Monitoring and Control

This section highlights the tools provided by SLURM to monitor job status and control resource usage in the cluster.

Job Monitoring

SLURM provides several tools to monitor job status.

The command to view occupied nodes is sinfo. This command provides valuable details to understand the overall system status and resource availability for running jobs.

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
dops         up   infinite      3    mix cetus,dops-a3,psi
dops         up   infinite      8   idle dops-a[1-2,4-5],log-c[1-4]
robotica     up   infinite      1   idle sputnik
citcea       up   infinite      4   idle log-c[1-4]
all*         up   infinite      3    mix cetus,dops-a3,psi
all*         up   infinite      8   idle dops-a[1-2,4-5],log-c[1-4],sputnik

For more details, use the -Nel option:

$ sinfo -Nel
Tue Feb 18 16:22:41 2025
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
cetus          1      dops       mixed 40     2:10:2  64166        0      1   (null) none
cetus          1      all*       mixed 40     2:10:2  64166        0      1   (null) none
dops-a1        1      dops        idle 4       1:4:1  15879        0      1   (null) none
dops-a1        1      all*        idle 4       1:4:1  15879        0      1   (null) none
dops-a2        1      dops        idle 4       1:4:1  15879        0      1   (null) none
dops-a2        1      all*        idle 4       1:4:1  15879        0      1   (null) none
dops-a3        1      dops       mixed 8       1:4:2  15878        0      1   (null) none
dops-a3        1      all*       mixed 8       1:4:2  15878        0      1   (null) none
dops-a4        1      dops        idle 8       1:4:2  15900        0      1   (null) none
dops-a4        1      all*        idle 8       1:4:2  15900        0      1   (null) none
dops-a5        1      dops        idle 8       1:4:2  15900        0      1   (null) none
dops-a5        1      all*        idle 8       1:4:2  15900        0      1   (null) none
log-c1         1      dops        idle 4       1:2:2   3715        0      1   (null) none
log-c1         1    citcea        idle 4       1:2:2   3715        0      1   (null) none
log-c1         1      all*        idle 4       1:2:2   3715        0      1   (null) none
log-c2         1      dops        idle 4       1:2:2   3779        0      1   (null) none
log-c2         1    citcea        idle 4       1:2:2   3779        0      1   (null) none
log-c2         1      all*        idle 4       1:2:2   3779        0      1   (null) none
log-c3         1      dops        idle 4       1:2:2   3715        0      1   (null) none
log-c3         1    citcea        idle 4       1:2:2   3715        0      1   (null) none
log-c3         1      all*        idle 4       1:2:2   3715        0      1   (null) none
log-c4         1      dops        idle 4       1:2:2   3779        0      1   (null) none
log-c4         1    citcea        idle 4       1:2:2   3779        0      1   (null) none
log-c4         1      all*        idle 4       1:2:2   3779        0      1   (null) none
psi            1      dops       mixed 128    2:32:2 257379        0      1   (null) none
psi            1      all*       mixed 128    2:32:2 257379        0      1   (null) none
sputnik        1  robotica        idle 16      1:8:2  15934        0      1   (null) none
sputnik        1      all*        idle 16      1:8:2  15934        0      1   (null) none

The command to view the job queue is squeue. This command provides a list of running jobs, their priority, and other relevant details. Example usage:

$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
544   dops        bash alexandr  R       0:20      1 psi
545   dops        bash alexandr  R       0:13      1 dops-a1

If you want more details, you can use the -o option to select which fields to display:

$ squeue -o "%A %u %t %M"
JOBID USER ST TIME
544 alexandre.gracia R 0:55
545 alexandre.gracia R 0:48

There is a custom command, scua, that displays queue results including reserved cores and memory parameters.

$ scua
JOBID CPUS MIN_MEM  EXEC_HOST       TIME                 USER                 NAME    ST
  400    7      7G      cetus    0:10:28     alexandre.gracia         test_docplex     R

The srun command lets us execute another command (for example htop) and view results in real time. Examples:

$ srun -w cetus htop # Shows htop on cetus; -w selects the machine to use and waits for terminal output.
$ srun sleep 120 & # Sends a sleep command for 120 seconds to the first available machine using 1 core. The & symbol frees the terminal.
$ srun -w cetus --mem=2G sleep 120 & # Puts one core on cetus to sleep for 120 seconds while reserving 2GB and frees the terminal.

Monitoring with htop on a node

We created a script that runs the htop command on a node. Example:

$ mvachtop node # Example: mvachtop psi

SSH connection to a node

We created a script that opens an SSH connection to cluster machines for running simple Linux commands. Specifically, it opens a terminal on the selected node with 1 core and 512MB of RAM. Example:

$ mvacinteract node # Example: mvacinteract psi
$ (venv)nom.usuari@psi:~/$

Information about a job

We created a script that retrieves information about a job by its ID after execution. Example:

$ sjob id # id is a number. Example: sjob 433
JobId 433 Information:
JobId=433
UserId=alexandre.gracia(6334)
GroupId=usuaris(3014)
Name=hostname
JobState=COMPLETED
Partition=all
TimeLimit=UNLIMITED
StartTime=2024-02-11T12:00:38
EndTime=2024-02-11T12:00:38
NodeList=log-c4
NodeCnt=1
ProcCnt=2
WorkDir=/home/users/alexandre.gracia
ReservationName=
Tres=cpu=1,mem=1G,node=1,billing=1
Account=
QOS=
WcKey=
Cluster=unknown
SubmitTime=2024-02-11T12:00:38
EligibleTime=2024-02-11T12:00:38
DerivedExitCode=0:0
ExitCode=0:0

Query used and available cores

SLURM allows checking the status of used cores with the sinfo or squeue command. You only need to add specific parameters.

For more output parameters, check the sinfo or squeue SLURM manual pages, since many parameters and customizations are available.

$ squeue -o"%.7i %.9P %.8j %.8u %.2t %.10M %.6D %C"
   JOBID PARTITION     NAME     USER        ST  TIME        NODES CPUS
   1357        all     test     alexandre.gr R   15:29:13      1    2

$ sinfo -o "%n %e %m %C" | awk 'NR==2{print "Hostname  Free Mem CPUS(Active/Idle/Offline/Total)"} NR>1{print}' | column -t
   Hostname  Free   Mem     CPUS(Active/Idle/Offline/Total)
   cetus     57063  64166   32/8/0/40
   dops-a3   13086  15878   6/2/0/8
   psi       92724  257379  32/96/0/128
   dops-a1   13433  15879   0/4/0/4
   dops-a2   13967  15879   0/4/0/4
   dops-a4   13514  15900   0/8/0/8
   dops-a5   13633  15900   0/8/0/8
   log-c1    2049   3715    0/4/0/4
   log-c2    2521   3779    0/4/0/4
   log-c3    2176   3715    0/4/0/4
   log-c4    2506   3779    0/4/0/4
   sputnik   15934  15934   0/0/16/16

seff Efficiency

One of the most important commands is seff, which lets us check the efficiency of our executions:

$ seff job_id # id is a number. Example: seff 12895
   Job ID: 12895
   Cluster: multivac
   User/Group:
   /usuaris
   State: COMPLETED (exit code 0)
   Nodes: 1
   Cores per node: 30
   CPU Utilized: 00:02:17
   CPU Efficiency: 57.08% of 00:04:00 core-walltime
   Job Wall-clock time: 00:00:08
   Memory Utilized: 3.35 MB
   Memory Efficiency: 0.02% of 20.00 GB (20.00 GB/node)