Monitoring and Control
This section highlights the tools provided by SLURM to monitor job status and control resource usage in the cluster.
Job Monitoring
SLURM provides several tools to monitor job status.
The command to view occupied nodes is sinfo. This command provides valuable details to understand the overall system status and resource availability for running jobs.
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
dops up infinite 3 mix cetus,dops-a3,psi
dops up infinite 8 idle dops-a[1-2,4-5],log-c[1-4]
robotica up infinite 1 idle sputnik
citcea up infinite 4 idle log-c[1-4]
all* up infinite 3 mix cetus,dops-a3,psi
all* up infinite 8 idle dops-a[1-2,4-5],log-c[1-4],sputnik
For more details, use the -Nel option:
$ sinfo -Nel
Tue Feb 18 16:22:41 2025
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
cetus 1 dops mixed 40 2:10:2 64166 0 1 (null) none
cetus 1 all* mixed 40 2:10:2 64166 0 1 (null) none
dops-a1 1 dops idle 4 1:4:1 15879 0 1 (null) none
dops-a1 1 all* idle 4 1:4:1 15879 0 1 (null) none
dops-a2 1 dops idle 4 1:4:1 15879 0 1 (null) none
dops-a2 1 all* idle 4 1:4:1 15879 0 1 (null) none
dops-a3 1 dops mixed 8 1:4:2 15878 0 1 (null) none
dops-a3 1 all* mixed 8 1:4:2 15878 0 1 (null) none
dops-a4 1 dops idle 8 1:4:2 15900 0 1 (null) none
dops-a4 1 all* idle 8 1:4:2 15900 0 1 (null) none
dops-a5 1 dops idle 8 1:4:2 15900 0 1 (null) none
dops-a5 1 all* idle 8 1:4:2 15900 0 1 (null) none
log-c1 1 dops idle 4 1:2:2 3715 0 1 (null) none
log-c1 1 citcea idle 4 1:2:2 3715 0 1 (null) none
log-c1 1 all* idle 4 1:2:2 3715 0 1 (null) none
log-c2 1 dops idle 4 1:2:2 3779 0 1 (null) none
log-c2 1 citcea idle 4 1:2:2 3779 0 1 (null) none
log-c2 1 all* idle 4 1:2:2 3779 0 1 (null) none
log-c3 1 dops idle 4 1:2:2 3715 0 1 (null) none
log-c3 1 citcea idle 4 1:2:2 3715 0 1 (null) none
log-c3 1 all* idle 4 1:2:2 3715 0 1 (null) none
log-c4 1 dops idle 4 1:2:2 3779 0 1 (null) none
log-c4 1 citcea idle 4 1:2:2 3779 0 1 (null) none
log-c4 1 all* idle 4 1:2:2 3779 0 1 (null) none
psi 1 dops mixed 128 2:32:2 257379 0 1 (null) none
psi 1 all* mixed 128 2:32:2 257379 0 1 (null) none
sputnik 1 robotica idle 16 1:8:2 15934 0 1 (null) none
sputnik 1 all* idle 16 1:8:2 15934 0 1 (null) none
The command to view the job queue is squeue. This command provides a list of running jobs, their priority, and other relevant details. Example usage:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
544 dops bash alexandr R 0:20 1 psi
545 dops bash alexandr R 0:13 1 dops-a1
If you want more details, you can use the -o option to select which fields to display:
$ squeue -o "%A %u %t %M"
JOBID USER ST TIME
544 alexandre.gracia R 0:55
545 alexandre.gracia R 0:48
There is a custom command, scua, that displays queue results including reserved cores and memory parameters.
$ scua
JOBID CPUS MIN_MEM EXEC_HOST TIME USER NAME ST
400 7 7G cetus 0:10:28 alexandre.gracia test_docplex R
The srun command lets us execute another command (for example htop) and view results in real time. Examples:
$ srun -w cetus htop # Shows htop on cetus; -w selects the machine to use and waits for terminal output.
$ srun sleep 120 & # Sends a sleep command for 120 seconds to the first available machine using 1 core. The & symbol frees the terminal.
$ srun -w cetus --mem=2G sleep 120 & # Puts one core on cetus to sleep for 120 seconds while reserving 2GB and frees the terminal.
Monitoring with htop on a node
We created a script that runs the htop command on a node. Example:
$ mvachtop node # Example: mvachtop psi
SSH connection to a node
We created a script that opens an SSH connection to cluster machines for running simple Linux commands. Specifically, it opens a terminal on the selected node with 1 core and 512MB of RAM. Example:
$ mvacinteract node # Example: mvacinteract psi
$ (venv)nom.usuari@psi:~/$
Information about a job
We created a script that retrieves information about a job by its ID after execution. Example:
$ sjob id # id is a number. Example: sjob 433
JobId 433 Information:
JobId=433
UserId=alexandre.gracia(6334)
GroupId=usuaris(3014)
Name=hostname
JobState=COMPLETED
Partition=all
TimeLimit=UNLIMITED
StartTime=2024-02-11T12:00:38
EndTime=2024-02-11T12:00:38
NodeList=log-c4
NodeCnt=1
ProcCnt=2
WorkDir=/home/users/alexandre.gracia
ReservationName=
Tres=cpu=1,mem=1G,node=1,billing=1
Account=
QOS=
WcKey=
Cluster=unknown
SubmitTime=2024-02-11T12:00:38
EligibleTime=2024-02-11T12:00:38
DerivedExitCode=0:0
ExitCode=0:0
Query used and available cores
SLURM allows checking the status of used cores with the sinfo or squeue command. You only need to add specific parameters.
For more output parameters, check the sinfo or squeue SLURM manual pages, since many parameters and customizations are available.
$ squeue -o"%.7i %.9P %.8j %.8u %.2t %.10M %.6D %C"
JOBID PARTITION NAME USER ST TIME NODES CPUS
1357 all test alexandre.gr R 15:29:13 1 2
$ sinfo -o "%n %e %m %C" | awk 'NR==2{print "Hostname Free Mem CPUS(Active/Idle/Offline/Total)"} NR>1{print}' | column -t
Hostname Free Mem CPUS(Active/Idle/Offline/Total)
cetus 57063 64166 32/8/0/40
dops-a3 13086 15878 6/2/0/8
psi 92724 257379 32/96/0/128
dops-a1 13433 15879 0/4/0/4
dops-a2 13967 15879 0/4/0/4
dops-a4 13514 15900 0/8/0/8
dops-a5 13633 15900 0/8/0/8
log-c1 2049 3715 0/4/0/4
log-c2 2521 3779 0/4/0/4
log-c3 2176 3715 0/4/0/4
log-c4 2506 3779 0/4/0/4
sputnik 15934 15934 0/0/16/16
seff Efficiency
One of the most important commands is seff, which lets us check the efficiency of our executions:
$ seff job_id # id is a number. Example: seff 12895
Job ID: 12895
Cluster: multivac
User/Group:
/usuaris
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 30
CPU Utilized: 00:02:17
CPU Efficiency: 57.08% of 00:04:00 core-walltime
Job Wall-clock time: 00:00:08
Memory Utilized: 3.35 MB
Memory Efficiency: 0.02% of 20.00 GB (20.00 GB/node)