This is an old revision of the document!

Slurm Commands Cheat Sheet

1. Cluster Status (sinfo)

sinfo displays partitions and node states.

Command	Description
sinfo	Basic view of partitions and nodes
sinfo -N	Show individual node names
sinfo -p gpu	Filter by specific partition (e.g., GPU)
sinfo -o "%P %a %l %D %t %N"	Custom: Partition, Availability, Time, Nodes, State, Names

Use cases:

Check available partitions
View total node count
Monitor node states:
```
idle, alloc, mix, down, drain
```

2. Job Submission & Management

Submit & Monitor

sbatch my_job.slurm      # Submit batch script
squeue                   # View all jobs in queue
squeue -u $USER          # Show only your jobs
squeue -j 123456         # Specific job details
squeue -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R"  # Custom format

Cancel Jobs

scancel 123456           # Cancel specific job
scancel -u $USER         # Cancel all your jobs
scancel 123456_17        # Cancel array task 17

Job Control

scontrol show job 123456     # Detailed job information
scontrol show node node001   # Detailed node information
scontrol hold 123456         # Hold/pause job
scontrol release 123456      # Release held job

3. Interactive Work

Allocate Resources

salloc -p cpu -N 1 -n 1 -c 8 --time=01:00:00

Then run inside allocation:

srun hostname
srun python script.py

Direct steps:

srun -p cpu -N 1 -n 1 -c 4 --time=00:10:00 hostname

4. Reporting & Statistics

Job Accounting

sacct -j 123456                                              # Job history
sacct -j 123456 --format=JobID,JobName,Partition,AllocCPUS,State,Elapsed,MaxRSS
sacct -u $USER --starttime today                             # Your jobs today

Live Statistics

sstat -j 123456.batch                                        # Running job stats
sstat -j 123456.batch --format=AveCPU,MaxRSS,MaxVMSize       # Specific metrics
seff 123456                                                  # Efficiency summary

5. Job Lifecycle States

State	Meaning
PENDING	Waiting in queue
RUNNING	Currently executing
COMPLETED	Finished successfully
FAILED	Error occurred
CANCELLED	Manually stopped
TIMEOUT	Time limit exceeded
OUT_OF_MEMORY	RAM exhausted
NODE_FAIL	Node hardware failure

Common PENDING Reasons

Resources – No free resources available
Priority – Lower queue priority
QOSMax… – Quality of Service limits
AssocGrp… – Account/group limits
ReqNodeNotAvail – Requested node unavailable
Dependency – Waiting on another job

6. Resource Specification

Most common errors come from incorrect resource requests.

Key Parameters

Parameter	Description
--nodes (-N)	Number of nodes requested
--ntasks (-n)	Total tasks/processes (MPI ranks)
--cpus-per-task (-c)	CPU cores per task (OpenMP threads)
--mem	Total RAM per node
--mem-per-cpu	RAM per CPU core
--time	Max walltime (HH:MM:SS)
--partition (-p)	Target partition/queue

Pro Tips

# Pure MPI (1 core per rank)
--nodes=2 --ntasks-per-node=32 --cpus-per-task=1
 
# Hybrid MPI+OpenMP  
--nodes=2 --ntasks-per-node=4 --cpus-per-task=8
 
# OpenMP
--nodes=1 --ntasks=1 --cpus-per-task=16

Golden Rule:

--ntasks-per-node × --cpus-per-task

should match node core count for optimal packing.

HPC UNITe

Table of Contents