Slurm Commands Cheat Sheet

1. Cluster Status (sinfo)

sinfo displays partitions and node states.

Command	Description
sinfo	Basic view of partitions and nodes
sinfo -N	Show individual node names
sinfo -p gpu	Filter by specific partition (e.g., GPU)
sinfo -o "%P %a %l %D %t %N"	Custom: Partition, Availability, Time, Nodes, State, Names

Use cases:

Check available partitions
View total node count
Monitor node states:
```
idle, alloc, mix, down, drain
```

2. Job Submission & Management

Submit & Monitor

sbatch my_job.slurm      # Submit batch script
squeue                   # View all jobs in queue
squeue -u $USER          # Show only your jobs
squeue -j 123456         # Specific job details
squeue -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R"  # Custom format

Cancel Jobs

scancel 123456           # Cancel specific job
scancel -u $USER         # Cancel all your jobs
scancel 123456_17        # Cancel array task 17

Job Control

scontrol show job 123456     # Detailed job information
scontrol show node node001   # Detailed node information
scontrol hold 123456         # Hold/pause job
scontrol release 123456      # Release held job

3. Interactive Work

Allocate Resources

salloc -p short -N 1 -n 1 -c 8 --time=01:00:00

Then run inside allocation:

srun hostname
srun python script.py

Run interactive steps:

srun --partition=short --ntasks=4 --gres=gpu:1  --time=02:00:00 --pty bash

4. Reporting & Statistics

Job Accounting

sacct -j 123456                                              # Job history
sacct -j 123456 --format=JobID,JobName,Partition,AllocCPUS,State,Elapsed,MaxRSS
sacct -u $USER --starttime today                             # Your jobs today

Live Statistics

sstat -j 123456.batch                                        # Running job stats
sstat -j 123456.batch --format=AveCPU,MaxRSS,MaxVMSize       # Specific metrics
seff 123456                                                  # Efficiency summary

5. Job Lifecycle States

State	Meaning
PENDING	Waiting in queue
RUNNING	Currently executing
COMPLETED	Finished successfully
FAILED	Error occurred
CANCELLED	Manually stopped
TIMEOUT	Time limit exceeded
OUT_OF_MEMORY	RAM exhausted
NODE_FAIL	Node hardware failure

Common PENDING Reasons

Resources – No free resources available
Priority – Lower queue priority
QOSMax… – Quality of Service limits
AssocGrp… – Account/group limits
ReqNodeNotAvail – Requested node unavailable
Dependency – Waiting on another job

6. Resource Specification

Most common errors come from incorrect resource requests.

Key Parameters

Parameter	Description
--nodes (-N)	Number of nodes requested
--ntasks (-n)	Total tasks/processes (MPI ranks)
--cpus-per-task (-c)	CPU cores per task (OpenMP threads)
--mem	Total RAM per node
--mem-per-cpu	RAM per CPU core
--time	Max walltime (HH:MM:SS)
--partition (-p)	Target partition/queue

Pro Tips

# Pure MPI (1 core per rank)
--nodes=2 --ntasks-per-node=32 --cpus-per-task=1
 
# Hybrid MPI+OpenMP  
--nodes=2 --ntasks-per-node=4 --cpus-per-task=8
 
# OpenMP
--nodes=1 --ntasks=1 --cpus-per-task=16

Golden Rule:

--ntasks-per-node × --cpus-per-task

should match node core count for optimal packing.

HPC UNITe

Table of Contents