Table of Contents

Slurm Commands Cheat Sheet

1. Cluster Status (sinfo)

sinfo displays partitions and node states.

Command Description
sinfo
Basic view of partitions and nodes
sinfo -N
Show individual node names
sinfo -p gpu
Filter by specific partition (e.g., GPU)
sinfo -o "%P %a %l %D %t %N"
Custom: Partition, Availability, Time, Nodes, State, Names

Use cases:

2. Job Submission & Management

Submit & Monitor

sbatch my_job.slurm      # Submit batch script
squeue                   # View all jobs in queue
squeue -u $USER          # Show only your jobs
squeue -j 123456         # Specific job details
squeue -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R"  # Custom format

Cancel Jobs

scancel 123456           # Cancel specific job
scancel -u $USER         # Cancel all your jobs
scancel 123456_17        # Cancel array task 17

Job Control

scontrol show job 123456     # Detailed job information
scontrol show node node001   # Detailed node information
scontrol hold 123456         # Hold/pause job
scontrol release 123456      # Release held job

3. Interactive Work

Allocate Resources

salloc -p cpu -N 1 -n 1 -c 8 --time=01:00:00

Then run inside allocation:

srun hostname
srun python script.py

Run interactive steps:

srun --partition=short --ntasks=4 --gres=gpu:1  --time=02:00:00 --pty bash

4. Reporting & Statistics

Job Accounting

sacct -j 123456                                              # Job history
sacct -j 123456 --format=JobID,JobName,Partition,AllocCPUS,State,Elapsed,MaxRSS
sacct -u $USER --starttime today                             # Your jobs today

Live Statistics

sstat -j 123456.batch                                        # Running job stats
sstat -j 123456.batch --format=AveCPU,MaxRSS,MaxVMSize       # Specific metrics
seff 123456                                                  # Efficiency summary

5. Job Lifecycle States

State Meaning
PENDING Waiting in queue
RUNNING Currently executing
COMPLETED Finished successfully
FAILED Error occurred
CANCELLED Manually stopped
TIMEOUT Time limit exceeded
OUT_OF_MEMORY RAM exhausted
NODE_FAIL Node hardware failure

Common PENDING Reasons

6. Resource Specification

Most common errors come from incorrect resource requests.

Key Parameters

Parameter Description
--nodes (-N)
Number of nodes requested
--ntasks (-n)
Total tasks/processes (MPI ranks)
--cpus-per-task (-c)
CPU cores per task (OpenMP threads)
--mem
Total RAM per node
--mem-per-cpu
RAM per CPU core
--time
Max walltime (HH:MM:SS)
--partition (-p)
Target partition/queue

Pro Tips

# Pure MPI (1 core per rank)
--nodes=2 --ntasks-per-node=32 --cpus-per-task=1
 
# Hybrid MPI+OpenMP  
--nodes=2 --ntasks-per-node=4 --cpus-per-task=8
 
# OpenMP
--nodes=1 --ntasks=1 --cpus-per-task=16

Golden Rule:

--ntasks-per-node × --cpus-per-task

should match node core count for optimal packing.