====== Slurm Commands Cheat Sheet ======
===== 1. Cluster Status (sinfo) =====
''sinfo'' displays partitions and node states.
^ Command ^ Description ^
| sinfo | Basic view of partitions and nodes |
| sinfo -N | Show individual node names |
| sinfo -p gpu | Filter by specific partition (e.g., GPU) |
| sinfo -o "%P %a %l %D %t %N" | Custom: Partition, Availability, Time, Nodes, State, Names |
**Use cases:**
* Check available partitions
* View total node count
* Monitor node states: idle, alloc, mix, down, drain
===== 2. Job Submission & Management =====
==== Submit & Monitor ====
sbatch my_job.slurm # Submit batch script
squeue # View all jobs in queue
squeue -u $USER # Show only your jobs
squeue -j 123456 # Specific job details
squeue -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R" # Custom format
==== Cancel Jobs ====
scancel 123456 # Cancel specific job
scancel -u $USER # Cancel all your jobs
scancel 123456_17 # Cancel array task 17
==== Job Control ====
scontrol show job 123456 # Detailed job information
scontrol show node node001 # Detailed node information
scontrol hold 123456 # Hold/pause job
scontrol release 123456 # Release held job
===== 3. Interactive Work =====
==== Allocate Resources ====
salloc -p cpu -N 1 -n 1 -c 8 --time=01:00:00
**Then run inside allocation:**
srun hostname
srun python script.py
**Run interactive steps:**
srun --partition=short --ntasks=4 --gres=gpu:1 --time=02:00:00 --pty bash
===== 4. Reporting & Statistics =====
==== Job Accounting ====
sacct -j 123456 # Job history
sacct -j 123456 --format=JobID,JobName,Partition,AllocCPUS,State,Elapsed,MaxRSS
sacct -u $USER --starttime today # Your jobs today
==== Live Statistics ====
sstat -j 123456.batch # Running job stats
sstat -j 123456.batch --format=AveCPU,MaxRSS,MaxVMSize # Specific metrics
seff 123456 # Efficiency summary
===== 5. Job Lifecycle States =====
^ State ^ Meaning ^
| **PENDING** | Waiting in queue |
| **RUNNING** | Currently executing |
| **COMPLETED** | Finished successfully |
| **FAILED** | Error occurred |
| **CANCELLED** | Manually stopped |
| **TIMEOUT** | Time limit exceeded |
| **OUT_OF_MEMORY** | RAM exhausted |
| **NODE_FAIL** | Node hardware failure |
==== Common PENDING Reasons ====
* **Resources** – No free resources available
* **Priority** – Lower queue priority
* **QOSMax...** – Quality of Service limits
* **AssocGrp...** – Account/group limits
* **ReqNodeNotAvail** – Requested node unavailable
* **Dependency** – Waiting on another job
===== 6. Resource Specification =====
**Most common errors come from incorrect resource requests.**
==== Key Parameters ====
^ Parameter ^ Description ^
| --nodes (-N) | Number of nodes requested |
| --ntasks (-n) | Total tasks/processes (MPI ranks) |
| --cpus-per-task (-c) | CPU cores per task (OpenMP threads) |
| --mem | Total RAM per node |
| --mem-per-cpu | RAM per CPU core |
| --time | Max walltime (HH:MM:SS) |
| --partition (-p) | Target partition/queue |
==== Pro Tips ====
# Pure MPI (1 core per rank)
--nodes=2 --ntasks-per-node=32 --cpus-per-task=1
# Hybrid MPI+OpenMP
--nodes=2 --ntasks-per-node=4 --cpus-per-task=8
# OpenMP
--nodes=1 --ntasks=1 --cpus-per-task=16
**Golden Rule:** --ntasks-per-node × --cpus-per-task ** should match node core count for optimal packing.**