====== Slurm Commands Cheat Sheet ====== ===== 1. Cluster Status (sinfo) ===== ''sinfo'' displays partitions and node states. ^ Command ^ Description ^ | sinfo | Basic view of partitions and nodes | | sinfo -N | Show individual node names | | sinfo -p gpu | Filter by specific partition (e.g., GPU) | | sinfo -o "%P %a %l %D %t %N" | Custom: Partition, Availability, Time, Nodes, State, Names | **Use cases:** * Check available partitions * View total node count * Monitor node states: idle, alloc, mix, down, drain ===== 2. Job Submission & Management ===== ==== Submit & Monitor ==== sbatch my_job.slurm # Submit batch script squeue # View all jobs in queue squeue -u $USER # Show only your jobs squeue -j 123456 # Specific job details squeue -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R" # Custom format ==== Cancel Jobs ==== scancel 123456 # Cancel specific job scancel -u $USER # Cancel all your jobs scancel 123456_17 # Cancel array task 17 ==== Job Control ==== scontrol show job 123456 # Detailed job information scontrol show node node001 # Detailed node information scontrol hold 123456 # Hold/pause job scontrol release 123456 # Release held job ===== 3. Interactive Work ===== ==== Allocate Resources ==== salloc -p cpu -N 1 -n 1 -c 8 --time=01:00:00 **Then run inside allocation:** srun hostname srun python script.py **Run interactive steps:** srun --partition=short --ntasks=4 --gres=gpu:1 --time=02:00:00 --pty bash ===== 4. Reporting & Statistics ===== ==== Job Accounting ==== sacct -j 123456 # Job history sacct -j 123456 --format=JobID,JobName,Partition,AllocCPUS,State,Elapsed,MaxRSS sacct -u $USER --starttime today # Your jobs today ==== Live Statistics ==== sstat -j 123456.batch # Running job stats sstat -j 123456.batch --format=AveCPU,MaxRSS,MaxVMSize # Specific metrics seff 123456 # Efficiency summary ===== 5. Job Lifecycle States ===== ^ State ^ Meaning ^ | **PENDING** | Waiting in queue | | **RUNNING** | Currently executing | | **COMPLETED** | Finished successfully | | **FAILED** | Error occurred | | **CANCELLED** | Manually stopped | | **TIMEOUT** | Time limit exceeded | | **OUT_OF_MEMORY** | RAM exhausted | | **NODE_FAIL** | Node hardware failure | ==== Common PENDING Reasons ==== * **Resources** – No free resources available * **Priority** – Lower queue priority * **QOSMax...** – Quality of Service limits * **AssocGrp...** – Account/group limits * **ReqNodeNotAvail** – Requested node unavailable * **Dependency** – Waiting on another job ===== 6. Resource Specification ===== **Most common errors come from incorrect resource requests.** ==== Key Parameters ==== ^ Parameter ^ Description ^ | --nodes (-N) | Number of nodes requested | | --ntasks (-n) | Total tasks/processes (MPI ranks) | | --cpus-per-task (-c) | CPU cores per task (OpenMP threads) | | --mem | Total RAM per node | | --mem-per-cpu | RAM per CPU core | | --time | Max walltime (HH:MM:SS) | | --partition (-p) | Target partition/queue | ==== Pro Tips ==== # Pure MPI (1 core per rank) --nodes=2 --ntasks-per-node=32 --cpus-per-task=1 # Hybrid MPI+OpenMP --nodes=2 --ntasks-per-node=4 --cpus-per-task=8 # OpenMP --nodes=1 --ntasks=1 --cpus-per-task=16 **Golden Rule:** --ntasks-per-node × --cpus-per-task ** should match node core count for optimal packing.**