slurm_most_used_commands
Table of Contents
Slurm Commands Cheat Sheet
1. Cluster Status (sinfo)
sinfo displays partitions and node states.
| Command | Description |
|---|---|
sinfo | Basic view of partitions and nodes |
sinfo -N | Show individual node names |
sinfo -p gpu | Filter by specific partition (e.g., GPU) |
sinfo -o "%P %a %l %D %t %N" | Custom: Partition, Availability, Time, Nodes, State, Names |
Use cases:
- Check available partitions
- View total node count
- Monitor node states:
idle, alloc, mix, down, drain
2. Job Submission & Management
Submit & Monitor
sbatch my_job.slurm # Submit batch script squeue # View all jobs in queue squeue -u $USER # Show only your jobs squeue -j 123456 # Specific job details squeue -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R" # Custom format
Cancel Jobs
scancel 123456 # Cancel specific job scancel -u $USER # Cancel all your jobs scancel 123456_17 # Cancel array task 17
Job Control
scontrol show job 123456 # Detailed job information scontrol show node node001 # Detailed node information scontrol hold 123456 # Hold/pause job scontrol release 123456 # Release held job
3. Interactive Work
Allocate Resources
salloc -p cpu -N 1 -n 1 -c 8 --time=01:00:00
Then run inside allocation:
srun hostname
srun python script.py
Run interactive steps:
srun --partition=short --ntasks=4 --gres=gpu:1 --time=02:00:00 --pty bash
4. Reporting & Statistics
Job Accounting
sacct -j 123456 # Job history sacct -j 123456 --format=JobID,JobName,Partition,AllocCPUS,State,Elapsed,MaxRSS sacct -u $USER --starttime today # Your jobs today
Live Statistics
sstat -j 123456.batch # Running job stats sstat -j 123456.batch --format=AveCPU,MaxRSS,MaxVMSize # Specific metrics seff 123456 # Efficiency summary
5. Job Lifecycle States
| State | Meaning |
|---|---|
| PENDING | Waiting in queue |
| RUNNING | Currently executing |
| COMPLETED | Finished successfully |
| FAILED | Error occurred |
| CANCELLED | Manually stopped |
| TIMEOUT | Time limit exceeded |
| OUT_OF_MEMORY | RAM exhausted |
| NODE_FAIL | Node hardware failure |
Common PENDING Reasons
- Resources – No free resources available
- Priority – Lower queue priority
- QOSMax… – Quality of Service limits
- AssocGrp… – Account/group limits
- ReqNodeNotAvail – Requested node unavailable
- Dependency – Waiting on another job
6. Resource Specification
Most common errors come from incorrect resource requests.
Key Parameters
| Parameter | Description |
|---|---|
--nodes (-N) | Number of nodes requested |
--ntasks (-n) | Total tasks/processes (MPI ranks) |
--cpus-per-task (-c) | CPU cores per task (OpenMP threads) |
--mem | Total RAM per node |
--mem-per-cpu | RAM per CPU core |
--time | Max walltime (HH:MM:SS) |
--partition (-p) | Target partition/queue |
Pro Tips
# Pure MPI (1 core per rank) --nodes=2 --ntasks-per-node=32 --cpus-per-task=1 # Hybrid MPI+OpenMP --nodes=2 --ntasks-per-node=4 --cpus-per-task=8 # OpenMP --nodes=1 --ntasks=1 --cpus-per-task=16
Golden Rule:
--ntasks-per-node × --cpus-per-task
should match node core count for optimal packing.
slurm_most_used_commands.txt · Last modified: 2026/04/02 16:08 by nshegunov
