User Tools

Site Tools


slurm_most_used_commands

This is an old revision of the document!


Slurm Commands Cheat Sheet

1. Cluster Status (sinfo)

sinfo displays partitions and node states.

Command Description
sinfo
Basic view of partitions and nodes
sinfo -N
Show individual node names
sinfo -p gpu
Filter by specific partition (e.g., GPU)
sinfo -o "%P %a %l %D %t %N"
Custom: Partition, Availability, Time, Nodes, State, Names

Use cases:

  • Check available partitions
  • View total node count
  • Monitor node states:
    idle, alloc, mix, down, drain

2. Job Submission & Management

Submit & Monitor

sbatch my_job.slurm      # Submit batch script
squeue                   # View all jobs in queue
squeue -u $USER          # Show only your jobs
squeue -j 123456         # Specific job details
squeue -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R"  # Custom format

Cancel Jobs

scancel 123456           # Cancel specific job
scancel -u $USER         # Cancel all your jobs
scancel 123456_17        # Cancel array task 17

Job Control

scontrol show job 123456     # Detailed job information
scontrol show node node001   # Detailed node information
scontrol hold 123456         # Hold/pause job
scontrol release 123456      # Release held job

3. Interactive Work

Allocate Resources

salloc -p cpu -N 1 -n 1 -c 8 --time=01:00:00

Then run inside allocation:

srun hostname
srun python script.py

Direct steps:

srun -p cpu -N 1 -n 1 -c 4 --time=00:10:00 hostname

4. Reporting & Statistics

Job Accounting

sacct -j 123456                                              # Job history
sacct -j 123456 --format=JobID,JobName,Partition,AllocCPUS,State,Elapsed,MaxRSS
sacct -u $USER --starttime today                             # Your jobs today

Live Statistics

sstat -j 123456.batch                                        # Running job stats
sstat -j 123456.batch --format=AveCPU,MaxRSS,MaxVMSize       # Specific metrics
seff 123456                                                  # Efficiency summary

5. Job Lifecycle States

State Meaning
PENDING Waiting in queue
RUNNING Currently executing
COMPLETED Finished successfully
FAILED Error occurred
CANCELLED Manually stopped
TIMEOUT Time limit exceeded
OUT_OF_MEMORY RAM exhausted
NODE_FAIL Node hardware failure

Common PENDING Reasons

  • Resources – No free resources available
  • Priority – Lower queue priority
  • QOSMax… – Quality of Service limits
  • AssocGrp… – Account/group limits
  • ReqNodeNotAvail – Requested node unavailable
  • Dependency – Waiting on another job

6. Resource Specification

Most common errors come from incorrect resource requests.

Key Parameters

Parameter Description
--nodes (-N)
Number of nodes requested
--ntasks (-n)
Total tasks/processes (MPI ranks)
--cpus-per-task (-c)
CPU cores per task (OpenMP threads)
--mem
Total RAM per node
--mem-per-cpu
RAM per CPU core
--time
Max walltime (HH:MM:SS)
--partition (-p)
Target partition/queue

Pro Tips

# Pure MPI (1 core per rank)
--nodes=2 --ntasks-per-node=32 --cpus-per-task=1
 
# Hybrid MPI+OpenMP  
--nodes=2 --ntasks-per-node=4 --cpus-per-task=8
 
# OpenMP
--nodes=1 --ntasks=1 --cpus-per-task=16

Golden Rule:

--ntasks-per-node × --cpus-per-task

should match node core count for optimal packing.

slurm_most_used_commands.1773821106.txt.gz · Last modified: 2026/03/18 10:05 by nshegunov

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki