slurm_tutorial
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| slurm_tutorial [2025/04/07 19:33] – nshegunov | slurm_tutorial [2025/04/07 20:03] (current) – [SLURM - Simple Linux Utility for Resource Management] nshegunov | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== SLURM - Simple Linux Utility for Resource Management ====== | ====== SLURM - Simple Linux Utility for Resource Management ====== | ||
| - | SLURM is an open-source workload manager designed for Linux clusters of all sizes. It provides job scheduling and resource management to optimize cluster utilization.It is a highly scalable cluster management and job scheduling system for large and small Linux clusters. It is used by some of the world’s most powerful supercomputers. | + | SLURM is an open-source workload manager designed for Linux clusters of all sizes. It provides job scheduling and resource management to optimize cluster utilization.It is a highly scalable cluster management and job scheduling system for large and small Linux clusters. It is used by some of the world’s most powerful supercomputers. |
| Please refer to [[https:// | Please refer to [[https:// | ||
| Line 19: | Line 19: | ||
| * **Job Step** – A component of a job, such as a single MPI process. | * **Job Step** – A component of a job, such as a single MPI process. | ||
| * **Scheduler** – The component that determines which jobs run when. | * **Scheduler** – The component that determines which jobs run when. | ||
| + | |||
| + | ===== Basic Architecture ===== | ||
| + | | {{ : | ||
| + | | SLURM architecture overview ([[https:// | ||
| + | |||
| + | Slurm is based on different components, to menage the cluster resources. Bellow you can find a short summary: | ||
| + | |||
| + | * **slurmctld (Controller Daemon)** | ||
| + | - Runs on the management (head) node. | ||
| + | - Handles job scheduling, resource allocation, and overall cluster state. | ||
| + | - Usually consists of a primary and a backup controller for failover. | ||
| + | |||
| + | * **slurmd (Node Daemon)** | ||
| + | - Runs on each compute node. | ||
| + | - Responsible for launching, monitoring, and cleaning up jobs on the node. | ||
| + | - Communicates with the slurmctld to receive instructions. | ||
| + | |||
| + | * **slurmdbd (Database Daemon)** '' | ||
| + | - Manages job accounting and usage data. | ||
| + | - Works with an external database (e.g., MySQL, MariaDB). | ||
| + | - Enables commands like **sacct** and **sreport** for usage reporting. | ||
| + | |||
| + | * **Client Commands** | ||
| + | - Tools used by users and admins to interact with Slurm: | ||
| + | - **sbatch** – submit batch jobs | ||
| + | - **srun** – run parallel jobs interactively | ||
| + | - **scancel** – cancel jobs | ||
| + | - **squeue** – view job queues | ||
| + | |||
| + | * **Central Database** '' | ||
| + | - Stores job and usage records. | ||
| + | - Used in conjunction with **slurmdbd** for accounting and reporting. | ||
| + | - Supports multiple clusters if needed. | ||
| + | |||
| + | Each component communicates over a secure protocol to coordinate resource usage and job execution efficiently. | ||
| + | |||
| + | ==== Official Source ==== | ||
| + | |||
| + | SchedMD - Slurm Workload Manager | ||
| + | * https:// | ||
| ===== SLURM Commands ===== | ===== SLURM Commands ===== | ||
slurm_tutorial.1744043611.txt.gz · Last modified: 2025/04/07 19:33 by nshegunov
