Using SLURM

The Bolden cluster uses the SLURM workload manager for job scheduling. This article covers basic SLURM commands and simple job submission script construction. You will find a list below of SLURM commands that are relevant to the average cluster user. Man pages exist for all SLURM daemons, commands, and API functions. The command option –help also provides a brief summary of options. Note that the command options are all case insensitive.

sacct is used to report job or job step accounting information about active or completed jobs.
sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
srun is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, and specific nodes to use

Most Bolden users will only need to use a few simple SLURM commands to submit and monitor thier jobs. Here is a sample SLURM job script:

——————————

#!/bin/sh
#SBATCH --job-name=JOBNAME
#SBATCH -n 2
#SBATCH -N 1
#SBATCH --output job%j.out
#SBATCH --error job%j.err
#SBATCH -p all

source /share/apps/Modules/3.2.10/init/modules.sh
module load modulename/version

cd /home/user/workingdirectory
./myExecutable

——————————

The commands we used for SBATCH are:
#SBATCH -n 2

This line requests cores (20 cores max per node)

#SBATCH -N 1

This line requests nodes (14 nodes max in the “all” queue)

#SBATCH –output job%j.out

stdout output %j appends job id

#SBATCH –error job%j.err

stderr output

#SBATCH -p all

Queue Request

Running jobs or compiling code on the Bolden headnode is strictly monitored, and any processes outside of file transfers or simple file maintenance will be killed. If you would like to work with your software interactively, you can request a session through the scheduler using:

srun – -pty -p all bash -i

Note that once you have been moved into the interactive session, you will still need to load your modules just as you did in the script with module load <modulename/moduleversion>

Once you have submitted your job to the scheduler, you can check the status by using the squeue command-

JOBID   PARTITION            NAME      USER ST TIME NODES   NODELIST(REASON)
107136                all       hello_world         root   R     :26               1           node01

The above output shows that is one job running, whose name is hello_world and whose jobid is 107136. The jobid is a unique identifier that is used by many Slurm commands when actions must be taken about one particular job. For instance, to cancel job hello_world, you would use scancel 107136. Time is the time the job has been running until now. Node is the number of nodes which are allocated to the job, while the Nodelist column lists the nodes which have been allocated for running jobs. For pending jobs, that column gives the reason why the job is pending. Each job is assigned a priority depending on several parameters whose details are beyond the scope of this document.

There are many switches you can use to filter the output by user (--user), by partition (--partition) by state (--state), etc.