Skip to content

Job Submission and Execution

Warning

Don't use the #SBATCH --exclusive parameter as it is already included in the SLURM configuration.

Use the #SBATCH --mem= parameter on qfat only. On cpu_ queues, whole nodes are allocated. Accelerated nodes (gpu_ queues) are divided each into eight parts with corresponding memory.

Introduction

Slurm workload manager is used to allocate and access Karolina's, Barbora's and Complementary systems' resources.

A man page exists for all Slurm commands, as well as the --help command option, which provides a brief summary of options. Slurm documentation and man pages are also available online.

Getting Partition Information

Display partitions/queues on system:

$ sinfo -s
PARTITION    AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
qcpu*           up 2-00:00:00      1/191/0/192 cn[1-192]
qcpu_biz        up 2-00:00:00      1/191/0/192 cn[1-192]
qcpu_exp        up    1:00:00      1/191/0/192 cn[1-192]
qcpu_free       up   18:00:00      1/191/0/192 cn[1-192]
qcpu_long       up 6-00:00:00      1/191/0/192 cn[1-192]
qcpu_preempt    up   12:00:00      1/191/0/192 cn[1-192]
qgpu            up 2-00:00:00          0/8/0/8 cn[193-200]
qgpu_biz        up 2-00:00:00          0/8/0/8 cn[193-200]
qgpu_exp        up    1:00:00          0/8/0/8 cn[193-200]
qgpu_free       up   18:00:00          0/8/0/8 cn[193-200]
qgpu_preempt    up   12:00:00          0/8/0/8 cn[193-200]
qfat            up 2-00:00:00          0/1/0/1 cn201
qdgx            up 2-00:00:00          0/1/0/1 cn202
qviz            up    8:00:00          0/2/0/2 vizserv[1-2]

NODES(A/I/O/T) column summarizes node count per state, where the A/I/O/T stands for allocated/idle/other/total. Example output is from Barbora cluster.

Graphical representation of clusters' usage, partitions, nodes, and jobs could be found

On Karolina cluster

  • all cpu queues/partitions provide full node allocation, whole nodes are allocated to job
  • other queues/partitions (gpu, fat, viz) provide partial node allocation

See Karolina Slurm Specifics for details.

On Barbora cluster, all queues/partitions provide full node allocation, whole nodes are allocated to job.

On Complementary systems, only some queues/partitions provide full node allocation, see Complementary systems documentation for details.

Running Interactive Jobs

Sometimes you may want to run your job interactively, for example for debugging, running your commands one by one from the command line.

Run interactive job - queue qcpu_exp, one node by default, one task by default:

$ salloc -A PROJECT-ID -p qcpu_exp

Run interactive job on four nodes, 128 tasks per node (Karolina cluster, CPU partition recommended value based on node core count), two hours time limit:

$ salloc -A PROJECT-ID -p qcpu -N 4 --ntasks-per-node 128 -t 2:00:00

Run interactive job, with X11 forwarding:

$ salloc -A PROJECT-ID -p qcpu_exp --x11

To finish the interactive job, use the Ctrl+D (^D) control sequence.

Warning

Do not use srun for initiating interactive jobs, subsequent srun, mpirun invocations would block forever.

Running Batch Jobs

Batch jobs is the standard way of running jobs and utilizing HPC clusters.

Job Script

Create example job script called script.sh with the following content:

#!/usr/bin/bash
#SBATCH --job-name MyJobName
#SBATCH --account PROJECT-ID
#SBATCH --partition qcpu
#SBATCH --nodes 4
#SBATCH --ntasks-per-node 128
#SBATCH --time 12:00:00

ml purge
ml OpenMPI/4.1.4-GCC-11.3.0

srun hostname | sort | uniq -c

Script will:

  • use bash shell interpreter
  • use MyJobName as job name
  • use project PROJECT-ID for job access and accounting
  • use partition/queue qcpu
  • use 4 nodes
  • use 128 tasks per node - value used by MPI
  • set job time limit to 12 hours

  • load appropriate module

  • run command, srun serves as Slurm's native way of executing MPI-enabled applications, hostname is used in the example just for sake of simplicity

Submit directory will be used as working directory for submitted job, so there is no need to change directory in the job script. Alternatively you can specify job working directory using the sbatch --chdir (or shortly -D) option.

Job Submit

Submit batch job:

$ cd my_work_dir
$ sbatch script.sh

A path to script.sh (relative or absolute) should be given if the job script is in a different location than the job working directory.

By default, job output is stored in a file called slurm-JOBID.out and contains both job standard output and error output. This can be changed using the sbatch options --output (shortly -o) and --error (shortly -e).

Example output of the job:

     128 cn017.karolina.it4i.cz
     128 cn018.karolina.it4i.cz
     128 cn019.karolina.it4i.cz
     128 cn020.karolina.it4i.cz

Job Environment Variables

Slurm provides useful information to the job via environment variables. Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (srun, compatible mpirun).

See all Slurm variables

$ set | grep ^SLURM

Commonly used variables are:

variable name description example
SLURM_JOB_ID job id of the executing job 593
SLURM_JOB_NODELIST nodes allocated to the job cn[101-102]
SLURM_JOB_NUM_NODES number of nodes allocated to the job 2
SLURM_STEP_NODELIST nodes allocated to the job step cn101
SLURM_STEP_NUM_NODES number of nodes allocated to the job step 1
SLURM_JOB_PARTITION name of the partition qcpu
SLURM_SUBMIT_DIR submit directory /scratch/project/open-xx-yy/work

See relevant Slurm documentation for details.

Get job nodelist:

$ echo $SLURM_JOB_NODELIST
cn[101-102]

Expand nodelist to list of nodes:

$ scontrol show hostnames
cn101
cn102

Job Management

Getting Job Information

Show all jobs on system:

$ squeue

Show my jobs:

$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               104   qcpu    interact    user   R       1:48      2 cn[101-102]

Show job details for a specific job:

$ scontrol show job JOBID

Show job details for executing job from job session:

$ scontrol show job $SLURM_JOBID

Show my jobs using a long output format which includes time limit:

$ squeue --me -l

Show my jobs in running state:

$ squeue --me -t running

Show my jobs in pending state:

$ squeue --me -t pending

Show jobs for a given project:

$ squeue -A PROJECT-ID

Job States

The most common job states are (in alphabetical order):

Code Job State Explanation
CA CANCELLED Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero.
CG COMPLETING Job is in the process of completing. Some processes on some nodes may still be active.
F FAILED Job terminated with non-zero exit code or other failure condition.
NF NODE_FAIL Job terminated due to failure of one or more allocated nodes.
OOM OUT_OF_MEMORY Job experienced out of memory error.
PD PENDING Job is awaiting resource allocation.
PR PREEMPTED Job terminated due to preemption.
R RUNNING Job currently has an allocation.
RQ REQUEUED Completing job is being requeued.
SI SIGNALING Job is being signaled.
TO TIMEOUT Job terminated upon reaching its time limit.

Modifying Jobs

In general:

$ scontrol update JobId=JOBID ATTR=VALUE

Modify job's time limit:

$ scontrol update JobId=JOBID timelimit=4:00:00

Set/modify job's comment:

$ scontrol update JobId=JOBID Comment='The best job ever'

Deleting Jobs

Delete a job by job ID:

$ scancel JOBID

Delete all my jobs:

$ scancel --me

Delete all my jobs in interactive mode, confirming every action:

$ scancel --me -i

Delete all my running jobs:

$ scancel --me -t running

Delete all my pending jobs:

$ scancel --me -t pending

Delete all my pending jobs for a project PROJECT-ID:

$ scancel --me -t pending -A PROJECT-ID

Troubleshooting

Invalid Account

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

Possible causes:

  • Invalid account (i.e. project) was specified in job submission.
  • User does not have access to given account/project.
  • Given account/project does not have access to given partition.
  • Access to given partition was retracted due to the project's allocation exhaustion.