it4i logoDocumentation
Clusters/Complementary systems

Complementary System Job Scheduling

Introduction

Slurm workload manager is used to allocate and access Complementary systems resources.

Getting Partition Information

Display partitions/queues

$ sinfo -s
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
p00-arm      up 1-00:00:00          0/1/0/1 p00-arm01
p01-arm*     up 1-00:00:00          0/8/0/8 p01-arm[01-08]
p02-intel    up 1-00:00:00          0/2/0/2 p02-intel[01-02]
p03-amd      up 1-00:00:00          0/2/0/2 p03-amd[01-02]
p04-edge     up 1-00:00:00          0/1/0/1 p04-edge01
p05-synt     up 1-00:00:00          0/1/0/1 p05-synt01
p06-arm      up 1-00:00:00          0/2/0/2 p06-arm[01-02]
p07-power    up 1-00:00:00          0/1/0/1 p07-power01
p08-amd      up 1-00:00:00          0/1/0/1 p08-amd01
p10-intel    up 1-00:00:00          0/1/0/1 p10-intel01

Getting Job Information

Show jobs

$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               104   p01-arm interact    user   R       1:48      2 p01-arm[01-02]

Show job details for specific job

$ scontrol -d show job JOBID

Show job details for executing job from job session

$ scontrol -d show job $SLURM_JOBID

Running Interactive Jobs

Run interactive job

 $ salloc -A PROJECT-ID -p p01-arm

Run interactive job, with X11 forwarding

 $ salloc -A PROJECT-ID -p p01-arm --x11

Do not use srun for initiating interactive jobs, subsequent srun, mpirun invocations would block forever.

Running Batch Jobs

Run batch job

 $ sbatch -A PROJECT-ID -p p01-arm ./script.sh

Useful command options (salloc, sbatch, srun)

  • -n, —ntasks
  • -c, —cpus-per-task
  • -N, —nodes

Slurm Job Environment Variables

Slurm provides useful information to the job via environment variables. Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (srun, compatible mpirun).

See all Slurm variables

set | grep ^SLURM

Useful Variables

variable namedescriptionexample
SLURM_JOB_IDjob id of the executing job593
SLURM_JOB_NODELISTnodes allocated to the jobp03-amd[01-02]
SLURM_JOB_NUM_NODESnumber of nodes allocated to the job2
SLURM_STEP_NODELISTnodes allocated to the job stepp03-amd01
SLURM_STEP_NUM_NODESnumber of nodes allocated to the job step1
SLURM_JOB_PARTITIONname of the partitionp03-amd
SLURM_SUBMIT_DIRsubmit directory/scratch/project/open-xx-yy/work

See Slurm srun documentation for details.

Get job nodelist

$ echo $SLURM_JOB_NODELIST
p03-amd[01-02]

Expand nodelist to list of nodes.

$ scontrol show hostnames $SLURM_JOB_NODELIST
p03-amd01
p03-amd02

Modifying Jobs

$ scontrol update JobId=JOBID ATTR=VALUE

for example

$ scontrol update JobId=JOBID Comment='The best job ever'

Deleting Jobs

$ scancel JOBID

Partitions

PARTITIONnodeswhole nodecores per nodefeatures
p00-arm1yes64aarch64,cortex-a72
p01-arm8yes48aarch64,a64fx,ib
p02-intel2no64x86_64,intel,icelake,ib,fpga,bitware,nvdimm
p03-amd2no64x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx
p04-edge1yes1686_64,intel,broadwell,ib
p05-synt1yes8x86_64,amd,milan,ib,ht
p06-arm2yes80aarch64,ib
p07-power1yes192ppc64le,ib
p08-amd1yes128x86_64,amd,milan-x,ib,ht
p10-intel1yes96x86_64,intel,sapphire_rapids,ht

Use -t, --time option to specify job run time limit. Default job time limit is 2 hours, maximum job time limit is 24 hours.

FIFO scheduling with backfilling is employed.

Partition 00 - ARM (Cortex-A72)

Whole node allocation.

One node:

salloc -A PROJECT-ID -p p00-arm

Partition 01 - ARM (A64FX)

Whole node allocation.

One node:

salloc -A PROJECT-ID -p p01-arm
salloc -A PROJECT-ID -p p01-arm -N=1

Multiple nodes:

salloc -A PROJECT-ID -p p01-arm -N=8

Partition 02 - Intel (Ice Lake, NVDIMMs + Bitware FPGAs)

FPGAs are treated as resources. See below for more details about resources.

Partial allocation - per FPGA, resource separation is not enforced. Use only FPGAs allocated to the job!

One FPGA:

salloc -A PROJECT-ID -p p02-intel --gres=fpga

Two FPGAs on the same node:

salloc -A PROJECT-ID -p p02-intel --gres=fpga:2

All FPGAs:

salloc -A PROJECT-ID -p p02-intel -N 2 --gres=fpga:2

Partition 03 - AMD (Milan, MI100 GPUs + Xilinx FPGAs)

GPUs and FPGAs are treated as resources. See below for more details about resources.

Partial allocation - per GPU and per FPGA, resource separation is not enforced. Use only GPUs and FPGAs allocated to the job!

One GPU:

salloc -A PROJECT-ID -p p03-amd --gres=gpu

Two GPUs on the same node:

salloc -A PROJECT-ID -p p03-amd --gres=gpu:2

Four GPUs on the same node:

salloc -A PROJECT-ID -p p03-amd --gres=gpu:4

All GPUs:

salloc -A PROJECT-ID -p p03-amd -N 2 --gres=gpu:4

One FPGA:

salloc -A PROJECT-ID -p p03-amd --gres=fpga

Two FPGAs:

salloc -A PROJECT-ID -p p03-amd --gres=fpga:2

All FPGAs:

salloc -A PROJECT-ID -p p03-amd -N 2--gres=fpga:2

One GPU and one FPGA on the same node:

salloc -A PROJECT-ID -p p03-amd --gres=gpu,fpga

Four GPUs and two FPGAs on the same node:

salloc -A PROJECT-ID -p p03-amd --gres=gpu:4,fpga:2

All GPUs and FPGAs:

salloc -A PROJECT-ID -p p03-amd -N 2 --gres=gpu:4,fpga:2

Partition 04 - Edge Server

Whole node allocation:

salloc -A PROJECT-ID -p p04-edge

Partition 05 - FPGA Synthesis Server

Whole node allocation:

salloc -A PROJECT-ID -p p05-synt

Partition 06 - ARM

Whole node allocation:

salloc -A PROJECT-ID -p p06-arm

Partition 07 - IBM Power

Whole node allocation:

salloc -A PROJECT-ID -p p07-power

Partition 08 - AMD Milan-X

Whole node allocation:

salloc -A PROJECT-ID -p p08-amd

Partition 10 - Intel Sapphire Rapids

Whole node allocation:

salloc -A PROJECT-ID -p p10-intel

Features

Nodes have feature tags assigned to them. Users can select nodes based on the feature tags using —constraint option.

FeatureDescription
aarch64platform
x86_64platform
ppc64leplatform
amdmanufacturer
intelmanufacturer
icelakeprocessor family
broadwellprocessor family
sapphire_rapidsprocessor family
milanprocessor family
milan-xprocessor family
ibInfiniband
gpuequipped with GPU
fpgaequipped with FPGA
nvdimmequipped with NVDIMMs
htHyperthreading enabled
nohtHyperthreading disabled
$ sinfo -o '%16N %f'
NODELIST         AVAIL_FEATURES
p00-arm01        aarch64,cortex-a72
p01-arm[01-08]   aarch64,a64fx,ib
p02-intel01      x86_64,intel,icelake,ib,fpga,bitware,nvdimm,ht
p02-intel02      x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht
p03-amd02        x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx,noht
p03-amd01        x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx,ht
p04-edge01       x86_64,intel,broadwell,ib,ht
p05-synt01       x86_64,amd,milan,ib,ht
p06-arm[01-02]   aarch64,ib
p07-power01      ppc64le,ib
p08-amd01        x86_64,amd,milan-x,ib,ht
p10-intel01      x86_64,intel,sapphire_rapids,ht
$ salloc -A PROJECT-ID -p p02-intel --constraint noht
$ scontrol -d show node p02-intel02 | grep ActiveFeatures
   ActiveFeatures=x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht

Resources, GRES

Slurm supports the ability to define and schedule arbitrary resources - Generic RESources (GRES) in Slurm’s terminology. We use GRES for scheduling/allocating GPUs and FPGAs.

Use only allocated GPUs and FPGAs. Resource separation is not enforced. If you use non-allocated resources, you can observe strange behavior and get into troubles.

Node Resources

Get information about GRES on node.

$ scontrol -d show node p02-intel01 | grep Gres=
   Gres=fpga:bitware_520n_mx:2
$ scontrol -d show node p02-intel02 | grep Gres=
   Gres=fpga:bitware_520n_mx:2
$ scontrol -d show node p03-amd01 | grep Gres=
   Gres=gpu:amd_mi100:4,fpga:xilinx_alveo_u250:2
$ scontrol -d show node p03-amd02 | grep Gres=
   Gres=gpu:amd_mi100:4,fpga:xilinx_alveo_u280:2

Request Resources

To allocate required resources (GPUs or FPGAs) use the --gres salloc/srun option.

Example: Allocate one FPGA

$ salloc -A PROJECT-ID -p p03-amd --gres fpga:1

Find Out Allocated Resources

Information about allocated resources is available in Slurm job details, attributes JOB_GRES and GRES.

$ scontrol -d show job $SLURM_JOBID |grep GRES=
   JOB_GRES=fpga:xilinx_alveo_u250:1
     Nodes=p03-amd01 CPU_IDs=0-1 Mem=0 GRES=fpga:xilinx_alveo_u250:1(IDX:0)

IDX in the GRES attribute specifies index/indexes of FPGA(s) (or GPUs) allocated to the job on the node. In the given example - allocated resources are fpga:xilinx_alveo_u250:1(IDX:0), we should use FPGA with index/number 0 on node p03-amd01.

Request Specific Resources

It is possible to allocate specific resources. It is useful for partition p03-amd equipped with FPGAs of different types.

GRES entry is using format “name[[:type]:count”, in the following example name is fpga, type is xilinx_alveo_u280, and count is count 2.

$ salloc -A PROJECT-ID -p p03-amd --gres=fpga:xilinx_alveo_u280:2
salloc: Granted job allocation XXX
salloc: Waiting for resource configuration
salloc: Nodes p03-amd02 are ready for job

$ scontrol -d show job $SLURM_JOBID | grep -i gres
   JOB_GRES=fpga:xilinx_alveo_u280:2
     Nodes=p03-amd02 CPU_IDs=0 Mem=0 GRES=fpga:xilinx_alveo_u280(IDX:0-1)
   TresPerNode=gres:fpga:xilinx_alveo_u280:2
© 2025 IT4Innovations – All rights reserved.