Skip to content

Complementary System Job Scheduling

Introduction

Slurm workload manager is used to allocate and access Complementary systems resources.

Display partitions/queues

$ sinfo -s
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
p00-arm      up 1-00:00:00          0/1/0/1 p00-arm01
p01-arm*     up 1-00:00:00          0/8/0/8 p01-arm[01-08]
p02-intel    up 1-00:00:00          0/2/0/2 p02-intel[01-02]
p03-amd      up 1-00:00:00          0/2/0/2 p03-amd[01-02]
p04-edge     up 1-00:00:00          0/1/0/1 p04-edge01
p05-synt     up 1-00:00:00          0/1/0/1 p05-synt01

Show jobs

$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               104   p01-arm interact    user   R       1:48      2 p01-arm[01-02]

Show job details

$ scontrol show job 104

Run interactive job

 $ salloc -A PROJECT-ID -p p01-arm

Run interactive job, with X11 forwarding

 $ salloc -A PROJECT-ID -p p01-arm --x11

Warning

Do not use srun for initiating interactive jobs, subsequent srun, mpirun invocations would block forever.

Run batch job

 $ sbatch -A PROJECT-ID -p p01-arm ../script.sh

Useful command options (salloc, sbatch, srun)

  • -n, --ntasks
  • -c, --cpus-per-task
  • -N, --nodes
PARTITION nodes cores per node
p00-arm 1 64
p01-arm 8 48
p02-intel 2 64
p03-amd 2 64
p04-edge 1 16
p05-synt 1 8

Use -t, --time option to specify job run time limit. Default job time limit is 2 hours, maximum job time limit is 24 hours.

FIFO scheduling with backfiling is employed.

Partition 00 - ARM (Cortex-A72)

Whole node allocation.

One node:

sbatch -A PROJECT-ID -p p00-arm ./script.sh

Partition 01 - ARM (A64FX)

Whole node allocation.

One node:

sbatch -A PROJECT-ID -p p01-arm ./script.sh
sbatch -A PROJECT-ID -p p01-arm -N=1 ./script.sh

Multiple nodes:

sbatch -A PROJECT-ID -p p01-arm -N=8 ./script.sh

Partition 02 - Intel (Ice Lake, NVDIMMs + Bitware FPGAs)

FPGAs are treated as resources. See below for more details about resources.

Partial allocation - per FPGA, resource separation is not enforced.

One FPGA:

sbatch -A PROJECT-ID -p p02-intel --gres=fpga ./script.sh

Two FPGAs on the same node:

sbatch -A PROJECT-ID -p p02-intel --gres=fpga:2 ./script.sh

All FPGAs:

sbatch -A PROJECT-ID -p p02-intel -N 2 --gres=fpga:2 ./script.sh

Partition 03 - AMD (Milan, MI100 GPUs + Xilinx FPGAs)

GPGPUs and FPGAs are treated as resources. See below for more details about resources.

Partial allocation - per GPGPU and per FPGA, resource separation is not enforced.

One GPU:

sbatch -A PROJECT-ID -p p03-amd --gres=gpgpu ./script.sh

Two GPUs on the same node:

sbatch -A PROJECT-ID -p p03-amd --gres=gpgpu:2 ./script.sh

Four GPUs on the same node:

sbatch -A PROJECT-ID -p p03-amd --gres=gpgpu:4 ./script.sh

All GPUs:

sbatch -A PROJECT-ID -p p03-amd -N 2 --gres=gpgpu:4 ./script.sh

One FPGA:

sbatch -A PROJECT-ID -p p03-amd --gres=fpga ./script.sh

Two FPGAs:

sbatch -A PROJECT-ID -p p03-amd --gres=fpga:2 ./script.sh

All FPGAs:

sbatch -A PROJECT-ID -p p03-amd -N 2--gres=fpga:2 ./script.sh

One GPU and one FPGA on the same node:

sbatch -A PROJECT-ID -p p03-amd --gres=gpgpu,fpga ./script.sh

Four GPUs and two FPGAs on the same node:

sbatch -A PROJECT-ID -p p03-amd --gres=gpgpu:4,fpga:2 ./script.sh

All GPUs and FPGAs:

sbatch -A PROJECT-ID -p p03-amd -N 2 --gres=gpgpu:4,fpga:2 ./script.sh

Partition 04 - Edge Server

Whole node allocation:

sbatch -A PROJECT-ID -p p04-edge ./script.sh

Partition 05 - FPGA Synthesis Server

Whole node allocation:

sbatch -A PROJECT-ID -p p05-synt ./script.sh

Features

Nodes have feature tags assigned to them. Users can select nodes based on the feature tags using --constraint option.

Feature Description
aarch64 platform
x86_64 platform
amd manufacturer
intel manufacturer
icelake processor family
broadwell processor family
milan processor family
ib Infiniband
gpgpu equipped with GPGPU
fpga equipped with FPGA
nvdimm equipped with NVDIMMs
ht Hyperthreading enabled
noht Hyperthreading disabled
$ sinfo -o '%16N %f'
NODELIST         AVAIL_FEATURES
p00-arm01        aarch64,cortex-a72
p01-arm[01-08]   aarch64,a64fx,ib
p02-intel01      x86_64,intel,icelake,ib,fpga,bitware,nvdimm,ht
p02-intel02      x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht
p03-amd01        x86_64,amd,milan,ib,gpgpu,mi100,fpga,xilinx,ht
p03-amd02        x86_64,amd,milan,ib,gpgpu,mi100,fpga,xilinx,noht
p04-edge01       x86_64,intel,broadwell,ib,ht
p05-synt01       x86_64,amd,milan,ib,ht
$ salloc -A PROJECT-ID -p p02-intel --constraint noht
$ scontrol -d show node p02-intel02 | grep ActiveFeatures
   ActiveFeatures=x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht

Resources

Slurm supports the ability to define and schedule arbitrary resources - Generic RESources (GRES) in Slurm's terminology. We use GRES for scheduling/allocating GPGPUs and FPGAs.

Get information about GRES on node:

$ scontrol -d show node p03-amd01 | grep Gres=
   Gres=gpgpu:amd_mi100:4,fpga:xilinx_alveo_u250:2
$ scontrol -d show node p03-amd02 | grep Gres=
   Gres=gpgpu:amd_mi100:4,fpga:xilinx_alveo_u280:2

Request specified GRES. GRES entry is using format "name[[:type]:count", in the following example name is fpga, type is xilinx_alveo_u280, and count is count 2.

$ salloc -A PROJECT-ID -p p03-amd --gres=fpga:xilinx_alveo_u280:2
salloc: Granted job allocation XXX
salloc: Waiting for resource configuration
salloc: Nodes p03-amd02 are ready for job

$ scontrol -d show job $SLURM_JOBID | grep -i gres
   JOB_GRES=fpga:xilinx_alveo_u280:2
     Nodes=p03-amd02 CPU_IDs=0 Mem=0 GRES=fpga:xilinx_alveo_u280(CNT:2)
   TresPerNode=gres:fpga:xilinx_alveo_u280:2

Comments