Complementary System Job Scheduling¶
Introduction¶
Slurm workload manager is used to allocate and access Complementary systems resources.
Getting Partition Information¶
Display partitions/queues
$ sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
p00-arm up 1-00:00:00 0/1/0/1 p00-arm01
p01-arm* up 1-00:00:00 0/8/0/8 p01-arm[01-08]
p02-intel up 1-00:00:00 0/2/0/2 p02-intel[01-02]
p03-amd up 1-00:00:00 0/2/0/2 p03-amd[01-02]
p04-edge up 1-00:00:00 0/1/0/1 p04-edge01
p05-synt up 1-00:00:00 0/1/0/1 p05-synt01
p06-arm up 1-00:00:00 0/2/0/2 p06-arm[01-02]
p07-power up 1-00:00:00 0/1/0/1 p07-power01
p08-amd up 1-00:00:00 0/1/0/1 p08-amd01
p10-intel up 1-00:00:00 0/1/0/1 p10-intel01
Getting Job Information¶
Show jobs
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
104 p01-arm interact user R 1:48 2 p01-arm[01-02]
Show job details for specific job
$ scontrol -d show job JOBID
Show job details for executing job from job session
$ scontrol -d show job $SLURM_JOBID
Running Interactive Jobs¶
Run interactive job
$ salloc -A PROJECT-ID -p p01-arm
Run interactive job, with X11 forwarding
$ salloc -A PROJECT-ID -p p01-arm --x11
Warning
Do not use srun
for initiating interactive jobs, subsequent srun
, mpirun
invocations would block forever.
Running Batch Jobs¶
Run batch job
$ sbatch -A PROJECT-ID -p p01-arm ./script.sh
Useful command options (salloc, sbatch, srun)
- -n, --ntasks
- -c, --cpus-per-task
- -N, --nodes
Slurm Job Environment Variables¶
Slurm provides useful information to the job via environment variables. Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (srun, compatible mpirun).
See all Slurm variables
set | grep ^SLURM
Useful Variables¶
variable name | description | example |
---|---|---|
SLURM_JOB_ID | job id of the executing job | 593 |
SLURM_JOB_NODELIST | nodes allocated to the job | p03-amd[01-02] |
SLURM_JOB_NUM_NODES | number of nodes allocated to the job | 2 |
SLURM_STEP_NODELIST | nodes allocated to the job step | p03-amd01 |
SLURM_STEP_NUM_NODES | number of nodes allocated to the job step | 1 |
SLURM_JOB_PARTITION | name of the partition | p03-amd |
SLURM_SUBMIT_DIR | submit directory | /scratch/project/open-xx-yy/work |
See Slurm srun documentation for details.
Get job nodelist
$ echo $SLURM_JOB_NODELIST
p03-amd[01-02]
Expand nodelist to list of nodes.
$ scontrol show hostnames $SLURM_JOB_NODELIST
p03-amd01
p03-amd02
Modifying Jobs¶
$ scontrol update JobId=JOBID ATTR=VALUE
for example
$ scontrol update JobId=JOBID Comment='The best job ever'
Deleting Jobs¶
$ scancel JOBID
Partitions¶
PARTITION | nodes | whole node | cores per node | features |
---|---|---|---|---|
p00-arm | 1 | yes | 64 | aarch64,cortex-a72 |
p01-arm | 8 | yes | 48 | aarch64,a64fx,ib |
p02-intel | 2 | no | 64 | x86_64,intel,icelake,ib,fpga,bitware,nvdimm |
p03-amd | 2 | no | 64 | x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx |
p04-edge | 1 | yes | 16 | 86_64,intel,broadwell,ib |
p05-synt | 1 | yes | 8 | x86_64,amd,milan,ib,ht |
p06-arm | 2 | yes | 80 | aarch64,ib |
p07-power | 1 | yes | 192 | ppc64le,ib |
p08-amd | 1 | yes | 128 | x86_64,amd,milan-x,ib,ht |
p10-intel | 1 | yes | 96 | x86_64,intel,sapphire_rapids,ht |
Use -t
, --time
option to specify job run time limit. Default job time limit is 2 hours, maximum job time limit is 24 hours.
FIFO scheduling with backfilling is employed.
Partition 00 - ARM (Cortex-A72)¶
Whole node allocation.
One node:
salloc -A PROJECT-ID -p p00-arm
Partition 01 - ARM (A64FX)¶
Whole node allocation.
One node:
salloc -A PROJECT-ID -p p01-arm
salloc -A PROJECT-ID -p p01-arm -N=1
Multiple nodes:
salloc -A PROJECT-ID -p p01-arm -N=8
Partition 02 - Intel (Ice Lake, NVDIMMs + Bitware FPGAs)¶
FPGAs are treated as resources. See below for more details about resources.
Partial allocation - per FPGA, resource separation is not enforced. Use only FPGAs allocated to the job!
One FPGA:
salloc -A PROJECT-ID -p p02-intel --gres=fpga
Two FPGAs on the same node:
salloc -A PROJECT-ID -p p02-intel --gres=fpga:2
All FPGAs:
salloc -A PROJECT-ID -p p02-intel -N 2 --gres=fpga:2
Partition 03 - AMD (Milan, MI100 GPUs + Xilinx FPGAs)¶
GPUs and FPGAs are treated as resources. See below for more details about resources.
Partial allocation - per GPU and per FPGA, resource separation is not enforced. Use only GPUs and FPGAs allocated to the job!
One GPU:
salloc -A PROJECT-ID -p p03-amd --gres=gpu
Two GPUs on the same node:
salloc -A PROJECT-ID -p p03-amd --gres=gpu:2
Four GPUs on the same node:
salloc -A PROJECT-ID -p p03-amd --gres=gpu:4
All GPUs:
salloc -A PROJECT-ID -p p03-amd -N 2 --gres=gpu:4
One FPGA:
salloc -A PROJECT-ID -p p03-amd --gres=fpga
Two FPGAs:
salloc -A PROJECT-ID -p p03-amd --gres=fpga:2
All FPGAs:
salloc -A PROJECT-ID -p p03-amd -N 2--gres=fpga:2
One GPU and one FPGA on the same node:
salloc -A PROJECT-ID -p p03-amd --gres=gpu,fpga
Four GPUs and two FPGAs on the same node:
salloc -A PROJECT-ID -p p03-amd --gres=gpu:4,fpga:2
All GPUs and FPGAs:
salloc -A PROJECT-ID -p p03-amd -N 2 --gres=gpu:4,fpga:2
Partition 04 - Edge Server¶
Whole node allocation:
salloc -A PROJECT-ID -p p04-edge
Partition 05 - FPGA Synthesis Server¶
Whole node allocation:
salloc -A PROJECT-ID -p p05-synt
Partition 06 - ARM¶
Whole node allocation:
salloc -A PROJECT-ID -p p06-arm
Partition 07 - IBM Power¶
Whole node allocation:
salloc -A PROJECT-ID -p p07-power
Partition 08 - AMD Milan-X¶
Whole node allocation:
salloc -A PROJECT-ID -p p08-amd
Partition 10 - Intel Sapphire Rapids¶
Whole node allocation:
salloc -A PROJECT-ID -p p10-intel
Features¶
Nodes have feature tags assigned to them. Users can select nodes based on the feature tags using --constraint option.
Feature | Description |
---|---|
aarch64 | platform |
x86_64 | platform |
ppc64le | platform |
amd | manufacturer |
intel | manufacturer |
icelake | processor family |
broadwell | processor family |
sapphire_rapids | processor family |
milan | processor family |
milan-x | processor family |
ib | Infiniband |
gpu | equipped with GPU |
fpga | equipped with FPGA |
nvdimm | equipped with NVDIMMs |
ht | Hyperthreading enabled |
noht | Hyperthreading disabled |
$ sinfo -o '%16N %f'
NODELIST AVAIL_FEATURES
p00-arm01 aarch64,cortex-a72
p01-arm[01-08] aarch64,a64fx,ib
p02-intel01 x86_64,intel,icelake,ib,fpga,bitware,nvdimm,ht
p02-intel02 x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht
p03-amd02 x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx,noht
p03-amd01 x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx,ht
p04-edge01 x86_64,intel,broadwell,ib,ht
p05-synt01 x86_64,amd,milan,ib,ht
p06-arm[01-02] aarch64,ib
p07-power01 ppc64le,ib
p08-amd01 x86_64,amd,milan-x,ib,ht
p10-intel01 x86_64,intel,sapphire_rapids,ht
$ salloc -A PROJECT-ID -p p02-intel --constraint noht
$ scontrol -d show node p02-intel02 | grep ActiveFeatures
ActiveFeatures=x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht
Resources, GRES¶
Slurm supports the ability to define and schedule arbitrary resources - Generic RESources (GRES) in Slurm's terminology. We use GRES for scheduling/allocating GPUs and FPGAs.
Warning
Use only allocated GPUs and FPGAs. Resource separation is not enforced. If you use non-allocated resources, you can observe strange behavior and get into troubles.
Node Resources¶
Get information about GRES on node.
$ scontrol -d show node p02-intel01 | grep Gres=
Gres=fpga:bitware_520n_mx:2
$ scontrol -d show node p02-intel02 | grep Gres=
Gres=fpga:bitware_520n_mx:2
$ scontrol -d show node p03-amd01 | grep Gres=
Gres=gpu:amd_mi100:4,fpga:xilinx_alveo_u250:2
$ scontrol -d show node p03-amd02 | grep Gres=
Gres=gpu:amd_mi100:4,fpga:xilinx_alveo_u280:2
Request Resources¶
To allocate required resources (GPUs or FPGAs) use the --gres salloc/srun
option.
Example: Allocate one FPGA
$ salloc -A PROJECT-ID -p p03-amd --gres fpga:1
Find Out Allocated Resources¶
Information about allocated resources is available in Slurm job details, attributes JOB_GRES
and GRES
.
$ scontrol -d show job $SLURM_JOBID |grep GRES=
JOB_GRES=fpga:xilinx_alveo_u250:1
Nodes=p03-amd01 CPU_IDs=0-1 Mem=0 GRES=fpga:xilinx_alveo_u250:1(IDX:0)
IDX in the GRES attribute specifies index/indexes of FPGA(s) (or GPUs) allocated to the job on the node. In the given example - allocated resources are fpga:xilinx_alveo_u250:1(IDX:0)
, we should use FPGA with index/number 0 on node p03-amd01.
Request Specific Resources¶
It is possible to allocate specific resources. It is useful for partition p03-amd equipped with FPGAs of different types.
GRES entry is using format "name[[:type]:count", in the following example name is fpga, type is xilinx_alveo_u280, and count is count 2.
$ salloc -A PROJECT-ID -p p03-amd --gres=fpga:xilinx_alveo_u280:2
salloc: Granted job allocation XXX
salloc: Waiting for resource configuration
salloc: Nodes p03-amd02 are ready for job
$ scontrol -d show job $SLURM_JOBID | grep -i gres
JOB_GRES=fpga:xilinx_alveo_u280:2
Nodes=p03-amd02 CPU_IDs=0 Mem=0 GRES=fpga:xilinx_alveo_u280(IDX:0-1)
TresPerNode=gres:fpga:xilinx_alveo_u280:2