Karolina - Job Submission and Execution¶

Introduction¶

Slurm workload manager is used to allocate and access Karolina cluster's resources. This page describes Karolina cluster's specific Slurm settings and usage. General information about Slurm usage at IT4Innovations can be found at Slurm Job Submission and Execution.

Partition Information¶

Partitions/queues on the system:

$ sinfo -s
PARTITION    AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
qcpu*           up 2-00:00:00      1/717/0/718 cn[001-718]
qcpu_biz        up 2-00:00:00      1/717/0/718 cn[001-718]
qcpu_exp        up    1:00:00      1/719/0/720 cn[001-720]
qcpu_free       up   18:00:00      1/717/0/718 cn[001-718]
qcpu_long       up 6-00:00:00      1/717/0/718 cn[001-718]
qcpu_preempt    up   12:00:00      1/717/0/718 cn[001-718]
qgpu            up 2-00:00:00        0/70/0/70 acn[01-70]
qgpu_big        up   12:00:00        71/1/0/72 acn[01-72]
qgpu_biz        up 2-00:00:00        0/70/0/70 acn[01-70]
qgpu_exp        up    1:00:00        0/72/0/72 acn[01-72]
qgpu_free       up   18:00:00        0/70/0/70 acn[01-70]
qgpu_preempt    up   12:00:00        0/70/0/70 acn[01-70]
qfat            up 2-00:00:00          0/1/0/1 sdf1
qviz            up    8:00:00          0/2/0/2 viz[1-2]

For more information about Karolina's queues, see this page.

Graphical representation of cluster usage, partitions, nodes, and jobs could be found at https://extranet.it4i.cz/rsweb/karolina

On Karolina cluster

all CPU queues/partitions provide full node allocation, whole nodes (all node resources) are allocated to a job.
other queues/partitions (gpu, fat, viz) provide partial node allocation. Jobs' resources (cpu, mem) are separated and dedicated for job.

Partial node allocation and security

Division of nodes means that if two users allocate a portion of the same node, they can see each other's running processes. If this solution is inconvenient for you, consider allocating a whole node.

IT4I clusters are monitored for resources utilization. One of the monitoring daemons is using registers to collect performance monitoring counters (PMC), which user may need when analysing performance of the executed application (perf or Score-P profiling tools). To deactivate the daemon and release the respective registers set job feature during allocation, as specified here.

Using CPU Queues¶

Access standard compute nodes. Whole nodes are allocated. Use the --nodes option to specify the number of requested nodes. There is no need to specify the number of cores and memory size.

#!/usr/bin/bash
#SBATCH --job-name MyJobName
#SBATCH --account PROJECT-ID
#SBATCH --partition qcpu
#SBATCH --time 12:00:00
#SBATCH --nodes 8
...

Using GPU Queues¶

Access GPU accelerated nodes. Every GPU accelerated node is divided into eight parts, each part contains one GPU, 16 CPU cores and corresponding memory. By default, only one part, i.e. 1/8 of the node - one GPU and corresponding CPU cores and memory, is allocated. There is no need to specify the number of cores and memory size, on the contrary, it is undesirable. There are employed some restrictions which aim to provide fair division and efficient use of node resources.

#!/usr/bin/bash
#SBATCH --job-name MyJobName
#SBATCH --account PROJECT-ID
#SBATCH --partition qgpu
#SBATCH --time 12:00:00
...

To allocate more GPUs use --gpus option. The default behavior is to allocate enough nodes to satisfy the requested resources as expressed by --gpus option and without delaying the initiation of the job.

The following code requests one GPU. One GPU and 16 CPU cores will be allocated to the job. Up to eight jobs could run on single GPU node.

#SBATCH --gpus 1

The following code requests four GPUs; scheduler can allocate from one up to four nodes depending on the actual cluster state (i.e. GPU availability) to fulfil the request.

#SBATCH --gpus 4

The following code requests 16 GPUs; scheduler can allocate from two up to sixteen nodes depending on the actual cluster state (i.e. GPU availability) to fulfil the request.

#SBATCH --gpus 16

To allocate GPUs within one node you have to specify the --nodes option.

The following code requests four GPUs on exactly one node

#SBATCH --gpus 4
#SBATCH --nodes 1

The following code requests 16 GPUs on exactly two nodes.

#SBATCH --gpus 16
#SBATCH --nodes 2

Alternatively, you can use the --gpus-per-node option. Only value 8 is allowed for multi-node allocation to prevent fragmenting nodes.

The following code requests 16 GPUs on exactly two nodes.

#SBATCH --gpus-per-node 8
#SBATCH --nodes 2

For large jobs that require more than 16 GPU nodes (equivalent to at least 128 GPUs), the "qgpu_big" queue is designated, with a limit of 64 GPU nodes (corresponding to up to 512 GPUs).

Using Fat Queue¶

Access data analytics aka fat node. Fat node is divided into 32 parts, each part contains one socket/processor (24 cores) and corresponding memory. By default, only one part, i.e. 1/32 of the node - one processor and corresponding memory, is allocated.

To allocate requested memory use the --mem option. Corresponding CPUs will be allocated. Fat node has about 22.5TB of memory available for jobs.

#!/usr/bin/bash
#SBATCH --job-name MyJobName
#SBATCH --account PROJECT-ID
#SBATCH --partition qfat
#SBATCH --time 2:00:00
#SBATCH --mem 6TB
...

You can also specify CPU-oriented options (like --cpus-per-task), then appropriate memory will be allocated to the job.

To allocate a whole fat node, use the --exclusive option

#SBATCH --exclusive

Using Viz Queue¶

Access visualization nodes. Every visualization node is divided into eight parts. By default, only one part, i.e. 1/8 of the node, is allocated.

$ salloc -A PROJECT-ID -p qviz

To allocate a whole visualisation node, use the --exclusive option

$ salloc -A PROJECT-ID -p qviz --exclusive