Skip to content

Resource Allocation and Job Execution

To run a job, computational resources of DGX-2 must be allocated.

Info

You can access the DGX PBS scheduler by loadnig the "DGX-2" module.

The DGX-2 is using independent PBS scheduler. Load the DGX-2 module to access the scheduler

$ml DGX-2

Resources Allocation Policy

The resources are allocated to the job in a fair-share fashion, subject to constraints set by the queue. The queue provides prioritized and exclusive access to computational resources.

  • qdgx, the queue for DGX-2 machine

Note

Job maximum walltime is 4 hours, there might be only 5 jobs in the queue and only one running job per user.

Job Submission and Execution

The qsub submits the job into the queue. The command creates a request to the PBS Job manager for allocation of specified resources. The resources will be allocated when available, subject to allocation policies and constraints. After the resources are allocated the jobscript or interactive shell is executed on the allocated node.

Job Submission

When allocating computational resources for the job, specify:

  1. a queue for your job (the default is qdgx)
  2. the number of computational nodes required (maximum is 16, we have only one DGX-2 machine (yet))
  3. the maximum wall time allocated to your calculation (default is 2 hour, maximum is 4 hour)
  4. a Jobscript or interactive switch

Note

Right now, the DGX-2 is divided into 16 computational nodes. Every node contains 6 CPUs (3 physical cores + 3 HT cores) and 1 GPU.

Info

You can access the DGX PBS scheduler by loadnig the "DGX-2" module.

Submit the job using the qsub command:

Example for 1 GPU

[kru0052@login4.salomon ~]$ ml DGX-2
PBS 18.1.3 for DGX-2 machine
[kru0052@login4.salomon ~]$ qsub -q qdgx -l select=1 -l walltime=04:00:00 -I
qsub: waiting for job 257.ldgx to start
qsub: job 257.ldgx ready

kru0052@dgx:~$ nvidia-smi
Thu Mar 14 07:46:01 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:57:00.0 Off |                    0 |
| N/A   29C    P0    50W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
kru0052@dgx:~$ exit
[kru0052@login4.salomon ~]$ ml purge
PBS 13.1.1 for cluster Salomon
[kru0052@login4.salomon ~]$

Example for 4 GPU

[kru0052@login4.salomon ~]$ ml DGX-2
PBS 18.1.3 for DGX-2 machine
[kru0052@login4.salomon ~]$ qsub -q qdgx -l select=4 -l walltime=04:00:00 -I
qsub: waiting for job 256.ldgx to start
qsub: job 256.ldgx ready

kru0052@dgx:~$ nvidia-smi
Thu Mar 14 07:45:29 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:57:00.0 Off |                    0 |
| N/A   29C    P0    50W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM3...  On   | 00000000:59:00.0 Off |                    0 |
| N/A   35C    P0    51W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM3...  On   | 00000000:5C:00.0 Off |                    0 |
| N/A   30C    P0    50W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM3...  On   | 00000000:5E:00.0 Off |                    0 |
| N/A   35C    P0    53W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
kru0052@dgx:~$ exit
[kru0052@login4.salomon ~]$ ml purge
PBS 13.1.1 for cluster Salomon
[kru0052@login4.salomon ~]$

Example for 16 GPU (all DGX-2)

[kru0052@login4.salomon ~]$ ml DGX-2
PBS 18.1.3 for DGX-2 machine
[kru0052@login4.salomon ~]$ qsub -q qdgx -l select=16 -l walltime=04:00:00 -I
qsub: waiting for job 258.ldgx to start
qsub: job 258.ldgx ready

kru0052@dgx:~$ nvidia-smi
Thu Mar 14 07:46:32 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:34:00.0 Off |                    0 |
| N/A   32C    P0    51W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM3...  On   | 00000000:36:00.0 Off |                    0 |
| N/A   31C    P0    48W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM3...  On   | 00000000:39:00.0 Off |                    0 |
| N/A   35C    P0    53W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM3...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   36C    P0    53W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM3...  On   | 00000000:57:00.0 Off |                    0 |
| N/A   29C    P0    50W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM3...  On   | 00000000:59:00.0 Off |                    0 |
| N/A   35C    P0    51W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM3...  On   | 00000000:5C:00.0 Off |                    0 |
| N/A   30C    P0    50W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM3...  On   | 00000000:5E:00.0 Off |                    0 |
| N/A   35C    P0    53W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   8  Tesla V100-SXM3...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   30C    P0    50W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   9  Tesla V100-SXM3...  On   | 00000000:B9:00.0 Off |                    0 |
| N/A   30C    P0    51W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  10  Tesla V100-SXM3...  On   | 00000000:BC:00.0 Off |                    0 |
| N/A   35C    P0    51W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  11  Tesla V100-SXM3...  On   | 00000000:BE:00.0 Off |                    0 |
| N/A   35C    P0    50W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  12  Tesla V100-SXM3...  On   | 00000000:E0:00.0 Off |                    0 |
| N/A   31C    P0    50W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  13  Tesla V100-SXM3...  On   | 00000000:E2:00.0 Off |                    0 |
| N/A   29C    P0    51W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  14  Tesla V100-SXM3...  On   | 00000000:E5:00.0 Off |                    0 |
| N/A   34C    P0    51W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  15  Tesla V100-SXM3...  On   | 00000000:E7:00.0 Off |                    0 |
| N/A   34C    P0    50W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
kru0052@dgx:~$ exit
[kru0052@login4.salomon ~]$ ml purge
PBS 13.1.1 for cluster Salomon
[kru0052@login4.salomon ~]$

Tip

Submit the intreractive job using the qsub -I ... command.

Info

You can determine allocated GPUs from environment variable CUDA_ALLOCATED_DEVICES. Variable CUDA_VISIBLE_DEVICES has to be count from 0 every time!

Job Execution

The DGX-2 machine runs only a bare-bone, minimal operating system. Users are expected to run singularity containers in order to enrich the environment accordint to the needs.

Containers (Docker images) optimized for DGX-2 may be downloaded from NVidia Gpu Cloud. Select the code of interest and copy the docker nvcr.io link from the Pull Command section. This link may be directly used to download the container via singularity, see example below:

Example - Singularity Run Tensorflow

[kru0052@login4.salomon ~]$ ml DGX-2
PBS 18.1.3 for DGX-2 machine
$ qsub -q qdgx -l select=16 -l walltime=01:00:00 -I
qsub: waiting for job 96.ldgx to start
qsub: job 96.ldgx ready

kru0052@dgx:~$ singularity shell docker://nvcr.io/nvidia/tensorflow:19.02-py3
Singularity tensorflow_19.02-py3.sif:~>
Singularity tensorflow_19.02-py3.sif:~> mpiexec --bind-to socket -np 16 python /opt/tensorflow/nvidia-examples/cnn/resnet.py --layers=18 --precision=fp16 --batch_size=512
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
TF 1.13.0-rc0
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
TF 1.13.0-rc0
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
TF 1.13.0-rc0
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
TF 1.13.0-rc0
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
TF 1.13.0-rc0
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
...
...
...
2019-03-11 08:30:12.263822: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
     1   1.0   338.2  6.999  7.291 2.00000
    10  10.0  3658.6  5.658  5.950 1.62000
    20  20.0 25628.6  2.957  3.258 1.24469
    30  30.0 30815.1  0.177  0.494 0.91877
    40  40.0 30826.3  0.004  0.330 0.64222
    50  50.0 30884.3  0.002  0.327 0.41506
    60  60.0 30888.7  0.001  0.325 0.23728
    70  70.0 30763.2  0.001  0.324 0.10889
    80  80.0 30845.5  0.001  0.324 0.02988
    90  90.0 26350.9  0.001  0.324 0.00025
kru0052@dgx:~$ exit
[kru0052@login4.salomon ~]$ ml purge
PBS 13.1.1 for cluster Salomon
[kru0052@login4.salomon ~]$

GPU stat

The GPU load can be determined by gpustat utility.

Every 2,0s: gpustat --color

dgx  Mon Mar 11 09:31:00 2019
[0] Tesla V100-SXM3-32GB | 47'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
[1] Tesla V100-SXM3-32GB | 48'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
[2] Tesla V100-SXM3-32GB | 56'C,  97 % | 23660 / 32480 MB | kru0052(23645M)
[3] Tesla V100-SXM3-32GB | 57'C,  97 % | 23660 / 32480 MB | kru0052(23645M)
[4] Tesla V100-SXM3-32GB | 46'C,  97 % | 23660 / 32480 MB | kru0052(23645M)
[5] Tesla V100-SXM3-32GB | 55'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
[6] Tesla V100-SXM3-32GB | 45'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
[7] Tesla V100-SXM3-32GB | 54'C,  97 % | 23660 / 32480 MB | kru0052(23645M)
[8] Tesla V100-SXM3-32GB | 45'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
[9] Tesla V100-SXM3-32GB | 46'C,  95 % | 23660 / 32480 MB | kru0052(23645M)
[10] Tesla V100-SXM3-32GB | 55'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
[11] Tesla V100-SXM3-32GB | 56'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
[12] Tesla V100-SXM3-32GB | 47'C,  95 % | 23660 / 32480 MB | kru0052(23645M)
[13] Tesla V100-SXM3-32GB | 45'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
[14] Tesla V100-SXM3-32GB | 55'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
[15] Tesla V100-SXM3-32GB | 58'C,  95 % | 23660 / 32480 MB | kru0052(23645M)

Comments