PyTorch¶
PyTorch Highlight¶
- Official page: https://pytorch.org/
- Code: https://github.com/pytorch/pytorch
- Python-based framework for machine learning
- Auto-differentiation on tensor types
- Official LUMI page: https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/
- Warning: be careful where the SIF image is installed or copied ($HOME is not recommended for quota reasons). For EasyBuild you must specify the installation path:
export EBU_USER_PREFIX=/project/project_XXXX/EasyBuild
.
CSC Installed Software Collection¶
- https://docs.csc.fi/support/tutorials/ml-multi/
- https://docs.lumi-supercomputer.eu/software/local/csc/
- https://docs.csc.fi/apps/pytorch/
PyTorch Install¶
Base Environment¶
module purge
module load CrayEnv
module load PrgEnv-cray/8.3.3
module load craype-accel-amd-gfx90a
module load cray-python
# Default ROCm – more recent versions are preferable (e.g. ROCm 5.6.0)
module load rocm/5.2.3.lua
Scripts¶
- natively
- 01-install-direct-torch1.13.1-rocm5.2.3.sh
- 01-install-direct-torch2.1.2-rocm5.5.3.sh
- virtual env
- 02-install-venv-torch1.13.1-rocm5.2.3.sh
- 02-install-venv-torch2.1.2-rocm5.5.3.sh
- conda env
- 03-install-conda-torch1.13.1-rocm5.2.3.sh
- 03-install-conda-torch2.1.2-rocm5.5.3.sh
- from source: 04-install-source-torch1.13.1-rocm5.2.3.sh
- containers (singularity)
- 05-install-container-torch2.0.1-rocm5.5.1.sh
- 05-install-container-torch2.1.0-rocm5.6.1.sh
PyTorch Tests¶
Run Interactive Job on Single Node¶
salloc -A project_XXX --partition=standard-g -N 1 -n 1 --gpus 8 -t 01:00:00
Scripts¶
- natively
- 01-simple-test-direct-torch1.13.1-rocm5.2.3.sh
- 01-simple-test-direct-torch2.1.2-rocm5.5.3.sh
- virtual env
- 02-simple-test-venv-torch1.13.1-rocm5.2.3.sh
- 02-simple-test-venv-torch2.1.2-rocm5.5.3.sh
- conda env
- 03-simple-test-conda-torch1.13.1-rocm5.2.3.sh
- 03-simple-test-conda-torch2.1.2-rocm5.5.3.sh
- from source: 04-simple-test-source-torch1.13.1-rocm5.2.3.sh
- containers (singularity)
- 05-simple-test-container-torch2.0.1-rocm5.5.1.sh
- 05-simple-test-container-torch2.1.0-rocm5.6.1.sh
Run Interactive Job on Multiple Nodes¶
salloc -A project_XXX --partition=standard-g -N 2 -n 16 --gpus 16 -t 01:00:00
Scripts¶
- containers (singularity)
- 07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh
- 07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh
- 08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh
- 08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh
Tips¶
Official Containers¶
ls -la /appl/local/containers/easybuild-sif-images/
Unofficial Versions of ROCM¶
module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules
ml rocm/5.4.3
ml rocm/5.6.0
Unofficial Containers¶
ls -la /pfs/lustrep2/projappl/project_462000125/samantao-public/containers/
Installing Python Modules in Containers¶
#!/bin/bash
wd=$(pwd)
SIF=/pfs/lustrep2/projappl/project_462000125/samantao-public/containers/lumi-pytorch-rocm-5.6.1-python-3.10-pytorch-v2.1.0-dockerhash-aa8dbea5e0e4.sif
rm -rf $wd/setup-me.sh
cat > $wd/setup-me.sh << EOF
#!/bin/bash -e
\$WITH_CONDA
pip3 install scipy h5py tqdm
EOF
chmod +x $wd/setup-me.sh
mkdir -p $wd/pip_install
srun -n 1 --gpus 8 singularity exec \
-B /var/spool/slurmd:/var/spool/slurmd \
-B /opt/cray:/opt/cray \
-B /usr/lib64/libcxi.so.1:/usr/lib64/libcxi.so.1 \
-B $wd:/workdir \
-B $wd/pip_install:$HOME/.local/lib \
$SIF /workdir/setup-me.sh
# Add the path of pip_install to singularity-exec in run.sh:
# -B $wd/pip_install:$HOME/.local/lib \
Controlling Device Visibility¶
HIP_VISIBLE_DEVICES=0,1,2,3 python -c 'import torch; print(torch.cuda.device_count())'
ROCR_VISIBLE_DEVICES=0,1,2,3 python -c 'import torch; print(torch.cuda.device_count())'
- SLURM sets
ROCR_VISIBLE_DEVICES
- Implications of both ways of setting visibility – blit kernels and/or DMA
RCCL¶
- The problem – on startup we can see:
NCCL error in: /pfs/lustrep2/projappl/project_462000125/samantao/pytorchexample/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, unhandled system error, NCCL version 2.12.12
- Checking error origin:
export NCCL_DEBUG=INFO
NCCL INFO NET/Socket : Using [0]nmn0:10.120.116.65<0> [1]hsn0:10.253.6.67<0> [2]hsn1:10.253.6.68<0>[3]hsn2:10.253.2.12<0> [4]hsn3:10.253.2.11<0>
NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/rccl/src/init.cc:1292
- The fix:
export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
RCCL AWS-CXI Plugin¶
- RCCL relies on runtime plugin-ins to connect with some transport layers
- Libfabric – provider for Slingshot
- Hipified plugin adapted from AWS OpenFabrics support available
- https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl
- 3-4x faster collectives
- Plugin needs to be pointed at by the loading environment
module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules
module load aws-ofi-rccl/rocm-5.2.3.lua
# Or
export LD_LIBRARY_PATH=/pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofirccl
# (will detect librccl-net.so)
- Verify the plugin is detected
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT
# and search the logs for:
# [0] NCCL INFO NET/OFI Using aws-ofi-rccl 1.4.0
amdgpu.ids Issue¶
https://github.com/pytorch/builder/issues/1410
References¶
- Samuel Antao (AMD), LUMI Courses
- https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530/extra_4_10_Best_Practices_GPU_Optimization/
-
Multi-GPU and multi-node machine learning by CSC
- https://docs.csc.fi/support/tutorials/ml-multi/