Using NVIDIA Grace Partition¶
For testing your application on the NVIDIA Grace Partition, you need to prepare a job script for that partition or use the interactive job:
salloc -N 1 -c 144 -A PROJECT-ID -p p11-grace --time=08:00:00
where:
-N 1
means allocation single node,-c 144
means allocation 144 cores,-p p11-grace
is NVIDIA Grace partition,--time=08:00:00
means allocation for 8 hours.
Available Toolchains¶
The platform offers three toolchains:
- Standard GCC (as a module
ml GCC
) - NVHPC (as a module
ml NVHPC
) - Clang for NVIDIA Grace (installed in
/opt/nvidia/clang
)
Note
The NVHPC toolchain showed strong results with minimal amount of tuning necessary in our initial evaluation.
GCC Toolchain¶
The GCC compiler seems to struggle with vectorization of short (constant length) loops, which tend to get completely unrolled/eliminated instead of being vectorized. For example simple nested loop such as
for(int i = 0; i < 1000000; ++i) {
// Iterations dependent in "i"
// ...
for(int j = 0; j < 8; ++j) {
// but independent in "j"
// ...
}
}
may emit scalar code for the inner loop leading to no vectorization being used at all.
Clang (For Grace) Toolchain¶
The Clang/LLVM tends to behave similarly, but can be guided to properly vectorize the inner loop with either flags -O3 -ffast-math -march=native -fno-unroll-loops -mllvm -force-vector-width=8
or pragmas such as #pragma clang loop vectorize_width(8)
and #pragma clang loop unroll(disable)
.
for(int i = 0; i < 1000000; ++i) {
// Iterations dependent in "i"
// ...
#pragma clang loop unroll(disable) vectorize_width(8)
for(int j = 0; j < 8; ++j) {
// but independent in "j"
// ...
}
}
Note
Our basic experiments show that fixed width vectorization (NEON) tends to perform better in the case of short (register-length) loops than SVE. In cases (like above), where specified vectorize_width
is larger than availiable vector unit width, Clang will emit multiple NEON instructions (eg. 4 instructions will be emitted to process 8 64-bit operations in 128-bit units of Grace).
NVHPC Toolchain¶
The NVHPC toolchain handled aforementioned case without any additional tuning. Simple -O3 -march=native -fast
should be therefore sufficient.
Basic Math Libraries¶
The basic libraries (BLAS and LAPACK) are included in NVHPC toolchain and can be used simply as -lblas
and -llapack
for BLAS and LAPACK respectively (lp64
and ilp64
versions are also included).
Note
The Grace platform doesn't include CUDA-capable GPU, therefore nvcc
will fail with an error. This means that nvc
, nvc++
and nvfortran
should be used instead.
NVIDIA Performance Libraries¶
The NVPL package includes more extensive set of libraries in both sequential and multi-threaded versions:
- BLACS:
-lnvpl_blacs_{lp64,ilp64}_{mpich,openmpi3,openmpi4,openmpi5}
- BLAS:
-lnvpl_blas_{lp64,ilp64}_{seq,gomp}
- FFTW:
-lnvpl_fftw
- LAPACK:
-lnvpl_lapack_{lp64,ilp64}_{seq,gomp}
- ScaLAPACK:
-lnvpl_scalapack_{lp64,ilp64}
- RAND:
-lnvpl_rand
or-lnvpl_rand_mt
- SPARSE:
-lnvpl_sparse
This package should be compatible with all availiable toolchains and includes CMake module files for easy integration into CMake-based projects. For further documentation see also NVPL.
Recommended BLAS Library¶
We recommend to use the multi-threaded BLAS library from the NVPL package.
Note
It is important to pin the processes using OMP_PROC_BIND=spread
Example:
$ ml NVHPC
$ nvc -O3 -march=native myprog.c -o myprog -lnvpl_blas_lp64_gomp
$ OMP_PROC_BIND=spread ./myprog
Basic Communication Libraries¶
The OpenMPI 4 implementation is included with NVHPC toolchain and is exposed as a module (ml OpenMPI
). The following example
#include <mpi.h>
#include <sched.h>
#include <omp.h>
int main(int argc, char **argv)
{
int rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#pragma omp parallel
{
printf("Hello on rank %d, thread %d on CPU %d\n", rank, omp_get_thread_num(), sched_getcpu());
}
MPI_Finalize();
}
can be compiled and run as follows
ml OpenMPI
mpic++ -fast -fopenmp hello.cpp -o hello
OMP_PROC_BIND=close OMP_NUM_THREADS=4 mpirun -np 4 --map-by slot:pe=36 ./hello
In this configuration we run 4 ranks bound to one quarter of cores each with 4 OpenMP threads.
Simple BLAS Application¶
The hello world
example application (written in C++
and Fortran
) uses simple stationary probability vector estimation to illustrate use of GEMM (BLAS 3 routine).
Stationary probability vector estimation in C++
:
#include <iostream>
#include <vector>
#include <chrono>
#include "cblas.h"
const size_t ITERATIONS = 32;
const size_t MATRIX_SIZE = 1024;
int main(int argc, char *argv[])
{
const size_t matrixElements = MATRIX_SIZE*MATRIX_SIZE;
std::vector<float> a(matrixElements, 1.0f / float(MATRIX_SIZE));
for(size_t i = 0; i < MATRIX_SIZE; ++i)
a[i] = 0.5f / (float(MATRIX_SIZE) - 1.0f);
a[0] = 0.5f;
std::vector<float> w1(matrixElements, 0.0f);
std::vector<float> w2(matrixElements, 0.0f);
std::copy(a.begin(), a.end(), w1.begin());
std::vector<float> *t1, *t2;
t1 = &w1;
t2 = &w2;
auto c1 = std::chrono::steady_clock::now();
for(size_t i = 0; i < ITERATIONS; ++i)
{
std::fill(t2->begin(), t2->end(), 0.0f);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, MATRIX_SIZE, MATRIX_SIZE, MATRIX_SIZE,
1.0f, t1->data(), MATRIX_SIZE,
a.data(), MATRIX_SIZE,
1.0f, t2->data(), MATRIX_SIZE);
std::swap(t1, t2);
}
auto c2 = std::chrono::steady_clock::now();
for(size_t i = 0; i < MATRIX_SIZE; ++i)
{
std::cout << (*t1)[i*MATRIX_SIZE + i] << " ";
}
std::cout << std::endl;
std::cout << "Elapsed Time: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
return 0;
}
Stationary probability vector estimation in Fortran
:
program main
implicit none
integer :: matrix_size, iterations
integer :: i
real, allocatable, target :: a(:,:), w1(:,:), w2(:,:)
real, dimension(:,:), contiguous, pointer :: t1, t2, tmp
real, pointer :: out_data(:), out_diag(:)
integer :: cr, cm, c1, c2
iterations = 32
matrix_size = 1024
call system_clock(count_rate=cr)
call system_clock(count_max=cm)
allocate(a(matrix_size, matrix_size))
allocate(w1(matrix_size, matrix_size))
allocate(w2(matrix_size, matrix_size))
a(:,:) = 1.0 / real(matrix_size)
a(:,1) = 0.5 / real(matrix_size - 1)
a(1,1) = 0.5
w1 = a
w2(:,:) = 0.0
t1 => w1
t2 => w2
call system_clock(c1)
do i = 0, iterations
t2(:,:) = 0.0
call sgemm('N', 'N', matrix_size, matrix_size, matrix_size, 1.0, t1, matrix_size, a, matrix_size, 1.0, t2, matrix_size)
tmp => t1
t1 => t2
t2 => tmp
end do
call system_clock(c2)
out_data(1:size(t1)) => t1
out_diag => out_data(1::matrix_size+1)
print *, out_diag
print *, "Elapsed Time: ", (c2 - c1) / real(cr)
deallocate(a)
deallocate(w1)
deallocate(w2)
end program main
Using NVHPC Toolchain¶
The C++ version of the example can be compiled with NVHPC and ran as follows
ml NVHPC
nvc++ -O3 -march=native -fast -I$NVHPC/Linux_aarch64/$EBVERSIONNVHPC/compilers/include/lp64 -lblas main.cpp -o main
OMP_NUM_THREADS=144 OMP_PROC_BIND=spread ./main
The Fortran version is just as simple:
ml NVHPC
nvfortran -O3 -march=native -fast -lblas main.f90 -o main.x
OMP_NUM_THREADS=144 OMP_PROC_BIND=spread ./main
Note
It may be advantageous to use NVPL libraries instead NVHPC ones. For example DGEMM BLAS 3 routine from NVPL is almost 30% faster than NVHPC one.
Using Clang (For Grace) Toolchain¶
Similarly Clang for Grace toolchain with NVPL BLAS can be used to compile C++ version of the example.
ml NVHPC
/opt/nvidia/clang/17.23.11/bin/clang++ -O3 -march=native -ffast-math -I$NVHPC/Linux_aarch64/$EBVERSIONNVHPC/compilers/include/lp64 -lnvpl_blas_lp64_gomp main.cpp -o main
Note
NVHPC module is used just for the cblas.h
include in this case. This can be avoided by changing the code to use nvpl_blas.h
instead.