Skip to content

Using NVIDIA Grace Partition

For testing your application on the NVIDIA Grace Partition, you need to prepare a job script for that partition or use the interactive job:

salloc -N 1 -c 144 -A PROJECT-ID -p p11-grace --time=08:00:00


  • -N 1 means allocation single node,
  • -c 144 means allocation 144 cores,
  • -p p11-grace is NVIDIA Grace partition,
  • --time=08:00:00 means allocation for 8 hours.

Available Toolchains

The platform offers three toolchains:


The NVHPC toolchain showed strong results with minimal amount of tuning necessary in our initial evaluation.

GCC Toolchain

The GCC compiler seems to struggle with vectorization of short (constant length) loops, which tend to get completely unrolled/eliminated instead of being vectorized. For example simple nested loop such as

for(int i = 0; i < 1000000; ++i) {
    // Iterations dependent in "i"
    // ...
    for(int j = 0; j < 8; ++j) {
        // but independent in "j"
        // ...

may emit scalar code for the inner loop leading to no vectorization being used at all.

Clang (For Grace) Toolchain

The Clang/LLVM tends to behave similarly, but can be guided to properly vectorize the inner loop with either flags -O3 -ffast-math -march=native -fno-unroll-loops -mllvm -force-vector-width=8 or pragmas such as #pragma clang loop vectorize_width(8) and #pragma clang loop unroll(disable).

for(int i = 0; i < 1000000; ++i) {
    // Iterations dependent in "i"
    // ...
    #pragma clang loop unroll(disable) vectorize_width(8)
    for(int j = 0; j < 8; ++j) {
        // but independent in "j"
        // ...


Our basic experiments show that fixed width vectorization (NEON) tends to perform better in the case of short (register-length) loops than SVE. In cases (like above), where specified vectorize_width is larger than availiable vector unit width, Clang will emit multiple NEON instructions (eg. 4 instructions will be emitted to process 8 64-bit operations in 128-bit units of Grace).

NVHPC Toolchain

The NVHPC toolchain handled aforementioned case without any additional tuning. Simple -O3 -march=native -fast should be therefore sufficient.

Basic Math Libraries

The basic libraries (BLAS and LAPACK) are included in NVHPC toolchain and can be used simply as -lblas and -llapack for BLAS and LAPACK respectively (lp64 and ilp64 versions are also included).


The Grace platform doesn't include CUDA-capable GPU, therefore nvcc will fail with an error. This means that nvc, nvc++ and nvfortran should be used instead.

NVIDIA Performance Libraries

The NVPL package includes more extensive set of libraries in both sequential and multi-threaded versions:

  • BLACS: -lnvpl_blacs_{lp64,ilp64}_{mpich,openmpi3,openmpi4,openmpi5}
  • BLAS: -lnvpl_blas_{lp64,ilp64}_{seq,gomp}
  • FFTW: -lnvpl_fftw
  • LAPACK: -lnvpl_lapack_{lp64,ilp64}_{seq,gomp}
  • ScaLAPACK: -lnvpl_scalapack_{lp64,ilp64}
  • RAND: -lnvpl_rand or -lnvpl_rand_mt
  • SPARSE: -lnvpl_sparse

This package should be compatible with all availiable toolchains and includes CMake module files for easy integration into CMake-based projects. For further documentation see also NVPL.

We recommend to use the multi-threaded BLAS library from the NVPL package.


It is important to pin the processes using OMP_PROC_BIND=spread


$ ml NVHPC
$ nvc -O3 -march=native myprog.c -o myprog -lnvpl_blas_lp64_gomp
$ OMP_PROC_BIND=spread ./myprog

Basic Communication Libraries

The OpenMPI 4 implementation is included with NVHPC toolchain and is exposed as a module (ml OpenMPI). The following example

#include <mpi.h>
#include <sched.h>
#include <omp.h>

int main(int argc, char **argv)
        int rank;
        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        #pragma omp parallel
                printf("Hello on rank %d, thread %d on CPU %d\n", rank, omp_get_thread_num(), sched_getcpu());

can be compiled and run as follows

ml OpenMPI
mpic++ -fast -fopenmp hello.cpp -o hello
OMP_PROC_BIND=close OMP_NUM_THREADS=4 mpirun -np 4 --map-by slot:pe=36 ./hello

In this configuration we run 4 ranks bound to one quarter of cores each with 4 OpenMP threads.

Simple BLAS Application

The hello world example application (written in C++ and Fortran) uses simple stationary probability vector estimation to illustrate use of GEMM (BLAS 3 routine).

Stationary probability vector estimation in C++:

#include <iostream>
#include <vector>
#include <chrono>
#include "cblas.h"

const size_t ITERATIONS  = 32;
const size_t MATRIX_SIZE = 1024;

int main(int argc, char *argv[])
    const size_t matrixElements = MATRIX_SIZE*MATRIX_SIZE;

    std::vector<float> a(matrixElements, 1.0f / float(MATRIX_SIZE));

    for(size_t i = 0; i < MATRIX_SIZE; ++i)
        a[i] = 0.5f / (float(MATRIX_SIZE) - 1.0f);
    a[0] = 0.5f;

    std::vector<float> w1(matrixElements, 0.0f);
    std::vector<float> w2(matrixElements, 0.0f);

    std::copy(a.begin(), a.end(), w1.begin());

    std::vector<float> *t1, *t2;
    t1 = &w1;
    t2 = &w2;

    auto c1 = std::chrono::steady_clock::now();

    for(size_t i = 0; i < ITERATIONS; ++i)
        std::fill(t2->begin(), t2->end(), 0.0f);

        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, MATRIX_SIZE, MATRIX_SIZE, MATRIX_SIZE,
                    1.0f, t1->data(), MATRIX_SIZE,
          , MATRIX_SIZE,
                    1.0f, t2->data(), MATRIX_SIZE);

        std::swap(t1, t2);

    auto c2 = std::chrono::steady_clock::now();

    for(size_t i = 0; i < MATRIX_SIZE; ++i)
        std::cout << (*t1)[i*MATRIX_SIZE + i] << " ";

    std::cout << std::endl;

    std::cout << "Elapsed Time: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;

    return 0;

Stationary probability vector estimation in Fortran:

program main
    implicit none

    integer :: matrix_size, iterations
    integer :: i
    real, allocatable, target :: a(:,:), w1(:,:), w2(:,:)
    real, dimension(:,:), contiguous, pointer :: t1, t2, tmp
    real, pointer :: out_data(:), out_diag(:)
    integer :: cr, cm, c1, c2

    iterations  = 32
    matrix_size = 1024

    call system_clock(count_rate=cr)
    call system_clock(count_max=cm)

    allocate(a(matrix_size, matrix_size))
    allocate(w1(matrix_size, matrix_size))
    allocate(w2(matrix_size, matrix_size))

    a(:,:) = 1.0 / real(matrix_size)
    a(:,1) = 0.5 / real(matrix_size - 1)
    a(1,1) = 0.5

    w1 = a
    w2(:,:) = 0.0

    t1 => w1
    t2 => w2

    call system_clock(c1)

    do i = 0, iterations
        t2(:,:) = 0.0

        call sgemm('N', 'N', matrix_size, matrix_size, matrix_size, 1.0, t1, matrix_size, a, matrix_size, 1.0, t2, matrix_size)

        tmp => t1
        t1  => t2
        t2  => tmp
    end do

    call system_clock(c2)

    out_data(1:size(t1)) => t1
    out_diag => out_data(1::matrix_size+1)

    print *, out_diag
    print *, "Elapsed Time: ", (c2 - c1) / real(cr)

end program main

Using NVHPC Toolchain

The C++ version of the example can be compiled with NVHPC and ran as follows

nvc++ -O3 -march=native -fast -I$NVHPC/Linux_aarch64/$EBVERSIONNVHPC/compilers/include/lp64 -lblas main.cpp -o main

The Fortran version is just as simple:

nvfortran -O3 -march=native -fast -lblas main.f90 -o main.x


It may be advantageous to use NVPL libraries instead NVHPC ones. For example DGEMM BLAS 3 routine from NVPL is almost 30% faster than NVHPC one.

Using Clang (For Grace) Toolchain

Similarly Clang for Grace toolchain with NVPL BLAS can be used to compile C++ version of the example.

/opt/nvidia/clang/17.23.11/bin/clang++ -O3 -march=native -ffast-math -I$NVHPC/Linux_aarch64/$EBVERSIONNVHPC/compilers/include/lp64 -lnvpl_blas_lp64_gomp main.cpp -o main


NVHPC module is used just for the cblas.h include in this case. This can be avoided by changing the code to use nvpl_blas.h instead.

Additional Resources