Heterogeneous Memory Management on Intel Platforms¶
Partition p10-intel
offser heterogeneous memory directly exposed to the user. This allows to manually pick appropriate kind of memory to be used at process or even single allocation granularity. Both kinds of memory are exposed as memory-only NUMA nodes. This allows both coarse (process level) and fine (allocation level) grained control over memory type used.
Overview¶
At the process level the numactl
facilities can be utilized, while Intel provided memkind
library allows for finer control. Both memkind
library and numactl
can be accessed by loading memkind
module or OpenMPI
module (only numactl
).
ml memkind
Process Level (NUMACTL)¶
The numactl
allows to either restrict memory pool of the process to specific set of memory NUMA nodes
numactl --membind <node_ids_set>
or select single preffered node
numactl --preffered <node_id>
where <node_ids_set>
is comma separated list (eg. 0,2,5,...
) in combination with ranges (such as 0-5
). The membind
option kills the process if it requests more memory than can be satisfied from specified nodes. The preffered
option just reverts to using other nodes according to their NUMA distance in the same situation.
Convenient way to check numactl
configuration is
numactl -s
which prints configuration in its execution environment eg.
numactl --membind 8-15 numactl -s
policy: bind
preferred node: 0
physcpubind: 0 1 2 ... 189 190 191
cpubind: 0 1 2 3 4 5 6 7
nodebind: 0 1 2 3 4 5 6 7
membind: 8 9 10 11 12 13 14 15
The last row shows allocations memory are restricted to NUMA nodes 8-15
.
Allocation Level (MEMKIND)¶
The memkind
library (in its simplest use case) offers new variant of malloc/free
function pair, which allows to specify kind of memory to be used for given allocation. Moving specific allocation from default to HBM memory pool then can be achieved by replacing:
void *pData = malloc(<SIZE>);
/* ... */
free(pData);
with
#include <memkind.h>
void *pData = memkind_malloc(MEMKIND_HBW, <SIZE>);
/* ... */
memkind_free(NULL, pData); // "kind" parameter is deduced from the address
Similarly other memory types can be chosen.
Note
The allocation will return NULL
pointer when memory of specified kind is not available.
High Bandwidth Memory (HBM)¶
Intel Sapphire Rapids (partition p10-intel
) consists of two sockets each with 128GB
of DDR and 64GB
on-package HBM memory. The machine is configured in FLAT mode and therefore exposes HBM memory as memory-only NUMA nodes (16GB
per 12-core tile). The configuration can be verified by running
numactl -H
which should show 16 NUMA nodes (0-7
should contain 12 cores and 32GB
of DDR DRAM, while 8-15
should have no cores and 16GB
of HBM each).
Process Level¶
With this we can easily restrict application to DDR DRAM or HBM memory:
# Only DDR DRAM
numactl --membind 0-7 ./stream
# ...
Function Best Rate MB/s Avg time Min time Max time
Copy: 369745.8 0.043355 0.043273 0.043588
Scale: 366989.8 0.043869 0.043598 0.045355
Add: 378054.0 0.063652 0.063483 0.063899
Triad: 377852.5 0.063621 0.063517 0.063884
# Only HBM
numactl --membind 8-15 ./stream
# ...
Function Best Rate MB/s Avg time Min time Max time
Copy: 1128430.1 0.015214 0.014179 0.015615
Scale: 1045065.2 0.015814 0.015310 0.016309
Add: 1096992.2 0.022619 0.021878 0.024182
Triad: 1065152.4 0.023449 0.022532 0.024559
The DDR DRAM achieves bandwidth of around 400GB/s, while the HBM clears 1TB/s bar.
Some further improvements can be achieved by entirely isolating a process to a single tile. This can be useful for MPI jobs, where $OMPI_COMM_WORLD_RANK
can be used to bind each process individually. The simple wrapper script to do this may look like
#!/bin/bash
numactl --membind $((8 + $OMPI_COMM_WORLD_RANK)) $@
and can be used as
mpirun -np 8 --map-by slot:pe=12 membind_wrapper.sh ./stream_mpi
(8 tiles with 12 cores each). However, this approach assumes 16GB
of HBM memory local to the tile is sufficient for each process (memory cannot spill between tiles). This approach may be significantly more useful in combination with --preferred
instead of --membind
to force preference of local HBM with spill to DDR DRAM. Otherwise
mpirun -n 8 --map-by slot:pe=12 numactl --membind 8-15 ./stream_mpi
is most likely preferable even for MPI workloads. Applying above approach to MPI Stream with 8 ranks and 1-24 threads per rank we can expect these results:
Allocation Level¶
Allocation level memory kind selection using memkind
library can be illustrated using modified stream benchmark. The stream benchmark uses three working arrays (A, B and C), whose allocation can be changed to memkind_malloc
as follows
#include <memkind.h>
// ...
STREAM_TYPE *a = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
STREAM_TYPE *b = (STREAM_TYPE *)memkind_malloc(MEMKIND_REGULAR, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
STREAM_TYPE *c = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
// ...
memkind_free(NULL, a);
memkind_free(NULL, b);
memkind_free(NULL, c);
Arrays A and C are allocated from HBM (MEMKIND_HBW_ALL
), while DDR DRAM (MEMKIND_REGULAR
) is used for B.
The code then has to be linked with memkind
library
gcc -march=native -O3 -fopenmp -lmemkind memkind_stream.c -o memkind_stream
and can be run as
export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
OMP_NUM_THREADS=$((N*12)) OMP_PROC_BIND=spread ./memkind_stream
While the memkind
library should be able to detect HBM memory on its own (through HMAT
and hwloc
) this is not supported on p10-intel
. This means that NUMA nodes representing HBM have to be specified manually using MEMKIND_HBW_NODES
environment variable.
With this setup we can see that simple copy operation (C[i] = A[i]) achieves bandwidth comparable to the application bound entirely to HBM memory. On the other hand the scale operation (B[i] = s*C[i]) is mostly limited by DDR DRAM bandwidth. Its also worth noting that operations combining all three arrays are performing close to HBM-only configuration.
Simple Application¶
One of applications that can greatly benefit from availability of large slower and faster smaller memory is computing histogram with many bins over large dataset.
#include <iostream>
#include <vector>
#include <chrono>
#include <cmath>
#include <cstring>
#include <omp.h>
#include <memkind.h>
const size_t N_DATA_SIZE = 2 * 1024 * 1024 * 1024ull;
const size_t N_BINS_COUNT = 1 * 1024 * 1024ull;
const size_t N_ITERS = 10;
#if defined(HBM)
#define DATA_MEMKIND MEMKIND_REGULAR
#define BINS_MEMKIND MEMKIND_HBW_ALL
#else
#define DATA_MEMKIND MEMKIND_REGULAR
#define BINS_MEMKIND MEMKIND_REGULAR
#endif
int main(int argc, char *argv[])
{
const double binWidth = 1.0 / double(N_BINS_COUNT + 1);
double *pData = (double *)memkind_malloc(DATA_MEMKIND, N_DATA_SIZE * sizeof(double));
size_t *pBins = (size_t *)memkind_malloc(BINS_MEMKIND, N_BINS_COUNT * omp_get_max_threads() * sizeof(double));
#pragma omp parallel
{
drand48_data state;
srand48_r(omp_get_thread_num(), &state);
#pragma omp for
for(size_t i = 0; i < N_DATA_SIZE; ++i)
drand48_r(&state, &pData[i]);
}
auto c1 = std::chrono::steady_clock::now();
for(size_t it = 0; it < N_ITERS; ++it)
{
#pragma omp parallel
{
for(size_t i = 0; i < N_BINS_COUNT; ++i)
pBins[omp_get_thread_num()*N_BINS_COUNT + i] = size_t(0);
#pragma omp for
for(size_t i = 0; i < N_DATA_SIZE; ++i)
{
const size_t idx = size_t(pData[i] / binWidth) % N_BINS_COUNT;
pBins[omp_get_thread_num()*N_BINS_COUNT + idx]++;
}
}
}
auto c2 = std::chrono::steady_clock::now();
#pragma omp parallel for
for(size_t i = 0; i < N_BINS_COUNT; ++i)
{
for(size_t j = 1; j < omp_get_max_threads(); ++j)
pBins[i] += pBins[j*N_BINS_COUNT + i];
}
std::cout << "Elapsed Time [s]: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
size_t total = 0;
#pragma omp parallel for reduction(+:total)
for(size_t i = 0; i < N_BINS_COUNT; ++i)
total += pBins[i];
std::cout << "Total Items: " << total << std::endl;
memkind_free(NULL, pData);
memkind_free(NULL, pBins);
return 0;
}
Using HBM Memory (P10-Intel)¶
Following commands can be used to compile and run example application above
ml GCC memkind
export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
g++ -O3 -fopenmp -lmemkind histogram.cpp -o histogram_dram
g++ -O3 -fopenmp -lmemkind -DHBM histogram.cpp -o histogram_hbm
OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_dram
OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_hbm
Moving histogram bins data into HBM memory should speedup the algorithm more than twice. It should be noted that moving also pData
array into HBM memory worsens this result (presumably because the algorithm can saturate both memory interfaces).