How to properly setup GPUDirect RDMA

GPUDirect RDMA (GDR) is an incredible technology allowing remote machines directly to manipulate the local GPU's memory. However, there are not many online resources discussing about this technology. So, I felt very confused when I encountered issues relevant to RDMA, especially for GDR.

Prerequisite

Install RNIC Drivers and Toolkits

In this tutorial, I will use Mellanox ConnextX RDMA NIC (RNIC) as an example to demonstrate configuration steps.

Note that some configuration steps are vendor-specific, which means for different vendor's RNIC, you may need to find the alternative solution if my approach is not applicable for your RNIC. Also I didn't test GDR on a NIC made by vendors other than Mellanox. (I suspect only Mellanox's RNIC supports GDR).

For ConnextX RNIC, the corresponding drivers and toolkits are all packed in Mellanox OFED.

Install CUDA Drivers and Toolkits

I won't introduce how to install these things as there are already many tutorials about this topic on Internet. I would recommend to check the official website, and install the packages this website provides.

Note that installing CUDA Drivers through apt and Toolkits through conda separately is NOT recommended.

Verify Installation

Once you properly installed them, just execute the command nvidia-smi topo -m and you should see something like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 CPU Affinity NUMA Affinity
GPU0 X NV4 NV4 NV4 SYS SYS 0-127 N/A
GPU1 NV4 X NV4 NV4 SYS SYS 0-127 N/A
GPU2 NV4 NV4 X NV4 PHB PHB 0-127 N/A
GPU3 NV4 NV4 NV4 X SYS SYS 0-127 N/A
NIC0 SYS SYS PHB SYS X PIX
NIC1 SYS SYS PHB SYS PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1

We will discuss what this output represents in the following section. For now, you should be able to identify both your GPUs and NICs from this output.

Check GDR Hardware Support

Every system is not created equal. Continue the above example, we can see there are many types of relationship between individual GPU and NIC such as SYS and PHB. In fact, they will greatly affect the GDR performance.

From my experience, I believe:

  • GDR Performance is good: PIX, PXB
  • GDR Performance is likely to be good: PHB
    • Might be as good as PIX and PXB
  • GDR Performance is bad: SYS, NODE

My benchmark result of GDR performance with 100 Gbps RoCE Network:

  • Dual Intel Xeon 4112 + NVIDIA Tesla V100
    • SYS: ~2 GB/s
    • PIX: ~10 GB/s
  • Single AMD EPYC 7763 + NVIDIA Tesla A100
    • SYS: ~6 GB/s
    • PHB: ~10 GB/s
  • Dual Intel Xeon E5-2630 v4 + NVIDIA Tesla P100
    • SYS: ~0.3 GB/s

Here are discussion about PHB and description about P2P Level.

(Updated on Jun 9, 2023) Here is a systematic introduction to PCIe Affinity.

If you unfortunately got some SYS or NODE, this relationship can be possibly corrected by plugging your GPU or NIC into proper PCIe slots.

  • For multi-socket system, some certain PCIe slots are physically connected to one CPU socket (package) while the other slots are connected to other sockets.
  • For system with chiplet-based CPU (e.g., AMD EPYC), some certain PCIe might be physically connected to one chipet.
    • That is still the case for AMD EPYC 7003 Series which has a separate I/O die.

Load NVIDIA Peer Memory Kernel Module

nvidia_peermem module is bundled in CUDA Toolkit downloaded from here. By default, this kernel module will be not loaded automatically. Thus, we could manually load this module with the command sudo modprobe nvidia_peermem.

To check this module is loaded correctly, execute the command lsmod | grep nvidia_peermem and see if this module name exists in the output.

If you cannot find this kernel module in the system, you might consider to install the latest CUDA Toolkit.

There is another version of this module that you can find on this repo, and it is called nv_peer_mem instead. But it appears to be no longer maintained.

Disable PCIe ACS

Many reports ([1], [2]) have mentioned PCIe ACS may hurt the GDR performance. PCIe ACS is a security feature, but we never care about the security when we are hungry for performance. Here is the script to disable it. Note that this script is only for your reference. You may need to modify the content according to your machine's configuration.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/bin/bash

# Author: OmniReduce Team
# This script applies what was mentioned in:
# https://forums.developer.nvidia.com/t/multi-gpu-peer-to-peer-access-failing-on-tesla-k80/39748/9
# To disable PCIe ACS

echo "Before==========================="
sudo lspci -vvv | grep -i acsctl
echo "================================="

pcis=$(lspci | grep -i plx | cut -d' ' -f1 | tr '\r\n' ' ')
echo "Disabling ACS on $pcis"

for pci in $pcis
do
setpci -s $pci f2a.w=0000
done

echo "After============================"
sudo lspci -vvv | grep -i acsctl
echo "================================="

echo "Make sure all ACS features are disabled as ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-"

Verification

Up to now, GDR supposed to work. To verify that, I would recommend to use OFED PerfTest.

  • DO NOT use PerfTest provided by OFED or APT. You should compile PerfTest by yourself because the binary distribution doesn't support GDR
  • Both Client and Server are capable of utilizing GDR
  • ib_send_bw has some bugs in GDR tests
1
2
3
4
5
6
7
8
9
10
# Compile
./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j

# Launch as server
./ib_write_bw -d ib_dev --use_cuda=<gpu index> -a
./ib_write_bw -d mlx5_0 --use_cuda=0 -a # Run on GPU0 and MLX5_0

# Launch as client
./ib_write_bw -d ib_dev --use_cuda=<gpu index> -a <server ip addr>
./ib_write_bw -d mlx5_0 --use_cuda=0 -a 10.200.0.10 # Run on GPU0 and MLX5_0

If you would like to test with NCCL, it is recommended to refer this article.

Troubleshooting

If you still encounter errors like ibv_create_qp failed or ibv_reg_mr failed, this might be caused by Linux user limits (ulimit).

A fast and dirty way to temporarily fix this issue is to run the program as root user. Once this dirty fix works, you can refer this article to permanently solve this issue.