GPUDirect RDMA (GDR) is an incredible technology allowing remote machines directly to manipulate the local GPU's memory. However, there are not many online resources discussing about this technology. So, I felt very confused when I encountered issues relevant to RDMA, especially for GDR.
Install RNIC Drivers and Toolkits
In this tutorial, I will use Mellanox ConnextX RDMA NIC (RNIC) as an example to demonstrate configuration steps.
Note that some configuration steps are vendor-specific, which means for different vendor's RNIC, you may need to find the alternative solution if my approach is not applicable for your RNIC. Also I didn't test GDR on a NIC made by vendors other than Mellanox. (I suspect only Mellanox's RNIC supports GDR).
For ConnextX RNIC, the corresponding drivers and toolkits are all packed in Mellanox OFED.
Install CUDA Drivers and Toolkits
I won't introduce how to install these things as there are already many tutorials about this topic on Internet. I would recommend to check the official website, and install the packages this website provides.
Note that installing CUDA Drivers through
apt and Toolkits through
conda separately is NOT recommended.
Once you properly installed them, just execute the command
nvidia-smi topo -m and you should see something like:
$ nvidia-smi topo -m
We will discuss what this output represents in the following section. For now, you should be able to identify both your GPUs and NICs from this output.
Check GDR Hardware Support
Every system is not created equal. Continue the above example, we can see there are many types of relationship between individual GPU and NIC such as
PHB. In fact, they will greatly affect the GDR performance.
From my experience, I believe:
- GDR Performance is good:
- GDR Performance is likely to be good:
- Might be as good as
- Might be as good as
- GDR Performance is bad:
My benchmark result of GDR performance with 100 Gbps RoCE Network:
- Dual Intel Xeon 4112 + NVIDIA Tesla V100
SYS: ~2 GB/s
PIX: ~10 GB/s
- Single AMD EPYC 7763 + NVIDIA Tesla A100
SYS: ~6 GB/s
PHB: ~10 GB/s
- Dual Intel Xeon E5-2630 v4 + NVIDIA Tesla P100
SYS: ~0.3 GB/s
(Updated on Jun 9, 2023) Here is a systematic introduction to PCIe Affinity.
If you unfortunately got some
NODE, this relationship can be possibly corrected by plugging your GPU or NIC into proper PCIe slots.
- For multi-socket system, some certain PCIe slots are physically connected to one CPU socket (package) while the other slots are connected to other sockets.
- For system with chiplet-based CPU (e.g., AMD EPYC), some certain PCIe might be physically connected to one chipet.
- That is still the case for AMD EPYC 7003 Series which has a separate I/O die.
Load NVIDIA Peer Memory Kernel Module
nvidia_peermem module is bundled in CUDA Toolkit downloaded from here. By default, this kernel module will be not loaded automatically. Thus, we could manually load this module with the command
sudo modprobe nvidia_peermem.
To check this module is loaded correctly, execute the command
lsmod | grep nvidia_peermem and see if this module name exists in the output.
If you cannot find this kernel module in the system, you might consider to install the latest CUDA Toolkit.
There is another version of this module that you can find on this repo, and it is called
nv_peer_meminstead. But it appears to be no longer maintained.
Disable PCIe ACS
Many reports (, ) have mentioned PCIe ACS may hurt the GDR performance. PCIe ACS is a security feature, but we never care about the security when we are hungry for performance. Here is the script to disable it. Note that this script is only for your reference. You may need to modify the content according to your machine's configuration.
Up to now, GDR supposed to work. To verify that, I would recommend to use OFED PerfTest.
- DO NOT use PerfTest provided by OFED or APT. You should compile PerfTest by yourself because the binary distribution doesn't support GDR
- Both Client and Server are capable of utilizing GDR
ib_send_bwhas some bugs in GDR tests
If you would like to test with NCCL, it is recommended to refer this article.
If you still encounter errors like
ibv_create_qp failed or
ibv_reg_mr failed, this might be caused by Linux user limits (ulimit).
A fast and dirty way to temporarily fix this issue is to run the program as root user. Once this dirty fix works, you can refer this article to permanently solve this issue.