Everything you need to know about Splitting NCCL Communicators

MPI allows to create a new communicator by splitting an existing one into a sub-communicator, which can make our program dynamically select a subset of computing nodes to involve in the collective communication operations, such as all-reduce and all-gather operations. NCCL also has a similar feature, but it is not well-documented yet.

TL;DR

Since NCCL relies on MPI to run on multiple nodes, the following example code is based on MPI Programming Model. Assume there are 4 CUDA GPUs and 4 corresponding MPI ranks. This code performs all-reduce operation within the first two and the last two ranks simultaneously.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#include <nccl.h>
#include <mpi.h>
#include <stdio.h>
#include <stdint.h>
#include <thrust/device_ptr.h>
#include <thrust/fill.h>

using namespace std;

int main() {
MPI_Init(NULL, NULL);
int world_size, world_rank;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
assert(world_size == 4);
cudaSetDevice(world_rank); // GPU N binds to MPI rank N

ncclUniqueId nccl_id, nccl_ids[4];
size_t id_size = sizeof(ncclUniqueId);

/* Generate Unique ID */
// nccl_id is a simple struct with the size of exact 128 bytes
// so it can be transferred over MPI
ncclGetUniqueId(&nccl_id);
MPI_Allgather(&nccl_id, id_size, MPI_UINT8_T,
&nccl_ids[0], id_size, MPI_UINT8_T, MPI_COMM_WORLD);

/* Create a sub-communicator */
ncclComm_t nccl_comm;

if (world_rank <= 1) {
ncclCommInitRank(&nccl_comm, 2, nccl_ids[0], world_rank);
} else if (world_rank >= 2) {
ncclCommInitRank(&nccl_comm, 2, nccl_ids[2], world_rank - 2);
}

/* Test */
constexpr size_t N = (size_t)1e3;
constexpr size_t arr_size = sizeof(int64_t) * N;
void *arr, *arr_host;
cudaMalloc(&arr, arr_size);
cudaMallocHost(&arr_host, arr_size);

/* Init the array on local GPU */
thrust::device_ptr<int64_t> arr_ptr((int64_t*)arr);
thrust::fill(arr_ptr, arr_ptr + N, world_rank);

ncclAllReduce(arr, arr, N, ncclInt64, ncclSum, nccl_comm, NULL);
cudaMemcpy(arr_host, arr, arr_size, cudaMemcpyDeviceToHost);
printf("[rank%d] result: %ld\n", world_rank, ((int64_t*)arr_host)[0]);

MPI_Finalize();
return 0;
}

This code can be compiled and run on my machine with these commands,

1
2
nvcc -ccbin mpic++ test.cu -o test -L/usr/local/cuda/lib -lnccl
mpirun -n 4 ./test

Note: Using nvcc to compile MPI code is not a common practice. It is recommended to compile it with mpic++ from a CUDA-Aware MPI variant.

The output of this program should be,

1
2
3
4
[rank0] result: 1 # 0 + 1 = 1
[rank1] result: 1
[rank2] result: 5 # 2 + 3 = 5
[rank3] result: 5

The key is ncclCommInitRank. Suppose only a subset of ranks initializes the communicator with the same unique ID belonging to one of them. In that case, this communicator will ignore other ranks that are not in this subset.

Usage of ncclCommInitRank

Official API explanation:

1
ncclResult_t ncclCommInitRank(ncclComm_t *comm, int nranks, ncclUniqueId commId, int rank)

Creates a new communicator (multi thread/process version). rank must be between 0 and nranks-1 and unique within a communicator clique. Each rank is associated to a CUDA device, which has to be set before calling ncclCommInitRank. ncclCommInitRank implicitly synchronizes with other ranks, so it must be called by different threads/processes or use ncclGroupStart/ncclGroupEnd.

In addition to the official instructions, we should also know,

  • Each unique ID should only be used once.
  • ncclGetUniqueId can be invoked multiple times, and it will return a different unique ID each time. Meanwhile, the unique ID generated before is still working.
  • It is safe to communicate within disjoint subsets of nodes simultaneously.
  • Using NCCL to perform inter-GPU communication concurrently with CUDA-aware MPI may create deadlocks.

Performance

Moreover, I also evaluate the influence on performance bring by sub-grouping.

The testbed is,

  • AWS g4dn.metal instance with 8x NVIDIA Tesla T4 GPUs.
  • Shipped with AWS Deep Learning AMI
    • OS: Ubuntu 18.04 (Kernel Version: Linux 5.4)
    • CUDA Toolkit: 11.0 (Driver Version: 450.119.03 )

First of all, I would like to emphasize the GPU topology of this bare-metal machine.

Note: We should extract the topology information from physical machines instead of virtual machines since the hypervisor may fuzz the result due to security reasons.

1
2
3
4
5
6
7
8
9
10
# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity
GPU0 X PHB NODE NODE SYS SYS SYS SYS 0-23,48-71 0
GPU1 PHB X NODE NODE SYS SYS SYS SYS 0-23,48-71 0
GPU2 NODE NODE X PHB SYS SYS SYS SYS 0-23,48-71 0
GPU3 NODE NODE PHB X SYS SYS SYS SYS 0-23,48-71 0
GPU4 SYS SYS SYS SYS X PHB NODE NODE 24-47,72-95 1
GPU5 SYS SYS SYS SYS PHB X NODE NODE 24-47,72-95 1
GPU6 SYS SYS SYS SYS NODE NODE X PHB 24-47,72-95 1
GPU7 SYS SYS SYS SYS NODE NODE PHB X 24-47,72-95 1

It looks like a balanced tree topology. We could expect two neighbor GPUs will have higher communication efficiency.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
UPI                   
|--CPU0
| |--PCIe Switch
| | |--GPU0
| | |--GPU1
| |--PCIe Switch
| |--GPU2
| |--GPU3
|--CPU1
|--PCIe Switch
| |--GPU4
| |--GPU5
|--PCIe Switch
|--GPU6
|--GPU7

The result below is measured on the root rank, and each experiment is repeated 5 times. Meanwhile, the environment CUDA_VISIBLE_DEVICES was set to reorder GPUs binded to MPI ranks. CPU binding remains unset.

And the meaning of the notations on communicators is,

  • 0/1: Only one communicator performing all-reduce on physical GPU 0/1.
  • 0/1 + 2/3: Two communicators are working at the same time, and each of them perform all-reduce on two GPUs independently.
  • 0-7: Equivalent to 0/1/2/.../6/7.
upload successful
upload successful

From the result above, we can conclude that,

  • GPUs are working at PCIe Gen3 x8 mode as the PCIe Switch splits one PCIe x16 slot into two x8 slots.
    • Double checked by nvidia-smi --query-gpu=pcie.link.gen.current --format=csv and sudo lspci -vvv
  • The GPU Topology will significantly affect the performance of all-reduce.
    • The topology that NVIDIA DGX adopt should obviously accelerate collective communication operations.
  • The interference between two concurrent communicators is not quite noticeable.
  • UPI bus is not a bottleneck when two PCIe Gen3 x16 devices (PCIe Switches) transmit a large data chunk over UPI bus.

Reference