Enable L3 PFC + DCQCN for RoCE on Mellanox ConnectX NICs

RoCE networks, a high-performance implementation of RDMA networks, offload flow control and congestion control algorithms to hardware to achieve high performance. However, these algorithms target lossless networks so they can be simple enough to implement on hardware. Thus, we should mitigate packet loss issues and guarantee lossless networks to our best. DCQCN (Congestion Control) + PFC (Flow Control) is a common option for many data centers. We observed that our system would suffer severe performance fluctuation if disabling them.

(Updated on Mar 31, 2024) You may not need PFC for modern NICs (ConnectX-6 or newer), which support NVIDIA RTTCC congestion control that doesn't rely on PFC and ECN.

Discussion

L2 PFC v.s. L3 PFC

It is recommended to set up L3 PFC rather than L2 PFC nowadays. Probably, it is because setting up L3 PFC is easier.

Note: 802.1p, pcp could refer to L2 PFC; dscp could refer to L3 PFC.

Terminologies

  • Type of Service (ToS): A value to distinguish applications. RoCE is one of the applications.
  • DSCP / dot11p Tag: A value embedded in the packet header.
  • Buffer: Receive buffer (SRAM) on NICs. Can be partitioned into multiple regions.
  • Traffic Class (TC): An intermediate value between PFC Priority and Queue ID.
  • Queue: Send queue on NICs.
  • Note: Service Level is a concept in Infiniband networks, not related to RoCE networks.
1
2
3
4
5
6
7
8
9
10
     >>2             Map                 Map       
TOS ─────► DSCP Tag ─────► PFC Priority ─────► TC
│ │
│ Map │ =
│ │
▼ ▼
BufferID QueueID

Fig. Numerical Relationship between terminologies
for Mellanox ConnectX NICs.
1
2
3
4
5
6
7
8
9
10
11
12
13
               ┌─────────────────────────────┐                          
┌──────────┐ │ NIC │
│ TCP App │ │ ┌────────┐ ┌───────┐ │
│ ToS=0 ├───┼──►│ Queue0 ├──►│ │ │ ┌─────────┐ ┌─────────┐
└──────────┘ │ └────────┘ │ Sche │ │ │ Packet │ │ Packet │
│ ... │ duler ├────┼──►│ DSCP=0 │ │ DSCP=26 │
┌──────────┐ │ ┌────────┐ │ │ │ └─────────┘ └─────────┘
│ RDMA App ├───┼──►│ Queue3 ├──►│ │ │
│ ToS=106 │ │ └────────┘ └───────┘ │
└──────────┘ │ │
└─────────────────────────────┘

Fig. Egress procedure of Mellanox ConnectX NICs.
1
2
3
4
5
6
7
8
9
10
11
12
13
              ┌──────────────────────────────┐   ┌────────┐
┌─────────┐ │ NIC │ │ System │
│ Packet │ │ ┌─────────┐ ┌────────┐ │ │ Memory │
│ DSCP=0 ├───┼──►│ Buffer0 ├──►│ │ │ │ │
└─────────┘ │ └─────────┘ │ DMA │ │ │ │
│ │ Engine ├───┼──►│ │
┌─────────┐ │ ┌─────────┐ │ │ │ │ │
│ Packet ├───┼──►│ Buffer1 ├──►│ │ │ │ │
│ DSCP=26 │ │ └─────────┘ └────────┘ │ │ │
└─────────┘ │ │ │ │
└──────────────────────────────┘ └────────┘

Fig. Ingress procedure of Mellanox ConnectX NICs.

Prerequisite

If network traffic goes through network switches, ensure L3 PFC and ECN are enabled on switches. We omitted the detailed steps to configure switches here as they depend on the particular switch SKU and OS.

Note: we could set up a direct connection to test whether our configuration on NICs works or not.

Steps

Enable DCQCN

DCQCN is enabled by default on Mellanox ConnextX NICs. But, we need to ensure ECN marking is enabled on all network switches.

Identify Interface Name and Device Name

We could check the interface and device name with show_gids, which should output something like below.

1
2
3
DEV	PORT	INDEX	GID					IPv4  		VER	DEV
--- ---- ----- --- ------------ --- ---
mlx5_1 1 3 0000:0000:0000:0000:0000:ffff:0ac8:000e 10.200.0.14 v2 enp216s0f1np1

Here mlx5_1 is the device name that is usually used to describe RDMA devices, and enp216s0f1np1 is the interface name that can be managed by Linux as an Ethernet interface. We could save them as environment variables.

1
2
export IF_NAME=enp216s0f1np1
export DEV_NAME=mlx5_1

Tune PFC Headroom Size

We should reserve enough buffer space to store on-the-fly packets since senders (or requestors) need to take time (latency) to respond to PFC pauses. The latency will be affected by the cable length (a.k.a. propagation delay). Mellanox requires us to set the cable length manually, and then the NICs will automatically calculate the correct headroom size.

Fortunately, the cable length is recorded in the transceiver’s EEPROM.

1
sudo mlxlink -d $DEV_NAME -m -c -e

From the outputs, we could find the cable length.

1
2
3
4
5
6
7
8
9
10
11
12
13
Module Info
-----------
Identifier : QSFP28
Compliance : 100GBASE-SR4 or 25GBASE-SR
Cable Technology : 850 nm VCSEL
Cable Type : Optical Module (separated)
OUI : Other
Vendor Name : ...
Vendor Part Number : ...
Vendor Serial Number : ...
Rev : A0
Wavelength [nm] : 850
Transfer Distance [m] : 50 # here

Then, apply this parameter to our QoS setting.

1
sudo mlnx_qos -i $IF_NAME --cable_len=50

Enable L3 PFC

Execute the following commands to activate PFC and apply the PFC setting to RoCE traffic. Note: the configuration is NOT persistent. You may need to re-run all the commands above to enable PFC each time the machine reboots. Also, ensure PFC is enabled for DSCP 26 traffic on all network switches.

(Updated on Mar 31, 2024) You may need to use a different PFC priority or DSCP value depending on the configuration/restriction of network switches. Conventionally, we use Priority 3 for lossless networks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# use L3 PFC, default=pcp (L2 PFC)
sudo mlnx_qos -i $IF_NAME --trust dscp

# enable PFC on PFC Priority 3
sudo mlnx_qos -i $IF_NAME --pfc 0,0,0,1,0,0,0,0

# clear Traffic Class (TC) settings
echo "tclass=-1" | sudo tee /sys/class/infiniband/$DEV_NAME/tc/1/traffic_class

# set default ToS (= DSCP value * 4) for RoCE traffic
echo 106 | sudo tee /sys/class/infiniband/$DEV_NAME/tc/1/traffic_class

# set default ToS for RoCE traffic
sudo cma_roce_tos -d $DEV_NAME -t 106

Verification

Show PFC Setting

1
sudo mlnx_qos -i $IF_NAME
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
DCBX mode: OS controlled
Priority trust state: dscp
dscp2prio mapping:
prio:0 dscp:07,06,05,04,03,02,01,00,
prio:1 dscp:15,14,13,12,11,10,09,08,
prio:2 dscp:23,22,21,20,19,18,17,16,
prio:3 dscp:31,30,29,28,27,26,25,24,
prio:4 dscp:39,38,37,36,35,34,33,32,
prio:5 dscp:47,46,45,44,43,42,41,40,
prio:6 dscp:55,54,53,52,51,50,49,48,
prio:7 dscp:63,62,61,60,59,58,57,56,
default priority:
Receive buffer size (bytes): 20016,156096,0,0,0,0,0,0,total_size=1027728
Cable len: 7
PFC configuration:
priority 0 1 2 3 4 5 6 7
enabled 0 0 0 1 0 0 0 0
buffer 0 0 0 1 0 0 0 0
tc: 1 ratelimit: unlimited, tsa: vendor
priority: 0
tc: 0 ratelimit: unlimited, tsa: vendor
priority: 1
tc: 2 ratelimit: unlimited, tsa: vendor
priority: 2
tc: 3 ratelimit: unlimited, tsa: vendor
priority: 3
tc: 4 ratelimit: unlimited, tsa: vendor
priority: 4
tc: 5 ratelimit: unlimited, tsa: vendor
priority: 5
tc: 6 ratelimit: unlimited, tsa: vendor
priority: 6
tc: 7 ratelimit: unlimited, tsa: vendor
priority: 7
1
2
3
4
5
6
7
         >>2            Map               Map       
TOS 106 ─────► DSCP 26 ─────► PFC Prio 3 ─────► TC 3
│ │
│ Map │ =
│ │
▼ ▼
Buffer 1 Queue 3

Check DCQCN is functioning

1
2
3
4
5
6
7
8
# Check DCQCN is enabled on Prio 3
cat /sys/class/net/$IF_NAME/ecn/roce_np/enable/3
cat /sys/class/net/$IF_NAME/ecn/roce_rp/enable/3

# Check counters related to DCQCN
cat /sys/class/infiniband/$DEV_NAME/ports/1/hw_counters/np_cnp_sent
cat /sys/class/infiniband/$DEV_NAME/ports/1/hw_counters/np_ecn_marked_roce_packets
cat /sys/class/infiniband/$DEV_NAME/ports/1/hw_counters/rp_cnp_handled

Note: Two cases triggering NP to send CNP packets:

  • NP’s NIC receives a packet with an ECN mark (marked by the switch indicating the switch’s buffer is about to be out of capacity).
  • NP’s NIC receives an out-of-order packet (packet loss occurred).

Check PFC is functioning

1
ethtool -S $IF_NAME | grep prio3
1
2
3
4
5
6
7
8
9
10
rx_prio3_bytes: 462536457742
rx_prio3_packets: 500098087
rx_prio3_discards: 0
tx_prio3_bytes: 1180912512618
tx_prio3_packets: 1155032777
rx_prio3_pause: 214479
rx_prio3_pause_duration: 213496
tx_prio3_pause: 12
tx_prio3_pause_duration: 13
rx_prio3_pause_transition: 107222

Note: tx_prio3_pause refers to the number of PFC pauses sent from this NIC as the server cannot absorb the network traffic quickly.

Miscellaneous

Advanced QoS Settings

The default values of the parameters below should be fine…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# mapping Priority to TC
sudo mlnx_qos -i $IF_NAME --prio_tc=0,1,2,3,4,5,6,7

# mapping Priority to receive buffers
sudo mlnx_qos -i $IF_NAME --prio2buffer=0,0,0,1,0,0,0,0

# adjust receive buffer sizes
sudo mlnx_qos -i $IF_NAME --buffer_size=20016,156096,0,0,0,0,0,0

## Set alternative TSA
# set to vendor (default)
sudo mlnx_qos -i $IF_NAME --tsa=vendor,vendor,vendor,vendor,vendor,vendor,vendor,vendor
# set to ets
sudo mlnx_qos -i $IF_NAME --tsa=ets,ets,ets,ets,ets,ets,ets,ets --tcbw=0,0,0,100,0,0,0,0

# Remap DSCP value to specified PFC Priority
# Map DSCP 3 to Prio 3
sudo mlnx_qos -i $IF_NAME --dcsp2prio='set,3,3'

Performance Counters on NICs

Refer to this: https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters.

Failed to map RoCE traffic to specified PFC Priority

  • Consider upgrading OFED drivers.
  • ibv_qp_attr.ah_attr.grh.traffic_class may override default ToS Value.

Configuration for BlueField DPU

QoS settings could be set on the host using mlnx_qos, but Traffic Class values must be set on the DPU side individually.

References