Enable L3 PFC + DCQCN for RoCE on Mellanox ConnectX NICs

RoCE networks, a high-performance implementation of RDMA networks, offload flow control and congestion control algorithms to hardware to achieve high performance. However, these algorithms target lossless networks so they can be simple enough to implement on hardware. Thus, we should mitigate packet loss issues and guarantee lossless networks to our best. DCQCN (Congestion Control) + PFC (Flow Control) is a common option for many data centers. We observed that our system would suffer severe performance fluctuation if disabling them.

Discussion

L2 PFC v.s. L3 PFC

It is recommended to set up L3 PFC rather than L2 PFC nowadays. Probably, it is because setting up L3 PFC is easier.

Note: 802.1p, pcp could refer to L2 PFC; dscp could refer to L3 PFC.

Terminologies

  • Service Level: a concept in Infiniband networks. Not related to RoCE networks.
  • Type of Service (ToS): to label RoCE traffic.
  • DSCP / dot11p Value: a value embedded in the packet header.
  • Traffic Class (TC): to distinguish internal queues of NICs.
  • Priority: an intermediate value bridging DSCP / dot11p and TC. And map DSCP / dot11p to buffers.
1
2
3
4
5
6
7
Egress:
ToS -----------------------> DSCP / dot11p
(ToS >> 2 = DSCP Value)

Ingress:
DSCP / dot11p -----------------> Priority ----------------> Traffic Class
(Mapping Table) (Mapping Table)

Note: the definition of terms may vary from vendor to vendor.

Prerequisite

If network traffic goes through network switches, ensure L3 PFC and ECN are enabled on switches. We omitted the detailed steps to configure switches here as they depend on the particular switch SKU and OS.

Note: we could set up a direct connection to test whether our configuration on NICs works or not.

Steps

Enable DCQCN

DCQCN is enabled by default on Mellanox ConnextX NICs. But, we need to ensure ECN marking is enabled on all network switches.

Tune PFC Headroom Size

We should reserve enough buffer space to store on-the-fly packets since senders (or requestors) need to take time (latency) to respond to PFC pauses. The latency will be affected by the cable length (a.k.a. propagation delay). Mellanox requires us to set the cable length manually, and then the NICs will automatically calculate the correct headroom size.

Fortunately, the cable length is recorded in the transceiver’s EEPROM.

1
sudo mlxlink -d $DEV_NAME -m -c -e

From the outputs, we could find the cable length.

1
2
3
4
5
6
7
8
9
10
11
12
13
Module Info
-----------
Identifier : QSFP28
Compliance : 100GBASE-SR4 or 25GBASE-SR
Cable Technology : 850 nm VCSEL
Cable Type : Optical Module (separated)
OUI : Other
Vendor Name : ...
Vendor Part Number : ...
Vendor Serial Number : ...
Rev : A0
Wavelength [nm] : 850
Transfer Distance [m] : 50 # here

Then, apply this parameter to our QoS setting.

1
sudo mlnx_qos -i $IF_NAME --cable_len=50

Enable L3 PFC

Firstly, we could check the interface name and device name with show_gids, which should output something like below.

1
2
3
DEV	PORT	INDEX	GID					IPv4  		VER	DEV
--- ---- ----- --- ------------ --- ---
mlx5_1 1 3 0000:0000:0000:0000:0000:ffff:0ac8:000e 10.200.0.14 v2 enp216s0f1np1

Then, execute the following commands to activate PFC and apply the PFC setting to RoCE traffic.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
export IF_NAME=enp216s0f1np1
export DEV_NAME=mlx5_1

# use L3 PFC, default=pcp (L2 PFC)
sudo mlnx_qos -i $IF_NAME --trust dscp

# enable PFC on PFC Priority 3
sudo mlnx_qos -i $IF_NAME --pfc 0,0,0,1,0,0,0,0

# clear Traffic Class (TC) settings
echo "tclass=-1" | sudo tee /sys/class/infiniband/$DEV_NAME/tc/1/traffic_class

# set default ToS (= DSCP value * 4) for RoCE traffic
echo 106 | sudo tee /sys/class/infiniband/$DEV_NAME/tc/1/traffic_class

# set default ToS for RoCE traffic
sudo cma_roce_tos -d $DEV_NAME -t 106

Note: the configuration is NOT persistent. You may need to re-run all the commands above to enable PFC each time the machine reboots. Also, ensure PFC is enabled for DSCP 26 traffic on all network switches.

Verification

Show PFC Setting

1
sudo mlnx_qos -i $IF_NAME
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
DCBX mode: OS controlled
Priority trust state: dscp
dscp2prio mapping:
prio:0 dscp:07,06,05,04,03,02,01,00,
prio:1 dscp:15,14,13,12,11,10,09,08,
prio:2 dscp:23,22,21,20,19,18,17,16,
prio:3 dscp:31,30,29,28,27,26,25,24,
prio:4 dscp:39,38,37,36,35,34,33,32,
prio:5 dscp:47,46,45,44,43,42,41,40,
prio:6 dscp:55,54,53,52,51,50,49,48,
prio:7 dscp:63,62,61,60,59,58,57,56,
default priority:
Receive buffer size (bytes): 20016,156096,0,0,0,0,0,0,total_size=1027728
Cable len: 7
PFC configuration:
priority 0 1 2 3 4 5 6 7
enabled 0 0 0 1 0 0 0 0
buffer 0 0 0 1 0 0 0 0
tc: 1 ratelimit: unlimited, tsa: vendor
priority: 0
tc: 0 ratelimit: unlimited, tsa: vendor
priority: 1
tc: 2 ratelimit: unlimited, tsa: vendor
priority: 2
tc: 3 ratelimit: unlimited, tsa: vendor
priority: 3
tc: 4 ratelimit: unlimited, tsa: vendor
priority: 4
tc: 5 ratelimit: unlimited, tsa: vendor
priority: 5
tc: 6 ratelimit: unlimited, tsa: vendor
priority: 6
tc: 7 ratelimit: unlimited, tsa: vendor
priority: 7

Check DCQCN is functioning

1
2
3
4
5
6
7
8
# Check DCQCN is enabled on Prio 3
cat /sys/class/net/$IF_NAME/ecn/roce_np/enable/3
cat /sys/class/net/$IF_NAME/ecn/roce_rp/enable/3

# Check counters related to DCQCN
cat /sys/class/infiniband/$DEV_NAME/ports/1/hw_counters/np_cnp_sent
cat /sys/class/infiniband/$DEV_NAME/ports/1/hw_counters/np_ecn_marked_roce_packets
cat /sys/class/infiniband/$DEV_NAME/ports/1/hw_counters/rp_cnp_handled

Note: Two cases triggering NP to send CNP packets:

  • NP’s NIC receives a packet with an ECN mark (marked by the switch indicating the switch’s buffer is about to be out of capacity).
  • NP’s NIC receives an out-of-order packet (packet loss occurred).

Check PFC is functioning

1
ethtool -S $IF_NAME | grep prio3
1
2
3
4
5
6
7
8
9
10
rx_prio3_bytes: 462536457742
rx_prio3_packets: 500098087
rx_prio3_discards: 0
tx_prio3_bytes: 1180912512618
tx_prio3_packets: 1155032777
rx_prio3_pause: 214479
rx_prio3_pause_duration: 213496
tx_prio3_pause: 12
tx_prio3_pause_duration: 13
rx_prio3_pause_transition: 107222

Note: tx_prio3_pause refers to the number of PFC pauses sent from this NIC as the server cannot absorb the network traffic quickly.

Miscellaneous

Advanced QoS Settings

The default values of the parameters below should be fine…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# mapping Priority to TC
sudo mlnx_qos -i $IF_NAME --prio_tc=0,1,2,3,4,5,6,7

# mapping Priority to receive buffers
sudo mlnx_qos -i $IF_NAME --prio2buffer=0,0,0,1,0,0,0,0

# adjust receive buffer sizes
sudo mlnx_qos -i $IF_NAME --buffer_size=20016,156096,0,0,0,0,0,0

## Set alternative TSA
# set to vendor (default)
sudo mlnx_qos -i $IF_NAME --tsa=vendor,vendor,vendor,vendor,vendor,vendor,vendor,vendor
# set to ets
sudo mlnx_qos -i $IF_NAME --tsa=ets,ets,ets,ets,ets,ets,ets,ets --tcbw=0,0,0,100,0,0,0,0

# Remap DSCP value to specified PFC Priority
# Map DSCP 3 to Prio 3
sudo mlnx_qos -i $IF_NAME --dcsp2prio='set,3,3'

Performance Counters on NICs

Refer to this: https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters.

Failed to map RoCE traffic to specified PFC Priority

  • Consider upgrading OFED drivers.
  • ibv_qp_attr.ah_attr.grh.traffic_class may override default ToS Value.

Configuration for BlueField DPU

QoS settings could be set on the host using mlnx_qos, but Traffic Class values must be set on the DPU side individually.

References