RoCE networks, a high-performance implementation of RDMA networks, offload flow control and congestion control algorithms to hardware to achieve high performance. However, these algorithms target lossless networks so they can be simple enough to implement on hardware. Thus, we should mitigate packet loss issues and guarantee lossless networks to our best. DCQCN (Congestion Control) + PFC (Flow Control) is a common option for many data centers. We observed that our system would suffer severe performance fluctuation if disabling them.
L2 PFC v.s. L3 PFC
It is recommended to set up L3 PFC rather than L2 PFC nowadays. Probably, it is because setting up L3 PFC is easier.
pcp could refer to L2 PFC;
dscp could refer to L3 PFC.
- Service Level: a concept in Infiniband networks. Not related to RoCE networks.
- Type of Service (ToS): to label RoCE traffic.
- DSCP / dot11p Value: a value embedded in the packet header.
- Traffic Class (TC): to distinguish internal queues of NICs.
- Priority: an intermediate value bridging DSCP / dot11p and TC. And map DSCP / dot11p to buffers.
Note: the definition of terms may vary from vendor to vendor.
If network traffic goes through network switches, ensure L3 PFC and ECN are enabled on switches. We omitted the detailed steps to configure switches here as they depend on the particular switch SKU and OS.
Note: we could set up a direct connection to test whether our configuration on NICs works or not.
DCQCN is enabled by default on Mellanox ConnextX NICs. But, we need to ensure ECN marking is enabled on all network switches.
Tune PFC Headroom Size
We should reserve enough buffer space to store on-the-fly packets since senders (or requestors) need to take time (latency) to respond to PFC pauses. The latency will be affected by the cable length (a.k.a. propagation delay). Mellanox requires us to set the cable length manually, and then the NICs will automatically calculate the correct headroom size.
Fortunately, the cable length is recorded in the transceiver’s EEPROM.
sudo mlxlink -d $DEV_NAME -m -c -e
From the outputs, we could find the cable length.
Then, apply this parameter to our QoS setting.
sudo mlnx_qos -i $IF_NAME --cable_len=50
Enable L3 PFC
Firstly, we could check the interface name and device name with
show_gids, which should output something like below.
DEV PORT INDEX GID IPv4 VER DEV
Then, execute the following commands to activate PFC and apply the PFC setting to RoCE traffic.
Note: the configuration is NOT persistent. You may need to re-run all the commands above to enable PFC each time the machine reboots. Also, ensure PFC is enabled for DSCP 26 traffic on all network switches.
Show PFC Setting
sudo mlnx_qos -i $IF_NAME
DCBX mode: OS controlled
Check DCQCN is functioning
# Check DCQCN is enabled on Prio 3
Note: Two cases triggering NP to send CNP packets:
- NP’s NIC receives a packet with an ECN mark (marked by the switch indicating the switch’s buffer is about to be out of capacity).
- NP’s NIC receives an out-of-order packet (packet loss occurred).
Check PFC is functioning
ethtool -S $IF_NAME | grep prio3
tx_prio3_pause refers to the number of PFC pauses sent from this NIC as the server cannot absorb the network traffic quickly.
Advanced QoS Settings
The default values of the parameters below should be fine…
# mapping Priority to TC
Performance Counters on NICs
Failed to map RoCE traffic to specified PFC Priority
- Consider upgrading OFED drivers.
ibv_qp_attr.ah_attr.grh.traffic_classmay override default ToS Value.
Configuration for BlueField DPU
QoS settings could be set on the host using
mlnx_qos, but Traffic Class values must be set on the DPU side individually.
- An incoming SIGCOMM’23 paper