This article is only for reference, as Edgecore SONiC, a customized variant with a lot of proprietary commands, is quite different from community SONiC. Also, some commands and configurations are specialized for certain switch ASICs, such as Intel Tofino I use right now. Thus, I would still suggest do not throw the official guidebook away, read it carefully, and it will save your life.

RoCE networks, a high-performance implementation of RDMA networks, offload flow control and congestion control algorithms to hardware to achieve high performance. However, these algorithms target lossless networks so they can be simple enough to implement on hardware. Thus, we should mitigate packet loss issues and guarantee lossless networks to our best. DCQCN (Congestion Control) + PFC (Flow Control) is a common option for many data centers. We observed that our system would suffer severe performance fluctuation if disabling them.

GPUDirect RDMA (GDR) is an incredible technology allowing remote machines directly to manipulate the local GPU's memory. However, there are not many online resources discussing about this technology. So, I felt very confused when I encountered issues relevant to RDMA, especially for GDR.

It's time to abandon NPS, Frp, or other solutions that are hard to configure or no longer maintained. Thanks to Docker, it's possible to set up a reliable reverse proxy with single command.

After reconfiguring clusters from scratch for several times, it seems that I am gradually adapting to this mystery and strange InfiniBand world...