RoCE networks, a high-performance implementation of RDMA networks, offload flow control and congestion control algorithms to hardware to achieve high performance. However, these algorithms target lossless networks so they can be simple enough to implement on hardware. Thus, we should mitigate packet loss issues and guarantee lossless networks to our best. DCQCN (Congestion Control) + PFC (Flow Control) is a common option for many data centers. We observed that our system would suffer severe performance fluctuation if disabling them.

GPUDirect RDMA (GDR) is an incredible technology allowing remote machines directly to manipulate the local GPU's memory. However, there are not many online resources discussing about this technology. So, I felt very confused when I encountered issues relevant to RDMA, especially for GDR.

It's time to abandon NPS, Frp, or other solutions that are hard to configure or no longer maintained. Thanks to Docker, it's possible to set up a reliable reverse proxy with single command.

After reconfiguring clusters from scratch for several times, it seems that I am gradually adapting to this mystery and strange InfiniBand world...

OpenWrt doesn't provide a combined disk image for ARM virtual machines, unlike what they did for x86 VMs. Meanwhile, their official ARM64 kernel release can't boot in UEFI environment. But we can still make it work by compiling it from source and building a disk image manually.

When we resize the virtual hard disk of a virtual machine or restore a disk image to a larger disk, the free space of the partition detected by Ubuntu will not increase because the partition table is unchanged. In the past, we could easily resize the ext4 root partition with the help of resize2fs. However, things get complex when Ubuntu utilizes LVM partition as their default root partition.