After reconfiguring clusters from scratch for several times, it seems that I am gradually adapting to this mystery and strange InfiniBand world...
Relationship among InfiniBand, RoCE, IPoIB, and Ethernet Mode
Let us take Mellanox ConnectX Adapter as an example. Actually, this adapter can work in either InfiniBand Mode or Ethernet Mode, which is configurable with some tools provided by the vendor. As iWARP is not widely adopted, our article will not discuss this protocol.
|InfiniBand Mode||Ethernet Mode|
|Supported by ConnectX||Yes||Yes|
|Programmable with Verbs||Yes||Yes|
|TCP/IP Support||Needs IPoIB||Yes|
|Configurable with Netplan (e.g. Assign IP Address)||Needs IPoIB||Yes|
|Layout of RDMA Packet||IB Frame + IB Header||ETH Frame + RoCE Header|
|Layout of TCP Packet||IB Frame + IB/IPoIB/IP/TCP Headers||ETH Frame + IP/TCP Headers|
Note that RoCE Header is a general concept. And RoCEv1 and RoCEv2 give different detailed definitions of this part.
Identify InfiniBand / Ethernet Mode
The easiest way is to directly have a look at the interface name and link type with
ip under Linux. An InfiniBand adapter working in Ethernet mode looks exactly the same as a regular Ethernet adapter.
$ ip a
ibdev2netdev can also help.
Another approach is through
ibstat. And the field
Link layer shows which mode the adapter is working in.
Change InfiniBand / Ethernet Mode
To alter the work mode, there doesn't exist a general way for now. For Mellanox ConnectX Adapter, the vendor provided a tool called
mlxconfig. Here is the usage listed in the official document, where you can find more information about it.
$ sudo mlxconfig -d /dev/mst/mt4103_pci_cr0 set LINK_TYPE_P1=1 LINK_TYPE_P2=1
Note that P1 and P2 are referring to two separated ports on the adapter. Attention: Please make sure the network switch is capable of handling InfiniBand or Ethernet Frame before altering the work mode . If the switch cannot recognize the data frame sent from the server, you might observer
Physical state: Polling reported by
ibstat, as the packet is not forwarded by the switch correctly. Certain network switches can only forward one type of data frame at a time, which means you may need to manually reconfigure the switch to let it work with the other type of data frame.
By default, the IPoIB will be automatically configured when the IP address is assigned to the interface. The IP address can be managed by
NetworkManager, which depends on your Linux distro. As for the configuration file, there is no difference between the InfiniBand and regular Ethernet Adapters.
# Assign a static IP address with netplan for an InfiniBand interface
Once the above configuration is applied and the interface is brought up successfully. We can see
ib_ipoib module is loaded.
$ lsmod | grep ipoib
If the IP address doesn't appear in
ip a, we need to check the status of the InfiniBand adapter and make sure its state is active in
ibstat. A common mistake is forgetting to enable
opensmd, which will make the adapter stuck at
State: Initializing. Note that
opensmd will not launch on startup by default.
# Start OpenSM
Identify RoCE Version
The major difference between RoCEv1 and RoCEv2 is that RoCEv2 is able to utilize IP networking to route while RoCEv1 is routing via MAC addresses. A funny fact is RoCEv1 and RoCEv2 may be enable simultaneously, and we could choose the version at runtime through specifying Group ID (GID). There is a script written by Mellanox named
show_gids and it will display RoCE versions associated to GIDs.
Check Adapter Speed
ethtool can read out this information and it can work with both InfiniBand and Ethernet mode.
$ ethtool ibp129s0