Tips of configuring InfiniBand adapters

After reconfiguring clusters from scratch for several times, it seems that I am gradually adapting to this mystery and strange InfiniBand world...

Relationship among InfiniBand, RoCE, IPoIB, and Ethernet Mode

Let us take Mellanox ConnectX Adapter as an example. Actually, this adapter can work in either InfiniBand Mode or Ethernet Mode, which is configurable with some tools provided by the vendor. As iWARP is not widely adopted, our article will not discuss this protocol.

InfiniBand Mode Ethernet Mode
Supported by ConnectX Yes Yes
RDMA Support Yes Yes
Programmable with Verbs Yes Yes
TCP/IP Support Needs IPoIB Yes
Configurable with Netplan (e.g. Assign IP Address) Needs IPoIB Yes
Layout of RDMA Packet IB Frame + IB Header ETH Frame + RoCE Header
Layout of TCP Packet IB Frame + IB/IPoIB/IP/TCP Headers ETH Frame + IP/TCP Headers

Note that RoCE Header is a general concept. And RoCEv1 and RoCEv2 give different detailed definitions of this part.

Identify InfiniBand / Ethernet Mode

The easiest way is to directly have a look at the interface name and link type with ifconfig or ip under Linux. An InfiniBand adapter working in Ethernet mode looks exactly the same as a regular Ethernet adapter.

1
2
3
4
5
6
7
8
9
10
11
12
13
$ ip a
# InfiniBand Mode
4: ibp129s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
link/infiniband ...
inet 192.168.7.100/24 brd 192.168.7.255 scope global ibp129s0
valid_lft forever preferred_lft forever

# Ethernet Mode
7: ens1f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether ...
inet 10.200.0.1/24 brd 10.200.0.255 scope global ens1f1
valid_lft forever preferred_lft forever

Besides, ibdev2netdev can also help.

1
2
3
$ ibdev2netdev
mlx4_0 port 1 ==> ibp129s0 (Up)
mlx5_0 port 1 ==> ens1f0 (Up)

Another approach is through ibstat. And the field Link layer shows which mode the adapter is working in.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
$ ibstat
# InfiniBand Mode
CA 'mlx4_0'
CA type: MT4099
Number of ports: 1
Firmware version: 2.42.5000
Hardware version: 1
Node GUID:
System image GUID:
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x0251486a
Port GUID:
Link layer: InfiniBand

# Ethernet Mode
CA 'mlx5_0'
CA type: MT4119
Number of ports: 1
Firmware version: 16.25.1020
Hardware version: 0
Node GUID:
System image GUID:
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID:
Link layer: Ethernet

Change InfiniBand / Ethernet Mode

To alter the work mode, there doesn't exist a general way for now. For Mellanox ConnectX Adapter, the vendor provided a tool called mlxconfig. Here is the usage listed in the official document, where you can find more information about it.

1
2
3
4
5
6
7
8
9
10
11
12
13
$ sudo mlxconfig -d /dev/mst/mt4103_pci_cr0 set LINK_TYPE_P1=1 LINK_TYPE_P2=1

Device #1:
----------
Device type: ConnectX3Pro
PCI device: /dev/mst/mt4103_pci_cr0
Configurations: Next Boot New
LINK_TYPE_P1 ETH(2) IB(1)
LINK_TYPE_P2 ETH(2) IB(1)

Apply new Configuration? ? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.

Note that P1 and P2 are referring to two separated ports on the adapter. Attention: Please make sure the network switch is capable of handling InfiniBand or Ethernet Frame before altering the work mode . If the switch cannot recognize the data frame sent from the server, you might observer Physical state: Polling reported by ibstat, as the packet is not forwarded by the switch correctly. Certain network switches can only forward one type of data frame at a time, which means you may need to manually reconfigure the switch to let it work with the other type of data frame.

Configure IPoIB

By default, the IPoIB will be automatically configured when the IP address is assigned to the interface. The IP address can be managed by netplan or NetworkManager, which depends on your Linux distro. As for the configuration file, there is no difference between the InfiniBand and regular Ethernet Adapters.

1
2
3
4
5
6
7
# Assign a static IP address with netplan for an InfiniBand interface
network:
ethernets:
ibp129s0:
addresses:
- 192.168.7.100/24
version: 2

Once the above configuration is applied and the interface is brought up successfully. We can see ib_ipoib module is loaded.

1
2
3
4
5
6
7
$ lsmod | grep ipoib
ib_ipoib 180224 0
$ ip a
4: ibp129s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
link/infiniband ...
inet 192.168.7.100/24 brd 192.168.7.255 scope global ibp129s0
valid_lft forever preferred_lft forever

If the IP address doesn't appear in ip a, we need to check the status of the InfiniBand adapter and make sure its state is active in ibstat. A common mistake is forgetting to enable opensm / opensmd, which will make the adapter stuck at State: Initializing. Note that opensmd will not launch on startup by default.

1
2
3
4
5
6
7
# Start OpenSM
$ sudo opensm

# Start OpenSM As Daemon
$ sudo service opensmd start # Method 1
$ sudo systemctl start opensmd # Method 2
$ sudo /etc/init.d/opensmd start # Method 3

Identify RoCE Version

The major difference between RoCEv1 and RoCEv2 is that RoCEv2 is able to utilize IP networking to route while RoCEv1 is routing via MAC addresses. A funny fact is RoCEv1 and RoCEv2 may be enable simultaneously, and we could choose the version at runtime through specifying Group ID (GID). There is a script written by Mellanox named show_gids and it will display RoCE versions associated to GIDs.

1
2
3
4
5
6
7
$ show_gids
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx5_0 1 0 fe80:0000:0000:0000:... v1 ens1f0
mlx5_0 1 1 fe80:0000:0000:0000:... v2 ens1f0
mlx5_0 1 2 0000:0000:0000:0000:... 11.0.0.201 v1 ens1f0
mlx5_0 1 3 0000:0000:0000:0000:... 11.0.0.201 v2 ens1f0

Check Adapter Speed

ethtool can read out this information and it can work with both InfiniBand and Ethernet mode.

1
2
3
4
5
6
7
8
9
10
11
$ ethtool ibp129s0
Settings for ibp129s0:
...
Speed: 56000Mb/s
Duplex: Full

$ ethtool ens1f0
Settings for ens1f0:
...
Speed: 100000Mb/s
Duplex: Full

References