This method only applies to Ubuntu 22.04 (or higher version, maybe), and you should be able to create TUN/TAP devices. For Ubuntu 20.04 or older, the official repo doesn't provide WireGuard-Go.
Note that if you are allowed to load kernel modules, for example, you are not setting up a VPN inside OpenVZ or Docker Containers, you may consider Kernel Space WireGuard instead of User Space WireGuard (e.g., WireGuard-Go introduced in this post).
1 | sudo apt update |
Note: The package wireguard
should NOT be installed. If you have installed it, just remove it.
Surprisingly, the binary file installed by apt
is named wireguard
instead of wireguard-go
, which is not recognized by wg-quick
. So the workaround here is just simply creating a soft link to wireguard
file.
1 | cd /bin |
The tutorial about writing the configuration file of wg-quick
is omitted here since we can use the same configuration for both Kernel Space and User Space WireGuard, and we can launch WireGuard-Go in the same way as Kernel Space WireGuard using wg-quick
.
1 | # Configuration file is written to /etc/wireguard/wg0.conf |
1 | sudo systemctl enable wg-quick@wg0 |
nvidia-smi
to see what was going on, but you only got this error message.NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I know what you're gonna say: Nvidia F*** You!
Actually, I encountered this problem many times. It is most likely caused by upgrading or downgrading the Linux kernel without properly generating kernel modules, which might be essential parts of GPU drivers.
Right now nvidia
module is supposed not to be loaded (Could check with lsmod | grep nvidia
). We could try to load the kernel manually.
1 | sudo modprobe nvidia |
You should get an error message like this.
modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.15.0-84-generic
Meanwhile, check whether /usr/lib/modules/5.15.0-84-generic/updates/dkms/nvidia.ko
is missing.
If you don’t see the error message above and that kernel module file does exist, you might have other issues, such as hardware failure. At this time, try to read kernel logs through dmesg
and check the existence of GPUs through lspci -vvv
, which should give you some clues.
DKMS, a utility that manages drivers, as well as NVIDIA Drivers, might be broken. We could fix them by removing them first and installing them back later.
1 | sudo rm -r /var/lib/dkms/nvidia |
Note: Installing the full CUDA Toolkits is the only way I recommend to install drivers. Using the drivers provided by the Ubuntu official repo is NOT recommended.
The previous step will also remove NVIDIA Docker Runtime, which may lead to this error if you use Docker.
Error response from daemon: Cannot restart container ...: could not select device driver "" with capabilities: [[gpu]]
Thus, we need to install it back too.
1 | sudo apt install -y nvidia-container-toolkit |
It is recommended to set up L3 PFC rather than L2 PFC nowadays. Probably, it is because setting up L3 PFC is easier.
Note: 802.1p
, pcp
could refer to L2 PFC; dscp
could refer to L3 PFC.
1 | Egress: |
Note: the definition of terms may vary from vendor to vendor.
If network traffic goes through network switches, ensure L3 PFC and ECN are enabled on switches. We omitted the detailed steps to configure switches here as they depend on the particular switch SKU and OS.
Note: we could set up a direct connection to test whether our configuration on NICs works or not.
DCQCN is enabled by default on Mellanox ConnextX NICs. But, we need to ensure ECN marking is enabled on all network switches.
We should reserve enough buffer space to store on-the-fly packets since senders (or requestors) need to take time (latency) to respond to PFC pauses. The latency will be affected by the cable length (a.k.a. propagation delay). Mellanox requires us to set the cable length manually, and then the NICs will automatically calculate the correct headroom size.
Fortunately, the cable length is recorded in the transceiver’s EEPROM.
1 | sudo mlxlink -d $DEV_NAME -m -c -e |
From the outputs, we could find the cable length.
1 | Module Info |
Then, apply this parameter to our QoS setting.
1 | sudo mlnx_qos -i $IF_NAME --cable_len=50 |
Firstly, we could check the interface name and device name with show_gids
, which should output something like below.
1 | DEVPORTINDEXGIDIPv4 VERDEV |
Then, execute the following commands to activate PFC and apply the PFC setting to RoCE traffic.
1 | export IF_NAME=enp216s0f1np1 |
Note: the configuration is NOT persistent. You may need to re-run all the commands above to enable PFC each time the machine reboots. Also, ensure PFC is enabled for DSCP 26 traffic on all network switches.
1 | sudo mlnx_qos -i $IF_NAME |
1 | DCBX mode: OS controlled |
1 | # Check DCQCN is enabled on Prio 3 |
Note: Two cases triggering NP to send CNP packets:
1 | ethtool -S $IF_NAME | grep prio3 |
1 | rx_prio3_bytes: 462536457742 |
Note: tx_prio3_pause
refers to the number of PFC pauses sent from this NIC as the server cannot absorb the network traffic quickly.
The default values of the parameters below should be fine…
1 | # mapping Priority to TC |
Refer to this: https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters.
ibv_qp_attr.ah_attr.grh.traffic_class
may override default ToS Value.QoS settings could be set on the host using mlnx_qos
, but Traffic Class values must be set on the DPU side individually.
This method is working on PowerPoint 16.72
]]>In this tutorial, I will use Mellanox ConnextX RDMA NIC (RNIC) as an example to demonstrate configuration steps.
Note that some configuration steps are vendor-specific, which means for different vendor's RNIC, you may need to find the alternative solution if my approach is not applicable for your RNIC. Also I didn't test GDR on a NIC made by vendors other than Mellanox. (I suspect only Mellanox's RNIC supports GDR).
For ConnextX RNIC, the corresponding drivers and toolkits are all packed in Mellanox OFED.
I won't introduce how to install these things as there are already many tutorials about this topic on Internet. I would recommend to check the official website, and install the packages this website provides.
Note that installing CUDA Drivers through apt
and Toolkits through conda
separately is NOT recommended.
Once you properly installed them, just execute the command nvidia-smi topo -m
and you should see something like:
1 | $ nvidia-smi topo -m |
We will discuss what this output represents in the following section. For now, you should be able to identify both your GPUs and NICs from this output.
Every system is not created equal. Continue the above example, we can see there are many types of relationship between individual GPU and NIC such as SYS
and PHB
. In fact, they will greatly affect the GDR performance.
From my experience, I believe:
PIX
, PXB
PHB
PIX
and PXB
SYS
, NODE
My benchmark result of GDR performance with 100 Gbps RoCE Network:
- Dual Intel Xeon 4112 + NVIDIA Tesla V100
SYS
: ~2 GB/sPIX
: ~10 GB/s- Single AMD EPYC 7763 + NVIDIA Tesla A100
SYS
: ~6 GB/sPHB
: ~10 GB/s- Dual Intel Xeon E5-2630 v4 + NVIDIA Tesla P100
SYS
: ~0.3 GB/s
Here are discussion about PHB and description about P2P Level.
(Updated on Jun 9, 2023) Here is a systematic introduction to PCIe Affinity.
If you unfortunately got some SYS
or NODE
, this relationship can be possibly corrected by plugging your GPU or NIC into proper PCIe slots.
nvidia_peermem
module is bundled in CUDA Toolkit downloaded from here. By default, this kernel module will be not loaded automatically. Thus, we could manually load this module with the command sudo modprobe nvidia_peermem
.
To check this module is loaded correctly, execute the command lsmod | grep nvidia_peermem
and see if this module name exists in the output.
If you cannot find this kernel module in the system, you might consider to install the latest CUDA Toolkit.
There is another version of this module that you can find on this repo, and it is called
nv_peer_mem
instead. But it appears to be no longer maintained.
Many reports ([1], [2]) have mentioned PCIe ACS may hurt the GDR performance. PCIe ACS is a security feature, but we never care about the security when we are hungry for performance. Here is the script to disable it. Note that this script is only for your reference. You may need to modify the content according to your machine's configuration.
1 |
|
Up to now, GDR supposed to work. To verify that, I would recommend to use OFED PerfTest.
- DO NOT use PerfTest provided by OFED or APT. You should compile PerfTest by yourself because the binary distribution doesn't support GDR
- Both Client and Server are capable of utilizing GDR
ib_send_bw
has some bugs in GDR tests
1 | # Compile |
If you would like to test with NCCL, it is recommended to refer this article.
If you still encounter errors like ibv_create_qp failed
or ibv_reg_mr failed
, this might be caused by Linux user limits (ulimit).
A fast and dirty way to temporarily fix this issue is to run the program as root user. Once this dirty fix works, you can refer this article to permanently solve this issue.
]]>Makefile 是 GNU Make 这个工具所需的文件,可以看作是一个比较特殊的 Bash 脚本。而对比现代的编译辅助工具,Make 显得非常简陋,直接用这玩意如同在打火机一块钱一个的时代钻木取火,但好处就是这个工具并不是很难理解。
首先要明确的是,Make 这个工具是用来编译比较大型的工程的(就是管理一大堆源代码文件),所以它很多设计是围绕编译展开的,而且它是上世纪八十年代的产物,面向的是当时手无寸铁的程序员,至少比起啥工具都没有的情况,Make 还是有很大作用的。总的来说,Make 其实主要是在做这几件事: - 判断是否可以跳过执行某些 Linux 命令(command) - 确定 Linux 命令执行的顺序,并行执行不相关的命令
我们先抛开编译不谈,来看 Make 的规则(Rule)的基本语法。
1 | target: prerequisites |
Make 大概是个缝合怪,有很多东西坨在一起了。比如说目标target,就有: - 真target:会对应一个具体的文件 - 假target(Phony Target):和磁盘上的文件完全没关系
然后执行条件(prerequisites)是一个或者一组目标,这些目标会决定下面的命令(command)要不要被执行。
命令(command)就是 Linux 的命令,一般情况下这些命令写清楚了如何产生一个真target(如果 target 是真 target 的话)。
继续抛开编译不谈,来看make/makefile1
里的简单例子,
1 | all: son.bak.txt |
在 command 前加一个@
,可以让 Make 不把原始命令显示出来。
这里son.bak.txt
是一个真 target,因为真的会有son.bak.txt
这么个文件(在执行了cp
命令以后)。
son.txt
也是一个真 target,并且一开始就有了。
习惯上all
是一个假 target,类似于整个 Makefile 的入口(或者说总目标),并且会写成第一个 target(就像这里写在了第一行)。当我们执行make
命令的时候(不手动指定 target),其实就是去产生(make)all
target。
我们实际上并不用去特别区分真假 target,只需要知道有一些特殊的 target 就行。
什么叫 make 一个 target?其实就是: - 如果存在一个执行条件里的 target 没有被 make 过,就先去 make 执行条件里还没 make 过的 target - 在执行条件里所有的 target 都 make 了以后,在某些时刻,把它的 command 执行一遍
在这个例子里,Make 的故事是: - 我们要 make all
target - Make 首先跑去 make all
- 然后很快发现son.bak.txt
没有 make 过,就跑去 make son.bak.txt
- 接着一看son.bak.txt
要求要 make son.txt
- 但是 - 没有一条关于son.txt
的 rule - son.txt
已经有这个文件了 - 那我们就当son.txt
已经 make 好了 - 执行条件里所有的 target 都好了,再看发现我们没有son.bak.txt
这个文件 - 那就执行 command(cp
和echo
) - command 执行完了,就当son.bak.txt
make 好了 - all
执行条件里所有的 target 都好了,再看发现我们没有all
这个文件 - 那就执行 command(echo
)(假 target 永远都要执行一次 command,毕竟对应的文件永远不存在) - command 执行完了,就当all
make 好了
程序输出如下:
1 | # make |
Make 的一大作用是,当我们有一大堆代码文件的时候,我们并不希望改动一个文件就要重新编译所有的文件,当然是拎出所有受到的影响的代码来重新编译,这样就可以节约很多时间。所以 Make 会自动选择性的执行 Makefile 里的 command。那什么时候 command 会被执行?
继续上一个例子,当我们在执行一次make
命令以后,如果试着再make
一次,程序的输出就只有
1 | # make |
很明显,只有all
target 被执行了。在这个例子里,Make 的故事是: - 我们要 make all
target - Make 首先跑去 make all
- 然后很快发现son.bak.txt
没有 make 过,就跑去 make son.bak.txt
- 接着一看son.bak.txt
要求要 make son.txt
- 那我们当son.txt
已经 make 好了 - 执行条件里所有的 target 都好了,再看发现我们没有son.bak.txt
这个文件 - 我们有son.bak.txt
,而且和son.txt
一样新 - 那就跳过执行 command - all
执行条件里所有的 target 都好了,再看发现我们没有all
这个文件 - 那就执行 command(echo
)(假 target 永远都要执行一次 command,毕竟对应的文件永远不存在) - command 执行完了,就当all
make 好了
如果我们去掉 Makefile 里
all
的command,再执行make
:
1
2
3
4
5 >all: son.bak.txt
>son.bak.txt: son.txt
echo I am son.bak.txt
cp son.txt son.bak.txt输出就只剩下这些东西。这代表这次
make
啥事也没做。
1
2 ># make
>make: Nothing to be done for 'all'.
也不难理解,我们没修改过son.txt
,那也没有必要去重新复制一份son.bak.txt
。Make 会根据 target 的执行条件里是否有依赖的 target 更新了(执行过 command,或者文件有新修改)来判断是否需要被重新 make 这个 target。
具体来说,Make 会比较
son.bak.txt
和son.txt
的修改时间来判断两文件谁新谁旧。
我们可以编辑一下son.txt
,随便写一些东西并保存,再执行make
的时候,就会发现 Make 重新复制了一份son.bak.txt
。
1 | # echo 1 > son.txt # write 1 to file son.txt |
我们的I am son.bak.txt
回来啦,意味着son.bak.txt
被重新复制了一份。
再举一个复杂一点的例子,在make/makefile2
里,Makefile 的内容如下:
1 | all: son.txt |
同时我们新增一个clean
假 target,用来清理 Makefile 产生的中间文件。输入命令make clean
即可删掉mom.txt
和dad.txt
。
1 | # make clean |
注意到第一次make
之后,如果未对文件作出修改,所有的 target 除all
以外会被跳过执行 command。当仅有grandma.txt
被修改时,dad.txt
这个 target 也会跳过执行 command。期间 Make 的故事是: - make all
target - make son.txt
- make mom.txt
- make grandma.txt
- 检查grandma.txt
文件是否有新修改 - 有更新,执行cat grandma.txt > mom.txt
- make dad.txt
- make grandpa.txt
- 检查grandma.txt
文件是否有新修改 - 没更新,啥也不做 - 检查mom.txt
和dad.txt
是否执行过 command - mom.txt
有执行过 command,执行cat mom.txt > son.txt
和cat dad.txt >> son.txt
- 检查son.txt
是否执行过 command - son.txt
有执行过 command,执行空指令
显而易见,下面的例子里,aunt.txt
和uncle.txt
因为和all
屁关系没有,从未被依赖过,所以它们的 command 压根就不可能被执行。
1 | all: son.txt |
Makefile 可以说是文件和命令的依赖关系的说明书,描述了通过什么文件以及什么命令可以产生什么文件。比如在上个例子里,son.txt
文件的产生依赖mom.txt
和dad.txt
,mom.txt
和dad.txt
的产生分别依赖grandma.txt
和grandpa.txt
。在真实环境中,为了加速工程的编译速度,几个被依赖的,但它们之间没有依赖关系的 target 的 command 可以被同时执行。这也因此要求我们准确描述依赖关系。如果我们把上个例子的 Makefile 写成:
1 | all: son.txt mom.txt dad.txt |
就有可能造成son.txt
,mom.txt
,dad.txt
的 command 被同时执行,就存在mom.txt
文件还没产生,son.txt
的命令(需要mom.txt
作为输入)就开始执行了的可能性。
时刻 | son.txt | mom.txt | dad.txt |
---|---|---|---|
1 | cat mom.txt > son.txt | ||
2 | ??我文件呢? | cat > dad.txt | |
3 | cat > mom.txt |
所以我们需要确保执行的顺序像下面那个样子。
时刻 | son.txt | mom.txt | dad.txt |
---|---|---|---|
1 | cat > mom.txt | cat > dad.txt | |
2 | cat mom.txt > son.txt |
使用make -j
可以启动并行编译,在下面的例子中,我们可以看到grandpa
和grandma
同时开始执行,son
在mom
和dad
都结束后才开始执行,总耗时为12s。
1 | all: son |
1 | # make -j |
可视化结果如下:
时刻 | son | mom | dad | grandma | grandpa |
---|---|---|---|---|---|
14 | 执行 | 执行 | |||
15 | 执行 | 执行 | |||
16 | 结束 | 执行 | |||
17 | 执行 | 执行 | |||
18 | 执行 | 结束 | |||
19 | 结束 | 执行 | |||
20 | 执行 | ||||
21 | 执行 | ||||
22 | 执行 | ||||
23 | 结束 | ||||
24 | 执行 | ||||
25 | 结束 |
将代码修改成如下形式,可以使所有 target 同时开始执行,总耗时为5s。
1 | all: son mom dad grandma grandpa |
1 | # make -j |
可视化结果如下:
时刻 | son | mom | dad | grandma | grandpa |
---|---|---|---|---|---|
33 | 执行 | 执行 | 执行 | 执行 | 执行 |
34 | 结束 | 执行 | 执行 | 执行 | 执行 |
35 | 结束 | 执行 | 结束 | 执行 | |
36 | 执行 | 执行 | |||
37 | 结束 | 结束 |
Make sure you have a machine with a public IP address (it could be a VPS), otherwise our method may not be applicable. Let's take exposing the SSH port of a machine behind NAT as an example. There are three roles in total:
The first step is to install the Docker on both the local machine and the public server. For Ubuntu 18.04+, I personally prefer to install Docker through apt.
1 | local-machine&public-server$ sudo apt update |
Gost supports nuermous proxy protocols. For reliablity, it is suggested to use a secured protocol to resist the interference by some secure gateways like GFW. Luckily, with the help of Docker, it is very easy to set up a secured tunnel. Another good news is that Docker Daemon will help to monitor the service and automatically restart Gost Service at boot or on failure. Let's say goodbye to the annoying Systemd.
On Public Server
1 | cd ~ # other any other path you like |
On Local Machine
1 | docker run --name gost_client --net host -id --restart always --entrypoint "" gogost/gost gost -L rtcp://:<remote_port>/:<local_port> -F "relay+wss://<user>:<passwd>@<pub_server_ip_or_domain>:<gost_service_port>" |
Here is the explanation of parameters:
user
/ passwd
: The username and the password for Gost. They are irrelevant to any other account such as Linux accounts.pub_server_ip_or_domain
: It could be either the public IP address or the binded domain name. If you would like to use your valid SSL certificate issued by CA, you should only use the domain name here.gost_service_port
: Could be arbitrary value. It is also fine to set this port number to 443 to pretend as a HTTPS server.remote_port
: The port that the public server's Gost listens to. The data received from this port will be forwarded to local_port
on the local machine. It could be arbitrary value.local_port
: The port that should associate with some services on the local machine. In our case, it should be 22, the SSH port.Back to our case, the corresponding commands will be:
1 | # On Public Server |
After that, the SSH port 22 on the local machine should be mapped to the port 2022 on the public server now.
1 | # user-local is the name of the user on local machine |
Debugging Tips: If Gost is not working properly, we can read the log using the command below.
1 docker logs -f gost_server # or gost_client
By default, Gost will generate a self-signed SSL certificate if the user doesn't specify one. However, this might be considered unsafe as the communication is no longer able to defense the MITM attack. Moreover, some secure gateways may disrupt the TLS session with a self-signed certificate.
The solution is simple. Do you remember the gost
directory we created before? We just need to put our certificate there and restart the service. There should be two files in total, cert.pem
and key.pem
, and their content should start with -----BEGIN CERTIFICATE----
and -----BEGIN RSA PRIVATE KEY-----
respectively.
Lastly, the command to restart the service is:
1 | # On Public Server |
If you don't want to install any new software, we can also utilize SSH to build a tunnel. This method is also simple (sometimes), but it is not quite reliable. As you might know, a SSH connection can be easily disrupted due to many reasons. In our case, we just need to type this command on the local machine to set up a tunnel:
1 | local-machine$ ssh -R "[::]:2022:localhost:22" user-public@a.b.c.d |
If everything goes well, now you can connect to the local machine through:
1 | your-computer$ ssh user-local@a.b.c.d -p 2022 |
However, if you failed to connect to your local machine. Please check the sshd
configuration on the public server.
1 | public-server$ sudo nano /etc/ssh/sshd_config |
Find the following line (for nano editor, press Ctrl-W to search), uncomment, and replaced no
with yes
.
1 | # Before |
Then restart the SSH server.
1 | public-server$ sudo service sshd restart |
A common cluster should comprise management nodes and compute nodes. This aritcle will take our cluster as an example to demostrate steps to install and configure Slurm. In our case, the management node is called clab-mgt01
while the compute nodes are named from clab01
to clab20
in order.
Execute the following command to install the dependencies on all machines. (clab-all
refers to all machines including management and compute nodes).
1 | clab-all$ sudo apt install slurm-wlm slurm-client munge |
Tips: There are several tools that may help to manage multiple nodes easily:
- iTerm2 (on Mac) / Terminator (on Linux)
- csshX (on Mac) / cssh (on Linux)
- Parallel SSH (at cluster side)
There is an official online configuration generator. And we should carefully check the fields below.
clab-mgt01
in our case.clab[01-20]
in our case.2
.2
, otherwise 1
.Click submit
, then we could copy the file content to /etc/slurm-llnl/slurm.conf
on all machines.
Tips: Don't forget the shared storage (e.g. NFS storage) on the cluster. We could utilize it to distribute files.
Once Munge is installed successfully, the key /etc/munge/munge.key
will be automatically generated. It is requried for all machines to hold the same key. Therefore, we could distribute the key on the management node to the remaining nodes including compute nodes and other backup management node if existing.
Tips: Again. We could also utilize the shared storage to distribute the key.
Then make sure the permission and the ownership are correctly set.
1 | clab-all$ sudo chmod 400 /etc/munge/munge.key |
By default, there Slurm cannot work with Cgroup well. If we start Slurm service right now, we may receive this error shown below.
1 | error: cgroup namespace 'freezer' not mounted. aborting |
Therefore, by pasting the following content to /etc/slurm/cgroup.conf
on compute nodes, this issue can be fixed.
1 | CgroupMountpoint=/sys/fs/cgroup |
or using this command:
1 | echo CgroupMountpoint=/sys/fs/cgroup >> /etc/slurm/cgroup.conf |
For unknown reasons, the permission of the relevant directory is not set properly, which may lead to this error.
1 | slurmctld: fatal: mkdir(/var/spool/slurmctld): Permission denied |
The solution is executing the commands below on management nodes.
1 | clab-mgt$ sudo mkdir -p /var/spool/slurmctld |
So far, we have finished the basic configuration. Let us launch Slurm now.
1 | # On management nodes |
Run sinfo
and we should see all the compute nodes are ready.
1 | $ sinfo |
If your Slurm is not working correctly, you could try with these commands to debug.
1 | clab-mgt$ sudo slurmctld -D |
Let us take Mellanox ConnectX Adapter as an example. Actually, this adapter can work in either InfiniBand Mode or Ethernet Mode, which is configurable with some tools provided by the vendor. As iWARP is not widely adopted, our article will not discuss this protocol.
InfiniBand Mode | Ethernet Mode | |
---|---|---|
Supported by ConnectX | Yes | Yes |
RDMA Support | Yes | Yes |
Programmable with Verbs | Yes | Yes |
TCP/IP Support | Needs IPoIB | Yes |
Configurable with Netplan (e.g. Assign IP Address) | Needs IPoIB | Yes |
Layout of RDMA Packet | IB Frame + IB Header | ETH Frame + RoCE Header |
Layout of TCP Packet | IB Frame + IB/IPoIB/IP/TCP Headers | ETH Frame + IP/TCP Headers |
Note that RoCE Header is a general concept. And RoCEv1 and RoCEv2 give different detailed definitions of this part.
The easiest way is to directly have a look at the interface name and link type with ifconfig
or ip
under Linux. An InfiniBand adapter working in Ethernet mode looks exactly the same as a regular Ethernet adapter.
1 | $ ip a |
Besides, ibdev2netdev
can also help.
1 | $ ibdev2netdev |
Another approach is through ibstat
. And the field Link layer
shows which mode the adapter is working in.
1 | $ ibstat |
To alter the work mode, there doesn't exist a general way for now. For Mellanox ConnectX Adapter, the vendor provided a tool called mlxconfig
. Here is the usage listed in the official document, where you can find more information about it.
1 | $ sudo mlxconfig -d /dev/mst/mt4103_pci_cr0 set LINK_TYPE_P1=1 LINK_TYPE_P2=1 |
Note that P1 and P2 are referring to two separated ports on the adapter. Attention: Please make sure the network switch is capable of handling InfiniBand or Ethernet Frame before altering the work mode . If the switch cannot recognize the data frame sent from the server, you might observer Physical state: Polling
reported by ibstat
, as the packet is not forwarded by the switch correctly. Certain network switches can only forward one type of data frame at a time, which means you may need to manually reconfigure the switch to let it work with the other type of data frame.
By default, the IPoIB will be automatically configured when the IP address is assigned to the interface. The IP address can be managed by netplan
or NetworkManager
, which depends on your Linux distro. As for the configuration file, there is no difference between the InfiniBand and regular Ethernet Adapters.
1 | # Assign a static IP address with netplan for an InfiniBand interface |
Once the above configuration is applied and the interface is brought up successfully. We can see ib_ipoib
module is loaded.
1 | $ lsmod | grep ipoib |
If the IP address doesn't appear in ip a
, we need to check the status of the InfiniBand adapter and make sure its state is active in ibstat
. A common mistake is forgetting to enable opensm
/ opensmd
, which will make the adapter stuck at State: Initializing
. Note that opensmd
will not launch on startup by default.
1 | # Start OpenSM |
The major difference between RoCEv1 and RoCEv2 is that RoCEv2 is able to utilize IP networking to route while RoCEv1 is routing via MAC addresses. A funny fact is RoCEv1 and RoCEv2 may be enable simultaneously, and we could choose the version at runtime through specifying Group ID (GID). There is a script written by Mellanox named show_gids
and it will display RoCE versions associated to GIDs.
1 | $ show_gids |
ethtool
can read out this information and it can work with both InfiniBand and Ethernet mode.
1 | $ ethtool ibp129s0 |
Since Arm community has various opinions on how to boot an Arm machine, such as UEFI + ACPI (widely used by commercial Arm servers as well as modern x86 systems), U-Boot + Device Tree (mostly used by embedded devices with limited resouces), and even UEFI + Device Tree (like Huawei L420 notebook I owned), I would suggest that don't expect OpenWrt will provide official support for UEFI + ACPI systems in recent days as it is designed to run on tiny routers.
Don't be scared. With the help of buildroot
, which could automatically prepare the cross-compilation toolchain we need, this step is much simple nowadays.
Note: My test environment is Ubuntu 21.10 ARM64 on Apple M1 Pro. It doesn't matter if you use a machine with a different system or architecture like AMD64, but you may need to take a few extra steps if so.
For Debian / Ubuntu users,
1 | sudo apt update |
Note: The content of this sub-section is copied from the official guide. Take a look at it if this command is not applicable for your system.
1 | git clone https://git.openwrt.org/openwrt/openwrt.git |
To save our effort, it is a good idea to modify an existing configuration instead of creating a new one.
1 | wget https://downloads.openwrt.org/releases/21.02.2/targets/armvirt/64/config.buildinfo |
Open file target/linux/armvirt/config-5.4
, and append the following lines to the end of file.
1 | CONFIG_EFI_STUB=y |
1 | make menuconfig |
Tweak the configuration as you like, but you should clearly understand the consequence before you turn on and off something. Keeping default options is also fine.
Note: These commands will build the whole toolchain from source for the first time they are executed. The compilation process is very slow.
1 | make -j $(nproc) defconfig download clean world |
It will compile the kernel and all of the selected pre-installed utilities, then generate an EFI binary of Linux Kernel and an Ext4 / SquashFS partition image of Rootfs.
The exciting moment comes. Let's test the kernel and rootfs we just built.
For Ubuntu users, I would suggest to install virt-manager instead, which offers a helpful GUI wizard for QEMU.
1 | sudo apt install virt-manager |
The magical QEMU allows virtual machines to boot a kernel without a bootloader. That is a great feature enables us to test the kernel's functionality at the early stage.
1 | qemu-system-aarch64 -m 512 -nographic -cpu cortex-a72 -smp 1 -M virt -kernel ~/openwrt/bin/targets/armvirt/64/openwrt-21.02.2-armvirt-64-Image-initramfs -bios /usr/share/qemu-efi-aarch64/QEMU_EFI.fd |
Note:
Image-initramfs
is the kernel binary while it integrates the OpenWrt's Rootfs asinitramfs
, so this virtual machine will lose data each time it reboots.
Note: If you encounter this issue,
1
2
3 EFI stub: Booting Linux Kernel...
EFI stub: ERROR: Failed to relocate kernel
EFI stub: ERROR: Failed to relocate kernelThe solution is to increase the memory capacity of your virtual machine. Empirically, it should be at least 256 MB.
Considered that data loss is not acceptable, while not every hypervisor is capable of launching a kernel directly, we should put everything we built into a disk, or virtual machine's disk image.
To keep things simple, let's start from building a raw disk image, which is one of the virtual disk formats supported by QEMU.
1 | tonny@vm:~$ dd if=/dev/zero of=disk.img bs=1M count=1024 |
This command will create an empty disk image. Feel free to replace the value of count
to change the size of the disk. (size = 1 MB * 1024 = 1 GB)
1 | tonny@vm:~$ fdisk disk.img |
A new GPT partition table with two partitions is written to the disk image.
1 | tonny@vm:~$ sudo losetup -Pf disk.img |
OS has recognized the two partitions, loop5p1
and loop5p2
.
1 | tonny@vm:~/mnt$ sudo mkfs.vfat /dev/loop5p1 |
We don't need to format the second partition (Rootfs) for now, because we can directly restore the partition image of Rootfs instead, which is already formatted with Ext4 File System.
ESP partition contains the EFI executables of bootloaders (e.g., GRUB), as well as its configuration files. We can also put the kernel binary here.
Note: Some Linux distributions, like Ubuntu, will put their kernel in a third partition.
Unlike Rootfs, OpenWrt Build System won't generate an ESP partition image for ARM64 platform. That means we have to build ESP partition manually.
1 | tonny@vm:~/mnt$ mkdir -p ~/mnt/esp |
1 | tonny@vm:~$ sudo dd if=~/openwrt/bin/targets/armvirt/64/openwrt-21.02.2-armvirt-64-rootfs-ext4.img of=/dev/loop5p2 bs=1M |
The size of Rootfs image is about 128 MB, which implies that the file system inside will assume the partition size is about 128 MB. The size of our Rootfs partition is likely larger than this number, so we should notify the filesystem there is a change on the partition size.
For Ubuntu users,
1 | sudo apt install grub-efi-arm64-bin |
Note: If your Host's architecture isn't ARM64, Apt may fail to find this package. Fortunately, thanks to Multiarch feature, we can easily install a package for other architectures. Take Ubuntu AMD64 as an example.
- Request for ARM64 architecture's packages.
1 sudo dpkg --add-architecture arm64
- Add an Apt Repository for ARM64.
Modify the file
/etc/apt/source.list
and add a ARM64 repository. Pay attention that ARM64 and AMD64 don't share the same repository, so we also need to add a filter for each repository. Here is an example.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23 deb [ arch=amd64 ] https://mirrors.ustc.edu.cn/ubuntu/ impish main restricted universe multiverse
# deb-src [ arch=amd64 ] https://mirrors.ustc.edu.cn/ubuntu/ impish main restricted universe multiverse
deb [ arch=amd64 ] https://mirrors.ustc.edu.cn/ubuntu/ impish-security main restricted universe multiverse
# deb-src [ arch=amd64 ] https://mirrors.ustc.edu.cn/ubuntu/ impish-security main restricted universe multiverse
deb [ arch=amd64 ] https://mirrors.ustc.edu.cn/ubuntu/ impish-updates main restricted universe multiverse
# deb-src [ arch=amd64 ] https://mirrors.ustc.edu.cn/ubuntu/ impish-updates main restricted universe multiverse
deb [ arch=amd64 ] https://mirrors.ustc.edu.cn/ubuntu/ impish-backports main restricted universe multiverse
# deb-src [ arch=amd64 ] https://mirrors.ustc.edu.cn/ubuntu/ impish-backports main restricted universe multiverse
deb [ arch=arm64 ] https://mirrors.ustc.edu.cn/ubuntu-ports/ impish main restricted universe multiverse
# deb-src [ arch=arm64 ] https://mirrors.ustc.edu.cn/ubuntu-ports/ impish main restricted universe multiverse
deb [ arch=arm64 ] https://mirrors.ustc.edu.cn/ubuntu-ports/ impish-security main restricted universe multiverse
# deb-src [ arch=arm64 ] https://mirrors.ustc.edu.cn/ubuntu-ports/ impish-security main restricted universe multiverse
deb [ arch=arm64 ] https://mirrors.ustc.edu.cn/ubuntu-ports/ impish-updates main restricted universe multiverse
# deb-src [ arch=arm64 ] https://mirrors.ustc.edu.cn/ubuntu-ports/ impish-updates main restricted universe multiverse
deb [ arch=arm64 ] https://mirrors.ustc.edu.cn/ubuntu-ports/ impish-backports main restricted universe multiverse
# deb-src [ arch=arm64 ] https://mirrors.ustc.edu.cn/ubuntu-ports/ impish-backports main restricted universe multiverse
- Install ARM64 GRUB
1
2 sudo apt update
sudo apt install grub-efi-arm64-bin
1 | tonny@vm:~$ lsblk -o PATH,UUID,PARTUUID /dev/loop5 |
Those UUIDs will be referred by the GRUB configurations.
Create a new file ~/grub-early.cfg
, and write the following lines. This configuration will be hardcoded into GRUB's EFI binary.
1 | search.fs_uuid CF95-2044 root |
Replace the UUID with your loop5p1
's.
1 | # tonny@vm:~$ sudo mount /dev/loop5p1 ~/mnt/esp |
Note: It is not recommended to use
grub-install
here. One of its typical usages is,
1 sudo grub-install --target=arm64-efi --efi-directory ~/mnt/esp --bootloader-id=GRUB --boot-directory ~/mnt/esp/boot/The hidden disgusting thing is, if you use GRUB provided by Ubuntu, this command will hardcode an important GRUB variable
prefix='/EFI/ubuntu'
to the EFI binary, and there is no way to change it.
1 | tonny@vm:~/mnt/esp/EFI/BOOT$ cd ../.. |
The content of grub.cfg
is,
1 | serial --unit=0 --speed=115200 --word=8 --parity=no --stop=1 --rtscts=off |
Replace PARTUUIDs (not UUIDs) with your loop5p2
's.
1 | tonny@vm:~/mnt/esp/boot$ sudo cp ~/openwrt/bin/targets/armvirt/64/openwrt-21.02.2-armvirt-64-Image vmlinuz |
1 | tonny@vm:~/mnt/esp/boot$ cd ~ |
If everything goes well, you could see your kernel is running happily. Enjoy it!
Note: You don't have to unmount the disk before launching the virtual machine. But you should sync the disk to make sure all the data cached in memory is written back.
1 tonny@vm:~/mnt/esp/boot$ sync
1 | tonny@vm:~/mnt$ sudo umount ~/mnt/esp |
1 | tonny@vm:~/mnt$ virt-manager |
Note: Sometimes
vert-manager
requires permissions to run.
The recommended configuration:
aarch64
virt
Home
➡️ Choose Volume disk.img
UEFI aarch64
Note: You can't change the firmware type after pre-install configuration.
resize2fs
. However, things get complex when Ubuntu utilizes LVM partition as their default root partition.Logical Volume Manager (LVM) is similar to Dynamic Disks under Windows, which can take several GPT / MBR partitions on different hard disks as a storage pool (LVM call it Volume Groups, VG), and allocate spaces from this pool, then Linux will recognize each space (LVM call it Logical Volume, LV) as an useable partition.
Thus, we should modify not only the GPT / MBR partition table, but also the LVM configuration.
I suggest all the operations should be done under live CD environment to avoid the occurrence of unpredictable problems. I didn't test online resizing on the root partition so far.
lsblk --fs
. nvme0n1p3
is the last GPT partition on the disk nvme0n1
, and it is easy to identify this partition is a LVM PV, and ubuntu--vg-ubuntu--lv
is the corresponding LV.1 | tonny@vm:~$ lsblk --fs |
Note that
ubuntu--vg-ubuntu--lv
is the root partition of the system here.
1
2
3
4
5
6 tonny@vm:~$ df -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv ext4 78G 7.7G 67G 11% /
/dev/nvme0n1p2 ext4 974M 87M 820M 10% /boot
/dev/nvme0n1p1 vfat 511M 3.6M 508M 1% /boot/efi
/dev/mapper/test--vg-test--lv ext4 464M 24K 429M 1% /home/tonny/mntAlso you could check the LVM Volume Group status by
vgs
.
1
2
3
4 tonny@vm:~$ sudo vgs
VG #PV #LV #SN Attr VSize VFree
test-vg 1 1 0 wz--n- 496.00m 0
ubuntu-vg 1 1 0 wz--n- <78.50g 0
fdisk
. I will use an emulated disk /dev/loop3
to demonstrate the whole process. Don't worry, you won't loss your data under normal circumstances. These commands will only modify the partition table, but make sure DO NOT remove the LVM's signature, otherwise the system may no longer recognize your LVM PV.1 | tonny@vm:~$ sudo fdisk /dev/loop3 # replace with your hard disk, such as /dev/nvme0n1p3 |
1 | tonny@vm:~$ sudo partprobe # ask kernel to read the new partition table |
At this moment, the LVM Volume Group status has changed to,
1
2
3
4 tonny@vm:~$ sudo vgs
VG #PV #LV #SN Attr VSize VFree
test-vg 1 1 0 wz--n- 1020.00m 524.00m
ubuntu-vg 1 1 0 wz--n- <78.50g 0Observe that
VFree
oftest-vg
has increased by 524.00 MB.
1 | tonny@vm:~$ sudo lvextend -l +100%FREE /dev/mapper/test--vg-test--lv |
The free space of VG is used up now.
1
2
3
4 tonny@vm:~$ sudo vgs
VG #PV #LV #SN Attr VSize VFree
test-vg 1 1 0 wz--n- 1020.00m 0
ubuntu-vg 1 1 0 wz--n- <78.50g 0
Up to now, although LVM LV is resized, the ext4 file system is not aware of the extra available space. Simply run resize2fs
to let it know.
1 | tonny@vm:~$ sudo resize2fs /dev/mapper/test--vg-test--lv |
We can see the available space of
test--vg-test--lv
has been enlarged.
1
2
3 tonny@vm:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/test--vg-test--lv 973M 1.3M 917M 1% /home/tonny/mnt
由于NexT v8较v7有大量的代码改动,加上我原来改代码的方式充满野性,NexT.Remix 的代码并不继承于之前的魔改主题,且代码的改动遵守了 NexT 的魔改规范,使用了 Theme Inject 来魔改主题,仅增加数个文件,并未对原有文件进行修改,能够(相对)方便的合并 upstream 的新代码。
除了更新 upstream 和 Hexo 的版本以及界面风格调整外,还把烂掉了的 utterance 换成了 giscus。
NexT 作为一个有着丰富历史(比如换了两次仓库)的主题,它仍然不忘初心,到现在还保持着最初的模样。然而我更喜欢时下流行的 后·扁平化 风格,但又馋 NexT 丰富的功能,同时也懒得迁移平台,所以只能去把 NexT 变成我喜欢的样子了。
由于我并不是什么设计带师,就只好 ”参考“ 已有的优秀样式。
实际上,NexT 的底子非常不错,随便改改就能完全满足我的审美。
如果配置得当,NexT本身的界面并不臃肿。这是 Remix v1 开始就在追求的目标。这一次进一步的删除掉不必要的元素,比如到处都是的下划线,友链上那一堆,文章目录上那一堆,还有日期上的下划线。另外 Pagination 也成了我重拳出击的对象。
此外,从很久以前开始,我就在弱化标签和分类这两个功能,因为我自己的习惯是从来不看博客文章的标签和分类,读者也都是从搜索引擎跳转过来的,搜索引擎也不需要标签就能自己从文中提取关键词,当然更重要的原因就是我懒得加这些东西,所以界面上关于标签和分类的元素也减少了。
这也是为什么要自己魔改主题的原因。一个显而易见的原因当然就是不希望自己的博客主题和其他大路货撞车。其次就是博客主题要符合自己的写作风格,不同于 这位,追求读者阅读的极致体验与获得感,我希望我能写出来:
也就是说,我并不会花很多心思在提升读者阅读体验上。对我来说,我不喜欢把很多精力放在我不关心的东西上,一些不那么重要的问题就怎么省力怎么来。比如配图,别说统一配图的风格了,如果这个配图只是为了美观,我选择不配图。因此,没有文章配图也很好看的主题就是我需要的。那种不需要写摘要,会自动把文章第一段当作摘要的主题就是我需要的。
我希望我的文章能侧重于回答那些暂时无解或者没人总结答案的但很多人关心的问题上,这样读者在救命稻草前肯定不会对阅读体验挑三拣四。当然基本的阅读体验还是要有的,魔改主题提升文章可读性也是改善阅读体验的一部分。
还有就是,我指望魔改博客主题这件事能够一定程度的体现出博主的水平……什么,你说 dalao 都是自己造博客框架的?我又不是前端专业,我不揽这个瓷器活。
哦对了,现在博客使用一种叫 Neko 语的东西,这语言一部分是中英双语,一部分是被我改掉的 NexT 的塑料英语。
自我感觉这一年这个博客还是取得了显著的进步,看起来更 Professional 了。
似乎没有维持出一个月一篇的节奏。Anyway,我自认为文章质量比去年的还是强了一丢丢。(可能是我太摸了,所以没有踩到什么坑。)
改完之后我舒服了,从表面到代码实现都比原来美观了不少。
以前的方案又贵又拉,Azure CDN + Github Page 这套太强了。
只要写文章的速度比文章过气的速度快,访问量一定是会增长的。只不过百度死活还是只收录了主页,辣鸡玩意。
]]>There is an article already described the detailed steps about the automatic installation of Ubuntu 20.04 Server, but according what its author said, their blog system ripped out some important characters. So, by checking with this script, I figured out the correct way to achieve our goal.
Download the live CD image in whichever way you prefer. For the user locates in China, I would suggest you download from https://mirrors.tuna.tsinghua.edu.cn/ubuntu-releases/focal/ubuntu-20.04.3-live-server-amd64.iso.
Only thing we need to do is updating several files. And here are some recommended editors.
Ultraiso
to edit the ISO file.ISO Master
should work. (I didn't try it before.)rootfs
image, cubic
is everything you need. You can refer to my previous post and learn how it works.Assume the root directory of ISO image is /cdrom
. There are two bootloader configuration files need to modify, one for UEFI system, one for the legacy one. Append the kernel arguments like this,
/cdrom/isolinux/txt.cfg
1 | label live |
/cdrom/boot/grub/grub.cfg
1 | menuentry "Install Ubuntu Server" { |
If you would like to skip the integrity check, you could try to append the kernel argument fsck.mode=skip
as the following example shows.
1 | # File: /cdrom/isolinux/txt.cfg |
Note: HWE Kernel is a higher version of Linux kernel compared to the default one, which is shipped with the newer drivers. Theoretically, it has a better support for the latest hardware.
Two new files are also required for automatic answering.
/cdrom/user-data
This configuration is what I am using now, and it is for the machine without Internet. I have verified that it can make the installation procedure fully automatic.
1 | #cloud-config |
Note:
direct
storage layout means using and erasing the whole disk. (The default option provided by the interactive installer.)- The password is
ubuntu
. This can be generated bymkpasswd
.
For more usages, check this example.
/cdrom/meta-data
Just create an empty file and put it there.
]]>时间回到去年,我并没有给SC20写回顾,因为这场比赛打得实在是——太!烂!了!烂到以至于取得了倒数第二的优异成绩,烂到了我都不好意思把这个比赛写进简历里,烂到了我到SC21开赛的时候才去问SC20的成绩。尽管SC20是我们第一次参加的SCC,但不至于烂的这么有特色。上一场比赛出现了包括但不限于以下这些情况,
cluster-init
),但其他队员似乎就完全没碰过云环境(伏笔1)我自己那时候负责的是复现挑战,不过我当时根本就不知道复现挑战要干嘛,甚至把精力放在了改代码上。我用了GDR来优化GPU P2P通信来着,还觉得自己挺牛批的来着(伏笔2)。Webinar完全没有看,到了比赛的时候才知道要“在46小时的时间里写出具有发表在国际刊物水平的report”。我作为一个无paper选手(截至目前也还是没paper),用Excel都画不明白的那种,到了场上才发现——坏了。
要复现的论文里提出的程序MemXCT有CPU和GPU的版本,比赛要求用两个版本跑出来的数据画图写报告。GPU版本这边,由于之前完全没用过云集群,我完全不知道云上预装了什么东西。队长在强力推销他的HPCX OpenMPI库,而我自己在测试的时候用的是NVIDIA HPC SDK里的OpenMPI。我一想,HPCX是NVIDIA的,HPC SDK也是NVIDIA的,它们的OpenMPI应该是同一个东西吧。结果很显然,MPI炸得很灿烂,程序就是死活都跑不起来。在无穷无尽的MPI调参和重新编译,MCA的各种参数各种排列组合都试了一圈以后,嗯,没有什么效果。套GDB单步慢慢调,去找爆炸源(那一会我还不知道GDB可以直接Traceback),发现就是我改的GDR P2P那几行代码炸了,但HPCX的OpenMPI不是自称是CUDA Aware还支持GDR的吗,年少无知的我并没有对HPCX产生怀疑,从第一天开始,一直瞎勾巴试到第二天晚上,才想到用回NVIDIA HPC SDK试试。MPI一换,什么问题都没有了。得益于这段时间里,我享受了长达4个小时的精致睡眠,我的大脑throttle到完全没有意识到“放弃这个优化”这个操作(且复现题压根就不需要优化),哪怕是其他Application,把程序跑起来才是重中之重。
CPU版本这边,总所周知(并没有),我们可以用mpirun
的各种各样奇奇怪怪的参数(如ppn
,map-by
)或者Job Scheduler(如Slurm)来绑核。然而当时我既不懂Slurm,又不懂mpirun
的参数,完了这程序还是Hybrid的(MPI + OpenMP),需要给MPI Rank分配多几个CPU核,而不是1个Rank一个核。我一顿操作猛如虎。好消息是,程序跑起来了。坏消息是,程序以一种很奇怪的姿势运行在多个节点上,比如明明要跑四节点,却出现了一节点有难,三个节点围观的情况。
到了比赛快结束的时候,两边才正常跑起来,由于根 本 没 有 提 前 写 论 文,也 不 知 道 怎 么 画 Academic的图,要不是有一个临时安排的帮手,在比赛结束前report写不完也就算了,估计连张图都画不出来。由于交的实在是太晚,这堆学术垃圾在提交到一半的时候比赛就结束了,这样也好,这篇完成度极低的黑历史就再也不会有其他人看到了。
其他人那边也没好到哪去,由于赛前准备什么的基本不存在(人还跑路了),Applications不能说是一塌糊涂,只能说是亿塌糊涂,基本上拿不到几个分,CESM跑不出,GROMACS只跑了一点,倒是MiniVite似乎还行。最后只能寄希望于Benchmarks,把剩下的Funding All in到Benchmarks上,说不定还能捞一个单项奖。当时他们刷了半天HPL,最后刷到了120TFlops,混到了暂时的第一,反手就被ETH的129T打成灰了,反手反手就被半夜想搞大新闻的THU搞出了大新闻,300T打得灰都不剩了。与此同时T队的HPCG,IO500分也把榜打爆了,这两玩意的分数是当时榜一(可怜的靶子)的3.89倍,5.76倍,似乎看到了摩尔定律复活的希望。在这种分数差面前,我们并没有任何反抗的余地,只能对其深夜5am炸鱼的行为表示强烈的谴责。
赛后才知道,HPL和HPCG方面,T队几个月前向组委会提交的plan里,早就盘算着把Azure机房洗劫一空,只要我GPU堆得越多,跑分速度就能快到其他队连尾气都闻不到,只要我操作够快,火速开机火速关机,一大堆GPU也花不掉几个钱。并且充分考虑到V100节点不够的问题,也准备了转用其他GPU节点的预案。组委会许可了他们的方案,并搬出板凳坐等看戏。反倒是我们的HPL因为超规格的问题,被扣分扣成了倒数。IO500更是离大谱,自研了打榜专用文件系统MadFS(详见金枪鱼之夜——IO500 S: There is rjgg behind MadFS - YouTube),科研成果下放到学生超算竞赛,直接形成降维打击,不得不说THU的System方向真是tql,TUNA里个个都是人才,说话又好听。
总之,要不是SC21打得不错,我是再也不会去提SC20的事情了。
今年招新的结果非常意外,忽悠到了一些《建议直接颁发硕士毕业证》的20级新人。今年招新是让感兴趣的入队的人做一份笔试题,目的当然是选出愿意去了解这个领域并去做一些搜索的人,由于计系没有哪门课介绍了超算,我对于他们的预期就是《言之有理即可》,aka《有字就行》,结果出现了这种情况,
Neko.d> 同学你好,由于你以一己之力让我怀疑题目出得太简单了,所以我希望今天能和你当面聊一下...
怎么会有人开局(大一)就几乎啥都懂啊,那还要我们老油条干嘛?
但另一个问题出现了,这次招新一个妹子都没忽悠到,不像去年又有Female还有Transgender,Diversity直接拉满,四舍五入直接等于保送进决赛。今年我们队里现在全是臭男人,proposal就只能尬吹我们辉煌的过去。
在写proposal的时候,本想把任务拆分,先写个中文的大纲,offload给其他队友,让他们输出English paragraphs,再merge成一个完整的proposal。结果真到offload的时候,赶作业的赶作业,赶paper的赶paper,这也就算了,还有去外面嗨然后装死的。当队长这个大锅抛到我头上的时候,距离交proposal的ddl已经没几天了,还得去拉浪潮的赞助。发现队友都指望不上还没时间自己搞定的时候,我只能无能狂怒,
Neko.d> I AM REALLY ANGRY!
但我angry并没有什么用,proposal还是得写,最后想到了一个惊为天人的解决方案:用谷歌翻译把提纲翻译一下,我看两眼改一改就交上去了,反正我只要坚信reviewer不喜欢看一大段一大段的屁话,我的良心就不会痛。
几个月以后收到消息,令人意外,我们用脚写的proposal过了,倒是T队挂了(不过靠着ISC冠军又回来赛场了,你大爷还是你大爷),听说还有几个国内的强校也挂了,目测多数死于Diversity。虽然我们Diversity吃了一个reviewer的低分,但其他的reviewer好像成功的被忽悠过去了。参赛前务必把美式政治正确玩明白了。
众所周知,我是摸鱼之王,摸鱼从来就没输过,摸到组委会专门发邮件催我们上号,摸到连队友不敢摸了。
组委会> Azure说你们从来没有上过号。马上比赛了,想问问你们是不是网络不好,登不上号啊?有问题的话要跟我们说啊。
组员1> 所以什么时候练习啊
组员2> 所以什么时候练习啊
Oracle集群的试用期也就五天,到账号激活的第二天我才想起来还有这事(我紫菜)。集群配置比较常规,主要是不存在伸缩问题,在试用的时候没发现太多的问题(主要是因为很多东西没来得及试)。
一个不大不小的事故是,因为我想让队员在练习的时候熟悉Linux的账号机制,就让他们改passwd
和sudoer.d
。然而我低估了修改sudoer.d
的风险,因为这个东西改炸了sudo
就用不了,想还原sudoer.d
的配置来抢救sudo
本身又需要sudo
权限,好家伙死锁了。不得已,只能寻求场外援助,Oracle的Technical Leader,Marcin老哥。老哥一看我的问题,见怪不怪,熟练的向我们推销serial console下grub改bootargs init=/bin/bash
之术,想必他的客户自己(哦是我啊,那没事了)也没少搞炸系统。当grub的界面显示在Windows Terminal下的时候,我大受震撼,这还是我第一次看到在serial console下的grub,而且还保留了原汁原味的TUI。改好参数,顺利以root身份登录系统,直接把写坏的配置删掉就完事了。
Marcin> :) Sysadmin Sunday
Neko.d> Oh. Sorry to disturb your beautiful Sunday morning lol.
Marcin老哥是个大好人(会救我们的人是好人,周末来救我们的人就是大好人啦!),此外他还教我挺多实用的东西,比如,
playbook
自动化工具(据说可用于重装部分软件)ssh-agent
到了Azure这边就没那么幸运了,赛前中后和其他队的运维瞎聊,大家无一例外的碰到了,
cloud-init
更新的配置不被应用。公认的workaround是,要想改cloud-init
脚本,重建集群吧!也就等十多分钟就好了!很快的!这导致我不敢写很复杂的init脚本OpenLogic:CentOS-HPC:7_9-gen2:latest
(后来发现这还是个陈年已知问题,上一任运维知道,但不说)cudaErrorSystemNotReady
的错值得一提的是,我们几个运维一致认为Azure的技术支持Andy是个装死带师,我们三个都被Andy无视了。跑去跟SCC主席Kathleen complain,
Kathleen> 啊,Andy早在Webinar里说过自己最近忙了,你是不是没去听啊
Kathleen> 还有你们Stand up meeting跑哪去了
彳亍口巴。Andy告诉我唯一有用的东西就是,CycleCloud Console的cloud-init
脚本可以不是cloud-init
脚本。
实际试用的时候,Azure上来就是开幕雷击,先是遇到了世代数的那个问题。到比赛快开始的时候才开GPU节点练习,(因为A100很贵,27刀一小时,而且队员之前没准备好,还没把CPU版折腾清楚),这个时候才发现CentOS 8的镜像要啥驱动都没有,自己一装驱动又踩了NVSwitch的坑。搞了半天还搞不定,仔细一想每次开VM都要装驱动,难顶,只能碰运气看看7.9带不带驱动。还好驱动是全的(似乎也是唯一一个带全驱动的镜像)。降系统版本又造成了一些小问题,什么缺Lmod导致module load intel全家桶出锅啊,什么GCC版本太老导致C++ ABI出锅啊,好在这些问题还是能解决掉的。
除了被两边集群的“特性”折腾以外,似乎没有太多问题了,也就记得有
spack clean -b
)nfslock
没启动,导致No locks available其他队友那边,一开始大家还自信得一批,仿佛人均编译带师,到了自己编译应用的时候,尤其是用Spack编译一些依赖(e.g., 比如OpenBLAS和FFTW)的时候,或者要编译GPU版代码的时候,编译器就能给你炸得妈都不认识。只好各种换编译器换姿势编译,什么GCC,ICC,ICX,NVC都用了一圈。年幼无知的队友甚至还对AMD有一丝信任,想用AOCC和AOCL平替,结果我就不说了,懂得都懂。最后发现还是老一套,ICC+MKL稳如老狗。(其实是一开始我忘装了MKL,所以让他们先试OpenBLAS,然后就欣赏编译器烟花了)还有NVCC经常会选错Host编译器,得加--ccbin
啥的。总之,你永远不知道最终生成的看上去能运行但大概率会爆炸的二进制文件是几个编译器生成的代码缝合在一起的产物。
Azure试用到没钱的后一天就正式比赛了,可见我们什么时候才开始赛前准备。按照基本上等于废话的计划,第一天开局打算先让复现和Cardioid跑,总之就是先把Oracle集群用起来,反正没有预算限制,QE和神秘应用这种要花Azure钱的晚一点再上也不迟。9.30am有个meeting,还以为要公布instructions了,结果除了打一个尬飞了的招呼以外,什么事都没发生,instructions一个字都没看到(后来才意识到这个meeting是用来把我们骗进breakout room然后给其他人直播比赛事故现场的)。
不得不说今年的instructions有点随意啊,不仅公布时间随意,而且,
这潦草的instructions让本来就因为remote而显得不正经的比赛看起来更不正经了,好想打场正儿八经的SC现场赛啊,但没机会了。Anyway,比赛还得打。
assign的同学巨屌无比(还是20级的),基本上可以称为修bug自动机,因为他有NVRTC和LLVM+PTX瞎搞的经验,这题就扔给他做了,总体来说没有遇到太多的问题(因为bug都瞬间被修掉了导致我对其并没有什么太深刻的印象)。
比赛要求跑几十套参数,大概要换几种网格尺寸,精度,计算方法跑,总共要跑几十种组合。跑了一轮下来,发现某一类参数组合会炸,总结一下有几种爆炸的姿势。第一种炸法,用gdb Backtrace一查,哦就是文件权限没设对,低级错误。第二种炸法,算出了NaN。但仔细读instruction,发现出题人似乎早就预料到选手们能跑出NaN,因为有一个问题就在问”你们见过NaN了吗“之类的,那肯定就是出题人故意设的坑。年轻的队友看不懂人性的险恶,甚至萌生了改代码去修NaN的想法。第三种炸法,发现backtrace不出来,怀疑是栈炸了。先改了ulimit,无限栈空间,不行。单步调试到一个for循环读取数组,程序就异常了。我大脑一瘫,居然怀疑这个读取操作炸了栈(读取操作不会写入栈啊喂!),因为数组在高地址的堆里(0x7ff
),离栈很近,说不定越界越到栈上了,甚至去算了base+size有没有大于esp的地址,结果是完全没事。继续debug,又遇到了一些匪夷所思的事情,比如不可复现的爆炸,第一次看到值异常,第二次就值正常了。我直接进行一个思考的放弃,1am了,到点睡觉了,day2还得通宵,于是就留这个守夜的队友就自己继续debug。
第一天晚上有三个在学校的队友守夜,主因是组委会完全无视50%队伍来自中国的事实,要求我们在12am-8am开摄像头供人围观(直播睡觉?)。跟Chair complain,
Kathleen> 你咋不早说
我还指望其他中国队会比我更早发现这个问题然后去argue,怎么只有我被喷了。Chair最后还在Warp up meeting还说,有些队伍不喜欢social只想去睡觉。(不会吧,不会吧,不会还有人打比赛还想睡觉吧。)
洗完澡刚躺平的时候,那个队友就发消息说,de出来了,gdb不靠谱,还是printf大法好,最后查出来是integer overflow了。好家伙overflow和underflow都齐了,那肯定是出题的人故意的。这个overflow有一个tricky的workaround,让MPI跑在更多节点上就好了,因为会发生这个overflow的算式会除以节点数,节点数一搞大就不overflow了。八卡跑八进程,不能再多了?不存在的,一张卡跑两个MPI进程就完事了。(不过我估计还是栈烂了导致GDB行为诡异,说不定是RDMA把栈写坏了,因为这个overflow的结果似乎会propagate到MPI的参数)。
搞复现的同学因为饱读各种AI paper就被我抓过来做这题,最初他以为现场赛的难点是在写paper上,因为之前在Azure上测试的时候一切都稳如老狗,虽然没时间测Oracle,但问题应该不大吧......吗?
真到了在Oracle上就,不得不说真是充满了惊喜(指新bug)。Oracle这边不像Azure,没有自带MVAPICH,只能自己用Spack编译一个,IB卡配置得也比较复杂,似乎是双口100G的RoCE,不像Azure是单口IB。程序似乎能正常编译运行下来,能跑出一些结果,只不过嘛......偶尔会在计算末期爆炸一下啦。根据我没几年的丰富经验,只要MPI程序能够跑出结果,问题就不是很大。一开始盲猜是运气问题,爆炸是随机的,重跑一次就好了。重跑了几次发现,是特定组合下的参数100%会炸,而且似乎是MPI的内部错误,挂在了同一个MPI函数上。由于程序总是在几十分钟后炸,用gdb调试几次可能一两个小时就没了,就打算先试各种民间偏方,同时先把能跑的点跑下来。
诡异的是,改无限栈空间,换MVAPICH的版本,使用Slurm,加MPI运行参数,都有时候能让程序跑起来,之后又失效了。也发现一些方法根本就没用,比如改编译MVAPICH的fabric参数,或者换HPCX MPI(然而自带的这玩意就没成功运行过)。从白天调到大半夜,程序一直处于能运行与不能运行的叠加态。Day2早上回到会议室,看得出那个队友被折磨了一晚上,身体已经完全被掏空了,大脑完全停止了思考。在我睡觉期间,队友去问了大好人Oracle的工程师Marcin,他给了一个写满了各种魔法MPI参数的博文, 然而这个魔法博文里提供了OpenMPI的参数,提供了Intel MPI的参数,提供了Platform MPI的参数,就是没有提供MVAPICH的,队友试着照着博文的参数复刻出一套MVAPICH的参数,然而并没有什么用。
Fun Fact:大好人Marcin在第二天晚上也帮我们试了一下MVAPICH,并给了一套参数,
1 mpirun -hosts hpc-node-1,hpc-node-2 -env MV2_IBA_HCA=mlx5_2 -env MV2_USE_RoCE=1 /nfs/cluster/osu-micro-benchmarks-5.8/mpi/one-sided/osu_get_latency我队友除了没指明
MV2_IBA_HCA
以外,其他参数都是这么写的。(不过其实我怀疑加了MV2_IBA_HCA
这个就好了,只不过到最后都没时间测)于是,
Neko.d> 这个MPI大多数的时候能跑,少数时候会炸,说不定是MVAPICH的bug
Marcin> we can take a look and report to Dr. DK Panda.
古有简历直达boss直聘,现有bug report直达author。
到了第二天中午,debug还是没什么进展。我还是觉得好像也不是非得用MVAPICH才行,一看instruction没要求,应该可以换吧。问了其他队,好家伙,有的队打一开始就是拿OpenMPI跑的。于是准备换Intel MPI,这玩意久经考验。虽然说不定Intel MPI能以一己之力扭转程序的性能特征,改变性能曲线的trend,进而颠覆了原论文的结论,但继续卡在这个问题上也不是个办法。用Intel MPI + 自己编译的GCC 9重新编译ramBLe,第一次运行的时候不加MPI参数,程序直接就爆炸了,甚至连计算都还没开始。我还以为又凉了。抱着死马当活马医的心态,试了一下博文里的参数,程序居然能跑起来了!还成功的跑完了!还第一次看到Intel MPI不加奇奇怪怪的参数还跑不了的情况(一般我觉得加太多参数,叠太多buff会让程序炸得更惨)。
buff组合,只能说全是魔法,只有一点代码:
1 -iface ens800f0 -genv UCX_TLS rc,self,sm -genv UCX_NET_DEVICES mlx5_2:1 -genv I_MPI_FABRICS shm:ofi -genv I_MPI_FALLBACK 0
再次重跑之前不能跑的点,诶,能跑出结果了。然而这个时候,比赛时间不太够了,之前用MVAPICH跑的结果又不能用。为了赶上进度,队友还想着一个节点同时跑两个点,然而我这个学期的project就在研究co-located programs之间相互干扰的问题,所以这个想法被我毙掉了,反正大不了就把原来的MVAPICH的数据加个说明交上去凑数,不过其实最后绝大部分结果都赶上了。此外,还有绑核的问题,MVAPICH绑核默认会绑成一个非常奇怪的姿势,但instruction又没提绑核的事情,干脆让Intel MPI自由发挥,用默认绑核的顺序就完事了。最后与MVAPICH的结果相比较,绑核和换Intel MPI都没有造成太大的影响,trend还是一致的。
这两题都是要花钱跑的,作为一个常年抠门,特别是常年在云服务器上抠门的人(但其实我作为AWS的Intern烧掉了AWS不少钱),自然是不太舍得大手大脚的让QE和神秘应用在八CPU节点或者八卡上做测试的。抠门如我,给了他们一个有一张V100的Intel Haswell的节点,在这个便宜的节点上先测试好,再换到正式环境跑。虽然是Intel的CPU,但Haswell只支持到AVX2,它能跑的代码AMD平台应该也能跑。
其实这两个才是赛前我比较担心的应用,因为这两道题目在赛前就看上去还没有另外两题准备的充分的样子,特别是神秘应用的同学人在香港,只能线上沟通,但结果是,这两道题反而没有遇到太多问题(或者说队友靠自己就闷声把问题修好了),出乎我的意料。QE的同学先开始说CPU版的单元测试跑不过,修了一下,好像是缺了什么库之类的,过了没多久,就说CPU的单元测试跑通了。开始叠优化选项,CPU版的单元测试又跑不过了,他自己鼓捣一下,又能跑过了。之后又开始折腾GPU版,这次遇到问题总算棘手了那么一点,这个问题让他修到了第二天。保险起见,就让他在Day 1到Day 2的半夜用四个节点把保底的结果跑出来,然后去思考用什么机器能刷高分数,跑完benchmarks有闲钱了就去刷高分数(然而最后是有闲钱但没时间了)。如果GPU版本能跑了,可以考虑明天和神秘应用一起时分复用GPU节点。为了省钱,还不让他开八节点来跑(不过仔细一想,其实好像花不了几个钱,但应该能提高不少分数)。
神秘应用那边,Day 1,他上来就列好了用来编译安装程序的所有的命令(他是怎么把命令列的这么全的?他能预知未来?)我看了两眼,给他补充了一些东西,他就自己去折腾了。我预期他会出现各种各样的状况,然而他安静得就像跑路了一样,之间就问过一次MPI相关的(HPCX又跑不起来,换MPI重新编译就好了),Python软件包缺失,NVCC和MPI的环境变量问题,吓得我以为他跑不出来就开始摆烂了。到了Day 2,他遇到了只有在多卡节点上才能测试的东西,给他开了A100节点后没多久,他下一句话就是,
Neko.d> 应该可以连进去(八卡节点)了 @12:25
队友> 现在应该是跑完了 @14:48
这顺利的就离谱。想到跑多节点可能会各种出锅,需要花不少时间用两个GPU节点debug,加上我对我多机多GPU的debug经验和速度都不太有自信,不如把钱省下来给benchmark。神秘应用在A100上跑的时候,QE的GPU版似乎还是不太行的样子。
所有的task在八卡单节点都搞定了。
跑benchmarks的老哥人在英国,大概是Day 2中午的时候终于想起来要上号了,此前训练的时候,
队友> 现在是在训练还是比赛啊?
看起来中国到英国网络不通。
跑IO500不烧钱,他就先跑这个玩玩。起初我们还对CycleCloud自带的分布式BeeGFS抱有期望,先跑了一个五节点的分。第一次,8分,不知道是什么水平,只知道是T队去年的10%不到的分数。虽然对这届IO500来说,我们直接进行一个烂的摆才是最优策略,浪费时间提升10分,在某些打榜专用FS面前,跟没有提升没有区别,但交个8分似乎也太丢人了。又跑了一下单节点的,10分。好家伙,上了分布式还负优化了(大概是通信的锅?或者集群配置有问题?)
跑完IO500热热身,他就......去睡觉了。他那边已经是当地时间的早上了,他再次上号的时候已经10个小时过去了,我们这边已经是Day 2的晚上了,看起来他还挺自信的,睡觉睡得很踏实。
QE和神秘应用跑的非常的经济,愣是省了1k多刀给benchmarks烧。我们的原本想到了A100根本开不出来的情况,就准备退而求其次开V100,然后发现了...V100也开不出来。(估计是T队此时也在拿V100跑benchmark)。好不容易开出两台V100,又发现坑爹Azure的Image里,有IB驱动,但只有完全用不了的IB驱动。Image里只带了支持新网卡的驱动,但机器上只有旧网卡,赛前根本就没试过V100节点的我们防不胜防,想不到还有这种事(然而队友aka前运维去年也是用的V100节点,咋啥都不记得了)。总之,如果硬要用V100节点,我们只能准备手动装OFED。(然而T队后来说他们早在测试的时候就发现了这个问题,估计自动化脚本都写好了)
然而,我从来没见过队友用过clusterssh和其他能broadcast input的terminal,对他自称能在短时间内给10+台节点配好OFED的说法深感怀疑。而且我自己装OFED的时候遇到过升级Linux内核的同时把GPU内核模块整没了的情况,说不定好不容易配好了OFED了,CUDA又坏了,所以不是很放心让他继续把时间花在V100上。
折腾V100的期间,ShanghaiTech的运维问我们要不要接盘他们的两台A100,他们准备释放了,但我那个时候还不知道IB驱动的坑,就没接盘。等到发现IB驱动的坑的时候,我只有非常的后悔,不过我又发现我顺手一开就开出两台A100,就是排了大半天的队。感觉V100那边跑不出来了,干脆不如能开出几台A100就跑几台。算了一下钱,发现,
Neko.d> 出现了有钱花不出去的问题
我们有足够的钱等A100开出来,哪怕最后开出了8台A100,我们的钱也不一定能花的完。并且我大胆假定,大部分的队伍会在比赛末期因为经费不够把A100让出来(后来发现,还得感谢某些搞事情的队高抬贵手,没有跑去抢A100),于是我就让队友一边在A100集群上调参一边等机子。从第三台机子开始,Azure开始0连排队都不让排了,但只要多试几次就能排上队,平均下来进入排队的状态要花10分钟,排队再花10分钟,也就是说20分钟能开出一个节点。反正大家的进度整体良好,没有我运维什么事,我就去当一个没有感情的点鼠标机器好了。
本来想写个脚本自动轮询,发现公司给的Mac不让Chrome访问不安全的网站(CycleCloud Web Console的HTTPS的证书没配好),用不了开发者工具来生成模拟请求的curl命令,所以只能人工点鼠标了。
用了差不多两个小时,总算开出了6台A100节点,但从此再也排不上队了。虽然我觉得没有哪个队钱多到占着A100到最后一秒(然而真的有这种队),但之后真的就一台也开不出来了。
队友那边,他掏出了不知道哪来的祖传HPL和HPCG二进制文件,复制粘贴,执行含有祖传参数的祖传脚本(怕不是ASC那会用的)就开始跑分了!现在Linux的二进制兼容性这么好了吗?不过6节点HPL一开始只能跑出100T左右的成绩,用htop看CPU占用,红红一大片,非常的不对劲。结果就是MPI进程/线程数设置得不对而已,经典Context Switch了。改了改参数,跑分期间看到了不错的预测结果,但居然跑炸了。队友突发奇想,改低了线程数,好了,太怪了。(某队还碰到了非常神奇的Verification failed,开眼界了)
队友自称在校内集群跑HPL的时候成功的把节点直接跑崩,这里没把Azure的物理机跑崩真是谢天谢地了。
后面的故事就比较简单了,48张A100加持下,钱到位了,金钱的力量绝不让人失望。狗贼T队先扔了一个很低的HPL分卖弱,到比赛快结束的时候,终于把真正的成绩放出来了,HPL,HPCG,IO500全部领先当时榜一一大截,仿佛SC20重演。看了一眼我们刚跑出来的分数,那没事了,就让T队开心那么几十分钟吧,然后再让他们感受资本主义的险恶(大雾)。
抖S的队友总觉得没有榨干A100的性能,从理论性能上看,HPL还是有希望跑上300T的,可惜到了最后几分钟也只跑到了284T。但我脑子一抽,居然同意了让他们在最后几秒用这个成绩override掉原来提交的280T的成绩,差点SC20重演,这是这一次比赛距离翻车最近的一次。还好负责提交的同学足够聪明,是先提交再删除,不是先删除再提交。结果是在ddl前提交成功了,但来不及删掉已经提交的成绩了,要是这个update的顺序反了就emmm。但赛方Grafana还是傻掉了,不能处理多份成绩的样子,显示不出我们的HPL成绩,于是我们创造了SC21最高LINPACK分高达0 GFLOPS的奇迹!(大雾)
另外,IO500原先是跑出了10分的“好”成绩,但忘了保存了。再跑单节点的时候又只有8分了,队友直到最后一刻都还想装Lustre来刷高IO500,但就算分数高了一点,也并没有什么*用。
去年需要我们在最后做一个完整的答辩,今年就变成了Poster+各赛题单独interview。Poster答辩前我还在吹水,完全没有意识到Poster答辩还占分,结果iPad(用来在比赛期间挂着线上会议)突然冒出来一句声音让我答辩,我《完全没有任何准备》(自豪),自然是讲得稀烂,我甚至都不记得我做的Poster里写了啥,只能对着评委说,
Neko.d> 嗯嗯你们自己看看吧,反正这个也不是最终比赛的配置,看看就好
结果发现最后Poster分并不高(我紫菜)。
其他队友的分赛题Interview比我靠谱多了,基本上问题都答得上来(你们是什么时候知道这些赛题的物理背景的?)。特别的是,负责Cardioid的的大二队友不太能说英语,但在我们另一个队友,托福100+的大师的协助下,加上interviewer照顾我们把问题打在了聊天框里,这场interview就在
的情况下完成了。我只知道
其实这1w字的流水账早就在10月的时候写完了,只不过拖到了第二年才发出来,欸嘿。说起来这也是我最后一次以正式队员/队长的身份打SC比赛了,想想好气啊,因为这破疫情,痛失2x美帝 2x新加坡 2x北京 1x厦门免费旅游机会,没能帮学校花钱,我感到非常的愧疚。果然还是想续一续我的学生身份,继续努力的帮学校花钱。希望今年能以Year 0 PhD Student的身份打ASC22(offer不会下不来了吧),然后在未来某一年以Presenter的身份回到SC的会场,并成为目击证人,亲眼见证在nike的successor在SC SCC把其他队锤爆的那一刻。
Since NCCL relies on MPI to run on multiple nodes, the following example code is based on MPI Programming Model. Assume there are 4 CUDA GPUs and 4 corresponding MPI ranks. This code performs all-reduce operation within the first two and the last two ranks simultaneously.
1 |
|
This code can be compiled and run on my machine with these commands,
1 | nvcc -ccbin mpic++ test.cu -o test -L/usr/local/cuda/lib -lnccl |
Note: Using
nvcc
to compile MPI code is not a common practice. It is recommended to compile it withmpic++
from a CUDA-Aware MPI variant.
The output of this program should be,
1 | [rank0] result: 1 # 0 + 1 = 1 |
The key is ncclCommInitRank
. Suppose only a subset of ranks initializes the communicator with the same unique ID belonging to one of them. In that case, this communicator will ignore other ranks that are not in this subset.
Official API explanation:
1 ncclResult_t ncclCommInitRank(ncclComm_t *comm, int nranks, ncclUniqueId commId, int rank)Creates a new communicator (multi thread/process version). rank must be between
0
andnranks-1
and unique within a communicator clique. Each rank is associated to a CUDA device, which has to be set before callingncclCommInitRank
.ncclCommInitRank
implicitly synchronizes with other ranks, so it must be called by different threads/processes or usencclGroupStart
/ncclGroupEnd
.
In addition to the official instructions, we should also know,
ncclGetUniqueId
can be invoked multiple times, and it will return a different unique ID each time. Meanwhile, the unique ID generated before is still working.Moreover, I also evaluate the influence on performance bring by sub-grouping.
The testbed is,
g4dn.metal
instance with 8x NVIDIA Tesla T4 GPUs.First of all, I would like to emphasize the GPU topology of this bare-metal machine.
Note: We should extract the topology information from physical machines instead of virtual machines since the hypervisor may fuzz the result due to security reasons.
1 | # nvidia-smi topo -m |
It looks like a balanced tree topology. We could expect two neighbor GPUs will have higher communication efficiency.
1 | UPI |
The result below is measured on the root rank, and each experiment is repeated 5 times. Meanwhile, the environment CUDA_VISIBLE_DEVICES
was set to reorder GPUs binded to MPI ranks. CPU binding remains unset.
And the meaning of the notations on communicators is,
0/1
: Only one communicator performing all-reduce on physical GPU 0/1.0/1 + 2/3
: Two communicators are working at the same time, and each of them perform all-reduce on two GPUs independently.0-7
: Equivalent to 0/1/2/.../6/7
.From the result above, we can conclude that,
nvidia-smi --query-gpu=pcie.link.gen.current --format=csv
and sudo lspci -vvv
理想情况下Docker Image最好使用Dockerfile来构建。把Docker Container当做虚拟机来构建Docker Image这个方法虽然非常省事,但该方法很容易做出来很大一坨镜像,很不轻量,所以仅推荐在测试时使用,不推荐在正式场合(如企业的生产环境)使用。
Docker的安装和使用(创建销毁容器等)都需要超级用户权限。若非系统管理员,务必确认环境里已经安装了Docker和拥有Docker的使用权限(已加入docker
用户组)
注:Docker也有Rootless模式,但需要额外的配置。
镜像可以说是容器在某一个时刻的所有文件数据,包括运行环境,程序,临时文件等。而容器才是能产生进程运行程序的东西。所以镜像是静态的,容器是动态的。他们的生命周期和转换关系如下
1 | |---------push--> (Docker Hub) |
容器的名字没有太多讲究,镜像名字的构成是:镜像名:Tag
,如Ubuntu:18.04
的镜像名是Ubuntu
,Tag是18.04
。
docker ps
查看运行中的容器docker ps -a
查看所有容器(包含未运行的容器)docker rm -f
删除容器docker images
查看已下载的镜像docker pull
下载镜像docker rmi
删除镜像暂未找到什么很好的加速方法
推荐命令:
1 | docker run -id --name your_ct_name --privileged --network host --restart always ubuntu:18.04 bash |
参数含义:
-d
+-i
+bash
组合会启动容器里的bash
,目的是让容器挂在后台--restart always
主机重启后自动启动容器,挂在后台ubuntu:18.04
推荐使用Ubuntu 18.04镜像--privileged
允许容器使用更多的内核功能--network host
使用主机网络(禁用网络空间隔离)
也可以使用老黄家的CUDA开发环境镜像:
1 | docker run -id --name your_ct_name --privileged --network host --restart always --gpus all nvidia/cuda:11.0.3-devel-ubuntu18.04 bash |
参数含义:
nvidia/cuda:11.0.3-devel-ubuntu18.04
是包含CUDA 11.0.3对应工具链的Ubuntu 18.04镜像
- 镜像的CUDA版本需要和驱动支持的版本对应,
nvidia-smi
右上角会显示最高支持的CUDA版本devel
版镜像包含nvcc
编译器等工具链,runtime
版不含工具链--gpus all
使用所有可用的GPU
所有黄家容器列表:https://hub.docker.com/r/nvidia/cuda/tags?page=1&ordering=last_updated
1 | docker exec -it your_ct_name bash |
参数含义:
-i
+-t
启动交互式模式
进入容器后如果想换apt源,建议使用下面的命令来换,因为镜像为了节约空间,往往不包含文字编辑器
1 sed -i "s#archive.ubuntu.com#mirrors.sustech.edu.cn#g" /etc/apt/sources.list
容器到镜像
1 | docker commit your_ct_name your_images_name:your_tag |
镜像到文件
1 | docker save your_images_name:your_tag -o your_file.tar |
文件到镜像
1 | docker load -i your_file.tar |
在以下前提下,上CDN不如把网站搬到24块一个月的腾讯云香港轻量服务器
已知的性价比最高的方案是Azure CDN (Microsoft Standard) + GitHub Pages。但网站本身才是影响访问速度的大头。
注:本文讨论的前提站长懒得备案。如果备案了,毫无疑问国内回源+国内CDN能吊打上述方案。
啊这里有人可能会问了:“这个问题有必要问吗?”但仔细思考一下,免备案的CDN边缘节点最近就在香港,考虑到边缘节点是个公交车,还要服务别人,线路不一定比腾讯云的强。更令人智熄的是,基本上个人博客可以和访问量大不到哪去画个等号,你的网站被挤出Cache可太正常了,Cache Miss一下延迟就更爆炸了。无脑上CDN后发生的大概率事件就是CDN打不过三网直连的24块钱的腾讯云香港。
那怎么样才能让CDN搞快点呢?当然是头痛医头脚痛医脚,
所以问题就被简化为回源和CDN的选择问题了。
弄一个好点的源站,说着容易,实际上源站的选择还是很讲究的。
对于静态网站(如Hexo)来说,源站除了可以搭建在VPS上,更建议扔在对象存储上,因为
对象存储的缺点主要是
不过调通了以后同步对象存储数据就像用网盘一样简单(因为这就是个网盘)。
之前也提到了,个人博客上CDN就要时刻准备好Cache Miss回源,所以缩短边缘节点从源站下载数据的时间非常的关键。解决这个问题最好的思路应该就是缩短边缘节点到源站的地理距离,最好在同一个地区,因为
另一个好处就是,因为只需要考虑同地区内的通信,所以源站的国际线路质量完全不需要考虑,什么CN2 GIA都完全不需要,源站能通网就行。
由于一个源站只能照顾一个地区,如果只考虑国内访问的话,一个香港源站应该就足够了。但如果要
,就可能需要不止一个源站了。
多地区的优化是玩具级解决方案和企业级方案的分水岭之一,为啥这么说呢,因为从相关服务的定价来看,基本上云厂家就没考虑过个人玩家的死活。对于CDN来说,就是配置多个源站(这既是为了降低延迟,也是为了容灾),让CDN能根据访客的位置选择最近的源站。这大致有两种实现
多个源站,GeoDNS都是烧钱的东西(企业人傻钱多不在意),那普通人咋办?这里就要介绍这个无敌的存在了,GitHub Pages。这玩意除了能免费给你存东西以外,还安排上了Fastly CDN。GitHub Pages的架构我们不得而知,但从测速结果来看,很多地方的测速点测出的访问延迟都很低,应该是有做数据的geo-replication。也就是说DNS不用买,多地区存储也不要钱,唯一的毛病就是国内访问比较随缘,但作为CDN的源,这个毛病无伤大雅。而且像Hexo这种静态博客,甚至有插件能一键同步博客到GitHub Pages上。
根据我这两年来的观察,我主观的将我用过的CDN按照国内访问速度分为几个等级。
至于国外网站访问速度估计大家都差不太多。
结论其实很明显,我肯定首推Azure CDN (Standard Microsoft),因为Front Door这个价格就离谱,其他家的CDN会让你怀疑为什么要花这个钱买个减速器(当然CloudFlare配合廉价美国VPS能省钱)。当然Azure确实比较高冷,首先得有张外币卡,然后就是各种问就是企业级的设计,以及莫名其妙的设计,比如说对根域名不友好,CNAME验证各种不通过,官网文档只会让你去买他家DNS,用cdnverify
绕过的方法就是不说,也不给根域名自动签TLS证书(AWS就可以);不可以CDN前端用HTTPS后端HTTP(Front Door倒是可以)等等。但没办法谁让他家CDN国内访问就是快,看在价格不贵的份上原谅他了,免费5条规则也算良心,可以拿来配HTTPS强制跳转和HSTS,虽然这些东西可能在别家CDN面板上一键就能配好。
Azure CDN (Microsoft Standard) + GitHub Pages这套方案可能比较绕,但一个月花不了几个钱(估计5rmb不到)速度又倍棒。不过还有一个问题值得思考,上这套方案就能让网站访问速度无人能敌?其实不是,从谷歌PageSpeed的分数看来,我从单回源(新加坡)+AWS CDN换到上述这套方案,PageSpeed也就提升了3分左右(国外访问速度)。另外提升的20多分靠的是对网站自身的调整,如减少了外部文件的加载数量。我曾今遇到过一个高度优化的网站,哪怕用的是Cloudflare,走国内国外网络的PageSpeed都是满分(用Chrome Lightroom测试)。
写到这里我才意识到这套方案最大的意义是给我省了一点钱,比起上腾讯云还便宜了不少,顺便提升了国内访问速度。
]]>HloSharding
ObjectFirst of all, we need a way to represent sharding specifications using programming language. XLA designed an object to do such a thing, and this object contains numerous variables and a set of supporting functions to configure itself. Some attributes of HloSharding
are listed below.
1 | // File: tensorflow/compiler/xla/service/hlo_sharding.h |
Array<int64> tile_assignment_
here is multi-dimensional with arbitrary shape. {devices=[2,1,2]2,3,5,7}
means the shape of tile_assignment_
is [2,1,2]
, while the values are {2,3,5,7}
.
std::vector<HloSharding> tuple_elements_
probably was designed to specify the sharding specifications of outputs.
I am not aware of what the roles of maximal_
, tuple_elements_
are. Is there any body know that?
Note that each single object could be shared by multiple instructions. By doing this, the cost of creating and maintaining several instances with the exact same contents could be eliminated.
The original implementation of XLA added the attribute std::shared_ptr<const HloSharding> sharding_
to the class xla::HloInstruction
, which is declared in tensorflow/compiler/xla/service/hlo_instruction.h
. A common usage of this HLO Instruction Attribute is to declare sharded tensors. Here is a sample HLO IR code with sharding attributes. Note that the Propagation Algorithm may fill in this attribute for those instructions without it.
1 | primitive_computation_add.6 { |
Note: this HLO IR code is compiled from this JAX Frontend code
1 |
|
This example illustrates a lambda function takes a replicated tensor as the input, and splits this tensor by invoking custom-call
, then performs the calculation.
You might notice that in the previous example, the instructions invoking operators (e.g. reduce.10) don’t contain sharding attributes. That leads to a critical question, how a regular operator reacts to sharded tensors. The solution of XLA is introducing SPMD Partitioner, which is mainly responsible for converting a full-sized operator into a partition-sized operator by adding necessary collective communication primitives to lower-layer IR code, and the partitioner also converts the inputs of operators from global tensor symbols with sharding to local tensor symbols without sharding specifications.
We could find some clues in tensorflow/compiler/xla/service/spmd/spmd_partitioner_test.cc
.
1 | TEST_F(SpmdPartitioningTest, DotPartialContracting2) { |
Two inputs, lhs
and rhs
, are tensors partitioned in the way that the figure describes. Thus, after partitioning the computation, the lhs
is unwarpped, and its shape changed from f32[24, 100]
to f32[24,50]
. And at the end of file, AllReduce
was added to collect the partial results.
The system should be able to figure out an optimal sharding specifications for the remaining tensors without user’s annotations. An ideal partitioning plan can reduce the communication amount, reduce memory footprint, and improve the performance.
Some unit tests written in tensorflow/compiler/xla/service/sharding_propagation_test.cc
are intuitive examples.
1 | TEST_P(ParameterizedMetadataTest, BroadcastForwardPass) { |
It clearly shows that the system inferred the sharding specification of broadcast
is {devices=[1,2,2,1]0,1,2,3}
according to its input with the attribute {devices=[1,2,2]0,1,2,3}
. Note that this test is called BroadcastForwardPass
, there also exists a test named BroadcastBackwardPass
, which is to say the propagation should be on both directions.
GShard: https://arxiv.org/abs/2006.16668
GSPMD: https://arxiv.org/abs/2105.04663
Julia DistributedArrays.jl: https://juliaparallel.github.io/DistributedArrays.jl/latest/index.html
Of course we need to download the source code of TensorFlow, and install all the dependencies. I suggest to use Conda to manage the environment, and use build-in GCC on Ubuntu 18.04 (or above, maybe) to build the code. Note that building from source requires about 50GiB of free space.
1 | # Fetch Source Code |
First of all, configure the project and build it.
1 | ./configure |
During the configuration process, it is recommended to choose ALL the default options if it is not a must to debug on GPU, since enabling GPU support needs additional configuration (Please refer to this article) and much more time to compile.
As for the bazel build flag,
--config=dbg
adds debugging symbols. Required.--config=monolithic
should generate the binary code as a single dynamic library. But this option seems to be buggy. Not recommended.Compiling TensorFlow is quite time-consuming, and it took about 20min using 48 CPU threads on my server. Time for coffee now.
In fact, we don't have to write something in Python frontend to trigger breakpoints inside XLA compiler, as there are already tons of unit tests that covers most of codes and demonstrates the capability of the compiler.
Let pick a simple test first to validate the code is compiled correctly.
1 | bazel test --config=dbg //tensorflow/compiler/xla/tests:tuple_test_cpu |
From the compiling log, we could find the executable file locates at bazel-bin/tensorflow/compiler/xla/tests/tuple_test_cpu
. Execute it! If everything works well, the program will print out the message below.
1 | [----------] Global test environment tear-down |
Then pick a test you interest, and repeat the steps above.
Take spmd_partitioner_test
as an example. This unit test can be compiled without any error message, but when you directly run the executable, you will see this message.
1 | [ RUN ] SpmdPartitioningTest.BroadcastAsReplicate3 |
This is because this executable is not linked to a valid backend, which means this executable doesn't contain the code of JIT Execution Environment. The solution is modifying the BUILD
file manually to fix the dependency as the message suggests.
Open the BUILD
file in the directory where the unit test locates. In this example, the test tensorflow/compiler/xla/service/spmd/spmd_partitioner_test.cc
corresponds to tensorflow/compiler/xla/service/spmd/BUILD
. And add this dependency //tensorflow/compiler/jit:xla_cpu_jit
.
1 | tf_cc_test( |
Since the unit test was built as an executable with debugging symbols, there is nothing special about the configuration of VSCode. Install C/C++
Extension, and write the following lines to .vscode/launch.json
.
You could open that json file by clicking
ctrl/command
+shift
+p
, typinglaunch.json
, and selectingAdd Configuration
->C/C++: (gdb) Launch
1 | { |
Everything is all set! Press F5
to start debugging.
Day0当然是传统艺能——装机的一天。忘了赛前咋规划的了,好像是计划上午要装好机的样子。作为垃圾佬出身的垃圾佬,哦不是,作为垃圾佬出身的运维,装机 应 该 能 分分钟装好,我的装机经验丰富到装烂过一张1080和Xeon Phi(自豪)。结果好家伙,不知道浪潮从哪里捡来这个鬼才设计师,把显卡供电线设计得贼鸡儿短,要大力出奇迹才能把显卡安装在推荐的GPU1和GPU2位置,为了省事我就把显卡装在了GPU1,3位置上了,反正显卡拓扑都差不多,一个NUMA一张卡,理 论 上 没啥问题。
关于装系统的方案,我们从一些人的血的教训以及我自己的血泪史中得出结论,不预装系统到硬盘上。当年我的NAS(垃圾服务器)的硬盘运行在AHCI模式,但HPE的BIOS也是鬼才,读不到AHCI模式下的硬盘的温度(只能读到RAID下的),于是就让风扇狂转。我就把硬盘设置成RAID模式,结果发现硬盘一个区域消失了,报分区表错误,尺寸对不上,最后我不得不把文件拷出来,改模式格式化,再写回去,因此我一直怀疑配RAID阵列有未知的副作用。所以我做了个自动安装的镜像,还整了五个U 盘,这样就能一边装系统,一边装另一台的硬件。
系统装好了,nvidia-smi一看,有一个节点的A100申必消失了。其他队员和浪潮的人老在叭叭说不按推荐的装法是不是会有问题啊,要不要再把显卡的装法改回去啊。我重新插了A100,还是不认卡,吓得我以为刷新了被我害死的硬件的价值的最高记录,7.5w一张的美国金卡还可以买一堆1080了。仔细一看,发现是GPU3位置上的A100认不出,结果挪到GPU2上就认出来了。虽然我坚持认为GPU1,3位装法没问题,暴力装显卡电源线可能伤硬件(更伤手),但还是把显卡装回1,2位了,(不过这次我不去装显卡线了),大概只有浪潮知道为什么GPU3位不能装GPU了。(供电缩水了吗?)
换好显卡位置了,发现有个节点红星闪闪放光芒,亮红灯,进BMC一看,警告有高速PCIe设备运行在低速模式下。拆机重装,感觉安装Riser的手感怪怪的,排查了一阵子,看到一根线的标签卡在Riser的插槽里。好家伙鬼才设计师,标签贴在哪里不好,非得贴在那根线上,还刚好在那个高度上。
期间,为了能通过无线连接集群,不知道谁直接从学校某处掏出了一个长得非常嚣张的硕大的路由器,因此被其他学校认为我们有滥用主场优势的嫌疑,但这玩意虽然看起来很强,可它是TP-Link啊。
折腾了大半天,到差不多四点了,才准备开始跑HPL。为了给其他队上一课,在气势上不能输,功耗的数值必须要大,直接打算跑五机十卡。按下回车,数秒后,功耗下降了。哦,原来是跳闸了。又开机了,压了功耗,又开始跑了,又跳了。虽然作为东道主,我们早就知道跑到7kw会跳闸,但我们应该没跑到这个数啊,过一阵子才意识到,插线板也需要做 负 载 均 衡。给了两路供电,但前五个节点服务器基本全在吸同一路的电,这才蚌埠住了。
跳闸是小事,但不知道为什么BeeGFS好像没见过这种场面,在跳闸之后就很难抢救回来了,折腾了半天,还是没治,放弃治疗了。求稳还是得看NFS,什么BeeGFS,Ceph这种花里胡哨的还是得爬。NFS唯一的毛病就是在存储节点的NFS服务重启的时候,其他的节点一挂载就崩了。我们有另一个运维来管理存储节点,我管理其他的。因为他懒得写好fstab去自动挂载一块xfs盘,每次重启后都要手动mount一下再重启一次NFS服务,在这个期间我只要挂载NFS盘就能触发经典Race condition。这种情况居然连续出现了两次,一崩就重启,重启完就Race condition崩了。最后我按着他的头老老实实把fstab写好了,这样就没必要重启NFS服务了。
插曲:妮可食堂又开始丢人了,这整的什么盒饭啊。THU队上来就问外卖怎么点,SJTU的队员指着饭盒说:你们平时吃的就这?后来才知道这些盒饭是教工食堂整出来的玩意(AP以下的教工还挺惨),由于被吐槽过多,计算中心直接跟食堂说你们别整那些水果饮料了,把饭做好吃点吧。总之这四天成为了我这学期吃得最健康的几天。不知道是不是我在采访里喷饭堂的言论被副校长看到了,他还打了个电话问计算中心大家吃得怎么样。答曰,没出事。
由于嫌弃Grafana太重了,(主要是不会配),晚上在手搓Dashboard,硬是用DHTMLX搓出了一个还看得过去的面板,用Flask和 文 件 系 统 数 据 库 (一个json文件存数据)搓了一个简单的后端,并部署到了一块Jetson Nano上。至于什么网页端的功耗控制,明年再写吧。现场写了几个破烂脚本去手动调功耗和风扇速度,再用clusterssh手动执行。至于自动功耗控制,不存在的。
Day 1是最后一天装机配环境的日子了,Day 2以后就是正式比赛,不能重启超功耗了。开机,IP全变了,不知道信息中心搞什么飞机,DHCP租约期还没有某些队员连续不睡觉的时间长,每天还得重配hosts文件。
昨天晚上突然意识到还没检查CPU的C State和P State(再次感谢某人的血泪史),dmesg一看发现intel_idle: does not run on family 6 model
,考虑到新硬件的兼容性,我们已经上了Ubuntu 18.04的5.4 HWE内核了,难道这还不够新?cpupower idle-info
里果然没有C6 State。一顿搜索找到了Ubuntu 18.04可以用的5.10内核的deb包,于是决定单独拉一个节点过来受害。新内核装好了以后,intel_idle
是没问题了,C6也应该有了,但NVIDIA的内核模块没有被重新编译,最关键的是没省多少电,也不知道为什么,保守起见,还是不上新内核了。可能是非CPU消耗的功耗太多了,再从CPU挤挤,省出来了那点功耗估计是非CPU功耗的零头。
待机控制的建议阅读:
关于内存,早就看到有些队给所有机器满上了,可能是上一届ASC让他们得了内存不足PTSD吧。我们还担心内存功耗很大,借了的内存必须在比赛的时候全插上,就先借了32条,能插满两个节点(两节点自带32条),然后测了一下,结果内存的峰值功耗还没有GPU待机功耗大。(最后插满的四节点内存峰值功耗也就100W内吧,工作人员老拿内存功耗说事,还以为功耗有多夸张呢。)于是就决定给四节点满上了,强迫症开心了,没有空着的内存插槽了,但这个时候工作人员又不让借了,说借内存需要确定比赛机器最终配置以后才能把内存拿给你,这种不明说的怕不是刚拍屁股想出来的规则真tm坑。
给节点上好内存的时候,有一个节点又红星闪闪放光芒,BMC说有几根内存条有毛病。把内存全拔了,调换顺序重插一遍。BMC说另外几根内存有毛病,烂掉的内存数量倒是少一点了。我们怀疑是内存金手指氧化了,上一次ASC19我队表演了一手现场擦显卡金手指,还给救活了,所以决定今年表演一手现场擦内存条金手指。擦了一个小时,痛失了半截橡皮擦后,重新插上内存,开机,红灯,BMC的报错和上次一毛一样,还是那几个槽的内存烂了,但是这些槽上的内存已经被调换了,每次内存条都被插到一个随机的位置上,想想这些内存都是新的,应该也不是氧化的问题。我们也懒得这个时候换机器了,等确认机器配置以后再换,正好这个节点也可以当做三带一方案里的那个load resistance,负责耗电就行,不影响我们测三机六卡。
最后测出来三机六卡和四机八卡同功耗限制下跑HPL的成绩差不太多,但四机八卡的HPCG的成绩远好于三机六卡,毕竟众所周知HPL吃浮点计算能力,HPCG吃总内存带宽,内存读写功耗又远低于浮点计算功耗。
五点左右的时候确定了最终配置,满内存四机八卡,其他队的方案估计也差不多。今年的机子属实大火炉,待机400W满载1kW+,单节点还只能上两GPU(把有线网卡废了理论上能上三GPU),所以应该整不出太多花活,白准备了十多张美国金卡了。六点的时候按最终方案配置好了,还剩下了一点时间测测HPL和我的QuEST。晚上因为之前两天每天只睡四个小时,就决定啥也不干了,直接躺床去世。
早上不想起床,感觉没自己什么事(其实跑HPL的那个人指望我去压功耗),就决定来一手可控迟到,多睡了半个小时(其实已经睡了差不多九个小时了)。在路上想着我是不是最后一个到场的,结果并不是,甚至还是前几个到场了的。大家都迟到了等于我没迟到
八点开始的比赛,十一点就要交HPL的成绩,九点才开始调HPL。第一次跑,就跑出了一个比预期好的成绩,第二次跑更好了。我说要不就交了吧,刚启动的机器比较凉,但跑HPL的那个人是个赌狗,后来机器不负众望,一次跑得比一次烂,甚至还出现了超功耗的情况。后来实在顶不住了,他说最后再跑一次,我把机子多冷却了一会,结果,自然是,跑出了一个史低。最后一次是不可能最后一次的,那肯定是又说着最后再跑一次又准备跑一次,我看不下去了,就去吃了个茶歇,想着我走开,没人去控制风扇和功耗,他就不能立刻跑HPL,机子就可以多冷却一会,结果远远地看到那个人敲着我的电脑自己开始跑HPL,拖住他的计划大失败,我还是得跑回去接手功耗控制,毕竟没有写读HPL的log来调功耗的脚本,全靠手动实现多阶段的功耗控制。emmm,最后,呃,跑出了个史高,太草了,赌狗的大胜利。至于HPCG,我们最初就计划跑一次HPCG,跑出结果就完事了。
大概十点交了HPL成绩,拿到了当天其他的赛题。PRESTO给了几十个GB的输入数据,突然我就慌了啊,不是说没有吃IO的应用吗,我们共享存储可是NFS over 1Gbps Ethernet,光是分发这些数据都难顶。这个时候想到了三个解决方案,用MPI分发数据到本地SSD上,让PRESTO从本地SSD读取数据;配置IP over IB,然后用scp分发数据到本地SSD上,或者NFS over IP over IB;配置NFS over RDMA。第一个方案,懒得写MPI程序。第二个方案,因为残留的BeeGFS可能阻止IB服务的重启,可能会炸。第三个方案,可能可行。找了两个在别的地方的节点,试着配了一下,载入NFS over RDMA内核模块没问题,但是mount的时候就失败了,在dmesg里看到这个内核模块已经爆炸了,所以第三个方案也不行。这个时候让PRESTO试着跑起来了,读取共享存储的数据,但dstat发现以太网卡其实没啥流量,也就是说根本没必要管共享存储的IO问题。
之后还出了一个事故,跑HPL那个人为了测MPI是不是正常的,又跑了一次HPL,结果把之前准备用来提交的HPL输出数据给覆盖了,好在最后不影响成绩。
PRESTO那边折腾了半天魔改的多机版,又是折腾出了纯MPI的,发现还没有单节点来得快,结果拖到了很晚才开始正式跑PRESTO。惊为天人的是,大家一顿操作猛如虎,还是成功的在最后几十分钟把所有的算例跑完了。
不过跑是跑下来了,提交文件的时候又出问题了,输出文件太大了。我们先后经历了U盘存不下,拷得慢等问题,最后换成了某位热心的工作人员提供的移动SSD硬盘,但还是慢。最令人窒息的事情是卡在了umount命令上,iostat看到一直有数据写入移动硬盘,又刚好听到了国防科大硬拔U盘造成了数据损坏,不得不重新复制的悲报,所以大家也没敢硬拔,只好躺在地上无所事事。直到七点半,实在是等不下去了,因为国防科大队已经重新复制完文件溜了。所以只能硬拔,格式化,重新复制。这一次用了rsync,而不是用的cp,复制速度终于符合SSD的表现了,腰不疼了,腿不酸了,umount也有劲了。
晚上大家都没吃(除了我),老师就让我们用破解版饭卡去西餐厅吃,今天加班不亏。
其实我Day 2的主要工作基本上就是压功耗,也就是说没啥事,Day 3就有我的QuEST了,所以这一天没怎么迟到。拿到算例一看,好家伙最少模拟34位量子比特,显存直接爆炸了。刚好八卡共320G的显存也就够模拟一个33位的量子比特,出题人一定是故意的。算例四的量子比特位倒是挺少,20多位,但是模拟了一堆量子比特,内存占用算下来等效于模拟一个35~36位的量子比特。最后决定还是跑CPU版的,自己魔改出来的分布式多GPU版本根本用不上。
一开始决定给AI题先跑,因为之前的情报显示我可能需要现场改QuEST多GPU的代码,因为决赛会提供一个加了料的QuEST的源代码,加了几个门。Day 3前一天晚上AI组的同学自信得一批,说绝对稳如老狗,因为原先题目说在BERT模型上搞,他们就对着BERT做了一堆优化,效果好像还不错。结果Day 3早上,组委会给了个ALBERT,好家伙,名字里倒确实有BERT这四个字,他们做的优化不仅是 白 搞 了,他们现在还得去改代码把优化手段删掉,然后发现了loss根本不收敛。不过他们业务能力极强,用了一个小时把bug修了。折腾到大概十二点开始训练模型,决赛最多允许训练到三个Epoch,一个Epoch大概需要训练半个小时,训练前两个Epoch屁事没有,最后一个Epoch还差一分钟训练完的时候居然超功耗了,没办法,还是得把程序鲨了。其实可以用在训练中保存的checkpoint来推理,但是因为不是正常退出,有一个关键的用来判定训练时间的log没有打出来,所以之前的结果还是白给了,虽然有人提出可以 人 脑 生 成 一个log,但估计不太合规,还是算了。此时已经一点半了,不跑别的就来不及了。
在AI训练的时候,我在现场对他们提供的加料版QuEST的源代码做CPU优化。一年前的时候我们就发现MPI和OMP混合比纯MPI快不少,但是用mpirun启动混合模式的程序就非常痛苦,特别是Intel MPI。OpenMPI好歹给了map-by
和bind-to
选项,Intel MPI调了半天genv的变量都没有让MPI给OMP留出指定数量的核心,而且Intel的-print-rank-map
的输出非常抽象,因为就没输出什么有用的信息。所以计划用slurm辅助Intel MPI启动混合模式的程序,结果这个时候有一个节点的slurm崩掉了(那个节点曾因为忘记开风扇过热关过机),什么slurmd -c
都不好使,所以又花了一些时间去抢救slurm,过了几十分钟,不知道为什么slurm又活过来了。准备开始跑QuEST的第一个算例量子傅里叶变换了,按下mpirun,过了几秒就崩了,报ulimit太小,但是我魔改的系统已经把ulimit开到unlimited了。好家伙,最后发现解决方案居然是换root用户跑,大概二十分钟就跑出结果了。
然后是第二个算例,这个算例是量子傅里叶正逆变换,而且它的量子比特位数比第一个大了1位,所以估计需要四倍于第一个算例的时间,但跑了一个小时程序还没有结束,感觉不太对劲了。我们开始讨论要不要结束程序,考虑到沉没成本过高,我坚持让程序跑完。过了一阵子,去检查了一下log,发现在计算开始半小时左右slurm报告它kill掉了什么东西,看了一眼htop,程序似乎还在“正常”跑,所以我以为slurm没kill成功。程序跑了一小时二十分的时候,感觉太不对劲了,又开始讨论要不要结束程序,这个时候我还是不太想杀死它,打算让AI训练和QuEST一起跑。又过了一阵子,仔细检查htop,这才发现QuEST每个节点都只剩下了三个rank,有一个rank与世长辞了,应该是被slurm害死的那个进程,QuEST再见了您内。
至于QuEST的第三个算例,这玩意光是编译就挺麻烦的。虽然组委会给了Makefile,但是QuEST已经迁移到了CMake了,那个Makefile只能编译出一堆错误。而且这个算例和其他不一样,其他算例基本上是一个cpp文件,这个算例是几个cpp文件和一个py内鬼,这个内鬼不仅让编译变得困难(依赖Python的头文件),估计还能起到拖慢程序的重大作用。不过我们第二个算例都跑不下来,就没第三个算例什么事了。
这个时候距离比赛结束还有大概1个小时,神秘应用能编译,但只能跑单线程,估计跑不完一个算例。所以大家稍加思索,最后决定保AI。AI那边本来应该没有什么大问题,至少跑完一个Epoch是没问题的。但不知道为什么,跑到三四百秒的时候进程被杀死了。又跑了一次,又被杀死了,原因不明,只见屏幕上有个Ctrl-C,但其他人都声称没有误触或者结束进程,难道是slurm杀疯了?也不太可能啊。但不管怎么说,我们的时辰快到了。好在AI组改代码去捕获Keyboard Interrupt,然后保存模型,把log打出来。原本的目的是为了让训练能跑多久跑多久,跑不完就Ctrl-C,结果让那个训练了400秒的模型成功保存下来。测试发现那玩意的准确率还有78%,训练三个Epoch也才85%。虽然准确率丢了点分,但时间分拿满了啊。Anyway,反正没别的东西交了,最后就把这个400秒的模型交上去了。
又是早起的一天,答辩无事发生,发挥比赛场上好多了,不愧是南方pre大学。下午领奖,就记住了宣传片里我队恶臭的口号,还有妮可坟头的招生广告。rank第五,别的奖项一个都没捞到,看在全队成员都是第一次打线下的现场赛的份上,这成绩还说得过去。赛前还吹牛批说要拿广东第一,最后倒是喜提广东倒数第一,深圳第一倒是保住了,因为也不存在深圳第二。不过Day 3崩成这样还能排第五,看起来大家都崩了,大家都崩等于大家都没崩!
后来吹水的时候听到了有人用了多GPU版的QuEST跑了几个算例,想了想确实很容易就可以写出来一部分Rank的StateVector放在显存上,一部分Rank的放在内存上的程序,这样显存爆了的情况下还可以让GPU和CPU一起算。之前没参加过ASC的现场赛,还是太年轻,我居然会去期待组委会是一群好人。组委会的恶趣味还包括年年都搞气象模型这种恶心东西,神秘应用可以读写PB级别的数据,还好没梭哈神秘应用,不然我们那个玩具级NFS必然顶不住。我最气愤的其实是组委会把晚宴取消了,晚上又得去吃学校里的茶餐厅了,每次比完超算比赛就去茶餐厅也太难顶了,西餐厅人均100的餐标也就能吃个夜宵。
]]>