修复错误version GLIBCXX_3.4.20 not found的思路

实际上这次的情况有一些复杂,首先这个集群上跑的是活化石Cent OS,也就是说环境非常的古老(硬件倒是最新的),GCC还是4.8.5,甚至编译不了最新的DGL库。更悲伤的是我并没有这个公用集群的管理员权限,只能想方设法去绕开权限去安装软件,因此我用Conda装了CMake,GCC和G++。所以解决version GLIBCXX_3.4.20 not found这个问题就更加麻烦了,因此这篇文章的解决方法并不适用所有的情况,但可以作为一个参考。

注:用Conda无权限安装GCC等软件的方式非常简单

1
2
3
4
5
6
7
8
9
10
11
12
13
# Create and activate a Conda environment named DGL
conda create -n dgl
conda activate dgl

# Add a channel called Conda-Forge
conda config --add channels conda-forge

# Install the software
conda install gxx_linux-64 gcc_linux-64 cmake

# Verify whether GCC is working
echo $CC # Should not be the default one (like /usr/bin/gcc)
echo $CXX

原始毛病

在编译的时候并不会出现问题,当执行import dgl(载入了DGL的动态链接库)的时候,就会报错,信息如下。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/work/gutz/miniconda3/envs/dgl/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/__init__.py", line 13, in <module>
from .backend import load_backend, backend_name
File "/work/gutz/miniconda3/envs/dgl/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/backend/__init__.py", line 96, in <module>
load_backend(get_preferred_backend())
File "/work/gutz/miniconda3/envs/dgl/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/backend/__init__.py", line 41, in load_backend
from .._ffi.base import load_tensor_adapter # imports DGL C library
File "/work/gutz/miniconda3/envs/dgl/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/_ffi/base.py", line 45, in <module>
_LIB, _LIB_NAME, _DIR_NAME = _load_lib()
File "/work/gutz/miniconda3/envs/dgl/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/_ffi/base.py", line 35, in _load_lib
lib = ctypes.CDLL(lib_path[0])
File "/work/gutz/miniconda3/envs/dgl/lib/python3.8/ctypes/__init__.py", line 381, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /work/gutz/miniconda3/envs/dgl/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/libdgl.so)

排查

在错误信息中值得关注的有三点,一是错误是由libdgl.so造成的,二是在试图加载/lib64/libstdc++.so.6时候出错,三是错误原因是/lib64/libstdc++.so.6这个东西太老了。

仔细一想事情其实非常不对,用Conda的G++编译出来的东西应该会调用Conda里面的C++库而不是古老的系统自带的那个。顺手用lddreadelf去分析libdgl.so可以得到如下信息。

1
2
3
4
5
6
7
8
9
10
11
# ldd /work/gutz/miniconda3/envs/dgl/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/libdgl.so
linux-vdso.so.1 => (0x00007ffde9722000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002aac0673e000)
librt.so.1 => /lib64/librt.so.1 (0x00002aac06942000)
libgomp.so.1 => /work/gutz/miniconda3/envs/dgl/lib/libgomp.so.1 (0x00002aac05929000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aac06b4a000)
libstdc++.so.6 => /work/gutz/miniconda3/envs/dgl/lib/libstdc++.so.6 (0x00002aac05957000)
libm.so.6 => /lib64/libm.so.6 (0x00002aac06d66000)
libgcc_s.so.1 => /work/gutz/miniconda3/envs/dgl/lib/libgcc_s.so.1 (0x00002aac05acb000)
libc.so.6 => /lib64/libc.so.6 (0x00002aac07068000)
/lib64/ld-linux-x86-64.so.2 (0x00002aac058f6000)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# readelf -d /work/gutz/miniconda3/envs/dgl/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/libdgl.so

Dynamic section at offset 0xc14ef8 contains 32 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]
0x0000000000000001 (NEEDED) Shared library: [librt.so.1]
0x0000000000000001 (NEEDED) Shared library: [libgomp.so.1]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x0000000000000001 (NEEDED) Shared library: [ld-linux-x86-64.so.2]
0x000000000000000e (SONAME) Library soname: [libdgl.so]
0x000000000000000f (RPATH) Library rpath: [/work/gutz/miniconda3/envs/dgl/lib]
0x000000000000000c (INIT) 0x19f000
0x000000000000000d (FINI) 0xaae9c0
0x0000000000000019 (INIT_ARRAY) 0xc10280
0x000000000000001b (INIT_ARRAYSZ) 960 (bytes)
0x0000000000000004 (HASH) 0x238
0x000000006ffffef5 (GNU_HASH) 0xa0e8
0x0000000000000005 (STRTAB) 0x45460
0x0000000000000006 (SYMTAB) 0x15af0
0x000000000000000a (STRSZ) 1191101 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000003 (PLTGOT) 0xc16138
0x0000000000000007 (RELA) 0x16c318
0x0000000000000008 (RELASZ) 204672 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x0000000000000018 (BIND_NOW)
0x000000006ffffffb (FLAGS_1) Flags: NOW
0x000000006ffffffe (VERNEED) 0x16c098
0x000000006fffffff (VERNEEDNUM) 9
0x000000006ffffff0 (VERSYM) 0x16811e
0x000000006ffffff9 (RELACOUNT) 1435
0x0000000000000000 (NULL) 0x0

这就很奇怪了,明明libdgl.so已经指明了使用这个C++库/work/gutz/miniconda3/envs/dgl/lib/libstdc++.so.6,而不是系统的/lib64/libstdc++.so.6。更奇怪的事是在readelf的结果中,RPATH项(run-time search path)已经指定了所需的库的位置/work/gutz/miniconda3/envs/dgl/lib,也就是说将要加载的动态链接库的绝对路径都已经写死在文件里了。而且对/work/gutz/miniconda3/envs/dgl/lib/libstdc++.so.6稍加验证一下,可以看出来这个库是支持GLIBCXX_3.4.20的。

1
2
3
4
5
6
7
8
9
10
# strings /work/gutz/miniconda3/envs/dgl/lib/libstdc++.so.6 | grep GLIBCXX
...
GLIBCXX_3.4.18
GLIBCXX_3.4.19
GLIBCXX_3.4.20
GLIBCXX_3.4.21
GLIBCXX_3.4.22
GLIBCXX_3.4.23
GLIBCXX_3.4.24
...

那么是什么原因导致了DGL加载了错误的C++库?我并没有什么头绪,甚至去翻了源代码去看看DGL有没有做什么手动指定库加载路径的呆逼操作,结果发现并没有,加载库的代码只有两行。

1
2
3
4
5
def _load_lib():
"""Load libary by searching possible path."""
lib_path = libinfo.find_lib_path()
lib = ctypes.CDLL(lib_path[0]) # lib_path[0] = '/work/gutz/miniconda3/envs/dgl/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/libdgl.so'
...

一筹莫展之际,随便试了试网上强烈推荐的调试方法export LD_DEBUG=libs,输出加载动态链接库时的额外信息。尽管信息很多,但还是发现加载了/lib64/libstdc++.so.6的罪魁祸首不是DGL,而是PyTorch。

1
2
3
4
5
6
7
8
9
10
197196:      search path=/work/gutz/miniconda3/envs/dgl/lib/python3.8/site-packages/torch/lib               (RUNPATH from file /work/gutz/miniconda3/envs/dgl/lib/python3.8/site-packages/torch/lib/libtorch_global_deps.so)
197196: trying file=/work/gutz/miniconda3/envs/dgl/lib/python3.8/site-packages/torch/lib/libgcc_s.so.1
197196: search cache=/etc/ld.so.cache
197196: trying file=/lib64/libgcc_s.so.1
197196:
197196:
197196: calling init: /lib64/libgcc_s.so.1
197196:
197196:
197196: calling init: /lib64/libstdc++.so.6

ldd看看就可以发现libtorch_global_deps.so这个死东西会去加载/lib64/libstdc++.so.6。在libdgl.so加载前就把/lib64/libstdc++.so.6加载进内存了,rpath指定的C++库就不会被载入了。

readelf分析libtorch_global_deps.so,发现其rpath指定为$ORIGIN,也就是libtorch_global_deps.so所在的目录。当然这个目录有PyTorch依赖的其他库,如Profiling用的libnvToolsExt,显然这里并没有C++库,系统就自动去搜寻默认路径,就把/lib64/libstdc++.so.6加载了。

解决方案

如果对自己的技术充分自信并有大量空闲时间,可以用Conda的G++把PyTorch重新编译就可以解决这个问题。

如果想试试花里胡哨的,可以试试patchelf去给libtorch_global_deps.so加一个指向新C++库的rpath

像我懒狗就直接在LD_LIBRARY_PATH这个环境变量里指定一下默认搜索路径就完事了。当然这样非常的不优雅就是了。

1
export LD_LIBRARY_PATH=/work/gutz/miniconda3/envs/dgl/lib:$LD_LIBRARY_PATH

参考文章