One day, you rebooted your server and suddenly found your cute GPUs had all disappeared. Then you executed
nvidia-smi to see what was going on, but you only got this error message.
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I know what you're gonna say: Nvidia F*** You!
Actually, I encountered this problem many times. It is most likely caused by upgrading or downgrading the Linux kernel without properly generating kernel modules, which might be essential parts of GPU drivers.
Check System Status
nvidia module is supposed not to be loaded (Could check with
lsmod | grep nvidia). We could try to load the kernel manually.
sudo modprobe nvidia
You should get an error message like this.
modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.15.0-84-generic
Meanwhile, check whether
/usr/lib/modules/5.15.0-84-generic/updates/dkms/nvidia.ko is missing.
If you don’t see the error message above and that kernel module file does exist, you might have other issues, such as hardware failure. At this time, try to read kernel logs through
dmesg and check the existence of GPUs through
lspci -vvv, which should give you some clues.
Reinstall DKMS and NVIDIA Drivers
DKMS, a utility that manages drivers, as well as NVIDIA Drivers, might be broken. We could fix them by removing them first and installing them back later.
sudo rm -r /var/lib/dkms/nvidia
Note: Installing the full CUDA Toolkits is the only way I recommend to install drivers. Using the drivers provided by the Ubuntu official repo is NOT recommended.
(Optional) Reinstall NVIDIA Docker Runtime
The previous step will also remove NVIDIA Docker Runtime, which may lead to this error if you use Docker.
Error response from daemon: Cannot restart container ...: could not select device driver "" with capabilities: [[gpu]]
Thus, we need to install it back too.
sudo apt install -y nvidia-container-toolkit