Fix corrupted NVIDIA Driver after Upgrading / Downgrading Ubuntu Kernel

One day, you rebooted your server and suddenly found your cute GPUs had all disappeared. Then you executed nvidia-smi to see what was going on, but you only got this error message.

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I know what you're gonna say: Nvidia F*** You!

Causes

Actually, I encountered this problem many times. It is most likely caused by upgrading or downgrading the Linux kernel without properly generating kernel modules, which might be essential parts of GPU drivers.

Steps

Check System Status

Right now nvidia module is supposed not to be loaded (Could check with lsmod | grep nvidia). We could try to load the kernel manually.

1
sudo modprobe nvidia

You should get an error message like this.

modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.15.0-84-generic

Meanwhile, check whether /usr/lib/modules/5.15.0-84-generic/updates/dkms/nvidia.ko is missing.

If you don’t see the error message above and that kernel module file does exist, you might have other issues, such as hardware failure. At this time, try to read kernel logs through dmesg and check the existence of GPUs through lspci -vvv, which should give you some clues.

Reinstall DKMS and NVIDIA Drivers

DKMS, a utility that manages drivers, as well as NVIDIA Drivers, might be broken. We could fix them by removing them first and installing them back later.

1
2
3
4
5
6
7
8
9
10
sudo rm -r /var/lib/dkms/nvidia
sudo apt install --reinstall dkms
sudo apt autoremove --purge nvidia* cuda*

# Please refer to https://developer.nvidia.com/cuda-downloads to install the latest CUDA toolkits
# Commands below work only on Ubuntu 22.04
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cuda

Note: Installing the full CUDA Toolkits is the only way I recommend to install drivers. Using the drivers provided by the Ubuntu official repo is NOT recommended.

(Optional) Reinstall NVIDIA Docker Runtime

The previous step will also remove NVIDIA Docker Runtime, which may lead to this error if you use Docker.

Error response from daemon: Cannot restart container ...: could not select device driver "" with capabilities: [[gpu]]

Thus, we need to install it back too.

1
sudo apt install -y nvidia-container-toolkit

References