当前位置 : 主页 > 操作系统 > centos >

20220215_安装nvidia gpu

来源:互联网 收集:自由互联 发布时间:2023-09-06
20220215_安装nvidia gpu 版本信息:centos8.5 一、安装步骤: 1.1.下载驱动,注意版本下载对应驱动 https://www.nvidia.cn/Download/index.aspx?lang=cn lscpi // 先查看硬件设备型号 https://www.nvidia.cn/geforce/

20220215_安装nvidia gpu

版本信息:centos8.5

一、安装步骤:

1.1.下载驱动,注意版本下载对应驱动

https://www.nvidia.cn/Download/index.aspx?lang=cn

lscpi // 先查看硬件设备型号
https://www.nvidia.cn/geforce/drivers/ // 官网驱动搜索界面

1.2.禁掉系统默认带的nouveau驱动

在/lib/modprobe.d/dist-blacklist.conf中,将nvidiafb注释掉:
vim /lib/modprobe.d/dist-blacklist.conf
#blacklist nvidiafb
再在该文件中添加一下配置:
blacklist nouveau
options nouveau modeset=0

1.3.重新建立initramfs image文件

如果/boot 分区大小不够,可以备份到其他目录
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
dracut /boot/initramfs-$(uname -r).img $(uname -r)
reboot
系统重启后,查看nouveau驱动是否已经被禁止掉
lsmod | grep nouveau #没有输出即表示显卡被禁用

1.4.依赖库安装

# centos8
dnf install -y tar bzip2 make automake gcc gcc-c++ vim pciutils elfutils-libelf-devel libglvnd-devel

1.5.杀掉桌面

[root@liv-pc-002 tmp]# ps -ef | grep  X
root 1948 1946 0 18:15 tty1 00:00:00 /usr/libexec/Xorg vt1 -displayfd 3 -auth /run/user/42/gdm/Xauthority -nolisten tcp -background none -noreset -keeptty -novtswitch -verbose 3
root 2531 2457 0 18:16 pts/0 00:00:00 grep --color=auto X
[root@liv-pc-002 tmp]# kill -9 1948

1.6.执行操作

./NVIDIA-Linux-x86_64-470.82.00.run
安装过程中,选择accept,如果提示要修改xorg.conf,选择yes
安装完成以后重启机器即可,reboot

1.7.查看结果

nvidia-smi
Tue Mar 1 11:53:21 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.00 Driver Version: 470.82.00 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 0% 52C P8 21W / 320W | 327MiB / 10015MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1957 G /usr/libexec/Xorg 39MiB |
| 0 N/A N/A 2366 G /usr/bin/gnome-shell 72MiB |
| 0 N/A N/A 25129 C python3.8 211MiB |
+-----------------------------------------------------------------------------+

二、出现的问题:

2.1 podman冲突

2.1.1 报错内容
Error:
Problem 1: problem with installed package podman-3.0.1-6.module_el8.4.0+781+acf4c33b.x86_64
package podman-3.0.1-6.module_el8.4.0+781+acf4c33b.x86_64 requires runc >= 1.0.0-57, but none of the providers can be installed
package podman-3.0.1-7.module_el8.4.0+830+8027e1c4.x86_64 requires runc >= 1.0.0-57, but none o
2.1.2 解决办法:
#dnf remove podman
yum erase podman buildah

2.2 安装docker操作

yum install -y yum-utils
yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
yum install -y docker-ce docker-ce-cli containerd.io
#安装nvidia-docker
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
yum install -y nvidia-docker2

2.3 内核版本不一致

2.3.1 报错内容:
ERROR: Unable to find the kernel source tree for the currently running kernel.  Please make sure you have installed the kernel     
source files for your kernel and that they are properly configured; on Red Hat Linux systems, for example, be sure you have
the 'kernel-source' or 'kernel-devel' RPM installed. If you know the correct kernel source files are installed, you may
specify the kernel source path with the '--kernel-source-path' command line option.
2.3.2 报错原因:
显示内核版本不一致
2.3.3 操作步骤:
a.进行升级。
dnf install kernel-headers
dnf install kernel-devel
b.升级完成后查看内核版本
$ uname -a
Linux live 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
c.比较一下两者版本
# rpm -qa | grep -E "kernel-devel|kernel-headers"
kernel-headers-4.18.0-348.7.1.el8_5.x86_64
kernel-devel-4.18.0-348.7.1.el8_5.x86_64
d.如果版本一致则执行安装命令。如果不一致执行如下命令:
# dnf distro-sync
执行完后重启在观察两边内核是否一致。

2.4 显示安装成功,nvidia-smi无输出

注意驱动的版本和linux的版本对应问题。

2.5 安装时报错:ERROR: Unable to find the development tool ​​cc​​ in your path

2.5.1 报错文本:
ERROR: Unable to find the development tool `cc` in your path; please make sure that you have the package 'gcc' installed.  If gcc is installed on your system, then please check that `cc` is in your PATH.
2.5.2 报错原因:
未安装gcc
2.5.3 操作
dnf install -y gcc

2.6 安装时报错:ERROR: The Nouveau kernel driver is currently in use by your system.

2.6.1 报错文本:
ERROR: The Nouveau kernel driver is currently in use by your system.  This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding.  Please consult the NVIDIA driver README and your Linux distribution's documentation for details on how to correctly disable the Nouveau kernel driver.
2.6.2 报错原因:
nouveau 是很多linux 发行版带的驱动,目的是为了兼容各种不同显卡,要安装nvidia驱动必须禁用nouveau驱动。
2.6.3 操作:禁用nouveau驱动
(1).先执行本文中【1.2中禁掉系统默认带的nouveau驱动】的操作,如果未生效则执行以下操作
-------
(1).在grub 启动中禁用nouveau,
vim /etc/default/grub
"GRUB_CMDLINE_LINUX"中添加 【rd.driver.blacklist=nouveau nouveau.modeset=0】
(2).然后更新grub:【grub2-mkconfig -o /boot/grub2/grub.cfg】
在/usr/lib/modprobe.d/dist-blacklist.conf 或/etc/modprobe.d/blacklist.conf中末尾添加 【blacklist nouveau 】
(3).备份 initramfs nouveau image镜像
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
(4).使用 dracut重新建立 initramfs
dracut -v /boot/initramfs-$(uname -r).img $(uname -r)
(5).reboot 重启,然后lsmod | grep nouveau 确认nouveau没有被加载
重新安装 ./NVIDIA-Linux-x86_64-xxx.run

2.7.安装时报错 ERROR: An NVIDIA kernel module 'nvidia-drm' appears to already be loaded in your kernel.

2.6.1 报错文本:
ERROR: An NVIDIA kernel module 'nvidia-drm' appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading. Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module’s usage count, for which the simplest remedy is to reboot your computer.
2.7.2 报错原因:
可能由于桌面系统占用进程。需要杀死桌面进程
2.7.3 操作:
查看本文1.5操作。

2.8 因为内核更新导致的显卡驱动无法加载

2.8.1 操作
a.禁用内核更新,防止内核更新后显卡驱动无法加载
sed -i '$a exclude=kernel* centos-release* initscripts*' /etc/yum.conf
b.查看所有启动项,并记下索引号
grubby --info=ALL
以下为输出内容
index=0
kernel=/boot/vmlinuz-5.4.225-1.el7.elrepo.x86_64
args="ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet "
root=/dev/mapper/centos-root
initrd=/boot/initramfs-5.4.225-1.el7.elrepo.x86_64.img
title=CentOS Linux (5.4.225-1.el7.elrepo.x86_64) 7 (Core)
index=1
kernel=/boot/vmlinuz-5.4.180-1.el7.elrepo.x86_64
args="ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet "
root=/dev/mapper/centos-root
initrd=/boot/initramfs-5.4.180-1.el7.elrepo.x86_64.img
title=CentOS Linux (5.4.180-1.el7.elrepo.x86_64) 7 (Core)
c. 把需要的启动项索引设置为默认启动项
grubby --set-default-index=1
d. 重启电脑
reboot
e. 查看当前使用内核是否与原先一致
uname -r
f. 查看驱动是否正常启动,nvidia-smi 如果未正常启动则卸载驱动重新安装
nvidia-uninstall
reboot
重新安装 ./NVIDIA-Linux-x86_64-xxx.run
网友评论