- 已编辑
[GPU] 拿不到对GPU的监控数据
所有节点都已安装nvidia-container-toolkit
cc中已加入gpu
GPU监控指标均为空
[GPU] 拿不到对GPU的监控数据
所有节点都已安装nvidia-container-toolkit
cc中已加入gpu
GPU监控指标均为空
dataknower
首先排查环境是否准备就绪,包括nvidia-docker2启用,nvidia-device-plugin就绪;
其次排查cc 安装的gpu-exporter 一部署完成;
最后通过prometheus query检查prometheus 将gpu-exporter 数据已采集
哪里有完整配置gpu节点说明?
https://github.com/NVIDIA/k8s-device-plugin
https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html
从nvidia上官方上看到的信息,描述看起来有些不一致或者缺乏维护
节点 /etc/containd/config.toml文件已配置
[plugins.“io.containerd.grpc.v1.cri”]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
containerd已重启过
为什么节点的守护进程出现错误
nvidia-device-plugin-daemonset容器出现以下日志,
2022-11-18T15:06:15.356745259+08:00 2022/11/18 07:06:15 Starting FS watcher.
2022-11-18T15:06:15.356833856+08:00 2022/11/18 07:06:15 Starting OS watcher.
2022-11-18T15:06:15.356969052+08:00 2022/11/18 07:06:15 Starting Plugins.
2022-11-18T15:06:15.356988551+08:00 2022/11/18 07:06:15 Loading configuration.
2022-11-18T15:06:15.356995051+08:00 2022/11/18 07:06:15 Initializing NVML.
2022-11-18T15:06:15.357161445+08:00 2022/11/18 07:06:15 Failed to initialize NVML: could not load NVML library.
2022-11-18T15:06:15.357168645+08:00 2022/11/18 07:06:15 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2022-11-18T15:06:15.357173045+08:00 2022/11/18 07:06:15 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2022-11-18T15:06:15.357177445+08:00 2022/11/18 07:06:15 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2022-11-18T15:06:15.357180245+08:00 2022/11/18 07:06:15 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes