[GPU] 拿不到对GPU的监控数据

所有节点都已安装nvidia-container-toolkit

cc中已加入gpu

GPU监控指标均为空

    dataknower
    首先排查环境是否准备就绪,包括nvidia-docker2启用,nvidia-device-plugin就绪;
    其次排查cc 安装的gpu-exporter 一部署完成;
    最后通过prometheus query检查prometheus 将gpu-exporter 数据已采集

      节点 /etc/containd/config.toml文件已配置

      [plugins.“io.containerd.grpc.v1.cri”]

      [plugins."io.containerd.grpc.v1.cri".containerd]
      
        default_runtime_name = "nvidia"
      
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
      
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
      
            privileged_without_host_devices = false
      
            runtime_engine = ""
      
            runtime_root = ""
      
            runtime_type = "io.containerd.runc.v2"
      
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
      
              BinaryName = "/usr/bin/nvidia-container-runtime"

      containerd已重启过

      为什么节点的守护进程出现错误

      nvidia-device-plugin-daemonset容器出现以下日志,

      2022-11-18T15:06:15.356745259+08:00 2022/11/18 07:06:15 Starting FS watcher.

      2022-11-18T15:06:15.356833856+08:00 2022/11/18 07:06:15 Starting OS watcher.

      2022-11-18T15:06:15.356969052+08:00 2022/11/18 07:06:15 Starting Plugins.

      2022-11-18T15:06:15.356988551+08:00 2022/11/18 07:06:15 Loading configuration.

      2022-11-18T15:06:15.356995051+08:00 2022/11/18 07:06:15 Initializing NVML.

      2022-11-18T15:06:15.357161445+08:00 2022/11/18 07:06:15 Failed to initialize NVML: could not load NVML library.

      2022-11-18T15:06:15.357168645+08:00 2022/11/18 07:06:15 If this is a GPU node, did you set the docker default runtime to `nvidia`?

      2022-11-18T15:06:15.357173045+08:00 2022/11/18 07:06:15 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites

      2022-11-18T15:06:15.357177445+08:00 2022/11/18 07:06:15 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

      2022-11-18T15:06:15.357180245+08:00 2022/11/18 07:06:15 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

      13 天 后
      8 天 后

      frezes

      我们环境是k3s,有点偏差,问题已修复

      参考网上文档https://devpress.csdn.net/k8s/62ebde0989d9027116a0fa22.html