创建部署问题时,请参考下面模板,你提供的信息越多,越容易及时获得解答。如果未按模板创建问题,管理员有权关闭问题。
确保帖子格式清晰易读,用 markdown code block 语法格式化代码块。
你只花一分钟创建的问题,不能指望别人花上半个小时给你解答。

操作系统信息
物理机,Ubuntu18.04,4C/8G

Kubernetes版本信息
1.20.4

容器运行时
Client: Docker Engine - Community

Version: 20.10.9

API version: 1.41

Go version: go1.16.8

Git commit: c2ea9bc

Built: Mon Oct 4 16:08:29 2021

OS/Arch: linux/amd64

Context: default

Experimental: true

Server: Docker Engine - Community

Engine:

Version: 20.10.9

API version: 1.41 (minimum version 1.12)

Go version: go1.16.8

Git commit: 79ea9d3

Built: Mon Oct 4 16:06:34 2021

OS/Arch: linux/amd64

Experimental: true

containerd:

Version: 1.4.11

GitCommit: 5b46e404f6b9f661a205e28d59c982d3634148f8

nvidia:

Version: 1.0.2

GitCommit: v1.0.2-0-g52b36a2

docker-init:

Version: 0.19.0

GitCommit: de40ad0

KubeSphere版本信息
kk安装kubesphere3.2.0

问题是什么
安装好gpu监控之后,自定义监控模板里面没有gpu的监控指标。

speedbot@c172:/opt/gpu_test/train_params$ kubectl get pods -n gpu-operator-resources

NAME                                                              READY   STATUS      RESTARTS   AGE

gpu-feature-discovery-5xdhd                                       1/1     Running     2          2d3h

gpu-feature-discovery-p2xw5                                       1/1     Running     0          2d3h

gpu-operator-1639972176-node-feature-discovery-master-dbf7ck98n   1/1     Running     0          2d3h

gpu-operator-1639972176-node-feature-discovery-worker-8bdl8       1/1     Running     0          2d3h

gpu-operator-1639972176-node-feature-discovery-worker-c6s69       1/1     Running     0          2d3h

gpu-operator-1639972176-node-feature-discovery-worker-nq58w       1/1     Running     2          2d3h

gpu-operator-868b78d4d8-v5zj2                                     1/1     Running     0          2d3h

nvidia-container-toolkit-daemonset-nl8w6                          1/1     Running     1          2d3h

nvidia-container-toolkit-daemonset-wplq9                          1/1     Running     0          2d3h

nvidia-cuda-validator-7kgxh                                       0/1     Completed   0          46h

nvidia-cuda-validator-pmgxn                                       0/1     Completed   0          2d3h

nvidia-dcgm-5vmjg                                                 1/1     Running     2          2d3h

nvidia-dcgm-exporter-4hrrw                                        1/1     Running     3          2d3h

nvidia-dcgm-exporter-v24tv                                        1/1     Running     2          2d3h

nvidia-dcgm-qsrwd                                                 1/1     Running     0          2d3h

nvidia-device-plugin-daemonset-8nmvp                              1/1     Running     2          2d3h

nvidia-device-plugin-daemonset-hl95b                              1/1     Running     0          2d3h

nvidia-device-plugin-validator-bbjsh                              0/1     Completed   0          2d3h

nvidia-device-plugin-validator-t2g99                              0/1     Completed   0          46h

nvidia-operator-validator-mdnqr                                   1/1     Running     0          2d3h

nvidia-operator-validator-rnxp6                                   1/1     Running     1          2d3h

补充说明:

1、在 http://IP:30450/metrics 以及config配置文件中没有发现dcgm 相关的监控指标。

2、使用kubectl describe nodes c172|grep gpu可以明显看到gpu使用情况。

speedbot@c172:/opt/gpu_test/train_params$ kubectl describe nodes c172|grep gpu
                    nvidia.com/gpu.compute.major=7
                    nvidia.com/gpu.compute.minor=5
                    nvidia.com/gpu.count=4
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=true
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=turing
                    nvidia.com/gpu.machine=SYS-7049GP-TRT
                    nvidia.com/gpu.memory=11019
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-GeForce-RTX-2080-Ti
  nvidia.com/gpu:     4
  nvidia.com/gpu:     4
  gpu-operator-resources        gpu-feature-discovery-p2xw5                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d4h
  gpu-operator-resources        gpu-operator-1639972176-node-feature-discovery-worker-8bdl8    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d4h
  gpu-operator-resources        nvidia-container-toolkit-daemonset-wplq9                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d4h
  gpu-operator-resources        nvidia-dcgm-exporter-4hrrw                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d4h
  gpu-operator-resources        nvidia-dcgm-qsrwd                                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d4h
  gpu-operator-resources        nvidia-device-plugin-daemonset-hl95b                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d4h
  gpu-operator-resources        nvidia-operator-validator-mdnqr                                0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d4h
  gpu-scheduler                 tf-notebook                                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         47h
  nvidia.com/gpu     1            1
speedbot@c172:/opt/gpu_test/train_params$ 
Lcz 更改标题为「gpu监控缺少监控指标