zhu733756 应该是支持的

你可以检测下你的处理器架构代号

$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c  #本文测试时处理器架构代号为Broadwell
16 Intel Core Processor (Broadwell)

    k8s主机os为centos7环境下报错:

    kubectl logs -f nvidia-driver-daemonset-tjzxf -ngpu-operator-resources

    ========== NVIDIA Software Installer ==========

    Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 5.11.5-1.el7.elrepo.x86_64

    Stopping NVIDIA persistence daemon…
    Unloading NVIDIA driver kernel modules…
    Unmounting NVIDIA driver rootfs…
    Checking NVIDIA driver packages…
    Updating the package cache…
    Unable to open the file ‘/lib/modules/5.11.5-1.el7.elrepo.x86_64/proc/version’ (No such file or directory).Could not resolve Linux kernel version
    Resolving Linux kernel version…
    Stopping NVIDIA persistence daemon…
    Unloading NVIDIA driver kernel modules…
    Unmounting NVIDIA driver rootfs…

      leonanor 可以去gpu-operator官方发布issue,照目前看来版本是支持的,就不知道有没有其他的问题。

      5 天 后

      在esxi6.7上,虚拟机OS由centos7.6换成ubuntu18.04部署成功。几个坑:
      1、要在esxi6.7选择某个work角色的虚拟机做gpu直通。在虚拟机高级设置中添加hypervisor.cpuid.v0=FALSE
      2、直通设置完成后ubuntu启动不了。在“x86:booting smp configuration….”处挂住。这时要在虚拟机升级intelcpu的微码。
      sudo dpkg -l|grep intel
      sudo apt-get purge intel-microcode
      sudo update-grub
      sudo reboot
      升级后重启ubuntu可以正常启动了。
      3、在线升级可能会超时。要下载镜像特别多。最好翻墙先下载好需要的镜像。先用helm fetch nvidia/gpu-operator 下载压缩包,解压后进去文件夹打开 values.yaml找到镜像名称下载。如果翻墙机器不是在设置了gpu直通的k8s机器,docker save -o 导出这些镜像然后docker load 导入镜像。
      下载的镜像名称:
      nvcr.io/nvidia/k8s/container-toolkit:1.4.7-ubuntu18.04
      nvcr.io/nvidia/gpu-operator:1.6.2
      nvcr.io/nvidia/driver:460.32.03-ubuntu18.04
      nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.2.0-ubuntu20.04
      nvcr.io/nvidia/k8s-device-plugin:v0.8.2-ubi8
      nvcr.io/nvidia/gpu-feature-discovery:v0.4.1
      nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2

        zhu733756
        跑你的例子的时候报错,是正常的吗?
        2021-03-16 02:22:19.394090: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
        2021-03-16 02:22:19.650521: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
        2021-03-16 02:22:19.652326: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
        Step 0 (epoch 0.00), 21.3 ms
        Minibatch loss: 8.335, learning rate: 0.010000
        Minibatch error: 85.9%
        Validation error: 84.6%
        Step 100 (epoch 0.12), 3.9 ms
        Minibatch loss: 3.231, learning rate: 0.010000
        Minibatch error: 3.1%

        2 年 后

        @zhu733756 你好,请问下这块安装的nvidia_dcgm_exporter和cc里面的gpu-dcgm-exporter是同一个么?

        gpu-dcgm-exporter这个pod总是error,

        容器日志显示如下

        6 个月 后

        gpu-operator 在选择GPU的时候,除了能选择gpu卡的数量,能指定使用哪张GPU卡吗

          17 天 后
          11 天 后

          kubectl apply -f gpu-monitor.yaml 在用了这个命令对集群GPU进行监控的,为什么过一段时间后这个服务就回自动停止呢,我的yaml文件配置:

          apiVersion: monitoring.coreos.com/v1

          kind: ServiceMonitor

          metadata:

          name: nvidia-dcgm-exporter

          namespace: gpu-operator-resources

          labels:

           app: nvidia-dcgm-exporter

          spec:

          jobLabel: nvidia-gpu-resources

          endpoints:

          • port: gpu-metrics

            interval: 15s

            selector:

            matchLabels:

            app: nvidia-dcgm-exporter

            namespaceSelector:

            matchNames:

            • gpu-operator-resources
            2 个月 后
            1 个月 后

            tangpan 你把helm install gpu-operator nvidia/gpu-operator -n gpu-operator-resources –wait中的-n gpu-operator-resources 替换掉,替换成kubesphere-monitoring-system,就可以了

            2 个月 后

            tangpan 删除的原因是因为在gpu-operator检测到如果dcgmExporter.serviceMonitor.enable为false的话会自动删除该namespace下名为nvidia-dcgm-exporter的ServiceMonitor,很巧,你的ServiceMonitor就叫这个名称,如果换个名称就不会被删除了。或者采用如下方式通过gpu-operator开启这个ServiceMonitor。

            kubectl edit clusterpolicies.nvidia.com cluster-policy修改如下部分开启ServiceMonitor,会自动给你创建出名为nvidia-dcgm-exporter资源

            7 个月 后

            Ubuntu24.04先安装kubesphere后使用helm安装gpu operator进行GPU监控,但是在helm install这一步出现了问题:

            # kubectl get pod -n gpu-operator-resources NAME READY STATUS RESTARTS AGE gpu-operator-7576dfc759-4mzms 0/1 Running 0 4s gpu-operator-node-feature-discovery-gc-67c749cbdf-8vdbk 0/1 CreateContainerConfigError 0 4s gpu-operator-node-feature-discovery-master-78d66d5695-bpj2x 0/1 CreateContainerConfigError 0 4s gpu-operator-node-feature-discovery-worker-zkbzr 0/1 CreateContainerConfigError 0 4s

            Warning Failed 5s (x4 over 17s) kubelet Error: container has runAsNonRoot and image will run as root (pod: “gpu-operator-node-feature-discovery-gc-67c749cbdf-8vdbk_gpu-operator-resources(1b5febf5-2bac-4b74-9f09-1232a13c33a1)”, container: gc)
            为什么会出现这个情况呢