kubesphere监控pod cpu

zhaozhongyuan425 · 2022年3月28日

使用kubesphere自带的prometheus监控pod cpu使用率怎么写告警策略？

Nrehearsal · 2022年3月29日

sum by (pod,namespace) (rate(container_cpu_usage_seconds_total[5m])) * 100 > 70

zhaozhongyuan425 · 2022年3月30日

Nrehearsal container_cpu_usage_seconds_total是该容器服务针对每个CPU累计消耗的CPU时间。如果有多个CPU，则总的CPU时间需要把各个CPU耗费的时间相加，你给的应该不是CPU使用率，我个人理解，我现在使用的方法是sum(rate(container_cpu_usage_seconds_total{namespace=“demo”, image!="", container_name!=“POD”}[2m])) by (pod,namespace) / sum(kube_pod_container_resource_limits_cpu_cores{namespace=“demo”,pod!="",container!=“POD”,container!=""}) by (pod,namespace)!= +inf >= 0.85

从网上看到很多文章需要/container_spec_cpu_quota：container的配额，为容器指定的CPU个数*100000

语句为sum(rate(container_cpu_usage_seconds_total{image!="",container!="POD",container!=""}[1m])) by (pod,namespace) / (sum(container_spec_cpu_quota{image!="",container!="POD",container!=""}/100000) by (pod,namespace)) * 100

我也没有办法确认那个是正确的，但是我的ks prom是没有办法获取到container_spec_cpu_quota的数据的，所有使用了我上面的方式作为告警策略

Nrehearsal · 2022年3月31日

你给出的公式应该是正确的。

CPU使用率的计算公式为：
rate(container_cpu_usage_seconds_total[10m]) / (container_spec_cpu_quota / container_spec_cpu_period)

定义：

container_cpu_usage_seconds_total：容器的CPU使用时间(以秒为单位)
container_spec_cpu_period：容器CPU的追踪周期，一般为 100000 微妙
container_spec_cpu_quota：每个CPU周期内，容器可以拥有多少CPU时间(以微秒为单位)

举例：
1.假设为容器指定 cpu_qouta 为 7 CPUs/周期，即 700000 microseconds / 100000 microseconds，则可以做如下换算：

700 milliseconds CPU时间每 100 milliseconds
0.7 seconds CPU时间每100 milliseconds
7 seconds CPU时间每 second

2.假设rate(container_cpu_usage_seconds_total[10m]) 为 1.34s，统一单位为秒，则CPU使用率计算的结果为 1.34 / 7 * 100 ≈ 19.1

3.kube_pod_container_resource_limits_cpu_cores 应该可以等价于 container_spec_cpu_quota / container_spec_cpu_period

可参考下面两个连接：
Prometheus metrics
Average CPU % usage per container

zhaozhongyuan425 · 2022年3月31日

Nrehearsal

受教了，明白了感谢大佬指导

zhaozhongyuan425 · 2022年4月26日

大佬我想修改告警消息的模板怎么修改呢？

Annotations:
- aliasName = demo-pod-memory
- message = demo-pod-memory(内存使用率) >=85%
- rule_update_time = 2022-04-21T07:46:24Z
- summary = demo-pod-memory >=85%
Labels:
- alertname = demo-pod-memory
- alerttype = metric
- cluster = default
- pod = order-2
- rule_id = 1a925d2b8e96f65656e54
- severity = error
Annotations:
- aliasName = demo-pod-memory
- message = demo-pod-memory(内存使用率) >=85%
- rule_update_time = 2022-04-21T07:46:24Z
- summary = demo-pod-memory >=85%

我想在这段信息中加上namespace字段大佬指教一下谢谢

Nrehearsal · 2022年4月27日

message: 'Pod:{{ $labels.pod }}, Namespace:{{ $labels.namespace }}, CPU utilization is large than 70%, Current Value:{{ $value | printf "%.2f%%" }}.'
summary: 'Pod:{{ $labels.pod }}, CPU utilization is large than 70%'
expr: (sum by(pod,namespace) (irate(container_cpu_usage_seconds_total{container!="",container!="POD",namespace="$REPLACE_WITH_YOUR_NAMESPACE",pod=~".*$REPLACE_WITH_YOUR_POD.*"}[3m]))) / (sum by(pod,namespace) (kube_pod_container_resource_limits_cpu_cores{namespace="$REPLACE_WITH_YOUR_NAMESPACE",pod=~".*$REPLACE_WITH_YOUR_POD.*"})) * 100 > 70

将 $REPLACE_WITH_YOUR_NAMESPACE，$REPLACE_WITH_YOUR_POD 替换成你需要的

zhaozhongyuan425 · 2022年5月10日

Nrehearsal 嗯呢已经解决了感谢大佬

大佬咱们kubesphere告警可以去掉自带的告警吗还有就是怎么配置多个钉钉群告警呢？

Nrehearsal · 2022年5月10日

zhaozhongyuan425

自带策略在这里
kubectl get -n kubesphere-monitoring-system prometheusrules.monitoring.coreos.com prometheus-k8s-rules -o yaml
结构大概这样name->rules->rule，你按需调整rule应该就可以。

钉钉设置多个通知群，不太清楚，你重新发个帖子问问吧。

LYN · 2023年4月12日

这个貌似不太对，在集群节点管理里面

显示的 CPU 用量是：

50%

56.05/112 核

这个是按照 CPU 核数来计算的

frezes · 2023年4月12日

LYN
上面描述的容器的CPU 用量，不是CPU的使用率，如果要配置告警，可以配置 pod_resource_usage / pod_resource_limit, 就是容器用量占限制的百分比；

frezes · 2023年4月12日

LYN
上面描述的容器的CPU 用量，不是CPU的使用率，如果要配置告警，可以配置 pod_resource_usage / pod_resource_limit, 就是容器用量占限制的百分比；

frezes · 2023年4月12日

LYN
上面描述的容器的CPU 用量，不是CPU的使用率，如果要配置告警，可以配置 pod_resource_usage / pod_resource_limit, 就是容器用量占限制的百分比；

frezes · 2023年4月12日

LYN
上面描述的容器的CPU 用量，不是CPU的使用率，如果要配置告警，可以配置 pod_resource_usage / pod_resource_limit, 就是容器用量占限制的百分比；

LYN · 2023年4月12日

frezes 感谢感谢，再请教个问题，现在负载配置的弹性伸缩里用的是 CPU 使用量，这个是使用的core/总core 比例吗？如果是这样的话，使用的core我理解是不可控的，一个应用哪怕只有很少的使用率，也能用到多个 core，用这个指标做弹性伸缩是不是有问题。

frezes · 2023年4月12日

LYN
不是，这里的利用率也是 pod resource usage / pod resource limit

可以参考： https://kubernetes.io/zh-cn/docs/tasks/run-application/horizontal-pod-autoscale/
https://kubernetes.io/zh-cn/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/

kubesphere监控pod cpu

zhaozhongyuan425

Nrehearsal

zhaozhongyuan425

Nrehearsal

zhaozhongyuan425

zhaozhongyuan425

Nrehearsal

zhaozhongyuan425

Nrehearsal

LYN

frezesK零S

frezesK零S

frezesK零S

frezesK零S

LYN

frezesK零S