3.0 启用告警通知后，怎么添加邮件以外的通知方式

benjaminhuo

tzghost 你需要编辑下面的 svc, 不是 operator 的 svc

kubectl -n kubesphere-monitoring-system get svc prometheus-k8s
NAME             TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE
prometheus-k8s   NodePort   10.233.9.200   <none>        9090:31193/TCP   19d

tzghost

benjaminhuo alertmanager-main的配置不能修改吗？发现修改replicas为1或者编辑配置模板等操作，配置会还原。另外alertmanager-main里的alertmanager.yaml加密了，调整这个文件的配置，要怎么处理？

tzghost

给prometheus-k8s-rulefiles-0添加配置项也被还原了

tzghost

@benjaminhuo @Jeff @rysinal @wanjunlei 上面的问题，麻烦各位有时间能解答一下吗？

benjaminhuo

这些都要编辑 crd ，不能直接编辑工作负载或者configmap：

# 调整 Alertmanager replica 
kubectl -n kubesphere-monitoring-system edit alertmanagers.monitoring.coreos.com main
# 调整 Alertmanager 配置 , 需要把内容拷贝出来 base64 解码，改完后再base64编码写进去
kubectl -n kubesphere-monitoring-system edit secrets alertmanager-main
# 修改 rule 也要改crd
kubectl -n kubesphere-monitoring-system edit prometheusrules.monitoring.coreos.com prometheus-k8s-rules

tzghost

调整后正常了，但遇到了新的问题，现在一直报ContainerBackoff，但这个POD我已经删除重建了，还是一直有这个告警，这是什么问题？benjaminhuo
=====监控报警===== 级别：warning 名称：ContainerBackoff 信息：Back-off restarting failed container 容器: notification-manager-operator POD: notification-manager-operator-6958786cd6-qmck2 命名空间：kubesphere-monitoring-system 告警时间：2020-10-15 15:49:50 =======end========

wanjunlei

看下这个pod的日志

tzghost

wanjunlei notification-manager-operator-6958786cd6-qmck2这个POD我已经删除了，但告警还是一直有
[root@bg-003-kvm004-vms003 ~]# kubectl -n kubesphere-monitoring-system get pods NAME READY STATUS RESTARTS AGE alertmanager-main-0 2/2 Running 0 16d alertmanager-main-1 2/2 Running 0 16d alertmanager-main-2 2/2 Running 0 16d kube-state-metrics-95c974544-5bnm5 3/3 Running 0 37d node-exporter-6n5ld 2/2 Running 0 37d node-exporter-8vs2v 2/2 Running 0 37d node-exporter-kjsp5 2/2 Running 0 37d node-exporter-m6ql6 2/2 Running 0 4d21h node-exporter-x7bmr 2/2 Running 0 37d node-exporter-x8wpd 2/2 Running 0 37d notification-manager-deployment-7c8df68d94-f4g97 1/1 Running 0 37d notification-manager-deployment-7c8df68d94-qb49z 1/1 Running 0 37d notification-manager-operator-6958786cd6-djsbt 2/2 Running 0 157m prometheus-k8s-0 3/3 Running 1 94m prometheus-k8s-1 3/3 Running 1 94m prometheus-operator-84d58bf775-269pk 2/2 Running 0 37d

=====监控报警===== 级别：warning 名称：ContainerBackoff 信息：Back-off restarting failed container 容器: notification-manager-operator POD: notification-manager-operator-6958786cd6-qmck2 命名空间：kubesphere-monitoring-system 告警时间：2020-10-15 16:56:21 =======end========

xulai

tzghost 前端工具箱的事件查询看下这个POD的事件

tzghost

xulai
`metadata 7 item
name:

notification-manager-operator-6958786cd6-qmck2.163921a5a979e695
namespace:

kubesphere-monitoring-system
selfLink:

/api/v1/namespaces/kubesphere-monitoring-system/events/notification-manager-operator-6958786cd6-qmck2.163921a5a979e695
uid:

827e396e-8c57-4232-9b48-d94e12e5d12d
resourceVersion:

12336096
creationTimestamp:

2020-10-14T11:35:18Z
managedFields 0 item
involvedObject 7 item
kind:

Pod
namespace:

kubesphere-monitoring-system
name:

notification-manager-operator-6958786cd6-qmck2
uid:

fb6064d4-67b1-4c05-9a7a-264a1d649ccf
apiVersion:

v1
resourceVersion:

4665
fieldPath:

spec.containers{notification-manager-operator}
reason:

BackOff
message:

Back-off restarting failed container
source 2 item
component:

kubelet
host:

bg-003-kvm006-vms004
firstTimestamp:

2020-09-29T02:55:37Z
lastTimestamp:

2020-10-14T11:38:23Z
count:

28
type:

Warning
eventTime:

null
reportingComponent:

reportingInstance:

logStripANSI:

undefined
`

xulai

tzghost
看下alertmanager有没有notification-manager-operator-6958786cd6-qmck2的告警：
curl alertmanager-main.kubesphere-monitoring-system.svc:9093/api/v2/alerts?filter=pod=notification-manager-operator-6958786cd6-qmck2查看，或者暴露出kubesphere-monitoring-system/alertmanager-main服务的9093端口，访问端口对应页面查看

tzghost

xulai

http://192.168.0.55:31289/api/v2/alerts?filter=pod=notification-manager-operator-6958786cd6-qmck2
·
[{“annotations”:{“message”:“Back-off restarting failed container”,“summary”:“Container back-off”,“summaryCn”:“容器回退”},“endsAt”:“2020-10-15T11:07:05.068Z”,“fingerprint”:“594b1ab1a9d84ff6”,“receivers”:[{“name”:“event”}],“startsAt”:“2020-10-15T11:02:05.068Z”,“status”:{“inhibitedBy”:[],“silencedBy”:[],“state”:“active”},“updatedAt”:“2020-10-15T11:02:05.068Z”,“labels”:{“alertname”:“ContainerBackoff”,“alerttype”:“event”,“container”:“notification-manager-operator”,“namespace”:“kubesphere-monitoring-system”,“pod”:“notification-manager-operator-6958786cd6-qmck2”,“severity”:“warning”}}]·

benjaminhuo

notification-manager-operator 是个 deployment，删了pod 当然还会重新创建
另外这个组件是从 Alertmanager 接收微信、slack告警必须的，不能删
你删它的意图是什么？

tzghost

benjaminhuo 是因为notification-manager-operator-6958786cd6-qmck2这个POD一直有ContainerBackoff的告警，所以尝试删除重建。按理说删除重建后原来的POD已经不存在了，告警应该会恢复的，但这个告警还一直有，比较奇怪

xulai

tzghost 看下kubesphere-logging-system/ks-events-ruler负载的日志是否有异常

tzghost

xulai 是有报错

xulai

tzghost 这个负载下的pod日志都是这样么

tzghost

xulai 是的

xulai

tzghost 看下kubesphere-monitoring-system下的alertmanager负载的所有pod ip

tzghost

xulai IP已经变了

xulai

tzghost 你在alertmanager ui界面点下status那一栏确认一下peer的所有ip

« 上一页