wanjunlei notification-manager-operator-6958786cd6-qmck2这个POD我已经删除了,但告警还是一直有
[root@bg-003-kvm004-vms003 ~]# kubectl -n kubesphere-monitoring-system get pods
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 0 16d
alertmanager-main-1 2/2 Running 0 16d
alertmanager-main-2 2/2 Running 0 16d
kube-state-metrics-95c974544-5bnm5 3/3 Running 0 37d
node-exporter-6n5ld 2/2 Running 0 37d
node-exporter-8vs2v 2/2 Running 0 37d
node-exporter-kjsp5 2/2 Running 0 37d
node-exporter-m6ql6 2/2 Running 0 4d21h
node-exporter-x7bmr 2/2 Running 0 37d
node-exporter-x8wpd 2/2 Running 0 37d
notification-manager-deployment-7c8df68d94-f4g97 1/1 Running 0 37d
notification-manager-deployment-7c8df68d94-qb49z 1/1 Running 0 37d
notification-manager-operator-6958786cd6-djsbt 2/2 Running 0 157m
prometheus-k8s-0 3/3 Running 1 94m
prometheus-k8s-1 3/3 Running 1 94m
prometheus-operator-84d58bf775-269pk 2/2 Running 0 37d

=====监控报警=====
级别:warning
名称:ContainerBackoff
信息:Back-off restarting failed container
容器: notification-manager-operator
POD: notification-manager-operator-6958786cd6-qmck2
命名空间:kubesphere-monitoring-system
告警时间:2020-10-15 16:56:21
=======end========

    xulai
    `metadata 7 item
    name:

    notification-manager-operator-6958786cd6-qmck2.163921a5a979e695
    namespace:

    kubesphere-monitoring-system
    selfLink:

    /api/v1/namespaces/kubesphere-monitoring-system/events/notification-manager-operator-6958786cd6-qmck2.163921a5a979e695
    uid:

    827e396e-8c57-4232-9b48-d94e12e5d12d
    resourceVersion:

    12336096
    creationTimestamp:

    2020-10-14T11:35:18Z
    managedFields 0 item
    involvedObject 7 item
    kind:

    Pod
    namespace:

    kubesphere-monitoring-system
    name:

    notification-manager-operator-6958786cd6-qmck2
    uid:

    fb6064d4-67b1-4c05-9a7a-264a1d649ccf
    apiVersion:

    v1
    resourceVersion:

    4665
    fieldPath:

    spec.containers{notification-manager-operator}
    reason:

    BackOff
    message:

    Back-off restarting failed container
    source 2 item
    component:

    kubelet
    host:

    bg-003-kvm006-vms004
    firstTimestamp:

    2020-09-29T02:55:37Z
    lastTimestamp:

    2020-10-14T11:38:23Z
    count:

    28
    type:

    Warning
    eventTime:

    null
    reportingComponent:

    reportingInstance:

    logStripANSI:

    undefined
    `

      tzghost
      看下alertmanager有没有notification-manager-operator-6958786cd6-qmck2的告警:
      curl alertmanager-main.kubesphere-monitoring-system.svc:9093/api/v2/alerts?filter=pod=notification-manager-operator-6958786cd6-qmck2查看,或者暴露出kubesphere-monitoring-system/alertmanager-main服务的9093端口,访问端口对应页面查看

        notification-manager-operator 是个 deployment,删了pod 当然还会重新创建
        另外这个组件是从 Alertmanager 接收微信、slack告警必须的,不能删
        你删它的意图是什么?

          xulai

          http://192.168.0.55:31289/api/v2/alerts?filter=pod=notification-manager-operator-6958786cd6-qmck2
          ·
          [{“annotations”:{“message”:“Back-off restarting failed container”,“summary”:“Container back-off”,“summaryCn”:“容器回退”},“endsAt”:“2020-10-15T11:07:05.068Z”,“fingerprint”:“594b1ab1a9d84ff6”,“receivers”:[{“name”:“event”}],“startsAt”:“2020-10-15T11:02:05.068Z”,“status”:{“inhibitedBy”:[],“silencedBy”:[],“state”:“active”},“updatedAt”:“2020-10-15T11:02:05.068Z”,“labels”:{“alertname”:“ContainerBackoff”,“alerttype”:“event”,“container”:“notification-manager-operator”,“namespace”:“kubesphere-monitoring-system”,“pod”:“notification-manager-operator-6958786cd6-qmck2”,“severity”:“warning”}}]·

            benjaminhuo 是因为notification-manager-operator-6958786cd6-qmck2这个POD一直有ContainerBackoff的告警,所以尝试删除重建。按理说删除重建后原来的POD已经不存在了,告警应该会恢复的,但这个告警还一直有,比较奇怪

            tzghost 看下kubesphere-logging-system/ks-events-ruler负载的日志是否有异常

              tzghost 看下kubesphere-monitoring-system下的alertmanager负载的所有pod ip

                tzghost 你在alertmanager ui界面点下status那一栏确认一下peer的所有ip

                  tzghost 你重启kubesphere-logging-system/ks-events-ruler负载让它恢复正常。经验证是个bug,我这边记录一下,接下来修复

                    4 天 后

                    tzghost 现在你可以通过kubectl -n kubesphere-logging-system edit ruler ks-events-ruler更新其中的镜像版本到v0.2.0(该版本已修复上边的bug)

                      23 天 后