wanjunlei benjaminhuo 感谢两位回复解答,发现目前的监控比较简单,我们计划是自定义一些更贴近业务的监控项,要怎么把3.0自带的prometheus暴露到外部访问呢,控制台上不能配置外网网关

  • Jeff 回复了此帖

    tzghost 你需要编辑下面的 svc, 不是 operator 的 svc

    kubectl -n kubesphere-monitoring-system get svc prometheus-k8s
    NAME             TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE
    prometheus-k8s   NodePort   10.233.9.200   <none>        9090:31193/TCP   19d
      6 天 后

      benjaminhuo alertmanager-main的配置不能修改吗?发现修改replicas为1或者编辑配置模板等操作,配置会还原。另外alertmanager-main里的alertmanager.yaml加密了,调整这个文件的配置,要怎么处理?

      给prometheus-k8s-rulefiles-0添加配置项也被还原了

      11 天 后

      这些都要编辑 crd ,不能直接编辑工作负载或者configmap:

      # 调整 Alertmanager replica 
      kubectl -n kubesphere-monitoring-system edit alertmanagers.monitoring.coreos.com main
      # 调整 Alertmanager 配置 , 需要把内容拷贝出来 base64 解码,改完后再base64编码写进去
      kubectl -n kubesphere-monitoring-system edit secrets alertmanager-main
      # 修改 rule 也要改crd
      kubectl -n kubesphere-monitoring-system edit prometheusrules.monitoring.coreos.com prometheus-k8s-rules
        5 天 后

        调整后正常了,但遇到了新的问题,现在一直报ContainerBackoff,但这个POD我已经删除重建了,还是一直有这个告警,这是什么问题?benjaminhuo
        =====监控报警=====
        级别:warning
        名称:ContainerBackoff
        信息:Back-off restarting failed container
        容器: notification-manager-operator
        POD: notification-manager-operator-6958786cd6-qmck2
        命名空间:kubesphere-monitoring-system
        告警时间:2020-10-15 15:49:50
        =======end========

        wanjunlei notification-manager-operator-6958786cd6-qmck2这个POD我已经删除了,但告警还是一直有
        [root@bg-003-kvm004-vms003 ~]# kubectl -n kubesphere-monitoring-system get pods
        NAME READY STATUS RESTARTS AGE
        alertmanager-main-0 2/2 Running 0 16d
        alertmanager-main-1 2/2 Running 0 16d
        alertmanager-main-2 2/2 Running 0 16d
        kube-state-metrics-95c974544-5bnm5 3/3 Running 0 37d
        node-exporter-6n5ld 2/2 Running 0 37d
        node-exporter-8vs2v 2/2 Running 0 37d
        node-exporter-kjsp5 2/2 Running 0 37d
        node-exporter-m6ql6 2/2 Running 0 4d21h
        node-exporter-x7bmr 2/2 Running 0 37d
        node-exporter-x8wpd 2/2 Running 0 37d
        notification-manager-deployment-7c8df68d94-f4g97 1/1 Running 0 37d
        notification-manager-deployment-7c8df68d94-qb49z 1/1 Running 0 37d
        notification-manager-operator-6958786cd6-djsbt 2/2 Running 0 157m
        prometheus-k8s-0 3/3 Running 1 94m
        prometheus-k8s-1 3/3 Running 1 94m
        prometheus-operator-84d58bf775-269pk 2/2 Running 0 37d

        =====监控报警=====
        级别:warning
        名称:ContainerBackoff
        信息:Back-off restarting failed container
        容器: notification-manager-operator
        POD: notification-manager-operator-6958786cd6-qmck2
        命名空间:kubesphere-monitoring-system
        告警时间:2020-10-15 16:56:21
        =======end========

          xulai
          `metadata 7 item
          name:

          notification-manager-operator-6958786cd6-qmck2.163921a5a979e695
          namespace:

          kubesphere-monitoring-system
          selfLink:

          /api/v1/namespaces/kubesphere-monitoring-system/events/notification-manager-operator-6958786cd6-qmck2.163921a5a979e695
          uid:

          827e396e-8c57-4232-9b48-d94e12e5d12d
          resourceVersion:

          12336096
          creationTimestamp:

          2020-10-14T11:35:18Z
          managedFields 0 item
          involvedObject 7 item
          kind:

          Pod
          namespace:

          kubesphere-monitoring-system
          name:

          notification-manager-operator-6958786cd6-qmck2
          uid:

          fb6064d4-67b1-4c05-9a7a-264a1d649ccf
          apiVersion:

          v1
          resourceVersion:

          4665
          fieldPath:

          spec.containers{notification-manager-operator}
          reason:

          BackOff
          message:

          Back-off restarting failed container
          source 2 item
          component:

          kubelet
          host:

          bg-003-kvm006-vms004
          firstTimestamp:

          2020-09-29T02:55:37Z
          lastTimestamp:

          2020-10-14T11:38:23Z
          count:

          28
          type:

          Warning
          eventTime:

          null
          reportingComponent:

          reportingInstance:

          logStripANSI:

          undefined
          `

            tzghost
            看下alertmanager有没有notification-manager-operator-6958786cd6-qmck2的告警:
            curl alertmanager-main.kubesphere-monitoring-system.svc:9093/api/v2/alerts?filter=pod=notification-manager-operator-6958786cd6-qmck2查看,或者暴露出kubesphere-monitoring-system/alertmanager-main服务的9093端口,访问端口对应页面查看

              notification-manager-operator 是个 deployment,删了pod 当然还会重新创建
              另外这个组件是从 Alertmanager 接收微信、slack告警必须的,不能删
              你删它的意图是什么?

                xulai

                http://192.168.0.55:31289/api/v2/alerts?filter=pod=notification-manager-operator-6958786cd6-qmck2
                ·
                [{“annotations”:{“message”:“Back-off restarting failed container”,“summary”:“Container back-off”,“summaryCn”:“容器回退”},“endsAt”:“2020-10-15T11:07:05.068Z”,“fingerprint”:“594b1ab1a9d84ff6”,“receivers”:[{“name”:“event”}],“startsAt”:“2020-10-15T11:02:05.068Z”,“status”:{“inhibitedBy”:[],“silencedBy”:[],“state”:“active”},“updatedAt”:“2020-10-15T11:02:05.068Z”,“labels”:{“alertname”:“ContainerBackoff”,“alerttype”:“event”,“container”:“notification-manager-operator”,“namespace”:“kubesphere-monitoring-system”,“pod”:“notification-manager-operator-6958786cd6-qmck2”,“severity”:“warning”}}]·

                  benjaminhuo 是因为notification-manager-operator-6958786cd6-qmck2这个POD一直有ContainerBackoff的告警,所以尝试删除重建。按理说删除重建后原来的POD已经不存在了,告警应该会恢复的,但这个告警还一直有,比较奇怪

                  tzghost 看下kubesphere-logging-system/ks-events-ruler负载的日志是否有异常