3.0 启用告警通知后，怎么添加邮件以外的通知方式

rysinal

可以先研究一下这个文档
https://github.com/kubesphere/docs.kubesphere.io/blob/master/content/en/monitoring/notification-manager.md#slack

tzghost

benjaminhuo rysinal 以CRD的形式添加的时候报错了，麻烦看下是什么问题

benjaminhuo

3.0 新增的 slack, 企业微信等通知渠道是由Notification Manager 这个项目提供的, 对接的是 Alertmanager 发出的告警.
Notification Manager 当前版本还没有加上 UI 的支持，可以通过命令行的方式以 CRD 的形式添加参考上述 Notification Manager 的文档或者 https://github.com/kubesphere/docs.kubesphere.io/blob/master/content/en/monitoring/notification-manager.md . 我们有计划在下一版本加上多种通知渠道的 UI 的支持
当前版本v3.0.0 开启的告警没有走 Alertmanager，所以不能使用企业微信等新增通知渠道。
我们计划再后续版本加上自定义的 Prometheus 格式的告警，会发往 Alertmanager，就可以用企业微信接收自定义告警的通知了。v3.0.0 我们内置了很多社区比较流行的Prometheus 格式的告警，日常运维应该是够了。

wanjunlei

tzghost secret里的数据是需要用base64加密的，你把apisecret用base64加密一下

tzghost

配置企业微信后能正常接收到通知，但内容实在没法看 wanjunlei

wanjunlei

tzghost notification manager接收的是alert manager的数据，所以数据的风格和alert manager保持一致，
你可以在https://github.com/kubesphere/notification-manager提个issue，把你想要的数据格式提出来，大家一起讨论

tzghost

wanjunlei 好的，内容格式类似邮箱告警中包含节点，监控项，异常值和时间这些就可以，可读性高一些。这个需求有参考解决方案吗？

wanjunlei

你可以把微信的格式换成邮件的
执行
kubectl edit notificationmanagers.notification.kubesphere.io -n kubesphere-monitoring-system notification-manager
加上这个

spec:
  receivers:
    options:
      wechat:
        template: {{ template "email.default.html" . }}

然后等notification manager重启

tzghost

wanjunlei benjaminhuo 感谢两位回复解答，发现目前的监控比较简单，我们计划是自定义一些更贴近业务的监控项，要怎么把3.0自带的prometheus暴露到外部访问呢，控制台上不能配置外网网关

benjaminhuo

Alertmanager 发出的告警包含的信息会更丰富些，不仅仅是之前老的监控告警邮件通知里的节点，监控项，异常值和时间这些，因为不仅仅能对节点告警，还能对 pod, deployment, daemonset, statefulset 等工作负载，还有容器的异常状况, 还包括系统关键组件apiserver etcd 等；告警的类型有 Prometheus 发出的告警，这个通常有异常值和阈值，有kube-events 发出的告警，你截图的就是event告警，没有阈值，只有异常描述信息；

你收到的告警应该说标题乱了一点，Alertmanager 风格的告警会把 label 的value写在标题的括号里，有点乱。其他应该还算清晰，内容没有中文可能让你觉得有点乱，我们会优化下

Jeff

tzghost https://kubesphere.com.cn/forum/d/2006-system-workspace 看这个

tzghost

Jeff 配置网关功能后，给promethus配置NodePort报错是什么原因？

benjaminhuo

tzghost 你需要编辑下面的 svc, 不是 operator 的 svc

kubectl -n kubesphere-monitoring-system get svc prometheus-k8s
NAME             TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE
prometheus-k8s   NodePort   10.233.9.200   <none>        9090:31193/TCP   19d

tzghost

benjaminhuo alertmanager-main的配置不能修改吗？发现修改replicas为1或者编辑配置模板等操作，配置会还原。另外alertmanager-main里的alertmanager.yaml加密了，调整这个文件的配置，要怎么处理？

tzghost

给prometheus-k8s-rulefiles-0添加配置项也被还原了

tzghost

@benjaminhuo @Jeff @rysinal @wanjunlei 上面的问题，麻烦各位有时间能解答一下吗？

benjaminhuo

这些都要编辑 crd ，不能直接编辑工作负载或者configmap：

# 调整 Alertmanager replica 
kubectl -n kubesphere-monitoring-system edit alertmanagers.monitoring.coreos.com main
# 调整 Alertmanager 配置 , 需要把内容拷贝出来 base64 解码，改完后再base64编码写进去
kubectl -n kubesphere-monitoring-system edit secrets alertmanager-main
# 修改 rule 也要改crd
kubectl -n kubesphere-monitoring-system edit prometheusrules.monitoring.coreos.com prometheus-k8s-rules

tzghost

调整后正常了，但遇到了新的问题，现在一直报ContainerBackoff，但这个POD我已经删除重建了，还是一直有这个告警，这是什么问题？benjaminhuo
=====监控报警===== 级别：warning 名称：ContainerBackoff 信息：Back-off restarting failed container 容器: notification-manager-operator POD: notification-manager-operator-6958786cd6-qmck2 命名空间：kubesphere-monitoring-system 告警时间：2020-10-15 15:49:50 =======end========

wanjunlei

看下这个pod的日志

tzghost

wanjunlei notification-manager-operator-6958786cd6-qmck2这个POD我已经删除了，但告警还是一直有
[root@bg-003-kvm004-vms003 ~]# kubectl -n kubesphere-monitoring-system get pods NAME READY STATUS RESTARTS AGE alertmanager-main-0 2/2 Running 0 16d alertmanager-main-1 2/2 Running 0 16d alertmanager-main-2 2/2 Running 0 16d kube-state-metrics-95c974544-5bnm5 3/3 Running 0 37d node-exporter-6n5ld 2/2 Running 0 37d node-exporter-8vs2v 2/2 Running 0 37d node-exporter-kjsp5 2/2 Running 0 37d node-exporter-m6ql6 2/2 Running 0 4d21h node-exporter-x7bmr 2/2 Running 0 37d node-exporter-x8wpd 2/2 Running 0 37d notification-manager-deployment-7c8df68d94-f4g97 1/1 Running 0 37d notification-manager-deployment-7c8df68d94-qb49z 1/1 Running 0 37d notification-manager-operator-6958786cd6-djsbt 2/2 Running 0 157m prometheus-k8s-0 3/3 Running 1 94m prometheus-k8s-1 3/3 Running 1 94m prometheus-operator-84d58bf775-269pk 2/2 Running 0 37d

=====监控报警===== 级别：warning 名称：ContainerBackoff 信息：Back-off restarting failed container 容器: notification-manager-operator POD: notification-manager-operator-6958786cd6-qmck2 命名空间：kubesphere-monitoring-system 告警时间：2020-10-15 16:56:21 =======end========

xulai

tzghost 前端工具箱的事件查询看下这个POD的事件