• 监控日志
  • 生产Kubesphere日志平台无法收集日志,fluent-bit持续性报错问题(急需求助)

一、前言

各位ks社区前辈大佬们好,Kubesphere 投入生产项目已有半年之久,目前在日志收集这块出现一些列问题,关于该问题社区一位大佬前辈昨日给我提供了一个参数配置,但无奈该问题未得到解决,因此重新发贴求助各位前辈。万分感谢附上,上篇帖子链接(https://ask.kubesphere.io/forum/d/22968-kubesphere-ping-tai-wu-fa-zheng-chang-shou-ji-ri-zhi-fluent-bitri-zhi-bao-cuo-wu-fa-zeng-jia-huan-chong-qu/9

kubesphere版本:基于k8s最小化部署的3.3.1版本

Kubernetes 版本:1.22.0

二、问题描述

Kubesphere平台投入生产运行一段时间之后,研发反馈Kubesphere生产以及测试平台都无法展示日志,搜索日志都是空白,如下图所示

于是乎,着手开始排查定位具体问题,发现fluent-bit 容器日志一直在报错,具体错误如下

[2023/07/23 00:15:37] [ warn] [http_client] cannot increase buffer: current=512000 requested=544768 max=512000
{"log":"[2023/11/08 10:26:33] [ warn] [input] tail.2 paused (mem buf overlimit)\n","stream":"stderr","time":"2023-11-08T10:26:33.406030339Z"}

根据上述的错误信息可以看出是需要调整内存缓冲区大小(memBufLimi)以及缓冲区大小(Buffer_Size)相关参数,我分别在CDR->input->tail 定义" memBufLimit" 以及在CDR->es定义 ”bufferSize“.

以上操作之后,于是乎尝试重启Fluent-bit 服务使其修改之后的参数生效,就在我满怀欢喜等待,日志平台恢复正常,新的噩耗出现了,Fluent-bit有出现了新的错误,如下所示:

[2023/12/04 23:17:16] [error] [http_client] broken connection to elasticsearch-logging-data.kubesphere-logging-system.svc:9200 ?
 [2023/12/04 23:17:16] [ warn] [output:es:es.0] http_do=-1 URI=/_bulk
 [2023/12/04 23:17:16] [error] [http_client] broken connection to elasticsearch-logging-data.kubesphere-logging-system.svc:9200 ?
 [2023/12/04 23:17:16] [ warn] [output:es:es.0] http_do=-1 URI=/_bulk
 [2023/12/04 23:17:16] [error] [http_client] broken connection to elasticsearch-logging-data.kubesphere-logging-system.svc:9200 ?
 [2023/12/04 23:17:16] [ warn] [output:es:es.0] http_do=-1 URI=/_bulk
 [2023/12/04 23:17:16] [error] [http_client] broken connection to elasticsearch-logging-data.kubesphere-logging-system.svc:9200 ?
 [2023/12/04 23:17:16] [ warn] [output:es:es.0] http_do=-1 URI=/_bulk
 [2023/12/04 23:17:16] [ warn] [engine] chunk '12-1701731723.856781118.flb' cannot be retried: task_id=51, input=systemd.1 > output=es.0
 [2023/12/04 23:17:16] [ warn] [engine] chunk '12-1701731722.751100596.flb' cannot be retried: task_id=47, input=systemd.1 > output=es.0
 [2023/12/04 23:17:16] [ warn] [engine] chunk '12-1701731775.506676822.flb' cannot be retried: task_id=63, input=tail.2 > output=es.0
 [2023/12/04 23:17:16] [ warn] [engine] chunk '12-1701731718.465677409.flb' cannot be retried: task_id=36, input=systemd.1 > output=es.0
 [2023/12/04 23:17:19] [ warn] [engine] failed to flush chunk '12-1701731835.508401286.flb', retry in 11 seconds: task_id=1, input=tail.2 > output=es.0 (out_id=0)
 [2023/12/04 23:17:19] [ warn] [engine] failed to flush chunk '12-1701731836.447435482.flb', retry in 7 seconds: task_id=2, input=tail.2 > output=es.0 (out_id=0)
 [2023/12/04 23:17:23] [ warn] [engine] chunk '12-1701731809.214007741.flb' cannot be retried: task_id=5, input=tail.2 > output=es.0
 [2023/12/04 23:17:24] [ warn] [engine] failed to flush chunk '12-1701731839.217955696.flb', retry in 7 seconds: task_id=3, input=tail.2 > output=es.0 (out_id=0)
 [2023/12/04 23:17:25] [ warn] [engine] chunk '12-1701731814.220947533.flb' cannot be retried: task_id=0, input=tail.2 > output=es.0

从上述错误信息可以看出,主要包括两部分,1. fluent-bit和ES之间的连接断开(这个问题错误可能是我尝试重启fluent-bit发现故障仍未解决,随后又尝试重启ES服务,导致的)2、无法刷新区块这个错误是在修改缓冲区大小“Buffer_Size”之后新的错误信息,更让人崩溃的是,有的fluent-bit pod 仍然报 无法增加缓冲区的错误,如下图所示:

随后根据上述问题,我检查了平台ES是否正常,发现是正常的,面对上述问题,真的很迷茫不知道是哪的问题导致的上述错误,翻阅各种资料文档都未找到相关解决方案,由于是生产项目,影响较为严重,还请社区各位大佬伸出援助之手,帮帮可怜的孩子吧!万分感谢,

es 的状态不正常

kubectl edit output -n kubesphere-logging-system es

spec:
  es:
    traceError: true

把这个加上,会显示详细的错误信息

    wanjunlei已添加,随后我点击es-data2发现报错

    [2023-12-04T23:18:57,282][WARN ][o.e.i.r.PeerRecoveryTargetService] [elasticsearch-logging-data-1] error while reading global checkpoint from translog, resetting the starting sequence number from -2 to unassigned and recovering as if there are none
    
     org.elasticsearch.index.translog.TranslogCorruptedException: translog from source [/usr/share/elasticsearch/data/nodes/0/indices/e2vtPlx3Tx6jJpVhfiEnQg/0/translog/translog-1.tlog] is corrupted, translog header truncated
    
     	at org.elasticsearch.index.translog.TranslogHeader.read(TranslogHeader.java:171) ~[elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1871) ~[elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1862) ~[elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.getStartingSeqNo(PeerRecoveryTargetService.java:393) ~[elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.getStartRecoveryRequest(PeerRecoveryTargetService.java:354) [elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:184) [elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$500(PeerRecoveryTargetService.java:84) [elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:656) [elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:778) [elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.8.22.jar:6.8.22]
    
     	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
    
     	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
    
     	at java.lang.Thread.run(Thread.java:832) [?:?]
    
     Caused by: java.io.EOFException
    
     	at org.elasticsearch.common.io.stream.InputStreamStreamInput.readByte(InputStreamStreamInput.java:59) ~[elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.index.translog.BufferedChecksumStreamInput.readByte(BufferedChecksumStreamInput.java:62) ~[elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.index.translog.BufferedChecksumStreamInput.read(BufferedChecksumStreamInput.java:81) ~[elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.apache.lucene.store.InputStreamDataInput.readByte(InputStreamDataInput.java:34) ~[lucene-core-7.7.3.jar:7.7.3 1a0d2a901dfec93676b0fe8be425101ceb754b85 - noble - 2020-04-21 10:31:55]
    
     	at org.apache.lucene.store.DataInput.readInt(DataInput.java:101) ~[lucene-core-7.7.3.jar:7.7.3 1a0d2a901dfec93676b0fe8be425101ceb754b85 - noble - 2020-04-21 10:31:55]
    
     	at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:194) ~[lucene-core-7.7.3.jar:7.7.3 1a0d2a901dfec93676b0fe8be425101ceb754b85 - noble - 2020-04-21 10:31:55]
    
     	at org.elasticsearch.index.translog.TranslogHeader.readHeaderVersion(TranslogHeader.java:112) ~[elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.index.translog.TranslogHeader.read(TranslogHeader.java:133) ~[elasticsearch-6.8.22.jar:6.8.22]
    
     	... 12 more
    
     [2023-12-04T23:19:00,899][WARN ][o.e.i.r.PeerRecoveryTargetService] [elasticsearch-logging-data-1] error while reading global checkpoint from translog, resetting the starting sequence number from -2 to unassigned and recovering as if there are none
    
     org.elasticsearch.index.translog.TranslogCorruptedException: translog from source [/usr/share/elasticsearch/data/nodes/0/indices/e2vtPlx3Tx6jJpVhfiEnQg/4/translog] is corrupted
    
     	at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1875) ~[elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1862) ~[elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.getStartingSeqNo(PeerRecoveryTargetService.java:393) ~[elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.getStartRecoveryRequest(PeerRecoveryTargetService.java:354) [elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:184) [elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$500(PeerRecoveryTargetService.java:84) [elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:656) [elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:778) [elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.8.22.jar:6.8.22]
    
     	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
    
     	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
    
     	at java.lang.Thread.run(Thread.java:832) [?:?]
    
     Caused by: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "text" is null
    
     	at org.apache.lucene.util.BytesRef.<init>(BytesRef.java:84) ~[lucene-core-7.7.3.jar:7.7.3 1a0d2a901dfec93676b0fe8be425101ceb754b85 - noble - 2020-04-21 10:31:55]
    
     	at org.elasticsearch.index.translog.TranslogHeader.read(TranslogHeader.java:145) ~[elasticsearch-6.8.22.jar:6.8.22]
    
     	at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1871) ~[elasticsearch-6.8.22.jar:6.8.22]
    
     	... 11 more
    
     [2023-12-04T23:19:01,682][WARN ][o.e.t.TransportService   ] [elasticsearch-logging-data-1] Received response for a request that has timed out, sent [60733ms] ago, timed out [30904ms] ago, action [internal:discovery/zen/fd/master_ping], node [{elasticsearch-logging-discovery-1}{HmIFepAoSMq_-zcKUU8qKA}{t9XUAlMKSZeDzCEZh_zHhA}{172.18.176.171}{172.18.176.171:9300}], id [27]

    看错误信息translog读取全局检查点时出错,将起始序列号从-2重置为未分配,并恢复为不存在 。trans log已损坏

    lihai2099 还没有呢,上述是ES这块的问题,OOM一直重启,调整限制就行了,但是日志收集这块仍然没有解决,虽然没有报错,但是一直刷警告信息,无法刷新区块,请多少秒后重试之类的日志,如下如所示:

    [2023/12/06 20:49:24] [ warn] [engine] failed to flush chunk '13-1701895764.413633850.flb', retry in 10 seconds: task_id=0, input=tail.2 > output=es.0 (out_id=0)
    
     [2023/12/06 20:49:24] [ warn] [engine] failed to flush chunk '13-1701895764.413004425.flb', retry in 11 seconds: task_id=1, input=tail.7 > output=es.0 (out_id=0)
    
     [2023/12/06 20:49:34] [ info] [engine] flush chunk '13-1701895764.413633850.flb' succeeded at retry 1: task_id=0, input=tail.2 > output=es.0 (out_id=0)
    
     [2023/12/06 20:49:35] [ info] [engine] flush chunk '13-1701895764.413004425.flb' succeeded at retry 1: task_id=1, input=tail.7 > output=es.0 (out_id=0)
    
     [2023/12/06 21:10:24] [ warn] [engine] failed to flush chunk '13-1701897023.542976235.flb', retry in 10 seconds: task_id=1, input=tail.7 > output=es.0 (out_id=0)
    
     [2023/12/06 21:10:24] [ warn] [engine] failed to flush chunk '13-1701897023.543427735.flb', retry in 8 seconds: task_id=0, input=tail.2 > output=es.3 (out_id=3)
    
     [2023/12/06 21:10:25] [ warn] [engine] failed to flush chunk '13-1701897024.457758706.flb', retry in 10 seconds: task_id=2, input=tail.2 > output=es.0 (out_id=0)
    
     [2023/12/06 21:10:25] [ warn] [engine] failed to flush chunk '13-1701897024.457323650.flb', retry in 8 seconds: task_id=3, input=tail.7 > output=es.0 (out_id=0)
    
     [2023/12/06 21:10:32] [ info] [engine] flush chunk '13-1701897023.543427735.flb' succeeded at retry 1: task_id=0, input=tail.2 > output=es.3 (out_id=3)
    
     [2023/12/06 21:10:33] [ info] [engine] flush chunk '13-1701897024.457323650.flb' succeeded at retry 1: task_id=3, input=tail.7 > output=es.0 (out_id=0)
    
     [2023/12/06 21:10:34] [ info] [engine] flush chunk '13-1701897023.542976235.flb' succeeded at retry 1: task_id=1, input=tail.7 > output=es.0 (out_id=0)
    
     [2023/12/06 21:10:35] [ info] [engine] flush chunk '13-1701897024.457758706.flb' succeeded at retry 1: task_id=2, input=tail.2 > output=es.0 (out_id=0)
    
     [2023/12/07 00:22:25] [ warn] [engine] failed to flush chunk '13-1701908544.784383481.flb', retry in 11 seconds: task_id=1, input=tail.7 > output=es.0 (out_id=0)
    
     [2023/12/07 00:22:25] [ warn] [engine] failed to flush chunk '13-1701908544.784911836.flb', retry in 9 seconds: task_id=0, input=tail.2 > output=es.3 (out_id=3)
    
     [2023/12/07 00:22:34] [ info] [engine] flush chunk '13-1701908544.784911836.flb' succeeded at retry 1: task_id=0, input=tail.2 >
     output=es.3 (out_id=3)
    
     [2023/12/07 00:22:36] [ info] [engine] flush chunk '13-1701908544.784383481.flb' succeeded at retry 1: task_id=1, input=tail.7 > output=es.0 (out_id=0)
    
     [2023/12/07 01:04:25] [ warn] [engine] failed to flush chunk '13-1701911064.805752355.flb', retry in 9 seconds: task_id=1, input=tail.7 > output=es.3 (out_id=3)
    
     [2023/12/07 01:04:25] [ warn] [engine] failed to flush chunk '13-1701911064.806307406.flb', retry in 7 seconds: task_id=0, input=tail.2 > output=es.0 (out_id=0)
    
     [2023/12/07 01:04:32] [ info] [engine] flush chunk '13-1701911064.806307406.flb' succeeded at retry 1: task_id=0, input=tail.2 > output=es.0 (out_id=0)
    
     [2023/12/07 01:04:34] [ info] [engine] flush chunk '13-1701911064.805752355.flb' succeeded at retry 1: task_id=1, input=tail.7 > output=es.3 (out_id=3)

    看日志级别是warn,应该不会影响到使用吧?现在是日志无法写入还是无法查看?

      1、Fluent是写到后端的ES吧?确保ES是正常的,可以手工调用api写入测试看看。同时检查下ES日志是否有报错。

      2、Fluent和ES的版本是否有过变更?确保他们的版本是兼容的。

        lihai2099 现在日志是可以查看了,虽然fb日志上述一直在刷新重试,但是可以从控制台查看