使用 Sealos + Longhorn 部署 KubeSphere v3.0.0

Feynman · 2020年9月28日

YJSDS 估计你的存储有问题

willqy · 2020年9月28日

minio日志打印出来，是不是绑定pv报错了

tdcare · 2020年10月8日

longhorn 存储有问题

使用上述方法安装的存储，在kubesphere 安装的时候，会自动创建PV
。但无法把PV加载到主机上，会报如下错误：
`
Events:
Type Reason Age From Message

Warning FailedMount 41m (x414 over 3d6h) kubelet, ecs-ebb9-0001.novalocal Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[minio-config-dir ks-minio-token-mw554 export]: timed out waiting for the condition
Warning FailedMount 7m26s (x406 over 3d6h) kubelet, ecs-ebb9-0001.novalocal Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[ks-minio-token-mw554 export minio-config-dir]: timed out waiting for the condition
Warning FailedAttachVolume 94s (x2312 over 3d6h) attachdetach-controller AttachVolume.Attach failed for volume “pvc-08ff037b-339b-41f4-b22f-b5108b438507” : rpc error: code = Aborted desc = The volume pvc-08ff037b-339b-41f4-b22f-b5108b438507 is attaching
Warning FailedMount <invalid> (x1247 over 3d6h) kubelet, ecs-ebb9-0001.novalocal Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[export minio-config-dir ks-minio-token-mw554]: timed out waiting for the condition
`

tdcare · 2020年10月8日

[root@ecs-ebb9-0004 ~]# kubectl -n kubesphere-system describe pv minio
Error from server (NotFound): persistentvolumes "minio" not found
[root@ecs-ebb9-0004 ~]# kubectl -n kubesphere-system describe pvc minio
Name:          minio
Namespace:     kubesphere-system
StorageClass:  longhorn
Status:        Bound
Volume:        pvc-08ff037b-339b-41f4-b22f-b5108b438507
Labels:        app=minio
               app.kubernetes.io/managed-by=Helm
               chart=minio-2.5.16
               heritage=Helm
               release=ks-minio
Annotations:   meta.helm.sh/release-name: ks-minio
               meta.helm.sh/release-namespace: kubesphere-system
               pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      20Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Mounted By:    minio-7bfdb5968b-kcfzd
Events:        <none>

tdcare · 2020年10月8日

[root@ecs-ebb9-0004 ~]# kubectl -n kubesphere-system describe pod minio-7bfdb5968b-kcfzd
Name:           minio-7bfdb5968b-kcfzd
Namespace:      kubesphere-system
Priority:       0
Node:           ecs-ebb9-0001.novalocal/192.168.0.231
Start Time:     Mon, 05 Oct 2020 16:30:18 +0800
Labels:         app=minio
                pod-template-hash=7bfdb5968b
                release=ks-minio
Annotations:    checksum/config: c6cc7f4b40064dffd59b339e133fa4819f787573ee18e1d001435aa4daff8ba2
                checksum/secrets: f9625c177e0e74a3b9997c3c65189ebffcfbde7aaa910de0ba38b48b032c1a96
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/minio-7bfdb5968b
Containers:
  minio:
    Container ID:  
    Image:         minio/minio:RELEASE.2019-08-07T01-59-21Z
    Image ID:      
    Port:          9000/TCP
    Host Port:     0/TCP
    Command:
      /bin/sh
      -ce
      /usr/bin/docker-entrypoint.sh minio -C /root/.minio/ server /data
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      250m
      memory:   256Mi
    Liveness:   http-get http://:service/minio/health/live delay=5s timeout=1s period=30s #success=1 #failure=3
    Readiness:  http-get http://:service/minio/health/ready delay=5s timeout=1s period=15s #success=1 #failure=3
    Environment:
      MINIO_ACCESS_KEY:  <set to the key 'accesskey' in secret 'minio'>  Optional: false
      MINIO_SECRET_KEY:  <set to the key 'secretkey' in secret 'minio'>  Optional: false
      MINIO_BROWSER:     on
    Mounts:
      /data from export (rw)
      /root/.minio/ from minio-config-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from ks-minio-token-mw554 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  export:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  minio
    ReadOnly:   false
  minio-user:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  minio
    Optional:    false
  minio-config-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  ks-minio-token-mw554:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ks-minio-token-mw554
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason              Age                          From                              Message
  ----     ------              ----                         ----                              -------
  Warning  FailedMount         41m (x414 over 3d6h)         kubelet, ecs-ebb9-0001.novalocal  Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[minio-config-dir ks-minio-token-mw554 export]: timed out waiting for the condition
  Warning  FailedMount         7m26s (x406 over 3d6h)       kubelet, ecs-ebb9-0001.novalocal  Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[ks-minio-token-mw554 export minio-config-dir]: timed out waiting for the condition
  Warning  FailedAttachVolume  94s (x2312 over 3d6h)        attachdetach-controller           AttachVolume.Attach failed for volume "pvc-08ff037b-339b-41f4-b22f-b5108b438507" : rpc error: code = Aborted desc = The volume pvc-08ff037b-339b-41f4-b22f-b5108b438507 is attaching
  Warning  FailedMount         <invalid> (x1247 over 3d6h)  kubelet, ecs-ebb9-0001.novalocal  Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[export minio-config-dir ks-minio-token-mw554]: timed out waiting for the condition

tdcare · 2020年10月8日

willqy · 2020年10月9日

一个是有没有装longhorn依赖，另外pod如果跑master节点了要去掉污点，保证longhorn csi插件也能调度到那里

sunshuyan · 2020年11月17日

willqy 我安装longhorn运行一段时间 pod重启出了一个这个错误

sunshuyan · 2020年11月17日

Warning FailedMount 5m1s (x2 over 11m) kubelet, hcjt-itc-dl-v100-10 Unable to attach or mount volumes: unmounted volumes=[redis-pvc], unattached volumes=[default-token-lrpd8 redis-pvc]: timed out waiting for the condition
Warning FailedAttachVolume 97s (x16 over 16m) attachdetach-controller AttachVolume.Attach failed for volume “pvc-7a21e6c8-9eae-4866-8202-09a1aee0406b” : rpc error: code = NotFound desc = ControllerPublishVolume: the volume pvc-7a21e6c8-9eae-4866-8202-09a1aee0406b not exists
Warning FailedMount 26s (x5 over 14m) kubelet, hcjt-itc-dl-v100-10 Unable to attach or mount volumes: unmounted volumes=[redis-pvc], unattached volumes=[redis-pvc default-token-lrpd8]: timed out waiting for the condition

willqy · 2020年11月20日

sunshuyan
pv挂不上了，去longhorn UI看看pv状态，手动attach到对应节点试试

willqy · 2020年11月27日

问题记录：涉及prometheus pv扩容，pv数据重置，pv文件系统修复，感觉longhorn还是不太靠谱，有磁盘还是用rook把，或者longhorn不要用本地文件系统，挂磁盘方式使用。

prometheus 故障

最近集群节点停电重启，导致prometheus 2个pod一死一伤，登录kubesphere UI无法显示监控信息。

查看pod状态，有一个是running的，监控应该能继续使用才对：

[root@jenkins ~]# kubectl -n kubesphere-monitoring-system get pods
NAME                                               READY   STATUS             RESTARTS   AGE
......
prometheus-k8s-0                                   3/3     Running            36         33d
prometheus-k8s-1                                   2/3     CrashLoopBackOff   41         9d
prometheus-operator-84d58bf775-g7hv8               2/2     Running            0          9d

查看crashloopbackoff pod日志，报错像是文件坏了

[root@jenkins ~]# kubectl -n kubesphere-monitoring-system logs -f prometheus-k8s-1  -c prometheus
......
level=info ts=2020-11-26T01:05:21.880Z caller=main.go:583 msg="Scrape manager stopped"
level=error ts=2020-11-26T01:05:21.880Z caller=main.go:764 err="opening storage failed: block dir: \"/prometheus/01EQ2ZQCKZEX21JP81GX10BPNK\": invalid character '\\x00' looking for beginning of value"

重启下pod，发现pv挂不上了，看来longhorn要背锅了，一出现意外重启就可能导致pv挂载失败：

[root@jenkins ~]# kubectl -n kubesphere-monitoring-system describe pods prometheus-k8s-1 
......
Events:
  Type     Reason       Age    From               Message
  ----     ------       ----   ----               -------
  Normal   Scheduled    7m4s   default-scheduler  Successfully assigned kubesphere-monitoring-system/prometheus-k8s-1 to k8s-master1
  Warning  FailedMount  6m56s  kubelet            MountVolume.SetUp failed for volume "pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6 but could not correct them: fsck from util-linux 2.31.1
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6 contains a file system with errors, check forced.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262155 extent tree (at level 1) could be shorter.  IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262171 extent tree (at level 1) could be shorter.  IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262179 extent tree (at level 1) could be shorter.  IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262184 extent tree (at level 1) could be shorter.  IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262188 extent tree (at level 1) could be shorter.  IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262198 extent tree (at level 1) could be shorter.  IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262208 extent tree (at level 1) could be shorter.  IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262216 has an invalid extent node (blk 1081353, lblk 0)


/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
  (i.e., without -a or -p options)
.
  Warning  FailedMount  5m2s                  kubelet  Unable to attach or mount volumes: unmounted volumes=[prometheus-k8s-db], unattached volumes=[config-out tls-assets prometheus-k8s-db prometheus-k8s-rulefiles-0 prometheus-k8s-token-n2nws config]: timed out waiting for the condition
  Warning  FailedMount  2m47s                 kubelet  Unable to attach or mount volumes: unmounted volumes=[prometheus-k8s-db], unattached volumes=[prometheus-k8s-db prometheus-k8s-rulefiles-0 prometheus-k8s-token-n2nws config config-out tls-assets]: timed out waiting for the condition
  Warning  FailedMount  35s (x10 over 6m54s)  kubelet  MountVolume.SetUp failed for volume "pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6 but could not correct them: fsck from util-linux 2.31.1

暂时不知道什么原因，在查看下第一个running pod是什么情况，一大堆no space left错误：

[root@jenkins ~]# kubectl -n kubesphere-monitoring-system logs -f prometheus-k8s-0 -c prometheus
level=warn ts=2020-11-26T00:00:23.231Z caller=manager.go:595 component="rule manager" group=node.rules msg="Rule sample appending failed" err="write to WAL: log samples: write /prometheus/wal/00000660: no space left on device"
level=warn ts=2020-11-26T00:00:23.231Z caller=manager.go:595 component="rule manager" group=node.rules msg="Rule sample appending failed" err="write to WAL: log samples: write /prometheus/wal/00000660: no space left on device"

很明显空间被用完了，进pod查看下文件系统，果然/prometheus 使用100%，可能第二个pod坏了，数据全写到第一个pod，被写满了。

[root@jenkins ~]# kubectl -n kubesphere-monitoring-system exec -it prometheus-k8s-0 -c prometheus -- df -h | grep prometheus
                         19.6G     19.5G         0 100% /prometheus
                        457.1G     89.3G    367.8G  20% /etc/prometheus/config_out
tmpfs                     7.8G         0      7.8G   0% /etc/prometheus/certs
                        457.1G     89.3G    367.8G  20% /etc/prometheus/rules/prometheus-k8s-rulefiles-0

怎么办呢，去论坛搜下prometheus关键字，发现有人提了相关问题，可以修改监控数据保存时间，把默认保留7d改为1d:

https://kubesphere.com.cn/forum/d/657-prometheus

# kubectl edit prometheuses -n kubesphere-monitoring-system

 retention: 1d

但是改完之后重启pod没有效果，可能是定时清理，要等上一天吧，考虑把第一个pod pv扩容，先把监控恢复起来。

底层存储使用longhorn，查看官方文档如何扩容pv

https://longhorn.io/docs/1.0.2/volumes-and-nodes/expansion/

默认新版本k8s已经支持扩容pv了，只是默认没有开启，需要修改storageclass增加一个字段allowVolumeExpansion: true

[root@jenkins longhorn]# kubectl edit sc longhorn 
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
...

然后需要把pv从节点分离，登录longhorn UI操作

完成后直接编辑对应pvc，修改大小

[root@jenkins clone]# kubectl -n kubesphere-monitoring-system edit pvc prometheus-k8s-db-prometheus-k8s-0 
......
spec:
  accessModes:
  - ReadWriteOnce
  resources:
      storage: 31Gi
...

修改后pv和pvc大小已经自动生效

[root@jenkins clone]# kubectl -n kubesphere-monitoring-system get pvc
NAME                                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
prometheus-k8s-db-prometheus-k8s-0   Bound    pvc-748d5256-d046-4c04-a37e-0edbe454f2ca   31Gi       RWO            longhorn       36d
prometheus-k8s-db-prometheus-k8s-1   Bound    pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6   20Gi       RWO            longhorn       36d

[root@jenkins clone]#  kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                             STORAGECLASS   REASON   AGE
pvc-748d5256-d046-4c04-a37e-0edbe454f2ca   31Gi       RWO            Delete           Bound    kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-0   longhorn                36d
pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6   20Gi       RWO            Delete           Bound    kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-1   longhorn                36d

重新在longhorn UI上将pv attach到节点上.

但是进pod查看，文件系统大小却没有自动更新，依然20G，尝试按照longhorn文档最后指示，手动更新文件系统，一顿操作，更新成功

volume_name=pvc-748d5256-d046-4c04-a37e-0edbe454f2ca
mount /dev/longhorn/$volume_name /data/pv
umount /dev/longhorn/$volume_name
mount /dev/longhorn/$volume_name /data/pv

[root@k8s-node1 ~]# resize2fs /dev/longhorn/$volume_name
resize2fs 1.42.9 (28-Dec-2013)
Filesystem at /dev/longhorn/pvc-748d5256-d046-4c04-a37e-0edbe454f2ca is mounted on /data/pv; on-line resizing required
old_desc_blocks = 3, new_desc_blocks = 4
The filesystem on /dev/longhorn/pvc-748d5256-d046-4c04-a37e-0edbe454f2ca is now 8126464 blocks long.

umount /dev/longhorn/$volume_name

在查看文件系统已经扩容成功

[root@jenkins longhorn]# kubectl -n kubesphere-monitoring-system exec -it prometheus-k8s-0 -c prometheus -- df -h | grep prometheus
                         30.4G     15.9G     14.5G  52% /prometheus
                        457.1G     88.1G    369.0G  19% /etc/prometheus/config_out
tmpfs                     7.8G         0      7.8G   0% /etc/prometheus/certs
                        457.1G     88.1G    369.0G  19% /etc/prometheus/rules/prometheus-k8s-rulefiles-0

最终第一个pod pv成功扩容，运行正常，kubesphere UI监控也恢复正常：

[root@jenkins clone]# kubectl -n kubesphere-monitoring-system get pods
NAME                                               READY   STATUS             RESTARTS   AGE
......
prometheus-k8s-0                                   3/3     Running            1          73m
prometheus-k8s-1                                   2/3     CrashLoopBackOff   3          2m2s
prometheus-operator-84d58bf775-8tgmr               2/2     Running            0          5h47m

最后一个crash pod处理方法，找到pv名称pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6

[root@jenkins longhorn]# kubectl get pv |grep prometheus-k8s-1 
pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6   32Gi       RWO            Delete           Bound    kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-1   longhorn                37d

去对应节点暴力清理数据

[root@k8s-master1 ~]# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0 465.8G  0 disk 
├─sda1   8:1    0   200M  0 part /boot/efi
├─sda2   8:2    0     1G  0 part /boot
├─sda3   8:3    0 406.8G  0 part /data
├─sda4   8:4    0    50G  0 part /
└─sda5   8:5    0   7.8G  0 part 
sdb      8:16   0    32G  0 disk /var/lib/kubelet/pods/ed0732c4-a8ae-4d3f-b78f-a09767589acd/volume-subpaths/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6/prometheus/2
sdc      8:32   0     2G  0 disk /var/lib/kubelet/pods/dd454b29-a2d5-4f2f-8244-bf3dd4d21054/volumes/kubernetes.io~csi/pvc-4da824f3-b462-4e32-b3ae-5be3f811dea5/mount

[root@k8s-master1 ~]# cd /var/lib/kubelet/pods/ed0732c4-a8ae-4d3f-b78f-a09767589acd/volume-subpaths/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6/prometheus/2
[root@k8s-master1 2]# rm -rf *

重启pod发现启动成功

[root@k8s-master1 ~]# kubectl -n kubesphere-monitoring-system get pods
NAME                                               READY   STATUS    RESTARTS   AGE
alertmanager-main-0                                2/2     Running   2          10d
alertmanager-main-1                                2/2     Running   4          36d
alertmanager-main-2                                2/2     Running   4          34d
kube-state-metrics-95c974544-8fjd8                 3/3     Running   3          34d
node-exporter-mdqvj                                2/2     Running   4          36d
node-exporter-p8glr                                2/2     Running   4          36d
node-exporter-s8ffl                                2/2     Running   6          36d
node-exporter-vsjkp                                2/2     Running   6          34d
notification-manager-deployment-7c8df68d94-bdm25   1/1     Running   1          34d
notification-manager-deployment-7c8df68d94-k6c2l   1/1     Running   2          36d
notification-manager-operator-6958786cd6-lqtkq     2/2     Running   8          36d
prometheus-k8s-0                                   3/3     Running   1          105m
prometheus-k8s-1                                   3/3     Running   1          3m8s
prometheus-operator-84d58bf775-8tgmr               2/2     Running   2          6h19m

如果遇到prometheus invalid magic number 0这种错误也可以通过暴力清理恢复pod

[root@jenkins longhorn]# kubectl -n kubesphere-monitoring-system logs -f prometheus-k8s-0 -c prometheus
level=error ts=2020-11-27T06:33:33.531Z caller=main.go:764 err="opening storage failed: /prometheus/chunks_head/000516: invalid magic number 0"

jenkins问题

升级系统，重启了下kubesphere devops服务器节点，发现jenkins pod挂掉了

[root@jenkins argocd]# kubectl -n kubesphere-devops-system get pods
NAME                                       READY   STATUS    RESTARTS   AGE
ks-jenkins-54455f5db8-glhbs                0/1     Error     2          35d
s2ioperator-0                              1/1     Running   2          35d
uc-jenkins-update-center-cd9464fff-r5txz   1/1     Running   2          11d

查看日志，又是longhorn pv无法挂载，像是文件系统坏掉得样子

[root@jenkins argocd]# kubectl -n kubesphere-devops-system describe pods  ks-jenkins-54455f5db8-glhbs   
......
Events:
  Type     Reason       Age                     From     Message
  ----     ------       ----                    ----     -------
  Warning  FailedMount  51m (x17 over 3h27m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[plugin-dir secrets-dir ks-jenkins-token-zzhd5 casc-config jenkins-home jenkins-config]: timed out waiting for the condition
  Warning  FailedMount  46m (x20 over 3h52m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[jenkins-home jenkins-config plugin-dir secrets-dir ks-jenkins-token-zzhd5 casc-config]: timed out waiting for the condition
  Warning  FailedMount  42m (x15 over 3h57m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[ks-jenkins-token-zzhd5 casc-config jenkins-home jenkins-config plugin-dir secrets-dir]: timed out waiting for the condition
  Warning  FailedMount  26m (x13 over 3h55m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[jenkins-config plugin-dir secrets-dir ks-jenkins-token-zzhd5 casc-config jenkins-home]: timed out waiting for the condition
  Warning  FailedMount  7m1s (x108 over 3h51m)  kubelet  MountVolume.SetUp failed for volume "pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf but could not correct them: fsck from util-linux 2.31.1
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf contains a file system with errors, check forced.
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf: Inode 131224 has an invalid extent node (blk 557148, lblk 0)


/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
  (i.e., without -a or -p options)
.
  Warning  FailedMount  97s (x14 over 3h46m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[casc-config jenkins-home jenkins-config plugin-dir secrets-dir ks-jenkins-token-zzhd5]: timed out waiting for the condition

最后提示让使用fsck修复，报错，e2fsck版本太低 (谨慎操作，做好备份，可能把pv搞坏

[root@k8s-node1 ~]# fsck -cvf /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf 
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf has unsupported feature(s): metadata_csum
e2fsck: Get a newer version of e2fsck!

下载最新版本e2fsprogs

https://distfiles.macports.org/e2fsprogs/

wget https://distfiles.macports.org/e2fsprogs/e2fsprogs-1.45.6.tar.gz
tar -zxvf e2fsprogs-1.45.6.tar.gz
cd e2fsprogs-1.45.6
make && make install

再次执行

[root@k8s-node1 ~]# fsck.ext4 -y /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf  
e2fsck 1.45.6 (20-Mar-2020)
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 131224 has an invalid extent node (blk 557148, lblk 0)
Clear? yes

Inode 131224 extent tree (at level 1) could be shorter.  Optimize? yes

Inode 131224, i_blocks is 992, should be 0.  Fix? yes

Inode 131227 extent block passes checks, but checksum does not match extent
        (logical block 16, physical block 107664, len 50)
Fix? yes

Inode 131227, i_blocks is 1080, should be 536.  Fix? yes


Running additional passes to resolve blocks claimed by more than one inode...
Pass 1B: Rescanning for multiply-claimed blocks
Multiply-claimed block(s) in inode 131227: 112333
Multiply-claimed block(s) in inode 131343: 112333
Pass 1C: Scanning directories for inodes with multiply-claimed blocks
Pass 1D: Reconciling multiply-claimed blocks
(There are 2 inodes containing multiply-claimed blocks.)

File /support/all_2020-11-24_20.29.26.log (inode #131227, mod time Wed Nov 25 21:00:09 2020) 
  has 1 multiply-claimed block(s), shared with 1 file(s):
        /jobs/demo-devops4t6ff/jobs/demo-pipeline/builds/7/workflow/4.xml (inode #131343, mod time Thu Nov 26 22:33:19 2020)
Clone multiply-claimed blocks? yes

File /jobs/demo-devops4t6ff/jobs/demo-pipeline/builds/7/workflow/4.xml (inode #131343, mod time Thu Nov 26 22:33:19 2020) 
  has 1 multiply-claimed block(s), shared with 1 file(s):
        /support/all_2020-11-24_20.29.26.log (inode #131227, mod time Wed Nov 25 21:00:09 2020)
Multiply-claimed blocks already reassigned or cloned.

Pass 1E: Optimizing extent trees
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(87056--87071) -(87088--87167) -(87296--87354) -(87424--87482) -(106512--106618) -(106624--106629) -(107024--107135) +(107664--107713) -108092 +110755 +110769 +110811 +110815 +110846 +110862 +112328 +112342 +112369 +112377 +112382 -123851 -123856 -123897 -(123899--123901) -124722 -124756 -126409 -126435 -126459 -129237 -129246 -129267 -129271 -129284 -129293 -130369 -130377 -130419 -130432 -130471 -132717 -132720 -132774 -132778 -132796 -133198 -133210 -557145 -557148 -557168 -559108 -559165 -559194 -559204 -560608 -560611 -560625 -560628 -563206 -563229 -563259 -563272 -564614 -564645 -564657 -564677 -564743 -564749 -564758 -564763 -564788 -565628 -565918 -565952 +566488 +566493 +566951 +566964 -567308 -568137 -568140 -568144 -568146 -568172 -568198 -568721 -568765 -568779
Fix? yes

Free blocks count wrong for group #0 (22415, counted=22414).
Fix? yes

Free blocks count wrong for group #2 (292, counted=506).
Fix? yes

Free blocks count wrong for group #3 (17798, counted=17985).
Fix? yes

Free blocks count wrong for group #4 (32761, counted=32768).
Fix? yes

Free blocks count wrong for group #17 (31489, counted=31522).
Fix? yes

Free blocks count wrong (1900427, counted=1900858).
Fix? yes


/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf: ***** FILE SYSTEM WAS MODIFIED *****
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf: 5862/524288 files (0.6% non-contiguous), 196294/2097152 blocks

重启pod恢复正常

[root@jenkins longhorn]# kubectl -n kubesphere-devops-system get pods
NAME                                       READY   STATUS    RESTARTS   AGE
ks-jenkins-54455f5db8-w7rqw                1/1     Running   0          13m
s2ioperator-0                              1/1     Running   2          35d
uc-jenkins-update-center-cd9464fff-r5txz   1/1     Running   2          11d

Sstoneshi-yunify · 2020年11月27日

willqy
另外，如果你想知道为啥pvc attach不到对应的pod，你可以进一步查看longhorn的log. volume attach的行为是csi-node做的，所以应该查看longhorn-csi-plugin-xxx的log，kubelet日志可能也有一些有用信息。longhorn作为分布式存储来使用，可能稳定性还是差点，可以考虑使用nfs，ceph之类的做替代。

willqy · 2020年11月27日

stoneshi-yunify
longhorn pod太多了不知道看哪个，涨姿势了，下次出问题在研究

使用 Sealos + Longhorn 部署 KubeSphere v3.0.0

FeynmanK零SK贰SK壹S

willqyK零S

tdcareK零S

longhorn 存储有问题

tdcareK零S

tdcareK零S

tdcareK零S

willqyK零S

sunshuyanK零S

sunshuyanK零S

willqyK零S

willqyK零S

prometheus 故障

jenkins问题

Sstoneshi-yunifyK零S

willqyK零S