使用 Sealos + Longhorn 部署 KubeSphere v3.0.0
willqyK零S
minio日志打印出来,是不是绑定pv报错了
tdcareK零S
longhorn 存储有问题
使用上述方法安装的存储,在kubesphere 安装的时候,会自动创建PV
。但无法把PV加载到主机上,会报如下错误:
`
Events:
Type Reason Age From Message
Warning FailedMount 41m (x414 over 3d6h) kubelet, ecs-ebb9-0001.novalocal Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[minio-config-dir ks-minio-token-mw554 export]: timed out waiting for the condition
Warning FailedMount 7m26s (x406 over 3d6h) kubelet, ecs-ebb9-0001.novalocal Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[ks-minio-token-mw554 export minio-config-dir]: timed out waiting for the condition
Warning FailedAttachVolume 94s (x2312 over 3d6h) attachdetach-controller AttachVolume.Attach failed for volume “pvc-08ff037b-339b-41f4-b22f-b5108b438507” : rpc error: code = Aborted desc = The volume pvc-08ff037b-339b-41f4-b22f-b5108b438507 is attaching
Warning FailedMount <invalid> (x1247 over 3d6h) kubelet, ecs-ebb9-0001.novalocal Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[export minio-config-dir ks-minio-token-mw554]: timed out waiting for the condition
`
tdcareK零S
[root@ecs-ebb9-0004 ~]# kubectl -n kubesphere-system describe pv minio
Error from server (NotFound): persistentvolumes "minio" not found
[root@ecs-ebb9-0004 ~]# kubectl -n kubesphere-system describe pvc minio
Name: minio
Namespace: kubesphere-system
StorageClass: longhorn
Status: Bound
Volume: pvc-08ff037b-339b-41f4-b22f-b5108b438507
Labels: app=minio
app.kubernetes.io/managed-by=Helm
chart=minio-2.5.16
heritage=Helm
release=ks-minio
Annotations: meta.helm.sh/release-name: ks-minio
meta.helm.sh/release-namespace: kubesphere-system
pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 20Gi
Access Modes: RWO
VolumeMode: Filesystem
Mounted By: minio-7bfdb5968b-kcfzd
Events: <none>
tdcareK零S
[root@ecs-ebb9-0004 ~]# kubectl -n kubesphere-system describe pod minio-7bfdb5968b-kcfzd
Name: minio-7bfdb5968b-kcfzd
Namespace: kubesphere-system
Priority: 0
Node: ecs-ebb9-0001.novalocal/192.168.0.231
Start Time: Mon, 05 Oct 2020 16:30:18 +0800
Labels: app=minio
pod-template-hash=7bfdb5968b
release=ks-minio
Annotations: checksum/config: c6cc7f4b40064dffd59b339e133fa4819f787573ee18e1d001435aa4daff8ba2
checksum/secrets: f9625c177e0e74a3b9997c3c65189ebffcfbde7aaa910de0ba38b48b032c1a96
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/minio-7bfdb5968b
Containers:
minio:
Container ID:
Image: minio/minio:RELEASE.2019-08-07T01-59-21Z
Image ID:
Port: 9000/TCP
Host Port: 0/TCP
Command:
/bin/sh
-ce
/usr/bin/docker-entrypoint.sh minio -C /root/.minio/ server /data
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Requests:
cpu: 250m
memory: 256Mi
Liveness: http-get http://:service/minio/health/live delay=5s timeout=1s period=30s #success=1 #failure=3
Readiness: http-get http://:service/minio/health/ready delay=5s timeout=1s period=15s #success=1 #failure=3
Environment:
MINIO_ACCESS_KEY: <set to the key 'accesskey' in secret 'minio'> Optional: false
MINIO_SECRET_KEY: <set to the key 'secretkey' in secret 'minio'> Optional: false
MINIO_BROWSER: on
Mounts:
/data from export (rw)
/root/.minio/ from minio-config-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from ks-minio-token-mw554 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
export:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: minio
ReadOnly: false
minio-user:
Type: Secret (a volume populated by a Secret)
SecretName: minio
Optional: false
minio-config-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
ks-minio-token-mw554:
Type: Secret (a volume populated by a Secret)
SecretName: ks-minio-token-mw554
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 41m (x414 over 3d6h) kubelet, ecs-ebb9-0001.novalocal Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[minio-config-dir ks-minio-token-mw554 export]: timed out waiting for the condition
Warning FailedMount 7m26s (x406 over 3d6h) kubelet, ecs-ebb9-0001.novalocal Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[ks-minio-token-mw554 export minio-config-dir]: timed out waiting for the condition
Warning FailedAttachVolume 94s (x2312 over 3d6h) attachdetach-controller AttachVolume.Attach failed for volume "pvc-08ff037b-339b-41f4-b22f-b5108b438507" : rpc error: code = Aborted desc = The volume pvc-08ff037b-339b-41f4-b22f-b5108b438507 is attaching
Warning FailedMount <invalid> (x1247 over 3d6h) kubelet, ecs-ebb9-0001.novalocal Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[export minio-config-dir ks-minio-token-mw554]: timed out waiting for the condition
tdcareK零S
willqyK零S
一个是有没有装longhorn依赖,另外pod如果跑master节点了要去掉污点,保证longhorn csi插件也能调度到那里
sunshuyanK零S
Warning FailedMount 5m1s (x2 over 11m) kubelet, hcjt-itc-dl-v100-10 Unable to attach or mount volumes: unmounted volumes=[redis-pvc], unattached volumes=[default-token-lrpd8 redis-pvc]: timed out waiting for the condition
Warning FailedAttachVolume 97s (x16 over 16m) attachdetach-controller AttachVolume.Attach failed for volume “pvc-7a21e6c8-9eae-4866-8202-09a1aee0406b” : rpc error: code = NotFound desc = ControllerPublishVolume: the volume pvc-7a21e6c8-9eae-4866-8202-09a1aee0406b not exists
Warning FailedMount 26s (x5 over 14m) kubelet, hcjt-itc-dl-v100-10 Unable to attach or mount volumes: unmounted volumes=[redis-pvc], unattached volumes=[redis-pvc default-token-lrpd8]: timed out waiting for the condition
willqyK零S
- 已编辑
问题记录:涉及prometheus pv扩容,pv数据重置,pv文件系统修复,感觉longhorn还是不太靠谱,有磁盘还是用rook把,或者longhorn不要用本地文件系统,挂磁盘方式使用。
prometheus 故障
最近集群节点停电重启,导致prometheus 2个pod一死一伤,登录kubesphere UI无法显示监控信息。
查看pod状态,有一个是running的,监控应该能继续使用才对:
[root@jenkins ~]# kubectl -n kubesphere-monitoring-system get pods
NAME READY STATUS RESTARTS AGE
......
prometheus-k8s-0 3/3 Running 36 33d
prometheus-k8s-1 2/3 CrashLoopBackOff 41 9d
prometheus-operator-84d58bf775-g7hv8 2/2 Running 0 9d
查看crashloopbackoff pod日志,报错像是文件坏了
[root@jenkins ~]# kubectl -n kubesphere-monitoring-system logs -f prometheus-k8s-1 -c prometheus
......
level=info ts=2020-11-26T01:05:21.880Z caller=main.go:583 msg="Scrape manager stopped"
level=error ts=2020-11-26T01:05:21.880Z caller=main.go:764 err="opening storage failed: block dir: \"/prometheus/01EQ2ZQCKZEX21JP81GX10BPNK\": invalid character '\\x00' looking for beginning of value"
重启下pod,发现pv挂不上了,看来longhorn要背锅了,一出现意外重启就可能导致pv挂载失败:
[root@jenkins ~]# kubectl -n kubesphere-monitoring-system describe pods prometheus-k8s-1
......
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m4s default-scheduler Successfully assigned kubesphere-monitoring-system/prometheus-k8s-1 to k8s-master1
Warning FailedMount 6m56s kubelet MountVolume.SetUp failed for volume "pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6 but could not correct them: fsck from util-linux 2.31.1
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6 contains a file system with errors, check forced.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262155 extent tree (at level 1) could be shorter. IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262171 extent tree (at level 1) could be shorter. IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262179 extent tree (at level 1) could be shorter. IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262184 extent tree (at level 1) could be shorter. IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262188 extent tree (at level 1) could be shorter. IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262198 extent tree (at level 1) could be shorter. IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262208 extent tree (at level 1) could be shorter. IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262216 has an invalid extent node (blk 1081353, lblk 0)
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
.
Warning FailedMount 5m2s kubelet Unable to attach or mount volumes: unmounted volumes=[prometheus-k8s-db], unattached volumes=[config-out tls-assets prometheus-k8s-db prometheus-k8s-rulefiles-0 prometheus-k8s-token-n2nws config]: timed out waiting for the condition
Warning FailedMount 2m47s kubelet Unable to attach or mount volumes: unmounted volumes=[prometheus-k8s-db], unattached volumes=[prometheus-k8s-db prometheus-k8s-rulefiles-0 prometheus-k8s-token-n2nws config config-out tls-assets]: timed out waiting for the condition
Warning FailedMount 35s (x10 over 6m54s) kubelet MountVolume.SetUp failed for volume "pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6 but could not correct them: fsck from util-linux 2.31.1
暂时不知道什么原因,在查看下第一个running pod是什么情况,一大堆no space left错误:
[root@jenkins ~]# kubectl -n kubesphere-monitoring-system logs -f prometheus-k8s-0 -c prometheus
level=warn ts=2020-11-26T00:00:23.231Z caller=manager.go:595 component="rule manager" group=node.rules msg="Rule sample appending failed" err="write to WAL: log samples: write /prometheus/wal/00000660: no space left on device"
level=warn ts=2020-11-26T00:00:23.231Z caller=manager.go:595 component="rule manager" group=node.rules msg="Rule sample appending failed" err="write to WAL: log samples: write /prometheus/wal/00000660: no space left on device"
很明显空间被用完了,进pod查看下文件系统,果然/prometheus 使用100%,可能第二个pod坏了,数据全写到第一个pod,被写满了。
[root@jenkins ~]# kubectl -n kubesphere-monitoring-system exec -it prometheus-k8s-0 -c prometheus -- df -h | grep prometheus
19.6G 19.5G 0 100% /prometheus
457.1G 89.3G 367.8G 20% /etc/prometheus/config_out
tmpfs 7.8G 0 7.8G 0% /etc/prometheus/certs
457.1G 89.3G 367.8G 20% /etc/prometheus/rules/prometheus-k8s-rulefiles-0
怎么办呢,去论坛搜下prometheus关键字,发现有人提了相关问题,可以修改监控数据保存时间,把默认保留7d改为1d:
https://kubesphere.com.cn/forum/d/657-prometheus
# kubectl edit prometheuses -n kubesphere-monitoring-system
retention: 1d
但是改完之后重启pod没有效果,可能是定时清理,要等上一天吧,考虑把第一个pod pv扩容,先把监控恢复起来。
底层存储使用longhorn,查看官方文档如何扩容pv
https://longhorn.io/docs/1.0.2/volumes-and-nodes/expansion/
默认新版本k8s已经支持扩容pv了,只是默认没有开启,需要修改storageclass增加一个字段allowVolumeExpansion: true
[root@jenkins longhorn]# kubectl edit sc longhorn
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
...
然后需要把pv从节点分离,登录longhorn UI操作
完成后直接编辑对应pvc,修改大小
[root@jenkins clone]# kubectl -n kubesphere-monitoring-system edit pvc prometheus-k8s-db-prometheus-k8s-0
......
spec:
accessModes:
- ReadWriteOnce
resources:
storage: 31Gi
...
修改后pv和pvc大小已经自动生效
[root@jenkins clone]# kubectl -n kubesphere-monitoring-system get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
prometheus-k8s-db-prometheus-k8s-0 Bound pvc-748d5256-d046-4c04-a37e-0edbe454f2ca 31Gi RWO longhorn 36d
prometheus-k8s-db-prometheus-k8s-1 Bound pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6 20Gi RWO longhorn 36d
[root@jenkins clone]# kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-748d5256-d046-4c04-a37e-0edbe454f2ca 31Gi RWO Delete Bound kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-0 longhorn 36d
pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6 20Gi RWO Delete Bound kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-1 longhorn 36d
重新在longhorn UI上将pv attach到节点上.
但是进pod查看,文件系统大小却没有自动更新,依然20G,尝试按照longhorn文档最后指示,手动更新文件系统,一顿操作,更新成功
volume_name=pvc-748d5256-d046-4c04-a37e-0edbe454f2ca
mount /dev/longhorn/$volume_name /data/pv
umount /dev/longhorn/$volume_name
mount /dev/longhorn/$volume_name /data/pv
[root@k8s-node1 ~]# resize2fs /dev/longhorn/$volume_name
resize2fs 1.42.9 (28-Dec-2013)
Filesystem at /dev/longhorn/pvc-748d5256-d046-4c04-a37e-0edbe454f2ca is mounted on /data/pv; on-line resizing required
old_desc_blocks = 3, new_desc_blocks = 4
The filesystem on /dev/longhorn/pvc-748d5256-d046-4c04-a37e-0edbe454f2ca is now 8126464 blocks long.
umount /dev/longhorn/$volume_name
在查看文件系统已经扩容成功
[root@jenkins longhorn]# kubectl -n kubesphere-monitoring-system exec -it prometheus-k8s-0 -c prometheus -- df -h | grep prometheus
30.4G 15.9G 14.5G 52% /prometheus
457.1G 88.1G 369.0G 19% /etc/prometheus/config_out
tmpfs 7.8G 0 7.8G 0% /etc/prometheus/certs
457.1G 88.1G 369.0G 19% /etc/prometheus/rules/prometheus-k8s-rulefiles-0
最终第一个pod pv成功扩容,运行正常,kubesphere UI监控也恢复正常:
[root@jenkins clone]# kubectl -n kubesphere-monitoring-system get pods
NAME READY STATUS RESTARTS AGE
......
prometheus-k8s-0 3/3 Running 1 73m
prometheus-k8s-1 2/3 CrashLoopBackOff 3 2m2s
prometheus-operator-84d58bf775-8tgmr 2/2 Running 0 5h47m
最后一个crash pod处理方法,找到pv名称pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6
[root@jenkins longhorn]# kubectl get pv |grep prometheus-k8s-1
pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6 32Gi RWO Delete Bound kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-1 longhorn 37d
去对应节点暴力清理数据
[root@k8s-master1 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 465.8G 0 disk
├─sda1 8:1 0 200M 0 part /boot/efi
├─sda2 8:2 0 1G 0 part /boot
├─sda3 8:3 0 406.8G 0 part /data
├─sda4 8:4 0 50G 0 part /
└─sda5 8:5 0 7.8G 0 part
sdb 8:16 0 32G 0 disk /var/lib/kubelet/pods/ed0732c4-a8ae-4d3f-b78f-a09767589acd/volume-subpaths/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6/prometheus/2
sdc 8:32 0 2G 0 disk /var/lib/kubelet/pods/dd454b29-a2d5-4f2f-8244-bf3dd4d21054/volumes/kubernetes.io~csi/pvc-4da824f3-b462-4e32-b3ae-5be3f811dea5/mount
[root@k8s-master1 ~]# cd /var/lib/kubelet/pods/ed0732c4-a8ae-4d3f-b78f-a09767589acd/volume-subpaths/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6/prometheus/2
[root@k8s-master1 2]# rm -rf *
重启pod发现启动成功
[root@k8s-master1 ~]# kubectl -n kubesphere-monitoring-system get pods
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 2 10d
alertmanager-main-1 2/2 Running 4 36d
alertmanager-main-2 2/2 Running 4 34d
kube-state-metrics-95c974544-8fjd8 3/3 Running 3 34d
node-exporter-mdqvj 2/2 Running 4 36d
node-exporter-p8glr 2/2 Running 4 36d
node-exporter-s8ffl 2/2 Running 6 36d
node-exporter-vsjkp 2/2 Running 6 34d
notification-manager-deployment-7c8df68d94-bdm25 1/1 Running 1 34d
notification-manager-deployment-7c8df68d94-k6c2l 1/1 Running 2 36d
notification-manager-operator-6958786cd6-lqtkq 2/2 Running 8 36d
prometheus-k8s-0 3/3 Running 1 105m
prometheus-k8s-1 3/3 Running 1 3m8s
prometheus-operator-84d58bf775-8tgmr 2/2 Running 2 6h19m
如果遇到prometheus invalid magic number 0这种错误也可以通过暴力清理恢复pod
[root@jenkins longhorn]# kubectl -n kubesphere-monitoring-system logs -f prometheus-k8s-0 -c prometheus
level=error ts=2020-11-27T06:33:33.531Z caller=main.go:764 err="opening storage failed: /prometheus/chunks_head/000516: invalid magic number 0"
jenkins问题
升级系统,重启了下kubesphere devops服务器节点,发现jenkins pod挂掉了
[root@jenkins argocd]# kubectl -n kubesphere-devops-system get pods
NAME READY STATUS RESTARTS AGE
ks-jenkins-54455f5db8-glhbs 0/1 Error 2 35d
s2ioperator-0 1/1 Running 2 35d
uc-jenkins-update-center-cd9464fff-r5txz 1/1 Running 2 11d
查看日志,又是longhorn pv无法挂载,像是文件系统坏掉得样子
[root@jenkins argocd]# kubectl -n kubesphere-devops-system describe pods ks-jenkins-54455f5db8-glhbs
......
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 51m (x17 over 3h27m) kubelet Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[plugin-dir secrets-dir ks-jenkins-token-zzhd5 casc-config jenkins-home jenkins-config]: timed out waiting for the condition
Warning FailedMount 46m (x20 over 3h52m) kubelet Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[jenkins-home jenkins-config plugin-dir secrets-dir ks-jenkins-token-zzhd5 casc-config]: timed out waiting for the condition
Warning FailedMount 42m (x15 over 3h57m) kubelet Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[ks-jenkins-token-zzhd5 casc-config jenkins-home jenkins-config plugin-dir secrets-dir]: timed out waiting for the condition
Warning FailedMount 26m (x13 over 3h55m) kubelet Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[jenkins-config plugin-dir secrets-dir ks-jenkins-token-zzhd5 casc-config jenkins-home]: timed out waiting for the condition
Warning FailedMount 7m1s (x108 over 3h51m) kubelet MountVolume.SetUp failed for volume "pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf but could not correct them: fsck from util-linux 2.31.1
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf contains a file system with errors, check forced.
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf: Inode 131224 has an invalid extent node (blk 557148, lblk 0)
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
.
Warning FailedMount 97s (x14 over 3h46m) kubelet Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[casc-config jenkins-home jenkins-config plugin-dir secrets-dir ks-jenkins-token-zzhd5]: timed out waiting for the condition
最后提示让使用fsck修复,报错,e2fsck版本太低 (谨慎操作,做好备份,可能把pv搞坏
[root@k8s-node1 ~]# fsck -cvf /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf has unsupported feature(s): metadata_csum
e2fsck: Get a newer version of e2fsck!
下载最新版本e2fsprogs
https://distfiles.macports.org/e2fsprogs/
wget https://distfiles.macports.org/e2fsprogs/e2fsprogs-1.45.6.tar.gz
tar -zxvf e2fsprogs-1.45.6.tar.gz
cd e2fsprogs-1.45.6
make && make install
再次执行
[root@k8s-node1 ~]# fsck.ext4 -y /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf
e2fsck 1.45.6 (20-Mar-2020)
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 131224 has an invalid extent node (blk 557148, lblk 0)
Clear? yes
Inode 131224 extent tree (at level 1) could be shorter. Optimize? yes
Inode 131224, i_blocks is 992, should be 0. Fix? yes
Inode 131227 extent block passes checks, but checksum does not match extent
(logical block 16, physical block 107664, len 50)
Fix? yes
Inode 131227, i_blocks is 1080, should be 536. Fix? yes
Running additional passes to resolve blocks claimed by more than one inode...
Pass 1B: Rescanning for multiply-claimed blocks
Multiply-claimed block(s) in inode 131227: 112333
Multiply-claimed block(s) in inode 131343: 112333
Pass 1C: Scanning directories for inodes with multiply-claimed blocks
Pass 1D: Reconciling multiply-claimed blocks
(There are 2 inodes containing multiply-claimed blocks.)
File /support/all_2020-11-24_20.29.26.log (inode #131227, mod time Wed Nov 25 21:00:09 2020)
has 1 multiply-claimed block(s), shared with 1 file(s):
/jobs/demo-devops4t6ff/jobs/demo-pipeline/builds/7/workflow/4.xml (inode #131343, mod time Thu Nov 26 22:33:19 2020)
Clone multiply-claimed blocks? yes
File /jobs/demo-devops4t6ff/jobs/demo-pipeline/builds/7/workflow/4.xml (inode #131343, mod time Thu Nov 26 22:33:19 2020)
has 1 multiply-claimed block(s), shared with 1 file(s):
/support/all_2020-11-24_20.29.26.log (inode #131227, mod time Wed Nov 25 21:00:09 2020)
Multiply-claimed blocks already reassigned or cloned.
Pass 1E: Optimizing extent trees
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(87056--87071) -(87088--87167) -(87296--87354) -(87424--87482) -(106512--106618) -(106624--106629) -(107024--107135) +(107664--107713) -108092 +110755 +110769 +110811 +110815 +110846 +110862 +112328 +112342 +112369 +112377 +112382 -123851 -123856 -123897 -(123899--123901) -124722 -124756 -126409 -126435 -126459 -129237 -129246 -129267 -129271 -129284 -129293 -130369 -130377 -130419 -130432 -130471 -132717 -132720 -132774 -132778 -132796 -133198 -133210 -557145 -557148 -557168 -559108 -559165 -559194 -559204 -560608 -560611 -560625 -560628 -563206 -563229 -563259 -563272 -564614 -564645 -564657 -564677 -564743 -564749 -564758 -564763 -564788 -565628 -565918 -565952 +566488 +566493 +566951 +566964 -567308 -568137 -568140 -568144 -568146 -568172 -568198 -568721 -568765 -568779
Fix? yes
Free blocks count wrong for group #0 (22415, counted=22414).
Fix? yes
Free blocks count wrong for group #2 (292, counted=506).
Fix? yes
Free blocks count wrong for group #3 (17798, counted=17985).
Fix? yes
Free blocks count wrong for group #4 (32761, counted=32768).
Fix? yes
Free blocks count wrong for group #17 (31489, counted=31522).
Fix? yes
Free blocks count wrong (1900427, counted=1900858).
Fix? yes
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf: ***** FILE SYSTEM WAS MODIFIED *****
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf: 5862/524288 files (0.6% non-contiguous), 196294/2097152 blocks
重启pod恢复正常
[root@jenkins longhorn]# kubectl -n kubesphere-devops-system get pods
NAME READY STATUS RESTARTS AGE
ks-jenkins-54455f5db8-w7rqw 1/1 Running 0 13m
s2ioperator-0 1/1 Running 2 35d
uc-jenkins-update-center-cd9464fff-r5txz 1/1 Running 2 11d
willqy
另外,如果你想知道为啥pvc attach不到对应的pod,你可以进一步查看longhorn的log. volume attach的行为是csi-node做的,所以应该查看longhorn-csi-plugin-xxx的log,kubelet日志可能也有一些有用信息。longhorn作为分布式存储来使用,可能稳定性还是差点,可以考虑使用nfs,ceph之类的做替代。
willqyK零S
stoneshi-yunify
longhorn pod太多了不知道看哪个,涨姿势了,下次出问题在研究