使用 Sealos + Longhorn 部署 KubeSphere v3.0.0

willqy · 2020年9月23日

最适合小白初次上手的 KubeSphere 3.0.0 快速部署体验

sealos简介

sealos—-只能用丝滑一词形容的kubernetes高可用安装（kubernetes install）工具，一条命令，离线安装，包含所有依赖，内核负载不依赖haproxy keepalived,纯golang开发,99年证书,支持v1.16 v1.15 v1.17 v1.18 v1.19!

longhorn简介

官网：https://www.rancher.cn/longhorn

Kubernetes云原生分布式块存储解决方案，易于使用丨100%开源丨一次编写，到处运行。

特性：Kubernetes的高可用持久化存储，简单的增量快照和备份，跨集群灾难恢复。

下面基于sealos快速拉起一个高可用k8s集群，然后基于该集群部署kubesphere，并使用longhorn作为底层存储。

kubesphere部署方式

kubesphere大致有2种部署方式：

部署k8s集群及kubesphere
已有kubernetes集群部署kubesphere

已有k8s集群部署kubesphere具有更高的灵活性，下面演示单独部署k8s集群，并在集群上部署kubesphere。使用rancher开源的云原生分布式存储longhorn作为底层存储。

部署k8s集群

使用sealos工具部署k8s集群，准备4个节点，3个master，1个node（或1个master2个node），所有节点必须配置主机名，并确认节点时间同步

hostnamectl set-hostname xx
yum install -y chrony
systemctl enable --now chronyd
timedatectl set-timezone Asia/Shanghai

在第一个master节点操作，下载部署工具及离线包

#基于go的二进制安装程序
wget -c https://sealyun.oss-cn-beijing.aliyuncs.com/latest/sealos && \
    chmod +x sealos && mv sealos /usr/bin

# 以k8s v1.18.8为例，不建议使用v1.19.x,kubesphere暂不支持
wget -c https://sealyun.oss-cn-beijing.aliyuncs.com/cd3d5791b292325d38bbfaffd9855312-1.18.8/kube1.18.8.tar.gz

执行以下命令部署k8s集群，passwd为所有节点root密码

sealos init --passwd 123456 \
  --master 10.39.140.248 \
  --master 10.39.140.249 \
  --master 10.39.140.250 \
  --node 10.39.140.251 \
  --pkg-url kube1.18.8.tar.gz \
  --version v1.18.8

确认k8s集群运行正常

# kubectl get nodes
NAME          STATUS   ROLES    AGE   VERSION
k8s-master1   Ready    master   13h   v1.18.8
k8s-master2   Ready    master   13h   v1.18.8
k8s-master3   Ready    master   13h   v1.18.8
k8s-node1     Ready    <none>   13h   v1.18.8

部署longhorn存储

longhorn推荐单独挂盘作为存储使用，这里作为测试直接使用本地存储目录/data/longhorn，默认为/var/lib/longhorn。

注意，kubesphere有几个组件申请的pv大小为20G，确保节点空间充足，否则可能出现pv能够绑定成功但没有满足条件的节点可调度的情况。如果仅仅测试环境，可以提前修改cluster-configuration.yaml缩减pv大小。

安装具有3数据副本的longhorn至少需要3个节点，这里去除master节点污点使其可调度pod：

kubectl taint nodes --all node-role.kubernetes.io/master-

k8s-master1安装helm

version=v3.3.1
curl -LO https://repo.huaweicloud.com/helm/${version}/helm-${version}-linux-amd64.tar.gz
tar -zxvf helm-${version}-linux-amd64.tar.gz
mv linux-amd64/helm /usr/local/bin/helm && rm -rf linux-amd64

所有节点安装longhorn依赖

yum install -y iscsi-initiator-utils
systemctl enable --now iscsid

添加longhorn chart，如果网络较差可以longhorn github release下载chart源码

helm repo add longhorn https://charts.longhorn.io
helm repo update

部署longhorn，支持离线部署，需要提前推送镜像到私有仓库longhornio下

kubectl create namespace longhorn-system

helm install longhorn \
  --namespace longhorn-system \
  --set defaultSettings.defaultDataPath="/data/longhorn/" \
  --set defaultSettings.defaultReplicaCount=3 \
  --set service.ui.type=NodePort \
  --set service.ui.nodePort=30890 \
  #--set privateRegistry.registryUrl=10.39.140.196:8081 \
  longhorn/longhorn

确认longhorn运行正常

[root@jenkins longhorn]# kubectl -n longhorn-system get pods
NAME                                        READY   STATUS    RESTARTS   AGE
csi-attacher-58b856dcff-9kqdt               1/1     Running   0          13h
csi-attacher-58b856dcff-c4zzp               1/1     Running   0          13h
csi-attacher-58b856dcff-tvfw2               1/1     Running   0          13h
csi-provisioner-56dd9dc55b-6ps8m            1/1     Running   0          13h
csi-provisioner-56dd9dc55b-m7gz4            1/1     Running   0          13h
csi-provisioner-56dd9dc55b-s9bh4            1/1     Running   0          13h
csi-resizer-6b87c4d9f8-2skth                1/1     Running   0          13h
csi-resizer-6b87c4d9f8-sqn2g                1/1     Running   0          13h
csi-resizer-6b87c4d9f8-z6xql                1/1     Running   0          13h
engine-image-ei-b99baaed-5fd7m              1/1     Running   0          13h
engine-image-ei-b99baaed-jcjxj              1/1     Running   0          12h
engine-image-ei-b99baaed-n6wxc              1/1     Running   0          12h
engine-image-ei-b99baaed-qxfhg              1/1     Running   0          12h
instance-manager-e-44ba7ac9                 1/1     Running   0          12h
instance-manager-e-48676e4a                 1/1     Running   0          12h
instance-manager-e-57bd994b                 1/1     Running   0          12h
instance-manager-e-753c704f                 1/1     Running   0          13h
instance-manager-r-4f4be1c1                 1/1     Running   0          12h
instance-manager-r-68bfb49b                 1/1     Running   0          12h
instance-manager-r-ccb87377                 1/1     Running   0          12h
instance-manager-r-e56429be                 1/1     Running   0          13h
longhorn-csi-plugin-fqgf7                   2/2     Running   0          12h
longhorn-csi-plugin-gbrnf                   2/2     Running   0          13h
longhorn-csi-plugin-kjj6b                   2/2     Running   0          12h
longhorn-csi-plugin-tvbvj                   2/2     Running   0          12h
longhorn-driver-deployer-74bb5c9fcb-khmbk   1/1     Running   0          14h
longhorn-manager-82ztz                      1/1     Running   0          12h
longhorn-manager-8kmsn                      1/1     Running   0          12h
longhorn-manager-flmfl                      1/1     Running   0          12h
longhorn-manager-mz6zj                      1/1     Running   0          14h
longhorn-ui-77c6d6f5b7-nzsg2                1/1     Running   0          14h

确认默认的storageclass已就绪

# kubectl get sc
NAME                 PROVISIONER          RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
longhorn (default)   driver.longhorn.io   Delete          Immediate           true                   14h

登录longhorn UI确认节点处于可调度状态

在这里插入图片描述

部署kubesphere

参考：

https://kubesphere.com.cn/en/docs/installing-on-kubernetes/
https://github.com/kubesphere/ks-installer

部署kubesphere 3.0版本，下载yaml文件

wget https://raw.githubusercontent.com/kubesphere/ks-installer/v3.0.0/deploy/kubesphere-installer.yaml
wget https://raw.githubusercontent.com/kubesphere/ks-installer/v3.0.0/deploy/cluster-configuration.yaml

修改cluster-configuration.yaml，找到相应字段开启需要安装的组件，以下仅为参考：

  devops:
    enabled: true
    ......
  logging:
    enabled: true
    ......
  metrics_server:
    enabled: true
    ......
  openpitrix:
    enabled: true
    ......

执行kubesphere部署

kubectl apply -f kubesphere-installer.yaml
kubectl apply -f cluster-configuration.yaml

查看部署日志，确认无报错

kubectl logs -n kubesphere-system $(kubectl get pod -n kubesphere-system -l app=ks-install -o jsonpath='{.items[0].metadata.name}') -f

部署完成后确认所有pod运行正常

[root@k8s-master1 ~]# kubectl get pods -A | grep kubesphere
kubesphere-controls-system     default-http-backend-857d7b6856-q24v2                             1/1     Running     0          12h
kubesphere-controls-system     kubectl-admin-58f985d8f6-jl9bj                                    1/1     Running     0          11h
kubesphere-controls-system     kubesphere-router-demo-ns-6c97d4968b-njgrc                        1/1     Running     1          154m
kubesphere-devops-system       ks-jenkins-54455f5db8-hm6kc                                       1/1     Running     0          11h
kubesphere-devops-system       s2ioperator-0                                                     1/1     Running     1          11h
kubesphere-devops-system       uc-jenkins-update-center-cd9464fff-qnvfz                          1/1     Running     0          12h
kubesphere-logging-system      elasticsearch-logging-curator-elasticsearch-curator-160079hmdmb   0/1     Completed   0          11h
kubesphere-logging-system      elasticsearch-logging-data-0                                      1/1     Running     0          12h
kubesphere-logging-system      elasticsearch-logging-data-1                                      1/1     Running     0          12h
kubesphere-logging-system      elasticsearch-logging-discovery-0                                 1/1     Running     0          12h
kubesphere-logging-system      fluent-bit-c45h2                                                  1/1     Running     0          12h
kubesphere-logging-system      fluent-bit-kptfc                                                  1/1     Running     0          12h
kubesphere-logging-system      fluent-bit-rzjfp                                                  1/1     Running     0          12h
kubesphere-logging-system      fluent-bit-wztkp                                                  1/1     Running     0          12h
kubesphere-logging-system      fluentbit-operator-855d4b977d-fk6hs                               1/1     Running     0          12h
kubesphere-logging-system      ks-events-exporter-5bc4d9f496-x297f                               2/2     Running     0          12h
kubesphere-logging-system      ks-events-operator-8dbf7fccc-9qmml                                1/1     Running     0          12h
kubesphere-logging-system      ks-events-ruler-698b7899c7-fkn4l                                  2/2     Running     0          12h
kubesphere-logging-system      ks-events-ruler-698b7899c7-hw6rq                                  2/2     Running     0          12h
kubesphere-logging-system      logsidecar-injector-deploy-74c66bfd85-cxkxm                       2/2     Running     0          12h
kubesphere-logging-system      logsidecar-injector-deploy-74c66bfd85-lzxbm                       2/2     Running     0          12h
kubesphere-monitoring-system   alertmanager-main-0                                               2/2     Running     0          11h
kubesphere-monitoring-system   alertmanager-main-1                                               2/2     Running     0          11h
kubesphere-monitoring-system   alertmanager-main-2                                               2/2     Running     0          11h
kubesphere-monitoring-system   kube-state-metrics-95c974544-r8kmq                                3/3     Running     0          12h
kubesphere-monitoring-system   node-exporter-9ddxn                                               2/2     Running     0          12h
kubesphere-monitoring-system   node-exporter-dw929                                               2/2     Running     0          12h
kubesphere-monitoring-system   node-exporter-ht868                                               2/2     Running     0          12h
kubesphere-monitoring-system   node-exporter-nxdsm                                               2/2     Running     0          12h
kubesphere-monitoring-system   notification-manager-deployment-7c8df68d94-hv56l                  1/1     Running     0          12h
kubesphere-monitoring-system   notification-manager-deployment-7c8df68d94-ttdsg                  1/1     Running     0          12h
kubesphere-monitoring-system   notification-manager-operator-6958786cd6-pllgc                    2/2     Running     0          12h
kubesphere-monitoring-system   prometheus-k8s-0                                                  3/3     Running     1          11h
kubesphere-monitoring-system   prometheus-k8s-1                                                  3/3     Running     1          11h
kubesphere-monitoring-system   prometheus-operator-84d58bf775-5rqdj                              2/2     Running     0          12h
kubesphere-system              etcd-65796969c7-whbzx                                             1/1     Running     0          12h
kubesphere-system              ks-apiserver-b4dbcc67-2kknm                                       1/1     Running     0          11h
kubesphere-system              ks-apiserver-b4dbcc67-k6jr2                                       1/1     Running     0          11h
kubesphere-system              ks-apiserver-b4dbcc67-q8845                                       1/1     Running     0          11h
kubesphere-system              ks-console-786b9846d4-86hxw                                       1/1     Running     0          12h
kubesphere-system              ks-console-786b9846d4-l6mhj                                       1/1     Running     0          12h
kubesphere-system              ks-console-786b9846d4-wct8z                                       1/1     Running     0          12h
kubesphere-system              ks-controller-manager-7fd8799789-478ks                            1/1     Running     0          11h
kubesphere-system              ks-controller-manager-7fd8799789-hwgmp                            1/1     Running     0          11h
kubesphere-system              ks-controller-manager-7fd8799789-pdbch                            1/1     Running     0          11h
kubesphere-system              ks-installer-64ddc4b77b-c7qz8                                     1/1     Running     0          12h
kubesphere-system              minio-7bfdb5968b-b5v59                                            1/1     Running     0          12h
kubesphere-system              mysql-7f64d9f584-kvxcb                                            1/1     Running     0          12h
kubesphere-system              openldap-0                                                        1/1     Running     0          12h
kubesphere-system              openldap-1                                                        1/1     Running     0          12h
kubesphere-system              redis-ha-haproxy-5c6559d588-2rt6v                                 1/1     Running     9          12h
kubesphere-system              redis-ha-haproxy-5c6559d588-mhj9p                                 1/1     Running     8          12h
kubesphere-system              redis-ha-haproxy-5c6559d588-tgpjv                                 1/1     Running     11         12h
kubesphere-system              redis-ha-server-0                                                 2/2     Running     0          12h
kubesphere-system              redis-ha-server-1                                                 2/2     Running     0          12h
kubesphere-system              redis-ha-server-2                                                 2/2     Running     0          12h

注意，kubesphere部分组件使用helm部署

[root@k8s-master1 ~]# helm ls -A | grep kubesphere
elasticsearch-logging           kubesphere-logging-system       1               2020-09-23 00:49:08.526873742 +0800 CST deployed        elasticsearch-1.22.1            6.7.0-0217                  
elasticsearch-logging-curator   kubesphere-logging-system       1               2020-09-23 00:49:16.117842593 +0800 CST deployed        elasticsearch-curator-1.3.3     5.5.4-0217                  
ks-events                       kubesphere-logging-system       1               2020-09-23 00:51:45.529430505 +0800 CST deployed        kube-events-0.1.0               0.1.0                       
ks-jenkins                      kubesphere-devops-system        1               2020-09-23 01:03:15.106022826 +0800 CST deployed        jenkins-0.19.0                  2.121.3-0217                
ks-minio                        kubesphere-system               2               2020-09-23 00:48:16.990599158 +0800 CST deployed        minio-2.5.16                    RELEASE.2019-08-07T01-59-21Z
ks-openldap                     kubesphere-system               1               2020-09-23 00:03:28.767712181 +0800 CST deployed        openldap-ha-0.1.0               1.0                         
ks-redis                        kubesphere-system               1               2020-09-23 00:03:19.439784188 +0800 CST deployed        redis-ha-3.9.0                  5.0.5                       
logsidecar-injector             kubesphere-logging-system       1               2020-09-23 00:51:57.519733074 +0800 CST deployed        logsidecar-injector-0.1.0       0.1.0                       
notification-manager            kubesphere-monitoring-system    1               2020-09-23 00:54:14.662762759 +0800 CST deployed        notification-manager-0.1.0      0.1.0                       
uc                              kubesphere-devops-system        1               2020-09-23 00:51:37.885154574 +0800 CST deployed        jenkins-update-center-0.8.0     3.0.0

获取web console 监听端口，默认为30880

kubectl get svc/ks-console -n kubesphere-system

默认登录账号为

admin/P@88w0rd

登录kubesphere UI

在这里插入图片描述
集群节点信息

在这里插入图片描述
服务组件信息

在这里插入图片描述
longhorn UI查看绑定的pv卷

在这里插入图片描述
查看卷详情

在这里插入图片描述

清理kubesphere集群

参考：
https://kubesphere.com.cn/en/docs/installing-on-kubernetes/uninstalling/uninstalling-kubesphere-from-k8s/

wget https://raw.githubusercontent.com/kubesphere/ks-installer/master/scripts/kubesphere-delete.sh
sh kubesphere-delete.sh

原文：https://blog.csdn.net/networken/article/details/105664147

Feynman · 2020年9月23日

willqy 我修改了一下标题，这篇博客我建议在开头对 Sealos 和 Longhorn 都加一段简单的描述介绍，并附上它们的官网链接。

YJSDS · 2020年9月27日

kubesphere 部署报错 ks-mino

yunkunrao · 2020年9月27日

你好，ks-minio报错跟“使用 Sealos + Longhorn 部署”有关吗？无关的话建议新开一个帖子。另外请详细描述一下部署遇到的问题，谢谢。

Feynman · 2020年9月28日

YJSDS 估计你的存储有问题

willqy · 2020年9月28日

minio日志打印出来，是不是绑定pv报错了

tdcare · 2020年10月8日

longhorn 存储有问题

使用上述方法安装的存储，在kubesphere 安装的时候，会自动创建PV
。但无法把PV加载到主机上，会报如下错误：
`
Events:
Type Reason Age From Message

Warning FailedMount 41m (x414 over 3d6h) kubelet, ecs-ebb9-0001.novalocal Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[minio-config-dir ks-minio-token-mw554 export]: timed out waiting for the condition
Warning FailedMount 7m26s (x406 over 3d6h) kubelet, ecs-ebb9-0001.novalocal Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[ks-minio-token-mw554 export minio-config-dir]: timed out waiting for the condition
Warning FailedAttachVolume 94s (x2312 over 3d6h) attachdetach-controller AttachVolume.Attach failed for volume “pvc-08ff037b-339b-41f4-b22f-b5108b438507” : rpc error: code = Aborted desc = The volume pvc-08ff037b-339b-41f4-b22f-b5108b438507 is attaching
Warning FailedMount <invalid> (x1247 over 3d6h) kubelet, ecs-ebb9-0001.novalocal Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[export minio-config-dir ks-minio-token-mw554]: timed out waiting for the condition
`

tdcare · 2020年10月8日

[root@ecs-ebb9-0004 ~]# kubectl -n kubesphere-system describe pv minio
Error from server (NotFound): persistentvolumes "minio" not found
[root@ecs-ebb9-0004 ~]# kubectl -n kubesphere-system describe pvc minio
Name:          minio
Namespace:     kubesphere-system
StorageClass:  longhorn
Status:        Bound
Volume:        pvc-08ff037b-339b-41f4-b22f-b5108b438507
Labels:        app=minio
               app.kubernetes.io/managed-by=Helm
               chart=minio-2.5.16
               heritage=Helm
               release=ks-minio
Annotations:   meta.helm.sh/release-name: ks-minio
               meta.helm.sh/release-namespace: kubesphere-system
               pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      20Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Mounted By:    minio-7bfdb5968b-kcfzd
Events:        <none>

tdcare · 2020年10月8日

[root@ecs-ebb9-0004 ~]# kubectl -n kubesphere-system describe pod minio-7bfdb5968b-kcfzd
Name:           minio-7bfdb5968b-kcfzd
Namespace:      kubesphere-system
Priority:       0
Node:           ecs-ebb9-0001.novalocal/192.168.0.231
Start Time:     Mon, 05 Oct 2020 16:30:18 +0800
Labels:         app=minio
                pod-template-hash=7bfdb5968b
                release=ks-minio
Annotations:    checksum/config: c6cc7f4b40064dffd59b339e133fa4819f787573ee18e1d001435aa4daff8ba2
                checksum/secrets: f9625c177e0e74a3b9997c3c65189ebffcfbde7aaa910de0ba38b48b032c1a96
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/minio-7bfdb5968b
Containers:
  minio:
    Container ID:  
    Image:         minio/minio:RELEASE.2019-08-07T01-59-21Z
    Image ID:      
    Port:          9000/TCP
    Host Port:     0/TCP
    Command:
      /bin/sh
      -ce
      /usr/bin/docker-entrypoint.sh minio -C /root/.minio/ server /data
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      250m
      memory:   256Mi
    Liveness:   http-get http://:service/minio/health/live delay=5s timeout=1s period=30s #success=1 #failure=3
    Readiness:  http-get http://:service/minio/health/ready delay=5s timeout=1s period=15s #success=1 #failure=3
    Environment:
      MINIO_ACCESS_KEY:  <set to the key 'accesskey' in secret 'minio'>  Optional: false
      MINIO_SECRET_KEY:  <set to the key 'secretkey' in secret 'minio'>  Optional: false
      MINIO_BROWSER:     on
    Mounts:
      /data from export (rw)
      /root/.minio/ from minio-config-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from ks-minio-token-mw554 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  export:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  minio
    ReadOnly:   false
  minio-user:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  minio
    Optional:    false
  minio-config-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  ks-minio-token-mw554:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ks-minio-token-mw554
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason              Age                          From                              Message
  ----     ------              ----                         ----                              -------
  Warning  FailedMount         41m (x414 over 3d6h)         kubelet, ecs-ebb9-0001.novalocal  Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[minio-config-dir ks-minio-token-mw554 export]: timed out waiting for the condition
  Warning  FailedMount         7m26s (x406 over 3d6h)       kubelet, ecs-ebb9-0001.novalocal  Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[ks-minio-token-mw554 export minio-config-dir]: timed out waiting for the condition
  Warning  FailedAttachVolume  94s (x2312 over 3d6h)        attachdetach-controller           AttachVolume.Attach failed for volume "pvc-08ff037b-339b-41f4-b22f-b5108b438507" : rpc error: code = Aborted desc = The volume pvc-08ff037b-339b-41f4-b22f-b5108b438507 is attaching
  Warning  FailedMount         <invalid> (x1247 over 3d6h)  kubelet, ecs-ebb9-0001.novalocal  Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[export minio-config-dir ks-minio-token-mw554]: timed out waiting for the condition

tdcare · 2020年10月8日

willqy · 2020年10月9日

一个是有没有装longhorn依赖，另外pod如果跑master节点了要去掉污点，保证longhorn csi插件也能调度到那里

sunshuyan · 2020年11月17日

willqy 我安装longhorn运行一段时间 pod重启出了一个这个错误

sunshuyan · 2020年11月17日

Warning FailedMount 5m1s (x2 over 11m) kubelet, hcjt-itc-dl-v100-10 Unable to attach or mount volumes: unmounted volumes=[redis-pvc], unattached volumes=[default-token-lrpd8 redis-pvc]: timed out waiting for the condition
Warning FailedAttachVolume 97s (x16 over 16m) attachdetach-controller AttachVolume.Attach failed for volume “pvc-7a21e6c8-9eae-4866-8202-09a1aee0406b” : rpc error: code = NotFound desc = ControllerPublishVolume: the volume pvc-7a21e6c8-9eae-4866-8202-09a1aee0406b not exists
Warning FailedMount 26s (x5 over 14m) kubelet, hcjt-itc-dl-v100-10 Unable to attach or mount volumes: unmounted volumes=[redis-pvc], unattached volumes=[redis-pvc default-token-lrpd8]: timed out waiting for the condition

willqy · 2020年11月20日

sunshuyan
pv挂不上了，去longhorn UI看看pv状态，手动attach到对应节点试试

willqy · 2020年11月27日

问题记录：涉及prometheus pv扩容，pv数据重置，pv文件系统修复，感觉longhorn还是不太靠谱，有磁盘还是用rook把，或者longhorn不要用本地文件系统，挂磁盘方式使用。

prometheus 故障

最近集群节点停电重启，导致prometheus 2个pod一死一伤，登录kubesphere UI无法显示监控信息。

查看pod状态，有一个是running的，监控应该能继续使用才对：

[root@jenkins ~]# kubectl -n kubesphere-monitoring-system get pods
NAME                                               READY   STATUS             RESTARTS   AGE
......
prometheus-k8s-0                                   3/3     Running            36         33d
prometheus-k8s-1                                   2/3     CrashLoopBackOff   41         9d
prometheus-operator-84d58bf775-g7hv8               2/2     Running            0          9d

查看crashloopbackoff pod日志，报错像是文件坏了

[root@jenkins ~]# kubectl -n kubesphere-monitoring-system logs -f prometheus-k8s-1  -c prometheus
......
level=info ts=2020-11-26T01:05:21.880Z caller=main.go:583 msg="Scrape manager stopped"
level=error ts=2020-11-26T01:05:21.880Z caller=main.go:764 err="opening storage failed: block dir: \"/prometheus/01EQ2ZQCKZEX21JP81GX10BPNK\": invalid character '\\x00' looking for beginning of value"

重启下pod，发现pv挂不上了，看来longhorn要背锅了，一出现意外重启就可能导致pv挂载失败：

[root@jenkins ~]# kubectl -n kubesphere-monitoring-system describe pods prometheus-k8s-1 
......
Events:
  Type     Reason       Age    From               Message
  ----     ------       ----   ----               -------
  Normal   Scheduled    7m4s   default-scheduler  Successfully assigned kubesphere-monitoring-system/prometheus-k8s-1 to k8s-master1
  Warning  FailedMount  6m56s  kubelet            MountVolume.SetUp failed for volume "pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6 but could not correct them: fsck from util-linux 2.31.1
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6 contains a file system with errors, check forced.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262155 extent tree (at level 1) could be shorter.  IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262171 extent tree (at level 1) could be shorter.  IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262179 extent tree (at level 1) could be shorter.  IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262184 extent tree (at level 1) could be shorter.  IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262188 extent tree (at level 1) could be shorter.  IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262198 extent tree (at level 1) could be shorter.  IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262208 extent tree (at level 1) could be shorter.  IGNORED.
/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262216 has an invalid extent node (blk 1081353, lblk 0)


/dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
  (i.e., without -a or -p options)
.
  Warning  FailedMount  5m2s                  kubelet  Unable to attach or mount volumes: unmounted volumes=[prometheus-k8s-db], unattached volumes=[config-out tls-assets prometheus-k8s-db prometheus-k8s-rulefiles-0 prometheus-k8s-token-n2nws config]: timed out waiting for the condition
  Warning  FailedMount  2m47s                 kubelet  Unable to attach or mount volumes: unmounted volumes=[prometheus-k8s-db], unattached volumes=[prometheus-k8s-db prometheus-k8s-rulefiles-0 prometheus-k8s-token-n2nws config config-out tls-assets]: timed out waiting for the condition
  Warning  FailedMount  35s (x10 over 6m54s)  kubelet  MountVolume.SetUp failed for volume "pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6 but could not correct them: fsck from util-linux 2.31.1

暂时不知道什么原因，在查看下第一个running pod是什么情况，一大堆no space left错误：

[root@jenkins ~]# kubectl -n kubesphere-monitoring-system logs -f prometheus-k8s-0 -c prometheus
level=warn ts=2020-11-26T00:00:23.231Z caller=manager.go:595 component="rule manager" group=node.rules msg="Rule sample appending failed" err="write to WAL: log samples: write /prometheus/wal/00000660: no space left on device"
level=warn ts=2020-11-26T00:00:23.231Z caller=manager.go:595 component="rule manager" group=node.rules msg="Rule sample appending failed" err="write to WAL: log samples: write /prometheus/wal/00000660: no space left on device"

很明显空间被用完了，进pod查看下文件系统，果然/prometheus 使用100%，可能第二个pod坏了，数据全写到第一个pod，被写满了。

[root@jenkins ~]# kubectl -n kubesphere-monitoring-system exec -it prometheus-k8s-0 -c prometheus -- df -h | grep prometheus
                         19.6G     19.5G         0 100% /prometheus
                        457.1G     89.3G    367.8G  20% /etc/prometheus/config_out
tmpfs                     7.8G         0      7.8G   0% /etc/prometheus/certs
                        457.1G     89.3G    367.8G  20% /etc/prometheus/rules/prometheus-k8s-rulefiles-0

怎么办呢，去论坛搜下prometheus关键字，发现有人提了相关问题，可以修改监控数据保存时间，把默认保留7d改为1d:

https://kubesphere.com.cn/forum/d/657-prometheus

# kubectl edit prometheuses -n kubesphere-monitoring-system

 retention: 1d

但是改完之后重启pod没有效果，可能是定时清理，要等上一天吧，考虑把第一个pod pv扩容，先把监控恢复起来。

底层存储使用longhorn，查看官方文档如何扩容pv

https://longhorn.io/docs/1.0.2/volumes-and-nodes/expansion/

默认新版本k8s已经支持扩容pv了，只是默认没有开启，需要修改storageclass增加一个字段allowVolumeExpansion: true

[root@jenkins longhorn]# kubectl edit sc longhorn 
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
...

然后需要把pv从节点分离，登录longhorn UI操作

完成后直接编辑对应pvc，修改大小

[root@jenkins clone]# kubectl -n kubesphere-monitoring-system edit pvc prometheus-k8s-db-prometheus-k8s-0 
......
spec:
  accessModes:
  - ReadWriteOnce
  resources:
      storage: 31Gi
...

修改后pv和pvc大小已经自动生效

[root@jenkins clone]# kubectl -n kubesphere-monitoring-system get pvc
NAME                                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
prometheus-k8s-db-prometheus-k8s-0   Bound    pvc-748d5256-d046-4c04-a37e-0edbe454f2ca   31Gi       RWO            longhorn       36d
prometheus-k8s-db-prometheus-k8s-1   Bound    pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6   20Gi       RWO            longhorn       36d

[root@jenkins clone]#  kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                             STORAGECLASS   REASON   AGE
pvc-748d5256-d046-4c04-a37e-0edbe454f2ca   31Gi       RWO            Delete           Bound    kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-0   longhorn                36d
pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6   20Gi       RWO            Delete           Bound    kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-1   longhorn                36d

重新在longhorn UI上将pv attach到节点上.

但是进pod查看，文件系统大小却没有自动更新，依然20G，尝试按照longhorn文档最后指示，手动更新文件系统，一顿操作，更新成功

volume_name=pvc-748d5256-d046-4c04-a37e-0edbe454f2ca
mount /dev/longhorn/$volume_name /data/pv
umount /dev/longhorn/$volume_name
mount /dev/longhorn/$volume_name /data/pv

[root@k8s-node1 ~]# resize2fs /dev/longhorn/$volume_name
resize2fs 1.42.9 (28-Dec-2013)
Filesystem at /dev/longhorn/pvc-748d5256-d046-4c04-a37e-0edbe454f2ca is mounted on /data/pv; on-line resizing required
old_desc_blocks = 3, new_desc_blocks = 4
The filesystem on /dev/longhorn/pvc-748d5256-d046-4c04-a37e-0edbe454f2ca is now 8126464 blocks long.

umount /dev/longhorn/$volume_name

在查看文件系统已经扩容成功

[root@jenkins longhorn]# kubectl -n kubesphere-monitoring-system exec -it prometheus-k8s-0 -c prometheus -- df -h | grep prometheus
                         30.4G     15.9G     14.5G  52% /prometheus
                        457.1G     88.1G    369.0G  19% /etc/prometheus/config_out
tmpfs                     7.8G         0      7.8G   0% /etc/prometheus/certs
                        457.1G     88.1G    369.0G  19% /etc/prometheus/rules/prometheus-k8s-rulefiles-0

最终第一个pod pv成功扩容，运行正常，kubesphere UI监控也恢复正常：

[root@jenkins clone]# kubectl -n kubesphere-monitoring-system get pods
NAME                                               READY   STATUS             RESTARTS   AGE
......
prometheus-k8s-0                                   3/3     Running            1          73m
prometheus-k8s-1                                   2/3     CrashLoopBackOff   3          2m2s
prometheus-operator-84d58bf775-8tgmr               2/2     Running            0          5h47m

最后一个crash pod处理方法，找到pv名称pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6

[root@jenkins longhorn]# kubectl get pv |grep prometheus-k8s-1 
pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6   32Gi       RWO            Delete           Bound    kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-1   longhorn                37d

去对应节点暴力清理数据

[root@k8s-master1 ~]# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0 465.8G  0 disk 
├─sda1   8:1    0   200M  0 part /boot/efi
├─sda2   8:2    0     1G  0 part /boot
├─sda3   8:3    0 406.8G  0 part /data
├─sda4   8:4    0    50G  0 part /
└─sda5   8:5    0   7.8G  0 part 
sdb      8:16   0    32G  0 disk /var/lib/kubelet/pods/ed0732c4-a8ae-4d3f-b78f-a09767589acd/volume-subpaths/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6/prometheus/2
sdc      8:32   0     2G  0 disk /var/lib/kubelet/pods/dd454b29-a2d5-4f2f-8244-bf3dd4d21054/volumes/kubernetes.io~csi/pvc-4da824f3-b462-4e32-b3ae-5be3f811dea5/mount

[root@k8s-master1 ~]# cd /var/lib/kubelet/pods/ed0732c4-a8ae-4d3f-b78f-a09767589acd/volume-subpaths/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6/prometheus/2
[root@k8s-master1 2]# rm -rf *

重启pod发现启动成功

[root@k8s-master1 ~]# kubectl -n kubesphere-monitoring-system get pods
NAME                                               READY   STATUS    RESTARTS   AGE
alertmanager-main-0                                2/2     Running   2          10d
alertmanager-main-1                                2/2     Running   4          36d
alertmanager-main-2                                2/2     Running   4          34d
kube-state-metrics-95c974544-8fjd8                 3/3     Running   3          34d
node-exporter-mdqvj                                2/2     Running   4          36d
node-exporter-p8glr                                2/2     Running   4          36d
node-exporter-s8ffl                                2/2     Running   6          36d
node-exporter-vsjkp                                2/2     Running   6          34d
notification-manager-deployment-7c8df68d94-bdm25   1/1     Running   1          34d
notification-manager-deployment-7c8df68d94-k6c2l   1/1     Running   2          36d
notification-manager-operator-6958786cd6-lqtkq     2/2     Running   8          36d
prometheus-k8s-0                                   3/3     Running   1          105m
prometheus-k8s-1                                   3/3     Running   1          3m8s
prometheus-operator-84d58bf775-8tgmr               2/2     Running   2          6h19m

如果遇到prometheus invalid magic number 0这种错误也可以通过暴力清理恢复pod

[root@jenkins longhorn]# kubectl -n kubesphere-monitoring-system logs -f prometheus-k8s-0 -c prometheus
level=error ts=2020-11-27T06:33:33.531Z caller=main.go:764 err="opening storage failed: /prometheus/chunks_head/000516: invalid magic number 0"

jenkins问题

升级系统，重启了下kubesphere devops服务器节点，发现jenkins pod挂掉了

[root@jenkins argocd]# kubectl -n kubesphere-devops-system get pods
NAME                                       READY   STATUS    RESTARTS   AGE
ks-jenkins-54455f5db8-glhbs                0/1     Error     2          35d
s2ioperator-0                              1/1     Running   2          35d
uc-jenkins-update-center-cd9464fff-r5txz   1/1     Running   2          11d

查看日志，又是longhorn pv无法挂载，像是文件系统坏掉得样子

[root@jenkins argocd]# kubectl -n kubesphere-devops-system describe pods  ks-jenkins-54455f5db8-glhbs   
......
Events:
  Type     Reason       Age                     From     Message
  ----     ------       ----                    ----     -------
  Warning  FailedMount  51m (x17 over 3h27m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[plugin-dir secrets-dir ks-jenkins-token-zzhd5 casc-config jenkins-home jenkins-config]: timed out waiting for the condition
  Warning  FailedMount  46m (x20 over 3h52m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[jenkins-home jenkins-config plugin-dir secrets-dir ks-jenkins-token-zzhd5 casc-config]: timed out waiting for the condition
  Warning  FailedMount  42m (x15 over 3h57m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[ks-jenkins-token-zzhd5 casc-config jenkins-home jenkins-config plugin-dir secrets-dir]: timed out waiting for the condition
  Warning  FailedMount  26m (x13 over 3h55m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[jenkins-config plugin-dir secrets-dir ks-jenkins-token-zzhd5 casc-config jenkins-home]: timed out waiting for the condition
  Warning  FailedMount  7m1s (x108 over 3h51m)  kubelet  MountVolume.SetUp failed for volume "pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf but could not correct them: fsck from util-linux 2.31.1
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf contains a file system with errors, check forced.
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf: Inode 131224 has an invalid extent node (blk 557148, lblk 0)


/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
  (i.e., without -a or -p options)
.
  Warning  FailedMount  97s (x14 over 3h46m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[casc-config jenkins-home jenkins-config plugin-dir secrets-dir ks-jenkins-token-zzhd5]: timed out waiting for the condition

最后提示让使用fsck修复，报错，e2fsck版本太低 (谨慎操作，做好备份，可能把pv搞坏

[root@k8s-node1 ~]# fsck -cvf /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf 
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf has unsupported feature(s): metadata_csum
e2fsck: Get a newer version of e2fsck!

下载最新版本e2fsprogs

https://distfiles.macports.org/e2fsprogs/

wget https://distfiles.macports.org/e2fsprogs/e2fsprogs-1.45.6.tar.gz
tar -zxvf e2fsprogs-1.45.6.tar.gz
cd e2fsprogs-1.45.6
make && make install

再次执行

[root@k8s-node1 ~]# fsck.ext4 -y /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf  
e2fsck 1.45.6 (20-Mar-2020)
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 131224 has an invalid extent node (blk 557148, lblk 0)
Clear? yes

Inode 131224 extent tree (at level 1) could be shorter.  Optimize? yes

Inode 131224, i_blocks is 992, should be 0.  Fix? yes

Inode 131227 extent block passes checks, but checksum does not match extent
        (logical block 16, physical block 107664, len 50)
Fix? yes

Inode 131227, i_blocks is 1080, should be 536.  Fix? yes


Running additional passes to resolve blocks claimed by more than one inode...
Pass 1B: Rescanning for multiply-claimed blocks
Multiply-claimed block(s) in inode 131227: 112333
Multiply-claimed block(s) in inode 131343: 112333
Pass 1C: Scanning directories for inodes with multiply-claimed blocks
Pass 1D: Reconciling multiply-claimed blocks
(There are 2 inodes containing multiply-claimed blocks.)

File /support/all_2020-11-24_20.29.26.log (inode #131227, mod time Wed Nov 25 21:00:09 2020) 
  has 1 multiply-claimed block(s), shared with 1 file(s):
        /jobs/demo-devops4t6ff/jobs/demo-pipeline/builds/7/workflow/4.xml (inode #131343, mod time Thu Nov 26 22:33:19 2020)
Clone multiply-claimed blocks? yes

File /jobs/demo-devops4t6ff/jobs/demo-pipeline/builds/7/workflow/4.xml (inode #131343, mod time Thu Nov 26 22:33:19 2020) 
  has 1 multiply-claimed block(s), shared with 1 file(s):
        /support/all_2020-11-24_20.29.26.log (inode #131227, mod time Wed Nov 25 21:00:09 2020)
Multiply-claimed blocks already reassigned or cloned.

Pass 1E: Optimizing extent trees
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(87056--87071) -(87088--87167) -(87296--87354) -(87424--87482) -(106512--106618) -(106624--106629) -(107024--107135) +(107664--107713) -108092 +110755 +110769 +110811 +110815 +110846 +110862 +112328 +112342 +112369 +112377 +112382 -123851 -123856 -123897 -(123899--123901) -124722 -124756 -126409 -126435 -126459 -129237 -129246 -129267 -129271 -129284 -129293 -130369 -130377 -130419 -130432 -130471 -132717 -132720 -132774 -132778 -132796 -133198 -133210 -557145 -557148 -557168 -559108 -559165 -559194 -559204 -560608 -560611 -560625 -560628 -563206 -563229 -563259 -563272 -564614 -564645 -564657 -564677 -564743 -564749 -564758 -564763 -564788 -565628 -565918 -565952 +566488 +566493 +566951 +566964 -567308 -568137 -568140 -568144 -568146 -568172 -568198 -568721 -568765 -568779
Fix? yes

Free blocks count wrong for group #0 (22415, counted=22414).
Fix? yes

Free blocks count wrong for group #2 (292, counted=506).
Fix? yes

Free blocks count wrong for group #3 (17798, counted=17985).
Fix? yes

Free blocks count wrong for group #4 (32761, counted=32768).
Fix? yes

Free blocks count wrong for group #17 (31489, counted=31522).
Fix? yes

Free blocks count wrong (1900427, counted=1900858).
Fix? yes


/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf: ***** FILE SYSTEM WAS MODIFIED *****
/dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf: 5862/524288 files (0.6% non-contiguous), 196294/2097152 blocks

重启pod恢复正常

[root@jenkins longhorn]# kubectl -n kubesphere-devops-system get pods
NAME                                       READY   STATUS    RESTARTS   AGE
ks-jenkins-54455f5db8-w7rqw                1/1     Running   0          13m
s2ioperator-0                              1/1     Running   2          35d
uc-jenkins-update-center-cd9464fff-r5txz   1/1     Running   2          11d

Sstoneshi-yunify · 2020年11月27日

willqy
另外，如果你想知道为啥pvc attach不到对应的pod，你可以进一步查看longhorn的log. volume attach的行为是csi-node做的，所以应该查看longhorn-csi-plugin-xxx的log，kubelet日志可能也有一些有用信息。longhorn作为分布式存储来使用，可能稳定性还是差点，可以考虑使用nfs，ceph之类的做替代。

willqy · 2020年11月27日

stoneshi-yunify
longhorn pod太多了不知道看哪个，涨姿势了，下次出问题在研究

使用 Sealos + Longhorn 部署 KubeSphere v3.0.0

willqyK零S

sealos简介

longhorn简介

kubesphere部署方式

部署k8s集群

部署longhorn存储

部署kubesphere

清理kubesphere集群

FeynmanK零SK贰SK壹S

YJSDS

yunkunraoK零S

FeynmanK零SK贰SK壹S

willqyK零S

tdcareK零S

longhorn 存储有问题

tdcareK零S

tdcareK零S

tdcareK零S

willqyK零S

sunshuyanK零S

sunshuyanK零S

willqyK零S

willqyK零S

prometheus 故障

jenkins问题

Sstoneshi-yunifyK零S

willqyK零S