最适合小白初次上手的 KubeSphere 3.0.0 快速部署体验 🚀

sealos简介

官方网站:https://sealyun.com/

sealos—-只能用丝滑一词形容的kubernetes高可用安装(kubernetes install)工具,一条命令,离线安装,包含所有依赖,内核负载不依赖haproxy keepalived,纯golang开发,99年证书,支持v1.16 v1.15 v1.17 v1.18 v1.19!

longhorn简介

官网:https://www.rancher.cn/longhorn

Kubernetes云原生分布式块存储解决方案,易于使用丨100%开源丨一次编写,到处运行。

特性:Kubernetes的高可用持久化存储,简单的增量快照和备份,跨集群灾难恢复。

下面基于sealos快速拉起一个高可用k8s集群,然后基于该集群部署kubesphere,并使用longhorn作为底层存储。

kubesphere部署方式

kubesphere大致有2种部署方式:

  • 部署k8s集群及kubesphere
  • 已有kubernetes集群部署kubesphere

已有k8s集群部署kubesphere具有更高的灵活性,下面演示单独部署k8s集群,并在集群上部署kubesphere。使用rancher开源的云原生分布式存储longhorn作为底层存储。

部署k8s集群

使用sealos工具部署k8s集群,准备4个节点,3个master,1个node(或1个master2个node),所有节点必须配置主机名,并确认节点时间同步

hostnamectl set-hostname xx
yum install -y chrony
systemctl enable --now chronyd
timedatectl set-timezone Asia/Shanghai

在第一个master节点操作,下载部署工具及离线包

#基于go的二进制安装程序
wget -c https://sealyun.oss-cn-beijing.aliyuncs.com/latest/sealos && \
    chmod +x sealos && mv sealos /usr/bin

# 以k8s v1.18.8为例,不建议使用v1.19.x,kubesphere暂不支持
wget -c https://sealyun.oss-cn-beijing.aliyuncs.com/cd3d5791b292325d38bbfaffd9855312-1.18.8/kube1.18.8.tar.gz

执行以下命令部署k8s集群,passwd为所有节点root密码

sealos init --passwd 123456 \
  --master 10.39.140.248 \
  --master 10.39.140.249 \
  --master 10.39.140.250 \
  --node 10.39.140.251 \
  --pkg-url kube1.18.8.tar.gz \
  --version v1.18.8

确认k8s集群运行正常

# kubectl get nodes
NAME          STATUS   ROLES    AGE   VERSION
k8s-master1   Ready    master   13h   v1.18.8
k8s-master2   Ready    master   13h   v1.18.8
k8s-master3   Ready    master   13h   v1.18.8
k8s-node1     Ready    <none>   13h   v1.18.8

部署longhorn存储

longhorn推荐单独挂盘作为存储使用,这里作为测试直接使用本地存储目录/data/longhorn,默认为/var/lib/longhorn。

注意,kubesphere有几个组件申请的pv大小为20G,确保节点空间充足,否则可能出现pv能够绑定成功但没有满足条件的节点可调度的情况。如果仅仅测试环境,可以提前修改cluster-configuration.yaml缩减pv大小。

安装具有3数据副本的longhorn至少需要3个节点,这里去除master节点污点使其可调度pod:

kubectl taint nodes --all node-role.kubernetes.io/master-  

k8s-master1安装helm

version=v3.3.1
curl -LO https://repo.huaweicloud.com/helm/${version}/helm-${version}-linux-amd64.tar.gz
tar -zxvf helm-${version}-linux-amd64.tar.gz
mv linux-amd64/helm /usr/local/bin/helm && rm -rf linux-amd64

所有节点安装longhorn依赖

yum install -y iscsi-initiator-utils
systemctl enable --now iscsid

添加longhorn chart,如果网络较差可以longhorn github release下载chart源码

helm repo add longhorn https://charts.longhorn.io
helm repo update

部署longhorn,支持离线部署,需要提前推送镜像到私有仓库longhornio下

kubectl create namespace longhorn-system

helm install longhorn \
  --namespace longhorn-system \
  --set defaultSettings.defaultDataPath="/data/longhorn/" \
  --set defaultSettings.defaultReplicaCount=3 \
  --set service.ui.type=NodePort \
  --set service.ui.nodePort=30890 \
  #--set privateRegistry.registryUrl=10.39.140.196:8081 \
  longhorn/longhorn

确认longhorn运行正常

[root@jenkins longhorn]# kubectl -n longhorn-system get pods
NAME                                        READY   STATUS    RESTARTS   AGE
csi-attacher-58b856dcff-9kqdt               1/1     Running   0          13h
csi-attacher-58b856dcff-c4zzp               1/1     Running   0          13h
csi-attacher-58b856dcff-tvfw2               1/1     Running   0          13h
csi-provisioner-56dd9dc55b-6ps8m            1/1     Running   0          13h
csi-provisioner-56dd9dc55b-m7gz4            1/1     Running   0          13h
csi-provisioner-56dd9dc55b-s9bh4            1/1     Running   0          13h
csi-resizer-6b87c4d9f8-2skth                1/1     Running   0          13h
csi-resizer-6b87c4d9f8-sqn2g                1/1     Running   0          13h
csi-resizer-6b87c4d9f8-z6xql                1/1     Running   0          13h
engine-image-ei-b99baaed-5fd7m              1/1     Running   0          13h
engine-image-ei-b99baaed-jcjxj              1/1     Running   0          12h
engine-image-ei-b99baaed-n6wxc              1/1     Running   0          12h
engine-image-ei-b99baaed-qxfhg              1/1     Running   0          12h
instance-manager-e-44ba7ac9                 1/1     Running   0          12h
instance-manager-e-48676e4a                 1/1     Running   0          12h
instance-manager-e-57bd994b                 1/1     Running   0          12h
instance-manager-e-753c704f                 1/1     Running   0          13h
instance-manager-r-4f4be1c1                 1/1     Running   0          12h
instance-manager-r-68bfb49b                 1/1     Running   0          12h
instance-manager-r-ccb87377                 1/1     Running   0          12h
instance-manager-r-e56429be                 1/1     Running   0          13h
longhorn-csi-plugin-fqgf7                   2/2     Running   0          12h
longhorn-csi-plugin-gbrnf                   2/2     Running   0          13h
longhorn-csi-plugin-kjj6b                   2/2     Running   0          12h
longhorn-csi-plugin-tvbvj                   2/2     Running   0          12h
longhorn-driver-deployer-74bb5c9fcb-khmbk   1/1     Running   0          14h
longhorn-manager-82ztz                      1/1     Running   0          12h
longhorn-manager-8kmsn                      1/1     Running   0          12h
longhorn-manager-flmfl                      1/1     Running   0          12h
longhorn-manager-mz6zj                      1/1     Running   0          14h
longhorn-ui-77c6d6f5b7-nzsg2                1/1     Running   0          14h

确认默认的storageclass已就绪

# kubectl get sc
NAME                 PROVISIONER          RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
longhorn (default)   driver.longhorn.io   Delete          Immediate           true                   14h

登录longhorn UI确认节点处于可调度状态

在这里插入图片描述

部署kubesphere

参考:

https://kubesphere.com.cn/en/docs/installing-on-kubernetes/
https://github.com/kubesphere/ks-installer

部署kubesphere 3.0版本,下载yaml文件

wget https://raw.githubusercontent.com/kubesphere/ks-installer/v3.0.0/deploy/kubesphere-installer.yaml
wget https://raw.githubusercontent.com/kubesphere/ks-installer/v3.0.0/deploy/cluster-configuration.yaml

修改cluster-configuration.yaml,找到相应字段开启需要安装的组件,以下仅为参考:

  devops:
    enabled: true
    ......
  logging:
    enabled: true
    ......
  metrics_server:
    enabled: true
    ......
  openpitrix:
    enabled: true
    ......

执行kubesphere部署

kubectl apply -f kubesphere-installer.yaml
kubectl apply -f cluster-configuration.yaml

查看部署日志,确认无报错

kubectl logs -n kubesphere-system $(kubectl get pod -n kubesphere-system -l app=ks-install -o jsonpath='{.items[0].metadata.name}') -f

部署完成后确认所有pod运行正常

[root@k8s-master1 ~]# kubectl get pods -A | grep kubesphere
kubesphere-controls-system     default-http-backend-857d7b6856-q24v2                             1/1     Running     0          12h
kubesphere-controls-system     kubectl-admin-58f985d8f6-jl9bj                                    1/1     Running     0          11h
kubesphere-controls-system     kubesphere-router-demo-ns-6c97d4968b-njgrc                        1/1     Running     1          154m
kubesphere-devops-system       ks-jenkins-54455f5db8-hm6kc                                       1/1     Running     0          11h
kubesphere-devops-system       s2ioperator-0                                                     1/1     Running     1          11h
kubesphere-devops-system       uc-jenkins-update-center-cd9464fff-qnvfz                          1/1     Running     0          12h
kubesphere-logging-system      elasticsearch-logging-curator-elasticsearch-curator-160079hmdmb   0/1     Completed   0          11h
kubesphere-logging-system      elasticsearch-logging-data-0                                      1/1     Running     0          12h
kubesphere-logging-system      elasticsearch-logging-data-1                                      1/1     Running     0          12h
kubesphere-logging-system      elasticsearch-logging-discovery-0                                 1/1     Running     0          12h
kubesphere-logging-system      fluent-bit-c45h2                                                  1/1     Running     0          12h
kubesphere-logging-system      fluent-bit-kptfc                                                  1/1     Running     0          12h
kubesphere-logging-system      fluent-bit-rzjfp                                                  1/1     Running     0          12h
kubesphere-logging-system      fluent-bit-wztkp                                                  1/1     Running     0          12h
kubesphere-logging-system      fluentbit-operator-855d4b977d-fk6hs                               1/1     Running     0          12h
kubesphere-logging-system      ks-events-exporter-5bc4d9f496-x297f                               2/2     Running     0          12h
kubesphere-logging-system      ks-events-operator-8dbf7fccc-9qmml                                1/1     Running     0          12h
kubesphere-logging-system      ks-events-ruler-698b7899c7-fkn4l                                  2/2     Running     0          12h
kubesphere-logging-system      ks-events-ruler-698b7899c7-hw6rq                                  2/2     Running     0          12h
kubesphere-logging-system      logsidecar-injector-deploy-74c66bfd85-cxkxm                       2/2     Running     0          12h
kubesphere-logging-system      logsidecar-injector-deploy-74c66bfd85-lzxbm                       2/2     Running     0          12h
kubesphere-monitoring-system   alertmanager-main-0                                               2/2     Running     0          11h
kubesphere-monitoring-system   alertmanager-main-1                                               2/2     Running     0          11h
kubesphere-monitoring-system   alertmanager-main-2                                               2/2     Running     0          11h
kubesphere-monitoring-system   kube-state-metrics-95c974544-r8kmq                                3/3     Running     0          12h
kubesphere-monitoring-system   node-exporter-9ddxn                                               2/2     Running     0          12h
kubesphere-monitoring-system   node-exporter-dw929                                               2/2     Running     0          12h
kubesphere-monitoring-system   node-exporter-ht868                                               2/2     Running     0          12h
kubesphere-monitoring-system   node-exporter-nxdsm                                               2/2     Running     0          12h
kubesphere-monitoring-system   notification-manager-deployment-7c8df68d94-hv56l                  1/1     Running     0          12h
kubesphere-monitoring-system   notification-manager-deployment-7c8df68d94-ttdsg                  1/1     Running     0          12h
kubesphere-monitoring-system   notification-manager-operator-6958786cd6-pllgc                    2/2     Running     0          12h
kubesphere-monitoring-system   prometheus-k8s-0                                                  3/3     Running     1          11h
kubesphere-monitoring-system   prometheus-k8s-1                                                  3/3     Running     1          11h
kubesphere-monitoring-system   prometheus-operator-84d58bf775-5rqdj                              2/2     Running     0          12h
kubesphere-system              etcd-65796969c7-whbzx                                             1/1     Running     0          12h
kubesphere-system              ks-apiserver-b4dbcc67-2kknm                                       1/1     Running     0          11h
kubesphere-system              ks-apiserver-b4dbcc67-k6jr2                                       1/1     Running     0          11h
kubesphere-system              ks-apiserver-b4dbcc67-q8845                                       1/1     Running     0          11h
kubesphere-system              ks-console-786b9846d4-86hxw                                       1/1     Running     0          12h
kubesphere-system              ks-console-786b9846d4-l6mhj                                       1/1     Running     0          12h
kubesphere-system              ks-console-786b9846d4-wct8z                                       1/1     Running     0          12h
kubesphere-system              ks-controller-manager-7fd8799789-478ks                            1/1     Running     0          11h
kubesphere-system              ks-controller-manager-7fd8799789-hwgmp                            1/1     Running     0          11h
kubesphere-system              ks-controller-manager-7fd8799789-pdbch                            1/1     Running     0          11h
kubesphere-system              ks-installer-64ddc4b77b-c7qz8                                     1/1     Running     0          12h
kubesphere-system              minio-7bfdb5968b-b5v59                                            1/1     Running     0          12h
kubesphere-system              mysql-7f64d9f584-kvxcb                                            1/1     Running     0          12h
kubesphere-system              openldap-0                                                        1/1     Running     0          12h
kubesphere-system              openldap-1                                                        1/1     Running     0          12h
kubesphere-system              redis-ha-haproxy-5c6559d588-2rt6v                                 1/1     Running     9          12h
kubesphere-system              redis-ha-haproxy-5c6559d588-mhj9p                                 1/1     Running     8          12h
kubesphere-system              redis-ha-haproxy-5c6559d588-tgpjv                                 1/1     Running     11         12h
kubesphere-system              redis-ha-server-0                                                 2/2     Running     0          12h
kubesphere-system              redis-ha-server-1                                                 2/2     Running     0          12h
kubesphere-system              redis-ha-server-2                                                 2/2     Running     0          12h

注意,kubesphere部分组件使用helm部署

[root@k8s-master1 ~]# helm ls -A | grep kubesphere
elasticsearch-logging           kubesphere-logging-system       1               2020-09-23 00:49:08.526873742 +0800 CST deployed        elasticsearch-1.22.1            6.7.0-0217                  
elasticsearch-logging-curator   kubesphere-logging-system       1               2020-09-23 00:49:16.117842593 +0800 CST deployed        elasticsearch-curator-1.3.3     5.5.4-0217                  
ks-events                       kubesphere-logging-system       1               2020-09-23 00:51:45.529430505 +0800 CST deployed        kube-events-0.1.0               0.1.0                       
ks-jenkins                      kubesphere-devops-system        1               2020-09-23 01:03:15.106022826 +0800 CST deployed        jenkins-0.19.0                  2.121.3-0217                
ks-minio                        kubesphere-system               2               2020-09-23 00:48:16.990599158 +0800 CST deployed        minio-2.5.16                    RELEASE.2019-08-07T01-59-21Z
ks-openldap                     kubesphere-system               1               2020-09-23 00:03:28.767712181 +0800 CST deployed        openldap-ha-0.1.0               1.0                         
ks-redis                        kubesphere-system               1               2020-09-23 00:03:19.439784188 +0800 CST deployed        redis-ha-3.9.0                  5.0.5                       
logsidecar-injector             kubesphere-logging-system       1               2020-09-23 00:51:57.519733074 +0800 CST deployed        logsidecar-injector-0.1.0       0.1.0                       
notification-manager            kubesphere-monitoring-system    1               2020-09-23 00:54:14.662762759 +0800 CST deployed        notification-manager-0.1.0      0.1.0                       
uc                              kubesphere-devops-system        1               2020-09-23 00:51:37.885154574 +0800 CST deployed        jenkins-update-center-0.8.0     3.0.0    

获取web console 监听端口,默认为30880

kubectl get svc/ks-console -n kubesphere-system

默认登录账号为

admin/P@88w0rd

登录kubesphere UI

在这里插入图片描述
集群节点信息

在这里插入图片描述
服务组件信息

在这里插入图片描述
longhorn UI查看绑定的pv卷

在这里插入图片描述
查看卷详情

在这里插入图片描述

清理kubesphere集群

参考:
https://kubesphere.com.cn/en/docs/installing-on-kubernetes/uninstalling/uninstalling-kubesphere-from-k8s/

wget https://raw.githubusercontent.com/kubesphere/ks-installer/master/scripts/kubesphere-delete.sh
sh kubesphere-delete.sh

原文:https://blog.csdn.net/networken/article/details/105664147

    willqy 我修改了一下标题,这篇博客我建议在开头对 Sealos 和 Longhorn 都加一段简单的描述介绍,并附上它们的官网链接。

    Feynman 更改标题为「使用 Sealos + Longhorn 部署 KubeSphere v3.0.0

    你好,ks-minio报错跟“使用 Sealos + Longhorn 部署”有关吗?无关的话建议新开一个帖子。另外请详细描述一下部署遇到的问题,谢谢。

    minio日志打印出来,是不是绑定pv报错了

    10 天 后

    longhorn 存储有问题

    使用上述方法安装的存储,在kubesphere 安装的时候,会自动创建PV
    。但无法把PV加载到主机上,会报如下错误:
    `
    Events:
    Type Reason Age From Message


    Warning FailedMount 41m (x414 over 3d6h) kubelet, ecs-ebb9-0001.novalocal Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[minio-config-dir ks-minio-token-mw554 export]: timed out waiting for the condition
    Warning FailedMount 7m26s (x406 over 3d6h) kubelet, ecs-ebb9-0001.novalocal Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[ks-minio-token-mw554 export minio-config-dir]: timed out waiting for the condition
    Warning FailedAttachVolume 94s (x2312 over 3d6h) attachdetach-controller AttachVolume.Attach failed for volume “pvc-08ff037b-339b-41f4-b22f-b5108b438507” : rpc error: code = Aborted desc = The volume pvc-08ff037b-339b-41f4-b22f-b5108b438507 is attaching
    Warning FailedMount <invalid> (x1247 over 3d6h) kubelet, ecs-ebb9-0001.novalocal Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[export minio-config-dir ks-minio-token-mw554]: timed out waiting for the condition
    `

    [root@ecs-ebb9-0004 ~]# kubectl -n kubesphere-system describe pv minio
    Error from server (NotFound): persistentvolumes "minio" not found
    [root@ecs-ebb9-0004 ~]# kubectl -n kubesphere-system describe pvc minio
    Name:          minio
    Namespace:     kubesphere-system
    StorageClass:  longhorn
    Status:        Bound
    Volume:        pvc-08ff037b-339b-41f4-b22f-b5108b438507
    Labels:        app=minio
                   app.kubernetes.io/managed-by=Helm
                   chart=minio-2.5.16
                   heritage=Helm
                   release=ks-minio
    Annotations:   meta.helm.sh/release-name: ks-minio
                   meta.helm.sh/release-namespace: kubesphere-system
                   pv.kubernetes.io/bind-completed: yes
                   pv.kubernetes.io/bound-by-controller: yes
                   volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
    Finalizers:    [kubernetes.io/pvc-protection]
    Capacity:      20Gi
    Access Modes:  RWO
    VolumeMode:    Filesystem
    Mounted By:    minio-7bfdb5968b-kcfzd
    Events:        <none>
    [root@ecs-ebb9-0004 ~]# kubectl -n kubesphere-system describe pod minio-7bfdb5968b-kcfzd
    Name:           minio-7bfdb5968b-kcfzd
    Namespace:      kubesphere-system
    Priority:       0
    Node:           ecs-ebb9-0001.novalocal/192.168.0.231
    Start Time:     Mon, 05 Oct 2020 16:30:18 +0800
    Labels:         app=minio
                    pod-template-hash=7bfdb5968b
                    release=ks-minio
    Annotations:    checksum/config: c6cc7f4b40064dffd59b339e133fa4819f787573ee18e1d001435aa4daff8ba2
                    checksum/secrets: f9625c177e0e74a3b9997c3c65189ebffcfbde7aaa910de0ba38b48b032c1a96
    Status:         Pending
    IP:             
    IPs:            <none>
    Controlled By:  ReplicaSet/minio-7bfdb5968b
    Containers:
      minio:
        Container ID:  
        Image:         minio/minio:RELEASE.2019-08-07T01-59-21Z
        Image ID:      
        Port:          9000/TCP
        Host Port:     0/TCP
        Command:
          /bin/sh
          -ce
          /usr/bin/docker-entrypoint.sh minio -C /root/.minio/ server /data
        State:          Waiting
          Reason:       ContainerCreating
        Ready:          False
        Restart Count:  0
        Requests:
          cpu:      250m
          memory:   256Mi
        Liveness:   http-get http://:service/minio/health/live delay=5s timeout=1s period=30s #success=1 #failure=3
        Readiness:  http-get http://:service/minio/health/ready delay=5s timeout=1s period=15s #success=1 #failure=3
        Environment:
          MINIO_ACCESS_KEY:  <set to the key 'accesskey' in secret 'minio'>  Optional: false
          MINIO_SECRET_KEY:  <set to the key 'secretkey' in secret 'minio'>  Optional: false
          MINIO_BROWSER:     on
        Mounts:
          /data from export (rw)
          /root/.minio/ from minio-config-dir (rw)
          /var/run/secrets/kubernetes.io/serviceaccount from ks-minio-token-mw554 (ro)
    Conditions:
      Type              Status
      Initialized       True 
      Ready             False 
      ContainersReady   False 
      PodScheduled      True 
    Volumes:
      export:
        Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
        ClaimName:  minio
        ReadOnly:   false
      minio-user:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  minio
        Optional:    false
      minio-config-dir:
        Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
        Medium:     
        SizeLimit:  <unset>
      ks-minio-token-mw554:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  ks-minio-token-mw554
        Optional:    false
    QoS Class:       Burstable
    Node-Selectors:  <none>
    Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                     node.kubernetes.io/unreachable:NoExecute for 300s
    Events:
      Type     Reason              Age                          From                              Message
      ----     ------              ----                         ----                              -------
      Warning  FailedMount         41m (x414 over 3d6h)         kubelet, ecs-ebb9-0001.novalocal  Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[minio-config-dir ks-minio-token-mw554 export]: timed out waiting for the condition
      Warning  FailedMount         7m26s (x406 over 3d6h)       kubelet, ecs-ebb9-0001.novalocal  Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[ks-minio-token-mw554 export minio-config-dir]: timed out waiting for the condition
      Warning  FailedAttachVolume  94s (x2312 over 3d6h)        attachdetach-controller           AttachVolume.Attach failed for volume "pvc-08ff037b-339b-41f4-b22f-b5108b438507" : rpc error: code = Aborted desc = The volume pvc-08ff037b-339b-41f4-b22f-b5108b438507 is attaching
      Warning  FailedMount         <invalid> (x1247 over 3d6h)  kubelet, ecs-ebb9-0001.novalocal  Unable to attach or mount volumes: unmounted volumes=[export], unattached volumes=[export minio-config-dir ks-minio-token-mw554]: timed out waiting for the condition

    一个是有没有装longhorn依赖,另外pod如果跑master节点了要去掉污点,保证longhorn csi插件也能调度到那里

      1 个月 后

      willqy 我安装longhorn运行一段时间 pod重启出了一个这个错误

      Warning FailedMount 5m1s (x2 over 11m) kubelet, hcjt-itc-dl-v100-10 Unable to attach or mount volumes: unmounted volumes=[redis-pvc], unattached volumes=[default-token-lrpd8 redis-pvc]: timed out waiting for the condition
      Warning FailedAttachVolume 97s (x16 over 16m) attachdetach-controller AttachVolume.Attach failed for volume “pvc-7a21e6c8-9eae-4866-8202-09a1aee0406b” : rpc error: code = NotFound desc = ControllerPublishVolume: the volume pvc-7a21e6c8-9eae-4866-8202-09a1aee0406b not exists
      Warning FailedMount 26s (x5 over 14m) kubelet, hcjt-itc-dl-v100-10 Unable to attach or mount volumes: unmounted volumes=[redis-pvc], unattached volumes=[redis-pvc default-token-lrpd8]: timed out waiting for the condition

        sunshuyan
        pv挂不上了,去longhorn UI看看pv状态,手动attach到对应节点试试

        7 天 后

        问题记录:涉及prometheus pv扩容,pv数据重置,pv文件系统修复,感觉longhorn还是不太靠谱,有磁盘还是用rook把,或者longhorn不要用本地文件系统,挂磁盘方式使用。

        prometheus 故障

        最近集群节点停电重启,导致prometheus 2个pod一死一伤,登录kubesphere UI无法显示监控信息。

        查看pod状态,有一个是running的,监控应该能继续使用才对:

        [root@jenkins ~]# kubectl -n kubesphere-monitoring-system get pods
        NAME                                               READY   STATUS             RESTARTS   AGE
        ......
        prometheus-k8s-0                                   3/3     Running            36         33d
        prometheus-k8s-1                                   2/3     CrashLoopBackOff   41         9d
        prometheus-operator-84d58bf775-g7hv8               2/2     Running            0          9d

        查看crashloopbackoff pod日志,报错像是文件坏了

        [root@jenkins ~]# kubectl -n kubesphere-monitoring-system logs -f prometheus-k8s-1  -c prometheus
        ......
        level=info ts=2020-11-26T01:05:21.880Z caller=main.go:583 msg="Scrape manager stopped"
        level=error ts=2020-11-26T01:05:21.880Z caller=main.go:764 err="opening storage failed: block dir: \"/prometheus/01EQ2ZQCKZEX21JP81GX10BPNK\": invalid character '\\x00' looking for beginning of value"

        重启下pod,发现pv挂不上了,看来longhorn要背锅了,一出现意外重启就可能导致pv挂载失败:

        [root@jenkins ~]# kubectl -n kubesphere-monitoring-system describe pods prometheus-k8s-1 
        ......
        Events:
          Type     Reason       Age    From               Message
          ----     ------       ----   ----               -------
          Normal   Scheduled    7m4s   default-scheduler  Successfully assigned kubesphere-monitoring-system/prometheus-k8s-1 to k8s-master1
          Warning  FailedMount  6m56s  kubelet            MountVolume.SetUp failed for volume "pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6 but could not correct them: fsck from util-linux 2.31.1
        /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6 contains a file system with errors, check forced.
        /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262155 extent tree (at level 1) could be shorter.  IGNORED.
        /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262171 extent tree (at level 1) could be shorter.  IGNORED.
        /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262179 extent tree (at level 1) could be shorter.  IGNORED.
        /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262184 extent tree (at level 1) could be shorter.  IGNORED.
        /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262188 extent tree (at level 1) could be shorter.  IGNORED.
        /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262198 extent tree (at level 1) could be shorter.  IGNORED.
        /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262208 extent tree (at level 1) could be shorter.  IGNORED.
        /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: Inode 262216 has an invalid extent node (blk 1081353, lblk 0)
        
        
        /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
          (i.e., without -a or -p options)
        .
          Warning  FailedMount  5m2s                  kubelet  Unable to attach or mount volumes: unmounted volumes=[prometheus-k8s-db], unattached volumes=[config-out tls-assets prometheus-k8s-db prometheus-k8s-rulefiles-0 prometheus-k8s-token-n2nws config]: timed out waiting for the condition
          Warning  FailedMount  2m47s                 kubelet  Unable to attach or mount volumes: unmounted volumes=[prometheus-k8s-db], unattached volumes=[prometheus-k8s-db prometheus-k8s-rulefiles-0 prometheus-k8s-token-n2nws config config-out tls-assets]: timed out waiting for the condition
          Warning  FailedMount  35s (x10 over 6m54s)  kubelet  MountVolume.SetUp failed for volume "pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6 but could not correct them: fsck from util-linux 2.31.1

        暂时不知道什么原因,在查看下第一个running pod是什么情况,一大堆no space left错误:

        [root@jenkins ~]# kubectl -n kubesphere-monitoring-system logs -f prometheus-k8s-0 -c prometheus
        level=warn ts=2020-11-26T00:00:23.231Z caller=manager.go:595 component="rule manager" group=node.rules msg="Rule sample appending failed" err="write to WAL: log samples: write /prometheus/wal/00000660: no space left on device"
        level=warn ts=2020-11-26T00:00:23.231Z caller=manager.go:595 component="rule manager" group=node.rules msg="Rule sample appending failed" err="write to WAL: log samples: write /prometheus/wal/00000660: no space left on device"

        很明显空间被用完了,进pod查看下文件系统,果然/prometheus 使用100%,可能第二个pod坏了,数据全写到第一个pod,被写满了。

        [root@jenkins ~]# kubectl -n kubesphere-monitoring-system exec -it prometheus-k8s-0 -c prometheus -- df -h | grep prometheus
                                 19.6G     19.5G         0 100% /prometheus
                                457.1G     89.3G    367.8G  20% /etc/prometheus/config_out
        tmpfs                     7.8G         0      7.8G   0% /etc/prometheus/certs
                                457.1G     89.3G    367.8G  20% /etc/prometheus/rules/prometheus-k8s-rulefiles-0

        怎么办呢,去论坛搜下prometheus关键字,发现有人提了相关问题,可以修改监控数据保存时间,把默认保留7d改为1d:

        https://kubesphere.com.cn/forum/d/657-prometheus

        # kubectl edit prometheuses -n kubesphere-monitoring-system
        
         retention: 1d

        但是改完之后重启pod没有效果,可能是定时清理,要等上一天吧,考虑把第一个pod pv扩容,先把监控恢复起来。

        底层存储使用longhorn,查看官方文档如何扩容pv

        https://longhorn.io/docs/1.0.2/volumes-and-nodes/expansion/

        默认新版本k8s已经支持扩容pv了,只是默认没有开启,需要修改storageclass增加一个字段allowVolumeExpansion: true

        [root@jenkins longhorn]# kubectl edit sc longhorn 
        allowVolumeExpansion: true
        apiVersion: storage.k8s.io/v1
        ...

        然后需要把pv从节点分离,登录longhorn UI操作

        完成后直接编辑对应pvc,修改大小

        [root@jenkins clone]# kubectl -n kubesphere-monitoring-system edit pvc prometheus-k8s-db-prometheus-k8s-0 
        ......
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
              storage: 31Gi
        ...      

        修改后pv和pvc大小已经自动生效

        [root@jenkins clone]# kubectl -n kubesphere-monitoring-system get pvc
        NAME                                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
        prometheus-k8s-db-prometheus-k8s-0   Bound    pvc-748d5256-d046-4c04-a37e-0edbe454f2ca   31Gi       RWO            longhorn       36d
        prometheus-k8s-db-prometheus-k8s-1   Bound    pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6   20Gi       RWO            longhorn       36d
        
        [root@jenkins clone]#  kubectl get pv
        NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                             STORAGECLASS   REASON   AGE
        pvc-748d5256-d046-4c04-a37e-0edbe454f2ca   31Gi       RWO            Delete           Bound    kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-0   longhorn                36d
        pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6   20Gi       RWO            Delete           Bound    kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-1   longhorn                36d

        重新在longhorn UI上将pv attach到节点上.

        但是进pod查看,文件系统大小却没有自动更新,依然20G,尝试按照longhorn文档最后指示,手动更新文件系统,一顿操作,更新成功

        volume_name=pvc-748d5256-d046-4c04-a37e-0edbe454f2ca
        mount /dev/longhorn/$volume_name /data/pv
        umount /dev/longhorn/$volume_name
        mount /dev/longhorn/$volume_name /data/pv
        
        [root@k8s-node1 ~]# resize2fs /dev/longhorn/$volume_name
        resize2fs 1.42.9 (28-Dec-2013)
        Filesystem at /dev/longhorn/pvc-748d5256-d046-4c04-a37e-0edbe454f2ca is mounted on /data/pv; on-line resizing required
        old_desc_blocks = 3, new_desc_blocks = 4
        The filesystem on /dev/longhorn/pvc-748d5256-d046-4c04-a37e-0edbe454f2ca is now 8126464 blocks long.
        
        umount /dev/longhorn/$volume_name

        在查看文件系统已经扩容成功

        [root@jenkins longhorn]# kubectl -n kubesphere-monitoring-system exec -it prometheus-k8s-0 -c prometheus -- df -h | grep prometheus
                                 30.4G     15.9G     14.5G  52% /prometheus
                                457.1G     88.1G    369.0G  19% /etc/prometheus/config_out
        tmpfs                     7.8G         0      7.8G   0% /etc/prometheus/certs
                                457.1G     88.1G    369.0G  19% /etc/prometheus/rules/prometheus-k8s-rulefiles-0

        最终第一个pod pv成功扩容,运行正常,kubesphere UI监控也恢复正常:

        [root@jenkins clone]# kubectl -n kubesphere-monitoring-system get pods
        NAME                                               READY   STATUS             RESTARTS   AGE
        ......
        prometheus-k8s-0                                   3/3     Running            1          73m
        prometheus-k8s-1                                   2/3     CrashLoopBackOff   3          2m2s
        prometheus-operator-84d58bf775-8tgmr               2/2     Running            0          5h47m

        最后一个crash pod处理方法,找到pv名称pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6

        [root@jenkins longhorn]# kubectl get pv |grep prometheus-k8s-1 
        pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6   32Gi       RWO            Delete           Bound    kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-1   longhorn                37d

        去对应节点暴力清理数据

        [root@k8s-master1 ~]# lsblk
        NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
        sda      8:0    0 465.8G  0 disk 
        ├─sda1   8:1    0   200M  0 part /boot/efi
        ├─sda2   8:2    0     1G  0 part /boot
        ├─sda3   8:3    0 406.8G  0 part /data
        ├─sda4   8:4    0    50G  0 part /
        └─sda5   8:5    0   7.8G  0 part 
        sdb      8:16   0    32G  0 disk /var/lib/kubelet/pods/ed0732c4-a8ae-4d3f-b78f-a09767589acd/volume-subpaths/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6/prometheus/2
        sdc      8:32   0     2G  0 disk /var/lib/kubelet/pods/dd454b29-a2d5-4f2f-8244-bf3dd4d21054/volumes/kubernetes.io~csi/pvc-4da824f3-b462-4e32-b3ae-5be3f811dea5/mount
        
        [root@k8s-master1 ~]# cd /var/lib/kubelet/pods/ed0732c4-a8ae-4d3f-b78f-a09767589acd/volume-subpaths/pvc-b62a16c1-6d47-4bad-8369-9bd98c473ee6/prometheus/2
        [root@k8s-master1 2]# rm -rf *

        重启pod发现启动成功

        [root@k8s-master1 ~]# kubectl -n kubesphere-monitoring-system get pods
        NAME                                               READY   STATUS    RESTARTS   AGE
        alertmanager-main-0                                2/2     Running   2          10d
        alertmanager-main-1                                2/2     Running   4          36d
        alertmanager-main-2                                2/2     Running   4          34d
        kube-state-metrics-95c974544-8fjd8                 3/3     Running   3          34d
        node-exporter-mdqvj                                2/2     Running   4          36d
        node-exporter-p8glr                                2/2     Running   4          36d
        node-exporter-s8ffl                                2/2     Running   6          36d
        node-exporter-vsjkp                                2/2     Running   6          34d
        notification-manager-deployment-7c8df68d94-bdm25   1/1     Running   1          34d
        notification-manager-deployment-7c8df68d94-k6c2l   1/1     Running   2          36d
        notification-manager-operator-6958786cd6-lqtkq     2/2     Running   8          36d
        prometheus-k8s-0                                   3/3     Running   1          105m
        prometheus-k8s-1                                   3/3     Running   1          3m8s
        prometheus-operator-84d58bf775-8tgmr               2/2     Running   2          6h19m

        如果遇到prometheus invalid magic number 0这种错误也可以通过暴力清理恢复pod

        [root@jenkins longhorn]# kubectl -n kubesphere-monitoring-system logs -f prometheus-k8s-0 -c prometheus
        level=error ts=2020-11-27T06:33:33.531Z caller=main.go:764 err="opening storage failed: /prometheus/chunks_head/000516: invalid magic number 0"

        jenkins问题

        升级系统,重启了下kubesphere devops服务器节点,发现jenkins pod挂掉了

        [root@jenkins argocd]# kubectl -n kubesphere-devops-system get pods
        NAME                                       READY   STATUS    RESTARTS   AGE
        ks-jenkins-54455f5db8-glhbs                0/1     Error     2          35d
        s2ioperator-0                              1/1     Running   2          35d
        uc-jenkins-update-center-cd9464fff-r5txz   1/1     Running   2          11d

        查看日志,又是longhorn pv无法挂载,像是文件系统坏掉得样子

        [root@jenkins argocd]# kubectl -n kubesphere-devops-system describe pods  ks-jenkins-54455f5db8-glhbs   
        ......
        Events:
          Type     Reason       Age                     From     Message
          ----     ------       ----                    ----     -------
          Warning  FailedMount  51m (x17 over 3h27m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[plugin-dir secrets-dir ks-jenkins-token-zzhd5 casc-config jenkins-home jenkins-config]: timed out waiting for the condition
          Warning  FailedMount  46m (x20 over 3h52m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[jenkins-home jenkins-config plugin-dir secrets-dir ks-jenkins-token-zzhd5 casc-config]: timed out waiting for the condition
          Warning  FailedMount  42m (x15 over 3h57m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[ks-jenkins-token-zzhd5 casc-config jenkins-home jenkins-config plugin-dir secrets-dir]: timed out waiting for the condition
          Warning  FailedMount  26m (x13 over 3h55m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[jenkins-config plugin-dir secrets-dir ks-jenkins-token-zzhd5 casc-config jenkins-home]: timed out waiting for the condition
          Warning  FailedMount  7m1s (x108 over 3h51m)  kubelet  MountVolume.SetUp failed for volume "pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf but could not correct them: fsck from util-linux 2.31.1
        /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf contains a file system with errors, check forced.
        /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf: Inode 131224 has an invalid extent node (blk 557148, lblk 0)
        
        
        /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
          (i.e., without -a or -p options)
        .
          Warning  FailedMount  97s (x14 over 3h46m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[casc-config jenkins-home jenkins-config plugin-dir secrets-dir ks-jenkins-token-zzhd5]: timed out waiting for the condition

        最后提示让使用fsck修复,报错,e2fsck版本太低 (谨慎操作,做好备份,可能把pv搞坏🙂

        [root@k8s-node1 ~]# fsck -cvf /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf 
        fsck from util-linux 2.23.2
        e2fsck 1.42.9 (28-Dec-2013)
        /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf has unsupported feature(s): metadata_csum
        e2fsck: Get a newer version of e2fsck!

        下载最新版本e2fsprogs

        https://distfiles.macports.org/e2fsprogs/

        wget https://distfiles.macports.org/e2fsprogs/e2fsprogs-1.45.6.tar.gz
        tar -zxvf e2fsprogs-1.45.6.tar.gz
        cd e2fsprogs-1.45.6
        make && make install

        再次执行

        [root@k8s-node1 ~]# fsck.ext4 -y /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf  
        e2fsck 1.45.6 (20-Mar-2020)
        /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf contains a file system with errors, check forced.
        Pass 1: Checking inodes, blocks, and sizes
        Inode 131224 has an invalid extent node (blk 557148, lblk 0)
        Clear? yes
        
        Inode 131224 extent tree (at level 1) could be shorter.  Optimize? yes
        
        Inode 131224, i_blocks is 992, should be 0.  Fix? yes
        
        Inode 131227 extent block passes checks, but checksum does not match extent
                (logical block 16, physical block 107664, len 50)
        Fix? yes
        
        Inode 131227, i_blocks is 1080, should be 536.  Fix? yes
        
        
        Running additional passes to resolve blocks claimed by more than one inode...
        Pass 1B: Rescanning for multiply-claimed blocks
        Multiply-claimed block(s) in inode 131227: 112333
        Multiply-claimed block(s) in inode 131343: 112333
        Pass 1C: Scanning directories for inodes with multiply-claimed blocks
        Pass 1D: Reconciling multiply-claimed blocks
        (There are 2 inodes containing multiply-claimed blocks.)
        
        File /support/all_2020-11-24_20.29.26.log (inode #131227, mod time Wed Nov 25 21:00:09 2020) 
          has 1 multiply-claimed block(s), shared with 1 file(s):
                /jobs/demo-devops4t6ff/jobs/demo-pipeline/builds/7/workflow/4.xml (inode #131343, mod time Thu Nov 26 22:33:19 2020)
        Clone multiply-claimed blocks? yes
        
        File /jobs/demo-devops4t6ff/jobs/demo-pipeline/builds/7/workflow/4.xml (inode #131343, mod time Thu Nov 26 22:33:19 2020) 
          has 1 multiply-claimed block(s), shared with 1 file(s):
                /support/all_2020-11-24_20.29.26.log (inode #131227, mod time Wed Nov 25 21:00:09 2020)
        Multiply-claimed blocks already reassigned or cloned.
        
        Pass 1E: Optimizing extent trees
        Pass 2: Checking directory structure
        Pass 3: Checking directory connectivity
        Pass 4: Checking reference counts
        Pass 5: Checking group summary information
        Block bitmap differences:  -(87056--87071) -(87088--87167) -(87296--87354) -(87424--87482) -(106512--106618) -(106624--106629) -(107024--107135) +(107664--107713) -108092 +110755 +110769 +110811 +110815 +110846 +110862 +112328 +112342 +112369 +112377 +112382 -123851 -123856 -123897 -(123899--123901) -124722 -124756 -126409 -126435 -126459 -129237 -129246 -129267 -129271 -129284 -129293 -130369 -130377 -130419 -130432 -130471 -132717 -132720 -132774 -132778 -132796 -133198 -133210 -557145 -557148 -557168 -559108 -559165 -559194 -559204 -560608 -560611 -560625 -560628 -563206 -563229 -563259 -563272 -564614 -564645 -564657 -564677 -564743 -564749 -564758 -564763 -564788 -565628 -565918 -565952 +566488 +566493 +566951 +566964 -567308 -568137 -568140 -568144 -568146 -568172 -568198 -568721 -568765 -568779
        Fix? yes
        
        Free blocks count wrong for group #0 (22415, counted=22414).
        Fix? yes
        
        Free blocks count wrong for group #2 (292, counted=506).
        Fix? yes
        
        Free blocks count wrong for group #3 (17798, counted=17985).
        Fix? yes
        
        Free blocks count wrong for group #4 (32761, counted=32768).
        Fix? yes
        
        Free blocks count wrong for group #17 (31489, counted=31522).
        Fix? yes
        
        Free blocks count wrong (1900427, counted=1900858).
        Fix? yes
        
        
        /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf: ***** FILE SYSTEM WAS MODIFIED *****
        /dev/longhorn/pvc-364adde3-9e86-42ee-95e6-c2d9e12a6bdf: 5862/524288 files (0.6% non-contiguous), 196294/2097152 blocks

        重启pod恢复正常

        [root@jenkins longhorn]# kubectl -n kubesphere-devops-system get pods
        NAME                                       READY   STATUS    RESTARTS   AGE
        ks-jenkins-54455f5db8-w7rqw                1/1     Running   0          13m
        s2ioperator-0                              1/1     Running   2          35d
        uc-jenkins-update-center-cd9464fff-r5txz   1/1     Running   2          11d

          willqy 👍
          另外,如果你想知道为啥pvc attach不到对应的pod,你可以进一步查看longhorn的log. volume attach的行为是csi-node做的,所以应该查看longhorn-csi-plugin-xxx的log,kubelet日志可能也有一些有用信息。longhorn作为分布式存储来使用,可能稳定性还是差点,可以考虑使用nfs,ceph之类的做替代。