背景

经常会遇到用户提问,我的 K8s 要如何重启?KubeSphere 如何重启?实际上大家的问题应该是聚焦在:如果我重启服务器后,K8s 和 kubesphere 都起不来了,要怎么解决?

常见问题

服务器在重启后,ssh 连接到服务器去执行 kubectl 经常会遇到以下问题,并且 KubeSphere 控制台都访问不了。很多用户误以为是 KubeSphere 出了问题。

kubectl get pod --all-namespaces
The connection to the server lb.kubesphere.local:6443 was refused - did you specify the right host or port?

解决方法

注意,K8s 和 KubeSphere 都不存在重启一说,只有 Docker 可以重启。通常情况 K8s 和 Docker 在服务器重启后可以自愈,KubeSphere 也会自动恢复运行。

但偶尔会遇到上述问题,这个情况大概率是服务器 Docker Daemon 启动失败,我们只需要 systemctl 重启一下 Docker 就可以解决问题:

sudo systemctl daemon-reload
sudo systemctl restart docker

再查看集群的 Pod 运行状况,所有 Pod 恢复正常,一片春意盎然,这时候就可以正常登录 KubeSphere:

kubectl get pod --all-namespaces
NAMESPACE                      NAME                                           READY   STATUS    RESTARTS   AGE
kube-system                    calico-kube-controllers-76d4774d89-kfrbs       1/1     Running   2          38h
kube-system                    calico-node-j2kp9                              1/1     Running   2          38h
kube-system                    coredns-6dd6674597-hlhwq                       1/1     Running   2          38h
kube-system                    coredns-6dd6674597-lmzrk                       1/1     Running   2          38h
kube-system                    kube-apiserver-node2                           1/1     Running   3          38h
kube-system                    kube-controller-manager-node2                  1/1     Running   3          38h
kube-system                    kube-proxy-8hqtk                               1/1     Running   4          38h
kube-system                    kube-scheduler-node2                           1/1     Running   3          38h
kube-system                    nodelocaldns-2lxn2                             1/1     Running   2          38h
kube-system                    openebs-localpv-provisioner-84446d4bd7-xvhrd   1/1     Running   3          36h
kube-system                    openebs-ndm-operator-6456dc9db-th787           1/1     Running   3          36h
kube-system                    openebs-ndm-pzbfw                              1/1     Running   3          36h
kubesphere-controls-system     default-http-backend-857d7b6856-v4gbk          1/1     Running   1          35h
kubesphere-controls-system     kubectl-admin-d4bcbdccc-7zq87                  1/1     Running   1          35h
kubesphere-monitoring-system   alertmanager-main-0                            2/2     Running   2          35h
kubesphere-monitoring-system   kube-state-metrics-95c974544-qsmp6             3/3     Running   3          35h
kubesphere-monitoring-system   node-exporter-sqjg8                            2/2     Running   2          35h
kubesphere-monitoring-system   prometheus-k8s-0                               3/3     Running   4          35h
kubesphere-monitoring-system   prometheus-operator-84d58bf775-qjdn2           2/2     Running   2          35h
kubesphere-system              ks-apiserver-867c6668bd-l6czq                  1/1     Running   0          7m9s
kubesphere-system              ks-console-959df9898-gqkzl                     1/1     Running   0          7m9s
kubesphere-system              ks-controller-manager-7c9f7fc6f7-lss6l         1/1     Running   0          7m8s
kubesphere-system              ks-installer-5b988669b9-7c74v                  1/1     Running   1          35h
kubesphere-system              openldap-0                                     1/1     Running   1          35h
kubesphere-system              redis-644bc597b9-blx6p                         1/1     Running   1          35h
10 个月 后

执行了:
sudo systemctl daemon-reload
sudo systemctl restart docker
不起作用
ps -ef | grep docker
运行着不少容器
ping lb.kubesphere.local 也能PING通
但以下问题一直在:
The connection to the server lb.kubesphere.local:6443 was refused - did you specify the right host or port?

    执行:journalctl -xefu kubelet
    显示node “node1” not found。我当前操作的就是node1.

    4月 23 17:00:40 node1 kubelet[809]: E0423 17:00:40.942104 809 kubelet.go:2268] node “node1” not found
    4月 23 17:00:41 node1 kubelet[809]: E0423 17:00:41.042342 809 kubelet.go:2268] node “node1” not found
    4月 23 17:00:41 node1 kubelet[809]: E0423 17:00:41.052967 809 eviction_manager.go:260] eviction manager: failed to get summary stats: failed to get node info: node “node1” not found
    4月 23 17:00:41 node1 kubelet[809]: I0423 17:00:41.088311 809 kubelet_node_status.go:294] Setting node annotation to enable volume controller attach/detach
    4月 23 17:00:41 node1 kubelet[809]: E0423 17:00:41.142523 809 kubelet.go:2268] node “node1” not found
    4月 23 17:00:41 node1 kubelet[809]: I0423 17:00:41.182145 809 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: 53a01358e7a5f2cb168861bda976f4fd05d3046953d284ab5eb44e332ed34a56
    4月 23 17:00:41 node1 kubelet[809]: E0423 17:00:41.184157 809 pod_workers.go:191] Error syncing pod 90464b1efed9533a2f90c07565ba22ba (“kube-apiserver-node1_kube-system(90464b1efed9533a2f90c07565ba22ba)”), skipping: failed to “StartContainer” for “kube-apiserver” with CrashLoopBackOff: “back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-node1_kube-system(90464b1efed9533a2f90c07565ba22ba)”
    4月 23 17:00:41 node1 kubelet[809]: E0423 17:00:41.242776 809 kubelet.go:2268] node “node1” not found
    4月 23 17:00:41 node1 kubelet[809]: E0423 17:00:41.343141 809 kubelet.go:2268] node “node1” not found
    4月 23 17:00:41 node1 kubelet[809]: E0423 17:00:41.444885 809 kubelet.go:2268] node “node1” not found
    4月 23 17:00:41 node1 kubelet[809]: E0423 17:00:41.545132 809 kubelet.go:2268] node “node1” not found
    4月 23 17:00:41 node1 kubelet[809]: E0423 17:00:41.645382 809 kubelet.go:2268] node “node1” not found
    4月 23 17:00:41 node1 kubelet[809]: E0423 17:00:41.745623 809 kubelet.go:2268] node “node1” not found
    4月 23 17:00:41 node1 kubelet[809]: I0423 17:00:41.777522 809 kubelet_node_status.go:294] Setting node annotation to enable volume controller attach/detach
    4月 23 17:00:41 node1 kubelet[809]: I0423 17:00:41.843589 809 kubelet_node_status.go:70] Attempting to register node node1
    4月 23 17:00:41 node1 kubelet[809]: E0423 17:00:41.844344 809 kubelet_node_status.go:92] Unable to register node “node1” with API server: Post https://lb.kubesphere.local:6443/api/v1/nodes: dial tcp 192.168.10.201:6443: connect: connection refused
    4月 23 17:00:41 node1 kubelet[809]: E0423 17:00:41.845814 809 kubelet.go:2268] node “node1” not found
    4月 23 17:00:41 node1 kubelet[809]: E0423 17:00:41.946087 809 kubelet.go:2268] node “node1” not found
    4月 23 17:00:42 node1 kubelet[809]: E0423 17:00:42.046336 809 kubelet.go:2268] node “node1” not found

    执行docker start $(docker ps -a | awk ‘{ print $1}’ | tail -n +2):
    Error response from daemon: cannot join network of a non running container: a511 ba5f2affcccde6f3617eb0f05e1fce5f1832772782e35f7b06c6f4d5ead2
    Error response from daemon: cannot join network of a non running container: 6a90 fa79ed308a1b04824a17f99ecdb9b91d60d55bf9b9d0c423a08367de9357
    a511ba5f2aff
    6a90fa79ed30
    Error response from daemon: cannot join network of a non running container: 5f50 4334cd9a159562e0ce9a7495c087bddec92b177debe403a69e9d4b4ce8fe
    5f504334cd9a
    Error response from daemon: cannot join network of a non running container: 56cf 795700c3d974ea969f38183e627114d27e14e0ec3371c13d9c282e32feb1
    56cf795700c3
    Error response from daemon: cannot join network of a non running container: cd60 b579fb5e4ca1a5e52731db7fc579203e90e0029118cbd7ac112b0f5b8fc8
    Error response from daemon: cannot join network of a non running container: 37c9 1b0aaac04d5b805364199606dbdbedc6a5b98aa2f2a1f9b8b62efa06fe81
    Error response from daemon: cannot join network of a non running container: 37c9 1b0aaac04d5b805364199606dbdbedc6a5b98aa2f2a1f9b8b62efa06fe81
    37c91b0aaac0
    Error response from daemon: cannot join network of a non running container: 0ecc 2360b1e982d78ab09492433e3797955bcfa3f91b37779e1cd4037b3e9dc0
    0ecc2360b1e9
    Error response from daemon: cannot join network of a non running container: 2e28 d93113ad2da3aeedaeb8bad1ab80b1b508c2a9bf71aedeaba85f6f5aa15c
    Error response from daemon: cannot join network of a non running container: 2e28 d93113ad2da3aeedaeb8bad1ab80b1b508c2a9bf71aedeaba85f6f5aa15c
    Error response from daemon: cannot join network of a non running container: 2e28 d93113ad2da3aeedaeb8bad1ab80b1b508c2a9bf71aedeaba85f6f5aa15c
    2e28d93113ad
    Error response from daemon: cannot join network of a non running container: ee19 daa4a38929b0d3fea22b14139c298aa732cef00ed35bd0c943fe9fa1cb70
    Error response from daemon: cannot join network of a non running container: ee19 daa4a38929b0d3fea22b14139c298aa732cef00ed35bd0c943fe9fa1cb70
    ee19daa4a389
    Error response from daemon: cannot join network of a non running container: 0d2f 99afb5a5ec534e5944c5cabb211773a8ad214be34d0faedc9df17df75eb3
    Error response from daemon: cannot join network of a non running container: 13f8 1fe97b89faf5609332aa7deb7a16f335fd8af96edf07ae9b2d3da7ab4aa4
    Error response from daemon: cannot join network of a non running container: 8348 9694de3d632076a4bcf2bfbaf90b439fdd6e53f4c23e7457ec650a1f6f8a
    13f81fe97b89
    cd60b579fb5e
    0d2f99afb5a5
    83489694de3d
    Error response from daemon: cannot join network of a non running container: afb9 aa18bf18406f847dd95d2c0109bafa3394ebb5e9e274059b5f7ca1fd3a1b
    Error response from daemon: cannot join network of a non running container: 0268 001c7d40e00f94702113e0a8dbc7098cb665d0fd8da6d74184dabf32d525
    Error response from daemon: cannot join network of a non running container: 8dde b0e18b90ec4a6cc82e32c2c7f25df03180fbc179f03173d237b8e9b105e0
    afb9aa18bf18
    0268001c7d40
    8ddeb0e18b90
    Error response from daemon: cannot join network of a non running container: c366 fcd837665aec35465f3a9205384584b83135550952b77839a188a33ce6b5
    Error response from daemon: cannot join network of a non running container: c366 fcd837665aec35465f3a9205384584b83135550952b77839a188a33ce6b5
    Error response from daemon: cannot join network of a non running container: c366 fcd837665aec35465f3a9205384584b83135550952b77839a188a33ce6b5
    Error response from daemon: cannot join network of a non running container: c366 fcd837665aec35465f3a9205384584b83135550952b77839a188a33ce6b5
    c366fcd83766
    Error response from daemon: cannot join network of a non running container: de69 4d0c0820a4fc43b803e266a30d09713cafb40a1dd9a32f4d3ae0f21cb65b
    de694d0c0820
    Error response from daemon: cannot join network of a non running container: 0c39 be3fdaf0bc8b463e8e2816dbd4573a57742ee7fab38032ed21707d27d285
    0c39be3fdaf0
    Error: failed to start containers: d7c732efd69d, 493689545c02, 40dbff812406, 137 54729ac50, 32c518747ce5, dce0f93dcc2c, 0f4d67aa4059, 53376ffa15a2, 5cfabc176015, 562660e17e50, 271418226fb8, e146c2972d23, eb2f3a4a994d, 4accc049da2a, 57c65659c e72, ee314eed0b27, 6a67eeced599, e16a5dec3b64, 70f16e4ada4f, 3382a1953b06, 8672e 683a779, 9aa774641b5c, b4f72aa37236, b6956ab88fac, 805bd6ca615e

      4 个月 后

      alpeai 试试这个

      sudo -i
      swapoff -a
      exit
      strace -eopenat kubectl version
      6 个月 后

      我也遇到过,不过最后找到原因是证书过期了,重新签发就好了。

      1 个月 后

      alpeai docker container ps -a 发现起不来,报错cannot join network of a non running container,需要按照顺序启动容器,比如有的组件需要先启动pause容器

      6 个月 后
      8 个月 后