一个集群多个master节点的负载均衡问题

caixuhui

创建部署问题时，请参考下面模板，你提供的信息越多，越容易及时获得解答。如果未按模板创建问题，管理员有权关闭问题。
确保帖子格式清晰易读，用 markdown code block 语法格式化代码块。
你只花一分钟创建的问题，不能指望别人花上半个小时给你解答。

操作系统信息
虚拟机，Centos7.9，32C/60G

Kubernetes版本信息
kubectl version 命令执行结果：

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.9", GitCommit:"4fb7ed12476d57b8437ada90b4f93b17ffaeed99", GitTreeState:"clean", BuildDate:"2020-07-15T16:18:16Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.9", GitCommit:"4fb7ed12476d57b8437ada90b4f93b17ffaeed99", GitTreeState:"clean", BuildDate:"2020-07-15T16:10:45Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

容器运行时
docker version 命令执行结果：

Client: Docker Engine - Community
 Version:           20.10.12
 API version:       1.41
 Go version:        go1.16.12
 Git commit:        e91ed57
 Built:             Mon Dec 13 11:45:41 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.12
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.12
  Git commit:       459d0df
  Built:            Mon Dec 13 11:44:05 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.12
  GitCommit:        7b11cfaabd73bb80907dd23182b9347b4245eb5d
 runc:
  Version:          1.0.2
  GitCommit:        v1.0.2-0-g52b36a2
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

crictl version

-bash: crictl: 未找到命令

nerdctl version

-bash: nerdctl: 未找到命令

KubeSphere版本信息
v3.0.0。离线安装。使用kk安装。

问题是什么
我新创建了一个集群，共3个master节点，2个worker节点

master-1
master-2
master-3
worker-1
worker-2

master之间通过Haproxy+keepalived做负载均衡。

但是如果master-1宕机，kubesphere的web页面就无法访问，如果是master-2/master-3宕机，页面还是可以正常访问。

一开始安装集群是在master-1上执行的kk命令安装，当master-1宕机后，在其他master上查看etcd的状态systemctl status etcd

10.0.40.197 就是master-1的IP

● etcd.service - etcd docker wrapper
   Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since 四 2023-03-23 10:15:15 CST; 12min ago
 Main PID: 10143 (etcd)
    Tasks: 10
   Memory: 19.6M
   CGroup: /system.slice/etcd.service
           ├─10143 /bin/bash /usr/local/bin/etcd
           └─10145 /usr/bin/docker run --restart=on-failure:5 --env-file=/etc/etcd.env --net=host -v /etc/ssl/certs:/etc/ssl/certs:ro -v /etc/ssl/etcd/ssl:/etc/ssl/etcd/ssl:ro -v /var/lib/etcd:/var/lib/etcd:rw --memory=512M --blkio-weight=1000 --name=etcd2 dockerhub.kubekey.loca...

3月 23 10:27:46 ks-node-4 etcd[10143]: 2023-03-23 02:27:46.231768 W | etcdserver: failed to reach the peerURL(https://10.0.40.197:2380) of member aacc57ad2318b2b1 (Get https://10.0.40.197:2380/version: dial tcp 10.0.40.197:2380: connect: no route to host)
3月 23 10:27:46 ks-node-4 etcd[10143]: 2023-03-23 02:27:46.231788 W | etcdserver: cannot get the version of member aacc57ad2318b2b1 (Get https://10.0.40.197:2380/version: dial tcp 10.0.40.197:2380: connect: no route to host)
3月 23 10:27:50 ks-node-4 etcd[10143]: 2023-03-23 02:27:50.640410 W | rafthttp: health check for peer aacc57ad2318b2b1 could not connect: dial tcp 10.0.40.197:2380: connect: no route to host (prober "ROUND_TRIPPER_SNAPSHOT")
3月 23 10:27:50 ks-node-4 etcd[10143]: 2023-03-23 02:27:50.640428 W | rafthttp: health check for peer aacc57ad2318b2b1 could not connect: dial tcp 10.0.40.197:2380: connect: no route to host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
3月 23 10:27:52 ks-node-4 etcd[10143]: 2023-03-23 02:27:52.243686 W | etcdserver: failed to reach the peerURL(https://10.0.40.197:2380) of member aacc57ad2318b2b1 (Get https://10.0.40.197:2380/version: dial tcp 10.0.40.197:2380: connect: no route to host)
3月 23 10:27:52 ks-node-4 etcd[10143]: 2023-03-23 02:27:52.243702 W | etcdserver: cannot get the version of member aacc57ad2318b2b1 (Get https://10.0.40.197:2380/version: dial tcp 10.0.40.197:2380: connect: no route to host)
3月 23 10:27:55 ks-node-4 etcd[10143]: 2023-03-23 02:27:55.640727 W | rafthttp: health check for peer aacc57ad2318b2b1 could not connect: dial tcp 10.0.40.197:2380: connect: no route to host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
3月 23 10:27:55 ks-node-4 etcd[10143]: 2023-03-23 02:27:55.640771 W | rafthttp: health check for peer aacc57ad2318b2b1 could not connect: dial tcp 10.0.40.197:2380: connect: no route to host (prober "ROUND_TRIPPER_SNAPSHOT")
3月 23 10:27:58 ks-node-4 etcd[10143]: 2023-03-23 02:27:58.255969 W | etcdserver: failed to reach the peerURL(https://10.0.40.197:2380) of member aacc57ad2318b2b1 (Get https://10.0.40.197:2380/version: dial tcp 10.0.40.197:2380: connect: no route to host)
3月 23 10:27:58 ks-node-4 etcd[10143]: 2023-03-23 02:27:58.255989 W | etcdserver: cannot get the version of member aacc57ad2318b2b1 (Get https://10.0.40.197:2380/version: dial tcp 10.0.40.197:2380: connect: no route to host)

在其他节点上执行kubectl get nodes

Unable to connect to the server: EOF

此时，已经在跑的服务还可以正常运行，但是如果服务所在的服务器也宕机（已经等待了10分钟，服务没转移），服务就再也无法访问，我猜测就是只有master-1在做调度东西，master-2/master-3这时候是无用的。

但是如果是master-2/master-3宕机，整个集群都很正常。

goodmanljj

信息量较少，说下我的猜想。

你的控制节点master虽然做了高可用和负载均衡，但从截图来看，你的etcd应该没有做高可用，而且唯一一个etcd还是部署在master1上，这可能就会导致这个现象

zhuhj

遇到同样的问题，楼主解决了吗？

frezes

zhuhj
建议开个新帖，完整描述下你的问题

zhuhj

frezes

已开新帖，麻烦能否帮看一下是啥问题

https://kubesphere.io/forum/d/9627-mastermaster1kubesphere/7