安装的k8s 的集群出现的问题怎么诊断？

tscswcn

yuswift 是的

vmware 虚机提供的

yuswift

去你的虚拟机上ping 10.10.10.2
telnet 10.10.10.2 53
如果不通说明这个dnserver有问题把它从nameserver里面删掉可以临时规避这个问题

tscswcn

从虚机是可以 telnet 的

telnet 10.10.10.2 53
Trying 10.10.10.2…
Connected to 10.10.10.2.
Escape character is ‘^]’.

yuswift

tscswcn
从你coredns所在的那个节点去telnet试试

RolandMa1986

tscswcn 能不能提供一下kube-proxy 的log kubectl -n kube-system logs kube-proxy-xxx
以及ipvs ipvsadm -Ln

tscswcn

yuswift 我3个虚机机都可以
[root@kubesphere ~]# telnet 10.10.10.2 53
Trying 10.10.10.2…
Connected to 10.10.10.2.
Escape character is ‘^]’.
Connection closed by foreign host.

[root@worker1 ~]# telnet 10.10.10.2 53
Trying 10.10.10.2…
Connected to 10.10.10.2.
Escape character is ‘^]’.
Connection closed by foreign host.
[root@worker1 ~]#

[root@worker2 ~]# telnet 10.10.10.2 53
Trying 10.10.10.2…
Connected to 10.10.10.2.
Escape character is ‘^]’.

[root@worker2 ~]# nslookup kubesphere
Server: 10.10.10.2
Address: 10.10.10.2#53

** server can’t find kubesphere: NXDOMAIN

[root@worker2 ~]# nslookup 10.10.10.104
** server can’t find 104.10.10.10.in-addr.arpa.: NXDOMAIN

[root@worker2 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.10.10.104 kubesphere
10.10.10.106 worker1 worker1.localdomain
10.10.10.108 worker2 woker2.localdomain
10.10.10.104 kubesphere.localdomain.cluster.local kubesphere.localdomain
10.10.10.104 lb.kubesphere.local
10.10.10.104 blockdeviceclaims.openebs.io

[root@worker2 ~]# ping www.baidu.com
PING www.a.shifen.com (220.181.38.149) 56(84) bytes of data.
64 bytes from 220.181.38.149 (220.181.38.149): icmp_seq=1 ttl=128 time=9.55 ms
64 bytes from 220.181.38.149 (220.181.38.149): icmp_seq=2 ttl=128 time=9.04 ms
64 bytes from 220.181.38.149 (220.181.38.149): icmp_seq=3 ttl=128 time=7.49 ms
64 bytes from 220.181.38.149 (220.181.38.149): icmp_seq=4 ttl=128 time=8.28 ms
64 bytes from 220.181.38.149 (220.181.38.149): icmp_seq=5 ttl=128 time=7.30 ms
^C
— www.a.shifen.com ping statistics —
5 packets transmitted, 5 received, 0% packet loss, time 4635ms
rtt min/avg/max/mdev = 7.303/8.334/9.555/0.871 ms
[root@worker2 ~]#

主机名我是写在/etc/hosts 文件里的

tscswcn

RolandMa1986 kubectl -n kube-system logs

[root@kubesphere kube-system kube-system kube-system [root@kubesphere I1029 06:18:30.480495 I1029 06:18:30.480573 I1029 06:18:30.480813 I1029 06:18:30.481295 I1029 06:18:30.481912 I1029 06:18:30.483408 I1029 06:18:30.483436 I1029 06:18:30.483476 I1029 06:18:30.483498 I1029 06:18:30.583690 I1029 06:18:30.583791 I1029 06:19:30.482637 I1029 06:19:30.482717 [root@kubesphere I1029 06:18:33.271768 I1029 06:18:33.271850 I1029 06:18:33.272574 I1029 06:18:33.273585 I1029 06:18:33.280392 I1029 06:18:33.280634 I1029 06:18:33.280749 I1029 06:18:33.280899 I1029 06:18:33.380933 I1029 06:18:33.381115 I1029 06:20:33.277535 I1029 06:20:33.279166 I1029 06:22:33.280526 I1029 06:24:33.282055 [root@kubesphere I1029 07:36:58.762707 I1029 07:36:58.763056 I1029 07:36:58.763872 I1029 07:36:58.764425 I1029 07:36:58.769403 I1029 07:36:58.769445 I1029 07:36:58.769682 I1029 07:36:58.769689 I1029 07:36:58.871107 I1029 07:36:58.871188 I1029 07:38:58.765417 I1029 07:38:58.765514 I1031 01:11:08.365047 Trace[1804000238]: [root@kubesphere ~]# ~]# kubectl get pods -A | grep proxy
kube-proxy-24b5l 1/1 Running 19 16d
kube-proxy-6v4×6 1/1 Running 11 12d
kube-proxy-sgr9t 1/1 Running 14 16d
~]# kubectl -n kube-system logs kube-proxy-24b5l
1 node.go:136] Successfully retrieved node IP: 10.10.10.106
1 server_others.go:259] Using ipvs Proxier.
1 proxier.go:357] missing br-netfilter module or unset sysctl br-nf-call-iptables; proxy may not work as intended
1 server.go:583] Version: v1.18.6
1 conntrack.go:52] Setting nf_conntrack_max to 131072
1 config.go:315] Starting service config controller
1 shared_informer.go:223] Waiting for caches to sync for service config
1 config.go:133] Starting endpoints config controller
1 shared_informer.go:223] Waiting for caches to sync for endpoints config
1 shared_informer.go:230] Caches are synced for endpoints config
1 shared_informer.go:230] Caches are synced for service config
1 graceful_termination.go:93] lw: remote out of the list: 10.233.0.3:53/TCP/10.233.64.39:53
1 graceful_termination.go:93] lw: remote out of the list: 10.233.0.3:53/TCP/10.233.64.31:53
~]# kubectl -n kube-system logs kube-proxy-6v4×6
1 node.go:136] Successfully retrieved node IP: 10.10.10.108
1 server_others.go:259] Using ipvs Proxier.
1 server.go:583] Version: v1.18.6
1 conntrack.go:52] Setting nf_conntrack_max to 131072
1 config.go:315] Starting service config controller
1 shared_informer.go:223] Waiting for caches to sync for service config
1 config.go:133] Starting endpoints config controller
1 shared_informer.go:223] Waiting for caches to sync for endpoints config
1 shared_informer.go:230] Caches are synced for service config
1 shared_informer.go:230] Caches are synced for endpoints config
1 graceful_termination.go:93] lw: remote out of the list: 10.233.0.3:53/TCP/10.233.64.39:53
1 graceful_termination.go:93] lw: remote out of the list: 10.233.0.3:53/TCP/10.233.64.31:53
1 graceful_termination.go:93] lw: remote out of the list: 10.233.60.1:80/TCP/10.233.64.78:9090
1 graceful_termination.go:93] lw: remote out of the list: 10.233.60.1:80/TCP/10.233.64.76:9090
~]# kubectl -n kube-system logs kube-proxy-sgr9t
1 node.go:136] Successfully retrieved node IP: 10.10.10.104
1 server_others.go:259] Using ipvs Proxier.
1 server.go:583] Version: v1.18.6
1 conntrack.go:52] Setting nf_conntrack_max to 131072
1 config.go:133] Starting endpoints config controller
1 shared_informer.go:223] Waiting for caches to sync for endpoints config
1 config.go:315] Starting service config controller
1 shared_informer.go:223] Waiting for caches to sync for service config
1 shared_informer.go:230] Caches are synced for service config
1 shared_informer.go:230] Caches are synced for endpoints config
1 graceful_termination.go:93] lw: remote out of the list: 10.233.0.3:53/TCP/10.233.64.39:53
1 graceful_termination.go:93] lw: remote out of the list: 10.233.0.3:53/TCP/10.233.64.31:53
1 trace.go:116] Trace[1804000238]: “iptables save” (started: 2020-10-31 01:11:01.197209563 +0000 UTC m=+81144.265636904) (total time: 2.966201505s):
[2.966201505s] [2.966201505s] END

[root@kubesphere ~]# ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress😛ort Scheduler Flags
-> RemoteAddress😛ort Forward Weight ActiveConn InActConn
TCP 169.254.25.10:30880 rr
-> 10.233.66.36:8000 Masq 1 0 0
TCP 169.254.25.10:32567 rr
-> 10.233.66.41:80 Masq 1 0 0
TCP 172.17.0.1:30880 rr
-> 10.233.66.36:8000 Masq 1 0 0
TCP 10.10.10.104:30880 rr
-> 10.233.66.36:8000 Masq 1 0 0
TCP 10.10.10.104:32567 rr
-> 10.233.66.41:80 Masq 1 0 0
TCP 10.233.0.1:443 rr
-> 10.10.10.104:6443 Masq 1 62 0
TCP 10.233.0.3:53 rr
-> 10.233.64.58:53 Masq 1 0 4
-> 10.233.64.70:53 Masq 1 0 4
TCP 10.233.0.3:9153 rr
-> 10.233.64.58:9153 Masq 1 0 0
-> 10.233.64.70:9153 Masq 1 0 0
TCP 10.233.2.130:6379 rr
-> 10.233.64.71:6379 Masq 1 0 0
TCP 10.233.7.20:80 rr
-> 10.233.66.41:80 Masq 1 0 0
TCP 10.233.8.12:443 rr
-> 10.233.64.65:8443 Masq 1 0 0
TCP 10.233.13.76:9093 rr persistent 10800
-> 10.233.64.60:9093 Masq 1 0 0
TCP 10.233.15.225:8443 rr
-> 10.233.64.75:8443 Masq 1 0 0
TCP 10.233.24.157:443 rr
-> 10.10.10.108:4443 Masq 1 2 0
TCP 10.233.27.144:19093 rr
-> 10.233.64.61:19093 Masq 1 0 0
TCP 10.233.35.97:443 rr
-> 10.233.64.80:8443 Masq 1 0 0
TCP 10.233.35.122:80 rr
-> 10.233.66.36:8000 Masq 1 0 0
TCP 10.233.40.123:9090 rr persistent 10800
-> 10.233.64.55:9090 Masq 1 0 0
TCP 10.233.47.113:80 rr
-> 10.233.64.62:8080 Masq 1 0 0
TCP 10.233.49.67:5656 rr
-> 10.233.64.72:5656 Masq 1 0 0
TCP 10.233.60.1:80 rr
-> 10.233.64.79:9090 Masq 1 0 0
TCP 10.233.64.0:30880 rr
-> 10.233.66.36:8000 Masq 1 0 0
TCP 10.233.64.0:32567 rr
-> 10.233.66.41:80 Masq 1 0 0
TCP 10.233.64.1:30880 rr
-> 10.233.66.36:8000 Masq 1 0 0
TCP 10.233.64.1:32567 rr
-> 10.233.66.41:80 Masq 1 0 0
TCP 127.0.0.1:30880 rr
-> 10.233.66.36:8000 Masq 1 0 0
TCP 127.0.0.1:32567 rr
-> 10.233.66.41:80 Masq 1 0 0
TCP 172.17.0.1:32567 rr
-> 10.233.66.41:80 Masq 1 0 0
UDP 10.233.0.3:53 rr
-> 10.233.64.58:53 Masq 1 0 0
-> 10.233.64.70:53 Masq 1 0 0
[root@kubesphere ~]#

RolandMa1986 再帮我看看

tscswcn

我的coredns 的两个pods 都在 master 上， master 节点既是master，也是 worker
[root@kubesphere ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
kubesphere.localdomain Ready master,worker 15d v1.18.6
worker1.localdomain Ready <none> 15d v1.18.6
worker2.localdomain Ready <none> 11d v1.18.6
[root@kubesphere ~]# kubectl get pods -A | grep coredbns
[root@kubesphere ~]# kubectl get pods -A | grep core
kube-system coredns-6b55b6764d-4dkqz 1/1 Running 2 15d
kube-system coredns-6b55b6764d-hwxj7 1/1 Running 2 15d
[root@kubesphere ~]#

这样是不是有问题

tscswcn

再帮我看下

RolandMa1986

tscswcn “ipvs rr udp 10.133.0.3 53 no destination available” ，这个不是影响环境的因素，要消除这个显示，在机器上执行：dmesg -n 1就不会显示。

目前您的集群中DNS查询还有超时吗？

tscswcn

RolandMa1986 还有

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:58402->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:48861->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:33260->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:53321->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:56475->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:57477->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:45194->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:50980->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:41519->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:46032->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:53543->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:50189->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:49052->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:40430->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:40941->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:34961->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:33067->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:57059->10.10.10.2:53: i/o timeout

[ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:54908->10.10.10.2:53: i/o timeout

10.0.0.2 是 vmware 提供的，是不是应该把它删掉

Jeff

tscswcn 主机上不能访问的dns全都去掉，不然影响coredns是用

tscswcn

该问题最新进展：

我在 coredns 的 pods 里放了1个busybox 容器，进到busybox 里发现不同ping 到虚机

我给 coredns 的pod 挂了1个 busybox的边车，进一步诊断发现 coredns的 pod 缺少路由

/ # ip route
default via 10.233.66.1 dev eth0
10.233.66.0/24 dev eth0 scope link src 10.233.66.44
10.244.0.0/16 via 10.233.66.1 dev eth0
/ # ping 10.10.10.4
PING 10.10.10.4 (10.10.10.4): 56 data bytes
^C
— 10.10.10.4 ping statistics —
6 packets transmitted, 0 packets received, 100% packet loss
/ # ping 10.10.10.6
PING 10.10.10.6 (10.10.10.6): 56 data bytes
^C
— 10.10.10.6 ping statistics —
2 packets transmitted, 0 packets received, 100% packet loss
/ # ping 10.10.10.8
PING 10.10.10.8 (10.10.10.8): 56 data bytes
^C
— 10.10.10.8 ping statistics —
4 packets transmitted, 0 packets received, 100% packet loss
/ # ping 10.10.10.104
PING 10.10.10.104 (10.10.10.104): 56 data bytes
^C
— 10.10.10.104 ping statistics —
2 packets transmitted, 0 packets received, 100% packet loss
/ # ping 10.10.10.106
PING 10.10.10.106 (10.10.10.106): 56 data bytes
^C
— 10.10.10.106 ping statistics —
4 packets transmitted, 0 packets received, 100% packet loss
/ # ping 10.10.10.108
PING 10.10.10.108 (10.10.10.108): 56 data bytes
64 bytes from 10.10.10.108: seq=0 ttl=64 time=0.158 ms
64 bytes from 10.10.10.108: seq=1 ttl=64 time=0.431 ms
64 bytes from 10.10.10.108: seq=2 ttl=64 time=0.118 ms
^C

所以问题应该是在这里

ip route add 10.10.10.0/24 proto kernel scope link src 10.10.10.104 metric 100

但容器这条问题报错，我查了网上一些帖子说要啊修改内核，我就不知道怎么弄了
/ # ip route add 10.10.10.0/24 proto kernel scope link src 10.10.10.104 metric 100
ip: RTNETLINK answers: Operation not permitted
/ #

tscswcn

此问题，已经解决，
docker service 里添加
–bip
–ip-masq
–mtu