• 安装部署
  • 安装的k8s 的集群 出现的问题怎么诊断?

看起来像是网络连通性问题,kubeadmin安装的么?

使用 kk 安装,也就是 kubeadm 安装的
现在 看起来 上面的报错 没什么影响,但 coredns 有 问题

kubectl logs -f coredns-6b55b6764d-4dkqz -n kube-system

kubectl logs -f nodelocaldns-4lzlx -n kube-system

coredns超时了10.10.10.2是不是你的host里面/etc/resolv.conf里面的nameserver

    去你的虚拟机上ping 10.10.10.2
    telnet 10.10.10.2 53
    如果不通 说明这个dnserver有问题 把它从nameserver里面删掉 可以临时规避这个问题

    从虚机是 可以 telnet 的

    telnet 10.10.10.2 53
    Trying 10.10.10.2…
    Connected to 10.10.10.2.
    Escape character is ‘^]’.

      tscswcn 能不能提供一下kube-proxy 的log kubectl -n kube-system logs kube-proxy-xxx
      以及ipvs ipvsadm -Ln

        yuswift 我3个虚机机 都 可以
        [root@kubesphere ~]# telnet 10.10.10.2 53
        Trying 10.10.10.2…
        Connected to 10.10.10.2.
        Escape character is ‘^]’.
        Connection closed by foreign host.

        [root@worker1 ~]# telnet 10.10.10.2 53
        Trying 10.10.10.2…
        Connected to 10.10.10.2.
        Escape character is ‘^]’.
        Connection closed by foreign host.
        [root@worker1 ~]#

        [root@worker2 ~]# telnet 10.10.10.2 53
        Trying 10.10.10.2…
        Connected to 10.10.10.2.
        Escape character is ‘^]’.

        [root@worker2 ~]# nslookup kubesphere
        Server: 10.10.10.2
        Address: 10.10.10.2#53

        ** server can’t find kubesphere: NXDOMAIN

        [root@worker2 ~]# nslookup 10.10.10.104
        ** server can’t find 104.10.10.10.in-addr.arpa.: NXDOMAIN

        [root@worker2 ~]# cat /etc/hosts
        127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
        ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
        10.10.10.104 kubesphere
        10.10.10.106 worker1 worker1.localdomain
        10.10.10.108 worker2 woker2.localdomain
        10.10.10.104 kubesphere.localdomain.cluster.local kubesphere.localdomain
        10.10.10.104 lb.kubesphere.local
        10.10.10.104 blockdeviceclaims.openebs.io

        [root@worker2 ~]# ping www.baidu.com
        PING www.a.shifen.com (220.181.38.149) 56(84) bytes of data.
        64 bytes from 220.181.38.149 (220.181.38.149): icmp_seq=1 ttl=128 time=9.55 ms
        64 bytes from 220.181.38.149 (220.181.38.149): icmp_seq=2 ttl=128 time=9.04 ms
        64 bytes from 220.181.38.149 (220.181.38.149): icmp_seq=3 ttl=128 time=7.49 ms
        64 bytes from 220.181.38.149 (220.181.38.149): icmp_seq=4 ttl=128 time=8.28 ms
        64 bytes from 220.181.38.149 (220.181.38.149): icmp_seq=5 ttl=128 time=7.30 ms
        C
        — www.a.shifen.com ping statistics —
        5 packets transmitted, 5 received, 0% packet loss, time 4635ms
        rtt min/avg/max/mdev = 7.303/8.334/9.555/0.871 ms
        [root@worker2 ~]#

        主机名我是 写在/etc/hosts 文件里的

        我的coredns 的两个pods 都在 master 上, master 节点既是master,也是 worker
        [root@kubesphere ~]# kubectl get nodes
        NAME STATUS ROLES AGE VERSION
        kubesphere.localdomain Ready master,worker 15d v1.18.6
        worker1.localdomain Ready <none> 15d v1.18.6
        worker2.localdomain Ready <none> 11d v1.18.6
        [root@kubesphere ~]# kubectl get pods -A | grep coredbns
        [root@kubesphere ~]# kubectl get pods -A | grep core
        kube-system coredns-6b55b6764d-4dkqz 1/1 Running 2 15d
        kube-system coredns-6b55b6764d-hwxj7 1/1 Running 2 15d
        [root@kubesphere ~]#

        这样 是不是有问题

        RolandMa1986 kubectl -n kube-system logs

        [root@kubesphere ~]# kubectl get pods -A | grep proxy
        kube-system kube-proxy-24b5l 1/1 Running 19 16d
        kube-system kube-proxy-6v4×6 1/1 Running 11 12d
        kube-system kube-proxy-sgr9t 1/1 Running 14 16d
        [root@kubesphere ~]# kubectl -n kube-system logs kube-proxy-24b5l
        I1029 06:18:30.480495 1 node.go:136] Successfully retrieved node IP: 10.10.10.106
        I1029 06:18:30.480573 1 server_others.go:259] Using ipvs Proxier.
        I1029 06:18:30.480813 1 proxier.go:357] missing br-netfilter module or unset sysctl br-nf-call-iptables; proxy may not work as intended
        I1029 06:18:30.481295 1 server.go:583] Version: v1.18.6
        I1029 06:18:30.481912 1 conntrack.go:52] Setting nf_conntrack_max to 131072
        I1029 06:18:30.483408 1 config.go:315] Starting service config controller
        I1029 06:18:30.483436 1 shared_informer.go:223] Waiting for caches to sync for service config
        I1029 06:18:30.483476 1 config.go:133] Starting endpoints config controller
        I1029 06:18:30.483498 1 shared_informer.go:223] Waiting for caches to sync for endpoints config
        I1029 06:18:30.583690 1 shared_informer.go:230] Caches are synced for endpoints config
        I1029 06:18:30.583791 1 shared_informer.go:230] Caches are synced for service config
        I1029 06:19:30.482637 1 graceful_termination.go:93] lw: remote out of the list: 10.233.0.3:53/TCP/10.233.64.39:53
        I1029 06:19:30.482717 1 graceful_termination.go:93] lw: remote out of the list: 10.233.0.3:53/TCP/10.233.64.31:53
        [root@kubesphere ~]# kubectl -n kube-system logs kube-proxy-6v4×6
        I1029 06:18:33.271768 1 node.go:136] Successfully retrieved node IP: 10.10.10.108
        I1029 06:18:33.271850 1 server_others.go:259] Using ipvs Proxier.
        I1029 06:18:33.272574 1 server.go:583] Version: v1.18.6
        I1029 06:18:33.273585 1 conntrack.go:52] Setting nf_conntrack_max to 131072
        I1029 06:18:33.280392 1 config.go:315] Starting service config controller
        I1029 06:18:33.280634 1 shared_informer.go:223] Waiting for caches to sync for service config
        I1029 06:18:33.280749 1 config.go:133] Starting endpoints config controller
        I1029 06:18:33.280899 1 shared_informer.go:223] Waiting for caches to sync for endpoints config
        I1029 06:18:33.380933 1 shared_informer.go:230] Caches are synced for service config
        I1029 06:18:33.381115 1 shared_informer.go:230] Caches are synced for endpoints config
        I1029 06:20:33.277535 1 graceful_termination.go:93] lw: remote out of the list: 10.233.0.3:53/TCP/10.233.64.39:53
        I1029 06:20:33.279166 1 graceful_termination.go:93] lw: remote out of the list: 10.233.0.3:53/TCP/10.233.64.31:53
        I1029 06:22:33.280526 1 graceful_termination.go:93] lw: remote out of the list: 10.233.60.1:80/TCP/10.233.64.78:9090
        I1029 06:24:33.282055 1 graceful_termination.go:93] lw: remote out of the list: 10.233.60.1:80/TCP/10.233.64.76:9090
        [root@kubesphere ~]# kubectl -n kube-system logs kube-proxy-sgr9t
        I1029 07:36:58.762707 1 node.go:136] Successfully retrieved node IP: 10.10.10.104
        I1029 07:36:58.763056 1 server_others.go:259] Using ipvs Proxier.
        I1029 07:36:58.763872 1 server.go:583] Version: v1.18.6
        I1029 07:36:58.764425 1 conntrack.go:52] Setting nf_conntrack_max to 131072
        I1029 07:36:58.769403 1 config.go:133] Starting endpoints config controller
        I1029 07:36:58.769445 1 shared_informer.go:223] Waiting for caches to sync for endpoints config
        I1029 07:36:58.769682 1 config.go:315] Starting service config controller
        I1029 07:36:58.769689 1 shared_informer.go:223] Waiting for caches to sync for service config
        I1029 07:36:58.871107 1 shared_informer.go:230] Caches are synced for service config
        I1029 07:36:58.871188 1 shared_informer.go:230] Caches are synced for endpoints config
        I1029 07:38:58.765417 1 graceful_termination.go:93] lw: remote out of the list: 10.233.0.3:53/TCP/10.233.64.39:53
        I1029 07:38:58.765514 1 graceful_termination.go:93] lw: remote out of the list: 10.233.0.3:53/TCP/10.233.64.31:53
        I1031 01:11:08.365047 1 trace.go:116] Trace[1804000238]: “iptables save” (started: 2020-10-31 01:11:01.197209563 +0000 UTC m=+81144.265636904) (total time: 2.966201505s):
        Trace[1804000238]: [2.966201505s] [2.966201505s] END
        [root@kubesphere ~]#

        [root@kubesphere ~]# ipvsadm -Ln
        IP Virtual Server version 1.2.1 (size=4096)
        Prot LocalAddress😛ort Scheduler Flags
        -> RemoteAddress😛ort Forward Weight ActiveConn InActConn
        TCP 169.254.25.10:30880 rr
        -> 10.233.66.36:8000 Masq 1 0 0
        TCP 169.254.25.10:32567 rr
        -> 10.233.66.41:80 Masq 1 0 0
        TCP 172.17.0.1:30880 rr
        -> 10.233.66.36:8000 Masq 1 0 0
        TCP 10.10.10.104:30880 rr
        -> 10.233.66.36:8000 Masq 1 0 0
        TCP 10.10.10.104:32567 rr
        -> 10.233.66.41:80 Masq 1 0 0
        TCP 10.233.0.1:443 rr
        -> 10.10.10.104:6443 Masq 1 62 0
        TCP 10.233.0.3:53 rr
        -> 10.233.64.58:53 Masq 1 0 4
        -> 10.233.64.70:53 Masq 1 0 4
        TCP 10.233.0.3:9153 rr
        -> 10.233.64.58:9153 Masq 1 0 0
        -> 10.233.64.70:9153 Masq 1 0 0
        TCP 10.233.2.130:6379 rr
        -> 10.233.64.71:6379 Masq 1 0 0
        TCP 10.233.7.20:80 rr
        -> 10.233.66.41:80 Masq 1 0 0
        TCP 10.233.8.12:443 rr
        -> 10.233.64.65:8443 Masq 1 0 0
        TCP 10.233.13.76:9093 rr persistent 10800
        -> 10.233.64.60:9093 Masq 1 0 0
        TCP 10.233.15.225:8443 rr
        -> 10.233.64.75:8443 Masq 1 0 0
        TCP 10.233.24.157:443 rr
        -> 10.10.10.108:4443 Masq 1 2 0
        TCP 10.233.27.144:19093 rr
        -> 10.233.64.61:19093 Masq 1 0 0
        TCP 10.233.35.97:443 rr
        -> 10.233.64.80:8443 Masq 1 0 0
        TCP 10.233.35.122:80 rr
        -> 10.233.66.36:8000 Masq 1 0 0
        TCP 10.233.40.123:9090 rr persistent 10800
        -> 10.233.64.55:9090 Masq 1 0 0
        TCP 10.233.47.113:80 rr
        -> 10.233.64.62:8080 Masq 1 0 0
        TCP 10.233.49.67:5656 rr
        -> 10.233.64.72:5656 Masq 1 0 0
        TCP 10.233.60.1:80 rr
        -> 10.233.64.79:9090 Masq 1 0 0
        TCP 10.233.64.0:30880 rr
        -> 10.233.66.36:8000 Masq 1 0 0
        TCP 10.233.64.0:32567 rr
        -> 10.233.66.41:80 Masq 1 0 0
        TCP 10.233.64.1:30880 rr
        -> 10.233.66.36:8000 Masq 1 0 0
        TCP 10.233.64.1:32567 rr
        -> 10.233.66.41:80 Masq 1 0 0
        TCP 127.0.0.1:30880 rr
        -> 10.233.66.36:8000 Masq 1 0 0
        TCP 127.0.0.1:32567 rr
        -> 10.233.66.41:80 Masq 1 0 0
        TCP 172.17.0.1:32567 rr
        -> 10.233.66.41:80 Masq 1 0 0
        UDP 10.233.0.3:53 rr
        -> 10.233.64.58:53 Masq 1 0 0
        -> 10.233.64.70:53 Masq 1 0 0
        [root@kubesphere ~]#

        RolandMa1986 再帮我看看

        tscswcn “ipvs rr udp 10.133.0.3 53 no destination available” ,这个不是影响环境的因素,要消除这个显示,在机器上执行:dmesg -n 1就不会显示。

        目前您的集群中DNS查询还有超时吗?

          RolandMa1986 还有

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:58402->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:48861->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:33260->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:53321->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:56475->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:57477->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:45194->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:50980->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:41519->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:46032->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:53543->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:50189->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:49052->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:40430->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:40941->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:34961->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:33067->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:57059->10.10.10.2:53: i/o timeout

          [ERROR] plugin/errors: 2 . NS: read udp 10.233.64.70:54908->10.10.10.2:53: i/o timeout

          10.0.0.2 是 vmware 提供的,是不是应该 把它删掉

          • Jeff 回复了此帖

            tscswcn 主机上不能访问的dns全都去掉,不然影响coredns是用

            该问题 最新进展:

            我在 coredns 的 pods 里放了1个busybox 容器 ,进到busybox 里发现不同ping 到 虚机

            我给 coredns 的pod 挂了1个 busybox的边车,进一步 诊断 发现 coredns的 pod 缺少 路由

            / # ip route
            default via 10.233.66.1 dev eth0
            10.233.66.0/24 dev eth0 scope link src 10.233.66.44
            10.244.0.0/16 via 10.233.66.1 dev eth0
            / # ping 10.10.10.4
            PING 10.10.10.4 (10.10.10.4): 56 data bytes
            C
            — 10.10.10.4 ping statistics —
            6 packets transmitted, 0 packets received, 100% packet loss
            / # ping 10.10.10.6
            PING 10.10.10.6 (10.10.10.6): 56 data bytes
            C
            — 10.10.10.6 ping statistics —
            2 packets transmitted, 0 packets received, 100% packet loss
            / # ping 10.10.10.8
            PING 10.10.10.8 (10.10.10.8): 56 data bytes
            C
            — 10.10.10.8 ping statistics —
            4 packets transmitted, 0 packets received, 100% packet loss
            / # ping 10.10.10.104
            PING 10.10.10.104 (10.10.10.104): 56 data bytes
            C
            — 10.10.10.104 ping statistics —
            2 packets transmitted, 0 packets received, 100% packet loss
            / # ping 10.10.10.106
            PING 10.10.10.106 (10.10.10.106): 56 data bytes
            C
            — 10.10.10.106 ping statistics —
            4 packets transmitted, 0 packets received, 100% packet loss
            / # ping 10.10.10.108
            PING 10.10.10.108 (10.10.10.108): 56 data bytes
            64 bytes from 10.10.10.108: seq=0 ttl=64 time=0.158 ms
            64 bytes from 10.10.10.108: seq=1 ttl=64 time=0.431 ms
            64 bytes from 10.10.10.108: seq=2 ttl=64 time=0.118 ms
            C

            所以 问题应该是 在这里

            ip route add 10.10.10.0/24 proto kernel scope link src 10.10.10.104 metric 100

            但 容器 这条问题报错,我查了网上一些帖子说 要啊修改 内核,我就不知道怎么弄了
            / # ip route add 10.10.10.0/24 proto kernel scope link src 10.10.10.104 metric 100
            ip: RTNETLINK answers: Operation not permitted
            / #

            此问题,已经解决,
            docker service 里 添加
            –bip
            –ip-masq
            –mtu