在安装KS3.0集群过程中集群安装成功,所有资源状态也正常,但是就是访问Dashboard报错Internal Server Error,之前也参看了很多类似的帖子和issues经过排查:LB、Redis、DNS、时区、网络等均正常但是还是没解决问题,同样的问题也复现在其他测试环境。为此,贴出详细的安装过程和配置文件,以及排查过程,希望社区同学帮忙看看问题所在。
集群安装参考地址:https://v3-0.docs.kubesphere.io/docs/installing-on-linux/on-premises/install-kubesphere-on-vmware-vsphere/#install-a-load-balancer-using-keepalived-and-haproxy-optional
集群节点清单(集群节点为双网卡,一个业务网口、一个管理网口,以下IP地址为业务网口IP):
Host IP Host Name Role
192.168.100.91 k8s-master01 master, etcd,Haproxy,Keepalived
192.168.100.92 k8s-master02 master, etcd,Haproxy,Keepalived
192.168.100.93 k8s-master03 master, etcd,Haproxy,Keepalived
192.168.100.1 k8s-node01 worker
192.168.100.2 k8s-node01 worker
192.168.100.3 k8s-node01 worker
192.168.100.4 k8s-node01 worker
192.168.100.5 k8s-node01 worker
192.168.100.99 VIP VIP
集群采用3master + 5Node的方式部署,其中master复用了LB功能,VIP会在3个master节点漂移。
安装步骤:
1、设置集群hostname。
2、设置免密登录,其中master01作为部署节点会ssh到其他节点。
3、关闭禁用防火墙并关闭selinux

1)关闭防火墙
// 查看防火墙开启情况
# firewall-cmd --list-all
# systemctl stop firewalld
# systemctl disable firewalld
// 查看关闭状态
# firewall-cmd --list-all
FirewallD is not running
2)关闭selinux
# sed -i "/^SELINUX/s/enforcing/disabled/" /etc/selinux/config
# setenforce 0

4、设置时钟同步

1)设置时区
# timedatectl set-timezone Asia/Shanghai
2)安装ntp服务
# yum -y install ntp
# systemctl restart ntpd && systemctl enable ntpd
3)设置时钟服务器(指向阿里云)
# vi /etc/ntp.conf
server ntp.aliyun.com iburst
server ntp1.aliyun.com iburst
server ntp2.aliyun.com iburst

4、设置更新yum源
5、配置集群VIP
集群VIP配置参考上述文章进行配置,VIP能正常漂移;由于采用Master+LB+Keepalived的方式为了解决6443端口冲突的问题,Haproxy监听端口为:16443,所以配置文件配置也是16443,例如:

  controlPlaneEndpoint:
    domain: lb.kubesphere.local
    address: "192.168.100.99"
    port: "16443"
4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 7c:c3:85:3a:48:8b brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.91/24 brd 192.168.100.255 scope global noprefixroute eno3
       valid_lft forever preferred_lft forever
    inet 192.168.100.99/32 scope global eno3
       valid_lft forever preferred_lft forever
    inet6 fe80::9f85:947:d534:1f48/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

6、集群配置文件

apiVersion: kubekey.kubesphere.io/v1alpha1
kind: Cluster
metadata:
  name: ks-cluster
spec:
  hosts:
  - {name: k8s-master01, address: 192.168.100.91, internalAddress: 192.168.100.91, privateKeyPath: "~/.ssh/id_rsa"} # password-less login with SSH keys
  - {name: k8s-master02, address: 192.168.100.92, internalAddress: 192.168.100.92, privateKeyPath: "~/.ssh/id_rsa"} # password-less login with SSH keys
  - {name: k8s-master03, address: 192.168.100.93, internalAddress: 192.168.100.93, privateKeyPath: "~/.ssh/id_rsa"} # password-less login with SSH keys
  - {name: k8s-node01, address: 192.168.100.1, internalAddress: 192.168.100.1, privateKeyPath: "~/.ssh/id_rsa"} # password-less login with SSH keys
  - {name: k8s-node02, address: 192.168.100.2, internalAddress: 192.168.100.2, privateKeyPath: "~/.ssh/id_rsa"} # password-less login with SSH keys
  - {name: k8s-node03, address: 192.168.100.3, internalAddress: 192.168.100.3, privateKeyPath: "~/.ssh/id_rsa"} # password-less login with SSH keys
  - {name: k8s-node04, address: 192.168.100.4, internalAddress: 192.168.100.4, privateKeyPath: "~/.ssh/id_rsa"} # password-less login with SSH keys
  - {name: k8s-node05, address: 192.168.100.5, internalAddress: 192.168.100.5, privateKeyPath: "~/.ssh/id_rsa"} # password-less login with SSH keys
  roleGroups:
    etcd:
     - k8s-master01
     - k8s-master02
     - k8s-master03
    master:
     - k8s-master01
     - k8s-master02
     - k8s-master03
    worker:
     - k8s-node01
     - k8s-node02
     - k8s-node03
     - k8s-node04
     - k8s-node05
  controlPlaneEndpoint:
    domain: lb.kubesphere.local
    address: "192.168.100.99"
    port: "16443"
  kubernetes:
    version: v1.17.9
    imageRepo: kubesphere
    clusterName: cluster.local
    masqueradeAll: false  # masqueradeAll tells kube-proxy to SNAT everything if using the pure iptables proxy mode. [Default: false]
    maxPods: 120  # maxPods is the number of pods that can run on this Kubelet. [Default: 110]
    nodeCidrMaskSize: 24  # internal network node size allocation. This is the size allocated to each node on your network. [Default: 24]
    proxyMode: ipvs  # mode specifies which proxy mode to use. [Default: ipvs]
  network:
    plugin: calico
    calico:
      ipipMode: Always  # IPIP Mode to use for the IPv4 POOL created at start up. If set to a value other than Never, vxlanMode should be set to "Never
". [Always | CrossSubnet | Never] [Default: Always]
      vxlanMode: Never  # VXLAN Mode to use for the IPv4 POOL created at start up. If set to a value other than Never, ipipMode should be set to "Never
". [Always | CrossSubnet | Never] [Default: Never]
      vethMTU: 1500  # The maximum transmission unit (MTU) setting determines the largest packet size that can be transmitted through your network. [De
fault: 1440]
    kubePodsCIDR: 10.232.64.0/18
    kubeServiceCIDR: 10.232.0.0/18
  registry:
    registryMirrors: []
    insecureRegistries: []
    privateRegistry: ""
  addons:
  - name: rbd-provisioner
    namespace: kube-system
    sources:
      chart:
        name: rbd-provisioner
        repo: https://charts.kubesphere.io/test
        values:
        # for more values, see https://github.com/kubesphere/helm-charts/tree/master/src/test/rbd-provisioner
        # 此处隐藏ceph访问细节
        - ceph.mon=x.x.x.x:6789\,x.x.x.x:6789\,x.x.x.x:6789
        - ceph.pool=xxx-k8s
        - ceph.adminId=admin
        - ceph.adminKey=admin-token
        - ceph.userId=k8s
        - ceph.userKey=k8s-token
        - sc.isDefault=true
---
apiVersion: installer.kubesphere.io/v1alpha1
kind: ClusterConfiguration
metadata:
  name: ks-installer
  namespace: kubesphere-system
  labels:
    version: v3.0.0
spec:
  local_registry: ""
  persistence:
    storageClass: ""
  authentication:
    jwtSecret: ""
  etcd:
    monitoring: true        # Whether to install etcd monitoring dashboard
    endpointIps: 192.168.100.91,192.168.100.92,192.168.100.93  # etcd cluster endpointIps
    port: 2379              # etcd port
    tlsEnable: true
  common:
    mysqlVolumeSize: 20Gi # MySQL PVC size
    minioVolumeSize: 20Gi # Minio PVC size
    etcdVolumeSize: 20Gi  # etcd PVC size
    openldapVolumeSize: 2Gi   # openldap PVC size
    redisVolumSize: 2Gi # Redis PVC size
    es:  # Storage backend for logging, tracing, events and auditing.
      elasticsearchMasterReplicas: 1   # total number of master nodes, it's not allowed to use even number
      elasticsearchDataReplicas: 1     # total number of data nodes
      elasticsearchMasterVolumeSize: 4Gi   # Volume size of Elasticsearch master nodes
      elasticsearchDataVolumeSize: 20Gi    # Volume size of Elasticsearch data nodes
      logMaxAge: 7                     # Log retention time in built-in Elasticsearch, it is 7 days by default.
      elkPrefix: logstash              # The string making up index names. The index name will be formatted as ks-<elk_prefix>-log
      # externalElasticsearchUrl:
      # externalElasticsearchPort:
  console:
    enableMultiLogin: false  # enable/disable multiple sing on, it allows an account can be used by different users at the same time.
    port: 30880
  alerting:                # Whether to install KubeSphere alerting system. It enables Users to customize alerting policies to send messages to receive
rs in time with different time intervals and alerting levels to choose from.
    enabled: false
  auditing:                # Whether to install KubeSphere audit log system. It provides a security-relevant chronological set of records,recording th
e sequence of activities happened in platform, initiated by different tenants.
    enabled: false
  devops:                  # Whether to install KubeSphere DevOps System. It provides out-of-box CI/CD system based on Jenkins, and automated workflow 
tools including Source-to-Image & Binary-to-Image
    enabled: false
    jenkinsMemoryLim: 2Gi      # Jenkins memory limit
    jenkinsMemoryReq: 1500Mi   # Jenkins memory request
    jenkinsVolumeSize: 8Gi     # Jenkins volume size
    jenkinsJavaOpts_Xms: 512m  # The following three fields are JVM parameters
    jenkinsJavaOpts_Xmx: 512m
    jenkinsJavaOpts_MaxRAM: 2g
  events:                  # Whether to install KubeSphere events system. It provides a graphical web console for Kubernetes Events exporting, filterin
g and alerting in multi-tenant Kubernetes clusters.
    enabled: false
  logging:                 # Whether to install KubeSphere logging system. Flexible logging functions are provided for log query, collection and manage
ment in a unified console. Additional log collectors can be added, such as Elasticsearch, Kafka and Fluentd.
    enabled: false
    logsidecarReplicas: 2
  metrics_server:                    # Whether to install metrics-server. IT enables HPA (Horizontal Pod Autoscaler).
    enabled: true
  monitoring:                        #
    prometheusReplicas: 1            # Prometheus replicas are responsible for monitoring different segments of data source and provide high availabili
ty as well.
    prometheusMemoryRequest: 900Mi   # Prometheus request memory
    prometheusVolumeSize: 20Gi       # Prometheus PVC size
    alertmanagerReplicas: 1          # AlertManager Replicas
  multicluster:
    clusterRole: none  # host | member | none  # You can install a solo cluster, or specify it as the role of host or member cluster
  networkpolicy:       # Network policies allow network isolation within the same cluster, which means firewalls can be set up between certain instance
s (Pods).
    enabled: false
  notification:        # Email Notification support for the legacy alerting system, should be enabled/disabled together with the above alerting option
    enabled: false
  openpitrix:          # Whether to install KubeSphere Application Store. It provides an application store for Helm-based applications, and offer appli
cation lifecycle management
    enabled: false
  servicemesh:         # Whether to install KubeSphere Service Mesh (Istio-based). It provides fine-grained traffic management, observability and traci
ng, and offer visualization for traffic topology
    enabled: false

7、执行安装
安装指令: ./kk create cluster -f ks-config.yaml
安装过程偶尔出现异常,通过delete删除重装:./kk delete cluster -f ks-config.yaml
8、安装节点资源情况:

[root@k8s-master01 kubesphere]# kubectl -n kube-system get pods
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-59d85c5c84-26l68   1/1     Running   0          11h
calico-node-cb9kn                          1/1     Running   0          11h
calico-node-gwc58                          1/1     Running   0          11h
calico-node-lwkff                          1/1     Running   0          11h
calico-node-stszc                          1/1     Running   0          11h
calico-node-svkjz                          1/1     Running   0          11h
calico-node-v2dmd                          1/1     Running   0          11h
calico-node-wwp8d                          1/1     Running   0          11h
calico-node-x6jwb                          1/1     Running   0          11h
coredns-74d59cc5c6-d4qxg                   1/1     Running   0          11h
coredns-74d59cc5c6-qv6qg                   1/1     Running   0          11h
kube-apiserver-k8s-master01                1/1     Running   0          11h
kube-apiserver-k8s-master02                1/1     Running   0          11h
kube-apiserver-k8s-master03                1/1     Running   0          11h
kube-controller-manager-k8s-master01       1/1     Running   0          11h
kube-controller-manager-k8s-master02       1/1     Running   0          11h
kube-controller-manager-k8s-master03       1/1     Running   0          11h
kube-proxy-4mtz7                           1/1     Running   0          11h
kube-proxy-9qcqd                           1/1     Running   0          11h
kube-proxy-9qfhb                           1/1     Running   0          11h
kube-proxy-g858h                           1/1     Running   0          11h
kube-proxy-nkksv                           1/1     Running   0          11h
kube-proxy-nmhq2                           1/1     Running   0          11h
kube-proxy-vrkxs                           1/1     Running   0          11h
kube-proxy-xqhxz                           1/1     Running   0          11h
kube-scheduler-k8s-master01                1/1     Running   0          11h
kube-scheduler-k8s-master02                1/1     Running   0          11h
kube-scheduler-k8s-master03                1/1     Running   0          11h
metrics-server-5ddd98b7f9-dl8xf            1/1     Running   0          10h
nodelocaldns-4zjwm                         1/1     Running   0          11h
nodelocaldns-g25z9                         1/1     Running   0          11h
nodelocaldns-jz5nz                         1/1     Running   0          11h
nodelocaldns-k8lkv                         1/1     Running   0          11h
nodelocaldns-m88qx                         1/1     Running   0          11h
nodelocaldns-pcjvg                         1/1     Running   0          11h
nodelocaldns-sg4tv                         1/1     Running   0          11h
nodelocaldns-ztvmc                         1/1     Running   0          11h
rbd-provisioner-7d87776cf7-7jwdd           1/1     Running   0          11h
snapshot-controller-0                      1/1     Running   0          10h
[root@k8s-master01 kubesphere]# kubectl -n kubesphere-system get pods
NAME                                     READY   STATUS    RESTARTS   AGE
ks-apiserver-86f894d9bb-4n7bf            1/1     Running   0          10h
ks-apiserver-86f894d9bb-pmtqn            1/1     Running   0          41m
ks-apiserver-86f894d9bb-t94fx            1/1     Running   0          41m
ks-console-9bc9c5df8-6l6lc               1/1     Running   0          10h
ks-controller-manager-6ff6df55b8-lr72j   1/1     Running   0          10h
ks-controller-manager-6ff6df55b8-p922p   1/1     Running   0          10h
ks-controller-manager-6ff6df55b8-tvjs7   1/1     Running   0          10h
ks-installer-85854b8c8-pj224             1/1     Running   0          11h
openldap-0                               1/1     Running   0          10h
openldap-1                               1/1     Running   0          10h
redis-ha-haproxy-ffb8d889d-7njmz         1/1     Running   1          10h
redis-ha-haproxy-ffb8d889d-gc7kf         1/1     Running   1          10h
redis-ha-haproxy-ffb8d889d-vtfpc         1/1     Running   2          10h
redis-ha-server-0                        2/2     Running   0          10h
redis-ha-server-1                        2/2     Running   0          10h
redis-ha-server-2                        2/2     Running   0          10h

安装成功日志

#####################################################
###              Welcome to KubeSphere!           ###
#####################################################

Console: http://192.168.100.91:30880
Account: admin
Password: P@88w0rd

NOTES:
  1. After logging into the console, please check the
     monitoring status of service components in
     the "Cluster Management". If any service is not
     ready, please wait patiently until all components 
     are ready.
  2. Please modify the default password after login.

#####################################################
https://kubesphere.io             2020-10-10 00:01:18
#####################################################

问题来了
访问:http://192.168.100.91:30880 出现卡顿,偶尔刷新几次可以访问页面但是登陆有卡住,通过查看k’s-console日志如下:

......
# kubectl -n kubesphere-system logs -f ks-console-9bc9c5df8-6l6lc
 --> GET / 302 2ms 43b 2020/10/10T10:45:29.615
  <-- GET /login 2020/10/10T10:45:29.616
(node:16) UnhandledPromiseRejectionWarning: redis.info: Executed timeout 5000 ms
    at Timeout._onTimeout (/opt/kubesphere/console/server/server.js:64530:19)
    at ontimeout (timers.js:498:11)
    at tryOnTimeout (timers.js:323:5)
    at Timer.listOnTimeout (timers.js:290:5)
(node:16) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 11641)
{ redis.set: Executed timeout 5000 ms
    at Timeout._onTimeout (/opt/kubesphere/console/server/server.js:64530:19)
    at ontimeout (timers.js:498:11)
    at tryOnTimeout (timers.js:323:5)
    at Timer.listOnTimeout (timers.js:290:5)
  name: 'redis.set',
  args: 
   [ 'sid-b7mWELvecc_rVal4cPGqArt4GldBsrlX',
     '{"salt":"x8CQ0Rt5l2jnwg33"}',
     'EX',
     7200000 ] }
  --> GET /login 200 5,014ms 14.82kb 2020/10/10T10:45:34.630
  <-- GET / 2020/10/10T10:45:39.613
{ UnauthorizedError: Not Login
    at Object.throw (/opt/kubesphere/console/server/server.js:31701:11)
    at getCurrentUser (/opt/kubesphere/console/server/server.js:9037:14)
    at renderView (/opt/kubesphere/console/server/server.js:23231:46)
    at dispatch (/opt/kubesphere/console/server/server.js:6870:32)
    at next (/opt/kubesphere/console/server/server.js:6871:18)
    at /opt/kubesphere/console/server/server.js:70183:16
    at dispatch (/opt/kubesphere/console/server/server.js:6870:32)
    at next (/opt/kubesphere/console/server/server.js:6871:18)
    at /opt/kubesphere/console/server/server.js:77986:37
    at dispatch (/opt/kubesphere/console/server/server.js:6870:32)
    at next (/opt/kubesphere/console/server/server.js:6871:18)
    at /opt/kubesphere/console/server/server.js:70183:16
    at dispatch (/opt/kubesphere/console/server/server.js:6870:32)
    at next (/opt/kubesphere/console/server/server.js:6871:18)
    at /opt/kubesphere/console/server/server.js:77986:37
    at dispatch (/opt/kubesphere/console/server/server.js:6870:32) message: 'Not Login' }
  --> GET / 302 2ms 43b 2020/10/10T10:45:39.615
  <-- GET /login 2020/10/10T10:45:39.616

1、Redis排查过程(redis读写正常基本上可以排除redis问题,在ks-console节点也是读写正常的):

[root@k8s-master01 kubesphere]# kubectl -n kubesphere-system  exec -it redis-ha-server-0 -- sh -c 'for i in `seq 0 2`; do nc -vz redis-ha-server-$i.redis-ha.kubesphere-system.svc 6379; done'
Defaulting container name to redis.
Use 'kubectl describe pod/redis-ha-server-0 -n kubesphere-system' to see all of the containers in this pod.
redis-ha-server-0.redis-ha.kubesphere-system.svc (10.232.113.2:6379) open
redis-ha-server-1.redis-ha.kubesphere-system.svc (10.232.76.4:6379) open
redis-ha-server-2.redis-ha.kubesphere-system.svc (10.232.66.9:6379) open

[root@k8s-master01 kubesphere]# kubectl -n kubesphere-system  get svc
NAME                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)              AGE
redis                   ClusterIP   10.232.6.151    <none>        6379/TCP             10h

[root@k8s-master01 kubesphere]# redis-cli -h 10.232.6.151
10.232.6.151:6379> set k v 
OK
10.232.6.151:6379> get k
"v"

2、DNS排查

[root@k8s-master01 kubesphere]# kubectl -n kubesphere-system exec -it ks-console-9bc9c5df8-6l6lc   nslookup redis.kubesphere-system.svc
nslookup: can't resolve '(null)': Name does not resolve

Name:      redis.kubesphere-system.svc
Address 1: 10.232.6.151 redis.kubesphere-system.svc.cluster.local

3、KS-ApiServer访问排查
从任一节点ping KS-ApiServer的pod ip都是正常的

[root@k8s-master01 kubesphere]# kubectl -n kubesphere-system get pods -o wide
NAME                                     READY   STATUS    RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES
ks-apiserver-86f894d9bb-4n7bf            1/1     Running   0          10h   10.232.76.5    k8s-master03   <none>           <none>
ks-apiserver-86f894d9bb-pmtqn            1/1     Running   0          54m   10.232.113.8   k8s-master02   <none>           <none>
ks-apiserver-86f894d9bb-t94fx            1/1     Running   0          54m   10.232.66.12   k8s-master01   <none>           <none>
[root@k8s-master01 kubesphere]# ping 10.232.76.5
PING 10.232.76.5 (10.232.76.5) 56(84) bytes of data.
64 bytes from 10.232.76.5: icmp_seq=1 ttl=63 time=0.273 ms
^C
--- 10.232.76.5 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.273/0.273/0.273/0.000 ms
[root@k8s-master01 kubesphere]# ping 10.232.113.8
PING 10.232.113.8 (10.232.113.8) 56(84) bytes of data.
64 bytes from 10.232.113.8: icmp_seq=1 ttl=63 time=0.287 ms
^C
--- 10.232.113.8 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.287/0.287/0.287/0.000 ms
[root@k8s-master01 kubesphere]# ping 10.232.66.12
PING 10.232.66.12 (10.232.66.12) 56(84) bytes of data.
64 bytes from 10.232.66.12: icmp_seq=1 ttl=64 time=0.089 ms
^C
--- 10.232.66.12 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.089/0.089/0.089/0.000 ms

4、为了验证KS-ApiServer的连通性,通过调用k’s-apiserver的接口进行排查,以下显示均能正常访问;

[root@k8s-master01 kubesphere]# curl -i -X POST -H 'Content-Type: application/x-www-form-urlencoded' \
>  'http://10.232.76.5:9090/oauth/token' \
>   --data-urlencode 'grant_type=password' \
>   --data-urlencode 'username=admin' \
>   --data-urlencode 'password=P@88w0rd'
HTTP/1.1 200 OK
Content-Type: application/json
Date: Sat, 10 Oct 2020 02:59:59 GMT
Content-Length: 690

[root@k8s-master01 kubesphere]# curl -i -X POST -H 'Content-Type: application/x-www-form-urlencoded' \
>  'http://10.232.113.8:9090/oauth/token' \
>   --data-urlencode 'grant_type=password' \
>   --data-urlencode 'username=admin' \
>   --data-urlencode 'password=P@88w0rd'
HTTP/1.1 200 OK
Content-Type: application/json
Date: Sat, 10 Oct 2020 03:00:31 GMT
Content-Length: 690

5、关键问题所在
由于Dashboard偶尔能正常访问,所以针对不能访问的路径进行排查发现如下:
例如当Dashboard访问:http://xxx:9090/kapis/tenant.kubesphere.io/v1alpha2/workspaces/system-workspace/namespaces?labelSelector=kubefed.io%2Fmanaged%21%3Dtrue%2C%20kubesphere.io%2Fkubefed-host-namespace%21%3Dtrue&sortBy=createTime&limit=10
这个地址的时候会出现hold住的现象也是就是一直页面转圈server无响应,所以我也尝试从直接访问api-server的方式进行,如下:
在master01节点上访问master01节点上的api-server时时正常的例如:

[root@k8s-master01 kubesphere]# curl -i -X GET -H "Authorization: Bearer token" \
>   -H 'Content-Type: application/json' \
>   'http://10.232.66.12:9090/kapis/tenant.kubesphere.io/v1alpha2/workspaces/system-workspace/namespaces?labelSelector=kubefed.io%2Fmanaged%21%3Dtrue%2C%20kubesphere.io%2Fkubefed-host-namespace%21%3Dtrue&sortBy=createTime&limit=10'
HTTP/1.1 200 OK
Content-Type: application/json
Date: Sat, 10 Oct 2020 03:06:55 GMT
Transfer-Encoding: chunked

但是!!!!!!在master01节点上访问master02节点上的api-server时,就出现了和Dashboard同样的问题,请求会一直hold在这里,说明一个问题就是异构节点访问存在问题,为了进步一排查节点网络问题我将请求做了修改。

[root@k8s-master01 kubesphere]# curl -i -X GET -H "Authorization: Bearer token" \
>   -H 'Content-Type: application/json' \
>   'http://10.232.113.8:9090/kapis/tenant.kubesphere.io/v1alpha2/workspaces/system-workspace/namespaces?labelSelector=kubefed.io%2Fmanaged%21%3Dtrue%2C%20kubesphere.io%2Fkubefed-host-namespace%21%3Dtrue&sortBy=createTime&limit=10'

我将在master01节点上访问master02节点上的api-server的GET方法改成了POST方法,此时master02节点上的ks-apiserver能正常返回,由此可以判断的是异构节点间的网络访问时没有问题的。

[root@k8s-master01 kubesphere]# curl -i -X POST -H "Authorization: Bearer token" \
>   -H 'Content-Type: application/json' \
>   'http://10.232.113.8:9090/kapis/tenant.kubesphere.io/v1alpha2/workspaces/system-workspace/namespaces?labelSelector=kubefed.io%2Fmanaged%21%3Dtrue%2C%20kubesphere.io%2Fkubefed-host-namespace%21%3Dtrue&sortBy=createTime&limit=10'
HTTP/1.1 400 Bad Request
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Sat, 10 Oct 2020 03:10:38 GMT
Content-Length: 4

由此推断出,ks-apiserver在访问某个资源的时候被hold住了,查看ks-apiserver的日志也不报错也没有timeout。找了ks-apiserver参数没有发现debug,也不知道ks-apiserver内部发生了什么,查看源码也短时间不能确定,到此卡住!!!

  • Jeff 回复了此帖

    cnicy 为了缩小问题,便于排查,你可以把ks-apiserver ks-console的副本都scale到1,然后再看下日志。再把ks-console和ks-apiserver放到一个节点上来排除跨节点网络通信的问题。

      Jeff 是这样排查过哈 console和ks-apiserver在同一个node就是正常的 但是在不同的node就有问题了

      • Jeff 回复了此帖

        cnicy 那就是网络插件出问题了,这个是K8s的问题了,你搜搜k8s repo的issue,插件的issue,看看有没有解决办法

          Jeff 确认有确认了 网络插件没有问题 proxy、dns都正常。现在是请求能到达api-server 但是某些请求会被ho ld住 说明ks-apiserver某些逻辑没有设置timeout,ks-apiserver又没有debug日志完全不知道哪里被hold。我把ks-console的代理拉到本地,将server上的ks-apiserver的svc地址通过nodeport暴露出来,本地完全能够正常访问。所以说明集群内部哪里出现了死锁。

            2 个月 后

            cnicy 我也出现了这个问题,在非master节点上访问就会偶尔卡死,但是在master节点上访问就会出问题,ks-console的pod一直在重启,我的k8s版本是1.19.3,修改过ca了,就是异构机器访问非常困难