### 1.k8s版本

kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.10", GitCommit:"f3add640dbcd4f3c33a7749f38baaac0b3fe810d", GitTreeState:"clean", BuildDate:"2020-05-20T14:00:52Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.10", GitCommit:"f3add640dbcd4f3c33a7749f38baaac0b3fe810d", GitTreeState:"clean", BuildDate:"2020-05-20T13:51:56Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

1个管理节点3个工作节点

kubectl get nodes
NAME      STATUS   ROLES    AGE   VERSION
master1   Ready    master   22d   v1.16.10
master2   Ready    <none>   22d   v1.16.10
master3   Ready    <none>   22d   v1.16.10
worker1   Ready    <none>   21d   v1.16.10

2.安装了KubeSphere 2.1.1的所有组件

helm list
NAME                         	REVISION	UPDATED                 	STATUS  	CHART                      	APP VERSION                 	NAMESPACE                
elasticsearch-logging        	1       	Wed Nov 11 18:24:06 2020	DEPLOYED	elasticsearch-1.22.1       	6.7.0-0217                  	kubesphere-logging-system
elasticsearch-logging-curator	1       	Wed Nov 11 18:24:09 2020	DEPLOYED	elasticsearch-curator-1.3.3	5.5.4-0217                  	kubesphere-logging-system
istio                        	1       	Wed Nov 11 18:33:32 2020	DEPLOYED	istio-1.3.3                	1.3.3                       	istio-system             
istio-init                   	1       	Wed Nov 11 18:32:48 2020	DEPLOYED	istio-init-1.3.2           	1.3.2                       	istio-system             
jaeger-operator              	1       	Wed Nov 11 18:34:13 2020	DEPLOYED	jaeger-operator-2.9.0      	1.13.1                      	istio-system             
ks-jenkins                   	4       	Mon Nov 23 09:50:01 2020	DEPLOYED	jenkins-0.19.0             	2.121.3-0217                	kubesphere-devops-system 
ks-minio                     	1       	Wed Nov 11 18:22:25 2020	DEPLOYED	minio-2.5.16               	RELEASE.2019-08-07T01-59-21Z	kubesphere-system        
ks-openldap                  	1       	Wed Nov 11 18:17:00 2020	DEPLOYED	openldap-ha-0.1.0          	1.0                         	kubesphere-system        
ks-openpitrix                	1       	Wed Nov 11 18:23:59 2020	DEPLOYED	openpitrix-0.1.0           	v0.4.8                      	openpitrix-system        
logging-fluentbit-operator   	1       	Wed Nov 11 18:24:03 2020	DEPLOYED	fluentbit-operator-0.1.0   	0.1.0-0217                  	kubesphere-logging-system
metrics-server               	1       	Wed Nov 11 18:21:46 2020	DEPLOYED	metrics-server-2.5.0       	0.3.1-0217                  	kube-system                            
uc                           	1       	Wed Nov 11 18:31:00 2020	DEPLOYED	jenkins-update-center-0.8.0	2.1.1                       	kubesphere-devops-system 

3.运行情况良好,有一天一个节点突然宕机,当时运行在上面的metrics-server,ks-apiserver发生迁移ks-apiserver一直无法启动

节点丢失
kubectl get nodes
NAME STATUS ROLES AGE VERSION
master1 Ready master 22d v1.16.10
master2 Ready <none> 22d v1.16.10
master3 Ready <none> 22d v1.16.10
worker1 NotReady <none> 21d v1.16.10
ks-apiserver 无法启动

kubectl get pods -n kubesphere-system -o wide
NAME                                     READY   STATUS    RESTARTS   AGE     IP              NODE      NOMINATED NODE   READINESS GATES
etcd-5769d4997f-8vkxc                    1/1     Running   0          3h16m   10.244.180.48   master2   <none>           <none>
ks-account-78dd6486bf-s5w8x              1/1     Running   0          10d     10.244.137.69   master1   <none>           <none>
ks-apigateway-764d86967d-7qrkh           1/1     Running   0          10d     10.244.137.86   master1   <none>           <none>
ks-apiserver-6b75dfdf4-gjxpr             0/1     Error     1          10d     10.244.137.83   master1   <none>           <none>
ks-console-7fd5b7d47-rkwmv               1/1     Running   0          10d     10.244.137.82   master1   <none>           <none>
ks-controller-manager-5cd6ff58b7-5cg8p   1/1     Running   1          10d     10.244.137.68   master1   <none>           <none>
ks-installer-7d9fb945c7-ld466            1/1     Running   1          20d     10.244.136.13   master3   <none>           <none>
minio-845b7bd867-r762v                   1/1     Running   1          20d     10.244.136.32   master3   <none>           <none>
mysql-66df969d-b9dzk                     1/1     Running   1          20d     10.244.136.11   master3   <none>           <none>
openldap-0                               1/1     Running   5          22d     10.244.137.78   master1   <none>           <none>
redis-6fd6c6d6f9-jtxzs                   1/1     Running   4          22d     10.244.137.75   master1   <none>           <none>
### 4.查看ks-apiserver 的日志,确认是连接kube-apiserver的时候出错
kubectl logs -f ks-apiserver-6b75dfdf4-gjxpr -n kubesphere-system
W1203 10:26:44.173078       1 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I1203 10:26:44.174200       1 server.go:179] Start cache objects
Error: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
Usage:
  ks-apiserver [flags]

Flags:
      --add-dir-header                                  If true, adds the file directory to the header
      --alsologtostderr                                 log to standard error as well as files
      --bind-address string                             server bind address (default "0.0.0.0")
      --elasticsearch-host string                       ElasticSearch logging service host. KubeSphere is using elastic as log store, if this filed left blank, KubeSphere will use kubernetes builtin log API instead, and the following elastic search options will be ignored.
      --elasticsearch-version string                    ElasticSearch major version, e.g. 5/6/7, if left blank, will detect automatically.Currently, minimum supported version is 5.x
  -h, --help                                            help for ks-apiserver
      --index-prefix string                             Index name prefix. KubeSphere will retrieve logs against indices matching the prefix. (default "fluentbit")
      --insecure-port int                               insecure port number (default 9090)
      --istio-pilot-host string                         istio pilot discovery service url
      --jaeger-query-host string                        jaeger query service url
      --jenkins-host string                             Jenkins service host address. If left blank, means Jenkins is unnecessary.
      --jenkins-max-connections int                     Maximum allowed connections to Jenkins.  (default 100)
      --jenkins-password string                         Password for access to Jenkins service, used pair with username.
      --jenkins-username string                         Username for access to Jenkins service. Leave it blank if there isn't any.
      --kubeconfig string                               Path for kubernetes kubeconfig file, if left blank, will use in cluster way.
      --log-backtrace-at traceLocation                  when logging hits line file:N, emit a stack trace (default :0)
      --log-dir string                                  If non-empty, write log files in this directory
      --log-file string                                 If non-empty, use this log file
      --log-file-max-size uint                          Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
      --logtostderr                                     log to standard error instead of files (default true)
      --master string                                   Used to generate kubeconfig for downloading, if not specified, will use host in kubeconfig.
      --mysql-host string                               MySQL service host address. If left blank, the following related mysql options will be ignored.
      --mysql-max-connection-life-time duration         Maximum connection life time allowed to connecto to mysql. (default 10s)
      --mysql-max-idle-connections int                  Maximum idle connections allowed to connect to mysql. (default 100)
      --mysql-max-open-connections int                  Maximum open connections allowed to connect to mysql. (default 100)
      --mysql-password string                           Password for access to mysql, should be used pair with password.
      --mysql-username string                           Username for access to mysql service.
      --openpitrix-app-manager-endpoint string          OpenPitrix app manager endpoint
      --openpitrix-attachment-manager-endpoint string   OpenPitrix attachment manager endpoint
      --openpitrix-category-manager-endpoint string     OpenPitrix category manager endpoint
      --openpitrix-cluster-manager-endpoint string      OpenPitrix cluster manager endpoint
      --openpitrix-repo-indexer-endpoint string         OpenPitrix repo indexer endpoint
      --openpitrix-repo-manager-endpoint string         OpenPitrix repo manager endpoint
      --openpitrix-runtime-manager-endpoint string      OpenPitrix runtime manager endpoint
      --prometheus-endpoint string                      Prometheus service endpoint which stores KubeSphere monitoring data, if left blank, will use builtin metrics-server as data source.
      --prometheus-secondary-endpoint string            Prometheus secondary service endpoint, if left empty and endpoint is set, will use endpoint instead.
      --s3-access-key-id string                         access key of s2i s3 (default "AKIAIOSFODNN7EXAMPLE")
      --s3-bucket string                                bucket name of s2i s3 (default "s2i-binaries")
      --s3-disable-SSL                                  disable ssl (default true)
      --s3-endpoint string                              Endpoint to access to s3 object storage service, if left blank, the following options will be ignored.
      --s3-force-path-style                             force path style (default true)
      --s3-region string                                Region of s3 that will access to, like us-east-1. (default "us-east-1")
      --s3-secret-access-key string                     secret access key of s2i s3 (default "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY")
      --s3-session-token string                         session token of s2i s3
      --secure-port int                                 secure port number
      --servicemesh-prometheus-host string              prometheus service for servicemesh
      --skip-headers                                    If true, avoid header prefixes in the log messages
      --skip-log-headers                                If true, avoid headers when opening log files
      --sonarqube-host string                           Sonarqube service address, if left empty, following sonarqube options will be ignored.
      --sonarqube-token string                          Sonarqube service access token.
      --stderrthreshold severity                        logs at or above this threshold go to stderr (default 2)
      --tls-cert-file string                            tls cert file
      --tls-private-key string                          tls private key
  -v, --v Level                                         number for the log level verbosity
      --vmodule moduleSpec                              comma-separated list of pattern=N settings for file-filtered logging

2020/12/03 10:26:44 unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

5.查看kube-apiserver中的日志发现连接metrics.k8s.io/v1beta1 时候出错

kubectl logs -f kube-apiserver-master1  -n kube-system
.....

E1203 10:22:44.410647       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:22:49.410972       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:22:54.411326       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:22:59.411622       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:23:04.411972       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:23:09.412390       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:23:14.412799       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:23:19.413131       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:23:21.758943       1 timeout.go:132] net/http: abort Handler
E1203 10:23:24.413409       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:23:29.413710       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:23:34.414094       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:23:39.534172       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I1203 10:23:40.410119       1 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
W1203 10:23:40.410178       1 handler_proxy.go:99] no RequestInfo found in the context
E1203 10:23:40.410216       1 controller.go:114] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
, Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
I1203 10:23:40.410223       1 controller.go:127] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
E1203 10:24:04.246085       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:24:09.246520       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:24:34.246193       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:24:39.246591       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

6.查看metrics-server 服务运行情况,确认已经在其他节点上重建

kubectl get pods -o wide -n kube-system
NAME                                      READY   STATUS     RESTARTS   AGE     IP               NODE      NOMINATED NODE   READINESS GATES
calico-kube-controllers-bbdc58449-cgsbt   1/1     Running    5          22d     10.244.137.79    master1   <none>           <none>
calico-node-2qk2b                         1/1     NodeLost   5          21d     192.168.210.74   worker1   <none>           <none>
calico-node-gjdrw                         1/1     Running    5          22d     192.168.210.73   master3   <none>           <none>
calico-node-jwbnz                         1/1     Running    4          22d     192.168.210.72   master2   <none>           <none>
calico-node-svmv7                         1/1     Running    5          22d     192.168.210.71   master1   <none>           <none>
coredns-85d448b787-8nks7                  1/1     Running    5          22d     10.244.137.87    master1   <none>           <none>
coredns-85d448b787-lpxtt                  1/1     Running    5          22d     10.244.137.80    master1   <none>           <none>
etcd-master1                              1/1     Running    5          22d     192.168.210.71   master1   <none>           <none>
kube-apiserver-master1                    1/1     Running    2          7d8h    192.168.210.71   master1   <none>           <none>
kube-controller-manager-master1           1/1     Running    9          22d     192.168.210.71   master1   <none>           <none>
kube-proxy-7wkbr                          1/1     Running    5          22d     192.168.210.71   master1   <none>           <none>
kube-proxy-8d7dj                          1/1     Running    4          22d     192.168.210.72   master2   <none>           <none>
kube-proxy-nhdsn                          1/1     NodeLost   4          21d     192.168.210.74   worker1   <none>           <none>
kube-proxy-sfbjm                          1/1     Running    4          22d     192.168.210.73   master3   <none>           <none>
kube-scheduler-master1                    1/1     Running    9          22d     192.168.210.71   master1   <none>           <none>
metrics-server-8b7689b66-xm6mf            1/1     Running    0          36s     10.244.180.55    master2   <none>           <none>
metrics-server-8b7689b66-z9hk9            1/1     Unknown    0          3m58s   10.244.235.186   worker1   <none>           <none>
tiller-deploy-5fd994b8f-twpn2             1/1     Running    0          3h13m   10.244.180.23    master2   <none>           <none>

7.查看api-resources情况

kubectl api-resources
NAME                              SHORTNAMES   APIGROUP                       NAMESPACED   KIND
bindings                                                                      true         Binding
componentstatuses                 cs                                          false        ComponentStatus
configmaps                        cm                                          true         ConfigMap
endpoints                         ep                                          true         Endpoints
events                            ev                                          true         Event
limitranges                       limits                                      true         LimitRange
namespaces                        ns                                          false        Namespace
nodes                             no                                          false        Node
persistentvolumeclaims            pvc                                         true         PersistentVolumeClaim
persistentvolumes                 pv                                          false        PersistentVolume
pods                              po                                          true         Pod
podtemplates                                                                  true         PodTemplate
replicationcontrollers            rc                                          true         ReplicationController
resourcequotas                    quota                                       true         ResourceQuota
secrets                                                                       true         Secret
serviceaccounts                   sa                                          true         ServiceAccount
services                          svc                                         true         Service
mutatingwebhookconfigurations                  admissionregistration.k8s.io   false        MutatingWebhookConfiguration
validatingwebhookconfigurations                admissionregistration.k8s.io   false        ValidatingWebhookConfiguration
customresourcedefinitions         crd,crds     apiextensions.k8s.io           false        CustomResourceDefinition
apiservices                                    apiregistration.k8s.io         false        APIService
applications                                   app.k8s.io                     true         Application
controllerrevisions                            apps                           true         ControllerRevision
daemonsets                        ds           apps                           true         DaemonSet
deployments                       deploy       apps                           true         Deployment
replicasets                       rs           apps                           true         ReplicaSet
statefulsets                      sts          apps                           true         StatefulSet
meshpolicies                                   authentication.istio.io        false        MeshPolicy
policies                                       authentication.istio.io        true         Policy
tokenreviews                                   authentication.k8s.io          false        TokenReview
localsubjectaccessreviews                      authorization.k8s.io           true         LocalSubjectAccessReview
selfsubjectaccessreviews                       authorization.k8s.io           false        SelfSubjectAccessReview
selfsubjectrulesreviews                        authorization.k8s.io           false        SelfSubjectRulesReview
subjectaccessreviews                           authorization.k8s.io           false        SubjectAccessReview
horizontalpodautoscalers          hpa          autoscaling                    true         HorizontalPodAutoscaler
cronjobs                          cj           batch                          true         CronJob
jobs                                           batch                          true         Job
certificatesigningrequests        csr          certificates.k8s.io            false        CertificateSigningRequest
adapters                                       config.istio.io                true         adapter
attributemanifests                             config.istio.io                true         attributemanifest
handlers                                       config.istio.io                true         handler
httpapispecbindings                            config.istio.io                true         HTTPAPISpecBinding
httpapispecs                                   config.istio.io                true         HTTPAPISpec
instances                                      config.istio.io                true         instance
quotaspecbindings                              config.istio.io                true         QuotaSpecBinding
quotaspecs                                     config.istio.io                true         QuotaSpec
rules                                          config.istio.io                true         rule
templates                                      config.istio.io                true         template
leases                                         coordination.k8s.io            true         Lease
bgpconfigurations                              crd.projectcalico.org          false        BGPConfiguration
bgppeers                                       crd.projectcalico.org          false        BGPPeer
blockaffinities                                crd.projectcalico.org          false        BlockAffinity
clusterinformations                            crd.projectcalico.org          false        ClusterInformation
felixconfigurations                            crd.projectcalico.org          false        FelixConfiguration
globalnetworkpolicies             gnp          crd.projectcalico.org          false        GlobalNetworkPolicy
globalnetworksets                              crd.projectcalico.org          false        GlobalNetworkSet
hostendpoints                                  crd.projectcalico.org          false        HostEndpoint
ipamblocks                                     crd.projectcalico.org          false        IPAMBlock
ipamconfigs                                    crd.projectcalico.org          false        IPAMConfig
ipamhandles                                    crd.projectcalico.org          false        IPAMHandle
ippools                                        crd.projectcalico.org          false        IPPool
kubecontrollersconfigurations                  crd.projectcalico.org          false        KubeControllersConfiguration
networkpolicies                                crd.projectcalico.org          true         NetworkPolicy
networksets                                    crd.projectcalico.org          true         NetworkSet
s2ibinaries                                    devops.kubesphere.io           true         S2iBinary
s2ibuilders                       s2ib         devops.kubesphere.io           true         S2iBuilder
s2ibuildertemplates               s2ibt        devops.kubesphere.io           false        S2iBuilderTemplate
s2iruns                           s2ir         devops.kubesphere.io           true         S2iRun
events                            ev           events.k8s.io                  true         Event
ingresses                         ing          extensions                     true         Ingress
jaegers                                        jaegertracing.io               true         Jaeger
fluentbits                                     logging.kubesphere.io          true         FluentBit
alertmanagers                                  monitoring.coreos.com          true         Alertmanager
podmonitors                                    monitoring.coreos.com          true         PodMonitor
prometheuses                                   monitoring.coreos.com          true         Prometheus
prometheusrules                                monitoring.coreos.com          true         PrometheusRule
servicemonitors                                monitoring.coreos.com          true         ServiceMonitor
destinationrules                  dr           networking.istio.io            true         DestinationRule
envoyfilters                                   networking.istio.io            true         EnvoyFilter
gateways                          gw           networking.istio.io            true         Gateway
serviceentries                    se           networking.istio.io            true         ServiceEntry
sidecars                                       networking.istio.io            true         Sidecar
virtualservices                   vs           networking.istio.io            true         VirtualService
ingresses                         ing          networking.k8s.io              true         Ingress
networkpolicies                   netpol       networking.k8s.io              true         NetworkPolicy
runtimeclasses                                 node.k8s.io                    false        RuntimeClass
poddisruptionbudgets              pdb          policy                         true         PodDisruptionBudget
podsecuritypolicies               psp          policy                         false        PodSecurityPolicy
clusterrolebindings                            rbac.authorization.k8s.io      false        ClusterRoleBinding
clusterroles                                   rbac.authorization.k8s.io      false        ClusterRole
rolebindings                                   rbac.authorization.k8s.io      true         RoleBinding
roles                                          rbac.authorization.k8s.io      true         Role
authorizationpolicies                          rbac.istio.io                  true         AuthorizationPolicy
clusterrbacconfigs                             rbac.istio.io                  false        ClusterRbacConfig
rbacconfigs                                    rbac.istio.io                  true         RbacConfig
servicerolebindings                            rbac.istio.io                  true         ServiceRoleBinding
serviceroles                                   rbac.istio.io                  true         ServiceRole
priorityclasses                   pc           scheduling.k8s.io              false        PriorityClass
servicepolicies                                servicemesh.kubesphere.io      true         ServicePolicy
strategies                                     servicemesh.kubesphere.io      true         Strategy
csidrivers                                     storage.k8s.io                 false        CSIDriver
csinodes                                       storage.k8s.io                 false        CSINode
storageclasses                    sc           storage.k8s.io                 false        StorageClass
volumeattachments                              storage.k8s.io                 false        VolumeAttachment
workspaces                                     tenant.kubesphere.io           false        Workspace
error: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

8.查看conntrack 连接情况发现kube-apiserver 在连接丢失的pod

查看metrics-server 的service

 kubectl get svc -n kube-system
NAME                               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
etcd                               ClusterIP   None            <none>        2379/TCP                 21d
kube-controller-manager-headless   ClusterIP   None            <none>        10252/TCP                22d
kube-dns                           ClusterIP   10.96.0.10      <none>        53/UDP,53/TCP,9153/TCP   22d
kube-scheduler-headless            ClusterIP   None            <none>        10251/TCP                22d
kubelet                            ClusterIP   None            <none>        10250/TCP                22d
metrics-server                     ClusterIP   **10.101.186.48**   <none>        443/TCP                  21d
tiller-deploy                      ClusterIP   10.103.19.75    <none>        44134/TCP                22d

查看metrics-server 运行情况及podIP

kubectl get pods -o wide -n kube-system
NAME                                      READY   STATUS     RESTARTS   AGE     IP               NODE      NOMINATED NODE   READINESS GATES
calico-kube-controllers-bbdc58449-cgsbt   1/1     Running    5          22d     10.244.137.79    master1   <none>           <none>
calico-node-2qk2b                         1/1     NodeLost   5          21d     192.168.210.74   worker1   <none>           <none>
calico-node-gjdrw                         1/1     Running    5          22d     192.168.210.73   master3   <none>           <none>
calico-node-jwbnz                         1/1     Running    4          22d     192.168.210.72   master2   <none>           <none>
calico-node-svmv7                         1/1     Running    5          22d     192.168.210.71   master1   <none>           <none>
coredns-85d448b787-8nks7                  1/1     Running    5          22d     10.244.137.87    master1   <none>           <none>
coredns-85d448b787-lpxtt                  1/1     Running    5          22d     10.244.137.80    master1   <none>           <none>
etcd-master1                              1/1     Running    5          22d     192.168.210.71   master1   <none>           <none>
kube-apiserver-master1                    1/1     Running    2          7d8h    192.168.210.71   master1   <none>           <none>
kube-controller-manager-master1           1/1     Running    9          22d     192.168.210.71   master1   <none>           <none>
kube-proxy-7wkbr                          1/1     Running    5          22d     192.168.210.71   master1   <none>           <none>
kube-proxy-8d7dj                          1/1     Running    4          22d     192.168.210.72   master2   <none>           <none>
kube-proxy-nhdsn                          1/1     NodeLost   4          21d     192.168.210.74   worker1   <none>           <none>
kube-proxy-sfbjm                          1/1     Running    4          22d     192.168.210.73   master3   <none>           <none>
kube-scheduler-master1                    1/1     Running    9          22d     192.168.210.71   master1   <none>           <none>
metrics-server-8b7689b66-xm6mf            1/1     Running    0          36s     **10.244.180.55**    master2   <none>           <none>
metrics-server-8b7689b66-z9hk9            1/1     Unknown    0          3m58s   **10.244.235.186**   worker1   <none>           <none>
tiller-deploy-5fd994b8f-twpn2             1/1     Running    0          3h13m   10.244.180.23    master2   <none>           <none>

观察连接情况

conntrack -L | grep 10.101.186.48
tcp      6 278 ESTABLISHED src=10.101.186.48 dst=**10.101.186.48** sport=45842 dport=443 src=**10.244.235.186** dst=192.168.210.71 sport=443 dport=19158 [ASSURED] mark=0 use=1
tcp      6 298 ESTABLISHED src=10.101.186.48 dst=**10.101.186.48** sport=45820 dport=443 src=**10.244.235.186** dst=**192.168.210.71** sport=443 dport=15276 [ASSURED] mark=0 use=2

这个bug只有在节点突然宕机或者拔掉网线的情况下出现,要经过10min才能切换,或者重启kube-apiserver,正常关闭节点没有

  • Jeff 回复了此帖
  • guoh1988 如果是生产环境,不建议这么设置,建议你可以ipvs规则是否在master节点上没有更新

    guoh1988

    error: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

    出现这个错误,说明有apiservice有问题,你执行 kubectl get apiservice 确认 metrics 服务是正常的

    root@master1:~# kubectl get apiservice | grep metrics
    v1beta1.metrics.k8s.io                 kube-system/metrics-server   True        147d

    如果第三栏不是True,说明metrics服务有问题,k8s的服务发现机制要求所有apiservice都是True状态

      Jeff 谢谢你的回复,这边看kubectl get apiservice | grep metrics 失败状态

      v1beta1.metrics.k8s.io                 kube-system/metrics-server   False (FailedDiscoveryCheck)   22d

      上面我日志中已经有个metrics-server起来但是kube-apiserver还是去试图连接那个丢失节点上的pod 10.244.235.186当我kill kube-apiserver之后他就能正确连接10.244.180.55
      早期使用udp出现过这样的错误,但是使用conntrack -D 清除连接缓存就可以生成新的连接,但是这次即使使用了conntrack -D 还是不行,还是会重新指向错误的podIP 10.244.235.186 感觉有地方记忆了
      https://github.com/kubernetes/kubernetes/issues/59368?from=singlemessage

      metrics-server-8b7689b66-xm6mf            1/1     Running    0          36s     10.244.180.55    master2   <none>           <none>
      metrics-server-8b7689b66-z9hk9            1/1     Unknown    0          3m58s   10.244.235.186  worker1   <none>           <none>
      conntrack -L | grep 10.101.186.48
      tcp      6 278 ESTABLISHED src=10.101.186.48 dst=10.101.186.48 sport=45842 dport=443 src=10.244.235.186 dst=192.168.210.71 sport=443 dport=19158 [ASSURED] mark=0 use=1
      tcp      6 298 ESTABLISHED src=10.101.186.48 dst=10.101.186.48 sport=45820 dport=443 src=10.244.235.186 dst=192.168.210.71 sport=443 dport=15276 [ASSURED] mark=0 use=2

      我试了如下的方法是有效的,修改了内核从net.ipv4.tcp_retries2=15到net.ipv4.tcp_retries2=1,断电的情况下载1min释放之后可以指定到10.244.180.55

      https://blog.csdn.net/gao1738/article/details/42839697

      • Jeff 回复了此帖

        guoh1988 如果是生产环境,不建议这么设置,建议你可以ipvs规则是否在master节点上没有更新

          Jeff 我这边使用的就是ipvs模式,不知道你说的IPvs规则指的是不是kube-prox的ipvs模式,上边的说明udp bug跟这次还是有区别,UDP只要使用conntrack -D 清除连接缓存就可以生成新的连接,kube-apserver 链接metrics-server 即使conntrack -D 清除重新生成还是错误的,也不能解决,只能修改内核net.ipv4.tcp_retries2或者重启kube-apiserver

          kubectl logs -f kube-proxy-7wkbr  -n kube-system
          I1123 01:40:45.112593       1 node.go:135] Successfully retrieved node IP: 192.168.210.71
          I1123 01:40:45.112639       1 server_others.go:177] Using ipvs Proxier.
          W1123 01:40:45.112929       1 proxier.go:415] IPVS scheduler not specified, use rr by default
          I1123 01:40:45.113153       1 server.go:529] Version: v1.16.10
          I1123 01:40:45.113560       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
          I1123 01:40:45.113585       1 conntrack.go:52] Setting nf_conntrack_max to 131072
          I1123 01:40:45.113628       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
          I1123 01:40:45.113650       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
          I1123 01:40:45.115833       1 config.go:131] Starting endpoints config controller
          I1123 01:40:45.115878       1 config.go:313] Starting service config controller
          I1123 01:40:45.115894       1 shared_informer.go:197] Waiting for caches to sync for endpoints config
          I1123 01:40:45.115897       1 shared_informer.go:197] Waiting for caches to sync for service config
          I1123 01:40:45.216030       1 shared_informer.go:204] Caches are synced for endpoints config 
          I1123 01:40:45.216033       1 shared_informer.go:204] Caches are synced for service config 
          • Jeff 回复了此帖

            guoh1988 是的,出现pod重启,apiservice仍然没有正常的情况下,看下master节点上ipvs规则,ipvsadm -Ln 看下metrics-server ip对应的pod ip是否正确。ipvsadm -lnc 看下是否有很多sync_wait的链接

            谢谢你的回复
            正常情况下

            kubectl get pods -n kube-system -o wide
            NAME                                      READY   STATUS    RESTARTS   AGE    IP               NODE      NOMINATED NODE   READINESS GATES
            metrics-server-8b7689b66-rvrrl            1/1     Running   0          8m7s   10.244.235.137   worker1   <none>           <none>
            
            ipvsadm -Ln
            TCP  10.101.186.48:443 rr
              -> 10.244.235.137:443           Masq    1      2          0   
            
            ipvsadm -lnc  | grep 10.101.186.48
            TCP 14:37  ESTABLISHED 10.101.186.48:56312 10.101.186.48:443  10.244.235.137:443
            TCP 14:55  ESTABLISHED 10.101.186.48:56328 10.101.186.48:443  10.244.235.137:443

            异常情况下

            ipvsadm -Ln
            TCP  10.101.186.48:443 rr
              -> 10.244.180.39:443            Masq    1      0          0         
              -> 10.244.235.137:443           Masq    0      2          0 

            一个瞬间转入CLOSE_WAIT,还有一个在ESTABLISHED ,看前面的时间和15min吻合

            ipvsadm -lnc  | grep 10.101.186.48
            TCP 14:58  ESTABLISHED 10.101.186.48:56312 10.101.186.48:443  10.244.235.137:443
            TCP 00:18  CLOSE_WAIT  10.101.186.48:56328 10.101.186.48:443  10.244.235.137:443

            还有一个规则一直在等待

            ipvsadm -lnc  | grep 10.101.186.48
            TCP 14:44  ESTABLISHED 10.101.186.48:56312 10.101.186.48:443  10.244.235.137:443
            • Jeff 回复了此帖

              guoh1988

              ipvsadm -Ln
              TCP  10.101.186.48:443 rr
                -> 10.244.180.39:443            Masq    1      0          0         
              -> 10.244.235.137:443 Masq 0 2 0 ``` 这个规则是由kube-proxy来控制的,你查下你的环境,是不是kube-proxy刷新的不够及时
              ipvsadm -Ln
              TCP  10.101.186.48:443 rr
                -> 10.244.180.39:443            Masq    1      0          0         
              
                -> 10.244.235.137:443           Masq    0      2          0 

              这边权重已经变成0了,应该不会轮训上去,刚才我试了下出问题时候kill kube-paiserver都能连接到10.244.180.39:443
              下面是我的kube-proxy的配置


                apiVersion: kubeproxy.config.k8s.io/v1alpha1
              bindAddress: 0.0.0.0
              clientConnection:
                acceptContentTypes: ""
                burst: 10
                contentType: application/vnd.kubernetes.protobuf
                kubeconfig: /var/lib/kube-proxy/kubeconfig.conf
                qps: 5
              clusterCIDR: 10.244.0.0/16
              configSyncPeriod: 15m0s
              conntrack:
                maxPerCore: 32768
                min: 131072
                tcpCloseWaitTimeout: 1h0m0s
                tcpEstablishedTimeout: 24h0m0s
              enableProfiling: false
              healthzBindAddress: 0.0.0.0:10256
              hostnameOverride: ""
              iptables:
                masqueradeAll: false
                masqueradeBit: 14
                minSyncPeriod: 0s
                syncPeriod: 30s
              ipvs:
                excludeCIDRs: null
                minSyncPeriod: 0s
                scheduler: ""
                strictARP: true
                syncPeriod: 30s
              kind: KubeProxyConfiguration
              metricsBindAddress: 127.0.0.1:10249
              mode: ipvs
              nodePortAddresses: null
              oomScoreAdj: -999
              portRange: ""
              udpIdleTimeout: 250ms
              winkernel:
                enableDSR: false
                networkName: ""
                sourceVip: ""

              问下你手边有k8s+kubeshere集群吗?这边这个很好模拟,查看metrics-server所在节点,直接将所在节点关闭电源,我这边使用的VMware

              • Jeff 回复了此帖

                guoh1988 直接关机是可能出现这个问题,类似的问题我们之前遇到过,一般因为节点宕机导致pod需要迁移,会有默认5分钟的间隔时间。针对metrics-server,你可以把这个时间调小点

                  Jeff 已经调小了,我是调整到30S, 5min是丢失节点pod重建的时间,我这边的metrics-server pod已经重建了,并不是因为pod没有建立的原因,这个等待需要15min

                  10 天 后

                  guoh1988 这个是K8s的问题了,你可以到k8s issue里搜下有无相关的问题