• KSV
  • cloudcore故障

创建部署问题时,请参考下面模板,你提供的信息越多,越容易及时获得解答。如果未按模板创建问题,管理员有权关闭问题。
确保帖子格式清晰易读,用 markdown code block 语法格式化代码块。
你只花一分钟创建的问题,不能指望别人花上半个小时给你解答。

#问题背景描述如下

我有个虚拟机83,不用了,计划回收

虚拟机83 作为edge node 之前加入到了KS中了,名称是edgenode-83

我正常先把edgenode-83 从KS中删除,执行如下命令:
kubectl drain edgenode-83 –delete-emptydir-data –force –ignore-daemonsets

kubectl delete nodes edgenode-83

然后,cloudcore重启,则会 cloudcore出现故障了,cloudcore无法正常运行, cloudcore的端口30000, 30004

其他edge node 不能连接了, 报错time out

cloud具体报错log在最下面

-- 我的手机号(微信) 18521096651 , 如果有老师遇到过, 有解决的经验, 及时与我沟通,谢谢哈

操作系统信息
虚拟机,Centos7.9,8C/16G

[root@master11 ~]# cat /etc/redhat-release

CentOS Linux release 7.9.2009 (Core)

Kubernetes版本信息
kubectl version 命令执行结果贴在下方

容器运行时
docker version / crictl version / nerdctl version 结果贴在下方

[root@master11 ~]# docker version

Client:

Version: 20.10.8

API version: 1.41

Go version: go1.16.6

Git commit: 3967b7d

Built: Fri Jul 30 19:50:40 2021

OS/Arch: linux/amd64

Context: default

Experimental: true

Server: Docker Engine - Community

Engine:

Version: 20.10.8

API version: 1.41 (minimum version 1.12)

Go version: go1.16.6

Git commit: 75249d8

Built: Fri Jul 30 19:55:09 2021

OS/Arch: linux/amd64

Experimental: false

containerd:

Version: v1.4.9

GitCommit: e25210fe30a0a703442421b0f60afac609f950a3

runc:

Version: 1.0.1

GitCommit: v1.0.1-0-g4144b638

docker-init:

Version: 0.19.0

GitCommit: de40ad0

[root@master11 ~]# crictl version

-bash: crictl: 未找到命令

[root@master11 ~]#

[root@master11 ~]# nerdctl version

-bash: nerdctl: 未找到命令

KubeSphere版本信息

版本是v3.3.2。

离线安装。

kk安装的

[root@master11 ~]# kubectl version

Client Version: version.Info{Major:“1”, Minor:“21”, GitVersion:“v1.21.5”, GitCommit:“aea7bbadd2fc0cd689de94a54e5b7b758869d691”, GitTreeState:“clean”, BuildDate:“2021-09-15T21:10:45Z”, GoVersion:“go1.16.8”, Compiler:“gc”, Platform:“linux/amd64”}

Server Version: version.Info{Major:“1”, Minor:“21”, GitVersion:“v1.21.5”, GitCommit:“aea7bbadd2fc0cd689de94a54e5b7b758869d691”, GitTreeState:“clean”, BuildDate:“2021-09-15T21:04:16Z”, GoVersion:“go1.16.8”, Compiler:“gc”, Platform:“linux/amd64”}

问题是什么
报错日志是什么,最好有截图。

---————————————-

容器日志

W0828 10:56:18.756174 1 validation.go:154] TLSTunnelPrivateKeyFile does not exist in /etc/kubeedge/certs/server.key, will load from secret

W0828 10:56:18.756282 1 validation.go:157] TLSTunnelCertFile does not exist in /etc/kubeedge/certs/server.crt, will load from secret

W0828 10:56:18.756325 1 validation.go:160] TLSTunnelCAFile does not exist in /etc/kubeedge/ca/rootCA.crt, will load from secret

I0828 10:56:18.756405 1 server.go:77] Version: v1.9.2

W0828 10:56:18.756456 1 client_config.go:615] Neither –kubeconfig nor –master was specified. Using the inClusterConfig. This might not work.

I0828 10:56:19.793428 1 module.go:52] Module cloudhub registered successfully

I0828 10:56:19.966632 1 module.go:52] Module edgecontroller registered successfully

I0828 10:56:19.966854 1 module.go:52] Module devicecontroller registered successfully

I0828 10:56:19.966903 1 module.go:52] Module synccontroller registered successfully

I0828 10:56:19.966995 1 module.go:52] Module cloudStream registered successfully

W0828 10:56:19.967010 1 module.go:55] Module router is disabled, do not register

W0828 10:56:19.967021 1 module.go:55] Module dynamiccontroller is disabled, do not register

I0828 10:56:19.967128 1 core.go:46] starting module devicecontroller

I0828 10:56:19.967206 1 core.go:46] starting module synccontroller

I0828 10:56:19.967429 1 core.go:46] starting module cloudStream

I0828 10:56:19.967446 1 downstream.go:878] Start downstream devicecontroller

I0828 10:56:19.967517 1 core.go:46] starting module cloudhub

I0828 10:56:19.967592 1 core.go:46] starting module edgecontroller

I0828 10:56:19.968870 1 upstream.go:125] start upstream controller

I0828 10:56:19.968936 1 downstream.go:339] start downstream controller

I0828 10:56:20.356643 1 server.go:257] Ca and CaKey don’t exist in local directory, and will read from the secret

I0828 10:56:21.566482 1 server.go:302] CloudCoreCert and key don’t exist in local directory, and will read from the secret

I0828 10:56:21.858993 1 tunnelserver.go:146] Succeed in loading TunnelCA from CloudHub

I0828 10:56:21.859713 1 tunnelserver.go:159] Succeed in loading TunnelCert and Key from CloudHub

I0828 10:56:21.860210 1 streamserver.go:305] Prepare to start stream server …

I0828 10:56:21.971950 1 upstream.go:64] Start upstream devicecontroller

I0828 10:56:22.056405 1 signcerts.go:100] Succeed to creating token

I0828 10:56:22.057072 1 server.go:44] start unix domain socket server

I0828 10:56:22.057589 1 uds.go:71] listening on: //var/lib/kubeedge/kubeedge.sock

I0828 10:56:22.359701 1 tunnelserver.go:179] Prepare to start tunnel server …

I0828 10:56:22.360381 1 server.go:64] Starting cloudhub websocket server

I0828 10:56:22.570056 1 tunnelserver.go:119] get a new tunnel agent hostname edgenode-131, internalIP 10.0.8.131

I0828 10:56:23.229356 1 tunnelserver.go:119] get a new tunnel agent hostname edgenode-83, internalIP 10.0.8.83

E0828 10:56:23.235188 1 tunnelserver.go:191] Failed while getting a Node to retry updating node KubeletEndpoint Port, node: edgenode-83, error: nodes “edgenode-83” not found

E0828 10:56:23.239467 1 tunnelserver.go:191] Failed while getting a Node to retry updating node KubeletEndpoint Port, node: edgenode-83, error: nodes “edgenode-83” not found

E0828 10:56:23.239504 1 tunnelserver.go:203] Update KubeletEndpoint Port of Node ‘edgenode-83’ error: timed out waiting for the condition.

---————————————-

补充:

复现以上问题有以下几种case,

case1:

删除edge node ,然后cloudcore 重启,则cloudcore会故障,

cloudcore 不能正常运行, cloudcore 30000,30004 端口不通, 所有edgecore 连接cloudcore 超时,

然后所有52台 edgecore 都会故障,KS上查看 所有边缘节点 52台 全部污点了

case2:

删除edge node ,过一段时间(比如1H后,2天后),cloudcore会故障,现象和case1 一样