过渡无压力！KubeSphere v3.4.x 到 v4.x 平滑升级全攻略

xingxing122

hongming 更新完毕之后，需要重新执行啥命令来加载一下呢，或者重新更新呢

xingxing122

重新执行更新，还是报错，我的电脑是mac M1 系列，跟客户是啥系统没关系吧，看这个报错，还是空指针，哪里不对

kubectl logs -f -n kubesphere-system prepare-upgrade-rnxs8

I0422 05:07:44.886017       1 filepath.go:71] [Storage] LocalFileStorage File directory /tmp/ks-upgrade already exists

I0422 05:07:44.886187       1 executor.go:158] [Job] whizard-alerting is disabled

I0422 05:07:44.886201       1 executor.go:158] [Job] whizard-logging is disabled

I0422 05:07:44.886205       1 executor.go:158] [Job] whizard-notification is disabled

I0422 05:07:44.886209       1 executor.go:158] [Job] tower is disabled

I0422 05:07:44.886212       1 executor.go:158] [Job] whizard-telemetry is disabled

I0422 05:07:44.886216       1 executor.go:158] [Job] whizard-events is disabled

I0422 05:07:44.886220       1 executor.go:158] [Job] kubefed is disabled

I0422 05:07:44.886224       1 executor.go:158] [Job] servicemesh is disabled

I0422 05:07:44.886227       1 executor.go:158] [Job] storage-utils is disabled

I0422 05:07:44.886240       1 executor.go:155] [Job] devops is enabled, priority 800

I0422 05:07:44.886261       1 executor.go:155] [Job] iam is enabled, priority 999

I0422 05:07:44.886272       1 executor.go:158] [Job] metrics-server is disabled

I0422 05:07:44.886276       1 executor.go:158] [Job] opensearch is disabled

I0422 05:07:44.886279       1 executor.go:158] [Job] whizard-monitoring is disabled

I0422 05:07:44.886287       1 executor.go:155] [Job] network is enabled, priority 100

I0422 05:07:44.886295       1 executor.go:158] [Job] vector is disabled

I0422 05:07:44.886304       1 executor.go:155] [Job] application is enabled, priority 100

I0422 05:07:44.886311       1 executor.go:155] [Job] core is enabled, priority 10000

I0422 05:07:44.886323       1 executor.go:155] [Job] gateway is enabled, priority 90

I0422 05:07:44.886327       1 executor.go:158] [Job] kubeedge is disabled

I0422 05:07:44.898462       1 helm.go:145] getting history for release [ks-core]

I0422 05:07:44.951846       1 validator.go:57] [Validator] Current release's version is v3.3.2

I0422 05:07:44.951869       1 executor.go:220] [Job] core prepare-upgrade start

I0422 05:07:44.951878       1 executor.go:58] [Job] Detected that the plugin core is true

I0422 05:07:44.977148       1 core.go:314] scale down deployment kubesphere-system/ks-apiserver unchanged

I0422 05:07:45.000332       1 core.go:314] scale down deployment kubesphere-system/ks-console unchanged

I0422 05:07:45.006025       1 core.go:314] scale down deployment kubesphere-system/ks-controller-manager unchanged

I0422 05:07:45.028711       1 core.go:314] scale down deployment kubesphere-system/ks-installer unchanged

panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x2025ccf]

goroutine 1 [running]:

kubesphere.io/ks-upgrade/pkg/jobs/core.(\*upgradeJob).deleteKubeSphereWebhook(0xc000a2f630, {0x2ba2f40, 0x40e7c00})

	/workspace/pkg/jobs/core/core.go:429 +0x22f

kubesphere.io/ks-upgrade/pkg/jobs/core.(\*upgradeJob).PrepareUpgrade(0xc000a2f630, {0x2ba2f40, 0x40e7c00})

	/workspace/pkg/jobs/core/core.go:118 +0xcc

kubesphere.io/ks-upgrade/pkg/executor.(\*Executor).PrepareUpgrade(0xc000491710, {0x2ba2f40, 0x40e7c00})

	/workspace/pkg/executor/executor.go:227 +0x275

main.init.func5(0xc00021a800?, {0x26e5683?, 0x4?, 0x26e5687?})

	/workspace/cmd/ks-upgrade.go:102 +0x26

github.com/spf13/cobra.(\*Command).execute(0x4095a80, {0xc000898420, 0x3, 0x3})

	/workspace/vendor/github.com/spf13/cobra/command.go:985 +0xaaa

github.com/spf13/cobra.(\*Command).ExecuteC(0x4094c20)

	/workspace/vendor/github.com/spf13/cobra/command.go:1117 +0x3ff

github.com/spf13/cobra.(\*Command).Execute(...)

	/workspace/vendor/github.com/spf13/cobra/command.go:1041

main.main()

	/workspace/cmd/ks-upgrade.go:136 +0x4e

hongming

xingxing122

检查一下 prepare-upgrade-xxxx pod 的 imagePullPolicy，镜像更新了没

    image: docker.io/kubesphere/ks-upgrade:v4.1.3
    imageID: docker.io/kubesphere/ks-upgrade@sha256:bbdc80bbcab3f87b020af43d177c28425af593e103133ec6defdab9488dfb3a3

xingxing122

脚本就卡在这里了，

deployment.apps/ks-installer scaled

etcd endpointIps is empty or localhost, will be filled with

clusterconfiguration.installer.kubesphere.io/ks-installer patched (no change)

remove redis

No resources found

apply CRDs

job.batch “prepare-upgrade” deleted

configmap/ks-upgrade-prepare-config unchanged

job.batch/prepare-upgrade created

查看资源运行情况

xingxing122

hongming 更新了，还是这个问题，有点崩溃

xingxing122

hongming 还有其他可以解决的么，升级之路太难了

hongming

xingxing122

确认一下 image hash 是否一致，我这能复现你遇到的问题，修复后也重新验证过，相关问题的修复记录 https://github.com/kubesphere/ks-upgrade/pull/27/files#diff-a539a85bdf77e6a269d56942c9ce3f56b0e1a0d33cf1ccd03ba6081f582a17dbR429

这一行的空指针加了前置判断，环境中配置的差异会被忽略

	/workspace/pkg/jobs/core/core.go:429 +0x22f

请提供一下最新的日志信息，可能不是这里产生的错误了

hongming

xingxing122

只能尽可能减少环境的差异，有一些特殊配置在升级过程中可能没有被覆盖到，但能根据异常日志定位到错误 https://github.com/kubesphere/ks-upgrade，比如你遇到的这个空指针问题，就可以稍微调整一下 validatingwebhook 的配置跳过

xingxing122

镜像是一致的呢，我只是打的tag 传到的仓库而已，因为mac M1 系列的本地是拉取不到arm的镜像，所以需要传到我仓库去

hongming

xingxing122

是不是 retag 的时候搞错了，改了 imagePullPolicy也没拉到更新后的镜像，日志中的错误信息已经足够定位问题，你也可以自己去构建镜像试试。如果觉得麻烦，可以这么跳过，把导致异常的 validatingwebhook 先备份删除，升级完成后再恢复

xingxing122

hongming tag 不应该的，毕竟后面的imageID 都是一致的，把异常的validatingwebhook 备份删除，这个操作有点麻烦，咋跳过，咋跳过呢

hongming

xingxing122

这个 imageID 看起来也是一致的，也是对的，但这是 x86 的，kubesphere/ks-upgrade 并没有提供 arm 镜像，你看看 prepare-upgrade 的 pod 实际拉取的是哪个呢，还有就是异常日志有没有变化，排除其它问题导致中断

xingxing122

hongming 我看了日志报错，还是那个nil 要不然就需要删除这些，但是我觉得还是跳过比较多，我尝试使用github 上制作镜像试试

kubectl get validatingwebhookconfigurations | grep kubesphere

cluster.kubesphere.io 1 19h

network.kubesphere.io 1 19h

resourcesquotas.quota.kubesphere.io 1 19h

rulegroups.alerting.kubesphere.io 3 19h

storageclass-accessor.storage.kubesphere.io 1 19h

users.iam.kubesphere.io 1 19h

xingxing122

xingxing122 我拉取了master 分支，打的镜像还是有问题，这个问题修复是在那个分支搞的

hongming

xingxing122

是 master 分支，这一行改动一目了然，只有 ClientConfig.Service 是指针类型

你再检查一下 pod，看看 imagePullPolicy 改过了吗

xingxing122

hongming 改动了，我拉去master 分支自己在本地打了一个镜像

xingxing122

hongming 我现在都不会升级了，接下来都不知道咋操作了。这个问题搞的有点懵逼

bwcx

升级网关 failed to get API group resources: unable to retrieve the complete list of server APIs: gateway.kubesphere.io/v2alpha1: the server could not find the requested resource

hongming

bwcx

member 集群上的网关升级，要等集群污点移除后，Gateway 扩展组件调度成功之后再执行。

bwcx

hongming 我只有host

hongming

bwcx

host 集群上也要等待扩展组件调度，可以通过UI 检查一下网关扩展组件的调度状态，或者通过以下命令：

kubectl get installplan gateway -o json | jq -r '["Cluster", "State"], (.status.clusterSchedulingStatuses | to_entries[] | [.key, .value.state]) | @tsv' | column -t

bwcx

hongming
Cluster State host Installed

gateway-agent-backend-controller-manager 日志