- 已编辑
系统信息:Kylin 4.0.2 (GNU/Linux 4.4.131-20190726.kylin.server-generic aarch64)
本次部署使用2.1.1版本安装
查看节点状态正常:
root@master1:# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master1 Ready master 3d17h v1.15.9
master2 Ready master 3d17h v1.15.9
master3 Ready master 3d17h v1.15.9
查看所有pod信息:
> root@master1:~# kubectl get pods --all-namespaces -o wide
> NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
> kube-system calico-kube-controllers-56cd854695-jcjk4 1/1 Running 0 6h10m 192.168.137.89 master1 <none> <none>
> kube-system calico-node-mmcdg 1/1 Running 5 3d17h 192.168.100.12 master2 <none> <none>
> kube-system calico-node-wwdtm 1/1 Running 5 3d17h 192.168.100.13 master3 <none> <none>
> kube-system calico-node-x7tmn 1/1 Running 4 3d17h 192.168.100.11 master1 <none> <none>
> kube-system coredns-5d4dd4b4db-24k9c 1/1 Running 0 162m 192.168.136.19 master3 <none> <none>
> kube-system coredns-5d4dd4b4db-gqsbb 1/1 Running 0 6h10m 192.168.137.90 master1 <none> <none>
> kube-system etcd-master1 1/1 Running 4 2d3h 192.168.100.11 master1 <none> <none>
> kube-system etcd-master2 1/1 Running 5 137m 192.168.100.12 master2 <none> <none>
> kube-system etcd-master3 1/1 Running 5 6h8m 192.168.100.13 master3 <none> <none>
> kube-system kube-apiserver-master1 1/1 Running 4 2d3h 192.168.100.11 master1 <none> <none>
> kube-system kube-apiserver-master2 1/1 Running 5 137m 192.168.100.12 master2 <none> <none>
> kube-system kube-apiserver-master3 1/1 Running 6 6h8m 192.168.100.13 master3 <none> <none>
> kube-system kube-controller-manager-master1 1/1 Running 5 2d3h 192.168.100.11 master1 <none> <none>
> kube-system kube-controller-manager-master2 1/1 Running 6 137m 192.168.100.12 master2 <none> <none>
> kube-system kube-controller-manager-master3 1/1 Running 5 6h8m 192.168.100.13 master3 <none> <none>
> kube-system kube-proxy-b2fgj 1/1 Running 5 3d17h 192.168.100.13 master3 <none> <none>
> kube-system kube-proxy-hg4vw 1/1 Running 4 3d17h 192.168.100.11 master1 <none> <none>
> kube-system kube-proxy-jfswn 1/1 Running 5 3d17h 192.168.100.12 master2 <none> <none>
> kube-system kube-scheduler-master1 1/1 Running 5 2d3h 192.168.100.11 master1 <none> <none>
> kube-system kube-scheduler-master2 1/1 Running 6 137m 192.168.100.12 master2 <none> <none>
> kube-system kube-scheduler-master3 1/1 Running 5 6h8m 192.168.100.13 master3 <none> <none>
> kubesphere-system ks-installer-6d58d545d7-79ckm 1/1 Running 0 80m 192.168.180.17 master2 <none> <none>
> metallb-system controller-7457ddbd47-jw59p 1/1 Running 0 6h10m 192.168.137.91 master1 <none> <none>
> metallb-system speaker-ctqd8 1/1 Running 6 3d17h 192.168.100.13 master3 <none> <none>
> metallb-system speaker-gr2zb 1/1 Running 4 3d17h 192.168.100.11 master1 <none> <none>
> metallb-system speaker-xndp2 1/1 Running 5 3d17h 192.168.100.12 master2 <none> <none>
现象是kubectl create -f kubesphere-minimal.yaml后会在一master节点上生成一个ansible-playbook进程:
/usr/bin/python2 /usr/bin/ansible-playbook -b -e @/kubesphere/config/ks-config.yaml -e @/kubesphere/results/env/extravars /kubesphere/playbooks/preinstall.yaml
使用命令查看日志:
root@master1:/opt/kubesphere-2.1.1# kubectl logs -n kubesphere-system $(kubectl get pod -n kubesphere-system -l app=ks-install -o jsonpath=‘{.items[0].metadata.name}’) -f
2020-08-24T07:25:06Z INFO : shell-operator latest
2020-08-24T07:25:06Z INFO : HTTP SERVER Listening on 0.0.0.0:9115
2020-08-24T07:25:06Z INFO : Use temporary dir: /tmp/shell-operator
2020-08-24T07:25:06Z INFO : Initialize hooks manager …
2020-08-24T07:25:06Z INFO : Search and load hooks …
2020-08-24T07:25:06Z INFO : Load hook config from ‘/hooks/kubesphere/installRunner.py’
2020-08-24T07:25:07Z INFO : Initializing schedule manager …
2020-08-24T07:25:07Z INFO : KUBE Init Kubernetes client
2020-08-24T07:25:07Z INFO : KUBE-INIT Kubernetes client is configured successfully
2020-08-24T07:25:07Z INFO : MAIN: run main loop
2020-08-24T07:25:07Z INFO : MAIN: add onStartup tasks
2020-08-24T07:25:07Z INFO : QUEUE add all HookRun@OnStartup
2020-08-24T07:25:07Z INFO : Running schedule manager …
2020-08-24T07:25:07Z INFO : MSTOR Create new metric shell_operator_live_ticks
2020-08-24T07:25:07Z INFO : MSTOR Create new metric shell_operator_tasks_queue_length
2020-08-24T07:25:07Z INFO : GVR for kind ‘ConfigMap’ is /v1, Resource=configmaps
2020-08-24T07:25:07Z INFO : EVENT Kube event ‘cbfd0ac1-0983-40b1-888a-c45b6b7c7549′
2020-08-24T07:25:07Z INFO : QUEUE add TASK_HOOK_RUN@KUBE_EVENTS kubesphere/installRunner.py
2020-08-24T07:25:10Z INFO : TASK_RUN HookRun@KUBE_EVENTS kubesphere/installRunner.py
2020-08-24T07:25:10Z INFO : Running hook ‘kubesphere/installRunner.py’ binding ‘KUBE_EVENTS’ …
日志信息一直处于此状态,没有进一步输出。一直卡在此处无法进行下一步,过一段时间后ansible-playbook的所在节点就会出现丢失,几个核心pod也会变成Unknown和NodeLost的状态,get nodes也会发现节点notready。
时间更久一点就会有两个节点出现该状态,因为节点丢失会重新在其他节点上拉起来ansible-playbook进程,top可看到CPU100%。
root@master2:# top
> top - 16:20:49 up 2:15, 4 users, load average: 7.96, 7.47, 7.20
> Tasks: 696 total, 6 running, 690 sleeping, 0 stopped, 0 zombie
> %Cpu(s): 0.2 us, 1.7 sy, 0.0 ni, 98.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> KiB Mem : 26588371+total, 25827788+free, 2642688 used, 4963136 buff/cache
> KiB Swap: 0 total, 0 free, 0 used. 23996217+avail Mem
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 20702 root 20 0 59456 55168 5888 R 100.0 0.0 55:37.20 ansible-playboo
> 2021 root 20 0 2542400 118208 63168 S 4.0 0.0 5:18.84 kubelet
> 3349 root 20 0 10.100g 87040 23168 S 3.6 0.0 8:59.32 etcd
> 2657 root 20 0 3470400 107776 34816 S 2.0 0.0 2:59.17 dockerd
> 3374 root 20 0 473856 251648 64256 S 1.3 0.1 2:07.23 kube-apiserver
> 4072 root 20 0 140608 44416 25344 S 1.3 0.0 0:17.33 kube-proxy
> 4749 root 20 0 150144 48448 29376 S 1.3 0.0 1:48.92 calico-node
> 4126 root 20 0 132416 20032 20032 S 1.0 0.0 0:55.67 speaker
> 20647 root 20 0 13952 6208 3264 S 1.0 0.0 0:23.97 top
> 26792 root 20 0 13952 6080 3264 R 0.7 0.0 0:00.14 top
> 7 root 20 0 0 0 0 S 0.3 0.0 0:09.14 rcu_sched
> 3375 root 20 0 142528 29824 20928 S 0.3 0.0 0:08.63 kube-scheduler
> 1 root 20 0 162816 9216 4928 D 0.0 0.0 0:13.98 systemd
> 2 root 20 0 0 0 0 S 0.0 0.0 0:00.08 kthreadd
> 3 root 20 0 0 0 0 S 0.0 0.0 0:00.33 ksoftirqd/0
> 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
> 8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
> 9 root rt 0 0 0 0 S 0.0 0.0 0:00.02 migration/0
> 10 root rt 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
> 11 root rt 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
> 12 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/1
> 13 root 20 0 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/1
> 15 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0H
> 16 root rt 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/2
系统日志:
> Aug 24 09:09:41 master2 kubelet[1971]: E0824 09:09:41.092123 1971 machine.go:288] failed to get cache information for node 0: open /sys/devices/system/cpu/cpu0/cache: no such file or directory
> Aug 24 09:09:41 master2 kubelet[1971]: E0824 09:09:41.092170 1971 machine.go:288] failed to get cache information for node 1: open /sys/devices/system/cpu/cpu8/cache: no such file or directory
> Aug 24 09:09:41 master2 kubelet[1971]: E0824 09:09:41.092194 1971 machine.go:288] failed to get cache information for node 2: open /sys/devices/system/cpu/cpu16/cache: no such file or directory
> Aug 24 09:09:41 master2 kubelet[1971]: E0824 09:09:41.092216 1971 machine.go:288] failed to get cache information for node 3: open /sys/devices/system/cpu/cpu24/cache: no such file or directory
> Aug 24 09:09:41 master2 kubelet[1971]: E0824 09:09:41.092237 1971 machine.go:288] failed to get cache information for node 4: open /sys/devices/system/cpu/cpu32/cache: no such file or directory
> Aug 24 09:09:41 master2 kubelet[1971]: E0824 09:09:41.092258 1971 machine.go:288] failed to get cache information for node 5: open /sys/devices/system/cpu/cpu40/cache: no such file or directory
> Aug 24 09:09:41 master2 kubelet[1971]: E0824 09:09:41.092278 1971 machine.go:288] failed to get cache information for node 6: open /sys/devices/system/cpu/cpu48/cache: no such file or directory
> Aug 24 09:09:41 master2 kubelet[1971]: E0824 09:09:41.092298 1971 machine.go:288] failed to get cache information for node 7: open /sys/devices/system/cpu/cpu56/cache: no such file or directory
然后还出现软死锁问题:
> Aug 24 14:21:15 master2 kernel: [ 950.796039] NMI watchdog: BUG: soft lockup - CPU#44 stuck for 45s! [ansible-playboo:11385]
> Aug 24 14:21:15 master2 kernel: [ 950.804264] Modules linked in: veth xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ipip tunnel4 ip_tunnel xt_set ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_bitmap_port ip_set_hash_ipport ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 xt_comment xt_mark nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables aufs iptable_filter ip_tables x_tables overlay crc32_arm64 igb shpchp dms5013a sunrpc autofs4 qla2xxx megaraid_sas scsi_transport_fc sr_mod cdrom
> Aug 24 14:21:15 master2 kernel: [ 950.804322]
> Aug 24 14:21:15 master2 kernel: [ 950.804326] CPU: 44 PID: 11385 Comm: ansible-playboo Tainted: G L 4.4.131-20190726.kylin.server-generic #kylin
> Aug 24 14:21:15 master2 kernel: [ 950.804328] Hardware name: wanfang / , BIOS V3.0.4 2019-08-04
> Aug 24 14:21:15 master2 kernel: [ 950.804331] task: ffff808348bb3e00 ti: ffff808348cfc000 task.ti: ffff808348cfc000
> Aug 24 14:21:15 master2 kernel: [ 950.804336] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38
> Aug 24 14:21:15 master2 kernel: [ 950.804341] LR is at remove_wait_queue+0x50/0x60
> Aug 24 14:21:15 master2 kernel: [ 950.804343] pc : [<ffff800000a150fc>] lr : [<ffff800000106f10>] pstate: 20000145
> Aug 24 14:21:15 master2 kernel: [ 950.804345] sp : ffff808348cffb90
> Aug 24 14:21:15 master2 kernel: [ 950.804346] x29: ffff808348cffb90 x28: ffff8583444e9400
> Aug 24 14:21:15 master2 kernel: [ 950.804349] x27: ffff8000004db870 x26: ffff8583444e94d8
> Aug 24 14:21:15 master2 kernel: [ 950.804352] x25: ffff86834451cb00 x24: ffff800000a7b488
> Aug 24 14:21:15 master2 kernel: [ 950.804355] x23: 0000000000000000 x22: 0000000000000000
> Aug 24 14:21:15 master2 kernel: [ 950.804358] x21: 0000000000000000 x20: ffff8583444e9630
> Aug 24 14:21:15 master2 kernel: [ 950.804360] x19: 0000000000000140 x18: 000000000000005d
> Aug 24 14:21:15 master2 kernel: [ 950.804363] x17: 0000ffffb0f5c6e4 x16: ffff80000024ced8
> Aug 24 14:21:15 master2 kernel: [ 950.804366] x15: 0000000000000000 x14: 0000ffffafac8948
> Aug 24 14:21:15 master2 kernel: [ 950.804368] x13: 0000000000000002 x12: 0000000000000070
> Aug 24 14:21:15 master2 kernel: [ 950.804371] x11: 0000000000000073 x10: ffff808348cfc000
> Aug 24 14:21:15 master2 kernel: [ 950.804373] x9 : 000000007fff0000 x8 : 0001000000000000
> Aug 24 14:21:15 master2 kernel: [ 950.804376] x7 : 0000000000000114 x6 : 0000000000000000
> Aug 24 14:21:15 master2 kernel: [ 950.804379] x5 : dead000000000100 x4 : dead000000000200
> Aug 24 14:21:15 master2 kernel: [ 950.804381] x3 : ffff8583444e9638 x2 : ffff8583444e9638
> Aug 24 14:21:15 master2 kernel: [ 950.804384] x1 : 0000000000000140 x0 : 000000000000da7f
> Aug 24 14:21:15 master2 kernel: [ 950.804386]
> Aug 24 14:21:28 master2 kernel: [ 963.477997] INFO: rcu_sched self-detected stall on CPU
> Aug 24 14:21:28 master2 kernel: [ 963.483124] 44-...: (60005 ticks this GP) idle=2f1/140000000000001/0 softirq=4529/4529 fqs=59542
> Aug 24 14:21:28 master2 kernel: [ 963.492029] (t=60010 jiffies g=9323 c=9322 q=218744)
> Aug 24 14:21:28 master2 kernel: [ 963.497159] Task dump for CPU 44:
> Aug 24 14:21:28 master2 kernel: [ 963.497161] ansible-playboo R running task 0 11385 11353 0x00000006
> Aug 24 14:21:28 master2 kernel: [ 963.497165] Call trace:
> Aug 24 14:21:28 master2 kernel: [ 963.499601] [<ffff8000000897f8>] dump_backtrace+0x0/0x178
> Aug 24 14:21:28 master2 kernel: [ 963.499605] [<ffff800000089b1c>] show_stack+0x24/0x30
> Aug 24 14:21:28 master2 kernel: [ 963.499609] [<ffff8000001cbcb4>] sched_show_task+0xd0/0xdc
> Aug 24 14:21:28 master2 kernel: [ 963.499612] [<ffff8000001cc2c8>] dump_cpu_task+0x48/0x54
> Aug 24 14:21:28 master2 kernel: [ 963.499614] [<ffff8000001cce48>] rcu_dump_cpu_stacks+0x98/0xb4
> Aug 24 14:21:28 master2 kernel: [ 963.499618] [<ffff8000001280d8>] rcu_check_callbacks+0x758/0x8b8
> Aug 24 14:21:28 master2 kernel: [ 963.499621] [<ffff80000012da8c>] update_process_times+0x44/0x90
> Aug 24 14:21:28 master2 kernel: [ 963.499624] [<ffff80000013ec90>] tick_sched_handle.isra.0+0x38/0x78
> Aug 24 14:21:28 master2 kernel: [ 963.499626] [<ffff80000013ed1c>] tick_sched_timer+0x4c/0x90
> Aug 24 14:21:28 master2 kernel: [ 963.499628] [<ffff80000012e458>] __hrtimer_run_queues+0x148/0x2c0
> Aug 24 14:21:28 master2 kernel: [ 963.499631] [<ffff80000012ecc0>] hrtimer_interrupt+0xa0/0x1d0
> Aug 24 14:21:28 master2 kernel: [ 963.499635] [<ffff80000089ba24>] arch_timer_handler_phys+0x3c/0x50
> Aug 24 14:21:28 master2 kernel: [ 963.499638] [<ffff80000011b85c>] handle_percpu_devid_irq+0x9c/0x228
> Aug 24 14:21:28 master2 kernel: [ 963.499641] [<ffff800000116d1c>] generic_handle_irq+0x34/0x50
> Aug 24 14:21:28 master2 kernel: [ 963.499644] [<ffff80000011706c>] __handle_domain_irq+0x6c/0xc0
> Aug 24 14:21:28 master2 kernel: [ 963.499646] [<ffff800000081c88>] gic_handle_irq+0x98/0x194
> Aug 24 14:21:28 master2 kernel: [ 963.499648] Exception stack(0xffff808348cffa20 to 0xffff808348cffb40)
> Aug 24 14:21:28 master2 kernel: [ 963.499651] fa20: 0000000000000140 ffff8583444e9630 ffff808348cffb80 ffff800000a150fc
> Aug 24 14:21:28 master2 kernel: [ 963.499654] fa40: 0000000060000145 0000000000000000 ffff86834451cb00 ffff8583444e94d8
> Aug 24 14:21:28 master2 kernel: [ 963.499656] fa60: 0000000000004efe 0000000000000140 ffff8583444e9638 ffff8583444e9638
> Aug 24 14:21:28 master2 kernel: [ 963.499659] fa80: 0001000000000000 0000000000000000 0000000000000000 0000000000000114
> Aug 24 14:21:28 master2 kernel: [ 963.499661] faa0: 0001000000000000 000000007fff0000 ffff808348cfc000 0000000000000073
> Aug 24 14:21:28 master2 kernel: [ 963.499663] fac0: 0000000000000070 0000000000000002 0000ffffafac8948 0000000000000000
> Aug 24 14:21:28 master2 kernel: [ 963.499665] fae0: ffff80000024ced8 0000ffffb0f5c6e4 000000000000005d 0000000000000140
> Aug 24 14:21:28 master2 kernel: [ 963.499667] fb00: ffff8583444e9630 ffff808348cffc60 ffff86834451cb00 ffff8583444f2540
> Aug 24 14:21:28 master2 rsyslogd-2007: action 'action 10' suspended, next retry is Mon Aug 24 14:22:28 2020 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
> Aug 24 14:21:28 master2 kernel: [ 963.499670] fb20: 0000000000000000 ffff86834451cb00 ffff8583444e94d8 ffff8000004db870
> Aug 24 14:21:28 master2 kernel: [ 963.499672] [<ffff800000084da8>] el1_irq+0x68/0xc0
> Aug 24 14:21:28 master2 kernel: [ 963.499675] [<ffff800000106e48>] add_wait_queue+0x58/0x68
> Aug 24 14:21:28 master2 kernel: [ 963.499678] [<ffff8000004db920>] n_tty_write+0xb0/0x460
> Aug 24 14:21:28 master2 kernel: [ 963.499681] [<ffff8000004d7894>] tty_write+0x114/0x268
> Aug 24 14:21:28 master2 kernel: [ 963.499686] [<ffff80000024af04>] do_loop_readv_writev.part.0+0x84/0xb8
> Aug 24 14:21:28 master2 kernel: [ 963.499688] [<ffff80000024b7ac>] do_readv_writev+0x1dc/0x270
> Aug 24 14:21:28 master2 kernel: [ 963.499690] [<ffff80000024bb18>] vfs_writev+0x58/0x80
> Aug 24 14:21:28 master2 kernel: [ 963.499693] [<ffff80000024bba0>] do_writev+0x60/0xf0
> Aug 24 14:21:28 master2 kernel: [ 963.499696] [<ffff80000024cf10>] SyS_writev+0x38/0x48
> Aug 24 14:21:28 master2 kernel: [ 963.499698] [<ffff800000085484>] el0_svc_naked+0x38/0x3c
该问题已困扰多天,甚是头疼,还请大佬们帮忙。:::::::