系统信息:Kylin 4.0.2 (GNU/Linux 4.4.131-20190726.kylin.server-generic aarch64)
本次部署使用2.1.1版本安装

查看节点状态正常:
root@master1:# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master1 Ready master 3d17h v1.15.9
master2 Ready master 3d17h v1.15.9
master3 Ready master 3d17h v1.15.9

查看所有pod信息:


> root@master1:~# kubectl get pods --all-namespaces -o wide
> NAMESPACE           NAME                                       READY   STATUS    RESTARTS   AGE     IP               NODE      NOMINATED NODE   READINESS GATES
> kube-system         calico-kube-controllers-56cd854695-jcjk4   1/1     Running   0          6h10m   192.168.137.89   master1   <none>           <none>
> kube-system         calico-node-mmcdg                          1/1     Running   5          3d17h   192.168.100.12   master2   <none>           <none>
> kube-system         calico-node-wwdtm                          1/1     Running   5          3d17h   192.168.100.13   master3   <none>           <none>
> kube-system         calico-node-x7tmn                          1/1     Running   4          3d17h   192.168.100.11   master1   <none>           <none>
> kube-system         coredns-5d4dd4b4db-24k9c                   1/1     Running   0          162m    192.168.136.19   master3   <none>           <none>
> kube-system         coredns-5d4dd4b4db-gqsbb                   1/1     Running   0          6h10m   192.168.137.90   master1   <none>           <none>
> kube-system         etcd-master1                               1/1     Running   4          2d3h    192.168.100.11   master1   <none>           <none>
> kube-system         etcd-master2                               1/1     Running   5          137m    192.168.100.12   master2   <none>           <none>
> kube-system         etcd-master3                               1/1     Running   5          6h8m    192.168.100.13   master3   <none>           <none>
> kube-system         kube-apiserver-master1                     1/1     Running   4          2d3h    192.168.100.11   master1   <none>           <none>
> kube-system         kube-apiserver-master2                     1/1     Running   5          137m    192.168.100.12   master2   <none>           <none>
> kube-system         kube-apiserver-master3                     1/1     Running   6          6h8m    192.168.100.13   master3   <none>           <none>
> kube-system         kube-controller-manager-master1            1/1     Running   5          2d3h    192.168.100.11   master1   <none>           <none>
> kube-system         kube-controller-manager-master2            1/1     Running   6          137m    192.168.100.12   master2   <none>           <none>
> kube-system         kube-controller-manager-master3            1/1     Running   5          6h8m    192.168.100.13   master3   <none>           <none>
> kube-system         kube-proxy-b2fgj                           1/1     Running   5          3d17h   192.168.100.13   master3   <none>           <none>
> kube-system         kube-proxy-hg4vw                           1/1     Running   4          3d17h   192.168.100.11   master1   <none>           <none>
> kube-system         kube-proxy-jfswn                           1/1     Running   5          3d17h   192.168.100.12   master2   <none>           <none>
> kube-system         kube-scheduler-master1                     1/1     Running   5          2d3h    192.168.100.11   master1   <none>           <none>
> kube-system         kube-scheduler-master2                     1/1     Running   6          137m    192.168.100.12   master2   <none>           <none>
> kube-system         kube-scheduler-master3                     1/1     Running   5          6h8m    192.168.100.13   master3   <none>           <none>
> kubesphere-system   ks-installer-6d58d545d7-79ckm              1/1     Running   0          80m     192.168.180.17   master2   <none>           <none>
> metallb-system      controller-7457ddbd47-jw59p                1/1     Running   0          6h10m   192.168.137.91   master1   <none>           <none>
> metallb-system      speaker-ctqd8                              1/1     Running   6          3d17h   192.168.100.13   master3   <none>           <none>
> metallb-system      speaker-gr2zb                              1/1     Running   4          3d17h   192.168.100.11   master1   <none>           <none>
> metallb-system      speaker-xndp2                              1/1     Running   5          3d17h   192.168.100.12   master2   <none>           <none>

现象是kubectl create -f kubesphere-minimal.yaml后会在一master节点上生成一个ansible-playbook进程:
/usr/bin/python2 /usr/bin/ansible-playbook -b -e @/kubesphere/config/ks-config.yaml -e @/kubesphere/results/env/extravars /kubesphere/playbooks/preinstall.yaml

使用命令查看日志:
root@master1:/opt/kubesphere-2.1.1# kubectl logs -n kubesphere-system $(kubectl get pod -n kubesphere-system -l app=ks-install -o jsonpath=‘{.items[0].metadata.name}’) -f
2020-08-24T07:25:06Z INFO : shell-operator latest
2020-08-24T07:25:06Z INFO : HTTP SERVER Listening on 0.0.0.0:9115
2020-08-24T07:25:06Z INFO : Use temporary dir: /tmp/shell-operator
2020-08-24T07:25:06Z INFO : Initialize hooks manager …
2020-08-24T07:25:06Z INFO : Search and load hooks …
2020-08-24T07:25:06Z INFO : Load hook config from ‘/hooks/kubesphere/installRunner.py’
2020-08-24T07:25:07Z INFO : Initializing schedule manager …
2020-08-24T07:25:07Z INFO : KUBE Init Kubernetes client
2020-08-24T07:25:07Z INFO : KUBE-INIT Kubernetes client is configured successfully
2020-08-24T07:25:07Z INFO : MAIN: run main loop
2020-08-24T07:25:07Z INFO : MAIN: add onStartup tasks
2020-08-24T07:25:07Z INFO : QUEUE add all HookRun@OnStartup
2020-08-24T07:25:07Z INFO : Running schedule manager …
2020-08-24T07:25:07Z INFO : MSTOR Create new metric shell_operator_live_ticks
2020-08-24T07:25:07Z INFO : MSTOR Create new metric shell_operator_tasks_queue_length
2020-08-24T07:25:07Z INFO : GVR for kind ‘ConfigMap’ is /v1, Resource=configmaps
2020-08-24T07:25:07Z INFO : EVENT Kube event ‘cbfd0ac1-0983-40b1-888a-c45b6b7c7549′
2020-08-24T07:25:07Z INFO : QUEUE add TASK_HOOK_RUN@KUBE_EVENTS kubesphere/installRunner.py
2020-08-24T07:25:10Z INFO : TASK_RUN HookRun@KUBE_EVENTS kubesphere/installRunner.py
2020-08-24T07:25:10Z INFO : Running hook ‘kubesphere/installRunner.py’ binding ‘KUBE_EVENTS’ …
日志信息一直处于此状态,没有进一步输出。一直卡在此处无法进行下一步,过一段时间后ansible-playbook的所在节点就会出现丢失,几个核心pod也会变成Unknown和NodeLost的状态,get nodes也会发现节点notready。
时间更久一点就会有两个节点出现该状态,因为节点丢失会重新在其他节点上拉起来ansible-playbook进程,top可看到CPU100%。
root@master2:# top


> top - 16:20:49 up  2:15,  4 users,  load average: 7.96, 7.47, 7.20
> Tasks: 696 total,   6 running, 690 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.2 us,  1.7 sy,  0.0 ni, 98.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem : 26588371+total, 25827788+free,  2642688 used,  4963136 buff/cache
> KiB Swap:        0 total,        0 free,        0 used. 23996217+avail Mem 
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                                   
> 20702 root      20   0   59456  55168   5888 R 100.0  0.0  55:37.20 ansible-playboo                                                                                                                                                                           
>  2021 root      20   0 2542400 118208  63168 S   4.0  0.0   5:18.84 kubelet                                                                                                                                                                                   
>  3349 root      20   0 10.100g  87040  23168 S   3.6  0.0   8:59.32 etcd                                                                                                                                                                                      
>  2657 root      20   0 3470400 107776  34816 S   2.0  0.0   2:59.17 dockerd                                                                                                                                                                                   
>  3374 root      20   0  473856 251648  64256 S   1.3  0.1   2:07.23 kube-apiserver                                                                                                                                                                            
>  4072 root      20   0  140608  44416  25344 S   1.3  0.0   0:17.33 kube-proxy                                                                                                                                                                                
>  4749 root      20   0  150144  48448  29376 S   1.3  0.0   1:48.92 calico-node                                                                                                                                                                               
>  4126 root      20   0  132416  20032  20032 S   1.0  0.0   0:55.67 speaker                                                                                                                                                                                   
> 20647 root      20   0   13952   6208   3264 S   1.0  0.0   0:23.97 top                                                                                                                                                                                       
> 26792 root      20   0   13952   6080   3264 R   0.7  0.0   0:00.14 top                                                                                                                                                                                       
>     7 root      20   0       0      0      0 S   0.3  0.0   0:09.14 rcu_sched                                                                                                                                                                                 
>  3375 root      20   0  142528  29824  20928 S   0.3  0.0   0:08.63 kube-scheduler                                                                                                                                                                            
>     1 root      20   0  162816   9216   4928 D   0.0  0.0   0:13.98 systemd                                                                                                                                                                                   
>     2 root      20   0       0      0      0 S   0.0  0.0   0:00.08 kthreadd                                                                                                                                                                                  
>     3 root      20   0       0      0      0 S   0.0  0.0   0:00.33 ksoftirqd/0                                                                                                                                                                               
>     5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H                                                                                                                                                                              
>     8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh                                                                                                                                                                                    
>     9 root      rt   0       0      0      0 S   0.0  0.0   0:00.02 migration/0                                                                                                                                                                               
>    10 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 watchdog/0                                                                                                                                                                                
>    11 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 watchdog/1                                                                                                                                                                                
>    12 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 migration/1                                                                                                                                                                               
>    13 root      20   0       0      0      0 S   0.0  0.0   0:00.01 ksoftirqd/1                                                                                                                                                                               
>    15 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/1:0H                                                                                                                                                                              
>    16 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 watchdog/2

系统日志:


> Aug 24 09:09:41 master2 kubelet[1971]: E0824 09:09:41.092123    1971 machine.go:288] failed to get cache information for node 0: open /sys/devices/system/cpu/cpu0/cache: no such file or directory
> Aug 24 09:09:41 master2 kubelet[1971]: E0824 09:09:41.092170    1971 machine.go:288] failed to get cache information for node 1: open /sys/devices/system/cpu/cpu8/cache: no such file or directory
> Aug 24 09:09:41 master2 kubelet[1971]: E0824 09:09:41.092194    1971 machine.go:288] failed to get cache information for node 2: open /sys/devices/system/cpu/cpu16/cache: no such file or directory
> Aug 24 09:09:41 master2 kubelet[1971]: E0824 09:09:41.092216    1971 machine.go:288] failed to get cache information for node 3: open /sys/devices/system/cpu/cpu24/cache: no such file or directory
> Aug 24 09:09:41 master2 kubelet[1971]: E0824 09:09:41.092237    1971 machine.go:288] failed to get cache information for node 4: open /sys/devices/system/cpu/cpu32/cache: no such file or directory
> Aug 24 09:09:41 master2 kubelet[1971]: E0824 09:09:41.092258    1971 machine.go:288] failed to get cache information for node 5: open /sys/devices/system/cpu/cpu40/cache: no such file or directory
> Aug 24 09:09:41 master2 kubelet[1971]: E0824 09:09:41.092278    1971 machine.go:288] failed to get cache information for node 6: open /sys/devices/system/cpu/cpu48/cache: no such file or directory
> Aug 24 09:09:41 master2 kubelet[1971]: E0824 09:09:41.092298    1971 machine.go:288] failed to get cache information for node 7: open /sys/devices/system/cpu/cpu56/cache: no such file or directory

然后还出现软死锁问题:


> Aug 24 14:21:15 master2 kernel: [  950.796039] NMI watchdog: BUG: soft lockup - CPU#44 stuck for 45s! [ansible-playboo:11385]
> Aug 24 14:21:15 master2 kernel: [  950.804264] Modules linked in: veth xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ipip tunnel4 ip_tunnel xt_set ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_bitmap_port ip_set_hash_ipport ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 xt_comment xt_mark nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables aufs iptable_filter ip_tables x_tables overlay crc32_arm64 igb shpchp dms5013a sunrpc autofs4 qla2xxx megaraid_sas scsi_transport_fc sr_mod cdrom
> Aug 24 14:21:15 master2 kernel: [  950.804322] 
> Aug 24 14:21:15 master2 kernel: [  950.804326] CPU: 44 PID: 11385 Comm: ansible-playboo Tainted: G             L  4.4.131-20190726.kylin.server-generic #kylin
> Aug 24 14:21:15 master2 kernel: [  950.804328] Hardware name: wanfang   /  , BIOS V3.0.4 2019-08-04
> Aug 24 14:21:15 master2 kernel: [  950.804331] task: ffff808348bb3e00 ti: ffff808348cfc000 task.ti: ffff808348cfc000
> Aug 24 14:21:15 master2 kernel: [  950.804336] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38
> Aug 24 14:21:15 master2 kernel: [  950.804341] LR is at remove_wait_queue+0x50/0x60
> Aug 24 14:21:15 master2 kernel: [  950.804343] pc : [<ffff800000a150fc>] lr : [<ffff800000106f10>] pstate: 20000145
> Aug 24 14:21:15 master2 kernel: [  950.804345] sp : ffff808348cffb90
> Aug 24 14:21:15 master2 kernel: [  950.804346] x29: ffff808348cffb90 x28: ffff8583444e9400 
> Aug 24 14:21:15 master2 kernel: [  950.804349] x27: ffff8000004db870 x26: ffff8583444e94d8 
> Aug 24 14:21:15 master2 kernel: [  950.804352] x25: ffff86834451cb00 x24: ffff800000a7b488 
> Aug 24 14:21:15 master2 kernel: [  950.804355] x23: 0000000000000000 x22: 0000000000000000 
> Aug 24 14:21:15 master2 kernel: [  950.804358] x21: 0000000000000000 x20: ffff8583444e9630 
> Aug 24 14:21:15 master2 kernel: [  950.804360] x19: 0000000000000140 x18: 000000000000005d 
> Aug 24 14:21:15 master2 kernel: [  950.804363] x17: 0000ffffb0f5c6e4 x16: ffff80000024ced8 
> Aug 24 14:21:15 master2 kernel: [  950.804366] x15: 0000000000000000 x14: 0000ffffafac8948 
> Aug 24 14:21:15 master2 kernel: [  950.804368] x13: 0000000000000002 x12: 0000000000000070 
> Aug 24 14:21:15 master2 kernel: [  950.804371] x11: 0000000000000073 x10: ffff808348cfc000 
> Aug 24 14:21:15 master2 kernel: [  950.804373] x9 : 000000007fff0000 x8 : 0001000000000000 
> Aug 24 14:21:15 master2 kernel: [  950.804376] x7 : 0000000000000114 x6 : 0000000000000000 
> Aug 24 14:21:15 master2 kernel: [  950.804379] x5 : dead000000000100 x4 : dead000000000200 
> Aug 24 14:21:15 master2 kernel: [  950.804381] x3 : ffff8583444e9638 x2 : ffff8583444e9638 
> Aug 24 14:21:15 master2 kernel: [  950.804384] x1 : 0000000000000140 x0 : 000000000000da7f 
> Aug 24 14:21:15 master2 kernel: [  950.804386] 
> Aug 24 14:21:28 master2 kernel: [  963.477997] INFO: rcu_sched self-detected stall on CPU
> Aug 24 14:21:28 master2 kernel: [  963.483124] 	44-...: (60005 ticks this GP) idle=2f1/140000000000001/0 softirq=4529/4529 fqs=59542 
> Aug 24 14:21:28 master2 kernel: [  963.492029] 	 (t=60010 jiffies g=9323 c=9322 q=218744)
> Aug 24 14:21:28 master2 kernel: [  963.497159] Task dump for CPU 44:
> Aug 24 14:21:28 master2 kernel: [  963.497161] ansible-playboo R  running task        0 11385  11353 0x00000006
> Aug 24 14:21:28 master2 kernel: [  963.497165] Call trace:
> Aug 24 14:21:28 master2 kernel: [  963.499601] [<ffff8000000897f8>] dump_backtrace+0x0/0x178
> Aug 24 14:21:28 master2 kernel: [  963.499605] [<ffff800000089b1c>] show_stack+0x24/0x30
> Aug 24 14:21:28 master2 kernel: [  963.499609] [<ffff8000001cbcb4>] sched_show_task+0xd0/0xdc
> Aug 24 14:21:28 master2 kernel: [  963.499612] [<ffff8000001cc2c8>] dump_cpu_task+0x48/0x54
> Aug 24 14:21:28 master2 kernel: [  963.499614] [<ffff8000001cce48>] rcu_dump_cpu_stacks+0x98/0xb4
> Aug 24 14:21:28 master2 kernel: [  963.499618] [<ffff8000001280d8>] rcu_check_callbacks+0x758/0x8b8
> Aug 24 14:21:28 master2 kernel: [  963.499621] [<ffff80000012da8c>] update_process_times+0x44/0x90
> Aug 24 14:21:28 master2 kernel: [  963.499624] [<ffff80000013ec90>] tick_sched_handle.isra.0+0x38/0x78
> Aug 24 14:21:28 master2 kernel: [  963.499626] [<ffff80000013ed1c>] tick_sched_timer+0x4c/0x90
> Aug 24 14:21:28 master2 kernel: [  963.499628] [<ffff80000012e458>] __hrtimer_run_queues+0x148/0x2c0
> Aug 24 14:21:28 master2 kernel: [  963.499631] [<ffff80000012ecc0>] hrtimer_interrupt+0xa0/0x1d0
> Aug 24 14:21:28 master2 kernel: [  963.499635] [<ffff80000089ba24>] arch_timer_handler_phys+0x3c/0x50
> Aug 24 14:21:28 master2 kernel: [  963.499638] [<ffff80000011b85c>] handle_percpu_devid_irq+0x9c/0x228
> Aug 24 14:21:28 master2 kernel: [  963.499641] [<ffff800000116d1c>] generic_handle_irq+0x34/0x50
> Aug 24 14:21:28 master2 kernel: [  963.499644] [<ffff80000011706c>] __handle_domain_irq+0x6c/0xc0
> Aug 24 14:21:28 master2 kernel: [  963.499646] [<ffff800000081c88>] gic_handle_irq+0x98/0x194
> Aug 24 14:21:28 master2 kernel: [  963.499648] Exception stack(0xffff808348cffa20 to 0xffff808348cffb40)
> Aug 24 14:21:28 master2 kernel: [  963.499651] fa20: 0000000000000140 ffff8583444e9630 ffff808348cffb80 ffff800000a150fc
> Aug 24 14:21:28 master2 kernel: [  963.499654] fa40: 0000000060000145 0000000000000000 ffff86834451cb00 ffff8583444e94d8
> Aug 24 14:21:28 master2 kernel: [  963.499656] fa60: 0000000000004efe 0000000000000140 ffff8583444e9638 ffff8583444e9638
> Aug 24 14:21:28 master2 kernel: [  963.499659] fa80: 0001000000000000 0000000000000000 0000000000000000 0000000000000114
> Aug 24 14:21:28 master2 kernel: [  963.499661] faa0: 0001000000000000 000000007fff0000 ffff808348cfc000 0000000000000073
> Aug 24 14:21:28 master2 kernel: [  963.499663] fac0: 0000000000000070 0000000000000002 0000ffffafac8948 0000000000000000
> Aug 24 14:21:28 master2 kernel: [  963.499665] fae0: ffff80000024ced8 0000ffffb0f5c6e4 000000000000005d 0000000000000140
> Aug 24 14:21:28 master2 kernel: [  963.499667] fb00: ffff8583444e9630 ffff808348cffc60 ffff86834451cb00 ffff8583444f2540
> Aug 24 14:21:28 master2 rsyslogd-2007: action 'action 10' suspended, next retry is Mon Aug 24 14:22:28 2020 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
> Aug 24 14:21:28 master2 kernel: [  963.499670] fb20: 0000000000000000 ffff86834451cb00 ffff8583444e94d8 ffff8000004db870
> Aug 24 14:21:28 master2 kernel: [  963.499672] [<ffff800000084da8>] el1_irq+0x68/0xc0
> Aug 24 14:21:28 master2 kernel: [  963.499675] [<ffff800000106e48>] add_wait_queue+0x58/0x68
> Aug 24 14:21:28 master2 kernel: [  963.499678] [<ffff8000004db920>] n_tty_write+0xb0/0x460
> Aug 24 14:21:28 master2 kernel: [  963.499681] [<ffff8000004d7894>] tty_write+0x114/0x268
> Aug 24 14:21:28 master2 kernel: [  963.499686] [<ffff80000024af04>] do_loop_readv_writev.part.0+0x84/0xb8
> Aug 24 14:21:28 master2 kernel: [  963.499688] [<ffff80000024b7ac>] do_readv_writev+0x1dc/0x270
> Aug 24 14:21:28 master2 kernel: [  963.499690] [<ffff80000024bb18>] vfs_writev+0x58/0x80
> Aug 24 14:21:28 master2 kernel: [  963.499693] [<ffff80000024bba0>] do_writev+0x60/0xf0
> Aug 24 14:21:28 master2 kernel: [  963.499696] [<ffff80000024cf10>] SyS_writev+0x38/0x48
> Aug 24 14:21:28 master2 kernel: [  963.499698] [<ffff800000085484>] el0_svc_naked+0x38/0x3c

该问题已困扰多天,甚是头疼,还请大佬们帮忙。:::::::

  • Jeff 回复了此帖

    Phoaster-wry 能提供下 cpu 的信息么,

    lscpu

      Jeff

      root@master1:~# lscpu
      Architecture:          aarch64
      Byte Order:            Little Endian
      CPU(s):                64
      On-line CPU(s) list:   0-63
      每个核的线程数:1
      每个座的核数:  4
      Socket(s):             16
      NUMA 节点:         8
      Model name:            Phytium,FT2000PLUS
      CPU max MHz:           2200.0000
      CPU min MHz:           1000.0000
      BogoMIPS:              3600.00
      NUMA node0 CPU(s):     0-7
      NUMA node1 CPU(s):     8-15
      NUMA node2 CPU(s):     16-23
      NUMA node3 CPU(s):     24-31
      NUMA node4 CPU(s):     32-39
      NUMA node5 CPU(s):     40-47
      NUMA node6 CPU(s):     48-55
      NUMA node7 CPU(s):     56-63
      Flags:                 fp asimd evtstrm crc32