k8s实践5:一次失败的kubernetes集群崩溃处理记录

莫名其妙的整个集群崩溃,所有命令无法执行,所有组件(controller-manager和scheduler两个正常)都是启动失败.各种记录和报错,参考见下:

[root@k8s-master2 ~]# kubectl get cs
error: the server doesn't have a resource type "cs"
1.
2.
[root@k8s-master2 ~]# systemctl status flanneld
● flanneld.service - Flanneld overlay address etcd agent
Loaded: loaded (/etc/systemd/system/flanneld.service; enabled; vendor preset: disabled)
Active: activating (start) since Mon 2019-04-08 11:26:35 CST; 37s ago
Main PID: 6691 (flanneld)
Memory: 10.6M
CGroup: /system.slice/flanneld.service
└─6691 /opt/k8s/bin/flanneld -etcd-cafile=/etc/kubernetes/cert/ca.pem -etcd-certfile=/etc/flanneld/...

Apr 08 11:26:45 k8s-master2 flanneld[6691]: ; error #2: dial tcp 192.168.32.129:2379: getsockopt: connecti...used
Apr 08 11:26:46 k8s-master2 flanneld[6691]: timed out
Apr 08 11:26:56 k8s-master2 flanneld[6691]: E0408 11:26:56.506816 6691 main.go:349] Couldn't fetch netw...used
Apr 08 11:26:56 k8s-master2 flanneld[6691]: ; error #1: net/http: TLS handshake timeout
Apr 08 11:26:56 k8s-master2 flanneld[6691]: ; error #2: dial tcp 192.168.32.129:2379: getsockopt: connecti...used
Apr 08 11:26:57 k8s-master2 flanneld[6691]: timed out
Apr 08 11:27:07 k8s-master2 flanneld[6691]: E0408 11:27:07.511956 6691 main.go:349] Couldn't fetch netw...used
Apr 08 11:27:07 k8s-master2 flanneld[6691]: ; error #1: net/http: TLS handshake timeout
Apr 08 11:27:07 k8s-master2 flanneld[6691]: ; error #2: dial tcp 192.168.32.129:2379: getsockopt: connecti...used
Apr 08 11:27:08 k8s-master2 flanneld[6691]: timed out
Hint: Some lines were ellipsized, use -l to show in full.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
[root@k8s-master2 ~]# systemctl status kube-apiserver
● kube-apiserver.service - Kubernetes API Server
Loaded: loaded (/etc/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Mon 2019-04-08 11:30:01 CST; 4s ago
Docs: https://github.com/GoogleCloudPlatform/kubernetes
Process: 7348 ExecStart=/opt/k8s/bin/kube-apiserver --enable-admission-plugins=Initializers,NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota --anonymous-auth=false --experimental-encryption-provider-config=/etc/kubernetes/encryption-config.yaml --advertise-address=192.168.32.129 --bind-address=192.168.32.129 --insecure-port=0 --authorization-mode=Node,RBAC --runtime-config=api/all --enable-bootstrap-token-auth --service-cluster-ip-range=10.254.0.0/16 --service-node-port-range=8400-9000 --tls-cert-file=/etc/kubernetes/cert/kubernetes.pem --tls-private-key-file=/etc/kubernetes/cert/kubernetes-key.pem --client-ca-file=/etc/kubernetes/cert/ca.pem --kubelet-client-certificate=/etc/kubernetes/cert/kubernetes.pem --kubelet-client-key=/etc/kubernetes/cert/kubernetes-key.pem --service-account-key-file=/etc/kubernetes/cert/ca-key.pem --etcd-cafile=/etc/kubernetes/cert/ca.pem --etcd-certfile=/etc/kubernetes/cert/kubernetes.pem --etcd-keyfile=/etc/kubernetes/cert/kubernetes-key.pem --etcd-servers=https://192.168.32.128:2379,https://192.168.32.129:2379,https://192.168.32.130:2379 --enable-swagger-ui=true --allow-privileged=true --apiserver-count=3 --audit-log-maxage=30 --audit-log-maxbackup=3 --audit-log-maxsize=100 --audit-log-path=/var/log/kube-apiserver-audit.log --event-ttl=1h --alsologtostderr=true --logtostderr=false --log-dir=/var/log/kubernetes --v=2 (code=exited, status=255)
Main PID: 7348 (code=exited, status=255)
Memory: 0B
CGroup: /system.slice/kube-apiserver.service

Apr 08 11:30:01 k8s-master2 systemd[1]: kube-apiserver.service: main process exited, code=exited, status=255/n/a
Apr 08 11:30:01 k8s-master2 systemd[1]: Failed to start Kubernetes API Server.
Apr 08 11:30:01 k8s-master2 systemd[1]: Unit kube-apiserver.service entered failed state.
Apr 08 11:30:01 k8s-master2 systemd[1]: kube-apiserver.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
[root@k8s-master2 ~]# systemctl status etcd
● etcd.service - Etcd Server
Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Mon 2019-04-08 11:28:02 CST; 358ms ago
Docs: https://github.com/coreos
Process: 7001 ExecStart=/opt/k8s/bin/etcd --data-dir=/var/lib/etcd --name=k8s-master2 --cert-file=/etc/etcd/cert/etcd.pem --key-file=/etc/etcd/cert/etcd-key.pem --trusted-ca-file=/etc/kubernetes/cert/ca.pem --peer-cert-file=/etc/etcd/cert/etcd.pem --peer-key-file=/etc/etcd/cert/etcd-key.pem --peer-trusted-ca-file=/etc/kubernetes/cert/ca.pem --peer-client-cert-auth --client-cert-auth --listen-peer-urls=https://192.168.32.129:2380 --initial-advertise-peer-urls=https://192.168.32.129:2380 --listen-client-urls=https://192.168.32.129:2379,http://127.0.0.1:2379 --advertise-client-urls=https://192.168.32.129:2379 --initial-cluster-token=etcd-cluster-0 --initial-cluster=k8s-master1=https://192.168.32.128:2380,k8s-master2=https://192.168.32.129:2380,k8s-master3=https://192.168.32.130:2380 --initial-cluster-state=new (code=exited, status=2)
Main PID: 7001 (code=exited, status=2)

Apr 08 11:28:02 k8s-master2 systemd[1]: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 08 11:28:02 k8s-master2 systemd[1]: Failed to start Etcd Server.
Apr 08 11:28:02 k8s-master2 systemd[1]: Unit etcd.service entered failed state.
Apr 08 11:28:02 k8s-master2 systemd[1]: etcd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
问题排查思路
集群最核心的数据是etcd.flannel网络数据存储在etcd,集群其他各种数据也全部存储在etcd.
集群组件通过kube-apiserver来读取etcd数据.
先从处理etcd开始,把etcd启动成功.

重启etcd报错,见下:

[root@k8s-master3 ~]# systemctl daemon-reload && systemctl restart etcd
Job for etcd.service failed because the control process exited with error code. See "systemctl status etcd.service" and "journalctl -xe" for details.
[root@k8s-master3 ~]# systemctl status etcd.service
● etcd.service - Etcd Server
Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Mon 2019-04-08 13:47:20 CST; 1s ago
Docs: https://github.com/coreos
Process: 25019 ExecStart=/opt/k8s/bin/etcd --data-dir=/var/lib/etcd --name=k8s-master3 --cert-file=/etc/etcd/cert/etcd.pem --key-file=/etc/etcd/cert/etcd-key.pem --trusted-ca-file=/etc/kubernetes/cert/ca.pem --peer-cert-file=/etc/etcd/cert/etcd.pem --peer-key-file=/etc/etcd/cert/etcd-key.pem --peer-trusted-ca-file=/etc/kubernetes/cert/ca.pem --peer-client-cert-auth --client-cert-auth --listen-peer-urls=https://192.168.32.130:2380 --initial-advertise-peer-urls=https://192.168.32.130:2380 --listen-client-urls=https://192.168.32.130:2379,http://127.0.0.1:2379 --advertise-client-urls=https://192.168.32.130:2379 --initial-cluster-token=etcd-cluster-0 --initial-cluster=k8s-master1=https://192.168.32.128:2380,k8s-master2=https://192.168.32.129:2380,k8s-master3=https://192.168.32.130:2380 --initial-cluster-state=new (code=exited, status=2)
Main PID: 25019 (code=exited, status=2)

Apr 08 13:47:20 k8s-master3 systemd[1]: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 08 13:47:20 k8s-master3 systemd[1]: Failed to start Etcd Server.
Apr 08 13:47:20 k8s-master3 systemd[1]: Unit etcd.service entered failed state.
Apr 08 13:47:20 k8s-master3 systemd[1]: etcd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@k8s-master3 ~]# journalctl -f etcd
Failed to add match 'etcd': Invalid argument
Failed to add filters: Invalid argument
[root@k8s-master3 ~]# journalctl -u etcd
-- Logs begin at Mon 2019-04-08 12:10:25 CST, end at Mon 2019-04-08 13:47:42 CST. --
Apr 08 12:10:25 k8s-master3 systemd[1]: Starting Etcd Server...
Apr 08 12:10:25 k8s-master3 etcd[4390]: etcd Version: 3.3.7
Apr 08 12:10:25 k8s-master3 etcd[4390]: Git SHA: 56536de55
Apr 08 12:10:25 k8s-master3 etcd[4390]: Go Version: go1.9.6
Apr 08 12:10:25 k8s-master3 etcd[4390]: Go OS/Arch: linux/amd64
Apr 08 12:10:25 k8s-master3 etcd[4390]: setting maximum number of CPUs to 1, total number of available CPUs is 1
Apr 08 12:10:25 k8s-master3 etcd[4390]: the server is already initialized as member before, starting as etcd memb
Apr 08 12:10:25 k8s-master3 etcd[4390]: peerTLS: cert = /etc/etcd/cert/etcd.pem, key = /etc/etcd/cert/etcd-key.pe
Apr 08 12:10:25 k8s-master3 etcd[4390]: listening for peers on https://192.168.32.130:2380
Apr 08 12:10:25 k8s-master3 etcd[4390]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cer
Apr 08 12:10:25 k8s-master3 etcd[4390]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert
Apr 08 12:10:25 k8s-master3 etcd[4390]: listening for client requests on 127.0.0.1:2379
Apr 08 12:10:25 k8s-master3 etcd[4390]: listening for client requests on 192.168.32.130:2379
Apr 08 12:10:25 k8s-master3 etcd[4390]: recovered store from snapshot at index 3200034
Apr 08 12:10:25 k8s-master3 etcd[4390]: recovering backend from snapshot error: database snapshot file path error
Apr 08 12:10:25 k8s-master3 etcd[4390]: panic: recovering backend from snapshot error: database snapshot file pat
Apr 08 12:10:25 k8s-master3 etcd[4390]: panic: runtime error: invalid memory address or nil pointer dereference
Apr 08 12:10:25 k8s-master3 etcd[4390]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xbccc10]
Apr 08 12:10:25 k8s-master3 etcd[4390]: goroutine 1 [running]:
Apr 08 12:10:25 k8s-master3 etcd[4390]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewSe
Apr 08 12:10:25 k8s-master3 etcd[4390]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/et
Apr 08 12:10:25 k8s-master3 etcd[4390]: panic(0xe1d2e0, 0xc4203eb9c0)
Apr 08 12:10:25 k8s-master3 etcd[4390]: /usr/local/go/src/runtime/panic.go:491 +0x283
Apr 08 12:10:25 k8s-master3 etcd[4390]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*Packag
Apr 08 12:10:25 k8s-master3 etcd[4390]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/et
Apr 08 12:10:25 k8s-master3 etcd[4390]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewSe
Apr 08 12:10:25 k8s-master3 etcd[4390]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/et
Apr 08 12:10:25 k8s-master3 etcd[4390]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(
Apr 08 12:10:25 k8s-master3 etcd[4390]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/et
Apr 08 12:10:25 k8s-master3 etcd[4390]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEt
Apr 08 12:10:25 k8s-master3 etcd[4390]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/et
Apr 08 12:10:25 k8s-master3 etcd[4390]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEt
Apr 08 12:10:25 k8s-master3 etcd[4390]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/et
Apr 08 12:10:25 k8s-master3 etcd[4390]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
Apr 08 12:10:25 k8s-master3 etcd[4390]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/et
Apr 08 12:10:25 k8s-master3 etcd[4390]: main.main()
Apr 08 12:10:25 k8s-master3 etcd[4390]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/et
Apr 08 12:10:25 k8s-master3 systemd[1]: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 08 12:10:25 k8s-master3 systemd[1]: Failed to start Etcd Server.
Apr 08 12:10:25 k8s-master3 systemd[1]: Unit etcd.service entered failed state.
Apr 08 12:10:25 k8s-master3 systemd[1]: etcd.service failed.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
主要报错:

Apr 08 12:10:25 k8s-master3 etcd[4390]: recovering backend from snapshot error: database snapshot file path error
Apr 08 12:10:25 k8s-master3 etcd[4390]: panic: recovering backend from snapshot error: database snapshot file pat
Apr 08 12:10:25 k8s-master3 etcd[4390]: panic: runtime error: invalid memory address or nil pointer dereference
1.
2.
3.
解决方法和思路:
删除所有etcd数据,重新初始化.

[root@k8s-master3 ~]# rm -rf /var/lib/etcd/*
[root@k8s-master3 ~]# systemctl daemon-reload && systemctl restart etcd
[root@k8s-master3 ~]# systemctl status etcd.service
● etcd.service - Etcd Server
Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2019-04-08 13:57:39 CST; 9s ago
Docs: https://github.com/coreos
Main PID: 27342 (etcd)
Memory: 34.7M
CGroup: /system.slice/etcd.service
└─27342 /opt/k8s/bin/etcd --data-dir=/var/lib/etcd --name=k8s-master3 --cert-file=/etc/etcd/cert/et...

Apr 08 13:57:46 k8s-master3 etcd[27342]: bd8793282fb7e56f [term 4] received MsgTimeoutNow from 5bac98ba27...ship.
Apr 08 13:57:46 k8s-master3 etcd[27342]: bd8793282fb7e56f became candidate at term 5
Apr 08 13:57:46 k8s-master3 etcd[27342]: bd8793282fb7e56f received MsgVoteResp from bd8793282fb7e56f at term 5
Apr 08 13:57:46 k8s-master3 etcd[27342]: bd8793282fb7e56f [logterm: 4, index: 43] sent MsgVote request to...erm 5
Apr 08 13:57:46 k8s-master3 etcd[27342]: bd8793282fb7e56f [logterm: 4, index: 43] sent MsgVote request to...erm 5
Apr 08 13:57:46 k8s-master3 etcd[27342]: raft.node: bd8793282fb7e56f lost leader 5bac98ba2781a51e at term 5
Apr 08 13:57:48 k8s-master3 etcd[27342]: bd8793282fb7e56f received MsgVoteResp from 5bac98ba2781a51e at term 5
Apr 08 13:57:48 k8s-master3 etcd[27342]: bd8793282fb7e56f [quorum:2] has received 2 MsgVoteResp votes and...tions
Apr 08 13:57:48 k8s-master3 etcd[27342]: bd8793282fb7e56f became leader at term 5
Apr 08 13:57:48 k8s-master3 etcd[27342]: raft.node: bd8793282fb7e56f elected leader bd8793282fb7e56f at term 5
Hint: Some lines were ellipsized, use -l to show in full.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
[root@k8s-master3 ~]# etcdctl --ca-file=/etc/kubernetes/cert/ca.pem --cert-file=/etc/etcd/cert/etcd.pem --key-file=/etc/etcd/cert/etcd-key.pem --endpoints=https://192.168.32.128:2379,https://192.168.32.129:2379,https://192.168.32.130:2379 member list
5bac98ba2781a51e: name=k8s-master2 peerURLs=https://192.168.32.129:2380 clientURLs=https://192.168.32.129:2379 isLeader=true
bd8793282fb7e56f: name=k8s-master3 peerURLs=https://192.168.32.130:2380 clientURLs=https://192.168.32.130:2379 isLeader=false
bee1cc9618cefbee: name=k8s-master1 peerURLs=https://192.168.32.128:2380 clientURLs=https://192.168.32.128:2379 isLeader=false
[root@k8s-master3 ~]#
1.
2.
3.
4.
5.
etcd集群正常后,命令可以执行.
但是所有数据丢失

[root@k8s-master1 etcd]# kubectl get all
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.254.0.1 <none> 443/TCP 45m
[root@k8s-master1 etcd]# kubectl get all -n kube-system
No resources found.
[root@k8s-master1 etcd]# kubectl get cs
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-0 Healthy {"health":"true"}
etcd-1 Healthy {"health":"true"}
etcd-2 Healthy {"health":"true"}
[root@k8s-master1 etcd]#
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
api恢复正常

[root@k8s-master1 etcd]# systemctl status kube-apiserver
● kube-apiserver.service - Kubernetes API Server
Loaded: loaded (/etc/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2019-04-08 13:57:56 CST; 46min ago
Docs: https://github.com/GoogleCloudPlatform/kubernetes
Main PID: 18534 (kube-apiserver)
Memory: 223.9M
CGroup: /system.slice/kube-apiserver.service
└─18534 /opt/k8s/bin/kube-apiserver --enable-admission-plugins=Initializers,NamespaceLifecycle,...

Apr 08 13:58:18 k8s-master1 kube-apiserver[18534]: I0408 13:58:18.174401 18534 storage_rbac.go:215] c...ler
Apr 08 13:58:18 k8s-master1 kube-apiserver[18534]: I0408 13:58:18.254198 18534 storage_rbac.go:215] c...ler
Apr 08 13:58:18 k8s-master1 kube-apiserver[18534]: I0408 13:58:18.270269 18534 storage_rbac.go:215] c...tor
Apr 08 13:58:18 k8s-master1 kube-apiserver[18534]: I0408 13:58:18.292810 18534 storage_rbac.go:215] c...ler
Apr 08 13:58:18 k8s-master1 kube-apiserver[18534]: I0408 13:58:18.305491 18534 storage_rbac.go:215] c...ler
Apr 08 13:58:18 k8s-master1 kube-apiserver[18534]: I0408 13:58:18.319763 18534 storage_rbac.go:215] c...ler
Apr 08 13:58:18 k8s-master1 kube-apiserver[18534]: I0408 13:58:18.339471 18534 storage_rbac.go:215] c...ler
Apr 08 13:58:18 k8s-master1 kube-apiserver[18534]: I0408 13:58:18.642901 18534 controller.go:608] quo...es}
Apr 08 13:58:19 k8s-master1 kube-apiserver[18534]: I0408 13:58:19.011673 18534 controller.go:608] quo...gs}
Apr 08 13:58:19 k8s-master1 kube-apiserver[18534]: I0408 13:58:19.112267 18534 controller.go:608] quo...ts}
Hint: Some lines were ellipsized, use -l to show in full.
[root@k8s-master1 etcd]#
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
其他组件
因为数据全部丢失,所以flanneld无法获取到etcd的网络记录数据,无法启动flanneld
因为flanneld无法启动,所以docker和kubelet也无法启动.
见下:

[root@k8s-master1 ~]# systemctl status flanneld -l
● flanneld.service - Flanneld overlay address etcd agent
Loaded: loaded (/etc/systemd/system/flanneld.service; enabled; vendor preset: disabled)
Active: activating (start) since Mon 2019-04-08 14:49:35 CST; 1min 1s ago
Main PID: 2194 (flanneld)
Memory: 40.5M
CGroup: /system.slice/flanneld.service
└─2194 /opt/k8s/bin/flanneld -etcd-cafile=/etc/kubernetes/cert/ca.pem -etcd-certfile=/etc/flanneld/cert/flanneld.pem -etcd-keyfile=/etc/flanneld/cert/flanneld-key.pem -etcd-endpoints=https://192.168.32.128:2379,https://192.168.32.129:2379,https://192.168.32.130:2379 -etcd-prefix=/kubernetes/network -iface=ens33

Apr 08 14:50:32 k8s-master1 flanneld[2194]: timed out
Apr 08 14:50:32 k8s-master1 flanneld[2194]: E0408 14:50:32.209982 2194 main.go:349] Couldn't fetch network config: 100: Key not found (/kubernetes) [14]
Apr 08 14:50:33 k8s-master1 flanneld[2194]: timed out
Apr 08 14:50:33 k8s-master1 flanneld[2194]: E0408 14:50:33.215958 2194 main.go:349] Couldn't fetch network config: 100: Key not found (/kubernetes) [14]
Apr 08 14:50:34 k8s-master1 flanneld[2194]: timed out
Apr 08 14:50:34 k8s-master1 flanneld[2194]: E0408 14:50:34.220404 2194 main.go:349] Couldn't fetch network config: 100: Key not found (/kubernetes) [14]
Apr 08 14:50:35 k8s-master1 flanneld[2194]: timed out
Apr 08 14:50:35 k8s-master1 flanneld[2194]: E0408 14:50:35.224053 2194 main.go:349] Couldn't fetch network config: 100: Key not found (/kubernetes) [14]
Apr 08 14:50:36 k8s-master1 flanneld[2194]: timed out
Apr 08 14:50:36 k8s-master1 flanneld[2194]: E0408 14:50:36.237390 2194 main.go:349] Couldn't fetch network config: 100: Key not found (/kubernetes) [14]
[root@k8s-master1 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: inactive (dead)
Docs: https://docs.docker.com
[root@k8s-master1 ~]# systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Active: inactive (dead)
Docs: https://github.com/GoogleCloudPlatform/kubernetes

Apr 08 14:51:05 k8s-master1 systemd[1]: Dependency failed for Kubernetes Kubelet.
Apr 08 14:51:05 k8s-master1 systemd[1]: Job kubelet.service/start failed with result 'dependency'.
[root@k8s-master1 ~]#
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
重新把flanneld的网络信息写入etcd,见下:

[root@k8s-master1 ~]# source /opt/k8s/bin/environment.sh
[root@k8s-master1 ~]# echo ${ETCD_ENDPOINTS}
https://192.168.32.128:2379,https://192.168.32.129:2379,https://192.168.32.130:2379
[root@k8s-master1 ~]#
[root@k8s-master1 ~]# etcdctl --endpoints=${ETCD_ENDPOINTS} --ca-file=/etc/kubernetes/cert/ca.pem --cert-file=/etc/flanneld/cert/flanneld.pem --key-file=/etc/flanneld/cert/flanneld-key.pem set ${FLANNEL_ETCD_PREFIX}/config '{"Network":"'${CLUSTER_CIDR}'",
"SubnetLen": 24, "Backend": {"Type": "vxlan"}}'
{"Network":"172.30.0.0/16",
"SubnetLen": 24, "Backend": {"Type": "vxlan"}}
[root@k8s-master1 ~]# etcdctl --endpoints=${ETCD_ENDPOINTS} --ca-file=/etc/kubernetes/cert/ca.pem --cert-file=/etc/flanneld/cert/flanneld.pem --key-file=/etc/flanneld/cert/flanneld-key.pem ls
/kubernetes
[root@k8s-master1 ~]# etcdctl --endpoints=${ETCD_ENDPOINTS} --ca-file=/etc/kubernetes/cert/ca.pem --cert-file=/etc/flanneld/cert/flanneld.pem --key-file=/etc/flanneld/cert/flanneld-key.pem ls /kubernetes
/kubernetes/network
[root@k8s-master1 ~]# etcdctl --endpoints=${ETCD_ENDPOINTS} --ca-file=/etc/kubernetes/cert/ca.pem --cert-file=/etc/flanneld/cert/flanneld.pem --key-file=/etc/flanneld/cert/flanneld-key.pem ls /kubernetes/network
/kubernetes/network/config
/kubernetes/network/subnets
[root@k8s-master1 ~]# etcdctl --endpoints=${ETCD_ENDPOINTS} --ca-file=/etc/kubernetes/cert/ca.pem --cert-file=/etc/flanneld/cert/flanneld.pem --key-file=/etc/flanneld/cert/flanneld-key.pem ls /kubernetes/network/subnets
/kubernetes/network/subnets/172.30.45.0-24
/kubernetes/network/subnets/172.30.79.0-24
/kubernetes/network/subnets/172.30.96.0-24
/kubernetes/network/subnets/172.30.27.0-24
[root@k8s-master1 ~]#
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
写入后,服务全部起来了

[root@k8s-master1 ~]# systemctl status flanneld
● flanneld.service - Flanneld overlay address etcd agent
Loaded: loaded (/etc/systemd/system/flanneld.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2019-04-08 14:56:11 CST; 1min 41s ago
Process: 3616 ExecStartPost=/opt/k8s/bin/mk-docker-opts.sh -k DOCKER_NETWORK_OPTIONS -d /run/flannel/docker (code=exited, status=0/SUCCESS)
Main PID: 3496 (flanneld)
Memory: 6.6M
CGroup: /system.slice/flanneld.service
└─3496 /opt/k8s/bin/flanneld -etcd-cafile=/etc/kubernetes/cert/ca.pem -etcd-certfile=/etc/flann...

Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:10.998843 3496 main.go:300] Wrote subnet fi....env
Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:10.998850 3496 main.go:304] Running backend.
Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:11.008631 3496 iptables.go:115] Some iptabl...ules
Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:11.008643 3496 iptables.go:137] Deleting ip...CEPT
Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:11.017130 3496 vxlan_network.go:60] watchin...ases
Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:11.017744 3496 main.go:396] Waiting for 22h...ease
Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:11.020469 3496 iptables.go:137] Deleting ip...CEPT
Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:11.022207 3496 iptables.go:125] Adding ipta...CEPT
Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:11.031517 3496 iptables.go:125] Adding ipta...CEPT
Apr 08 14:56:11 k8s-master1 systemd[1]: Started Flanneld overlay address etcd agent.
Hint: Some lines were ellipsized, use -l to show in full.
[root@k8s-master1 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2019-04-08 14:56:12 CST; 1min 59s ago
Docs: https://docs.docker.com
Main PID: 3641 (dockerd)
Memory: 68.8M
CGroup: /system.slice/docker.service
├─3641 /usr/bin/dockerd --log-level=error --bip=172.30.45.1/24 --ip-masq=true --mtu=1450
└─3647 docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metri...

Apr 08 14:56:11 k8s-master1 systemd[1]: Starting Docker Application Container Engine...
Apr 08 14:56:12 k8s-master1 dockerd[3641]: time="2019-04-08T14:56:12.329436766+08:00" level=error msg=...ist"
Apr 08 14:56:12 k8s-master1 systemd[1]: Started Docker Application Container Engine.
Apr 08 14:56:13 k8s-master1 dockerd[3641]: time="2019-04-08T14:56:13.902502953+08:00" level=error msg=...ped"
Apr 08 14:56:14 k8s-master1 dockerd[3641]: time="2019-04-08T14:56:14.163449193+08:00" level=error msg=...071"
Apr 08 14:56:14 k8s-master1 dockerd[3641]: time="2019-04-08T14:56:14.164426984+08:00" level=error msg=...071"
Apr 08 14:57:13 k8s-master1 dockerd[3641]: time="2019-04-08T14:57:13.945446893+08:00" level=error msg=...ped"
Apr 08 14:57:13 k8s-master1 dockerd[3641]: time="2019-04-08T14:57:13.953866536+08:00" level=error msg=...ped"
Hint: Some lines were ellipsized, use -l to show in full.
[root@k8s-master1 ~]# systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2019-04-08 14:56:12 CST; 2min 7s ago
Docs: https://github.com/GoogleCloudPlatform/kubernetes
Main PID: 3768 (kubelet)
Memory: 126.8M
CGroup: /system.slice/kubelet.service
└─3768 /opt/k8s/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/kubelet-bootstrap.kubeconfig...

Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.299679 3768 reconciler.go:412] Reconciler syn...
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.406547 3768 reconciler.go:181] operationExecu...
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.406580 3768 reconciler.go:181] operationE...89")
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.406595 3768 reconciler.go:181] operationExecu...
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.406715 3768 operation_generator.go:698] Unmou...
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.406782 3768 operation_generator.go:698] Unmou...
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.406978 3768 operation_generator.go:698] Unmou...
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.507212 3768 reconciler.go:301] Volume det...h ""
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.507243 3768 reconciler.go:301] Volume det...h ""
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.507261 3768 reconciler.go:301] Volume det...h ""
Hint: Some lines were ellipsized, use -l to show in full.
[root@k8s-master1 ~]#
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
故障原因思考:

为什么会突然集群崩溃?说莫名奇妙那是忽悠自己的话.
前几天因为一次etcd故障,做过一次操作.
记录见下:

etcd集群报错

[root@k8s-master2 ~]# kubectl get cs
NAME STATUS MESSAGE ERROR
etcd-1 Unhealthy Get https://192.168.32.129:2379/health: dial tcp 192.168.32.129:2379: connect: connection refused
scheduler Healthy ok
controller-manager Healthy ok
etcd-0 Healthy {"health":"true"}
etcd-2 Healthy {"health":"true"}
[root@k8s-master2 ~]#
1.
2.
3.
4.
5.
6.
7.
8.
[root@k8s-master1 cert]# etcdctl --ca-file=/etc/kubernetes/cert/ca.pem --cert-file=/etc/etcd/cert/etcd.pem --key-file=/etc/etcd/cert/etcd-key.pem --endpoints=https://192.168.32.128:2379,https://192.168.32.129:2379,https://192.168.32.130:2379 cluster-health
failed to check the health of member 5bac98ba2781a51e on https://192.168.32.129:2379: Get https://192.168.32.129:2379/health: dial tcp 192.168.32.129:2379: getsockopt: connection refused
member 5bac98ba2781a51e is unreachable: [https://192.168.32.129:2379] are all unreachable
member bd8793282fb7e56f is healthy: got healthy result from https://192.168.32.130:2379
member bee1cc9618cefbee is healthy: got healthy result from https://192.168.32.128:2379
cluster is degraded
[root@k8s-master1 cert]#
1.
2.
3.
4.
5.
6.
7.
[root@k8s-master1 cert]# etcdctl --ca-file=/etc/kubernetes/cert/ca.pem --cert-file=/etc/etcd/cert/etcd.pem --key-file=/etc/etcd/cert/etcd-key.pem --endpoints=https://192.168.32.128:2379,https://192.168.32.129:2379,https://192.168.32.130:2379 member list
5bac98ba2781a51e: name=k8s-master2 peerURLs=https://192.168.32.129:2380 clientURLs=https://192.168.32.129:2379 isLeader=false
bd8793282fb7e56f: name=k8s-master3 peerURLs=https://192.168.32.130:2380 clientURLs=https://192.168.32.130:2379 isLeader=false
bee1cc9618cefbee: name=k8s-master1 peerURLs=https://192.168.32.128:2380 clientURLs=https://192.168.32.128:2379 isLeader=true
[root@k8s-master1 cert]#
1.
2.
3.
4.
5.
etcd日志

journalctl -u etcd

Mar 28 09:55:11 k8s-master2 systemd[1]: Starting Etcd Server...
Mar 28 09:55:11 k8s-master2 etcd[2415]: etcd Version: 3.3.7
Mar 28 09:55:11 k8s-master2 etcd[2415]: Git SHA: 56536de55
Mar 28 09:55:11 k8s-master2 etcd[2415]: Go Version: go1.9.6
Mar 28 09:55:11 k8s-master2 etcd[2415]: Go OS/Arch: linux/amd64
Mar 28 09:55:11 k8s-master2 etcd[2415]: setting maximum number of CPUs to 1, total number of available CPUs is 1
Mar 28 09:55:11 k8s-master2 etcd[2415]: the server is already initialized as member before, starting as etcd member...
Mar 28 09:55:11 k8s-master2 etcd[2415]: peerTLS: cert = /etc/etcd/cert/etcd.pem, key = /etc/etcd/cert/etcd-key.pem, ca = , trusted-ca = /etc/kube
Mar 28 09:55:11 k8s-master2 etcd[2415]: listening for peers on https://192.168.32.129:2380
Mar 28 09:55:11 k8s-master2 etcd[2415]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored k
Mar 28 09:55:11 k8s-master2 etcd[2415]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is ena
Mar 28 09:55:11 k8s-master2 etcd[2415]: listening for client requests on 127.0.0.1:2379
Mar 28 09:55:11 k8s-master2 etcd[2415]: listening for client requests on 192.168.32.129:2379
Mar 28 09:55:12 k8s-master2 etcd[2415]: recovered store from snapshot at index 2000020
Mar 28 09:55:12 k8s-master2 etcd[2415]: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't ex
Mar 28 09:55:12 k8s-master2 etcd[2415]: panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doe
Mar 28 09:55:12 k8s-master2 etcd[2415]: panic: runtime error: invalid memory address or nil pointer dereference
Mar 28 09:55:12 k8s-master2 etcd[2415]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xbccc10]
Mar 28 09:55:12 k8s-master2 etcd[2415]: goroutine 1 [running]:
Mar 28 09:55:12 k8s-master2 etcd[2415]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer.func1(0xc4201a3c88, 0xc4201
Mar 28 09:55:12 k8s-master2 etcd[2415]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/
Mar 28 09:55:12 k8s-master2 etcd[2415]: panic(0xe1d2e0, 0xc42034af00)
Mar 28 09:55:12 k8s-master2 etcd[2415]: /usr/local/go/src/runtime/panic.go:491 +0x283
Mar 28 09:55:12 k8s-master2 etcd[2415]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc420171ea0, 0x
Mar 28 09:55:12 k8s-master2 etcd[2415]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/
Mar 28 09:55:12 k8s-master2 etcd[2415]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0x7ffdfe599c5f, 0xb, 0x0, 0
Mar 28 09:55:12 k8s-master2 etcd[2415]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/
Mar 28 09:55:12 k8s-master2 etcd[2415]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc4201d8000, 0xc4201d8480, 0x0,
Mar 28 09:55:12 k8s-master2 systemd[1]: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 28 09:55:12 k8s-master2 systemd[1]: Failed to start Etcd Server.
Mar 28 09:55:12 k8s-master2 systemd[1]: Unit etcd.service entered failed state.
Mar 28 09:55:12 k8s-master2 systemd[1]: etcd.service failed.
Mar 28 09:55:17 k8s-master2 systemd[1]: etcd.service holdoff time over, scheduling restart.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
/var/log/messages

Mar 28 10:56:53 k8s-master2 systemd: Starting Etcd Server...
Mar 28 10:56:53 k8s-master2 etcd: etcd Version: 3.3.7
Mar 28 10:56:53 k8s-master2 etcd: Git SHA: 56536de55
Mar 28 10:56:53 k8s-master2 etcd: Go Version: go1.9.6
Mar 28 10:56:53 k8s-master2 etcd: Go OS/Arch: linux/amd64
Mar 28 10:56:53 k8s-master2 etcd: setting maximum number of CPUs to 1, total number of available CPUs is 1
Mar 28 10:56:53 k8s-master2 etcd: the server is already initialized as member before, starting as etcd member...
Mar 28 10:56:53 k8s-master2 etcd: peerTLS: cert = /etc/etcd/cert/etcd.pem, key = /etc/etcd/cert/etcd-key.pem, ca = , trusted-ca = /etc/kubernetes/cert/ca.pem, client-cert-auth = true, crl-file =
Mar 28 10:56:53 k8s-master2 etcd: listening for peers on https://192.168.32.129:2380
Mar 28 10:56:53 k8s-master2 etcd: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Mar 28 10:56:53 k8s-master2 etcd: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Mar 28 10:56:53 k8s-master2 etcd: listening for client requests on 127.0.0.1:2379
Mar 28 10:56:53 k8s-master2 etcd: listening for client requests on 192.168.32.129:2379
Mar 28 10:56:53 k8s-master2 etcd: recovered store from snapshot at index 2000020
Mar 28 10:56:53 k8s-master2 etcd: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
Mar 28 10:56:53 k8s-master2 etcd: panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
Mar 28 10:56:53 k8s-master2 etcd: panic: runtime error: invalid memory address or nil pointer dereference
Mar 28 10:56:53 k8s-master2 etcd: [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xbccc10]
Mar 28 10:56:53 k8s-master2 etcd: goroutine 1 [running]:
Mar 28 10:56:53 k8s-master2 etcd: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer.func1(0xc4201a3c88, 0xc4201a3868)
Mar 28 10:56:53 k8s-master2 etcd: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:290 +0x40
Mar 28 10:56:53 k8s-master2 etcd: panic(0xe1d2e0, 0xc4202a1150)
Mar 28 10:56:53 k8s-master2 etcd: /usr/local/go/src/runtime/panic.go:491 +0x283
Mar 28 10:56:53 k8s-master2 etcd: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc420171ea0, 0x1011c8c, 0x2a, 0xc4201a38f8, 0x1, 0x1)
Mar 28 10:56:53 k8s-master2 etcd: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x16d
Mar 28 10:56:53 k8s-master2 etcd: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0x7ffc7f287c5f, 0xb, 0x0, 0x0, 0x0, 0x0, 0xc4200db600, 0x1, 0x1, 0xc4200db400, ...)
Mar 28 10:56:53 k8s-master2 etcd: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:385 +0x2b18
Mar 28 10:56:53 k8s-master2 etcd: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc4201e4000, 0xc4201e4480, 0x0, 0x0)
Mar 28 10:56:53 k8s-master2 etcd: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed/etcd.go:179 +0x870
Mar 28 10:56:53 k8s-master2 etcd: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc4201e4000, 0x6, 0xff05c7, 0x6, 0x1)
Mar 28 10:56:53 k8s-master2 etcd: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:181 +0x40
Mar 28 10:56:53 k8s-master2 etcd: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
Mar 28 10:56:53 k8s-master2 etcd: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:102 +0x151e
Mar 28 10:56:53 k8s-master2 etcd: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
Mar 28 10:56:53 k8s-master2 systemd: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 28 10:56:53 k8s-master2 etcd: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:46 +0x3f
Mar 28 10:56:53 k8s-master2 etcd: main.main()
Mar 28 10:56:53 k8s-master2 etcd: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x20
Mar 28 10:56:53 k8s-master2 systemd: Failed to start Etcd Server.
Mar 28 10:56:53 k8s-master2 systemd: Unit etcd.service entered failed state.
Mar 28 10:56:53 k8s-master2 systemd: etcd.service failed.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
通过报错记录

Mar 28 10:56:53 k8s-master2 etcd: recovered store from snapshot at index 2000020
Mar 28 10:56:53 k8s-master2 etcd: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
Mar 28 10:56:53 k8s-master2 etcd: panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
1.
2.
3.
可知,找不到数据库的快照

数据存放在哪?

[root@k8s-master2 snap]# ll
total 3768
-rw-r--r-- 1 k8s k8s 86575 Mar 21 12:21 000000000000077f-0000000000186a10.snap
-rw-r--r-- 1 k8s k8s 90505 Mar 25 10:42 000000000000078a-000000000019f0b1.snap
-rw-r--r-- 1 k8s k8s 90505 Mar 25 17:08 000000000000078a-00000000001b7752.snap
-rw-r--r-- 1 k8s k8s 93947 Mar 26 14:31 00000000000007a2-00000000001cfdf3.snap
-rw-r--r-- 1 k8s k8s 97869 Mar 27 12:52 00000000000007a7-00000000001e8494.snap
-rw------- 1 k8s k8s 3387392 Mar 28 11:17 db
[root@k8s-master2 snap]# pwd
/var/lib/etcd/member/snap
[root@k8s-master2 snap]#
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
操作思路和今天的一样,我删除了etcd1的所有数据,重新做了初始化
[root@k8s-master2 ~]# rm -rf /var/lib/etcd/*

怀疑就是因为上次这个操作导致etcd集群的数据不完整,引发今天的整个集群崩溃.

因为没有做etcd数据的备份,之前运行的pod,svc等等所有数据全部丢失.
故障处理彻底失败.

来自:https://blog.51cto.com/goome/2375348

版权声明:导航君 发表于 2023年1月7日 下午11:38。
转载请注明:k8s实践5:一次失败的kubernetes集群崩溃处理记录 | 第八网址导航

相关文章

暂无评论

暂无评论...