Replacing a Failed Master Host on OCP 4.3.x

Abip Sjarbini

5 years ago

This procedure assumes that there is still an etcd quorum in the cluster.
If you have lost the majority of your master hosts, leading to etcd quorum loss, then you must follow the disaster recovery procedure to recover from lost master hosts instead of this procedure.

(Cover image :

To replace a Single Master Host:
– Remove the member from the etcd cluster
– Add the member back

Here, we have 3 Master Nodes, etcd-[0-2].ocp4.ocp.abip, and trying to remove the etcd-2.ocp4.ocp.abip node.
Let’s assume this node has failed

etcd-0.ocp4.ocp.abip   192.168.24.51
etcd-1.ocp4.ocp.abip   192.168.24.52
etcd-2.ocp4.ocp.abip   192.168.24.53

Removing a Failed Master Host from the etcd Cluster.
Prerequisites:
– Access to the cluster as cluster-admin role
– SSH Access to an Active Master Host. We’ll perform the activities from etcd-1.ocp4.ocp.abip node.

Procedures:
1. Access an Active Master Host
2. View the list of Pods with etcd

[root@bastion ~]# ssh core@etcd-1.ocp4.ocp.abip

[core@etcd-1 ~]$ oc login -u admin #
The server uses a certificate signed by an unknown authority.
You can bypass the certificate check, but any data you send to the server could be intercepted by others.
Use insecure connections? (y/n): y

[core@etcd-1 ~]$ oc get pods -n openshift-etcd
NAME                               READY   STATUS    RESTARTS   AGE
etcd-member-etcd-0.ocp4.ocp.abip   2/2     Running   62         22d
etcd-member-etcd-1.ocp4.ocp.abip   2/2     Running   57         22d
etcd-member-etcd-2.ocp4.ocp.abip   2/2     Running   59         22d

3. Remove the Failed Master Host, etcd-2.ocp4.ocp.abip.
The problem we have in OCP Restricted Network, the etcd-member-remove.sh tried to download the etcdctl from the internet. (Please refer to the link provided at the end of this Blog)
We need to modify the script as we did in backing up the etcd data:
– Find the etcdctl
– Copy it somewhere, e.g: /root/etcdctl
– Modify the script to disable dl_etcdctl function, and point ETCDCTL environment variable to /root/etcdctl

[core@etcd-1 ~]$ which etcd-member-remove.sh
/usr/local/bin/etcd-member-remove.sh

[core@etcd-1 ~]$ sudo -E /usr/local/bin/etcd-member-remove-disconnected.sh etcd-member-etcd-2.ocp4.ocp.abip
Trying to backup etcd client certs..
etcd client certs already backed up and available ./assets/backup/
Member d4d8cf3147795936 removed from cluster 46efcf9423373cdf
etcd member etcd-member-etcd-2.ocp4.ocp.abip with d4d8cf3147795936 successfully removed..

4. Verify that the etcd member has been successfully removed from the cluster:

[core@etcd-1 ~]$ id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{print $1}')

[core@etcd-1 ~]$ sudo crictl exec -it $id /bin/sh
sh-4.2#

sh-4.2# export ETCDCTL_API=3
sh-4.2# export ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt
sh-4.2# export ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt)
sh-4.2# export ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)


sh-4.2# etcdctl member list -w table
+------------------+---------+----------------------------------+-----------------------------------+----------------------------+
|        ID        | STATUS  |               NAME               |            PEER ADDRS             |        CLIENT ADDRS        |
+------------------+---------+----------------------------------+-----------------------------------+----------------------------+
| 7122dcf57e681d7d | started | etcd-member-etcd-0.ocp4.ocp.abip | # | # |
| abcc869a529d85cb | started | etcd-member-etcd-1.ocp4.ocp.abip | # | # |
+------------------+---------+----------------------------------+-----------------------------------+----------------------------+

Adding a Master Host Back to the etcd Cluster
Prerequisites:
– Access to the cluster as cluster-admin role
– SSH Access to the Master Host to Add to the etcd Cluster (the one we removed, etcd-2.ocp4.ocp.abip)
– The IP Address of an Existing Active etcd Member
– For Restricted Environment, need to modify etcd-member-add.sh and etcd-snapshot-backup.sh scripts as we did before (Please refer to the link we provided at the end of this Blog)

1.Access the Master Host to Add to the etcd Cluster

[root@bastion ~]# ssh core@etcd-2.ocp4.ocp.abip

2. Run the etcd-member-add.sh script and pass in two parameters:
– IP Address of an existing etcd member: 192.168.24.52
The name of the etcd member to Add, etcd-2.ocp4.ocp.abip

[core@etcd-2 ~]$ sudo -E /usr/local/bin/etcd-member-add-disconnected.sh 192.168.24.52 etcd-member-etcd-2.ocp4.ocp.abip
etcd-member.yaml found in ./assets/backup/
etcd.conf backup upready exists ./assets/backup/etcd.conf
Trying to backup etcd client certs..
etcd client certs already backed up and available ./assets/backup/
Stopping etcd..
etcd data-dir backup found ./assets/backup/etcd..
Updating etcd membership..
Member 7f77e67d2bf8334b added to cluster 46efcf9423373cdf

ETCD_NAME="etcd-member-etcd-2.ocp4.ocp.abip"
ETCD_INITIAL_CLUSTER="etcd-member-etcd-0.ocp4.ocp.abip=https://etcd-0.ocp4.ocp.abip:2380,etcd-member-etcd-2.ocp4.ocp.abip=https://etcd-2.ocp4.ocp.abip:2380,etcd-member-etcd-1.ocp4.ocp.abip=https://etcd-1.ocp4.ocp.abip:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://etcd-2.ocp4.ocp.abip:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
Starting etcd..

3. Verify that the new member is in the list of Pods associated with etcd and that its status is Running

[core@etcd-1 ~]$ oc get pods -n openshift-etcd
NAME                               READY   STATUS    RESTARTS   AGE
etcd-member-etcd-0.ocp4.ocp.abip   2/2     Running   62         22d
etcd-member-etcd-1.ocp4.ocp.abip   2/2     Running   57         22d
etcd-member-etcd-2.ocp4.ocp.abip   2/2     Running   0          69s

4. Verify that the etcd member has been successfully added to the etcd cluster, and the new member is healthy:

[core@etcd-1 ~]$ id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{print $1}')

[core@etcd-1 ~]$ sudo crictl exec -it $id /bin/sh
sh-4.2#

sh-4.2# export ETCDCTL_API=3
sh-4.2# export ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt
sh-4.2# export ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt)
sh-4.2# export ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)

sh-4.2# etcdctl member list -w table
+------------------+---------+----------------------------------+-----------------------------------+----------------------------+
|        ID        | STATUS  |               NAME               |            PEER ADDRS             |        CLIENT ADDRS        |
+------------------+---------+----------------------------------+-----------------------------------+----------------------------+
| 7122dcf57e681d7d | started | etcd-member-etcd-0.ocp4.ocp.abip | # | # |
| 7f77e67d2bf8334b | started | etcd-member-etcd-2.ocp4.ocp.abip | # | # |
| abcc869a529d85cb | started | etcd-member-etcd-1.ocp4.ocp.abip | # | # |
+------------------+---------+----------------------------------+-----------------------------------+----------------------------+

sh-4.2# etcdctl endpoint health --cluster
# is healthy: successfully committed proposal: took = 39.875839ms
# is healthy: successfully committed proposal: took = 51.685488ms
# is healthy: successfully committed proposal: took = 61.023569ms

PS:
We need to revert back the changes we have on etcd-* scripts to avoid machine-config operator goes to DEGRADED state due to file mismatch, verification: oc describe pods -n machine-config-operator machine-config-daemon-XXX (the nodes where we modify the script)
To fix the DEGRADED state, we need to delete the problematic pods

Note:
– For OCP nodes connected using proxy, We might need to add HTTP(S)_PROXY environment variables on the script.
– For OCP 4.3.5 and later, You might not need to modify the backup script.
– Please refer to below link to modify the scripts for Restricted Environment.
Perform etcd Backup for Restricted Environment on OCP 4.3.x

Share this to your network: