ETCD Restoration Practical Guide: Easily Recover Your Cluster(En)

Albert Weng
5 min readFeb 27, 2024

--

Recently encountered issues with the K8S cluster. Without relying on other third-party tools, this article documents the process of restoring the cluster using regular ETCD backups.

This guide records the restoration steps for future occurrences (hopefully not!), serving as a reference document.

Similarly, here are the key points that will be covered in this article:

  1. Types of data in ETCD
  2. ETCD backups
  3. ETCD restoration
  4. Conclusion

1. Types of data in ETCD

ETCD data is by default divided into two types:

  • Snap (Snapshot): Stores snapshot data designed to prevent too many WAL (Write Ahead Log) files. It holds the ETCD data state, and the file has a .snap extension. If you open a .snap file, you’ll find visible characters after the initial few, as it essentially contains the stored JSON-formatted string.
  • WAL (Write Ahead Log): Stores write-ahead logs, and its main function is to record the entire process of data changes. In ETCD, all modifications to data are written to WAL before being committed. Each WAL is composed of individual records.

2. ETCD backups

#------------------------------------------------------
# S2-1. Prerequisites
#------------------------------------------------------
[master]# mkdir -p /root/backup_$(date +%Y%m%d)
[master]# cp -r /etc/kubernetes /root/backup_$(date +%Y%m%d)/
[master]# cp -r /var/lib/etcd/ /root/backup_$(date +%Y%m%d)/
[master]# cp -r /var/lib/kubelet/ /root/backup_$(date +%Y%m%d)/
[master]# ls -al /root/backup_$(date +%Y%m%d)/
#------------------------------------------------------
# S2-2. ETCD backup
#------------------------------------------------------
[master]# ETCDCTL_API=3 etcdctl --endpoints="https://127.0.0.1:2379" \
--cert="/etc/kubernetes/pki/etcd/server.crt" \
--key="/etc/kubernetes/pki/etcd/server.key" \
--cacert="/etc/kubernetes/pki/etcd/ca.crt" \
snapshot save /root/backup_$(date +%Y%m%d)/snap-$(date +%Y%m%d).db

[master]# ls -alh
drwx------ 3 root root 20 Feb 5 15:52 etcd
drwx------ 8 root root 4.0K Feb 5 15:52 kubelet
drwxr-xr-x 4 root root 125 Feb 5 15:52 kubernetes
-rw------- 1 root root 169M Feb 5 15:53 snap-20240205.db

3. ETCD restoration

#------------------------------------------------------
# S3-1. Confirm member
#------------------------------------------------------
[master]# ETCDCTL_API=3 etcdctl --endpoints 10.107.88.12:2379,10.107.88.13:2379,10.107.88.14:2379 \
--cert="/etc/kubernetes/pki/etcd/server.crt" \
--key="/etc/kubernetes/pki/etcd/server.key" \
--cacert="/etc/kubernetes/pki/etcd/ca.crt" \
member list --write-out=table
#------------------------------------------------------
# S3-1. Before restore (master01,master02, master03)
# stop kube-apiserver, etcd
#------------------------------------------------------
[master]# cd /etc/kubernetes
[master]# mv manifests manifests.bak
[master01]# scp -rp /etc/kubernetes/manifests.bak.master01 root@lb01:/root/backup_20240205/
[master02]# scp -rp /etc/kubernetes/manifests.bak.master02 root@lb01:/root/backup_20240205/
[master03]# scp -rp /etc/kubernetes/manifests.bak.master03 root@lb01:/root/backup_20240205/
[master]# crictl ps

[master]# mv /var/lib/etcd /var/lib/etcd.bak
[master01]# scp -rp /var/lib/etcd.bak.master01 root@lb01:/root/backup_20240205/
[master02]# scp -rp /var/lib/etcd.bak.master02 root@lb01:/root/backup_20240205/
[master03]# scp -rp /var/lib/etcd.bak.master03 root@lb01:/root/backup_20240205/

※ Before

※ After

#------------------------------------------------------
# S3-3. restore ETCD (master01,master02, master03)
#------------------------------------------------------
[master]# scp -rp /root/backup_20240205 root@master02.test.example.poc:/root/
[master]# scp -rp /root/backup_20240205 root@master03.test.example.poc:/root/
#------------------------------------------------------
# S3-4. Perform restore(master01)
#------------------------------------------------------
[master01]# ETCDCTL_API=3 etcdctl snapshot restore /root/backup_20240205/snap-20240205.db \
--endpoints=10.107.88.12:2379 \
--name=master01.test.example.poc \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--initial-advertise-peer-urls=https://10.107.88.12:2380 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=master01.test.example.poc=https://10.107.88.12:2380,master02.test.example.poc=https://10.107.88.13:2380,master03.test.example.poc=https://10.107.88.14:2380 \
--data-dir=/var/lib/etcd

[master01]# mv /etc/kubernetes/manifests.bak /etc/kubernetes/manifests
#------------------------------------------------------
# S3-5. Perform restore (master02)
#------------------------------------------------------
[master]# ETCDCTL_API=3 etcdctl snapshot restore /root/backup_20240205/snap-20240205.db \
--endpoints=10.107.88.13:2379 \
--name=master02.test.example.poc \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--initial-advertise-peer-urls=https://10.107.88.13:2380 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=master01.test.example.poc=https://10.107.88.12:2380,master02.test.example.poc=https://10.107.88.13:2380,master03.test.example.poc=https://10.107.88.14:2380 \
--data-dir=/var/lib/etcd

[master02]# mv /etc/kubernetes/manifests.bak /etc/kubernetes/manifests
#------------------------------------------------------
# S3-6. Perform restore (master03)
#------------------------------------------------------
[master]# ETCDCTL_API=3 etcdctl snapshot restore /root/backup_20240205/snap-20240205.db \
--endpoints=10.107.88.14:2379 \
--name=master03.test.example.poc \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--initial-advertise-peer-urls=https://10.107.88.14:2380 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=master01.test.example.poc=https://10.107.88.12:2380,master02.test.example.poc=https://10.107.88.13:2380,master03.test.example.poc=https://10.107.88.14:2380 \
--data-dir=/var/lib/etcd

[master03]# mv /etc/kubernetes/manifests.bak /etc/kubernetes/manifests
#------------------------------------------------------
# S3-6. Confirm
#------------------------------------------------------
[master]# crictl ps
[master]# kubectl get podn -n kube-system
[master]# ETCDCTL_API=3 etcdctl --endpoints 10.107.88.12:2379,10.107.88.13:2379,10.107.88.14:2379 \
--cert="/etc/kubernetes/pki/etcd/server.crt" \
--key="/etc/kubernetes/pki/etcd/server.key" \
--cacert="/etc/kubernetes/pki/etcd/ca.crt" \
member list --write-out=table

4. Conclusion

This article provides a quick review of backing up ETCD, and you can refer to previous ETCD articles for more details. In practice, it is recommended to automate ETCD backup operations using Cronjobs and supplement them with third-party backups for application services. This ensures greater system stability and recoverability.

Furthermore, the storage location of backups is crucial. It should be stored outside of K8S, preferably with an additional copy stored remotely for added security.

That’s it for this article. Thanks for tuning in~~

--

--

Albert Weng

You don't have to be great to start, but you have to start to be great