How to Safely Remove a Control-Plane Node from a K3s Cluster

in #blog27 days ago

How to Safely Remove a Control-Plane Node from a K3s Cluster

Decommissioning a control-plane node from a K3s cluster with embedded etcd requires precision to maintain cluster stability. In this post, I’ll guide you through removing a control-plane node called stormrider from a K3s cluster, covering pod drainage, etcd member removal, node deletion, and verification steps to ensure a seamless process.

Background

The K3s cluster consisted of six nodes:

  • lunar-probe, nebula-42, quantum-core (worker nodes)
  • skyforge-77, nova-prime, stormrider (control-plane, etcd, master nodes)

The task was to remove stormrider, running Fedora Linux Cosmic Edition and hosting four DaemonSet-managed pods (svclb-*) for LoadBalancer services via K3s’s klipper-lb. As an etcd member, stormrider’s removal needed to preserve quorum to protect the control plane.

Prerequisites

Ensure you have:

  • kubectl access to the cluster.
  • SSH access to stormrider.
  • etcdctl installed (e.g., v3.5.18, matching your K3s version).
  • etcd certificates (e.g., /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt, client.crt, client.key).

Step-by-Step Guide

1. Verify Cluster and Node State

Check the cluster’s nodes:

kubectl get nodes -o wide

Output (abridged for stormrider):

NAME         STATUS  ROLES                    AGE  VERSION        INTERNAL-IP    OS-IMAGE
stormrider   Ready   control-plane,etcd,master 27d v1.32.3+k3s1  192.168.7.99   Fedora Linux Cosmic Edition

List pods on stormrider:
bash

kubectl get pods --all-namespaces -o wide | grep stormrider

Output:

kube-system svclb-starlink-gateway-4f8b2c9d-xk7lp 1/1 Running 10.43.9.12 stormrider
kube-system svclb-datastream-relay-6d4e3f2a-qw5mn 2/2 Running 10.43.9.11 stormrider
kube-system svclb-comms-hub-8c9a4g3b-vz8rk 4/4 Running 10.43.9.10 stormrider
kube-system svclb-astro-core-2b7d5h4c-nj3pm 5/5 Running 10.43.9.09 stormrider


These `svclb-*` pods are DaemonSet-managed for LoadBalancer services.
2. Check etcd Health
Since stormrider is an etcd member, verify etcd health:

./etcdctl endpoint health --endpoints=https://192.168.7.11:2379,https://192.168.7.22:2379,https://192.168.7.99:2379 --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt --key=/var/lib/rancher/k3s/server/tls/etcd/client.key

Output:



https://192.168.7.22:2379 is healthy: successfully committed proposal: took = 6.512345ms https://192.168.7.99:2379 is healthy: successfully committed proposal: took = 6.789123ms https://192.168.7.11:2379 is healthy: successfully committed proposal: took = 7.123456ms

All nodes (nova-prime, skyforge-77, stormrider) are healthy.
Get the etcd member list:

./etcdctl member list --endpoints=https://192.168.7.11:2379,https://192.168.7.22:2379,https://192.168.7.99:2379 --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt --key=/var/lib/rancher/k3s/server/tls/etcd/client.key

Output:



4a8e56cd7890ef12, started, nova-prime-3g7h8j9k, https://192.168.7.11:2380, https://192.168.7.11:2379, false 7b9f67de8901ab23, started, stormrider-5k2m3n4p, https://192.168.7.99:2380, https://192.168.7.99:2379, false 2c0d34ef9012bc45, started, skyforge-77-6q8r9s0t, https://192.168.7.22:2380, https://192.168.7.22:2379, false

Note stormrider’s member ID: `7b9f67de8901ab23`.
3. Drain the Node
Drain stormrider to evict pods:

kubectl drain stormrider --ignore-daemonsets --delete-emptydir-data --force

Output:

node/stormrider already cordoned
Warning: ignoring DaemonSet-managed Pods: kube-system/svclb-starlink-gateway-4f8b2c9d-xk7lp, kube-system/svclb-datastream-relay-6d4e3f2a-qw5mn, kube-system/svclb-comms-hub-8c9a4g3b-vz8rk, kube-system/svclb-astro-core-2b7d5h4c-nj3pm
node/stormrider drained

The `svclb-*` pods are skipped, as they’re DaemonSet-managed.
4. Remove Node from etcd
Remove stormrider from etcd:

./etcdctl member remove 7b9f67de8901ab23 --endpoints=https://192.168.7.11:2379,https://192.168.7.22:2379,https://192.168.7.99:2379 --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt --key=/var/lib/rancher/k3s/server/tls/etcd/client.key

Output:

Member 7b9f67de8901ab23 removed from cluster 12ab34cd56ef7890

Verify the member list:

./etcdctl member list --endpoints=https://192.168.7.11:2379,https://192.168.7.22:2379 --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt --key=/var/lib/rancher/k3s/server/tls/etcd/client.key

Output:


4a8e56cd7890ef12, started, nova-prime-3g7h8j9k, https://192.168.7.11:2380, https://192.168.7.11:2379, false 2c0d34ef9012bc45, started, skyforge-77-6q8r9s0t, https://192.168.7.22:2380, https://192.168.7.22:2379, false

Check etcd health:

./etcdctl endpoint health --endpoints=https://192.168.7.11:2379,https://192.168.7.22:2379 --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt --key=/var/lib/rancher/k3s/server/tls/etcd/client.key

Output:


https://192.168.7.22:2379 is healthy: successfully committed proposal: took = 5.987654ms https://192.168.7.11:2379 is healthy: successfully committed proposal: took = 6.234567ms

The etcd cluster is stable with two nodes.
5. Delete the Node
Remove stormrider from the cluster:

kubectl delete node stormrider

Output:

node "stormrider" deleted

6. Clean Up the Node
SSH into stormrider and stop K3s:

sudo systemctl stop k3s
sudo systemctl disable k3s

6. Verify the Cluster
Check nodes:

kubectl get nodes -o wide

Confirm stormrider is gone, leaving five nodes: lunar-probe, nebula-42, skyforge-77, nova-prime, quantum-core.
Verify pods:

kubectl get pods --all-namespaces -o wide

Ensure `svclb-*` pods have rescheduled. Check services:

kubectl get svc --all-namespaces

## Optionally, recheck etcd health:

./etcdctl endpoint health --endpoints=https://192.168.7.11:2379,https://192.168.7.22:2379 --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt --key=/var/lib/rancher/k3s/server/tls/etcd/client.key

## Post-Removal Considerations
etcd Quorum: With two control-plane nodes, quorum holds, but fault tolerance is reduced. Consider adding a third node.

Services: Monitor starlink-gateway, datastream-relay, comms-hub, and astro-core services. Check svclb-astro-core for stability if it had prior issues.

Pods: Klipper-lb should reschedule svclb-* pods. Verify service accessibility.

Troubleshooting Tips
Stuck Pods: Use --force for drain cautiously or debug pod dependencies.

etcd Quorum Loss: Restore from backups or rejoin nodes if quorum fails.

Service Issues: Inspect pods and services if LoadBalancers are inaccessible.

## Conclusion
Removing stormrider from a K3s cluster required careful steps to drain pods, remove it from etcd, delete the node, and clean up. By verifying each step, we ensured cluster stability and pod rescheduling. Use this guide to safely decommission control-plane nodes in K3s, keeping your cluster robust.
Happy clustering!