1. Node
1.1 Node NotReady
A Kubernetes cluster node being in the NotReady
state can result from various issues. Here are some realistic and common reasons:
1. Node Resource Issues
-
Insufficient Memory or CPU: If the node is running out of memory or CPU resources, the
kubelet
may mark the node asNotReady
. -
Disk Pressure: The node’s disk usage may be too high, causing the
kubelet
to mark it asNotReady
.- Example:
kubectl describe node <node-name>
showsDiskPressure
under conditions.
- Example:
-
Network Pressure: High network latency or dropped packets may cause readiness issues.
2. kubelet Problems
-
kubelet Down: The
kubelet
service on the node is not running or has crashed. -
Certificate Issues: The
kubelet
‘s certificate might have expired, causing it to fail authentication with thekube-apiserver
. -
Configuration Errors: Misconfigured kubelet flags (e.g., wrong
--cluster-dns
,--api-servers
) can lead to connectivity issues.
3. Networking Issues
- Node Network Unreachable: The node cannot communicate with the control plane or other nodes.
- CNI Plugin Failure: Issues with the Container Network Interface (CNI) plugin (e.g., Calico, Flannel, Weave) may disrupt pod-to-pod or node-to-node communication.
-
Firewall Rules: A firewall or security group blocking Kubernetes-related traffic (e.g., ports 6443, 10250) can cause the node to go
NotReady
.
4. Control Plane Connectivity Issues
- kube-apiserver Unreachable: The node cannot reach the API server due to network partitioning or DNS resolution issues.
- etcd Problems: If the control plane’s etcd database is down or unhealthy, the API server might not respond to node heartbeats.
5. Component Issues
- Container Runtime Failure: The container runtime (e.g., Docker, containerd, CRI-O) is not running or is misconfigured.
-
kube-proxy Failure: The
kube-proxy
component on the node is not functioning correctly, disrupting node communication.
6. Other reasons
There are still many possible reasons, like the hardware failures, missing some config files, some time the kubelet version mismatch also can casue node failure.
1.1.1 How to Debug a NotReady
Node
- Check node conditions frist and
kubelet
status:
kubectl describe node <node-name>
systemctl status kubelet
# if it down/inactive restart it
systemctl restart kubelet
- Check logs:
-
kubelet
logs: - or we can also check the container runtime logs (e.g., Docker):
-
journalctl -u kubelet
journalctl -u docker
- Verify network connectivity:
- Ping the control plane API server:
- Check CNI plugin logs.
journalctl -u kubelet
- Inspect resource usage:
top
df -h
# or directly use kubectl top
kubectl top node --sort-by='cpu' | awk 'NR==2 {print $1}'
in this article, we will always use
systemd
cmds, andubuntu
sys.
1.2 Cordon and Drain Nodes
kubectl cordon NODENAME
kubectl drain NODENAME
kubectl uncordon NODENAME
1.2.1 kubectl cordon NODENAME
- Purpose: Marks a node as unschedulable. This prevents new pods from being scheduled on that node.
- Effect on existing pods: Existing pods continue to run on the node.
- Use case: temporarily prevent new workloads from being placed on a node, perhaps for investigation or minor maintenance, without disrupting existing applications.
NOTICE: if you have specify the
nodeName: <node_name>
, then it can still schedule pod to the node, because:
- Cordoning a node: it tells the scheduler “don’t place any new pods on this node unless there’s a very good reason.”
- nodeName in pod spec: this will be a direct instruction to Kubernetes: “I want this pod to run specifically on this node.”
1.2.2 kubectl drain NODENAME
- Purpose: Evicts all pods from a node and marks it as unschedulable.
- Effect on existing pods: Gracefully terminates pods running on the node.
- Use case: perform more significant maintenance on a node, such as kernel updates, hardware replacement.
2. Cluster
2.1 Update
Following the official documents here
0. check availiable version
sudo apt update
sudo apt-cache madison kubeadm
1. updata kubeadmin
# change the version as you need
sudo apt-mark unhold kubeadm &&
sudo apt-get update && sudo apt-get install -y kubeadm='1.32.x-*' &&
sudo apt-mark hold kubeadm
kubeadm version
1.1 on control node
sudo kubeadm upgrade apply
1.2 on other node
sudo kubeadm upgrade node
2. drain node
kubectl drain <node-to-drain> --ignore-daemonsets
If on other nodes, first ssh to the control node, then drain the node, then ssh back to updating node
3. update kubelet
and kubectl
# change the version as you need
sudo apt-mark unhold kubelet kubectl &&
sudo apt-get update && sudo apt-get install -y kubelet='1.32.x-*' kubectl='1.32.x-*' &&
sudo apt-mark hold kubelet kubectl
Then restart kubelet
sudo systemctl daemon-reload
sudo systemctl restart kubelet
4. uncordon node
kubectl uncordon <node-to-uncordon>
Similarly, if you are on other node, ssh back to control node, then do the
uncordon
.