Troubleshooting Node Issues
When a node enters a NotReady state, workloads on that node stop receiving traffic and may be rescheduled. Rapid diagnosis is essential to restore cluster capacity.
Identifying NotReady Nodes
# List all nodes with their status
kubectl get nodes
# Filter for problem nodes
kubectl get nodes | grep -v " Ready"
# Get detailed node conditions
kubectl describe node worker-node-3
Node Conditions
The describe node output includes conditions that reveal the root cause:
# Check specific conditions
kubectl get node worker-node-3 -o jsonpath='{.status.conditions[*].type}'
# Common conditions to check:
# - Ready: kubelet is healthy and can accept pods
# - MemoryPressure: node is running low on memory
# - DiskPressure: node disk capacity is low
# - PIDPressure: too many processes on the node
# - NetworkUnavailable: node network is misconfigured
# Get conditions in a readable format
kubectl get node worker-node-3 \
-o jsonpath='{range .status.conditions[*]}{.type}={.status}{"\n"}{end}'
Investigating Further
# Check node resource usage
kubectl top node worker-node-3
# List pods running on the problem node
kubectl get pods -A --field-selector spec.nodeName=worker-node-3
# Check kubelet logs (SSH to node or use debug pod)
kubectl debug node/worker-node-3 -it --image=busybox
# View system events related to the node
kubectl get events --field-selector involvedObject.name=worker-node-3
Taints and Cordoning
Prevent new pods from scheduling on a problem node while investigating:
# Cordon the node (mark unschedulable)
kubectl cordon worker-node-3
# Drain the node (evict pods gracefully)
kubectl drain worker-node-3 --ignore-daemonsets --delete-emptydir-data
# After fixing, uncordon to allow scheduling again
kubectl uncordon worker-node-3
# Check taints on a node
kubectl describe node worker-node-3 | grep -A3 Taints
Always check kubelet status, disk space, and network connectivity on the node itself if kubectl commands do not reveal the issue.