Troubleshooting Node Issues

When a node enters a NotReady state, workloads on that node stop receiving traffic and may be rescheduled. Rapid diagnosis is essential to restore cluster capacity.

Identifying NotReady Nodes

# List all nodes with their status
kubectl get nodes

# Filter for problem nodes
kubectl get nodes | grep -v " Ready"

# Get detailed node conditions
kubectl describe node worker-node-3

Node Conditions

The describe node output includes conditions that reveal the root cause:

# Check specific conditions
kubectl get node worker-node-3 -o jsonpath='{.status.conditions[*].type}'

# Common conditions to check:
# - Ready: kubelet is healthy and can accept pods
# - MemoryPressure: node is running low on memory
# - DiskPressure: node disk capacity is low
# - PIDPressure: too many processes on the node
# - NetworkUnavailable: node network is misconfigured

# Get conditions in a readable format
kubectl get node worker-node-3 \
  -o jsonpath='{range .status.conditions[*]}{.type}={.status}{"\n"}{end}'

Investigating Further

# Check node resource usage
kubectl top node worker-node-3

# List pods running on the problem node
kubectl get pods -A --field-selector spec.nodeName=worker-node-3

# Check kubelet logs (SSH to node or use debug pod)
kubectl debug node/worker-node-3 -it --image=busybox

# View system events related to the node
kubectl get events --field-selector involvedObject.name=worker-node-3

Taints and Cordoning

Prevent new pods from scheduling on a problem node while investigating:

# Cordon the node (mark unschedulable)
kubectl cordon worker-node-3

# Drain the node (evict pods gracefully)
kubectl drain worker-node-3 --ignore-daemonsets --delete-emptydir-data

# After fixing, uncordon to allow scheduling again
kubectl uncordon worker-node-3

# Check taints on a node
kubectl describe node worker-node-3 | grep -A3 Taints

Always check kubelet status, disk space, and network connectivity on the node itself if kubectl commands do not reveal the issue.