Design a Production Cluster

Designing a production Kubernetes cluster requires thoughtful decisions about high availability, networking, security, and operational readiness. This lesson walks through the key considerations.

High Availability

A production control plane must survive node failures:

# Verify control plane redundancy
kubectl get nodes -l node-role.kubernetes.io/control-plane

# Check etcd cluster health
kubectl get pods -n kube-system -l component=etcd

# Ensure multiple API server replicas exist
kubectl get pods -n kube-system -l component=kube-apiserver

# Confirm at minimum 3 control plane nodes
kubectl get nodes -l node-role.kubernetes.io/control-plane --no-headers | wc -l

Plan for at minimum 3 control plane nodes across availability zones. Use an external load balancer in front of the API servers.

Networking Architecture

# Choose a CNI plugin that supports NetworkPolicy
# Options: Calico, Cilium, Weave

# Plan your pod CIDR to avoid conflicts
kubectl cluster-info dump | grep -m1 "cluster-cidr"

# Plan service CIDR
kubectl cluster-info dump | grep -m1 "service-cluster-ip-range"

# Set up ingress controller
kubectl get pods -n ingress-nginx

Allocate non-overlapping CIDR ranges for pods, services, and nodes. Consider future growth when sizing these ranges.

Node Pools and Sizing

# Label nodes for workload targeting
kubectl label node worker-1 workload-type=general
kubectl label node gpu-node-1 workload-type=gpu

# Set taints for dedicated workloads
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule

# Review node capacity
kubectl describe nodes | grep -A5 "Capacity"

Security Baseline

# Enable Pod Security Standards
kubectl label namespace production pod-security.kubernetes.io/enforce=restricted

# Create default network policies
kubectl get networkpolicies -n production

# Restrict API access with RBAC
kubectl get clusterrolebindings

# Enable audit logging on the API server
kubectl describe pod -n kube-system kube-apiserver-master | grep audit

Operational Readiness Checklist

Verify monitoring (Prometheus/Grafana), log aggregation (Loki/EFK), backup strategy for etcd, disaster recovery plan, and upgrade procedures before going to production. Use kubectl get componentstatuses to validate control plane health as a starting point.