Design a Production Cluster
Designing a production Kubernetes cluster requires thoughtful decisions about high availability, networking, security, and operational readiness. This lesson walks through the key considerations.
High Availability
A production control plane must survive node failures:
# Verify control plane redundancy
kubectl get nodes -l node-role.kubernetes.io/control-plane
# Check etcd cluster health
kubectl get pods -n kube-system -l component=etcd
# Ensure multiple API server replicas exist
kubectl get pods -n kube-system -l component=kube-apiserver
# Confirm at minimum 3 control plane nodes
kubectl get nodes -l node-role.kubernetes.io/control-plane --no-headers | wc -l
Plan for at minimum 3 control plane nodes across availability zones. Use an external load balancer in front of the API servers.
Networking Architecture
# Choose a CNI plugin that supports NetworkPolicy
# Options: Calico, Cilium, Weave
# Plan your pod CIDR to avoid conflicts
kubectl cluster-info dump | grep -m1 "cluster-cidr"
# Plan service CIDR
kubectl cluster-info dump | grep -m1 "service-cluster-ip-range"
# Set up ingress controller
kubectl get pods -n ingress-nginx
Allocate non-overlapping CIDR ranges for pods, services, and nodes. Consider future growth when sizing these ranges.
Node Pools and Sizing
# Label nodes for workload targeting
kubectl label node worker-1 workload-type=general
kubectl label node gpu-node-1 workload-type=gpu
# Set taints for dedicated workloads
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule
# Review node capacity
kubectl describe nodes | grep -A5 "Capacity"
Security Baseline
# Enable Pod Security Standards
kubectl label namespace production pod-security.kubernetes.io/enforce=restricted
# Create default network policies
kubectl get networkpolicies -n production
# Restrict API access with RBAC
kubectl get clusterrolebindings
# Enable audit logging on the API server
kubectl describe pod -n kube-system kube-apiserver-master | grep audit
Operational Readiness Checklist
Verify monitoring (Prometheus/Grafana), log aggregation (Loki/EFK), backup strategy for etcd, disaster recovery plan, and upgrade procedures before going to production. Use kubectl get componentstatuses to validate control plane health as a starting point.