Running Stateful Apps
Stateful services need stable identity, storage, and ordered startup.
Stateful apps like databases or queues need stable network identity and durable storage.
Core requirements
- Stable identity: predictable Pod names
- Durable storage: data survives Pod recreation
- Ordered startup/shutdown: avoid all replicas restarting together
Common stack
- StatefulSet + PVC
- Headless Service for stable DNS
- Backup and restore workflows
Practice tips
- Start with a single instance, then scale out
- Use
readinessProbeto avoid routing traffic too early
Practical notes
- Start with a quick inventory:
kubectl get nodes,kubectl get pods -A, andkubectl get events -A. - Compare desired vs. observed state;
kubectl describeusually explains drift or failed controllers. - Keep names, labels, and selectors consistent so Services and controllers can find Pods.
Quick checklist
- The resource matches the intent you described in YAML.
- Namespaces, RBAC, and images are correct for the target environment.
- Health checks and logs are in place before promotion.
Data, identity, and steady state
Stateful workloads need stable identity and stable storage. stateful applications is where Kubernetes provides that stability through persistent volumes, stable DNS names, and ordered lifecycle management. The goal is to keep data safe while still allowing automated rollouts and rescheduling.
Replication topology and consistency
For replicated systems, choose a topology that matches your consistency needs. Single leader with followers is common for relational databases, while quorum based systems require a majority to make progress. Understand how your database elects a leader and how clients discover it. Kubernetes can schedule Pods, but it does not solve consensus for you.
Storage planning and isolation
Each replica should have its own PVC. Shared volumes can cause corruption unless the application is built for it. Plan storage capacity per replica and budget for growth. Use anti affinity to spread replicas across nodes so a single failure does not drop the entire cluster.
Backup and restore discipline
Persistent volumes are not backups. Use logical dumps or snapshots and test restores regularly. Document the recovery sequence, especially for systems with replication, because the order of restore can determine which node becomes the primary. Disaster recovery is a process, not a file.
Upgrades and failure handling
Stateful upgrades are slower and require more care. Use partitions or staged rollouts, and ensure readiness probes reflect real availability. When a node fails, pods may reschedule, but volumes may take time to attach. Monitor for stuck attachments and design for longer recovery windows.
Observability and tuning
Track replication lag, storage latency, and disk usage. These are early warning signals. Resource limits that are too tight can cause throttling and timeouts, so set realistic requests and leave headroom. For databases, IO latency is often a better signal than CPU usage.
Leader routing and client behavior
Clients often need to send writes to a leader and reads to replicas. Use stable DNS names for direct access and Services for balanced reads. If your system supports read only replicas, make that separation explicit in client config.
Maintenance and automation
Schedule compaction, vacuum, or defragmentation during low traffic windows. Operators or automation tools can enforce backup schedules and safe rollouts, reducing human error. Treat stateful maintenance as a regular task, not an emergency.
kubectl get pods -n demo
kubectl get pvc -n demo
kubectl describe pod db-0 -n demo
Operational checklist
Verify anti affinity, PDBs, and backup jobs. Confirm that each replica has its own volume, and that failover procedures are rehearsed. Stateful reliability comes from consistent operational habits as much as from configuration.
Wrap-up: the hard part is recovery time
Stateful apps rarely fail because the YAML is wrong. They fail because recovery takes longer than you planned.
If you only do one thing: run a restore drill in a test namespace. It changes how you design everything else.