CFN Cloud
2025-10-12

MySQL Replication

Use replication for read scaling and recovery.

MySQL replication writes to a primary and replays changes on replicas so reads can be offloaded.

Key ideas

  • Primary writes and generates binlog
  • Replicas pull and replay the binlog
  • Stable network identity matters

In Kubernetes

  • StatefulSet keeps instance order
  • Headless Service provides stable DNS
  • Separate read/write Services if needed

Risks

  • Replication lag can break consistency
  • Primary failure needs a failover plan

Practical notes

  • Start with a quick inventory: kubectl get nodes, kubectl get pods -A, and kubectl get events -A.
  • Compare desired vs. observed state; kubectl describe usually explains drift or failed controllers.
  • Keep names, labels, and selectors consistent so Services and controllers can find Pods.

Quick checklist

  • The resource matches the intent you described in YAML.
  • Namespaces, RBAC, and images are correct for the target environment.
  • Health checks and logs are in place before promotion.

Data, identity, and steady state

Stateful workloads need stable identity and stable storage. MySQL replication is where Kubernetes provides that stability through persistent volumes, stable DNS names, and ordered lifecycle management. The goal is to keep data safe while still allowing automated rollouts and rescheduling.

Replication topology and consistency

For replicated systems, choose a topology that matches your consistency needs. Single leader with followers is common for relational databases, while quorum based systems require a majority to make progress. Understand how your database elects a leader and how clients discover it. Kubernetes can schedule Pods, but it does not solve consensus for you.

Storage planning and isolation

Each replica should have its own PVC. Shared volumes can cause corruption unless the application is built for it. Plan storage capacity per replica and budget for growth. Use anti affinity to spread replicas across nodes so a single failure does not drop the entire cluster.

Backup and restore discipline

Persistent volumes are not backups. Use logical dumps or snapshots and test restores regularly. Document the recovery sequence, especially for systems with replication, because the order of restore can determine which node becomes the primary. Disaster recovery is a process, not a file.

Upgrades and failure handling

Stateful upgrades are slower and require more care. Use partitions or staged rollouts, and ensure readiness probes reflect real availability. When a node fails, pods may reschedule, but volumes may take time to attach. Monitor for stuck attachments and design for longer recovery windows.

Observability and tuning

Track replication lag, storage latency, and disk usage. These are early warning signals. Resource limits that are too tight can cause throttling and timeouts, so set realistic requests and leave headroom. For databases, IO latency is often a better signal than CPU usage.

Leader routing and client behavior

Clients often need to send writes to a leader and reads to replicas. Use stable DNS names for direct access and Services for balanced reads. If your system supports read only replicas, make that separation explicit in client config.

Maintenance and automation

Schedule compaction, vacuum, or defragmentation during low traffic windows. Operators or automation tools can enforce backup schedules and safe rollouts, reducing human error. Treat stateful maintenance as a regular task, not an emergency.

kubectl get pods -n demo
kubectl get pvc -n demo
kubectl describe pod db-0 -n demo

Operational checklist

Verify anti affinity, PDBs, and backup jobs. Confirm that each replica has its own volume, and that failover procedures are rehearsed. Stateful reliability comes from consistent operational habits as much as from configuration.

Wrap-up: replication is easy, failover is not

The scary part isn’t setting up replicas. It’s knowing what happens when the primary dies.

Do at least one controlled failover test in staging and measure:

  • time to promote
  • time to rejoin
  • how clients behave during the gap

References