MySQL Replication

MySQL replication writes to a primary and replays changes on replicas so reads can be offloaded.

Key ideas

Primary writes and generates binlog
Replicas pull and replay the binlog
Stable network identity matters

In Kubernetes

StatefulSet keeps instance order
Headless Service provides stable DNS
Separate read/write Services if needed

Risks

Replication lag can break consistency
Primary failure needs a failover plan

Practical notes

Start with a quick inventory: kubectl get nodes, kubectl get pods -A, and kubectl get events -A.
Compare desired vs. observed state; kubectl describe usually explains drift or failed controllers.
Keep names, labels, and selectors consistent so Services and controllers can find Pods.

Quick checklist

The resource matches the intent you described in YAML.
Namespaces, RBAC, and images are correct for the target environment.
Health checks and logs are in place before promotion.

Data, identity, and steady state

Stateful workloads need stable identity and stable storage. MySQL replication is where Kubernetes provides that stability through persistent volumes, stable DNS names, and ordered lifecycle management. The goal is to keep data safe while still allowing automated rollouts and rescheduling.

Replication topology and consistency

For replicated systems, choose a topology that matches your consistency needs. Single leader with followers is common for relational databases, while quorum based systems require a majority to make progress. Understand how your database elects a leader and how clients discover it. Kubernetes can schedule Pods, but it does not solve consensus for you.

Storage planning and isolation

Each replica should have its own PVC. Shared volumes can cause corruption unless the application is built for it. Plan storage capacity per replica and budget for growth. Use anti affinity to spread replicas across nodes so a single failure does not drop the entire cluster.

Backup and restore discipline

Persistent volumes are not backups. Use logical dumps or snapshots and test restores regularly. Document the recovery sequence, especially for systems with replication, because the order of restore can determine which node becomes the primary. Disaster recovery is a process, not a file.

Upgrades and failure handling

Stateful upgrades are slower and require more care. Use partitions or staged rollouts, and ensure readiness probes reflect real availability. When a node fails, pods may reschedule, but volumes may take time to attach. Monitor for stuck attachments and design for longer recovery windows.

Observability and tuning

Track replication lag, storage latency, and disk usage. These are early warning signals. Resource limits that are too tight can cause throttling and timeouts, so set realistic requests and leave headroom. For databases, IO latency is often a better signal than CPU usage.

Leader routing and client behavior

Clients often need to send writes to a leader and reads to replicas. Use stable DNS names for direct access and Services for balanced reads. If your system supports read only replicas, make that separation explicit in client config.

Maintenance and automation

Schedule compaction, vacuum, or defragmentation during low traffic windows. Operators or automation tools can enforce backup schedules and safe rollouts, reducing human error. Treat stateful maintenance as a regular task, not an emergency.

kubectl get pods -n demo
kubectl get pvc -n demo
kubectl describe pod db-0 -n demo

Operational checklist

Verify anti affinity, PDBs, and backup jobs. Confirm that each replica has its own volume, and that failover procedures are rehearsed. Stateful reliability comes from consistent operational habits as much as from configuration.

Field checklist

When you move from a quick lab to real traffic, confirm the basics every time. Check resource requests, readiness behavior, log coverage, alerting, and clear rollback steps. A checklist prevents skipping the boring steps that keep services stable. Keep it short, repeatable, and stored with the repo so it evolves with the service and stays close to the code.

Troubleshooting flow

Start from symptoms, not guesses. Review recent events for scheduling, image, or probe failures, then scan logs for application errors. If traffic is failing, confirm readiness, verify endpoints, and trace the request path hop by hop. When data looks wrong, validate the active version and configuration against the release plan. Always record what you changed so a rollback is fast and a postmortem is accurate.

Small exercises to build confidence

Practice common operations in a safe environment. Scale the workload up and down and observe how quickly it stabilizes. Restart a single Pod and watch how the service routes around it. Change one configuration value and verify that the change is visible in logs or metrics. These small drills teach how the system behaves under real operations without waiting for an outage.

Production guardrails

Introduce limits gradually. Resource quotas, PodDisruptionBudgets, and network policies should be tested in staging before production. Keep backups and restore procedures documented, even for stateless services, because dependencies often are not stateless. Align monitoring with user outcomes so you catch regressions before they become incidents.

Documentation and ownership

Write down who owns the service, what success looks like, and which dashboards to use. Include the on-call rotation, escalation path, and basic runbooks for common failures. A small amount of documentation removes a lot of guesswork during incidents and helps new team members ramp up quickly.

Quick validation

After any change, validate the system the same way a user would. Hit the main endpoint, check latency, and watch for error spikes. Confirm that new pods are ready, old ones are gone, and metrics are stable. If the change touched storage, verify disk usage and cleanup behavior. If it touched networking, confirm DNS names and endpoint lists are correct.

Release notes

Write a short note with what changed, why it changed, and how to roll back. This is not bureaucracy; it prevents confusion during incidents. Even a few bullets help future you remember intent and context.

Capacity check

Compare current usage to requests and limits. If the service is close to limits, plan a small scaling adjustment before traffic grows. Capacity planning is easier when it is incremental rather than reactive.

Final reminder

Keep changes small and observable. If a release is risky, reduce scope and validate in staging first. Prefer frequent small updates over rare large ones. When in doubt, pick the option that simplifies rollback and reduces time to detect issues. The goal is not perfect config, but predictable operations.