MySQL Replication on Kubernetes: Topology, Data, and Failover

Understand how to run MySQL replication on Kubernetes, including primary-replica design, storage concerns, failover risks, and operational checks.

MySQL replication is where Kubernetes stateful concepts stop being theoretical. Once you have a primary and one or more replicas, identity, storage, topology, and failover behavior all start to matter at the same time.

The basic replication idea

the primary accepts writes
replicas pull and replay binlog changes
clients may read from replicas, but write to the primary

This sounds straightforward, but the operational difficulty is not the initial setup. It is what happens when lag grows, storage gets slow, or the primary disappears.

Why Kubernetes changes the picture

Replication on Kubernetes usually depends on:

StatefulSet for stable ordinal identity
Headless Service for Pod-level DNS
PVC-backed storage per replica
careful readiness and failover logic

That is why this page should be read alongside:

kubernetes-quickstart-statefulset.md
kubernetes-quickstart-headless-service.md
kubernetes-quickstart-pv-pvc.md
kubernetes-quickstart-storageclass.md

What to verify first in a replication setup

Before you even look at MySQL internals, confirm these basics:

each replica has its own PVC
Pod DNS identity is stable
readiness reflects actual service availability
clients know which endpoint is for writes and which is for reads

If those are wrong, MySQL replication problems will look worse and be harder to explain.

Why replication lag is the signal to respect

A cluster can look superficially healthy while replication is quietly degrading. Pods may all be running, but if lag grows, read-after-write behavior gets worse and failover risk increases.

That is why replication lag is one of the most important stateful signals to watch.

Common sources of lag and instability

slow disk
insufficient CPU
network jitter between nodes
large transactions
bad readiness or promotion logic

The first instinct should usually be to inspect IO and resource pressure before tuning obscure database parameters.

Routing matters as much as replication

In many setups, the problem is not that replication itself broke. It is that clients are talking to the wrong place.

Typical pattern:

a write endpoint for the primary
a read endpoint for replicas
direct Pod DNS for internal replication or admin flows

If Service selectors or DNS assumptions are wrong, the symptoms can look like inconsistent data even when replication is technically working.

A concrete operator checklist

When bringing up or reviewing a replication cluster, I would check:

are primary and replicas clearly separated?
does each replica own its own volume?
is read traffic really going where expected?
do we know how primary promotion would work?
do we know how to rejoin an old primary after failover?

These are the questions that usually decide whether the system feels calm or fragile in production.

Quick commands that help early

kubectl get pods -n <ns> -o wide
kubectl get pvc -n <ns>
kubectl get svc -n <ns>
kubectl get endpoints -n <ns> <svc> -o wide
kubectl describe pod -n <ns> <pod>

From the MySQL side, replication status checks matter too:

kubectl exec -n db mysql-primary-0 -- mysql -u root -p -e "SHOW MASTER STATUS\G"
kubectl exec -n db mysql-secondary-0 -- mysql -u root -p -e "SHOW SLAVE STATUS\G"

Failover is the real test

The most dangerous sentence in a replication setup is: “we assume failover will be fine.”

Replication is relatively easy to start. Promotion, rejoin, and client behavior during failure are what matter operationally.

At minimum, test in staging:

time to promote a replica
time to rejoin nodes afterward
client behavior during the gap
what data consistency guarantees you actually get

Questions worth asking

Q: Is replication enough to call MySQL highly available? A: Not by itself. Replication is a building block, but actual HA depends on failover behavior, client routing, and recovery discipline.

Q: What should I look at first if reads seem stale? A: Check replication lag, then storage and CPU pressure, then verify reads are really hitting replicas and not the wrong service path.

Q: Why does Kubernetes not solve the hard part automatically? A: Because Kubernetes manages workload lifecycle and placement, not MySQL consensus, leader promotion, or application-level correctness.

Before you use this in a real cluster

Replication is where many teams first realize that “the Pods are running” is only the beginning. The real questions are about lag, promotion, client routing, and recovery - and those are exactly the questions worth rehearsing before production.