Replication and High Availability
Replication Topologies
| Topology | Use Case |
|---|---|
| Single primary + replicas | Read scaling, backups, DR |
| Chain replication | A → B → C for geographic distribution |
| Multi-source | Replica from multiple primaries (merge analytics) |
| Group Replication | Automatic failover, consensus-based HA |
| InnoDB Cluster | MySQL Shell managed Group Replication + Router |
Asynchronous Replication Overview
The primary writes to the binary log (binlog). Replicas connect via IO thread, copy events to relay log, and SQL thread applies them.
Primary (binlog) → Replica IO Thread → Relay Log → Replica SQL Thread → Tables
Primary Configuration
[mysqld]
server-id = 1
log_bin = mysql-bin
binlog_format = ROW
gtid_mode = ON
enforce_gtid_consistency = ON
binlog_expire_logs_seconds = 604800
sync_binlog = 1
CREATE USER 'repl'@'10.0.2.%' IDENTIFIED BY 'ReplSecurePass!';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'10.0.2.%';
FLUSH PRIVILEGES;
SHOW MASTER STATUS;
ROW format is recommended — logs row changes, not SQL statements, avoiding non-deterministic replay issues.
Replica Configuration
[mysqld]
server-id = 2
relay_log = relay-bin
read_only = ON
super_read_only = ON
gtid_mode = ON
enforce_gtid_consistency = ON
replica_parallel_workers = 4
replica_parallel_type = LOGICAL_CLOCK
CHANGE REPLICATION SOURCE TO
SOURCE_HOST = 'primary.internal',
SOURCE_USER = 'repl',
SOURCE_PASSWORD = 'ReplSecurePass!',
SOURCE_AUTO_POSITION = 1,
GET_SOURCE_PUBLIC_KEY = 1;
START REPLICA;
SHOW REPLICA STATUS\G
Verify:
Replica_IO_Running: YesReplica_SQL_Running: YesSeconds_Behind_Sourcenear 0 under normal load
GTID Benefits
Global Transaction Identifiers uniquely tag every transaction:
SELECT @@GLOBAL.gtid_executed;
Failover without guessing binlog file/position — SOURCE_AUTO_POSITION = 1 finds the correct starting point automatically.
Read/Write Splitting
Application architecture:
Writes → Primary
Reads → Replica pool (with lag tolerance)
Critical reads after write → Primary (or lag-aware routing)
# Pseudocode
def get_user(user_id, fresh=False):
conn = primary if fresh else random.choice(replicas)
return conn.execute("SELECT * FROM users WHERE id = %s", user_id)
After registration, route the immediate profile read to primary to avoid “user not found” from lag.
Semi-Synchronous Replication
Primary waits for at least one replica to acknowledge receipt before commit completes:
plugin-load-add = semisync_source.so
rpl_semi_sync_source_enabled = 1
rpl_semi_sync_source_timeout = 1000
rpl_semi_sync_source_wait_point = AFTER_SYNC
On replica:
plugin-load-add = semisync_replica.so
rpl_semi_sync_replica_enabled = 1
Trade-off: higher write latency for reduced data loss risk if primary fails before async replicas catch up.
Failover Strategies
Manual Failover
- Stop writes to primary
- Verify most up-to-date replica (
SHOW REPLICA STATUS, compareExecuted_Gtid_Set) STOP REPLICA; RESET REPLICA ALL;on chosen replicaSET GLOBAL read_only = OFF;- Repoint application and other replicas
Automated Failover
Tools: Orchestrator, MHA, InnoDB Cluster, cloud RDS Multi-AZ
Orchestrator detects primary failure, promotes replica, repoints remaining topology.
Split-Brain Prevention
Never allow two writable primaries accepting the same data without conflict resolution. Use:
- VIP/floating IP managed by failover tool
read_only = ONon all non-primary nodes- Group Replication consensus (single-primary mode)
Replication Monitoring
SHOW REPLICA STATUS\G
SELECT * FROM performance_schema.replication_connection_status;
SELECT * FROM performance_schema.replication_applier_status_by_worker;
-- Alert when lag exceeds threshold
SELECT TIMESTAMPDIFF(SECOND, applier_last_applied_transaction_original_commit_timestamp, NOW())
FROM performance_schema.replication_applier_status_by_worker LIMIT 1;
Common Replication Errors
Duplicate entry on replica:
STOP REPLICA;
SET GLOBAL sql_replica_skip_counter = 1; -- last resort, breaks consistency with GTID
START REPLICA;
-- Better: skip specific GTID or rebuild replica from backup
Schema drift: replica missing column — apply same migration to all nodes simultaneously.
Best Practices
- Use GTID and ROW binlog format
- Keep replica schemas identical to primary
- Monitor lag with alerts (warning: 10s, critical: 60s)
- Test failover quarterly — untested failover always fails in real incidents
- Use
read_onlyon replicas to prevent accidental writes
Production Scenario
A SaaS platform runs primary in us-east-1, replicas in us-east-1 (read scaling) and eu-west-1 (DR). Semi-sync ensures at least one local replica acknowledges writes. Orchestrator handles automatic failover with a 30-second detection window. Application uses ProxySQL for read/write routing with max lag threshold of 5 seconds.
Replication is the foundation of MySQL scalability and availability — design topology, lag handling, and failover before you need them at 3 AM.