Replication Topologies

Topology Use Case
Single primary + replicas Read scaling, backups, DR
Chain replication A → B → C for geographic distribution
Multi-source Replica from multiple primaries (merge analytics)
Group Replication Automatic failover, consensus-based HA
InnoDB Cluster MySQL Shell managed Group Replication + Router

Asynchronous Replication Overview

The primary writes to the binary log (binlog). Replicas connect via IO thread, copy events to relay log, and SQL thread applies them.

  Primary (binlog) → Replica IO Thread → Relay Log → Replica SQL Thread → Tables
  

Primary Configuration

  [mysqld]
server-id = 1
log_bin = mysql-bin
binlog_format = ROW
gtid_mode = ON
enforce_gtid_consistency = ON
binlog_expire_logs_seconds = 604800
sync_binlog = 1
  
  CREATE USER 'repl'@'10.0.2.%' IDENTIFIED BY 'ReplSecurePass!';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'10.0.2.%';
FLUSH PRIVILEGES;

SHOW MASTER STATUS;
  

ROW format is recommended — logs row changes, not SQL statements, avoiding non-deterministic replay issues.

Replica Configuration

  [mysqld]
server-id = 2
relay_log = relay-bin
read_only = ON
super_read_only = ON
gtid_mode = ON
enforce_gtid_consistency = ON
replica_parallel_workers = 4
replica_parallel_type = LOGICAL_CLOCK
  
  CHANGE REPLICATION SOURCE TO
    SOURCE_HOST = 'primary.internal',
    SOURCE_USER = 'repl',
    SOURCE_PASSWORD = 'ReplSecurePass!',
    SOURCE_AUTO_POSITION = 1,
    GET_SOURCE_PUBLIC_KEY = 1;

START REPLICA;
SHOW REPLICA STATUS\G
  

Verify:

  • Replica_IO_Running: Yes
  • Replica_SQL_Running: Yes
  • Seconds_Behind_Source near 0 under normal load

GTID Benefits

Global Transaction Identifiers uniquely tag every transaction:

  SELECT @@GLOBAL.gtid_executed;
  

Failover without guessing binlog file/position — SOURCE_AUTO_POSITION = 1 finds the correct starting point automatically.

Read/Write Splitting

Application architecture:

  Writes → Primary
Reads  → Replica pool (with lag tolerance)
Critical reads after write → Primary (or lag-aware routing)
  
  # Pseudocode
def get_user(user_id, fresh=False):
    conn = primary if fresh else random.choice(replicas)
    return conn.execute("SELECT * FROM users WHERE id = %s", user_id)
  

After registration, route the immediate profile read to primary to avoid “user not found” from lag.

Semi-Synchronous Replication

Primary waits for at least one replica to acknowledge receipt before commit completes:

  plugin-load-add = semisync_source.so
rpl_semi_sync_source_enabled = 1
rpl_semi_sync_source_timeout = 1000
rpl_semi_sync_source_wait_point = AFTER_SYNC
  

On replica:

  plugin-load-add = semisync_replica.so
rpl_semi_sync_replica_enabled = 1
  

Trade-off: higher write latency for reduced data loss risk if primary fails before async replicas catch up.

Failover Strategies

Manual Failover

  1. Stop writes to primary
  2. Verify most up-to-date replica (SHOW REPLICA STATUS, compare Executed_Gtid_Set)
  3. STOP REPLICA; RESET REPLICA ALL; on chosen replica
  4. SET GLOBAL read_only = OFF;
  5. Repoint application and other replicas

Automated Failover

Tools: Orchestrator, MHA, InnoDB Cluster, cloud RDS Multi-AZ

Orchestrator detects primary failure, promotes replica, repoints remaining topology.

Split-Brain Prevention

Never allow two writable primaries accepting the same data without conflict resolution. Use:

  • VIP/floating IP managed by failover tool
  • read_only = ON on all non-primary nodes
  • Group Replication consensus (single-primary mode)

Replication Monitoring

  SHOW REPLICA STATUS\G

SELECT * FROM performance_schema.replication_connection_status;
SELECT * FROM performance_schema.replication_applier_status_by_worker;

-- Alert when lag exceeds threshold
SELECT TIMESTAMPDIFF(SECOND, applier_last_applied_transaction_original_commit_timestamp, NOW())
FROM performance_schema.replication_applier_status_by_worker LIMIT 1;
  

Common Replication Errors

Duplicate entry on replica:

  STOP REPLICA;
SET GLOBAL sql_replica_skip_counter = 1;  -- last resort, breaks consistency with GTID
START REPLICA;
-- Better: skip specific GTID or rebuild replica from backup
  

Schema drift: replica missing column — apply same migration to all nodes simultaneously.

Best Practices

  1. Use GTID and ROW binlog format
  2. Keep replica schemas identical to primary
  3. Monitor lag with alerts (warning: 10s, critical: 60s)
  4. Test failover quarterly — untested failover always fails in real incidents
  5. Use read_only on replicas to prevent accidental writes

Production Scenario

A SaaS platform runs primary in us-east-1, replicas in us-east-1 (read scaling) and eu-west-1 (DR). Semi-sync ensures at least one local replica acknowledges writes. Orchestrator handles automatic failover with a 30-second detection window. Application uses ProxySQL for read/write routing with max lag threshold of 5 seconds.

Replication is the foundation of MySQL scalability and availability — design topology, lag handling, and failover before you need them at 3 AM.