to navigate

to select

to close

On this page

Replication and High Availability

Replication Topologies

Topology	Use Case
Single primary + replicas	Read scaling, backups, DR
Chain replication	A → B → C for geographic distribution
Multi-source	Replica from multiple primaries (merge analytics)
Group Replication	Automatic failover, consensus-based HA
InnoDB Cluster	MySQL Shell managed Group Replication + Router

Asynchronous Replication Overview

The primary writes to the binary log (binlog). Replicas connect via IO thread, copy events to relay log, and SQL thread applies them.

  Primary (binlog) → Replica IO Thread → Relay Log → Replica SQL Thread → Tables

Primary Configuration

  [mysqld]
server-id = 1
log_bin = mysql-bin
binlog_format = ROW
gtid_mode = ON
enforce_gtid_consistency = ON
binlog_expire_logs_seconds = 604800
sync_binlog = 1

  CREATE USER 'repl'@'10.0.2.%' IDENTIFIED BY 'ReplSecurePass!';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'10.0.2.%';
FLUSH PRIVILEGES;

SHOW MASTER STATUS;

ROW format is recommended — logs row changes, not SQL statements, avoiding non-deterministic replay issues.

Replica Configuration

  [mysqld]
server-id = 2
relay_log = relay-bin
read_only = ON
super_read_only = ON
gtid_mode = ON
enforce_gtid_consistency = ON
replica_parallel_workers = 4
replica_parallel_type = LOGICAL_CLOCK

  CHANGE REPLICATION SOURCE TO
    SOURCE_HOST = 'primary.internal',
    SOURCE_USER = 'repl',
    SOURCE_PASSWORD = 'ReplSecurePass!',
    SOURCE_AUTO_POSITION = 1,
    GET_SOURCE_PUBLIC_KEY = 1;

START REPLICA;
SHOW REPLICA STATUS\G

Verify:

Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source near 0 under normal load

GTID Benefits

Global Transaction Identifiers uniquely tag every transaction:

  SELECT @@GLOBAL.gtid_executed;

Failover without guessing binlog file/position — SOURCE_AUTO_POSITION = 1 finds the correct starting point automatically.

Read/Write Splitting

Application architecture:

  Writes → Primary
Reads  → Replica pool (with lag tolerance)
Critical reads after write → Primary (or lag-aware routing)

  # Pseudocode
def get_user(user_id, fresh=False):
    conn = primary if fresh else random.choice(replicas)
    return conn.execute("SELECT * FROM users WHERE id = %s", user_id)

After registration, route the immediate profile read to primary to avoid “user not found” from lag.

Semi-Synchronous Replication

Primary waits for at least one replica to acknowledge receipt before commit completes:

  plugin-load-add = semisync_source.so
rpl_semi_sync_source_enabled = 1
rpl_semi_sync_source_timeout = 1000
rpl_semi_sync_source_wait_point = AFTER_SYNC

On replica:

  plugin-load-add = semisync_replica.so
rpl_semi_sync_replica_enabled = 1

Trade-off: higher write latency for reduced data loss risk if primary fails before async replicas catch up.

Failover Strategies

Manual Failover

Stop writes to primary
Verify most up-to-date replica (SHOW REPLICA STATUS, compare Executed_Gtid_Set)
STOP REPLICA; RESET REPLICA ALL; on chosen replica
SET GLOBAL read_only = OFF;
Repoint application and other replicas

Automated Failover

Tools: Orchestrator, MHA, InnoDB Cluster, cloud RDS Multi-AZ

Orchestrator detects primary failure, promotes replica, repoints remaining topology.

Split-Brain Prevention

Never allow two writable primaries accepting the same data without conflict resolution. Use:

VIP/floating IP managed by failover tool
read_only = ON on all non-primary nodes
Group Replication consensus (single-primary mode)

Replication Monitoring

  SHOW REPLICA STATUS\G

SELECT * FROM performance_schema.replication_connection_status;
SELECT * FROM performance_schema.replication_applier_status_by_worker;

-- Alert when lag exceeds threshold
SELECT TIMESTAMPDIFF(SECOND, applier_last_applied_transaction_original_commit_timestamp, NOW())
FROM performance_schema.replication_applier_status_by_worker LIMIT 1;

Common Replication Errors

Duplicate entry on replica:

  STOP REPLICA;
SET GLOBAL sql_replica_skip_counter = 1;  -- last resort, breaks consistency with GTID
START REPLICA;
-- Better: skip specific GTID or rebuild replica from backup

Schema drift: replica missing column — apply same migration to all nodes simultaneously.

Best Practices

Use GTID and ROW binlog format
Keep replica schemas identical to primary
Monitor lag with alerts (warning: 10s, critical: 60s)
Test failover quarterly — untested failover always fails in real incidents
Use read_only on replicas to prevent accidental writes

Production Scenario

A SaaS platform runs primary in us-east-1, replicas in us-east-1 (read scaling) and eu-west-1 (DR). Semi-sync ensures at least one local replica acknowledges writes. Orchestrator handles automatic failover with a 30-second detection window. Application uses ProxySQL for read/write routing with max lag threshold of 5 seconds.

Replication is the foundation of MySQL scalability and availability — design topology, lag handling, and failover before you need them at 3 AM.

Performance Tuning

Backup and Recovery

Replication and High Availability

Replication Topologies link

Asynchronous Replication Overview link

Primary Configuration link

Replica Configuration link

GTID Benefits link

Read/Write Splitting link

Semi-Synchronous Replication link

Failover Strategies link

Manual Failover link

Automated Failover link

Split-Brain Prevention link

Replication Monitoring link

Common Replication Errors link

Best Practices link

Production Scenario link