High Availability (Always On)
High availability (HA) keeps databases online through hardware failures, patching, and disasters. SQL Server’s flagship HA feature is Always On Availability Groups (AG) — multi-database replication with optional automatic failover and readable secondaries.
HA Options Compared
| Feature | Always On AG | Failover Cluster (FCI) | Log Shipping | Replication |
|---|---|---|---|---|
| Failover | Automatic (AG) | Automatic (shared storage) | Manual | Manual |
| Read scale-out | Yes (secondaries) | No | No (standby) | Yes (subscribers) |
| RPO | Zero with sync | Storage dependent | Minutes | Minutes |
| Edition | Enterprise (full); Standard (basic) | Standard+ | All | All |
| Complexity | High | Medium | Low | Medium |
Always On Prerequisites
- Windows Server Failover Cluster (WSFC) on Windows, or Pacemaker on Linux
- Matching SQL Server version/build on replicas
- Enterprise Edition for multi-DB AG, readable secondaries, basic features beyond limits
- Standard Edition — one database per AG, no readable secondaries (verify current limits)
- Network latency low enough for sync mode (typically same datacenter)
- Open port 5022 for database mirroring endpoint
WSFC Quorum
Quorum determines cluster viability when nodes fail:
| Quorum Mode | When to Use |
|---|---|
| Node majority | Odd number of voting nodes |
| Node and disk majority | Shared disk witness |
| Cloud witness | Azure blob — recommended for 2-node clusters |
| No quorum (manual) | Maintenance only |
Split-brain occurs without quorum — neither side accepts writes. Test failover quarterly.
Create Availability Group
-- On primary replica
CREATE AVAILABILITY GROUP [AppAG]
WITH (
AUTOMATED_BACKUP_PREFERENCE = SECONDARY,
DB_FAILOVER = ON,
DTC_SUPPORT = NONE
)
FOR DATABASE MyApp
REPLICA ON
N'SQL-PRIMARY' WITH (
ENDPOINT_URL = N'TCP://sql-primary.contoso.com:5022',
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
FAILOVER_MODE = AUTOMATIC,
SEEDING_MODE = AUTOMATIC,
SECONDARY_ROLE(ALLOW_CONNECTIONS = READ_ONLY)
),
N'SQL-SECONDARY' WITH (
ENDPOINT_URL = N'TCP://sql-secondary.contoso.com:5022',
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
FAILOVER_MODE = AUTOMATIC,
SEEDING_MODE = AUTOMATIC,
SECONDARY_ROLE(ALLOW_CONNECTIONS = READ_ONLY)
);
GO
Join Secondary Replica
-- On secondary
ALTER AVAILABILITY GROUP [AppAG] JOIN;
ALTER DATABASE MyApp SET HADR = ON;
Automatic seeding streams backup over network — firewall must allow replica-to-replica traffic.
Async Replica for DR
-- Disaster recovery site — async accepts data loss on failover
N'SQL-DR' WITH (
ENDPOINT_URL = N'TCP://sql-dr.contoso.com:5022',
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
FAILOVER_MODE = MANUAL,
SECONDARY_ROLE(ALLOW_CONNECTIONS = NO)
)
Use async when latency exceeds ~10 ms or cross-region DR is required.
Failover
-- Planned manual failover (no data loss with sync)
ALTER AVAILABILITY GROUP [AppAG] FAILOVER;
-- Force failover with data loss (async DR emergency only)
ALTER AVAILABILITY GROUP [AppAG] FORCE_FAILOVER_ALLOW_DATA_LOSS;
Application connection strings must target AG listener, not individual node names:
Server=tcp:AppAG-Listener.contoso.com,1433;Database=MyApp;MultiSubnetFailover=True;
Read-Only Routing
Offload reporting to secondaries:
ALTER AVAILABILITY GROUP [AppAG]
MODIFY REPLICA ON N'SQL-SECONDARY' WITH (
SECONDARY_ROLE(READ_ONLY_ROUTING_URL = N'TCP://sql-secondary.contoso.com:1433')
);
ALTER AVAILABILITY GROUP [AppAG]
MODIFY REPLICA ON N'SQL-PRIMARY' WITH (
PRIMARY_ROLE(READ_ONLY_ROUTING_LIST = (N'SQL-SECONDARY', N'SQL-PRIMARY'))
);
Connection string: ApplicationIntent=ReadOnly routes to readable secondary.
Monitoring AG Health
SELECT
ag.name AS ag_name,
ar.replica_server_name,
ars.role_desc,
drs.synchronization_state_desc,
drs.synchronization_health_desc,
drs.log_send_queue_size,
drs.redo_queue_size,
drs.last_commit_time
FROM sys.availability_groups AS ag
INNER JOIN sys.availability_replicas AS ar ON ag.group_id = ar.group_id
INNER JOIN sys.dm_hadr_availability_replica_states AS ars ON ar.replica_id = ars.replica_id
INNER JOIN sys.dm_hadr_database_replica_states AS drs ON ar.replica_id = drs.replica_id;
Alert when:
synchronization_health_desc≠ HEALTHYlog_send_queue_sizegrows continuously (network or secondary load)redo_queue_sizehigh (secondary CPU/disk bottleneck)
Log Shipping Alternative
Simpler DR without AG licensing complexity:
-- Primary: backup log every 15 minutes
BACKUP LOG MyApp TO DISK = N'\\share\logs\MyApp_log.trn' WITH COMPRESSION;
-- Secondary: restore with NORECOVERY
RESTORE LOG MyApp FROM DISK = N'\\share\logs\MyApp_log.trn' WITH NORECOVERY;
Manual failover — restore final log WITH RECOVERY. Higher RPO/RTO than sync AG.
Common Mistakes
| Mistake | Impact |
|---|---|
| Connecting to node name | App breaks on failover |
| Sync AG across regions | Performance collapse, timeouts |
| No quorum witness | Cluster goes offline on one node loss |
| Skipping failover drills | Discover broken logins during outage |
| Backups only on primary | Miss secondary preference benefits |
Performance Tips
- Run backups on secondary (
AUTOMATED_BACKUP_PREFERENCE = SECONDARY) - Match hardware on sync replicas — redo lag causes backlog
- Use dedicated network for replication traffic
- Limit readable secondary workload — competes with redo thread
Production Scenario: Patch Cycle
- Failover to secondary (planned)
- Patch former primary (now secondary)
- Verify sync health
- Fail back during maintenance window
- Patch other node
Document runbook with listener name, credentials, and validation queries.
Troubleshooting
| Issue | Resolution |
|---|---|
| AG not synchronizing | Check endpoint, firewall 5022, service accounts |
| Secondary not readable | ALLOW_CONNECTIONS = READ_ONLY, database in SYNCHRONIZING |
| Failover failed | Quorum lost — restore witness |
| Login failures after failover | Sync SQL logins (jobs/contained users) |
-- Verify listener
SELECT dns_name, port FROM sys.availability_group_listeners;
Always On delivers enterprise RPO/RTO — invest in quorum design, listener connection strings, and regular failover testing before production dependency.