Disaster Recovery on AWS
Disaster recovery (DR) ensures your business continues operating when infrastructure fails — whether from hardware failure, natural disaster, human error, or cyberattack. AWS provides the tools to implement DR strategies ranging from simple backups to active-active multi-region architectures. The key metric: can you actually restore, not just backup?
DR Core Metrics
| Metric | Definition | Example |
|---|---|---|
| RTO (Recovery Time Objective) | Max acceptable downtime | 4 hours |
| RPO (Recovery Point Objective) | Max acceptable data loss | 15 minutes |
| MTTR (Mean Time to Recovery) | Average time to restore | 2 hours |
| MTBF (Mean Time Between Failures) | Average time between incidents | 720 hours |
Lower RTO/RPO = higher cost and complexity. Match your strategy to business requirements, not arbitrary zero-downtime goals.
DR Strategy Comparison
| Strategy | RTO | RPO | Cost | Complexity |
|---|---|---|---|---|
| Backup & Restore | Hours–Days | Hours | $ | Low |
| Pilot Light | 10–30 min | Minutes | $$ | Medium |
| Warm Standby | Minutes | Minutes | $$$ | Medium-High |
| Active-Active | Near-zero | Near-zero | $$$$ | High |
Backup & Restore
The simplest strategy — regular backups with documented restore procedures:
# Automated RDS backups (enable at creation)
aws rds create-db-instance \
--backup-retention-period 35 \
--preferred-backup-window "03:00-04:00" \
...
# Cross-region automated backup copy
aws rds start-db-instance-automated-backups-replication \
--source-db-instance-arn arn:aws:rds:us-east-1:123:db:myapp-postgres \
--backup-retention-period 35 \
--kms-key-id arn:aws:kms:us-west-2:123:key/xxx
# Manual snapshot before major changes
aws rds create-db-snapshot \
--db-instance-identifier myapp-postgres \
--db-snapshot-identifier pre-migration-$(date +%Y%m%d)
# Restore to new instance in DR region
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier myapp-postgres-dr \
--db-snapshot-arn arn:aws:rds:us-west-2:123:snapshot:myapp-postgres-auto-xxx \
--db-instance-class db.r6g.large \
--vpc-security-group-ids sg-dr-database \
--db-subnet-group-name dr-db-subnet-group
S3 Cross-Region Replication
# Enable versioning (required for replication)
aws s3api put-bucket-versioning \
--bucket myapp-assets \
--versioning-configuration Status=Enabled
# Create replication rule
aws s3api put-bucket-replication \
--bucket myapp-assets \
--replication-configuration '{
"Role": "arn:aws:iam::123:role/s3-replication-role",
"Rules": [{
"Status": "Enabled",
"Priority": 1,
"Filter": {},
"Destination": {
"Bucket": "arn:aws:s3:::myapp-assets-dr",
"ReplicationTime": {"Status": "Enabled", "Time": {"Minutes": 15}},
"Metrics": {"Status": "Enabled", "EventThreshold": {"Minutes": 15}}
},
"DeleteMarkerReplication": {"Status": "Enabled"}
}]
}'
Pilot Light
Minimal resources running in DR region — scale up on failover:
| Component | Primary (us-east-1) | DR (us-west-2) |
|---|---|---|
| RDS | db.r6g.large Multi-AZ | Cross-region read replica (promote on failover) |
| EC2/ECS | 4 tasks running | Task definition registered, 0 tasks |
| AMIs/Containers | Active | Replicated to DR region ECR |
| VPC | Full production | Pre-configured, minimal resources |
| Route 53 | Active routing | Failover record (standby) |
# Promote RDS read replica to standalone (failover)
aws rds promote-read-replica \
--db-instance-identifier myapp-postgres-dr-replica
# Scale ECS service from 0 to production capacity
aws ecs update-service \
--cluster dr-production \
--service myapp-service \
--desired-count 4
Warm Standby
Reduced-capacity DR environment always running — faster failover than pilot light:
# DR region runs at 20-30% capacity
# Primary: 10 EC2 instances → DR: 2-3 instances always running
# On failover, scale DR ASG to full capacity
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name dr-web-asg \
--desired-capacity 10 \
--min-size 10
Active-Active Multi-Region
Full capacity in multiple regions with traffic distributed:
# Route 53 latency-based routing
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "us-east-1",
"Region": "us-east-1",
"AliasTarget": {
"HostedZoneId": "Z35SXDOTRQ7X7K",
"DNSName": "primary-alb.us-east-1.elb.amazonaws.com",
"EvaluateTargetHealth": true
}
}
}]
}'
Active-Active Challenges
| Challenge | Solution |
|---|---|
| Data consistency | DynamoDB Global Tables, Aurora Global Database |
| Session state | Stateless apps + ElastiCache Global Datastore |
| Deployment sync | CI/CD deploys to both regions simultaneously |
| Conflict resolution | Last-writer-wins or application-level merge |
Aurora Global Database
Sub-second cross-region replication for PostgreSQL/MySQL:
aws rds create-global-cluster \
--global-cluster-identifier myapp-global \
--engine aurora-postgresql \
--engine-version 16.1
aws rds create-db-cluster \
--db-cluster-identifier myapp-primary \
--engine aurora-postgresql \
--global-cluster-identifier myapp-global
aws rds create-db-cluster \
--db-cluster-identifier myapp-dr \
--engine aurora-postgresql \
--global-cluster-identifier myapp-global \
--region us-west-2
# Failover to DR region (< 1 minute)
aws rds failover-global-cluster \
--global-cluster-identifier myapp-global \
--target-db-cluster-identifier myapp-dr
DynamoDB Global Tables
Multi-region, multi-active NoSQL with automatic replication:
aws dynamodb create-table \
--table-name Orders \
--attribute-definitions AttributeName=orderId,AttributeType=S \
--key-schema AttributeName=orderId,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES
aws dynamodb create-global-table \
--global-table-name Orders \
--replication-group RegionName=us-east-1 RegionName=eu-west-1 RegionName=ap-southeast-1
DR Runbook Template
Every DR strategy needs a tested runbook:
# DR Failover Runbook: Production API
## Trigger Conditions
- Primary region (us-east-1) unavailable for > 5 minutes
- RDS primary failure with Multi-AZ failover unsuccessful
- Security incident requiring region isolation
## Failover Steps
1. Confirm primary region outage (CloudWatch, Route 53 health checks)
2. Notify stakeholders via PagerDuty (#incident-dr channel)
3. Promote RDS read replica in us-west-2 (Step 3.1 below)
4. Scale ECS services in DR region to production capacity
5. Update Route 53 failover record to point to DR ALB
6. Verify application health: curl https://api.example.com/health
7. Monitor DR region metrics for 30 minutes
8. Document incident timeline for postmortem
## Rollback Steps
1. Confirm primary region restored and stable
2. Sync data from DR back to primary (if applicable)
3. Scale DR back to standby capacity
4. Update Route 53 to primary region
5. Verify and notify stakeholders
## Test Schedule
- Quarterly: Full failover drill (non-business hours)
- Monthly: Backup restore verification
- Weekly: Automated DR health check script
AWS Backup (Centralized)
Manage backups across services from one console:
# Create backup vault
aws backup create-backup-vault --backup-vault-name production-vault
# Backup plan: daily with 35-day retention
aws backup create-backup-plan --backup-plan '{
"BackupPlanName": "daily-production",
"Rules": [{
"RuleName": "daily-backup",
"TargetBackupVaultName": "production-vault",
"ScheduleExpression": "cron(0 5 ? * * *)",
"StartWindowMinutes": 60,
"CompletionWindowMinutes": 120,
"Lifecycle": {"DeleteAfterDays": 35},
"CopyActions": [{
"Lifecycle": {"DeleteAfterDays": 35},
"DestinationBackupVaultArn": "arn:aws:backup:us-west-2:123:backup-vault:dr-vault"
}]
}]
}'
Covers EC2, EBS, RDS, DynamoDB, EFS, S3, and more.
Real-World Scenario: Financial Services DR
| Requirement | Implementation |
|---|---|
| RTO: 15 minutes | Warm standby in us-west-2 |
| RPO: 5 minutes | Aurora Global Database |
| Compliance: 7-year retention | S3 Glacier Deep Archive via AWS Backup |
| Failover testing | Quarterly automated DR drill |
| Monitoring | Route 53 health checks + CloudWatch composite alarms |
| Communication | PagerDuty integration with runbook links |
Common Mistakes
- Backups without tested restores — untested backups are wishful thinking
- DR region in same geography — us-east-1 and us-east-2 share risk; use us-west-2
- No runbook — panic-driven failover takes 10× longer
- Ignoring data consistency — failover with stale data corrupts business logic
- DR resources not maintained — AMIs, task definitions, and IaC drift from production
- Never testing failover — discover broken DR during actual disaster
Troubleshooting DR Failures
| Issue | Cause | Fix |
|---|---|---|
| Restore takes too long | Large snapshot, wrong instance class | Pre-provision DR instances; use Aurora for fast failover |
| DNS not switching | Health check still passing on failed region | Lower health check thresholds; use CloudWatch alarm-based failover |
| Data inconsistency after failover | Async replication lag | Monitor replication lag; use sync replication for critical data |
| DR region costs too high | Full active-active when warm standby suffices | Match strategy to actual RTO/RPO requirements |
Best Practices
- Define RTO/RPO with business stakeholders, not engineers alone
- Test failover quarterly — automate where possible
- Store IaC templates ready to deploy in DR region
- Replicate AMIs/containers to DR region continuously
- Use Route 53 health checks with automatic failover
- Maintain DR runbooks with step-by-step commands
- Monitor replication lag (RDS, S3 CRR, DynamoDB) with alarms
- Include DR costs in budget planning — DR is insurance, not waste
This completes the AWS Expert Deep Dives track. Review the AWS index for the full learning path.