Disaster recovery (DR) ensures your business continues operating when infrastructure fails — whether from hardware failure, natural disaster, human error, or cyberattack. AWS provides the tools to implement DR strategies ranging from simple backups to active-active multi-region architectures. The key metric: can you actually restore, not just backup?

DR Core Metrics

Metric Definition Example
RTO (Recovery Time Objective) Max acceptable downtime 4 hours
RPO (Recovery Point Objective) Max acceptable data loss 15 minutes
MTTR (Mean Time to Recovery) Average time to restore 2 hours
MTBF (Mean Time Between Failures) Average time between incidents 720 hours

Lower RTO/RPO = higher cost and complexity. Match your strategy to business requirements, not arbitrary zero-downtime goals.

DR Strategy Comparison

Strategy RTO RPO Cost Complexity
Backup & Restore Hours–Days Hours $ Low
Pilot Light 10–30 min Minutes $$ Medium
Warm Standby Minutes Minutes $$$ Medium-High
Active-Active Near-zero Near-zero $$$$ High

Backup & Restore

The simplest strategy — regular backups with documented restore procedures:

  # Automated RDS backups (enable at creation)
aws rds create-db-instance \
    --backup-retention-period 35 \
    --preferred-backup-window "03:00-04:00" \
    ...

# Cross-region automated backup copy
aws rds start-db-instance-automated-backups-replication \
    --source-db-instance-arn arn:aws:rds:us-east-1:123:db:myapp-postgres \
    --backup-retention-period 35 \
    --kms-key-id arn:aws:kms:us-west-2:123:key/xxx

# Manual snapshot before major changes
aws rds create-db-snapshot \
    --db-instance-identifier myapp-postgres \
    --db-snapshot-identifier pre-migration-$(date +%Y%m%d)

# Restore to new instance in DR region
aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier myapp-postgres-dr \
    --db-snapshot-arn arn:aws:rds:us-west-2:123:snapshot:myapp-postgres-auto-xxx \
    --db-instance-class db.r6g.large \
    --vpc-security-group-ids sg-dr-database \
    --db-subnet-group-name dr-db-subnet-group
  

S3 Cross-Region Replication

  # Enable versioning (required for replication)
aws s3api put-bucket-versioning \
    --bucket myapp-assets \
    --versioning-configuration Status=Enabled

# Create replication rule
aws s3api put-bucket-replication \
    --bucket myapp-assets \
    --replication-configuration '{
        "Role": "arn:aws:iam::123:role/s3-replication-role",
        "Rules": [{
            "Status": "Enabled",
            "Priority": 1,
            "Filter": {},
            "Destination": {
                "Bucket": "arn:aws:s3:::myapp-assets-dr",
                "ReplicationTime": {"Status": "Enabled", "Time": {"Minutes": 15}},
                "Metrics": {"Status": "Enabled", "EventThreshold": {"Minutes": 15}}
            },
            "DeleteMarkerReplication": {"Status": "Enabled"}
        }]
    }'
  

Pilot Light

Minimal resources running in DR region — scale up on failover:

Component Primary (us-east-1) DR (us-west-2)
RDS db.r6g.large Multi-AZ Cross-region read replica (promote on failover)
EC2/ECS 4 tasks running Task definition registered, 0 tasks
AMIs/Containers Active Replicated to DR region ECR
VPC Full production Pre-configured, minimal resources
Route 53 Active routing Failover record (standby)
  # Promote RDS read replica to standalone (failover)
aws rds promote-read-replica \
    --db-instance-identifier myapp-postgres-dr-replica

# Scale ECS service from 0 to production capacity
aws ecs update-service \
    --cluster dr-production \
    --service myapp-service \
    --desired-count 4
  

Warm Standby

Reduced-capacity DR environment always running — faster failover than pilot light:

  # DR region runs at 20-30% capacity
# Primary: 10 EC2 instances → DR: 2-3 instances always running

# On failover, scale DR ASG to full capacity
aws autoscaling update-auto-scaling-group \
    --auto-scaling-group-name dr-web-asg \
    --desired-capacity 10 \
    --min-size 10
  

Active-Active Multi-Region

Full capacity in multiple regions with traffic distributed:

  # Route 53 latency-based routing
aws route53 change-resource-record-sets \
    --hosted-zone-id Z1234567890 \
    --change-batch '{
        "Changes": [{
            "Action": "UPSERT",
            "ResourceRecordSet": {
                "Name": "api.example.com",
                "Type": "A",
                "SetIdentifier": "us-east-1",
                "Region": "us-east-1",
                "AliasTarget": {
                    "HostedZoneId": "Z35SXDOTRQ7X7K",
                    "DNSName": "primary-alb.us-east-1.elb.amazonaws.com",
                    "EvaluateTargetHealth": true
                }
            }
        }]
    }'
  

Active-Active Challenges

Challenge Solution
Data consistency DynamoDB Global Tables, Aurora Global Database
Session state Stateless apps + ElastiCache Global Datastore
Deployment sync CI/CD deploys to both regions simultaneously
Conflict resolution Last-writer-wins or application-level merge

Aurora Global Database

Sub-second cross-region replication for PostgreSQL/MySQL:

  aws rds create-global-cluster \
    --global-cluster-identifier myapp-global \
    --engine aurora-postgresql \
    --engine-version 16.1

aws rds create-db-cluster \
    --db-cluster-identifier myapp-primary \
    --engine aurora-postgresql \
    --global-cluster-identifier myapp-global

aws rds create-db-cluster \
    --db-cluster-identifier myapp-dr \
    --engine aurora-postgresql \
    --global-cluster-identifier myapp-global \
    --region us-west-2

# Failover to DR region (< 1 minute)
aws rds failover-global-cluster \
    --global-cluster-identifier myapp-global \
    --target-db-cluster-identifier myapp-dr
  

DynamoDB Global Tables

Multi-region, multi-active NoSQL with automatic replication:

  aws dynamodb create-table \
    --table-name Orders \
    --attribute-definitions AttributeName=orderId,AttributeType=S \
    --key-schema AttributeName=orderId,KeyType=HASH \
    --billing-mode PAY_PER_REQUEST \
    --stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES

aws dynamodb create-global-table \
    --global-table-name Orders \
    --replication-group RegionName=us-east-1 RegionName=eu-west-1 RegionName=ap-southeast-1
  

DR Runbook Template

Every DR strategy needs a tested runbook:

  # DR Failover Runbook: Production API

## Trigger Conditions
- Primary region (us-east-1) unavailable for > 5 minutes
- RDS primary failure with Multi-AZ failover unsuccessful
- Security incident requiring region isolation

## Failover Steps
1. Confirm primary region outage (CloudWatch, Route 53 health checks)
2. Notify stakeholders via PagerDuty (#incident-dr channel)
3. Promote RDS read replica in us-west-2 (Step 3.1 below)
4. Scale ECS services in DR region to production capacity
5. Update Route 53 failover record to point to DR ALB
6. Verify application health: curl https://api.example.com/health
7. Monitor DR region metrics for 30 minutes
8. Document incident timeline for postmortem

## Rollback Steps
1. Confirm primary region restored and stable
2. Sync data from DR back to primary (if applicable)
3. Scale DR back to standby capacity
4. Update Route 53 to primary region
5. Verify and notify stakeholders

## Test Schedule
- Quarterly: Full failover drill (non-business hours)
- Monthly: Backup restore verification
- Weekly: Automated DR health check script
  

AWS Backup (Centralized)

Manage backups across services from one console:

  # Create backup vault
aws backup create-backup-vault --backup-vault-name production-vault

# Backup plan: daily with 35-day retention
aws backup create-backup-plan --backup-plan '{
    "BackupPlanName": "daily-production",
    "Rules": [{
        "RuleName": "daily-backup",
        "TargetBackupVaultName": "production-vault",
        "ScheduleExpression": "cron(0 5 ? * * *)",
        "StartWindowMinutes": 60,
        "CompletionWindowMinutes": 120,
        "Lifecycle": {"DeleteAfterDays": 35},
        "CopyActions": [{
            "Lifecycle": {"DeleteAfterDays": 35},
            "DestinationBackupVaultArn": "arn:aws:backup:us-west-2:123:backup-vault:dr-vault"
        }]
    }]
}'
  

Covers EC2, EBS, RDS, DynamoDB, EFS, S3, and more.

Real-World Scenario: Financial Services DR

Requirement Implementation
RTO: 15 minutes Warm standby in us-west-2
RPO: 5 minutes Aurora Global Database
Compliance: 7-year retention S3 Glacier Deep Archive via AWS Backup
Failover testing Quarterly automated DR drill
Monitoring Route 53 health checks + CloudWatch composite alarms
Communication PagerDuty integration with runbook links

Common Mistakes

  1. Backups without tested restores — untested backups are wishful thinking
  2. DR region in same geography — us-east-1 and us-east-2 share risk; use us-west-2
  3. No runbook — panic-driven failover takes 10× longer
  4. Ignoring data consistency — failover with stale data corrupts business logic
  5. DR resources not maintained — AMIs, task definitions, and IaC drift from production
  6. Never testing failover — discover broken DR during actual disaster

Troubleshooting DR Failures

Issue Cause Fix
Restore takes too long Large snapshot, wrong instance class Pre-provision DR instances; use Aurora for fast failover
DNS not switching Health check still passing on failed region Lower health check thresholds; use CloudWatch alarm-based failover
Data inconsistency after failover Async replication lag Monitor replication lag; use sync replication for critical data
DR region costs too high Full active-active when warm standby suffices Match strategy to actual RTO/RPO requirements

Best Practices

  • Define RTO/RPO with business stakeholders, not engineers alone
  • Test failover quarterly — automate where possible
  • Store IaC templates ready to deploy in DR region
  • Replicate AMIs/containers to DR region continuously
  • Use Route 53 health checks with automatic failover
  • Maintain DR runbooks with step-by-step commands
  • Monitor replication lag (RDS, S3 CRR, DynamoDB) with alarms
  • Include DR costs in budget planning — DR is insurance, not waste

This completes the AWS Expert Deep Dives track. Review the AWS index for the full learning path.