to navigate

to select

to close

On this page

Disaster Recovery on AWS

Disaster recovery (DR) ensures your business continues operating when infrastructure fails — whether from hardware failure, natural disaster, human error, or cyberattack. AWS provides the tools to implement DR strategies ranging from simple backups to active-active multi-region architectures. The key metric: can you actually restore, not just backup?

DR Core Metrics

Metric	Definition	Example
RTO (Recovery Time Objective)	Max acceptable downtime	4 hours
RPO (Recovery Point Objective)	Max acceptable data loss	15 minutes
MTTR (Mean Time to Recovery)	Average time to restore	2 hours
MTBF (Mean Time Between Failures)	Average time between incidents	720 hours

Lower RTO/RPO = higher cost and complexity. Match your strategy to business requirements, not arbitrary zero-downtime goals.

DR Strategy Comparison

Strategy	RTO	RPO	Cost	Complexity
Backup & Restore	Hours–Days	Hours	$	Low
Pilot Light	10–30 min	Minutes	$$	Medium
Warm Standby	Minutes	Minutes	$$$	Medium-High
Active-Active	Near-zero	Near-zero	$$$$	High

Backup & Restore

The simplest strategy — regular backups with documented restore procedures:

  # Automated RDS backups (enable at creation)
aws rds create-db-instance \
    --backup-retention-period 35 \
    --preferred-backup-window "03:00-04:00" \
    ...

# Cross-region automated backup copy
aws rds start-db-instance-automated-backups-replication \
    --source-db-instance-arn arn:aws:rds:us-east-1:123:db:myapp-postgres \
    --backup-retention-period 35 \
    --kms-key-id arn:aws:kms:us-west-2:123:key/xxx

# Manual snapshot before major changes
aws rds create-db-snapshot \
    --db-instance-identifier myapp-postgres \
    --db-snapshot-identifier pre-migration-$(date +%Y%m%d)

# Restore to new instance in DR region
aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier myapp-postgres-dr \
    --db-snapshot-arn arn:aws:rds:us-west-2:123:snapshot:myapp-postgres-auto-xxx \
    --db-instance-class db.r6g.large \
    --vpc-security-group-ids sg-dr-database \
    --db-subnet-group-name dr-db-subnet-group

S3 Cross-Region Replication

  # Enable versioning (required for replication)
aws s3api put-bucket-versioning \
    --bucket myapp-assets \
    --versioning-configuration Status=Enabled

# Create replication rule
aws s3api put-bucket-replication \
    --bucket myapp-assets \
    --replication-configuration '{
        "Role": "arn:aws:iam::123:role/s3-replication-role",
        "Rules": [{
            "Status": "Enabled",
            "Priority": 1,
            "Filter": {},
            "Destination": {
                "Bucket": "arn:aws:s3:::myapp-assets-dr",
                "ReplicationTime": {"Status": "Enabled", "Time": {"Minutes": 15}},
                "Metrics": {"Status": "Enabled", "EventThreshold": {"Minutes": 15}}
            },
            "DeleteMarkerReplication": {"Status": "Enabled"}
        }]
    }'

Pilot Light

Minimal resources running in DR region — scale up on failover:

Component	Primary (us-east-1)	DR (us-west-2)
RDS	db.r6g.large Multi-AZ	Cross-region read replica (promote on failover)
EC2/ECS	4 tasks running	Task definition registered, 0 tasks
AMIs/Containers	Active	Replicated to DR region ECR
VPC	Full production	Pre-configured, minimal resources
Route 53	Active routing	Failover record (standby)

  # Promote RDS read replica to standalone (failover)
aws rds promote-read-replica \
    --db-instance-identifier myapp-postgres-dr-replica

# Scale ECS service from 0 to production capacity
aws ecs update-service \
    --cluster dr-production \
    --service myapp-service \
    --desired-count 4

Warm Standby

Reduced-capacity DR environment always running — faster failover than pilot light:

  # DR region runs at 20-30% capacity
# Primary: 10 EC2 instances → DR: 2-3 instances always running

# On failover, scale DR ASG to full capacity
aws autoscaling update-auto-scaling-group \
    --auto-scaling-group-name dr-web-asg \
    --desired-capacity 10 \
    --min-size 10

Active-Active Multi-Region

Full capacity in multiple regions with traffic distributed:

  # Route 53 latency-based routing
aws route53 change-resource-record-sets \
    --hosted-zone-id Z1234567890 \
    --change-batch '{
        "Changes": [{
            "Action": "UPSERT",
            "ResourceRecordSet": {
                "Name": "api.example.com",
                "Type": "A",
                "SetIdentifier": "us-east-1",
                "Region": "us-east-1",
                "AliasTarget": {
                    "HostedZoneId": "Z35SXDOTRQ7X7K",
                    "DNSName": "primary-alb.us-east-1.elb.amazonaws.com",
                    "EvaluateTargetHealth": true
                }
            }
        }]
    }'

Active-Active Challenges

Challenge	Solution
Data consistency	DynamoDB Global Tables, Aurora Global Database
Session state	Stateless apps + ElastiCache Global Datastore
Deployment sync	CI/CD deploys to both regions simultaneously
Conflict resolution	Last-writer-wins or application-level merge

Aurora Global Database

Sub-second cross-region replication for PostgreSQL/MySQL:

  aws rds create-global-cluster \
    --global-cluster-identifier myapp-global \
    --engine aurora-postgresql \
    --engine-version 16.1

aws rds create-db-cluster \
    --db-cluster-identifier myapp-primary \
    --engine aurora-postgresql \
    --global-cluster-identifier myapp-global

aws rds create-db-cluster \
    --db-cluster-identifier myapp-dr \
    --engine aurora-postgresql \
    --global-cluster-identifier myapp-global \
    --region us-west-2

# Failover to DR region (< 1 minute)
aws rds failover-global-cluster \
    --global-cluster-identifier myapp-global \
    --target-db-cluster-identifier myapp-dr

DynamoDB Global Tables

Multi-region, multi-active NoSQL with automatic replication:

  aws dynamodb create-table \
    --table-name Orders \
    --attribute-definitions AttributeName=orderId,AttributeType=S \
    --key-schema AttributeName=orderId,KeyType=HASH \
    --billing-mode PAY_PER_REQUEST \
    --stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES

aws dynamodb create-global-table \
    --global-table-name Orders \
    --replication-group RegionName=us-east-1 RegionName=eu-west-1 RegionName=ap-southeast-1

DR Runbook Template

Every DR strategy needs a tested runbook:

  # DR Failover Runbook: Production API

## Trigger Conditions
- Primary region (us-east-1) unavailable for > 5 minutes
- RDS primary failure with Multi-AZ failover unsuccessful
- Security incident requiring region isolation

## Failover Steps
1. Confirm primary region outage (CloudWatch, Route 53 health checks)
2. Notify stakeholders via PagerDuty (#incident-dr channel)
3. Promote RDS read replica in us-west-2 (Step 3.1 below)
4. Scale ECS services in DR region to production capacity
5. Update Route 53 failover record to point to DR ALB
6. Verify application health: curl https://api.example.com/health
7. Monitor DR region metrics for 30 minutes
8. Document incident timeline for postmortem

## Rollback Steps
1. Confirm primary region restored and stable
2. Sync data from DR back to primary (if applicable)
3. Scale DR back to standby capacity
4. Update Route 53 to primary region
5. Verify and notify stakeholders

## Test Schedule
- Quarterly: Full failover drill (non-business hours)
- Monthly: Backup restore verification
- Weekly: Automated DR health check script

AWS Backup (Centralized)

Manage backups across services from one console:

  # Create backup vault
aws backup create-backup-vault --backup-vault-name production-vault

# Backup plan: daily with 35-day retention
aws backup create-backup-plan --backup-plan '{
    "BackupPlanName": "daily-production",
    "Rules": [{
        "RuleName": "daily-backup",
        "TargetBackupVaultName": "production-vault",
        "ScheduleExpression": "cron(0 5 ? * * *)",
        "StartWindowMinutes": 60,
        "CompletionWindowMinutes": 120,
        "Lifecycle": {"DeleteAfterDays": 35},
        "CopyActions": [{
            "Lifecycle": {"DeleteAfterDays": 35},
            "DestinationBackupVaultArn": "arn:aws:backup:us-west-2:123:backup-vault:dr-vault"
        }]
    }]
}'

Covers EC2, EBS, RDS, DynamoDB, EFS, S3, and more.

Real-World Scenario: Financial Services DR

Requirement	Implementation
RTO: 15 minutes	Warm standby in us-west-2
RPO: 5 minutes	Aurora Global Database
Compliance: 7-year retention	S3 Glacier Deep Archive via AWS Backup
Failover testing	Quarterly automated DR drill
Monitoring	Route 53 health checks + CloudWatch composite alarms
Communication	PagerDuty integration with runbook links

Common Mistakes

Backups without tested restores — untested backups are wishful thinking
DR region in same geography — us-east-1 and us-east-2 share risk; use us-west-2
No runbook — panic-driven failover takes 10× longer
Ignoring data consistency — failover with stale data corrupts business logic
DR resources not maintained — AMIs, task definitions, and IaC drift from production
Never testing failover — discover broken DR during actual disaster

Troubleshooting DR Failures

Issue	Cause	Fix
Restore takes too long	Large snapshot, wrong instance class	Pre-provision DR instances; use Aurora for fast failover
DNS not switching	Health check still passing on failed region	Lower health check thresholds; use CloudWatch alarm-based failover
Data inconsistency after failover	Async replication lag	Monitor replication lag; use sync replication for critical data
DR region costs too high	Full active-active when warm standby suffices	Match strategy to actual RTO/RPO requirements

Best Practices

Define RTO/RPO with business stakeholders, not engineers alone
Test failover quarterly — automate where possible
Store IaC templates ready to deploy in DR region
Replicate AMIs/containers to DR region continuously
Use Route 53 health checks with automatic failover
Maintain DR runbooks with step-by-step commands
Monitor replication lag (RDS, S3 CRR, DynamoDB) with alarms
Include DR costs in budget planning — DR is insurance, not waste

This completes the AWS Expert Deep Dives track. Review the AWS index for the full learning path.

Advanced Networking

Introduction to Azure

Disaster Recovery on AWS

DR Core Metrics link

DR Strategy Comparison link

Backup & Restore link

S3 Cross-Region Replication link

Pilot Light link

Warm Standby link

Active-Active Multi-Region link

Active-Active Challenges link

Aurora Global Database link

DynamoDB Global Tables link

DR Runbook Template link

AWS Backup (Centralized) link

Real-World Scenario: Financial Services DR link

Common Mistakes link

Troubleshooting DR Failures link

Best Practices link