Disaster recovery (DR) ensures your business continues operating when infrastructure fails — whether from hardware failure, natural disaster, human error, or cyberattack. Azure provides tools to implement DR strategies ranging from simple backups to active-active multi-region architectures. The critical question: can you actually restore, not just backup?

DR Core Metrics

Metric Definition Example
RTO (Recovery Time Objective) Max acceptable downtime 4 hours
RPO (Recovery Point Objective) Max acceptable data loss 15 minutes
MTTR (Mean Time to Recovery) Average time to restore 2 hours
MTBF (Mean Time Between Failures) Average time between incidents 720 hours

Lower RTO/RPO means higher cost and complexity. Match your strategy to business requirements — not arbitrary zero-downtime goals for every workload.

DR Strategy Comparison

Strategy RTO RPO Cost Complexity
Backup & Restore Hours–Days Hours $ Low
Pilot Light 10–30 min Minutes $$ Medium
Warm Standby Minutes Minutes $$$ Medium-High
Active-Active Near-zero Near-zero $$$$ High

Backup & Restore

The simplest strategy — regular backups with documented, tested restore procedures.

Azure Backup for VMs

  # Create Recovery Services vault
az backup vault create \
  --resource-group rg-dr-prod \
  --name rsv-webapp-prod \
  --location eastus

# Enable VM backup (daily at 2 AM, 30-day retention)
az backup protection enable-for-vm \
  --resource-group rg-dr-prod \
  --vault-name rsv-webapp-prod \
  --vm vm-web-01 \
  --policy-name DefaultPolicy

# Create custom backup policy
az backup policy create \
  --resource-group rg-dr-prod \
  --vault-name rsv-webapp-prod \
  --name daily-30day-retention \
  --backup-management-type AzureIaasVM \
  --policy '{
    "schedulePolicy": {"scheduleRunFrequency": "Daily", "scheduleRunTimes": ["2025-01-01T02:00:00Z"]},
    "retentionPolicy": {"dailySchedule": {"retentionTimes": ["2025-01-01T02:00:00Z"], "retentionDuration": {"count": 30, "durationType": "Days"}}}
  }'

# Restore VM from backup
az backup restore restore-disks \
  --resource-group rg-dr-prod \
  --vault-name rsv-webapp-prod \
  --container-name vm-web-01 \
  --item-name vm-web-01 \
  --rp-name DefaultProfile \
  --storage-account strestoretemp001 \
  --target-resource-group rg-dr-prod
  

Azure SQL Backup and Geo-Restore

  # Azure SQL automatic backups (enabled by default)
# Point-in-time restore within retention period
az sql db restore \
  --dest-name db-webapp-restored \
  --edition Standard \
  --service-objective S2 \
  --resource-group rg-webapp-prod \
  --server sql-webapp-prod \
  --name db-webapp \
  --time "2025-06-01T14:30:00Z"

# Geo-restore to paired region (after regional outage)
az sql db restore \
  --dest-name db-webapp-dr \
  --resource-group rg-webapp-dr \
  --server sql-webapp-dr \
  --name db-webapp \
  --geo-backup-id /subscriptions/SUB_ID/resourceGroups/rg-webapp-prod/providers/Microsoft.Sql/servers/sql-webapp-prod/restorableDroppedDatabaseBackups/xxx
  

Storage Geo-Redundancy

  # Verify GZRS on production storage
az storage account show \
  --name stwebappprod001 \
  --resource-group rg-webapp-prod \
  --query "{SKU:sku.name, Location:primaryLocation, Secondary:secondaryLocation}"

# Failover storage account to secondary region (manual, irreversible for GRS)
az storage account failover \
  --name stwebappprod001 \
  --resource-group rg-webapp-prod
  

Pilot Light

Minimal resources running in the DR region — scale up on failover:

Component Primary (East US) DR (West US)
App Service P1v3, 3 instances B1, 1 instance (minimal)
Azure SQL Business Critical Geo-replica (readable secondary)
Storage GZRS RA-GZRS (read access)
ACR Active registry Geo-replicated replica
Front Door Active routing Same profile, secondary origin (disabled)
  # Create geo-replica for Azure SQL
az sql db replica create \
  --resource-group rg-webapp-prod \
  --server sql-webapp-prod \
  --name db-webapp \
  --partner-server sql-webapp-dr \
  --partner-resource-group rg-webapp-dr

# Create failover group (automatic DNS failover, ~30s RTO)
az sql failover-group create \
  --name fg-webapp \
  --resource-group rg-webapp-prod \
  --server sql-webapp-prod \
  --partner-server sql-webapp-dr \
  --partner-resource-group rg-webapp-dr \
  --failover-policy Automatic \
  --grace-period 60 \
  --add-db db-webapp
  

On failover: scale App Service in DR region, activate Front Door secondary origin, SQL failover group handles database automatically.

Warm Standby

Reduced-capacity DR environment always running — faster failover than pilot light:

  # DR region App Service always running at 30% capacity
az appservice plan create \
  --name plan-webapp-dr \
  --resource-group rg-webapp-dr \
  --location westus \
  --sku P1v3 \
  --number-of-workers 1

# On failover, scale to production capacity
az appservice plan update \
  --name plan-webapp-dr \
  --resource-group rg-webapp-dr \
  --number-of-workers 3
  

Azure Site Recovery (ASR)

Replicate VMs, physical servers, and VMware/Hyper-V workloads to Azure for orchestrated failover:

  # Create ASR vault
az backup vault create \
  --resource-group rg-dr-prod \
  --name asr-vault-prod \
  --location eastus

# Enable replication for a VM (via Portal or Recovery Services API)
# ASR handles: initial replication, delta sync, failover, failback

# Test failover (non-disruptive validation)
az site-recovery test-failover \
  --resource-group rg-dr-prod \
  --vault-name asr-vault-prod \
  --fabric-name azure-eastus \
  --protection-container mapped \
  --recovery-container azure-westus \
  --recovery-provider azure \
  --replication-protected-item vm-web-01
  

ASR supports test failover without impacting production — validate DR procedures regularly.

Active-Active Multi-Region

Both regions serve traffic simultaneously — highest availability and cost:

Component East US West US
App Service 3 instances (active) 3 instances (active)
Azure SQL Primary (read-write) Geo-replica (read-only)
Front Door Origin 1 (priority 1) Origin 2 (priority 1)
Cosmos DB Multi-region write Multi-region write
Storage GZRS primary RA-GZRS read
  # Front Door with active-active origins
az afd origin create \
  --origin-name origin-westus \
  --origin-group-name og-webapp \
  --profile-name fd-webapp-prod \
  --resource-group rg-webapp-prod \
  --host-name my-webapp-dr.azurewebsites.net \
  --priority 1 \
  --weight 1000
  

Real-World Scenario: Financial Services DR Plan

Requirement RTO: 15 min RPO: 5 min
Strategy Warm standby with auto-failover
Database SQL failover group (auto, 60s grace)
Application Front Door health probes + 2 regions
Storage GZRS with object replication
VMs (legacy) ASR continuous replication
Testing Quarterly test failover via ASR
Documentation Runbook with step-by-step failover/failback
Compliance 7-year backup retention in Archive tier

DR Service Comparison

Service Protects RTO Best For
Azure Backup VMs, files, SQL, SAP HANA Hours Point-in-time restore
Azure Site Recovery VMs, VMware, physical Minutes Orchestrated VM failover
SQL Failover Group Azure SQL databases ~30 seconds Database auto-failover
Geo-redundant Storage Blob, file data Minutes (manual failover) Data durability
Front Door / Traffic Manager Application routing Seconds Traffic redirection

Common Mistakes

  1. Backups without tested restores — discover corruption or gaps during actual disaster
  2. No documented runbooks — engineers improvise under pressure, extending RTO
  3. Same region for DR resources — availability zone failure affects both primary and “DR”
  4. Ignoring DNS TTL — long TTLs delay traffic redirection after failover
  5. Never testing failover — untested DR plans fail when needed most
  6. Over-engineering DR for dev/test — LRS backups are sufficient for non-production

Troubleshooting

Issue Diagnosis Fix
Backup job failed VM agent issue or snapshot conflict Check guest OS agent; retry backup
ASR replication lag Network bandwidth or disk churn Increase replication bandwidth; check throttling
SQL failover group not failing over Grace period or manual policy Verify automatic policy; check grace period setting
Geo-restore not available Backup not yet replicated to secondary Wait for geo-replication (typically < 1 hour)
Front Door not failing over Origin health probe failing Verify /health endpoint returns 200 in DR region
  # Check SQL failover group status
az sql failover-group show \
  --name fg-webapp \
  --resource-group rg-webapp-prod \
  --server sql-webapp-prod \
  --query "{State:replicationState, Role:role, Databases:databases}" -o json

# List recent backup jobs
az backup job list \
  --resource-group rg-dr-prod \
  --vault-name rsv-webapp-prod \
  --query "[?properties.status=='Completed'].{VM:properties.entityFriendlyName, Time:properties.endTime}" \
  -o table
  

Best Practices

  • Define RTO/RPO per workload — not every system needs 99.99% availability
  • Test DR procedures quarterly with documented results and improvement actions
  • Use Azure paired regions for geo-redundancy (East US ↔ West US)
  • Automate failover where possible (SQL failover groups, Front Door health probes)
  • Maintain runbooks with step-by-step instructions, contacts, and decision trees
  • Monitor replication health with alerts on lag or failed backup jobs
  • Include DR in architecture reviews and cost planning — DR resources have ongoing cost
  • Implement immutability (Azure Backup soft delete, blob immutability) against ransomware

This completes the Azure learning path. Review the Azure Well-Architected Framework to ensure your designs align with all five pillars.