to navigate

to select

to close

On this page

Disaster Recovery on Azure

Disaster recovery (DR) ensures your business continues operating when infrastructure fails — whether from hardware failure, natural disaster, human error, or cyberattack. Azure provides tools to implement DR strategies ranging from simple backups to active-active multi-region architectures. The critical question: can you actually restore, not just backup?

DR Core Metrics

Metric	Definition	Example
RTO (Recovery Time Objective)	Max acceptable downtime	4 hours
RPO (Recovery Point Objective)	Max acceptable data loss	15 minutes
MTTR (Mean Time to Recovery)	Average time to restore	2 hours
MTBF (Mean Time Between Failures)	Average time between incidents	720 hours

Lower RTO/RPO means higher cost and complexity. Match your strategy to business requirements — not arbitrary zero-downtime goals for every workload.

DR Strategy Comparison

Strategy	RTO	RPO	Cost	Complexity
Backup & Restore	Hours–Days	Hours	$	Low
Pilot Light	10–30 min	Minutes	$$	Medium
Warm Standby	Minutes	Minutes	$$$	Medium-High
Active-Active	Near-zero	Near-zero	$$$$	High

Backup & Restore

The simplest strategy — regular backups with documented, tested restore procedures.

Azure Backup for VMs

  # Create Recovery Services vault
az backup vault create \
  --resource-group rg-dr-prod \
  --name rsv-webapp-prod \
  --location eastus

# Enable VM backup (daily at 2 AM, 30-day retention)
az backup protection enable-for-vm \
  --resource-group rg-dr-prod \
  --vault-name rsv-webapp-prod \
  --vm vm-web-01 \
  --policy-name DefaultPolicy

# Create custom backup policy
az backup policy create \
  --resource-group rg-dr-prod \
  --vault-name rsv-webapp-prod \
  --name daily-30day-retention \
  --backup-management-type AzureIaasVM \
  --policy '{
    "schedulePolicy": {"scheduleRunFrequency": "Daily", "scheduleRunTimes": ["2025-01-01T02:00:00Z"]},
    "retentionPolicy": {"dailySchedule": {"retentionTimes": ["2025-01-01T02:00:00Z"], "retentionDuration": {"count": 30, "durationType": "Days"}}}
  }'

# Restore VM from backup
az backup restore restore-disks \
  --resource-group rg-dr-prod \
  --vault-name rsv-webapp-prod \
  --container-name vm-web-01 \
  --item-name vm-web-01 \
  --rp-name DefaultProfile \
  --storage-account strestoretemp001 \
  --target-resource-group rg-dr-prod

Azure SQL Backup and Geo-Restore

  # Azure SQL automatic backups (enabled by default)
# Point-in-time restore within retention period
az sql db restore \
  --dest-name db-webapp-restored \
  --edition Standard \
  --service-objective S2 \
  --resource-group rg-webapp-prod \
  --server sql-webapp-prod \
  --name db-webapp \
  --time "2025-06-01T14:30:00Z"

# Geo-restore to paired region (after regional outage)
az sql db restore \
  --dest-name db-webapp-dr \
  --resource-group rg-webapp-dr \
  --server sql-webapp-dr \
  --name db-webapp \
  --geo-backup-id /subscriptions/SUB_ID/resourceGroups/rg-webapp-prod/providers/Microsoft.Sql/servers/sql-webapp-prod/restorableDroppedDatabaseBackups/xxx

Storage Geo-Redundancy

  # Verify GZRS on production storage
az storage account show \
  --name stwebappprod001 \
  --resource-group rg-webapp-prod \
  --query "{SKU:sku.name, Location:primaryLocation, Secondary:secondaryLocation}"

# Failover storage account to secondary region (manual, irreversible for GRS)
az storage account failover \
  --name stwebappprod001 \
  --resource-group rg-webapp-prod

Pilot Light

Minimal resources running in the DR region — scale up on failover:

Component	Primary (East US)	DR (West US)
App Service	P1v3, 3 instances	B1, 1 instance (minimal)
Azure SQL	Business Critical	Geo-replica (readable secondary)
Storage	GZRS	RA-GZRS (read access)
ACR	Active registry	Geo-replicated replica
Front Door	Active routing	Same profile, secondary origin (disabled)

  # Create geo-replica for Azure SQL
az sql db replica create \
  --resource-group rg-webapp-prod \
  --server sql-webapp-prod \
  --name db-webapp \
  --partner-server sql-webapp-dr \
  --partner-resource-group rg-webapp-dr

# Create failover group (automatic DNS failover, ~30s RTO)
az sql failover-group create \
  --name fg-webapp \
  --resource-group rg-webapp-prod \
  --server sql-webapp-prod \
  --partner-server sql-webapp-dr \
  --partner-resource-group rg-webapp-dr \
  --failover-policy Automatic \
  --grace-period 60 \
  --add-db db-webapp

On failover: scale App Service in DR region, activate Front Door secondary origin, SQL failover group handles database automatically.

Warm Standby

Reduced-capacity DR environment always running — faster failover than pilot light:

  # DR region App Service always running at 30% capacity
az appservice plan create \
  --name plan-webapp-dr \
  --resource-group rg-webapp-dr \
  --location westus \
  --sku P1v3 \
  --number-of-workers 1

# On failover, scale to production capacity
az appservice plan update \
  --name plan-webapp-dr \
  --resource-group rg-webapp-dr \
  --number-of-workers 3

Azure Site Recovery (ASR)

Replicate VMs, physical servers, and VMware/Hyper-V workloads to Azure for orchestrated failover:

  # Create ASR vault
az backup vault create \
  --resource-group rg-dr-prod \
  --name asr-vault-prod \
  --location eastus

# Enable replication for a VM (via Portal or Recovery Services API)
# ASR handles: initial replication, delta sync, failover, failback

# Test failover (non-disruptive validation)
az site-recovery test-failover \
  --resource-group rg-dr-prod \
  --vault-name asr-vault-prod \
  --fabric-name azure-eastus \
  --protection-container mapped \
  --recovery-container azure-westus \
  --recovery-provider azure \
  --replication-protected-item vm-web-01

ASR supports test failover without impacting production — validate DR procedures regularly.

Active-Active Multi-Region

Both regions serve traffic simultaneously — highest availability and cost:

Component	East US	West US
App Service	3 instances (active)	3 instances (active)
Azure SQL	Primary (read-write)	Geo-replica (read-only)
Front Door	Origin 1 (priority 1)	Origin 2 (priority 1)
Cosmos DB	Multi-region write	Multi-region write
Storage	GZRS primary	RA-GZRS read

  # Front Door with active-active origins
az afd origin create \
  --origin-name origin-westus \
  --origin-group-name og-webapp \
  --profile-name fd-webapp-prod \
  --resource-group rg-webapp-prod \
  --host-name my-webapp-dr.azurewebsites.net \
  --priority 1 \
  --weight 1000

Real-World Scenario: Financial Services DR Plan

Requirement	RTO: 15 min	RPO: 5 min
Strategy	Warm standby with auto-failover
Database	SQL failover group (auto, 60s grace)
Application	Front Door health probes + 2 regions
Storage	GZRS with object replication
VMs (legacy)	ASR continuous replication
Testing	Quarterly test failover via ASR
Documentation	Runbook with step-by-step failover/failback
Compliance	7-year backup retention in Archive tier

DR Service Comparison

Service	Protects	RTO	Best For
Azure Backup	VMs, files, SQL, SAP HANA	Hours	Point-in-time restore
Azure Site Recovery	VMs, VMware, physical	Minutes	Orchestrated VM failover
SQL Failover Group	Azure SQL databases	~30 seconds	Database auto-failover
Geo-redundant Storage	Blob, file data	Minutes (manual failover)	Data durability
Front Door / Traffic Manager	Application routing	Seconds	Traffic redirection

Common Mistakes

Backups without tested restores — discover corruption or gaps during actual disaster
No documented runbooks — engineers improvise under pressure, extending RTO
Same region for DR resources — availability zone failure affects both primary and “DR”
Ignoring DNS TTL — long TTLs delay traffic redirection after failover
Never testing failover — untested DR plans fail when needed most
Over-engineering DR for dev/test — LRS backups are sufficient for non-production

Troubleshooting

Issue	Diagnosis	Fix
Backup job failed	VM agent issue or snapshot conflict	Check guest OS agent; retry backup
ASR replication lag	Network bandwidth or disk churn	Increase replication bandwidth; check throttling
SQL failover group not failing over	Grace period or manual policy	Verify automatic policy; check grace period setting
Geo-restore not available	Backup not yet replicated to secondary	Wait for geo-replication (typically < 1 hour)
Front Door not failing over	Origin health probe failing	Verify `/health` endpoint returns 200 in DR region

  # Check SQL failover group status
az sql failover-group show \
  --name fg-webapp \
  --resource-group rg-webapp-prod \
  --server sql-webapp-prod \
  --query "{State:replicationState, Role:role, Databases:databases}" -o json

# List recent backup jobs
az backup job list \
  --resource-group rg-dr-prod \
  --vault-name rsv-webapp-prod \
  --query "[?properties.status=='Completed'].{VM:properties.entityFriendlyName, Time:properties.endTime}" \
  -o table

Best Practices

Define RTO/RPO per workload — not every system needs 99.99% availability
Test DR procedures quarterly with documented results and improvement actions
Use Azure paired regions for geo-redundancy (East US ↔ West US)
Automate failover where possible (SQL failover groups, Front Door health probes)
Maintain runbooks with step-by-step instructions, contacts, and decision trees
Monitor replication health with alerts on lag or failed backup jobs
Include DR in architecture reviews and cost planning — DR resources have ongoing cost
Implement immutability (Azure Backup soft delete, blob immutability) against ransomware

This completes the Azure learning path. Review the Azure Well-Architected Framework to ensure your designs align with all five pillars.

Advanced Networking

Introduction to GCP

Disaster Recovery on Azure

DR Core Metrics link

DR Strategy Comparison link

Backup & Restore link

Azure Backup for VMs link

Azure SQL Backup and Geo-Restore link

Storage Geo-Redundancy link

Pilot Light link

Warm Standby link

Azure Site Recovery (ASR) link

Active-Active Multi-Region link

Real-World Scenario: Financial Services DR Plan link

DR Service Comparison link

Common Mistakes link

Troubleshooting link

Best Practices link