Disaster Recovery on Azure
Disaster recovery (DR) ensures your business continues operating when infrastructure fails — whether from hardware failure, natural disaster, human error, or cyberattack. Azure provides tools to implement DR strategies ranging from simple backups to active-active multi-region architectures. The critical question: can you actually restore, not just backup?
DR Core Metrics
| Metric | Definition | Example |
|---|---|---|
| RTO (Recovery Time Objective) | Max acceptable downtime | 4 hours |
| RPO (Recovery Point Objective) | Max acceptable data loss | 15 minutes |
| MTTR (Mean Time to Recovery) | Average time to restore | 2 hours |
| MTBF (Mean Time Between Failures) | Average time between incidents | 720 hours |
Lower RTO/RPO means higher cost and complexity. Match your strategy to business requirements — not arbitrary zero-downtime goals for every workload.
DR Strategy Comparison
| Strategy | RTO | RPO | Cost | Complexity |
|---|---|---|---|---|
| Backup & Restore | Hours–Days | Hours | $ | Low |
| Pilot Light | 10–30 min | Minutes | $$ | Medium |
| Warm Standby | Minutes | Minutes | $$$ | Medium-High |
| Active-Active | Near-zero | Near-zero | $$$$ | High |
Backup & Restore
The simplest strategy — regular backups with documented, tested restore procedures.
Azure Backup for VMs
# Create Recovery Services vault
az backup vault create \
--resource-group rg-dr-prod \
--name rsv-webapp-prod \
--location eastus
# Enable VM backup (daily at 2 AM, 30-day retention)
az backup protection enable-for-vm \
--resource-group rg-dr-prod \
--vault-name rsv-webapp-prod \
--vm vm-web-01 \
--policy-name DefaultPolicy
# Create custom backup policy
az backup policy create \
--resource-group rg-dr-prod \
--vault-name rsv-webapp-prod \
--name daily-30day-retention \
--backup-management-type AzureIaasVM \
--policy '{
"schedulePolicy": {"scheduleRunFrequency": "Daily", "scheduleRunTimes": ["2025-01-01T02:00:00Z"]},
"retentionPolicy": {"dailySchedule": {"retentionTimes": ["2025-01-01T02:00:00Z"], "retentionDuration": {"count": 30, "durationType": "Days"}}}
}'
# Restore VM from backup
az backup restore restore-disks \
--resource-group rg-dr-prod \
--vault-name rsv-webapp-prod \
--container-name vm-web-01 \
--item-name vm-web-01 \
--rp-name DefaultProfile \
--storage-account strestoretemp001 \
--target-resource-group rg-dr-prod
Azure SQL Backup and Geo-Restore
# Azure SQL automatic backups (enabled by default)
# Point-in-time restore within retention period
az sql db restore \
--dest-name db-webapp-restored \
--edition Standard \
--service-objective S2 \
--resource-group rg-webapp-prod \
--server sql-webapp-prod \
--name db-webapp \
--time "2025-06-01T14:30:00Z"
# Geo-restore to paired region (after regional outage)
az sql db restore \
--dest-name db-webapp-dr \
--resource-group rg-webapp-dr \
--server sql-webapp-dr \
--name db-webapp \
--geo-backup-id /subscriptions/SUB_ID/resourceGroups/rg-webapp-prod/providers/Microsoft.Sql/servers/sql-webapp-prod/restorableDroppedDatabaseBackups/xxx
Storage Geo-Redundancy
# Verify GZRS on production storage
az storage account show \
--name stwebappprod001 \
--resource-group rg-webapp-prod \
--query "{SKU:sku.name, Location:primaryLocation, Secondary:secondaryLocation}"
# Failover storage account to secondary region (manual, irreversible for GRS)
az storage account failover \
--name stwebappprod001 \
--resource-group rg-webapp-prod
Pilot Light
Minimal resources running in the DR region — scale up on failover:
| Component | Primary (East US) | DR (West US) |
|---|---|---|
| App Service | P1v3, 3 instances | B1, 1 instance (minimal) |
| Azure SQL | Business Critical | Geo-replica (readable secondary) |
| Storage | GZRS | RA-GZRS (read access) |
| ACR | Active registry | Geo-replicated replica |
| Front Door | Active routing | Same profile, secondary origin (disabled) |
# Create geo-replica for Azure SQL
az sql db replica create \
--resource-group rg-webapp-prod \
--server sql-webapp-prod \
--name db-webapp \
--partner-server sql-webapp-dr \
--partner-resource-group rg-webapp-dr
# Create failover group (automatic DNS failover, ~30s RTO)
az sql failover-group create \
--name fg-webapp \
--resource-group rg-webapp-prod \
--server sql-webapp-prod \
--partner-server sql-webapp-dr \
--partner-resource-group rg-webapp-dr \
--failover-policy Automatic \
--grace-period 60 \
--add-db db-webapp
On failover: scale App Service in DR region, activate Front Door secondary origin, SQL failover group handles database automatically.
Warm Standby
Reduced-capacity DR environment always running — faster failover than pilot light:
# DR region App Service always running at 30% capacity
az appservice plan create \
--name plan-webapp-dr \
--resource-group rg-webapp-dr \
--location westus \
--sku P1v3 \
--number-of-workers 1
# On failover, scale to production capacity
az appservice plan update \
--name plan-webapp-dr \
--resource-group rg-webapp-dr \
--number-of-workers 3
Azure Site Recovery (ASR)
Replicate VMs, physical servers, and VMware/Hyper-V workloads to Azure for orchestrated failover:
# Create ASR vault
az backup vault create \
--resource-group rg-dr-prod \
--name asr-vault-prod \
--location eastus
# Enable replication for a VM (via Portal or Recovery Services API)
# ASR handles: initial replication, delta sync, failover, failback
# Test failover (non-disruptive validation)
az site-recovery test-failover \
--resource-group rg-dr-prod \
--vault-name asr-vault-prod \
--fabric-name azure-eastus \
--protection-container mapped \
--recovery-container azure-westus \
--recovery-provider azure \
--replication-protected-item vm-web-01
ASR supports test failover without impacting production — validate DR procedures regularly.
Active-Active Multi-Region
Both regions serve traffic simultaneously — highest availability and cost:
| Component | East US | West US |
|---|---|---|
| App Service | 3 instances (active) | 3 instances (active) |
| Azure SQL | Primary (read-write) | Geo-replica (read-only) |
| Front Door | Origin 1 (priority 1) | Origin 2 (priority 1) |
| Cosmos DB | Multi-region write | Multi-region write |
| Storage | GZRS primary | RA-GZRS read |
# Front Door with active-active origins
az afd origin create \
--origin-name origin-westus \
--origin-group-name og-webapp \
--profile-name fd-webapp-prod \
--resource-group rg-webapp-prod \
--host-name my-webapp-dr.azurewebsites.net \
--priority 1 \
--weight 1000
Real-World Scenario: Financial Services DR Plan
| Requirement | RTO: 15 min | RPO: 5 min |
|---|---|---|
| Strategy | Warm standby with auto-failover | |
| Database | SQL failover group (auto, 60s grace) | |
| Application | Front Door health probes + 2 regions | |
| Storage | GZRS with object replication | |
| VMs (legacy) | ASR continuous replication | |
| Testing | Quarterly test failover via ASR | |
| Documentation | Runbook with step-by-step failover/failback | |
| Compliance | 7-year backup retention in Archive tier |
DR Service Comparison
| Service | Protects | RTO | Best For |
|---|---|---|---|
| Azure Backup | VMs, files, SQL, SAP HANA | Hours | Point-in-time restore |
| Azure Site Recovery | VMs, VMware, physical | Minutes | Orchestrated VM failover |
| SQL Failover Group | Azure SQL databases | ~30 seconds | Database auto-failover |
| Geo-redundant Storage | Blob, file data | Minutes (manual failover) | Data durability |
| Front Door / Traffic Manager | Application routing | Seconds | Traffic redirection |
Common Mistakes
- Backups without tested restores — discover corruption or gaps during actual disaster
- No documented runbooks — engineers improvise under pressure, extending RTO
- Same region for DR resources — availability zone failure affects both primary and “DR”
- Ignoring DNS TTL — long TTLs delay traffic redirection after failover
- Never testing failover — untested DR plans fail when needed most
- Over-engineering DR for dev/test — LRS backups are sufficient for non-production
Troubleshooting
| Issue | Diagnosis | Fix |
|---|---|---|
| Backup job failed | VM agent issue or snapshot conflict | Check guest OS agent; retry backup |
| ASR replication lag | Network bandwidth or disk churn | Increase replication bandwidth; check throttling |
| SQL failover group not failing over | Grace period or manual policy | Verify automatic policy; check grace period setting |
| Geo-restore not available | Backup not yet replicated to secondary | Wait for geo-replication (typically < 1 hour) |
| Front Door not failing over | Origin health probe failing | Verify /health endpoint returns 200 in DR region |
# Check SQL failover group status
az sql failover-group show \
--name fg-webapp \
--resource-group rg-webapp-prod \
--server sql-webapp-prod \
--query "{State:replicationState, Role:role, Databases:databases}" -o json
# List recent backup jobs
az backup job list \
--resource-group rg-dr-prod \
--vault-name rsv-webapp-prod \
--query "[?properties.status=='Completed'].{VM:properties.entityFriendlyName, Time:properties.endTime}" \
-o table
Best Practices
- Define RTO/RPO per workload — not every system needs 99.99% availability
- Test DR procedures quarterly with documented results and improvement actions
- Use Azure paired regions for geo-redundancy (East US ↔ West US)
- Automate failover where possible (SQL failover groups, Front Door health probes)
- Maintain runbooks with step-by-step instructions, contacts, and decision trees
- Monitor replication health with alerts on lag or failed backup jobs
- Include DR in architecture reviews and cost planning — DR resources have ongoing cost
- Implement immutability (Azure Backup soft delete, blob immutability) against ransomware
This completes the Azure learning path. Review the Azure Well-Architected Framework to ensure your designs align with all five pillars.