Disaster recovery (DR) ensures your business continues operating when infrastructure fails — whether from hardware failure, natural disaster, human error, or cyberattack. GCP provides the tools to implement DR strategies ranging from simple backups to active-active multi-region architectures. The key metric: can you actually restore, not just backup?

DR Core Metrics

Metric Definition Example
RTO (Recovery Time Objective) Max acceptable downtime 4 hours
RPO (Recovery Point Objective) Max acceptable data loss 15 minutes
MTTR (Mean Time to Recovery) Average time to restore 2 hours
MTBF (Mean Time Between Failures) Average time between incidents 720 hours

Lower RTO/RPO = higher cost and complexity. Match your strategy to business requirements, not arbitrary zero-downtime goals.

DR Strategy Comparison

Strategy RTO RPO Cost Complexity
Backup & Restore Hours–Days Hours $ Low
Pilot Light 10–30 min Minutes $$ Medium
Warm Standby Minutes Minutes $$$ Medium-High
Active-Active Near-zero Near-zero $$$$ High

Backup & Restore

The simplest strategy — regular backups with documented restore procedures:

  # Cloud SQL automated backups (enable at creation)
gcloud sql instances create app-db \
  --database-version=POSTGRES_15 \
  --tier=db-custom-2-4096 \
  --region=us-central1 \
  --backup-start-time=03:00 \
  --enable-point-in-time-recovery \
  --retained-backups-count=30

# Manual backup before major changes
gcloud sql backups create --instance=app-db \
  --description="pre-migration-$(date +%Y%m%d)"

# Restore to new instance
gcloud sql backups restore BACKUP_ID \
  --restore-instance=app-db-restored

# Point-in-time recovery
gcloud sql instances clone app-db app-db-pitr \
  --point-in-time='2024-06-01T10:00:00.000Z'
  

GCS Cross-Region Replication

  # Enable versioning (required for replication)
gsutil versioning set on gs://learning-gcp-dev-assets/

# Create DR bucket in another region
gsutil mb -l europe-west1 -c STANDARD gs://learning-gcp-dev-assets-dr/

# Configure dual-region or use Storage Transfer Service for replication
gcloud storage buckets update gs://learning-gcp-dev-assets \
  --recovery-point-objective=1h \
  --recovery-point-objective-scope=region
  

For dual-region buckets, GCP automatically stores data in two regions within a configured pair (e.g., nam4 = Iowa + South Carolina).

Compute Engine Snapshots

  # Snapshot schedule for production disks
gcloud compute resource-policies create snapshot-schedule daily-dr \
  --max-retention-days=30 \
  --daily-schedule start-time=02:00 \
  --region=us-central1 \
  --storage-locations=europe-west1

gcloud compute disks add-resource-policies boot-disk \
  --resource-policies=daily-dr \
  --zone=us-central1-a

# Restore from snapshot in DR region
gcloud compute disks create restored-disk \
  --source-snapshot=SNAPSHOT_NAME \
  --zone=europe-west1-b
  

Pilot Light

Minimal resources running in DR region — scale up on failover:

Component Primary (us-central1) DR (europe-west1)
Cloud SQL db-custom-4-16384 HA Cross-region read replica (promote on failover)
Cloud Run 4 services, min-instances=2 Services deployed, min-instances=0
Container images Active in Artifact Registry Replicated to DR region registry
VPC Full production Pre-configured, minimal resources
Cloud DNS Active routing Failover record (standby)
  # Promote Cloud SQL read replica to standalone (failover)
gcloud sql instances promote-replica app-db-dr-replica

# Scale Cloud Run from 0 to production capacity
gcloud run services update web-app \
  --region=europe-west1 \
  --min-instances=2 \
  --max-instances=50
  

Warm Standby

Reduced-capacity DR environment always running — faster failover than pilot light:

  # DR region runs at 20-30% capacity
# Primary: GKE 10 nodes → DR: GKE 2-3 nodes always running

# On failover, scale DR cluster
gcloud container clusters resize dr-cluster \
  --region=europe-west1 \
  --num-nodes=10

# Scale Cloud Run services
gcloud run services update web-app \
  --region=europe-west1 \
  --min-instances=5
  

Active-Active Multi-Region

Full capacity in multiple regions with traffic distributed:

  # Cloud DNS geolocation routing
gcloud dns record-sets create api.example.com. \
  --zone=example-com \
  --type=A \
  --ttl=60 \
  --routing-policy-type=GEO \
  --routing-policy-data="us-central1=203.0.113.10,europe-west1=198.51.100.10"

# Global HTTPS load balancer with backends in both regions
gcloud compute backend-services create api-backend \
  --global --protocol=HTTP --health-checks=http-health

gcloud compute backend-services add-backend api-backend \
  --global --network-endpoint-group=neg-us --balancing-mode=UTILIZATION

gcloud compute backend-services add-backend api-backend \
  --global --network-endpoint-group=neg-eu --balancing-mode=UTILIZATION
  

Active-Active Challenges

Challenge Solution
Data consistency Cloud Spanner (global), Firestore multi-region
Session state Stateless apps + Memorystore (regional)
Deployment sync Cloud Build deploys to both regions simultaneously
Conflict resolution Last-writer-wins or application-level merge
Cost 2× infrastructure; justify with RTO/RPO requirements

Cloud SQL DR Options

Feature RTO RPO Use Case
Automated backups Hours Hours Basic DR
Point-in-time recovery Hours Seconds Accidental deletion recovery
Regional HA (failover) ~60 seconds Zero (sync) Zone failure within region
Cross-region read replica Minutes Seconds–Minutes Regional DR
Clone instance Minutes Point-in-time Testing, migration
  # Cross-region read replica
gcloud sql instances create app-db-dr \
  --master-instance-name=app-db \
  --tier=db-custom-2-4096 \
  --region=europe-west1

# Monitor replication lag
gcloud sql instances describe app-db-dr \
  --format="value(replicaConfiguration.replicaLag)"
  

GKE Disaster Recovery

  # Backup for GKE (application and cluster state)
gcloud container backup-restore backup-plans create gke-dr-plan \
  --cluster=prod-cluster \
  --location=us-central1 \
  --all-namespaces \
  --include-secrets \
  --include-volume-data \
  --retention-days=30

# Restore to DR cluster
gcloud container backup-restore restores create gke-restore \
  --backup=projects/learning-gcp-dev/locations/us-central1/backupPlans/gke-dr-plan/backups/BACKUP_ID \
  --target-cluster=dr-cluster \
  --target-location=europe-west1
  

DR Runbook Template

Every DR strategy needs a tested runbook:

  # DR Failover Runbook: Production API

## Trigger Conditions
- Primary region (us-central1) unavailable for > 5 minutes
- Cloud SQL primary failure with HA failover unsuccessful
- Security incident requiring region isolation

## Failover Steps
1. Confirm primary region outage (Cloud Monitoring, uptime checks)
2. Notify stakeholders via PagerDuty (#incident-dr channel)
3. Promote Cloud SQL read replica in europe-west1
4. Scale Cloud Run / GKE services in DR region to production capacity
5. Update Cloud DNS failover record to point to DR load balancer
6. Verify application health: curl https://api.example.com/health
7. Monitor DR region metrics for 30 minutes
8. Document incident timeline for postmortem

## Rollback Steps
1. Confirm primary region restored and stable
2. Create new read replica from DR primary back to us-central1
3. Scale DR back to standby capacity
4. Update Cloud DNS to primary region
5. Verify and notify stakeholders

## Test Schedule
- Quarterly: Full failover drill (non-business hours)
- Monthly: Backup restore verification
- Weekly: Automated DR health check script
  

Backup for GCE (Centralized)

Manage backups across Compute Engine resources:

  # Create backup plan for VMs
gcloud compute resource-policies create snapshot-schedule vm-backup \
  --max-retention-days=35 \
  --daily-schedule start-time=05:00 \
  --region=us-central1 \
  --storage-locations=europe-west1,us-east1

# Apply to instance
gcloud compute instances add-resource-policies web-server-01 \
  --resource-policies=vm-backup \
  --zone=us-central1-a
  

Real-World Scenario: Financial Services DR

Requirement Implementation
RTO: 15 minutes Warm standby in europe-west1
RPO: 5 minutes Cloud SQL cross-region read replica
Compliance: 7-year retention GCS Archive with bucket lock
Failover testing Quarterly automated DR drill
Monitoring Cloud DNS health checks + Cloud Monitoring composite alerts
Communication PagerDuty integration with runbook links
Data integrity Promote replica only after confirming replication lag < 30s

Common Mistakes

  1. Backups without tested restores — untested backups are wishful thinking
  2. DR region in same geographyus-central1 and us-east1 share US risk; use europe-west1
  3. No runbook — panic-driven failover takes 10× longer
  4. Ignoring data consistency — failover with stale data corrupts business logic
  5. DR resources not maintained — container images, IaC, and configs drift from production
  6. Never testing failover — discover broken DR during actual disaster
  7. High DNS TTL during failover — clients cache stale records for hours

Troubleshooting DR Failures

Issue Cause Fix
Restore takes too long Large snapshot, wrong instance tier Pre-provision DR instances; use Cloud SQL HA for fast failover
DNS not switching Health check still passing on failed region Lower health check thresholds; use Cloud Monitoring alarm-based failover
Data inconsistency after failover Async replication lag Monitor replication lag; promote only when lag < threshold
DR region costs too high Full active-active when warm standby suffices Match strategy to actual RTO/RPO requirements
GKE restore fails Backup plan not covering all namespaces Verify backup plan includes secrets and volume data

Best Practices

  • Define RTO/RPO with business stakeholders, not engineers alone
  • Test failover quarterly — automate where possible
  • Store Terraform/IaC templates ready to deploy in DR region
  • Replicate container images to DR region Artifact Registry continuously
  • Use Cloud DNS health-checked failover for automatic traffic switching
  • Maintain DR runbooks with step-by-step commands
  • Monitor replication lag (Cloud SQL, GCS) with Cloud Monitoring alerts
  • Include DR costs in budget planning — DR is insurance, not waste
  • Lower DNS TTL to 60 seconds before planned failover events
  • Use dual-region GCS buckets for data that must survive regional failure

This completes the GCP Expert Deep Dives track. Review the GCP index for the full learning path.