to navigate

to select

to close

On this page

Disaster Recovery on GCP

Disaster recovery (DR) ensures your business continues operating when infrastructure fails — whether from hardware failure, natural disaster, human error, or cyberattack. GCP provides the tools to implement DR strategies ranging from simple backups to active-active multi-region architectures. The key metric: can you actually restore, not just backup?

DR Core Metrics

Metric	Definition	Example
RTO (Recovery Time Objective)	Max acceptable downtime	4 hours
RPO (Recovery Point Objective)	Max acceptable data loss	15 minutes
MTTR (Mean Time to Recovery)	Average time to restore	2 hours
MTBF (Mean Time Between Failures)	Average time between incidents	720 hours

Lower RTO/RPO = higher cost and complexity. Match your strategy to business requirements, not arbitrary zero-downtime goals.

DR Strategy Comparison

Strategy	RTO	RPO	Cost	Complexity
Backup & Restore	Hours–Days	Hours	$	Low
Pilot Light	10–30 min	Minutes	$$	Medium
Warm Standby	Minutes	Minutes	$$$	Medium-High
Active-Active	Near-zero	Near-zero	$$$$	High

Backup & Restore

The simplest strategy — regular backups with documented restore procedures:

  # Cloud SQL automated backups (enable at creation)
gcloud sql instances create app-db \
  --database-version=POSTGRES_15 \
  --tier=db-custom-2-4096 \
  --region=us-central1 \
  --backup-start-time=03:00 \
  --enable-point-in-time-recovery \
  --retained-backups-count=30

# Manual backup before major changes
gcloud sql backups create --instance=app-db \
  --description="pre-migration-$(date +%Y%m%d)"

# Restore to new instance
gcloud sql backups restore BACKUP_ID \
  --restore-instance=app-db-restored

# Point-in-time recovery
gcloud sql instances clone app-db app-db-pitr \
  --point-in-time='2024-06-01T10:00:00.000Z'

GCS Cross-Region Replication

  # Enable versioning (required for replication)
gsutil versioning set on gs://learning-gcp-dev-assets/

# Create DR bucket in another region
gsutil mb -l europe-west1 -c STANDARD gs://learning-gcp-dev-assets-dr/

# Configure dual-region or use Storage Transfer Service for replication
gcloud storage buckets update gs://learning-gcp-dev-assets \
  --recovery-point-objective=1h \
  --recovery-point-objective-scope=region

For dual-region buckets, GCP automatically stores data in two regions within a configured pair (e.g., nam4 = Iowa + South Carolina).

Compute Engine Snapshots

  # Snapshot schedule for production disks
gcloud compute resource-policies create snapshot-schedule daily-dr \
  --max-retention-days=30 \
  --daily-schedule start-time=02:00 \
  --region=us-central1 \
  --storage-locations=europe-west1

gcloud compute disks add-resource-policies boot-disk \
  --resource-policies=daily-dr \
  --zone=us-central1-a

# Restore from snapshot in DR region
gcloud compute disks create restored-disk \
  --source-snapshot=SNAPSHOT_NAME \
  --zone=europe-west1-b

Pilot Light

Minimal resources running in DR region — scale up on failover:

Component	Primary (us-central1)	DR (europe-west1)
Cloud SQL	db-custom-4-16384 HA	Cross-region read replica (promote on failover)
Cloud Run	4 services, min-instances=2	Services deployed, min-instances=0
Container images	Active in Artifact Registry	Replicated to DR region registry
VPC	Full production	Pre-configured, minimal resources
Cloud DNS	Active routing	Failover record (standby)

  # Promote Cloud SQL read replica to standalone (failover)
gcloud sql instances promote-replica app-db-dr-replica

# Scale Cloud Run from 0 to production capacity
gcloud run services update web-app \
  --region=europe-west1 \
  --min-instances=2 \
  --max-instances=50

Warm Standby

Reduced-capacity DR environment always running — faster failover than pilot light:

  # DR region runs at 20-30% capacity
# Primary: GKE 10 nodes → DR: GKE 2-3 nodes always running

# On failover, scale DR cluster
gcloud container clusters resize dr-cluster \
  --region=europe-west1 \
  --num-nodes=10

# Scale Cloud Run services
gcloud run services update web-app \
  --region=europe-west1 \
  --min-instances=5

Active-Active Multi-Region

Full capacity in multiple regions with traffic distributed:

  # Cloud DNS geolocation routing
gcloud dns record-sets create api.example.com. \
  --zone=example-com \
  --type=A \
  --ttl=60 \
  --routing-policy-type=GEO \
  --routing-policy-data="us-central1=203.0.113.10,europe-west1=198.51.100.10"

# Global HTTPS load balancer with backends in both regions
gcloud compute backend-services create api-backend \
  --global --protocol=HTTP --health-checks=http-health

gcloud compute backend-services add-backend api-backend \
  --global --network-endpoint-group=neg-us --balancing-mode=UTILIZATION

gcloud compute backend-services add-backend api-backend \
  --global --network-endpoint-group=neg-eu --balancing-mode=UTILIZATION

Active-Active Challenges

Challenge	Solution
Data consistency	Cloud Spanner (global), Firestore multi-region
Session state	Stateless apps + Memorystore (regional)
Deployment sync	Cloud Build deploys to both regions simultaneously
Conflict resolution	Last-writer-wins or application-level merge
Cost	2× infrastructure; justify with RTO/RPO requirements

Cloud SQL DR Options

Feature	RTO	RPO	Use Case
Automated backups	Hours	Hours	Basic DR
Point-in-time recovery	Hours	Seconds	Accidental deletion recovery
Regional HA (failover)	~60 seconds	Zero (sync)	Zone failure within region
Cross-region read replica	Minutes	Seconds–Minutes	Regional DR
Clone instance	Minutes	Point-in-time	Testing, migration

  # Cross-region read replica
gcloud sql instances create app-db-dr \
  --master-instance-name=app-db \
  --tier=db-custom-2-4096 \
  --region=europe-west1

# Monitor replication lag
gcloud sql instances describe app-db-dr \
  --format="value(replicaConfiguration.replicaLag)"

GKE Disaster Recovery

  # Backup for GKE (application and cluster state)
gcloud container backup-restore backup-plans create gke-dr-plan \
  --cluster=prod-cluster \
  --location=us-central1 \
  --all-namespaces \
  --include-secrets \
  --include-volume-data \
  --retention-days=30

# Restore to DR cluster
gcloud container backup-restore restores create gke-restore \
  --backup=projects/learning-gcp-dev/locations/us-central1/backupPlans/gke-dr-plan/backups/BACKUP_ID \
  --target-cluster=dr-cluster \
  --target-location=europe-west1

DR Runbook Template

Every DR strategy needs a tested runbook:

  # DR Failover Runbook: Production API

## Trigger Conditions
- Primary region (us-central1) unavailable for > 5 minutes
- Cloud SQL primary failure with HA failover unsuccessful
- Security incident requiring region isolation

## Failover Steps
1. Confirm primary region outage (Cloud Monitoring, uptime checks)
2. Notify stakeholders via PagerDuty (#incident-dr channel)
3. Promote Cloud SQL read replica in europe-west1
4. Scale Cloud Run / GKE services in DR region to production capacity
5. Update Cloud DNS failover record to point to DR load balancer
6. Verify application health: curl https://api.example.com/health
7. Monitor DR region metrics for 30 minutes
8. Document incident timeline for postmortem

## Rollback Steps
1. Confirm primary region restored and stable
2. Create new read replica from DR primary back to us-central1
3. Scale DR back to standby capacity
4. Update Cloud DNS to primary region
5. Verify and notify stakeholders

## Test Schedule
- Quarterly: Full failover drill (non-business hours)
- Monthly: Backup restore verification
- Weekly: Automated DR health check script

Backup for GCE (Centralized)

Manage backups across Compute Engine resources:

  # Create backup plan for VMs
gcloud compute resource-policies create snapshot-schedule vm-backup \
  --max-retention-days=35 \
  --daily-schedule start-time=05:00 \
  --region=us-central1 \
  --storage-locations=europe-west1,us-east1

# Apply to instance
gcloud compute instances add-resource-policies web-server-01 \
  --resource-policies=vm-backup \
  --zone=us-central1-a

Real-World Scenario: Financial Services DR

Requirement	Implementation
RTO: 15 minutes	Warm standby in europe-west1
RPO: 5 minutes	Cloud SQL cross-region read replica
Compliance: 7-year retention	GCS Archive with bucket lock
Failover testing	Quarterly automated DR drill
Monitoring	Cloud DNS health checks + Cloud Monitoring composite alerts
Communication	PagerDuty integration with runbook links
Data integrity	Promote replica only after confirming replication lag < 30s

Common Mistakes

Backups without tested restores — untested backups are wishful thinking
DR region in same geography — us-central1 and us-east1 share US risk; use europe-west1
No runbook — panic-driven failover takes 10× longer
Ignoring data consistency — failover with stale data corrupts business logic
DR resources not maintained — container images, IaC, and configs drift from production
Never testing failover — discover broken DR during actual disaster
High DNS TTL during failover — clients cache stale records for hours

Troubleshooting DR Failures

Issue	Cause	Fix
Restore takes too long	Large snapshot, wrong instance tier	Pre-provision DR instances; use Cloud SQL HA for fast failover
DNS not switching	Health check still passing on failed region	Lower health check thresholds; use Cloud Monitoring alarm-based failover
Data inconsistency after failover	Async replication lag	Monitor replication lag; promote only when lag < threshold
DR region costs too high	Full active-active when warm standby suffices	Match strategy to actual RTO/RPO requirements
GKE restore fails	Backup plan not covering all namespaces	Verify backup plan includes secrets and volume data

Best Practices

Define RTO/RPO with business stakeholders, not engineers alone
Test failover quarterly — automate where possible
Store Terraform/IaC templates ready to deploy in DR region
Replicate container images to DR region Artifact Registry continuously
Use Cloud DNS health-checked failover for automatic traffic switching
Maintain DR runbooks with step-by-step commands
Monitor replication lag (Cloud SQL, GCS) with Cloud Monitoring alerts
Include DR costs in budget planning — DR is insurance, not waste
Lower DNS TTL to 60 seconds before planned failover events
Use dual-region GCS buckets for data that must survive regional failure

This completes the GCP Expert Deep Dives track. Review the GCP index for the full learning path.

Advanced Networking on GCP

Introduction to C Programming Language

Disaster Recovery on GCP

DR Core Metrics link

DR Strategy Comparison link

Backup & Restore link

GCS Cross-Region Replication link

Compute Engine Snapshots link

Pilot Light link

Warm Standby link

Active-Active Multi-Region link

Active-Active Challenges link

Cloud SQL DR Options link

GKE Disaster Recovery link

DR Runbook Template link

Backup for GCE (Centralized) link

Real-World Scenario: Financial Services DR link

Common Mistakes link

Troubleshooting DR Failures link

Best Practices link