Disaster Recovery on GCP
Disaster recovery (DR) ensures your business continues operating when infrastructure fails — whether from hardware failure, natural disaster, human error, or cyberattack. GCP provides the tools to implement DR strategies ranging from simple backups to active-active multi-region architectures. The key metric: can you actually restore, not just backup?
DR Core Metrics
| Metric | Definition | Example |
|---|---|---|
| RTO (Recovery Time Objective) | Max acceptable downtime | 4 hours |
| RPO (Recovery Point Objective) | Max acceptable data loss | 15 minutes |
| MTTR (Mean Time to Recovery) | Average time to restore | 2 hours |
| MTBF (Mean Time Between Failures) | Average time between incidents | 720 hours |
Lower RTO/RPO = higher cost and complexity. Match your strategy to business requirements, not arbitrary zero-downtime goals.
DR Strategy Comparison
| Strategy | RTO | RPO | Cost | Complexity |
|---|---|---|---|---|
| Backup & Restore | Hours–Days | Hours | $ | Low |
| Pilot Light | 10–30 min | Minutes | $$ | Medium |
| Warm Standby | Minutes | Minutes | $$$ | Medium-High |
| Active-Active | Near-zero | Near-zero | $$$$ | High |
Backup & Restore
The simplest strategy — regular backups with documented restore procedures:
# Cloud SQL automated backups (enable at creation)
gcloud sql instances create app-db \
--database-version=POSTGRES_15 \
--tier=db-custom-2-4096 \
--region=us-central1 \
--backup-start-time=03:00 \
--enable-point-in-time-recovery \
--retained-backups-count=30
# Manual backup before major changes
gcloud sql backups create --instance=app-db \
--description="pre-migration-$(date +%Y%m%d)"
# Restore to new instance
gcloud sql backups restore BACKUP_ID \
--restore-instance=app-db-restored
# Point-in-time recovery
gcloud sql instances clone app-db app-db-pitr \
--point-in-time='2024-06-01T10:00:00.000Z'
GCS Cross-Region Replication
# Enable versioning (required for replication)
gsutil versioning set on gs://learning-gcp-dev-assets/
# Create DR bucket in another region
gsutil mb -l europe-west1 -c STANDARD gs://learning-gcp-dev-assets-dr/
# Configure dual-region or use Storage Transfer Service for replication
gcloud storage buckets update gs://learning-gcp-dev-assets \
--recovery-point-objective=1h \
--recovery-point-objective-scope=region
For dual-region buckets, GCP automatically stores data in two regions within a configured pair (e.g., nam4 = Iowa + South Carolina).
Compute Engine Snapshots
# Snapshot schedule for production disks
gcloud compute resource-policies create snapshot-schedule daily-dr \
--max-retention-days=30 \
--daily-schedule start-time=02:00 \
--region=us-central1 \
--storage-locations=europe-west1
gcloud compute disks add-resource-policies boot-disk \
--resource-policies=daily-dr \
--zone=us-central1-a
# Restore from snapshot in DR region
gcloud compute disks create restored-disk \
--source-snapshot=SNAPSHOT_NAME \
--zone=europe-west1-b
Pilot Light
Minimal resources running in DR region — scale up on failover:
| Component | Primary (us-central1) | DR (europe-west1) |
|---|---|---|
| Cloud SQL | db-custom-4-16384 HA | Cross-region read replica (promote on failover) |
| Cloud Run | 4 services, min-instances=2 | Services deployed, min-instances=0 |
| Container images | Active in Artifact Registry | Replicated to DR region registry |
| VPC | Full production | Pre-configured, minimal resources |
| Cloud DNS | Active routing | Failover record (standby) |
# Promote Cloud SQL read replica to standalone (failover)
gcloud sql instances promote-replica app-db-dr-replica
# Scale Cloud Run from 0 to production capacity
gcloud run services update web-app \
--region=europe-west1 \
--min-instances=2 \
--max-instances=50
Warm Standby
Reduced-capacity DR environment always running — faster failover than pilot light:
# DR region runs at 20-30% capacity
# Primary: GKE 10 nodes → DR: GKE 2-3 nodes always running
# On failover, scale DR cluster
gcloud container clusters resize dr-cluster \
--region=europe-west1 \
--num-nodes=10
# Scale Cloud Run services
gcloud run services update web-app \
--region=europe-west1 \
--min-instances=5
Active-Active Multi-Region
Full capacity in multiple regions with traffic distributed:
# Cloud DNS geolocation routing
gcloud dns record-sets create api.example.com. \
--zone=example-com \
--type=A \
--ttl=60 \
--routing-policy-type=GEO \
--routing-policy-data="us-central1=203.0.113.10,europe-west1=198.51.100.10"
# Global HTTPS load balancer with backends in both regions
gcloud compute backend-services create api-backend \
--global --protocol=HTTP --health-checks=http-health
gcloud compute backend-services add-backend api-backend \
--global --network-endpoint-group=neg-us --balancing-mode=UTILIZATION
gcloud compute backend-services add-backend api-backend \
--global --network-endpoint-group=neg-eu --balancing-mode=UTILIZATION
Active-Active Challenges
| Challenge | Solution |
|---|---|
| Data consistency | Cloud Spanner (global), Firestore multi-region |
| Session state | Stateless apps + Memorystore (regional) |
| Deployment sync | Cloud Build deploys to both regions simultaneously |
| Conflict resolution | Last-writer-wins or application-level merge |
| Cost | 2× infrastructure; justify with RTO/RPO requirements |
Cloud SQL DR Options
| Feature | RTO | RPO | Use Case |
|---|---|---|---|
| Automated backups | Hours | Hours | Basic DR |
| Point-in-time recovery | Hours | Seconds | Accidental deletion recovery |
| Regional HA (failover) | ~60 seconds | Zero (sync) | Zone failure within region |
| Cross-region read replica | Minutes | Seconds–Minutes | Regional DR |
| Clone instance | Minutes | Point-in-time | Testing, migration |
# Cross-region read replica
gcloud sql instances create app-db-dr \
--master-instance-name=app-db \
--tier=db-custom-2-4096 \
--region=europe-west1
# Monitor replication lag
gcloud sql instances describe app-db-dr \
--format="value(replicaConfiguration.replicaLag)"
GKE Disaster Recovery
# Backup for GKE (application and cluster state)
gcloud container backup-restore backup-plans create gke-dr-plan \
--cluster=prod-cluster \
--location=us-central1 \
--all-namespaces \
--include-secrets \
--include-volume-data \
--retention-days=30
# Restore to DR cluster
gcloud container backup-restore restores create gke-restore \
--backup=projects/learning-gcp-dev/locations/us-central1/backupPlans/gke-dr-plan/backups/BACKUP_ID \
--target-cluster=dr-cluster \
--target-location=europe-west1
DR Runbook Template
Every DR strategy needs a tested runbook:
# DR Failover Runbook: Production API
## Trigger Conditions
- Primary region (us-central1) unavailable for > 5 minutes
- Cloud SQL primary failure with HA failover unsuccessful
- Security incident requiring region isolation
## Failover Steps
1. Confirm primary region outage (Cloud Monitoring, uptime checks)
2. Notify stakeholders via PagerDuty (#incident-dr channel)
3. Promote Cloud SQL read replica in europe-west1
4. Scale Cloud Run / GKE services in DR region to production capacity
5. Update Cloud DNS failover record to point to DR load balancer
6. Verify application health: curl https://api.example.com/health
7. Monitor DR region metrics for 30 minutes
8. Document incident timeline for postmortem
## Rollback Steps
1. Confirm primary region restored and stable
2. Create new read replica from DR primary back to us-central1
3. Scale DR back to standby capacity
4. Update Cloud DNS to primary region
5. Verify and notify stakeholders
## Test Schedule
- Quarterly: Full failover drill (non-business hours)
- Monthly: Backup restore verification
- Weekly: Automated DR health check script
Backup for GCE (Centralized)
Manage backups across Compute Engine resources:
# Create backup plan for VMs
gcloud compute resource-policies create snapshot-schedule vm-backup \
--max-retention-days=35 \
--daily-schedule start-time=05:00 \
--region=us-central1 \
--storage-locations=europe-west1,us-east1
# Apply to instance
gcloud compute instances add-resource-policies web-server-01 \
--resource-policies=vm-backup \
--zone=us-central1-a
Real-World Scenario: Financial Services DR
| Requirement | Implementation |
|---|---|
| RTO: 15 minutes | Warm standby in europe-west1 |
| RPO: 5 minutes | Cloud SQL cross-region read replica |
| Compliance: 7-year retention | GCS Archive with bucket lock |
| Failover testing | Quarterly automated DR drill |
| Monitoring | Cloud DNS health checks + Cloud Monitoring composite alerts |
| Communication | PagerDuty integration with runbook links |
| Data integrity | Promote replica only after confirming replication lag < 30s |
Common Mistakes
- Backups without tested restores — untested backups are wishful thinking
- DR region in same geography —
us-central1andus-east1share US risk; useeurope-west1 - No runbook — panic-driven failover takes 10× longer
- Ignoring data consistency — failover with stale data corrupts business logic
- DR resources not maintained — container images, IaC, and configs drift from production
- Never testing failover — discover broken DR during actual disaster
- High DNS TTL during failover — clients cache stale records for hours
Troubleshooting DR Failures
| Issue | Cause | Fix |
|---|---|---|
| Restore takes too long | Large snapshot, wrong instance tier | Pre-provision DR instances; use Cloud SQL HA for fast failover |
| DNS not switching | Health check still passing on failed region | Lower health check thresholds; use Cloud Monitoring alarm-based failover |
| Data inconsistency after failover | Async replication lag | Monitor replication lag; promote only when lag < threshold |
| DR region costs too high | Full active-active when warm standby suffices | Match strategy to actual RTO/RPO requirements |
| GKE restore fails | Backup plan not covering all namespaces | Verify backup plan includes secrets and volume data |
Best Practices
- Define RTO/RPO with business stakeholders, not engineers alone
- Test failover quarterly — automate where possible
- Store Terraform/IaC templates ready to deploy in DR region
- Replicate container images to DR region Artifact Registry continuously
- Use Cloud DNS health-checked failover for automatic traffic switching
- Maintain DR runbooks with step-by-step commands
- Monitor replication lag (Cloud SQL, GCS) with Cloud Monitoring alerts
- Include DR costs in budget planning — DR is insurance, not waste
- Lower DNS TTL to 60 seconds before planned failover events
- Use dual-region GCS buckets for data that must survive regional failure
This completes the GCP Expert Deep Dives track. Review the GCP index for the full learning path.