Cloud Monitoring
Cloud Monitoring (formerly Stackdriver) is GCP’s unified observability platform. It collects metrics, logs, and traces from GCP services, applications, and hybrid infrastructure. Without observability, you are flying blind — incidents go undetected, root causes take hours instead of minutes, and capacity planning becomes guesswork.
Monitoring Stack
| Component | Purpose | Data Source |
|---|---|---|
| Cloud Monitoring | Metrics, alerts, dashboards | GCP services, custom metrics |
| Cloud Logging | Log storage and analysis | Application and audit logs |
| Cloud Trace | Distributed request tracing | Instrumented applications |
| Error Reporting | Aggregated error tracking | Application error logs |
| Cloud Profiler | CPU/memory profiling | Instrumented applications |
View Metrics
# List metric descriptors
gcloud monitoring metrics-descriptors list \
--filter="metric.type=compute.googleapis.com/instance/cpu/utilization" \
--limit=5
# Query time series
gcloud monitoring time-series list \
--filter='metric.type="compute.googleapis.com/instance/cpu/utilization"' \
--interval-start-time=2024-01-15T00:00:00Z \
--interval-end-time=2024-01-15T23:59:59Z
In Console: Monitoring → Metrics Explorer → select resource type and metric.
Key GCP Metrics to Monitor
| Service | Metric | Alert Threshold |
|---|---|---|
| Compute Engine | instance/cpu/utilization |
> 80% for 5 min |
| Cloud SQL | database/cpu/utilization |
> 85% for 10 min |
| GKE | container/cpu/limit_utilization |
> 90% for 5 min |
| Cloud Run | request_latencies (p99) |
> 500ms for 5 min |
| Cloud Storage | storage/total_bytes |
Budget-based |
| Load Balancer | https/backend_latencies |
p99 > 1s |
Logging with Cloud Logging
Query logs with Logging query language:
resource.type="gce_instance"
severity>=ERROR
timestamp>="2024-01-15T00:00:00Z"
jsonPayload.message=~"connection refused"
Export logs to BigQuery, Cloud Storage, or Pub/Sub for long-term analysis:
gcloud logging sinks create error-logs-sink \
bigquery.googleapis.com/projects/learning-gcp-dev/datasets/logs \
--log-filter='severity>=ERROR'
# List sinks
gcloud logging sinks list
# View recent errors
gcloud logging read 'severity>=ERROR' --limit=10 --format=json
Log-Based Metrics
Turn log patterns into alertable metrics:
# Create a log-based metric for 500 errors
gcloud logging metrics create http_500_errors \
--description="Count of HTTP 500 responses" \
--log-filter='resource.type="cloud_run_revision"
httpRequest.status=500'
Create Alerts
# Create notification channel (email)
gcloud alpha monitoring channels create \
--display-name="Ops Team Email" \
--type=email \
[email protected]
# Create alert policy
gcloud alpha monitoring policies create \
--display-name="High CPU Alert" \
--condition-display-name="CPU > 80%" \
--condition-filter='metric.type="compute.googleapis.com/instance/cpu/utilization"' \
--condition-threshold-value=0.8 \
--condition-threshold-duration=300s \
--notification-channels=CHANNEL_ID \
--documentation-content="Runbook: https://wiki.company.com/high-cpu"
Alert Severity Levels
| Severity | Response Time | Example |
|---|---|---|
| P1 — Critical | < 15 min | Production down, data loss |
| P2 — High | < 1 hour | Degraded performance, partial outage |
| P3 — Medium | < 4 hours | Non-critical service affected |
| P4 — Low | Next business day | Warning threshold, capacity planning |
Dashboards and SLOs
- Dashboards: Custom charts combining metrics from multiple services
- Uptime checks: Synthetic monitoring from global locations
- SLOs: Define service level objectives based on availability or latency metrics
Example SLO: 99.9% of HTTP requests complete in under 200ms over a 30-day window.
SLI: request_latency < 200ms
SLO: 99.9% of requests meet SLI over 30 days
Error budget: 0.1% = ~43 minutes of downtime per month
When error budget is exhausted, freeze feature releases and focus on reliability.
Application Instrumentation
Node.js with OpenTelemetry:
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { TraceExporter } = require('@google-cloud/opentelemetry-cloud-trace-exporter');
const { MetricExporter } = require('@google-cloud/opentelemetry-cloud-monitoring-exporter');
const sdk = new NodeSDK({
traceExporter: new TraceExporter(),
metricExporter: new MetricExporter(),
serviceName: 'web-app',
serviceVersion: '1.2.0'
});
sdk.start();
Install the Ops Agent on Compute Engine VMs for automatic metric and log collection:
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install
Real-World Scenario: E-Commerce Monitoring
An online store monitors the full request path:
Uptime Check (synthetic) → Alert if homepage > 3s
↓
Cloud Run metrics (request count, latency p50/p99, error rate)
↓
Cloud SQL metrics (connections, replication lag, CPU)
↓
Cloud Logging (structured JSON logs with trace ID)
↓
Cloud Trace (distributed trace across services)
↓
Error Reporting (aggregated stack traces)
SLO dashboard shows error budget remaining. PagerDuty integration fires on P1/P2 alerts only.
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Alerting on everything | Alert fatigue, ignored pages | Alert on symptoms, not causes |
| No runbook links in alerts | Slow incident response | Add documentation to alert policies |
| Logs without structure | Unqueryable, unparseable | Use JSON structured logging |
| No SLOs defined | No reliability targets | Define SLIs and SLOs per service |
| Monitoring only infrastructure | Blind to user experience | Add uptime checks and app metrics |
Best Practices
- Define SLIs (latency, error rate, throughput) per service
- Collect infrastructure metrics automatically; add custom metrics for business KPIs
- Set alerts on actionable thresholds with runbook links
- Use log-based metrics to turn log patterns into alertable metrics
- Review dashboards weekly; tune alert thresholds to reduce noise
- Implement distributed tracing for microservices
- Export logs to BigQuery for long-term analysis and compliance
- Use Error Reporting for automatic error grouping and notification
Troubleshooting
Metrics not appearing:
# Verify Ops Agent is running on VM
sudo systemctl status google-cloud-ops-agent
# Check agent config
sudo cat /etc/google-cloud-ops-agent/config.yaml
Alert not firing:
gcloud alpha monitoring policies list --format="table(displayName,enabled)"
# Verify notification channel is verified (email channels need confirmation)
gcloud alpha monitoring channels list
High logging costs:
# Check log volume by resource
gcloud logging read 'timestamp>="2024-01-01"' --format='value(resource.type)' | sort | uniq -c | sort -rn
# Add exclusion filters for noisy debug logs
gcloud logging sinks update error-logs-sink --log-filter='severity>=ERROR'
Trace gaps in distributed requests:
Ensure all services propagate the X-Cloud-Trace-Context header and use the same trace exporter.
Effective monitoring transforms raw telemetry into operational awareness and faster incident response.
Next: Google Kubernetes Engine — managed Kubernetes clusters.