Cloud Monitoring (formerly Stackdriver) is GCP’s unified observability platform. It collects metrics, logs, and traces from GCP services, applications, and hybrid infrastructure. Without observability, you are flying blind — incidents go undetected, root causes take hours instead of minutes, and capacity planning becomes guesswork.

Monitoring Stack

Component Purpose Data Source
Cloud Monitoring Metrics, alerts, dashboards GCP services, custom metrics
Cloud Logging Log storage and analysis Application and audit logs
Cloud Trace Distributed request tracing Instrumented applications
Error Reporting Aggregated error tracking Application error logs
Cloud Profiler CPU/memory profiling Instrumented applications

View Metrics

  # List metric descriptors
gcloud monitoring metrics-descriptors list \
  --filter="metric.type=compute.googleapis.com/instance/cpu/utilization" \
  --limit=5

# Query time series
gcloud monitoring time-series list \
  --filter='metric.type="compute.googleapis.com/instance/cpu/utilization"' \
  --interval-start-time=2024-01-15T00:00:00Z \
  --interval-end-time=2024-01-15T23:59:59Z
  

In Console: MonitoringMetrics Explorer → select resource type and metric.

Key GCP Metrics to Monitor

Service Metric Alert Threshold
Compute Engine instance/cpu/utilization > 80% for 5 min
Cloud SQL database/cpu/utilization > 85% for 10 min
GKE container/cpu/limit_utilization > 90% for 5 min
Cloud Run request_latencies (p99) > 500ms for 5 min
Cloud Storage storage/total_bytes Budget-based
Load Balancer https/backend_latencies p99 > 1s

Logging with Cloud Logging

Query logs with Logging query language:

  resource.type="gce_instance"
severity>=ERROR
timestamp>="2024-01-15T00:00:00Z"
jsonPayload.message=~"connection refused"
  

Export logs to BigQuery, Cloud Storage, or Pub/Sub for long-term analysis:

  gcloud logging sinks create error-logs-sink \
  bigquery.googleapis.com/projects/learning-gcp-dev/datasets/logs \
  --log-filter='severity>=ERROR'

# List sinks
gcloud logging sinks list

# View recent errors
gcloud logging read 'severity>=ERROR' --limit=10 --format=json
  

Log-Based Metrics

Turn log patterns into alertable metrics:

  # Create a log-based metric for 500 errors
gcloud logging metrics create http_500_errors \
  --description="Count of HTTP 500 responses" \
  --log-filter='resource.type="cloud_run_revision"
    httpRequest.status=500'
  

Create Alerts

  # Create notification channel (email)
gcloud alpha monitoring channels create \
  --display-name="Ops Team Email" \
  --type=email \
  [email protected]

# Create alert policy
gcloud alpha monitoring policies create \
  --display-name="High CPU Alert" \
  --condition-display-name="CPU > 80%" \
  --condition-filter='metric.type="compute.googleapis.com/instance/cpu/utilization"' \
  --condition-threshold-value=0.8 \
  --condition-threshold-duration=300s \
  --notification-channels=CHANNEL_ID \
  --documentation-content="Runbook: https://wiki.company.com/high-cpu"
  

Alert Severity Levels

Severity Response Time Example
P1 — Critical < 15 min Production down, data loss
P2 — High < 1 hour Degraded performance, partial outage
P3 — Medium < 4 hours Non-critical service affected
P4 — Low Next business day Warning threshold, capacity planning

Dashboards and SLOs

  • Dashboards: Custom charts combining metrics from multiple services
  • Uptime checks: Synthetic monitoring from global locations
  • SLOs: Define service level objectives based on availability or latency metrics

Example SLO: 99.9% of HTTP requests complete in under 200ms over a 30-day window.

  SLI: request_latency < 200ms
SLO: 99.9% of requests meet SLI over 30 days
Error budget: 0.1% = ~43 minutes of downtime per month
  

When error budget is exhausted, freeze feature releases and focus on reliability.

Application Instrumentation

Node.js with OpenTelemetry:

  const { NodeSDK } = require('@opentelemetry/sdk-node');
const { TraceExporter } = require('@google-cloud/opentelemetry-cloud-trace-exporter');
const { MetricExporter } = require('@google-cloud/opentelemetry-cloud-monitoring-exporter');

const sdk = new NodeSDK({
  traceExporter: new TraceExporter(),
  metricExporter: new MetricExporter(),
  serviceName: 'web-app',
  serviceVersion: '1.2.0'
});
sdk.start();
  

Install the Ops Agent on Compute Engine VMs for automatic metric and log collection:

  curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install
  

Real-World Scenario: E-Commerce Monitoring

An online store monitors the full request path:

  Uptime Check (synthetic) → Alert if homepage > 3s
  ↓
Cloud Run metrics (request count, latency p50/p99, error rate)
  ↓
Cloud SQL metrics (connections, replication lag, CPU)
  ↓
Cloud Logging (structured JSON logs with trace ID)
  ↓
Cloud Trace (distributed trace across services)
  ↓
Error Reporting (aggregated stack traces)
  

SLO dashboard shows error budget remaining. PagerDuty integration fires on P1/P2 alerts only.

Common Mistakes

Mistake Impact Fix
Alerting on everything Alert fatigue, ignored pages Alert on symptoms, not causes
No runbook links in alerts Slow incident response Add documentation to alert policies
Logs without structure Unqueryable, unparseable Use JSON structured logging
No SLOs defined No reliability targets Define SLIs and SLOs per service
Monitoring only infrastructure Blind to user experience Add uptime checks and app metrics

Best Practices

  1. Define SLIs (latency, error rate, throughput) per service
  2. Collect infrastructure metrics automatically; add custom metrics for business KPIs
  3. Set alerts on actionable thresholds with runbook links
  4. Use log-based metrics to turn log patterns into alertable metrics
  5. Review dashboards weekly; tune alert thresholds to reduce noise
  6. Implement distributed tracing for microservices
  7. Export logs to BigQuery for long-term analysis and compliance
  8. Use Error Reporting for automatic error grouping and notification

Troubleshooting

Metrics not appearing:

  # Verify Ops Agent is running on VM
sudo systemctl status google-cloud-ops-agent
# Check agent config
sudo cat /etc/google-cloud-ops-agent/config.yaml
  

Alert not firing:

  gcloud alpha monitoring policies list --format="table(displayName,enabled)"
# Verify notification channel is verified (email channels need confirmation)
gcloud alpha monitoring channels list
  

High logging costs:

  # Check log volume by resource
gcloud logging read 'timestamp>="2024-01-01"' --format='value(resource.type)' | sort | uniq -c | sort -rn
# Add exclusion filters for noisy debug logs
gcloud logging sinks update error-logs-sink --log-filter='severity>=ERROR'
  

Trace gaps in distributed requests: Ensure all services propagate the X-Cloud-Trace-Context header and use the same trace exporter.

Effective monitoring transforms raw telemetry into operational awareness and faster incident response.

Next: Google Kubernetes Engine — managed Kubernetes clusters.