to navigate

to select

to close

On this page

Cloud Monitoring

Cloud Monitoring (formerly Stackdriver) is GCP’s unified observability platform. It collects metrics, logs, and traces from GCP services, applications, and hybrid infrastructure. Without observability, you are flying blind — incidents go undetected, root causes take hours instead of minutes, and capacity planning becomes guesswork.

Monitoring Stack

Component	Purpose	Data Source
Cloud Monitoring	Metrics, alerts, dashboards	GCP services, custom metrics
Cloud Logging	Log storage and analysis	Application and audit logs
Cloud Trace	Distributed request tracing	Instrumented applications
Error Reporting	Aggregated error tracking	Application error logs
Cloud Profiler	CPU/memory profiling	Instrumented applications

View Metrics

  # List metric descriptors
gcloud monitoring metrics-descriptors list \
  --filter="metric.type=compute.googleapis.com/instance/cpu/utilization" \
  --limit=5

# Query time series
gcloud monitoring time-series list \
  --filter='metric.type="compute.googleapis.com/instance/cpu/utilization"' \
  --interval-start-time=2024-01-15T00:00:00Z \
  --interval-end-time=2024-01-15T23:59:59Z

In Console: Monitoring → Metrics Explorer → select resource type and metric.

Key GCP Metrics to Monitor

Service	Metric	Alert Threshold
Compute Engine	`instance/cpu/utilization`	> 80% for 5 min
Cloud SQL	`database/cpu/utilization`	> 85% for 10 min
GKE	`container/cpu/limit_utilization`	> 90% for 5 min
Cloud Run	`request_latencies` (p99)	> 500ms for 5 min
Cloud Storage	`storage/total_bytes`	Budget-based
Load Balancer	`https/backend_latencies`	p99 > 1s

Logging with Cloud Logging

Query logs with Logging query language:

  resource.type="gce_instance"
severity>=ERROR
timestamp>="2024-01-15T00:00:00Z"
jsonPayload.message=~"connection refused"

Export logs to BigQuery, Cloud Storage, or Pub/Sub for long-term analysis:

  gcloud logging sinks create error-logs-sink \
  bigquery.googleapis.com/projects/learning-gcp-dev/datasets/logs \
  --log-filter='severity>=ERROR'

# List sinks
gcloud logging sinks list

# View recent errors
gcloud logging read 'severity>=ERROR' --limit=10 --format=json

Log-Based Metrics

Turn log patterns into alertable metrics:

  # Create a log-based metric for 500 errors
gcloud logging metrics create http_500_errors \
  --description="Count of HTTP 500 responses" \
  --log-filter='resource.type="cloud_run_revision"
    httpRequest.status=500'

Create Alerts

  # Create notification channel (email)
gcloud alpha monitoring channels create \
  --display-name="Ops Team Email" \
  --type=email \
  [email protected]

# Create alert policy
gcloud alpha monitoring policies create \
  --display-name="High CPU Alert" \
  --condition-display-name="CPU > 80%" \
  --condition-filter='metric.type="compute.googleapis.com/instance/cpu/utilization"' \
  --condition-threshold-value=0.8 \
  --condition-threshold-duration=300s \
  --notification-channels=CHANNEL_ID \
  --documentation-content="Runbook: https://wiki.company.com/high-cpu"

Alert Severity Levels

Severity	Response Time	Example
P1 — Critical	< 15 min	Production down, data loss
P2 — High	< 1 hour	Degraded performance, partial outage
P3 — Medium	< 4 hours	Non-critical service affected
P4 — Low	Next business day	Warning threshold, capacity planning

Dashboards and SLOs

Dashboards: Custom charts combining metrics from multiple services
Uptime checks: Synthetic monitoring from global locations
SLOs: Define service level objectives based on availability or latency metrics

Example SLO: 99.9% of HTTP requests complete in under 200ms over a 30-day window.

  SLI: request_latency < 200ms
SLO: 99.9% of requests meet SLI over 30 days
Error budget: 0.1% = ~43 minutes of downtime per month

When error budget is exhausted, freeze feature releases and focus on reliability.

Application Instrumentation

Node.js with OpenTelemetry:

  const { NodeSDK } = require('@opentelemetry/sdk-node');
const { TraceExporter } = require('@google-cloud/opentelemetry-cloud-trace-exporter');
const { MetricExporter } = require('@google-cloud/opentelemetry-cloud-monitoring-exporter');

const sdk = new NodeSDK({
  traceExporter: new TraceExporter(),
  metricExporter: new MetricExporter(),
  serviceName: 'web-app',
  serviceVersion: '1.2.0'
});
sdk.start();

Install the Ops Agent on Compute Engine VMs for automatic metric and log collection:

  curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install

Real-World Scenario: E-Commerce Monitoring

An online store monitors the full request path:

  Uptime Check (synthetic) → Alert if homepage > 3s
  ↓
Cloud Run metrics (request count, latency p50/p99, error rate)
  ↓
Cloud SQL metrics (connections, replication lag, CPU)
  ↓
Cloud Logging (structured JSON logs with trace ID)
  ↓
Cloud Trace (distributed trace across services)
  ↓
Error Reporting (aggregated stack traces)

SLO dashboard shows error budget remaining. PagerDuty integration fires on P1/P2 alerts only.

Common Mistakes

Mistake	Impact	Fix
Alerting on everything	Alert fatigue, ignored pages	Alert on symptoms, not causes
No runbook links in alerts	Slow incident response	Add documentation to alert policies
Logs without structure	Unqueryable, unparseable	Use JSON structured logging
No SLOs defined	No reliability targets	Define SLIs and SLOs per service
Monitoring only infrastructure	Blind to user experience	Add uptime checks and app metrics

Best Practices

Define SLIs (latency, error rate, throughput) per service
Collect infrastructure metrics automatically; add custom metrics for business KPIs
Set alerts on actionable thresholds with runbook links
Use log-based metrics to turn log patterns into alertable metrics
Review dashboards weekly; tune alert thresholds to reduce noise
Implement distributed tracing for microservices
Export logs to BigQuery for long-term analysis and compliance
Use Error Reporting for automatic error grouping and notification

Troubleshooting

Metrics not appearing:

  # Verify Ops Agent is running on VM
sudo systemctl status google-cloud-ops-agent
# Check agent config
sudo cat /etc/google-cloud-ops-agent/config.yaml

Alert not firing:

  gcloud alpha monitoring policies list --format="table(displayName,enabled)"
# Verify notification channel is verified (email channels need confirmation)
gcloud alpha monitoring channels list

High logging costs:

  # Check log volume by resource
gcloud logging read 'timestamp>="2024-01-01"' --format='value(resource.type)' | sort | uniq -c | sort -rn
# Add exclusion filters for noisy debug logs
gcloud logging sinks update error-logs-sink --log-filter='severity>=ERROR'

Trace gaps in distributed requests: Ensure all services propagate the X-Cloud-Trace-Context header and use the same trace exporter.

Effective monitoring transforms raw telemetry into operational awareness and faster incident response.

Next: Google Kubernetes Engine — managed Kubernetes clusters.

Cloud Functions Serverless

Google Kubernetes Engine

Cloud Monitoring

Monitoring Stack link

View Metrics link

Key GCP Metrics to Monitor link

Logging with Cloud Logging link

Log-Based Metrics link

Create Alerts link

Alert Severity Levels link

Dashboards and SLOs link

Application Instrumentation link

Real-World Scenario: E-Commerce Monitoring link

Common Mistakes link

Best Practices link

Troubleshooting link