Amazon CloudWatch is AWS’s observability service — collect metrics, store logs, set alarms, and visualize system health. Every AWS service emits CloudWatch metrics automatically; you add custom metrics and logs for application-level visibility.

CloudWatch Components

Component Purpose
Metrics Time-series data (CPU, request count, custom)
Logs Centralized log storage and querying
Alarms Automated actions on metric thresholds
Dashboards Visual monitoring panels
Events/EventBridge Event-driven automation
Synthetics Canary scripts for uptime monitoring
Container Insights ECS/EKS deep metrics

View EC2 Metrics

  # CPU utilization for an instance (last hour)
aws cloudwatch get-metric-statistics \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
    --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 300 \
    --statistics Average Maximum

# List available metrics
aws cloudwatch list-metrics --namespace AWS/EC2
  

Standard EC2 metrics (5-minute intervals, free):

  • CPUUtilization, NetworkIn, NetworkOut, DiskReadOps, DiskWriteOps

Enable detailed monitoring (1-minute intervals, extra cost) for Auto Scaling responsiveness.

Create Alarms

  # Alert when EC2 CPU > 80% for 5 minutes
aws cloudwatch put-metric-alarm \
    --alarm-name high-cpu-web-server \
    --alarm-description "CPU above 80% for 5 minutes" \
    --metric-name CPUUtilization \
    --namespace AWS/EC2 \
    --statistic Average \
    --period 300 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold \
    --evaluation-periods 2 \
    --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts

# Composite alarm (multiple conditions)
aws cloudwatch put-composite-alarm \
    --alarm-name service-degraded \
    --alarm-rule "ALARM(high-cpu-web-server) OR ALARM(high-error-rate)"
  

Alarm Actions

Action Use Case
SNS notification Email, SMS, Slack (via Lambda)
Auto Scaling Scale out/in on metric
EC2 recovery Reboot impaired instance
Lambda Custom remediation
SSM Automation Run runbooks

CloudWatch Logs

  # Create log group
aws logs create-log-group --log-group-name /aws/lambda/my-function

# Stream logs from CLI
aws logs tail /aws/lambda/my-function --follow

# Filter logs (Logs Insights query)
aws logs start-query \
    --log-group-name /aws/lambda/my-function \
    --start-time $(date -u -v-1H +%s) \
    --end-time $(date -u +%s) \
    --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20'
  

Logs Insights Query Examples

  -- Error rate by function
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() as errors by bin(5m)

-- Slow Lambda invocations
fields @timestamp, @duration, @requestId
| filter @type = "REPORT"
| filter @duration > 3000
| sort @duration desc
| limit 50

-- API Gateway 5xx errors
fields @timestamp, status, path, ip
| filter status >= 500
| stats count() by path
  

Custom Metrics

  import boto3
cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_data(
    Namespace='MyApp/Orders',
    MetricData=[{
        'MetricName': 'OrdersProcessed',
        'Value': 1,
        'Unit': 'Count',
        'Dimensions': [
            {'Name': 'Environment', 'Value': 'production'},
            {'Name': 'Region', 'Value': 'us-east-1'}
        ]
    }]
)
  

Use Embedded Metric Format (EMF) for structured logging that auto-creates metrics:

  import json
print(json.dumps({
    "_aws": {
        "Timestamp": int(time.time() * 1000),
        "CloudWatchMetrics": [{
            "Namespace": "MyApp",
            "Dimensions": [["Service"]],
            "Metrics": [{"Name": "ProcessingTime", "Unit": "Milliseconds"}]
        }]
    },
    "Service": "order-processor",
    "ProcessingTime": 245
}))
  

Dashboards

  aws cloudwatch put-dashboard \
    --dashboard-name Production-Overview \
    --dashboard-body file://dashboard.json
  
  {
    "widgets": [{
        "type": "metric",
        "properties": {
            "metrics": [
                ["AWS/EC2", "CPUUtilization", "AutoScalingGroupName", "web-asg"],
                [".", "NetworkIn", ".", "."],
                ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/web-alb/xxx"]
            ],
            "period": 300,
            "stat": "Average",
            "region": "us-east-1",
            "title": "Web Tier Health"
        }
    }]
}
  

Log Retention and Costs

Retention Cost Impact
Never expire Highest — logs accumulate forever
30 days Good default for application logs
7 days Dev/staging environments
Export to S3 Archive long-term at lower cost
  aws logs put-retention-policy \
    --log-group-name /aws/lambda/my-function \
    --retention-in-days 30
  

CloudWatch Agent

Collect memory, disk, and custom OS metrics from EC2:

  {
    "metrics": {
        "namespace": "CWAgent",
        "metrics_collected": {
            "mem": {"measurement": ["mem_used_percent"]},
            "disk": {
                "measurement": ["disk_used_percent"],
                "resources": ["/"]
            }
        }
    },
    "logs": {
        "logs_collected": {
            "files": {
                "collect_list": [{
                    "file_path": "/var/log/nginx/access.log",
                    "log_group_name": "/app/nginx/access",
                    "log_stream_name": "{instance_id}"
                }]
            }
        }
    }
}
  

Real-World Scenario: Production Monitoring Stack

Layer Tool Alerts
Infrastructure CloudWatch EC2/RDS metrics CPU > 80%, disk > 90%
Application Custom metrics + Logs Insights Error rate > 1%, p99 latency > 2s
Uptime Route 53 health checks + Synthetics Endpoint down
Security GuardDuty + CloudTrail Unauthorized API calls
Notification SNS → PagerDuty/Slack On-call escalation

CloudWatch vs Third-Party Tools

Feature CloudWatch Datadog/New Relic
AWS integration Native, automatic Agent required
Custom metrics cost $0.30/metric/month Included in plan
Log analytics Logs Insights Full APM
Multi-cloud AWS only Multi-cloud
Setup time Minutes Hours (agent config)

Start with CloudWatch; add third-party APM when you need distributed tracing across services.

Common Mistakes

  1. No alarms configured — metrics without alarms are just graphs
  2. Alarm fatigue — too many low-threshold alarms; tune evaluation periods
  3. Never-expire log retention — costs grow linearly with traffic
  4. Missing custom metrics — infrastructure metrics don’t show business KPIs
  5. Not using Logs Insights — grep across log groups is slow and expensive
  6. Ignoring billing metrics — set billing alarms in us-east-1

Troubleshooting

Issue Cause Fix
No metrics appearing Wrong namespace/dimensions Verify with list-metrics
Alarm never triggers Insufficient data points Check evaluation-periods and datapoints-to-alarm
Logs not appearing Missing IAM permissions Role needs logs:CreateLogStream, logs:PutLogEvents
High Logs cost Verbose logging, no retention Reduce log level; set retention; filter at source
Dashboard empty Wrong region CloudWatch dashboards are region-specific

Best Practices

  • Define SLIs/SLOs (availability, latency, error rate) and alarm on SLO breaches
  • Use composite alarms to reduce noise
  • Set log retention on every log group at creation
  • Emit custom business metrics (orders/min, signups/day)
  • Create runbooks linked from alarm descriptions
  • Enable X-Ray alongside CloudWatch for request tracing
  • Export long-term logs to S3 + Athena for cost-effective analysis

Next: Elastic Load Balancing.