to navigate

to select

to close

On this page

CloudWatch — Monitoring

Amazon CloudWatch is AWS’s observability service — collect metrics, store logs, set alarms, and visualize system health. Every AWS service emits CloudWatch metrics automatically; you add custom metrics and logs for application-level visibility.

CloudWatch Components

Component	Purpose
Metrics	Time-series data (CPU, request count, custom)
Logs	Centralized log storage and querying
Alarms	Automated actions on metric thresholds
Dashboards	Visual monitoring panels
Events/EventBridge	Event-driven automation
Synthetics	Canary scripts for uptime monitoring
Container Insights	ECS/EKS deep metrics

View EC2 Metrics

  # CPU utilization for an instance (last hour)
aws cloudwatch get-metric-statistics \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
    --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 300 \
    --statistics Average Maximum

# List available metrics
aws cloudwatch list-metrics --namespace AWS/EC2

Standard EC2 metrics (5-minute intervals, free):

CPUUtilization, NetworkIn, NetworkOut, DiskReadOps, DiskWriteOps

Enable detailed monitoring (1-minute intervals, extra cost) for Auto Scaling responsiveness.

Create Alarms

  # Alert when EC2 CPU > 80% for 5 minutes
aws cloudwatch put-metric-alarm \
    --alarm-name high-cpu-web-server \
    --alarm-description "CPU above 80% for 5 minutes" \
    --metric-name CPUUtilization \
    --namespace AWS/EC2 \
    --statistic Average \
    --period 300 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold \
    --evaluation-periods 2 \
    --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts

# Composite alarm (multiple conditions)
aws cloudwatch put-composite-alarm \
    --alarm-name service-degraded \
    --alarm-rule "ALARM(high-cpu-web-server) OR ALARM(high-error-rate)"

Alarm Actions

Action	Use Case
SNS notification	Email, SMS, Slack (via Lambda)
Auto Scaling	Scale out/in on metric
EC2 recovery	Reboot impaired instance
Lambda	Custom remediation
SSM Automation	Run runbooks

CloudWatch Logs

  # Create log group
aws logs create-log-group --log-group-name /aws/lambda/my-function

# Stream logs from CLI
aws logs tail /aws/lambda/my-function --follow

# Filter logs (Logs Insights query)
aws logs start-query \
    --log-group-name /aws/lambda/my-function \
    --start-time $(date -u -v-1H +%s) \
    --end-time $(date -u +%s) \
    --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20'

Logs Insights Query Examples

  -- Error rate by function
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() as errors by bin(5m)

-- Slow Lambda invocations
fields @timestamp, @duration, @requestId
| filter @type = "REPORT"
| filter @duration > 3000
| sort @duration desc
| limit 50

-- API Gateway 5xx errors
fields @timestamp, status, path, ip
| filter status >= 500
| stats count() by path

Custom Metrics

  import boto3
cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_data(
    Namespace='MyApp/Orders',
    MetricData=[{
        'MetricName': 'OrdersProcessed',
        'Value': 1,
        'Unit': 'Count',
        'Dimensions': [
            {'Name': 'Environment', 'Value': 'production'},
            {'Name': 'Region', 'Value': 'us-east-1'}
        ]
    }]
)

Use Embedded Metric Format (EMF) for structured logging that auto-creates metrics:

  import json
print(json.dumps({
    "_aws": {
        "Timestamp": int(time.time() * 1000),
        "CloudWatchMetrics": [{
            "Namespace": "MyApp",
            "Dimensions": [["Service"]],
            "Metrics": [{"Name": "ProcessingTime", "Unit": "Milliseconds"}]
        }]
    },
    "Service": "order-processor",
    "ProcessingTime": 245
}))

Dashboards

  aws cloudwatch put-dashboard \
    --dashboard-name Production-Overview \
    --dashboard-body file://dashboard.json

  {
    "widgets": [{
        "type": "metric",
        "properties": {
            "metrics": [
                ["AWS/EC2", "CPUUtilization", "AutoScalingGroupName", "web-asg"],
                [".", "NetworkIn", ".", "."],
                ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/web-alb/xxx"]
            ],
            "period": 300,
            "stat": "Average",
            "region": "us-east-1",
            "title": "Web Tier Health"
        }
    }]
}

Log Retention and Costs

Retention	Cost Impact
Never expire	Highest — logs accumulate forever
30 days	Good default for application logs
7 days	Dev/staging environments
Export to S3	Archive long-term at lower cost

  aws logs put-retention-policy \
    --log-group-name /aws/lambda/my-function \
    --retention-in-days 30

CloudWatch Agent

Collect memory, disk, and custom OS metrics from EC2:

  {
    "metrics": {
        "namespace": "CWAgent",
        "metrics_collected": {
            "mem": {"measurement": ["mem_used_percent"]},
            "disk": {
                "measurement": ["disk_used_percent"],
                "resources": ["/"]
            }
        }
    },
    "logs": {
        "logs_collected": {
            "files": {
                "collect_list": [{
                    "file_path": "/var/log/nginx/access.log",
                    "log_group_name": "/app/nginx/access",
                    "log_stream_name": "{instance_id}"
                }]
            }
        }
    }
}

Real-World Scenario: Production Monitoring Stack

Layer	Tool	Alerts
Infrastructure	CloudWatch EC2/RDS metrics	CPU > 80%, disk > 90%
Application	Custom metrics + Logs Insights	Error rate > 1%, p99 latency > 2s
Uptime	Route 53 health checks + Synthetics	Endpoint down
Security	GuardDuty + CloudTrail	Unauthorized API calls
Notification	SNS → PagerDuty/Slack	On-call escalation

CloudWatch vs Third-Party Tools

Feature	CloudWatch	Datadog/New Relic
AWS integration	Native, automatic	Agent required
Custom metrics cost	$0.30/metric/month	Included in plan
Log analytics	Logs Insights	Full APM
Multi-cloud	AWS only	Multi-cloud
Setup time	Minutes	Hours (agent config)

Start with CloudWatch; add third-party APM when you need distributed tracing across services.

Common Mistakes

No alarms configured — metrics without alarms are just graphs
Alarm fatigue — too many low-threshold alarms; tune evaluation periods
Never-expire log retention — costs grow linearly with traffic
Missing custom metrics — infrastructure metrics don’t show business KPIs
Not using Logs Insights — grep across log groups is slow and expensive
Ignoring billing metrics — set billing alarms in us-east-1

Troubleshooting

Issue	Cause	Fix
No metrics appearing	Wrong namespace/dimensions	Verify with `list-metrics`
Alarm never triggers	Insufficient data points	Check `evaluation-periods` and `datapoints-to-alarm`
Logs not appearing	Missing IAM permissions	Role needs `logs:CreateLogStream`, `logs:PutLogEvents`
High Logs cost	Verbose logging, no retention	Reduce log level; set retention; filter at source
Dashboard empty	Wrong region	CloudWatch dashboards are region-specific

Best Practices

Define SLIs/SLOs (availability, latency, error rate) and alarm on SLO breaches
Use composite alarms to reduce noise
Set log retention on every log group at creation
Emit custom business metrics (orders/min, signups/day)
Create runbooks linked from alarm descriptions
Enable X-Ray alongside CloudWatch for request tracing
Export long-term logs to S3 + Athena for cost-effective analysis

Next: Elastic Load Balancing.

Lambda — Serverless

Elastic Load Balancing

CloudWatch — Monitoring

CloudWatch Components link

View EC2 Metrics link

Create Alarms link

Alarm Actions link

CloudWatch Logs link

Logs Insights Query Examples link

Custom Metrics link

Dashboards link

Log Retention and Costs link

CloudWatch Agent link

Real-World Scenario: Production Monitoring Stack link

CloudWatch vs Third-Party Tools link

Common Mistakes link

Troubleshooting link

Best Practices link