On this page
CloudWatch — Monitoring
Amazon CloudWatch is AWS’s observability service — collect metrics, store logs, set alarms, and visualize system health. Every AWS service emits CloudWatch metrics automatically; you add custom metrics and logs for application-level visibility.
CloudWatch Components
| Component | Purpose |
|---|---|
| Metrics | Time-series data (CPU, request count, custom) |
| Logs | Centralized log storage and querying |
| Alarms | Automated actions on metric thresholds |
| Dashboards | Visual monitoring panels |
| Events/EventBridge | Event-driven automation |
| Synthetics | Canary scripts for uptime monitoring |
| Container Insights | ECS/EKS deep metrics |
View EC2 Metrics
# CPU utilization for an instance (last hour)
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average Maximum
# List available metrics
aws cloudwatch list-metrics --namespace AWS/EC2
Standard EC2 metrics (5-minute intervals, free):
CPUUtilization,NetworkIn,NetworkOut,DiskReadOps,DiskWriteOps
Enable detailed monitoring (1-minute intervals, extra cost) for Auto Scaling responsiveness.
Create Alarms
# Alert when EC2 CPU > 80% for 5 minutes
aws cloudwatch put-metric-alarm \
--alarm-name high-cpu-web-server \
--alarm-description "CPU above 80% for 5 minutes" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts
# Composite alarm (multiple conditions)
aws cloudwatch put-composite-alarm \
--alarm-name service-degraded \
--alarm-rule "ALARM(high-cpu-web-server) OR ALARM(high-error-rate)"
Alarm Actions
| Action | Use Case |
|---|---|
| SNS notification | Email, SMS, Slack (via Lambda) |
| Auto Scaling | Scale out/in on metric |
| EC2 recovery | Reboot impaired instance |
| Lambda | Custom remediation |
| SSM Automation | Run runbooks |
CloudWatch Logs
# Create log group
aws logs create-log-group --log-group-name /aws/lambda/my-function
# Stream logs from CLI
aws logs tail /aws/lambda/my-function --follow
# Filter logs (Logs Insights query)
aws logs start-query \
--log-group-name /aws/lambda/my-function \
--start-time $(date -u -v-1H +%s) \
--end-time $(date -u +%s) \
--query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20'
Logs Insights Query Examples
-- Error rate by function
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() as errors by bin(5m)
-- Slow Lambda invocations
fields @timestamp, @duration, @requestId
| filter @type = "REPORT"
| filter @duration > 3000
| sort @duration desc
| limit 50
-- API Gateway 5xx errors
fields @timestamp, status, path, ip
| filter status >= 500
| stats count() by path
Custom Metrics
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data(
Namespace='MyApp/Orders',
MetricData=[{
'MetricName': 'OrdersProcessed',
'Value': 1,
'Unit': 'Count',
'Dimensions': [
{'Name': 'Environment', 'Value': 'production'},
{'Name': 'Region', 'Value': 'us-east-1'}
]
}]
)
Use Embedded Metric Format (EMF) for structured logging that auto-creates metrics:
import json
print(json.dumps({
"_aws": {
"Timestamp": int(time.time() * 1000),
"CloudWatchMetrics": [{
"Namespace": "MyApp",
"Dimensions": [["Service"]],
"Metrics": [{"Name": "ProcessingTime", "Unit": "Milliseconds"}]
}]
},
"Service": "order-processor",
"ProcessingTime": 245
}))
Dashboards
aws cloudwatch put-dashboard \
--dashboard-name Production-Overview \
--dashboard-body file://dashboard.json
{
"widgets": [{
"type": "metric",
"properties": {
"metrics": [
["AWS/EC2", "CPUUtilization", "AutoScalingGroupName", "web-asg"],
[".", "NetworkIn", ".", "."],
["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/web-alb/xxx"]
],
"period": 300,
"stat": "Average",
"region": "us-east-1",
"title": "Web Tier Health"
}
}]
}
Log Retention and Costs
| Retention | Cost Impact |
|---|---|
| Never expire | Highest — logs accumulate forever |
| 30 days | Good default for application logs |
| 7 days | Dev/staging environments |
| Export to S3 | Archive long-term at lower cost |
aws logs put-retention-policy \
--log-group-name /aws/lambda/my-function \
--retention-in-days 30
CloudWatch Agent
Collect memory, disk, and custom OS metrics from EC2:
{
"metrics": {
"namespace": "CWAgent",
"metrics_collected": {
"mem": {"measurement": ["mem_used_percent"]},
"disk": {
"measurement": ["disk_used_percent"],
"resources": ["/"]
}
}
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [{
"file_path": "/var/log/nginx/access.log",
"log_group_name": "/app/nginx/access",
"log_stream_name": "{instance_id}"
}]
}
}
}
}
Real-World Scenario: Production Monitoring Stack
| Layer | Tool | Alerts |
|---|---|---|
| Infrastructure | CloudWatch EC2/RDS metrics | CPU > 80%, disk > 90% |
| Application | Custom metrics + Logs Insights | Error rate > 1%, p99 latency > 2s |
| Uptime | Route 53 health checks + Synthetics | Endpoint down |
| Security | GuardDuty + CloudTrail | Unauthorized API calls |
| Notification | SNS → PagerDuty/Slack | On-call escalation |
CloudWatch vs Third-Party Tools
| Feature | CloudWatch | Datadog/New Relic |
|---|---|---|
| AWS integration | Native, automatic | Agent required |
| Custom metrics cost | $0.30/metric/month | Included in plan |
| Log analytics | Logs Insights | Full APM |
| Multi-cloud | AWS only | Multi-cloud |
| Setup time | Minutes | Hours (agent config) |
Start with CloudWatch; add third-party APM when you need distributed tracing across services.
Common Mistakes
- No alarms configured — metrics without alarms are just graphs
- Alarm fatigue — too many low-threshold alarms; tune evaluation periods
- Never-expire log retention — costs grow linearly with traffic
- Missing custom metrics — infrastructure metrics don’t show business KPIs
- Not using Logs Insights — grep across log groups is slow and expensive
- Ignoring billing metrics — set billing alarms in us-east-1
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| No metrics appearing | Wrong namespace/dimensions | Verify with list-metrics |
| Alarm never triggers | Insufficient data points | Check evaluation-periods and datapoints-to-alarm |
| Logs not appearing | Missing IAM permissions | Role needs logs:CreateLogStream, logs:PutLogEvents |
| High Logs cost | Verbose logging, no retention | Reduce log level; set retention; filter at source |
| Dashboard empty | Wrong region | CloudWatch dashboards are region-specific |
Best Practices
- Define SLIs/SLOs (availability, latency, error rate) and alarm on SLO breaches
- Use composite alarms to reduce noise
- Set log retention on every log group at creation
- Emit custom business metrics (orders/min, signups/day)
- Create runbooks linked from alarm descriptions
- Enable X-Ray alongside CloudWatch for request tracing
- Export long-term logs to S3 + Athena for cost-effective analysis
Next: Elastic Load Balancing.