Cost Optimization
AWS’s pay-as-you-go model means costs scale with usage — both up and down. Without active management, cloud bills grow silently from idle resources, over-provisioned instances, and forgotten snapshots. This guide covers the tools and strategies professionals use to optimize AWS costs without sacrificing performance or reliability.
Understand Your Bill
AWS charges fall into major categories:
| Category | Examples | Typical % of Bill |
|---|---|---|
| Compute | EC2, Lambda, Fargate | 40-60% |
| Storage | S3, EBS, EFS | 10-20% |
| Database | RDS, DynamoDB, ElastiCache | 15-25% |
| Networking | Data transfer, NAT Gateway, CloudFront | 5-15% |
| Other | CloudWatch, KMS, support plan | 5-10% |
# Cost Explorer CLI (last 30 days by service)
aws ce get-cost-and-usage \
--time-period Start=2024-05-01,End=2024-06-01 \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE
# Daily costs for EC2
aws ce get-cost-and-usage \
--time-period Start=2024-06-01,End=2024-06-13 \
--granularity DAILY \
--metrics UnblendedCost \
--filter '{"Dimensions":{"Key":"SERVICE","Values":["Amazon Elastic Compute Cloud - Compute"]}}'
Tagging Strategy
Tags enable cost allocation — without them, you can’t answer “which team spent what?”
| Tag Key | Example Values | Purpose |
|---|---|---|
| Environment | dev, staging, production | Separate dev spend |
| Project | ecommerce, analytics | Per-project billing |
| Owner | team-platform, team-data | Team accountability |
| CostCenter | CC-1234 | Finance integration |
# Enforce tagging with AWS Organizations SCP or Config rule
# Example: deny EC2 launch without required tags
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Deny",
"Action": "ec2:RunInstances",
"Resource": "arn:aws:ec2:*:*:instance/*",
"Condition": {
"Null": {
"aws:RequestTag/Environment": "true",
"aws:RequestTag/Project": "true"
}
}
}]
}
Activate Cost Allocation Tags in Billing Console → Cost Allocation Tags.
Right-Sizing
Most EC2 instances are over-provisioned. Use data, not guesses:
# Compute Optimizer recommendations
aws compute-optimizer get-ec2-instance-recommendations
# CloudWatch CPU utilization over 14 days
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-xxx \
--start-time 2024-05-30T00:00:00Z \
--end-time 2024-06-13T00:00:00Z \
--period 86400 \
--statistics Average Maximum
| CPU Avg | Action |
|---|---|
| < 20% | Downsize instance type |
| 20-70% | Right-sized |
| > 70% sustained | Upsize or add instances |
Also check Trusted Advisor (Business/Enterprise support) for underutilized EBS volumes, idle ELBs, and unused Elastic IPs.
Reserved Capacity vs Savings Plans
| Option | Commitment | Flexibility | Savings |
|---|---|---|---|
| On-Demand | None | Full | 0% (baseline) |
| Savings Plans (Compute) | $/hour for 1-3 years | Any instance family/region | Up to 66% |
| Reserved Instances (EC2) | Specific instance type | Low — tied to type/AZ | Up to 72% |
| Spot Instances | None (can be interrupted) | Any available capacity | Up to 90% |
# Purchase Compute Savings Plan (Console recommended for first time)
# Example: $0.50/hour commitment for 1 year, no upfront
# Applies to EC2, Fargate, Lambda automatically
Strategy: Steady-state baseline on Savings Plans; burst capacity on On-Demand; fault-tolerant workloads on Spot.
Spot Instances
Ideal for batch processing, CI/CD workers, and stateless workloads:
# Launch Spot instance via ASG
aws autoscaling create-auto-scaling-group \
--auto-scaling-group-name batch-workers \
--mixed-instances-policy '{
"LaunchTemplate": {"LaunchTemplateSpecification": {"LaunchTemplateName": "batch-lt", "Version": "$Latest"}},
"InstancesDistribution": {
"OnDemandBaseCapacity": 0,
"OnDemandPercentageAboveBaseCapacity": 0,
"SpotAllocationStrategy": "capacity-optimized"
}
}' \
--min-size 0 --max-size 20 --desired-capacity 5
Handle Spot interruptions gracefully — use Spot Instance interruption notices (2-minute warning via IMDS).
Storage Cost Optimization
S3 Lifecycle Policies
{
"Rules": [{
"ID": "TieredStorage",
"Status": "Enabled",
"Filter": {"Prefix": "data/"},
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 90, "StorageClass": "GLACIER"},
{"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
]
}]
}
EBS Optimization
# Find unattached EBS volumes (paying for storage with no instance)
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[*].[VolumeId,Size,CreateTime]' \
--output table
# Delete unused volumes (verify first!)
aws ec2 delete-volume --volume-id vol-xxx
Switch gp2 → gp3 for 20% cost savings with same or better performance.
NAT Gateway Costs
NAT Gateway is often a surprise line item (~$32/month + $0.045/GB processed per AZ):
| Alternative | Savings | Trade-off |
|---|---|---|
| VPC endpoints for S3/DynamoDB | Free (gateway) | S3/DynamoDB only |
| Interface endpoints for AWS APIs | ~$7/month/AZ | Per-service cost |
| NAT Instance (t3.micro) | ~$8/month | You manage HA |
| VPC endpoints + no NAT | Maximum | Limited to AWS services |
Audit NAT Gateway data processing charges monthly.
Budgets and Alerts
# Create budget via CLI
aws budgets create-budget \
--account-id 123456789012 \
--budget '{
"BudgetName": "monthly-total",
"BudgetLimit": {"Amount": "500", "Unit": "USD"},
"TimeUnit": "MONTHLY",
"BudgetType": "COST"
}' \
--notifications-with-subscribers '[{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [{"SubscriptionType": "EMAIL", "Address": "[email protected]"}]
}]'
Set budgets per team/project using cost allocation tags.
Real-World Scenario: Startup Cost Review
| Finding | Monthly Cost | Action | Savings |
|---|---|---|---|
| 3 idle t3.large (dev) | $180 | Stop after hours / use t3.micro | $150 |
| Unattached 500 GB EBS | $50 | Delete after snapshot | $50 |
| NAT Gateway (single AZ dev) | $45 | VPC endpoints for S3/APIs | $30 |
| RDS db.r5.xlarge (20% CPU) | $350 | Downsize to db.r5.large | $175 |
| S3 Standard for 2TB logs | $46 | Lifecycle to Glacier after 30d | $35 |
| Total | ~$440/month |
FinOps Best Practices
- Monthly cost review — dedicated meeting with engineering and finance
- Showback/chargeback — teams see their own cloud costs
- Automate shutdown — dev/staging resources off nights and weekends
- Use AWS Free Tier wisely for experiments, not production
- Review Reserved/Savings Plan utilization quarterly
- Delete unused resources — EIPs, snapshots, AMIs, old log groups
Common Cost Mistakes
- Leaving dev environments running 24/7 — schedule stop/start
- Over-provisioned RDS — db.r5 for a dev database with 5 connections
- No lifecycle policies on S3 — logs accumulate in Standard class forever
- Multiple NAT Gateways in dev — one is enough for non-production
- Ignoring data transfer costs — cross-AZ and cross-region transfer adds up
- Unused Reserved Instances — buy RIs only for proven steady-state workloads
Troubleshooting Unexpected Bills
| Spike Source | How to Find | Fix |
|---|---|---|
| EC2 | Cost Explorer → EC2 → by instance ID | Stop/terminate idle instances |
| Data transfer | Cost Explorer → Data Transfer | CloudFront for outbound; same-AZ placement |
| NAT Gateway | VPC → NAT Gateways → monitoring | VPC endpoints; reduce cross-AZ traffic |
| CloudWatch Logs | Log groups → stored bytes | Set retention; reduce log verbosity |
| S3 requests | S3 → Metrics → NumberOfObjects | Lifecycle policies; Intelligent-Tiering |
Best Practices Summary
- Tag everything from day one — retroactive tagging is painful
- Use Cost Explorer and Cost Anomaly Detection weekly
- Purchase Savings Plans for steady-state compute after 3 months of stable usage
- Use Spot for fault-tolerant and batch workloads
- Apply S3 lifecycle policies to every bucket
- Enable AWS Budgets with alerts at 50%, 80%, 100%
- Run Trusted Advisor or Compute Optimizer monthly
- Automate dev environment shutdown with Instance Scheduler or Lambda
Next: DevOps with CodePipeline.