AWS’s pay-as-you-go model means costs scale with usage — both up and down. Without active management, cloud bills grow silently from idle resources, over-provisioned instances, and forgotten snapshots. This guide covers the tools and strategies professionals use to optimize AWS costs without sacrificing performance or reliability.

Understand Your Bill

AWS charges fall into major categories:

Category Examples Typical % of Bill
Compute EC2, Lambda, Fargate 40-60%
Storage S3, EBS, EFS 10-20%
Database RDS, DynamoDB, ElastiCache 15-25%
Networking Data transfer, NAT Gateway, CloudFront 5-15%
Other CloudWatch, KMS, support plan 5-10%
  # Cost Explorer CLI (last 30 days by service)
aws ce get-cost-and-usage \
    --time-period Start=2024-05-01,End=2024-06-01 \
    --granularity MONTHLY \
    --metrics BlendedCost \
    --group-by Type=DIMENSION,Key=SERVICE

# Daily costs for EC2
aws ce get-cost-and-usage \
    --time-period Start=2024-06-01,End=2024-06-13 \
    --granularity DAILY \
    --metrics UnblendedCost \
    --filter '{"Dimensions":{"Key":"SERVICE","Values":["Amazon Elastic Compute Cloud - Compute"]}}'
  

Tagging Strategy

Tags enable cost allocation — without them, you can’t answer “which team spent what?”

Tag Key Example Values Purpose
Environment dev, staging, production Separate dev spend
Project ecommerce, analytics Per-project billing
Owner team-platform, team-data Team accountability
CostCenter CC-1234 Finance integration
  # Enforce tagging with AWS Organizations SCP or Config rule
# Example: deny EC2 launch without required tags
{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Deny",
        "Action": "ec2:RunInstances",
        "Resource": "arn:aws:ec2:*:*:instance/*",
        "Condition": {
            "Null": {
                "aws:RequestTag/Environment": "true",
                "aws:RequestTag/Project": "true"
            }
        }
    }]
}
  

Activate Cost Allocation Tags in Billing Console → Cost Allocation Tags.

Right-Sizing

Most EC2 instances are over-provisioned. Use data, not guesses:

  # Compute Optimizer recommendations
aws compute-optimizer get-ec2-instance-recommendations

# CloudWatch CPU utilization over 14 days
aws cloudwatch get-metric-statistics \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=i-xxx \
    --start-time 2024-05-30T00:00:00Z \
    --end-time 2024-06-13T00:00:00Z \
    --period 86400 \
    --statistics Average Maximum
  
CPU Avg Action
< 20% Downsize instance type
20-70% Right-sized
> 70% sustained Upsize or add instances

Also check Trusted Advisor (Business/Enterprise support) for underutilized EBS volumes, idle ELBs, and unused Elastic IPs.

Reserved Capacity vs Savings Plans

Option Commitment Flexibility Savings
On-Demand None Full 0% (baseline)
Savings Plans (Compute) $/hour for 1-3 years Any instance family/region Up to 66%
Reserved Instances (EC2) Specific instance type Low — tied to type/AZ Up to 72%
Spot Instances None (can be interrupted) Any available capacity Up to 90%
  # Purchase Compute Savings Plan (Console recommended for first time)
# Example: $0.50/hour commitment for 1 year, no upfront
# Applies to EC2, Fargate, Lambda automatically
  

Strategy: Steady-state baseline on Savings Plans; burst capacity on On-Demand; fault-tolerant workloads on Spot.

Spot Instances

Ideal for batch processing, CI/CD workers, and stateless workloads:

  # Launch Spot instance via ASG
aws autoscaling create-auto-scaling-group \
    --auto-scaling-group-name batch-workers \
    --mixed-instances-policy '{
        "LaunchTemplate": {"LaunchTemplateSpecification": {"LaunchTemplateName": "batch-lt", "Version": "$Latest"}},
        "InstancesDistribution": {
            "OnDemandBaseCapacity": 0,
            "OnDemandPercentageAboveBaseCapacity": 0,
            "SpotAllocationStrategy": "capacity-optimized"
        }
    }' \
    --min-size 0 --max-size 20 --desired-capacity 5
  

Handle Spot interruptions gracefully — use Spot Instance interruption notices (2-minute warning via IMDS).

Storage Cost Optimization

S3 Lifecycle Policies

  {
    "Rules": [{
        "ID": "TieredStorage",
        "Status": "Enabled",
        "Filter": {"Prefix": "data/"},
        "Transitions": [
            {"Days": 30, "StorageClass": "STANDARD_IA"},
            {"Days": 90, "StorageClass": "GLACIER"},
            {"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
        ]
    }]
}
  

EBS Optimization

  # Find unattached EBS volumes (paying for storage with no instance)
aws ec2 describe-volumes \
    --filters Name=status,Values=available \
    --query 'Volumes[*].[VolumeId,Size,CreateTime]' \
    --output table

# Delete unused volumes (verify first!)
aws ec2 delete-volume --volume-id vol-xxx
  

Switch gp2 → gp3 for 20% cost savings with same or better performance.

NAT Gateway Costs

NAT Gateway is often a surprise line item (~$32/month + $0.045/GB processed per AZ):

Alternative Savings Trade-off
VPC endpoints for S3/DynamoDB Free (gateway) S3/DynamoDB only
Interface endpoints for AWS APIs ~$7/month/AZ Per-service cost
NAT Instance (t3.micro) ~$8/month You manage HA
VPC endpoints + no NAT Maximum Limited to AWS services

Audit NAT Gateway data processing charges monthly.

Budgets and Alerts

  # Create budget via CLI
aws budgets create-budget \
    --account-id 123456789012 \
    --budget '{
        "BudgetName": "monthly-total",
        "BudgetLimit": {"Amount": "500", "Unit": "USD"},
        "TimeUnit": "MONTHLY",
        "BudgetType": "COST"
    }' \
    --notifications-with-subscribers '[{
        "Notification": {
            "NotificationType": "ACTUAL",
            "ComparisonOperator": "GREATER_THAN",
            "Threshold": 80,
            "ThresholdType": "PERCENTAGE"
        },
        "Subscribers": [{"SubscriptionType": "EMAIL", "Address": "[email protected]"}]
    }]'
  

Set budgets per team/project using cost allocation tags.

Real-World Scenario: Startup Cost Review

Finding Monthly Cost Action Savings
3 idle t3.large (dev) $180 Stop after hours / use t3.micro $150
Unattached 500 GB EBS $50 Delete after snapshot $50
NAT Gateway (single AZ dev) $45 VPC endpoints for S3/APIs $30
RDS db.r5.xlarge (20% CPU) $350 Downsize to db.r5.large $175
S3 Standard for 2TB logs $46 Lifecycle to Glacier after 30d $35
Total ~$440/month

FinOps Best Practices

  1. Monthly cost review — dedicated meeting with engineering and finance
  2. Showback/chargeback — teams see their own cloud costs
  3. Automate shutdown — dev/staging resources off nights and weekends
  4. Use AWS Free Tier wisely for experiments, not production
  5. Review Reserved/Savings Plan utilization quarterly
  6. Delete unused resources — EIPs, snapshots, AMIs, old log groups

Common Cost Mistakes

  1. Leaving dev environments running 24/7 — schedule stop/start
  2. Over-provisioned RDS — db.r5 for a dev database with 5 connections
  3. No lifecycle policies on S3 — logs accumulate in Standard class forever
  4. Multiple NAT Gateways in dev — one is enough for non-production
  5. Ignoring data transfer costs — cross-AZ and cross-region transfer adds up
  6. Unused Reserved Instances — buy RIs only for proven steady-state workloads

Troubleshooting Unexpected Bills

Spike Source How to Find Fix
EC2 Cost Explorer → EC2 → by instance ID Stop/terminate idle instances
Data transfer Cost Explorer → Data Transfer CloudFront for outbound; same-AZ placement
NAT Gateway VPC → NAT Gateways → monitoring VPC endpoints; reduce cross-AZ traffic
CloudWatch Logs Log groups → stored bytes Set retention; reduce log verbosity
S3 requests S3 → Metrics → NumberOfObjects Lifecycle policies; Intelligent-Tiering

Best Practices Summary

  • Tag everything from day one — retroactive tagging is painful
  • Use Cost Explorer and Cost Anomaly Detection weekly
  • Purchase Savings Plans for steady-state compute after 3 months of stable usage
  • Use Spot for fault-tolerant and batch workloads
  • Apply S3 lifecycle policies to every bucket
  • Enable AWS Budgets with alerts at 50%, 80%, 100%
  • Run Trusted Advisor or Compute Optimizer monthly
  • Automate dev environment shutdown with Instance Scheduler or Lambda

Next: DevOps with CodePipeline.