The AWS Well-Architected Framework provides architectural best practices across six pillars. Use it to evaluate designs, identify risks, and build production systems that scale securely and cost-effectively. AWS offers free Well-Architected Reviews with a Solutions Architect for production workloads.

The Six Pillars

Pillar Focus Key Question
Operational Excellence Run and monitor systems Can you deploy, respond to incidents, and improve?
Security Protect data and systems Is everything encrypted, least-privilege, and audited?
Reliability Recover from failures Does the system meet SLAs across AZ/region failures?
Performance Efficiency Use resources efficiently Right-sized compute, caching, and async processing?
Cost Optimization Avoid unnecessary spend Reserved capacity, right-sizing, lifecycle policies?
Sustainability Minimize environmental impact Efficient resources, serverless, Graviton instances?

Operational Excellence

Design Principles

  • Operations as code — CloudFormation, CDK, Terraform for reproducible infrastructure
  • Automate changes — CI/CD pipelines, no manual console changes in production
  • Anticipate failure — Game days, chaos engineering, runbooks
  • Learn from events — Blameless postmortems, update runbooks

Checklist

  # Infrastructure as Code example (CloudFormation snippet)
Resources:
  WebServerASG:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      MinSize: 2
      MaxSize: 10
      HealthCheckType: ELB
      Tags:
        - Key: Environment
          Value: production
          PropagateAtLaunch: true
  
  • All infrastructure defined in IaC (no manual drift)
  • CI/CD pipeline with automated testing and rollback
  • Runbooks for common incidents (linked from CloudWatch alarms)
  • Regular game days and failure injection tests
  • Structured logging with correlation IDs

Security Pillar

Defense in Depth

  Internet → WAF → ALB → Security Groups → Private Subnets → Encryption
                ↓
         CloudTrail + GuardDuty + Config
  
Layer AWS Service
Identity IAM, SSO, MFA
Network VPC, SG, NACL, WAF, Shield
Data at rest KMS, S3 encryption, RDS encryption
Data in transit TLS 1.3, ACM certificates
Detection GuardDuty, Security Hub, Macie
Audit CloudTrail, Config, Access Analyzer

Security Checklist

  • Root account secured with MFA, no access keys
  • Least-privilege IAM with permission boundaries
  • All data encrypted at rest (KMS CMK for sensitive data)
  • TLS everywhere — no HTTP in production
  • CloudTrail enabled in all regions
  • GuardDuty enabled
  • Secrets in Secrets Manager, not environment variables
  • Regular penetration testing and vulnerability scans

Reliability Pillar

High Availability Patterns

Pattern Implementation RTO/RPO
Multi-AZ RDS Multi-AZ, ASG across AZs Minutes / Zero
Multi-Region Route 53 failover, S3 CRR Minutes-Hours / Minutes
Backup & Restore RDS snapshots, S3 versioning Hours / Hours
Pilot Light Minimal DR region, scale on failover 10-30 min / Minutes
Warm Standby Reduced capacity in DR region Minutes / Minutes
Active-Active Full capacity in multiple regions Near-zero / Near-zero

Reliability Checklist

  • Workloads span minimum 2 AZs
  • Auto Scaling with health checks (ELB, not just EC2)
  • RDS Multi-AZ with automated backups (7+ day retention)
  • SQS/SNS for decoupling and async processing
  • Circuit breakers and retry with exponential backoff
  • Tested disaster recovery procedures (quarterly)
  • Route 53 health checks for DNS failover

Performance Efficiency

Right-Sizing and Selection

Workload Recommended Service
Static website S3 + CloudFront
REST API (variable traffic) Lambda + API Gateway
Containerized microservices ECS Fargate or EKS
Long-running batch EC2 Spot Instances
Real-time analytics Kinesis + Lambda
Caching layer ElastiCache (Redis)
  # Use Compute Optimizer for right-sizing recommendations
aws compute-optimizer get-ec2-instance-recommendations \
    --account-ids 123456789012
  

Performance Checklist

  • CloudFront CDN for static and cacheable content
  • ElastiCache for session and database query caching
  • Async processing for non-critical paths (SQS + Lambda)
  • Database read replicas for read-heavy workloads
  • Graviton instances where compatible (20-40% savings)
  • Load testing before production launch (Artillery, k6, Locust)

Cost Optimization

See the dedicated Cost Optimization page. Key principles:

  • Right-size — don’t over-provision; use Compute Optimizer
  • Reserved capacity — Savings Plans or Reserved Instances for steady workloads
  • Spot Instances — up to 90% savings for fault-tolerant workloads
  • Lifecycle policies — S3 IA/Glacier for infrequent data
  • Tag everything — cost allocation by team/project/environment

Sustainability

  • Prefer Graviton (ARM) instances — better performance per watt
  • Use serverless (Lambda, Fargate) — no idle capacity
  • Apply S3 lifecycle policies — reduce stored data volume
  • Choose regions powered by renewable energy where possible
  • Right-size to avoid over-provisioned resources

Well-Architected Review Process

  1. Define workload scope (e.g., “Production e-commerce API”)
  2. Answer questions for each pillar in the AWS WA Tool
  3. Identify high-risk issues (HRIs) — must fix before production
  4. Create improvement plan with prioritized remediation
  5. Re-review after 6-12 months or major architecture changes

Real-World Scenario: SaaS Platform Assessment

Pillar Current State HRI Remediation
Security No WAF on ALB Yes Attach AWS WAF with managed rules
Reliability Single AZ RDS Yes Enable Multi-AZ
Cost All On-Demand EC2 No Purchase Compute Savings Plan
Operations Manual deploys Yes Implement CodePipeline CI/CD
Performance No CDN No Add CloudFront for static assets

Architecture Patterns Reference

Pattern Services When to Use
Three-tier web ALB + EC2/ECS + RDS Traditional web apps
Serverless API API GW + Lambda + DynamoDB Variable traffic, event-driven
Event-driven EventBridge + SQS + Lambda Async workflows, decoupling
Data lake S3 + Glue + Athena Analytics on structured/unstructured data
Microservices ECS/EKS + ALB + per-service DB Independent team deployment

Common Architectural Mistakes

  1. Single AZ production — AZ failure = total outage
  2. Monolith on oversized EC2 — can’t scale components independently
  3. No caching layer — database becomes bottleneck
  4. Synchronous everything — tight coupling causes cascading failures
  5. Shared database across microservices — defeats service independence
  6. No observability — can’t debug what you can’t see

Best Practices Summary

  • Run a Well-Architected Review before every major launch
  • Automate everything — infrastructure, deployment, scaling, remediation
  • Design for failure — assume any component will fail
  • Apply least privilege at every layer
  • Measure with SLIs/SLOs — availability, latency, error rate
  • Document decisions — ADRs (Architecture Decision Records) for future teams
  • Review architecture quarterly as requirements evolve

Next: Cost Optimization.