On this page
Azure Well-Architected Framework
The Azure Well-Architected Framework provides design principles and best practices for building secure, reliable, and efficient cloud workloads. Microsoft organizes guidance around five pillars. Applying these pillars systematically produces systems that are resilient, secure, cost-effective, and maintainable at scale.
The Five Pillars
| Pillar | Goal | Key Practices |
|---|---|---|
| Reliability | Recover from failures, meet SLAs | Redundancy, health checks, DR plans |
| Security | Protect data and systems | Identity, encryption, threat protection |
| Cost Optimization | Maximize value, minimize spend | Right-sizing, reserved capacity, tagging |
| Operational Excellence | Run and improve systems | Automation, observability, IaC |
| Performance Efficiency | Scale to meet demand | Auto-scale, caching, async patterns |
Each pillar includes a set of design principles and review questions you should answer for every production workload.
Reliability Patterns
- Deploy across Availability Zones for zone-level redundancy (App Service, VMs, SQL, Storage)
- Use paired regions for geo-disaster recovery — each Azure region has a designated pair
- Implement health probes on load balancers, Front Door, and App Service
- Design for graceful degradation when dependencies fail (circuit breakers, fallbacks)
- Test recovery with chaos engineering (Azure Chaos Studio) and documented DR drills
Example multi-region architecture:
Primary Region (East US)
├── App Service (active, 3 instances)
├── Azure SQL (primary, zone-redundant)
├── Traffic Manager / Front Door (priority routing)
└── Storage (GZRS)
Secondary Region (West US)
├── App Service (standby, 1 instance — scale on failover)
├── Azure SQL (geo-replica in failover group)
└── Storage (RA-GZRS read access)
# Verify zone support in a region
az account list-locations --query "[?name=='eastus'].availabilityZoneMappings" -o json
# Create zone-redundant App Service plan
az appservice plan create \
--name plan-webapp-prod-zr \
--resource-group rg-webapp-prod \
--location eastus \
--sku P1v3 \
--zone-redundant true
Security Layers
- Identity: Entra ID, MFA, conditional access, Managed Identities — no long-lived credentials
- Network: NSGs, private endpoints, Azure Firewall, DDoS Protection Standard
- Data: Encryption at rest (SSE/CMK) and in transit (TLS 1.2+), Key Vault for secrets
- Application: WAF on Application Gateway or Front Door, secure coding, dependency scanning
- Governance: Azure Policy, Defender for Cloud, Activity Log audit, RBAC least privilege
# Enable Defender for Cloud on subscription
az security pricing create \
--name VirtualMachines \
--tier Standard
# Assign built-in policy: require HTTPS on storage
az policy assignment create \
--name require-https-storage \
--policy /providers/Microsoft.Authorization/policyDefinitions/404c3081-a854-4457-ae30-26a93ef643f9 \
--scope /subscriptions/SUB_ID
Cost Optimization
# Tag resources for cost allocation
az resource tag \
--tags environment=prod cost-center=engineering project=web-app \
--ids /subscriptions/SUB_ID/resourceGroups/rg-webapp-prod
# Review advisor cost recommendations
az advisor recommendation list --category Cost --query "[].{Name:shortDescription.problem, Impact:impact}" -o table
Cost strategies:
- Use Reserved Instances and Savings Plans for predictable baseline compute
- Right-size VMs with Azure Advisor — many workloads run at < 20% CPU
- Apply auto-shutdown for dev/test VMs and use Azure DevTest Labs
- Choose appropriate storage tiers (Cool/Archive) and redundancy (LRS for dev)
- Delete orphaned resources: unattached disks, unused IPs, old snapshots
Operational Excellence
- Deploy with Bicep or Terraform — no manual Portal changes in production
- Use Azure DevOps or GitHub Actions for CI/CD with environment gates
- Centralize logs in Log Analytics with structured KQL queries and workbooks
- Document runbooks for common operational tasks (failover, scaling, certificate rotation)
- Conduct post-incident reviews (PIRs) and track action items to completion
- Implement Infrastructure as Code reviews in pull requests
# Deploy Bicep template with what-if preview
az deployment group what-if \
--resource-group rg-webapp-prod \
--template-file main.bicep \
--parameters @parameters.prod.json
az deployment group create \
--resource-group rg-webapp-prod \
--template-file main.bicep \
--parameters @parameters.prod.json
Performance Efficiency
| Pattern | Azure Service | Benefit |
|---|---|---|
| Caching | Azure Cache for Redis | Reduce database load, lower latency |
| CDN | Azure Front Door / CDN | Edge delivery of static assets |
| Async processing | Service Bus, Functions | Decouple heavy work from request path |
| Auto-scale | App Service, AKS HPA, VMSS | Match capacity to demand |
| Read replicas | Azure SQL geo-replicas | Offload read traffic |
Real-World Scenario: E-Commerce Platform Review
| Pillar | Assessment | Action Items |
|---|---|---|
| Reliability | Single-region App Service | Add geo-replica SQL + Front Door failover |
| Security | Public SQL endpoint | Migrate to private endpoint + Managed Identity |
| Cost | Over-provisioned D8s_v5 VMs | Downsize to D4s_v5; purchase 1-year RI |
| Operations | Manual deployments | Implement Bicep + GitHub Actions pipeline |
| Performance | No caching layer | Add Redis for session and product catalog |
Pillar Trade-offs
| Decision | Improves | May Impact |
|---|---|---|
| Multi-region deployment | Reliability | Cost, complexity |
| Private endpoints everywhere | Security | Operational complexity, DNS management |
| Reserved capacity | Cost | Flexibility |
| Comprehensive monitoring | Operations | Log ingestion costs |
| Premium SKUs | Performance | Cost |
Common Mistakes
- Optimizing one pillar in isolation — cheaper but unreliable is not a win
- No architecture review before launch — technical debt accumulates fast
- Ignoring the shared responsibility model — Azure secures the platform; you secure your data and access
- Copying on-premises architecture 1:1 — cloud-native patterns reduce cost and improve resilience
- Skipping DR testing — backups exist but restore procedures are untested
- No tagging or governance — cost and security sprawl across subscriptions
Troubleshooting Design Issues
| Symptom | Likely Pillar | Investigation |
|---|---|---|
| Frequent outages | Reliability | Check redundancy, health probes, dependency chains |
| Security audit failures | Security | Review RBAC, public endpoints, encryption settings |
| Budget overruns | Cost | Cost analysis by tag; Advisor recommendations |
| Slow incident response | Operations | Verify monitoring coverage, runbook availability |
| High latency under load | Performance | Profile bottlenecks; check scaling rules and caching |
Best Practices
- Run the Azure Well-Architected Review assessment for each major workload
- Revisit architecture quarterly or when requirements change significantly
- Document architecture decision records (ADRs) for significant design choices
- Use Azure Architecture Center reference architectures as starting points
- Balance pillars based on business priorities — not every workload needs multi-region
- Include FinOps, security, and ops stakeholders in architecture reviews
Next: Cost Management.