Azure Kubernetes Service
Azure Kubernetes Service (AKS) is a managed Kubernetes offering. Microsoft manages the control plane (API server, etcd, scheduler); you manage node pools, workloads, and networking. AKS is the standard choice for container orchestration on Azure when you need Kubernetes portability and ecosystem tooling.
AKS Architecture link
Control Plane (managed by Azure — free)
├── API Server
├── etcd (encrypted, geo-replicated)
└── Scheduler + Controller Manager
Your Node Pools (you manage)
├── System node pool (critical addons: CoreDNS, metrics-server)
└── User node pools (application workloads)
Never run application pods on the system node pool — dedicate it to critical cluster services.
Create a Cluster link
# Create resource group
az group create --name rg-aks-prod --location eastus
# Create AKS cluster with monitoring
az aks create \
--resource-group rg-aks-prod \
--name aks-webapp-prod \
--node-count 3 \
--node-vm-size Standard_D4s_v5 \
--enable-managed-identity \
--enable-addons monitoring \
--network-plugin azure \
--network-policy azure \
--generate-ssh-keys \
--zones 1 2 3
# Attach Azure Container Registry
az aks update \
--resource-group rg-aks-prod \
--name aks-webapp-prod \
--attach-acr acrwebappprod
# Get credentials
az aks get-credentials \
--resource-group rg-aks-prod \
--name aks-webapp-prod \
--overwrite-existing
kubectl get nodes
Deploy an Application link
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web-app
image: acrwebappprod.azurecr.io/web-app:v1.2.0
ports:
- containerPort: 8080
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
readinessProbe:
httpGet:
path: /ready
port: 8080
kubectl apply -f deployment.yaml
kubectl expose deployment web-app --type=LoadBalancer --port=80 --target-port=8080
kubectl get svc web-app -w
| Method |
Command / Tool |
Use Case |
| Manual |
kubectl scale deployment web-app --replicas=5 |
Fixed capacity |
| HPA |
Horizontal Pod Autoscaler on CPU/memory |
Variable load |
| Cluster Autoscaler |
Scale node pools automatically |
Node capacity |
| KEDA |
Event-driven scaling |
Queue depth, Prometheus metrics |
# Horizontal Pod Autoscaler
kubectl autoscale deployment web-app \
--cpu-percent=70 \
--min=2 \
--max=20
# Enable cluster autoscaler on node pool
az aks nodepool update \
--resource-group rg-aks-prod \
--cluster-name aks-webapp-prod \
--name nodepool1 \
--enable-cluster-autoscaler \
--min-count 2 \
--max-count 10
Networking Options link
| Plugin |
IP Management |
Use Case |
| Kubenet |
Basic overlay, fewer IPs |
Simple, IP-constrained VNets |
| Azure CNI |
Pod gets VNet IP |
Network policies, direct VNet integration |
| Azure CNI Overlay |
Overlay on VNet |
Large clusters without IP exhaustion |
Use Ingress with Application Gateway (AGIC) or NGINX for HTTP routing. Network Policies restrict pod-to-pod traffic — essential for multi-tenant clusters.
# Install NGINX Ingress Controller
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx --create-namespace
Identity and Security link
# Enable Workload Identity (recommended over pod-managed identity)
az aks update \
--resource-group rg-aks-prod \
--name aks-webapp-prod \
--enable-oidc-issuer \
--enable-workload-identity
# Enable Azure Policy add-on for governance
az aks enable-addons \
--resource-group rg-aks-prod \
--name aks-webapp-prod \
--addons azure-policy
- Use Workload Identity for pods to authenticate to Azure services (Key Vault, Storage, SQL)
- Store secrets in Azure Key Vault via CSI driver — not in Kubernetes Secrets
- Enable Defender for Containers for vulnerability scanning and runtime protection
- Apply Pod Security Standards (restricted baseline for production namespaces)
| Component |
Configuration |
| Cluster |
3 system nodes + 2 user node pools (web, batch) |
| Ingress |
Application Gateway with WAF |
| Registry |
ACR Premium with geo-replication |
| Monitoring |
Container Insights + Prometheus + Grafana |
| CI/CD |
Azure DevOps pipeline → helm upgrade |
| DR |
Secondary cluster in paired region; ArgoCD sync |
AKS vs Alternatives link
| Feature |
AKS |
Container Apps |
App Service (containers) |
| Control |
Full Kubernetes |
Abstracted K8s |
PaaS containers |
| Complexity |
High |
Medium |
Low |
| Portability |
Multi-cloud K8s |
Azure-specific |
Azure-specific |
| Scaling |
HPA, KEDA, CA |
KEDA built-in |
App Service auto-scale |
| Best for |
Complex orchestration |
Microservices, jobs |
Simple containerized web apps |
Common Mistakes link
- Running apps on system node pool — resource contention breaks cluster addons
- No resource requests/limits — noisy neighbor pods starve others
- Latest tag in production — use immutable tags (
v1.2.0) for reproducibility
- Missing Pod Disruption Budgets — node drains cause downtime during upgrades
- Skipping cluster upgrades — unsupported Kubernetes versions lose security patches
- Over-provisioned node pools — right-size with cluster autoscaler and metrics
Troubleshooting link
| Issue |
Diagnosis |
Fix |
| Pods stuck Pending |
Insufficient node resources or IP exhaustion |
Check kubectl describe pod; scale nodes or switch to CNI Overlay |
| ImagePullBackOff |
ACR auth failure or wrong tag |
Verify ACR attachment; check image name and tag |
| CrashLoopBackOff |
App crash on startup |
kubectl logs <pod> --previous; fix startup command |
| Service unreachable |
Wrong selector or port |
Verify labels match; check targetPort |
| Node NotReady |
Disk pressure or kubelet issue |
kubectl describe node; drain and replace node |
# Debug pod issues
kubectl describe pod <pod-name> -n production
kubectl logs <pod-name> -n production --tail=100
kubectl get events -n production --sort-by='.lastTimestamp'
# Check cluster health
az aks show --resource-group rg-aks-prod --name aks-webapp-prod --query powerState
Best Practices link
- Separate system and user node pools with appropriate VM SKUs
- Enable Azure Monitor Container Insights from day one
- Use Workload Identity for Azure service authentication
- Apply Pod Disruption Budgets (minAvailable: 1) for critical deployments
- Store images in ACR with vulnerability scanning enabled
- Run cluster upgrades quarterly — test in staging first
- Use Helm or GitOps (Flux/ArgoCD) for reproducible deployments
- Implement network policies to restrict east-west traffic
Next: Azure Well-Architected Framework.