Azure Kubernetes Service (AKS) is a managed Kubernetes offering. Microsoft manages the control plane (API server, etcd, scheduler); you manage node pools, workloads, and networking. AKS is the standard choice for container orchestration on Azure when you need Kubernetes portability and ecosystem tooling.

AKS Architecture

  Control Plane (managed by Azure — free)
  ├── API Server
  ├── etcd (encrypted, geo-replicated)
  └── Scheduler + Controller Manager

Your Node Pools (you manage)
  ├── System node pool (critical addons: CoreDNS, metrics-server)
  └── User node pools (application workloads)
  

Never run application pods on the system node pool — dedicate it to critical cluster services.

Create a Cluster

  # Create resource group
az group create --name rg-aks-prod --location eastus

# Create AKS cluster with monitoring
az aks create \
  --resource-group rg-aks-prod \
  --name aks-webapp-prod \
  --node-count 3 \
  --node-vm-size Standard_D4s_v5 \
  --enable-managed-identity \
  --enable-addons monitoring \
  --network-plugin azure \
  --network-policy azure \
  --generate-ssh-keys \
  --zones 1 2 3

# Attach Azure Container Registry
az aks update \
  --resource-group rg-aks-prod \
  --name aks-webapp-prod \
  --attach-acr acrwebappprod

# Get credentials
az aks get-credentials \
  --resource-group rg-aks-prod \
  --name aks-webapp-prod \
  --overwrite-existing

kubectl get nodes
  

Deploy an Application

  apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
      - name: web-app
        image: acrwebappprod.azurecr.io/web-app:v1.2.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 250m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 15
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
  
  kubectl apply -f deployment.yaml
kubectl expose deployment web-app --type=LoadBalancer --port=80 --target-port=8080
kubectl get svc web-app -w
  

Scaling

Method Command / Tool Use Case
Manual kubectl scale deployment web-app --replicas=5 Fixed capacity
HPA Horizontal Pod Autoscaler on CPU/memory Variable load
Cluster Autoscaler Scale node pools automatically Node capacity
KEDA Event-driven scaling Queue depth, Prometheus metrics
  # Horizontal Pod Autoscaler
kubectl autoscale deployment web-app \
  --cpu-percent=70 \
  --min=2 \
  --max=20

# Enable cluster autoscaler on node pool
az aks nodepool update \
  --resource-group rg-aks-prod \
  --cluster-name aks-webapp-prod \
  --name nodepool1 \
  --enable-cluster-autoscaler \
  --min-count 2 \
  --max-count 10
  

Networking Options

Plugin IP Management Use Case
Kubenet Basic overlay, fewer IPs Simple, IP-constrained VNets
Azure CNI Pod gets VNet IP Network policies, direct VNet integration
Azure CNI Overlay Overlay on VNet Large clusters without IP exhaustion

Use Ingress with Application Gateway (AGIC) or NGINX for HTTP routing. Network Policies restrict pod-to-pod traffic — essential for multi-tenant clusters.

  # Install NGINX Ingress Controller
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx --create-namespace
  

Identity and Security

  # Enable Workload Identity (recommended over pod-managed identity)
az aks update \
  --resource-group rg-aks-prod \
  --name aks-webapp-prod \
  --enable-oidc-issuer \
  --enable-workload-identity

# Enable Azure Policy add-on for governance
az aks enable-addons \
  --resource-group rg-aks-prod \
  --name aks-webapp-prod \
  --addons azure-policy
  
  • Use Workload Identity for pods to authenticate to Azure services (Key Vault, Storage, SQL)
  • Store secrets in Azure Key Vault via CSI driver — not in Kubernetes Secrets
  • Enable Defender for Containers for vulnerability scanning and runtime protection
  • Apply Pod Security Standards (restricted baseline for production namespaces)

Real-World Scenario: Microservices Platform

Component Configuration
Cluster 3 system nodes + 2 user node pools (web, batch)
Ingress Application Gateway with WAF
Registry ACR Premium with geo-replication
Monitoring Container Insights + Prometheus + Grafana
CI/CD Azure DevOps pipeline → helm upgrade
DR Secondary cluster in paired region; ArgoCD sync

AKS vs Alternatives

Feature AKS Container Apps App Service (containers)
Control Full Kubernetes Abstracted K8s PaaS containers
Complexity High Medium Low
Portability Multi-cloud K8s Azure-specific Azure-specific
Scaling HPA, KEDA, CA KEDA built-in App Service auto-scale
Best for Complex orchestration Microservices, jobs Simple containerized web apps

Common Mistakes

  1. Running apps on system node pool — resource contention breaks cluster addons
  2. No resource requests/limits — noisy neighbor pods starve others
  3. Latest tag in production — use immutable tags (v1.2.0) for reproducibility
  4. Missing Pod Disruption Budgets — node drains cause downtime during upgrades
  5. Skipping cluster upgrades — unsupported Kubernetes versions lose security patches
  6. Over-provisioned node pools — right-size with cluster autoscaler and metrics

Troubleshooting

Issue Diagnosis Fix
Pods stuck Pending Insufficient node resources or IP exhaustion Check kubectl describe pod; scale nodes or switch to CNI Overlay
ImagePullBackOff ACR auth failure or wrong tag Verify ACR attachment; check image name and tag
CrashLoopBackOff App crash on startup kubectl logs <pod> --previous; fix startup command
Service unreachable Wrong selector or port Verify labels match; check targetPort
Node NotReady Disk pressure or kubelet issue kubectl describe node; drain and replace node
  # Debug pod issues
kubectl describe pod <pod-name> -n production
kubectl logs <pod-name> -n production --tail=100
kubectl get events -n production --sort-by='.lastTimestamp'

# Check cluster health
az aks show --resource-group rg-aks-prod --name aks-webapp-prod --query powerState
  

Best Practices

  • Separate system and user node pools with appropriate VM SKUs
  • Enable Azure Monitor Container Insights from day one
  • Use Workload Identity for Azure service authentication
  • Apply Pod Disruption Budgets (minAvailable: 1) for critical deployments
  • Store images in ACR with vulnerability scanning enabled
  • Run cluster upgrades quarterly — test in staging first
  • Use Helm or GitOps (Flux/ArgoCD) for reproducible deployments
  • Implement network policies to restrict east-west traffic

Next: Azure Well-Architected Framework.