Google Kubernetes Engine (GKE) is Google’s managed Kubernetes service. Google operates the control plane; you manage node pools and workloads. GKE is the reference implementation of Kubernetes — the platform Kubernetes was originally built on. For containerized workloads at scale, GKE offers the deepest GCP integration for identity, networking, and observability.

GKE Modes

Mode Control Plane Node Management Use Case
Standard Google-managed You manage node pools Full control, custom nodes, GPU
Autopilot Google-managed Fully managed Hands-off, pay per pod resources
GKE Enterprise Multi-cluster fleet Advanced features Large-scale, multi-cluster ops

Standard vs. Autopilot

Criteria Standard Autopilot
Node management You configure pools Google manages everything
Pricing Per node (VM cost) Per pod CPU/memory request
Customization Full (taints, GPU, local SSD) Limited to supported configs
Security Your responsibility Hardened by default
Best for GPU, specialized hardware Most stateless workloads

Create a Standard Cluster

  gcloud container clusters create learning-cluster \
  --zone=us-central1-a \
  --num-nodes=2 \
  --machine-type=e2-medium \
  --enable-autoscaling \
  --min-nodes=1 \
  --max-nodes=5 \
  --enable-ip-alias \
  --workload-pool=learning-gcp-dev.svc.id.goog \
  --enable-shielded-nodes \
  --release-channel=regular

gcloud container clusters get-credentials learning-cluster \
  --zone=us-central1-a

kubectl get nodes
  

Regional Cluster (Production)

  gcloud container clusters create prod-cluster \
  --region=us-central1 \
  --num-nodes=1 \
  --machine-type=e2-standard-4 \
  --enable-autoscaling \
  --min-nodes=1 \
  --max-nodes=10 \
  --enable-ip-alias \
  --workload-pool=learning-gcp-dev.svc.id.goog \
  --release-channel=stable \
  --enable-network-policy
  

Regional clusters distribute nodes across three zones for zone-level fault tolerance.

Deploy a Workload

  apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      serviceAccountName: web-app-sa
      containers:
      - name: web-app
        image: us-central1-docker.pkg.dev/learning-gcp-dev/app/web:v1
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 250m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 15
  
  kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
  

Workload Identity

Bind Kubernetes service accounts to GCP service accounts — no key files:

  # Create GCP service account
gcloud iam service-accounts create web-app-gcp

# Grant GCP permissions
gcloud projects add-iam-policy-binding learning-gcp-dev \
  --member="serviceAccount:[email protected]" \
  --role="roles/cloudsql.client"

# Bind K8s SA to GCP SA
gcloud iam service-accounts add-iam-policy-binding \
  [email protected] \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:learning-gcp-dev.svc.id.goog[default/web-app-sa]"

kubectl annotate serviceaccount web-app-sa \
  iam.gke.io/gcp-service-account=web-app-gcp@learning-gcp-dev.iam.gserviceaccount.com
  

Autoscaling

Scaler Scales Trigger
HPA (Horizontal Pod Autoscaler) Pod replicas CPU, memory, custom metrics
VPA (Vertical Pod Autoscaler) Pod resources Historical usage
Cluster Autoscaler Nodes Pending pods cannot schedule
KEDA Pod replicas External events (Pub/Sub, etc.)
  kubectl autoscale deployment web-app \
  --cpu-percent=70 --min=2 --max=10
  

Networking

Component Purpose
GKE Ingress HTTP(S) routing with Cloud Load Balancing
Gateway API Next-gen ingress (recommended for new deployments)
Network Policies Pod-to-pod firewall rules
Private clusters Nodes have only private IPs
Cloud Service Mesh mTLS, traffic management (GKE Enterprise)
  # Private cluster (nodes not reachable from internet)
gcloud container clusters create private-cluster \
  --region=us-central1 \
  --enable-private-nodes \
  --master-ipv4-cidr=172.16.0.0/28 \
  --enable-ip-alias
  

Real-World Scenario: Production Microservices

A fintech platform runs 15 microservices on GKE:

  1. Regional cluster across us-central1 (3 zones)
  2. Separate node pools: general (e2-standard-4), memory (n2-highmem-4), gpu (a2-highgpu-1g)
  3. Workload Identity for Cloud SQL, Secret Manager, Pub/Sub access
  4. Gateway API for ingress with Cloud Armor WAF
  5. HPA on all deployments; Cluster Autoscaler for node pools
  6. Backup for GKE for etcd snapshots
  7. Managed Prometheus for metrics; Cloud Trace for distributed tracing

Common Mistakes

Mistake Impact Fix
Zonal cluster in production Zone outage = full downtime Regional cluster
No resource requests/limits Noisy neighbor, OOM kills Set requests and limits on all pods
latest image tag Unpredictable deployments Pin image tags or digests
SA keys mounted in pods Credential leakage Workload Identity
No network policies Any pod talks to any pod Implement default-deny policies

Best Practices

  • Use Artifact Registry for container images with vulnerability scanning
  • Enable GKE release channels (Regular or Stable) for managed upgrades
  • Apply Pod Security Standards or Pod Security Admission
  • Use Backup for GKE for etcd and application state
  • Monitor with Cloud Monitoring GKE dashboards and Managed Prometheus
  • Run node auto-repair and auto-upgrade for node health
  • Use Workload Identity instead of service account keys
  • Implement readiness and liveness probes on every deployment
  • Set PodDisruptionBudgets for critical services during node maintenance

Troubleshooting

Pods stuck in Pending:

  kubectl describe pod POD_NAME  # Check events
kubectl get nodes              # Verify nodes are Ready
# Common causes: insufficient CPU/memory, taints, image pull errors
  

Image pull errors:

  kubectl describe pod POD_NAME | grep -A5 "Failed"
# Verify Artifact Registry permissions for node SA
gcloud artifacts repositories list
  

Workload Identity not working:

  kubectl describe sa web-app-sa | grep gcp-service-account
# Verify annotation matches GCP SA email exactly
  

Node NotReady:

  kubectl describe node NODE_NAME
gcloud compute instances describe NODE_NAME --zone=ZONE
# Check disk space, kubelet logs: journalctl -u kubelet
  

GKE combines Kubernetes portability with deep GCP integration for identity, networking, and observability.

Next: Architecture Best Practices — reliability, security, and design patterns.