Production Architecture Principles

Running Redis in production requires more than correct commands — it demands deliberate architecture for isolation, failure handling, security, and operability.

Core principles:

  1. Assume failure — nodes crash, networks partition, memory fills
  2. Isolate workloads — cache, sessions, and queues on separate instances
  3. Define SLOs — latency, availability, acceptable data loss (RPO)
  4. Automate operations — failover, backups, scaling, alerting
  5. Document runbooks — incidents happen at 3 AM

Multi-Instance Topology

  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│  Cache Cluster  │  │ Session Sentinel│  │  Queue (Streams)│
│  (eviction OK)  │  │ (no eviction)   │  │  (durability)   │
│  3M + 3R        │  │  1M + 2R + Sent │  │  1M + 2R + AOF  │
└─────────────────┘  └─────────────────┘  └─────────────────┘
         ▲                    ▲                      ▲
         └────────────────────┴──────────────────────┘
                           Application
  

Benefits: independent scaling, blast radius containment, tailored persistence and eviction per workload.

Rate Limiting Patterns

Fixed Window

  MULTI
INCR rate:192.168.1.1:2024061314
EXPIRE rate:192.168.1.1:2024061314 60
EXEC
# Reject if count > 100
  

Simple but boundary burst problem (200 requests at minute boundary).

Sliding Window with Sorted Set

  import time

def is_rate_limited(r, key, limit, window_sec):
    now = time.time()
    pipe = r.pipeline()
    pipe.zremrangebyscore(key, 0, now - window_sec)
    pipe.zadd(key, {str(now): now})
    pipe.zcard(key)
    pipe.expire(key, window_sec)
    _, _, count, _ = pipe.execute()
    return count > limit
  

Accurate sliding window — more memory per key but fairer limiting.

Token Bucket via Lua

Atomic check-and-decrement prevents race conditions in distributed rate limiters.

Distributed Locks

Use SET NX EX for simple locks:

  import uuid
import time

def acquire_lock(r, lock_name, ttl=10):
    token = str(uuid.uuid4())
    if r.set(f"lock:{lock_name}", token, nx=True, ex=ttl):
        return token
    return None

def release_lock(r, lock_name, token):
    # Lua script — only release if token matches (avoid deleting others' locks)
    script = """
    if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("del", KEYS[1])
    else
        return 0
    end
    """
    return r.eval(script, 1, f"lock:{lock_name}", token)
  

Redlock (multiple independent Redis instances) for higher safety — evaluate if your consistency requirements justify the complexity.

Lock guidelines:

  • Always set TTL — prevent deadlocks from crashed holders
  • Use unique tokens — only holder can release
  • Keep lock duration short — locks are anti-scale

Idempotency Keys

Prevent duplicate processing of retried requests:

  def process_payment(r, idempotency_key, payment_fn):
    key = f"idem:{idempotency_key}"
    if r.set(key, "processing", nx=True, ex=86400):
        result = payment_fn()
        r.set(key, json.dumps(result), ex=86400)
        return result
    else:
        status = r.get(key)
        if status == "processing":
            raise RetryLater()
        return json.loads(status)
  

Circuit Breaker with Redis

Track failure counts across app instances:

  def call_external_service(r, service_name, fn):
    fail_key = f"circuit:{service_name}:failures"
    failures = int(r.get(fail_key) or 0)

    if failures >= 5:
        raise CircuitOpenError(f"{service_name} circuit open")

    try:
        result = fn()
        r.delete(fail_key)
        return result
    except Exception:
        r.incr(fail_key)
        r.expire(fail_key, 60)
        raise
  

Graceful Degradation

When Redis is unavailable, applications should degrade — not crash:

  def get_user_cached(user_id):
    try:
        cached = redis.get(f"user:{user_id}")
        if cached:
            return json.loads(cached)
    except redis.ConnectionError:
        logger.warning("Redis unavailable — falling back to DB")
    return db.get_user(user_id)
  

Define behavior per workload:

  • Cache miss: always fall back to DB
  • Sessions: force re-login or local session fallback
  • Rate limiting: fail open (allow) or fail closed (deny) — document choice

Security Checklist

  bind 10.0.0.0
requirepass strong_password
rename-command FLUSHALL ""
rename-command FLUSHDB ""
rename-command CONFIG ""
rename-command DEBUG ""
  
  • Enable TLS for cross-network traffic
  • Use ACLs with least privilege per application
  • No public internet exposure
  • Rotate credentials quarterly
  • Audit CLIENT LIST for unexpected connections

Backup and Disaster Recovery

Workload Backup Strategy RPO
Cache None — rebuild from DB N/A
Sessions Hourly RDB + AOF everysec ~1 sec
Queues/Streams AOF always + cross-region replica ~0 sec

Automate backup verification — restore to test instance monthly.

Operational Runbooks

Redis Memory > 85%

  1. Check INFO memory and evicted_keys
  2. Run --bigkeys analysis
  3. Identify keys without TTL
  4. Short-term: increase maxmemory or add node
  5. Long-term: key optimization (see Memory Optimization page)

Master Failover (Sentinel)

  1. Confirm +switch-master alert
  2. Verify new master: SENTINEL master mymaster
  3. Check application reconnection metrics
  4. Investigate old master — rejoin as replica
  5. Post-incident: document timeline and client impact

Latency Spike

  1. SLOWLOG GET 20
  2. LATENCY DOCTOR
  3. Check INFO persistence for BGSAVE/AOF rewrite
  4. Review recent deploys
  5. Check host CPU, network, and THP settings

Monitoring and Alerting

Alert Threshold Action
Memory usage > 85% maxmemory Capacity / optimize keys
Hit ratio < 80% for 15 min Cache design review
Replication lag > 10s Network / replica load
Rejected connections > 0 Pool sizing
Eviction rate > baseline 2× Memory pressure
Sentinel failover Any Runbook execution

Export via redis_exporter to Prometheus/Grafana.

Deployment Best Practices

  1. Blue/green Redis upgrades: deploy new version to replicas, failover, upgrade old master
  2. Config management: version-control redis.conf, Sentinel config
  3. Change windows: resharding, AOF rewrite during low traffic
  4. Load test after topology changes
  5. Chaos engineering: kill nodes in staging quarterly

Common Production Mistakes

Mistake Impact
Single Redis for everything Cascade failures across features
No fallback when Redis down Full outage instead of degraded
FLUSHALL in CI pointing at prod Catastrophic — separate environments
Locks without TTL Permanent deadlocks
Hardcoded topology Failover breaks clients

Production Scenario: Black Friday Readiness

An e-commerce team prepared Redis for 10× traffic:

Two weeks before:

  • Load test at 3× expected peak
  • Warm cache for top 10K product pages
  • Verify Sentinel failover under load
  • Set alerts: memory 80%, p99 latency 5ms, hit ratio 85%

One week before:

  • Increase Cluster from 6 to 8 nodes
  • Add TTL jitter to all cache keys
  • Deploy lock-based stampede prevention on hero product pages

Day of:

  • Grafana dashboard: ops/sec, memory, hit ratio, evictions
  • On-call runbook printed (memory, failover, latency)
  • Post-event: slowlog review, capacity report for next year

Result: p99 cache latency 1.8ms, hit ratio 94%, zero evictions, zero failover events.

Production Redis is an operational discipline — architecture patterns, runbooks, and tested failure handling matter as much as command knowledge.