to navigate

to select

to close

Redis Production Patterns

Production Architecture Principles

Running Redis in production requires more than correct commands — it demands deliberate architecture for isolation, failure handling, security, and operability.

Core principles:

Assume failure — nodes crash, networks partition, memory fills
Isolate workloads — cache, sessions, and queues on separate instances
Define SLOs — latency, availability, acceptable data loss (RPO)
Automate operations — failover, backups, scaling, alerting
Document runbooks — incidents happen at 3 AM

Multi-Instance Topology

  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│  Cache Cluster  │  │ Session Sentinel│  │  Queue (Streams)│
│  (eviction OK)  │  │ (no eviction)   │  │  (durability)   │
│  3M + 3R        │  │  1M + 2R + Sent │  │  1M + 2R + AOF  │
└─────────────────┘  └─────────────────┘  └─────────────────┘
         ▲                    ▲                      ▲
         └────────────────────┴──────────────────────┘
                           Application

Benefits: independent scaling, blast radius containment, tailored persistence and eviction per workload.

Rate Limiting Patterns

Fixed Window

  MULTI
INCR rate:192.168.1.1:2024061314
EXPIRE rate:192.168.1.1:2024061314 60
EXEC
# Reject if count > 100

Simple but boundary burst problem (200 requests at minute boundary).

Sliding Window with Sorted Set

  import time

def is_rate_limited(r, key, limit, window_sec):
    now = time.time()
    pipe = r.pipeline()
    pipe.zremrangebyscore(key, 0, now - window_sec)
    pipe.zadd(key, {str(now): now})
    pipe.zcard(key)
    pipe.expire(key, window_sec)
    _, _, count, _ = pipe.execute()
    return count > limit

Accurate sliding window — more memory per key but fairer limiting.

Token Bucket via Lua

Atomic check-and-decrement prevents race conditions in distributed rate limiters.

Distributed Locks

Use SET NX EX for simple locks:

  import uuid
import time

def acquire_lock(r, lock_name, ttl=10):
    token = str(uuid.uuid4())
    if r.set(f"lock:{lock_name}", token, nx=True, ex=ttl):
        return token
    return None

def release_lock(r, lock_name, token):
    # Lua script — only release if token matches (avoid deleting others' locks)
    script = """
    if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("del", KEYS[1])
    else
        return 0
    end
    """
    return r.eval(script, 1, f"lock:{lock_name}", token)

Redlock (multiple independent Redis instances) for higher safety — evaluate if your consistency requirements justify the complexity.

Lock guidelines:

Always set TTL — prevent deadlocks from crashed holders
Use unique tokens — only holder can release
Keep lock duration short — locks are anti-scale

Idempotency Keys

Prevent duplicate processing of retried requests:

  def process_payment(r, idempotency_key, payment_fn):
    key = f"idem:{idempotency_key}"
    if r.set(key, "processing", nx=True, ex=86400):
        result = payment_fn()
        r.set(key, json.dumps(result), ex=86400)
        return result
    else:
        status = r.get(key)
        if status == "processing":
            raise RetryLater()
        return json.loads(status)

Circuit Breaker with Redis

Track failure counts across app instances:

  def call_external_service(r, service_name, fn):
    fail_key = f"circuit:{service_name}:failures"
    failures = int(r.get(fail_key) or 0)

    if failures >= 5:
        raise CircuitOpenError(f"{service_name} circuit open")

    try:
        result = fn()
        r.delete(fail_key)
        return result
    except Exception:
        r.incr(fail_key)
        r.expire(fail_key, 60)
        raise

Graceful Degradation

When Redis is unavailable, applications should degrade — not crash:

  def get_user_cached(user_id):
    try:
        cached = redis.get(f"user:{user_id}")
        if cached:
            return json.loads(cached)
    except redis.ConnectionError:
        logger.warning("Redis unavailable — falling back to DB")
    return db.get_user(user_id)

Define behavior per workload:

Cache miss: always fall back to DB
Sessions: force re-login or local session fallback
Rate limiting: fail open (allow) or fail closed (deny) — document choice

Security Checklist

  bind 10.0.0.0
requirepass strong_password
rename-command FLUSHALL ""
rename-command FLUSHDB ""
rename-command CONFIG ""
rename-command DEBUG ""

Enable TLS for cross-network traffic
Use ACLs with least privilege per application
No public internet exposure
Rotate credentials quarterly
Audit CLIENT LIST for unexpected connections

Backup and Disaster Recovery

Workload	Backup Strategy	RPO
Cache	None — rebuild from DB	N/A
Sessions	Hourly RDB + AOF everysec	~1 sec
Queues/Streams	AOF always + cross-region replica	~0 sec

Automate backup verification — restore to test instance monthly.

Operational Runbooks

Redis Memory > 85%

Check INFO memory and evicted_keys
Run --bigkeys analysis
Identify keys without TTL
Short-term: increase maxmemory or add node
Long-term: key optimization (see Memory Optimization page)

Master Failover (Sentinel)

Confirm +switch-master alert
Verify new master: SENTINEL master mymaster
Check application reconnection metrics
Investigate old master — rejoin as replica
Post-incident: document timeline and client impact

Latency Spike

SLOWLOG GET 20
LATENCY DOCTOR
Check INFO persistence for BGSAVE/AOF rewrite
Review recent deploys
Check host CPU, network, and THP settings

Monitoring and Alerting

Alert	Threshold	Action
Memory usage	> 85% maxmemory	Capacity / optimize keys
Hit ratio	< 80% for 15 min	Cache design review
Replication lag	> 10s	Network / replica load
Rejected connections	> 0	Pool sizing
Eviction rate	> baseline 2×	Memory pressure
Sentinel failover	Any	Runbook execution

Export via redis_exporter to Prometheus/Grafana.

Deployment Best Practices

Blue/green Redis upgrades: deploy new version to replicas, failover, upgrade old master
Config management: version-control redis.conf, Sentinel config
Change windows: resharding, AOF rewrite during low traffic
Load test after topology changes
Chaos engineering: kill nodes in staging quarterly

Common Production Mistakes

Mistake	Impact
Single Redis for everything	Cascade failures across features
No fallback when Redis down	Full outage instead of degraded
FLUSHALL in CI pointing at prod	Catastrophic — separate environments
Locks without TTL	Permanent deadlocks
Hardcoded topology	Failover breaks clients

Production Scenario: Black Friday Readiness

An e-commerce team prepared Redis for 10× traffic:

Two weeks before:

Load test at 3× expected peak
Warm cache for top 10K product pages
Verify Sentinel failover under load
Set alerts: memory 80%, p99 latency 5ms, hit ratio 85%

One week before:

Increase Cluster from 6 to 8 nodes
Add TTL jitter to all cache keys
Deploy lock-based stampede prevention on hero product pages

Day of:

Grafana dashboard: ops/sec, memory, hit ratio, evictions
On-call runbook printed (memory, failover, latency)
Post-event: slowlog review, capacity report for next year

Result: p99 cache latency 1.8ms, hit ratio 94%, zero evictions, zero failover events.

Production Redis is an operational discipline — architecture patterns, runbooks, and tested failure handling matter as much as command knowledge.

Redis Memory Optimization

Introduction to AWS

Redis Production Patterns

Production Architecture Principles link

Multi-Instance Topology link

Rate Limiting Patterns link

Fixed Window link

Sliding Window with Sorted Set link

Token Bucket via Lua link

Distributed Locks link

Idempotency Keys link

Circuit Breaker with Redis link

Graceful Degradation link

Security Checklist link

Backup and Disaster Recovery link

Operational Runbooks link

Redis Memory > 85% link

Master Failover (Sentinel) link

Latency Spike link

Monitoring and Alerting link

Deployment Best Practices link

Common Production Mistakes link

Production Scenario: Black Friday Readiness link