Redis Production Patterns
Production Architecture Principles
Running Redis in production requires more than correct commands — it demands deliberate architecture for isolation, failure handling, security, and operability.
Core principles:
- Assume failure — nodes crash, networks partition, memory fills
- Isolate workloads — cache, sessions, and queues on separate instances
- Define SLOs — latency, availability, acceptable data loss (RPO)
- Automate operations — failover, backups, scaling, alerting
- Document runbooks — incidents happen at 3 AM
Multi-Instance Topology
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Cache Cluster │ │ Session Sentinel│ │ Queue (Streams)│
│ (eviction OK) │ │ (no eviction) │ │ (durability) │
│ 3M + 3R │ │ 1M + 2R + Sent │ │ 1M + 2R + AOF │
└─────────────────┘ └─────────────────┘ └─────────────────┘
▲ ▲ ▲
└────────────────────┴──────────────────────┘
Application
Benefits: independent scaling, blast radius containment, tailored persistence and eviction per workload.
Rate Limiting Patterns
Fixed Window
MULTI
INCR rate:192.168.1.1:2024061314
EXPIRE rate:192.168.1.1:2024061314 60
EXEC
# Reject if count > 100
Simple but boundary burst problem (200 requests at minute boundary).
Sliding Window with Sorted Set
import time
def is_rate_limited(r, key, limit, window_sec):
now = time.time()
pipe = r.pipeline()
pipe.zremrangebyscore(key, 0, now - window_sec)
pipe.zadd(key, {str(now): now})
pipe.zcard(key)
pipe.expire(key, window_sec)
_, _, count, _ = pipe.execute()
return count > limit
Accurate sliding window — more memory per key but fairer limiting.
Token Bucket via Lua
Atomic check-and-decrement prevents race conditions in distributed rate limiters.
Distributed Locks
Use SET NX EX for simple locks:
import uuid
import time
def acquire_lock(r, lock_name, ttl=10):
token = str(uuid.uuid4())
if r.set(f"lock:{lock_name}", token, nx=True, ex=ttl):
return token
return None
def release_lock(r, lock_name, token):
# Lua script — only release if token matches (avoid deleting others' locks)
script = """
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
"""
return r.eval(script, 1, f"lock:{lock_name}", token)
Redlock (multiple independent Redis instances) for higher safety — evaluate if your consistency requirements justify the complexity.
Lock guidelines:
- Always set TTL — prevent deadlocks from crashed holders
- Use unique tokens — only holder can release
- Keep lock duration short — locks are anti-scale
Idempotency Keys
Prevent duplicate processing of retried requests:
def process_payment(r, idempotency_key, payment_fn):
key = f"idem:{idempotency_key}"
if r.set(key, "processing", nx=True, ex=86400):
result = payment_fn()
r.set(key, json.dumps(result), ex=86400)
return result
else:
status = r.get(key)
if status == "processing":
raise RetryLater()
return json.loads(status)
Circuit Breaker with Redis
Track failure counts across app instances:
def call_external_service(r, service_name, fn):
fail_key = f"circuit:{service_name}:failures"
failures = int(r.get(fail_key) or 0)
if failures >= 5:
raise CircuitOpenError(f"{service_name} circuit open")
try:
result = fn()
r.delete(fail_key)
return result
except Exception:
r.incr(fail_key)
r.expire(fail_key, 60)
raise
Graceful Degradation
When Redis is unavailable, applications should degrade — not crash:
def get_user_cached(user_id):
try:
cached = redis.get(f"user:{user_id}")
if cached:
return json.loads(cached)
except redis.ConnectionError:
logger.warning("Redis unavailable — falling back to DB")
return db.get_user(user_id)
Define behavior per workload:
- Cache miss: always fall back to DB
- Sessions: force re-login or local session fallback
- Rate limiting: fail open (allow) or fail closed (deny) — document choice
Security Checklist
bind 10.0.0.0
requirepass strong_password
rename-command FLUSHALL ""
rename-command FLUSHDB ""
rename-command CONFIG ""
rename-command DEBUG ""
- Enable TLS for cross-network traffic
- Use ACLs with least privilege per application
- No public internet exposure
- Rotate credentials quarterly
- Audit
CLIENT LISTfor unexpected connections
Backup and Disaster Recovery
| Workload | Backup Strategy | RPO |
|---|---|---|
| Cache | None — rebuild from DB | N/A |
| Sessions | Hourly RDB + AOF everysec | ~1 sec |
| Queues/Streams | AOF always + cross-region replica | ~0 sec |
Automate backup verification — restore to test instance monthly.
Operational Runbooks
Redis Memory > 85%
- Check
INFO memoryandevicted_keys - Run
--bigkeysanalysis - Identify keys without TTL
- Short-term: increase maxmemory or add node
- Long-term: key optimization (see Memory Optimization page)
Master Failover (Sentinel)
- Confirm
+switch-masteralert - Verify new master:
SENTINEL master mymaster - Check application reconnection metrics
- Investigate old master — rejoin as replica
- Post-incident: document timeline and client impact
Latency Spike
SLOWLOG GET 20LATENCY DOCTOR- Check
INFO persistencefor BGSAVE/AOF rewrite - Review recent deploys
- Check host CPU, network, and THP settings
Monitoring and Alerting
| Alert | Threshold | Action |
|---|---|---|
| Memory usage | > 85% maxmemory | Capacity / optimize keys |
| Hit ratio | < 80% for 15 min | Cache design review |
| Replication lag | > 10s | Network / replica load |
| Rejected connections | > 0 | Pool sizing |
| Eviction rate | > baseline 2× | Memory pressure |
| Sentinel failover | Any | Runbook execution |
Export via redis_exporter to Prometheus/Grafana.
Deployment Best Practices
- Blue/green Redis upgrades: deploy new version to replicas, failover, upgrade old master
- Config management: version-control
redis.conf, Sentinel config - Change windows: resharding, AOF rewrite during low traffic
- Load test after topology changes
- Chaos engineering: kill nodes in staging quarterly
Common Production Mistakes
| Mistake | Impact |
|---|---|
| Single Redis for everything | Cascade failures across features |
| No fallback when Redis down | Full outage instead of degraded |
| FLUSHALL in CI pointing at prod | Catastrophic — separate environments |
| Locks without TTL | Permanent deadlocks |
| Hardcoded topology | Failover breaks clients |
Production Scenario: Black Friday Readiness
An e-commerce team prepared Redis for 10× traffic:
Two weeks before:
- Load test at 3× expected peak
- Warm cache for top 10K product pages
- Verify Sentinel failover under load
- Set alerts: memory 80%, p99 latency 5ms, hit ratio 85%
One week before:
- Increase Cluster from 6 to 8 nodes
- Add TTL jitter to all cache keys
- Deploy lock-based stampede prevention on hero product pages
Day of:
- Grafana dashboard: ops/sec, memory, hit ratio, evictions
- On-call runbook printed (memory, failover, latency)
- Post-event: slowlog review, capacity report for next year
Result: p99 cache latency 1.8ms, hit ratio 94%, zero evictions, zero failover events.
Production Redis is an operational discipline — architecture patterns, runbooks, and tested failure handling matter as much as command knowledge.