Performance Tuning Methodology

  1. Measure — baseline latency, ops/sec, memory, hit ratio
  2. Identify bottleneck — CPU, memory, network, slow commands, connection count
  3. Change one thing — isolate impact
  4. Verify — compare before/after under realistic load
  5. Monitor continuously — performance regressions appear after deploys

Never tune randomly — every change should trace to measured evidence.

Memory Management

  INFO memory
MEMORY USAGE user:1001
MEMORY STATS
MEMORY DOCTOR
  

Key memory fields:

Field Meaning
used_memory Total bytes allocated by Redis
used_memory_rss OS-reported physical memory
used_memory_peak High water mark
maxmemory Configured limit
mem_fragmentation_ratio RSS / used_memory — > 1.5 may indicate fragmentation
  maxmemory 4gb
maxmemory-policy allkeys-lru
maxmemory-samples 10
  

Eviction Policies

Policy Behavior
noeviction Return errors when full — use for sessions/queues
allkeys-lru Evict any key — LRU approximation
volatile-lru Evict keys with TTL only
allkeys-lfu Evict least frequently used (Redis 4+)
volatile-lfu LFU among keys with TTL
allkeys-random Random eviction
volatile-ttl Evict keys with shortest TTL

Cache workloads: allkeys-lru or allkeys-lfu Mixed cache + sessions: volatile-lru with TTL on cache keys only, sessions use noeviction on dedicated instance

Latency Monitoring

  CONFIG SET latency-monitor-threshold 10
LATENCY LATEST
LATENCY HISTORY command
LATENCY DOCTOR
LATENCY GRAPH command
  

Built-in latency doctor summarizes issues:

  LATENCY DOCTOR
# Analyzes spikes, suggests causes (fork, AOF, slow commands)
  

Slowlog

Commands exceeding slowlog-log-slower-than (default 10,000 microseconds = 10ms):

  CONFIG GET slowlog-log-slower-than
SLOWLOG GET 20
SLOWLOG LEN
SLOWLOG RESET
  

Common slow command culprits: KEYS *, SMEMBERS on huge sets, LRANGE on long lists, SORT, large HGETALL.

Avoid Expensive Commands

  # Bad on large datasets
KEYS *
SMEMBERS huge_set
HGETALL massive_hash
FLUSHALL

# Good alternatives
SCAN 0 MATCH user:* COUNT 100
SSCAN huge_set 0 COUNT 100
HSCAN massive_hash 0 COUNT 100
  

KEYS is O(N) and blocks the single event loop — never use in production.

Pipelining and Batching

  pipe = redis.pipeline(transaction=False)
for i in range(10000):
    pipe.set(f"key:{i}", f"value:{i}")
pipe.execute()
# One round trip vs 10,000
  

Pipelining can improve throughput 10–100× for bulk operations.

Connection Pooling

  import redis

pool = redis.ConnectionPool(
    max_connections=50,
    host="localhost",
    port=6379,
    decode_responses=True
)
r = redis.Redis(connection_pool=pool)
  

One TCP connection per command wastes resources. Size pools to expected concurrent requests per process.

  INFO clients
# connected_clients, blocked_clients, rejected_connections
CONFIG GET maxclients
  

Key Metrics to Watch

  INFO stats
INFO replication
INFO cpu
INFO commandstats
  
Metric Healthy Signal Warning
instantaneous_ops_per_sec Stable under load Sudden drop = issue
keyspace_hits / keyspace_misses Hit ratio > 90% Low hit ratio = wrong cache design
rejected_connections 0 Pool or maxclients exhausted
used_memory vs maxmemory < 80% Evictions or OOM imminent
latest_fork_usec < 10ms Large RDB fork causing latency

Hit Ratio Calculation

  INFO stats | grep keyspace
# hit_ratio = hits / (hits + misses)
  

Command Statistics

  INFO commandstats
# usec_per_call, calls per command
CONFIG RESETSTAT
  

Identify hot commands consuming disproportionate CPU time.

Monitoring Stack

Tool Purpose
redis_exporter Prometheus metrics
RedisInsight GUI exploration, profiler
Grafana Dashboards for ops/sec, memory, latency
Datadog / New Relic APM integration
redis-cli INFO Quick manual checks

Example Prometheus alerts:

  # Memory > 85% maxmemory
# hit_ratio < 80% for 15 minutes
# rejected_connections > 0
# replication lag > 10s
  

Best Practices

  1. Set maxmemory and eviction policy before production traffic
  2. Use SCAN family, never KEYS
  3. Pipeline bulk operations
  4. Pool connections in every application process
  5. Monitor slowlog weekly
  6. Separate instances for cache vs sessions vs queues

Common Mistakes

Mistake Impact
No maxmemory limit Host OOM kill
KEYS in production script Latency outage
One connection per request Connection exhaustion
Ignoring mem_fragmentation_ratio Wasted RAM, need restart
Tuning without baseline metrics Cannot verify improvement

Troubleshooting

Latency spikes every N minutes:

  LATENCY DOCTOR
INFO persistence
# RDB BGSAVE or AOF rewrite fork — schedule off-peak
  

Ops/sec ceiling:

  INFO cpu
redis-benchmark -q -n 100000 -c 50
# Single-threaded — scale via sharding or more instances
  

High rejected_connections:

  CONFIG GET maxclients
INFO clients
# Increase maxclients AND fix connection pooling in apps
  

Performance Tips

  • Disable THP on Linux hosts running Redis
  • Use UNLINK instead of DEL for large keys (async reclaim, Redis 4+)
  • Prefer many small values over few huge values for even latency
  • Use CLIENT KILL to drop idle connections during incidents
  • Run MEMORY PURGE (Redis 4+) if fragmentation ratio > 1.5

Production Scenario

A ad-tech platform serving 500K ops/sec monitored Redis via Prometheus + Grafana. Alerts fired when p99 latency exceeded 5ms (normal: 1.2ms). Slowlog revealed a deployment introduced HGETALL on 50KB session hashes. Fix: switched to HMGET for required fields — p99 dropped to 1.4ms. Memory alert at 85% triggered proactive node addition to Cluster before evictions impacted hit ratio.

Profile before optimizing — measure latency, memory, and command distribution, then fix the highest-impact issues first.