Performance and Monitoring
Performance Tuning Methodology
- Measure — baseline latency, ops/sec, memory, hit ratio
- Identify bottleneck — CPU, memory, network, slow commands, connection count
- Change one thing — isolate impact
- Verify — compare before/after under realistic load
- Monitor continuously — performance regressions appear after deploys
Never tune randomly — every change should trace to measured evidence.
Memory Management
INFO memory
MEMORY USAGE user:1001
MEMORY STATS
MEMORY DOCTOR
Key memory fields:
| Field | Meaning |
|---|---|
used_memory |
Total bytes allocated by Redis |
used_memory_rss |
OS-reported physical memory |
used_memory_peak |
High water mark |
maxmemory |
Configured limit |
mem_fragmentation_ratio |
RSS / used_memory — > 1.5 may indicate fragmentation |
maxmemory 4gb
maxmemory-policy allkeys-lru
maxmemory-samples 10
Eviction Policies
| Policy | Behavior |
|---|---|
noeviction |
Return errors when full — use for sessions/queues |
allkeys-lru |
Evict any key — LRU approximation |
volatile-lru |
Evict keys with TTL only |
allkeys-lfu |
Evict least frequently used (Redis 4+) |
volatile-lfu |
LFU among keys with TTL |
allkeys-random |
Random eviction |
volatile-ttl |
Evict keys with shortest TTL |
Cache workloads: allkeys-lru or allkeys-lfu
Mixed cache + sessions: volatile-lru with TTL on cache keys only, sessions use noeviction on dedicated instance
Latency Monitoring
CONFIG SET latency-monitor-threshold 10
LATENCY LATEST
LATENCY HISTORY command
LATENCY DOCTOR
LATENCY GRAPH command
Built-in latency doctor summarizes issues:
LATENCY DOCTOR
# Analyzes spikes, suggests causes (fork, AOF, slow commands)
Slowlog
Commands exceeding slowlog-log-slower-than (default 10,000 microseconds = 10ms):
CONFIG GET slowlog-log-slower-than
SLOWLOG GET 20
SLOWLOG LEN
SLOWLOG RESET
Common slow command culprits: KEYS *, SMEMBERS on huge sets, LRANGE on long lists, SORT, large HGETALL.
Avoid Expensive Commands
# Bad on large datasets
KEYS *
SMEMBERS huge_set
HGETALL massive_hash
FLUSHALL
# Good alternatives
SCAN 0 MATCH user:* COUNT 100
SSCAN huge_set 0 COUNT 100
HSCAN massive_hash 0 COUNT 100
KEYS is O(N) and blocks the single event loop — never use in production.
Pipelining and Batching
pipe = redis.pipeline(transaction=False)
for i in range(10000):
pipe.set(f"key:{i}", f"value:{i}")
pipe.execute()
# One round trip vs 10,000
Pipelining can improve throughput 10–100× for bulk operations.
Connection Pooling
import redis
pool = redis.ConnectionPool(
max_connections=50,
host="localhost",
port=6379,
decode_responses=True
)
r = redis.Redis(connection_pool=pool)
One TCP connection per command wastes resources. Size pools to expected concurrent requests per process.
INFO clients
# connected_clients, blocked_clients, rejected_connections
CONFIG GET maxclients
Key Metrics to Watch
INFO stats
INFO replication
INFO cpu
INFO commandstats
| Metric | Healthy Signal | Warning |
|---|---|---|
instantaneous_ops_per_sec |
Stable under load | Sudden drop = issue |
keyspace_hits / keyspace_misses |
Hit ratio > 90% | Low hit ratio = wrong cache design |
rejected_connections |
0 | Pool or maxclients exhausted |
used_memory vs maxmemory |
< 80% | Evictions or OOM imminent |
latest_fork_usec |
< 10ms | Large RDB fork causing latency |
Hit Ratio Calculation
INFO stats | grep keyspace
# hit_ratio = hits / (hits + misses)
Command Statistics
INFO commandstats
# usec_per_call, calls per command
CONFIG RESETSTAT
Identify hot commands consuming disproportionate CPU time.
Monitoring Stack
| Tool | Purpose |
|---|---|
| redis_exporter | Prometheus metrics |
| RedisInsight | GUI exploration, profiler |
| Grafana | Dashboards for ops/sec, memory, latency |
| Datadog / New Relic | APM integration |
| redis-cli INFO | Quick manual checks |
Example Prometheus alerts:
# Memory > 85% maxmemory
# hit_ratio < 80% for 15 minutes
# rejected_connections > 0
# replication lag > 10s
Best Practices
- Set
maxmemoryand eviction policy before production traffic - Use SCAN family, never KEYS
- Pipeline bulk operations
- Pool connections in every application process
- Monitor slowlog weekly
- Separate instances for cache vs sessions vs queues
Common Mistakes
| Mistake | Impact |
|---|---|
| No maxmemory limit | Host OOM kill |
| KEYS in production script | Latency outage |
| One connection per request | Connection exhaustion |
| Ignoring mem_fragmentation_ratio | Wasted RAM, need restart |
| Tuning without baseline metrics | Cannot verify improvement |
Troubleshooting
Latency spikes every N minutes:
LATENCY DOCTOR
INFO persistence
# RDB BGSAVE or AOF rewrite fork — schedule off-peak
Ops/sec ceiling:
INFO cpu
redis-benchmark -q -n 100000 -c 50
# Single-threaded — scale via sharding or more instances
High rejected_connections:
CONFIG GET maxclients
INFO clients
# Increase maxclients AND fix connection pooling in apps
Performance Tips
- Disable THP on Linux hosts running Redis
- Use
UNLINKinstead ofDELfor large keys (async reclaim, Redis 4+) - Prefer many small values over few huge values for even latency
- Use
CLIENT KILLto drop idle connections during incidents - Run
MEMORY PURGE(Redis 4+) if fragmentation ratio > 1.5
Production Scenario
A ad-tech platform serving 500K ops/sec monitored Redis via Prometheus + Grafana. Alerts fired when p99 latency exceeded 5ms (normal: 1.2ms). Slowlog revealed a deployment introduced HGETALL on 50KB session hashes. Fix: switched to HMGET for required fields — p99 dropped to 1.4ms. Memory alert at 85% triggered proactive node addition to Cluster before evictions impacted hit ratio.
Profile before optimizing — measure latency, memory, and command distribution, then fix the highest-impact issues first.