Monitoring and Observability
Monitoring Layers
Production MySQL observability spans four layers:
- Infrastructure — CPU, memory, disk I/O, network
- Server metrics — connections, buffer pool, locks, replication
- Query performance — slow queries, digest statistics, execution plans
- Business SLOs — p99 query latency, error rates, availability
Tools like Prometheus + mysqld_exporter, Datadog, PMM (Percona Monitoring), and CloudWatch RDS metrics cover layers 1–3. Layer 4 requires application instrumentation.
Essential Status Variables
SHOW GLOBAL STATUS LIKE 'Threads_%';
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_%';
SHOW GLOBAL STATUS LIKE 'Slow_queries';
SHOW GLOBAL STATUS LIKE 'Aborted_%';
| Variable | Healthy Signal | Warning |
|---|---|---|
Threads_running |
Stable under load | Sustained spike = query backlog |
Innodb_buffer_pool_read_requests vs ..._reads |
Hit ratio > 99% | Frequent disk reads = pool too small |
Slow_queries |
Flat growth rate | Sudden jump = regression or load spike |
Aborted_connects |
Near zero | Credential attacks or misconfigured apps |
SELECT
(1 - Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests) * 100
AS buffer_pool_hit_pct
FROM (
SELECT
VARIABLE_VALUE AS Innodb_buffer_pool_reads
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads'
) r
CROSS JOIN (
SELECT VARIABLE_VALUE AS Innodb_buffer_pool_read_requests
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests'
) rr;
Performance Schema and sys Schema
Enable Performance Schema (default in MySQL 8.0):
UPDATE performance_schema.setup_instruments
SET ENABLED = 'YES', TIMED = 'YES'
WHERE NAME LIKE 'statement/%';
SELECT * FROM sys.statements_with_runtimes_in_95th_percentile
LIMIT 10;
SELECT * FROM sys.schema_unused_indexes
WHERE object_schema = 'myapp';
The sys schema wraps Performance Schema views with human-readable output — start here before writing raw P_S queries.
Top Queries by Total Time
SELECT
DIGEST_TEXT,
COUNT_STAR AS exec_count,
ROUND(AVG_TIMER_WAIT / 1e12, 3) AS avg_sec,
ROUND(SUM_TIMER_WAIT / 1e12, 1) AS total_sec,
SUM_ROWS_EXAMINED
FROM performance_schema.events_statements_summary_by_digest
WHERE SCHEMA_NAME = 'myapp'
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 15;
Reset digest stats after deployments to measure impact:
TRUNCATE TABLE performance_schema.events_statements_summary_by_digest;
Slow Query Log
[mysqld]
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 0.5
log_queries_not_using_indexes = 1
Analyze with pt-query-digest:
pt-query-digest /var/log/mysql/slow.log > slow_report.txt
Focus on queries with high Query_time × Count — one slow query run once matters less than a mediocre query run a million times.
Processlist and Lock Waits
SELECT * FROM sys.processlist WHERE command != 'Sleep';
SELECT
r.trx_id AS waiting_trx,
r.trx_mysql_thread_id AS waiting_thread,
b.trx_id AS blocking_trx,
b.trx_mysql_thread_id AS blocking_thread
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;
Long-running Waiting for lock entries indicate contention — often missing indexes, long transactions, or batch jobs overlapping peak traffic.
Replication Monitoring
SHOW REPLICA STATUS\G
Key fields:
| Field | Action if Bad |
|---|---|
Replica_IO_Running / Replica_SQL_Running |
Investigate network, credentials, or SQL errors |
Seconds_Behind_Source |
Alert if > 30s for read-your-writes apps |
Last_SQL_Error |
Fix schema drift or conflicting writes |
SELECT * FROM performance_schema.replication_applier_status_by_worker;
Parallel replication workers (slave_parallel_workers) can lag individually — check per-worker errors.
Connection Pool Health
SHOW STATUS LIKE 'Max_used_connections';
SHOW VARIABLES LIKE 'max_connections';
SELECT user, host, COUNT(*) AS connections
FROM information_schema.processlist
GROUP BY user, host
ORDER BY connections DESC;
If Max_used_connections approaches max_connections, applications see “Too many connections.” Fix with connection pooling (ProxySQL, HikariCP, PgBouncer-style middleware) — not just raising the limit.
Alerting Thresholds (Starting Points)
| Metric | Warning | Critical |
|---|---|---|
| Replication lag | > 10s | > 60s |
| Buffer pool hit ratio | < 95% | < 90% |
| Disk usage | > 75% | > 85% |
| Threads_running | > 2× CPU cores sustained | > 4× cores |
| Slow query rate | 2× baseline | 5× baseline |
Tune thresholds per workload — a write-heavy batch system tolerates higher lag differently than a real-time dashboard.
Best Practices
- Baseline metrics during normal traffic before setting alerts
- Monitor rate of change, not absolute values alone
- Correlate MySQL metrics with application traces (OpenTelemetry)
- Automate weekly slow query digest reviews
- Store historical metrics for at least 30 days for incident forensics
Common Mistakes
| Mistake | Consequence |
|---|---|
| Alerting on every slow query | Alert fatigue — use digests and percentiles |
| Ignoring replication lag until failover | Promote stale replica, lose data |
| Monitoring only CPU | I/O-bound databases show low CPU but high latency |
TRUNCATE digest stats in production without note |
Lose pre-deploy comparison baseline |
Troubleshooting Runbook
Sudden latency spike:
- Check
Threads_runningandprocesslist - Identify lock waits
- Review recent deploys and schema changes
- Check disk I/O (iostat) and buffer pool hit ratio
Replication stopped:
- Read
Last_SQL_ErrorinSHOW REPLICA STATUS - Skip single statement only if safe:
STOP REPLICA; SET GTID_NEXT=...; START REPLICA; - Prefer re-sync from backup if errors are widespread
Production Scenario: E-Commerce Flash Sale
Before a flash sale, the SRE team:
-- Snapshot baseline
SELECT * FROM sys.metrics;
-- Verify buffer pool
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
-- Warm critical caches
SELECT COUNT(*) FROM products WHERE category_id = 42;
During the event, Grafana dashboards track Threads_running, InnoDB row lock waits, and p99 checkout query latency. Post-event, pt-query-digest on the slow log identifies new hot queries for indexing in the next sprint.
Performance Tips
- Enable
performance_schemaconsumers selectively — full instrumentation adds ~5–10% overhead - Use EXPLAIN ANALYZE (MySQL 8.0.18+) for real execution stats on suspect queries
- Schedule
ANALYZE TABLEafter large data loads to keep optimizer statistics fresh - Export metrics to Prometheus for long-term trend analysis and capacity planning
Effective monitoring turns MySQL from a black box into a predictable production service — invest in dashboards and runbooks before the first outage, not after.