Monitoring Layers

Production MySQL observability spans four layers:

  1. Infrastructure — CPU, memory, disk I/O, network
  2. Server metrics — connections, buffer pool, locks, replication
  3. Query performance — slow queries, digest statistics, execution plans
  4. Business SLOs — p99 query latency, error rates, availability

Tools like Prometheus + mysqld_exporter, Datadog, PMM (Percona Monitoring), and CloudWatch RDS metrics cover layers 1–3. Layer 4 requires application instrumentation.

Essential Status Variables

  SHOW GLOBAL STATUS LIKE 'Threads_%';
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_%';
SHOW GLOBAL STATUS LIKE 'Slow_queries';
SHOW GLOBAL STATUS LIKE 'Aborted_%';
  
Variable Healthy Signal Warning
Threads_running Stable under load Sustained spike = query backlog
Innodb_buffer_pool_read_requests vs ..._reads Hit ratio > 99% Frequent disk reads = pool too small
Slow_queries Flat growth rate Sudden jump = regression or load spike
Aborted_connects Near zero Credential attacks or misconfigured apps
  SELECT
    (1 - Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests) * 100
        AS buffer_pool_hit_pct
FROM (
    SELECT
        VARIABLE_VALUE AS Innodb_buffer_pool_reads
    FROM performance_schema.global_status
    WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads'
) r
CROSS JOIN (
    SELECT VARIABLE_VALUE AS Innodb_buffer_pool_read_requests
    FROM performance_schema.global_status
    WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests'
) rr;
  

Performance Schema and sys Schema

Enable Performance Schema (default in MySQL 8.0):

  UPDATE performance_schema.setup_instruments
SET ENABLED = 'YES', TIMED = 'YES'
WHERE NAME LIKE 'statement/%';

SELECT * FROM sys.statements_with_runtimes_in_95th_percentile
LIMIT 10;

SELECT * FROM sys.schema_unused_indexes
WHERE object_schema = 'myapp';
  

The sys schema wraps Performance Schema views with human-readable output — start here before writing raw P_S queries.

Top Queries by Total Time

  SELECT
    DIGEST_TEXT,
    COUNT_STAR AS exec_count,
    ROUND(AVG_TIMER_WAIT / 1e12, 3) AS avg_sec,
    ROUND(SUM_TIMER_WAIT / 1e12, 1) AS total_sec,
    SUM_ROWS_EXAMINED
FROM performance_schema.events_statements_summary_by_digest
WHERE SCHEMA_NAME = 'myapp'
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 15;
  

Reset digest stats after deployments to measure impact:

  TRUNCATE TABLE performance_schema.events_statements_summary_by_digest;
  

Slow Query Log

  [mysqld]
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 0.5
log_queries_not_using_indexes = 1
  

Analyze with pt-query-digest:

  pt-query-digest /var/log/mysql/slow.log > slow_report.txt
  

Focus on queries with high Query_time × Count — one slow query run once matters less than a mediocre query run a million times.

Processlist and Lock Waits

  SELECT * FROM sys.processlist WHERE command != 'Sleep';

SELECT
    r.trx_id AS waiting_trx,
    r.trx_mysql_thread_id AS waiting_thread,
    b.trx_id AS blocking_trx,
    b.trx_mysql_thread_id AS blocking_thread
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;
  

Long-running Waiting for lock entries indicate contention — often missing indexes, long transactions, or batch jobs overlapping peak traffic.

Replication Monitoring

  SHOW REPLICA STATUS\G
  

Key fields:

Field Action if Bad
Replica_IO_Running / Replica_SQL_Running Investigate network, credentials, or SQL errors
Seconds_Behind_Source Alert if > 30s for read-your-writes apps
Last_SQL_Error Fix schema drift or conflicting writes
  SELECT * FROM performance_schema.replication_applier_status_by_worker;
  

Parallel replication workers (slave_parallel_workers) can lag individually — check per-worker errors.

Connection Pool Health

  SHOW STATUS LIKE 'Max_used_connections';
SHOW VARIABLES LIKE 'max_connections';

SELECT user, host, COUNT(*) AS connections
FROM information_schema.processlist
GROUP BY user, host
ORDER BY connections DESC;
  

If Max_used_connections approaches max_connections, applications see “Too many connections.” Fix with connection pooling (ProxySQL, HikariCP, PgBouncer-style middleware) — not just raising the limit.

Alerting Thresholds (Starting Points)

Metric Warning Critical
Replication lag > 10s > 60s
Buffer pool hit ratio < 95% < 90%
Disk usage > 75% > 85%
Threads_running > 2× CPU cores sustained > 4× cores
Slow query rate 2× baseline 5× baseline

Tune thresholds per workload — a write-heavy batch system tolerates higher lag differently than a real-time dashboard.

Best Practices

  1. Baseline metrics during normal traffic before setting alerts
  2. Monitor rate of change, not absolute values alone
  3. Correlate MySQL metrics with application traces (OpenTelemetry)
  4. Automate weekly slow query digest reviews
  5. Store historical metrics for at least 30 days for incident forensics

Common Mistakes

Mistake Consequence
Alerting on every slow query Alert fatigue — use digests and percentiles
Ignoring replication lag until failover Promote stale replica, lose data
Monitoring only CPU I/O-bound databases show low CPU but high latency
TRUNCATE digest stats in production without note Lose pre-deploy comparison baseline

Troubleshooting Runbook

Sudden latency spike:

  1. Check Threads_running and processlist
  2. Identify lock waits
  3. Review recent deploys and schema changes
  4. Check disk I/O (iostat) and buffer pool hit ratio

Replication stopped:

  1. Read Last_SQL_Error in SHOW REPLICA STATUS
  2. Skip single statement only if safe: STOP REPLICA; SET GTID_NEXT=...; START REPLICA;
  3. Prefer re-sync from backup if errors are widespread

Production Scenario: E-Commerce Flash Sale

Before a flash sale, the SRE team:

  -- Snapshot baseline
SELECT * FROM sys.metrics;

-- Verify buffer pool
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';

-- Warm critical caches
SELECT COUNT(*) FROM products WHERE category_id = 42;
  

During the event, Grafana dashboards track Threads_running, InnoDB row lock waits, and p99 checkout query latency. Post-event, pt-query-digest on the slow log identifies new hot queries for indexing in the next sprint.

Performance Tips

  • Enable performance_schema consumers selectively — full instrumentation adds ~5–10% overhead
  • Use EXPLAIN ANALYZE (MySQL 8.0.18+) for real execution stats on suspect queries
  • Schedule ANALYZE TABLE after large data loads to keep optimizer statistics fresh
  • Export metrics to Prometheus for long-term trend analysis and capacity planning

Effective monitoring turns MySQL from a black box into a predictable production service — invest in dashboards and runbooks before the first outage, not after.