to navigate

to select

to close

On this page

Monitoring and Observability

Monitoring Layers

Production MySQL observability spans four layers:

Infrastructure — CPU, memory, disk I/O, network
Server metrics — connections, buffer pool, locks, replication
Query performance — slow queries, digest statistics, execution plans
Business SLOs — p99 query latency, error rates, availability

Tools like Prometheus + mysqld_exporter, Datadog, PMM (Percona Monitoring), and CloudWatch RDS metrics cover layers 1–3. Layer 4 requires application instrumentation.

Essential Status Variables

  SHOW GLOBAL STATUS LIKE 'Threads_%';
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_%';
SHOW GLOBAL STATUS LIKE 'Slow_queries';
SHOW GLOBAL STATUS LIKE 'Aborted_%';

Variable	Healthy Signal	Warning
`Threads_running`	Stable under load	Sustained spike = query backlog
`Innodb_buffer_pool_read_requests` vs `..._reads`	Hit ratio > 99%	Frequent disk reads = pool too small
`Slow_queries`	Flat growth rate	Sudden jump = regression or load spike
`Aborted_connects`	Near zero	Credential attacks or misconfigured apps

  SELECT
    (1 - Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests) * 100
        AS buffer_pool_hit_pct
FROM (
    SELECT
        VARIABLE_VALUE AS Innodb_buffer_pool_reads
    FROM performance_schema.global_status
    WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads'
) r
CROSS JOIN (
    SELECT VARIABLE_VALUE AS Innodb_buffer_pool_read_requests
    FROM performance_schema.global_status
    WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests'
) rr;

Performance Schema and sys Schema

Enable Performance Schema (default in MySQL 8.0):

  UPDATE performance_schema.setup_instruments
SET ENABLED = 'YES', TIMED = 'YES'
WHERE NAME LIKE 'statement/%';

SELECT * FROM sys.statements_with_runtimes_in_95th_percentile
LIMIT 10;

SELECT * FROM sys.schema_unused_indexes
WHERE object_schema = 'myapp';

The sys schema wraps Performance Schema views with human-readable output — start here before writing raw P_S queries.

Top Queries by Total Time

  SELECT
    DIGEST_TEXT,
    COUNT_STAR AS exec_count,
    ROUND(AVG_TIMER_WAIT / 1e12, 3) AS avg_sec,
    ROUND(SUM_TIMER_WAIT / 1e12, 1) AS total_sec,
    SUM_ROWS_EXAMINED
FROM performance_schema.events_statements_summary_by_digest
WHERE SCHEMA_NAME = 'myapp'
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 15;

Reset digest stats after deployments to measure impact:

  TRUNCATE TABLE performance_schema.events_statements_summary_by_digest;

Slow Query Log

  [mysqld]
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 0.5
log_queries_not_using_indexes = 1

Analyze with pt-query-digest:

  pt-query-digest /var/log/mysql/slow.log > slow_report.txt

Focus on queries with high Query_time × Count — one slow query run once matters less than a mediocre query run a million times.

Processlist and Lock Waits

  SELECT * FROM sys.processlist WHERE command != 'Sleep';

SELECT
    r.trx_id AS waiting_trx,
    r.trx_mysql_thread_id AS waiting_thread,
    b.trx_id AS blocking_trx,
    b.trx_mysql_thread_id AS blocking_thread
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;

Long-running Waiting for lock entries indicate contention — often missing indexes, long transactions, or batch jobs overlapping peak traffic.

Replication Monitoring

  SHOW REPLICA STATUS\G

Key fields:

Field	Action if Bad
`Replica_IO_Running` / `Replica_SQL_Running`	Investigate network, credentials, or SQL errors
`Seconds_Behind_Source`	Alert if > 30s for read-your-writes apps
`Last_SQL_Error`	Fix schema drift or conflicting writes

  SELECT * FROM performance_schema.replication_applier_status_by_worker;

Parallel replication workers (slave_parallel_workers) can lag individually — check per-worker errors.

Connection Pool Health

  SHOW STATUS LIKE 'Max_used_connections';
SHOW VARIABLES LIKE 'max_connections';

SELECT user, host, COUNT(*) AS connections
FROM information_schema.processlist
GROUP BY user, host
ORDER BY connections DESC;

If Max_used_connections approaches max_connections, applications see “Too many connections.” Fix with connection pooling (ProxySQL, HikariCP, PgBouncer-style middleware) — not just raising the limit.

Alerting Thresholds (Starting Points)

Metric	Warning	Critical
Replication lag	> 10s	> 60s
Buffer pool hit ratio	< 95%	< 90%
Disk usage	> 75%	> 85%
Threads_running	> 2× CPU cores sustained	> 4× cores
Slow query rate	2× baseline	5× baseline

Tune thresholds per workload — a write-heavy batch system tolerates higher lag differently than a real-time dashboard.

Best Practices

Baseline metrics during normal traffic before setting alerts
Monitor rate of change, not absolute values alone
Correlate MySQL metrics with application traces (OpenTelemetry)
Automate weekly slow query digest reviews
Store historical metrics for at least 30 days for incident forensics

Common Mistakes

Mistake	Consequence
Alerting on every slow query	Alert fatigue — use digests and percentiles
Ignoring replication lag until failover	Promote stale replica, lose data
Monitoring only CPU	I/O-bound databases show low CPU but high latency
`TRUNCATE` digest stats in production without note	Lose pre-deploy comparison baseline

Troubleshooting Runbook

Sudden latency spike:

Check Threads_running and processlist
Identify lock waits
Review recent deploys and schema changes
Check disk I/O (iostat) and buffer pool hit ratio

Replication stopped:

Read Last_SQL_Error in SHOW REPLICA STATUS
Skip single statement only if safe: STOP REPLICA; SET GTID_NEXT=...; START REPLICA;
Prefer re-sync from backup if errors are widespread

Production Scenario: E-Commerce Flash Sale

Before a flash sale, the SRE team:

  -- Snapshot baseline
SELECT * FROM sys.metrics;

-- Verify buffer pool
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';

-- Warm critical caches
SELECT COUNT(*) FROM products WHERE category_id = 42;

During the event, Grafana dashboards track Threads_running, InnoDB row lock waits, and p99 checkout query latency. Post-event, pt-query-digest on the slow log identifies new hot queries for indexing in the next sprint.

Performance Tips

Enable performance_schema consumers selectively — full instrumentation adds ~5–10% overhead
Use EXPLAIN ANALYZE (MySQL 8.0.18+) for real execution stats on suspect queries
Schedule ANALYZE TABLE after large data loads to keep optimizer statistics fresh
Export metrics to Prometheus for long-term trend analysis and capacity planning

Effective monitoring turns MySQL from a black box into a predictable production service — invest in dashboards and runbooks before the first outage, not after.

Table Partitioning

Introduction to MSSQL (Microsoft SQL Server)

Monitoring and Observability

Monitoring Layers link

Essential Status Variables link

Performance Schema and sys Schema link

Top Queries by Total Time link

Slow Query Log link

Processlist and Lock Waits link

Replication Monitoring link

Connection Pool Health link

Alerting Thresholds (Starting Points) link

Best Practices link

Common Mistakes link

Troubleshooting Runbook link

Production Scenario: E-Commerce Flash Sale link

Performance Tips link