Production Operations
Running SQL Server in production extends beyond backups and indexes — it requires patch discipline, capacity planning, on-call runbooks, automation, and alignment with business SLAs. This page consolidates operational patterns for enterprise DBAs.
Operational Pillars
Reliability ─── Backups, HA, tested DR
Performance ── Baselines, Query Store, proactive tuning
Security ───── Least privilege, patching, auditing
Observability ─ Alerts, dashboards, runbooks
Change control ─ SSDT, flyway, staged deploys
Weakness in any pillar surfaces during incidents — invest proportionally.
Service Level Objectives
Define measurable targets with stakeholders:
| Metric | Example Target | Measurement |
|---|---|---|
| Availability | 99.95% monthly | AG health, app synthetic checks |
| RPO | 15 minutes | Log backup frequency |
| RTO | 1 hour | Last DR drill time |
| P95 query latency | < 200 ms | Query Store, APM |
| Backup success | 100% | Agent job history |
Document exceptions (maintenance windows) in SLA appendix.
Patch and CU Strategy
SQL Server releases Cumulative Updates (CUs) — not every CU is mandatory, but security fixes are:
- Subscribe to SQL Server release services
- Test CU on staging — full regression + Query Store comparison
- Apply during maintenance window — AG failover patch secondary first
- Verify
@@VERSIONand critical DMVs post-patch
SELECT @@VERSION, SERVERPROPERTY('ProductLevel') AS product_level,
SERVERPROPERTY('ProductUpdateLevel') AS update_level;
Never run unsupported versions — extended support end dates drive upgrade projects.
Capacity Planning
Monitor growth trends monthly:
-- Database size trend
SELECT
DB_NAME(database_id) AS database_name,
SUM(size * 8 / 1024) AS size_mb
FROM sys.master_files
GROUP BY database_id
ORDER BY size_mb DESC;
-- Table row counts (approximate for large tables)
SELECT OBJECT_NAME(object_id) AS table_name, SUM(row_count) AS rows
FROM sys.dm_db_partition_stats
WHERE index_id IN (0, 1)
GROUP BY object_id
ORDER BY rows DESC;
Project disk exhaustion — alert at 80% capacity. Plan filegrowth proactively vs autogrowth storms.
CPU and Memory Headroom
- Sustained CPU > 70% — scale up/out or tune top queries
PAGE LIFE EXPECTANCY< 300 — add memory or reduce cache pressure- Batch requests/sec plateau with rising latency — saturation signal
On-Call Runbook Template
Severity 1: Database Unavailable
- Verify scope — single app vs entire instance
- Check AG listener connectivity and quorum
- Review SQL error log —
EXEC xp_readerrorlog 0, 1 - Check disk space on data/log/backup volumes
- Failover to secondary if primary hardware failure (document approval)
- Communicate ETA to stakeholders every 15 minutes
Severity 2: Severe Slowness
sp_WhoIsActiveor Activity Monitor — blocking chain- Top waits —
sys.dm_os_wait_stats - Query Store — regressed queries last hour
- Kill runaway session only with approval
- Mitigate — enable Resource Governor throttle, route reads to secondary
Severity 3: Failed Backup Job
- Agent job history — error message
- Disk space on backup target
- VSS provider issues (if snapshot backup)
- Run manual backup; investigate before next window
Store runbooks in wiki with copy-paste SQL — reduce thinking time at 3 AM.
Automation Essentials
| Task | Tool |
|---|---|
| Index/stats maintenance | Ola Hallengren scripts |
| Backup verification | Restore to test instance via Agent |
| Deployment | SSDT dacpac, Flyway, Liquibase |
| Configuration drift | dbatools, PowerShell DSC |
| Monitoring | SCOM, Azure Monitor, Datadog, Prometheus exporter |
Example Agent job step — post-backup verify:
RESTORE VERIFYONLY FROM DISK = @backup_path;
IF @@ERROR <> 0
EXEC msdb.dbo.sp_send_dbmail
@profile_name = N'DBA_Alerts',
@recipients = N'[email protected]',
@subject = N'BACKUP VERIFY FAILED',
@body = @backup_path;
Change Management
Production schema changes flow:
- Developer submits SSDT/SQL script PR
- Review — execution plan impact, lock risk, rollback script
- Deploy staging — load test
- Deploy production — off-peak, with
SET LOCK_TIMEOUT - Post-deploy — Query Store comparison, smoke tests
-- Safer column add (SQL 2016+ online)
ALTER TABLE dbo.Orders ADD Notes NVARCHAR(500) NULL
WITH (ONLINE = ON);
Always ship rollback script — even if “we never roll back.”
Incident Response: Corruption Suspected
DBCC CHECKDB (MyApp) WITH NO_INFOMSGS, ALL_ERRORMSGS;
-- Or PHYSICAL_ONLY for quick check
DBCC CHECKDB (MyApp) WITH PHYSICAL_ONLY;
If errors found:
RESTORE ... WITH RECOVERYfrom last clean backup- Apply logs to point before corruption
- Root cause — storage, firmware, forced shutdown
- Open vendor ticket with CHECKDB output
Enable page checksums on new databases:
ALTER DATABASE MyApp SET PAGE_VERIFY CHECKSUM;
Security Operations
- Rotate SQL login passwords quarterly (automated vault)
- Review
sysadminmembership monthly - Audit failed logins — brute force detection
- Disable unused features — xp_cmdshell, OLE automation unless required
SELECT name, is_disabled FROM sys.server_principals WHERE name = N'sa';
-- Disable sa if using Windows auth only
ALTER LOGIN sa DISABLE;
Documentation Inventory
Maintain for every production instance:
| Document | Contents |
|---|---|
| Instance catalog | Name, version, edition, owner, criticality |
| Connection map | Apps, listeners, connection pools |
| Backup/restore runbook | Paths, schedules, last restore test date |
| AG diagram | Replicas, quorum, failover steps |
| Escalation matrix | L1 app team → L2 DBA → L3 vendor |
Common Operational Mistakes
| Mistake | Consequence |
|---|---|
| No DR drill in 12 months | RTO unknown until disaster |
Shared sa password in vault |
No accountability |
| Manual changes without scripts | Drift, unreproducible state |
| Alert fatigue | Real incidents missed |
| Shrinking databases routinely | Fragmentation, performance loss |
Performance in Operations
Weekly DBA checklist:
- Backup success rate 100%
- AG synchronization healthy
- Disk space > 20% free on all volumes
- Top 5 Query Store regressions reviewed
- Long-running Agent jobs within SLA
- Error log scanned for severity 17+
EXEC xp_readerrorlog 0, 1, N'Error', N'Severity: 17';
Production Scenario: Zero-Downtime Index Deploy
Large table — Orders 500M rows, need index on CustomerID:
CREATE NONCLUSTERED INDEX IX_Orders_CustomerID
ON dbo.Orders (CustomerID)
WITH (ONLINE = ON, MAXDOP = 4, SORT_IN_TEMPDB = ON);
Monitor redo during AG sync. Schedule during low traffic. Pre-notify app team of possible duration estimates from staging test.
Cloud and Hybrid Ops
| Platform | Ops Notes |
|---|---|
| Azure SQL Database | Microsoft patches OS/SQL; you tune indexes, RDT |
| Azure SQL MI | Linked to AG-like SLA; verify backup retention |
| SQL on VM | You patch — treat like on-prem with Azure backup |
Use Azure Monitor alerts for DTU/storage/CPU — complement with Query Store insights.
Handoff to Development Teams
Provide developers:
- Query Store access (read-only) for their database
- Guidelines — no
NOLOCK, parameterize SQL, avoidSELECT * - Staging environment mirroring production cardinality (masked data)
- Escalation path for slow query tickets with query_hash attached
Production excellence is repetitive discipline — backups verified, patches tested, runbooks practiced, and metrics reviewed before users notice pain.