to navigate

to select

to close

On this page

Production Operations

Running SQL Server in production extends beyond backups and indexes — it requires patch discipline, capacity planning, on-call runbooks, automation, and alignment with business SLAs. This page consolidates operational patterns for enterprise DBAs.

Operational Pillars

  Reliability ─── Backups, HA, tested DR
Performance ── Baselines, Query Store, proactive tuning
Security ───── Least privilege, patching, auditing
Observability ─ Alerts, dashboards, runbooks
Change control ─ SSDT, flyway, staged deploys

Weakness in any pillar surfaces during incidents — invest proportionally.

Service Level Objectives

Define measurable targets with stakeholders:

Metric	Example Target	Measurement
Availability	99.95% monthly	AG health, app synthetic checks
RPO	15 minutes	Log backup frequency
RTO	1 hour	Last DR drill time
P95 query latency	< 200 ms	Query Store, APM
Backup success	100%	Agent job history

Document exceptions (maintenance windows) in SLA appendix.

Patch and CU Strategy

SQL Server releases Cumulative Updates (CUs) — not every CU is mandatory, but security fixes are:

Subscribe to SQL Server release services
Test CU on staging — full regression + Query Store comparison
Apply during maintenance window — AG failover patch secondary first
Verify @@VERSION and critical DMVs post-patch

  SELECT @@VERSION, SERVERPROPERTY('ProductLevel') AS product_level,
       SERVERPROPERTY('ProductUpdateLevel') AS update_level;

Never run unsupported versions — extended support end dates drive upgrade projects.

Capacity Planning

Monitor growth trends monthly:

  -- Database size trend
SELECT
    DB_NAME(database_id) AS database_name,
    SUM(size * 8 / 1024) AS size_mb
FROM sys.master_files
GROUP BY database_id
ORDER BY size_mb DESC;

-- Table row counts (approximate for large tables)
SELECT OBJECT_NAME(object_id) AS table_name, SUM(row_count) AS rows
FROM sys.dm_db_partition_stats
WHERE index_id IN (0, 1)
GROUP BY object_id
ORDER BY rows DESC;

Project disk exhaustion — alert at 80% capacity. Plan filegrowth proactively vs autogrowth storms.

CPU and Memory Headroom

Sustained CPU > 70% — scale up/out or tune top queries
PAGE LIFE EXPECTANCY < 300 — add memory or reduce cache pressure
Batch requests/sec plateau with rising latency — saturation signal

On-Call Runbook Template

Severity 1: Database Unavailable

Verify scope — single app vs entire instance
Check AG listener connectivity and quorum
Review SQL error log — EXEC xp_readerrorlog 0, 1
Check disk space on data/log/backup volumes
Failover to secondary if primary hardware failure (document approval)
Communicate ETA to stakeholders every 15 minutes

Severity 2: Severe Slowness

sp_WhoIsActive or Activity Monitor — blocking chain
Top waits — sys.dm_os_wait_stats
Query Store — regressed queries last hour
Kill runaway session only with approval
Mitigate — enable Resource Governor throttle, route reads to secondary

Severity 3: Failed Backup Job

Agent job history — error message
Disk space on backup target
VSS provider issues (if snapshot backup)
Run manual backup; investigate before next window

Store runbooks in wiki with copy-paste SQL — reduce thinking time at 3 AM.

Automation Essentials

Task	Tool
Index/stats maintenance	Ola Hallengren scripts
Backup verification	Restore to test instance via Agent
Deployment	SSDT dacpac, Flyway, Liquibase
Configuration drift	dbatools, PowerShell DSC
Monitoring	SCOM, Azure Monitor, Datadog, Prometheus exporter

Example Agent job step — post-backup verify:

  RESTORE VERIFYONLY FROM DISK = @backup_path;
IF @@ERROR <> 0
    EXEC msdb.dbo.sp_send_dbmail
        @profile_name = N'DBA_Alerts',
        @recipients = N'[email protected]',
        @subject = N'BACKUP VERIFY FAILED',
        @body = @backup_path;

Change Management

Production schema changes flow:

Developer submits SSDT/SQL script PR
Review — execution plan impact, lock risk, rollback script
Deploy staging — load test
Deploy production — off-peak, with SET LOCK_TIMEOUT
Post-deploy — Query Store comparison, smoke tests

  -- Safer column add (SQL 2016+ online)
ALTER TABLE dbo.Orders ADD Notes NVARCHAR(500) NULL
WITH (ONLINE = ON);

Always ship rollback script — even if “we never roll back.”

Incident Response: Corruption Suspected

  DBCC CHECKDB (MyApp) WITH NO_INFOMSGS, ALL_ERRORMSGS;
-- Or PHYSICAL_ONLY for quick check
DBCC CHECKDB (MyApp) WITH PHYSICAL_ONLY;

If errors found:

RESTORE ... WITH RECOVERY from last clean backup
Apply logs to point before corruption
Root cause — storage, firmware, forced shutdown
Open vendor ticket with CHECKDB output

Enable page checksums on new databases:

  ALTER DATABASE MyApp SET PAGE_VERIFY CHECKSUM;

Security Operations

Rotate SQL login passwords quarterly (automated vault)
Review sysadmin membership monthly
Audit failed logins — brute force detection
Disable unused features — xp_cmdshell, OLE automation unless required

  SELECT name, is_disabled FROM sys.server_principals WHERE name = N'sa';
-- Disable sa if using Windows auth only
ALTER LOGIN sa DISABLE;

Documentation Inventory

Maintain for every production instance:

Document	Contents
Instance catalog	Name, version, edition, owner, criticality
Connection map	Apps, listeners, connection pools
Backup/restore runbook	Paths, schedules, last restore test date
AG diagram	Replicas, quorum, failover steps
Escalation matrix	L1 app team → L2 DBA → L3 vendor

Common Operational Mistakes

Mistake	Consequence
No DR drill in 12 months	RTO unknown until disaster
Shared `sa` password in vault	No accountability
Manual changes without scripts	Drift, unreproducible state
Alert fatigue	Real incidents missed
Shrinking databases routinely	Fragmentation, performance loss

Performance in Operations

Weekly DBA checklist:

Backup success rate 100%
AG synchronization healthy
Disk space > 20% free on all volumes
Top 5 Query Store regressions reviewed
Long-running Agent jobs within SLA
Error log scanned for severity 17+

  EXEC xp_readerrorlog 0, 1, N'Error', N'Severity: 17';

Production Scenario: Zero-Downtime Index Deploy

Large table — Orders 500M rows, need index on CustomerID:

  CREATE NONCLUSTERED INDEX IX_Orders_CustomerID
ON dbo.Orders (CustomerID)
WITH (ONLINE = ON, MAXDOP = 4, SORT_IN_TEMPDB = ON);

Monitor redo during AG sync. Schedule during low traffic. Pre-notify app team of possible duration estimates from staging test.

Cloud and Hybrid Ops

Platform	Ops Notes
Azure SQL Database	Microsoft patches OS/SQL; you tune indexes, RDT
Azure SQL MI	Linked to AG-like SLA; verify backup retention
SQL on VM	You patch — treat like on-prem with Azure backup

Use Azure Monitor alerts for DTU/storage/CPU — complement with Query Store insights.

Handoff to Development Teams

Provide developers:

Query Store access (read-only) for their database
Guidelines — no NOLOCK, parameterize SQL, avoid SELECT *
Staging environment mirroring production cardinality (masked data)
Escalation path for slow query tickets with query_hash attached

Production excellence is repetitive discipline — backups verified, patches tested, runbooks practiced, and metrics reviewed before users notice pain.

Query Store Deep Dive

Introduction to MongoDB

Production Operations

Operational Pillars link

Service Level Objectives link

Patch and CU Strategy link

Capacity Planning link

CPU and Memory Headroom link

On-Call Runbook Template link

Severity 1: Database Unavailable link

Severity 2: Severe Slowness link

Severity 3: Failed Backup Job link

Automation Essentials link

Change Management link

Incident Response: Corruption Suspected link

Security Operations link

Documentation Inventory link

Common Operational Mistakes link

Performance in Operations link

Production Scenario: Zero-Downtime Index Deploy link

Cloud and Hybrid Ops link

Handoff to Development Teams link