Running SQL Server in production extends beyond backups and indexes — it requires patch discipline, capacity planning, on-call runbooks, automation, and alignment with business SLAs. This page consolidates operational patterns for enterprise DBAs.

Operational Pillars

  Reliability ─── Backups, HA, tested DR
Performance ── Baselines, Query Store, proactive tuning
Security ───── Least privilege, patching, auditing
Observability ─ Alerts, dashboards, runbooks
Change control ─ SSDT, flyway, staged deploys
  

Weakness in any pillar surfaces during incidents — invest proportionally.

Service Level Objectives

Define measurable targets with stakeholders:

Metric Example Target Measurement
Availability 99.95% monthly AG health, app synthetic checks
RPO 15 minutes Log backup frequency
RTO 1 hour Last DR drill time
P95 query latency < 200 ms Query Store, APM
Backup success 100% Agent job history

Document exceptions (maintenance windows) in SLA appendix.

Patch and CU Strategy

SQL Server releases Cumulative Updates (CUs) — not every CU is mandatory, but security fixes are:

  1. Subscribe to SQL Server release services
  2. Test CU on staging — full regression + Query Store comparison
  3. Apply during maintenance window — AG failover patch secondary first
  4. Verify @@VERSION and critical DMVs post-patch
  SELECT @@VERSION, SERVERPROPERTY('ProductLevel') AS product_level,
       SERVERPROPERTY('ProductUpdateLevel') AS update_level;
  

Never run unsupported versions — extended support end dates drive upgrade projects.

Capacity Planning

Monitor growth trends monthly:

  -- Database size trend
SELECT
    DB_NAME(database_id) AS database_name,
    SUM(size * 8 / 1024) AS size_mb
FROM sys.master_files
GROUP BY database_id
ORDER BY size_mb DESC;

-- Table row counts (approximate for large tables)
SELECT OBJECT_NAME(object_id) AS table_name, SUM(row_count) AS rows
FROM sys.dm_db_partition_stats
WHERE index_id IN (0, 1)
GROUP BY object_id
ORDER BY rows DESC;
  

Project disk exhaustion — alert at 80% capacity. Plan filegrowth proactively vs autogrowth storms.

CPU and Memory Headroom

  • Sustained CPU > 70% — scale up/out or tune top queries
  • PAGE LIFE EXPECTANCY < 300 — add memory or reduce cache pressure
  • Batch requests/sec plateau with rising latency — saturation signal

On-Call Runbook Template

Severity 1: Database Unavailable

  1. Verify scope — single app vs entire instance
  2. Check AG listener connectivity and quorum
  3. Review SQL error log — EXEC xp_readerrorlog 0, 1
  4. Check disk space on data/log/backup volumes
  5. Failover to secondary if primary hardware failure (document approval)
  6. Communicate ETA to stakeholders every 15 minutes

Severity 2: Severe Slowness

  1. sp_WhoIsActive or Activity Monitor — blocking chain
  2. Top waits — sys.dm_os_wait_stats
  3. Query Store — regressed queries last hour
  4. Kill runaway session only with approval
  5. Mitigate — enable Resource Governor throttle, route reads to secondary

Severity 3: Failed Backup Job

  1. Agent job history — error message
  2. Disk space on backup target
  3. VSS provider issues (if snapshot backup)
  4. Run manual backup; investigate before next window

Store runbooks in wiki with copy-paste SQL — reduce thinking time at 3 AM.

Automation Essentials

Task Tool
Index/stats maintenance Ola Hallengren scripts
Backup verification Restore to test instance via Agent
Deployment SSDT dacpac, Flyway, Liquibase
Configuration drift dbatools, PowerShell DSC
Monitoring SCOM, Azure Monitor, Datadog, Prometheus exporter

Example Agent job step — post-backup verify:

  RESTORE VERIFYONLY FROM DISK = @backup_path;
IF @@ERROR <> 0
    EXEC msdb.dbo.sp_send_dbmail
        @profile_name = N'DBA_Alerts',
        @recipients = N'[email protected]',
        @subject = N'BACKUP VERIFY FAILED',
        @body = @backup_path;
  

Change Management

Production schema changes flow:

  1. Developer submits SSDT/SQL script PR
  2. Review — execution plan impact, lock risk, rollback script
  3. Deploy staging — load test
  4. Deploy production — off-peak, with SET LOCK_TIMEOUT
  5. Post-deploy — Query Store comparison, smoke tests
  -- Safer column add (SQL 2016+ online)
ALTER TABLE dbo.Orders ADD Notes NVARCHAR(500) NULL
WITH (ONLINE = ON);
  

Always ship rollback script — even if “we never roll back.”

Incident Response: Corruption Suspected

  DBCC CHECKDB (MyApp) WITH NO_INFOMSGS, ALL_ERRORMSGS;
-- Or PHYSICAL_ONLY for quick check
DBCC CHECKDB (MyApp) WITH PHYSICAL_ONLY;
  

If errors found:

  1. RESTORE ... WITH RECOVERY from last clean backup
  2. Apply logs to point before corruption
  3. Root cause — storage, firmware, forced shutdown
  4. Open vendor ticket with CHECKDB output

Enable page checksums on new databases:

  ALTER DATABASE MyApp SET PAGE_VERIFY CHECKSUM;
  

Security Operations

  • Rotate SQL login passwords quarterly (automated vault)
  • Review sysadmin membership monthly
  • Audit failed logins — brute force detection
  • Disable unused features — xp_cmdshell, OLE automation unless required
  SELECT name, is_disabled FROM sys.server_principals WHERE name = N'sa';
-- Disable sa if using Windows auth only
ALTER LOGIN sa DISABLE;
  

Documentation Inventory

Maintain for every production instance:

Document Contents
Instance catalog Name, version, edition, owner, criticality
Connection map Apps, listeners, connection pools
Backup/restore runbook Paths, schedules, last restore test date
AG diagram Replicas, quorum, failover steps
Escalation matrix L1 app team → L2 DBA → L3 vendor

Common Operational Mistakes

Mistake Consequence
No DR drill in 12 months RTO unknown until disaster
Shared sa password in vault No accountability
Manual changes without scripts Drift, unreproducible state
Alert fatigue Real incidents missed
Shrinking databases routinely Fragmentation, performance loss

Performance in Operations

Weekly DBA checklist:

  • Backup success rate 100%
  • AG synchronization healthy
  • Disk space > 20% free on all volumes
  • Top 5 Query Store regressions reviewed
  • Long-running Agent jobs within SLA
  • Error log scanned for severity 17+
  EXEC xp_readerrorlog 0, 1, N'Error', N'Severity: 17';
  

Production Scenario: Zero-Downtime Index Deploy

Large table — Orders 500M rows, need index on CustomerID:

  CREATE NONCLUSTERED INDEX IX_Orders_CustomerID
ON dbo.Orders (CustomerID)
WITH (ONLINE = ON, MAXDOP = 4, SORT_IN_TEMPDB = ON);
  

Monitor redo during AG sync. Schedule during low traffic. Pre-notify app team of possible duration estimates from staging test.

Cloud and Hybrid Ops

Platform Ops Notes
Azure SQL Database Microsoft patches OS/SQL; you tune indexes, RDT
Azure SQL MI Linked to AG-like SLA; verify backup retention
SQL on VM You patch — treat like on-prem with Azure backup

Use Azure Monitor alerts for DTU/storage/CPU — complement with Query Store insights.

Handoff to Development Teams

Provide developers:

  • Query Store access (read-only) for their database
  • Guidelines — no NOLOCK, parameterize SQL, avoid SELECT *
  • Staging environment mirroring production cardinality (masked data)
  • Escalation path for slow query tickets with query_hash attached

Production excellence is repetitive discipline — backups verified, patches tested, runbooks practiced, and metrics reviewed before users notice pain.