Running MongoDB in production requires operational discipline beyond configuration — backups, monitoring, upgrade procedures, and incident response. This guide covers the runbooks and practices that keep production databases reliable.

Operational Pillars

  Reliability = Backups + Monitoring + Change Management + Incident Response
  

Every production deployment needs documented procedures for each pillar before going live.

Backup Strategies

Continuous cloud backup with point-in-time recovery — enable on M10+ clusters. Test restores quarterly.

Self-Hosted: mongodump

Logical backup — portable, slower on large datasets:

  # Full backup with oplog for point-in-time
mongodump --uri="mongodb://host:27017" \
  --oplog --out=/backup/$(date +%Y%m%d) \
  --gzip

# Single database
mongodump --db=myapp --out=/backup/myapp

# Restore
mongorestore --uri="mongodb://host:27017" \
  --oplogReplay /backup/20240613
  

Schedule with cron:

  0 2 * * * mongodump --uri="$MONGO_URI" --oplog --gzip --out=/backup/$(date +\%Y\%m\%d) && \
  find /backup -mtime +7 -delete
  

Filesystem Snapshots (WiredTiger)

Fast, consistent snapshots when MongoDB is quiesced:

  # Lock database for snapshot (brief pause)
mongosh --eval 'db.fsyncLock()'

# Take EBS/disk snapshot
aws ecfs create-snapshot --volume-id vol-abc123

# Unlock
mongosh --eval 'db.fsyncUnlock()'
  

Use LVM or cloud snapshots for large datasets where mongodump is too slow.

Backup Checklist

  • Daily automated backups
  • Oplog captured for point-in-time recovery
  • Backups stored off-site (different region/account)
  • Restore tested quarterly
  • Backup encryption at rest
  • RTO and RPO documented

Monitoring Stack

Key Metrics

Metric Alert Threshold Action
Replication lag > 60 seconds Check network, disk, write load
Connections > 80% of max Add pooling, scale
Opcounters Sudden drop Check application, network
Cache usage > 90% dirty Increase cache or reduce writes
Disk usage > 80% Archive, expand, or shard
Page faults Sustained high Working set exceeds RAM
Queue length > 10 Lock contention, slow queries

mongostat and mongotop

  # Throughput every 5 seconds
mongostat --uri="$MONGO_URI" 5

# Collection-level I/O
mongotop --uri="$MONGO_URI" 5
  

Server Status

  const status = db.serverStatus();

// Connections
status.connections  // { current, available, totalCreated }

// Operation counters
status.opcounters  // insert, query, update, delete per second

// WiredTiger cache
status.wiredTiger.cache

// Replication
status.repl  // lag, state
  

Prometheus Integration

Use MongoDB Exporter for Prometheus/Grafana:

  # Percona MongoDB Exporter or official MongoDB exporter
docker run -d -p 9216:9216 \
  -e MONGODB_URI="$MONGO_URI" \
  percona/mongodb_exporter:0.40
  

Dashboard templates available for Grafana — import MongoDB overview dashboard.

Alerting Rules

Configure alerts for:

  # Example Prometheus alert rules
- alert: MongoDBReplicationLag
  expr: mongodb_mongod_replset_member_replication_lag > 60
  for: 5m

- alert: MongoDBHighConnections
  expr: mongodb_connections{state="current"} / mongodb_connections{state="available"} > 0.8
  for: 10m

- alert: MongoDBDiskSpaceLow
  expr: mongodb_dbstats_dataSize_bytes / node_filesystem_size_bytes > 0.8
  for: 15m
  

Route to PagerDuty/Slack for production incidents.

Upgrade Procedures

Rolling Upgrade (Replica Set)

Upgrade one member at a time — secondaries first, primary last:

  # 1. Upgrade secondary
sudo systemctl stop mongod
sudo apt-get install -y mongodb-org=7.0.8
sudo systemctl start mongod
# Wait until SECONDARY state

# 2. Repeat for other secondaries

# 3. Step down primary
mongosh --eval 'rs.stepDown(120)'

# 4. Upgrade former primary
sudo systemctl stop mongod
sudo apt-get install -y mongodb-org=7.0.8
sudo systemctl start mongod
  

Set featureCompatibilityVersion after all members upgraded:

  db.adminCommand({ setFeatureCompatibilityVersion: "7.0" })
  

Upgrade Checklist

  • Read release notes for breaking changes
  • Test upgrade in staging with production data copy
  • Backup before upgrade
  • Upgrade secondaries before primary
  • Set FCV after all members on new version
  • Verify application compatibility with new driver version

Capacity Planning

When to Scale Vertically

  • Working set growing beyond RAM (increasing page faults)
  • CPU sustained > 70% on primary
  • Disk I/O saturation

When to Scale Horizontally

  • Write throughput exceeds single primary
  • Storage approaching node limits
  • Need geographic distribution

Growth Projection Template

  Current: 500 GB data, 2K writes/sec, 10K reads/sec
Growth:  20% per quarter
Action:  Shard at 1.5 TB or 5K writes/sec (whichever first)
Budget:  M30 → M40 upgrade in Q3
  

Review capacity quarterly with 6-month forward projection.

Incident Response Runbooks

Primary Unreachable

  1. Check rs.status() — is election in progress?
2. Verify network connectivity to all members
3. Check disk space on primary host
4. If hardware failure — promote secondary manually if needed
5. Restore failed node or replace with new member via rs.add()
6. Post-incident: review replication lag alerts
  

Database Locked / Slow Queries

  1. db.currentOp({ active: true, secs_running: { $gt: 10 } })
2. Identify long-running queries — missing index? COLLSCAN?
3. db.killOp(opid) for runaway queries (with caution)
4. Check globalLock.currentQueue
5. Review recent deployments or schema changes
  

Disk Full

  1. Emergency: db.runCommand({ compact: "collection" }) — last resort, blocks
2. Delete old data: db.logs.deleteMany({ createdAt: { $lt: cutoff } })
3. Archive to cold storage
4. Expand disk volume (cloud) or add shard
5. Post-incident: configure disk alerts at 70%
  

Oplog Window Too Small

  1. Check oplog size: db.getReplicationInfo()
2. Increase oplogSizeMB in config (requires restart)
3. For change streams — ensure consumer keeps up
4. Atlas: oplog scales with tier automatically
  

Maintenance Windows

Schedule regular maintenance:

Task Frequency
Backup restore test Quarterly
Failover drill Quarterly
Index review ($indexStats) Monthly
Slow query review Weekly
Version patch Monthly (security)
Capacity review Quarterly
User access audit Quarterly

Log Management

  # mongod.conf
systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true
  logRotate: reopen
  verbosity: 0
  component:
    query:
      verbosity: 1  # temporary for debugging
  

Ship logs to centralized logging (ELK, CloudWatch, Datadog):

  # Log rotation
/var/log/mongodb/mongod.log {
  daily
  rotate 14
  compress
  postrotate
    /bin/kill -SIGUSR1 $(cat /var/lib/mongodb/mongod.lock)
  endscript
}
  

Security Operations

  • Rotate database passwords quarterly
  • Review IP access lists / firewall rules monthly
  • Patch MongoDB within 30 days of security releases
  • Audit user roles — remove departed team members immediately
  • Enable encryption at rest and in transit (verify, don’t assume)

Production Deployment Patterns

Blue-Green Database Migration

  1. Deploy green cluster (new version or new region)
2. Initial sync via mongodump/mongorestore or initial sync
3. Enable change stream sync for delta
4. Cutover application connection string
5. Decommission blue after validation period
  

Multi-Environment Strategy

  Production:  Atlas M30+, 3-region, cloud backup
Staging:     Atlas M10, same version as production
Development: Atlas M0/M2 or local Docker replica set
  

Never test destructive operations on production.

Common Operational Mistakes

  • No tested backup restore procedure
  • Upgrading primary first — causes unnecessary downtime
  • Ignoring replication lag until reads return stale data
  • Running maintenance without disabling balancer (sharded)
  • No connection pooling — exhausting file descriptors
  • Profiler left at level 2 in production
  • Skipping FCV update after version upgrade

Troubleshooting Commands Reference

  // Health check
db.runCommand({ ping: 1 })
rs.status()
sh.status()  // sharded

// Performance
db.serverStatus()
db.currentOp({ active: true })
db.setProfilingLevel(1, { slowms: 100 })

// Storage
db.stats()
db.collection.stats(1024 * 1024)  // MB

// Replication
rs.printSecondaryReplicationInfo()
db.getReplicationInfo()
  

Best Practices

  1. Automate backups — manual backups get forgotten
  2. Test failover before you need it
  3. Document every production change in a changelog
  4. Use infrastructure as code for Atlas (Terraform provider)
  5. Maintain staging environment mirroring production topology
  6. Set up on-call rotation with runbook access
  7. Review MongoDB security advisories monthly

What Comes Next

You now have the full MongoDB learning path — from document basics through production operations. Apply these patterns iteratively as your workload grows.