to navigate

to select

to close

On this page

MongoDB Production Operations

Running MongoDB in production requires operational discipline beyond configuration — backups, monitoring, upgrade procedures, and incident response. This guide covers the runbooks and practices that keep production databases reliable.

Operational Pillars

  Reliability = Backups + Monitoring + Change Management + Incident Response

Every production deployment needs documented procedures for each pillar before going live.

Backup Strategies

Atlas (Recommended)

Continuous cloud backup with point-in-time recovery — enable on M10+ clusters. Test restores quarterly.

Self-Hosted: mongodump

Logical backup — portable, slower on large datasets:

  # Full backup with oplog for point-in-time
mongodump --uri="mongodb://host:27017" \
  --oplog --out=/backup/$(date +%Y%m%d) \
  --gzip

# Single database
mongodump --db=myapp --out=/backup/myapp

# Restore
mongorestore --uri="mongodb://host:27017" \
  --oplogReplay /backup/20240613

Schedule with cron:

  0 2 * * * mongodump --uri="$MONGO_URI" --oplog --gzip --out=/backup/$(date +\%Y\%m\%d) && \
  find /backup -mtime +7 -delete

Filesystem Snapshots (WiredTiger)

Fast, consistent snapshots when MongoDB is quiesced:

  # Lock database for snapshot (brief pause)
mongosh --eval 'db.fsyncLock()'

# Take EBS/disk snapshot
aws ecfs create-snapshot --volume-id vol-abc123

# Unlock
mongosh --eval 'db.fsyncUnlock()'

Use LVM or cloud snapshots for large datasets where mongodump is too slow.

Backup Checklist

Daily automated backups
Oplog captured for point-in-time recovery
Backups stored off-site (different region/account)
Restore tested quarterly
Backup encryption at rest
RTO and RPO documented

Monitoring Stack

Key Metrics

Metric	Alert Threshold	Action
Replication lag	> 60 seconds	Check network, disk, write load
Connections	> 80% of max	Add pooling, scale
Opcounters	Sudden drop	Check application, network
Cache usage	> 90% dirty	Increase cache or reduce writes
Disk usage	> 80%	Archive, expand, or shard
Page faults	Sustained high	Working set exceeds RAM
Queue length	> 10	Lock contention, slow queries

mongostat and mongotop

  # Throughput every 5 seconds
mongostat --uri="$MONGO_URI" 5

# Collection-level I/O
mongotop --uri="$MONGO_URI" 5

Server Status

  const status = db.serverStatus();

// Connections
status.connections  // { current, available, totalCreated }

// Operation counters
status.opcounters  // insert, query, update, delete per second

// WiredTiger cache
status.wiredTiger.cache

// Replication
status.repl  // lag, state

Prometheus Integration

Use MongoDB Exporter for Prometheus/Grafana:

  # Percona MongoDB Exporter or official MongoDB exporter
docker run -d -p 9216:9216 \
  -e MONGODB_URI="$MONGO_URI" \
  percona/mongodb_exporter:0.40

Dashboard templates available for Grafana — import MongoDB overview dashboard.

Alerting Rules

Configure alerts for:

  # Example Prometheus alert rules
- alert: MongoDBReplicationLag
  expr: mongodb_mongod_replset_member_replication_lag > 60
  for: 5m

- alert: MongoDBHighConnections
  expr: mongodb_connections{state="current"} / mongodb_connections{state="available"} > 0.8
  for: 10m

- alert: MongoDBDiskSpaceLow
  expr: mongodb_dbstats_dataSize_bytes / node_filesystem_size_bytes > 0.8
  for: 15m

Route to PagerDuty/Slack for production incidents.

Upgrade Procedures

Rolling Upgrade (Replica Set)

Upgrade one member at a time — secondaries first, primary last:

  # 1. Upgrade secondary
sudo systemctl stop mongod
sudo apt-get install -y mongodb-org=7.0.8
sudo systemctl start mongod
# Wait until SECONDARY state

# 2. Repeat for other secondaries

# 3. Step down primary
mongosh --eval 'rs.stepDown(120)'

# 4. Upgrade former primary
sudo systemctl stop mongod
sudo apt-get install -y mongodb-org=7.0.8
sudo systemctl start mongod

Set featureCompatibilityVersion after all members upgraded:

  db.adminCommand({ setFeatureCompatibilityVersion: "7.0" })

Upgrade Checklist

Read release notes for breaking changes
Test upgrade in staging with production data copy
Backup before upgrade
Upgrade secondaries before primary
Set FCV after all members on new version
Verify application compatibility with new driver version

Capacity Planning

When to Scale Vertically

Working set growing beyond RAM (increasing page faults)
CPU sustained > 70% on primary
Disk I/O saturation

When to Scale Horizontally

Write throughput exceeds single primary
Storage approaching node limits
Need geographic distribution

Growth Projection Template

  Current: 500 GB data, 2K writes/sec, 10K reads/sec
Growth:  20% per quarter
Action:  Shard at 1.5 TB or 5K writes/sec (whichever first)
Budget:  M30 → M40 upgrade in Q3

Review capacity quarterly with 6-month forward projection.

Incident Response Runbooks

Primary Unreachable

  1. Check rs.status() — is election in progress?
2. Verify network connectivity to all members
3. Check disk space on primary host
4. If hardware failure — promote secondary manually if needed
5. Restore failed node or replace with new member via rs.add()
6. Post-incident: review replication lag alerts

Database Locked / Slow Queries

  1. db.currentOp({ active: true, secs_running: { $gt: 10 } })
2. Identify long-running queries — missing index? COLLSCAN?
3. db.killOp(opid) for runaway queries (with caution)
4. Check globalLock.currentQueue
5. Review recent deployments or schema changes

Disk Full

  1. Emergency: db.runCommand({ compact: "collection" }) — last resort, blocks
2. Delete old data: db.logs.deleteMany({ createdAt: { $lt: cutoff } })
3. Archive to cold storage
4. Expand disk volume (cloud) or add shard
5. Post-incident: configure disk alerts at 70%

Oplog Window Too Small

  1. Check oplog size: db.getReplicationInfo()
2. Increase oplogSizeMB in config (requires restart)
3. For change streams — ensure consumer keeps up
4. Atlas: oplog scales with tier automatically

Maintenance Windows

Schedule regular maintenance:

Task	Frequency
Backup restore test	Quarterly
Failover drill	Quarterly
Index review (`$indexStats`)	Monthly
Slow query review	Weekly
Version patch	Monthly (security)
Capacity review	Quarterly
User access audit	Quarterly

Log Management

  # mongod.conf
systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true
  logRotate: reopen
  verbosity: 0
  component:
    query:
      verbosity: 1  # temporary for debugging

Ship logs to centralized logging (ELK, CloudWatch, Datadog):

  # Log rotation
/var/log/mongodb/mongod.log {
  daily
  rotate 14
  compress
  postrotate
    /bin/kill -SIGUSR1 $(cat /var/lib/mongodb/mongod.lock)
  endscript
}

Security Operations

Rotate database passwords quarterly
Review IP access lists / firewall rules monthly
Patch MongoDB within 30 days of security releases
Audit user roles — remove departed team members immediately
Enable encryption at rest and in transit (verify, don’t assume)

Production Deployment Patterns

Blue-Green Database Migration

  1. Deploy green cluster (new version or new region)
2. Initial sync via mongodump/mongorestore or initial sync
3. Enable change stream sync for delta
4. Cutover application connection string
5. Decommission blue after validation period

Multi-Environment Strategy

  Production:  Atlas M30+, 3-region, cloud backup
Staging:     Atlas M10, same version as production
Development: Atlas M0/M2 or local Docker replica set

Never test destructive operations on production.

Common Operational Mistakes

No tested backup restore procedure
Upgrading primary first — causes unnecessary downtime
Ignoring replication lag until reads return stale data
Running maintenance without disabling balancer (sharded)
No connection pooling — exhausting file descriptors
Profiler left at level 2 in production
Skipping FCV update after version upgrade

Troubleshooting Commands Reference

  // Health check
db.runCommand({ ping: 1 })
rs.status()
sh.status()  // sharded

// Performance
db.serverStatus()
db.currentOp({ active: true })
db.setProfilingLevel(1, { slowms: 100 })

// Storage
db.stats()
db.collection.stats(1024 * 1024)  // MB

// Replication
rs.printSecondaryReplicationInfo()
db.getReplicationInfo()

Best Practices

Automate backups — manual backups get forgotten
Test failover before you need it
Document every production change in a changelog
Use infrastructure as code for Atlas (Terraform provider)
Maintain staging environment mirroring production topology
Set up on-call rotation with runbook access
Review MongoDB security advisories monthly

What Comes Next

You now have the full MongoDB learning path — from document basics through production operations. Apply these patterns iteratively as your workload grows.

MongoDB Sharding Deep Dive

Security and Auditing

MongoDB Production Operations

Operational Pillars link

Backup Strategies link

Atlas (Recommended) link

Self-Hosted: mongodump link

Filesystem Snapshots (WiredTiger) link

Backup Checklist link

Monitoring Stack link

Key Metrics link

mongostat and mongotop link

Server Status link

Prometheus Integration link

Alerting Rules link

Upgrade Procedures link

Rolling Upgrade (Replica Set) link

Upgrade Checklist link

Capacity Planning link

When to Scale Vertically link

When to Scale Horizontally link

Growth Projection Template link

Incident Response Runbooks link

Primary Unreachable link

Database Locked / Slow Queries link

Disk Full link

Oplog Window Too Small link

Maintenance Windows link

Log Management link

Security Operations link

Production Deployment Patterns link

Blue-Green Database Migration link

Multi-Environment Strategy link

Common Operational Mistakes link

Troubleshooting Commands Reference link

Best Practices link

What Comes Next link