MongoDB Production Operations
Running MongoDB in production requires operational discipline beyond configuration — backups, monitoring, upgrade procedures, and incident response. This guide covers the runbooks and practices that keep production databases reliable.
Operational Pillars
Reliability = Backups + Monitoring + Change Management + Incident Response
Every production deployment needs documented procedures for each pillar before going live.
Backup Strategies
Atlas (Recommended)
Continuous cloud backup with point-in-time recovery — enable on M10+ clusters. Test restores quarterly.
Self-Hosted: mongodump
Logical backup — portable, slower on large datasets:
# Full backup with oplog for point-in-time
mongodump --uri="mongodb://host:27017" \
--oplog --out=/backup/$(date +%Y%m%d) \
--gzip
# Single database
mongodump --db=myapp --out=/backup/myapp
# Restore
mongorestore --uri="mongodb://host:27017" \
--oplogReplay /backup/20240613
Schedule with cron:
0 2 * * * mongodump --uri="$MONGO_URI" --oplog --gzip --out=/backup/$(date +\%Y\%m\%d) && \
find /backup -mtime +7 -delete
Filesystem Snapshots (WiredTiger)
Fast, consistent snapshots when MongoDB is quiesced:
# Lock database for snapshot (brief pause)
mongosh --eval 'db.fsyncLock()'
# Take EBS/disk snapshot
aws ecfs create-snapshot --volume-id vol-abc123
# Unlock
mongosh --eval 'db.fsyncUnlock()'
Use LVM or cloud snapshots for large datasets where mongodump is too slow.
Backup Checklist
- Daily automated backups
- Oplog captured for point-in-time recovery
- Backups stored off-site (different region/account)
- Restore tested quarterly
- Backup encryption at rest
- RTO and RPO documented
Monitoring Stack
Key Metrics
| Metric | Alert Threshold | Action |
|---|---|---|
| Replication lag | > 60 seconds | Check network, disk, write load |
| Connections | > 80% of max | Add pooling, scale |
| Opcounters | Sudden drop | Check application, network |
| Cache usage | > 90% dirty | Increase cache or reduce writes |
| Disk usage | > 80% | Archive, expand, or shard |
| Page faults | Sustained high | Working set exceeds RAM |
| Queue length | > 10 | Lock contention, slow queries |
mongostat and mongotop
# Throughput every 5 seconds
mongostat --uri="$MONGO_URI" 5
# Collection-level I/O
mongotop --uri="$MONGO_URI" 5
Server Status
const status = db.serverStatus();
// Connections
status.connections // { current, available, totalCreated }
// Operation counters
status.opcounters // insert, query, update, delete per second
// WiredTiger cache
status.wiredTiger.cache
// Replication
status.repl // lag, state
Prometheus Integration
Use MongoDB Exporter for Prometheus/Grafana:
# Percona MongoDB Exporter or official MongoDB exporter
docker run -d -p 9216:9216 \
-e MONGODB_URI="$MONGO_URI" \
percona/mongodb_exporter:0.40
Dashboard templates available for Grafana — import MongoDB overview dashboard.
Alerting Rules
Configure alerts for:
# Example Prometheus alert rules
- alert: MongoDBReplicationLag
expr: mongodb_mongod_replset_member_replication_lag > 60
for: 5m
- alert: MongoDBHighConnections
expr: mongodb_connections{state="current"} / mongodb_connections{state="available"} > 0.8
for: 10m
- alert: MongoDBDiskSpaceLow
expr: mongodb_dbstats_dataSize_bytes / node_filesystem_size_bytes > 0.8
for: 15m
Route to PagerDuty/Slack for production incidents.
Upgrade Procedures
Rolling Upgrade (Replica Set)
Upgrade one member at a time — secondaries first, primary last:
# 1. Upgrade secondary
sudo systemctl stop mongod
sudo apt-get install -y mongodb-org=7.0.8
sudo systemctl start mongod
# Wait until SECONDARY state
# 2. Repeat for other secondaries
# 3. Step down primary
mongosh --eval 'rs.stepDown(120)'
# 4. Upgrade former primary
sudo systemctl stop mongod
sudo apt-get install -y mongodb-org=7.0.8
sudo systemctl start mongod
Set featureCompatibilityVersion after all members upgraded:
db.adminCommand({ setFeatureCompatibilityVersion: "7.0" })
Upgrade Checklist
- Read release notes for breaking changes
- Test upgrade in staging with production data copy
- Backup before upgrade
- Upgrade secondaries before primary
- Set FCV after all members on new version
- Verify application compatibility with new driver version
Capacity Planning
When to Scale Vertically
- Working set growing beyond RAM (increasing page faults)
- CPU sustained > 70% on primary
- Disk I/O saturation
When to Scale Horizontally
- Write throughput exceeds single primary
- Storage approaching node limits
- Need geographic distribution
Growth Projection Template
Current: 500 GB data, 2K writes/sec, 10K reads/sec
Growth: 20% per quarter
Action: Shard at 1.5 TB or 5K writes/sec (whichever first)
Budget: M30 → M40 upgrade in Q3
Review capacity quarterly with 6-month forward projection.
Incident Response Runbooks
Primary Unreachable
1. Check rs.status() — is election in progress?
2. Verify network connectivity to all members
3. Check disk space on primary host
4. If hardware failure — promote secondary manually if needed
5. Restore failed node or replace with new member via rs.add()
6. Post-incident: review replication lag alerts
Database Locked / Slow Queries
1. db.currentOp({ active: true, secs_running: { $gt: 10 } })
2. Identify long-running queries — missing index? COLLSCAN?
3. db.killOp(opid) for runaway queries (with caution)
4. Check globalLock.currentQueue
5. Review recent deployments or schema changes
Disk Full
1. Emergency: db.runCommand({ compact: "collection" }) — last resort, blocks
2. Delete old data: db.logs.deleteMany({ createdAt: { $lt: cutoff } })
3. Archive to cold storage
4. Expand disk volume (cloud) or add shard
5. Post-incident: configure disk alerts at 70%
Oplog Window Too Small
1. Check oplog size: db.getReplicationInfo()
2. Increase oplogSizeMB in config (requires restart)
3. For change streams — ensure consumer keeps up
4. Atlas: oplog scales with tier automatically
Maintenance Windows
Schedule regular maintenance:
| Task | Frequency |
|---|---|
| Backup restore test | Quarterly |
| Failover drill | Quarterly |
Index review ($indexStats) |
Monthly |
| Slow query review | Weekly |
| Version patch | Monthly (security) |
| Capacity review | Quarterly |
| User access audit | Quarterly |
Log Management
# mongod.conf
systemLog:
destination: file
path: /var/log/mongodb/mongod.log
logAppend: true
logRotate: reopen
verbosity: 0
component:
query:
verbosity: 1 # temporary for debugging
Ship logs to centralized logging (ELK, CloudWatch, Datadog):
# Log rotation
/var/log/mongodb/mongod.log {
daily
rotate 14
compress
postrotate
/bin/kill -SIGUSR1 $(cat /var/lib/mongodb/mongod.lock)
endscript
}
Security Operations
- Rotate database passwords quarterly
- Review IP access lists / firewall rules monthly
- Patch MongoDB within 30 days of security releases
- Audit user roles — remove departed team members immediately
- Enable encryption at rest and in transit (verify, don’t assume)
Production Deployment Patterns
Blue-Green Database Migration
1. Deploy green cluster (new version or new region)
2. Initial sync via mongodump/mongorestore or initial sync
3. Enable change stream sync for delta
4. Cutover application connection string
5. Decommission blue after validation period
Multi-Environment Strategy
Production: Atlas M30+, 3-region, cloud backup
Staging: Atlas M10, same version as production
Development: Atlas M0/M2 or local Docker replica set
Never test destructive operations on production.
Common Operational Mistakes
- No tested backup restore procedure
- Upgrading primary first — causes unnecessary downtime
- Ignoring replication lag until reads return stale data
- Running maintenance without disabling balancer (sharded)
- No connection pooling — exhausting file descriptors
- Profiler left at level 2 in production
- Skipping FCV update after version upgrade
Troubleshooting Commands Reference
// Health check
db.runCommand({ ping: 1 })
rs.status()
sh.status() // sharded
// Performance
db.serverStatus()
db.currentOp({ active: true })
db.setProfilingLevel(1, { slowms: 100 })
// Storage
db.stats()
db.collection.stats(1024 * 1024) // MB
// Replication
rs.printSecondaryReplicationInfo()
db.getReplicationInfo()
Best Practices
- Automate backups — manual backups get forgotten
- Test failover before you need it
- Document every production change in a changelog
- Use infrastructure as code for Atlas (Terraform provider)
- Maintain staging environment mirroring production topology
- Set up on-call rotation with runbook access
- Review MongoDB security advisories monthly
What Comes Next
You now have the full MongoDB learning path — from document basics through production operations. Apply these patterns iteratively as your workload grows.