Replica Sets and Sharding
Production MongoDB runs on replica sets (high availability) and optionally sharded clusters (horizontal scale). This guide covers configuration, failover behavior, read scaling, and sharding fundamentals.
Replica Set Architecture
┌─────────────┐
│ PRIMARY │ ← writes, default reads
│ mongo1:27017│
└──────┬──────┘
│ replication
┌────────────┼────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│SECONDARY │ │SECONDARY │ │ ARBITER │
│mongo2 │ │mongo3 │ │ (opt.) │
└──────────┘ └──────────┘ └──────────┘
↑ ↑
read scaling failover candidate
- Primary — accepts all writes, replicates oplog to secondaries
- Secondary — applies oplog, can serve reads with read preference
- Arbiter — votes in elections only, holds no data (use sparingly)
Minimum production replica set: 3 data-bearing members (or 2 + arbiter for dev only).
Initialize a Replica Set
// Connect to first member
mongosh --host mongo1:27017
rs.initiate({
_id: "rs0",
members: [
{ _id: 0, host: "mongo1:27017", priority: 2 },
{ _id: 1, host: "mongo2:27017", priority: 1 },
{ _id: 2, host: "mongo3:27017", priority: 1 }
]
})
rs.status()
// Wait until one member shows "PRIMARY", others "SECONDARY"
Each member needs matching replSetName in mongod.conf:
replication:
replSetName: rs0
Connection Strings
mongodb://mongo1:27017,mongo2:27017,mongo3:27017/myapp?replicaSet=rs0
The driver discovers the primary automatically and fails over on election.
Read Preferences
| Preference | Behavior | Use Case |
|---|---|---|
primary |
Primary only (default) | Strong consistency |
primaryPreferred |
Primary, fallback to secondary | HA with consistency preference |
secondary |
Secondary only | Analytics, reporting |
secondaryPreferred |
Secondary, fallback to primary | Read scaling |
nearest |
Lowest latency member | Multi-region |
// mongosh
db.users.find().readPref("secondaryPreferred")
// Node.js driver
const client = new MongoClient(uri, {
readPreference: ReadPreference.SECONDARY_PREFERRED
})
Caution: reads from secondaries may be stale (replication lag). Never read your own writes from secondaries without afterClusterTime.
Write Concern
db.orders.insertOne(
{ total: 99.99 },
{ writeConcern: { w: "majority", j: true, wtimeout: 5000 } }
)
| Setting | Meaning |
|---|---|
w: 1 |
Acknowledged by primary only |
w: "majority" |
Replicated to majority of voting members |
j: true |
Written to journal on disk |
Use w: "majority" for durability; w: 1 for lower latency with accepted risk.
Failover and Elections
When the primary fails, secondaries hold an election:
rs.status() // identify current PRIMARY
// Adjust member priority
cfg = rs.conf()
cfg.members[1].priority = 3 // prefer mongo2 as primary
rs.reconfigure(cfg)
Election triggers:
- Primary unreachable for ~10 seconds
- Manual
rs.stepDown() - Priority changes
Typical failover: 10–30 seconds for election + client rediscovery.
Adding and Removing Members
// Add a secondary
rs.add("mongo4:27017")
// Remove a member ( drain first in production )
rs.remove("mongo4:27017")
// Step down primary for maintenance
rs.stepDown(60) // 60 seconds
Always add/remove members during maintenance windows with monitoring.
Monitoring Replication
rs.printSecondaryReplicationInfo()
// Shows lag per secondary
rs.status().members.forEach(m => {
print(m.name, m.stateStr, m.optimeDate)
})
// Lag in seconds
db.adminCommand({ replSetGetStatus: 1 }).members
High lag causes:
- Network issues between members
- Secondary disk slower than primary
- Heavy writes overwhelming replication
- Index builds on secondary
Sharded Cluster Overview
For datasets exceeding single-node capacity:
Clients → mongos (query router) → Shards (replica sets)
↓
Config servers (metadata)
| Component | Role |
|---|---|
| mongos | Query router — clients connect here |
| Config servers | Store cluster metadata (replica set of 3) |
| Shards | Replica sets holding data subsets |
Enable Sharding
// Connect to mongos
mongosh --host mongos:27017
sh.enableSharding("analytics")
// Hashed shard key — even distribution
sh.shardCollection("analytics.events", { userId: "hashed" })
// Range shard key — range queries, hotspot risk
sh.shardCollection("analytics.logs", { timestamp: 1 })
sh.status()
Choosing a Shard Key
Good shard keys:
- High cardinality (many distinct values)
- Even distribution of reads and writes
- Supports common query patterns (targeted queries hit one shard)
Bad shard keys:
- Monotonically increasing (
_id, timestamp alone) — all writes to one shard - Low cardinality (boolean, country with 5 values)
// Compound shard key for time-series with even distribution
sh.shardCollection("analytics.events", { userId: 1, timestamp: 1 })
Zone Sharding
Route data to geographic shards for data locality and compliance:
sh.addShardToZone("shard-us", "US")
sh.addShardToZone("shard-eu", "EU")
sh.updateZoneKeyRange(
"myapp.users",
{ country: "US", country: "US" },
"US"
)
sh.updateZoneKeyRange(
"myapp.users",
{ country: "DE", country: "DE" },
"EU"
)
GDPR data stays in EU shards; US data in US shards.
Balancer
The balancer migrates chunks between shards for even distribution:
sh.getBalancerState() // true = enabled
sh.isBalancerRunning()
sh.setBalancerState(false) // disable during maintenance
Monitor chunk migration — excessive migrations indicate poor shard key choice.
Production Scenario: 3-Region Deployment
Region US-East: Primary + Secondary (rs0)
Region US-West: Secondary (rs0)
Region EU: Secondary (rs0) + zone sharding
- Writes go to US-East primary
- Local reads from nearest secondary
- EU data zone-sharded to EU shard
Common Mistakes
- Running standalone
mongodin production — no failover, no transactions - Using arbiters instead of data-bearing members in production
- Reading from secondaries without handling replication lag
- Choosing monotonic shard key — creates write hotspots
- Sharding too early — replica set handles most workloads to TB scale
- Not monitoring replication lag — stale reads and delayed failover
Troubleshooting
Replica set won’t elect primary
rs.status() // check for RECOVERING, STARTUP2, or DOWN members
// Ensure majority of members are reachable
StepDown fails — “not primary”
Primary already lost — check rs.status() for current state.
Chunk migration stuck
sh.stopBalancer()
// Fix underlying issue (disk space, network)
sh.setBalancerState(true)
Best Practices
- Always deploy 3+ data-bearing replica set members
- Use
w: "majority"for critical writes - Monitor replication lag with alerts (> 30 seconds)
- Test failover quarterly — verify application reconnects
- Shard only when single replica set exceeds capacity
- Choose shard key before loading data — resharding is costly
What Comes Next
Performance optimization builds on replication awareness — read preferences and write concerns directly affect latency and consistency.