to navigate

to select

to close

On this page

Replica Sets and Sharding

Production MongoDB runs on replica sets (high availability) and optionally sharded clusters (horizontal scale). This guide covers configuration, failover behavior, read scaling, and sharding fundamentals.

Replica Set Architecture

                      ┌─────────────┐
                    │   PRIMARY   │ ← writes, default reads
                    │  mongo1:27017│
                    └──────┬──────┘
                           │ replication
              ┌────────────┼────────────┐
              ▼            ▼            ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │SECONDARY │ │SECONDARY │ │ ARBITER  │
        │mongo2    │ │mongo3    │ │ (opt.)   │
        └──────────┘ └──────────┘ └──────────┘
              ↑            ↑
         read scaling   failover candidate

Primary — accepts all writes, replicates oplog to secondaries
Secondary — applies oplog, can serve reads with read preference
Arbiter — votes in elections only, holds no data (use sparingly)

Minimum production replica set: 3 data-bearing members (or 2 + arbiter for dev only).

Initialize a Replica Set

  // Connect to first member
mongosh --host mongo1:27017

rs.initiate({
  _id: "rs0",
  members: [
    { _id: 0, host: "mongo1:27017", priority: 2 },
    { _id: 1, host: "mongo2:27017", priority: 1 },
    { _id: 2, host: "mongo3:27017", priority: 1 }
  ]
})

rs.status()
// Wait until one member shows "PRIMARY", others "SECONDARY"

Each member needs matching replSetName in mongod.conf:

  replication:
  replSetName: rs0

Connection Strings

  mongodb://mongo1:27017,mongo2:27017,mongo3:27017/myapp?replicaSet=rs0

The driver discovers the primary automatically and fails over on election.

Read Preferences

Preference	Behavior	Use Case
`primary`	Primary only (default)	Strong consistency
`primaryPreferred`	Primary, fallback to secondary	HA with consistency preference
`secondary`	Secondary only	Analytics, reporting
`secondaryPreferred`	Secondary, fallback to primary	Read scaling
`nearest`	Lowest latency member	Multi-region

  // mongosh
db.users.find().readPref("secondaryPreferred")

// Node.js driver
const client = new MongoClient(uri, {
  readPreference: ReadPreference.SECONDARY_PREFERRED
})

Caution: reads from secondaries may be stale (replication lag). Never read your own writes from secondaries without afterClusterTime.

Write Concern

  db.orders.insertOne(
  { total: 99.99 },
  { writeConcern: { w: "majority", j: true, wtimeout: 5000 } }
)

Setting	Meaning
`w: 1`	Acknowledged by primary only
`w: "majority"`	Replicated to majority of voting members
`j: true`	Written to journal on disk

Use w: "majority" for durability; w: 1 for lower latency with accepted risk.

Failover and Elections

When the primary fails, secondaries hold an election:

  rs.status()  // identify current PRIMARY

// Adjust member priority
cfg = rs.conf()
cfg.members[1].priority = 3  // prefer mongo2 as primary
rs.reconfigure(cfg)

Election triggers:

Primary unreachable for ~10 seconds
Manual rs.stepDown()
Priority changes

Typical failover: 10–30 seconds for election + client rediscovery.

Adding and Removing Members

  // Add a secondary
rs.add("mongo4:27017")

// Remove a member ( drain first in production )
rs.remove("mongo4:27017")

// Step down primary for maintenance
rs.stepDown(60)  // 60 seconds

Always add/remove members during maintenance windows with monitoring.

Monitoring Replication

  rs.printSecondaryReplicationInfo()
// Shows lag per secondary

rs.status().members.forEach(m => {
  print(m.name, m.stateStr, m.optimeDate)
})

// Lag in seconds
db.adminCommand({ replSetGetStatus: 1 }).members

High lag causes:

Network issues between members
Secondary disk slower than primary
Heavy writes overwhelming replication
Index builds on secondary

Sharded Cluster Overview

For datasets exceeding single-node capacity:

  Clients → mongos (query router) → Shards (replica sets)
                ↓
         Config servers (metadata)

Component	Role
mongos	Query router — clients connect here
Config servers	Store cluster metadata (replica set of 3)
Shards	Replica sets holding data subsets

Enable Sharding

  // Connect to mongos
mongosh --host mongos:27017

sh.enableSharding("analytics")

// Hashed shard key — even distribution
sh.shardCollection("analytics.events", { userId: "hashed" })

// Range shard key — range queries, hotspot risk
sh.shardCollection("analytics.logs", { timestamp: 1 })

sh.status()

Choosing a Shard Key

Good shard keys:

High cardinality (many distinct values)
Even distribution of reads and writes
Supports common query patterns (targeted queries hit one shard)

Bad shard keys:

Monotonically increasing (_id, timestamp alone) — all writes to one shard
Low cardinality (boolean, country with 5 values)

  // Compound shard key for time-series with even distribution
sh.shardCollection("analytics.events", { userId: 1, timestamp: 1 })

Zone Sharding

Route data to geographic shards for data locality and compliance:

  sh.addShardToZone("shard-us", "US")
sh.addShardToZone("shard-eu", "EU")

sh.updateZoneKeyRange(
  "myapp.users",
  { country: "US", country: "US" },
  "US"
)
sh.updateZoneKeyRange(
  "myapp.users",
  { country: "DE", country: "DE" },
  "EU"
)

GDPR data stays in EU shards; US data in US shards.

Balancer

The balancer migrates chunks between shards for even distribution:

  sh.getBalancerState()       // true = enabled
sh.isBalancerRunning()
sh.setBalancerState(false)  // disable during maintenance

Monitor chunk migration — excessive migrations indicate poor shard key choice.

Production Scenario: 3-Region Deployment

  Region US-East:  Primary + Secondary (rs0)
Region US-West:  Secondary (rs0)
Region EU:       Secondary (rs0) + zone sharding

Writes go to US-East primary
Local reads from nearest secondary
EU data zone-sharded to EU shard

Common Mistakes

Running standalone mongod in production — no failover, no transactions
Using arbiters instead of data-bearing members in production
Reading from secondaries without handling replication lag
Choosing monotonic shard key — creates write hotspots
Sharding too early — replica set handles most workloads to TB scale
Not monitoring replication lag — stale reads and delayed failover

Troubleshooting

Replica set won’t elect primary

  rs.status()  // check for RECOVERING, STARTUP2, or DOWN members
// Ensure majority of members are reachable

StepDown fails — “not primary”

Primary already lost — check rs.status() for current state.

Chunk migration stuck

  sh.stopBalancer()
// Fix underlying issue (disk space, network)
sh.setBalancerState(true)

Best Practices

Always deploy 3+ data-bearing replica set members
Use w: "majority" for critical writes
Monitor replication lag with alerts (> 30 seconds)
Test failover quarterly — verify application reconnects
Shard only when single replica set exceeds capacity
Choose shard key before loading data — resharding is costly

What Comes Next

Performance optimization builds on replication awareness — read preferences and write concerns directly affect latency and consistency.

Joins and Views

Stored Procedures and Functions

Replica Sets and Sharding

Replica Set Architecture link

Initialize a Replica Set link

Connection Strings link

Read Preferences link

Write Concern link

Failover and Elections link

Adding and Removing Members link

Monitoring Replication link

Sharded Cluster Overview link

Enable Sharding link

Choosing a Shard Key link

Zone Sharding link

Balancer link

Production Scenario: 3-Region Deployment link

Common Mistakes link

Troubleshooting link

Best Practices link

What Comes Next link