Replication is implemented via replica sets—a self-healing group of mongod processes that maintain the same dataset.

Replica set architecture
- Primary:
- Accepts all writes.
- Records each write in the oplog (capped collection local.oplog.rs).
- Secondaries:
- Continuously tail the primary’s oplog.
- Apply operations in the same order to maintain consistency.
- Can serve reads (depending on readPreference).
- Arbiters:
- No data, only votes.
- Used to maintain an odd number of votes without extra storage cost.
Key properties:
- Majority-based elections ensure only one primary at a time.
- Automatic failover when primary is unreachable.
- Eventual consistency on secondaries.
Replication internals
- Oplog mechanics
- Every write on primary is converted into an idempotent operation and appended to the oplog.
- Secondaries:
- Use a sync source (usually the primary, but can be another secondary).
- Pull oplog entries and apply them in order.
- If a secondary falls behind:
- It replays from the oplog as long as entries are still present.
- If too far behind and oplog rolled over → needs initial sync.
- Initial sync
- Steps:
- Clone all collections from sync source.
- Apply oplog entries that occurred during the clone.
- Repeat until the node is “caught up”.
- After sync, node transitions to SECONDARY state.
- Heartbeats and elections
- Nodes send heartbeats every ~2 seconds.
- If a node doesn’t hear from the primary within a timeout:
- Eligible secondaries start an election.
- Election factors:
- Priority: higher priority nodes preferred as primary.
- Votes: max 7 voting members, majority required.
- Term: monotonically increasing term to avoid split-brain.
- If primary node A goes down, election will take place between the secondary nodes, that is voting will take place between the secondary nodes and they will elect one of themselves to become the primary node. Now once the failed Node A comes up, it will become the secondary node.
- Write concern & read concern
- Write concern controls durability guarantees:
- w: 1 → acknowledged by primary only.
- w: “majority” → replicated to majority of voting nodes.
- Optional j: true → journaled to disk.
- Read concern controls visibility:
- local → default, may see unreplicated data.
- majority → only data replicated to majority.
- linearizable (with w: “majority” writes) → strongest semantics.
Read preferences and consistency
- Primary: all reads from primary (strongest consistency).
- PrimaryPreferred: primary if available, otherwise secondaries.
- Secondary / SecondaryPreferred: offload reads to secondaries.
- Nearest: lowest latency node.
Trade-off: consistency vs performance. Reads from secondaries can be stale.
Common replica set topologies
- 3-node replica set:
- 1 primary, 2 secondaries.
- Classic HA setup.
- 3-node + 1 arbiter:
- For environments where storage is limited.
- Geo-distributed:
- Nodes in multiple data centers.
- Use tags and writeConcern to control locality and durability.
🧬 MongoDB Replica Set: High Availability
- Primary node handles all writes and maintains the oplog.
- Secondary nodes replicate from the primary and can serve reads.
- Arbiter participates in elections but stores no data.
- Automatic failover ensures continuity if the primary goes down.
- Ideal for disaster recovery, read scaling, and fault tolerance.
Clustering in MongoDB: Sharded Clusters

Sharding is MongoDB’s horizontal scaling mechanism—splitting data across multiple replica sets (shards).
- Distributing data between different machines
- Useful when you have large datasets
So in sharded clusters data is distributed between each of the instances. That is one of the part of the data is kept in one machine and the 2nd part is kept in another machine. Data is not replicated along each of the clusters.
🧭 MongoDB Sharded Cluster: Horizontal Scalability
- Shards are replica sets that store partitions of data.
- Config servers hold metadata about chunk ranges and shard mappings.
- mongos routers direct queries to the correct shard(s).
- Data is split into chunks based on a shard key.
- Balancer redistributes chunks to maintain even load.
- Ideal for large datasets, high-throughput apps, and geo-distributed systems.
Sharded cluster components

Shard keys and data distribution
The shard key is the core of sharding design.
- Defined at collection level, immutable once sharding is enabled.
- Determines how documents are partitioned into chunks.
- Good shard keys:
- High cardinality.
- Even distribution.
- Good query targeting (frequently used in queries).
Shard key patterns:
- Hashed shard key:
- sh.shardCollection(“db.coll”, { field: “hashed” })
- Good for uniform distribution.
- Poor for range queries (scatter-gather).
- Ranged shard key:
- shardCollection(“db.coll”, { field: 1 })
- Good for range queries.
- Risk of hot shard if values are monotonically increasing (e.g., timestamps, ObjectId).
Chunks and balancer
- Data is split into chunks (contiguous ranges of shard key values).
- Each chunk belongs to exactly one shard.
- Balancer:
- Runs in the background.
- Monitors chunk distribution.
- Migrates chunks between shards to keep them balanced.
- Chunk migration steps:
- Copy data from source shard to destination shard.
- Apply changes that occurred during copy (oplog-based).
- Update config metadata to point chunk to new shard.
- Delete old data from source shard.
All metadata changes are coordinated via config servers.
Query routing and execution
- Client connects to mongos, not directly to shards.
- mongos:
- Consults config servers to determine which shard(s) hold relevant chunks.
- Routes query to:
- Single shard (targeted query) if shard key or prefix is specified.
- Multiple/all shards (scatter-gather) if not.
- For writes:
- mongos routes to the shard owning the shard key range.
- Shard’s primary handles the write (since each shard is a replica set).
Fault tolerance in sharded clusters
- Each shard is a replica set → local HA.
- Config servers are a replica set:
- Loss of majority of config servers → cluster metadata unavailable.
- mongos is stateless:
- You can run multiple mongos instances behind a load balancer.
- Failure of one mongos doesn’t affect data, just client connectivity.
Replica sets vs sharded clusters (deep comparison)

Design and operational best practices
For replica sets
- Always use an odd number of voting members (3, 5, 7).
- Prefer 3-node replica set over 2-node + arbiter unless storage is a real constraint.
- Use w: “majority” for critical data.
- Monitor:
- Replication lag.
- Oplog window (time span covered by oplog).
- Ensure oplog is sized to handle expected downtime + peak write load.
For sharded clusters
- Spend time on shard key design—this is the most critical decision.
- Avoid monotonically increasing shard keys with ranged sharding unless you use hashed or zone sharding.
- Keep config servers highly available and well-protected.
- Run multiple mongos instances close to your application layer.
- Monitor:
- Chunk distribution.
- Balancer activity.
- Hot shards.
- Query patterns (to avoid scatter-gather).
How this ties together in real deployments
A typical production “cluster” is:
- Each shard = replica set (3+ nodes).
- Config server replica set (3 nodes).
- Multiple mongos processes.
- Applications connect only to mongos, using:
- Proper writeConcern and readConcern.
- Appropriate readPreference for latency vs consistency.
So you end up with replication inside each shard and sharding across shards—two layers of distribution: one for availability, one for scale.