MongoDB Data Modeling
Data modeling is the most consequential decision in MongoDB. Unlike SQL, there is no single normalized schema — you design documents around how your application reads and writes data.
Design Philosophy
MongoDB rewards schema design for access patterns, not theoretical normalization:
- Identify how the application queries data (read patterns)
- Identify write frequency and update scope
- Choose embed vs reference for each relationship
- Add indexes matching query patterns
- Validate schema where business rules require it
Embedding vs Referencing
When to Embed
- Data is accessed together in the same query
- 1-to-few relationship (addresses on a user, line items on an order)
- Data does not grow unboundedly
- Child data is not shared across parents
// User with embedded addresses — always fetched together
{
_id: ObjectId("..."),
name: "Alice Chen",
email: "[email protected]",
addresses: [
{ type: "home", street: "123 Main St", city: "Seattle" },
{ type: "work", street: "456 Oak Ave", city: "Bellevue" }
]
}
When to Reference
- 1-to-many with unbounded growth (comments on a popular post)
- Many-to-many relationships (students ↔ courses)
- Data shared across documents (product catalog referenced by orders)
- Child data updated independently and frequently
// Order references products — product details change independently
{
_id: ObjectId("..."),
userId: ObjectId("..."),
items: [
{ productId: ObjectId("..."), sku: "WIDGET-01", qty: 2, price: NumberDecimal("19.99") }
],
status: "completed"
}
Design Patterns
Subset Pattern
Store the most recent N items embedded; archive the rest in a separate collection:
// Post with last 20 comments embedded
{
_id: postId,
title: "MongoDB Tips",
comments: [ /* last 20 comments */ ],
commentCount: 1543
}
// Full comment history in `comments` collection
Bucket Pattern
Group time-series events into bucket documents:
// One document per sensor per hour instead of one per reading
{
sensorId: "TEMP-001",
hour: ISODate("2024-06-13T14:00:00Z"),
readings: [
{ t: ISODate("2024-06-13T14:00:00Z"), v: 22.5 },
{ t: ISODate("2024-06-13T14:01:00Z"), v: 22.6 },
// ... up to ~1000 readings per bucket
],
count: 3600,
min: 22.1,
max: 23.0,
avg: 22.55
}
Reduces document count by 1000x for IoT and logging workloads.
Extended Reference Pattern
Denormalize frequently accessed fields alongside the reference:
{
items: [
{
productId: ObjectId("..."),
productName: "Wireless Mouse", // denormalized — avoids $lookup
sku: "MOUSE-01",
qty: 1,
price: NumberDecimal("29.99")
}
]
}
Update denormalized fields when source changes, or accept eventual consistency.
Outlier Pattern
Most documents are small; a few are huge — handle outliers separately:
// Normal product
{ _id: 1, name: "Widget", reviews: [ /* 5 reviews */ ] }
// Outlier product with 10,000 reviews
{ _id: 2, name: "Popular Gadget", reviewCount: 10000 }
// Reviews stored in separate `reviews` collection
Attribute Pattern
Polymorphic products with different attributes per category:
{
name: "Running Shoes",
category: "footwear",
attributes: {
size: 10,
color: "blue",
material: "mesh"
}
}
{
name: "Laptop",
category: "electronics",
attributes: {
cpu: "M3",
ram: "16GB",
storage: "512GB"
}
}
Use schema validation to enforce category-specific attribute shapes.
Time-Series Collections
MongoDB 5.0+ native time-series for metrics, IoT, and logs:
db.createCollection("sensor_readings", {
timeseries: {
timeField: "timestamp",
metaField: "sensorId",
granularity: "minutes"
}
})
db.sensor_readings.insertOne({
sensorId: "TEMP-001",
timestamp: new Date(),
temperature: 22.5
})
Automatic bucketing, compression, and optimized queries — prefer over manual bucket pattern for new projects.
Schema Validation
db.createCollection("orders", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["userId", "items", "status", "createdAt"],
properties: {
userId: { bsonType: "objectId" },
status: { enum: ["pending", "completed", "cancelled"] },
items: {
bsonType: "array",
minItems: 1,
items: {
bsonType: "object",
required: ["sku", "qty", "price"],
properties: {
sku: { bsonType: "string" },
qty: { bsonType: "int", minimum: 1 },
price: { bsonType: "decimal" }
}
}
}
}
}
},
validationLevel: "moderate",
validationAction: "error"
})
| Level | Behavior |
|---|---|
strict |
Validate all inserts and updates |
moderate |
Validate inserts; updates only if doc already valid |
off |
No validation |
Denormalization vs Normalization
| Approach | Pros | Cons |
|---|---|---|
| Denormalized (embed) | Single query, fast reads | Larger docs, update anomalies |
| Normalized (reference) | Smaller docs, single source of truth | Multiple queries or $lookup |
Rule of thumb: denormalize for read-heavy, reference for write-heavy shared data.
Production Scenarios
E-Commerce Catalog
// Products — referenced, updated by catalog team
{ _id, sku, name, price, category, attributes: { ... } }
// Orders — embed line item snapshot at purchase time
{ _id, userId, items: [{ sku, name, price, qty }], total, status }
// Reviews — separate collection, paginated
{ _id, productId, userId, rating, text, createdAt }
Social Feed
// User timeline — embed recent posts (subset pattern)
{ userId, posts: [{ postId, author, text, createdAt }] }
// Full posts collection for detail views
{ _id: postId, authorId, text, likes, commentCount }
Common Mistakes
- Unbounded arrays — comments, logs, events grow forever; hits 16 MB limit
- Over-embedding — 500 KB documents slow every read even when you need one field
- Under-embedding —
$lookupon every request when data is always accessed together - Same design as SQL — importing normalized tables without redesigning for documents
- Ignoring write amplification — updating one field in a 500 KB embedded document rewrites the whole doc
Performance Tips
- Keep working set (hot data + indexes) in RAM — model to limit document size
- Pre-allocate array space with
$push+$slicefor capped collections - Use
$projectin aggregation to avoid shipping large embedded arrays - Archive cold data to separate collections or databases
Migration Strategy
When schema evolves:
// Batch migration with bulkWrite
db.products.find({ price: { $type: "double" } }).forEach(doc => {
db.products.updateOne(
{ _id: doc._id },
[{ $set: { price: { $toDecimal: "$price" } } }] // aggregation update
)
})
Use dual-write or background migration for zero-downtime schema changes.
Troubleshooting
Document too large (16 MB limit)
db.collection.aggregate([
{ $project: { size: { $bsonSize: "$$ROOT" } } },
{ $sort: { size: -1 } },
{ $limit: 10 }
])
Split large arrays into separate collections or apply bucket pattern.
Best Practices
- Design for your top 5 query patterns — not every possible query
- Prototype with real data volumes, not 100-row test sets
- Document schema decisions in your team’s architecture docs
- Use schema validation for critical collections in production
- Review schema quarterly as access patterns evolve
What Comes Next
Security builds on your data model — role-based access control operates at database and collection level.