to navigate

to select

to close

MongoDB Data Modeling

Data modeling is the most consequential decision in MongoDB. Unlike SQL, there is no single normalized schema — you design documents around how your application reads and writes data.

Design Philosophy

MongoDB rewards schema design for access patterns, not theoretical normalization:

Identify how the application queries data (read patterns)
Identify write frequency and update scope
Choose embed vs reference for each relationship
Add indexes matching query patterns
Validate schema where business rules require it

Embedding vs Referencing

When to Embed

Data is accessed together in the same query
1-to-few relationship (addresses on a user, line items on an order)
Data does not grow unboundedly
Child data is not shared across parents

  // User with embedded addresses — always fetched together
{
  _id: ObjectId("..."),
  name: "Alice Chen",
  email: "[email protected]",
  addresses: [
    { type: "home", street: "123 Main St", city: "Seattle" },
    { type: "work", street: "456 Oak Ave", city: "Bellevue" }
  ]
}

When to Reference

1-to-many with unbounded growth (comments on a popular post)
Many-to-many relationships (students ↔ courses)
Data shared across documents (product catalog referenced by orders)
Child data updated independently and frequently

  // Order references products — product details change independently
{
  _id: ObjectId("..."),
  userId: ObjectId("..."),
  items: [
    { productId: ObjectId("..."), sku: "WIDGET-01", qty: 2, price: NumberDecimal("19.99") }
  ],
  status: "completed"
}

Design Patterns

Subset Pattern

Store the most recent N items embedded; archive the rest in a separate collection:

  // Post with last 20 comments embedded
{
  _id: postId,
  title: "MongoDB Tips",
  comments: [ /* last 20 comments */ ],
  commentCount: 1543
}
// Full comment history in `comments` collection

Bucket Pattern

Group time-series events into bucket documents:

  // One document per sensor per hour instead of one per reading
{
  sensorId: "TEMP-001",
  hour: ISODate("2024-06-13T14:00:00Z"),
  readings: [
    { t: ISODate("2024-06-13T14:00:00Z"), v: 22.5 },
    { t: ISODate("2024-06-13T14:01:00Z"), v: 22.6 },
    // ... up to ~1000 readings per bucket
  ],
  count: 3600,
  min: 22.1,
  max: 23.0,
  avg: 22.55
}

Reduces document count by 1000x for IoT and logging workloads.

Extended Reference Pattern

Denormalize frequently accessed fields alongside the reference:

  {
  items: [
    {
      productId: ObjectId("..."),
      productName: "Wireless Mouse",  // denormalized — avoids $lookup
      sku: "MOUSE-01",
      qty: 1,
      price: NumberDecimal("29.99")
    }
  ]
}

Update denormalized fields when source changes, or accept eventual consistency.

Outlier Pattern

Most documents are small; a few are huge — handle outliers separately:

  // Normal product
{ _id: 1, name: "Widget", reviews: [ /* 5 reviews */ ] }

// Outlier product with 10,000 reviews
{ _id: 2, name: "Popular Gadget", reviewCount: 10000 }
// Reviews stored in separate `reviews` collection

Attribute Pattern

Polymorphic products with different attributes per category:

  {
  name: "Running Shoes",
  category: "footwear",
  attributes: {
    size: 10,
    color: "blue",
    material: "mesh"
  }
}
{
  name: "Laptop",
  category: "electronics",
  attributes: {
    cpu: "M3",
    ram: "16GB",
    storage: "512GB"
  }
}

Use schema validation to enforce category-specific attribute shapes.

Time-Series Collections

MongoDB 5.0+ native time-series for metrics, IoT, and logs:

  db.createCollection("sensor_readings", {
  timeseries: {
    timeField: "timestamp",
    metaField: "sensorId",
    granularity: "minutes"
  }
})

db.sensor_readings.insertOne({
  sensorId: "TEMP-001",
  timestamp: new Date(),
  temperature: 22.5
})

Automatic bucketing, compression, and optimized queries — prefer over manual bucket pattern for new projects.

Schema Validation

  db.createCollection("orders", {
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["userId", "items", "status", "createdAt"],
      properties: {
        userId: { bsonType: "objectId" },
        status: { enum: ["pending", "completed", "cancelled"] },
        items: {
          bsonType: "array",
          minItems: 1,
          items: {
            bsonType: "object",
            required: ["sku", "qty", "price"],
            properties: {
              sku: { bsonType: "string" },
              qty: { bsonType: "int", minimum: 1 },
              price: { bsonType: "decimal" }
            }
          }
        }
      }
    }
  },
  validationLevel: "moderate",
  validationAction: "error"
})

Level	Behavior
`strict`	Validate all inserts and updates
`moderate`	Validate inserts; updates only if doc already valid
`off`	No validation

Denormalization vs Normalization

Approach	Pros	Cons
Denormalized (embed)	Single query, fast reads	Larger docs, update anomalies
Normalized (reference)	Smaller docs, single source of truth	Multiple queries or $lookup

Rule of thumb: denormalize for read-heavy, reference for write-heavy shared data.

Production Scenarios

E-Commerce Catalog

  // Products — referenced, updated by catalog team
{ _id, sku, name, price, category, attributes: { ... } }

// Orders — embed line item snapshot at purchase time
{ _id, userId, items: [{ sku, name, price, qty }], total, status }

// Reviews — separate collection, paginated
{ _id, productId, userId, rating, text, createdAt }

  // User timeline — embed recent posts (subset pattern)
{ userId, posts: [{ postId, author, text, createdAt }] }

// Full posts collection for detail views
{ _id: postId, authorId, text, likes, commentCount }

Common Mistakes

Unbounded arrays — comments, logs, events grow forever; hits 16 MB limit
Over-embedding — 500 KB documents slow every read even when you need one field
Under-embedding — $lookup on every request when data is always accessed together
Same design as SQL — importing normalized tables without redesigning for documents
Ignoring write amplification — updating one field in a 500 KB embedded document rewrites the whole doc

Performance Tips

Keep working set (hot data + indexes) in RAM — model to limit document size
Pre-allocate array space with $push + $slice for capped collections
Use $project in aggregation to avoid shipping large embedded arrays
Archive cold data to separate collections or databases

Migration Strategy

When schema evolves:

  // Batch migration with bulkWrite
db.products.find({ price: { $type: "double" } }).forEach(doc => {
  db.products.updateOne(
    { _id: doc._id },
    [{ $set: { price: { $toDecimal: "$price" } } }]  // aggregation update
  )
})

Use dual-write or background migration for zero-downtime schema changes.

Troubleshooting

Document too large (16 MB limit)

  db.collection.aggregate([
  { $project: { size: { $bsonSize: "$$ROOT" } } },
  { $sort: { size: -1 } },
  { $limit: 10 }
])

Split large arrays into separate collections or apply bucket pattern.

Best Practices

Design for your top 5 query patterns — not every possible query
Prototype with real data volumes, not 100-row test sets
Document schema decisions in your team’s architecture docs
Use schema validation for critical collections in production
Review schema quarterly as access patterns evolve

What Comes Next

Security builds on your data model — role-based access control operates at database and collection level.

MongoDB Aggregation Framework

Pub/Sub Messaging

MongoDB Data Modeling

Design Philosophy link

Embedding vs Referencing link

When to Embed link

When to Reference link

Design Patterns link

Subset Pattern link

Bucket Pattern link

Extended Reference Pattern link

Outlier Pattern link

Attribute Pattern link

Time-Series Collections link

Schema Validation link

Denormalization vs Normalization link

Production Scenarios link

E-Commerce Catalog link

Social Feed link

Common Mistakes link

Performance Tips link

Migration Strategy link

Troubleshooting link

Best Practices link

What Comes Next link