Data modeling is the most consequential decision in MongoDB. Unlike SQL, there is no single normalized schema — you design documents around how your application reads and writes data.

Design Philosophy

MongoDB rewards schema design for access patterns, not theoretical normalization:

  1. Identify how the application queries data (read patterns)
  2. Identify write frequency and update scope
  3. Choose embed vs reference for each relationship
  4. Add indexes matching query patterns
  5. Validate schema where business rules require it

Embedding vs Referencing

When to Embed

  • Data is accessed together in the same query
  • 1-to-few relationship (addresses on a user, line items on an order)
  • Data does not grow unboundedly
  • Child data is not shared across parents
  // User with embedded addresses — always fetched together
{
  _id: ObjectId("..."),
  name: "Alice Chen",
  email: "[email protected]",
  addresses: [
    { type: "home", street: "123 Main St", city: "Seattle" },
    { type: "work", street: "456 Oak Ave", city: "Bellevue" }
  ]
}
  

When to Reference

  • 1-to-many with unbounded growth (comments on a popular post)
  • Many-to-many relationships (students ↔ courses)
  • Data shared across documents (product catalog referenced by orders)
  • Child data updated independently and frequently
  // Order references products — product details change independently
{
  _id: ObjectId("..."),
  userId: ObjectId("..."),
  items: [
    { productId: ObjectId("..."), sku: "WIDGET-01", qty: 2, price: NumberDecimal("19.99") }
  ],
  status: "completed"
}
  

Design Patterns

Subset Pattern

Store the most recent N items embedded; archive the rest in a separate collection:

  // Post with last 20 comments embedded
{
  _id: postId,
  title: "MongoDB Tips",
  comments: [ /* last 20 comments */ ],
  commentCount: 1543
}
// Full comment history in `comments` collection
  

Bucket Pattern

Group time-series events into bucket documents:

  // One document per sensor per hour instead of one per reading
{
  sensorId: "TEMP-001",
  hour: ISODate("2024-06-13T14:00:00Z"),
  readings: [
    { t: ISODate("2024-06-13T14:00:00Z"), v: 22.5 },
    { t: ISODate("2024-06-13T14:01:00Z"), v: 22.6 },
    // ... up to ~1000 readings per bucket
  ],
  count: 3600,
  min: 22.1,
  max: 23.0,
  avg: 22.55
}
  

Reduces document count by 1000x for IoT and logging workloads.

Extended Reference Pattern

Denormalize frequently accessed fields alongside the reference:

  {
  items: [
    {
      productId: ObjectId("..."),
      productName: "Wireless Mouse",  // denormalized — avoids $lookup
      sku: "MOUSE-01",
      qty: 1,
      price: NumberDecimal("29.99")
    }
  ]
}
  

Update denormalized fields when source changes, or accept eventual consistency.

Outlier Pattern

Most documents are small; a few are huge — handle outliers separately:

  // Normal product
{ _id: 1, name: "Widget", reviews: [ /* 5 reviews */ ] }

// Outlier product with 10,000 reviews
{ _id: 2, name: "Popular Gadget", reviewCount: 10000 }
// Reviews stored in separate `reviews` collection
  

Attribute Pattern

Polymorphic products with different attributes per category:

  {
  name: "Running Shoes",
  category: "footwear",
  attributes: {
    size: 10,
    color: "blue",
    material: "mesh"
  }
}
{
  name: "Laptop",
  category: "electronics",
  attributes: {
    cpu: "M3",
    ram: "16GB",
    storage: "512GB"
  }
}
  

Use schema validation to enforce category-specific attribute shapes.

Time-Series Collections

MongoDB 5.0+ native time-series for metrics, IoT, and logs:

  db.createCollection("sensor_readings", {
  timeseries: {
    timeField: "timestamp",
    metaField: "sensorId",
    granularity: "minutes"
  }
})

db.sensor_readings.insertOne({
  sensorId: "TEMP-001",
  timestamp: new Date(),
  temperature: 22.5
})
  

Automatic bucketing, compression, and optimized queries — prefer over manual bucket pattern for new projects.

Schema Validation

  db.createCollection("orders", {
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["userId", "items", "status", "createdAt"],
      properties: {
        userId: { bsonType: "objectId" },
        status: { enum: ["pending", "completed", "cancelled"] },
        items: {
          bsonType: "array",
          minItems: 1,
          items: {
            bsonType: "object",
            required: ["sku", "qty", "price"],
            properties: {
              sku: { bsonType: "string" },
              qty: { bsonType: "int", minimum: 1 },
              price: { bsonType: "decimal" }
            }
          }
        }
      }
    }
  },
  validationLevel: "moderate",
  validationAction: "error"
})
  
Level Behavior
strict Validate all inserts and updates
moderate Validate inserts; updates only if doc already valid
off No validation

Denormalization vs Normalization

Approach Pros Cons
Denormalized (embed) Single query, fast reads Larger docs, update anomalies
Normalized (reference) Smaller docs, single source of truth Multiple queries or $lookup

Rule of thumb: denormalize for read-heavy, reference for write-heavy shared data.

Production Scenarios

E-Commerce Catalog

  // Products — referenced, updated by catalog team
{ _id, sku, name, price, category, attributes: { ... } }

// Orders — embed line item snapshot at purchase time
{ _id, userId, items: [{ sku, name, price, qty }], total, status }

// Reviews — separate collection, paginated
{ _id, productId, userId, rating, text, createdAt }
  

Social Feed

  // User timeline — embed recent posts (subset pattern)
{ userId, posts: [{ postId, author, text, createdAt }] }

// Full posts collection for detail views
{ _id: postId, authorId, text, likes, commentCount }
  

Common Mistakes

  • Unbounded arrays — comments, logs, events grow forever; hits 16 MB limit
  • Over-embedding — 500 KB documents slow every read even when you need one field
  • Under-embedding$lookup on every request when data is always accessed together
  • Same design as SQL — importing normalized tables without redesigning for documents
  • Ignoring write amplification — updating one field in a 500 KB embedded document rewrites the whole doc

Performance Tips

  • Keep working set (hot data + indexes) in RAM — model to limit document size
  • Pre-allocate array space with $push + $slice for capped collections
  • Use $project in aggregation to avoid shipping large embedded arrays
  • Archive cold data to separate collections or databases

Migration Strategy

When schema evolves:

  // Batch migration with bulkWrite
db.products.find({ price: { $type: "double" } }).forEach(doc => {
  db.products.updateOne(
    { _id: doc._id },
    [{ $set: { price: { $toDecimal: "$price" } } }]  // aggregation update
  )
})
  

Use dual-write or background migration for zero-downtime schema changes.

Troubleshooting

Document too large (16 MB limit)

  db.collection.aggregate([
  { $project: { size: { $bsonSize: "$$ROOT" } } },
  { $sort: { size: -1 } },
  { $limit: 10 }
])
  

Split large arrays into separate collections or apply bucket pattern.

Best Practices

  1. Design for your top 5 query patterns — not every possible query
  2. Prototype with real data volumes, not 100-row test sets
  3. Document schema decisions in your team’s architecture docs
  4. Use schema validation for critical collections in production
  5. Review schema quarterly as access patterns evolve

What Comes Next

Security builds on your data model — role-based access control operates at database and collection level.