Schema Migration: MongoDB to Cosmos DB¶

Audience: Data engineers, application architects, and database administrators designing the target schema for Cosmos DB after migrating from MongoDB.

Overview¶

MongoDB's flexible schema is one of its core strengths -- and that flexibility carries over to Cosmos DB for MongoDB. Both vCore and RU-based deployments support schemaless documents, nested objects, arrays, and mixed-type fields. However, schema migration is not just about copying the document structure. It is about optimizing for Cosmos DB's performance characteristics: partition key design (RU-based), indexing policy tuning, document size management, and reference vs. embedding trade-offs.

This guide covers collection design patterns, document modeling, indexing strategy, partition key selection, and TTL configuration for both Cosmos DB deployment models.

1. Collection design patterns¶

1:1 collection mapping (default)¶

The simplest approach: map each MongoDB collection to a Cosmos DB container. This works for most migrations and is the recommended starting point.

MongoDB                    Cosmos DB
═══════                    ═════════
users        →            users (container)
orders       →            orders (container)
products     →            products (container)
inventory    →            inventory (container)

Collection consolidation (RU-based optimization)¶

For RU-based deployments with shared database throughput, consolidating small collections into a single container reduces the minimum RU overhead. Each container on dedicated throughput requires at least 400 RU/s; shared throughput databases require 400 RU/s total across all containers.

Pattern: discriminator field

// Users and preferences in a single container
// Partition key: /entityType + /userId (hierarchical)
{
  "entityType": "user",
  "userId": "user-123",
  "name": "Jane Doe",
  "email": "jane@example.com"
}

{
  "entityType": "preference",
  "userId": "user-123",
  "theme": "dark",
  "language": "en-US"
}

When to consolidate:

Collections with fewer than 1,000 documents and minimal throughput
Related entities always queried together
Total collections exceed 25 (shared throughput has a 25-container limit for 400 RU/s)

When not to consolidate:

Collections with different TTL requirements
Collections with different indexing needs
High-throughput collections that need dedicated RU budgets

Collection splitting (for hot partitions)¶

If a single MongoDB collection has a skewed access pattern (e.g., 90% of traffic hits "active" orders), consider splitting:

MongoDB                    Cosmos DB
═══════                    ═════════
orders       →            active_orders (container, high throughput)
             →            archived_orders (container, low throughput)

2. Document modeling: embedded vs. referenced¶

Embedded documents (denormalized)¶

MongoDB best practice: embed related data when it is read together. This carries over to Cosmos DB with one critical caveat for RU-based: document size affects RU cost.

Good candidate for embedding:

{
    "_id": "order-456",
    "customerId": "cust-123",
    "orderDate": "2026-04-30T10:00:00Z",
    "items": [
        { "productId": "prod-1", "name": "Widget A", "qty": 2, "price": 29.99 },
        { "productId": "prod-2", "name": "Widget B", "qty": 1, "price": 49.99 }
    ],
    "shipping": {
        "address": "123 Main St",
        "city": "Arlington",
        "state": "VA",
        "zip": "22201"
    },
    "total": 109.97
}

Why embed here:

Items and shipping are always read with the order.
The embedded arrays are bounded (orders have finite items).
The total document size stays well under 16 KB.

Referenced documents (normalized)¶

Use references when:

The referenced entity is large or unbounded.
The referenced entity changes independently and frequently.
The referenced entity is shared across multiple parent documents.

Example: referencing user profile from orders

// orders container
{
  "_id": "order-456",
  "customerId": "cust-123",  // reference, not embedded
  "orderDate": "2026-04-30T10:00:00Z",
  "total": 109.97
}

// users container
{
  "_id": "cust-123",
  "name": "Jane Doe",
  "email": "jane@example.com",
  "address": { "street": "123 Main St", "city": "Arlington" }
}

RU-based sizing guidance¶

Document size	Point read RU	Insert RU	Recommendation
< 1 KB	1	~5	Optimal for RU efficiency
1--4 KB	1--2	~10	Good
4--16 KB	2--5	~15--30	Acceptable; monitor RU consumption
16--100 KB	5--30	~30--100	Consider splitting or trimming
> 100 KB	30+	100+	Strongly consider normalizing

For RU-based, keep documents under 16 KB for optimal RU efficiency. Documents approaching the 2 MB limit consume disproportionate RUs.

3. Partition key selection patterns¶

Partition key design is covered in detail in the RU-Based Migration Guide. This section provides specific patterns for common MongoDB collection types.

Pattern: user-centric application¶

Collection: user_profiles    → Partition key: /userId
Collection: user_sessions    → Partition key: /userId
Collection: user_orders      → Partition key: /userId
Collection: user_preferences → Partition key: /userId

Advantage: All user data co-located. Single-partition queries for user-scoped operations. Transactions across user's orders and preferences possible.

Pattern: multi-tenant SaaS¶

Collection: tenants   → Partition key: /tenantId
Collection: documents → Partition key: /tenantId
Collection: audit_log → Partition key: /tenantId

Advantage: Tenant isolation. Per-tenant queries never cross partitions. Can use hierarchical partition key (/tenantId, /userId) for finer distribution.

Pattern: event sourcing / time-series¶

Collection: events → Partition key: /entityId
                   → Use TTL for automatic expiration
                   → Use analytical store for time-range queries

Why not partition by date: Partitioning by date creates hot partitions (today's partition gets all writes). Partition by the entity generating events; use analytical store for time-range analytics.

Pattern: catalog / reference data¶

Collection: products   → Partition key: /categoryId
Collection: categories → Partition key: /categoryId

Consideration: If the catalog is small (< 20 GB, < 10,000 RU/s), a synthetic partition key may work: /id provides maximum distribution but loses query affinity. For catalogs queried by category, /categoryId is better.

4. Indexing strategy¶

vCore indexing (familiar MongoDB approach)¶

vCore uses standard MongoDB indexing. Migrate your existing indexes directly:

// Indexes migrate with mongorestore
// Or create manually:
db.orders.createIndex({ customerId: 1, orderDate: -1 });
db.orders.createIndex({ status: 1 });
db.orders.createIndex({ "items.productId": 1 });
db.orders.createIndex({ createdAt: 1 }, { expireAfterSeconds: 2592000 }); // 30-day TTL

vCore indexing best practices:

Keep the total index count under 64 per collection (MongoDB default limit).
Use compound indexes for queries that filter and sort on multiple fields.
Drop unused indexes to reduce write overhead.
Use explain() to validate query plans.

RU-based indexing policy (declarative approach)¶

RU-based Cosmos DB uses a JSON indexing policy rather than createIndex(). The default policy indexes all properties, which is flexible but expensive on writes.

Recommended approach: start with "exclude all, include explicitly"

{
    "indexingMode": "consistent",
    "automatic": true,
    "includedPaths": [],
    "excludedPaths": [{ "path": "/*" }]
}

Then add back only the paths your queries need:

{
    "indexingMode": "consistent",
    "automatic": true,
    "includedPaths": [
        { "path": "/customerId/?" },
        { "path": "/orderDate/?" },
        { "path": "/status/?" },
        { "path": "/total/?" },
        { "path": "/items/[]/productId/?" }
    ],
    "excludedPaths": [{ "path": "/*" }],
    "compositeIndexes": [
        [
            { "path": "/customerId", "order": "ascending" },
            { "path": "/orderDate", "order": "descending" }
        ],
        [
            { "path": "/status", "order": "ascending" },
            { "path": "/total", "order": "descending" }
        ]
    ],
    "spatialIndexes": [
        {
            "path": "/location/*",
            "types": ["Point", "Polygon"]
        }
    ]
}

Index type mapping¶

MongoDB index type	RU-based equivalent	Notes
Single-field	Included path with `/?` suffix	Auto-indexed by default
Compound	Composite index	Required for ORDER BY on multiple fields
Multikey (array)	Included path with `/[]`	Arrays auto-indexed when path is included
Text	Basic text index or Azure AI Search	For full-text search, integrate Azure AI Search
Geospatial (2dsphere)	Spatial index in policy	Supports Point, Polygon, LineString
Hashed	Not needed (partition key handles distribution)	Cosmos DB partitioning replaces hash-based distribution
Wildcard	Default policy (include all)	Use sparingly; increases write cost
TTL	Container-level TTL policy	Set per container, not per index
Unique	Unique key policy (set at container creation)	Must include partition key; immutable after creation

5. TTL configuration¶

MongoDB TTL (per-index)¶

// MongoDB: TTL index on a date field
db.sessions.createIndex({ lastAccess: 1 }, { expireAfterSeconds: 3600 });

Cosmos DB vCore TTL¶

Same syntax as MongoDB. TTL indexes work identically:

db.sessions.createIndex({ lastAccess: 1 }, { expireAfterSeconds: 3600 });

Cosmos DB RU-based TTL¶

RU-based TTL is set at the container level, not per-index. Documents must have a _ts (system timestamp) or a custom TTL field.

Container-level TTL (based on _ts):

# Set default TTL to 30 days (in seconds)
az cosmosdb mongodb collection update \
  --resource-group rg-data-platform \
  --account-name my-cosmos-account \
  --database-name mydb \
  --name sessions \
  --default-ttl 2592000

Per-document TTL override:

{
    "_id": "session-789",
    "userId": "user-123",
    "ttl": 3600
}

Documents with a ttl field set to a positive integer expire that many seconds after their _ts. Documents with ttl: -1 never expire (override container default). Documents without a ttl field use the container default.

6. Schema validation migration¶

MongoDB `$jsonSchema` validation¶

db.createCollection("orders", {
    validator: {
        $jsonSchema: {
            bsonType: "object",
            required: ["customerId", "orderDate", "total"],
            properties: {
                customerId: { bsonType: "string" },
                orderDate: { bsonType: "date" },
                total: { bsonType: "decimal", minimum: 0 },
                status: {
                    enum: [
                        "pending",
                        "processing",
                        "shipped",
                        "delivered",
                        "cancelled",
                    ],
                },
            },
        },
    },
    validationLevel: "moderate",
    validationAction: "warn",
});

Cosmos DB vCore¶

Schema validation using $jsonSchema is supported. Migrate your validation rules directly.

Cosmos DB RU-based¶

Server-side validation is supported via the $jsonSchema validator on container creation. Apply the same validation rules:

# Set validation via Azure CLI (or Azure Portal)
az cosmosdb mongodb collection create \
  --resource-group rg-data-platform \
  --account-name my-cosmos-account \
  --database-name mydb \
  --name orders \
  --shard "customerId" \
  --throughput 4000
# Then apply validator via mongosh or application code

7. Migration checklist¶

Maintainers: csa-inabox core team Last updated: 2026-04-30