Pattern — Cosmos DB¶

TL;DR: Pick NoSQL API for greenfield, MongoDB / Cassandra / PostgreSQL APIs only when migrating those workloads. Use autoscale RU/s for spiky traffic, provisioned RU/s for predictable. Get the partition key right the first time — you can't change it without recreating the container. Enable Synapse Link if you need analytics over the same data without copying it.

Cosmos DB account Overview blade — Essentials panel showing East US read/write region, Free Tier Discount Opted in, throughput limits, and the Add Container / Data Explorer / Mirroring in Fabric toolbar

Problem¶

Cosmos DB is the right choice when you need single-digit millisecond latency at any scale, multi-region writes, or flexible schema for high-cardinality / heterogeneous workloads. But it's expensive when designed wrong, and the wrong partition-key decision is permanent.

Architecture¶

flowchart LR
    App[Application] --> SDK[Cosmos SDK<br/>direct mode + bulk ops]
    SDK --> GW[Cosmos Gateway<br/>per region]
    GW --> Partition1[Physical partition 1<br/>10 GB / 10,000 RU max]
    GW --> Partition2[Physical partition 2]
    GW --> Partition3[Physical partition N]

    Partition1 -.replicate.-> Region2[Secondary region]
    Partition1 -.replicate.-> Region3[Tertiary region]

    Partition1 -. analytical store .-> Synapse[Synapse Link<br/>read-only analytics]

Pattern: pick the right API¶

API	Use for
NoSQL (default)	Greenfield workloads. Best perf, best feature set, native to Cosmos
MongoDB API	Migrating MongoDB apps. Drop-in for most MongoDB drivers; some perf trade-offs
Cassandra API	Migrating Cassandra apps. Good for time-series and wide-column patterns
PostgreSQL (Citus)	Distributed PostgreSQL. Different product really — choose for HTAP / multi-tenant SaaS
Gremlin (graph)	Graph traversal workloads. Often supplanted by GraphRAG patterns over a relational store now
Table	Migrating Azure Table Storage. Better consistency + perf than Table Storage

Pattern: partition key design¶

The single most important decision. Goals:

High cardinality (millions+ distinct values)
Even access distribution (no hot partition)
Stable (doesn't change after creation — you cannot update a partition key value)
Aligned with your read pattern (most queries should target one partition)

Examples¶

Workload	Good partition key	Bad partition key
User events / activity	`userId`	`eventType` (low cardinality)
Multi-tenant SaaS	`tenantId` (if tenants are similar size)	`region` (low cardinality)
IoT sensor readings	`deviceId`	`sensorType`
Order management	`customerId` (if reads are by customer)	`orderStatus` (small cardinality, hot status)
Time-series at extreme scale	Synthetic: `${deviceId}-${dayBucket}`	`timestamp` (hot recent partition)
Mixed multi-tenant where tenants vary 1M× in size	Hierarchical: `tenantId/userId` (subpartition)	`tenantId` alone (mega-tenant becomes hot)

If you can't find a good single-key, use a synthetic key (region#date, customerId#year) or hierarchical partitioning (preview).

Pattern: consistency level¶

Level	Latency	Use for
Strong	Highest	Financial transactions, anything requiring strict linearizability
Bounded staleness	Medium-high	Most workloads where you want predictable max staleness window
Session (default)	Medium	User-facing apps; "read your own writes" within a session token
Consistent prefix	Low	Append-only logs, event streams
Eventual	Lowest	Analytics, aggregations, anywhere staleness doesn't matter

Default is Session — start there. Tighten only when you have a real consistency requirement; loosen only when you've proven you can tolerate it.

Pattern: throughput model¶

Model	Use for
Autoscale RU/s (default)	Spiky workloads, dev environments, anything where peak ≠ steady-state
Provisioned RU/s	Predictable steady-state where autoscale's 10x range is wasteful
Serverless	Dev, low-volume workloads (<5,000 RU/s peak), sandbox

Autoscale costs ~50% more per RU than provisioned — but pays for itself if your peak/avg ratio >2×.

Pattern: TTL for cost control¶

{
    "id": "session-12345",
    "userId": "user-7890",
    "data": "...",
    "ttl": 86400 // expires in 24 hours
}

Setting ttl on documents is free and automatic (no RU cost for deletes). Use it for:

Session data
Cache documents
Streaming events past their analytical window
Soft-deleted records (set TTL to grace period)

Pattern: Synapse Link (HTAP)¶

Cosmos has an analytical store that's a separate columnar copy auto-synced from the transactional store. Pros:

Zero-RU impact on transactional workload
Auto-synced (~2 minute lag)
Queryable from Synapse Spark / Synapse SQL Serverless
Eliminates "ETL Cosmos to lakehouse for analytics"

Enable when:

You want analytics over current Cosmos data without copying
Daily sync is fine (don't need sub-second analytical fresh)
Cost is acceptable (analytical store has its own storage cost; queries cost from Synapse)

Pattern: change feed → Event-driven¶

Cosmos has a change feed (CDC built in). Use it for:

Materialized views (write to Cosmos → change feed → derived view in Cosmos)
Event-driven downstream (change feed → Function → Event Hubs → consumer)
Cache invalidation (change feed → invalidate Redis)

Better than polling. Cheaper than CDC tools. Use the Change Feed Processor library (Java, .NET, Python) or Azure Functions Cosmos Trigger.

Cost optimization checklist¶

Right consistency level (Session is the default for a reason)
Autoscale only for spiky workloads
TTL on time-bounded documents
Indexing policy tuned — exclude paths you never query (default indexes everything → expensive)
Use Synapse Link instead of cross-Cosmos analytics queries
Avoid cross-partition queries in user-facing paths (always include partition key)
Bulk operations for inserts (10x cheaper than individual inserts)
Reserved capacity for predictable workloads (significant discount)

Common pitfalls¶

Pitfall	Mitigation
Wrong partition key picked at design time	Cannot be changed — recreate container with right key, migrate data via change feed
Default indexing policy on a write-heavy workload	Tune indexing policy; exclude unused paths
Cross-partition queries in user-facing paths	Always include partition key in WHERE clause
Strong consistency by default	Use Session unless you have a specific reason
Cosmos for analytics workloads	Use Synapse Link OR copy to ADLS Delta nightly
Multi-region writes for "DR"	Multi-region writes is for active-active, not DR. For DR, use single-write + failover
Single physical partition (>10K RU/s for >10GB)	Will get throttled — repartition with better key

Reference Architecture — Data Flow (where Cosmos fits in the medallion)
Pattern — Streaming & CDC (Cosmos change feed integration)
Best Practices — Performance Tuning
Best Practices — Cost Optimization
Cosmos modeling docs: https://learn.microsoft.com/azure/cosmos-db/nosql/modeling-data
Cosmos partitioning docs: https://learn.microsoft.com/azure/cosmos-db/partitioning-overview