🔀 Partitioning Strategies for Azure Cosmos DB¶
Proper partition key selection is critical for Cosmos DB performance and scalability. Choose partition keys that distribute data evenly and align with query patterns.
🎯 Partition Key Fundamentals¶
Key Concepts¶
- Logical Partition: Group of items with same partition key (max 20 GB)
- Physical Partition: Storage managed by Cosmos DB (10 GB each)
- Partition Key Path: JSON property used for partitioning (e.g.,
/customerId)
# Example: E-commerce order partitioning by customerId
{
"id": "order-12345",
"customerId": "cust-67890", # Partition key
"orderDate": "2024-01-15",
"items": [...],
"total": 299.99
}
📊 Partition Key Selection Criteria¶
Good Partition Key Characteristics¶
| Criteria | Description | Example |
|---|---|---|
| High Cardinality | Many distinct values | UserId, DeviceId |
| Even Distribution | Balanced storage across partitions | CustomerId, TenantId |
| Query Pattern Alignment | Included in WHERE clauses | Date ranges, Categories |
| Write Distribution | Spreads writes evenly | Timestamp + UserId |
Anti-patterns to Avoid¶
# ❌ BAD: Low cardinality
partition_key = "status" # Only a few values (active, inactive, pending)
# ❌ BAD: Uneven distribution
partition_key = "country" # US might have 80% of data
# ❌ BAD: Always changes
partition_key = "lastModifiedDate" # Creates new partitions constantly
# ✅ GOOD: High cardinality, even distribution
partition_key = "customerId" # Thousands of customers, even distribution
🏗️ Common Partitioning Patterns¶
Pattern 1: Single Property Key¶
# User-centric applications
container = database.create_container(
id="user-profiles",
partition_key=PartitionKey(path="/userId")
)
# All user data stays together
user_data = {
"id": "profile-001",
"userId": "user-12345", # Partition key
"profile": {...},
"preferences": {...}
}
Pattern 2: Composite Key (Synthetic)¶
# Combine properties for better distribution
import hashlib
def generate_partition_key(user_id: str) -> str:
"""Create synthetic partition key from userId."""
# Hash and take modulo for distribution
hash_value = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
bucket = hash_value % 100 # 100 partitions
return f"bucket-{bucket:03d}"
# Usage
document = {
"id": "doc-001",
"userId": "user-12345",
"partitionKey": generate_partition_key("user-12345"), # "bucket-042"
"data": {...}
}
Pattern 3: Hierarchical Key¶
# Multi-tenant SaaS applications
document = {
"id": "record-001",
"tenantId": "tenant-A", # Partition key
"customerId": "cust-123",
"data": {...}
}
# All tenant data in same partition - efficient for tenant-scoped queries
Pattern 4: Time-based Partitioning¶
# IoT or time-series data
from datetime import datetime
def create_time_partition_key(timestamp: datetime) -> str:
"""Create partition key from timestamp."""
# Partition by year-month for time-series queries
return timestamp.strftime("%Y-%m")
document = {
"id": "reading-001",
"deviceId": "sensor-123",
"timestamp": "2024-01-15T10:30:00Z",
"partitionKey": "2024-01", # Partition by month
"temperature": 72.5
}
📈 Scaling Considerations¶
Monitoring Partition Distribution¶
from azure.mgmt.cosmosdb import CosmosDBManagementClient
from azure.identity import DefaultAzureCredential
def analyze_partition_metrics():
"""Check partition key distribution."""
credential = DefaultAzureCredential()
cosmosdb_client = CosmosDBManagementClient(credential, subscription_id="<sub-id>")
# Get partition key ranges
metrics = cosmosdb_client.collection.list_metrics(
resource_group_name="myresourcegroup",
account_name="mycosmosaccount",
database_rid="<database-rid>",
collection_rid="<collection-rid>",
filter="name.value eq 'DataUsage'"
)
for metric in metrics.value:
print(f"Partition: {metric.name.value}")
for timeseries in metric.timeseries:
for data in timeseries.data:
print(f" Storage: {data.total} bytes")
🔗 Related Resources¶
Last Updated: 2025-01-28 Documentation Status: Complete