Skip to content

Home > Docs > Multi-Region Deployment

Multi-Region Deployment Guide

Note

Quick Summary: Active-active multi-region deployment for CSA-in-a-Box โ€” service capability matrix (native replication vs stamp-per-region), data replication patterns (RA-GZRS, Cosmos multi-master, Event Hubs Geo-DR), failover procedures, RPO/RTO targets, step-by-step deployment, chaos testing, monitoring, and cost implications (+40-80% overhead).

This guide covers deploying CSA-in-a-Box in an active-active multi-region configuration for high availability and disaster recovery. It extends the standard deployment in GETTING_STARTED.md and pairs with the DR runbook in DR.md which documents the operational failover procedures.

Important

Scope: the CSA-in-a-Box Data Landing Zone (DLZ) and its dependent services. The Management and Connectivity landing zones have their own multi-region considerations (Azure Policy replication, hub VNet per region) that are out of scope here.

๐Ÿ“‘ Table of Contents


๐Ÿ“ฆ Starter parameter file

A starter parameter template lives at deploy/bicep/DLZ/params.multi-region.json โ€” deploy it to the primary region, then create a paired secondary file with location set to the failover region. Snippet (truncated):

{
  "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentParameters.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "location":    { "value": "East US 2" },
    "environment": { "value": "prod" },
    "prefix":      { "value": "dlz" },
    "parCosmosDB": {
      "value": {
        "enableMultipleWriteLocations": "Enabled",
        "secondaryLocation":            "westus2",
        "_comment_multiRegion": "Active-active multi-master Cosmos DB."
      }
    },
    "parStorage": {
      "value": {
        "sku": "Standard_RAGZRS",
        "_comment_sku": "Read-access geo-zone-redundant storage; secondary region is readable at all times."
      }
    }
  }
}

See the file for the full set of services (Synapse, Databricks, Data Factory, Event Hubs Geo-DR alias, Azure ML, Data Explorer, Functions, Stream Analytics, App Insights) including the manual az eventhubs georecovery-alias setup commands embedded in the _comment_geoDR block.


๐Ÿ—๏ธ 1. Architecture Overview

CSA-in-a-Box multi-region uses an active-active topology where both regions serve traffic simultaneously. Services that support native multi-region replication (Cosmos DB, Storage) are configured for automatic data synchronization. Services without built-in multi-region support (Databricks, Data Factory, Synapse) are deployed as independent stamps in each region.

graph TB
    TM["Azure Traffic Manager<br/>/ Azure Front Door"] --> E["East US 2<br/>(Primary)"]
    TM --> W["West US 2<br/>(Secondary)"]

    subgraph E["East US 2 (Primary)"]
        EC["Cosmos DB"]
        ES["Storage (RA-GZRS)"]
        EK["Key Vault"]
        EH["Event Hubs"]
        ED["Databricks"]
        EA["Data Factory"]
    end

    subgraph W["West US 2 (Secondary)"]
        WC["Cosmos DB"]
        WS["Storage (RA-GZRS)"]
        WK["Key Vault"]
        WH["Event Hubs"]
        WD["Databricks"]
        WA["Data Factory"]
    end

    EC <-->|"Multi-region writes"| WC
    ES <-->|"RA-GZRS replication"| WS
    EK <-->|"Azure-managed"| WK
    EH <-->|"Geo-DR pairing"| WH

The primary region handles all writes for stamp-based services. The secondary region is a warm standby for compute services and an active read replica for data services. During a regional failure, the secondary region can be promoted to primary with the failover procedures in DR.md.


๐Ÿ“Š 2. Service Capability Matrix

Not all Azure services support multi-region the same way. This matrix documents the strategy for each service deployed by CSA-in-a-Box.

Service Multi-Region Support Strategy RPO RTO
Cosmos DB Native multi-region writes Active-active < 15 min < 30 min
Storage (Silver/Gold) RA-GZRS Built-in geo-replication < 1 h < 1 h
Storage (Bronze/raw) ZRS Zone-redundant, single region < 4 h < 4 h
Key Vault Azure-managed geo-replication Built-in N/A < 15 min
Event Hubs Geo-DR namespace pairing Manual config post-deploy < 5 min < 15 min
Databricks No native support Stamp per region < 4 h < 4 h
Data Factory No native support Stamp per region < 4 h < 8 h
Synapse No native support Stamp per region < 1 h < 2 h
Azure ML No native support Stamp per region < 4 h < 4 h
Data Explorer Follower databases Read replicas in secondary < 30 min < 1 h
App Insights Per-region instance Stamp per region N/A N/A
Stream Analytics No native support Stamp per region < 4 h < 4 h
Purview No native support Accept 24h RPO < 24 h < 24 h
Legend
  • Native: The service replicates data automatically across regions.
  • Geo-DR pairing: The service supports a disaster-recovery namespace/alias that can be failed over.
  • Stamp per region: Deploy the service independently in each region. No automatic data sync โ€” the service is stateless or re-populated from replicated data sources.
  • Follower databases: ADX-specific โ€” a secondary cluster attaches read-only copies of databases from the primary cluster.

๐Ÿ”„ 3. Data Replication Patterns

3.1 Storage: RA-GZRS Automatic Replication

When sku is set to Standard_RAGZRS, Azure Storage replicates data synchronously across three availability zones in the primary region and asynchronously to the paired secondary region. The secondary region provides read-only access at all times.

Key points:

  • RPO is typically < 15 minutes but Azure does not guarantee a specific RPO for asynchronous replication. Plan for up to 1 hour.
  • After a customer-initiated failover, the account's replication drops to LRS. Re-enable geo-replication after the primary region recovers (see DR.md ยง5).
  • The secondary endpoint is at {account}-secondary.blob.core.windows.net. Configure read clients to use this endpoint for cross-region reads.

Configuration in params.multi-region.json:

"parStorage": {
    "value": {
        "sku": "Standard_RAGZRS"
    }
}

3.2 Cosmos DB: Multi-Master Writes

With enableMultipleWriteLocations=true and secondaryLocation set, Cosmos DB accepts writes in both regions simultaneously. Conflict resolution uses Last-Writer-Wins (LWW) by default, based on the _ts (timestamp) property.

Key points:

  • Session consistency is maintained per-region. Cross-region reads may show slight staleness (typically < 10ms for multi-master).
  • Automatic failover is enabled โ€” if the primary region goes down, Cosmos automatically promotes the secondary.
  • Conflict resolution policy is configurable per container. For last-writer-wins, no application changes are needed.
  • Continuous backup (30-day PITR) works across both regions.

Configuration in params.multi-region.json:

"parCosmosDB": {
    "value": {
        "enableMultipleWriteLocations": "Enabled",
        "secondaryLocation": "westus2",
        "enableAutomaticFailover": "Enabled"
    }
}

3.3 dbt Models: Idempotent, Deploy to Both Regions

dbt models are inherently idempotent โ€” they read from source data and produce output tables. In a multi-region deployment, run dbt against both regions' Databricks workspaces to populate the medallion layers from the region-local storage replicas.

# Run dbt against the primary region
dbt run --target prod_eastus2

# Run dbt against the secondary region
dbt run --target prod_westus2

Note

Since the source data (Storage RA-GZRS) is eventually consistent across regions, model outputs may differ slightly during replication lag. For critical aggregations, run dbt in the primary region first and let replication propagate the source data before running in the secondary.

3.4 Event Hubs: Geo-DR Namespace Pairing

Event Hubs Geo-DR replicates namespace metadata (event hub definitions, consumer groups, access policies) but not message data. During normal operation, producers send to the primary namespace. After failover, the alias DNS record points to the secondary namespace and producers seamlessly reconnect.

# Create the Geo-DR alias (one-time setup after both namespaces are deployed)
az eventhubs georecovery-alias create \
  --resource-group rg-dlz-prod-eventhubs-eastus2 \
  --namespace-name dlz-prod-ehns-eastus2 \
  --alias dlz-prod-ehns-geo \
  --partner-namespace "/subscriptions/<SUB_ID>/resourceGroups/rg-dlz-prod-eventhubs-westus2/providers/Microsoft.EventHub/namespaces/dlz-prod-ehns-westus2"

!!! warning - Messages in-flight during failover may be lost. Design consumers for at-least-once processing with idempotent writes. - After failover, the secondary becomes the new primary. The old primary must be re-paired as the secondary. - Consumer offsets are not replicated โ€” consumers restart from the latest checkpoint or the beginning of the retention window.


๐Ÿš€ 4. Failover Procedures

The detailed step-by-step failover procedure is in DR.md ยง3. This section summarizes the decision framework.

4.1 Automatic Failover (No operator action required)

Service Trigger Behavior
Cosmos DB Regional outage detected by Azure Secondary promoted to primary automatically. Multi-master writes continue in the surviving region.
Storage (RA-GZRS) No automatic failover Read traffic can use the secondary endpoint immediately. Write failover requires operator action.
Key Vault Azure-managed Automatic within Azure's SLA.

4.2 Manual Failover (Operator action required)

Service Command Time to Complete
Storage az storage account failover --name <name> --yes 10โ€“60 minutes
Event Hubs az eventhubs georecovery-alias fail-over --alias <alias> < 5 minutes
Databricks Activate secondary workspace, update linked services 1โ€“2 hours
Data Factory Redeploy pipelines to secondary factory 1โ€“4 hours
ADX Promote follower databases to primary 30โ€“60 minutes

4.3 DNS / Traffic Manager Configuration

For clients to seamlessly fail over, deploy Azure Traffic Manager (or Azure Front Door) with priority-based routing:

# Create Traffic Manager profile
az network traffic-manager profile create \
  --name "csa-dlz-tm" \
  --resource-group "rg-global-traffic" \
  --routing-method Priority \
  --unique-dns-name "csa-dlz"

# Add primary endpoint
az network traffic-manager endpoint create \
  --name "eastus2-primary" \
  --profile-name "csa-dlz-tm" \
  --resource-group "rg-global-traffic" \
  --type azureEndpoints \
  --target-resource-id "<PRIMARY_RESOURCE_ID>" \
  --priority 1

# Add secondary endpoint
az network traffic-manager endpoint create \
  --name "westus2-secondary" \
  --profile-name "csa-dlz-tm" \
  --resource-group "rg-global-traffic" \
  --type azureEndpoints \
  --target-resource-id "<SECONDARY_RESOURCE_ID>" \
  --priority 2

๐Ÿ“‹ 5. RPO / RTO Targets by Service Tier

Services are classified into tiers that drive the SKU, replication mode, and expected recovery behavior. See DR.md ยง1 for the authoritative table.

Tier: Critical (RPO < 1h, RTO < 1h)

Service Configuration Recovery
Cosmos DB Multi-master, automatic failover Automatic โ€” no data loss in surviving region
Storage (Silver/Gold) RA-GZRS Manual failover; read traffic uses secondary immediately
Key Vault Azure-managed geo-replication Automatic
Event Hubs Geo-DR pairing Manual failover; in-flight messages may be lost

Tier: Standard (RPO < 4h, RTO < 8h)

Service Configuration Recovery
Databricks Passive workspace in secondary Activate workspace, rehydrate Unity Catalog
Data Factory Paired factory in secondary Redeploy pipelines, update linked services
Synapse Per-region workspace Reattach serverless endpoints to failover storage
Azure ML Per-region workspace Retarget compute to secondary region
Data Explorer Follower databases Promote followers to writable databases

Tier: Best-Effort (RPO < 24h, RTO < 24h)

Service Configuration Recovery
Purview No geo-replication Re-scan from source after region recovery
Stream Analytics Per-region job Redeploy job definition

๐Ÿ“ฆ 6. Deployment Process

6.1 Prerequisites

Before deploying multi-region:

  • Two Azure subscriptions (or a single subscription with region-specific resource groups)
  • Spoke VNets deployed in both regions, peered to respective hubs
  • Private DNS zones linked to both spoke VNets
  • Key Vault provisioned in both regions (or relying on Azure-managed geo-replication of a single vault)
  • Log Analytics workspace accessible from both regions

6.2 Step 1 โ€” Deploy the Primary Region

az account set --subscription <DLZ_SUBSCRIPTION_ID>

az deployment sub create \
  --location eastus2 \
  --template-file deploy/bicep/DLZ/main.bicep \
  --parameters deploy/bicep/DLZ/params.multi-region.json

6.3 Step 2 โ€” Deploy Stamp-Based Services in the Secondary Region

Create a secondary parameter file for services that need separate stamps:

cp deploy/bicep/DLZ/params.multi-region.json \
   deploy/bicep/DLZ/params.multi-region-westus2.json

Edit the secondary file:

  • Change location to "West US 2"
  • Update resource names to include westus2 suffix
  • Update VNet/subnet references to the secondary region's network
  • Disable services that are already multi-region (Cosmos DB, Storage)
  • Keep stamp-based services enabled (Databricks, Data Factory, etc.)
az deployment sub create \
  --location westus2 \
  --template-file deploy/bicep/DLZ/main.bicep \
  --parameters deploy/bicep/DLZ/params.multi-region-westus2.json

6.4 Step 3 โ€” Configure Geo-DR Pairing

After both namespaces are deployed, pair Event Hubs:

az eventhubs georecovery-alias create \
  --resource-group rg-dlz-prod-eventhubs-eastus2 \
  --namespace-name dlz-prod-ehns-eastus2 \
  --alias dlz-prod-ehns-geo \
  --partner-namespace "<SECONDARY_NAMESPACE_RESOURCE_ID>"

6.5 Step 4 โ€” Configure ADX Follower Databases

az kusto attached-database-configuration create \
  --cluster-name dlzprodadxwestus2 \
  --resource-group rg-dlz-prod-adx-westus2 \
  --attached-database-configuration-name "analytics-follower" \
  --cluster-resource-id "/subscriptions/<SUB_ID>/resourceGroups/rg-dlz-prod-adx-eastus2/providers/Microsoft.Kusto/Clusters/dlzprodadxeastus2" \
  --database-name "analytics" \
  --default-principals-modification-kind "Union"

6.6 Step 5 โ€” Configure Traffic Manager

Set up global routing as described in ยง4.3 above.

6.7 Step 6 โ€” Verify

# Cosmos DB โ€” verify both regions
az cosmosdb show \
  --name <COSMOS_ACCOUNT> \
  --resource-group <RG> \
  --query "locations[].{Name:locationName, Priority:failoverPriority, Status:provisioningState}" \
  -o table

# Storage โ€” verify RAGZRS
az storage account show \
  --name <STORAGE_ACCOUNT> \
  --resource-group <RG> \
  --query "{Primary:primaryLocation, Secondary:secondaryLocation, SKU:sku.name}" \
  -o table

# Event Hubs โ€” verify Geo-DR
az eventhubs georecovery-alias show \
  --resource-group <RG> \
  --namespace-name <PRIMARY_NAMESPACE> \
  --alias <ALIAS>

๐Ÿงช 7. Chaos Testing

Use Azure Chaos Studio to validate failover behavior before a real outage occurs.

7.1 Experiment: Storage Failover

{
    "name": "storage-failover-test",
    "identity": { "type": "SystemAssigned" },
    "properties": {
        "steps": [
            {
                "name": "Step1-StorageFailover",
                "branches": [
                    {
                        "name": "Branch1",
                        "actions": [
                            {
                                "type": "continuous",
                                "name": "urn:csci:microsoft:storageAccount:failover/1.0",
                                "duration": "PT10M",
                                "parameters": [],
                                "selectorId": "storageSelector"
                            }
                        ]
                    }
                ]
            }
        ],
        "selectors": [
            {
                "type": "List",
                "id": "storageSelector",
                "targets": [
                    {
                        "type": "ChaosTarget",
                        "id": "/subscriptions/<SUB_ID>/resourceGroups/<RG>/providers/Microsoft.Storage/storageAccounts/<ACCOUNT>/providers/Microsoft.Chaos/targets/Microsoft-StorageAccount"
                    }
                ]
            }
        ]
    }
}

7.2 Experiment: Cosmos DB Region Outage

Simulate a region outage by temporarily removing the primary region from the Cosmos account's location list:

# Force failover to secondary (non-destructive, reversible)
az cosmosdb failover-priority-change \
  --name <COSMOS_ACCOUNT> \
  --resource-group <RG> \
  --failover-policies westus2=0 eastus2=1

Verify that applications continue to operate using the secondary region, then reverse:

az cosmosdb failover-priority-change \
  --name <COSMOS_ACCOUNT> \
  --resource-group <RG> \
  --failover-policies eastus2=0 westus2=1

7.3 Quarterly Drill Schedule

Quarter Service Environment Focus
Q1 Cosmos DB Dev Automatic failover verification
Q2 Storage Dev Customer-initiated failover timing
Q3 Event Hubs Dev Geo-DR alias failover + consumer restart
Q4 Full stack Staging End-to-end failover drill

๐Ÿ“ˆ 8. Monitoring Multi-Region Health

8.1 Cross-Region Dashboard

Create an Azure Monitor workbook that displays health across both regions side by side:

Cosmos DB replication lag:

AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DOCUMENTDB"
| where Category == "DataPlaneRequests"
| summarize AvgLatencyMs=avg(duration_s * 1000),
            P99LatencyMs=percentile(duration_s * 1000, 99)
  by Region=location_s, bin(TimeGenerated, 5m)
| render timechart

Storage geo-replication status:

az storage account show \
  --name <ACCOUNT> \
  --query "geoReplicationStats.{Status:status, LastSync:lastSyncTime}" \
  -o table

Event Hubs Geo-DR health:

AzureDiagnostics
| where ResourceProvider == "MICROSOFT.EVENTHUB"
| where OperationName == "GeoReplication"
| project TimeGenerated, Status=status_s, Alias=resource_s
| order by TimeGenerated desc
| take 10

8.2 Alerts

Alert Metric Threshold Action
Cosmos replication lag ReplicationLatency > 5,000 ms for 10 min Page on-call
Storage geo-replication unhealthy GeoReplicationStatus != Live for 15 min Page on-call
Event Hubs Geo-DR broken Manual check Alias not resolving Page on-call
ADX follower lag FollowerLatency > 60,000 ms for 15 min Notify platform team

8.3 Azure Resource Graph: Multi-Region Inventory

Resources
| where tags['MultiRegion'] != '' or tags['PrimaryRegion'] != ''
| summarize ResourceCount=count() by location, type
| order by location, type

๐Ÿ’ฐ 9. Cost Implications of Multi-Region

Multi-region deployments increase cost. Understanding the cost drivers helps right-size the secondary region.

9.1 Cost Multipliers by Service

Service Multi-Region Cost Impact Notes
Cosmos DB 2ร— RU cost Each region charges for provisioned/consumed RUs
Storage (RA-GZRS) ~1.5ร— vs LRS Geo-replication adds ~50% to storage cost
Event Hubs 2ร— namespace cost Two full namespaces (primary + secondary)
Databricks 1ร—โ€“2ร— Secondary workspace can be cold standby (1ร—) or active (2ร—)
Data Factory ~1.5ร— Secondary factory but potentially fewer pipeline runs
Data Explorer 1.5ร—โ€“2ร— Follower databases consume compute in secondary cluster
Key Vault Negligible Geo-replication included in Standard/Premium tier
Azure ML 1ร—โ€“2ร— Depends on compute utilization in secondary

9.2 Cost Optimization Strategies

  • Cold standby for stamp-based services: Deploy Databricks and ADF in the secondary region but keep clusters stopped and pipelines disabled. Reduces 2ร— cost to ~1.1ร—.
  • Right-size the secondary ADX cluster: Use a smaller SKU for the follower cluster since it only handles read traffic.
  • Selective multi-region: Use Standard_ZRS for Bronze/raw data and reserve RA-GZRS for Silver/Gold layers.
  • Reserved instances: Commit to 1-year or 3-year reservations for Cosmos DB RUs and Databricks DBUs in both regions.
  • Monitor and tune: Use Azure Cost Management with the MultiRegion and PrimaryRegion tags.

9.3 Estimated Monthly Cost Overhead

Category Single-Region Multi-Region Delta
Data (Cosmos + Storage) Baseline +60โ€“80% RA-GZRS + multi-master
Compute (Databricks + ADF) Baseline +10โ€“100% Cold standby to active
Networking (VNet peering, PE) Baseline +50โ€“80% Second set of endpoints
Monitoring (Log Analytics) Baseline +30โ€“50% Second workspace or cross-region shipping
Total estimate Baseline +40โ€“80% Depends on active vs cold standby

๐Ÿข 10. Combining Multi-Region with Multi-Tenant

For deployments that require both multi-tenant isolation and multi-region availability, deploy each tenant stamp in both regions:

Tenant A: params.tenant-contoso-eastus2.json โ†’ eastus2
           params.tenant-contoso-westus2.json โ†’ westus2

Tenant B: params.tenant-fabrikam-eastus2.json โ†’ eastus2
           params.tenant-fabrikam-westus2.json โ†’ westus2

Each tenant's Cosmos DB account and storage account get their own multi-region replication. Stamp-based services (Databricks, ADF) are deployed per-tenant per-region.

See MULTI_TENANT.md for the tenant isolation model and naming conventions.


๐Ÿ“‹ 11. Quick Reference

Scenario Guide
Deploy multi-region from scratch ยง6 above
Understand which services support multi-region ยง2 Service Capability Matrix
Configure Cosmos DB multi-master ยง3.2
Configure Storage RA-GZRS ยง3.1
Set up Event Hubs Geo-DR ยง3.4
Failover the whole platform DR.md ยง3
Fail back after recovery DR.md ยง5
Run a chaos experiment ยง7
Monitor cross-region health ยง8
Understand cost impact ยง9
Multi-region + multi-tenant ยง10
Quarterly DR drill ยง7.3 / DR.md ยง4

See also: