Skip to content

Home > Docs > Disaster Recovery

Multi-Region Disaster Recovery Runbook

Note

Quick Summary: Authoritative DR runbook for CSA-in-a-Box covering RPO/RTO targets by service tier, primary/secondary region pairing, step-by-step failover procedures (Cosmos DB, Storage, ADF, Databricks, DNS), quarterly drill cadence, failback procedures, and known gaps.

This runbook is the authoritative answer to "what do we do when a region goes down?" It pairs with the deploy-time rollback procedure in ROLLBACK.md, which covers bad deploys; this document covers regional unavailability (Azure region outage, networking partition, storage replication failure) where the fix is failing over to a different region, not redeploying.

Important

Scope: the CSA-in-a-Box platform's four landing zones (Management, Connectivity, DMLZ, DLZ). Out of scope: application-layer DR for workloads owned by domain teams โ€” those are expected to follow the per-service guidance in this runbook and document their own RPO/RTO.

๐Ÿ“‘ Table of Contents


๐Ÿ“Š 1. RPO / RTO Targets by Service Tier

Every Azure service deployed by CSA-in-a-Box is classified into one of three tiers. The tier determines the SKU, replication mode, and expected recovery behaviour. Before enabling a new service in production, add a row to this table and wire up the matching configuration.

Service Tier RPO RTO Replication Bicep toggle
Cosmos DB (ingest + catalog) Critical < 15 min < 30 min Multi-region writes across two Azure regions deploy/bicep/DLZ/modules/cosmos/cosmosdb.bicep โ†’ secondaryLocation, enableMultipleWriteLocations=true, enableAutomaticFailover=true
Data Lake Storage (Silver + Gold) Critical < 1 h < 1 h RA-GRS (read-access geo-redundant) deploy/bicep/DLZ/modules/storage/storage.bicep โ†’ storageSku=Standard_RAGRS (or Standard_RAGZRS for zone + region redundancy)
Data Lake Storage (Bronze / raw) Standard < 4 h < 4 h ZRS (zone-redundant, single region) storage.bicep default (no override needed)
Log Analytics + App Insights Critical < 1 h < 1 h Cross-region diagnostic-settings mirror Manual: add a second diagnostic-settings destination pointing at the failover workspace
Key Vault Critical N/A < 15 min Azure-managed geo replication (Standard/Premium) + 90-day soft delete deploy/bicep/**/KeyVault/keyvault.bicep already enables enableSoftDelete + enablePurgeProtection + softDeleteRetentionInDays: 90
Databricks Standard < 4 h < 4 h Passive paired workspace in secondary region (cold standby) Manual: deploy the workspace Bicep with a second location param; unity catalog metadata is regional
Azure Data Factory Standard < 4 h < 8 h Paired factory in secondary region; linked services recreated via ARM Manual: redeploy domains/shared/pipelines/adf/* into the failover factory
Synapse Serverless SQL Standard < 1 h < 2 h Automatically HA inside a region; for cross-region, reattach serverless endpoints to the failover storage account Manual
Event Hub Critical < 5 min < 15 min Geo-DR pairing Manual: configure Microsoft.EventHub/namespaces/disasterRecoveryConfigs in a follow-up commit
Purview Standard < 24 h < 24 h No built-in geo-replication; rely on the collection export + reimport procedure below Manual

Note

Tiers higher than "Critical" (near-zero RPO/RTO, e.g. for regulated workloads) require multi-region active/active and are out of scope for this runbook; those workloads should be reviewed individually.


๐ŸŒ 2. Primary / Secondary Region Pairing

CSA-in-a-Box defaults to the following Azure region pairs. Every critical-tier service should be deployed in both regions of a single pair, never across pairs (so that the Azure DR primitives like Standard_RAGRS and Cosmos automatic failover line up):

Primary Secondary
eastus (default) westus
eastus2 centralus
westeurope northeurope
uksouth ukwest

The primary region is driven by AZURE_REGION in .github/workflows/deploy.yml. The secondary region is the target of the DR failover jobs below and is configured per service via the Bicep parameters linked in the table above.


๐Ÿš€ 3. Failover Procedure

Danger

Trigger this procedure when the primary region is confirmed unavailable via the Azure Service Health dashboard and has been down for > 5 minutes (shorter incidents are usually cheaper to wait out).

Step 1 โ€” Declare the incident

  • Page the on-call engineer and open an incident in your tracker.
  • Post to the incident channel: region, start time, services affected, decision authority for go/no-go on failover.
  • Start a scribe log in the incident doc so every action below is timestamped and attributable.

Step 2 โ€” Verify the scope

Before flipping anything, confirm the outage is region-wide and not a per-service or per-subscription issue:

# Control-plane health
az account list-locations --query "[?metadata.regionType=='Physical'].{Name:name, Pairs:metadata.pairedRegion[].name}" -o table

# Verify each of our deployed services
az resource list --location eastus --query "[?provisioningState!='Succeeded'].{Name:name, Type:type, State:provisioningState}" -o table

Important

If only one resource group or service is affected, use the per-service rollback procedure instead โ€” failing over the whole platform for a single stuck resource is more risk than reward.

Step 3 โ€” Cosmos DB failover

Cosmos accounts with enableAutomaticFailover=true and a secondaryLocation will fail over on their own within Azure's threshold. If automatic failover has not happened within 10 minutes, force it:

az cosmosdb failover-priority-change \
  --name csa-cosmos-dlz \
  --resource-group rg-csa-cosmos \
  --failover-policies westus=0 eastus=1

Verify:

az cosmosdb show --name csa-cosmos-dlz --resource-group rg-csa-cosmos \
  --query "locations[].{Name:locationName, Priority:failoverPriority}" -o table

Step 4 โ€” Storage account failover

RA-GRS accounts require a manual customer-initiated failover:

az storage account failover \
  --name csadlzstorage001 \
  --resource-group rg-csa-storage \
  --yes

Warning

After the command returns, the account's primary region flips to the paired secondary and the replication type drops to LRS (you need to re-enable geo replication after the original region recovers โ€” see ยง5 Failback). Expected time: 10โ€“60 minutes.

Verify:

az storage account show --name csadlzstorage001 --resource-group rg-csa-storage \
  --query "{primary:primaryLocation, secondary:secondaryLocation, sku:sku.name}"

Step 5 โ€” ADF linked-service reconfiguration

ADF cannot be failed over โ€” each region has its own factory. Redeploy pipeline JSON into the secondary-region factory:

# Re-run the DLZ deploy workflow targeting the secondary region.
# In .github/workflows/deploy.yml, temporarily set AZURE_REGION=westus
# or pass the target via workflow_dispatch input once supported.
gh workflow run deploy.yml -f environment=prod -f deploy_dlz=true -f dry_run=false

Linked services in the secondary factory must point at:

  • The restored Cosmos account (now in the failover region)
  • The failed-over Data Lake storage account
  • The secondary Databricks workspace (if you're running Databricks pipelines via ADF linked compute)

Step 6 โ€” Databricks failover

  • Stop any still-running jobs in the primary workspace (if the control plane is still reachable).
  • Activate the cold-standby workspace in the secondary region.
  • Rehydrate the Unity Catalog metastore in the secondary workspace via the Databricks Terraform provider export + import pattern (Unity Catalog is regional so metadata does not replicate).
  • Repoint ADF linked services (step 5) at the new workspace URL.

Step 7 โ€” DNS, certificates, and clients

  • Update any CNAMEs / Traffic Manager endpoints to point at the secondary region.
  • Verify that all client apps pick up the new DNS TTL.
  • Rotate any Application Insights instrumentation keys that baked the primary region into their instance ID.

Step 8 โ€” Smoke test

Run the post-deploy verification job from deploy.yml (or the load test harness from tests/load/README.md) against the failover region and confirm:

  • Cosmos writes and reads succeed from both regions (if using enableMultipleWriteLocations).
  • Storage reads return expected content from the failover region.
  • A representative ADF pipeline completes end-to-end.
  • A dbt model run (make test-dbt) completes against the failover Databricks workspace.

๐Ÿงช 4. Failover Readiness โ€” Quarterly Drill

Rollback procedures rot when they are not exercised. Schedule a chaos-engineering drill once per quarter. The drill is automated via .github/workflows/dr-drill.yml and documented in detail in runbooks/dr-drill.md (CSA-0073):

  • Pick one critical-tier service (rotate Cosmos โ†’ Storage โ†’ Databricks over the year).
  • Run the failover procedure against the dev environment only.
  • Time each step and capture the durations in a drill report under reports/dr-drills/<YYYY-MM-DD>.md (gitignored โ€” attach it to the incident tracker instead).
  • Update this runbook with anything that drifted.

Note

The dev-environment drill is the minimum bar. For regulated workloads add a second drill per year against a pre-prod clone of production.


๐Ÿ”„ 5. Failback Procedure

Danger

Do not fail back at the first sign that the primary region is healthy again โ€” you risk oscillating and losing data. Wait at least one full business day after the Azure Service Health dashboard shows the primary region as green, and only then start the failback.

Step 1 โ€” Re-enable geo replication

A storage-account failover drops the SKU to LRS. Put it back:

az storage account update \
  --name csadlzstorage001 \
  --resource-group rg-csa-storage \
  --sku Standard_RAGRS

Wait for the account's geoReplicationStatus to show Live before proceeding โ€” it typically takes hours for Azure to seed the replica.

Step 2 โ€” Swap Cosmos failover priorities back

az cosmosdb failover-priority-change \
  --name csa-cosmos-dlz \
  --resource-group rg-csa-cosmos \
  --failover-policies eastus=0 westus=1

Step 3 โ€” Repoint ADF and Databricks

Reverse the linked-service changes from ยง3.5 and ยง3.6. Leave the secondary-region workspaces and factories in place as cold standbys โ€” they are now part of the steady-state DR posture.

Step 4 โ€” DNS and client reset

Flip CNAMEs / Traffic Manager priorities back to the primary region.

Step 5 โ€” Post-incident review

Within two business days, write up:

  • Timeline from first alert to service restoration.
  • Which steps took longer than the documented RTO and why.
  • Any data loss (compare against RPO targets in ยง1).
  • Updates to this runbook captured as a follow-up PR.

File the post-incident review under the tracking issue for the incident, and update .claude/DEVELOPMENT_LOG.md with a short pointer so future sessions see it.


โš ๏ธ 6. Known Gaps and Roadmap

The following items are intentionally left manual / unconfigured and tracked as follow-up work. They were too large for the initial DR rollout:

  • Event Hub Geo-DR pairing: Microsoft.EventHub/namespaces/ disasterRecoveryConfigs is not yet wired into the Bicep modules. Workloads that need sub-5-minute RPO on Event Hub should add this manually via the portal for now.
  • Purview geo replication: Purview does not support cross-region replication. The current posture is "accept 24h RPO and re-scan from source after a disaster"; workloads with stricter requirements need a different catalog service.
  • Automated DR drills: ยง4's drill procedure is now wired into .github/workflows/dr-drill.yml on a quarterly cron against a scratch subscription (CSA-0073). The per-scenario shell scripts under scripts/drill/ are still stubbed and tracked as follow-ups in runbooks/dr-drill.md ยง8.
  • Traffic Manager + Front Door: no global-routing layer is currently deployed. Clients talk to region-specific endpoints. If this becomes a pain point, deploy Azure Front Door with priority routing so clients do not need to be aware of the failover.

๐Ÿ“‹ 7. Quick Reference

Scenario Runbook section
Bad deploy, region healthy ROLLBACK.md
Region unavailable This document
Individual resource deleted ROLLBACK.md ยง5โ€“6 (Cosmos PITR / storage soft-delete)
dbt model regression tests/load/README.md โ†’ benchmark_dbt_models.py
Quarterly drill ยง4 above + runbooks/dr-drill.md

See also: