Home > Docs > Disaster Recovery
Multi-Region Disaster Recovery Runbook¶
Note
Quick Summary: Authoritative DR runbook for CSA-in-a-Box covering RPO/RTO targets by service tier, primary/secondary region pairing, step-by-step failover procedures (Cosmos DB, Storage, ADF, Databricks, DNS), quarterly drill cadence, failback procedures, and known gaps.
This runbook is the authoritative answer to "what do we do when a region goes down?" It pairs with the deploy-time rollback procedure in ROLLBACK.md, which covers bad deploys; this document covers regional unavailability (Azure region outage, networking partition, storage replication failure) where the fix is failing over to a different region, not redeploying.
Important
Scope: the CSA-in-a-Box platform's four landing zones (Management, Connectivity, DMLZ, DLZ). Out of scope: application-layer DR for workloads owned by domain teams โ those are expected to follow the per-service guidance in this runbook and document their own RPO/RTO.
๐ Table of Contents¶
- ๐ 1. RPO / RTO Targets by Service Tier
- ๐ 2. Primary / Secondary Region Pairing
- ๐ 3. Failover Procedure
- ๐งช 4. Failover Readiness โ Quarterly Drill
- ๐ 5. Failback Procedure
- โ ๏ธ 6. Known Gaps and Roadmap
- ๐ 7. Quick Reference
๐ 1. RPO / RTO Targets by Service Tier¶
Every Azure service deployed by CSA-in-a-Box is classified into one of three tiers. The tier determines the SKU, replication mode, and expected recovery behaviour. Before enabling a new service in production, add a row to this table and wire up the matching configuration.
| Service | Tier | RPO | RTO | Replication | Bicep toggle |
|---|---|---|---|---|---|
| Cosmos DB (ingest + catalog) | Critical | < 15 min | < 30 min | Multi-region writes across two Azure regions | deploy/bicep/DLZ/modules/cosmos/cosmosdb.bicep โ secondaryLocation, enableMultipleWriteLocations=true, enableAutomaticFailover=true |
| Data Lake Storage (Silver + Gold) | Critical | < 1 h | < 1 h | RA-GRS (read-access geo-redundant) | deploy/bicep/DLZ/modules/storage/storage.bicep โ storageSku=Standard_RAGRS (or Standard_RAGZRS for zone + region redundancy) |
| Data Lake Storage (Bronze / raw) | Standard | < 4 h | < 4 h | ZRS (zone-redundant, single region) | storage.bicep default (no override needed) |
| Log Analytics + App Insights | Critical | < 1 h | < 1 h | Cross-region diagnostic-settings mirror | Manual: add a second diagnostic-settings destination pointing at the failover workspace |
| Key Vault | Critical | N/A | < 15 min | Azure-managed geo replication (Standard/Premium) + 90-day soft delete | deploy/bicep/**/KeyVault/keyvault.bicep already enables enableSoftDelete + enablePurgeProtection + softDeleteRetentionInDays: 90 |
| Databricks | Standard | < 4 h | < 4 h | Passive paired workspace in secondary region (cold standby) | Manual: deploy the workspace Bicep with a second location param; unity catalog metadata is regional |
| Azure Data Factory | Standard | < 4 h | < 8 h | Paired factory in secondary region; linked services recreated via ARM | Manual: redeploy domains/shared/pipelines/adf/* into the failover factory |
| Synapse Serverless SQL | Standard | < 1 h | < 2 h | Automatically HA inside a region; for cross-region, reattach serverless endpoints to the failover storage account | Manual |
| Event Hub | Critical | < 5 min | < 15 min | Geo-DR pairing | Manual: configure Microsoft.EventHub/namespaces/disasterRecoveryConfigs in a follow-up commit |
| Purview | Standard | < 24 h | < 24 h | No built-in geo-replication; rely on the collection export + reimport procedure below | Manual |
Note
Tiers higher than "Critical" (near-zero RPO/RTO, e.g. for regulated workloads) require multi-region active/active and are out of scope for this runbook; those workloads should be reviewed individually.
๐ 2. Primary / Secondary Region Pairing¶
CSA-in-a-Box defaults to the following Azure region pairs. Every critical-tier service should be deployed in both regions of a single pair, never across pairs (so that the Azure DR primitives like Standard_RAGRS and Cosmos automatic failover line up):
| Primary | Secondary |
|---|---|
eastus (default) | westus |
eastus2 | centralus |
westeurope | northeurope |
uksouth | ukwest |
The primary region is driven by AZURE_REGION in .github/workflows/deploy.yml. The secondary region is the target of the DR failover jobs below and is configured per service via the Bicep parameters linked in the table above.
๐ 3. Failover Procedure¶
Danger
Trigger this procedure when the primary region is confirmed unavailable via the Azure Service Health dashboard and has been down for > 5 minutes (shorter incidents are usually cheaper to wait out).
Step 1 โ Declare the incident¶
- Page the on-call engineer and open an incident in your tracker.
- Post to the incident channel: region, start time, services affected, decision authority for go/no-go on failover.
- Start a scribe log in the incident doc so every action below is timestamped and attributable.
Step 2 โ Verify the scope¶
Before flipping anything, confirm the outage is region-wide and not a per-service or per-subscription issue:
# Control-plane health
az account list-locations --query "[?metadata.regionType=='Physical'].{Name:name, Pairs:metadata.pairedRegion[].name}" -o table
# Verify each of our deployed services
az resource list --location eastus --query "[?provisioningState!='Succeeded'].{Name:name, Type:type, State:provisioningState}" -o table
Important
If only one resource group or service is affected, use the per-service rollback procedure instead โ failing over the whole platform for a single stuck resource is more risk than reward.
Step 3 โ Cosmos DB failover¶
Cosmos accounts with enableAutomaticFailover=true and a secondaryLocation will fail over on their own within Azure's threshold. If automatic failover has not happened within 10 minutes, force it:
az cosmosdb failover-priority-change \
--name csa-cosmos-dlz \
--resource-group rg-csa-cosmos \
--failover-policies westus=0 eastus=1
Verify:
az cosmosdb show --name csa-cosmos-dlz --resource-group rg-csa-cosmos \
--query "locations[].{Name:locationName, Priority:failoverPriority}" -o table
Step 4 โ Storage account failover¶
RA-GRS accounts require a manual customer-initiated failover:
Warning
After the command returns, the account's primary region flips to the paired secondary and the replication type drops to LRS (you need to re-enable geo replication after the original region recovers โ see ยง5 Failback). Expected time: 10โ60 minutes.
Verify:
az storage account show --name csadlzstorage001 --resource-group rg-csa-storage \
--query "{primary:primaryLocation, secondary:secondaryLocation, sku:sku.name}"
Step 5 โ ADF linked-service reconfiguration¶
ADF cannot be failed over โ each region has its own factory. Redeploy pipeline JSON into the secondary-region factory:
# Re-run the DLZ deploy workflow targeting the secondary region.
# In .github/workflows/deploy.yml, temporarily set AZURE_REGION=westus
# or pass the target via workflow_dispatch input once supported.
gh workflow run deploy.yml -f environment=prod -f deploy_dlz=true -f dry_run=false
Linked services in the secondary factory must point at:
- The restored Cosmos account (now in the failover region)
- The failed-over Data Lake storage account
- The secondary Databricks workspace (if you're running Databricks pipelines via ADF linked compute)
Step 6 โ Databricks failover¶
- Stop any still-running jobs in the primary workspace (if the control plane is still reachable).
- Activate the cold-standby workspace in the secondary region.
- Rehydrate the Unity Catalog metastore in the secondary workspace via the Databricks Terraform provider export + import pattern (Unity Catalog is regional so metadata does not replicate).
- Repoint ADF linked services (step 5) at the new workspace URL.
Step 7 โ DNS, certificates, and clients¶
- Update any CNAMEs / Traffic Manager endpoints to point at the secondary region.
- Verify that all client apps pick up the new DNS TTL.
- Rotate any Application Insights instrumentation keys that baked the primary region into their instance ID.
Step 8 โ Smoke test¶
Run the post-deploy verification job from deploy.yml (or the load test harness from tests/load/README.md) against the failover region and confirm:
- Cosmos writes and reads succeed from both regions (if using
enableMultipleWriteLocations). - Storage reads return expected content from the failover region.
- A representative ADF pipeline completes end-to-end.
- A dbt model run (
make test-dbt) completes against the failover Databricks workspace.
๐งช 4. Failover Readiness โ Quarterly Drill¶
Rollback procedures rot when they are not exercised. Schedule a chaos-engineering drill once per quarter. The drill is automated via .github/workflows/dr-drill.yml and documented in detail in runbooks/dr-drill.md (CSA-0073):
- Pick one critical-tier service (rotate Cosmos โ Storage โ Databricks over the year).
- Run the failover procedure against the dev environment only.
- Time each step and capture the durations in a drill report under
reports/dr-drills/<YYYY-MM-DD>.md(gitignored โ attach it to the incident tracker instead). - Update this runbook with anything that drifted.
Note
The dev-environment drill is the minimum bar. For regulated workloads add a second drill per year against a pre-prod clone of production.
๐ 5. Failback Procedure¶
Danger
Do not fail back at the first sign that the primary region is healthy again โ you risk oscillating and losing data. Wait at least one full business day after the Azure Service Health dashboard shows the primary region as green, and only then start the failback.
Step 1 โ Re-enable geo replication¶
A storage-account failover drops the SKU to LRS. Put it back:
az storage account update \
--name csadlzstorage001 \
--resource-group rg-csa-storage \
--sku Standard_RAGRS
Wait for the account's geoReplicationStatus to show Live before proceeding โ it typically takes hours for Azure to seed the replica.
Step 2 โ Swap Cosmos failover priorities back¶
az cosmosdb failover-priority-change \
--name csa-cosmos-dlz \
--resource-group rg-csa-cosmos \
--failover-policies eastus=0 westus=1
Step 3 โ Repoint ADF and Databricks¶
Reverse the linked-service changes from ยง3.5 and ยง3.6. Leave the secondary-region workspaces and factories in place as cold standbys โ they are now part of the steady-state DR posture.
Step 4 โ DNS and client reset¶
Flip CNAMEs / Traffic Manager priorities back to the primary region.
Step 5 โ Post-incident review¶
Within two business days, write up:
- Timeline from first alert to service restoration.
- Which steps took longer than the documented RTO and why.
- Any data loss (compare against RPO targets in ยง1).
- Updates to this runbook captured as a follow-up PR.
File the post-incident review under the tracking issue for the incident, and update .claude/DEVELOPMENT_LOG.md with a short pointer so future sessions see it.
โ ๏ธ 6. Known Gaps and Roadmap¶
The following items are intentionally left manual / unconfigured and tracked as follow-up work. They were too large for the initial DR rollout:
- Event Hub Geo-DR pairing:
Microsoft.EventHub/namespaces/ disasterRecoveryConfigsis not yet wired into the Bicep modules. Workloads that need sub-5-minute RPO on Event Hub should add this manually via the portal for now. - Purview geo replication: Purview does not support cross-region replication. The current posture is "accept 24h RPO and re-scan from source after a disaster"; workloads with stricter requirements need a different catalog service.
- Automated DR drills: ยง4's drill procedure is now wired into
.github/workflows/dr-drill.ymlon a quarterly cron against a scratch subscription (CSA-0073). The per-scenario shell scripts underscripts/drill/are still stubbed and tracked as follow-ups inrunbooks/dr-drill.mdยง8. - Traffic Manager + Front Door: no global-routing layer is currently deployed. Clients talk to region-specific endpoints. If this becomes a pain point, deploy Azure Front Door with priority routing so clients do not need to be aware of the failover.
๐ 7. Quick Reference¶
| Scenario | Runbook section |
|---|---|
| Bad deploy, region healthy | ROLLBACK.md |
| Region unavailable | This document |
| Individual resource deleted | ROLLBACK.md ยง5โ6 (Cosmos PITR / storage soft-delete) |
| dbt model regression | tests/load/README.md โ benchmark_dbt_models.py |
| Quarterly drill | ยง4 above + runbooks/dr-drill.md |
See also:
- โ Previous: Multi-Tenant
- โ Next: Documentation home
- โ Index: Documentation home