Home > Docs > Runbooks > Multi-Region Failover

🌐 Multi-Region Failover Runbook¶

Last Updated: 2026-04-27 | Phase: 14 (Wave 1) | Feature: 1.5 Audience: On-call engineers, incident commanders, platform leads, Fabric admins Purpose: Step-by-step procedure for failing Fabric workloads from a primary region to a secondary region during a regional outage — including OneLake geo-redundancy, capacity failover, and Power BI report redirection.

Pre-Requisites¶

These must be configured BEFORE an incident. A failover during outage assumes everything below is already in place. If any item is missing, fix it during the next DR drill — not mid-incident.

Infrastructure Pre-Requisites¶

#	Pre-Requisite	Validation	Owner
1	Secondary Fabric capacity provisioned (warm standby) — F32+ in paired region (e.g., `westus2` for `eastus2`), paused to control cost	`az fabric capacity show --name fabric-cap-dr-westus2` returns `state=Paused`	Platform Lead
2	OneLake geo-redundancy enabled — Capacity uses GRS or GZRS; replication lag < 15 min steady-state	`AzureStorageMetrics` GeoReplicationLag < 900,000 ms (15 min)	Fabric Admin
3	Workspace pairs configured — Every prod workspace has a DR twin (`ws-prod-{name}` ↔ `ws-dr-{name}`) deployed via `fabric-cicd` from the same git source	`fabric-cicd-deploy.py --verify-only --workspace-id $DR_WS_ID` succeeds	Platform Lead
4	Power BI dataset replicas — Semantic models deployed to DR workspace via Deployment Pipelines or git; Direct Lake bindings parameterized	DR semantic models present + refresh succeeds against DR Gold tables	BI Lead
5	DNS / Front Door routing configured — Azure Front Door or Traffic Manager profile with primary + secondary backends; health probes target Fabric API per region	`az network front-door show --name fd-fabric` lists both backends	Network Lead
6	Key Vault replicated — Workspace Identity / SP credentials available in DR-region Key Vault; CMK keys present	`az keyvault key list --vault-name kv-fabric-dr-westus2` matches primary keys	Security Lead
7	ADLS Gen2 landing zones GRS — All upstream landing storage accounts use Standard_GRS or GZRS	`az storage account show --query sku.name` returns `Standard_GRS` or `Standard_GZRS`	Data Engineering
8	Eventstream/Event Hub DR namespaces — Standby Event Hub namespace pre-provisioned in DR region; producer SDK has both endpoints in config	DR namespace exists; producer config has `EH_PRIMARY` + `EH_SECONDARY`	Streaming Lead
9	Break-glass account — Emergency Global Admin in DR region with FIDO2 hardware key; credentials in physical safe (dual-control)	Quarterly access review confirms credentials sealed	Security Lead
10	Runbook tested in last 90 days — Quarterly DR drill executed against DR region with documented RTO achieved	Last `dr-drill-{YYYY-MM-DD}.md` postmortem ≤ 90 days old	Incident Commander

Recovery Objectives (Documented Targets)¶

Objective	Target	Source
RTO (Recovery Time Objective)	1 hour end-to-end (detection → traffic on DR)	BCDR Best Practices Tier 1
RPO (Recovery Point Objective)	15 minutes for Delta tables (OneLake GRS); 1 minute for streaming (Event Hub replay)	Same
Failback RTO	2 hours (after primary stable for 24 hr)	Casino/Federal BCDR
Drill cadence	Quarterly tabletop + annual full failover	NIST SP 800-34 / FedRAMP CP-4

Pre-Flight Checklist (run within 5 min of decision to failover)¶

DR capacity status verified (Paused and resumable)
OneLake replication lag < 15 min (within RPO)
DR workspace last-deployed timestamp < 24 hr ago
Front Door health probes confirm DR region is healthy
Incident channel open and IC assigned
Stakeholder notification draft ready

Symptoms¶

Indicator	Where to Check	Interpretation
Azure Service Health regional outage notification	Azure Portal → Service Health → Service issues	Confirmed regional incident — proceed to severity classification
Fabric capacity unavailable in primary region	`az fabric capacity show` returns 503/timeout	Capacity-level outage; may be regional or capacity-specific
Mass query failures from one region only	Workspace Monitoring KQL: `FabricCapacityMetrics \\| where Region == "eastus2"` shows 0 events	Regional data plane issue
OneLake reads failing tenant-wide	Notebook reads return `StorageException`; `mssparkutils.fs.ls("Files/")` fails	Storage layer regional outage
Power BI reports stale or 5xx	`https://app.powerbi.com` returns 503 for tenant; refresh failures across all datasets	Front-end or capacity outage
Event Hub ingestion drops to zero	Eventstream input metric flatlines; producer SDK errors	Region or Event Hub namespace outage
Cross-region GRS replication lag spike (>1 hr)	`AzureStorageMetrics \\| where MetricName == "GeoReplicationLag"`	Regional storage degradation precursor
Microsoft Fabric Status Page red	https://support.fabric.microsoft.com/support/	Microsoft has confirmed the incident

Important: A single workspace failure or a single capacity throttling event is not a multi-region failover scenario. Use capacity-throttling-response.md or auth-failure-playbook.md for those.

Severity Classification¶

Condition	Severity	Why
Regional outage confirmed by Azure Service Health, ETA > RTO target (1 hr), customer-facing impact	SEV1	Multi-tenant blast radius, RTO breach imminent
Capacity unreachable region-wide, ETA unknown	SEV1	Treat as regional until proven otherwise
Single workspace down, other workspaces in same region healthy	SEV2	Workspace-scoped — use pipeline/auth runbooks
GRS replication lag > 1 hr (no outage yet)	SEV2	RPO at risk; investigate but do not failover
DR drill / planned failover	SEV4	Scheduled change; follow change management

Rule of thumb: Regional outage with customer impact = SEV1. Page VP Eng + Incident Commander immediately per incident-response-template.md.

Decision Matrix: Failover vs Wait¶

The cost of an unnecessary failover is real: ~2 hr engineering time, possible data divergence, customer-visible cutover, and a failback effort. Do not failover reflexively.

flowchart TD
    Start([Regional Outage Suspected]) --> Q1{Azure Service Health<br/>confirms regional issue?}
    Q1 -->|No| Wait1[Wait + monitor — likely<br/>workspace/capacity issue]
    Q1 -->|Yes| Q2{Microsoft ETA<br/>to recover?}
    Q2 -->|< 30 min| Wait2[Wait — recovery likely<br/>faster than failover]
    Q2 -->|30-60 min| Q3{Customer SLA<br/>at risk?}
    Q2 -->|> 60 min OR unknown| Q4
    Q3 -->|No| Wait2
    Q3 -->|Yes| Q4{GRS replication<br/>lag within RPO<br/>< 15 min?}
    Q4 -->|No| Q5{Data residency<br/>allows secondary<br/>region?}
    Q4 -->|Yes| Q5
    Q5 -->|No| Hold[HOLD — escalate to<br/>compliance officer]
    Q5 -->|Yes| Q6{DR capacity<br/>verified healthy?}
    Q6 -->|No| Hold2[HOLD — fix DR<br/>before failing over]
    Q6 -->|Yes| Failover[EXECUTE FAILOVER]

    style Failover fill:#ea4335,color:#fff
    style Hold fill:#fbbc04,color:#000
    style Hold2 fill:#fbbc04,color:#000
    style Wait1 fill:#34a853,color:#fff
    style Wait2 fill:#34a853,color:#fff

Decision Criteria (must answer YES to all before failover)¶

#	Criterion	If NO
1	Azure Service Health confirms regional incident	Wait — likely capacity/workspace issue
2	Estimated outage duration > RTO target (1 hr) OR unknown	Wait — recovery faster than failover
3	OneLake GRS replication lag is within RPO (< 15 min)	Hold — escalate to compliance for data-loss exception approval
4	Data residency / sovereignty rules allow secondary region (US-East ↔ US-West OK; commercial → gov NOT OK)	Hold — compliance officer must approve
5	DR capacity is provisioned and resumable (Pre-Req #1)	Hold — re-provision DR before failing over
6	Incident Commander has authorized the failover	Wait — get IC sign-off (SEV1 requires VP Eng concurrence)

When to ALWAYS Wait¶

Outage ETA < 30 min
Single workspace impacted (use pipeline-failure runbook instead)
Replication lag exceeds RPO and data loss not acceptable
DR region itself is degraded
Compliance constraint not yet cleared

Failover Procedure¶

Total target: 60 minutes from decision to "DR serving traffic". Time-box each step. If a step exceeds 1.5x its budget, escalate.

Step 1 — Validate Secondary Region Health (Target: 5 min)¶

# 1.1 — Confirm secondary region is healthy
az rest --method get \
  --url "https://management.azure.com/subscriptions/${SUB}/providers/Microsoft.ResourceHealth/availabilityStatuses?api-version=2022-10-01&\$filter=location eq '${DR_REGION}'"

# 1.2 — Check DR capacity provisioning state
az fabric capacity show \
  --resource-group "${DR_RG}" \
  --name "fabric-cap-dr-${DR_REGION}" \
  --query "{name:name, state:properties.state, sku:sku.name}"

# 1.3 — Verify DR workspace last-deployed timestamp
python scripts/fabric-cicd-deploy.py \
  --workspace-id "${DR_WORKSPACE_ID}" \
  --verify-only

# 1.4 — Confirm OneLake DR replication lag is within RPO
# (Run in Workspace Monitoring KQL — see Quick-Reference Commands)

Pass criteria: Region healthy, DR capacity in Paused or Active state, last deployment < 24 hr, lag < 15 min.

On fail: Abort failover, escalate to Microsoft support — DR region itself is at risk.

Step 2 — Stop Writes to Primary (if reachable) (Target: 5 min)¶

Goal: Quiesce writes so secondary becomes the system of record cleanly. Skip if primary is fully unreachable.

# 2.1 — Pause primary capacity (if still reachable) to halt all compute
az fabric capacity suspend \
  --resource-group "${PRIMARY_RG}" \
  --name "fabric-cap-prod-${PRIMARY_REGION}"

# 2.2 — Stop upstream Event Hub producers (issue command to producer apps)
# Producer apps should switch to DR endpoint based on env flag
az appconfig kv set \
  --name "appcs-fabric" \
  --key "Streaming:ActiveRegion" \
  --value "${DR_REGION}" --yes

# 2.3 — Pause running pipelines via Fabric REST API
for ws in $(echo "$PRIMARY_WORKSPACES" | tr ',' ' '); do
  az rest --method post \
    --url "https://api.fabric.microsoft.com/v1/workspaces/${ws}/jobs/instances?jobType=Pipeline&action=cancel"
done

Pass criteria: No new writes hitting primary OneLake; pipelines cancelled.

On fail (primary unreachable): Skip — primary is already silent. Document timestamp of last known write for reconciliation.

Step 3 — Promote (Resume + Scale) Secondary Capacity (Target: 10 min)¶

# 3.1 — Resume DR capacity
az fabric capacity resume \
  --resource-group "${DR_RG}" \
  --name "fabric-cap-dr-${DR_REGION}"

# 3.2 — Scale to production SKU (F64) if currently smaller
az rest --method patch \
  --url "https://management.azure.com/subscriptions/${SUB}/resourceGroups/${DR_RG}/providers/Microsoft.Fabric/capacities/fabric-cap-dr-${DR_REGION}?api-version=2023-11-01" \
  --body '{"sku": {"name": "F64", "tier": "Fabric"}}'

# 3.3 — Wait for capacity to be Active (poll up to 8 min)
for i in $(seq 1 16); do
  STATE=$(az fabric capacity show --resource-group "${DR_RG}" --name "fabric-cap-dr-${DR_REGION}" --query "properties.state" -o tsv)
  echo "Attempt ${i}: ${STATE}"
  if [ "${STATE}" = "Active" ]; then break; fi
  sleep 30
done

Pass criteria: Capacity state = Active, SKU = F64.

On fail: Engage Microsoft support (severity A — production down). See Escalation.

Step 4 — Switch OneLake Shortcuts to Secondary Paths (Target: 10 min)¶

# 4.1 — Run in DR workspace notebook (or via REST API script)
# Update all shortcuts in DR lakehouses to point at the DR-region ADLS / paired OneLake

import requests
DR_WORKSPACE_ID = "${DR_WORKSPACE_ID}"
DR_LAKEHOUSE_ID  = "${DR_LAKEHOUSE_ID}"
DR_ADLS_ACCOUNT  = "stfabriclz-dr"  # GRS-paired storage in DR region

# List existing shortcuts
shortcuts = requests.get(
    f"https://api.fabric.microsoft.com/v1/workspaces/{DR_WORKSPACE_ID}/items/{DR_LAKEHOUSE_ID}/shortcuts",
    headers={"Authorization": f"Bearer {TOKEN}"},
).json()

# Re-create each shortcut with DR target
for sc in shortcuts["value"]:
    new_target = sc["target"]["adlsGen2"]
    new_target["location"] = f"https://{DR_ADLS_ACCOUNT}.dfs.core.windows.net"
    requests.post(
        f"https://api.fabric.microsoft.com/v1/workspaces/{DR_WORKSPACE_ID}/items/{DR_LAKEHOUSE_ID}/shortcuts",
        headers={"Authorization": f"Bearer {TOKEN}"},
        json={"name": sc["name"], "path": sc["path"], "target": {"adlsGen2": new_target}},
    )

# 4.2 — If using ADLS storage account failover (single-account pattern)
az storage account failover \
  --name "stfabriclz" \
  --resource-group "${PRIMARY_RG}" \
  --no-wait

# 4.3 — Validate sample read from DR lakehouse
# (Run in DR workspace notebook)
# spark.read.format("delta").load("Tables/bronze_slot_telemetry").limit(10).show()

Pass criteria: Sample DR notebook reads return rows; row counts within RPO of last primary write.

On fail: GRS failover may still be in progress (can take 15+ min). Wait 5 min and retry; if still failing escalate.

Step 5 — Repoint Power BI Datasets to Secondary Semantic Models (Target: 10 min)¶

# 5.1 — Connect to Power BI service
Connect-PowerBIServiceAccount

# 5.2 — Rebind reports in DR workspace to DR semantic models
$drWorkspaceId = "${DR_WORKSPACE_ID}"
$reports = Get-PowerBIReport -WorkspaceId $drWorkspaceId

foreach ($r in $reports) {
    # Find the DR semantic model with same name as primary
    $drModel = Get-PowerBIDataset -WorkspaceId $drWorkspaceId | Where-Object Name -eq $r.Name
    if ($drModel) {
        Invoke-PowerBIRestMethod -Url "groups/$drWorkspaceId/reports/$($r.Id)/Rebind" `
            -Method Post -Body (@{datasetId = $drModel.Id} | ConvertTo-Json)
        Write-Host "Rebound $($r.Name) → $($drModel.Id)"
    }
}

# 5.3 — Trigger refresh of DR semantic models
Get-PowerBIDataset -WorkspaceId $drWorkspaceId | ForEach-Object {
    Invoke-PowerBIRestMethod -Url "groups/$drWorkspaceId/datasets/$($_.Id)/refreshes" -Method Post
}

Pass criteria: All critical reports list DR dataset as source; first refresh completes within 5 min.

On fail: Direct Lake reports auto-recover from Gold tables; check that DR Gold tables exist and are queryable.

Step 6 — Update DNS / Front Door Routing (Target: 5 min)¶

# 6.1 — Disable primary backend in Front Door
az network front-door backend-pool backend update \
  --front-door-name "fd-fabric" \
  --resource-group "${NETWORK_RG}" \
  --pool-name "fabric-backends" \
  --index 1 \
  --enabled-state Disabled

# 6.2 — Promote secondary backend to weight 100
az network front-door backend-pool backend update \
  --front-door-name "fd-fabric" \
  --resource-group "${NETWORK_RG}" \
  --pool-name "fabric-backends" \
  --index 2 \
  --weight 100 \
  --enabled-state Enabled

# 6.3 — Purge Front Door cache to flush stale primary responses
az network front-door purge-endpoint \
  --resource-group "${NETWORK_RG}" \
  --name "fd-fabric" \
  --content-paths "/*"

# 6.4 — Alternative: Traffic Manager
az network traffic-manager endpoint update \
  --resource-group "${NETWORK_RG}" \
  --profile-name "tm-fabric" \
  --name "primary" --type azureEndpoints \
  --endpoint-status Disabled

Pass criteria: dig / nslookup of customer-facing hostname resolves to DR backend within TTL window (typically < 60 sec).

Step 7 — Resume Traffic (Target: 5 min)¶

# 7.1 — Switch streaming producers (already prepared via App Configuration in Step 2.2)
# Producers tail App Config and reconnect within 30 sec

# 7.2 — Re-enable scheduled pipelines in DR workspace
for pipeline in $(echo "$DR_PIPELINES" | tr ',' ' '); do
  az rest --method post \
    --url "https://api.fabric.microsoft.com/v1/workspaces/${DR_WORKSPACE_ID}/items/${pipeline}/jobs/instances?jobType=Pipeline"
done

# 7.3 — Notify customers via status page that DR is serving
# (Manual: update https://status.your-org.com)

Pass criteria: Streaming ingestion rate in DR ≥ 80% of pre-incident primary rate within 5 min.

Step 8 — Verify Customer Queries Succeed (Target: 10 min)¶

See Verification — must complete before declaring failover successful.

Verification¶

Do not declare the incident "MITIGATED" until every check below passes. Run all checks; document results in incident channel.

8.1 Capacity Health¶

# Capacity must be Active in DR
az fabric capacity show --resource-group "${DR_RG}" --name "fabric-cap-dr-${DR_REGION}" \
  --query "{name:name, state:properties.state, sku:sku.name}"
# Expected: state=Active, sku=F64

8.2 Sample Customer Queries¶

Run in DR workspace notebook — these are the canonical "is the platform working?" queries:

# 8.2.1 — Bronze layer freshness (RPO check)
spark.sql("""
    SELECT MAX(event_timestamp) AS latest_event,
           COUNT(*) AS row_count
    FROM lh_bronze.bronze_slot_telemetry
""").show(truncate=False)
# Expected: latest_event within 15 min of failover start

# 8.2.2 — Silver layer integrity
spark.sql("""
    SELECT COUNT(*) AS row_count,
           COUNT(DISTINCT machine_id) AS unique_machines
    FROM lh_silver.silver_slot_cleansed
""").show()

# 8.2.3 — Gold layer KPI
spark.sql("""
    SELECT * FROM lh_gold.gold_slot_performance
    WHERE business_date = current_date() - 1
    LIMIT 10
""").show()

8.3 Power BI Refresh¶

Trigger refresh on top-5 customer-facing datasets — must complete within 5 min
Open one report from each domain (casino, federal/USDA, federal/EPA) — visuals render
No 5xx errors in Power BI service logs

8.4 Streaming Ingestion¶

// In DR Workspace Monitoring
EventstreamMetrics
| where TimeGenerated > ago(15m)
| where WorkspaceId == "${DR_WORKSPACE_ID}"
| summarize EventsPerMin = count() by bin(TimeGenerated, 1m), EventstreamName
| render timechart

Expected: ingestion rate ≥ 80% of pre-incident baseline.

8.5 End-to-End Smoke Test¶

Submit synthetic transaction at source → confirm landing in DR Bronze within 5 min
Confirm Silver/Gold pipeline picks it up on next schedule
Confirm Power BI report shows the new row

8.6 Verification Pass Criteria¶

All of: - All KQL/SQL queries return expected row counts (within RPO tolerance) - Power BI refresh succeeds for all critical datasets - Streaming rate within 80% of baseline - No SEV1/SEV2 alerts firing on DR capacity - Customer-reported queries resolved

If any check fails, do not declare resolved — continue mitigation or escalate.

Failback Procedure¶

Failback is its own change. Do not failback during the incident. Wait until primary is stable for 24 hr and schedule failback as a planned change with full change management approval.

Failback Pre-Conditions¶

Primary region healthy for 24 hours minimum
Microsoft confirms incident resolved
Reverse data-sync has caught up (DR → primary lag < 15 min)
Change Management ticket approved
Off-peak window scheduled (target: lowest-traffic hour)

Step 1 — Reverse-Sync Data from Secondary to Primary (Target: 30-120 min)¶

# Use Mirroring or scripted Delta sync to push DR data back to primary OneLake
# Pattern: read DR Delta tables, write as overwrite to primary

tables_to_sync = [
    "bronze_slot_telemetry",
    "silver_slot_cleansed",
    "gold_slot_performance",
    "gold_compliance_ctr",
]

for table in tables_to_sync:
    df = spark.read.format("delta").load(f"abfss://{DR_WS}@onelake.dfs.fabric.microsoft.com/Tables/{table}")
    df.write.format("delta").mode("overwrite") \
        .option("overwriteSchema", "true") \
        .save(f"abfss://{PRIMARY_WS}@onelake.dfs.fabric.microsoft.com/Tables/{table}")
    print(f"Synced {table}: {df.count()} rows")

Mirroring shortcut: If using Fabric Mirroring (see mirroring.md) the reverse direction can be configured with a new mirror profile; this avoids manual sync code.

Step 2 — Verify Integrity (Target: 30 min)¶

# Row count + checksum comparison
for table in tables_to_sync:
    primary_cnt = spark.sql(f"SELECT COUNT(*) AS c FROM primary.{table}").collect()[0]["c"]
    dr_cnt      = spark.sql(f"SELECT COUNT(*) AS c FROM dr.{table}").collect()[0]["c"]
    primary_chk = spark.sql(f"SELECT SHA2(STRING_AGG(CAST(* AS STRING)), 256) AS h FROM primary.{table}").collect()[0]["h"]
    dr_chk      = spark.sql(f"SELECT SHA2(STRING_AGG(CAST(* AS STRING)), 256) AS h FROM dr.{table}").collect()[0]["h"]
    assert primary_cnt == dr_cnt, f"Row count mismatch on {table}: P={primary_cnt} D={dr_cnt}"
    assert primary_chk == dr_chk, f"Checksum mismatch on {table}"
    print(f"{table} OK: {primary_cnt} rows, checksum match")

Step 3 — Quiesce Secondary Writes (Target: 5 min)¶

# Cancel all running pipelines in DR workspace
for ws in "${DR_WORKSPACE_ID}"; do
  az rest --method post \
    --url "https://api.fabric.microsoft.com/v1/workspaces/${ws}/jobs/instances?jobType=Pipeline&action=cancel"
done

# Switch streaming producers back to primary endpoint
az appconfig kv set \
  --name "appcs-fabric" \
  --key "Streaming:ActiveRegion" \
  --value "${PRIMARY_REGION}" --yes

Step 4 — Switch Routing Back (Target: 5 min)¶

# Re-enable primary in Front Door, demote secondary to standby weight
az network front-door backend-pool backend update \
  --front-door-name "fd-fabric" --resource-group "${NETWORK_RG}" \
  --pool-name "fabric-backends" --index 1 \
  --enabled-state Enabled --weight 100

az network front-door backend-pool backend update \
  --front-door-name "fd-fabric" --resource-group "${NETWORK_RG}" \
  --pool-name "fabric-backends" --index 2 \
  --weight 0

Step 5 — Validate Primary (Target: 30 min)¶

Run full Verification checklist against primary workspace.

Step 6 — Re-Pause DR Capacity (Target: 5 min)¶

# Cost optimization — return DR to warm-standby state
az fabric capacity suspend \
  --resource-group "${DR_RG}" \
  --name "fabric-cap-dr-${DR_REGION}"

Post-Incident Actions¶

Action	Owner	Due
Data reconciliation report (rows ingested DR vs. expected primary)	Data Engineering Lead	Within 24 hr
Customer comms — "DR Activation Notice" with impact summary	Communications Lead	Within 24 hr
Compliance notifications (NIGC/FinCEN for casino; OMB/CISA for federal)	Incident Commander + Compliance Officer	Per regulatory deadline
Gap analysis — what worked, what didn't, missing automation	Incident Commander	Within 48 hr
Postmortem published	IC + Tech Lead	Within 48 hr (SEV1)
Action items entered into Archon	IC	Within 5 business days
Update this runbook with lessons learned	Platform Lead	Within 5 business days
Schedule next DR drill (if last drill > 60 days ago)	Platform Lead	Within 30 days

Use the Blameless Postmortem Template.

Escalation¶

Microsoft Support (Capacity / Service Issues)¶

Issue	Severity	Path
DR capacity won't resume	Severity A	Azure Portal → Help + support → New support request → Production system down
GRS replication lag > 1 hr	Severity B	Same path; reference incident ID
Fabric service-side errors during failover	Severity A	Same path; cite Service Health incident ID
Power BI tenant-wide outage	Severity A	Power Platform support

Microsoft Premier/Unified support phone: (use your contracted support number — store in incident channel pin)

Internal Escalation Path¶

On-Call Engineer ──(5 min)──▶ Platform Lead ──(15 min)──▶ Incident Commander
                                                    │
                                              (30 min for SEV1)
                                                    ▼
                                              VP Engineering ──(45 min)──▶ CTO/CDO

Executive Communications¶

For SEV1 sustained > 1 hour, the IC must brief executives:

Recipient	Trigger	Channel
VP Engineering	Failover decision made	Phone + Teams
CTO / CDO	SEV1 sustained > 1 hr	Email briefing every 60 min
CFO	Customer SLA breach likely	Email from VP Eng
Compliance Officer	Any SOX/HIPAA/FedRAMP impact	Phone + email immediately
Legal	Regulatory notification required	Email
Customer Success	Customer-visible impact	Slack + email

Use the Stakeholder Update Template.

Communication Tree Reference¶

See the canonical Communication Tree in the anchor runbook. Specific multi-region failover additions:

Audience	When	Owner
Casino Gaming Commission	Within 24 hr (regulatory requirement)	Compliance Officer
FinCEN	If CTR/SAR filing window missed	Compliance Officer
Federal OCIO (per agency)	Per agency-specific timeline (USDA: 2 hr, SBA: 1 hr, NOAA: immediate, EPA: 4 hr, DOI: 4 hr)	Agency liaison
OMB / CISA	If FedRAMP system, per ATO terms	COOP Coordinator

Quick-Reference Commands¶

Capacity Status Across Regions (Azure CLI)¶

# List all Fabric capacities and their state
az resource list --resource-type "Microsoft.Fabric/capacities" \
  --query "[].{name:name, region:location, state:properties.state, sku:sku.name}" \
  --output table

# Get specific capacity health
az fabric capacity show \
  --resource-group "${RG}" \
  --name "${CAPACITY_NAME}" \
  --query "{state:properties.state, sku:sku.name, region:location}"

# Resume / Suspend
az fabric capacity resume  --resource-group "${RG}" --name "${CAPACITY_NAME}"
az fabric capacity suspend --resource-group "${RG}" --name "${CAPACITY_NAME}"

# Scale SKU (mid-incident upsize)
az rest --method patch \
  --url "https://management.azure.com/subscriptions/${SUB}/resourceGroups/${RG}/providers/Microsoft.Fabric/capacities/${CAPACITY_NAME}?api-version=2023-11-01" \
  --body '{"sku": {"name": "F64", "tier": "Fabric"}}'

Front Door / Traffic Manager Updates¶

# Front Door — disable primary backend
az network front-door backend-pool backend update \
  --front-door-name "fd-fabric" --resource-group "${NETWORK_RG}" \
  --pool-name "fabric-backends" --index 1 --enabled-state Disabled

# Front Door — purge cache after failover
az network front-door purge-endpoint \
  --resource-group "${NETWORK_RG}" --name "fd-fabric" --content-paths "/*"

# Traffic Manager — disable primary endpoint
az network traffic-manager endpoint update \
  --resource-group "${NETWORK_RG}" --profile-name "tm-fabric" \
  --name "primary" --type azureEndpoints --endpoint-status Disabled

# DNS — direct CNAME flip (if not using Front Door)
az network dns record-set cname set-record \
  --resource-group "${DNS_RG}" --zone-name "your-org.com" \
  --record-set-name "fabric" --cname "fabric-dr.westus2.cloudapp.azure.com"

Power BI Dataset Rebinding (PowerShell)¶

Connect-PowerBIServiceAccount

# Rebind a single report to a new dataset
$workspaceId = "${DR_WORKSPACE_ID}"
$reportId    = "${REPORT_ID}"
$newDsId     = "${DR_DATASET_ID}"

Invoke-PowerBIRestMethod -Url "groups/$workspaceId/reports/$reportId/Rebind" `
    -Method Post -Body (@{datasetId = $newDsId} | ConvertTo-Json)

# Refresh all datasets in DR workspace
Get-PowerBIDataset -WorkspaceId $workspaceId | ForEach-Object {
    Invoke-PowerBIRestMethod -Url "groups/$workspaceId/datasets/$($_.Id)/refreshes" -Method Post
    Write-Host "Refresh started: $($_.Name)"
}

# Update gateway / data source connection (if not Direct Lake)
Invoke-PowerBIRestMethod -Url "gateways/$gatewayId/datasources/$dsId" `
    -Method Patch -Body (@{
        connectionDetails = '{"server":"fabric-dr.westus2","database":"lh_gold"}'
    } | ConvertTo-Json)

KQL — Cross-Region Telemetry¶

// Replication lag across regions
AzureStorageMetrics
| where TimeGenerated > ago(1h)
| where MetricName == "GeoReplicationLag"
| summarize AvgLagSec = avg(Value)/1000, MaxLagSec = max(Value)/1000
    by bin(TimeGenerated, 5m), Resource
| render timechart

// Capacity utilization by region
FabricCapacityMetrics
| where TimeGenerated > ago(1h)
| extend Region = tostring(split(ResourceId, "/")[4])
| summarize AvgCU = avg(CUSeconds), MaxCU = max(CUSeconds)
    by bin(TimeGenerated, 5m), Region, CapacityName
| render timechart

// Cross-region pipeline failure correlation
FabricPipelineRuns
| where TimeGenerated > ago(2h)
| where Status == "Failed"
| extend Region = tostring(split(WorkspaceId, "-")[2])
| summarize Failures = count() by bin(TimeGenerated, 5m), Region
| render columnchart

// DR readiness — last successful deployment per workspace
FabricDeploymentLogs
| where TimeGenerated > ago(7d)
| where Status == "Succeeded"
| summarize LastDeploy = max(TimeGenerated) by WorkspaceId, Environment
| extend HoursSinceDeploy = datetime_diff("hour", now(), LastDeploy)
| extend ReadyForFailover = iff(HoursSinceDeploy < 24, "✅", "⚠️ Stale")

Bicep Reference (DR Capacity)¶

The DR capacity uses the same module as production — see infra/modules/fabric/fabric-capacity.bicep. Set skuName: 'F32' and add tags: { Environment: 'DR', AutoPause: 'true' } to keep cost low until activation.

Diagrams¶

Failover Decision Tree¶

flowchart TD
    Detect([Outage Detected]) --> SH{Azure Service Health<br/>regional incident?}
    SH -->|No| Other[Use other runbook<br/>capacity / pipeline / auth]
    SH -->|Yes| ETA{ETA > 60 min<br/>or unknown?}
    ETA -->|No| Wait[Monitor — recovery faster than failover]
    ETA -->|Yes| RPO{GRS lag<br/>< 15 min?}
    RPO -->|No| Comp{Compliance OK<br/>with data loss?}
    RPO -->|Yes| Res{Data residency<br/>allows secondary?}
    Comp -->|No| Hold[HOLD — escalate]
    Comp -->|Yes| Res
    Res -->|No| Hold
    Res -->|Yes| DR{DR capacity<br/>healthy?}
    DR -->|No| Fix[Fix DR — abort failover]
    DR -->|Yes| IC{IC + VP Eng<br/>authorize?}
    IC -->|No| Wait2[Wait for authorization]
    IC -->|Yes| Exec([EXECUTE FAILOVER<br/>Steps 1-8])

    style Exec fill:#ea4335,color:#fff
    style Hold fill:#fbbc04,color:#000
    style Fix fill:#fbbc04,color:#000
    style Wait fill:#34a853,color:#fff
    style Wait2 fill:#34a853,color:#fff

Failover Sequence Diagram¶

sequenceDiagram
    participant ASH as Azure Service Health
    participant IC as Incident Commander
    participant Eng as On-Call Engineer
    participant Pri as Primary Region (eastus2)
    participant DR as DR Region (westus2)
    participant FD as Front Door
    participant Cust as Customers

    ASH->>Eng: Regional outage alert
    Eng->>IC: Page SEV1 (Pre-Flight Checklist)
    IC->>IC: Decision matrix → AUTHORIZE FAILOVER

    Note over Eng,DR: Step 1 — Validate DR (5 min)
    Eng->>DR: Check capacity state, replication lag
    DR-->>Eng: Healthy, lag 8 min

    Note over Eng,Pri: Step 2 — Stop primary writes (5 min)
    Eng->>Pri: Suspend capacity, cancel pipelines
    Pri-->>Eng: Quiesced

    Note over Eng,DR: Step 3 — Promote DR (10 min)
    Eng->>DR: Resume + scale to F64
    DR-->>Eng: state=Active

    Note over Eng,DR: Step 4 — Switch shortcuts (10 min)
    Eng->>DR: Repoint shortcuts to DR ADLS
    DR-->>Eng: Sample reads succeed

    Note over Eng,DR: Step 5 — Power BI rebind (10 min)
    Eng->>DR: Rebind reports + trigger refresh
    DR-->>Eng: Refresh succeeded

    Note over Eng,FD: Step 6 — Routing (5 min)
    Eng->>FD: Disable primary, promote DR
    FD-->>Cust: Traffic now lands in DR

    Note over Eng,DR: Step 7 — Resume traffic (5 min)
    Eng->>DR: Re-enable pipelines, switch producers
    DR-->>Cust: Streaming + queries serving

    Note over Eng,DR: Step 8 — Verify (10 min)
    Eng->>DR: Run verification suite
    DR-->>Eng: All checks pass
    Eng->>IC: Failover complete — RTO 58 min
    IC->>Cust: Status page: MITIGATED

Runbook	When to Use
Incident Response Template	Anchor — every incident starts here
Capacity Throttling Response	Single-capacity throttling (not regional)
Pipeline Failure Triage	Pipeline-scoped failure inside one region
Auth Failure Playbook	Workspace Identity / SP issues
Tenant Migration (Dev/Staging/Prod)	Bad-deployment rollback (not regional outage)
Data Quality Incident	GE failure, schema breach

Document	Description
Disaster Recovery & BCDR	RTO/RPO targets, BCDR architecture, FedRAMP CP controls
Multi-Tenant Workspace Architecture	Workspace pairing patterns for primary/DR
Network Security	Front Door, private endpoints, DR network topology
Customer-Managed Keys	CMK key replication for DR region
Fabric CI/CD Deployment	Git-based deployment to keep DR workspaces current
Capacity Planning & Cost Optimization	DR capacity cost trade-offs (paused vs active)
Mirroring	Cross-region replication for warehouses & SQL DBs

⬆️ Back to Top | 📚 Runbooks Index | 🏠 Home