Skip to content

Home > Docs > Runbooks > Multi-Region Failover

🌐 Multi-Region Failover Runbook

Last Updated: 2026-04-27 | Phase: 14 (Wave 1) | Feature: 1.5 Audience: On-call engineers, incident commanders, platform leads, Fabric admins Purpose: Step-by-step procedure for failing Fabric workloads from a primary region to a secondary region during a regional outage — including OneLake geo-redundancy, capacity failover, and Power BI report redirection.

Category Type Platform Severity


📑 Table of Contents

  1. Pre-Requisites
  2. Symptoms
  3. Severity Classification
  4. Decision Matrix: Failover vs Wait
  5. Failover Procedure
  6. Verification
  7. Failback Procedure
  8. Post-Incident Actions
  9. Escalation
  10. Communication Tree Reference
  11. Quick-Reference Commands
  12. Diagrams
  13. Related Runbooks
  14. Related Best-Practice Docs

Pre-Requisites

These must be configured BEFORE an incident. A failover during outage assumes everything below is already in place. If any item is missing, fix it during the next DR drill — not mid-incident.

Infrastructure Pre-Requisites

# Pre-Requisite Validation Owner
1 Secondary Fabric capacity provisioned (warm standby) — F32+ in paired region (e.g., westus2 for eastus2), paused to control cost az fabric capacity show --name fabric-cap-dr-westus2 returns state=Paused Platform Lead
2 OneLake geo-redundancy enabled — Capacity uses GRS or GZRS; replication lag < 15 min steady-state AzureStorageMetrics GeoReplicationLag < 900,000 ms (15 min) Fabric Admin
3 Workspace pairs configured — Every prod workspace has a DR twin (ws-prod-{name}ws-dr-{name}) deployed via fabric-cicd from the same git source fabric-cicd-deploy.py --verify-only --workspace-id $DR_WS_ID succeeds Platform Lead
4 Power BI dataset replicas — Semantic models deployed to DR workspace via Deployment Pipelines or git; Direct Lake bindings parameterized DR semantic models present + refresh succeeds against DR Gold tables BI Lead
5 DNS / Front Door routing configured — Azure Front Door or Traffic Manager profile with primary + secondary backends; health probes target Fabric API per region az network front-door show --name fd-fabric lists both backends Network Lead
6 Key Vault replicated — Workspace Identity / SP credentials available in DR-region Key Vault; CMK keys present az keyvault key list --vault-name kv-fabric-dr-westus2 matches primary keys Security Lead
7 ADLS Gen2 landing zones GRS — All upstream landing storage accounts use Standard_GRS or GZRS az storage account show --query sku.name returns Standard_GRS or Standard_GZRS Data Engineering
8 Eventstream/Event Hub DR namespaces — Standby Event Hub namespace pre-provisioned in DR region; producer SDK has both endpoints in config DR namespace exists; producer config has EH_PRIMARY + EH_SECONDARY Streaming Lead
9 Break-glass account — Emergency Global Admin in DR region with FIDO2 hardware key; credentials in physical safe (dual-control) Quarterly access review confirms credentials sealed Security Lead
10 Runbook tested in last 90 days — Quarterly DR drill executed against DR region with documented RTO achieved Last dr-drill-{YYYY-MM-DD}.md postmortem ≤ 90 days old Incident Commander

Recovery Objectives (Documented Targets)

Objective Target Source
RTO (Recovery Time Objective) 1 hour end-to-end (detection → traffic on DR) BCDR Best Practices Tier 1
RPO (Recovery Point Objective) 15 minutes for Delta tables (OneLake GRS); 1 minute for streaming (Event Hub replay) Same
Failback RTO 2 hours (after primary stable for 24 hr) Casino/Federal BCDR
Drill cadence Quarterly tabletop + annual full failover NIST SP 800-34 / FedRAMP CP-4

Pre-Flight Checklist (run within 5 min of decision to failover)

  • DR capacity status verified (Paused and resumable)
  • OneLake replication lag < 15 min (within RPO)
  • DR workspace last-deployed timestamp < 24 hr ago
  • Front Door health probes confirm DR region is healthy
  • Incident channel open and IC assigned
  • Stakeholder notification draft ready

Symptoms

Indicator Where to Check Interpretation
Azure Service Health regional outage notification Azure Portal → Service Health → Service issues Confirmed regional incident — proceed to severity classification
Fabric capacity unavailable in primary region az fabric capacity show returns 503/timeout Capacity-level outage; may be regional or capacity-specific
Mass query failures from one region only Workspace Monitoring KQL: FabricCapacityMetrics \| where Region == "eastus2" shows 0 events Regional data plane issue
OneLake reads failing tenant-wide Notebook reads return StorageException; mssparkutils.fs.ls("Files/") fails Storage layer regional outage
Power BI reports stale or 5xx https://app.powerbi.com returns 503 for tenant; refresh failures across all datasets Front-end or capacity outage
Event Hub ingestion drops to zero Eventstream input metric flatlines; producer SDK errors Region or Event Hub namespace outage
Cross-region GRS replication lag spike (>1 hr) AzureStorageMetrics \| where MetricName == "GeoReplicationLag" Regional storage degradation precursor
Microsoft Fabric Status Page red https://support.fabric.microsoft.com/support/ Microsoft has confirmed the incident

Important: A single workspace failure or a single capacity throttling event is not a multi-region failover scenario. Use capacity-throttling-response.md or auth-failure-playbook.md for those.


Severity Classification

Condition Severity Why
Regional outage confirmed by Azure Service Health, ETA > RTO target (1 hr), customer-facing impact SEV1 Multi-tenant blast radius, RTO breach imminent
Capacity unreachable region-wide, ETA unknown SEV1 Treat as regional until proven otherwise
Single workspace down, other workspaces in same region healthy SEV2 Workspace-scoped — use pipeline/auth runbooks
GRS replication lag > 1 hr (no outage yet) SEV2 RPO at risk; investigate but do not failover
DR drill / planned failover SEV4 Scheduled change; follow change management

Rule of thumb: Regional outage with customer impact = SEV1. Page VP Eng + Incident Commander immediately per incident-response-template.md.


Decision Matrix: Failover vs Wait

The cost of an unnecessary failover is real: ~2 hr engineering time, possible data divergence, customer-visible cutover, and a failback effort. Do not failover reflexively.

flowchart TD
    Start([Regional Outage Suspected]) --> Q1{Azure Service Health<br/>confirms regional issue?}
    Q1 -->|No| Wait1[Wait + monitor — likely<br/>workspace/capacity issue]
    Q1 -->|Yes| Q2{Microsoft ETA<br/>to recover?}
    Q2 -->|< 30 min| Wait2[Wait — recovery likely<br/>faster than failover]
    Q2 -->|30-60 min| Q3{Customer SLA<br/>at risk?}
    Q2 -->|> 60 min OR unknown| Q4
    Q3 -->|No| Wait2
    Q3 -->|Yes| Q4{GRS replication<br/>lag within RPO<br/>< 15 min?}
    Q4 -->|No| Q5{Data residency<br/>allows secondary<br/>region?}
    Q4 -->|Yes| Q5
    Q5 -->|No| Hold[HOLD — escalate to<br/>compliance officer]
    Q5 -->|Yes| Q6{DR capacity<br/>verified healthy?}
    Q6 -->|No| Hold2[HOLD — fix DR<br/>before failing over]
    Q6 -->|Yes| Failover[EXECUTE FAILOVER]

    style Failover fill:#ea4335,color:#fff
    style Hold fill:#fbbc04,color:#000
    style Hold2 fill:#fbbc04,color:#000
    style Wait1 fill:#34a853,color:#fff
    style Wait2 fill:#34a853,color:#fff

Decision Criteria (must answer YES to all before failover)

# Criterion If NO
1 Azure Service Health confirms regional incident Wait — likely capacity/workspace issue
2 Estimated outage duration > RTO target (1 hr) OR unknown Wait — recovery faster than failover
3 OneLake GRS replication lag is within RPO (< 15 min) Hold — escalate to compliance for data-loss exception approval
4 Data residency / sovereignty rules allow secondary region (US-East ↔ US-West OK; commercial → gov NOT OK) Hold — compliance officer must approve
5 DR capacity is provisioned and resumable (Pre-Req #1) Hold — re-provision DR before failing over
6 Incident Commander has authorized the failover Wait — get IC sign-off (SEV1 requires VP Eng concurrence)

When to ALWAYS Wait

  • Outage ETA < 30 min
  • Single workspace impacted (use pipeline-failure runbook instead)
  • Replication lag exceeds RPO and data loss not acceptable
  • DR region itself is degraded
  • Compliance constraint not yet cleared

Failover Procedure

Total target: 60 minutes from decision to "DR serving traffic". Time-box each step. If a step exceeds 1.5x its budget, escalate.

Step 1 — Validate Secondary Region Health (Target: 5 min)

# 1.1 — Confirm secondary region is healthy
az rest --method get \
  --url "https://management.azure.com/subscriptions/${SUB}/providers/Microsoft.ResourceHealth/availabilityStatuses?api-version=2022-10-01&\$filter=location eq '${DR_REGION}'"

# 1.2 — Check DR capacity provisioning state
az fabric capacity show \
  --resource-group "${DR_RG}" \
  --name "fabric-cap-dr-${DR_REGION}" \
  --query "{name:name, state:properties.state, sku:sku.name}"

# 1.3 — Verify DR workspace last-deployed timestamp
python scripts/fabric-cicd-deploy.py \
  --workspace-id "${DR_WORKSPACE_ID}" \
  --verify-only

# 1.4 — Confirm OneLake DR replication lag is within RPO
# (Run in Workspace Monitoring KQL — see Quick-Reference Commands)

Pass criteria: Region healthy, DR capacity in Paused or Active state, last deployment < 24 hr, lag < 15 min.

On fail: Abort failover, escalate to Microsoft support — DR region itself is at risk.


Step 2 — Stop Writes to Primary (if reachable) (Target: 5 min)

Goal: Quiesce writes so secondary becomes the system of record cleanly. Skip if primary is fully unreachable.

# 2.1 — Pause primary capacity (if still reachable) to halt all compute
az fabric capacity suspend \
  --resource-group "${PRIMARY_RG}" \
  --name "fabric-cap-prod-${PRIMARY_REGION}"

# 2.2 — Stop upstream Event Hub producers (issue command to producer apps)
# Producer apps should switch to DR endpoint based on env flag
az appconfig kv set \
  --name "appcs-fabric" \
  --key "Streaming:ActiveRegion" \
  --value "${DR_REGION}" --yes

# 2.3 — Pause running pipelines via Fabric REST API
for ws in $(echo "$PRIMARY_WORKSPACES" | tr ',' ' '); do
  az rest --method post \
    --url "https://api.fabric.microsoft.com/v1/workspaces/${ws}/jobs/instances?jobType=Pipeline&action=cancel"
done

Pass criteria: No new writes hitting primary OneLake; pipelines cancelled.

On fail (primary unreachable): Skip — primary is already silent. Document timestamp of last known write for reconciliation.


Step 3 — Promote (Resume + Scale) Secondary Capacity (Target: 10 min)

# 3.1 — Resume DR capacity
az fabric capacity resume \
  --resource-group "${DR_RG}" \
  --name "fabric-cap-dr-${DR_REGION}"

# 3.2 — Scale to production SKU (F64) if currently smaller
az rest --method patch \
  --url "https://management.azure.com/subscriptions/${SUB}/resourceGroups/${DR_RG}/providers/Microsoft.Fabric/capacities/fabric-cap-dr-${DR_REGION}?api-version=2023-11-01" \
  --body '{"sku": {"name": "F64", "tier": "Fabric"}}'

# 3.3 — Wait for capacity to be Active (poll up to 8 min)
for i in $(seq 1 16); do
  STATE=$(az fabric capacity show --resource-group "${DR_RG}" --name "fabric-cap-dr-${DR_REGION}" --query "properties.state" -o tsv)
  echo "Attempt ${i}: ${STATE}"
  if [ "${STATE}" = "Active" ]; then break; fi
  sleep 30
done

Pass criteria: Capacity state = Active, SKU = F64.

On fail: Engage Microsoft support (severity A — production down). See Escalation.


Step 4 — Switch OneLake Shortcuts to Secondary Paths (Target: 10 min)

# 4.1 — Run in DR workspace notebook (or via REST API script)
# Update all shortcuts in DR lakehouses to point at the DR-region ADLS / paired OneLake

import requests
DR_WORKSPACE_ID = "${DR_WORKSPACE_ID}"
DR_LAKEHOUSE_ID  = "${DR_LAKEHOUSE_ID}"
DR_ADLS_ACCOUNT  = "stfabriclz-dr"  # GRS-paired storage in DR region

# List existing shortcuts
shortcuts = requests.get(
    f"https://api.fabric.microsoft.com/v1/workspaces/{DR_WORKSPACE_ID}/items/{DR_LAKEHOUSE_ID}/shortcuts",
    headers={"Authorization": f"Bearer {TOKEN}"},
).json()

# Re-create each shortcut with DR target
for sc in shortcuts["value"]:
    new_target = sc["target"]["adlsGen2"]
    new_target["location"] = f"https://{DR_ADLS_ACCOUNT}.dfs.core.windows.net"
    requests.post(
        f"https://api.fabric.microsoft.com/v1/workspaces/{DR_WORKSPACE_ID}/items/{DR_LAKEHOUSE_ID}/shortcuts",
        headers={"Authorization": f"Bearer {TOKEN}"},
        json={"name": sc["name"], "path": sc["path"], "target": {"adlsGen2": new_target}},
    )
# 4.2 — If using ADLS storage account failover (single-account pattern)
az storage account failover \
  --name "stfabriclz" \
  --resource-group "${PRIMARY_RG}" \
  --no-wait

# 4.3 — Validate sample read from DR lakehouse
# (Run in DR workspace notebook)
# spark.read.format("delta").load("Tables/bronze_slot_telemetry").limit(10).show()

Pass criteria: Sample DR notebook reads return rows; row counts within RPO of last primary write.

On fail: GRS failover may still be in progress (can take 15+ min). Wait 5 min and retry; if still failing escalate.


Step 5 — Repoint Power BI Datasets to Secondary Semantic Models (Target: 10 min)

# 5.1 — Connect to Power BI service
Connect-PowerBIServiceAccount

# 5.2 — Rebind reports in DR workspace to DR semantic models
$drWorkspaceId = "${DR_WORKSPACE_ID}"
$reports = Get-PowerBIReport -WorkspaceId $drWorkspaceId

foreach ($r in $reports) {
    # Find the DR semantic model with same name as primary
    $drModel = Get-PowerBIDataset -WorkspaceId $drWorkspaceId | Where-Object Name -eq $r.Name
    if ($drModel) {
        Invoke-PowerBIRestMethod -Url "groups/$drWorkspaceId/reports/$($r.Id)/Rebind" `
            -Method Post -Body (@{datasetId = $drModel.Id} | ConvertTo-Json)
        Write-Host "Rebound $($r.Name)$($drModel.Id)"
    }
}

# 5.3 — Trigger refresh of DR semantic models
Get-PowerBIDataset -WorkspaceId $drWorkspaceId | ForEach-Object {
    Invoke-PowerBIRestMethod -Url "groups/$drWorkspaceId/datasets/$($_.Id)/refreshes" -Method Post
}

Pass criteria: All critical reports list DR dataset as source; first refresh completes within 5 min.

On fail: Direct Lake reports auto-recover from Gold tables; check that DR Gold tables exist and are queryable.


Step 6 — Update DNS / Front Door Routing (Target: 5 min)

# 6.1 — Disable primary backend in Front Door
az network front-door backend-pool backend update \
  --front-door-name "fd-fabric" \
  --resource-group "${NETWORK_RG}" \
  --pool-name "fabric-backends" \
  --index 1 \
  --enabled-state Disabled

# 6.2 — Promote secondary backend to weight 100
az network front-door backend-pool backend update \
  --front-door-name "fd-fabric" \
  --resource-group "${NETWORK_RG}" \
  --pool-name "fabric-backends" \
  --index 2 \
  --weight 100 \
  --enabled-state Enabled

# 6.3 — Purge Front Door cache to flush stale primary responses
az network front-door purge-endpoint \
  --resource-group "${NETWORK_RG}" \
  --name "fd-fabric" \
  --content-paths "/*"

# 6.4 — Alternative: Traffic Manager
az network traffic-manager endpoint update \
  --resource-group "${NETWORK_RG}" \
  --profile-name "tm-fabric" \
  --name "primary" --type azureEndpoints \
  --endpoint-status Disabled

Pass criteria: dig / nslookup of customer-facing hostname resolves to DR backend within TTL window (typically < 60 sec).


Step 7 — Resume Traffic (Target: 5 min)

# 7.1 — Switch streaming producers (already prepared via App Configuration in Step 2.2)
# Producers tail App Config and reconnect within 30 sec

# 7.2 — Re-enable scheduled pipelines in DR workspace
for pipeline in $(echo "$DR_PIPELINES" | tr ',' ' '); do
  az rest --method post \
    --url "https://api.fabric.microsoft.com/v1/workspaces/${DR_WORKSPACE_ID}/items/${pipeline}/jobs/instances?jobType=Pipeline"
done

# 7.3 — Notify customers via status page that DR is serving
# (Manual: update https://status.your-org.com)

Pass criteria: Streaming ingestion rate in DR ≥ 80% of pre-incident primary rate within 5 min.


Step 8 — Verify Customer Queries Succeed (Target: 10 min)

See Verification — must complete before declaring failover successful.


Verification

Do not declare the incident "MITIGATED" until every check below passes. Run all checks; document results in incident channel.

8.1 Capacity Health

# Capacity must be Active in DR
az fabric capacity show --resource-group "${DR_RG}" --name "fabric-cap-dr-${DR_REGION}" \
  --query "{name:name, state:properties.state, sku:sku.name}"
# Expected: state=Active, sku=F64

8.2 Sample Customer Queries

Run in DR workspace notebook — these are the canonical "is the platform working?" queries:

# 8.2.1 — Bronze layer freshness (RPO check)
spark.sql("""
    SELECT MAX(event_timestamp) AS latest_event,
           COUNT(*) AS row_count
    FROM lh_bronze.bronze_slot_telemetry
""").show(truncate=False)
# Expected: latest_event within 15 min of failover start

# 8.2.2 — Silver layer integrity
spark.sql("""
    SELECT COUNT(*) AS row_count,
           COUNT(DISTINCT machine_id) AS unique_machines
    FROM lh_silver.silver_slot_cleansed
""").show()

# 8.2.3 — Gold layer KPI
spark.sql("""
    SELECT * FROM lh_gold.gold_slot_performance
    WHERE business_date = current_date() - 1
    LIMIT 10
""").show()

8.3 Power BI Refresh

  • Trigger refresh on top-5 customer-facing datasets — must complete within 5 min
  • Open one report from each domain (casino, federal/USDA, federal/EPA) — visuals render
  • No 5xx errors in Power BI service logs

8.4 Streaming Ingestion

// In DR Workspace Monitoring
EventstreamMetrics
| where TimeGenerated > ago(15m)
| where WorkspaceId == "${DR_WORKSPACE_ID}"
| summarize EventsPerMin = count() by bin(TimeGenerated, 1m), EventstreamName
| render timechart

Expected: ingestion rate ≥ 80% of pre-incident baseline.

8.5 End-to-End Smoke Test

  • Submit synthetic transaction at source → confirm landing in DR Bronze within 5 min
  • Confirm Silver/Gold pipeline picks it up on next schedule
  • Confirm Power BI report shows the new row

8.6 Verification Pass Criteria

All of: - All KQL/SQL queries return expected row counts (within RPO tolerance) - Power BI refresh succeeds for all critical datasets - Streaming rate within 80% of baseline - No SEV1/SEV2 alerts firing on DR capacity - Customer-reported queries resolved

If any check fails, do not declare resolved — continue mitigation or escalate.


Failback Procedure

Failback is its own change. Do not failback during the incident. Wait until primary is stable for 24 hr and schedule failback as a planned change with full change management approval.

Failback Pre-Conditions

  • Primary region healthy for 24 hours minimum
  • Microsoft confirms incident resolved
  • Reverse data-sync has caught up (DR → primary lag < 15 min)
  • Change Management ticket approved
  • Off-peak window scheduled (target: lowest-traffic hour)

Step 1 — Reverse-Sync Data from Secondary to Primary (Target: 30-120 min)

# Use Mirroring or scripted Delta sync to push DR data back to primary OneLake
# Pattern: read DR Delta tables, write as overwrite to primary

tables_to_sync = [
    "bronze_slot_telemetry",
    "silver_slot_cleansed",
    "gold_slot_performance",
    "gold_compliance_ctr",
]

for table in tables_to_sync:
    df = spark.read.format("delta").load(f"abfss://{DR_WS}@onelake.dfs.fabric.microsoft.com/Tables/{table}")
    df.write.format("delta").mode("overwrite") \
        .option("overwriteSchema", "true") \
        .save(f"abfss://{PRIMARY_WS}@onelake.dfs.fabric.microsoft.com/Tables/{table}")
    print(f"Synced {table}: {df.count()} rows")

Mirroring shortcut: If using Fabric Mirroring (see mirroring.md) the reverse direction can be configured with a new mirror profile; this avoids manual sync code.

Step 2 — Verify Integrity (Target: 30 min)

# Row count + checksum comparison
for table in tables_to_sync:
    primary_cnt = spark.sql(f"SELECT COUNT(*) AS c FROM primary.{table}").collect()[0]["c"]
    dr_cnt      = spark.sql(f"SELECT COUNT(*) AS c FROM dr.{table}").collect()[0]["c"]
    primary_chk = spark.sql(f"SELECT SHA2(STRING_AGG(CAST(* AS STRING)), 256) AS h FROM primary.{table}").collect()[0]["h"]
    dr_chk      = spark.sql(f"SELECT SHA2(STRING_AGG(CAST(* AS STRING)), 256) AS h FROM dr.{table}").collect()[0]["h"]
    assert primary_cnt == dr_cnt, f"Row count mismatch on {table}: P={primary_cnt} D={dr_cnt}"
    assert primary_chk == dr_chk, f"Checksum mismatch on {table}"
    print(f"{table} OK: {primary_cnt} rows, checksum match")

Step 3 — Quiesce Secondary Writes (Target: 5 min)

# Cancel all running pipelines in DR workspace
for ws in "${DR_WORKSPACE_ID}"; do
  az rest --method post \
    --url "https://api.fabric.microsoft.com/v1/workspaces/${ws}/jobs/instances?jobType=Pipeline&action=cancel"
done

# Switch streaming producers back to primary endpoint
az appconfig kv set \
  --name "appcs-fabric" \
  --key "Streaming:ActiveRegion" \
  --value "${PRIMARY_REGION}" --yes

Step 4 — Switch Routing Back (Target: 5 min)

# Re-enable primary in Front Door, demote secondary to standby weight
az network front-door backend-pool backend update \
  --front-door-name "fd-fabric" --resource-group "${NETWORK_RG}" \
  --pool-name "fabric-backends" --index 1 \
  --enabled-state Enabled --weight 100

az network front-door backend-pool backend update \
  --front-door-name "fd-fabric" --resource-group "${NETWORK_RG}" \
  --pool-name "fabric-backends" --index 2 \
  --weight 0

Step 5 — Validate Primary (Target: 30 min)

Run full Verification checklist against primary workspace.

Step 6 — Re-Pause DR Capacity (Target: 5 min)

# Cost optimization — return DR to warm-standby state
az fabric capacity suspend \
  --resource-group "${DR_RG}" \
  --name "fabric-cap-dr-${DR_REGION}"

Post-Incident Actions

Action Owner Due
Data reconciliation report (rows ingested DR vs. expected primary) Data Engineering Lead Within 24 hr
Customer comms — "DR Activation Notice" with impact summary Communications Lead Within 24 hr
Compliance notifications (NIGC/FinCEN for casino; OMB/CISA for federal) Incident Commander + Compliance Officer Per regulatory deadline
Gap analysis — what worked, what didn't, missing automation Incident Commander Within 48 hr
Postmortem published IC + Tech Lead Within 48 hr (SEV1)
Action items entered into Archon IC Within 5 business days
Update this runbook with lessons learned Platform Lead Within 5 business days
Schedule next DR drill (if last drill > 60 days ago) Platform Lead Within 30 days

Use the Blameless Postmortem Template.


Escalation

Microsoft Support (Capacity / Service Issues)

Issue Severity Path
DR capacity won't resume Severity A Azure Portal → Help + support → New support request → Production system down
GRS replication lag > 1 hr Severity B Same path; reference incident ID
Fabric service-side errors during failover Severity A Same path; cite Service Health incident ID
Power BI tenant-wide outage Severity A Power Platform support

Microsoft Premier/Unified support phone: (use your contracted support number — store in incident channel pin)

Internal Escalation Path

On-Call Engineer ──(5 min)──▶ Platform Lead ──(15 min)──▶ Incident Commander
                                              (30 min for SEV1)
                                              VP Engineering ──(45 min)──▶ CTO/CDO

Executive Communications

For SEV1 sustained > 1 hour, the IC must brief executives:

Recipient Trigger Channel
VP Engineering Failover decision made Phone + Teams
CTO / CDO SEV1 sustained > 1 hr Email briefing every 60 min
CFO Customer SLA breach likely Email from VP Eng
Compliance Officer Any SOX/HIPAA/FedRAMP impact Phone + email immediately
Legal Regulatory notification required Email
Customer Success Customer-visible impact Slack + email

Use the Stakeholder Update Template.


Communication Tree Reference

See the canonical Communication Tree in the anchor runbook. Specific multi-region failover additions:

Audience When Owner
Casino Gaming Commission Within 24 hr (regulatory requirement) Compliance Officer
FinCEN If CTR/SAR filing window missed Compliance Officer
Federal OCIO (per agency) Per agency-specific timeline (USDA: 2 hr, SBA: 1 hr, NOAA: immediate, EPA: 4 hr, DOI: 4 hr) Agency liaison
OMB / CISA If FedRAMP system, per ATO terms COOP Coordinator

Quick-Reference Commands

Capacity Status Across Regions (Azure CLI)

# List all Fabric capacities and their state
az resource list --resource-type "Microsoft.Fabric/capacities" \
  --query "[].{name:name, region:location, state:properties.state, sku:sku.name}" \
  --output table

# Get specific capacity health
az fabric capacity show \
  --resource-group "${RG}" \
  --name "${CAPACITY_NAME}" \
  --query "{state:properties.state, sku:sku.name, region:location}"

# Resume / Suspend
az fabric capacity resume  --resource-group "${RG}" --name "${CAPACITY_NAME}"
az fabric capacity suspend --resource-group "${RG}" --name "${CAPACITY_NAME}"

# Scale SKU (mid-incident upsize)
az rest --method patch \
  --url "https://management.azure.com/subscriptions/${SUB}/resourceGroups/${RG}/providers/Microsoft.Fabric/capacities/${CAPACITY_NAME}?api-version=2023-11-01" \
  --body '{"sku": {"name": "F64", "tier": "Fabric"}}'

Front Door / Traffic Manager Updates

# Front Door — disable primary backend
az network front-door backend-pool backend update \
  --front-door-name "fd-fabric" --resource-group "${NETWORK_RG}" \
  --pool-name "fabric-backends" --index 1 --enabled-state Disabled

# Front Door — purge cache after failover
az network front-door purge-endpoint \
  --resource-group "${NETWORK_RG}" --name "fd-fabric" --content-paths "/*"

# Traffic Manager — disable primary endpoint
az network traffic-manager endpoint update \
  --resource-group "${NETWORK_RG}" --profile-name "tm-fabric" \
  --name "primary" --type azureEndpoints --endpoint-status Disabled

# DNS — direct CNAME flip (if not using Front Door)
az network dns record-set cname set-record \
  --resource-group "${DNS_RG}" --zone-name "your-org.com" \
  --record-set-name "fabric" --cname "fabric-dr.westus2.cloudapp.azure.com"

Power BI Dataset Rebinding (PowerShell)

Connect-PowerBIServiceAccount

# Rebind a single report to a new dataset
$workspaceId = "${DR_WORKSPACE_ID}"
$reportId    = "${REPORT_ID}"
$newDsId     = "${DR_DATASET_ID}"

Invoke-PowerBIRestMethod -Url "groups/$workspaceId/reports/$reportId/Rebind" `
    -Method Post -Body (@{datasetId = $newDsId} | ConvertTo-Json)

# Refresh all datasets in DR workspace
Get-PowerBIDataset -WorkspaceId $workspaceId | ForEach-Object {
    Invoke-PowerBIRestMethod -Url "groups/$workspaceId/datasets/$($_.Id)/refreshes" -Method Post
    Write-Host "Refresh started: $($_.Name)"
}

# Update gateway / data source connection (if not Direct Lake)
Invoke-PowerBIRestMethod -Url "gateways/$gatewayId/datasources/$dsId" `
    -Method Patch -Body (@{
        connectionDetails = '{"server":"fabric-dr.westus2","database":"lh_gold"}'
    } | ConvertTo-Json)

KQL — Cross-Region Telemetry

// Replication lag across regions
AzureStorageMetrics
| where TimeGenerated > ago(1h)
| where MetricName == "GeoReplicationLag"
| summarize AvgLagSec = avg(Value)/1000, MaxLagSec = max(Value)/1000
    by bin(TimeGenerated, 5m), Resource
| render timechart
// Capacity utilization by region
FabricCapacityMetrics
| where TimeGenerated > ago(1h)
| extend Region = tostring(split(ResourceId, "/")[4])
| summarize AvgCU = avg(CUSeconds), MaxCU = max(CUSeconds)
    by bin(TimeGenerated, 5m), Region, CapacityName
| render timechart
// Cross-region pipeline failure correlation
FabricPipelineRuns
| where TimeGenerated > ago(2h)
| where Status == "Failed"
| extend Region = tostring(split(WorkspaceId, "-")[2])
| summarize Failures = count() by bin(TimeGenerated, 5m), Region
| render columnchart
// DR readiness — last successful deployment per workspace
FabricDeploymentLogs
| where TimeGenerated > ago(7d)
| where Status == "Succeeded"
| summarize LastDeploy = max(TimeGenerated) by WorkspaceId, Environment
| extend HoursSinceDeploy = datetime_diff("hour", now(), LastDeploy)
| extend ReadyForFailover = iff(HoursSinceDeploy < 24, "✅", "⚠️ Stale")

Bicep Reference (DR Capacity)

The DR capacity uses the same module as production — see infra/modules/fabric/fabric-capacity.bicep. Set skuName: 'F32' and add tags: { Environment: 'DR', AutoPause: 'true' } to keep cost low until activation.


Diagrams

Failover Decision Tree

flowchart TD
    Detect([Outage Detected]) --> SH{Azure Service Health<br/>regional incident?}
    SH -->|No| Other[Use other runbook<br/>capacity / pipeline / auth]
    SH -->|Yes| ETA{ETA > 60 min<br/>or unknown?}
    ETA -->|No| Wait[Monitor — recovery faster than failover]
    ETA -->|Yes| RPO{GRS lag<br/>< 15 min?}
    RPO -->|No| Comp{Compliance OK<br/>with data loss?}
    RPO -->|Yes| Res{Data residency<br/>allows secondary?}
    Comp -->|No| Hold[HOLD — escalate]
    Comp -->|Yes| Res
    Res -->|No| Hold
    Res -->|Yes| DR{DR capacity<br/>healthy?}
    DR -->|No| Fix[Fix DR — abort failover]
    DR -->|Yes| IC{IC + VP Eng<br/>authorize?}
    IC -->|No| Wait2[Wait for authorization]
    IC -->|Yes| Exec([EXECUTE FAILOVER<br/>Steps 1-8])

    style Exec fill:#ea4335,color:#fff
    style Hold fill:#fbbc04,color:#000
    style Fix fill:#fbbc04,color:#000
    style Wait fill:#34a853,color:#fff
    style Wait2 fill:#34a853,color:#fff

Failover Sequence Diagram

sequenceDiagram
    participant ASH as Azure Service Health
    participant IC as Incident Commander
    participant Eng as On-Call Engineer
    participant Pri as Primary Region (eastus2)
    participant DR as DR Region (westus2)
    participant FD as Front Door
    participant Cust as Customers

    ASH->>Eng: Regional outage alert
    Eng->>IC: Page SEV1 (Pre-Flight Checklist)
    IC->>IC: Decision matrix → AUTHORIZE FAILOVER

    Note over Eng,DR: Step 1 — Validate DR (5 min)
    Eng->>DR: Check capacity state, replication lag
    DR-->>Eng: Healthy, lag 8 min

    Note over Eng,Pri: Step 2 — Stop primary writes (5 min)
    Eng->>Pri: Suspend capacity, cancel pipelines
    Pri-->>Eng: Quiesced

    Note over Eng,DR: Step 3 — Promote DR (10 min)
    Eng->>DR: Resume + scale to F64
    DR-->>Eng: state=Active

    Note over Eng,DR: Step 4 — Switch shortcuts (10 min)
    Eng->>DR: Repoint shortcuts to DR ADLS
    DR-->>Eng: Sample reads succeed

    Note over Eng,DR: Step 5 — Power BI rebind (10 min)
    Eng->>DR: Rebind reports + trigger refresh
    DR-->>Eng: Refresh succeeded

    Note over Eng,FD: Step 6 — Routing (5 min)
    Eng->>FD: Disable primary, promote DR
    FD-->>Cust: Traffic now lands in DR

    Note over Eng,DR: Step 7 — Resume traffic (5 min)
    Eng->>DR: Re-enable pipelines, switch producers
    DR-->>Cust: Streaming + queries serving

    Note over Eng,DR: Step 8 — Verify (10 min)
    Eng->>DR: Run verification suite
    DR-->>Eng: All checks pass
    Eng->>IC: Failover complete — RTO 58 min
    IC->>Cust: Status page: MITIGATED

Runbook When to Use
Incident Response Template Anchor — every incident starts here
Capacity Throttling Response Single-capacity throttling (not regional)
Pipeline Failure Triage Pipeline-scoped failure inside one region
Auth Failure Playbook Workspace Identity / SP issues
Tenant Migration (Dev/Staging/Prod) Bad-deployment rollback (not regional outage)
Data Quality Incident GE failure, schema breach
Document Description
Disaster Recovery & BCDR RTO/RPO targets, BCDR architecture, FedRAMP CP controls
Multi-Tenant Workspace Architecture Workspace pairing patterns for primary/DR
Network Security Front Door, private endpoints, DR network topology
Customer-Managed Keys CMK key replication for DR region
Fabric CI/CD Deployment Git-based deployment to keep DR workspaces current
Capacity Planning & Cost Optimization DR capacity cost trade-offs (paused vs active)
Mirroring Cross-region replication for warehouses & SQL DBs

⬆️ Back to Top | 📚 Runbooks Index | 🏠 Home