Home > Docs > Runbooks > Disaster Recovery Execution

🌐 Disaster Recovery Execution Runbook¶

Last Updated: 2026-05-05 | Version: 1.0 Audience: Platform engineers, SRE, infrastructure operations, Fabric admins Purpose: Execute regional failover, verify OneLake replication, redeploy Fabric capacity in the DR region, and validate data integrity after a regional outage.

Trigger Conditions¶

Use this runbook when any of the following conditions are observed:

#	Condition	Detection Source
1	Microsoft Fabric regional outage declared on Azure Status page	Azure Status, Fabric Status
2	Fabric capacity unavailable for `> 30 min` despite no throttling	Admin Portal → Capacity status shows `Unavailable`
3	OneLake endpoint unreachable from all clients for `> 15 min`	Application monitoring; OneLake SDK connection failures
4	Azure region failure affecting Fabric's home region	Azure Service Health notification
5	Planned DR drill scheduled by operations team	DR test calendar
6	Compliance-mandated failover test (annual or semi-annual)	Compliance requirements (SOX, HIPAA, FedRAMP)

Severity Classification¶

Severity	Condition	Response SLA
SEV1	Unplanned regional outage; production Fabric workloads unavailable; compliance reporting at risk	5 min page
SEV2	Partial regional degradation; some workspaces affected; workaround possible via secondary region	15 min page
DR Drill	Planned test; no production impact; executed during maintenance window	Scheduled

Decision Flowchart¶

flowchart TD
    A([Regional Outage<br/>or DR Drill]) --> B{Fabric capacity<br/>accessible?}
    B -->|Yes| C[Not a regional failure<br/>→ Check other runbooks]
    B -->|No| D{Azure Status confirms<br/>regional outage?}
    D -->|No| E[Wait 15 min,<br/>recheck status]
    E --> D
    D -->|Yes| F{DR region<br/>capacity exists?}
    F -->|No| G[Deploy DR capacity<br/>→ Step 7]
    F -->|Yes| H{OneLake replication<br/>current?}
    H -->|No| I[Assess data gap<br/>→ Step 6]
    H -->|Yes| J[Activate DR workspaces<br/>→ Step 8]
    G --> H
    I --> J
    J --> K[Redirect consumers<br/>→ Step 9]
    K --> L[Validate data<br/>→ Step 10]
    L --> M{Primary region<br/>recovered?}
    M -->|No| N[Continue on DR,<br/>monitor recovery]
    N --> M
    M -->|Yes| O[Begin failback<br/>→ Failback section]

Step-by-Step Procedure¶

Phase 1 — Detect and Declare (0–15 min)¶

Step 1. Confirm the regional outage by checking all three sources: - Azure Status Page — check the Fabric service and the home region. - Fabric Status Page — check for service advisories. - Microsoft 365 Admin Center → Service Health → Microsoft Fabric.

Step 2. If confirmed, declare a SEV1 DR incident and page the Incident Commander, VP Engineering, and infrastructure team.

Step 3. Open a dedicated incident bridge (Teams call or war room) and assign roles: - Incident Commander: Coordinates overall response. - Infrastructure Lead: Handles capacity deployment. - Data Lead: Handles replication verification and data validation. - Communications Lead: Handles stakeholder updates.

Step 4. Notify stakeholders that DR failover is beginning. Send initial communication:

"A regional outage has been declared for [region]. DR failover is in progress. ETA for service restoration: [X hours]. Updates every 30 minutes."

Phase 2 — Verify Replication State (15–30 min)¶

Step 5. Check OneLake replication status for all critical Lakehouses:

# List OneLake shortcut targets to verify DR copies exist
az rest --method get \
  --url "https://api.fabric.microsoft.com/v1/workspaces/$drWorkspaceId/items" \
  --query "value[?type=='Lakehouse'].{name:displayName, id:id}"

Step 6. Assess the replication lag (RPO — Recovery Point Objective):

# Compare latest file timestamps between primary and DR OneLake
from datetime import datetime, timezone

primary_latest = mssparkutils.fs.ls("abfss://primary-lakehouse@onelake.dfs.fabric.microsoft.com/Tables/gold/fact_daily_slot_performance/")
dr_latest = mssparkutils.fs.ls("abfss://dr-lakehouse@onelake.dfs.fabric.microsoft.com/Tables/gold/fact_daily_slot_performance/")

primary_max = max(f.modifyTime for f in primary_latest)
dr_max = max(f.modifyTime for f in dr_latest)

lag_minutes = (primary_max - dr_max) / 60000
print(f"Replication lag: {lag_minutes:.0f} minutes")

If lag is within RPO target (e.g., < 60 min), proceed to Step 8.
If lag exceeds RPO, document the data gap and proceed — missing data will be backfilled after failback.

Phase 3 — Deploy and Activate DR (30–90 min)¶

Step 7. If DR capacity does not already exist, deploy it in the secondary region:

# Deploy Fabric capacity in DR region using Bicep
az deployment group create \
  --resource-group rg-fabric-dr \
  --template-file infra/modules/fabric/fabric-capacity.bicep \
  --parameters \
    capacityName=fabric-casino-dr \
    location=westus2 \
    skuName=F64 \
    adminMembers='["admin@contoso.com"]'

Step 8. Activate DR workspaces: 1. Assign DR workspaces to the DR Fabric capacity. 2. Verify all Lakehouse items, notebooks, pipelines, and semantic models are present. 3. Update connection strings in pipelines to point to DR data sources (if sources are also region-specific).

# Assign workspace to DR capacity
az rest --method post \
  --url "https://api.fabric.microsoft.com/v1/workspaces/$drWorkspaceId/assignToCapacity" \
  --body "{\"capacityId\": \"$drCapacityId\"}"

Step 9. Redirect consumers to DR: - Update DNS or application configuration to point to DR workspace URLs. - For Power BI reports, update the semantic model connection to the DR Lakehouse. - For API consumers, update the Fabric REST API base URL. - Notify consumers that they should use DR endpoints.

Phase 4 — Validate (90–180 min)¶

Step 10. Execute the data validation suite:

# Run Great Expectations checkpoint against DR tables
import subprocess
result = subprocess.run(
    ["great_expectations", "checkpoint", "run", "dr_validation_checkpoint"],
    capture_output=True, text=True
)
print(result.stdout)
assert result.returncode == 0, f"DR validation failed: {result.stderr}"

Step 11. Verify critical workloads:

Workload	Validation	Expected Result
Power BI exec dashboard	Load report, check data freshness	Data within RPO window
CTR compliance pipeline	Trigger manual run	Pipeline succeeds
Real-time Eventstream	Check ingestion metrics	Events flowing
Semantic model	Trigger refresh	Refresh succeeds
KQL dashboard	Run sample query	Results returned

Step 12. Send stakeholder update confirming DR is active:

"DR failover complete. Services are running from [DR region]. Data is current as of [timestamp]. RPO gap: [X minutes]. Monitoring continues."

OneLake Replication Verification¶

Check	Command / Method	Expected Result
Shortcut targets exist	`GET /workspaces/{id}/items?type=Lakehouse`	All Lakehouses present
Table row counts match	`spark.table("dr.table").count()` vs primary count	Within 1% tolerance
Latest partition present	`ls` on partition directory	Latest date partition exists
Delta log consistent	`DESCRIBE HISTORY table`	No corrupt commits
Schema matches	`df.schema` comparison	Identical schemas

Capacity Redeployment¶

If the DR capacity was not pre-provisioned, use the Bicep module:

az deployment group create \
  --resource-group rg-fabric-dr \
  --template-file infra/modules/fabric/fabric-capacity.bicep \
  --parameters \
    capacityName=fabric-casino-dr \
    location=westus2 \
    skuName=F64 \
    adminMembers='["admin@contoso.com"]'

Provisioning typically takes 5–10 minutes. Monitor via:

az resource show \
  --resource-group rg-fabric-dr \
  --name fabric-casino-dr \
  --resource-type "Microsoft.Fabric/capacities" \
  --query "{state: properties.state, sku: sku.name}"

Data Validation¶

Run a comprehensive validation after failover:

tables_to_validate = [
    "bronze_slot_telemetry",
    "silver_slot_cleansed",
    "gold.fact_daily_slot_performance",
    "gold.dim_machine",
    "gold.dim_customer",
]

for table in tables_to_validate:
    dr_count = spark.table(f"dr_lakehouse.{table}").count()
    print(f"{table}: {dr_count:,} rows")

    # Check for recent data
    latest = spark.sql(f"""
        SELECT MAX(process_date) as latest_date
        FROM dr_lakehouse.{table}
    """).collect()[0]["latest_date"]
    print(f"  Latest data: {latest}")

Failback Procedure¶

After the primary region recovers:

Step 13. Verify the primary region is fully operational: - Azure Status shows Resolved. - Primary Fabric capacity shows Active state. - OneLake endpoints are reachable.

Step 14. Sync data created during the DR period back to the primary: - Identify new data written to DR (tables, files, notebooks modified during outage). - Copy DR-period data to primary OneLake using shortcuts or copy jobs.

Step 15. Switch consumers back to the primary: - Revert DNS / configuration changes from Step 9. - Update Power BI connections back to primary Lakehouse. - Notify consumers of failback completion.

Step 16. Validate primary environment using the same checks from Step 10 and Step 11.

Step 17. If DR capacity was deployed on-demand, deprovision to save costs:

az resource delete \
  --resource-group rg-fabric-dr \
  --name fabric-casino-dr \
  --resource-type "Microsoft.Fabric/capacities"

Escalation Path¶

Time Elapsed	Action	Contact
0 min	Incident Commander declared; DR failover initiated	IC + On-call SRE
5 min	Page VP Engineering and CISO (data at risk)	VP Engineering + CISO
15 min	Open Microsoft support case (Sev A — business-critical outage)	Microsoft Unified Support
30 min	Executive communication sent to business stakeholders	Communications Lead
1 hr	If DR deployment blocked, engage Azure support for capacity provisioning	Azure Support
4 hr	If failover incomplete, escalate to Microsoft Fabric product team	Microsoft Fabric Engineering
8 hr	If RPO breach impacts compliance, notify Legal and Compliance	Legal + Compliance Officer

Post-Incident Review Checklist¶

Document	Description
Disaster Recovery & BCDR	BCDR architecture and design patterns
Multi-Region Failover	Detailed multi-region failover procedures
Monitoring & Observability	Alert setup and dashboards
Capacity Planning & Cost	DR capacity sizing
Incident Response Template	Master incident response structure
Capacity Throttling	When DR capacity is throttled

⬆️ Back to Top | 📋 Runbook Index | 🏠 Home