Skip to content

Home > Docs > Runbooks > Incident Response Template

🚨 Incident Response Template

Last Updated: 2026-04-27 | Phase: 14 (Wave 1) | Anchor Runbook Audience: On-call engineers, incident commanders, SRE teams Purpose: Reusable template for any Fabric production incident β€” clone this file and fill in the brackets

Category Type Platform Severity


πŸ“‘ Table of Contents

  1. How to Use This Template
  2. Severity Matrix
  3. Incident Response Lifecycle
  4. Phase 1 β€” Detect & Triage
  5. Phase 2 β€” Mitigate
  6. Phase 3 β€” Resolve
  7. Phase 4 β€” Post-Incident Review
  8. Communication Tree
  9. Incident Channel Conventions
  10. Blameless Postmortem Template
  11. Quick-Reference Commands
  12. Related Runbooks

How to Use This Template

This is the master template every Fabric incident runbook should mirror. When responding to an incident:

  1. Open this document while paging-in. The structure tells you what to do next.
  2. Open the specific runbook for the failure mode (e.g., Capacity Throttling, Pipeline Failure, Auth Failure).
  3. Use the Communication Tree to page the right people.
  4. Fill in the Postmortem Template within 48 hours of resolution.

Authoring new runbooks: Copy the section structure from this file. Every runbook MUST have: Symptoms, Severity Classification, Diagnostic Steps, Resolution Procedures, Verification, Rollback, Post-Incident Actions, Escalation Path.


Severity Matrix

Rule of thumb: Pick the highest severity any condition matches. When in doubt, classify higher and de-escalate later.

Severity Definition Examples (Fabric) Response SLA Resolution SLA Escalation
SEV1 β€” Critical Production outage. Revenue or compliance impact. Multi-tenant blast radius. Capacity unavailable region-wide; mass auth failure; data corruption in Gold; SOX/HIPAA-impacting Bronze loss 5 min page 2 hr Immediate: VP Eng + Incident Commander
SEV2 β€” High Major feature degraded. Single workspace down. Customer-visible. Capacity throttling >90% sustained; primary pipeline failed >2 hr; Power BI dataset refresh failure for prod report 15 min page 4 hr Within 30 min: Platform Lead
SEV3 β€” Medium Degraded but workable. Workaround exists. Single non-critical pipeline failed; intermittent slow queries; non-prod dataset stale 2 hr ack 24 hr Within 4 hr: Team Lead
SEV4 β€” Low Minor / cosmetic. No customer impact. Dev workspace warning; documentation gap; non-blocking GE warning 24 hr ack 5 business days Ticket queue

Severity Classification Decision Tree

flowchart TD
    Start([Incident Detected]) --> Q1{Production data<br/>or compliance<br/>impact?}
    Q1 -->|Yes| Q2{Multi-workspace<br/>or region-wide?}
    Q2 -->|Yes| SEV1[SEV1 β€” page on-call now]
    Q2 -->|No| SEV2[SEV2 β€” page within 15 min]
    Q1 -->|No| Q3{Customer-visible<br/>degradation?}
    Q3 -->|Yes| SEV2
    Q3 -->|No| Q4{Workaround<br/>exists?}
    Q4 -->|Yes| SEV3[SEV3 β€” ack within 2 hr]
    Q4 -->|No| SEV2
    SEV1 --> SEV1Action[Page VP Eng + Incident Commander]
    SEV2 --> SEV2Action[Page Platform Lead]
    SEV3 --> SEV3Action[Notify Team Lead]

Incident Response Lifecycle

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  DETECT    │──▢│  MITIGATE  │──▢│  RESOLVE   │──▢│    PIR     β”‚
β”‚  + TRIAGE  β”‚   β”‚            β”‚   β”‚            β”‚   β”‚            β”‚
β”‚  0–15 min  β”‚   β”‚ 15–60 min  β”‚   β”‚  Variable  β”‚   β”‚ ≀48 hr postβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚                β”‚                β”‚                β”‚
      β–Ό                β–Ό                β–Ό                β–Ό
  Page on-call    Stop bleeding    Root cause        Postmortem
  Classify SEV    Stabilize        Permanent fix     Action items
  Open channel    Rollback if      Verify recovery   Process update
                  needed

Phase 1 β€” Detect & Triage (0–15 min)

1.1 Acknowledge the Page

# In your paging tool (PagerDuty, Opsgenie, Azure Action Group), acknowledge within 5 min for SEV1/SEV2.
# Acknowledgment stops escalation but does NOT mean resolved.

1.2 Open the Incident Channel

Tool Channel Convention Naming
Microsoft Teams #incident- channel in Engineering #incident-2026-04-27-capacity-sev1
Slack (if used) #inc- channel #inc-2026-04-27-capacity-sev1
Phone bridge (SEV1 only) Always-on bridge URL (from runbook header)

1.3 Assign Incident Roles

Role Responsibility Who
Incident Commander (IC) Owns the incident end-to-end. Makes decisions. Coordinates roles. First responder for SEV¾; designated IC for SEV½
Communications Lead Posts updates every 15 min (SEV½) or 60 min (SEV3). Manages stakeholder comms. Platform Lead or designate
Technical Lead Drives diagnosis and remediation. Subject-matter on-call
Scribe Captures timeline in real-time (timestamp, action, outcome). Critical for postmortem. Anyone on the bridge

1.4 Classify Severity

Use the Severity Matrix. Document classification in the channel pin.

1.5 Initial Triage Checklist

  • Confirm the alert is real (not a false positive β€” check dashboard)
  • Identify scope: which workspace(s)? which capacity? which item type?
  • Check Microsoft Fabric Status Page for known issues
  • Check Azure Service Health for region-level issues
  • Identify customer impact: who sees what's broken?
  • Snapshot diagnostics (capacity metrics, error logs, query plans) β€” these become postmortem evidence

Phase 2 β€” Mitigate (15–60 min)

Goal: Stop the bleeding. Restore service even if root cause unknown. Permanent fix comes later.

2.1 Common Mitigation Patterns

Failure Mode Quick Mitigation Reference Runbook
Capacity throttled Scale up SKU temporarily; pause non-critical workspaces on shared capacity capacity-throttling-response.md
Pipeline failed Rerun failed activity; if data corruption, rollback via Delta time-travel pipeline-failure-triage.md
Auth failures Validate Workspace Identity / Service Principal credentials; re-grant scopes auth-failure-playbook.md
Region outage Trigger geo-failover; redirect traffic to secondary multi-region-failover.md
Data quality breach Quarantine affected partition; halt downstream consumers data-quality-incident.md
Bad deployment Roll back via Deployment Pipelines; revert Git commit tenant-migration-dev-staging-prod.md

2.2 Mitigation Decision Framework

   Is service restored by mitigation?
            β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
     β”‚             β”‚
    YES           NO
     β”‚             β”‚
     β–Ό             β–Ό
  Move to      Escalate one
  Phase 3      severity level
                + add resources

2.3 Communication During Mitigation

Every 15 min (SEV½) post in the incident channel:

[STATUS UPDATE β€” {HH:MM}]
Severity: {SEV1/SEV2}
Status: {INVESTIGATING / MITIGATING / MONITORING}
Impact: {what's broken, who's affected}
Actions taken: {bullet list of last 15 min}
Next steps: {what we're trying next}
ETA to next update: {15 min}

Phase 3 β€” Resolve

3.1 Identify Root Cause

Use the 5 Whys technique. Document each "why" in the scribe log.

Symptom: Power BI report stale at 9 AM
  Why? Refresh failed at 6 AM
    Why? Source query timed out
      Why? Bronze table had 10x normal row count
        Why? Upstream system retried failed batches creating duplicates
          Why? Idempotency key was not enforced ← ROOT CAUSE

3.2 Apply Permanent Fix

  • Code changes go through normal PR review (NOT direct-to-prod, even during incident β€” unless SEV1 hotfix)
  • Hotfix branch convention: hotfix/incident-{YYYYMMDD}-{short-desc}
  • All hotfixes MUST have follow-up PR to merge fix into main + dev environments

3.3 Verify Resolution

  • Original alert clears
  • Capacity metrics back in green band (<70% sustained)
  • Affected pipelines complete successfully end-to-end
  • Sample customer queries return expected data
  • No related downstream alerts firing
  • Monitor for 2Γ— the incident duration before declaring resolved (if 2hr incident, watch 4hr)

3.4 Resolve in Tooling

1. Update paging tool: status = RESOLVED, add resolution note
2. Close incident channel topic (keep channel open for postmortem discussion)
3. Communicate resolution to stakeholders
4. Schedule postmortem within 48 hours

Phase 4 β€” Post-Incident Review (PIR)

Schedule within 48 hours of resolution for SEV1/SEV2; within 5 business days for SEV3.

4.1 PIR Meeting

  • Facilitator: Incident Commander
  • Required attendees: All who responded + service owner + product/customer rep
  • Recommended: 60 min for SEV½, 30 min for SEV3
  • Format: blameless β€” focus on systems and process, not individuals

4.2 Postmortem Doc

Use the Blameless Postmortem Template below. Publish in docs/postmortems/{YYYY-MM-DD}-{slug}.md.

4.3 Action Items

Every action item MUST have: - Owner (named individual, not team) - Due date - Tracking ticket (link to Archon task or GitHub issue) - Severity (P0 for "won't survive recurrence" β†’ P3 for nice-to-have)


Communication Tree

Internal Escalation (Engineering)

On-Call Engineer  ─────15 min──▢  Platform Lead
       β”‚                                β”‚
   30 min for SEV1                  30 min for SEV1
       β–Ό                                β–Ό
Incident Commander  ◀──────────  VP Engineering
       β”‚
   45 min for SEV1
       β–Ό
   CTO / CDO

External / Stakeholder Communication

Audience When Channel Owner
Affected workspace owners SEV1/SEV2 within 30 min Email + Teams DM Comms Lead
All Fabric users (tenant) SEV1 only, within 1 hr Tenant-wide banner + email Platform Lead
Compliance Officer Any incident with PII/SOX/HIPAA touch Phone + email Incident Commander
Legal If SLA breach or contractual notification required Email VP Eng
Executive team SEV1 sustained >1 hr or any data loss Briefing email VP Eng β†’ CTO
Customers (external) If customer SLA at risk Status page + email Product + Comms

Stakeholder Update Template

Subject: [INC-{YYYYMMDD}-{slug}] {SEV1/2/3} β€” {one-line summary}

Status: {INVESTIGATING / MITIGATING / RESOLVED}
Started: {HH:MM UTC}
{If resolved} Resolved: {HH:MM UTC}
Duration: {N min}

Impact:
- {what was broken}
- {who was affected}

Current state:
- {what's working}
- {what's still degraded if any}

Next update: {HH:MM UTC} OR upon resolution

Incident channel: {link}
Incident Commander: {name}

Incident Channel Conventions

Channel Naming

#incident-{YYYY-MM-DD}-{slug}-{sevN}

Examples:
  #incident-2026-04-27-capacity-throttle-sev1
  #incident-2026-04-27-pipeline-bronze-fail-sev2
  #incident-2026-04-27-power-bi-refresh-sev3

Pinned Messages (set immediately on creation)

  1. Severity + Status β€” SEV1 | INVESTIGATING
  2. Impact statement β€” one sentence, customer-facing
  3. Incident Commander β€” @username
  4. Bridge link β€” Teams call URL (SEV½ only)
  5. Tracking ticket β€” Archon or GitHub issue
  6. Status page entry β€” if external comms

Closing the Channel

  • Keep channel open for 7 days post-resolution for postmortem discussion
  • After 7 days, archive with naming archive/incident-...
  • Postmortem doc must link to (archived) channel for evidence chain

Blameless Postmortem Template

Copy this section into docs/postmortems/{YYYY-MM-DD}-{slug}.md after every SEV1/SEV2.

# Postmortem: {Short Title}

**Incident ID:** INC-{YYYYMMDD}-{slug}
**Severity:** {SEV1/SEV2/SEV3}
**Date:** {YYYY-MM-DD}
**Duration:** {detection β†’ resolution} = {N min/hr}
**Authors:** {IC name + technical lead}
**Status:** DRAFT / REVIEWED / PUBLISHED

## Summary

{2–3 sentences: what happened, who was impacted, how long, root cause in one phrase.}

## Impact

- **Customer impact:** {users affected, revenue impact, data loss/corruption}
- **Internal impact:** {engineering hours, opportunity cost}
- **Compliance impact:** {SOX, HIPAA, GDPR, PCI implications β€” explicit YES/NO each}

## Timeline

| Time (UTC) | Event |
|------------|-------|
| {HH:MM} | First alert fired ({alert name}) |
| {HH:MM} | On-call paged |
| {HH:MM} | IC assigned, channel opened |
| {HH:MM} | Severity classified as {SEV} |
| {HH:MM} | First mitigation attempted: {action} β†’ {result} |
| {HH:MM} | Root cause identified |
| {HH:MM} | Permanent fix deployed |
| {HH:MM} | Verification complete, incident resolved |

## Root Cause

{Plain-English explanation β€” what failed and why. Use 5 Whys to dig past surface symptoms.}

### Contributing Factors

- {what made detection slow / mitigation hard / impact larger}
- {gaps in alerting, monitoring, runbooks, training}

## What Went Well

- {detection was fast because...}
- {communication was clear because...}
- {mitigation worked because...}

## What Went Wrong

> **Blameless rule:** describe systems and decisions, not individuals.

- {alert took N min to fire because threshold was wrong}
- {runbook was missing for this failure mode}
- {mitigation step in runbook didn't account for {edge case}}

## Action Items

| ID | Action | Owner | Due | Priority | Ticket |
|----|--------|-------|-----|----------|--------|
| AI-1 | {Specific, measurable action} | @owner | {date} | P0/P1/P2 | {link} |
| AI-2 | ... | ... | ... | ... | ... |

**Action item rules:**
- Every action item must have a named owner (not "the team")
- Every action item must have a due date
- Every action item must have a tracking link
- Recurring postmortems (same root cause class twice) auto-promote action items to P0

## Lessons Learned

{What does this incident teach us about the system, our process, our assumptions? 1–3 paragraphs.}

## Detection & Monitoring Improvements

{Specifically: how would we detect this faster next time?}

## Process Improvements

{Specifically: how would we respond better next time?}

## Appendix

- Incident channel: {archived link}
- Diagnostic snapshots: {paths to logs, screenshots, query results}
- Hotfix PR: {link}
- Follow-up PR: {link}

Quick-Reference Commands

Capacity Health (Azure CLI)

# Get capacity status
az rest --method get \
  --url "https://api.fabric.microsoft.com/v1/capacities/${CAPACITY_ID}"

# Scale capacity (mitigation)
az rest --method patch \
  --url "https://management.azure.com/subscriptions/${SUB}/resourceGroups/${RG}/providers/Microsoft.Fabric/capacities/${NAME}?api-version=2023-11-01" \
  --body '{"sku": {"name": "F128", "tier": "Fabric"}}'

# List recent pipeline runs (last 24h)
az rest --method get \
  --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/items?type=DataPipeline"

Capacity Metrics (KQL β€” Workspace Monitoring)

// Top 10 CU consumers in last hour
FabricCapacityMetrics
| where TimeGenerated > ago(1h)
| summarize TotalCU = sum(CUSeconds) by ItemName, ItemType
| top 10 by TotalCU desc

Pipeline Failure Lookup

// All pipeline failures last 24h
FabricPipelineRuns
| where TimeGenerated > ago(24h)
| where Status == "Failed"
| project TimeGenerated, PipelineName, ActivityName, ErrorMessage
| order by TimeGenerated desc

Eventhouse Ingestion Failures

.show ingestion failures
| where FailedOn > ago(1h)
| summarize Failures = count(), Sample = any(Details) by Table, FailureKind
| order by Failures desc

Delta Time-Travel Recovery

# Roll back a corrupted Gold table to a known-good version
spark.sql("DESCRIBE HISTORY gold.fact_daily_revenue").show(20, False)

# Restore to specific version
spark.sql("RESTORE TABLE gold.fact_daily_revenue TO VERSION AS OF 145")

# Or to specific timestamp
spark.sql("RESTORE TABLE gold.fact_daily_revenue TO TIMESTAMP AS OF '2026-04-27 06:00:00'")

Runbook When to Use
Capacity Throttling Response CU > 90%, throttling, query queueing
Pipeline Failure Triage Data Pipeline activity failed, retry exhausted
Auth Failure Playbook Workspace Identity, SP, MI auth failures
Multi-Region Failover Region outage, geo-failover required
Tenant Migration (Dev/Staging/Prod) Bad deployment rollback, environment promotion
Data Quality Incident GE failure, schema breach, downstream consumer impact
Document Description
SLO/SLI for Fabric Service-level objectives that trigger paging
On-Call Rotation Handbook Rotation, handoff, paging integration
Change Management RFC, freeze windows, rollback policy
Observability Stack Log Analytics + Workspace Monitoring + Action Groups
Error Handling & Monitoring Pipeline error architecture
Alerting & Data Activator Alert wiring patterns

⬆️ Back to Top | πŸ“š Runbooks Index | 🏠 Home