Home > Docs > Runbooks > Incident Response Template
π¨ Incident Response Template¶
Last Updated: 2026-04-27 | Phase: 14 (Wave 1) | Anchor Runbook Audience: On-call engineers, incident commanders, SRE teams Purpose: Reusable template for any Fabric production incident β clone this file and fill in the brackets
π Table of Contents¶
- How to Use This Template
- Severity Matrix
- Incident Response Lifecycle
- Phase 1 β Detect & Triage
- Phase 2 β Mitigate
- Phase 3 β Resolve
- Phase 4 β Post-Incident Review
- Communication Tree
- Incident Channel Conventions
- Blameless Postmortem Template
- Quick-Reference Commands
- Related Runbooks
How to Use This Template¶
This is the master template every Fabric incident runbook should mirror. When responding to an incident:
- Open this document while paging-in. The structure tells you what to do next.
- Open the specific runbook for the failure mode (e.g., Capacity Throttling, Pipeline Failure, Auth Failure).
- Use the Communication Tree to page the right people.
- Fill in the Postmortem Template within 48 hours of resolution.
Authoring new runbooks: Copy the section structure from this file. Every runbook MUST have: Symptoms, Severity Classification, Diagnostic Steps, Resolution Procedures, Verification, Rollback, Post-Incident Actions, Escalation Path.
Severity Matrix¶
Rule of thumb: Pick the highest severity any condition matches. When in doubt, classify higher and de-escalate later.
| Severity | Definition | Examples (Fabric) | Response SLA | Resolution SLA | Escalation |
|---|---|---|---|---|---|
| SEV1 β Critical | Production outage. Revenue or compliance impact. Multi-tenant blast radius. | Capacity unavailable region-wide; mass auth failure; data corruption in Gold; SOX/HIPAA-impacting Bronze loss | 5 min page | 2 hr | Immediate: VP Eng + Incident Commander |
| SEV2 β High | Major feature degraded. Single workspace down. Customer-visible. | Capacity throttling >90% sustained; primary pipeline failed >2 hr; Power BI dataset refresh failure for prod report | 15 min page | 4 hr | Within 30 min: Platform Lead |
| SEV3 β Medium | Degraded but workable. Workaround exists. | Single non-critical pipeline failed; intermittent slow queries; non-prod dataset stale | 2 hr ack | 24 hr | Within 4 hr: Team Lead |
| SEV4 β Low | Minor / cosmetic. No customer impact. | Dev workspace warning; documentation gap; non-blocking GE warning | 24 hr ack | 5 business days | Ticket queue |
Severity Classification Decision Tree¶
flowchart TD
Start([Incident Detected]) --> Q1{Production data<br/>or compliance<br/>impact?}
Q1 -->|Yes| Q2{Multi-workspace<br/>or region-wide?}
Q2 -->|Yes| SEV1[SEV1 β page on-call now]
Q2 -->|No| SEV2[SEV2 β page within 15 min]
Q1 -->|No| Q3{Customer-visible<br/>degradation?}
Q3 -->|Yes| SEV2
Q3 -->|No| Q4{Workaround<br/>exists?}
Q4 -->|Yes| SEV3[SEV3 β ack within 2 hr]
Q4 -->|No| SEV2
SEV1 --> SEV1Action[Page VP Eng + Incident Commander]
SEV2 --> SEV2Action[Page Platform Lead]
SEV3 --> SEV3Action[Notify Team Lead] Incident Response Lifecycle¶
ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ
β DETECT ββββΆβ MITIGATE ββββΆβ RESOLVE ββββΆβ PIR β
β + TRIAGE β β β β β β β
β 0β15 min β β 15β60 min β β Variable β β β€48 hr postβ
ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
Page on-call Stop bleeding Root cause Postmortem
Classify SEV Stabilize Permanent fix Action items
Open channel Rollback if Verify recovery Process update
needed
Phase 1 β Detect & Triage (0β15 min)¶
1.1 Acknowledge the Page¶
# In your paging tool (PagerDuty, Opsgenie, Azure Action Group), acknowledge within 5 min for SEV1/SEV2.
# Acknowledgment stops escalation but does NOT mean resolved.
1.2 Open the Incident Channel¶
| Tool | Channel Convention | Naming |
|---|---|---|
| Microsoft Teams | #incident- channel in Engineering | #incident-2026-04-27-capacity-sev1 |
| Slack (if used) | #inc- channel | #inc-2026-04-27-capacity-sev1 |
| Phone bridge (SEV1 only) | Always-on bridge URL | (from runbook header) |
1.3 Assign Incident Roles¶
| Role | Responsibility | Who |
|---|---|---|
| Incident Commander (IC) | Owns the incident end-to-end. Makes decisions. Coordinates roles. | First responder for SEV¾; designated IC for SEV½ |
| Communications Lead | Posts updates every 15 min (SEV½) or 60 min (SEV3). Manages stakeholder comms. | Platform Lead or designate |
| Technical Lead | Drives diagnosis and remediation. | Subject-matter on-call |
| Scribe | Captures timeline in real-time (timestamp, action, outcome). Critical for postmortem. | Anyone on the bridge |
1.4 Classify Severity¶
Use the Severity Matrix. Document classification in the channel pin.
1.5 Initial Triage Checklist¶
- Confirm the alert is real (not a false positive β check dashboard)
- Identify scope: which workspace(s)? which capacity? which item type?
- Check Microsoft Fabric Status Page for known issues
- Check Azure Service Health for region-level issues
- Identify customer impact: who sees what's broken?
- Snapshot diagnostics (capacity metrics, error logs, query plans) β these become postmortem evidence
Phase 2 β Mitigate (15β60 min)¶
Goal: Stop the bleeding. Restore service even if root cause unknown. Permanent fix comes later.
2.1 Common Mitigation Patterns¶
| Failure Mode | Quick Mitigation | Reference Runbook |
|---|---|---|
| Capacity throttled | Scale up SKU temporarily; pause non-critical workspaces on shared capacity | capacity-throttling-response.md |
| Pipeline failed | Rerun failed activity; if data corruption, rollback via Delta time-travel | pipeline-failure-triage.md |
| Auth failures | Validate Workspace Identity / Service Principal credentials; re-grant scopes | auth-failure-playbook.md |
| Region outage | Trigger geo-failover; redirect traffic to secondary | multi-region-failover.md |
| Data quality breach | Quarantine affected partition; halt downstream consumers | data-quality-incident.md |
| Bad deployment | Roll back via Deployment Pipelines; revert Git commit | tenant-migration-dev-staging-prod.md |
2.2 Mitigation Decision Framework¶
Is service restored by mitigation?
β
ββββββββ΄βββββββ
β β
YES NO
β β
βΌ βΌ
Move to Escalate one
Phase 3 severity level
+ add resources
2.3 Communication During Mitigation¶
Every 15 min (SEV½) post in the incident channel:
[STATUS UPDATE β {HH:MM}]
Severity: {SEV1/SEV2}
Status: {INVESTIGATING / MITIGATING / MONITORING}
Impact: {what's broken, who's affected}
Actions taken: {bullet list of last 15 min}
Next steps: {what we're trying next}
ETA to next update: {15 min}
Phase 3 β Resolve¶
3.1 Identify Root Cause¶
Use the 5 Whys technique. Document each "why" in the scribe log.
Symptom: Power BI report stale at 9 AM
Why? Refresh failed at 6 AM
Why? Source query timed out
Why? Bronze table had 10x normal row count
Why? Upstream system retried failed batches creating duplicates
Why? Idempotency key was not enforced β ROOT CAUSE
3.2 Apply Permanent Fix¶
- Code changes go through normal PR review (NOT direct-to-prod, even during incident β unless SEV1 hotfix)
- Hotfix branch convention:
hotfix/incident-{YYYYMMDD}-{short-desc} - All hotfixes MUST have follow-up PR to merge fix into main + dev environments
3.3 Verify Resolution¶
- Original alert clears
- Capacity metrics back in green band (<70% sustained)
- Affected pipelines complete successfully end-to-end
- Sample customer queries return expected data
- No related downstream alerts firing
- Monitor for 2Γ the incident duration before declaring resolved (if 2hr incident, watch 4hr)
3.4 Resolve in Tooling¶
1. Update paging tool: status = RESOLVED, add resolution note
2. Close incident channel topic (keep channel open for postmortem discussion)
3. Communicate resolution to stakeholders
4. Schedule postmortem within 48 hours
Phase 4 β Post-Incident Review (PIR)¶
Schedule within 48 hours of resolution for SEV1/SEV2; within 5 business days for SEV3.
4.1 PIR Meeting¶
- Facilitator: Incident Commander
- Required attendees: All who responded + service owner + product/customer rep
- Recommended: 60 min for SEV½, 30 min for SEV3
- Format: blameless β focus on systems and process, not individuals
4.2 Postmortem Doc¶
Use the Blameless Postmortem Template below. Publish in docs/postmortems/{YYYY-MM-DD}-{slug}.md.
4.3 Action Items¶
Every action item MUST have: - Owner (named individual, not team) - Due date - Tracking ticket (link to Archon task or GitHub issue) - Severity (P0 for "won't survive recurrence" β P3 for nice-to-have)
Communication Tree¶
Internal Escalation (Engineering)¶
On-Call Engineer βββββ15 minβββΆ Platform Lead
β β
30 min for SEV1 30 min for SEV1
βΌ βΌ
Incident Commander βββββββββββ VP Engineering
β
45 min for SEV1
βΌ
CTO / CDO
External / Stakeholder Communication¶
| Audience | When | Channel | Owner |
|---|---|---|---|
| Affected workspace owners | SEV1/SEV2 within 30 min | Email + Teams DM | Comms Lead |
| All Fabric users (tenant) | SEV1 only, within 1 hr | Tenant-wide banner + email | Platform Lead |
| Compliance Officer | Any incident with PII/SOX/HIPAA touch | Phone + email | Incident Commander |
| Legal | If SLA breach or contractual notification required | VP Eng | |
| Executive team | SEV1 sustained >1 hr or any data loss | Briefing email | VP Eng β CTO |
| Customers (external) | If customer SLA at risk | Status page + email | Product + Comms |
Stakeholder Update Template¶
Subject: [INC-{YYYYMMDD}-{slug}] {SEV1/2/3} β {one-line summary}
Status: {INVESTIGATING / MITIGATING / RESOLVED}
Started: {HH:MM UTC}
{If resolved} Resolved: {HH:MM UTC}
Duration: {N min}
Impact:
- {what was broken}
- {who was affected}
Current state:
- {what's working}
- {what's still degraded if any}
Next update: {HH:MM UTC} OR upon resolution
Incident channel: {link}
Incident Commander: {name}
Incident Channel Conventions¶
Channel Naming¶
#incident-{YYYY-MM-DD}-{slug}-{sevN}
Examples:
#incident-2026-04-27-capacity-throttle-sev1
#incident-2026-04-27-pipeline-bronze-fail-sev2
#incident-2026-04-27-power-bi-refresh-sev3
Pinned Messages (set immediately on creation)¶
- Severity + Status β
SEV1 | INVESTIGATING - Impact statement β one sentence, customer-facing
- Incident Commander β
@username - Bridge link β Teams call URL (SEV½ only)
- Tracking ticket β Archon or GitHub issue
- Status page entry β if external comms
Closing the Channel¶
- Keep channel open for 7 days post-resolution for postmortem discussion
- After 7 days, archive with naming
archive/incident-... - Postmortem doc must link to (archived) channel for evidence chain
Blameless Postmortem Template¶
Copy this section into
docs/postmortems/{YYYY-MM-DD}-{slug}.mdafter every SEV1/SEV2.
# Postmortem: {Short Title}
**Incident ID:** INC-{YYYYMMDD}-{slug}
**Severity:** {SEV1/SEV2/SEV3}
**Date:** {YYYY-MM-DD}
**Duration:** {detection β resolution} = {N min/hr}
**Authors:** {IC name + technical lead}
**Status:** DRAFT / REVIEWED / PUBLISHED
## Summary
{2β3 sentences: what happened, who was impacted, how long, root cause in one phrase.}
## Impact
- **Customer impact:** {users affected, revenue impact, data loss/corruption}
- **Internal impact:** {engineering hours, opportunity cost}
- **Compliance impact:** {SOX, HIPAA, GDPR, PCI implications β explicit YES/NO each}
## Timeline
| Time (UTC) | Event |
|------------|-------|
| {HH:MM} | First alert fired ({alert name}) |
| {HH:MM} | On-call paged |
| {HH:MM} | IC assigned, channel opened |
| {HH:MM} | Severity classified as {SEV} |
| {HH:MM} | First mitigation attempted: {action} β {result} |
| {HH:MM} | Root cause identified |
| {HH:MM} | Permanent fix deployed |
| {HH:MM} | Verification complete, incident resolved |
## Root Cause
{Plain-English explanation β what failed and why. Use 5 Whys to dig past surface symptoms.}
### Contributing Factors
- {what made detection slow / mitigation hard / impact larger}
- {gaps in alerting, monitoring, runbooks, training}
## What Went Well
- {detection was fast because...}
- {communication was clear because...}
- {mitigation worked because...}
## What Went Wrong
> **Blameless rule:** describe systems and decisions, not individuals.
- {alert took N min to fire because threshold was wrong}
- {runbook was missing for this failure mode}
- {mitigation step in runbook didn't account for {edge case}}
## Action Items
| ID | Action | Owner | Due | Priority | Ticket |
|----|--------|-------|-----|----------|--------|
| AI-1 | {Specific, measurable action} | @owner | {date} | P0/P1/P2 | {link} |
| AI-2 | ... | ... | ... | ... | ... |
**Action item rules:**
- Every action item must have a named owner (not "the team")
- Every action item must have a due date
- Every action item must have a tracking link
- Recurring postmortems (same root cause class twice) auto-promote action items to P0
## Lessons Learned
{What does this incident teach us about the system, our process, our assumptions? 1β3 paragraphs.}
## Detection & Monitoring Improvements
{Specifically: how would we detect this faster next time?}
## Process Improvements
{Specifically: how would we respond better next time?}
## Appendix
- Incident channel: {archived link}
- Diagnostic snapshots: {paths to logs, screenshots, query results}
- Hotfix PR: {link}
- Follow-up PR: {link}
Quick-Reference Commands¶
Capacity Health (Azure CLI)¶
# Get capacity status
az rest --method get \
--url "https://api.fabric.microsoft.com/v1/capacities/${CAPACITY_ID}"
# Scale capacity (mitigation)
az rest --method patch \
--url "https://management.azure.com/subscriptions/${SUB}/resourceGroups/${RG}/providers/Microsoft.Fabric/capacities/${NAME}?api-version=2023-11-01" \
--body '{"sku": {"name": "F128", "tier": "Fabric"}}'
# List recent pipeline runs (last 24h)
az rest --method get \
--url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/items?type=DataPipeline"
Capacity Metrics (KQL β Workspace Monitoring)¶
// Top 10 CU consumers in last hour
FabricCapacityMetrics
| where TimeGenerated > ago(1h)
| summarize TotalCU = sum(CUSeconds) by ItemName, ItemType
| top 10 by TotalCU desc
Pipeline Failure Lookup¶
// All pipeline failures last 24h
FabricPipelineRuns
| where TimeGenerated > ago(24h)
| where Status == "Failed"
| project TimeGenerated, PipelineName, ActivityName, ErrorMessage
| order by TimeGenerated desc
Eventhouse Ingestion Failures¶
.show ingestion failures
| where FailedOn > ago(1h)
| summarize Failures = count(), Sample = any(Details) by Table, FailureKind
| order by Failures desc
Delta Time-Travel Recovery¶
# Roll back a corrupted Gold table to a known-good version
spark.sql("DESCRIBE HISTORY gold.fact_daily_revenue").show(20, False)
# Restore to specific version
spark.sql("RESTORE TABLE gold.fact_daily_revenue TO VERSION AS OF 145")
# Or to specific timestamp
spark.sql("RESTORE TABLE gold.fact_daily_revenue TO TIMESTAMP AS OF '2026-04-27 06:00:00'")
Related Runbooks¶
| Runbook | When to Use |
|---|---|
| Capacity Throttling Response | CU > 90%, throttling, query queueing |
| Pipeline Failure Triage | Data Pipeline activity failed, retry exhausted |
| Auth Failure Playbook | Workspace Identity, SP, MI auth failures |
| Multi-Region Failover | Region outage, geo-failover required |
| Tenant Migration (Dev/Staging/Prod) | Bad deployment rollback, environment promotion |
| Data Quality Incident | GE failure, schema breach, downstream consumer impact |
Related Best-Practice Docs¶
| Document | Description |
|---|---|
| SLO/SLI for Fabric | Service-level objectives that trigger paging |
| On-Call Rotation Handbook | Rotation, handoff, paging integration |
| Change Management | RFC, freeze windows, rollback policy |
| Observability Stack | Log Analytics + Workspace Monitoring + Action Groups |
| Error Handling & Monitoring | Pipeline error architecture |
| Alerting & Data Activator | Alert wiring patterns |