Home > Docs > Runbooks > Pipeline Failure Triage
๐ ๏ธ Pipeline Failure Triage¶
Last Updated: 2026-04-27 | Phase: 14 (Wave 1) | Feature: 1.3 Audience: On-call data engineers, platform operators, pipeline owners Purpose: Diagnose and resolve Microsoft Fabric Data Pipeline failures โ failed activities, retry exhaustion, timeouts, source connectivity, sink errors
๐ Table of Contents¶
- Symptoms
- Severity Classification
- Symptom โ Cause Matrix
- Diagnostic Steps
- Resolution Procedures
- Quarantine Pattern
- Verification
- Rollback
- Post-Incident Actions
- Escalation Path
- Quick-Reference Commands
- Decision Tree
- Related Runbooks
- Related Best-Practice Docs
Symptoms¶
Trigger this runbook when any of the following are observed:
- Fabric Monitor shows pipeline status
FailedorPartiallyFailedin the Monitoring hub. - Data Activator / Action Group alert fires from
pipeline_errorsor a workspace KQL alert. - Downstream consumer signal: Power BI report stale, Lakehouse row count flat vs. baseline, GE checkpoint missing a partition, semantic model refresh fails.
- Partial loads: Bronze partition exists but row count materially below the rolling 7-day same-hour average.
- Retry exhaustion: activity has hit
retryPolicy.countand pipeline run isFailed. - Copy Job watermark stale (>2ร schedule frequency โ see copy-job-cdc.md).
- Notebook activity ends with
2147467259/ genericUserErrorin pipeline output.
Companion to incident-response-template.md โ open that template first, then drill into this runbook.
Severity Classification¶
Use the master Severity Matrix. Pipeline-specific mapping:
| Pipeline Failure Scope | Severity | Response SLA | Escalation |
|---|---|---|---|
| Compliance pipeline (CTR/SAR/W-2G/HIPAA) failed in filing window | SEV1 | 5 min page | VP Eng + Compliance Officer |
| Gold-layer pipeline failed; Power BI prod report stale | SEV2 | 15 min page | Platform Lead |
| Bronze ingestion failed >2 hr; downstream Silver/Gold blocked | SEV2 | 15 min page | Platform Lead |
| Single non-critical pipeline failed; backfill window available | SEV3 | 2 hr ack | Team Lead |
| Dev/Staging pipeline failure; no prod consumer | SEV4 | 24 hr ack | Ticket queue |
Rule of thumb: if the failure prevents a regulatory filing or a contractually-bound report, classify SEV1 and page immediately. Compliance > convenience.
Symptom โ Cause Matrix¶
Map the surface error to the most-likely classification before applying a fix. Classification drives whether you retry, fix-and-replay, or fail-forward. See error-handling-monitoring.md for taxonomy details.
| Error Code / Pattern | Activity Type | Classification | Likely Cause | First Action |
|---|---|---|---|---|
SourceNotFound / Path does not exist | Copy / Notebook | PERMANENT | Source file/table missing or moved | Verify source system; do not retry blindly |
InvalidSchema / Schema mismatch / AnalysisException: cannot resolve | Copy / Notebook | DATA_QUALITY | Source schema drift (new/dropped/renamed column) | Compare schemas; toggle mergeSchema if additive |
QuotaExceeded / 429 / CapacityThrottled | Copy / Notebook | TRANSIENT | Capacity throttling, source API rate limit | Backoff & retry; see capacity-throttling-response.md |
AuthenticationFailed / 401 / 403 / Unauthorized | Any | PERMISSION | Workspace Identity / SP / MI credential expired or scope removed | Re-grant; see auth-failure-playbook.md |
Timeout / Operation timed out / Command timeout | Copy / Lookup / Stored Proc | TIMEOUT | Long-running query, large batch, slow source | Increase timeout, partition source query |
NetworkUnreachable / TNS:Connect timeout / ETIMEDOUT | Copy (DB source) | TRANSIENT | VPN/ExpressRoute drop, gateway down, firewall change | Validate connectivity; check gateway health |
SinkDataIntegrityError | Copy โ Delta | DATA_QUALITY | NOT NULL/CHECK violation, type coercion failure | Inspect rejected rows; quarantine |
ChecksumMismatch | Copy (file source) | DATA_QUALITY | Source file truncated/corrupt mid-transfer | Re-fetch source; do not commit partial |
IdempotencyKeyConflict / MERGE found duplicates | Notebook MERGE | DATA_QUALITY | Source emitted duplicate keys in same batch | Dedupe staging frame before MERGE |
ConcurrentAppendException | Notebook write | TRANSIENT | Two writers hit same Delta table version | Stagger schedules, retry |
OutOfMemoryError / Job aborted due to stage failure | Notebook (Spark) | RESOURCE | Skewed join, unbounded collect, undersized pool | Repartition, broadcast small side, scale pool |
DeadlockVictim / Lock wait timeout exceeded | Stored Proc | TRANSIENT | Source DB contention | Retry; coordinate with source DBA |
ServiceUnavailable / 503 / 504 | Web / REST source | TRANSIENT | Upstream API outage | Backoff, then escalate to source owner |
WatermarkStale (Copy Job) | Copy Job | PERMANENT | Previous run failed before watermark commit | Reset watermark; replay from last good run |
Diagnostic Steps¶
Step 1 โ Open Pipeline Run in Fabric Monitor¶
Fabric Portal โ Workspace โ Monitoring hub
Filter:
Item type = Data pipeline
Status = Failed (last 24h)
Workspace = ${WS_NAME}
Sort: Start time DESC
Click the failed run to open the Pipeline run details blade.
Step 2 โ Identify the First Failed Activity¶
Pipelines often cascade โ a downstream activity may fail because an upstream activity produced no output. Always investigate the earliest failed activity, not the last.
Pipeline run โ Activity runs tab
Sort: Activity start time ASC
Find: First row with Status = Failed
Step 3 โ Extract Error Details¶
Click the failed activity โ Output tab โ expand the error object. Capture for the incident channel: errorCode, message (first 200 chars), target (activity name), failureType (UserError = our fault, SystemError = Fabric platform fault).
{ "errorCode": "2200", "message": "ErrorCode=UserErrorInvalidColumnName,...",
"failureType": "UserError", "target": "Copy_USDA_NASS_To_Bronze" }
Step 4 โ Cross-Reference Centralized Error Table¶
If the pipeline uses the Single Error Activity pattern:
SELECT TOP 20 error_id, error_timestamp, activity_name,
error_classification, severity, error_message, retry_count, correlation_id
FROM dbo.pipeline_errors
WHERE pipeline_run_id = '${PIPELINE_RUN_ID}'
ORDER BY error_timestamp ASC;
Step 5 โ Capacity & Source Health¶
az rest --method get --url "https://api.fabric.microsoft.com/v1/capacities/${CAPACITY_ID}"
nc -vz ${SOURCE_HOST} ${SOURCE_PORT}
For Notebook activities, click Output โ notebookutils logs or open the linked Spark application UI. If capacity is throttled, stop here โ switch to capacity-throttling-response.md.
Resolution Procedures¶
R1 โ Retry / Replay (Single Activity vs Full Pipeline)¶
Use when: classification is TRANSIENT, retries are not exhausted, and source is healthy.
Fabric Monitor โ Failed pipeline run โ Rerun โผ
โ Rerun from failed activity โ preferred (preserves successful upstream work)
โ Rerun entire pipeline โ only if upstream produced bad output
REST equivalent:
# Rerun from failed activity (Fabric REST API)
az rest --method post \
--url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/items/${PIPELINE_ID}/jobs/instances/${RUN_ID}?jobType=Pipeline" \
--headers "Content-Type=application/json" \
--body '{"executionData": {"runMode": "rerunFromFailed"}}'
R2 โ Schema Evolution Toggles¶
Use when: classification is DATA_QUALITY and the change is additive (new columns) โ not breaking (renamed/dropped/retyped).
# Notebook activity โ enable additive schema merge for Delta sink
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")
(df_source
.write
.format("delta")
.mode("append")
.option("mergeSchema", "true")
.saveAsTable("lh_bronze.bronze_usda_crop_production"))
Do NOT enable autoMerge for Silver/Gold layers โ schema drift in business tables must go through PR review. See medallion-architecture-deep-dive.md.
For Copy Activity sinks, edit the activity โ Mapping tab โ Import schemas to refresh, then save.
R3 โ Source Connectivity Validation¶
Use when: classification is NetworkUnreachable / TNS:Connect timeout.
import socket, traceback
SOURCE_HOST = "${SOURCE_HOST}"; SOURCE_PORT = ${SOURCE_PORT}
# Layer 1 โ TCP reachability
try:
with socket.create_connection((SOURCE_HOST, SOURCE_PORT), timeout=10):
print(f"OK TCP {SOURCE_HOST}:{SOURCE_PORT}")
except OSError as e:
print(f"FAIL TCP: {e}"); raise
# Layer 2 โ credential / driver
try:
spark.read.format("jdbc") \
.option("url", f"jdbc:sqlserver://{SOURCE_HOST}:{SOURCE_PORT};database=${DB}") \
.option("dbtable", "(SELECT 1 AS hc) h") \
.option("authentication", "ActiveDirectoryMSI").load().show()
print("OK JDBC auth")
except Exception:
traceback.print_exc(); raise
Layer 1 fails โ gateway/firewall/DNS โ escalate network team. Layer 1 passes, Layer 2 fails โ see auth-failure-playbook.md.
R4 โ Sink Rollback via Delta Time-Travel¶
Use when: the failed run partially wrote corrupt data to Bronze/Silver/Gold and downstream consumers must see only the last good state.
-- Step 1: identify the last good Delta version BEFORE the failed run started
DESCRIBE HISTORY lh_bronze.bronze_usda_crop_production;
-- Look for the operation row with timestamp just BEFORE the failed pipeline start
-- Step 2: restore to that version
RESTORE TABLE lh_bronze.bronze_usda_crop_production
TO VERSION AS OF 248;
-- OR by timestamp (often cleaner)
RESTORE TABLE lh_bronze.bronze_usda_crop_production
TO TIMESTAMP AS OF '2026-04-27 06:00:00';
# Or in PySpark
spark.sql("""
RESTORE TABLE lh_silver.silver_compliance_events
TO TIMESTAMP AS OF '2026-04-27 05:55:00'
""")
CRITICAL: Notify downstream consumers BEFORE running RESTORE โ semantic models reading the table will see row counts shift. See Verification below.
R5 โ Idempotency MERGE Pattern¶
Use when: classification is IdempotencyKeyConflict / duplicate keys, or replaying a partially-completed run that may double-load.
from delta.tables import DeltaTable
from pyspark.sql import functions as F
# Dedupe staging BEFORE merge โ idempotency on natural key
staging = raw_df.withColumn("_ingest_ts", F.current_timestamp()) \
.dropDuplicates(["event_id"])
target = DeltaTable.forName(spark, "lh_bronze.bronze_compliance_events")
(target.alias("t").merge(staging.alias("s"), "t.event_id = s.event_id")
.whenMatchedUpdate(
condition="s._ingest_ts > t._ingest_ts",
set={c: f"s.{c}" for c in staging.columns})
.whenNotMatchedInsertAll().execute())
This pattern is safe to replay โ re-running it after a failure produces the same result.
R6 โ Backfill Window Calculation¶
Use when: the pipeline missed one or more scheduled runs and only the gap needs loading.
# Compute the gap from current watermark
wm = spark.sql("SELECT MAX(transaction_timestamp) AS wm FROM lh_bronze.bronze_slot_transactions") \
.collect()[0]["wm"]
# Source query bounded to the gap โ avoids full re-fetch
backfill = spark.read.format("jdbc").option("url", JDBC_URL).option("query", f"""
SELECT * FROM SLOT_TRANSACTIONS
WHERE TRANSACTION_TIMESTAMP > '{wm.isoformat()}'
AND TRANSACTION_TIMESTAMP <= CURRENT_TIMESTAMP
""").load()
print(f"Backfill: {wm} โ now, rows={backfill.count()}")
For Copy Job: Monitor โ Watermark State โ Reset (set to last good timestamp), then trigger an immediate run. See copy-job-cdc.md.
Quarantine Pattern¶
When a small fraction of records fail data-quality checks, do not fail the whole pipeline โ quarantine bad rows and continue. This keeps Bronze fresh while preserving evidence for triage.
from pyspark.sql import functions as F
bad = raw_df.filter(F.col("event_id").isNull() | F.col("event_type").isNull() | (F.col("amount") < 0))
good = raw_df.subtract(bad)
good.write.mode("append").format("delta").saveAsTable("lh_bronze.bronze_compliance_events")
(bad.withColumn("_quarantine_reason", F.lit("required_null_or_negative_amount"))
.withColumn("_quarantine_ts", F.current_timestamp())
.withColumn("_pipeline_run_id", F.lit("${PIPELINE_RUN_ID}"))
.write.mode("append").format("delta")
.saveAsTable("lh_bronze.dlq_bronze_compliance_events"))
ratio = bad.count() / max(raw_df.count(), 1)
if ratio > 0.05:
raise ValueError(f"DLQ ratio {ratio:.1%} > 5% โ investigate source")
DLQ governance: retention 30 days (90 for compliance domains); triage daily โ rows >7 days old auto-page pipeline owner; once root cause fixed, MERGE good DLQ rows back into target.
Verification¶
After applying any resolution, all of the following must pass before declaring resolved.
V1 โ Target Table Row Count vs Baseline¶
SELECT COUNT(*) AS rows_total, COUNT(DISTINCT event_id) AS rows_distinct,
MAX(event_ts) AS max_ts,
SUM(CASE WHEN event_ts >= '${PIPELINE_START}' THEN 1 ELSE 0 END) AS rows_this_run
FROM lh_silver.silver_compliance_events;
Expected: rows_this_run within ยฑ10% of rolling 7-day average for same hour-of-day.
V2 โ Freshness Check¶
from datetime import datetime, timezone
max_ts = spark.sql("SELECT MAX(event_ts) m FROM lh_silver.silver_compliance_events").collect()[0]["m"]
age_min = (datetime.now(timezone.utc) - max_ts).total_seconds() / 60
assert age_min < 30, f"Table stale: {age_min:.1f} min"
V3 โ Downstream Consumer Signal¶
- Power BI dataset refresh Succeeded (re-run if needed)
- Semantic model row count card matches expected
- GE checkpoint passes:
great_expectations checkpoint run silver_compliance_checkpoint - No related downstream alerts firing in the last 30 min
V4 โ Watch for 2ร Incident Duration¶
Per incident-response-template.md ยง3.3: a 2 hr incident โ watch 4 hr before declaring fully resolved.
Rollback¶
If the resolution made things worse or introduces a regression:
-- 1. Roll back target table to pre-incident state
DESCRIBE HISTORY lh_silver.silver_compliance_events;
RESTORE TABLE lh_silver.silver_compliance_events TO VERSION AS OF ${LAST_GOOD_VERSION};
-- 2. Snapshot the bad version into an audit table for forensics
CREATE TABLE lh_audit.audit_silver_compliance_events_${YYYYMMDD_HHMM} USING DELTA
AS SELECT * FROM lh_silver.silver_compliance_events VERSION AS OF ${BAD_VERSION};
Pipeline-definition rollback (revert via fabric-cicd):
git revert ${BAD_COMMIT_SHA} && git push origin hotfix/incident-${YYYYMMDD}-pipeline-revert
python scripts/fabric-cicd-deploy.py --workspace-id ${WS_ID} \
--branch hotfix/incident-${YYYYMMDD}-pipeline-revert
Mandatory rollback notifications: post in #incident-${date}-...-sevN; email pipeline-owners with restored version; update Status Page if customer-visible.
Post-Incident Actions¶
Within 48 hours of resolution (per PIR ยง4):
- Root-cause categorization โ tag in
pipeline_errors.error_classificationso the trend lands in the Error Trend KQL. - GE rule additions โ add a Great Expectations expectation for the failure mode (e.g.
expect_column_values_to_not_be_null("event_id")) so the next occurrence is caught upstream. - Retry policy review โ if classification was
TRANSIENTbut retries didn't recover, re-tuneretryPolicy.count/intervalInSeconds/backoffMultiplier. - Idempotency audit โ confirm the pipeline can replay safely; retrofit MERGE pattern (see R5) if missing.
- Add quarantine โ if a data-quality failure halted the pipeline, retrofit the Quarantine Pattern.
- Write postmortem in
docs/postmortems/${YYYY-MM-DD}-pipeline-${slug}.mdusing the template. - Update this runbook if a new failure mode emerged โ append to the Symptom โ Cause Matrix.
Escalation Path¶
| Trigger | Escalate To | When |
|---|---|---|
| 30 min on SEV1/SEV2 with no mitigation | Platform Lead | Immediate |
| Compliance pipeline (CTR/SAR/HIPAA) fails inside filing window | Compliance Officer + VP Eng | Page on detection |
| Source system owner action required (DBA, agency liaison) | Source-system on-call | Within 15 min |
| Capacity throttling is the root cause | Switch to capacity-throttling-response.md | On classification |
| Auth/Identity failure | Switch to auth-failure-playbook.md | On classification |
| Data-quality breach with downstream blast radius | Open companion data-quality-incident.md | On classification |
| 2 hr SEV2 / 1 hr SEV1 sustained | VP Engineering | Per Communication Tree |
Quick-Reference Commands¶
Fabric REST โ Pipeline Runs¶
# List pipelines in workspace
az rest --method get --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/items?type=DataPipeline"
# Get a specific run's status
az rest --method get --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/items/${PIPELINE_ID}/jobs/instances/${RUN_ID}"
# Cancel a hung run
az rest --method post --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/items/${PIPELINE_ID}/jobs/instances/${RUN_ID}/cancel"
# Rerun from failed activity
az rest --method post --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/items/${PIPELINE_ID}/jobs/instances/${RUN_ID}?jobType=Pipeline" \
--body '{"executionData": {"runMode": "rerunFromFailed"}}'
KQL โ Workspace Monitoring¶
// All pipeline failures last 24h, top failing pipelines first
FabricPipelineRuns
| where TimeGenerated > ago(24h)
| where Status == "Failed"
| summarize Failures = count(),
LastError = take_any(ErrorMessage),
LastRunId = arg_max(TimeGenerated, RunId)
by PipelineName
| order by Failures desc
// Activity-level drill-down for a specific pipeline run
FabricPipelineActivityRuns
| where RunId == "${PIPELINE_RUN_ID}"
| project ActivityStart, ActivityName, ActivityType, Status, ErrorCode, ErrorMessage, DurationMs
| order by ActivityStart asc
// Error trend โ bin by hour, classify by errorCode prefix
FabricPipelineActivityRuns
| where TimeGenerated > ago(7d) and Status == "Failed"
| extend Bucket = bin(TimeGenerated, 1h)
| summarize Failures = count() by Bucket, ErrorCode
| render timechart
PySpark โ Delta History & Diagnostics¶
spark.sql("DESCRIBE HISTORY lh_bronze.bronze_slot_transactions").show(20, False)
# Row count delta between two versions
def cnt(v): return spark.sql(f"SELECT COUNT(*) c FROM lh_bronze.bronze_slot_transactions VERSION AS OF {v}").collect()[0].c
print(f"Delta v247โv248: {cnt(248) - cnt(247)}")
spark.sql("RESTORE TABLE lh_bronze.bronze_slot_transactions TO VERSION AS OF 247")
Centralized Error Table¶
SELECT error_id, activity_name, error_classification, severity,
LEFT(error_message, 200) AS msg, retry_count
FROM dbo.pipeline_errors
WHERE pipeline_run_id = '${PIPELINE_RUN_ID}' AND is_resolved = 0
ORDER BY error_timestamp ASC;
EXEC dbo.usp_resolve_error @error_id=${ERROR_ID}, @resolved_by='${YOUR_UPN}',
@resolution_notes='Schema drift โ mergeSchema enabled; source notified';
Decision Tree¶
flowchart TD
Start([Pipeline Failed Alert]) --> Q1{Capacity<br/>throttled?}
Q1 -->|Yes| Cap[capacity-throttling-response.md]
Q1 -->|No| Q2{First failed<br/>activity type?}
Q2 -->|Copy / Lookup| Q3{Error class?}
Q2 -->|Notebook| Q4{Error class?}
Q2 -->|Stored Proc / Web| Q5{Error class?}
Q3 -->|TRANSIENT| R1[Rerun from failed activity]
Q3 -->|PERMISSION| Auth[auth-failure-playbook.md]
Q3 -->|DATA_QUALITY schema| R2[Toggle mergeSchema or fix mapping]
Q3 -->|TIMEOUT| T1[Increase timeout, partition source]
Q3 -->|PERMANENT SourceNotFound| Src[Verify source; do NOT auto-retry]
Q4 -->|RESOURCE OOM| OOM[Repartition / scale pool]
Q4 -->|DATA_QUALITY| DQ{Bad-row<br/>fraction?}
Q4 -->|IdempotencyKeyConflict| R5[Apply MERGE idempotency pattern]
Q5 -->|TRANSIENT deadlock| R1
Q5 -->|PERMANENT| Perm[Fix code/config; PR review]
DQ -->|<5%| Quarantine[Quarantine bad rows, continue]
DQ -->|>=5%| Halt[Halt pipeline, escalate]
R1 --> V[Verification โ V1/V2/V3]
R2 --> V
R5 --> V
OOM --> V
Quarantine --> V
T1 --> V
V -->|Pass| Done([Mark resolved + Postmortem])
V -->|Fail| Roll[Rollback via Delta time-travel]
Roll --> Esc[Escalate per Escalation Path] Related Runbooks¶
| Runbook | When to Use |
|---|---|
| Incident Response Template | Always open first โ anchor structure for any incident |
| Capacity Throttling Response | Failure root-caused to CU > 90% / throttling |
| Auth Failure Playbook | 401/403/Workspace-Identity/SP/MI auth failures |
| Data Quality Incident | DLQ ratio > 5% or downstream consumer impact |
| Multi-Region Failover | Region outage caused source unreachable |
| Tenant Migration (Dev/Staging/Prod) | Bad pipeline deployment rollback |
Related Best-Practice Docs¶
| Document | Description |
|---|---|
| Error Handling & Monitoring | Pipeline error architecture, classification, taxonomy |
| Pipelines & Data Movement | Pipeline patterns, retry policies, error activity wiring |
| Copy Job CDC | Watermark management, CDC failure modes, replay |
| Alerting & Data Activator | Wiring alerts that page this runbook |
| Monitoring & Observability | KQL dashboards, FabricPipelineRuns schema |
| Medallion Architecture Deep Dive | Schema-evolution policy by layer |
| Testing Strategies | GE checkpoints, regression protection |
| Disaster Recovery (BCDR) | Delta time-travel retention, RTO/RPO targets |