Skip to content

Home > Docs > Runbooks > Pipeline Failure Triage

๐Ÿ› ๏ธ Pipeline Failure Triage

Last Updated: 2026-04-27 | Phase: 14 (Wave 1) | Feature: 1.3 Audience: On-call data engineers, platform operators, pipeline owners Purpose: Diagnose and resolve Microsoft Fabric Data Pipeline failures โ€” failed activities, retry exhaustion, timeouts, source connectivity, sink errors

Category Type Platform Severity


๐Ÿ“‘ Table of Contents

  1. Symptoms
  2. Severity Classification
  3. Symptom โ†’ Cause Matrix
  4. Diagnostic Steps
  5. Resolution Procedures
  6. Quarantine Pattern
  7. Verification
  8. Rollback
  9. Post-Incident Actions
  10. Escalation Path
  11. Quick-Reference Commands
  12. Decision Tree
  13. Related Runbooks
  14. Related Best-Practice Docs

Symptoms

Trigger this runbook when any of the following are observed:

  • Fabric Monitor shows pipeline status Failed or PartiallyFailed in the Monitoring hub.
  • Data Activator / Action Group alert fires from pipeline_errors or a workspace KQL alert.
  • Downstream consumer signal: Power BI report stale, Lakehouse row count flat vs. baseline, GE checkpoint missing a partition, semantic model refresh fails.
  • Partial loads: Bronze partition exists but row count materially below the rolling 7-day same-hour average.
  • Retry exhaustion: activity has hit retryPolicy.count and pipeline run is Failed.
  • Copy Job watermark stale (>2ร— schedule frequency โ€” see copy-job-cdc.md).
  • Notebook activity ends with 2147467259 / generic UserError in pipeline output.

Companion to incident-response-template.md โ€” open that template first, then drill into this runbook.


Severity Classification

Use the master Severity Matrix. Pipeline-specific mapping:

Pipeline Failure Scope Severity Response SLA Escalation
Compliance pipeline (CTR/SAR/W-2G/HIPAA) failed in filing window SEV1 5 min page VP Eng + Compliance Officer
Gold-layer pipeline failed; Power BI prod report stale SEV2 15 min page Platform Lead
Bronze ingestion failed >2 hr; downstream Silver/Gold blocked SEV2 15 min page Platform Lead
Single non-critical pipeline failed; backfill window available SEV3 2 hr ack Team Lead
Dev/Staging pipeline failure; no prod consumer SEV4 24 hr ack Ticket queue

Rule of thumb: if the failure prevents a regulatory filing or a contractually-bound report, classify SEV1 and page immediately. Compliance > convenience.


Symptom โ†’ Cause Matrix

Map the surface error to the most-likely classification before applying a fix. Classification drives whether you retry, fix-and-replay, or fail-forward. See error-handling-monitoring.md for taxonomy details.

Error Code / Pattern Activity Type Classification Likely Cause First Action
SourceNotFound / Path does not exist Copy / Notebook PERMANENT Source file/table missing or moved Verify source system; do not retry blindly
InvalidSchema / Schema mismatch / AnalysisException: cannot resolve Copy / Notebook DATA_QUALITY Source schema drift (new/dropped/renamed column) Compare schemas; toggle mergeSchema if additive
QuotaExceeded / 429 / CapacityThrottled Copy / Notebook TRANSIENT Capacity throttling, source API rate limit Backoff & retry; see capacity-throttling-response.md
AuthenticationFailed / 401 / 403 / Unauthorized Any PERMISSION Workspace Identity / SP / MI credential expired or scope removed Re-grant; see auth-failure-playbook.md
Timeout / Operation timed out / Command timeout Copy / Lookup / Stored Proc TIMEOUT Long-running query, large batch, slow source Increase timeout, partition source query
NetworkUnreachable / TNS:Connect timeout / ETIMEDOUT Copy (DB source) TRANSIENT VPN/ExpressRoute drop, gateway down, firewall change Validate connectivity; check gateway health
SinkDataIntegrityError Copy โ†’ Delta DATA_QUALITY NOT NULL/CHECK violation, type coercion failure Inspect rejected rows; quarantine
ChecksumMismatch Copy (file source) DATA_QUALITY Source file truncated/corrupt mid-transfer Re-fetch source; do not commit partial
IdempotencyKeyConflict / MERGE found duplicates Notebook MERGE DATA_QUALITY Source emitted duplicate keys in same batch Dedupe staging frame before MERGE
ConcurrentAppendException Notebook write TRANSIENT Two writers hit same Delta table version Stagger schedules, retry
OutOfMemoryError / Job aborted due to stage failure Notebook (Spark) RESOURCE Skewed join, unbounded collect, undersized pool Repartition, broadcast small side, scale pool
DeadlockVictim / Lock wait timeout exceeded Stored Proc TRANSIENT Source DB contention Retry; coordinate with source DBA
ServiceUnavailable / 503 / 504 Web / REST source TRANSIENT Upstream API outage Backoff, then escalate to source owner
WatermarkStale (Copy Job) Copy Job PERMANENT Previous run failed before watermark commit Reset watermark; replay from last good run

Diagnostic Steps

Step 1 โ€” Open Pipeline Run in Fabric Monitor

Fabric Portal โ†’ Workspace โ†’ Monitoring hub
  Filter:
    Item type   = Data pipeline
    Status      = Failed (last 24h)
    Workspace   = ${WS_NAME}
  Sort: Start time DESC

Click the failed run to open the Pipeline run details blade.

Step 2 โ€” Identify the First Failed Activity

Pipelines often cascade โ€” a downstream activity may fail because an upstream activity produced no output. Always investigate the earliest failed activity, not the last.

Pipeline run โ†’ Activity runs tab
  Sort: Activity start time ASC
  Find: First row with Status = Failed

Step 3 โ€” Extract Error Details

Click the failed activity โ†’ Output tab โ†’ expand the error object. Capture for the incident channel: errorCode, message (first 200 chars), target (activity name), failureType (UserError = our fault, SystemError = Fabric platform fault).

{ "errorCode": "2200", "message": "ErrorCode=UserErrorInvalidColumnName,...",
  "failureType": "UserError", "target": "Copy_USDA_NASS_To_Bronze" }

Step 4 โ€” Cross-Reference Centralized Error Table

If the pipeline uses the Single Error Activity pattern:

SELECT TOP 20 error_id, error_timestamp, activity_name,
       error_classification, severity, error_message, retry_count, correlation_id
FROM dbo.pipeline_errors
WHERE pipeline_run_id = '${PIPELINE_RUN_ID}'
ORDER BY error_timestamp ASC;

Step 5 โ€” Capacity & Source Health

az rest --method get --url "https://api.fabric.microsoft.com/v1/capacities/${CAPACITY_ID}"
nc -vz ${SOURCE_HOST} ${SOURCE_PORT}

For Notebook activities, click Output โ†’ notebookutils logs or open the linked Spark application UI. If capacity is throttled, stop here โ€” switch to capacity-throttling-response.md.


Resolution Procedures

R1 โ€” Retry / Replay (Single Activity vs Full Pipeline)

Use when: classification is TRANSIENT, retries are not exhausted, and source is healthy.

Fabric Monitor โ†’ Failed pipeline run โ†’ Rerun โ–ผ
  โ˜‘ Rerun from failed activity   โ† preferred (preserves successful upstream work)
  โ˜ Rerun entire pipeline        โ† only if upstream produced bad output

REST equivalent:

# Rerun from failed activity (Fabric REST API)
az rest --method post \
  --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/items/${PIPELINE_ID}/jobs/instances/${RUN_ID}?jobType=Pipeline" \
  --headers "Content-Type=application/json" \
  --body '{"executionData": {"runMode": "rerunFromFailed"}}'

R2 โ€” Schema Evolution Toggles

Use when: classification is DATA_QUALITY and the change is additive (new columns) โ€” not breaking (renamed/dropped/retyped).

# Notebook activity โ€” enable additive schema merge for Delta sink
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

(df_source
   .write
   .format("delta")
   .mode("append")
   .option("mergeSchema", "true")
   .saveAsTable("lh_bronze.bronze_usda_crop_production"))

Do NOT enable autoMerge for Silver/Gold layers โ€” schema drift in business tables must go through PR review. See medallion-architecture-deep-dive.md.

For Copy Activity sinks, edit the activity โ†’ Mapping tab โ†’ Import schemas to refresh, then save.

R3 โ€” Source Connectivity Validation

Use when: classification is NetworkUnreachable / TNS:Connect timeout.

import socket, traceback

SOURCE_HOST = "${SOURCE_HOST}"; SOURCE_PORT = ${SOURCE_PORT}

# Layer 1 โ€” TCP reachability
try:
    with socket.create_connection((SOURCE_HOST, SOURCE_PORT), timeout=10):
        print(f"OK  TCP {SOURCE_HOST}:{SOURCE_PORT}")
except OSError as e:
    print(f"FAIL TCP: {e}"); raise

# Layer 2 โ€” credential / driver
try:
    spark.read.format("jdbc") \
        .option("url", f"jdbc:sqlserver://{SOURCE_HOST}:{SOURCE_PORT};database=${DB}") \
        .option("dbtable", "(SELECT 1 AS hc) h") \
        .option("authentication", "ActiveDirectoryMSI").load().show()
    print("OK  JDBC auth")
except Exception:
    traceback.print_exc(); raise

Layer 1 fails โ†’ gateway/firewall/DNS โ€” escalate network team. Layer 1 passes, Layer 2 fails โ†’ see auth-failure-playbook.md.

R4 โ€” Sink Rollback via Delta Time-Travel

Use when: the failed run partially wrote corrupt data to Bronze/Silver/Gold and downstream consumers must see only the last good state.

-- Step 1: identify the last good Delta version BEFORE the failed run started
DESCRIBE HISTORY lh_bronze.bronze_usda_crop_production;
-- Look for the operation row with timestamp just BEFORE the failed pipeline start

-- Step 2: restore to that version
RESTORE TABLE lh_bronze.bronze_usda_crop_production
TO VERSION AS OF 248;

-- OR by timestamp (often cleaner)
RESTORE TABLE lh_bronze.bronze_usda_crop_production
TO TIMESTAMP AS OF '2026-04-27 06:00:00';
# Or in PySpark
spark.sql("""
  RESTORE TABLE lh_silver.silver_compliance_events
  TO TIMESTAMP AS OF '2026-04-27 05:55:00'
""")

CRITICAL: Notify downstream consumers BEFORE running RESTORE โ€” semantic models reading the table will see row counts shift. See Verification below.

R5 โ€” Idempotency MERGE Pattern

Use when: classification is IdempotencyKeyConflict / duplicate keys, or replaying a partially-completed run that may double-load.

from delta.tables import DeltaTable
from pyspark.sql import functions as F

# Dedupe staging BEFORE merge โ€” idempotency on natural key
staging = raw_df.withColumn("_ingest_ts", F.current_timestamp()) \
                .dropDuplicates(["event_id"])

target = DeltaTable.forName(spark, "lh_bronze.bronze_compliance_events")
(target.alias("t").merge(staging.alias("s"), "t.event_id = s.event_id")
   .whenMatchedUpdate(
       condition="s._ingest_ts > t._ingest_ts",
       set={c: f"s.{c}" for c in staging.columns})
   .whenNotMatchedInsertAll().execute())

This pattern is safe to replay โ€” re-running it after a failure produces the same result.

R6 โ€” Backfill Window Calculation

Use when: the pipeline missed one or more scheduled runs and only the gap needs loading.

# Compute the gap from current watermark
wm = spark.sql("SELECT MAX(transaction_timestamp) AS wm FROM lh_bronze.bronze_slot_transactions") \
          .collect()[0]["wm"]

# Source query bounded to the gap โ€” avoids full re-fetch
backfill = spark.read.format("jdbc").option("url", JDBC_URL).option("query", f"""
    SELECT * FROM SLOT_TRANSACTIONS
     WHERE TRANSACTION_TIMESTAMP >  '{wm.isoformat()}'
       AND TRANSACTION_TIMESTAMP <= CURRENT_TIMESTAMP
""").load()
print(f"Backfill: {wm} โ†’ now, rows={backfill.count()}")

For Copy Job: Monitor โ†’ Watermark State โ†’ Reset (set to last good timestamp), then trigger an immediate run. See copy-job-cdc.md.


Quarantine Pattern

When a small fraction of records fail data-quality checks, do not fail the whole pipeline โ€” quarantine bad rows and continue. This keeps Bronze fresh while preserving evidence for triage.

from pyspark.sql import functions as F

bad  = raw_df.filter(F.col("event_id").isNull() | F.col("event_type").isNull() | (F.col("amount") < 0))
good = raw_df.subtract(bad)

good.write.mode("append").format("delta").saveAsTable("lh_bronze.bronze_compliance_events")

(bad.withColumn("_quarantine_reason", F.lit("required_null_or_negative_amount"))
    .withColumn("_quarantine_ts",     F.current_timestamp())
    .withColumn("_pipeline_run_id",   F.lit("${PIPELINE_RUN_ID}"))
    .write.mode("append").format("delta")
    .saveAsTable("lh_bronze.dlq_bronze_compliance_events"))

ratio = bad.count() / max(raw_df.count(), 1)
if ratio > 0.05:
    raise ValueError(f"DLQ ratio {ratio:.1%} > 5% โ€” investigate source")

DLQ governance: retention 30 days (90 for compliance domains); triage daily โ€” rows >7 days old auto-page pipeline owner; once root cause fixed, MERGE good DLQ rows back into target.


Verification

After applying any resolution, all of the following must pass before declaring resolved.

V1 โ€” Target Table Row Count vs Baseline

SELECT COUNT(*) AS rows_total, COUNT(DISTINCT event_id) AS rows_distinct,
       MAX(event_ts) AS max_ts,
       SUM(CASE WHEN event_ts >= '${PIPELINE_START}' THEN 1 ELSE 0 END) AS rows_this_run
FROM lh_silver.silver_compliance_events;

Expected: rows_this_run within ยฑ10% of rolling 7-day average for same hour-of-day.

V2 โ€” Freshness Check

from datetime import datetime, timezone
max_ts = spark.sql("SELECT MAX(event_ts) m FROM lh_silver.silver_compliance_events").collect()[0]["m"]
age_min = (datetime.now(timezone.utc) - max_ts).total_seconds() / 60
assert age_min < 30, f"Table stale: {age_min:.1f} min"

V3 โ€” Downstream Consumer Signal

  • Power BI dataset refresh Succeeded (re-run if needed)
  • Semantic model row count card matches expected
  • GE checkpoint passes: great_expectations checkpoint run silver_compliance_checkpoint
  • No related downstream alerts firing in the last 30 min

V4 โ€” Watch for 2ร— Incident Duration

Per incident-response-template.md ยง3.3: a 2 hr incident โ†’ watch 4 hr before declaring fully resolved.


Rollback

If the resolution made things worse or introduces a regression:

-- 1. Roll back target table to pre-incident state
DESCRIBE HISTORY lh_silver.silver_compliance_events;
RESTORE TABLE lh_silver.silver_compliance_events TO VERSION AS OF ${LAST_GOOD_VERSION};

-- 2. Snapshot the bad version into an audit table for forensics
CREATE TABLE lh_audit.audit_silver_compliance_events_${YYYYMMDD_HHMM} USING DELTA
AS SELECT * FROM lh_silver.silver_compliance_events VERSION AS OF ${BAD_VERSION};

Pipeline-definition rollback (revert via fabric-cicd):

git revert ${BAD_COMMIT_SHA} && git push origin hotfix/incident-${YYYYMMDD}-pipeline-revert
python scripts/fabric-cicd-deploy.py --workspace-id ${WS_ID} \
  --branch hotfix/incident-${YYYYMMDD}-pipeline-revert

Mandatory rollback notifications: post in #incident-${date}-...-sevN; email pipeline-owners with restored version; update Status Page if customer-visible.


Post-Incident Actions

Within 48 hours of resolution (per PIR ยง4):

  • Root-cause categorization โ€” tag in pipeline_errors.error_classification so the trend lands in the Error Trend KQL.
  • GE rule additions โ€” add a Great Expectations expectation for the failure mode (e.g. expect_column_values_to_not_be_null("event_id")) so the next occurrence is caught upstream.
  • Retry policy review โ€” if classification was TRANSIENT but retries didn't recover, re-tune retryPolicy.count / intervalInSeconds / backoffMultiplier.
  • Idempotency audit โ€” confirm the pipeline can replay safely; retrofit MERGE pattern (see R5) if missing.
  • Add quarantine โ€” if a data-quality failure halted the pipeline, retrofit the Quarantine Pattern.
  • Write postmortem in docs/postmortems/${YYYY-MM-DD}-pipeline-${slug}.md using the template.
  • Update this runbook if a new failure mode emerged โ€” append to the Symptom โ†’ Cause Matrix.

Escalation Path

Trigger Escalate To When
30 min on SEV1/SEV2 with no mitigation Platform Lead Immediate
Compliance pipeline (CTR/SAR/HIPAA) fails inside filing window Compliance Officer + VP Eng Page on detection
Source system owner action required (DBA, agency liaison) Source-system on-call Within 15 min
Capacity throttling is the root cause Switch to capacity-throttling-response.md On classification
Auth/Identity failure Switch to auth-failure-playbook.md On classification
Data-quality breach with downstream blast radius Open companion data-quality-incident.md On classification
2 hr SEV2 / 1 hr SEV1 sustained VP Engineering Per Communication Tree

Quick-Reference Commands

Fabric REST โ€” Pipeline Runs

# List pipelines in workspace
az rest --method get --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/items?type=DataPipeline"

# Get a specific run's status
az rest --method get --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/items/${PIPELINE_ID}/jobs/instances/${RUN_ID}"

# Cancel a hung run
az rest --method post --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/items/${PIPELINE_ID}/jobs/instances/${RUN_ID}/cancel"

# Rerun from failed activity
az rest --method post --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/items/${PIPELINE_ID}/jobs/instances/${RUN_ID}?jobType=Pipeline" \
  --body '{"executionData": {"runMode": "rerunFromFailed"}}'

KQL โ€” Workspace Monitoring

// All pipeline failures last 24h, top failing pipelines first
FabricPipelineRuns
| where TimeGenerated > ago(24h)
| where Status == "Failed"
| summarize Failures = count(),
            LastError = take_any(ErrorMessage),
            LastRunId = arg_max(TimeGenerated, RunId)
            by PipelineName
| order by Failures desc
// Activity-level drill-down for a specific pipeline run
FabricPipelineActivityRuns
| where RunId == "${PIPELINE_RUN_ID}"
| project ActivityStart, ActivityName, ActivityType, Status, ErrorCode, ErrorMessage, DurationMs
| order by ActivityStart asc
// Error trend โ€” bin by hour, classify by errorCode prefix
FabricPipelineActivityRuns
| where TimeGenerated > ago(7d) and Status == "Failed"
| extend Bucket = bin(TimeGenerated, 1h)
| summarize Failures = count() by Bucket, ErrorCode
| render timechart

PySpark โ€” Delta History & Diagnostics

spark.sql("DESCRIBE HISTORY lh_bronze.bronze_slot_transactions").show(20, False)

# Row count delta between two versions
def cnt(v): return spark.sql(f"SELECT COUNT(*) c FROM lh_bronze.bronze_slot_transactions VERSION AS OF {v}").collect()[0].c
print(f"Delta v247โ†’v248: {cnt(248) - cnt(247)}")

spark.sql("RESTORE TABLE lh_bronze.bronze_slot_transactions TO VERSION AS OF 247")

Centralized Error Table

SELECT error_id, activity_name, error_classification, severity,
       LEFT(error_message, 200) AS msg, retry_count
FROM dbo.pipeline_errors
WHERE pipeline_run_id = '${PIPELINE_RUN_ID}' AND is_resolved = 0
ORDER BY error_timestamp ASC;

EXEC dbo.usp_resolve_error @error_id=${ERROR_ID}, @resolved_by='${YOUR_UPN}',
     @resolution_notes='Schema drift โ€” mergeSchema enabled; source notified';

Decision Tree

flowchart TD
    Start([Pipeline Failed Alert]) --> Q1{Capacity<br/>throttled?}
    Q1 -->|Yes| Cap[capacity-throttling-response.md]
    Q1 -->|No| Q2{First failed<br/>activity type?}
    Q2 -->|Copy / Lookup| Q3{Error class?}
    Q2 -->|Notebook| Q4{Error class?}
    Q2 -->|Stored Proc / Web| Q5{Error class?}
    Q3 -->|TRANSIENT| R1[Rerun from failed activity]
    Q3 -->|PERMISSION| Auth[auth-failure-playbook.md]
    Q3 -->|DATA_QUALITY schema| R2[Toggle mergeSchema or fix mapping]
    Q3 -->|TIMEOUT| T1[Increase timeout, partition source]
    Q3 -->|PERMANENT SourceNotFound| Src[Verify source; do NOT auto-retry]
    Q4 -->|RESOURCE OOM| OOM[Repartition / scale pool]
    Q4 -->|DATA_QUALITY| DQ{Bad-row<br/>fraction?}
    Q4 -->|IdempotencyKeyConflict| R5[Apply MERGE idempotency pattern]
    Q5 -->|TRANSIENT deadlock| R1
    Q5 -->|PERMANENT| Perm[Fix code/config; PR review]
    DQ -->|<5%| Quarantine[Quarantine bad rows, continue]
    DQ -->|>=5%| Halt[Halt pipeline, escalate]
    R1 --> V[Verification โ€” V1/V2/V3]
    R2 --> V
    R5 --> V
    OOM --> V
    Quarantine --> V
    T1 --> V
    V -->|Pass| Done([Mark resolved + Postmortem])
    V -->|Fail| Roll[Rollback via Delta time-travel]
    Roll --> Esc[Escalate per Escalation Path]

Runbook When to Use
Incident Response Template Always open first โ€” anchor structure for any incident
Capacity Throttling Response Failure root-caused to CU > 90% / throttling
Auth Failure Playbook 401/403/Workspace-Identity/SP/MI auth failures
Data Quality Incident DLQ ratio > 5% or downstream consumer impact
Multi-Region Failover Region outage caused source unreachable
Tenant Migration (Dev/Staging/Prod) Bad pipeline deployment rollback
Document Description
Error Handling & Monitoring Pipeline error architecture, classification, taxonomy
Pipelines & Data Movement Pipeline patterns, retry policies, error activity wiring
Copy Job CDC Watermark management, CDC failure modes, replay
Alerting & Data Activator Wiring alerts that page this runbook
Monitoring & Observability KQL dashboards, FabricPipelineRuns schema
Medallion Architecture Deep Dive Schema-evolution policy by layer
Testing Strategies GE checkpoints, regression protection
Disaster Recovery (BCDR) Delta time-travel retention, RTO/RPO targets

โฌ†๏ธ Back to Top | ๐Ÿ“š Runbooks Index | ๐Ÿ  Home