Home > Docs > Runbooks > Data Pipeline Failure
Data Pipeline Failure Runbook (CSA-0059)¶
Note
Quick Summary: First-response procedure for ADF / Synapse pipeline failures — triage by trigger vs. activity scope, classify severity, apply the right retry / backfill strategy, and escalate. Covers trigger mis-fires, activity-level failures (Copy, Databricks, Dataflow), transient vs. deterministic errors, and the backfill-by-watermark pattern.
Before First Use — Customization Checklist¶
This runbook ships with placeholders. Populate these in a PR before your first real on-call rotation:
- Populate the Contact Information table with your Platform / Data Engineering leads.
- Wire the Data Factory instance names in §4.1 to your real factories (dev / staging / prod).
- Confirm which ADF pipelines are tagged as critical in your RBAC matrix — those get P2 on first failure, not P3.
- Confirm the ADF diagnostic-settings workspace ID used by the KQL queries below.
📑 Table of Contents¶
- 📋 1. Scope
- 🔒 2. Severity Classification
- 🚀 3. Initial Response
- 🧪 4. Common Scenarios
- 🔁 5. Retry vs. Backfill
- 📋 6. Evidence Preservation
- 📝 7. Communication Templates
- 📎 8. Contact Information
- 🗓️ 9. Drill Log
- 🔗 10. Related Documentation
📋 1. Scope¶
Covers failure response for:
- Azure Data Factory (ADF) pipelines — trigger, activity, and Self-Hosted IR failures.
- Synapse pipeline activities (Pipelines experience in Synapse Workspace).
- Databricks notebook activities invoked by ADF.
- Copy activities reading from / writing to ADLS Gen2, Event Hubs, SQL DB, Cosmos DB.
Out of scope: real-time streaming failures (see dead-letter.md), dbt model failures inside the lakehouse (see dbt-ci.md), platform-wide region outages (see dr-drill.md).
🔒 2. Severity Classification¶
| Severity | Description | Response Time | Escalation |
|---|---|---|---|
| P1 — Critical | Critical pipeline (governance-tagged tier=critical) failed in prod with customer-facing data impact | 1 hour | Data Eng Lead + CISO |
| P2 — High | Any pipeline feeding a Gold-layer product with SLA freshness breached | 4 hours | Data Eng Lead |
| P3 — Medium | Non-critical pipeline failed; backfill window still open | 24 hours | On-call engineer |
| P4 — Low | Retry succeeded automatically; post-mortem needed for the deterministic failure cause | 72 hours | Team queue |
🚀 3. Initial Response¶
Step 1: Confirm the alert is real¶
// Recent pipeline failures across the subscription
ADFPipelineRun
| where TimeGenerated > ago(2h)
| where Status == "Failed"
| project TimeGenerated, PipelineName, RunId, Status, FailureType, Message
| order by TimeGenerated desc
Step 2: Classify severity¶
- Is the failed pipeline tagged
tier=criticalin the governance RBAC matrix? - Is the downstream product SLA breached (check
portal/api/v1/marketplace/products)? - Has retry already succeeded? (see
Statustransitions in KQL below)
Step 3: Contain¶
- If the same pipeline has failed 3 times consecutively, pause the trigger to stop alert storms:
bash az datafactory trigger stop \ --factory-name <factory> \ --resource-group <rg> \ --name <trigger-name> - Open an incident ticket linking the ADF run ID.
Step 4: Investigate¶
Use the scenario playbook in §4.
Step 5: Recover¶
Pick one of: automatic-retry-in-place, manual rerun of failed activities, or watermark backfill (see §5).
Step 6: Post-Incident¶
- File a follow-up task tagged
data-pipeline-followupwithin one business day. - Update detection thresholds if a legitimate failure went to P3 when it should have been P2.
🧪 4. Common Scenarios¶
4.1 Identify the failed run¶
ADFActivityRun
| where TimeGenerated > ago(24h)
| where Status == "Failed"
| project TimeGenerated, PipelineName, ActivityName, ActivityType,
ActivityRunId, Error = tostring(Error.message), ErrorCode = tostring(Error.errorCode)
| order by TimeGenerated desc
4.2 Trigger did not fire¶
Symptom: Expected pipeline run did not show up in ADFPipelineRun.
- Check trigger state:
bash az datafactory trigger show \ --factory-name <factory> --resource-group <rg> --name <trigger> - If
runtimeState != "Started", the trigger is stopped. Restart:bash az datafactory trigger start \ --factory-name <factory> --resource-group <rg> --name <trigger> - If trigger is started but no run occurred, check managed-identity permissions on the storage account or event source.
- Run a manual rerun for the missed window (see §5).
4.3 Copy activity failure (source / sink)¶
Symptom: ActivityType == "Copy", ErrorCode like 2xxx or UserErrorSourceDataContractMismatch.
- Check the error class: -
UserError*→ deterministic, will not self-heal. Fix the source or schema. -SystemError*→ transient. Retry the activity in place once. - Validate the source dataset is reachable (firewall, managed-identity RBAC). Cross-check
DiagnosticsResourceHealth. - For
UserErrorSourceDataContractMismatch, review data contracts undergovernance/contracts/. The schema drift detector will have fired — do not retry blindly; the upstream owner owns the fix.
4.4 Databricks notebook activity failure¶
Symptom: ActivityType == "DatabricksNotebook" with an exit code.
- Pull the notebook run URL from the activity output and open it.
- If the error is
ExecutionFailedwith a Spark OOM, reduce partition count or move the job to a larger cluster type in the Databricks instance pool config. - If the failure is
LibraryInstallFailed, your cluster-wide library manifest drifted — rebuild the cluster from the Bicep definition. - For transient
JobRequestTimedOut, rerun once; escalate if it recurs within the same hour.
4.5 Dataflow activity OOM / skew¶
Symptom: ActivityType == "ExecuteDataFlow" with memory errors.
- Increase integration-runtime core count (AutoResolveIntegrationRuntime compute type General → Memory Optimized).
- If the skew is caused by a hotspot join key, add a partition override to the dataflow sink partitioning settings.
- For recurring skew, promote the job to Databricks — dataflow's cost model does not forgive skew.
4.6 Self-Hosted IR offline¶
Symptom: Integration runtime is not available or unhealthy.
- Check IR status:
bash az datafactory integration-runtime show \ --factory-name <factory> --resource-group <rg> --name <ir-name> - See
docs/SELF_HOSTED_IR.mdfor the full restart procedure. TL;DR: restart the Windows service on the VM hosting the IR, confirm outbound firewall rules for 443 are still intact, re-register with the factory key if the IR was re-imaged.
🔁 5. Retry vs. Backfill¶
| Situation | Strategy |
|---|---|
Transient error, ActivityRetryCount not yet exhausted | Let ADF retry automatically. Watch for third failure. |
| Deterministic error, fix deployed | Manual Rerun from failed activity in ADF Studio. |
| Missed trigger window < 7 days old | Watermark backfill: trigger pipeline_param.startTime/endTime for the missed window. |
| Missed window ≥ 7 days | Open a data-correctness task; coordinate with the downstream consumer. |
| Bronze data corrupted (wrote bad data) | Roll back bronze partition from ADLS versioning; rerun silver/gold. |
Manual rerun command:
az datafactory pipeline create-run \
--factory-name <factory> \
--resource-group <rg> \
--name <pipeline> \
--parameters '{"startTime":"2026-04-20T00:00:00Z","endTime":"2026-04-20T23:59:59Z"}'
📋 6. Evidence Preservation¶
Before remediation, capture:
- ADF run ID (
RunId) and all activity run IDs. - The full
Error.messageandError.failureTypeJSON payload. - KQL query results from
ADFActivityRunfor the last 24 hours. - For Copy activities: source/sink URLs, row counts read vs. written.
- For Databricks: notebook run URL and the job cluster's log bundle.
📝 7. Communication Templates¶
Internal notification (P1/P2)¶
Subject: [P1/P2] ADF Pipeline Failure —
<pipeline_name>Run ID:
<ADF run ID>Detected:<timestamp UTC>Impact:<which products / SLAs affected>Status: Investigating / Contained / Remediated Actions taken:<bullet list>Next update:<time>
Backfill notice to consumers¶
Subject: Backfill planned for
<product>—<date range>Pipeline
<pipeline_name>failed on<date>and we are replaying the window<start>→<end>. Expect duplicated / late-arriving rows in<table>between<start-of-replay>and<end-of-replay>. Downstream consumers with idempotent joins should see no impact.
📎 8. Contact Information¶
Warning
Action Required: Populate these before first production use.
| Role | Contact | Phone | Escalation |
|---|---|---|---|
| Data Eng On-Call | (set via your org's on-call roster) | (see PagerDuty / OpsGenie) | First responder |
| Data Eng Lead | (set via your org's data eng DL) | (see PagerDuty / OpsGenie) | P1/P2 escalation |
| Platform Team Lead | (set via your org's platform team) | (see PagerDuty / OpsGenie) | Infra / IR failures |
| Upstream Source Owner | (per-pipeline — see governance RBAC matrix) | (DL) | Source data contract issues |
| Azure Support | Case via Portal | N/A | ADF platform-level issues |
🗓️ 9. Drill Log¶
Run this runbook in tabletop form quarterly. Add one row per drill.
| Quarter | Date | Type (tabletop / live) | Scenario exercised | Lead | Gaps identified | Fixes tracked |
|---|---|---|---|---|---|---|
| Q1 — Jan | TBD | TBD | TBD | TBD | TBD | TBD |
| Q2 — Apr | TBD | TBD | TBD | TBD | TBD | TBD |
| Q3 — Jul | TBD | TBD | TBD | TBD | TBD | TBD |
| Q4 — Oct | TBD | TBD | TBD | TBD | TBD | TBD |
🔗 10. Related Documentation¶
- dbt CI — dbt test failures and CI gate recovery
- Dead Letter — streaming dead-letter recovery
- DR Drill — region-level failover
- SELF_HOSTED_IR — Self-Hosted Integration Runtime ops
- Troubleshooting — general triage