Home > Docs > Runbooks > Data Pipeline Failure

Data Pipeline Failure Runbook (CSA-0059)¶

Note

Quick Summary: First-response procedure for ADF / Synapse pipeline failures — triage by trigger vs. activity scope, classify severity, apply the right retry / backfill strategy, and escalate. Covers trigger mis-fires, activity-level failures (Copy, Databricks, Dataflow), transient vs. deterministic errors, and the backfill-by-watermark pattern.

Before First Use — Customization Checklist¶

This runbook ships with placeholders. Populate these in a PR before your first real on-call rotation:

Populate the Contact Information table with your Platform / Data Engineering leads.
Wire the Data Factory instance names in §4.1 to your real factories (dev / staging / prod).
Confirm which ADF pipelines are tagged as critical in your RBAC matrix — those get P2 on first failure, not P3.
Confirm the ADF diagnostic-settings workspace ID used by the KQL queries below.

📋 1. Scope¶

Covers failure response for:

Azure Data Factory (ADF) pipelines — trigger, activity, and Self-Hosted IR failures.
Synapse pipeline activities (Pipelines experience in Synapse Workspace).
Databricks notebook activities invoked by ADF.
Copy activities reading from / writing to ADLS Gen2, Event Hubs, SQL DB, Cosmos DB.

Out of scope: real-time streaming failures (see dead-letter.md), dbt model failures inside the lakehouse (see dbt-ci.md), platform-wide region outages (see dr-drill.md).

🔒 2. Severity Classification¶

Severity	Description	Response Time	Escalation
P1 — Critical	Critical pipeline (governance-tagged `tier=critical`) failed in prod with customer-facing data impact	1 hour	Data Eng Lead + CISO
P2 — High	Any pipeline feeding a Gold-layer product with SLA freshness breached	4 hours	Data Eng Lead
P3 — Medium	Non-critical pipeline failed; backfill window still open	24 hours	On-call engineer
P4 — Low	Retry succeeded automatically; post-mortem needed for the deterministic failure cause	72 hours	Team queue

🚀 3. Initial Response¶

Step 1: Confirm the alert is real¶

// Recent pipeline failures across the subscription
ADFPipelineRun
| where TimeGenerated > ago(2h)
| where Status == "Failed"
| project TimeGenerated, PipelineName, RunId, Status, FailureType, Message
| order by TimeGenerated desc

Step 2: Classify severity¶

Is the failed pipeline tagged tier=critical in the governance RBAC matrix?
Is the downstream product SLA breached (check portal/api/v1/marketplace/products)?
Has retry already succeeded? (see Status transitions in KQL below)

Step 3: Contain¶

If the same pipeline has failed 3 times consecutively, pause the trigger to stop alert storms: bash az datafactory trigger stop \ --factory-name <factory> \ --resource-group <rg> \ --name <trigger-name>
Open an incident ticket linking the ADF run ID.

Step 4: Investigate¶

Use the scenario playbook in §4.

Step 5: Recover¶

Pick one of: automatic-retry-in-place, manual rerun of failed activities, or watermark backfill (see §5).

Step 6: Post-Incident¶

File a follow-up task tagged data-pipeline-followup within one business day.
Update detection thresholds if a legitimate failure went to P3 when it should have been P2.

🧪 4. Common Scenarios¶

4.1 Identify the failed run¶

ADFActivityRun
| where TimeGenerated > ago(24h)
| where Status == "Failed"
| project TimeGenerated, PipelineName, ActivityName, ActivityType,
          ActivityRunId, Error = tostring(Error.message), ErrorCode = tostring(Error.errorCode)
| order by TimeGenerated desc

4.2 Trigger did not fire¶

Symptom: Expected pipeline run did not show up in ADFPipelineRun.

Check trigger state: bash az datafactory trigger show \ --factory-name <factory> --resource-group <rg> --name <trigger>
If runtimeState != "Started", the trigger is stopped. Restart: bash az datafactory trigger start \ --factory-name <factory> --resource-group <rg> --name <trigger>
If trigger is started but no run occurred, check managed-identity permissions on the storage account or event source.
Run a manual rerun for the missed window (see §5).

4.3 Copy activity failure (source / sink)¶

Symptom: ActivityType == "Copy", ErrorCode like 2xxx or UserErrorSourceDataContractMismatch.

Check the error class: - UserError* → deterministic, will not self-heal. Fix the source or schema. - SystemError* → transient. Retry the activity in place once.
Validate the source dataset is reachable (firewall, managed-identity RBAC). Cross-check DiagnosticsResourceHealth.
For UserErrorSourceDataContractMismatch, review data contracts under governance/contracts/. The schema drift detector will have fired — do not retry blindly; the upstream owner owns the fix.

4.4 Databricks notebook activity failure¶

Symptom: ActivityType == "DatabricksNotebook" with an exit code.

Pull the notebook run URL from the activity output and open it.
If the error is ExecutionFailed with a Spark OOM, reduce partition count or move the job to a larger cluster type in the Databricks instance pool config.
If the failure is LibraryInstallFailed, your cluster-wide library manifest drifted — rebuild the cluster from the Bicep definition.
For transient JobRequestTimedOut, rerun once; escalate if it recurs within the same hour.

4.5 Dataflow activity OOM / skew¶

Symptom: ActivityType == "ExecuteDataFlow" with memory errors.

Increase integration-runtime core count (AutoResolveIntegrationRuntime compute type General → Memory Optimized).
If the skew is caused by a hotspot join key, add a partition override to the dataflow sink partitioning settings.
For recurring skew, promote the job to Databricks — dataflow's cost model does not forgive skew.

4.6 Self-Hosted IR offline¶

Symptom: Integration runtime is not available or unhealthy.

Check IR status: bash az datafactory integration-runtime show \ --factory-name <factory> --resource-group <rg> --name <ir-name>
See docs/SELF_HOSTED_IR.md for the full restart procedure. TL;DR: restart the Windows service on the VM hosting the IR, confirm outbound firewall rules for 443 are still intact, re-register with the factory key if the IR was re-imaged.

🔁 5. Retry vs. Backfill¶

Situation	Strategy
Transient error, `ActivityRetryCount` not yet exhausted	Let ADF retry automatically. Watch for third failure.
Deterministic error, fix deployed	Manual `Rerun from failed activity` in ADF Studio.
Missed trigger window < 7 days old	Watermark backfill: trigger `pipeline_param.startTime`/`endTime` for the missed window.
Missed window ≥ 7 days	Open a data-correctness task; coordinate with the downstream consumer.
Bronze data corrupted (wrote bad data)	Roll back bronze partition from ADLS versioning; rerun silver/gold.

Manual rerun command:

az datafactory pipeline create-run \
  --factory-name <factory> \
  --resource-group <rg> \
  --name <pipeline> \
  --parameters '{"startTime":"2026-04-20T00:00:00Z","endTime":"2026-04-20T23:59:59Z"}'

📋 6. Evidence Preservation¶

Before remediation, capture:

ADF run ID (RunId) and all activity run IDs.
The full Error.message and Error.failureType JSON payload.
KQL query results from ADFActivityRun for the last 24 hours.
For Copy activities: source/sink URLs, row counts read vs. written.
For Databricks: notebook run URL and the job cluster's log bundle.

📝 7. Communication Templates¶

Internal notification (P1/P2)¶

Subject: [P1/P2] ADF Pipeline Failure — <pipeline_name>

Run ID: <ADF run ID> Detected: <timestamp UTC> Impact: <which products / SLAs affected> Status: Investigating / Contained / Remediated Actions taken: <bullet list> Next update: <time>

Backfill notice to consumers¶

Subject: Backfill planned for <product> — <date range>

Pipeline <pipeline_name> failed on <date> and we are replaying the window <start> → <end>. Expect duplicated / late-arriving rows in <table> between <start-of-replay> and <end-of-replay>. Downstream consumers with idempotent joins should see no impact.

📎 8. Contact Information¶

Warning

Action Required: Populate these before first production use.

Role	Contact	Phone	Escalation
Data Eng On-Call	(set via your org's on-call roster)	(see PagerDuty / OpsGenie)	First responder
Data Eng Lead	(set via your org's data eng DL)	(see PagerDuty / OpsGenie)	P1/P2 escalation
Platform Team Lead	(set via your org's platform team)	(see PagerDuty / OpsGenie)	Infra / IR failures
Upstream Source Owner	(per-pipeline — see governance RBAC matrix)	(DL)	Source data contract issues
Azure Support	Case via Portal	N/A	ADF platform-level issues

🗓️ 9. Drill Log¶

Run this runbook in tabletop form quarterly. Add one row per drill.

Quarter	Date	Type (tabletop / live)	Scenario exercised	Lead	Gaps identified	Fixes tracked
Q1 — Jan	TBD	TBD	TBD	TBD	TBD	TBD
Q2 — Apr	TBD	TBD	TBD	TBD	TBD	TBD
Q3 — Jul	TBD	TBD	TBD	TBD	TBD	TBD
Q4 — Oct	TBD	TBD	TBD	TBD	TBD	TBD

dbt CI — dbt test failures and CI gate recovery
Dead Letter — streaming dead-letter recovery
DR Drill — region-level failover
SELF_HOSTED_IR — Self-Hosted Integration Runtime ops
Troubleshooting — general triage