Skip to content

Home > Docs > Runbooks > Data Quality Incident

๐Ÿงช Data Quality Incident Runbook

Last Updated: 2026-05-05 | Version: 2.0 Audience: Data engineers, on-call SRE, data stewards, compliance officers Purpose: Detect data quality degradation, assess downstream impact, quarantine bad records, communicate with stakeholders, and remediate โ€” without propagating bad data through the medallion architecture.

Category Domain Platform Pattern


๐Ÿ“‘ Table of Contents

  1. Trigger Conditions
  2. Severity Classification
  3. Decision Flowchart
  4. Step-by-Step Procedure
  5. Quarantine Procedures
  6. Stakeholder Communication
  7. Remediation Patterns
  8. Escalation Path
  9. Post-Incident Review Checklist
  10. Related Documents

Trigger Conditions

Use this runbook when any of the following conditions are observed:

# Condition Detection Source
1 Great Expectations checkpoint failed โ€” one or more expectations did not pass great_expectations checkpoint run exit code; webhook notification
2 Schema drift at Bronze ingest โ€” new column, type change, or column removed Bronze notebook log; mergeSchema warning; pipeline error
3 Null spike beyond threshold โ€” null rate on a required column exceeds 5x historical baseline GE expect_column_values_to_not_be_null; Workspace Monitoring KQL alert
4 Duplicate key violation โ€” primary or business key has count > 1 GE expect_column_values_to_be_unique; Silver MERGE failure
5 Referential integrity broken โ€” Silver fact rows reference non-existent dimension keys Silver post-load join check; Gold FK validation
6 Business-rule violation โ€” domain rule failed (e.g., coin_out > coin_in ร— 100, negative revenue) GE custom expectation; KQL business-rule monitor
7 Downstream consumer reports incorrect data Email, Teams, support ticket from analyst or executive
8 Anomaly detection โ€” Data Activator or Workspace Monitoring flags a metric outside expected band Activator reflex; KQL anomaly detector

Severity Classification

Rule of thumb: Classify by the furthest-downstream layer the bad data has reached, not by the layer where it was detected. A null spike caught in Bronze is SEV4. The same null spike found in a Power BI exec report is SEV2.

Severity Definition Example (Casino) Response SLA
SEV1 Compliance-impacting bad data already consumed by audit reports / regulators Wrong CTR threshold caused a $9,500 transaction to be filed as reportable; SAR pattern detection missed structuring 5 min page
SEV2 Bad data in Gold consumed by execs / customers / external reports gold.fact_daily_slot_performance shows negative win/loss; Power BI exec dashboard shows wrong revenue 15 min page
SEV3 Bad data in Silver, not yet propagated to Gold silver_slot_cleansed failed dedup; orphan FK in silver_slot_session 2 hr ack
SEV4 Bad data in Bronze, caught by quarantine before Silver runs 12% of last batch's slot_telemetry rows had null machine_id, routed to DLQ 4 hr ack

Compliance Override

Any incident touching CTR, SAR, W-2G, HIPAA PHI, FedRAMP audit logs, 42 CFR Part 2 SUD records, or PCI cardholder data is automatically SEV1 or SEV2 regardless of layer.


Decision Flowchart

flowchart TD
    A([Data Quality Alert]) --> B{Where was bad<br/>data detected?}
    B -->|Bronze| C{Quarantine<br/>caught it?}
    C -->|Yes| D[SEV4 โ€” Review DLQ,<br/>fix source โ†’ Step 8]
    C -->|No| E[SEV3 โ€” Stop Silver pipeline,<br/>quarantine manually โ†’ Step 7]

    B -->|Silver| F{Propagated<br/>to Gold?}
    F -->|No| E
    F -->|Yes| G{Consumed by<br/>reports or consumers?}

    B -->|Gold| G
    G -->|No| H[SEV3 โ€” Pause Gold refresh,<br/>reprocess โ†’ Step 9]
    G -->|Yes| I{Compliance<br/>data affected?}
    I -->|Yes| J[SEV1 โ€” Page IC + Compliance,<br/>contain immediately โ†’ Step 5]
    I -->|No| K[SEV2 โ€” Page on-call,<br/>flag stale reports โ†’ Step 6]

Step-by-Step Procedure

Phase 1 โ€” Detect and Classify (0โ€“15 min)

Step 1. Identify the quality issue type from the trigger condition. Collect: - Affected table(s) and column(s). - Number of bad records. - Time range of affected data. - Detection source (GE checkpoint, consumer report, anomaly alert).

Step 2. Determine the furthest-downstream layer the bad data has reached:

# Check if bad data propagated to Silver
bad_keys = spark.sql("""
    SELECT DISTINCT machine_id FROM bronze_slot_telemetry
    WHERE machine_id IS NULL AND process_date = '2026-05-05'
""")

silver_impact = spark.sql("""
    SELECT COUNT(*) as affected FROM silver_slot_cleansed
    WHERE machine_id IS NULL AND process_date = '2026-05-05'
""").collect()[0]["affected"]

print(f"Silver impact: {silver_impact} rows")

Step 3. Classify severity using the table above. If SEV1 or SEV2, open an incident channel immediately.

Step 4. If compliance data is affected (CTR, SAR, W-2G, PHI), notify the Compliance Officer regardless of severity.

Phase 2 โ€” Contain (15โ€“45 min)

Step 5. SEV1 โ€” Compliance data affected: 1. Halt all downstream processing โ€” pause Silver and Gold pipelines that consume the affected table. 2. Mark affected Power BI reports with a data quality banner or pause scheduled refreshes. 3. Notify the Compliance Officer with affected record count and time range. 4. Preserve the bad data for forensic analysis (do not delete).

Step 6. SEV2 โ€” Gold data consumed by reports: 1. Pause the semantic model refresh for affected datasets. 2. Add a note to the Power BI report indicating data under investigation. 3. Notify affected report consumers via Teams / email.

Step 7. SEV3 โ€” Bad data in Silver, not yet in Gold: 1. Pause Gold pipelines that consume the affected Silver table. 2. Quarantine bad records (see Quarantine Procedures). 3. Allow good records to continue flowing.

Step 8. SEV4 โ€” Caught in Bronze quarantine: 1. Review the dead-letter queue (DLQ) table for the affected batch. 2. Determine root cause from source system. 3. No immediate downstream action needed โ€” quarantine worked as designed.

Phase 3 โ€” Diagnose (45โ€“120 min)

Step 9. Investigate the root cause by issue type:

Issue Type Diagnostic Query Common Root Cause
Null spike SELECT COUNT(*) FROM table WHERE col IS NULL AND process_date = 'X' Source system schema change; ETL bug; missing required field
Duplicate keys SELECT key, COUNT(*) FROM table GROUP BY key HAVING COUNT(*) > 1 Idempotency failure; re-run without dedup; source sent duplicates
Schema drift DESCRIBE TABLE table vs expected schema Source added/renamed/removed column
Referential integrity SELECT f.* FROM fact f LEFT JOIN dim d ON f.key = d.key WHERE d.key IS NULL Dimension load ran after fact load; dimension record deleted
Business rule violation GE custom expectation failure details Data entry error; calculation bug; threshold change
Anomaly SELECT metric, process_date FROM table WHERE metric > threshold Source data change; seasonal pattern; real-world event

Step 10. Determine the scope of impact:

# Count affected records by date
impact = spark.sql("""
    SELECT process_date, COUNT(*) as bad_records
    FROM affected_table
    WHERE quality_flag = 'FAILED'
    GROUP BY process_date
    ORDER BY process_date
""")
impact.show()

Phase 4 โ€” Remediate

Step 11. Fix the root cause: - Source-side fix: Coordinate with the source system team to correct the data or schema. - ETL fix: Update the notebook or pipeline transformation logic. - Validation fix: Add or tighten Great Expectations rules to catch this earlier.

Step 12. Reprocess affected data:

# Reprocess affected partitions through Silver
affected_dates = ["2026-05-04", "2026-05-05"]

for date in affected_dates:
    mssparkutils.notebook.run(
        "01_silver_slot_cleansing",
        timeout_seconds=3600,
        arguments={"process_date": date}
    )

# Then reprocess Gold
for date in affected_dates:
    mssparkutils.notebook.run(
        "01_gold_slot_performance",
        timeout_seconds=3600,
        arguments={"process_date": date}
    )

Step 13. Validate the reprocessed data:

great_expectations checkpoint run silver_checkpoint
great_expectations checkpoint run gold_checkpoint

Phase 5 โ€” Verify and Close

Step 14. Confirm downstream reports show correct data: - Refresh affected semantic models. - Verify Power BI report values match expected results. - Remove any data quality banners from reports.

Step 15. Notify stakeholders that the issue is resolved (see Stakeholder Communication).

Step 16. Document the incident and complete the Post-Incident Review Checklist.


Quarantine Procedures

Bronze Quarantine (DLQ Pattern)

Route records that fail validation to a dead-letter queue table instead of blocking the pipeline:

from pyspark.sql.functions import col, lit, current_timestamp

# Validate incoming batch
valid = df.filter(col("machine_id").isNotNull() & col("event_timestamp").isNotNull())
invalid = df.filter(col("machine_id").isNull() | col("event_timestamp").isNull())

# Write valid records to Bronze
valid.write.format("delta").mode("append").save("Tables/bronze_slot_telemetry")

# Write invalid records to DLQ with metadata
invalid.withColumn("_dlq_reason", lit("null_required_field")) \
       .withColumn("_dlq_timestamp", current_timestamp()) \
       .write.format("delta").mode("append") \
       .save("Tables/_dlq_slot_telemetry")

Silver Quarantine

Tag records that fail business rules without removing them:

df_silver = df_silver.withColumn(
    "quality_flag",
    when(col("coin_out") > col("coin_in") * 100, "SUSPECT")
    .when(col("revenue") < 0, "SUSPECT")
    .otherwise("PASS")
)

# Only promote PASS records to Gold
df_gold_ready = df_silver.filter(col("quality_flag") == "PASS")

Stakeholder Communication

Timing Audience Channel Content
Immediately (SEV1/SEV2) On-call team, IC Incident bridge Issue detected, containment in progress
Within 30 min Report consumers Teams / email "Data under investigation โ€” reports may show stale data"
Within 1 hr Compliance (if applicable) Compliance channel Affected record count, time range, regulatory impact
Upon resolution All stakeholders Teams / email "Issue resolved โ€” data corrected as of [timestamp]"
Within 48 hr Engineering team Postmortem meeting Root cause, timeline, prevention measures

Remediation Patterns

Pattern When to Use Procedure
Reprocess from Bronze Silver or Gold data is wrong but Bronze is clean Re-run Silver + Gold notebooks for affected dates
Backfill from source Bronze data is also wrong Re-ingest from source system, then reprocess Silver + Gold
Manual correction Small number of records with known correct values SQL UPDATE on affected rows with audit trail
Delta time travel Need to revert to a known-good state RESTORE TABLE table TO VERSION AS OF <version>
Quarantine release DLQ records fixed and ready to promote Move records from DLQ to Bronze, then process normally

Escalation Path

Time Elapsed Action Contact
0 min On-call engineer begins triage On-call rotation
5 min If SEV1 (compliance data), page Compliance Officer and IC Compliance Officer + IC
15 min If SEV2, page Data Platform Lead Data Platform Lead
1 hr If root cause is source-side, engage source system team Source System Owner
2 hr If reprocessing blocked, escalate to VP Engineering VP Engineering
4 hr If compliance reporting deadline at risk, notify Legal Legal Counsel

Post-Incident Review Checklist

  • Affected table(s), column(s), and record count documented
  • Time range of bad data identified
  • Furthest-downstream layer reached by bad data documented
  • Root cause category identified (source change, ETL bug, schema drift, validation gap)
  • Quarantine procedure executed (DLQ, tagging, pipeline pause)
  • Data reprocessed and validated with Great Expectations
  • Downstream reports refreshed and verified
  • Stakeholders notified of resolution
  • New or tightened GE expectations added to prevent recurrence
  • Source system team notified (if source-side root cause)
  • Monitoring alert created or tuned for this failure pattern
  • Compliance impact assessed and documented (if applicable)
  • Runbook accuracy reviewed โ€” any steps to add or update?
  • Blameless postmortem completed within 48 hours (SEV1/SEV2 only)

Document Description
Testing Strategies Great Expectations setup and test patterns
Error Handling & Monitoring Pipeline error architecture
Monitoring & Observability Dashboard and alert setup
Medallion Architecture Deep Dive Bronze/Silver/Gold patterns
Data Governance Deep Dive Purview and governance patterns
Incident Response Template Master incident response structure
Failed Refresh Triage When quality issue causes refresh failure

โฌ†๏ธ Back to Top | ๐Ÿ“‹ Runbook Index | ๐Ÿ  Home