Dead-Letter Queue (DLQ) Operator Runbook¶

Scope: Canonical DLQ pattern for every ingest pipeline in CSA-in-a-Box. Finding: CSA-0138 · Decision: AQ-0033 / Ballot E9 Module: deploy/bicep/shared/modules/deadletter/deadletter.bicep

Overview¶

Before CSA-0138, ingest pipelines had no standard failure sink. Bad records were either silently dropped, or they crashed the pipeline and stalled the whole workload. Streaming workloads could not reach production without a canonical quarantine story.

The approved pattern provisions, per pipeline:

Resource	Purpose
Blob container `deadletter-<pipelineName>`	Quarantines the raw poison message plus a sidecar metadata JSON.
Event Grid system topic + filtered subscription	Fires on `Microsoft.Storage.BlobCreated` scoped to the DLQ container — downstream triage.
Log Analytics diagnostic settings	Captures `StorageRead` / `StorageWrite` + Transaction metrics for audit and replay evidence.
Azure Monitor metric alert	Fires when container capacity exceeds `alertThresholdBytes` (default 1 GiB).

Every ingest Bicep template (ADF-based, Stream Analytics, Event Hubs consumer, Databricks Autoloader) should module ... '../shared/modules/deadletter/deadletter.bicep' with a pipeline-specific pipelineName.

When a message lands in the DLQ¶

A pipeline writes to its DLQ container on any of these conditions:

Schema mismatch — record fails Pydantic / JSON-Schema / Avro validation.
Encoding error — bytes that aren't valid UTF-8 / binary payload where JSON is expected.
Downstream 5xx / throttling — persistent failure to write to Bronze after retry budget.
AuthN/AuthZ failure — expired SAS, missing managed-identity role.
Poison-message loop — consumer repeatedly fails on the same offset (Event Hubs / Kafka).

Each quarantined blob is paired with a sidecar <blobname>.metadata.json:

{
    "pipeline": "iot-telemetry",
    "source": "eventhubs://evh-csa-prod/iot-telemetry",
    "timestamp": "2026-04-19T14:03:22Z",
    "offset": "12345678",
    "errorKind": "schema_mismatch",
    "errorMessage": "Pydantic validation error: field 'device_id' required",
    "attemptCount": 3,
    "correlationId": "8f3a2b1c-...",
    "auditEventId": "csa-audit-2026-04-19-..."
}

The auditEventId ties back to the tamper-evident audit log (CSA-0016) so triage can reconstruct the upstream event history.

Triage¶

Step 1 — Acknowledge the alert¶

The DLQ-size metric alert (csa-alert-dlq-size-<pipelineName>) routes through the pipeline's action group. Acknowledge in the Azure Monitor alert blade:

https://portal.azure.com/#blade/Microsoft_Azure_Monitor/AlertsManagementSummaryBlade

If this is the first time the pipeline has ever alerted, verify the module was deployed by confirming the container exists:

az storage container show \
  --account-name <storageAccountName> \
  --name deadletter-<pipelineName> \
  --auth-mode login

Step 2 — Enumerate poison messages¶

az storage blob list \
  --account-name <storageAccountName> \
  --container-name deadletter-<pipelineName> \
  --auth-mode login \
  --output table \
  --query "[].{name:name, created:properties.creationTime, size:properties.contentLength}"

High cardinality of created timestamps within a tight window suggests a spike; cluster those with jq / Sort-Object before sampling.

Step 3 — Inspect a representative message¶

# download a sample message and its sidecar
az storage blob download \
  --account-name <storageAccountName> \
  --container-name deadletter-<pipelineName> \
  --name <blobname> \
  --file /tmp/dlq-sample.bin \
  --auth-mode login

az storage blob download \
  --account-name <storageAccountName> \
  --container-name deadletter-<pipelineName> \
  --name <blobname>.metadata.json \
  --file /tmp/dlq-sample.metadata.json \
  --auth-mode login

cat /tmp/dlq-sample.metadata.json | jq .

Step 4 — Classify the error¶

Match errorKind from the sidecar to one of:

errorKind	Typical cause	Disposition
`schema_mismatch`	Producer shipped a new field or changed a type	Fix schema → replay
`encoding`	Bad UTF-8 / corrupted binary	Drop with audit event
`downstream_5xx`	Transient Bronze write failure after retry budget	Replay
`auth`	Expired SAS, missing RBAC, managed-identity misconfig	Fix identity → replay
`poison_loop`	Consumer crashes on the same record every attempt	Drop + escalate
`other`	Unknown — inspect raw payload and upstream logs	Case-by-case

Step 5 — Decide disposition¶

Replay — if the underlying defect is fixed or transient. See Replay.
Drop — if the payload is unrecoverable. See Drop.
Escalate — if volume or pattern indicates an upstream incident. See Escalation.

Replay¶

Databricks Autoloader pattern¶

Point Autoloader at the DLQ container with a one-shot trigger(availableNow=True) read — corrected records flow through the normal Bronze → Silver → Gold path:

(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "binaryFile")
  .option("cloudFiles.schemaLocation", "/tmp/schema/dlq-iot-telemetry")
  .load(f"abfss://deadletter-iot-telemetry@{storage}.dfs.core.windows.net/")
  .writeStream
  .trigger(availableNow=True)
  .option("checkpointLocation", "/tmp/chk/dlq-replay-iot-telemetry")
  .foreachBatch(lambda df, epoch: replay_batch(df, epoch))
  .start())

ADF copy-with-retry pattern¶

Clone the production ingest pipeline.
Replace the source dataset with the DLQ container.
Set retryCount = 3, retryIntervalInSeconds = 60.
Enable Fault tolerance → Skip incompatible rows → Log so persistent failures re-land in DLQ rather than crashing the replay.
Trigger manually; after success, delete the replayed blobs from DLQ.

Event Hubs / Service Bus consumer pattern¶

Use a standalone replay consumer that reads the DLQ blobs and re-publishes to the original ingest topic with an x-replay-attempt header so downstream observability can distinguish replayed traffic.

Drop¶

When a payload is unrecoverable:

Emit a drop audit event via the tamper-evident logger (CSA-0016) with action = "dlq.drop", the correlationId, and the operator identity.
Delete the blob and its sidecar from the DLQ container.
Capture rationale in the per-pipeline triage log (see Per-pipeline DLQ inventory).

az storage blob delete-batch \
  --source deadletter-<pipelineName> \
  --account-name <storageAccountName> \
  --pattern "<blobname>*" \
  --auth-mode login

Drop is an auditable action — every drop must have an audit event and an operator signature. If the drop rate exceeds 1 % of ingest volume over a 24-hour window, escalate per below.

Escalation¶

Page the on-call and open a P1 when any of the following trip:

DLQ size alert fires three times in one hour for the same pipeline.
Single DLQ container exceeds 10 GiB regardless of alert threshold.
Ingest pipeline throughput drops > 25 % simultaneous with DLQ growth.
errorKind = poison_loop observed — indicates consumer-code defect.
Volume spike correlates with a recent deployment (possible bad release).

Include in the P1 ticket:

Pipeline name + DLQ container URI
Last 1 hour of DLQ metrics (KQL snippet)
Sample sidecar JSON (redact PII)
Suspected upstream change (git SHA / deployment run)

Per-pipeline DLQ inventory¶

Teams own the row for their pipeline. Update on deployment.

Pipeline	DLQ Container URI	Owner	Replay SLA	Replay Procedure
template	`https://<storage>.blob.core.windows.net/deadletter-<name>`	team-alias	24 h	See "Databricks Autoloader pattern" above
`iot-telemetry`	(fill on first deploy)	(team)	4 h	(link to pipeline-specific replay notebook)
`noaa-weather`	(fill on first deploy)	(team)	24 h	(link)
`aqi-sensors`	(fill on first deploy)	(team)	1 h	(link)
`casino-telemetry`	(fill on first deploy)	(team)	15 min	(link)

Finding: CSA-0138 — No standard DLQ pattern
Decision: AQ-0033 / Ballot item E9
ADR: ADR-0005 — Event Hubs over Kafka (context for streaming DLQ semantics)
Audit logger: CSA-0016 tamper-evident audit log (drop-event emission)
Bicep module: deploy/bicep/shared/modules/deadletter/deadletter.bicep
Alerts: Alert rule csa-alert-dlq-size-<pipelineName> fires to the pipeline's action group.

Last reviewed: 2026-04-19 · Owner: Platform Reliability · Review cadence: quarterly