fiab-0006: Mirroring engine via Debezium + Spark Structured Streaming + Delta MERGE¶

Comparative positioning note

This document is written from the perspective of Microsoft Azure, Cloud Scale Analytics, and CSA Loom. Any description of third-party or competing products, services, pricing, or capabilities is derived from publicly available documentation and sources believed accurate at the time of writing, and is provided for general comparison only. We do not claim expertise in, or authority over, any non-Microsoft product or service; the respective vendor's official documentation is the authoritative source for their offerings, which may change over time. Nothing here is intended to disparage any vendor — where a competing product has genuine advantages, we aim to note them honestly. Verify all third-party details against the vendor's current official documentation before making decisions.

Status: Accepted Date: 2026-05-22 Locked decision ref: LD-9

Context¶

CSA Loom needs Fabric Mirroring parity: zero-ETL near-real-time CDC from operational databases into the Loom lakehouse as Delta tables. Fabric GA sources as of 2026-05-22: Azure SQL DB, Azure SQL MI, SQL Server 2016-2025, Cosmos DB, Azure DB for PostgreSQL, Snowflake, Oracle, SAP Datasphere, Fabric SQL DB.

Fabric exposes the Open Mirroring landing-zone protocol publicly so partners can drop Parquet files with a documented __rowMarker__ column directly into a landing zone path — Fabric's replicator picks them up and MERGEs into Delta. This is the partner-extensible ingestion path (Qlik, Striim, Informatica IDMC, SNP Glue for SAP, Theobald Xtract Universal).

Per temp/fiab-research/03-fabric-only-internals.md §5, this is tractable OSS territory — per-source CDC is well-trodden (Debezium has connectors for SQL Server, Postgres, MySQL, Oracle; Cosmos DB Spark connector handles change feed; Snowflake streams API exists), and writing Delta MERGE INTO logic in Spark Structured Streaming is standard.

Three approaches considered: 1. OSS Debezium + Spark Structured Streaming + Delta MERGE — portable, debuggable, source-supported, customer can inspect every connector log 2. Build our own simplified CDC framework — tighter Console integration; engineering cost 3. Wrap Azure Data Factory CDC + Mapping Data Flows — native Azure; managed; slower

Decision¶

OSS Debezium + Spark Structured Streaming + Delta MERGE. With honor of Fabric's Open Mirroring publisher contract so partner publishers can drop Parquet directly.

Implemented as apps/fiab-mirroring-engine/ — Container App (Commercial / GCC) or AKS workload (GCC-High / IL5).

Source connectors:

Source	Mechanism
Azure SQL DB / Azure SQL MI	Debezium SQL Server connector (reads CDC tables)
Postgres	Debezium Postgres connector (logical replication)
MySQL	Debezium MySQL connector (binlog)
Cosmos DB	Azure Cosmos Spark connector (change feed)
Snowflake	Custom poller via Snowflake streams API
Oracle	Debezium Oracle connector (LogMiner)
SQL Server 2016-2025 on-prem	Debezium SQL Server + Self-Hosted IR for network connectivity
SAP / Snowflake / Oracle via partner publishers	Open Mirroring landing-zone protocol — partner writes Parquet directly

Transport: Event Hubs (Kafka protocol surface — Debezium emits Kafka topics; Event Hubs accepts them natively).

Replicator: Spark Structured Streaming job on Databricks that reads Event Hubs + landing zone, parses CDC envelope, MERGE INTOs target Delta with idempotency.

Idempotency: per-row last_op_id in Delta + watermarks in Cosmos DB.

Schema evolution: auto-union new columns; manual recreate for drops.

Open Mirroring landing-zone protocol (identical to Fabric's): - Path: <ADLS>/landing-zone/<schema>/<table>/ - _metadata.json declares keyColumns - 20-digit zero-padded sequence file names - __rowMarker__ column with 1=INSERT, 2=UPDATE, 3=DELETE semantics

Consequences¶

Positive¶

Portable + debuggable — customer can read every Debezium connector log and Spark Streaming job log
Source-supported — Debezium has active community + commercial backing (Red Hat); Spark Structured Streaming is industry standard
Honors Fabric's Open Mirroring publisher contract — partner publishers (Qlik, Striim, Informatica, SAP-side connectors) already support this protocol; they "just work" against Loom
Sub-minute steady-state latency matches Fabric
Forward-migration friendly — when Fabric Mirroring lands in Gov, customers can switch per-source (Cosmos / SQL / Postgres / Snowflake / Oracle already GA in Fabric); keep Loom Mirroring for sources Fabric doesn't yet cover

Negative¶

First-touch setup UX is harder than Fabric's "click to mirror" — customer configures Debezium connector + Spark job parameters; v1 ships templated configs per source type; v1.1 polishes UX
Snowflake source has no native Debezium — custom poller is more fragile; document operational expectations openly
Operating a Debezium Connect runtime + Spark Streaming jobs adds complexity vs Fabric's managed service
Backpressure handling under high CDC volume requires Spark Streaming trigger interval tuning

Neutral¶

Open Mirroring publisher SDK (Python + .NET) deferred to v1.1 (PRP-108) — partners using existing SDKs (Qlik / Striim) just work
Latency at 5-15 s steady-state with default 30-s trigger interval; configurable down to 5-s trigger for lower latency

Alternatives considered¶

Alternative	Why not chosen
Build our own simplified CDC framework	Reinvents the wheel; less community support; harder for customer to debug
Wrap ADF CDC + Mapping Data Flows	Slower latency than Spark Streaming; doesn't match Open Mirroring publisher contract; harder to extend to non-Microsoft sources
Use Azure Synapse Link directly	Only covers Cosmos DB + Azure SQL DB; Synapse Link to Snowflake / Oracle / SAP doesn't exist
Native SQL Server CDC into Parquet (no Debezium)	Works only for SQL Server 2025+; misses older SQL Server, Postgres, MySQL, Cosmos, Snowflake, Oracle

References¶

PRD: temp/fiab-prd/05-workload-parity.md §5.8, 06-custom-apps.md §6.4
Amendments: temp/fiab-prd/AMENDMENTS.md §A10
Research: temp/fiab-research/03-fabric-only-internals.md §5
External: Microsoft Learn — Open Mirroring landing-zone format, Debezium docs
Build: PRP-07 — apps/fiab-mirroring-engine/