ADR 0003 — Delta Lake over Iceberg and Parquet as canonical table format¶
Context and Problem Statement¶
The medallion lakehouse stores Bronze/Silver/Gold tables on ADLS Gen2. We must pick a single canonical table format to unify reads and writes across Databricks Spark, dbt, Synapse SQL serverless, and ad-hoc DuckDB analysis. The choice materially affects query performance (statistics + data-skipping), ACID guarantees (concurrent writes during MERGE operations), Purview lineage, and eventual compatibility with Microsoft Fabric OneLake.
Decision Drivers¶
- Microsoft Fabric OneLake compatibility — Fabric's native table format is Delta; selecting Delta aligns with the strategic target (ADR-0010).
- Databricks-native optimizations — Photon + Delta Caching + Liquid Clustering deliver the best price/performance on our primary engine.
- ACID MERGE semantics — CDC and SCD2 patterns in dbt and ADF pipelines require atomic upserts; plain Parquet does not provide this.
- Governance — Purview, Unity Catalog, and Fabric OneLake all read Delta transaction logs for lineage and classification propagation.
- Open-source interoperability — Delta has an open spec + Delta-RS client so non-Spark readers (DuckDB, Polars, Trino) can consume tables.
Considered Options¶
- Delta Lake (chosen) — Open spec, ACID, Databricks-native, Fabric OneLake-native, mature dbt adapter, Purview-aware.
- Apache Iceberg — Open spec, strong engine-neutrality story (Snowflake, Trino, Athena), excellent schema-evolution semantics.
- Apache Hudi — ACID upserts, strong CDC story, smaller Azure ecosystem.
- Raw Parquet + Hive metastore — Simplest; no ACID; data-skipping via metastore stats only.
Decision Outcome¶
Chosen: Option 1 — Delta Lake for all Silver and Gold tables. Bronze may remain as landed raw files (Parquet/JSON/CSV) until promotion.
Consequences¶
- Positive: Native on Databricks + Fabric OneLake; minimal adapter surface.
- Positive: ACID MERGE, time travel, Z-ORDER / Liquid Clustering, and OPTIMIZE are available without extra tooling.
- Positive: Purview + Unity Catalog automatically discover Delta tables and propagate classifications.
- Positive: Open spec + Delta-RS enable DuckDB/Polars analysis in the dev loop without a Spark cluster.
- Negative: Engine-neutrality is weaker than Iceberg outside the Azure/Databricks/Fabric ecosystem (e.g., Snowflake's Iceberg support is more mature than its Delta support).
- Negative: Delta Uniform (Iceberg compatibility layer) is available but adds a metadata-sync cost; we do not enable it by default.
- Negative: Small-file problems require regular OPTIMIZE jobs; mitigated by dbt post-hooks and scheduled maintenance notebooks.
- Neutral: If a federal tenant mandates Iceberg, Delta Uniform or a one-way conversion path is feasible.
Pros and Cons of the Options¶
Option 1 — Delta Lake¶
- Pros: Databricks + Fabric native; ACID; Z-ORDER; open spec; Delta-RS; mature dbt adapter; Purview-aware.
- Cons: Weaker non-Azure engine parity vs. Iceberg; OPTIMIZE overhead.
Option 2 — Apache Iceberg¶
- Pros: Engine-neutral; strong schema evolution; hidden partitioning; first-class in Snowflake/Trino/Athena.
- Cons: Not native to Databricks (as of decision date); Fabric OneLake does not write Iceberg natively; Purview lineage is weaker.
Option 3 — Apache Hudi¶
- Pros: Strong CDC and upsert semantics; merge-on-read tables for low-latency ingestion.
- Cons: Smaller Azure ecosystem; no Fabric native story; fewer engineers fluent in operations.
Option 4 — Raw Parquet¶
- Pros: Simplest; zero lock-in; every engine reads it.
- Cons: No ACID; no time travel; compaction is manual; CDC requires custom patterns; stats live in an external metastore.
Validation¶
We will know this decision is right if:
- All Silver and Gold tables in vertical examples are Delta within one quarter of onboarding.
- Purview lineage coverage for Delta tables exceeds 95% of the catalog.
- If a tenant's consumer stack is Snowflake-primary and Delta performance is materially worse than Iceberg, revisit with Delta Uniform as the compromise.
References¶
- Decision tree: Delta vs. Iceberg vs. Parquet
- Related code:
domains/shared/dbt/dbt_project.yml(materialization defaults),deploy/bicep/DLZ/modules/databricks/databricks.bicep(Unity Catalog),domains/shared/pipelines/adf/pl_ingest_to_bronze.json - Framework controls: NIST 800-53 SC-28 (Delta log integrity via ADLS encryption at rest), AU-10 (non-repudiation via time travel / version history), CP-9 (backup via time-travel retention). See
governance/compliance/nist-800-53-rev5.yaml. - Discussion: CSA-0087