ADR 0002 — Azure Databricks over open-source Spark-on-AKS for heavy compute¶
Context and Problem Statement¶
Medallion transformations, large-scale enrichment, and ML feature engineering require a distributed Spark runtime. Customers need a predictable Spark experience in both Azure Commercial and Azure Government, with a credible story for governance (Unity Catalog), performance (Photon), and cost control (job clusters with auto-termination). We must pick a primary compute engine before the Databricks-specific Bicep modules (see deploy/bicep/DMLZ/modules/Databricks/databricks.bicep) are finalized.
Decision Drivers¶
- Azure Government availability for the Spark runtime with FedRAMP High authorization inheritance.
- Total cost of ownership — we prefer a managed runtime over customer-run AKS Spark operators that need 24x7 platform engineering.
- Governance — native integration with Unity Catalog and Purview for row/column lineage, classification propagation, and table-level ACLs.
- Performance — Photon + Delta Lake optimizations materially reduce query latency for Silver/Gold.
- Composability — the choice must not lock in proprietary transformation code; dbt and PySpark are both portable.
Considered Options¶
- Azure Databricks (chosen) — Managed Spark, Unity Catalog, Photon, Delta Lake native, Gov-GA, strong Purview integration.
- Open-source Apache Spark on AKS — Full control, no platform markup, but customer-owned HA, upgrades, and autoscaling.
- Azure Synapse Spark Pools — Managed, integrated with Synapse SQL, but less aggressive innovation cadence and weaker Unity-Catalog-equivalent governance.
- Microsoft Fabric Spark — Strategic target (see ADR-0010) but Gov availability lags Commercial by quarters to a year.
Decision Outcome¶
Chosen: Option 1 — Azure Databricks as the primary heavy-compute engine, with Synapse Spark permitted for tenants that have an existing Synapse footprint and Fabric Spark planned as a migration target once Gov-GA lands.
Consequences¶
- Positive: Managed service, auto-termination controls cost, Photon gives real speedups on Delta, Unity Catalog gives fine-grained access control and lineage without custom code.
- Positive: Gov-GA available with FedRAMP High inheritance from Microsoft.
- Positive: PySpark + dbt transformations remain portable to Fabric Spark or OSS Spark if we migrate later.
- Negative: Per-DBU premium on top of VM cost; requires active cluster policy enforcement to stop sprawl (tracked in
deploy/bicep/DMLZ/modules/Databricks/databricks.bicepcluster policies). - Negative: Workspace sprawl if one workspace per domain becomes the default — mitigated by Unity-Catalog-scoped workspaces.
- Negative: Some Databricks-specific APIs (e.g. SQL warehouses, Jobs 2.1) are non-portable; we cap their use to orchestration glue, not business logic.
- Neutral: Does not preclude a future migration to Fabric Spark; Delta tables and Unity Catalog entries are first-class in Fabric OneLake.
Pros and Cons of the Options¶
Option 1 — Azure Databricks¶
- Pros: Managed runtime; Photon; Unity Catalog; Gov-GA; strong Purview integration; Delta Lake-native; mature job scheduler.
- Cons: DBU markup; SQL Warehouses are proprietary; workspace proliferation risk.
Option 2 — OSS Spark on AKS¶
- Pros: No DBU premium; full version control; portable everywhere.
- Cons: Customer-owned HA, upgrades, autoscaling, and governance; no equivalent to Unity Catalog; no Photon.
Option 3 — Synapse Spark Pools¶
- Pros: Managed; integrated with Synapse SQL; Gov-GA; cheaper DBU-free pricing model.
- Cons: Slower innovation; no Photon equivalent; Purview lineage is shallower; tighter coupling to a Synapse workspace.
Option 4 — Fabric Spark¶
- Pros: Strategic future target; OneLake-native; deep Purview + Fabric governance integration.
- Cons: Gov-GA lags; not viable for current federal tenants.
Validation¶
We will know this decision is right if:
- Spark job cost per TB processed is within 25% of a well-tuned OSS Spark benchmark after cluster policies are applied.
- Unity Catalog replaces legacy Hive-metastore + ACL code in all domains within two quarters.
- If Fabric Spark reaches Gov-GA and matches Databricks feature parity, revisit for new workloads (tracked in ADR-0010).
References¶
- Decision tree: Fabric vs. Databricks vs. Synapse
- Related code:
deploy/bicep/DMLZ/modules/Databricks/databricks.bicep,deploy/bicep/DLZ/modules/databricks/databricks.bicep,deploy/bicep/gov/modules/databricks.bicep,domains/spark/configs/ - Framework controls: NIST 800-53 AC-3 (Unity Catalog access enforcement), AU-12 (cluster audit logs to Log Analytics), SC-8 (encryption in transit via customer-managed keys). See
governance/compliance/nist-800-53-rev5.yaml. - Discussion: CSA-0087