ADR 0002 — Azure Databricks over open-source Spark-on-AKS for heavy compute¶

Context and Problem Statement¶

Medallion transformations, large-scale enrichment, and ML feature engineering require a distributed Spark runtime. Customers need a predictable Spark experience in both Azure Commercial and Azure Government, with a credible story for governance (Unity Catalog), performance (Photon), and cost control (job clusters with auto-termination). We must pick a primary compute engine before the Databricks-specific Bicep modules (see deploy/bicep/DMLZ/modules/Databricks/databricks.bicep) are finalized.

Decision Drivers¶

Azure Government availability for the Spark runtime with FedRAMP High authorization inheritance.
Total cost of ownership — we prefer a managed runtime over customer-run AKS Spark operators that need 24x7 platform engineering.
Governance — native integration with Unity Catalog and Purview for row/column lineage, classification propagation, and table-level ACLs.
Performance — Photon + Delta Lake optimizations materially reduce query latency for Silver/Gold.
Composability — the choice must not lock in proprietary transformation code; dbt and PySpark are both portable.

Considered Options¶

Azure Databricks (chosen) — Managed Spark, Unity Catalog, Photon, Delta Lake native, Gov-GA, strong Purview integration.
Open-source Apache Spark on AKS — Full control, no platform markup, but customer-owned HA, upgrades, and autoscaling.
Azure Synapse Spark Pools — Managed, integrated with Synapse SQL, but less aggressive innovation cadence and weaker Unity-Catalog-equivalent governance.
Microsoft Fabric Spark — Strategic target (see ADR-0010) but Gov availability lags Commercial by quarters to a year.

Decision Outcome¶

Chosen: Option 1 — Azure Databricks as the primary heavy-compute engine, with Synapse Spark permitted for tenants that have an existing Synapse footprint and Fabric Spark planned as a migration target once Gov-GA lands.

Consequences¶

Positive: Managed service, auto-termination controls cost, Photon gives real speedups on Delta, Unity Catalog gives fine-grained access control and lineage without custom code.
Positive: Gov-GA available with FedRAMP High inheritance from Microsoft.
Positive: PySpark + dbt transformations remain portable to Fabric Spark or OSS Spark if we migrate later.
Negative: Per-DBU premium on top of VM cost; requires active cluster policy enforcement to stop sprawl (tracked in deploy/bicep/DMLZ/modules/Databricks/databricks.bicep cluster policies).
Negative: Workspace sprawl if one workspace per domain becomes the default — mitigated by Unity-Catalog-scoped workspaces.
Negative: Some Databricks-specific APIs (e.g. SQL warehouses, Jobs 2.1) are non-portable; we cap their use to orchestration glue, not business logic.
Neutral: Does not preclude a future migration to Fabric Spark; Delta tables and Unity Catalog entries are first-class in Fabric OneLake.

Pros and Cons of the Options¶

Option 1 — Azure Databricks¶

Pros: Managed runtime; Photon; Unity Catalog; Gov-GA; strong Purview integration; Delta Lake-native; mature job scheduler.
Cons: DBU markup; SQL Warehouses are proprietary; workspace proliferation risk.

Option 2 — OSS Spark on AKS¶

Pros: No DBU premium; full version control; portable everywhere.
Cons: Customer-owned HA, upgrades, autoscaling, and governance; no equivalent to Unity Catalog; no Photon.

Option 3 — Synapse Spark Pools¶

Pros: Managed; integrated with Synapse SQL; Gov-GA; cheaper DBU-free pricing model.
Cons: Slower innovation; no Photon equivalent; Purview lineage is shallower; tighter coupling to a Synapse workspace.

Option 4 — Fabric Spark¶

Pros: Strategic future target; OneLake-native; deep Purview + Fabric governance integration.
Cons: Gov-GA lags; not viable for current federal tenants.

Validation¶

We will know this decision is right if:

Spark job cost per TB processed is within 25% of a well-tuned OSS Spark benchmark after cluster policies are applied.
Unity Catalog replaces legacy Hive-metastore + ACL code in all domains within two quarters.
If Fabric Spark reaches Gov-GA and matches Databricks feature parity, revisit for new workloads (tracked in ADR-0010).

References¶

Decision tree: Fabric vs. Databricks vs. Synapse
Related code: deploy/bicep/DMLZ/modules/Databricks/databricks.bicep, deploy/bicep/DLZ/modules/databricks/databricks.bicep, deploy/bicep/gov/modules/databricks.bicep, domains/spark/configs/
Framework controls: NIST 800-53 AC-3 (Unity Catalog access enforcement), AU-12 (cluster audit logs to Log Analytics), SC-8 (encryption in transit via customer-managed keys). See governance/compliance/nist-800-53-rev5.yaml.
Discussion: CSA-0087