ADR 0006 — Microsoft Purview over Apache Atlas for data catalog and lineage¶
Context and Problem Statement¶
Federal customers require a data catalog that carries classification, sensitivity labels, glossary terms, and end-to-end lineage from source to consumption. The catalog must integrate with Microsoft Information Protection (MIP) labels, propagate classifications to downstream stores, and produce evidence for ATO (NIST 800-53 CA-7, AC-3, AU-6). We must pick a single catalog system that is operational on day one in Azure Government.
Decision Drivers¶
- Azure Government availability with FedRAMP High.
- MIP sensitivity label propagation across storage, dataflows, and Power BI.
- Lineage coverage for ADF activities, dbt models, Databricks Unity Catalog, and SQL Server/Synapse sources without writing custom bridges.
- RBAC integration with Entra ID and the platform's existing persona model (
csa_platform/csa_platform/governance/rbac/). - Operational burden — prefer managed PaaS so customers do not run Atlas + Solr + HBase themselves.
Considered Options¶
- Microsoft Purview (chosen) — Managed PaaS, Gov-GA, native Azure scanners, MIP integration, Unity Catalog federation.
- Apache Atlas — Open-source, extensible type system, HDFS-era heritage, requires customer-run Solr + HBase or equivalent.
- DataHub (Acryl) — Modern open-source catalog, plugin-rich, but smaller Azure ecosystem and no MIP integration.
- Collibra — Enterprise catalog with strong business-glossary story, but third-party procurement + no native Azure Gov deployment.
Considered but out of scope¶
- Unity Catalog as the only catalog — we use Unity Catalog inside the Databricks blast radius (ADR-0002) but federate it into Purview so non-Databricks consumers (SQL, Power BI, Fabric) see the same lineage.
Decision Outcome¶
Chosen: Option 1 — Microsoft Purview as the platform catalog, with Unity Catalog federated in and dbt emitting OpenLineage events through the Purview connector.
Consequences¶
- Positive: Managed PaaS in Azure Gov; FedRAMP High inheritance.
- Positive: Native scanners cover ADLS, SQL, Synapse, Power BI, and Unity Catalog — lineage "just works" for the core stack.
- Positive: MIP label propagation is first-class; sensitivity follows the data across the medallion.
- Positive: Entra-ID-native RBAC fits the existing persona model.
- Negative: Open-source catalog extensibility is weaker than Atlas or DataHub — custom type definitions are possible but less ergonomic.
- Negative: Scanner cadence and capacity units are a real cost line; we manage scanner scope explicitly (domain-level, not tenant-level).
- Negative: Vendor lock-in to the Microsoft governance stack; if a tenant requires a neutral catalog, we pair Purview with DataHub exports rather than replacing it.
- Neutral: Does not preclude Fabric Purview migration — Fabric Purview is a superset and our metadata moves forward with us.
Pros and Cons of the Options¶
Option 1 — Microsoft Purview¶
- Pros: Gov-GA PaaS; MIP integration; native Azure scanners; Unity Catalog federation; Entra-ID RBAC; Fabric-ready.
- Cons: Weaker custom-type ergonomics; scanner-unit cost; ecosystem tightly coupled to Microsoft.
Option 2 — Apache Atlas¶
- Pros: Open-source; extensible type system; field-proven at scale at Hortonworks-era customers.
- Cons: Customer-run Solr/HBase; no managed Azure offering; no MIP integration; contributor pool aging.
Option 3 — DataHub (Acryl)¶
- Pros: Modern architecture; strong plugin ecosystem; vibrant community; SaaS option via Acryl.
- Cons: Smaller Azure ecosystem; no MIP story; third-party SaaS procurement adds FedRAMP burden.
Option 4 — Collibra¶
- Pros: Best-in-class business glossary and stewardship workflows.
- Cons: Third-party SaaS; no Gov-GA deployment; significant licensing cost; not Azure-native.
Validation¶
We will know this decision is right if:
- Purview lineage covers ADF + dbt + Unity Catalog flows in every vertical example within one sprint of onboarding.
- MIP labels applied in Purview appear on downstream Power BI datasets without manual relabeling.
- If scanner cost exceeds 10% of storage cost at any tenant, revisit scanner scope and cadence (not the catalog choice).
References¶
- Decision tree: n/a (catalog choice is cross-cutting; see architecture matrix)
- Related code:
deploy/bicep/DMLZ/(Purview provisioning),csa_platform/csa_platform/governance/compliance/nist-800-53-rev5.yaml(control mappings referencing Purview evidence),docs/PLATFORM_SERVICES.md(catalog narrative) - Framework controls: NIST 800-53 CA-7 (continuous monitoring via Purview scan schedules), AC-3 (catalog-level access enforcement), AU-6 (audit record review via Purview lineage), SI-12 (information management and retention). See
csa_platform/csa_platform/governance/compliance/nist-800-53-rev5.yaml. - HIPAA Security Rule: §164.312(b) (audit controls) — satisfied by Purview lineage records for PHI-tagged tables. See
csa_platform/csa_platform/governance/compliance/hipaa-security-rule.yaml. - Discussion: CSA-0087