Skip to content

Microsoft Purview Guide

See also: generic Azure reference

For service-agnostic deep-dive content on Microsoft Purview — architecture, feature reference, code samples, and patterns independent of CSA-in-a-Box — see Microsoft Purview in the reference library.

Purview is the data governance hub for CSA-in-a-Box — providing data cataloging, lineage, classification, and access policies across the entire medallion architecture. See ADR-0006 for the decision rationale.


Why Purview

Microsoft Purview was chosen over Apache Atlas, DataHub, and Collibra because it is Gov-GA with FedRAMP High inheritance, integrates natively with Microsoft Information Protection (MIP) sensitivity labels, and provides built-in scanners for every service in the CSA stack (ADLS, Databricks, Synapse, SQL Server, Power BI). Purview's Entra ID RBAC fits the platform's existing persona model, and its metadata maps forward into Fabric Purview when tenants migrate.


Architecture Overview

graph TD
    subgraph "Data Sources"
        ADLS[(ADLS Gen2<br/>Bronze / Silver / Gold)]
        DBR[Databricks<br/>Unity Catalog]
        SQL[Azure SQL]
        COSMOS[Cosmos DB]
        PBI[Power BI<br/>Semantic Models]
        S3[(AWS S3<br/>Cross-cloud)]
    end

    subgraph "Microsoft Purview"
        DM[Data Map<br/>Scanners + Ingestion]
        CAT[Catalog<br/>Search + Glossary]
        LIN[Lineage<br/>ADF + dbt + Databricks]
        CLS[Classification<br/>Built-in + Custom]
        POL[Policies<br/>Access + DevOps]
        COL[Collections<br/>RBAC Hierarchy]
    end

    subgraph Consumers
        ANALYST[Data Analysts]
        STEWARD[Data Stewards]
        ENGINEER[Data Engineers]
        AUDITOR[Compliance / Auditors]
    end

    ADLS --> DM
    DBR --> DM
    SQL --> DM
    COSMOS --> DM
    PBI --> DM
    S3 --> DM

    DM --> CAT
    DM --> LIN
    DM --> CLS
    DM --> POL
    DM --> COL

    CAT --> ANALYST
    LIN --> AUDITOR
    CLS --> STEWARD
    POL --> ENGINEER

Setup

Account Creation (Bicep)

resource purviewAccount 'Microsoft.Purview/accounts@2021-12-01' = {
  name: 'pview-csa-${environment}-${location}'
  location: location
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    publicNetworkAccess: 'Disabled'
    managedResourceGroupName: 'rg-pview-managed-${environment}'
  }
}

Managed Identity Permissions

Purview's system-assigned managed identity needs read access to every source it scans. Grant these roles before configuring scans.

Source Required role Scope
ADLS Gen2 Storage Blob Data Reader Storage account
Azure SQL db_datareader (SQL role) Database
Databricks Unity Catalog Account Admin or Metastore Admin Databricks account
Cosmos DB Cosmos DB Account Reader Cosmos account
Power BI Fabric Administrator or workspace member Tenant / workspace
Key Vault Key Vault Secrets User Key Vault (for scan credentials)
# Grant ADLS read access to Purview managed identity
PURVIEW_MI=$(az purview account show \
  --name "pview-csa-dev-eastus" \
  --resource-group "rg-dmlz-dev" \
  --query "identity.principalId" -o tsv)

az role assignment create \
  --assignee "$PURVIEW_MI" \
  --role "Storage Blob Data Reader" \
  --scope "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{sa}"

Data Map

Registering Sources

Register each data source in Purview's Data Map before scanning.

graph LR
    REG[Register Source] --> SCAN[Configure Scan]
    SCAN --> RULE[Apply Scan Rule Set]
    RULE --> SCHED[Set Schedule]
    SCHED --> RUN[Run Scan]
    RUN --> ASSETS[Assets in Catalog]

Source Registration Examples

Source Registration type Key configuration
ADLS Gen2 Azure Data Lake Storage Gen2 Storage account URL; collection assignment
Databricks Azure Databricks Workspace URL; Unity Catalog metastore
Azure SQL Azure SQL Database Server FQDN; database name
Cosmos DB Azure Cosmos DB Account URL; database(s) to scan
Power BI Power BI Tenant-wide or specific workspaces
AWS S3 Amazon S3 Bucket ARN; cross-account IAM role

Scan Rule Sets

Scan rule sets determine which file types and classification rules to apply. CSA-in-a-Box uses a custom rule set that includes Delta Lake metadata and government-specific classifiers.

{
    "name": "csa-scan-rules",
    "kind": "AzureStorage",
    "scanRulesetType": "Custom",
    "properties": {
        "fileExtensions": [".parquet", ".delta", ".json", ".csv", ".avro"],
        "classificationRuleNames": [
            "MICROSOFT.GOVERNMENT.US_SOCIAL_SECURITY_NUMBER",
            "MICROSOFT.FINANCIAL.CREDIT_CARD_NUMBER",
            "CSA.GOVERNMENT.CUI_MARKING",
            "CSA.GOVERNMENT.FOUO_MARKING"
        ]
    }
}

Catalog

Business Glossary

The glossary provides a controlled vocabulary for business terms. CSA-in-a-Box organizes glossary terms by domain.

Term hierarchy Example
Category Finance, Healthcare, Defense
Term Revenue, Patient Record, Classification Level
Synonym Revenue = Sales, Turnover
Related term Revenue is related to Cost of Goods Sold

Collections Hierarchy

Collections control both organization and RBAC. CSA-in-a-Box uses three levels.

Root Collection (CSA-in-a-Box)
├── Production
│   ├── Finance
│   ├── Healthcare
│   └── Defense
├── Staging
│   ├── Finance
│   └── Healthcare
└── Development
    └── Sandbox

RBAC Inheritance

Permissions flow downward. A user with Data Reader on Production can browse assets in Finance, Healthcare, and Defense unless explicitly denied at the child level.

Purview search supports faceted queries across the entire data estate.

# Find all Delta tables in the Gold layer with PII classifications
qualifiedName:*gold* AND classification:*SSN*

Lineage

Automatic Lineage Sources

Source Lineage mechanism Configuration
Azure Data Factory Built-in (automatic) ADF linked to Purview; no extra setup
Databricks Unity Catalog + OpenLineage Enable Purview connector in workspace settings
Synapse Analytics Built-in (automatic) Synapse workspace linked to Purview
dbt OpenLineage events via Purview REST API dbt-purview integration or custom manifest parser
Power BI Built-in scan Power BI tenant connected to Purview

dbt Lineage Integration

"""Push dbt lineage to Purview via the Atlas API endpoint."""
import json
from azure.purview.catalog import PurviewCatalogClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
client = PurviewCatalogClient(
    endpoint="https://pview-csa-dev-eastus.purview.azure.com",
    credential=credential,
)

# Parse dbt manifest.json for model relationships
with open("target/manifest.json") as f:
    manifest = json.load(f)

for node_id, node in manifest["nodes"].items():
    if node["resource_type"] == "model":
        # Create lineage entity linking upstream sources to this model
        entity = {
            "typeName": "dbt_model",
            "attributes": {
                "qualifiedName": f"dbt://{node['database']}.{node['schema']}.{node['name']}",
                "name": node["name"],
                "description": node.get("description", ""),
            },
        }
        client.entity.create_or_update(entity={"entity": entity})
graph LR
    SRC[Source Tables<br/>Bronze] -->|ADF Copy| STG[Staging<br/>Silver]
    STG -->|dbt model| GOLD[Gold Tables]
    GOLD -->|Power BI scan| PBI[Semantic Model]
    PBI --> RPT[Report / Dashboard]

    style SRC fill:#cd7f32
    style STG fill:#c0c0c0
    style GOLD fill:#ffd700

Classification

Built-in Classifiers

Purview ships with 200+ system classifiers. The most relevant for government workloads:

Classifier Detects Sensitivity
US Social Security Number SSN patterns (XXX-XX-XXXX) Confidential
Credit Card Number Visa, MC, Amex patterns Confidential
US Passport Number Passport format Highly Confidential
IP Address IPv4 / IPv6 addresses Internal
Email Address RFC 5322 patterns Internal

Custom Classifiers for Government Data

{
    "name": "CSA.GOVERNMENT.CUI_MARKING",
    "kind": "Custom",
    "properties": {
        "classificationName": "CUI Marking",
        "description": "Controlled Unclassified Information marking per NIST SP 800-171",
        "pattern": {
            "kind": "Regex",
            "pattern": "(?i)(CUI|CONTROLLED UNCLASSIFIED|NOFORN|FOUO|LAW ENFORCEMENT SENSITIVE)"
        },
        "minimumPercentageMatch": 60
    }
}
Custom classifier Pattern target Regulation
CSA.GOVERNMENT.CUI_MARKING CUI, CONTROLLED UNCLASSIFIED NIST SP 800-171
CSA.GOVERNMENT.FOUO_MARKING FOR OFFICIAL USE ONLY DoD 5200.01
CSA.GOVERNMENT.ITAR_MARKING ITAR, USML, defense articles 22 CFR 120-130
CSA.GOVERNMENT.HIPAA_PHI Protected Health Information patterns HIPAA Security Rule

Sensitivity Labels

Purview integrates with Microsoft Information Protection to apply sensitivity labels that follow data across the estate.

Label hierarchy:
├── Public
├── Internal
├── Confidential
│   ├── Confidential - PII
│   └── Confidential - Financial
└── Highly Confidential
    ├── Highly Confidential - CUI
    └── Highly Confidential - ITAR

Label Propagation

Sensitivity labels applied in Purview propagate to Power BI datasets automatically. A Highly Confidential label on a Gold table will appear on any Power BI semantic model built on that table.


Data Access Policies

Policy Types

Policy type Purpose Scope
Self-service access Users request access through the catalog Per-asset
DevOps policies Grant diagnostic access to ops teams Resource group
Data owner policies Data owners grant/deny access Collection

Self-Service Access

Enable self-service access policies so analysts can request access to datasets through the Purview portal. Data stewards approve or deny requests, creating an auditable access trail for compliance.


Purview + Unity Catalog

CSA-in-a-Box uses a dual governance pattern: Unity Catalog governs data inside the Databricks workspace (fine-grained ACLs, row/column filters), while Purview provides the enterprise-wide catalog that non-Databricks consumers (SQL, Power BI, Fabric) use.

graph LR
    UC[Unity Catalog<br/>Databricks-internal governance] -->|Federation| PV[Purview<br/>Enterprise catalog]
    PV --> PBI[Power BI]
    PV --> SQL[Azure SQL]
    PV --> FAB[Fabric]
    UC --> DBR[Databricks Users]

No Duplication

Unity Catalog metadata is federated into Purview, not duplicated. Purview scans the Unity Catalog metastore and creates linked assets — the source of truth for table definitions remains Unity Catalog.


Purview + Fabric

When tenants migrate to Fabric (ADR-0010), Purview metadata moves forward:

  • Fabric Purview is a superset of standalone Purview
  • OneLake scanning replaces ADLS scanning
  • Direct Lake semantic models appear automatically in the catalog
  • Sensitivity labels propagate through OneLake to Fabric workloads

Monitoring

Key Metrics

Metric What to watch Alert condition
Scan success rate Percentage of successful scans < 95% over 7 days
Asset count growth New assets discovered per scan Sudden drop (source disconnected)
Classification coverage % of assets with classifications < 80% for sensitive collections
Glossary term adoption Assets linked to glossary terms < 50% for Gold layer
Lineage completeness % of Gold tables with upstream lineage < 90%

Cost Optimization

Cost driver Optimization Impact
Scanner capacity units Scope scans to specific collections, not tenant-wide 30-60% reduction
Scan frequency Daily for production, weekly for dev/staging 50% reduction
Classification rules Disable irrelevant system classifiers Faster scans
Data Map population Register only governed sources, not every storage account Fewer assets to manage

Scanner Scope

ADR-0006 warns that if scanner cost exceeds 10% of storage cost, revisit scanner scope and cadence — not the catalog choice. Scope scans at the domain/collection level rather than tenant-wide.


Anti-Patterns

Don't: Scan everything at the tenant level

Tenant-wide scans discover assets you do not govern, inflating costs and cluttering the catalog. Register specific sources and scope scans to relevant collections.

Don't: Skip the business glossary

Without glossary terms, the catalog is a technical inventory — not a governance tool. Invest in glossary terms early; they compound in value as adoption grows.

Don't: Rely solely on automatic classification

System classifiers catch common patterns (SSN, credit cards) but miss domain-specific sensitive data. Create custom classifiers for CUI, FOUO, ITAR, and any organization-specific data categories.

Do: Federate Unity Catalog into Purview

Maintain one source of truth per domain. Databricks teams use Unity Catalog; everyone else discovers the same data through Purview.

Do: Enable sensitivity label propagation

Labels applied in Purview should flow to Power BI and downstream consumers automatically. This is the primary mechanism for demonstrating data classification compliance.


Checklist

  • Purview account deployed via DMLZ Bicep templates
  • Managed identity granted read access to all governed sources
  • Collection hierarchy created (Organization > Environment > Domain)
  • Sources registered: ADLS, Databricks, SQL, Cosmos DB, Power BI
  • Custom classifiers created for CUI, FOUO, ITAR
  • Scan rule sets configured with Delta Lake file types
  • Scan schedules configured (daily production, weekly dev)
  • Business glossary seeded with initial domain terms
  • Sensitivity labels configured and propagation enabled
  • dbt lineage integration tested (manifest parser or OpenLineage)
  • Unity Catalog federation verified
  • Self-service access policies enabled
  • Diagnostic settings forwarding to Log Analytics