Tutorial: Data Governance with Microsoft Purview¶

This tutorial walks you through setting up complete data governance for the CSA-in-a-Box platform using Microsoft Purview. By the end, you will have a fully configured data catalog with automated scanning, business glossary, custom classifications, lineage tracking, quality monitoring, and access policies.

Time required: 2–3 hours Prerequisites: DMLZ deployed, Azure CLI, Python 3.11+

Step 1: Verify Purview Deployment¶

The Purview account is deployed as part of the Data Management Landing Zone (DMLZ) Bicep templates at deploy/bicep/dmlz/modules/Purview/purview.bicep.

Check the deployment¶

# Set variables for your environment
export PURVIEW_RG="rg-dmlz-dev"
export DLZ_RG="rg-dlz-dev"
export ENV="dev"

# Find the Purview account
export PURVIEW_ACCOUNT=$(az purview account list \
  --resource-group "$PURVIEW_RG" \
  --query "[0].name" -o tsv)

echo "Purview account: $PURVIEW_ACCOUNT"

# Verify it's running
az purview account show \
  --name "$PURVIEW_ACCOUNT" \
  --resource-group "$PURVIEW_RG" \
  --query "{name:name, state:provisioningState, endpoint:endpoints.catalog}" \
  -o table

Expected output:

Name               State      Endpoint
-----------------  ---------  ------------------------------------------
csadmlzdevpview    Succeeded  https://csadmlzdevpview.purview.azure.com

Verify network access¶

export PURVIEW_ENDPOINT="https://$PURVIEW_ACCOUNT.purview.azure.com"
export TOKEN=$(az account get-access-token \
  --resource "https://purview.azure.net" \
  --query accessToken -o tsv)

# Test API connectivity
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  "$PURVIEW_ENDPOINT/account/collections?api-version=2019-11-01-preview" \
  -H "Authorization: Bearer $TOKEN")

echo "API status: $HTTP_STATUS"  # Should be 200

If you get 403, you need Collection Admin role on the root collection. If you get a connection error, check private endpoint DNS resolution.

📖 Detailed reference: docs/governance/PURVIEW_SETUP.md

Step 2: Configure Collection Hierarchy¶

Collections organize your data assets and control access inheritance.

Option A: Use the bootstrap script (recommended)¶

./scripts/governance/bootstrap-purview.sh \
  --purview-account "$PURVIEW_ACCOUNT" \
  --purview-rg "$PURVIEW_RG" \
  --dlz-rg "$DLZ_RG" \
  --env "$ENV" \
  --dry-run  # Remove --dry-run when ready to apply

Option B: Create manually¶

# Create environment collections
for ENV_NAME in production staging development; do
  curl -s -X PUT \
    "$PURVIEW_ENDPOINT/account/collections/$ENV_NAME?api-version=2019-11-01-preview" \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
      "friendlyName": "'${ENV_NAME^}'",
      "parentCollection": {
        "referenceName": "'$PURVIEW_ACCOUNT'",
        "type": "CollectionReference"
      }
    }'
done

# Create domain collections under Production
for DOMAIN in Finance Healthcare Environmental Transportation; do
  DOMAIN_LOWER=$(echo "$DOMAIN" | tr '[:upper:]' '[:lower:]')
  curl -s -X PUT \
    "$PURVIEW_ENDPOINT/account/collections/prod-$DOMAIN_LOWER?api-version=2019-11-01-preview" \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
      "friendlyName": "'$DOMAIN'",
      "parentCollection": {
        "referenceName": "production",
        "type": "CollectionReference"
      }
    }'
done

Verify¶

curl -s "$PURVIEW_ENDPOINT/account/collections?api-version=2019-11-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  | jq '.value[] | .friendlyName' -r | sort

Expected output:

CSA-in-a-Box (root)
Development
Environmental
Finance
Healthcare
Production
Sandbox
Shared
Staging
Transportation

📖 Detailed reference: docs/governance/PURVIEW_SETUP.md — Step 1

Step 3: Register and Scan Data Sources¶

Register sources¶

The bootstrap script registers all five source types. To register individually:

STORAGE_ACCOUNT="csadlz${ENV}st"
SUBSCRIPTION_ID=$(az account show --query id -o tsv)

# Register ADLS Gen2
curl -s -X PUT \
  "$PURVIEW_ENDPOINT/scan/datasources/adls-$STORAGE_ACCOUNT?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "kind": "AzureStorage",
    "properties": {
      "endpoint": "https://'$STORAGE_ACCOUNT'.dfs.core.windows.net/",
      "resourceGroup": "'$DLZ_RG'",
      "subscriptionId": "'$SUBSCRIPTION_ID'",
      "location": "eastus",
      "resourceName": "'$STORAGE_ACCOUNT'",
      "collection": {
        "referenceName": "production",
        "type": "CollectionReference"
      }
    }
  }'

Grant managed identity access¶

PURVIEW_MI=$(az purview account show \
  --name "$PURVIEW_ACCOUNT" \
  --resource-group "$PURVIEW_RG" \
  --query identity.principalId -o tsv)

az role assignment create \
  --assignee-object-id "$PURVIEW_MI" \
  --assignee-principal-type ServicePrincipal \
  --role "Storage Blob Data Reader" \
  --scope "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$DLZ_RG/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT"

Create and run a scan¶

SOURCE_NAME="adls-$STORAGE_ACCOUNT"

# Create scan definition
curl -s -X PUT \
  "$PURVIEW_ENDPOINT/scan/datasources/$SOURCE_NAME/scans/initial-scan?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "kind": "AzureStorageMsi",
    "properties": {
      "scanRulesetName": "AzureStorage",
      "scanRulesetType": "System",
      "collection": {
        "referenceName": "production",
        "type": "CollectionReference"
      }
    }
  }'

# Trigger scan
curl -s -X POST \
  "$PURVIEW_ENDPOINT/scan/datasources/$SOURCE_NAME/scans/initial-scan/runs?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "scanLevel": "Full" }'

echo "Scan triggered. Check status in Purview Studio or via API."

Check scan status¶

curl -s "$PURVIEW_ENDPOINT/scan/datasources/$SOURCE_NAME/scans/initial-scan/runs?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  | jq '.value[0] | {status, startTime, endTime}'

Wait for "status": "Succeeded" before proceeding.

📖 Detailed reference: docs/governance/PURVIEW_SETUP.md — Step 2-3, docs/governance/METADATA_MANAGEMENT.md

Step 4: Set Up Business Glossary¶

Option A: Use the seed script (recommended)¶

python scripts/governance/seed-glossary.py \
  --purview-account "$PURVIEW_ACCOUNT" \
  --glossary-file scripts/governance/glossary-terms.yaml

This creates 30+ terms organized into categories (Data Engineering, Data Governance, Finance, Healthcare, Environmental, Transportation, Quality Metrics) with parent-child relationships.

Option B: Use the automation module¶

python -m csa_platform.governance.purview.purview_automation \
  --account "$PURVIEW_ACCOUNT" \
  --action import-glossary \
  --glossary-file scripts/governance/glossary-terms.yaml

Verify glossary¶

curl -s "$PURVIEW_ENDPOINT/catalog/api/atlas/v2/glossary" \
  -H "Authorization: Bearer $TOKEN" \
  | jq '.[0] | {name, guid, termCount: (.terms | length)}'

Link terms to assets¶

After scanning discovers assets, link glossary terms to them:

# Find the gold customer table
ENTITY_GUID=$(curl -s -X POST \
  "$PURVIEW_ENDPOINT/catalog/api/search/query?api-version=2022-08-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "keywords": "gld_customer_lifetime_value", "limit": 1 }' \
  | jq -r '.value[0].id')

# Get the CLV term GUID
CLV_GUID=$(curl -s "$PURVIEW_ENDPOINT/catalog/api/atlas/v2/glossary" \
  -H "Authorization: Bearer $TOKEN" \
  | jq -r '.[0].terms[] | select(.displayText == "Customer Lifetime Value") | .termGuid')

# Link them
curl -s -X POST \
  "$PURVIEW_ENDPOINT/catalog/api/atlas/v2/glossary/terms/$CLV_GUID/assignedEntities" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '[{ "guid": "'$ENTITY_GUID'" }]'

echo "Linked CLV term to gold customer table"

📖 Detailed reference: docs/governance/DATA_CATALOGING.md

Step 5: Create Custom Classifications¶

Option A: Use the bootstrap script¶

The bootstrap script creates SSN, EIN, Tribal Enrollment ID, MRN, and Financial Account Number classifiers automatically.

Option B: Apply from YAML¶

python -m csa_platform.governance.purview.purview_automation \
  --account "$PURVIEW_ACCOUNT" \
  --action apply-classifications \
  --rules-dir csa_platform/governance/purview/classifications/

This processes all YAML files in the classifications directory:

pii_classifications.yaml — SSN, email, phone, address, name
phi_classifications.yaml — Medical record numbers, diagnosis codes
financial_classifications.yaml — Account numbers, credit cards
government_classifications.yaml — EIN, tribal enrollment, federal IDs

Option C: Create individually¶

curl -s -X PUT \
  "$PURVIEW_ENDPOINT/scan/classificationrules/CSA_PII_SSN?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "kind": "Custom",
    "properties": {
      "description": "US Social Security Number",
      "classificationName": "CSA_PII_SSN",
      "ruleStatus": "Enabled",
      "minimumPercentageMatch": 60.0,
      "dataPatterns": [
        { "pattern": "\\b(?!000|666|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0000)\\d{4}\\b" }
      ],
      "columnPatterns": [
        { "pattern": "(?i)(ssn|social_security|ss_number)" }
      ]
    }
  }'

Verify classifications¶

curl -s "$PURVIEW_ENDPOINT/scan/classificationrules?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  | jq '.value[] | select(.name | startswith("CSA_")) | .name'

Re-scan with custom classifiers¶

After creating classifications, re-scan to detect them:

# Update the scan to use the custom ruleset
curl -s -X PUT \
  "$PURVIEW_ENDPOINT/scan/datasources/$SOURCE_NAME/scans/initial-scan?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "kind": "AzureStorageMsi",
    "properties": {
      "scanRulesetName": "csa-adls-ruleset",
      "scanRulesetType": "Custom",
      "collection": {
        "referenceName": "production",
        "type": "CollectionReference"
      }
    }
  }'

# Trigger re-scan
curl -s -X POST \
  "$PURVIEW_ENDPOINT/scan/datasources/$SOURCE_NAME/scans/initial-scan/runs?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "scanLevel": "Full" }'

📖 Detailed reference: docs/governance/DATA_CATALOGING.md — Custom Classifications

Step 6: Configure Data Lineage¶

6.1 ADF Pipeline Lineage (Automatic)¶

Connect ADF to Purview for automatic lineage capture:

ADF_NAME="csadlz${ENV}adf"
SUBSCRIPTION_ID=$(az account show --query id -o tsv)

# Grant Purview MI access to ADF
az role assignment create \
  --assignee-object-id "$PURVIEW_MI" \
  --assignee-principal-type ServicePrincipal \
  --role "Data Factory Contributor" \
  --scope "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$DLZ_RG/providers/Microsoft.DataFactory/factories/$ADF_NAME"

Then in ADF Studio: Manage → Microsoft Purview → Connect and select your Purview account.

6.2 Databricks Lineage (OpenLineage)¶

Add OpenLineage Spark listener to Databricks clusters. See the detailed guide for init script and Spark configuration.

Quick test — register lineage from a notebook:

# In a Databricks notebook
from azure.identity import DefaultAzureCredential
import requests

token = DefaultAzureCredential().get_token("https://purview.azure.net/.default").token
purview_url = "https://csadmlzdevpview.purview.azure.com"

resp = requests.post(
    f"{purview_url}/catalog/api/atlas/v2/entity/bulk",
    headers={"Authorization": f"Bearer {token}", "Content-Type": "application/json"},
    json={"entities": [{
        "typeName": "databricks_notebook_process",
        "attributes": {
            "qualifiedName": "databricks://csadlzdev/notebooks/silver/transform_customers",
            "name": "Transform: customers",
        },
        "relationshipAttributes": {
            "inputs": [{"typeName": "azure_datalake_gen2_resource_set",
                        "uniqueAttributes": {"qualifiedName": "https://csadlzdevst.dfs.core.windows.net/bronze/customers"}}],
            "outputs": [{"typeName": "azure_datalake_gen2_resource_set",
                         "uniqueAttributes": {"qualifiedName": "https://csadlzdevst.dfs.core.windows.net/silver/customers"}}],
        },
    }]},
    timeout=30,
)
print(f"Status: {resp.status_code}")

6.3 dbt Lineage¶

After running dbt, push lineage from the manifest:

python -m csa_platform.governance.purview.purview_automation \
  --account "$PURVIEW_ACCOUNT" \
  --action register-dbt-lineage \
  --manifest target/manifest.json

Verify lineage¶

# Search for process entities
curl -s -X POST \
  "$PURVIEW_ENDPOINT/catalog/api/search/query?api-version=2022-08-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "keywords": "*", "filter": { "objectType": "Processes" }, "limit": 10 }' \
  | jq '.["@search.count"] as $count | "\($count) process entities found"'

📖 Detailed reference: docs/governance/DATA_LINEAGE.md

Step 7: Set Up Data Quality Rules¶

7.1 Great Expectations¶

The quality framework is defined in csa_platform/governance/dataquality/quality-rules.yaml with expectation suites for each medallion layer.

Run quality checks:

# Run the daily quality checkpoint
python csa_platform/governance/dataquality/run_quality_checks.py \
  --config csa_platform/governance/dataquality/quality-rules.yaml \
  --suite bronze_customers_suite

7.2 Push Quality Scores to Purview¶

After running quality checks, update asset metadata:

# Create the quality metadata type (one-time setup)
curl -s -X POST \
  "$PURVIEW_ENDPOINT/catalog/api/atlas/v2/types/typedefs" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "businessMetadataDefs": [{
      "name": "CSA_DataQuality",
      "description": "Data quality scores",
      "attributeDefs": [
        { "name": "quality_score", "typeName": "float", "isOptional": true, "cardinality": "SINGLE",
          "options": { "applicableEntityTypes": "[\"DataSet\"]" } },
        { "name": "completeness_score", "typeName": "float", "isOptional": true, "cardinality": "SINGLE",
          "options": { "applicableEntityTypes": "[\"DataSet\"]" } },
        { "name": "last_checked", "typeName": "string", "isOptional": true, "cardinality": "SINGLE",
          "options": { "applicableEntityTypes": "[\"DataSet\"]" } }
      ]
    }]
  }'

# Update an asset with quality scores
curl -s -X PUT \
  "$PURVIEW_ENDPOINT/catalog/api/atlas/v2/entity/guid/$ENTITY_GUID/businessmetadata?isOverwrite=true" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "CSA_DataQuality": {
      "quality_score": 0.96,
      "completeness_score": 0.99,
      "last_checked": "2025-01-12T06:00:00Z"
    }
  }'

7.3 Configure Alerting¶

Set up alerts for quality failures using Azure Monitor (see quality-rules.yaml alerting section).

📖 Detailed reference: docs/governance/DATA_QUALITY.md

Step 8: Configure Access Policies¶

Enable Data Use Management¶

# Enable access policies on the ADLS source
curl -s -X PATCH \
  "$PURVIEW_ENDPOINT/scan/datasources/$SOURCE_NAME?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "properties": { "dataUseGovernance": "Enabled" } }'

Create a read policy¶

curl -s -X PUT \
  "$PURVIEW_ENDPOINT/policyStore/dataPolicies/read-gold-finance?api-version=2022-12-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "read-gold-finance",
    "properties": {
      "description": "Finance team read access to gold/finance/",
      "decisionRules": [{
        "effect": "Permit",
        "dnfCondition": [[
          { "attributeName": "resource.path", "attributeValueIncludes": "gold/finance" },
          { "attributeName": "principal.microsoft.groups", "attributeValueIncludedIn": ["sg-finance-analysts"] },
          { "attributeName": "action.id", "attributeValueIncludes": "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read" }
        ]]
      }],
      "collection": { "referenceName": "prod-finance", "type": "CollectionReference" }
    }
  }'

Assign collection roles¶

Grant your team appropriate roles on collections via Purview Studio:

Collection Admin → Platform team
Data Source Admin → Data engineers
Data Curator → Data stewards
Data Reader → Analysts

📖 Detailed reference: docs/governance/DATA_ACCESS.md

Step 9: Validate End-to-End¶

Run through this checklist to confirm everything is working:

Automated validation¶

echo "=== CSA-in-a-Box Governance Validation ==="
echo ""

# 1. Purview account
echo "1. Purview Account"
az purview account show --name "$PURVIEW_ACCOUNT" -g "$PURVIEW_RG" \
  --query "{name:name, state:provisioningState}" -o table
echo ""

# 2. Collections
echo "2. Collections"
COLLECTION_COUNT=$(curl -s "$PURVIEW_ENDPOINT/account/collections?api-version=2019-11-01-preview" \
  -H "Authorization: Bearer $TOKEN" | jq '.value | length')
echo "   Collections: $COLLECTION_COUNT (expected: 9+)"
echo ""

# 3. Data Sources
echo "3. Data Sources"
SOURCE_COUNT=$(curl -s "$PURVIEW_ENDPOINT/scan/datasources?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" | jq '.value | length')
echo "   Sources: $SOURCE_COUNT (expected: 5)"
echo ""

# 4. Custom Classifications
echo "4. Custom Classifications"
CLASS_COUNT=$(curl -s "$PURVIEW_ENDPOINT/scan/classificationrules?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" | jq '[.value[] | select(.name | startswith("CSA_"))] | length')
echo "   Custom classifications: $CLASS_COUNT (expected: 5+)"
echo ""

# 5. Glossary Terms
echo "5. Glossary"
TERM_COUNT=$(curl -s "$PURVIEW_ENDPOINT/catalog/api/atlas/v2/glossary" \
  -H "Authorization: Bearer $TOKEN" | jq '.[0].terms | length')
echo "   Terms: $TERM_COUNT (expected: 30+)"
echo ""

# 6. Discovered Assets
echo "6. Discovered Assets"
ASSET_COUNT=$(curl -s -X POST \
  "$PURVIEW_ENDPOINT/catalog/api/search/query?api-version=2022-08-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "keywords": "*", "limit": 0 }' | jq '.["@search.count"]')
echo "   Assets: $ASSET_COUNT"
echo ""

# 7. Lineage
echo "7. Lineage"
PROCESS_COUNT=$(curl -s -X POST \
  "$PURVIEW_ENDPOINT/catalog/api/search/query?api-version=2022-08-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "keywords": "*", "filter": { "objectType": "Processes" }, "limit": 0 }' \
  | jq '.["@search.count"]')
echo "   Process entities: $PROCESS_COUNT"
echo ""

echo "=== Validation Complete ==="

Manual checks in Purview Studio¶

Open https://web.purview.azure.com and verify:

Check	Where	Expected
Collections visible	Data Map → Collections	9+ collections in hierarchy
Sources registered	Data Map → Sources	5 registered sources
Scan completed	Data Map → Sources → ADLS → Scans	Last run: Succeeded
Assets discovered	Data Catalog → Browse	Tables, files visible
Glossary populated	Data Catalog → Glossary	30+ terms in categories
Classifications applied	Any scanned asset → Schema tab	CSA_PII_SSN etc.
Lineage visible	Any gold asset → Lineage tab	Upstream chain visible
Quality metadata	Any asset → Properties	CSA_DataQuality scores

What's Next¶

You now have a fully governed data platform. Here are suggested next steps:

Automate with CI/CD — Run bootstrap-purview.sh and classification updates in your deployment pipeline
Set up approval workflows — Use Logic Apps for sensitive data access requests (see DATA_ACCESS.md)
Configure sensitivity labels — Connect MIP for auto-labeling (see DATA_CATALOGING.md)
Monitor quality trends — Build Power BI dashboards from quality scores
Onboard domain teams — Train data stewards on glossary management and asset certification

Reference Documentation¶

Document	Purpose
PURVIEW_SETUP.md	Initial setup, network, permissions
METADATA_MANAGEMENT.md	Scanning, custom metadata
DATA_CATALOGING.md	Glossary, classifications, labels
DATA_LINEAGE.md	ADF, Databricks, dbt lineage
DATA_QUALITY.md	Great Expectations, scoring
DATA_ACCESS.md	Policies, RBAC, audit

Troubleshooting¶

Problem	Solution
`403` on API calls	Ensure you have Collection Admin on root. Run `az login` to refresh token.
Scan fails	Check managed identity has Storage Blob Data Reader. Check private endpoints.
Glossary import fails	Ensure no duplicate term names. Check glossary GUID is valid.
Classifications not detected	Re-scan after creating rules. Check `minimumPercentageMatch` threshold.
Lineage not showing	ADF lineage takes 15-30 minutes after pipeline run. Check connection in ADF Studio.
Access policy not enforced	Allow up to 2 hours for policy propagation. Check Data Use Management is enabled.