Skip to content

Metadata Management — CSA-in-a-Box

This guide covers automated scanning, custom scan rule sets, schema extraction, and business metadata enrichment in Microsoft Purview.


Automated Scanning

Create a Scan

After registering a data source (see PURVIEW_SETUP.md), create a scan definition:

PURVIEW_ENDPOINT="https://$PURVIEW_ACCOUNT.purview.azure.com"
TOKEN=$(az account get-access-token --resource "https://purview.azure.net" --query accessToken -o tsv)
SOURCE_NAME="adls-csadlzdevst"

# Create a scan for the ADLS source
curl -s -X PUT \
  "$PURVIEW_ENDPOINT/scan/datasources/$SOURCE_NAME/scans/weekly-full-scan?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "kind": "AzureStorageMsi",
    "properties": {
      "scanRulesetName": "csa-adls-ruleset",
      "scanRulesetType": "Custom",
      "collection": {
        "referenceName": "production",
        "type": "CollectionReference"
      }
    }
  }'

Set a Recurring Schedule

curl -s -X PUT \
  "$PURVIEW_ENDPOINT/scan/datasources/$SOURCE_NAME/scans/weekly-full-scan/triggers/default?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "properties": {
      "scanLevel": "Full",
      "recurrence": {
        "frequency": "Week",
        "interval": 1,
        "startTime": "2025-01-01T02:00:00Z",
        "timezone": "UTC",
        "schedule": {
          "hours": [2],
          "minutes": [0],
          "weekDays": ["Sunday"]
        }
      }
    }
  }'

Trigger a Scan Manually

# Run a scan immediately
curl -s -X POST \
  "$PURVIEW_ENDPOINT/scan/datasources/$SOURCE_NAME/scans/weekly-full-scan/runs?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "scanLevel": "Full" }'

# Check scan run status
curl -s "$PURVIEW_ENDPOINT/scan/datasources/$SOURCE_NAME/scans/weekly-full-scan/runs?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" | jq '.value[0] | {status, startTime, endTime, assetsDiscovered: .scanResultsCount}'

Expected output after a successful scan:

{
    "status": "Succeeded",
    "startTime": "2025-01-12T02:00:15Z",
    "endTime": "2025-01-12T02:45:32Z",
    "assetsDiscovered": 1247
}

Custom Scan Rule Sets

Default scan rules handle common formats (CSV, JSON, Parquet). CSA-in-a-Box extends these with rules for Delta Lake, GeoParquet, and domain-specific patterns.

Delta Lake Scan Rule Set

curl -s -X PUT \
  "$PURVIEW_ENDPOINT/scan/scanrulesets/csa-delta-ruleset?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "kind": "AzureStorage",
    "properties": {
      "description": "Delta Lake format detection with CSA custom classifiers",
      "excludedSystemClassifications": [],
      "includedCustomClassificationRuleNames": [
        "CSA_PII_SSN", "CSA_PII_EMAIL", "CSA_PII_PHONE",
        "CSA_PHI_MRN", "CSA_FIN_ACCOUNT_NUMBER",
        "CSA_GOV_TRIBAL_ENROLLMENT_ID"
      ],
      "scanRulesetType": "Custom",
      "fileExtensions": [".parquet", ".json"],
      "resourceTypes": {
        "AzureStorage": {
          "scanRulesetType": "Custom",
          "customFileExtensions": [
            {
              "customFileType": {
                "builtInType": "PARQUET"
              },
              "description": "Delta Lake transaction log",
              "enabled": true,
              "fileExtension": ".json"
            }
          ]
        }
      }
    }
  }'

GeoParquet Scan Rule Set

GeoParquet files contain geometry columns. Create a custom classification to detect spatial data:

# Classification rule for WKT geometry strings
curl -s -X PUT \
  "$PURVIEW_ENDPOINT/scan/classificationrules/CSA_GEOSPATIAL_WKT?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "kind": "Custom",
    "properties": {
      "description": "Well-Known Text geometry representation",
      "classificationName": "CSA_GEOSPATIAL_WKT",
      "ruleStatus": "Enabled",
      "minimumPercentageMatch": 50.0,
      "dataPatterns": [
        { "pattern": "^(POINT|LINESTRING|POLYGON|MULTIPOINT|MULTILINESTRING|MULTIPOLYGON|GEOMETRYCOLLECTION)\\s*\\(" }
      ],
      "columnPatterns": [
        { "pattern": "(?i)(geom|geometry|shape|wkt|spatial|location_wkt)" }
      ]
    }
  }'

# Scan rule set that includes geo classifiers
curl -s -X PUT \
  "$PURVIEW_ENDPOINT/scan/scanrulesets/csa-geoparquet-ruleset?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "kind": "AzureStorage",
    "properties": {
      "description": "GeoParquet files with spatial column detection",
      "includedCustomClassificationRuleNames": [
        "CSA_GEOSPATIAL_WKT", "CSA_PII_SSN", "CSA_PII_EMAIL"
      ],
      "scanRulesetType": "Custom"
    }
  }'

Schema Extraction and Technical Metadata

Purview automatically extracts schema metadata during scanning. For each asset you get:

  • Column names, data types, and nullability
  • File format, encoding, and compression
  • Partition structure (for partitioned datasets)
  • Row count estimates

Query Schema via REST API

# Search for a specific asset
ASSET_QN="https://csadlzdevst.dfs.core.windows.net/silver/customers"

curl -s -X POST \
  "$PURVIEW_ENDPOINT/catalog/api/search/query?api-version=2022-08-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "keywords": "customers",
    "filter": {
      "and": [
        { "objectType": "Tables" },
        { "collectionId": "production" }
      ]
    },
    "limit": 5
  }' | jq '.value[] | {name, qualifiedName, entityType}'

# Get full entity with schema
ENTITY_GUID="<guid-from-search>"
curl -s "$PURVIEW_ENDPOINT/catalog/api/atlas/v2/entity/guid/$ENTITY_GUID?api-version=2022-08-01-preview" \
  -H "Authorization: Bearer $TOKEN" | jq '.entity.relationshipAttributes.columns[] | {name: .displayText, type: .attributes.data_type}'

Business Metadata Enrichment

Apply Glossary Terms to Assets

# Link a glossary term to an asset
TERM_GUID="<glossary-term-guid>"
ENTITY_GUID="<entity-guid>"

curl -s -X POST \
  "$PURVIEW_ENDPOINT/catalog/api/atlas/v2/glossary/terms/$TERM_GUID/assignedEntities" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "guid": "'$ENTITY_GUID'",
      "typeName": "azure_datalake_gen2_resource_set",
      "relationshipAttributes": {
        "meanings": [{ "guid": "'$TERM_GUID'" }]
      }
    }
  ]'

Custom Metadata Attributes

Define custom type definitions to capture domain-specific metadata. CSA-in-a-Box uses three custom attributes across all data assets:

Attribute Type Values Purpose
data_domain string finance, healthcare, environmental, transport Which business domain owns the data
quality_tier enum bronze, silver, gold Medallion layer / quality level
sla_hours int 1–168 Max hours before data is considered stale
# Create a business metadata type definition
curl -s -X POST \
  "$PURVIEW_ENDPOINT/catalog/api/atlas/v2/types/typedefs" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "businessMetadataDefs": [
      {
        "name": "CSA_DataGovernance",
        "description": "CSA-in-a-Box governance metadata",
        "attributeDefs": [
          {
            "name": "data_domain",
            "typeName": "string",
            "description": "Business domain that owns this asset",
            "isOptional": true,
            "cardinality": "SINGLE",
            "options": {
              "applicableEntityTypes": "[\"DataSet\",\"azure_datalake_gen2_resource_set\",\"azure_sql_table\"]"
            }
          },
          {
            "name": "quality_tier",
            "typeName": "string",
            "description": "Medallion architecture tier: bronze, silver, or gold",
            "isOptional": true,
            "cardinality": "SINGLE",
            "options": {
              "applicableEntityTypes": "[\"DataSet\"]"
            }
          },
          {
            "name": "sla_hours",
            "typeName": "int",
            "description": "Maximum acceptable data staleness in hours",
            "isOptional": true,
            "cardinality": "SINGLE",
            "options": {
              "applicableEntityTypes": "[\"DataSet\"]"
            }
          }
        ]
      }
    ]
  }'

Apply Custom Metadata to an Asset

curl -s -X PUT \
  "$PURVIEW_ENDPOINT/catalog/api/atlas/v2/entity/guid/$ENTITY_GUID/businessmetadata?isOverwrite=true" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "CSA_DataGovernance": {
      "data_domain": "finance",
      "quality_tier": "gold",
      "sla_hours": 4
    }
  }'

Bulk Metadata Operations

Bulk Update via REST API

# Bulk update classifications on multiple assets
curl -s -X POST \
  "$PURVIEW_ENDPOINT/catalog/api/atlas/v2/entity/bulk/setClassifications" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "guidHeaderMap": {
      "'$ENTITY_GUID_1'": {
        "classifications": [
          { "typeName": "MICROSOFT.PERSONAL.EMAIL" },
          { "typeName": "CSA_PII_SSN" }
        ]
      },
      "'$ENTITY_GUID_2'": {
        "classifications": [
          { "typeName": "MICROSOFT.FINANCIAL.CREDIT_CARD_NUMBER" }
        ]
      }
    }
  }'

Bulk Search and Tag

# Find all assets in the bronze container and tag them
curl -s -X POST \
  "$PURVIEW_ENDPOINT/catalog/api/search/query?api-version=2022-08-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "keywords": "*",
    "filter": {
      "and": [
        { "objectType": "Files" },
        { "assetType": "Azure Data Lake Storage Gen2" }
      ]
    },
    "limit": 100
  }' | jq -r '.value[].id' | while read GUID; do
    curl -s -X PUT \
      "$PURVIEW_ENDPOINT/catalog/api/atlas/v2/entity/guid/$GUID/businessmetadata?isOverwrite=false" \
      -H "Authorization: Bearer $TOKEN" \
      -H "Content-Type: application/json" \
      -d '{ "CSA_DataGovernance": { "quality_tier": "bronze" } }'
    echo "Tagged $GUID as bronze"
  done

Python Automation

Use purview_automation.py for programmatic metadata management:

from azure.identity import DefaultAzureCredential
from csa_platform.governance.purview.purview_automation import PurviewAutomation

purview = PurviewAutomation(
    account_name="csadmlzdevpview",
    credential=DefaultAzureCredential(),
)

# Apply custom classification rules
results = purview.apply_classification_rules(
    "csa_platform/governance/purview/classifications/pii_classifications.yaml"
)
for r in results:
    print(f"{r['name']}: {r['status']}")

# Import glossary terms
results = purview.import_glossary_terms(
    "scripts/governance/glossary-terms.yaml",
    glossary_name="CSA Business Glossary",
)
for r in results:
    print(f"{r['name']}: {r['status']}")

# Schedule a recurring scan
from csa_platform.governance.purview.purview_automation import ScanSchedule

schedule = ScanSchedule(
    source_name="adls-csadlzdevst",
    scan_name="weekly-full-scan",
    trigger_type="Recurring",
    recurrence_interval=7,
    scan_level="Full",
)
result = purview.schedule_scan(schedule)
print(f"Scan schedule: {result['status']}")

Dry-Run Mode

All automation methods support dry_run=True to validate configuration without making changes:

# Validate classification rules without applying
results = purview.apply_classification_rules(
    "csa_platform/governance/purview/classifications/pii_classifications.yaml",
    dry_run=True,
)
# Prints validation results without touching Purview

Scan Monitoring

Check Scan History

curl -s "$PURVIEW_ENDPOINT/scan/datasources/$SOURCE_NAME/scans/weekly-full-scan/runs?api-version=2022-07-01-preview" \
  -H "Authorization: Bearer $TOKEN" \
  | jq '.value[] | {status, startTime, endTime, scanResultsCount: .assetsDiscovered}'

Monitor via Azure Monitor

The Purview Bicep template configures diagnostic settings to send logs to Log Analytics. Query scan status:

PurviewScanStatusLogs
| where TimeGenerated > ago(7d)
| project TimeGenerated, ScanName, DataSourceName, ScanStatus, AssetsDiscovered
| order by TimeGenerated desc

Next Steps