ETL Migration: Glue to ADF, dbt, and Purview¶
A deep-dive guide for data engineers migrating AWS Glue ETL pipelines, crawlers, and the Glue Data Catalog to Azure Data Factory, dbt, and Microsoft Purview.
Executive summary¶
AWS Glue is three things in one: a metadata catalog (Glue Data Catalog), an ETL engine (Glue Jobs), and a schema discovery tool (Glue Crawlers). On Azure, these responsibilities split across purpose-built services: Azure Data Factory handles orchestration and data movement, dbt handles transformation logic, Unity Catalog manages runtime metadata, and Purview provides enterprise governance, classification, and lineage. This decomposition trades Glue's single-service simplicity for deeper capability in each function.
This guide covers each Glue component, its Azure equivalent, migration patterns, and worked examples including a full Glue PySpark job converted to a dbt SQL model.
Component mapping overview¶
| Glue component | Primary Azure equivalent | Secondary option | Notes |
|---|---|---|---|
| Glue Data Catalog | Unity Catalog (runtime) + Purview (business) | N/A | Unity Catalog for Spark; Purview for governance |
| Glue ETL Jobs (PySpark) | Databricks Jobs + dbt models | ADF Data Flows | dbt for SQL logic; Databricks for PySpark |
| Glue Python Shell | Azure Functions | Databricks lightweight task | Serverless for lightweight jobs |
| Glue Crawlers | Purview scan jobs + Auto Loader schema inference | N/A | Purview for governance; Auto Loader for runtime |
| Glue Studio (visual) | ADF visual pipeline designer | Fabric Data Factory | Visual ETL in ADF; logic in dbt |
| Glue Streaming | Databricks Structured Streaming | Stream Analytics | Delta Live Tables for streaming ETL |
| Glue DataBrew | Power Query in Fabric | dbt | Visual prep in Power Query; SQL in dbt |
| Glue Data Quality | dbt tests + Great Expectations | data-product contracts | Contract-driven quality |
| Glue Workflows | ADF pipeline orchestration | Databricks Workflows | ADF for cross-service; Databricks for Spark-only |
| Step Functions | ADF pipeline + Logic Apps | Durable Functions | ADF for data; Logic Apps for integration |
Part 1: Glue Data Catalog to Unity Catalog and Purview¶
Architecture comparison¶
Glue Data Catalog:
- Stores database, table, and partition metadata
- Integrated with Athena, EMR, Redshift Spectrum
- Per-account, per-region (not global)
- No built-in classification, lineage, or business glossary
- Lake Formation layered on top for access control
Unity Catalog:
- Three-level namespace: catalog.schema.table
- Integrated with Databricks SQL, Jobs, ML
- Cross-workspace (organization-wide)
- Built-in access control (grants, row filters, column masks)
- Lineage tracking for tables, columns, and notebooks
Purview:
- Enterprise-wide data catalog and governance
- Classification (PII, PHI, financial, government)
- Business glossary with term relationships
- Lineage across ADF, Databricks, Power BI
- Scan sources: ADLS, SQL, Databricks, S3 (cross-cloud)
Migration approach¶
Glue Data Catalog
├── databases → Unity Catalog catalogs
├── tables → Unity Catalog tables (managed or external)
├── partitions → Delta partitions (auto-managed)
└── metadata → Purview glossary terms + classifications
Step 1: Export Glue Catalog metadata
# Export all databases
aws glue get-databases --output json > glue_databases.json
# Export tables per database
for db in $(cat glue_databases.json | jq -r '.DatabaseList[].Name'); do
aws glue get-tables --database-name $db --output json > "glue_tables_${db}.json"
done
# Export partitions for partitioned tables
aws glue get-partitions \
--database-name sales \
--table-name fact_orders \
--output json > glue_partitions_sales_fact_orders.json
Step 2: Create Unity Catalog structure
-- Create catalog (one per Glue database or per domain)
CREATE CATALOG IF NOT EXISTS sales_prod;
USE CATALOG sales_prod;
-- Create schemas
CREATE SCHEMA IF NOT EXISTS bronze COMMENT 'Raw ingested data';
CREATE SCHEMA IF NOT EXISTS silver COMMENT 'Cleaned and conformed data';
CREATE SCHEMA IF NOT EXISTS gold COMMENT 'Business-ready aggregates';
-- Register external location for existing ADLS data
CREATE EXTERNAL LOCATION IF NOT EXISTS sales_raw
URL 'abfss://raw@acmeanalyticsgov.dfs.core.usgovcloudapi.net/sales/'
WITH (STORAGE CREDENTIAL sales_credential);
Step 3: Register Purview governance
# Purview scan configuration (via purview_automation.py pattern)
# Cross-reference: csa_platform/csa_platform/governance/purview/purview_automation.py
scan_config = {
"name": "scan-databricks-sales",
"kind": "AzureDatabricks",
"properties": {
"scanRulesetName": "default",
"collection": {
"referenceName": "sales-domain"
}
}
}
Part 2: Glue Jobs to Databricks Jobs and dbt¶
Decision matrix: which target?¶
| Glue job type | Target | Reasoning |
|---|---|---|
| PySpark job doing SQL transforms | dbt SQL model | SQL is more testable and versionable |
| PySpark job with complex Python logic | Databricks notebook/job | Preserve PySpark; update paths |
| Python Shell job (lightweight) | Azure Function | Serverless, no cluster overhead |
| Glue streaming job | Databricks Structured Streaming | Delta Live Tables for managed streaming |
| Glue visual job (Studio) | ADF Data Flow or dbt | Depends on complexity |
Worked example: Glue PySpark job to dbt SQL model¶
Original Glue PySpark job:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'run_date'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read from Glue Catalog
orders_dyf = glueContext.create_dynamic_frame.from_catalog(
database="sales",
table_name="raw_orders"
)
orders_df = orders_dyf.toDF()
# Filter and transform
from pyspark.sql.functions import col, sum as spark_sum, date_format, current_timestamp
daily_sales = orders_df \
.filter(col("order_date") == args['run_date']) \
.groupBy("region", "product_id") \
.agg(
spark_sum("quantity").alias("total_units"),
spark_sum("gross_amount").alias("total_revenue")
) \
.withColumn("sales_date", col("order_date")) \
.withColumn("etl_timestamp", current_timestamp())
# Write back to Glue Catalog
daily_sales.write \
.mode("overwrite") \
.format("parquet") \
.partitionBy("sales_date") \
.saveAsTable("sales.fact_sales_daily")
job.commit()
Migrated dbt SQL model:
-- models/gold/fact_sales_daily.sql
{{ config(
materialized='incremental',
unique_key=['sales_date', 'region', 'product_id'],
incremental_strategy='merge',
partition_by=['sales_date'],
tags=['daily', 'sales']
) }}
WITH source_orders AS (
SELECT
order_date AS sales_date,
region,
product_id,
quantity,
gross_amount
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE order_date >= DATE_SUB(CURRENT_DATE(), 3)
{% endif %}
)
SELECT
sales_date,
region,
product_id,
SUM(quantity) AS total_units,
SUM(gross_amount) AS total_revenue,
CURRENT_TIMESTAMP() AS etl_timestamp
FROM source_orders
GROUP BY sales_date, region, product_id
dbt test for the model:
# models/gold/schema.yml
models:
- name: fact_sales_daily
description: "Daily sales aggregation by region and product"
columns:
- name: sales_date
tests:
- not_null
- name: region
tests:
- not_null
- accepted_values:
values: ["EAST", "WEST", "CENTRAL", "SOUTH"]
- name: total_units
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
- name: total_revenue
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
What changed:
- Glue
DynamicFrame/GlueContextremoved; replaced by dbtref()and SQL. - PySpark DataFrame operations expressed as SQL GROUP BY.
- S3/Parquet write replaced by Delta Lake incremental merge.
- Data quality checks added as dbt tests (not possible in Glue without custom code).
- Lineage automatically tracked by dbt and visible in Purview.
Part 3: Glue Crawlers to Purview and Auto Loader¶
Glue Crawlers: what they do¶
Glue Crawlers scan data sources (S3, JDBC), infer schemas, and register tables in the Glue Data Catalog. They run on a schedule and detect schema changes.
Azure equivalents¶
Purview scan jobs (governance discovery):
{
"name": "scan-adls-raw",
"kind": "AdlsGen2",
"properties": {
"scanRulesetName": "AdlsGen2DefaultScanRuleSet",
"collection": { "referenceName": "raw-data-collection" },
"credential": { "referenceName": "managed-identity-credential" },
"endpoint": "https://acmeanalyticsgov.dfs.core.usgovcloudapi.net/",
"resourceTypes": { "AdlsGen2": { "scanRulesetType": "System" } }
}
}
Purview scans discover assets, classify columns (PII, PHI), and build lineage graphs. They are the governance equivalent of crawlers.
Databricks Auto Loader (runtime schema inference):
# Auto Loader: stream new files from ADLS, infer schema, write to Delta
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "dbfs:/schemas/raw_events")
.option("cloudFiles.inferColumnTypes", "true")
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.load("abfss://raw@acmeanalyticsgov.dfs.core.usgovcloudapi.net/events/")
.writeStream
.format("delta")
.option("checkpointLocation", "dbfs:/checkpoints/raw_events")
.option("mergeSchema", "true")
.trigger(availableNow=True)
.toTable("events_prod.bronze.raw_events"))
Auto Loader provides runtime schema evolution --- it detects new columns in source files and evolves the Delta table schema automatically. This is more capable than Glue Crawlers for streaming/semi-structured ingestion.
Part 4: Glue Studio to ADF visual pipelines¶
ADF pipeline structure¶
{
"name": "pipeline_daily_sales_etl",
"properties": {
"activities": [
{
"name": "RunDbtSalesModels",
"type": "DatabricksNotebook",
"typeProperties": {
"notebookPath": "/Repos/analytics/sales/run_dbt",
"baseParameters": {
"dbt_command": "dbt run --select tag:daily"
}
},
"linkedServiceName": {
"referenceName": "AzureDatabricksLinkedService",
"type": "LinkedServiceReference"
}
},
{
"name": "RunDbtTests",
"type": "DatabricksNotebook",
"dependsOn": [
{
"activity": "RunDbtSalesModels",
"dependencyConditions": ["Succeeded"]
}
],
"typeProperties": {
"notebookPath": "/Repos/analytics/sales/run_dbt",
"baseParameters": {
"dbt_command": "dbt test --select tag:daily"
}
},
"linkedServiceName": {
"referenceName": "AzureDatabricksLinkedService",
"type": "LinkedServiceReference"
}
},
{
"name": "RefreshPowerBIDataset",
"type": "WebActivity",
"dependsOn": [
{
"activity": "RunDbtTests",
"dependencyConditions": ["Succeeded"]
}
],
"typeProperties": {
"url": "https://api.powerbi.com/v1.0/myorg/groups/{workspace_id}/datasets/{dataset_id}/refreshes",
"method": "POST",
"authentication": {
"type": "MSI",
"resource": "https://analysis.windows.net/powerbi/api"
}
}
}
],
"parameters": {
"run_date": { "type": "string", "defaultValue": "" }
}
}
}
Part 5: Step Functions to ADF pipeline orchestration¶
Architecture comparison¶
| Step Functions concept | ADF equivalent | Notes |
|---|---|---|
| State machine | Pipeline | Visual orchestration |
| Task state | Activity | Execute an action |
| Choice state | If Condition activity | Branching logic |
| Parallel state | ForEach (parallel) | Concurrent execution |
| Wait state | Wait activity | Time-based delay |
| Map state | ForEach activity | Iterate over collection |
| Fail/Succeed state | Fail activity / pipeline success | Terminal states |
| Retry/Catch | Activity retry policy + error handling | Built-in retry with backoff |
| Step Functions Express | ADF with Durable Functions | High-volume, short-lived workflows |
Connection management: Glue connections to ADF Linked Services¶
| Glue connection type | ADF linked service | Key Vault integration |
|---|---|---|
| JDBC (PostgreSQL) | AzurePostgreSql | Connection string in Key Vault |
| JDBC (MySQL) | MySql | Connection string in Key Vault |
| JDBC (SQL Server) | SqlServer | Connection string in Key Vault |
| JDBC (Redshift) | AmazonRedshift | Connection string in Key Vault |
| S3 | AmazonS3 | Access key/secret in Key Vault |
| DynamoDB | AmazonDynamoDB | Access key/secret in Key Vault |
| Kafka (MSK) | N/A --- use Event Hubs | Kafka endpoint in connection |
| Custom (REST API) | RestService | API key/token in Key Vault |
Key Vault integration pattern:
{
"name": "PostgreSQLLinkedService",
"type": "AzurePostgreSql",
"typeProperties": {
"connectionString": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "KeyVaultLinkedService",
"type": "LinkedServiceReference"
},
"secretName": "postgres-connection-string"
}
}
}
Cross-reference: domains/shared/pipelines/adf/ for ADF pipeline patterns; ADR-0001 docs/adr/0001-adf-dbt-over-airflow.md for the ADF + dbt architecture decision.
Part 6: Glue Data Quality to dbt tests and contracts¶
Migration mapping¶
| Glue Data Quality rule | dbt equivalent | Notes |
|---|---|---|
ColumnExists "col_name" | schema.yml column definition | Schema enforcement at build time |
IsComplete "col_name" | not_null test | Column-level test |
IsUnique "col_name" | unique test | Column-level test |
ColumnValues "col" between X and Y | dbt_utils.accepted_range | Custom test |
RowCount > 0 | dbt_utils.expression_is_true | Model-level test |
DataFreshness "col" < 24hrs | freshness in sources.yml | Source freshness monitoring |
CustomSQL "SELECT COUNT(*)..." | Custom SQL test | tests/custom_test.sql |
Data product contract pattern¶
# contract.yaml (per data product)
# Cross-reference: domains/finance/data-products/invoices/contract.yaml
name: fact_sales_daily
version: "1.0"
owner: sales-analytics-team
sla:
freshness: 4h
availability: 99.5%
schema:
- name: sales_date
type: DATE
nullable: false
description: "Date of sales aggregation"
- name: region
type: STRING
nullable: false
description: "Sales region"
allowed_values: ["EAST", "WEST", "CENTRAL", "SOUTH"]
- name: total_units
type: BIGINT
nullable: false
constraints:
- ">= 0"
- name: total_revenue
type: DECIMAL(18,2)
nullable: false
constraints:
- ">= 0"
quality_rules:
- row_count: "> 0"
- unique_key: ["sales_date", "region", "product_id"]
CI validates contracts via .github/workflows/validate-contracts.yml.
Migration sequence for ETL¶
| Phase | Duration | Activities |
|---|---|---|
| 1. Catalog inventory | 1-2 weeks | Export Glue Catalog; map databases to Unity Catalog catalogs; identify all Glue jobs |
| 2. Catalog migration | 2-3 weeks | Create Unity Catalog structure; register external locations; configure Purview scans |
| 3. Pilot ETL migration | 3-4 weeks | Convert 3-5 representative Glue jobs to dbt models; validate with dual-run |
| 4. Bulk ETL migration | 6-10 weeks | Convert remaining Glue jobs; establish ADF orchestration; deploy contracts |
| 5. Crawler replacement | 2-3 weeks | Configure Purview scan schedules; deploy Auto Loader for streaming sources |
| 6. Validation | 2-3 weeks | Dual-run reconciliation; aggregate parity checks; lineage verification |
Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team Related: Migration Center | Compute Migration | Analytics Migration | Migration Playbook