Tutorial: Convert AWS Glue ETL Job to ADF + dbt Pipeline¶
Status: Authored 2026-04-30 Audience: Data engineers converting AWS Glue ETL jobs (PySpark and Python Shell) to Azure Data Factory orchestration with dbt transformation models on Databricks. Prerequisites knowledge: AWS Glue, PySpark basics, SQL, Azure Data Factory concepts. Time estimate: 1-3 days per Glue job depending on complexity.
Overview¶
AWS Glue combines orchestration (triggers, workflows), cataloging (Glue Data Catalog), and compute (Glue Spark/Python jobs) into one service. In csa-inabox on Azure, these responsibilities separate cleanly:
| Glue responsibility | Azure equivalent | Why |
|---|---|---|
| Orchestration (triggers, workflows) | Azure Data Factory (ADF) | Purpose-built orchestrator; see ADR-0001 |
| Catalog (databases, tables) | Unity Catalog + Purview | Runtime + enterprise governance |
| Compute (Spark jobs) | Databricks Jobs | Managed Spark with Photon; see ADR-0002 |
| Transforms (PySpark / Python) | dbt models (SQL-first) or Databricks notebooks | SQL for most transforms; notebooks for complex logic |
This tutorial walks through converting a single Glue ETL job to the ADF + dbt pattern, end to end.
AWS comparison: In AWS, Glue is a single service that does orchestration, cataloging, and compute. In Azure, you get purpose-built services for each concern. This seems like "more services," but it means each piece scales and governs independently. ADF handles orchestration (like Step Functions + Glue triggers), dbt handles transformations (like Glue PySpark but in SQL), and Databricks provides compute (like EMR but managed).
Prerequisites¶
Tools¶
| Tool | Minimum version | Purpose |
|---|---|---|
| AWS CLI | 2.x | Export Glue job definitions |
| Azure CLI | 2.60+ | ADF and Databricks provisioning |
| dbt-databricks | 1.8+ | Transformation layer |
| Databricks CLI | 0.220+ | Workspace management |
| Python | 3.10+ | Local dbt development |
AWS access¶
glue:GetJob,glue:GetJobRun,glue:GetTables,glue:GetDatabasespermissions.- Access to the S3 bucket where Glue job scripts are stored.
Azure access¶
- An ADF instance (or Fabric Data Factory).
- A Databricks workspace with Unity Catalog enabled.
- ADLS Gen2 storage account with data already migrated (see tutorial-s3-to-adls.md).
Step 1: Document the existing Glue job¶
Before converting anything, fully document what the Glue job does.
Export Glue job definition¶
# Get the Glue job definition
aws glue get-job --job-name daily-customer-etl --output json > glue_job_def.json
# Key fields to capture:
# - Command.ScriptLocation (S3 path to the PySpark script)
# - DefaultArguments (parameters passed to the job)
# - Connections (data sources)
# - MaxCapacity or NumberOfWorkers + WorkerType (DPU sizing)
# - Timeout
# - GlueVersion
# Download the PySpark script
SCRIPT_LOCATION=$(cat glue_job_def.json | jq -r '.Job.Command.ScriptLocation')
aws s3 cp ${SCRIPT_LOCATION} ./glue_scripts/
# Get trigger (schedule) information
aws glue get-trigger --name daily-customer-etl-trigger --output json > glue_trigger.json
# Get Glue Catalog tables used by this job
aws glue get-tables --database-name analytics --output json > glue_tables.json
Document the job profile¶
| Attribute | Value |
|---|---|
| Job name | daily-customer-etl |
| Type | Spark (Glue 4.0) |
| DPU / Workers | 10 DPU (G.1X, 5 workers) |
| Schedule | Daily at 03:00 UTC |
| Script location | s3://acme-glue-scripts/jobs/daily_customer_etl.py |
| Source tables | raw.customer_events, raw.customer_profiles |
| Target table | curated.customer_360 |
| Connections | redshift-analytics (JDBC), s3-raw-bucket |
| Avg runtime | 18 minutes |
| Bookmarks | Enabled (incremental) |
Analyze the PySpark script¶
Here is a representative Glue PySpark job:
# daily_customer_etl.py (AWS Glue)
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
from pyspark.sql.functions import col, when, lit, current_timestamp, datediff
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'run_date'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read from Glue Catalog
events_dyf = glueContext.create_dynamic_frame.from_catalog(
database="raw",
table_name="customer_events",
push_down_predicate=f"event_date = '{args['run_date']}'"
)
profiles_dyf = glueContext.create_dynamic_frame.from_catalog(
database="raw",
table_name="customer_profiles"
)
# Convert to DataFrames for complex transforms
events_df = events_dyf.toDF()
profiles_df = profiles_dyf.toDF()
# Business logic: build customer 360
customer_360 = events_df \
.groupBy("customer_id") \
.agg(
{"event_type": "count", "revenue": "sum", "event_date": "max"}
) \
.withColumnRenamed("count(event_type)", "total_events") \
.withColumnRenamed("sum(revenue)", "lifetime_revenue") \
.withColumnRenamed("max(event_date)", "last_activity_date") \
.join(profiles_df, "customer_id", "left") \
.withColumn("customer_segment",
when(col("lifetime_revenue") > 10000, lit("platinum"))
.when(col("lifetime_revenue") > 5000, lit("gold"))
.when(col("lifetime_revenue") > 1000, lit("silver"))
.otherwise(lit("bronze"))
) \
.withColumn("days_since_activity",
datediff(current_timestamp(), col("last_activity_date"))
) \
.withColumn("is_active", col("days_since_activity") < 90) \
.withColumn("etl_timestamp", current_timestamp())
# Write to Glue Catalog target
glueContext.write_dynamic_frame.from_catalog(
frame=DynamicFrame.fromDF(customer_360, glueContext, "customer_360"),
database="curated",
table_name="customer_360",
additional_options={"enableUpdateCatalog": True}
)
job.commit()
Step 2: Create ADF Linked Services for source and target¶
ADF Linked Services are the equivalent of Glue Connections.
Create ADLS Gen2 linked service¶
{
"name": "ls_adls_analytics",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://acmeanalyticsgov.dfs.core.usgovcloudapi.net",
"accountKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "ls_keyvault",
"type": "LinkedServiceReference"
},
"secretName": "storage-account-key"
}
}
}
}
Create Databricks linked service¶
{
"name": "ls_databricks",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://adb-1234567890.1.azuredatabricks.net",
"authentication": "MSI",
"workspaceResourceId": "/subscriptions/<sub-id>/resourceGroups/rg-analytics/providers/Microsoft.Databricks/workspaces/dbx-analytics",
"newClusterNodeType": "Standard_D4s_v5",
"newClusterNumOfWorker": "2:8",
"newClusterSparkEnvVars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"newClusterVersion": "15.4.x-scala2.12"
}
}
}
AWS comparison: Glue Connections store JDBC/S3 credentials. ADF Linked Services are the same concept but with richer auth options -- managed identity (no credentials stored), Key Vault references, or service principal. Prefer managed identity (
"authentication": "MSI") to eliminate credential management entirely.
Step 3: Build ADF Copy Activity for data ingestion¶
If the Glue job includes a data-copy step (reading from an external source into the lake), convert that to an ADF Copy Activity.
ADF pipeline JSON¶
{
"name": "pl_ingest_customer_events",
"properties": {
"activities": [
{
"name": "copy_customer_events",
"type": "Copy",
"inputs": [
{
"referenceName": "ds_source_customer_events",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "ds_adls_bronze_customer_events",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ParquetSource",
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"wildcardFolderPath": {
"value": "@formatDateTime(pipeline().parameters.run_date, 'yyyy/MM/dd')",
"type": "Expression"
}
}
},
"sink": {
"type": "ParquetSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
},
"formatSettings": {
"type": "ParquetWriteSettings"
}
},
"enableStaging": false
}
},
{
"name": "run_dbt_transforms",
"type": "DatabricksNotebook",
"dependsOn": [
{
"activity": "copy_customer_events",
"dependencyConditions": ["Succeeded"]
}
],
"linkedServiceName": {
"referenceName": "ls_databricks",
"type": "LinkedServiceReference"
},
"typeProperties": {
"notebookPath": "/Repos/acme/analytics/notebooks/run_dbt",
"baseParameters": {
"dbt_command": "dbt run --select customer_360",
"run_date": {
"value": "@pipeline().parameters.run_date",
"type": "Expression"
}
}
}
}
],
"parameters": {
"run_date": {
"type": "string",
"defaultValue": "@utcnow('yyyy-MM-dd')"
}
}
}
}
Step 4: Convert Glue PySpark transforms to dbt SQL models¶
This is where the core transformation logic moves from PySpark to SQL. Most Glue PySpark transforms translate directly to SQL, which is easier to test, version, and audit.
Before: Glue PySpark (from Step 1)¶
The PySpark script does: read events, aggregate by customer, join profiles, compute segments, write output.
After: dbt SQL models¶
Staging model -- clean and type-cast source data:
-- models/staging/stg_customer_events.sql
{{ config(materialized='view') }}
SELECT
customer_id,
event_type,
CAST(revenue AS DOUBLE) AS revenue,
CAST(event_date AS DATE) AS event_date,
event_timestamp
FROM {{ source('bronze', 'customer_events') }}
WHERE event_date IS NOT NULL
Staging model -- customer profiles:
-- models/staging/stg_customer_profiles.sql
{{ config(materialized='view') }}
SELECT
customer_id,
first_name,
last_name,
email,
signup_date,
account_status
FROM {{ source('bronze', 'customer_profiles') }}
Gold model -- customer 360 (replaces the Glue PySpark transform):
-- models/gold/customer_360.sql
{{ config(
materialized='incremental',
unique_key='customer_id',
incremental_strategy='merge',
partition_by=['customer_segment'],
post_hook="OPTIMIZE {{ this }} ZORDER BY (customer_id)"
) }}
WITH event_summary AS (
SELECT
customer_id,
COUNT(event_type) AS total_events,
SUM(revenue) AS lifetime_revenue,
MAX(event_date) AS last_activity_date
FROM {{ ref('stg_customer_events') }}
{% if is_incremental() %}
WHERE event_date >= date_sub(current_date(), 3)
{% endif %}
GROUP BY customer_id
)
SELECT
e.customer_id,
e.total_events,
e.lifetime_revenue,
e.last_activity_date,
p.first_name,
p.last_name,
p.email,
p.signup_date,
p.account_status,
CASE
WHEN e.lifetime_revenue > 10000 THEN 'platinum'
WHEN e.lifetime_revenue > 5000 THEN 'gold'
WHEN e.lifetime_revenue > 1000 THEN 'silver'
ELSE 'bronze'
END AS customer_segment,
datediff(current_date(), e.last_activity_date) AS days_since_activity,
datediff(current_date(), e.last_activity_date) < 90 AS is_active,
current_timestamp() AS etl_timestamp
FROM event_summary e
LEFT JOIN {{ ref('stg_customer_profiles') }} p
ON e.customer_id = p.customer_id
dbt tests (replaces Glue DataQuality):
# models/gold/customer_360.yml
version: 2
models:
- name: customer_360
description: >
Customer 360 view combining event aggregates with profile data.
Migrated from Glue job: daily-customer-etl.
Original script: s3://acme-glue-scripts/jobs/daily_customer_etl.py
columns:
- name: customer_id
tests:
- not_null
- unique
- name: lifetime_revenue
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
- name: customer_segment
tests:
- accepted_values:
values: ["platinum", "gold", "silver", "bronze"]
- name: total_events
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 1
Conversion cheat sheet: Glue PySpark to dbt SQL¶
| Glue PySpark pattern | dbt SQL equivalent |
|---|---|
glueContext.create_dynamic_frame.from_catalog(...) | {{ source('schema', 'table') }} |
dyf.toDF().filter(...) | WHERE clause |
df.groupBy(...).agg(...) | GROUP BY + aggregate functions |
df.join(other, "key", "left") | LEFT JOIN ... ON |
df.withColumn("new", expr) | expression AS new in SELECT |
when(...).when(...).otherwise(...) | CASE WHEN ... WHEN ... ELSE ... END |
datediff(col1, col2) | datediff(col1, col2) (same in Databricks SQL) |
current_timestamp() | current_timestamp() (same) |
df.write.mode("overwrite").saveAsTable(...) | {{ config(materialized='table') }} |
df.write.mode("append") | {{ config(materialized='incremental') }} |
| Glue bookmarks (incremental reads) | {% if is_incremental() %} WHERE ... {% endif %} |
ResolveChoice (schema conflicts) | CAST(col AS type) in staging model |
glueContext.write_dynamic_frame.from_catalog(...) | Automatic (dbt writes to configured target) |
Step 5: Set up dbt project with profiles.yml for Databricks¶
Project structure¶
redshift_migration/
dbt_project.yml
packages.yml
models/
staging/
_sources.yml
stg_customer_events.sql
stg_customer_profiles.sql
gold/
customer_360.sql
customer_360.yml
tests/
validate_customer_360_parity.sql
macros/
generate_schema_name.sql
dbt_project.yml¶
name: analytics_migration
version: "1.0.0"
config-version: 2
profile: "analytics_migration"
model-paths: ["models"]
test-paths: ["tests"]
macro-paths: ["macros"]
models:
analytics_migration:
staging:
+materialized: view
+schema: staging
gold:
+materialized: incremental
+schema: gold
+tags: ["daily"]
Sources configuration¶
# models/staging/_sources.yml
version: 2
sources:
- name: bronze
database: analytics_prod
schema: bronze
tables:
- name: customer_events
description: "Raw customer events (migrated from S3 via AzCopy)"
loaded_at_field: event_date
freshness:
warn_after: { count: 24, period: hour }
error_after: { count: 48, period: hour }
- name: customer_profiles
description: "Customer profile data (migrated from Redshift)"
Run and test¶
# Install dbt packages
dbt deps
# Run the models
dbt run --select customer_360
# Run tests
dbt test --select customer_360
# Generate documentation
dbt docs generate
dbt docs serve
Step 6: Configure ADF pipeline to trigger dbt runs¶
Databricks notebook to execute dbt¶
# notebooks/run_dbt.py
# This notebook is called by ADF to execute dbt commands
import subprocess
import sys
# Get parameters from ADF
dbt_command = dbutils.widgets.get("dbt_command")
run_date = dbutils.widgets.get("run_date")
# Set the run date as a dbt variable
full_command = f"{dbt_command} --vars '{{\"run_date\": \"{run_date}\"}}'"
print(f"Executing: {full_command}")
# Execute dbt
result = subprocess.run(
full_command,
shell=True,
cwd="/Workspace/Repos/acme/analytics",
capture_output=True,
text=True
)
print("STDOUT:", result.stdout)
if result.returncode != 0:
print("STDERR:", result.stderr)
dbutils.notebook.exit(f"FAILED: {result.stderr[-500:]}")
else:
dbutils.notebook.exit("SUCCESS")
ADF pipeline with dbt activity¶
The pipeline from Step 3 already includes the run_dbt_transforms activity. Here is the complete orchestration pattern:
[ADF Trigger: Daily 03:00 UTC]
-> [Copy Activity: Ingest new data to bronze]
-> [Databricks Notebook: dbt run --select customer_360]
-> [Databricks Notebook: dbt test --select customer_360]
-> [Web Activity: Notify on success/failure]
Step 7: Set up scheduling (Glue triggers to ADF triggers)¶
Glue trigger to ADF trigger mapping¶
| Glue trigger type | ADF trigger type | Configuration |
|---|---|---|
| Schedule trigger | Schedule trigger | Cron expression |
| On-demand | Manual trigger | REST API call or portal |
| EventBridge event | Event trigger (Storage events) | Blob created/modified events |
| Crawler completion trigger | Pipeline dependency | Activity dependency chain |
| Workflow trigger | Pipeline trigger | Tumbling window or schedule |
Create an ADF schedule trigger¶
{
"name": "tr_daily_customer_etl",
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Day",
"interval": 1,
"startTime": "2026-04-30T03:00:00Z",
"timeZone": "UTC"
}
},
"pipelines": [
{
"pipelineReference": {
"referenceName": "pl_ingest_customer_events",
"type": "PipelineReference"
},
"parameters": {
"run_date": "@trigger().scheduledTime"
}
}
]
}
}
Create an event-based trigger (replacing Glue EventBridge triggers)¶
{
"name": "tr_new_file_arrival",
"properties": {
"type": "BlobEventsTrigger",
"typeProperties": {
"blobPathBeginsWith": "/bronze/customer_events/",
"blobPathEndsWith": ".parquet",
"ignoreEmptyBlobs": true,
"scope": "/subscriptions/<sub>/resourceGroups/rg-analytics/providers/Microsoft.Storage/storageAccounts/acmeanalyticsgov",
"events": ["Microsoft.Storage.BlobCreated"]
},
"pipelines": [
{
"pipelineReference": {
"referenceName": "pl_ingest_customer_events",
"type": "PipelineReference"
}
}
]
}
}
AWS comparison: Glue triggers work with CloudWatch Events / EventBridge. ADF triggers work with Azure Event Grid (for storage events) or built-in schedules. The main UX difference is that ADF triggers are configured alongside the pipeline, while Glue triggers are separate resources. Functionally they are equivalent.
Step 8: Validate output parity¶
Side-by-side comparison¶
# Databricks notebook: validate_glue_migration.py
# Load Glue-produced output (from S3 via OneLake shortcut)
glue_output = spark.read.table("migration_bridge.curated.customer_360")
# Load dbt-produced output
dbt_output = spark.read.table("analytics_prod.gold.customer_360")
# Row count comparison
glue_count = glue_output.count()
dbt_count = dbt_output.count()
print(f"Glue output: {glue_count:,} rows")
print(f"dbt output: {dbt_count:,} rows")
print(f"Difference: {abs(glue_count - dbt_count):,} rows")
# Aggregate comparison
for col_name in ["lifetime_revenue", "total_events"]:
glue_sum = glue_output.agg({col_name: "sum"}).first()[0]
dbt_sum = dbt_output.agg({col_name: "sum"}).first()[0]
pct_diff = abs(glue_sum - dbt_sum) / glue_sum * 100
status = "PASS" if pct_diff < 0.01 else "FAIL"
print(f"{col_name}: Glue={glue_sum:,.2f} dbt={dbt_sum:,.2f} diff={pct_diff:.4f}% [{status}]")
# Segment distribution comparison
print("\nSegment distribution:")
glue_segments = glue_output.groupBy("customer_segment").count().orderBy("customer_segment")
dbt_segments = dbt_output.groupBy("customer_segment").count().orderBy("customer_segment")
glue_segments.show()
dbt_segments.show()
Dual-run period¶
Run both the Glue job and the ADF+dbt pipeline in parallel for at least 5 business days. Compare outputs daily. Only decommission the Glue job after consistent parity is confirmed.
Glue Catalog to Purview migration¶
The Glue Data Catalog metadata (databases, tables, partitions) should be migrated to Unity Catalog (runtime) and Purview (governance).
Export Glue Catalog metadata¶
# Export all databases
aws glue get-databases --output json > glue_databases.json
# Export tables for each database
for DB in $(aws glue get-databases --query 'DatabaseList[].Name' --output text); do
aws glue get-tables --database-name ${DB} --output json > "glue_tables_${DB}.json"
done
Register in Unity Catalog¶
-- Create catalogs and schemas matching Glue databases
CREATE CATALOG IF NOT EXISTS analytics_prod;
CREATE SCHEMA IF NOT EXISTS analytics_prod.bronze;
CREATE SCHEMA IF NOT EXISTS analytics_prod.silver;
CREATE SCHEMA IF NOT EXISTS analytics_prod.gold;
-- Register migrated tables (already loaded as Delta)
-- Tables created by dbt are automatically registered in Unity Catalog
Connect Purview to scan Unity Catalog¶
# Register the Databricks workspace as a Purview data source
az purview account add-root-collection-admin \
--account-name purview-analytics \
--resource-group rg-analytics \
--object-id <databricks-managed-identity-object-id>
# Purview will scan Unity Catalog and populate:
# - Business glossary terms
# - Data classifications (PII, PHI, etc.)
# - Lineage (dbt model dependencies)
AWS comparison: Glue Data Catalog is both the runtime catalog and the governance layer (via Lake Formation). In Azure, these are separate: Unity Catalog handles runtime metadata (what Spark/Databricks sees), and Purview handles enterprise governance (classifications, lineage, glossary, access policies). This separation means Purview can govern data across Databricks, ADF, Power BI, and external systems -- not just the Spark runtime.
Related resources¶
- AWS-to-Azure migration playbook -- full capability mapping, Glue section 2.3
- S3 to ADLS tutorial -- storage migration prerequisite
- Redshift to Fabric tutorial -- warehouse migration
- Best practices -- migration patterns and pitfalls
docs/adr/0001-adf-dbt-over-airflow.md-- why ADF + dbt over Airflowdocs/adr/0006-purview-over-atlas.md-- Purview as the governance layerdomains/shared/pipelines/adf/-- reference ADF pipeline patternsdomains/shared/dbt/dbt_project.yml-- reference dbt project structure
Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team