Compute Migration: Redshift, EMR, and Athena to Azure¶
A deep-dive guide for data engineers migrating AWS compute services to Databricks, Fabric, and Synapse on Azure.
Executive summary¶
The AWS analytics compute layer is three services with three different programming models: Redshift for SQL warehousing, EMR for Spark/Hadoop, and Athena for serverless ad-hoc queries. Each has its own query dialect, its own performance tuning model, and its own cost structure. Migrating these to Azure means mapping three compute models onto a unified Databricks + Fabric platform.
The consolidation payoff is significant. Instead of managing Redshift clusters, EMR fleets, and Athena workgroups independently, the target architecture uses Databricks SQL Warehouses for interactive queries, Databricks Jobs for batch processing, and Fabric SQL endpoints for lightweight ad-hoc access. One engine, one billing model, one monitoring surface.
Part 1: Redshift to Databricks SQL¶
Architecture mapping¶
| Redshift concept | Databricks SQL equivalent | Notes |
|---|---|---|
| Cluster (RA3 nodes) | SQL Warehouse (Classic or Serverless) | Serverless recommended for variable workloads |
| Database | Unity Catalog catalog | One Redshift database per catalog |
| Schema | Unity Catalog schema | 1:1 mapping |
| Table (distribution style) | Delta table (partitioned) | See distribution mapping below |
| External table (Spectrum) | OneLake shortcut / external location | Zero-copy reads from S3 |
| View | View in Unity Catalog | 1:1 mapping |
| Materialized view | dbt incremental model / Databricks MV | dbt preferred for testability |
| Stored procedure | dbt macro / notebook job | See SP migration section |
| WLM queue | SQL Warehouse | One warehouse per workload class |
| User/group | Entra ID user/group + Unity Catalog grants | See Security Migration |
SQL dialect differences¶
Redshift SQL is PostgreSQL-derivative. Databricks SQL is SparkSQL-based. Most standard SQL works on both, but there are differences.
Date and time functions¶
| Operation | Redshift SQL | Databricks SQL |
|---|---|---|
| Current date | GETDATE() or CURRENT_DATE | CURRENT_DATE() or CURRENT_DATE |
| Current timestamp | GETDATE() or SYSDATE | CURRENT_TIMESTAMP() or NOW() |
| Date add | DATEADD(day, 7, date_col) | DATE_ADD(date_col, 7) |
| Date diff | DATEDIFF(day, start, end) | DATEDIFF(end, start) (note: reversed arg order) |
| Date trunc | DATE_TRUNC('month', date_col) | DATE_TRUNC('month', date_col) (same) |
| Extract | EXTRACT(year FROM date_col) | YEAR(date_col) or EXTRACT(YEAR FROM date_col) |
| String to date | TO_DATE(str, 'YYYY-MM-DD') | TO_DATE(str, 'yyyy-MM-dd') (Java format) |
| Date to string | TO_CHAR(date_col, 'YYYY-MM-DD') | DATE_FORMAT(date_col, 'yyyy-MM-dd') |
String functions¶
| Operation | Redshift SQL | Databricks SQL |
|---|---|---|
| Concatenate | col1 \|\| col2 or CONCAT(a, b) | CONCAT(a, b) or a \|\| b |
| Substring | SUBSTRING(str, start, len) | SUBSTRING(str, start, len) (same) |
| Length | LEN(str) | LENGTH(str) or LEN(str) |
| Trim | TRIM(str) | TRIM(str) (same) |
| Replace | REPLACE(str, old, new) | REPLACE(str, old, new) (same) |
| Regex match | str ~ 'pattern' | str RLIKE 'pattern' |
| NVL | NVL(a, b) | COALESCE(a, b) or NVL(a, b) |
| DECODE | DECODE(col, val1, res1, val2, res2, default) | CASE WHEN col = val1 THEN res1 ... |
Data types¶
| Redshift type | Databricks SQL type | Notes |
|---|---|---|
SMALLINT | SMALLINT | Same |
INTEGER | INT | Same |
BIGINT | BIGINT | Same |
DECIMAL(p,s) | DECIMAL(p,s) | Same |
REAL / FLOAT4 | FLOAT | Same |
DOUBLE PRECISION / FLOAT8 | DOUBLE | Same |
BOOLEAN | BOOLEAN | Same |
CHAR(n) | STRING | Databricks uses variable-length strings |
VARCHAR(n) | STRING | Length constraint enforced at application layer |
DATE | DATE | Same |
TIMESTAMP | TIMESTAMP | Same |
TIMESTAMPTZ | TIMESTAMP | Databricks stores UTC; apply timezone in queries |
SUPER (semi-structured) | STRING (JSON) + JSON functions | Use : notation for JSON field access |
HLLSKETCH | approx_count_distinct() | Function-based rather than type-based |
GEOMETRY | STRING (WKT) + H3 functions | Spatial support via H3 and Mosaic libraries |
Distribution and sort key mapping¶
Redshift distribution styles control how data is physically distributed across nodes. Delta Lake uses partitioning and Z-ordering instead.
| Redshift distribution | Delta Lake equivalent | Migration approach |
|---|---|---|
DISTSTYLE KEY (col) | PARTITIONED BY (col) | High-cardinality keys become partition columns |
DISTSTYLE EVEN | No partition (or hash partition) | Let Delta auto-optimize; add Z-order if needed |
DISTSTYLE ALL | Broadcast hint in joins | Small dimension tables; use /*+ BROADCAST(dim) */ |
SORTKEY (col1, col2) | ZORDER BY (col1, col2) | Run after initial load: OPTIMIZE tbl ZORDER BY (col1, col2) |
COMPOUND SORTKEY | ZORDER BY (col1, col2) | Z-order handles multi-column optimization |
INTERLEAVED SORTKEY | ZORDER BY (col1, col2) | Z-order is inherently multi-dimensional |
Example: convert a Redshift table definition
-- Redshift DDL
CREATE TABLE sales.fact_orders (
order_id BIGINT,
customer_id BIGINT,
order_date DATE,
region VARCHAR(50),
product_id BIGINT,
quantity INTEGER,
amount DECIMAL(18,2)
)
DISTSTYLE KEY
DISTKEY (customer_id)
COMPOUND SORTKEY (order_date, region);
-- Databricks SQL DDL
CREATE TABLE sales_prod.gold.fact_orders (
order_id BIGINT,
customer_id BIGINT,
order_date DATE,
region STRING,
product_id BIGINT,
quantity INT,
amount DECIMAL(18,2)
)
USING DELTA
PARTITIONED BY (order_date)
COMMENT 'Migrated from Redshift sales.fact_orders';
-- After initial data load, optimize
OPTIMIZE sales_prod.gold.fact_orders
ZORDER BY (region, customer_id);
Workload Management (WLM) to SQL Warehouse sizing¶
| WLM queue | Typical use | Databricks SQL Warehouse | Sizing guidance |
|---|---|---|---|
| ETL queue (high memory) | Batch loads, CTAS | Databricks Job cluster | Auto-scaling; Photon enabled |
| BI queue (high concurrency) | Dashboard queries | SQL Warehouse (Serverless) | Auto-scale 1-10; 2XS-M size |
| Ad-hoc queue | Analyst queries | SQL Warehouse (Pro) | Auto-scale 1-4; S-M size |
| Short query queue | Sub-second lookups | SQL Warehouse (Serverless) | Auto-scale; smallest size |
| Superuser queue | Admin/DDL | SQL Warehouse (Classic) | Fixed size; restricted access |
Stored procedure migration¶
Redshift stored procedures use a PL/pgSQL-like syntax. There is no direct equivalent in Databricks SQL. Migration paths depend on the SP complexity:
Path 1: SQL-only logic to dbt macros (recommended)
-- Redshift SP
CREATE OR REPLACE PROCEDURE refresh_daily_sales(run_date DATE)
AS $$
BEGIN
DELETE FROM gold.fact_sales_daily WHERE sales_date = run_date;
INSERT INTO gold.fact_sales_daily
SELECT DATE(order_ts) AS sales_date, region, product_id,
SUM(quantity) AS units, SUM(amount) AS revenue
FROM silver.orders
WHERE DATE(order_ts) = run_date
GROUP BY 1, 2, 3;
END;
$$ LANGUAGE plpgsql;
-- dbt incremental model (replaces SP)
-- models/gold/fact_sales_daily.sql
{{ config(
materialized='incremental',
unique_key=['sales_date', 'region', 'product_id'],
incremental_strategy='merge',
partition_by=['sales_date']
) }}
SELECT
DATE(order_ts) AS sales_date,
region,
product_id,
SUM(quantity) AS units,
SUM(amount) AS revenue
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE DATE(order_ts) >= DATE_SUB(CURRENT_DATE(), 3)
{% endif %}
GROUP BY 1, 2, 3
Path 2: Complex imperative logic to Databricks notebooks
# Databricks notebook (replaces complex SP with control flow)
import pyspark.sql.functions as F
from datetime import date, timedelta
# Parameters (from Databricks Job)
run_date = dbutils.widgets.get("run_date")
# Read source
orders = spark.table("sales_prod.silver.orders") \
.filter(F.col("order_date") == run_date)
# Complex business logic that was in SP
if orders.count() == 0:
dbutils.notebook.exit("No data for date")
# Aggregation with conditional logic
result = orders.groupBy("region", "product_id") \
.agg(
F.sum("quantity").alias("units"),
F.sum("amount").alias("revenue"),
F.countDistinct("customer_id").alias("unique_customers")
)
# Merge into target
result.write \
.format("delta") \
.mode("overwrite") \
.option("replaceWhere", f"sales_date = '{run_date}'") \
.saveAsTable("sales_prod.gold.fact_sales_daily")
Part 2: EMR to Databricks¶
Spark version compatibility¶
| EMR release | Spark version | Databricks Runtime | Notes |
|---|---|---|---|
| EMR 6.15 | Spark 3.4.1 | DBR 13.3 LTS | Direct compatibility |
| EMR 6.12 | Spark 3.4.0 | DBR 13.3 LTS | Direct compatibility |
| EMR 7.0 | Spark 3.5.0 | DBR 14.3 LTS | Direct compatibility |
| EMR 7.1 | Spark 3.5.1 | DBR 15.4 LTS | Direct compatibility |
Key point: Spark code written for EMR runs on Databricks with minimal changes. The Spark API is the same. The differences are in cluster configuration, library management, and filesystem paths.
Common code changes¶
File system paths:
# EMR (S3)
df = spark.read.parquet("s3a://acme-analytics-raw/sales/")
df.write.parquet("s3a://acme-analytics-curated/sales_daily/")
# Databricks (ADLS Gen2)
df = spark.read.parquet("abfss://raw@acmeanalyticsgov.dfs.core.usgovcloudapi.net/sales/")
df.write.format("delta").save("abfss://curated@acmeanalyticsgov.dfs.core.usgovcloudapi.net/sales_daily/")
# Databricks (Unity Catalog - preferred)
df = spark.table("sales_prod.bronze.raw_sales")
df.write.format("delta").mode("overwrite").saveAsTable("sales_prod.gold.fact_sales_daily")
Credential configuration:
# EMR: IAM Instance Profile (automatic)
# No credential configuration needed in code
# Databricks: Managed Identity (automatic with Unity Catalog)
# No credential configuration needed in code when using Unity Catalog
# Unity Catalog + managed identity handles auth transparently
Library management:
| EMR approach | Databricks equivalent | Notes |
|---|---|---|
| Bootstrap action (install packages) | Init script | Place in DBFS or workspace files |
| EMR step (jar submission) | Job task (jar/wheel) | Attach library to job or cluster |
--py-files (PySpark dependencies) | Workspace library / PyPI install | %pip install in notebook or cluster library |
| Conda/virtualenv | Databricks cluster library | Install at cluster level or notebook level |
EMR Step to Databricks Job conversion¶
{
"name": "daily_sales_agg",
"schedule": {
"quartz_cron_expression": "0 0 2 * * ?",
"timezone_id": "UTC"
},
"tasks": [
{
"task_key": "aggregate",
"notebook_task": {
"notebook_path": "/Repos/analytics/sales/jobs/daily_sales_agg",
"base_parameters": {
"run_date": "{{job.start_time[yyyy-MM-dd]}}"
}
},
"job_cluster_key": "agg_cluster"
}
],
"job_clusters": [
{
"job_cluster_key": "agg_cluster",
"new_cluster": {
"spark_version": "15.4.x-scala2.12",
"node_type_id": "Standard_D8s_v5",
"num_workers": 4,
"data_security_mode": "SINGLE_USER",
"runtime_engine": "PHOTON"
}
}
]
}
Bootstrap actions to init scripts¶
# EMR bootstrap action
#!/bin/bash
sudo pip3 install pandas==2.1.0 pyarrow==14.0.0
sudo yum install -y libgdal-devel
# Databricks init script (place in workspace or DBFS)
#!/bin/bash
pip install pandas==2.1.0 pyarrow==14.0.0
apt-get update && apt-get install -y libgdal-dev
Part 3: Athena to Fabric SQL endpoint / Databricks SQL¶
Migration path selection¶
| Athena usage pattern | Recommended Azure target | Reasoning |
|---|---|---|
| Ad-hoc analyst queries | Databricks SQL Warehouse (Serverless) | Auto-scales to zero; pay per query |
| Scheduled reports/dashboards | Databricks SQL Warehouse (Pro) | Consistent performance; integrates with Power BI |
| Federated queries (DynamoDB, RDS) | Databricks Lakehouse Federation | Native connectors for common sources |
| Lightweight exploration | Fabric SQL endpoint | Zero-config for data already in OneLake |
| Cost-sensitive scanning | Fabric SQL endpoint | No per-scan charge; included in Fabric capacity |
Athena saved queries migration¶
Athena saved queries and named queries are SQL text stored in Athena. Extract and adapt:
# Export all saved queries from Athena
aws athena list-named-queries --output json > athena_queries.json
# For each query ID, get the SQL
for qid in $(cat athena_queries.json | jq -r '.NamedQueryIds[]'); do
aws athena get-named-query --named-query-id $qid --output json >> query_definitions.json
done
Common Athena-to-Databricks SQL adaptations:
-- Athena: partition projection (no direct equivalent)
-- CREATE EXTERNAL TABLE logs (...)
-- PARTITIONED BY (dt STRING)
-- TBLPROPERTIES (
-- 'projection.enabled' = 'true',
-- 'projection.dt.type' = 'date',
-- 'projection.dt.range' = '2020-01-01,NOW'
-- )
-- Databricks: Auto Loader handles dynamic partition discovery
-- Or: explicit partition columns in Delta table
CREATE TABLE logs_prod.bronze.logs
USING DELTA
PARTITIONED BY (dt)
AS SELECT *, DATE(event_time) AS dt FROM ...;
-- Athena: CTAS (Create Table As Select) for results
CREATE TABLE results.monthly_summary
WITH (
format = 'PARQUET',
external_location = 's3://results/monthly/'
) AS SELECT ...
-- Databricks: CREATE TABLE or INSERT OVERWRITE
CREATE OR REPLACE TABLE results_prod.gold.monthly_summary
USING DELTA
AS SELECT ...;
Athena workgroup to SQL Warehouse mapping¶
| Athena workgroup setting | Databricks SQL Warehouse equivalent |
|---|---|
BytesScannedCutoffPerQuery | SQL Warehouse query timeout + Azure budget alerts |
RequesterPaysEnabled | N/A --- ADLS does not have requester-pays |
OutputLocation | Default warehouse location in Unity Catalog |
EncryptionConfiguration | Storage account encryption (CMK via Key Vault) |
EnforceWorkGroupConfiguration | SQL Warehouse access control + cluster policies |
Data migration tooling comparison¶
| Tool | Best for | Throughput | Cost |
|---|---|---|---|
| AzCopy | Bulk S3 to ADLS copy | 5-10 Gbps over ExpressRoute | Free (egress charges apply) |
| ADF Copy Activity | Orchestrated, incremental copies | Scales with DIU count | Per-DIU-hour pricing |
| Databricks notebook | Format conversion (Parquet to Delta) | Scales with cluster size | Per-DBU pricing |
| OneLake shortcut | Zero-copy bridge | N/A (no data movement) | No data movement cost |
| AWS DMS | Database-to-database (Redshift to SQL) | Varies by instance | Per-instance-hour |
Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team Related: Migration Center | Storage Migration | ETL Migration | Migration Playbook