Compute Migration: Redshift, EMR, and Athena to Azure¶

A deep-dive guide for data engineers migrating AWS compute services to Databricks, Fabric, and Synapse on Azure.

Executive summary¶

The AWS analytics compute layer is three services with three different programming models: Redshift for SQL warehousing, EMR for Spark/Hadoop, and Athena for serverless ad-hoc queries. Each has its own query dialect, its own performance tuning model, and its own cost structure. Migrating these to Azure means mapping three compute models onto a unified Databricks + Fabric platform.

The consolidation payoff is significant. Instead of managing Redshift clusters, EMR fleets, and Athena workgroups independently, the target architecture uses Databricks SQL Warehouses for interactive queries, Databricks Jobs for batch processing, and Fabric SQL endpoints for lightweight ad-hoc access. One engine, one billing model, one monitoring surface.

Part 1: Redshift to Databricks SQL¶

Architecture mapping¶

Redshift concept	Databricks SQL equivalent	Notes
Cluster (RA3 nodes)	SQL Warehouse (Classic or Serverless)	Serverless recommended for variable workloads
Database	Unity Catalog catalog	One Redshift database per catalog
Schema	Unity Catalog schema	1:1 mapping
Table (distribution style)	Delta table (partitioned)	See distribution mapping below
External table (Spectrum)	OneLake shortcut / external location	Zero-copy reads from S3
View	View in Unity Catalog	1:1 mapping
Materialized view	dbt incremental model / Databricks MV	dbt preferred for testability
Stored procedure	dbt macro / notebook job	See SP migration section
WLM queue	SQL Warehouse	One warehouse per workload class
User/group	Entra ID user/group + Unity Catalog grants	See Security Migration

SQL dialect differences¶

Redshift SQL is PostgreSQL-derivative. Databricks SQL is SparkSQL-based. Most standard SQL works on both, but there are differences.

Date and time functions¶

Operation	Redshift SQL	Databricks SQL
Current date	`GETDATE()` or `CURRENT_DATE`	`CURRENT_DATE()` or `CURRENT_DATE`
Current timestamp	`GETDATE()` or `SYSDATE`	`CURRENT_TIMESTAMP()` or `NOW()`
Date add	`DATEADD(day, 7, date_col)`	`DATE_ADD(date_col, 7)`
Date diff	`DATEDIFF(day, start, end)`	`DATEDIFF(end, start)` (note: reversed arg order)
Date trunc	`DATE_TRUNC('month', date_col)`	`DATE_TRUNC('month', date_col)` (same)
Extract	`EXTRACT(year FROM date_col)`	`YEAR(date_col)` or `EXTRACT(YEAR FROM date_col)`
String to date	`TO_DATE(str, 'YYYY-MM-DD')`	`TO_DATE(str, 'yyyy-MM-dd')` (Java format)
Date to string	`TO_CHAR(date_col, 'YYYY-MM-DD')`	`DATE_FORMAT(date_col, 'yyyy-MM-dd')`

String functions¶

Operation	Redshift SQL	Databricks SQL
Concatenate	`col1 \\|\\| col2` or `CONCAT(a, b)`	`CONCAT(a, b)` or `a \\|\\| b`
Substring	`SUBSTRING(str, start, len)`	`SUBSTRING(str, start, len)` (same)
Length	`LEN(str)`	`LENGTH(str)` or `LEN(str)`
Trim	`TRIM(str)`	`TRIM(str)` (same)
Replace	`REPLACE(str, old, new)`	`REPLACE(str, old, new)` (same)
Regex match	`str ~ 'pattern'`	`str RLIKE 'pattern'`
NVL	`NVL(a, b)`	`COALESCE(a, b)` or `NVL(a, b)`
DECODE	`DECODE(col, val1, res1, val2, res2, default)`	`CASE WHEN col = val1 THEN res1 ...`

Data types¶

Redshift type	Databricks SQL type	Notes
`SMALLINT`	`SMALLINT`	Same
`INTEGER`	`INT`	Same
`BIGINT`	`BIGINT`	Same
`DECIMAL(p,s)`	`DECIMAL(p,s)`	Same
`REAL` / `FLOAT4`	`FLOAT`	Same
`DOUBLE PRECISION` / `FLOAT8`	`DOUBLE`	Same
`BOOLEAN`	`BOOLEAN`	Same
`CHAR(n)`	`STRING`	Databricks uses variable-length strings
`VARCHAR(n)`	`STRING`	Length constraint enforced at application layer
`DATE`	`DATE`	Same
`TIMESTAMP`	`TIMESTAMP`	Same
`TIMESTAMPTZ`	`TIMESTAMP`	Databricks stores UTC; apply timezone in queries
`SUPER` (semi-structured)	`STRING` (JSON) + JSON functions	Use `:` notation for JSON field access
`HLLSKETCH`	`approx_count_distinct()`	Function-based rather than type-based
`GEOMETRY`	`STRING` (WKT) + H3 functions	Spatial support via H3 and Mosaic libraries

Distribution and sort key mapping¶

Redshift distribution styles control how data is physically distributed across nodes. Delta Lake uses partitioning and Z-ordering instead.

Redshift distribution	Delta Lake equivalent	Migration approach
`DISTSTYLE KEY (col)`	`PARTITIONED BY (col)`	High-cardinality keys become partition columns
`DISTSTYLE EVEN`	No partition (or hash partition)	Let Delta auto-optimize; add Z-order if needed
`DISTSTYLE ALL`	Broadcast hint in joins	Small dimension tables; use `/+ BROADCAST(dim) /`
`SORTKEY (col1, col2)`	`ZORDER BY (col1, col2)`	Run after initial load: `OPTIMIZE tbl ZORDER BY (col1, col2)`
`COMPOUND SORTKEY`	`ZORDER BY (col1, col2)`	Z-order handles multi-column optimization
`INTERLEAVED SORTKEY`	`ZORDER BY (col1, col2)`	Z-order is inherently multi-dimensional

Example: convert a Redshift table definition

-- Redshift DDL
CREATE TABLE sales.fact_orders (
  order_id BIGINT,
  customer_id BIGINT,
  order_date DATE,
  region VARCHAR(50),
  product_id BIGINT,
  quantity INTEGER,
  amount DECIMAL(18,2)
)
DISTSTYLE KEY
DISTKEY (customer_id)
COMPOUND SORTKEY (order_date, region);

-- Databricks SQL DDL
CREATE TABLE sales_prod.gold.fact_orders (
  order_id BIGINT,
  customer_id BIGINT,
  order_date DATE,
  region STRING,
  product_id BIGINT,
  quantity INT,
  amount DECIMAL(18,2)
)
USING DELTA
PARTITIONED BY (order_date)
COMMENT 'Migrated from Redshift sales.fact_orders';

-- After initial data load, optimize
OPTIMIZE sales_prod.gold.fact_orders
  ZORDER BY (region, customer_id);

Workload Management (WLM) to SQL Warehouse sizing¶

WLM queue	Typical use	Databricks SQL Warehouse	Sizing guidance
ETL queue (high memory)	Batch loads, CTAS	Databricks Job cluster	Auto-scaling; Photon enabled
BI queue (high concurrency)	Dashboard queries	SQL Warehouse (Serverless)	Auto-scale 1-10; 2XS-M size
Ad-hoc queue	Analyst queries	SQL Warehouse (Pro)	Auto-scale 1-4; S-M size
Short query queue	Sub-second lookups	SQL Warehouse (Serverless)	Auto-scale; smallest size
Superuser queue	Admin/DDL	SQL Warehouse (Classic)	Fixed size; restricted access

Stored procedure migration¶

Redshift stored procedures use a PL/pgSQL-like syntax. There is no direct equivalent in Databricks SQL. Migration paths depend on the SP complexity:

Path 1: SQL-only logic to dbt macros (recommended)

-- Redshift SP
CREATE OR REPLACE PROCEDURE refresh_daily_sales(run_date DATE)
AS $$
BEGIN
  DELETE FROM gold.fact_sales_daily WHERE sales_date = run_date;
  INSERT INTO gold.fact_sales_daily
    SELECT DATE(order_ts) AS sales_date, region, product_id,
           SUM(quantity) AS units, SUM(amount) AS revenue
    FROM silver.orders
    WHERE DATE(order_ts) = run_date
    GROUP BY 1, 2, 3;
END;
$$ LANGUAGE plpgsql;

-- dbt incremental model (replaces SP)
-- models/gold/fact_sales_daily.sql
{{ config(
    materialized='incremental',
    unique_key=['sales_date', 'region', 'product_id'],
    incremental_strategy='merge',
    partition_by=['sales_date']
) }}

SELECT
  DATE(order_ts) AS sales_date,
  region,
  product_id,
  SUM(quantity) AS units,
  SUM(amount) AS revenue
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE DATE(order_ts) >= DATE_SUB(CURRENT_DATE(), 3)
{% endif %}
GROUP BY 1, 2, 3

Path 2: Complex imperative logic to Databricks notebooks

# Databricks notebook (replaces complex SP with control flow)
import pyspark.sql.functions as F
from datetime import date, timedelta

# Parameters (from Databricks Job)
run_date = dbutils.widgets.get("run_date")

# Read source
orders = spark.table("sales_prod.silver.orders") \
    .filter(F.col("order_date") == run_date)

# Complex business logic that was in SP
if orders.count() == 0:
    dbutils.notebook.exit("No data for date")

# Aggregation with conditional logic
result = orders.groupBy("region", "product_id") \
    .agg(
        F.sum("quantity").alias("units"),
        F.sum("amount").alias("revenue"),
        F.countDistinct("customer_id").alias("unique_customers")
    )

# Merge into target
result.write \
    .format("delta") \
    .mode("overwrite") \
    .option("replaceWhere", f"sales_date = '{run_date}'") \
    .saveAsTable("sales_prod.gold.fact_sales_daily")

Part 2: EMR to Databricks¶

Spark version compatibility¶

EMR release	Spark version	Databricks Runtime	Notes
EMR 6.15	Spark 3.4.1	DBR 13.3 LTS	Direct compatibility
EMR 6.12	Spark 3.4.0	DBR 13.3 LTS	Direct compatibility
EMR 7.0	Spark 3.5.0	DBR 14.3 LTS	Direct compatibility
EMR 7.1	Spark 3.5.1	DBR 15.4 LTS	Direct compatibility

Key point: Spark code written for EMR runs on Databricks with minimal changes. The Spark API is the same. The differences are in cluster configuration, library management, and filesystem paths.

Common code changes¶

File system paths:

# EMR (S3)
df = spark.read.parquet("s3a://acme-analytics-raw/sales/")
df.write.parquet("s3a://acme-analytics-curated/sales_daily/")

# Databricks (ADLS Gen2)
df = spark.read.parquet("abfss://raw@acmeanalyticsgov.dfs.core.usgovcloudapi.net/sales/")
df.write.format("delta").save("abfss://curated@acmeanalyticsgov.dfs.core.usgovcloudapi.net/sales_daily/")

# Databricks (Unity Catalog - preferred)
df = spark.table("sales_prod.bronze.raw_sales")
df.write.format("delta").mode("overwrite").saveAsTable("sales_prod.gold.fact_sales_daily")

Credential configuration:

# EMR: IAM Instance Profile (automatic)
# No credential configuration needed in code

# Databricks: Managed Identity (automatic with Unity Catalog)
# No credential configuration needed in code when using Unity Catalog
# Unity Catalog + managed identity handles auth transparently

Library management:

EMR approach	Databricks equivalent	Notes
Bootstrap action (install packages)	Init script	Place in DBFS or workspace files
EMR step (jar submission)	Job task (jar/wheel)	Attach library to job or cluster
`--py-files` (PySpark dependencies)	Workspace library / PyPI install	`%pip install` in notebook or cluster library
Conda/virtualenv	Databricks cluster library	Install at cluster level or notebook level

EMR Step to Databricks Job conversion¶

{
    "name": "daily_sales_agg",
    "schedule": {
        "quartz_cron_expression": "0 0 2 * * ?",
        "timezone_id": "UTC"
    },
    "tasks": [
        {
            "task_key": "aggregate",
            "notebook_task": {
                "notebook_path": "/Repos/analytics/sales/jobs/daily_sales_agg",
                "base_parameters": {
                    "run_date": "{{job.start_time[yyyy-MM-dd]}}"
                }
            },
            "job_cluster_key": "agg_cluster"
        }
    ],
    "job_clusters": [
        {
            "job_cluster_key": "agg_cluster",
            "new_cluster": {
                "spark_version": "15.4.x-scala2.12",
                "node_type_id": "Standard_D8s_v5",
                "num_workers": 4,
                "data_security_mode": "SINGLE_USER",
                "runtime_engine": "PHOTON"
            }
        }
    ]
}

Bootstrap actions to init scripts¶

# EMR bootstrap action
#!/bin/bash
sudo pip3 install pandas==2.1.0 pyarrow==14.0.0
sudo yum install -y libgdal-devel

# Databricks init script (place in workspace or DBFS)
#!/bin/bash
pip install pandas==2.1.0 pyarrow==14.0.0
apt-get update && apt-get install -y libgdal-dev

Part 3: Athena to Fabric SQL endpoint / Databricks SQL¶

Migration path selection¶

Athena usage pattern	Recommended Azure target	Reasoning
Ad-hoc analyst queries	Databricks SQL Warehouse (Serverless)	Auto-scales to zero; pay per query
Scheduled reports/dashboards	Databricks SQL Warehouse (Pro)	Consistent performance; integrates with Power BI
Federated queries (DynamoDB, RDS)	Databricks Lakehouse Federation	Native connectors for common sources
Lightweight exploration	Fabric SQL endpoint	Zero-config for data already in OneLake
Cost-sensitive scanning	Fabric SQL endpoint	No per-scan charge; included in Fabric capacity

Athena saved queries migration¶

Athena saved queries and named queries are SQL text stored in Athena. Extract and adapt:

# Export all saved queries from Athena
aws athena list-named-queries --output json > athena_queries.json

# For each query ID, get the SQL
for qid in $(cat athena_queries.json | jq -r '.NamedQueryIds[]'); do
  aws athena get-named-query --named-query-id $qid --output json >> query_definitions.json
done

Common Athena-to-Databricks SQL adaptations:

-- Athena: partition projection (no direct equivalent)
-- CREATE EXTERNAL TABLE logs (...)
-- PARTITIONED BY (dt STRING)
-- TBLPROPERTIES (
--   'projection.enabled' = 'true',
--   'projection.dt.type' = 'date',
--   'projection.dt.range' = '2020-01-01,NOW'
-- )

-- Databricks: Auto Loader handles dynamic partition discovery
-- Or: explicit partition columns in Delta table
CREATE TABLE logs_prod.bronze.logs
USING DELTA
PARTITIONED BY (dt)
AS SELECT *, DATE(event_time) AS dt FROM ...;

-- Athena: CTAS (Create Table As Select) for results
CREATE TABLE results.monthly_summary
WITH (
  format = 'PARQUET',
  external_location = 's3://results/monthly/'
) AS SELECT ...

-- Databricks: CREATE TABLE or INSERT OVERWRITE
CREATE OR REPLACE TABLE results_prod.gold.monthly_summary
USING DELTA
AS SELECT ...;

Athena workgroup to SQL Warehouse mapping¶

Athena workgroup setting	Databricks SQL Warehouse equivalent
`BytesScannedCutoffPerQuery`	SQL Warehouse query timeout + Azure budget alerts
`RequesterPaysEnabled`	N/A --- ADLS does not have requester-pays
`OutputLocation`	Default warehouse location in Unity Catalog
`EncryptionConfiguration`	Storage account encryption (CMK via Key Vault)
`EnforceWorkGroupConfiguration`	SQL Warehouse access control + cluster policies

Data migration tooling comparison¶

Tool	Best for	Throughput	Cost
AzCopy	Bulk S3 to ADLS copy	5-10 Gbps over ExpressRoute	Free (egress charges apply)
ADF Copy Activity	Orchestrated, incremental copies	Scales with DIU count	Per-DIU-hour pricing
Databricks notebook	Format conversion (Parquet to Delta)	Scales with cluster size	Per-DBU pricing
OneLake shortcut	Zero-copy bridge	N/A (no data movement)	No data movement cost
AWS DMS	Database-to-database (Redshift to SQL)	Varies by instance	Per-instance-hour

Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team Related: Migration Center | Storage Migration | ETL Migration | Migration Playbook