Home > Tutorials > Amazon Redshift to Fabric Migration

🟧 Tutorial 43: Amazon Redshift → Microsoft Fabric Migration¶

Last Updated: 2026-04-27 | Phase: 14 (Wave 4) | Multi-Cloud Migration Status: ✅ Final | Maintainer: Platform Team

Third-party references — publicly sourced, good-faith comparison

This page references non-Microsoft products and services. That information is drawn from each vendor's publicly available documentation and is offered for honest, good-faith comparison only. This is a personal project written from a Microsoft Fabric and Azure perspective; it does not claim expertise in, or authority over, any third-party product, and nothing here is an official statement by, or endorsed by, those vendors. Capabilities, pricing, and features change often — always verify against the vendor's current official documentation. Where a third-party offering is the stronger choice, we say so plainly.

🟧 Tutorial 43: Amazon Redshift → Microsoft Fabric Migration¶


Difficulty	⭐⭐⭐ Advanced
Time	⏱️ 180-240 minutes
Focus	Multi-cloud migration: Redshift cluster (RA3/DC2), S3 backups, Spectrum external tables, Glue catalog, AWS WLM → Fabric Warehouse, OneLake (S3 shortcuts), OneLake Catalog, Fabric capacity governance

📊 Progress Tracker¶

Tutorial	Status
00 — Environment Setup → 38 — DOJ Justice	✅ Complete (Phases 1-13)
41 — Synapse → Fabric	✅
42 — Databricks → Fabric	✅
43 — Redshift → Fabric (this tutorial)	🔵 YOU ARE HERE
44 — BigQuery → Fabric	⬜
45 — On-Prem SSAS/SSIS/SSRS → Fabric	⬜

Navigation
⬅️ Previous	42 — Databricks → Fabric
➡️ Next	44 — BigQuery → Fabric

📖 Overview¶

Amazon Redshift → Fabric is a multi-cloud migration — a path some enterprises take to consolidate analytics estates onto Microsoft Fabric while keeping operational workloads in AWS. Redshift is a strong, widely-deployed cloud data warehouse with deep AWS-ecosystem integration; this playbook is not an argument that one platform is universally better, but a practical guide for teams that have already decided to move. It walks through extracting Redshift workloads via UNLOAD to S3, bringing the data into OneLake (via S3 shortcut or copy), converting Redshift PostgreSQL-flavored DDL into Fabric Warehouse T-SQL, and rationalizing distribution/sort keys, RA3 nodes, and WLM rules into the Fabric capacity model.

Comparing Redshift and Fabric¶

The table below summarizes differences based on each vendor's publicly available documentation (as of this doc's date). It is descriptive, not a scorecard — Redshift's per-table tuning controls, for example, are a strength for teams that want fine-grained physical control, whereas Fabric favors automatic optimization.

Concern	Redshift Behavior	Fabric Behavior
Compute model	RA3/DC2 nodes, fixed cluster size or Concurrency Scaling	CU pool shared across all workloads (warehouse, Spark, BI, real-time)
Storage	Managed storage on RA3 + S3 (Spectrum)	OneLake unified; S3 readable in place via shortcut
Performance tuning	DISTKEY / SORTKEY / encoding give explicit per-table control	V-Order applies automatically; no DISTKEY/SORTKEY to maintain
BI integration	QuickSight or other BI tools via ODBC/connectors	Power BI Direct Lake (no copy, no refresh)
Real-time	Kinesis + Redshift Streaming Ingestion (separately billed)	Eventstream + Eventhouse native
Multi-cloud reality	Tightly integrated with AWS region/billing	Reads S3 from Azure via shortcut; supports M365 / Power BI integration
Cost shape	Node-hours + S3 + egress + concurrency scaling charges	Single F-SKU bundles compute, storage, BI, governance

What This Tutorial Covers¶

Inventory and complexity scoring of your Redshift cluster
Redshift PostgreSQL DDL → Fabric Warehouse T-SQL conversion
UNLOAD to S3 in Parquet format (largest workload mover)
S3 → OneLake via shortcut (no copy) or copy via Pipeline / fabric-cicd
Redshift Spectrum external tables → OneLake shortcuts
UDF (Python, SQL) and stored procedure conversion
WLM (Workload Management) rules → Fabric capacity governance
RA3 node-hour → F-SKU sizing
Validation: row counts, hashes, query parity
Coexistence (read-in-place during migration)
Cutover playbook and AWS decommission checklist
AWS egress cost minimization strategies

🎯 Learning Objectives¶

By the end of this tutorial, you will be able to:

🏗️ Reference Architecture¶

flowchart LR
    subgraph Redshift["🟧 Source: Amazon Redshift (AWS)"]
        RC[(Redshift Cluster<br/>RA3 / DC2)]
        RBKP[(S3 Backups & UNLOAD targets)]
        SPEC[Redshift Spectrum<br/>External Tables on S3]
        GLUE[AWS Glue Data Catalog]
        WLM[Workload Management<br/>Queues]
    end

    subgraph Fabric["⚡ Target: Microsoft Fabric (Azure)"]
        FW[(Fabric Warehouse)]
        OL[(OneLake)]
        OLC[OneLake Catalog]
        FCAP[Fabric Capacity<br/>F-SKU]
    end

    subgraph Migration["🔄 Migration Layer"]
        Assess[Assessment Script]
        Convert[DDL Converter]
        UNLOAD[UNLOAD → S3 Parquet]
        Shortcut[S3 → OneLake Shortcut]
        Validate[Validation Suite]
    end

    RC --> Assess
    SPEC --> Assess
    WLM --> Assess
    GLUE --> Assess
    Assess --> Convert
    Convert --> FW
    RC -->|UNLOAD FORMAT PARQUET| UNLOAD
    UNLOAD --> RBKP
    RBKP -->|shortcut, no copy| Shortcut
    Shortcut --> OL
    SPEC -->|same S3 prefix| Shortcut
    GLUE -->|metadata sync| OLC
    OL --> Validate
    FW --> Validate
    WLM -.->|capacity governance| FCAP

Component Mapping (the canonical reference)¶

Redshift Component	Fabric Equivalent	Migration Approach
Redshift cluster	Fabric Warehouse	DDL conversion + UNLOAD/COPY load
RA3 / DC2 nodes	Fabric Capacity (CU model)	Heuristic node → CU sizing table
`DISTKEY` / `DISTSTYLE`	V-Order automatic	Remove — Fabric optimizes automatically
`SORTKEY` (compound / interleaved)	V-Order automatic	Remove — Fabric maintains physical order
`ENCODE` (LZO, ZSTD, AZ64)	Delta + Parquet compression	Remove — Fabric handles automatically
WLM (Workload Management) queues	Workspace + capacity assignment	Map queues to workspaces; capacity governs concurrency
Concurrency Scaling	Fabric capacity bursting + Smoothing	Replace WLM auto-scale with capacity smoothing
User-defined types	Standard T-SQL types	Convert to standard types
Python UDFs (`CREATE FUNCTION ... LANGUAGE plpythonu`)	Fabric notebook function or T-SQL function	Manual port — most logic moves to PySpark
SQL UDFs (`CREATE FUNCTION ... LANGUAGE sql`)	T-SQL inline / scalar function	Mostly direct
Stored procedures (`LANGUAGE plpgsql`)	T-SQL stored procedures	Mostly compatible — syntax tweaks
Materialized views	Materialized Lake Views (Wave 9)	Recreate in Fabric MLV
Redshift Spectrum (S3 external tables)	OneLake shortcut to S3	No copy — same S3 prefix, read in place
Redshift Data Sharing	Fabric data sharing	Recreate share targets
AWS Glue Data Catalog	OneLake Catalog	Metadata sync; rebuild lineage in Purview
Federated queries (RDS / Aurora)	Fabric Mirroring + cross-DB queries	Replace federation with Mirroring
Identity (IAM users, roles)	Microsoft Entra ID	Map IAM roles → Entra groups
Backup/snapshot to S3	OneLake retention + RA-GRS	Use OneLake retention policies

📋 Prerequisites¶

🚀 Step-by-Step¶

Step 1 — Assess Your Redshift Cluster¶

Run the assessment script to inventory tables, distribution keys, sort keys, Spectrum externals, UDFs, and query history.

cd tutorials/43-redshift-to-fabric/
python 01_assessment.py \
    --redshift-host "prod-cluster.abc123.us-east-1.redshift.amazonaws.com" \
    --redshift-db "analytics" \
    --redshift-user "$REDSHIFT_USER" \
    --aws-region "us-east-1" \
    --output-dir "./assessment-output"

The script queries:

-- Tables, sizes, distribution / sort keys
SELECT "schema", "table", size, tbl_rows, diststyle, sortkey1, encoded
FROM svv_table_info
WHERE "schema" NOT IN ('information_schema', 'pg_catalog')
ORDER BY size DESC;

-- Column-level definitions
SELECT schemaname, tablename, "column", type, encoding, distkey, sortkey
FROM pg_table_def
WHERE schemaname NOT IN ('information_schema', 'pg_catalog');

-- External (Spectrum) tables
SELECT schemaname, tablename, location, input_format, output_format
FROM svv_external_tables;

-- UDFs and stored procedures
SELECT n.nspname, p.proname, l.lanname, p.prosrc
FROM pg_proc p
JOIN pg_namespace n ON n.oid = p.pronamespace
JOIN pg_language l ON l.oid = p.prolang
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema');

-- Query history (last 30 days, top 100 by elapsed)
SELECT query, userid, starttime, endtime, aborted, elapsed
FROM stl_query
WHERE starttime > DATEADD(day, -30, GETDATE())
ORDER BY elapsed DESC
LIMIT 100;

✅ Verification: Output directory contains: - inventory.csv — every table, view, UDF, stored proc with size + row count + dependencies - distkey_sortkey_report.csv — per-table DISTKEY/SORTKEY/ENCODE that will be removed during conversion - spectrum_externals.csv — S3 prefixes that will become OneLake shortcuts - complexity_scores.csv — per-workload "Easy / Medium / Hard / Very Hard" classification - wlm_queues.json — current WLM rules for capacity-mapping - unsupported_features.md — Redshift features that don't have a 1:1 Fabric equivalent

💡 Tip: Re-run assessment monthly. Redshift STL_* views age out after a few days; use STL_QUERY_METRICS and SVL_QLOG for longer baselines.

Step 2 — Plan Migration Wave Order¶

Use the dependency graph to plan migration waves. Migrate leaves first, roots last.

python 01_assessment.py --command wave-plan \
    --inventory ./assessment-output/inventory.csv \
    --max-wave-effort-days 10

The tool produces migration-waves.md with: - Wave 1: Spectrum external tables (just shortcuts — fastest) - Wave 2: Independent reference / dimension tables - Wave 3: Bronze fact tables (UNLOAD then load) - Wave 4: Silver / aggregate tables - Wave 5: Gold / BI tables, materialized views → MLV - Wave 6: BI semantic models, QuickSight rebind, decommission

⚠️ Gotcha: Redshift late-binding views (WITH NO SCHEMA BINDING) often hide cross-schema dependencies. Run pg_get_viewdef() on all views and parse the SQL — the assessment script does this for you, but spot-check before publishing the wave plan.

Step 3 — Convert Schemas (Redshift PostgreSQL → Fabric Warehouse T-SQL)¶

python 02_schema_conversion.py \
    --source-host "prod-cluster.abc123.us-east-1.redshift.amazonaws.com" \
    --source-db "analytics" \
    --target-warehouse "wh-analytics-prod" \
    --target-workspace "ws-analytics-prod" \
    --output-ddl ./output-ddl/

The script: 1. Reads source schema via pg_table_def and svv_columns 2. Translates types (most are direct; flags incompatibilities) 3. Removes Redshift-specific physical clauses (DISTSTYLE, DISTKEY, SORTKEY, ENCODE, INTERLEAVED SORTKEY) — Fabric Warehouse handles these via V-Order automatically 4. Translates IDENTITY(1,1) → T-SQL IDENTITY(1,1) (direct) 5. Generates Fabric Warehouse DDL with comments noting changes 6. Writes per-table CREATE TABLE to output-ddl/{schema}.{table}.sql

Type Conversion Table (Redshift → Fabric Warehouse)¶

Redshift Type	Fabric Warehouse Type	Notes
`SMALLINT` / `INT2`	`smallint`	Direct
`INTEGER` / `INT4`	`int`	Direct
`BIGINT` / `INT8`	`bigint`	Direct
`DECIMAL(p,s)` / `NUMERIC(p,s)`	`decimal(p,s)`	Direct
`REAL` / `FLOAT4`	`real`	Direct
`DOUBLE PRECISION` / `FLOAT8`	`float`	Direct
`BOOLEAN`	`bit`	T-SQL uses `bit`
`CHAR(n)`	`char(n)`	Direct
`VARCHAR(n)`	`varchar(n)`	Redshift is byte-length; T-SQL is char-length — see gotcha below
`VARCHAR(MAX)` (Redshift max 65535)	`varchar(8000)` or `varchar(max)`	Cap depends on row size
`DATE`	`date`	Direct
`TIMESTAMP` (without TZ)	`datetime2(6)`	Recommended upgrade
`TIMESTAMPTZ`	`datetimeoffset(6)`	Direct
`TIME`	`time(6)`	Direct
`GEOMETRY` / `GEOGRAPHY`	NOT SUPPORTED in Warehouse	Use Lakehouse + `sedona` (Tutorial 21)
`HLLSKETCH`	NOT SUPPORTED	Recompute via `APPROX_COUNT_DISTINCT`
`SUPER` (semi-structured)	`varchar(max)` + JSON	Parse to JSON in Silver layer
`VARBYTE`	`varbinary(n)`	Direct
`OID`	NOT SUPPORTED	Drop or convert to `bigint`

⚠️ Gotcha — VARCHAR length semantics: Redshift VARCHAR(n) is bytes; Fabric T-SQL varchar(n) is characters (single-byte) — for UTF-8 multi-byte data you may need varchar(n*2) or larger. The conversion script flags suspect columns.

Redshift-Specific Features to Remove¶

Redshift Feature	Action
`DISTSTYLE EVEN/KEY/ALL`	Remove — Fabric auto-distributes
`DISTKEY (col)`	Remove — V-Order handles colocation
`SORTKEY (col1, col2)`	Remove — V-Order maintains physical order
`INTERLEAVED SORTKEY (...)`	Remove — V-Order replaces
`ENCODE LZO/ZSTD/AZ64/RAW`	Remove — Parquet compression is automatic
`WITH NO BACKUP`	Remove — OneLake handles persistence
`BACKUP YES/NO`	Remove
`DELETE/UPDATE` triggers via stored procs	Migrate proc; T-SQL is similar

✅ Verification:
-- Run in Fabric Warehouse SQL endpoint
SELECT COUNT(*) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE = 'BASE TABLE';
-- Should equal source Redshift table count (excluding NOT SUPPORTED rows)
💡 Tip: Run EXPLAIN on a representative query in Fabric vs Redshift after migration. Fabric's optimizer is different — but you should not be hand-tuning hints. If a query is slow, file it as a tuning issue rather than porting Redshift OPTION clauses.

Step 4 — UNLOAD Data from Redshift to S3¶

UNLOAD is the fastest way to extract Redshift data. Use Parquet + Snappy for downstream Fabric efficiency.

-- Run in Redshift query editor
UNLOAD ('SELECT * FROM analytics.fact_player_session')
TO 's3://your-fabric-staging-bucket/redshift-export/fact_player_session/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftUnloadRole'
FORMAT PARQUET
PARALLEL ON
MAXFILESIZE 256 MB
CLEANPATH;

Key flags explained:

Flag	Why
`FORMAT PARQUET`	Columnar format Fabric reads natively
`PARALLEL ON`	Multiple files = parallel slicing on read in Fabric
`MAXFILESIZE 256 MB`	Sweet spot for Spark partition reads
`CLEANPATH`	Removes prior export files at the prefix
`IAM_ROLE`	Cluster-attached IAM role — never use access keys

For incremental UNLOAD, parameterize with a watermark:

UNLOAD ('SELECT * FROM analytics.fact_player_session
         WHERE event_ts > ''2026-04-01 00:00:00''
           AND event_ts <= ''2026-04-08 00:00:00''')
TO 's3://your-fabric-staging-bucket/redshift-export/fact_player_session/wave2/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftUnloadRole'
FORMAT PARQUET
PARALLEL ON;

✅ Verification:
aws s3 ls s3://your-fabric-staging-bucket/redshift-export/fact_player_session/ --recursive --summarize
# Confirm file count and total size match Redshift svv_table_info row/byte estimate
⚠️ Gotcha: UNLOAD to a public S3 bucket counts as internet egress and incurs AWS data-transfer charges. See the AWS Network Costs section below for mitigation strategies.

Step 5 — Bring S3 Data into OneLake¶

You have two paths. Pick based on the table's role.

Path A: OneLake Shortcut (zero copy — preferred)¶

Best for: Spectrum tables, large fact tables read once-or-rarely from Fabric, archive data.

# Create a OneLake shortcut to the S3 prefix
az fabric onelake-shortcut create \
    --workspace-id "$WS_ID" \
    --lakehouse-id "$LH_ID" \
    --name "redshift-fact_player_session" \
    --path "Files/redshift_export/fact_player_session" \
    --target-s3 "{ \"location\": \"https://s3.us-east-1.amazonaws.com\", \"bucket\": \"your-fabric-staging-bucket\", \"subpath\": \"/redshift-export/fact_player_session/\" }" \
    --connection-id "$S3_CONNECTION_ID"

Or via Bicep:

resource shortcut 'Microsoft.Fabric/workspaces/items/shortcuts@2024-01-01' = {
  name: 'redshift-fact_player_session'
  properties: {
    target: {
      s3: {
        location: 'https://s3.${awsRegion}.amazonaws.com'
        bucket: 'your-fabric-staging-bucket'
        subpath: '/redshift-export/fact_player_session/'
      }
    }
    connectionId: s3ConnectionId
  }
}

Why this is huge: - Petabyte-scale data is not copied across clouds (no Azure ingress, no AWS egress on every query) - Source S3 retains full backup - Fabric reads in place with caching - Zero downtime — Redshift Spectrum and Fabric can both read the same prefix during coexistence

✅ Verification: Query a sample table in Fabric and compare row count to Redshift.
# Fabric notebook
df = spark.read.parquet("Files/redshift_export/fact_player_session")
print(df.count())

Path B: One-Time Copy via Fabric Pipeline (when shortcuts won't do)¶

Best for: tables you'll write to from Fabric, hot tables needing OneLake-native cache, regulatory data residency requirements.

python 03_pipeline_copy.py \
    --source-s3 "s3://your-fabric-staging-bucket/redshift-export/fact_player_session/" \
    --target-lakehouse "lh_bronze" \
    --target-table "bronze_player_session" \
    --workspace-id "$WS_ID" \
    --copy-mode "delta"

The pipeline: 1. Reads S3 Parquet via the AWS S3 connector 2. Writes Delta to OneLake Lakehouse Tables 3. Applies V-Order automatically 4. Records lineage in OneLake Catalog

💡 Tip: Choose Path A by default. Only use Path B when there's a concrete reason (residency, write workload, severe latency).

Step 6 — Load Fabric Warehouse via CTAS or COPY INTO¶

Once the data is in OneLake (shortcut or copy), load Fabric Warehouse tables.

-- Run in Fabric Warehouse SQL endpoint

-- Option 1: COPY INTO (best for one-shot bulk load)
COPY INTO analytics.fact_player_session
FROM 'https://onelake.dfs.fabric.microsoft.com/{workspace}/{lakehouse}.Lakehouse/Files/redshift_export/fact_player_session/'
WITH (FILE_TYPE = 'PARQUET');

-- Option 2: CTAS from Lakehouse SQL endpoint (best for joining + reshaping during load)
CREATE TABLE analytics.fact_player_session AS
SELECT
    session_id,
    player_id,
    CAST(start_ts AS datetime2(6)) AS start_ts,
    end_ts,
    coin_in,
    coin_out,
    -- type-promote where the assessment flagged precision concerns
    CAST(net_revenue AS decimal(18,4)) AS net_revenue
FROM lh_bronze.bronze_player_session;

✅ Verification:

SELECT COUNT(*) AS row_count, MIN(start_ts), MAX(start_ts)
FROM analytics.fact_player_session;
-- Compare to Redshift baseline

Step 7 — Migrate Spectrum External Tables → OneLake Shortcuts¶

Redshift Spectrum tables already live as files on S3. The fastest possible migration is: create a shortcut to the same S3 prefix.

# For each Spectrum table:
python 04_spectrum_to_shortcut.py \
    --spectrum-inventory ./assessment-output/spectrum_externals.csv \
    --workspace-id "$WS_ID" \
    --lakehouse-id "$LH_ID" \
    --s3-connection-id "$S3_CONNECTION_ID"

The script: 1. Reads each svv_external_tables row 2. Creates a OneLake shortcut at Files/spectrum/{schema}/{table} pointing at the same S3 location 3. Generates a Lakehouse SQL CREATE TABLE so it appears as a queryable Delta/Parquet table

✅ Verification: Query a Spectrum-migrated table in Fabric and compare to the original Redshift Spectrum query.
-- Fabric SQL endpoint
SELECT COUNT(*) FROM lh_bronze.spectrum_clickstream_2026_q1;
💡 Tip: Spectrum tables migrated this way require zero data movement and zero AWS egress — they're the cheapest, fastest part of the migration.

Step 8 — Convert UDFs and Stored Procedures¶

Redshift UDFs come in three flavors:

Source Pattern	Fabric Target	Approach
`LANGUAGE plpythonu` (Python UDF)	Fabric notebook function (PySpark)	Manual port — most logic moves to PySpark functions called from notebooks
`LANGUAGE sql` (SQL UDF)	T-SQL inline / scalar function	Mostly direct; rewrite `RETURNS TABLE` → T-SQL inline TVF
`LANGUAGE plpgsql` (stored proc)	T-SQL stored procedure	Largely compatible — `RAISE NOTICE` → `PRINT`; cursor syntax differs

# Run the UDF analyzer
python 05_udf_conversion.py \
    --inventory ./assessment-output/inventory.csv \
    --output-dir ./udf-conversions/

The script produces, per UDF: - {schema}.{name}.original.sql — Redshift source - {schema}.{name}.fabric.sql — best-effort T-SQL conversion - {schema}.{name}.notes.md — manual review checklist

⚠️ Gotcha: Redshift Python UDFs have access to a sandbox stdlib that includes pandas, numpy, etc. In Fabric, the equivalent runs in a notebook, not inline in SQL. Plan to refactor calling SQL into a notebook step where the UDF is replaced by a PySpark UDF or vectorized operation.

Step 9 — Convert WLM Rules → Fabric Capacity Governance¶

Redshift WLM (Workload Management) governs concurrency and memory per queue. Fabric uses capacity + workspace assignment for the same outcome.

Redshift WLM Concept	Fabric Equivalent
Manual WLM queue	Workspace assigned to a Fabric capacity
Queue concurrency slots	F-SKU capacity unit pool (CU)
Queue memory percentage	CU smoothing across workloads
Query priority (Auto WLM)	Capacity throttling policy + workspace priority
Short Query Acceleration	Direct Lake (BI hot path)
Concurrency Scaling	Capacity bursting + reserved CU

python 01_assessment.py --command wlm-mapping \
    --wlm-config ./assessment-output/wlm_queues.json \
    --output ./fabric-capacity-plan.md

Output: a recommended workspace + capacity assignment plan, e.g.:

Redshift Queue	Workload	Recommended Fabric Workspace	Capacity
`etl_queue`	Nightly ETL	`ws-fabric-etl`	F64 (dedicated)
`bi_queue`	Power BI / QuickSight	`ws-fabric-bi`	Shared F64
`adhoc_queue`	Analyst ad-hoc	`ws-fabric-analyst`	Shared F64 with throttle

💡 Tip: Use Action Groups + budget alerts (Wave 1 Bicep module) to alert on CU saturation, the equivalent of Redshift WLM queue queueing time.

Step 10 — Size Fabric Capacity from Redshift Node-Hours¶

python 01_assessment.py --command capacity-recommendation \
    --inventory ./assessment-output/inventory.csv \
    --node-type "ra3.4xlarge" \
    --node-count 8 \
    --concurrency-scaling-hours-per-month 120 \
    --query-history-csv ./assessment-output/query_history.csv

Output: F-SKU recommendation with rationale.

RA3 / DC2 → F-SKU Rough Sizing (Starting Point — Validate Empirically)¶

Redshift Cluster	Approximate Fabric F-SKU
2× `dc2.large`	F8
2× `ra3.xlplus`	F16
4× `ra3.xlplus`	F32
4× `ra3.4xlarge`	F64
8× `ra3.4xlarge`	F128
4× `ra3.16xlarge`	F256
8× `ra3.16xlarge`	F512
16× `ra3.16xlarge`	F1024
32× `ra3.16xlarge`	F2048

⚠️ Gotcha — These numbers are heuristics only. Redshift node sizing reflects historical RA3 storage decoupling; Fabric CU also serves Power BI, Eventstream, and Spark from the same pool. Always run a parallel measurement period during coexistence (Step 12) before downsizing.

💡 Tip: If your Redshift cluster used Concurrency Scaling for > 30% of hours, factor that into the F-SKU choice — Fabric capacity does not have a separate "burst tier" billing line.

Step 11 — Validate Migration¶

Validation runs three categories of checks: row counts, content hashes, and query parity.

python 06_validation_suite.py \
    --redshift-host "$RS_HOST" \
    --redshift-db "$RS_DB" \
    --target-warehouse "$FAB_WH_ID" \
    --tables-yaml ./tables-to-validate.yaml \
    --output-report ./validation-report.html

Three-Tier Validation¶

   Tier 1: ROW COUNT     ─────▶  count(*) match per table
   Tier 2: CONTENT HASH  ─────▶  fnv_hash() vs HASHBYTES('SHA2_256', ...) match per row
   Tier 3: QUERY PARITY  ─────▶  representative aggregations match within tolerance

Check	Pass Threshold	Action on Fail
Row count	exact match	Re-run incremental UNLOAD; check filter predicates and watermark
Content hash	100% match	Investigate type conversion (esp. VARCHAR byte-vs-char), NULL handling, timestamp precision
Aggregation parity	within 0.001% (floating point)	Type precision review; Redshift `FLOAT` vs Fabric `decimal`
Query latency	within 2× source p95	Tune V-Order; review query plan; check if Spectrum-shortcut adds S3 read latency

✅ Verification: All three tiers pass for ≥ 95% of in-scope tables before declaring migration complete.

💡 Tip: Redshift FNV_HASH() and Fabric HASHBYTES('SHA2_256', ...) produce different values — the validation suite normalizes by computing both hashes on a column-concatenated string row representation, not the source hash function.

Step 12 — Coexistence Period (Run Both for 2-4 Weeks)¶

During coexistence: - Both Redshift Spectrum and Fabric (via shortcut) read the same S3 data - For copied tables: Fabric is the source of truth; Redshift continues to receive writes via dual-write or one-way replication back to Redshift - BI reports point to Fabric (Direct Lake); Redshift only used for legacy edge cases - Daily validation report confirms parity

💡 Tip: Use Action Groups (Wave 1 Bicep module) to alert on validation failures during coexistence.

Step 13 — Cutover¶

When validation has been green for 7 consecutive days:

Freeze Redshift writes — pause ELT jobs; set databases to read-only
Final delta UNLOAD — capture the last incremental window
Switch BI — repoint Power BI / QuickSight to Fabric semantic model
Switch consumers — apps, APIs, integrations move to Fabric Warehouse endpoint
Update Glue Catalog references — repoint or retire entries; sync to OneLake Catalog
Monitor for 48 hours — close on-call bridge after no incidents
Decommission (Step 14)

Step 14 — Decommission Redshift¶

# After 30-day grace period with all validation green:

# 1. Take a final manual snapshot (cheap insurance)
aws redshift create-cluster-snapshot \
    --cluster-identifier prod-cluster \
    --snapshot-identifier prod-cluster-final-pre-fabric

# 2. Backup metadata for audit
python 07_backup_redshift_metadata.py \
    --cluster prod-cluster \
    --output ./redshift-backup/

# 3. Pause the cluster (keep snapshots, no compute charges)
aws redshift pause-cluster --cluster-identifier prod-cluster

# 4. After 90 more days, delete the cluster (S3 data and snapshots remain)
aws redshift delete-cluster \
    --cluster-identifier prod-cluster \
    --final-cluster-snapshot-identifier prod-cluster-final-archive

# 5. After 180 days total, retire snapshots if no other AWS workloads need them

⚠️ Gotcha: S3 data referenced by OneLake shortcuts is NOT decommissioned — Fabric continues to read in place. Only the Redshift compute cluster is removed.

💸 AWS Network Costs — Egress Mitigation¶

Multi-cloud migration introduces AWS egress charges that single-cloud migrations don't have. Plan for them up front.

Cost Source	Charge	Mitigation
`UNLOAD` Redshift → S3 (same region)	$0 (no transfer)	Always UNLOAD to a same-region S3 bucket
S3 → OneLake shortcut read	Per-GB read on every query	Enable OneLake shortcut caching; right-size F-SKU to keep working set hot
S3 → OneLake one-time copy	Egress per GB out of AWS region	Schedule copy in off-peak windows; compress aggressively
Cross-region Redshift → S3	High per-GB	Move S3 staging bucket to the same AWS region as the cluster
AWS PrivateLink / Direct Connect	Hourly + per-GB	Worth it for > ~50 TB monthly transfer

Recommended Topology¶

S3 staging bucket in the same AWS region as the Redshift cluster (free intra-region UNLOAD)
Fabric capacity in the Azure region geographically closest to that AWS region (e.g., us-east-1 ↔ eastus2)
OneLake shortcut caching enabled to minimize repeat reads
For large repeat workloads (> 50 TB/month), evaluate AWS Direct Connect + Azure ExpressRoute with a megaport interconnect

💡 Tip: For the largest archive tables, evaluate keeping them in S3 + shortcut indefinitely. Reading 100 TB once a quarter via shortcut may be cheaper than copying to OneLake.

🔡 SQL Dialect Differences (Redshift PostgreSQL → Fabric T-SQL)¶

Redshift speaks a PostgreSQL-flavored dialect. Fabric speaks T-SQL. Most production migrations spend more time on dialect translation than on data movement.

Construct	Redshift	Fabric T-SQL
String concat	`\\|\\|` or `CONCAT()`	`+` or `CONCAT()`
Limit	`LIMIT n`	`TOP n` or `OFFSET ... FETCH NEXT`
Date arithmetic	`DATEADD('day', 7, col)`	`DATEADD(day, 7, col)` (no quotes)
Date diff	`DATEDIFF('day', a, b)`	`DATEDIFF(day, a, b)` (no quotes)
Current timestamp	`GETDATE()` / `SYSDATE` / `CURRENT_TIMESTAMP`	`GETDATE()` / `SYSUTCDATETIME()`
String position	`POSITION(sub IN str)` or `STRPOS()`	`CHARINDEX(sub, str)`
Cast	`col::INT` (PG cast)	`CAST(col AS int)`
Boolean literal	`TRUE` / `FALSE`	`1` / `0` (T-SQL `bit`)
Regex	`~` operator, `REGEXP_SUBSTR()`	T-SQL has no native regex — use CLR or move to PySpark
`LISTAGG`	`LISTAGG(col, ',')`	`STRING_AGG(col, ',')`
`MEDIAN()`	Native	`PERCENTILE_CONT(0.5)`
`APPROXIMATE COUNT(DISTINCT col)`	Native (HLL-backed)	`APPROX_COUNT_DISTINCT(col)`
`UNLOAD` (export)	Native	`COPY INTO ... TO ...` (export) or notebook write
`COPY` (load from S3)	Redshift COPY	Fabric `COPY INTO` (different syntax)
Window without ORDER	Allowed	Required
`IDENTITY(seed, step)`	At column or `IDENTITY(1,1)`	Same syntax

💡 Tip: The 02_schema_conversion.py script's --include-views flag attempts dialect translation on view bodies. Manually review every translated view — regex usage and LISTAGG are common breakage points.

✅ Verification (Final Checklist)¶

🧹 Cleanup¶

This tutorial creates assessment artifacts in ./assessment-output/, conversion artifacts in ./output-ddl/, and UDF conversion drafts in ./udf-conversions/. Retain for audit; archive to long-term storage.

If your assessment ran against a real Redshift cluster, the read-only queries leave no trace. UNLOAD runs do leave files on S3 — clean up the staging prefix when migration is complete and cutover has aged 30 days.

# Clean staging bucket (after cutover + 30 days)
aws s3 rm s3://your-fabric-staging-bucket/redshift-export/ --recursive

🚦 Next Steps¶

Tutorial 44 — BigQuery → Fabric — for shops with GCP analytics alongside AWS
Tutorial 41 — Synapse → Fabric — anchor pattern for Azure-native migrations
Tutorial 42 — Databricks → Fabric — for Databricks-on-AWS shops
Migration Patterns Best Practices — cross-source patterns
fabric-cicd Deployment — automate the deployment of migrated artifacts
Capacity Planning — F-SKU sizing deep dive
Network Security — S3 PrivateLink + ExpressRoute interconnect

🛠️ Troubleshooting¶

Issue	Cause	Resolution
`Type 'SUPER' is not supported` during DDL apply	Redshift semi-structured type	Convert column to `varchar(max)` and parse to JSON in Silver layer
Row counts mismatch by ~1%	UNLOAD watermark gap or in-flight writes	Re-run incremental UNLOAD with overlapping window; reconcile diff
`UNLOAD` fails: `S3ServiceException Access Denied`	Cluster IAM role missing `s3:PutObject`	Update role; never use static access keys
OneLake shortcut shows 0 rows	S3 connection permissions / wrong region	Verify connection-id; confirm S3 bucket region matches connection region
Egress costs higher than expected	Cross-region UNLOAD or repeated full-table shortcut reads	Move staging bucket to cluster region; enable shortcut caching
Aggregation off by 0.01%	Redshift `FLOAT` vs Fabric `decimal`	Cast aggregation column to decimal explicitly in both sides
Spectrum-shortcut query slow	Cold S3 + cross-region	Enable OneLake shortcut caching; consider one-time copy to OneLake for hot tables
`VARCHAR` truncation after migration	Redshift bytes vs T-SQL chars	Increase target `varchar(n)` size; re-run validation hashes
Stored procedure compile error: `RAISE NOTICE`	plpgsql vs T-SQL	Replace with `PRINT` or `RAISERROR`
Python UDF returns NULL in Fabric	Inline plpythonu has no T-SQL equivalent	Refactor calling SQL to call a notebook PySpark function
F-SKU throttles after migration	Workload concurrency higher than Redshift WLM allowed	Scale F-SKU; use capacity-throttling runbook
Power BI Direct Lake fails to refresh	Schema drift in Warehouse	Refresh metadata; see workspace monitoring
Encoding mismatch warnings during validation	Redshift `ENCODE ZSTD` source vs Parquet target	Expected — encoding is removed; data parity is preserved
Large `VARBYTE` / LOB columns slow	Cross-cloud shortcut latency on big BLOBs	Copy to OneLake (Path B) for tables with > 64 KB avg row size

📁 Key Files Referenced¶

Step	File
1, 2, 9, 10	`tutorials/43-redshift-to-fabric/01_assessment.py`
3	`tutorials/43-redshift-to-fabric/02_schema_conversion.py`
5 (Path B)	`tutorials/43-redshift-to-fabric/03_pipeline_copy.py`
7	`tutorials/43-redshift-to-fabric/04_spectrum_to_shortcut.py`
8	`tutorials/43-redshift-to-fabric/05_udf_conversion.py`
11	`tutorials/43-redshift-to-fabric/06_validation_suite.py` (planned in Wave 9)
14	`tutorials/43-redshift-to-fabric/07_backup_redshift_metadata.py` (planned)
All	`data_generation/generators/migration/redshift_workload_inventory.py`

📚 References¶

AWS Documentation¶

Microsoft Documentation¶

⬆️ Back to Top | 📚 Tutorial Index | 🏠 Home

← PreviousWorkflow MigrationRead more →Next →44 — BigQuery → FabricRead more →

🟧 Tutorial 43: Amazon Redshift → Microsoft Fabric Migration¶

🟧 Tutorial 43: Amazon Redshift → Microsoft Fabric Migration¶

📊 Progress Tracker¶

📖 Overview¶

Comparing Redshift and Fabric¶

What This Tutorial Covers¶

🎯 Learning Objectives¶

🏗️ Reference Architecture¶

Component Mapping (the canonical reference)¶

📋 Prerequisites¶

🚀 Step-by-Step¶

Step 1 — Assess Your Redshift Cluster¶

Step 2 — Plan Migration Wave Order¶

Step 3 — Convert Schemas (Redshift PostgreSQL → Fabric Warehouse T-SQL)¶

Type Conversion Table (Redshift → Fabric Warehouse)¶

Redshift-Specific Features to Remove¶

Step 4 — UNLOAD Data from Redshift to S3¶

Step 5 — Bring S3 Data into OneLake¶

Path A: OneLake Shortcut (zero copy — preferred)¶

Path B: One-Time Copy via Fabric Pipeline (when shortcuts won't do)¶

Step 6 — Load Fabric Warehouse via CTAS or COPY INTO¶

Step 7 — Migrate Spectrum External Tables → OneLake Shortcuts¶

Step 8 — Convert UDFs and Stored Procedures¶

Step 9 — Convert WLM Rules → Fabric Capacity Governance¶

Step 10 — Size Fabric Capacity from Redshift Node-Hours¶

RA3 / DC2 → F-SKU Rough Sizing (Starting Point — Validate Empirically)¶

Step 11 — Validate Migration¶

Three-Tier Validation¶

Step 12 — Coexistence Period (Run Both for 2-4 Weeks)¶

Step 13 — Cutover¶

Step 14 — Decommission Redshift¶

💸 AWS Network Costs — Egress Mitigation¶

Recommended Topology¶

🔡 SQL Dialect Differences (Redshift PostgreSQL → Fabric T-SQL)¶

✅ Verification (Final Checklist)¶

🧹 Cleanup¶

🚦 Next Steps¶

🛠️ Troubleshooting¶

📁 Key Files Referenced¶

📚 References¶

AWS Documentation¶

Microsoft Documentation¶

Related Tutorials & Docs¶