Home > Tutorials > Databricks to Fabric Migration
🧱 Tutorial 42: Databricks → Microsoft Fabric Migration¶
Last Updated: 2026-04-27 | Phase: 14 (Wave 4) | Companion to Tutorial 41 (Synapse) Status: ✅ Final | Maintainer: Platform Team
Third-party references — publicly sourced, good-faith comparison
This page references non-Microsoft products and services. That information is drawn from each vendor's publicly available documentation and is offered for honest, good-faith comparison only. This is a personal project written from a Microsoft Fabric and Azure perspective; it does not claim expertise in, or authority over, any third-party product, and nothing here is an official statement by, or endorsed by, those vendors. Capabilities, pricing, and features change often — always verify against the vendor's current official documentation. Where a third-party offering is the stronger choice, we say so plainly.
🧱 Tutorial 42: Databricks → Microsoft Fabric Migration¶
| Difficulty | ⭐⭐⭐ Advanced |
| Time | ⏱️ 200-260 minutes |
| Focus | Workspace + Unity Catalog + Delta tables + Workflows + MLflow + DBSQL → Fabric Workspace, OneLake Catalog, Lakehouse Delta, Fabric Pipelines, Fabric MLflow, Fabric Warehouse SQL endpoint |
📊 Progress Tracker¶
| Tutorial | Status |
|---|---|
| 00 — Environment Setup → 38 — DOJ Justice | ✅ Complete (Phases 1-13) |
| 41 — Synapse → Fabric | ✅ Complete |
| 42 — Databricks → Fabric (this tutorial) | 🔵 YOU ARE HERE |
| 43 — Redshift → Fabric | ⬜ |
| 44 — BigQuery → Fabric | ⬜ |
| 45 — On-Prem SSAS/SSIS/SSRS → Fabric | ⬜ |
| Navigation | |
|---|---|
| ⬅️ Previous | 41 — Synapse → Fabric |
| ➡️ Next | 43 — Redshift → Fabric |
📖 Overview¶
Databricks → Fabric tends to be a relatively smooth cloud-to-cloud migration because both platforms speak Delta Lake natively. Databricks is a mature, best-in-class lakehouse and Spark platform with deep ML/AI tooling, and for many teams it is the right long-term home. This tutorial is for teams that have decided to consolidate onto a Microsoft platform that bundles BI, real-time, governance, and ML alongside Spark, with a single capacity-based commercial model and tighter Microsoft 365 integration — not an argument that one platform is better than the other.
Comparing Databricks and Fabric¶
The table below summarizes differences based on each vendor's publicly available documentation (as of this doc's date). It is descriptive, not a ranking — Databricks' Photon engine, DLT, and Unity Catalog are genuine strengths, and the most cost-effective choice depends entirely on your workload mix.
| Concern | Databricks Behavior | Fabric Behavior |
|---|---|---|
| Compute model | DBUs per cluster, billed per workspace | Capacity Units (CUs) shared across Spark, SQL, Power BI, RTI, ML |
| Storage | Customer-managed ADLS / S3 / GCS | OneLake unified, but ADLS shortcuts work too (no copy) |
| BI integration | Power BI via DBSQL connector or DirectQuery | Direct Lake (no copy, no refresh) on Delta — read-from-OneLake |
| Real-time | Structured Streaming + DLT (Delta Live Tables) | Eventstream + Eventhouse + Materialized Lake Views |
| Governance | Unity Catalog (Databricks-native, mature) | OneLake Catalog + Microsoft Purview (org-wide) |
| MLOps | MLflow (managed) + Model Serving | Fabric MLflow + AutoML + Real-Time Endpoints |
| DBSQL Warehouses | Photon-accelerated SQL warehouses (separately billed) | Fabric Warehouse + SQL endpoint over Lakehouse |
| Notebook UX | Databricks notebooks | Fabric notebooks (similar; mssparkutils API differences) |
| Workflows | Databricks Workflows / Jobs | Fabric Data Pipelines |
| DLT (Declarative) | Delta Live Tables | Materialized Lake Views (Wave 9) |
| Cost shape | Variable (per-DBU + storage + serverless DBSQL surcharge) | Fixed monthly F-SKU — predictability differs by workload pattern |
| Microsoft estate fit | Bring your own M365 / Power BI / Purview | Native, single tenant, single bill |
What This Tutorial Covers¶
- Inventory + complexity scoring of your Databricks workspace
- Mapping Unity Catalog → OneLake Catalog (governance migration)
- Moving Delta tables to OneLake (in-place via shortcut, OR rewrite for V-Order)
- Notebook conversion:
dbutils→mssparkutils, magic command parity, library management - Databricks Workflows → Fabric Data Pipelines
- Delta Live Tables → Materialized Lake Views (or scheduled notebooks)
- MLflow registry export + import to Fabric MLflow
- DBSQL → Fabric Warehouse SQL endpoint
- DBU → CU sizing
- Three-tier validation (row count + hash + ML model parity)
- Coexistence pattern (both platforms read same OneLake/ADLS Delta)
- Cutover + decommission
🎯 Learning Objectives¶
By the end of this tutorial, you will be able to:
- Run a comprehensive Databricks workspace inventory (notebooks, jobs, clusters, tables, models)
- Map Unity Catalog three-level namespace to OneLake Catalog
- Migrate Delta tables to OneLake using shortcuts (zero-copy) or rewrite (V-Order optimized)
- Convert Databricks notebooks to Fabric notebooks with mssparkutils shim
- Translate Databricks Workflows into Fabric Data Pipelines
- Map Delta Live Tables to Materialized Lake Views or scheduled notebooks
- Export an MLflow model registry from Databricks and import into Fabric MLflow
- Migrate DBSQL warehouses to Fabric Warehouse / SQL endpoint
- Size F-SKU capacity from existing DBU consumption
- Validate parity via row counts, content hashes, query parity, and ML prediction parity
- Run Databricks and Fabric in coexistence over the same OneLake-shortcut data
- Decommission Databricks workspace safely without data loss
🏗️ Reference Architecture¶
flowchart LR
subgraph DB["🧱 Source: Databricks Workspace"]
DBWS[Databricks Workspace]
UC[(Unity Catalog)]
DBT[(Delta Tables on ADLS/S3)]
DBWF[Workflows / Jobs]
DLT[Delta Live Tables]
DBSQL[(DBSQL Warehouses)]
DBNB[Notebooks]
DBML[MLflow Registry]
DBCP[Cluster Pools]
end
subgraph Fabric["⚡ Target: Microsoft Fabric"]
FWS[Fabric Workspace]
OLC[(OneLake Catalog)]
LH[(Lakehouse Delta in OneLake)]
FPIPE[Fabric Pipelines]
MLV[Materialized Lake Views]
FWH[(Fabric Warehouse / SQL endpoint)]
FNB[Fabric Notebooks]
FML[Fabric MLflow]
FCAP[F-SKU Capacity]
end
subgraph Migration["🔄 Migration Layer"]
Assess[Assessment Script]
DeltaCompat[Delta Compatibility Check]
WFConvert[Workflow → Pipeline Converter]
MLExport[MLflow Registry Export]
Validate[Validation Suite]
end
DBWS --> Assess
UC --> Assess
DBT --> DeltaCompat
DBWF --> WFConvert
DBML --> MLExport
DBNB --> Assess
Assess --> FWS
UC -.governance map.-> OLC
DeltaCompat -->|shortcut OR rewrite| LH
WFConvert --> FPIPE
DLT -.declarative map.-> MLV
MLExport --> FML
DBSQL -.SQL endpoint.-> FWH
DBNB -.dbutils shim.-> FNB
DBCP -.DBU→CU.-> FCAP
LH --> Validate
FPIPE --> Validate
FML --> Validate Component Mapping (the canonical reference)¶
| Databricks Component | Fabric Equivalent | Migration Approach |
|---|---|---|
| Databricks Workspace | Fabric Workspace | New workspace per source workspace; preserve naming |
| Unity Catalog (catalog.schema.table) | OneLake Catalog + Lakehouse schemas | Direct map — three-part name preserved |
| Hive Metastore (legacy) | Lakehouse default schema | Bulk register tables to Lakehouse |
| Delta tables on ADLS Gen2 | Delta in OneLake | Shortcut (zero-copy) OR rewrite for V-Order |
| Delta tables on S3 / GCS | Delta in OneLake | OneLake S3/GCS shortcut OR copy |
| Iceberg tables (UniForm) | OneLake Iceberg shortcut | Direct — see Iceberg Interop |
| Databricks Notebooks | Fabric Notebooks | Port code; dbutils → mssparkutils shim |
| Databricks Workflows / Jobs | Fabric Data Pipelines | Activity translation (see Step 5) |
| Delta Live Tables (DLT) | Materialized Lake Views | Declarative SQL maps directly; Python DLT rewrites |
| DBR Runtime (e.g., 14.3 LTS) | Fabric Spark Runtime 1.3 / 2.0 | Most code portable; see runtime migration |
| Cluster pools | Fabric Spark capacity (CUs) | No 1:1 — Fabric autoscales within F-SKU |
| Photon engine | Fabric V-Order + Native Execution Engine | Different code path, similar perf for BI workloads |
| MLflow Tracking Server | Fabric MLflow (built-in) | Re-point experiments; export-import models |
| MLflow Model Registry | Fabric Model Registry | mlflow export-import tool (Step 7) |
| Model Serving | Fabric Real-Time Endpoints | Re-deploy registered model |
| DBSQL Warehouses | Fabric Warehouse + SQL endpoint | DDL conversion + V-Order; Power BI repoints |
| Unity Catalog Volumes | Lakehouse Files section | Copy non-tabular files |
| Secret Scopes | Azure Key Vault + Workspace Identity | Recreate; reference via OAP |
| Cluster init scripts | Fabric Environment files | Repackage as conda/pip env |
| Repos (Git folders) | Fabric Git Integration | Connect workspace to same repo |
📋 Prerequisites¶
- Tutorial 00 (Environment Setup) complete
- Tutorials 01-03 (Bronze/Silver/Gold medallion) understood
- Tutorial 12 (CI/CD) for fabric-cicd workflow
- Tutorial 13 (Migration Planning) for governance overview
- Tutorial 41 (Synapse → Fabric) recommended for shared assessment patterns
- Source: Existing Databricks workspace (workspace admin or
account_adminfor Unity Catalog inventory) - Target: Fabric F64+ capacity provisioned, workspace created
- Databricks PAT (Personal Access Token) with
workspace.access+metastore.read - Service Principal with
Workspace Adminon Fabric target - Databricks CLI ≥ 0.220 (
databricks --version) - Python 3.11 with
databricks-sdk,mlflow,azure-identity - Network path between Databricks ADLS Gen2 (or S3) and Fabric (Workspace Identity granted Storage Blob Data Reader)
- Estimated 3-5 hours active time + 2-4 weeks coexistence period
🚀 Step-by-Step¶
Step 1 — Assess Your Databricks Workspace¶
Inventory notebooks, jobs, clusters, Unity Catalog tables, and MLflow models.
cd tutorials/42-databricks-to-fabric/
python 01_assessment.py \
--databricks-host "https://adb-1234567890123456.1.azuredatabricks.net" \
--databricks-token "$DATABRICKS_TOKEN" \
--output-dir "./assessment-output"
✅ Verification: Output directory contains: -
notebooks_inventory.csv— every notebook with size, last-run, language, dependencies -jobs_inventory.csv— Workflows, schedules, cluster config, last-run status -clusters_inventory.csv— interactive + job clusters with DBR runtime, node type, autoscale ranges -unity_catalog_inventory.csv— every table with catalog.schema.table, format (Delta/Iceberg/Parquet), size, last-modified -mlflow_models_inventory.csv— every registered model + production-stage version -dlt_pipelines_inventory.csv— Delta Live Tables pipelines with mode (triggered/continuous) -dbsql_warehouses_inventory.csv— DBSQL warehouses with t-shirt size + Photon flag -dbu_consumption_30d.csv— last 30 days DBU usage by workspace + workload type -complexity_scores.csv— per-asset Easy/Medium/Hard/VeryHard rating -unsupported_features.md— items with no 1:1 Fabric equivalent (Photon-only SQL functions, Workflows-only triggers, etc.)💡 Tip: Run assessment monthly during the migration project. New tables, dropped models, and shifting cluster usage all surface here.
⚠️ Gotcha: If your Databricks workspace was created before Unity Catalog (legacy
hive_metastore), the assessment will report tables underhive_metastore.<schema>.<table>. These need explicit promotion to a UC catalog first or direct Lakehouse registration. Don't skip this — Hive metastore tables can have weaker schema enforcement.
Step 2 — Plan Migration Wave Order¶
Use the dependency graph to plan migration waves. Migrate leaves (no consumers) first, roots (most-consumed) last.
python 01_assessment.py --command wave-plan \
--inventory ./assessment-output/ \
--max-wave-effort-days 10
The tool produces migration-waves.md with:
- Wave 1 — Independent reference tables (dimensions with no consumers)
- Wave 2 — Bronze tables (raw ingestion → Fabric Bronze Lakehouse)
- Wave 3 — Silver / cleansed dimension tables
- Wave 4 — Gold / fact tables + aggregations
- Wave 5 — DLT pipelines → Materialized Lake Views
- Wave 6 — MLflow models + serving endpoints
- Wave 7 — DBSQL workloads + Power BI rebinding
- Wave 8 — Decommission Databricks workspace
⚠️ Gotcha: Cross-workspace Delta Sharing creates hidden dependencies. If Workspace A consumes a share from Workspace B, migrate the producer first, then the consumer. The assessment script checks for
system.information_schema.sharesand surfaces these.
Step 3 — Migrate Delta Tables to OneLake¶
This is the easiest part of the migration. Both Databricks and Fabric speak Delta Lake natively. You have two options:
Option A — OneLake Shortcut (zero-copy, recommended)¶
If your Databricks Delta tables live on ADLS Gen2 (Azure) or S3 (AWS), create a OneLake shortcut. No data movement.
# ADLS Gen2 shortcut
az fabric onelake-shortcut create \
--workspace-id "$WS_ID" \
--lakehouse-id "$LH_ID" \
--name "databricks-bronze" \
--path "Tables/databricks_bronze" \
--target-adls-uri "abfss://databricks-data@adlsgendb01.dfs.core.windows.net/bronze/"
Or via Bicep:
resource shortcut 'Microsoft.Fabric/workspaces/items/shortcuts@2024-01-01' = {
name: 'databricks-bronze'
properties: {
target: {
adlsGen2: {
location: 'https://${storageAccount}.dfs.core.windows.net/'
path: '/databricks-data/bronze'
subpath: '/'
}
}
}
}
Why shortcuts are huge here: - Petabyte-scale Delta tables are not copied - Source ADLS / S3 retains full history - Both Databricks and Fabric read the same Delta files in place - Zero downtime, true bidirectional during coexistence
Option B — Rewrite for V-Order¶
If your top BI tables benefit from V-Order (Power BI Direct Lake), rewrite them through Fabric Spark with delta.parquet.vorder.default=true:
# In a Fabric notebook
from delta.tables import DeltaTable
source_path = "abfss://databricks-data@adlsgendb01.dfs.core.windows.net/gold/fact_slot_revenue"
target_table = "lh_gold.fact_slot_revenue"
(spark.read.format("delta").load(source_path)
.write
.format("delta")
.mode("overwrite")
.option("delta.parquet.vorder.default", "true")
.saveAsTable(target_table))
# Then OPTIMIZE for full V-Order benefit
spark.sql(f"OPTIMIZE {target_table} VORDER")
OPTIMIZE / Z-ORDER Differences¶
| Databricks | Fabric |
|---|---|
OPTIMIZE table ZORDER BY (col) | OPTIMIZE table ZORDER BY (col) (supported) |
OPTIMIZE table (file compaction only) | OPTIMIZE table VORDER (V-Order + compaction) |
VACUUM table RETAIN N HOURS | VACUUM table RETAIN N HOURS (same syntax) |
Liquid clustering (CLUSTER BY) | Liquid clustering supported in Fabric Runtime 1.3+ |
| Photon-only functions | NEE-supported subset; check assessment output |
✅ Verification:
# In Fabric notebook df = spark.read.format("delta").load("Tables/databricks-bronze/fact_slot_revenue") print(f"Row count: {df.count()}") # Compare against Databricks notebook output # Row count must match exactly💡 Tip: Don't blindly rewrite every table. Shortcut first; only rewrite the top 10-20 BI hot paths where V-Order pays for itself in CU savings.
⚠️ Gotcha: Databricks-managed Delta tables (no explicit
LOCATION) live in DBFS root, which is not accessible to Fabric shortcuts. You mustALTER TABLE ... SET LOCATIONto ADLS first, or copy via Spark.
Step 4 — Convert Databricks Notebooks¶
Databricks and Fabric notebooks are very similar. Most port with mechanical changes.
Conversion Checklist¶
| Databricks Pattern | Fabric Pattern |
|---|---|
dbutils.fs.ls(...) | mssparkutils.fs.ls(...) |
dbutils.fs.cp(...) | mssparkutils.fs.cp(...) |
dbutils.secrets.get(scope=..., key=...) | mssparkutils.credentials.getSecret(akv_uri, secret_name) |
dbutils.notebook.run(...) | mssparkutils.notebook.run(...) (same API) |
dbutils.widgets.text(...) | mssparkutils.notebook.exit(...) + Pipeline parameters |
%pip install pkg | %pip install pkg (works) — prefer Environment file |
%conda install pkg | Environment file conda_dependencies block |
%sql magic | %%sql magic (note double percent) |
%md magic | %%md magic (double percent) |
%run ./other_notebook | %run ./other_notebook (works) |
display(df) | display(df) (works in Fabric) |
spark.conf.set("spark.databricks.delta...") | Map to spark.conf.set("spark.microsoft.delta...") where applicable |
| Cluster init script | Environment custom library or pip requirements |
/Workspace/Repos/... paths | Fabric Git Integration paths |
/dbfs/... paths | OneLake Files/ paths |
Code-Level Shim¶
Add this small shim at the top of each migrated notebook for backwards-compatible code that uses dbutils:
# Fabric notebook compatibility shim
try:
dbutils # Already defined in Databricks
except NameError:
from notebookutils import mssparkutils
# Lightweight dbutils proxy
class _DBUtilsShim:
fs = mssparkutils.fs
notebook = mssparkutils.notebook
secrets = mssparkutils.credentials # API differences — see migration script
dbutils = _DBUtilsShim()
✅ Verification: Run the migrated notebook end-to-end. Output must match Databricks output (row counts exact; aggregations within float tolerance).
💡 Tip: Run the conversion script in batch mode. It rewrites known patterns automatically and produces a list of manual-review hotspots:
python 02_notebook_migration.py \ --source-export ./databricks-export/Workspace/ \ --target-dir ./fabric-notebooks/ \ --apply-shim⚠️ Gotcha:
dbutils.widgetsand Fabric notebook parameters are not 1:1. Fabric uses notebook parameter cells (tagged withparameters) and pipeline activity parameters. The conversion script flags every widget and recommends a parameter-cell rewrite.
Step 5 — Migrate Databricks Workflows → Fabric Data Pipelines¶
Workflows JSON definitions translate to Fabric Pipeline JSON via the conversion script. Most task types have direct Fabric activity equivalents.
python 03_workflow_migration.py \
--source-jobs-folder ./databricks-export/Jobs/ \
--target-workspace "$FABRIC_WS_ID" \
--output-dir ./fabric-pipelines/
Task / Activity Compatibility¶
| Databricks Task Type | Fabric Activity | Notes |
|---|---|---|
notebook_task | Notebook | Re-point notebook reference |
python_wheel_task | Spark Job Definition | Repackage wheel |
spark_jar_task | Spark Job Definition | Repackage JAR |
spark_submit_task | Spark Job Definition | Translate args |
sql_task (DBSQL query) | Script (T-SQL) or Lookup | Re-point to Fabric Warehouse / SQL endpoint |
pipeline_task (DLT) | Materialized Lake Views or chained notebooks | Manual rewrite (Step 6) |
dbt_task | Notebook with dbt-fabric | See dbt integration |
run_job_task | Invoke Pipeline | Direct map |
condition_task (If/Else) | If Condition | Direct map |
for_each_task | ForEach | Direct map |
| Job parameters | Pipeline parameters / Variable Library | Promote to Variable Library where reused |
| Job triggers (cron) | Pipeline schedule trigger | Direct map |
| Job triggers (file arrival) | Storage event trigger / Eventstream | Re-author |
| Job clusters | Fabric Spark capacity (autoscale) | No explicit cluster — Fabric manages |
⚠️ Gotcha: Continuous job triggers in Databricks (always-on streaming jobs) become Fabric streaming jobs running against Eventstream — not Pipelines. The conversion script flags these and points to Tutorial 04 (Real-Time Analytics) for re-authoring.
💡 Tip: Promote any parameter referenced by ≥ 3 jobs to the Variable Library. Cleanup opportunity during migration.
Step 6 — Migrate Delta Live Tables → Materialized Lake Views¶
DLT pipelines are declarative. Pure-SQL DLT maps cleanly to Materialized Lake Views (MLV — Wave 9 feature). Python DLT pipelines need rewrite.
SQL DLT → Materialized Lake View¶
-- Databricks DLT (Python or SQL):
CREATE OR REFRESH STREAMING TABLE bronze_slot_telemetry
AS SELECT * FROM cloud_files("/mnt/raw/slots/", "json");
CREATE OR REFRESH LIVE TABLE silver_slot_telemetry
COMMENT "Cleansed slot telemetry"
AS SELECT
machine_id,
CAST(event_time AS TIMESTAMP) AS event_time,
coin_in,
coin_out
FROM LIVE.bronze_slot_telemetry
WHERE coin_in IS NOT NULL;
-- Fabric Materialized Lake View:
CREATE MATERIALIZED LAKE VIEW silver_slot_telemetry
AS SELECT
machine_id,
CAST(event_time AS TIMESTAMP) AS event_time,
coin_in,
coin_out
FROM bronze_slot_telemetry
WHERE coin_in IS NOT NULL;
Python DLT → Scheduled Notebook¶
If the DLT pipeline uses @dlt.table decorators with custom Python logic, rewrite as a scheduled Fabric notebook with explicit MERGE INTO:
# Databricks DLT:
# import dlt
# @dlt.table
# @dlt.expect_or_drop("valid_amount", "coin_in > 0")
# def silver_slot_telemetry():
# return dlt.read_stream("bronze_slot_telemetry")...
# Fabric scheduled notebook with quality enforcement:
df = (spark.readStream.format("delta")
.table("lh_bronze.bronze_slot_telemetry")
.filter("coin_in > 0")) # equivalent to expect_or_drop
(df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "Files/checkpoints/silver_slot_telemetry")
.trigger(availableNow=True)
.toTable("lh_silver.silver_slot_telemetry"))
✅ Verification: Compare last 7 days of output between DLT pipeline and the migrated MLV / notebook. Row counts and aggregations must match.
⚠️ Gotcha: DLT Expectations with
expect_or_failaction have no direct MLV equivalent. Migrate as Great Expectations checks in the upstream notebook, or as aCHECKconstraint when the table is rewritten.
Step 7 — Migrate MLflow Models¶
Databricks MLflow → Fabric MLflow is a registry export-import.
# Step 7a: Export from Databricks
pip install mlflow-export-import
mlflow-export-import export-models \
--output-dir ./mlflow-export/ \
--models "fraud_detection,player_ltv,slot_anomaly" \
--stages "Production,Staging" \
--notebook-formats SOURCE \
--databricks-host "$DATABRICKS_HOST" \
--databricks-token "$DATABRICKS_TOKEN"
# Step 7b: Import to Fabric MLflow
mlflow-export-import import-models \
--input-dir ./mlflow-export/ \
--tracking-uri "$FABRIC_MLFLOW_URI" \
--experiment-name-prefix "migrated_"
| Item | Databricks | Fabric | Migration |
|---|---|---|---|
| Tracking server | Managed MLflow | Fabric MLflow (per workspace) | Re-point MLFLOW_TRACKING_URI |
| Registered models | UC models schema | Fabric Model Registry | mlflow-export-import |
| Model versions | Stages (Staging, Production, Archived) | Stages (preserved) | Stages map directly |
| Run artifacts | DBFS / UC Volumes | OneLake Files/mlflow-artifacts/ | Bulk copy during export |
| Model Serving endpoints | Databricks Model Serving | Fabric Real-Time Endpoints | Re-deploy registered model |
| Feature Store | UC Feature Store | Fabric Lakehouse + Semantic Link | Republish features |
ML Model Parity Validation¶
Score the same hold-out dataset on both platforms; predictions must match within tolerance.
# Validate model parity
import mlflow
src_model = mlflow.pyfunc.load_model("models:/fraud_detection/Production",
tracking_uri=DATABRICKS_TRACKING_URI)
tgt_model = mlflow.pyfunc.load_model("models:/migrated_fraud_detection/Production",
tracking_uri=FABRIC_TRACKING_URI)
src_pred = src_model.predict(holdout_df)
tgt_pred = tgt_model.predict(holdout_df)
# Classification: exact match expected
assert (src_pred == tgt_pred).all(), "Prediction mismatch!"
# Regression: within 1e-6 tolerance
# np.allclose(src_pred, tgt_pred, atol=1e-6)
⚠️ Gotcha: Models trained on Photon-specific UDFs may produce slightly different predictions in Fabric Spark due to floating-point operation ordering. Tolerance ≤ 1e-6 is normal; > 1e-3 is a red flag — investigate.
Step 8 — Migrate DBSQL → Fabric Warehouse / SQL Endpoint¶
Two paths depending on workload:
| DBSQL Workload | Recommended Target | Why |
|---|---|---|
| Read-only BI queries on Delta tables | Fabric SQL endpoint over Lakehouse | Zero ETL — just query the same Delta |
| Heavy stored procedures, multi-statement transactions | Fabric Warehouse | Full T-SQL DDL + DML |
| Power BI Direct Query | Direct Lake (preferred) | Read-from-OneLake, no refresh |
| Tabular semantic models | Fabric semantic model | Re-author or use Power BI desktop migrate |
DDL conversion via the helper:
python 04_dbsql_ddl_conversion.py \
--source-host "$DBSQL_HOST" \
--target-warehouse "wh-analytics-prod" \
--target-workspace "$FABRIC_WS_ID" \
--output-ddl ./output-ddl/
| DBSQL Type | Fabric Warehouse Type | Notes |
|---|---|---|
BIGINT | BIGINT | Direct |
INT | INT | Direct |
DECIMAL(p,s) | DECIMAL(p,s) | Direct |
STRING | VARCHAR(8000) (sized) | Choose explicit length; VARCHAR(MAX) for unbounded |
BOOLEAN | BIT | Direct |
TIMESTAMP | DATETIME2(6) | Recommended |
DATE | DATE | Direct |
BINARY | VARBINARY(MAX) | Direct |
ARRAY<T> | NOT SUPPORTED in Warehouse | Move to Lakehouse + Spark; or normalize |
MAP<K,V> | NOT SUPPORTED | Convert to JSON VARCHAR(MAX) |
STRUCT<...> | NOT SUPPORTED | Convert to JSON VARCHAR(MAX) |
💡 Tip: Most BI dashboards backed by DBSQL don't need a Warehouse — point them at the Lakehouse SQL endpoint instead. Same Delta tables, no ETL.
Step 9 — Iceberg Interoperability (UniForm Tables)¶
If your Databricks tables use Delta UniForm (write Delta, expose as Iceberg), Fabric reads them directly via OneLake Iceberg shortcuts. No conversion required.
# Create an Iceberg shortcut
az fabric onelake-shortcut create \
--workspace-id "$WS_ID" \
--lakehouse-id "$LH_ID" \
--name "databricks-iceberg" \
--type "iceberg" \
--path "Tables/databricks_iceberg" \
--target-uri "abfss://uniform@adlsdb01.dfs.core.windows.net/iceberg-tables/"
✅ Verification:
df = spark.read.format("iceberg").load("Tables/databricks-iceberg/fact_table") print(df.count()) # Match Databricks Iceberg query💡 Tip: For read-only consumption from Snowflake, BigQuery, or Trino in parallel, UniForm + Iceberg shortcut means one copy of data, three engines.
Step 10 — Photon vs Fabric Spark — Performance Considerations¶
| Concern | Databricks Photon | Fabric Spark |
|---|---|---|
| Vectorized engine | Photon (C++ runtime) | Native Execution Engine (NEE) |
| Format optimization | Delta + Photon-aware caching | Delta + V-Order + NEE caching |
| BI workload latency | Sub-second on hot cache | Sub-second on Direct Lake (no cache needed) |
| Function coverage | Wide; some Photon-only | Growing; check NEE coverage matrix |
| Separate billing | Photon cluster surcharge | None — included in F-SKU |
⚠️ Gotcha: Some Photon-accelerated SQL functions (e.g.,
regexp_extract_all) may run on JVM Spark in Fabric (not NEE) — slower for big datasets. Assessment script flags Photon-heavy workloads. For these, plan an extra perf-tuning pass post-migration.
Step 11 — Unity Catalog → OneLake Catalog (Governance Migration)¶
Unity Catalog organizes objects as catalog.schema.table with grants and lineage. OneLake Catalog provides equivalent capability inside Fabric, with Microsoft Purview for org-wide governance.
| Unity Catalog Concept | Fabric Equivalent | Migration Approach |
|---|---|---|
catalog | Fabric domain or workspace | Map 1:1 (domain for orgs; workspace for teams) |
schema | Lakehouse / Warehouse | Direct map |
table | Delta table | Direct map |
view | View (T-SQL or Spark) | DDL conversion |
volume | Lakehouse Files section | Copy non-tabular files |
function | Spark UDF / T-SQL function | Re-register |
model | MLflow model | See Step 7 |
GRANT SELECT ON TABLE | OneLake Security row/column policies | Re-author |
GRANT MODIFY | Workspace role (Contributor) | Map principals to roles |
| Lineage | Microsoft Purview lineage | Auto-captured once Purview connected |
| Tags / labels | Sensitivity labels (M365) | Re-apply via Purview |
python 05_unity_catalog_export.py \
--metastore-id "$UC_METASTORE_ID" \
--output-dir ./uc-export/
# Outputs:
# uc-export/catalogs.json
# uc-export/schemas.json
# uc-export/tables.json
# uc-export/grants.json ← critical for security migration
# uc-export/lineage.json ← reference for Purview re-bootstrap
💡 Tip: Connect Microsoft Purview to the Fabric tenant before migration. Lineage captures from day one and you don't have to backfill.
Step 12 — Size Fabric Capacity from DBU Consumption¶
python 01_assessment.py --command capacity-recommendation \
--inventory ./assessment-output/ \
--dbu-30d ./assessment-output/dbu_consumption_30d.csv
Output: F-SKU recommendation with rationale.
DBU → CU Rough Sizing (Starting Point — Validate Empirically)¶
| Databricks Workload (avg active DBU/hr) | Approximate Fabric F-SKU | Notes |
|---|---|---|
| ~1 DBU/hr (small dev) | F8 | One small team |
| ~4 DBU/hr | F16 | Single project, light Power BI |
| ~10 DBU/hr | F32 | Mid-size team, Bronze/Silver/Gold |
| ~25 DBU/hr | F64 | POC/production baseline |
| ~50 DBU/hr | F128 | Multi-team, RTI workloads |
| ~100 DBU/hr | F256 | Enterprise scale |
| ~200 DBU/hr | F512 | Multi-LOB |
| ~400 DBU/hr | F1024 | Federation / very large estate |
| ~800 DBU/hr | F2048 | Top-tier enterprise |
⚠️ Gotcha: This table is a starting point only. Photon-DBU and serverless-SQL-DBU consume different ratios in Databricks; in Fabric all draw from the same CU pool plus Power BI users. Always measure during coexistence and adjust before cutover.
💡 Tip: Use capacity planning best practices for the full sizing methodology and CU smoothing patterns.
Step 13 — Validate Migration¶
Validation runs four categories of checks: row counts, content hashes, query parity, and ML model parity.
python 06_validation_suite.py \
--source-databricks-host "$DATABRICKS_HOST" \
--source-databricks-token "$DATABRICKS_TOKEN" \
--target-workspace "$FABRIC_WS_ID" \
--tables-yaml ./tables-to-validate.yaml \
--models-yaml ./models-to-validate.yaml \
--output-report ./validation-report.html
Four-Tier Validation¶
Tier 1: ROW COUNT ──▶ count(*) match per table
Tier 2: CONTENT HASH ──▶ sha2(concat_ws('|', columns)) match per row
Tier 3: QUERY PARITY ──▶ representative aggregations match within tolerance
Tier 4: MODEL PARITY ──▶ predictions on hold-out match within tolerance
| Check | Pass Threshold | Action on Fail |
|---|---|---|
| Row count | exact match | Re-run incremental load, check filter predicates |
| Content hash | 100% match | Investigate type conversion / NULL handling / collation |
| Aggregation parity | within 0.001% | Type precision review |
| Query latency | within 2× source p95 | Tune V-Order, partitioning, NEE-eligible rewrites |
| Model prediction parity (classification) | exact match | Investigate Photon-only function in featurization |
| Model prediction parity (regression) | within 1e-6 | Check float ordering; usually safe |
✅ Verification: All four tiers pass for ≥ 95% of in-scope assets before declaring migration complete.
Step 14 — Coexistence Period (2-4 Weeks)¶
During coexistence: - Both Databricks and Fabric read the same OneLake-shortcut Delta tables - Writes go to one truth — pick Fabric early; replicate read-only mirror to Databricks if needed - BI repoints to Fabric (Direct Lake); Databricks SQL warehouses suspended for cost savings - Daily validation report confirms parity
💡 Tip: Pause expensive Databricks job clusters and DBSQL warehouses during coexistence to halve coexistence-period cost. They're not running production traffic — Fabric is.
Step 15 — Cutover¶
When validation has been green for 7 consecutive days:
- Freeze Databricks — pause Workflows, set Unity Catalog tables to read-only
- Final delta sync — capture last incremental changes
- Switch BI — repoint Power BI workspaces to Fabric semantic models / Direct Lake
- Switch consumers — apps, APIs, dbt projects to Fabric endpoints
- Switch ML serving — re-route inference traffic to Fabric Real-Time Endpoints
- Monitor for 48 hours — close on-call bridge after no incidents
- Decommission (Step 16)
Step 16 — Decommission Databricks¶
# After 30-day grace period with all validation green:
# 1. Disable all Workflows
databricks jobs list --output JSON | jq -r '.jobs[].job_id' \
| xargs -I {} databricks jobs delete --job-id {}
# 2. Stop and pin DBSQL warehouses (no auto-restart)
databricks warehouses list --output JSON | jq -r '.warehouses[].id' \
| xargs -I {} databricks warehouses stop --id {}
# 3. Backup workspace metadata for audit
python 07_backup_databricks_metadata.py \
--databricks-host "$DATABRICKS_HOST" \
--databricks-token "$DATABRICKS_TOKEN" \
--output ./databricks-backup/
# 4. After 90 more days, delete clusters + DBSQL warehouses
# 5. After 180 days total, delete workspace if no other workloads remain
⚠️ Gotcha: ADLS Gen2 / S3 storage referenced by OneLake shortcuts is NOT decommissioned — Fabric reads in place. Only the Databricks compute + control plane is removed. If you also want to retire the storage account, copy data to OneLake first via Step 3 Option B.
✅ Verification (Final Checklist)¶
- All in-scope Databricks Delta tables accessible via Fabric Lakehouse / Warehouse
- Four-tier validation (row count + hash + query + model) green for 7+ consecutive days
- OneLake shortcuts to ADLS Gen2 / S3 reading at performance parity
- All Databricks notebooks ported,
dbutilsreplaced withmssparkutils - All Databricks Workflows converted to Fabric Pipelines
- All DLT pipelines mapped to Materialized Lake Views or scheduled notebooks
- MLflow registry exported from Databricks and imported to Fabric MLflow
- Model serving endpoints redeployed as Fabric Real-Time Endpoints
- DBSQL workloads on Fabric Warehouse / SQL endpoint with Power BI repointed
- Unity Catalog grants re-authored as OneLake Security policies
- Microsoft Purview connected; lineage captured
- F-SKU sized correctly (CU usage in steady-state band — see SLO/SLI doc)
- Databricks Workflows disabled, DBSQL warehouses stopped
- Cutover postmortem published
- DBU spend → CU spend cost reduction realized
🧹 Cleanup¶
This tutorial creates assessment artifacts in ./assessment-output/, conversion artifacts in ./fabric-notebooks/ and ./fabric-pipelines/, and a metadata backup in ./databricks-backup/. Retain for audit; archive to long-term storage.
If your assessment ran against a real Databricks workspace, the read-only API calls leave no trace. No rollback needed.
🚦 Next Steps¶
- Tutorial 43 — Redshift → Fabric — for shops also retiring AWS Redshift
- Tutorial 44 — BigQuery → Fabric — for GCP exits
- Migration Patterns Best Practices — cross-source patterns
- Spark Runtime Migration — DBR → Fabric Spark Runtime details
- OneLake Iceberg Interop — UniForm and cross-engine reads
- dbt Fabric Integration — for shops migrating dbt projects
- fabric-cicd Deployment — automate the deployment of migrated artifacts
🛠️ Troubleshooting¶
| Issue | Cause | Resolution |
|---|---|---|
| OneLake shortcut shows 0 rows from ADLS | Workspace Identity lacks Storage Blob Data Reader | Grant role on storage account or container |
dbutils is undefined in Fabric notebook | Databricks-specific module | Apply shim from Step 4 OR replace with mssparkutils |
%conda install fails in Fabric | %conda not supported | Use Environment file conda dependencies, or %pip install |
| Photon-only function fails on Fabric | Function not in NEE coverage | Rewrite using portable Spark SQL; flag in assessment |
MLflow import errors on models/<n>/MLmodel | Stage encoding differences across MLflow versions | Pin mlflow-export-import to same major MLflow version on both sides |
DLT expect_or_fail has no equivalent | DLT-specific quality control | Add Great Expectations checkpoint upstream; or CHECK constraint |
| Row counts off by ~1% post-migration | Time-zone drift on partitioned timestamps | Cast partition columns to TIMESTAMP_NTZ explicitly |
| Direct Lake fallback to DirectQuery | Schema not V-Ordered or unsupported types | Run OPTIMIZE table VORDER; check VARCHAR(MAX) usage |
| Cluster init script broke on migration | Init scripts not portable | Repackage as Fabric Environment custom library |
| F-SKU throttles after cutover | Concurrent BI users + Spark + ML draw same CU pool | Scale F-SKU; review throttling runbook |
📁 Key Files Referenced¶
| Step | File |
|---|---|
| 1, 2, 12 | tutorials/42-databricks-to-fabric/01_assessment.py (planned, batch 4b) |
| 3 | tutorials/42-databricks-to-fabric/01_delta_compatibility.py (planned, batch 4b) |
| 4 | tutorials/42-databricks-to-fabric/02_notebook_migration.py (planned, batch 4b) |
| 5 | tutorials/42-databricks-to-fabric/02_workflow_migration.md (planned, batch 4b) |
| 7 | tutorials/42-databricks-to-fabric/03_mlflow_migration.py (planned, batch 4b) |
| 8 | tutorials/42-databricks-to-fabric/04_dbsql_ddl_conversion.py (planned, batch 4b) |
| 11 | tutorials/42-databricks-to-fabric/05_unity_catalog_export.py (planned, batch 4b) |
| 13 | tutorials/42-databricks-to-fabric/06_validation_suite.py (planned, batch 4b) |
| 16 | tutorials/42-databricks-to-fabric/07_backup_databricks_metadata.py (planned, batch 4b) |
| All | data_generation/generators/migration/databricks_workload_inventory.py (planned) |
📚 References¶
Databricks Documentation¶
Microsoft Documentation¶
- Fabric migration guide (Databricks)
- OneLake shortcuts
- Fabric Spark runtime
- Fabric MLflow
- Materialized Lake Views
Related Tutorials & Docs¶
- Tutorial 41 — Synapse → Fabric (Wave 4 anchor)
- Tutorial 24 — Snowflake → Fabric
- Tutorial 10 — Teradata → Fabric
- Migration Patterns
- Spark Runtime Migration
- OneLake Iceberg Interop
- dbt Fabric Integration
- Capacity Planning
- fabric-cicd Deployment