Tutorial — Convert a Databricks PySpark Notebook to Fabric¶
Status: Authored 2026-04-30 Audience: Data engineers performing their first Databricks-to-Fabric notebook migration. Scope: Step-by-step walkthrough converting a real-world PySpark notebook with dbutils, Delta tables, library dependencies, and a downstream Power BI Direct Lake report.
Prerequisites¶
Before starting this tutorial, you need:
- A Fabric workspace (F64 or higher for Power BI Premium features)
- A Lakehouse created in the workspace
- Access to the source ADLS Gen2 storage account (for OneLake shortcuts)
- The original Databricks notebook exported as
.pyor.ipynb - Familiarity with PySpark and the Fabric notebook interface
Scenario¶
You have a Databricks PySpark notebook that:
- Reads raw customer CSV files from ADLS (
/mnt/raw/customers/) - Cleans and transforms the data (deduplicate, standardize fields)
- Writes a Silver-tier Delta table (
production.silver.customers_clean) - A Power BI Import semantic model refreshes from this table via DBSQL
We will convert this to a Fabric notebook that:
- Reads the same data via an OneLake shortcut
- Runs the same transformations
- Writes to a Fabric Lakehouse table
- Powers a Power BI Direct Lake report (no scheduled refresh needed)
Step 1: Export the Databricks notebook¶
In Databricks:
- Open the notebook
- Click File > Export > Source File (.py) or IPython Notebook (.ipynb)
- Save to your local machine
Original Databricks notebook (customer_etl.py):
# Databricks notebook source
# MAGIC %md
# MAGIC # Customer ETL Pipeline
# MAGIC Reads raw CSV, cleans, writes Silver Delta table.
# COMMAND ----------
# Install dependencies
# MAGIC %pip install phonenumbers==8.13.0 email-validator==2.0.0
# COMMAND ----------
import phonenumbers
from email_validator import validate_email
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
# COMMAND ----------
# MAGIC %md
# MAGIC ## Configuration
# COMMAND ----------
dbutils.widgets.text("input_path", "/mnt/raw/customers/", "Input Path")
dbutils.widgets.text("output_table", "production.silver.customers_clean", "Output Table")
dbutils.widgets.dropdown("mode", "overwrite", ["overwrite", "append"], "Write Mode")
input_path = dbutils.widgets.get("input_path")
output_table = dbutils.widgets.get("output_table")
write_mode = dbutils.widgets.get("mode")
# COMMAND ----------
# MAGIC %md
# MAGIC ## Read raw data
# COMMAND ----------
# Read CSVs from mounted ADLS
df_raw = (
spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(input_path)
)
print(f"Raw records: {df_raw.count()}")
# COMMAND ----------
# MAGIC %md
# MAGIC ## Clean and transform
# COMMAND ----------
# UDF: Standardize phone numbers
@F.udf(StringType())
def standardize_phone(phone):
if phone is None:
return None
try:
parsed = phonenumbers.parse(phone, "US")
return phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.E164)
except Exception:
return None
# UDF: Validate email
@F.udf(StringType())
def validate_email_udf(email):
if email is None:
return None
try:
validated = validate_email(email, check_deliverability=False)
return validated.normalized
except Exception:
return None
# Apply transformations
df_clean = (
df_raw
.dropDuplicates(["customer_id"])
.withColumn("email_clean", validate_email_udf(F.col("email")))
.withColumn("phone_clean", standardize_phone(F.col("phone")))
.withColumn("state_upper", F.upper(F.trim(F.col("state"))))
.withColumn("created_date", F.to_date(F.col("created_at")))
.withColumn("etl_timestamp", F.current_timestamp())
.filter(F.col("customer_id").isNotNull())
.select(
"customer_id",
"first_name",
"last_name",
"email_clean",
"phone_clean",
"state_upper",
"created_date",
"etl_timestamp"
)
)
print(f"Clean records: {df_clean.count()}")
# COMMAND ----------
# MAGIC %md
# MAGIC ## Write Silver Delta table
# COMMAND ----------
(
df_clean.write
.format("delta")
.mode(write_mode)
.option("overwriteSchema", "true")
.saveAsTable(output_table)
)
# COMMAND ----------
# MAGIC %sql
# MAGIC -- Verify
# MAGIC SELECT COUNT(*) as total_records FROM production.silver.customers_clean
# COMMAND ----------
# MAGIC %md
# MAGIC ## Optimize table
# COMMAND ----------
spark.sql(f"OPTIMIZE {output_table} ZORDER BY (state_upper, created_date)")
# COMMAND ----------
dbutils.notebook.exit(f"SUCCESS: {df_clean.count()} records written to {output_table}")
Step 2: Create OneLake shortcut to source data¶
Before converting the notebook, make the source CSV files accessible in Fabric.
- Open your Fabric workspace
- Open (or create) a Lakehouse named
bronze_lakehouse - In the Lakehouse explorer, right-click Files > New shortcut
- Select Azure Data Lake Storage Gen2
- Enter the connection details:
- URL:
https://<storageaccount>.dfs.core.windows.net - Sub path:
/<container>/raw/customers - Shortcut name:
raw_customers
- URL:
- Authenticate with your Entra ID credentials or a service principal
The CSV files now appear at Files/raw_customers/ in the Lakehouse. No data was copied.
Step 3: Create a Fabric environment for libraries¶
- In the workspace, click New > Environment
- Name it
customer-etl-env - Under Public libraries, add:
phonenumbers==8.13.0email-validator==2.0.0
- Click Publish and wait for the environment to build (~2-5 minutes)
Step 4: Create the Fabric notebook¶
- In the workspace, click New > Notebook
- Name it
customer_etl - Attach the
customer-etl-envenvironment (top toolbar > Environment dropdown) - Attach to the
bronze_lakehouseas the default Lakehouse
Step 5: Convert the notebook code¶
Create the following cells in the Fabric notebook:
Cell 1 (Markdown):
# Customer ETL Pipeline
Reads raw CSV from OneLake shortcut, cleans, writes Silver Lakehouse table.
Converted from Databricks notebook `customer_etl.py`.
Cell 2 (PySpark) -- Imports:
# Libraries are installed via the Fabric environment (customer-etl-env)
# No need for %pip install in the notebook
import phonenumbers
from email_validator import validate_email
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
Cell 3 (PySpark) -- Configuration:
# Parameters -- in Fabric, these are passed from Data Pipeline
# or via mssparkutils.notebook.run()
# Default values used for interactive development
input_path = mssparkutils.notebook.getParam("input_path", "Files/raw_customers/")
output_table = mssparkutils.notebook.getParam("output_table", "customers_clean")
write_mode = mssparkutils.notebook.getParam("mode", "overwrite")
print(f"Input: {input_path}")
print(f"Output table: {output_table}")
print(f"Write mode: {write_mode}")
Cell 4 (PySpark) -- Read raw data:
# Read CSVs from OneLake shortcut (no /mnt/ needed)
df_raw = (
spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(input_path)
)
print(f"Raw records: {df_raw.count()}")
df_raw.printSchema()
Cell 5 (PySpark) -- Transform:
# UDF: Standardize phone numbers (same as Databricks)
@F.udf(StringType())
def standardize_phone(phone):
if phone is None:
return None
try:
parsed = phonenumbers.parse(phone, "US")
return phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.E164)
except Exception:
return None
# UDF: Validate email (same as Databricks)
@F.udf(StringType())
def validate_email_udf(email):
if email is None:
return None
try:
validated = validate_email(email, check_deliverability=False)
return validated.normalized
except Exception:
return None
# Apply transformations (same logic)
df_clean = (
df_raw
.dropDuplicates(["customer_id"])
.withColumn("email_clean", validate_email_udf(F.col("email")))
.withColumn("phone_clean", standardize_phone(F.col("phone")))
.withColumn("state_upper", F.upper(F.trim(F.col("state"))))
.withColumn("created_date", F.to_date(F.col("created_at")))
.withColumn("etl_timestamp", F.current_timestamp())
.filter(F.col("customer_id").isNotNull())
.select(
"customer_id",
"first_name",
"last_name",
"email_clean",
"phone_clean",
"state_upper",
"created_date",
"etl_timestamp"
)
)
print(f"Clean records: {df_clean.count()}")
Cell 6 (PySpark) -- Write Silver table:
# Write to Lakehouse table (replaces Unity Catalog table path)
(
df_clean.write
.format("delta")
.mode(write_mode)
.option("overwriteSchema", "true")
.saveAsTable(output_table)
)
print(f"Table '{output_table}' written successfully.")
Cell 7 (Spark SQL) -- Verify:
-- Verify row count (use cell type: Spark SQL)
SELECT COUNT(*) as total_records,
COUNT(DISTINCT customer_id) as unique_customers,
MIN(created_date) as earliest,
MAX(created_date) as latest
FROM bronze_lakehouse.customers_clean
Cell 8 (PySpark) -- Optimization note:
# Fabric auto-applies V-Order optimization on write.
# Manual OPTIMIZE + ZORDER is available but usually unnecessary.
# Uncomment if you need explicit optimization:
# spark.sql(f"OPTIMIZE {output_table} ZORDER BY (state_upper, created_date)")
print("V-Order auto-optimization applied during write.")
Cell 9 (PySpark) -- Exit:
record_count = spark.table(output_table).count()
mssparkutils.notebook.exit(f"SUCCESS: {record_count} records in {output_table}")
Step 6: Schedule with a Data Pipeline¶
- In the workspace, click New > Data Pipeline
- Name it
customer_etl_pipeline - Add a Notebook activity:
- Notebook:
customer_etl - Parameters:
input_path:Files/raw_customers/output_table:customers_cleanmode:overwrite
- Notebook:
- Add a Schedule trigger:
- Recurrence: Daily at 06:00 UTC
- Or: Storage event trigger on
Files/raw_customers/for event-driven runs
Step 7: Create a Direct Lake Power BI report¶
- In the Lakehouse, click the SQL endpoint in the top-right
- In the SQL endpoint view, click New semantic model
- Select the
customers_cleantable - Name the semantic model
Customers - Direct Lake - The semantic model is created with Direct Lake mode -- no import, no refresh
- Open the semantic model and click New report
- Build your report:
- Card visual: Total Customers (
COUNT(customer_id)) - Bar chart: Customers by State (
state_upper) - Table: Customer list with filters
- Card visual: Total Customers (
- Save the report
Result: The Power BI report reads directly from the Delta files in OneLake. When the notebook runs and updates the table, the report shows fresh data automatically. No scheduled refresh, no DBSQL warehouse running.
Step 8: Validate migration¶
Run both the Databricks notebook and the Fabric notebook against the same source data and compare:
| Validation check | Databricks result | Fabric result | Match? |
|---|---|---|---|
| Raw record count | __ | __ | [ ] |
| Clean record count | __ | __ | [ ] |
| Unique customer IDs | __ | __ | [ ] |
| Records per state (top 5) | __ | __ | [ ] |
| Email validation failures | __ | __ | [ ] |
| Phone standardization failures | __ | __ | [ ] |
| Schema (column names + types) | __ | __ | [ ] |
If all checks pass, the migration is validated.
Step 9: Decommission Databricks notebook¶
After 2 weeks of parallel operation:
- Archive the Databricks notebook to Git (if not already)
- Disable the Databricks job schedule
- Remove the DBSQL endpoint (if only used for this Power BI model)
- Update documentation to reference the Fabric notebook
- Remove the old Power BI Import semantic model
Summary of changes¶
| Original (Databricks) | Converted (Fabric) | Change type |
|---|---|---|
%pip install | Fabric environment | Library management |
dbutils.widgets.get() | mssparkutils.notebook.getParam() | Parameterization |
/mnt/raw/customers/ | Files/raw_customers/ (shortcut) | Path |
production.silver.customers_clean | customers_clean (Lakehouse table) | Table reference |
OPTIMIZE ... ZORDER BY | V-Order auto-applied | Optimization |
%sql magic | Spark SQL cell type | Cell type |
dbutils.notebook.exit() | mssparkutils.notebook.exit() | Notebook API |
| Power BI Import + DBSQL | Direct Lake (no refresh) | BI model |
| Databricks Workflow job | Fabric Data Pipeline | Scheduling |
Total code changes: ~15 lines modified out of ~80 lines of PySpark. The transformation logic is identical.
Related¶
- Notebook Migration -- complete reference for all notebook patterns
- Unity Catalog Migration -- table reference mapping
- Feature Mapping -- notebook and dev tools section
- Best Practices -- notebook conversion checklist
- Parent guide: 5-phase migration
Maintainers: csa-inabox core team Source finding: CSA-0083 (HIGH, XL) -- approved via AQ-0010 ballot B6 Last updated: 2026-04-30