Home > Tutorials > 42 — Databricks → Fabric > Workflow Migration Reference
🔁 Tutorial 42 — Reference: Databricks Workflows → Fabric Pipelines + Spark Job Definitions¶
Last Updated: 2026-04-27 | Phase: 14 (Wave 4) | Companion to Tutorial 42 — Databricks → Fabric Status: ✅ Final | Maintainer: Platform Team
Third-party references — publicly sourced, good-faith comparison
This page references non-Microsoft products and services. That information is drawn from each vendor's publicly available documentation and is offered for honest, good-faith comparison only. This is a personal project written from a Microsoft Fabric and Azure perspective; it does not claim expertise in, or authority over, any third-party product, and nothing here is an official statement by, or endorsed by, those vendors. Capabilities, pricing, and features change often — always verify against the vendor's current official documentation. Where a third-party offering is the stronger choice, we say so plainly.
📖 Overview¶
This is the deep-dive reference for Step 5 of Tutorial 42: converting Databricks Workflows / Jobs to Fabric Data Pipelines and Spark Job Definitions. The parent tutorial gives you the high-level activity-mapping table; this doc gives you the canonical translation matrix, cluster/trigger/parameter handling, three worked examples (simple notebook, multi-task DAG, and DLT → Materialized Lake View), DLT-specific guidance, bulk-export tooling, and migration anti-patterns.
Use this reference when:
- You need the full task-type → activity map (not just the common ones)
- You're translating cluster specs, init scripts, or library installs into Fabric Environments
- You're rewriting DLT pipelines as Materialized Lake Views or scheduled notebooks
- You're scripting bulk export of Workflow JSON from a large Databricks estate
- You want the anti-patterns so you don't lift-and-shift things that should be redesigned
🧭 Workflow Type Mapping¶
The first decision is which Fabric construct each Databricks workflow becomes. Most jobs map to a Fabric Pipeline, but a few map to a Spark Job Definition, an Eventstream, or a Materialized Lake View.
| Databricks Construct | Fabric Equivalent | Notes |
|---|---|---|
| Workflow Job (multi-task DAG) | Fabric Pipeline | Tasks → activities; dependencies → activity links |
| Single-task scheduled notebook | Fabric Pipeline with one Notebook activity, OR notebook with native schedule | Use Pipeline if it needs a Variable Library or chained activities; use native schedule for simple repeats |
| JAR / Wheel / spark-submit job | Spark Job Definition (SJD) | Headless Spark execution; not a Pipeline activity by itself — invoked from a Pipeline if orchestrated |
| Job cluster (ephemeral) | Fabric Spark capacity (auto-allocated within F-SKU) | No 1:1 cluster object — Fabric autoscales sessions out of CU pool |
| All-purpose cluster (interactive) | Fabric Spark session (notebook attached to Lakehouse) | For interactive dev only; not a deploy target |
| Job cluster pools | No direct equivalent | F-SKU CU pool replaces the warm-pool concept |
| DLT Pipeline (declarative) | Materialized Lake View (MLV) for SQL DLT, OR scheduled notebook chain for Python DLT | See DLT-Specific Migration below |
| Continuous (always-on) job | Fabric streaming notebook + Eventstream trigger | Not a Pipeline — re-author against Eventstream/Eventhouse |
| Webhook-triggered job | Pipeline invoked via REST API + Logic App / Power Automate | Storage event triggers also possible |
⚠️ Gotcha: Don't reflexively put every Databricks job inside a Pipeline. A heavy production Spark batch (gigabytes of input, a single executable script) is usually a better fit as a Spark Job Definition. Pipelines are for orchestration of activities; SJDs are for execution of code. See Spark Job Definitions.
🧩 Task Type Mapping (Canonical Reference)¶
Every Databricks task type, with the recommended Fabric construct.
| Databricks Task Type | Fabric Equivalent | Notes |
|---|---|---|
notebook_task | Notebook activity | Direct map — re-point notebook reference; widget params → activity parameters |
notebook_task (heavy ETL) | Spark Job Definition activity | Convert notebook to .py if you want headless execution and faster cold-start |
spark_jar_task | Spark Job Definition (Java/Scala) | Repackage JAR; reference from SJD; pass args via SJD command line |
spark_python_task (Python script) | Spark Job Definition (Python) OR Notebook with %run | SJD if it's truly headless; Notebook if you want widget-style params |
python_wheel_task | Spark Job Definition + Environment with custom .whl | Upload wheel to Environment; reference entry-point in SJD |
spark_submit_task | Spark Job Definition | Translate --conf, --py-files, --jars into SJD reference files + Spark properties |
sql_task (DBSQL query) | Script (T-SQL) OR Lookup OR Stored procedure activity | Lookup if you need return values; Script for fire-and-forget DDL/DML |
sql_task (DBSQL alert) | Pipeline + Web activity (to alerting system) OR Activator | Alert logic re-authored as expression on Lookup output |
sql_task (DBSQL dashboard refresh) | Power BI semantic model refresh activity | Dashboards re-authored as Power BI reports |
dbt_task | Notebook with dbt-fabric adapter OR external dbt runner via Web activity | See dbt Fabric Integration |
pipeline_task (DLT) | Materialized Lake View (SQL DLT) OR scheduled Notebook chain (Python DLT) | See DLT-Specific Migration |
for_each_task | ForEach activity | Direct map; iteration over array parameter |
condition_task (If/Else, run-if) | If Condition, Switch, or Until activity | If/Else → If Condition; multi-branch → Switch; loop-with-exit → Until |
run_job_task (sub-job) | Invoke Pipeline activity (a.k.a. Execute Pipeline) | Direct map — child pipeline reference |
| Webhook task | Web activity | HTTP POST/GET; supports auth via Workspace Identity or Key Vault |
| Custom workflow task (e.g., Airflow operator) | Custom code in Notebook activity | Translate operator logic into PySpark/Python |
| Job-completion trigger | Invoke Pipeline activity in chain | Chain pipelines instead of triggering on completion |
| File-arrival trigger | Storage event trigger on Pipeline (via Logic App + REST) | Native file triggers landing in late 2026; Logic App bridge today |
💡 Tip: Anything labelled "custom workflow task" in Databricks (where teams have built their own operator framework) should be the first thing you redesign — don't port the framework, port the intent. A Notebook activity calling a small Python module beats reimplementing the operator runtime.
☁️ Cluster Configuration Translation¶
Databricks job-cluster definitions live inside the Job JSON (under new_cluster). Fabric does not have explicit cluster objects — instead, Spark sessions are allocated against the F-SKU CU pool, with Spark properties, library lists, and resources declared in a Fabric Environment.
| Databricks Cluster Field | Fabric Equivalent | Notes |
|---|---|---|
spark_version (e.g., 14.3.x-scala2.12) | Fabric Spark Runtime (1.3 / 2.0) | Pick the runtime closest to DBR major; see Spark Runtime Migration |
node_type_id (e.g., Standard_E8s_v3) | No equivalent — sized by CU | Fabric chooses node sizing from F-SKU; tune via spark.executor.memory if needed |
driver_node_type_id | No equivalent | Driver memory tuned via spark.driver.memory Spark property |
num_workers (fixed) | No equivalent | Fabric autoscales; control with spark.dynamicAllocation.maxExecutors |
autoscale.min_workers / max_workers | spark.dynamicAllocation.minExecutors / maxExecutors | Fabric handles allocation; you set the bounds |
spark_conf (dict) | Environment Spark properties | Direct copy with key remapping (spark.databricks.* → spark.microsoft.* where applicable) |
spark_env_vars | Environment YAML env: section | Set via Environment, not per-session |
init_scripts | Environment Resources + custom library | Repackage init logic as a .whl or .tar.gz and upload to Environment |
cluster_log_conf (DBFS log path) | Workspace Monitoring | Logs go to workspace monitoring eventhouse — see Workspace Monitoring |
aws_attributes / azure_attributes | F-SKU region + Workspace Identity | Identity grants replace IAM role passthrough |
instance_pool_id | No equivalent — F-SKU CU pool | Warm-pool semantics built into capacity |
enable_elastic_disk | Always-on in Fabric | No setting required |
runtime_engine: PHOTON | Fabric Native Execution Engine (NEE) | NEE is the default for V-Order Delta workloads |
| Library install (PyPI) | Environment Public Libraries | Add to Environment; Publish required |
| Library install (Maven) | Environment custom JAR upload | Download JAR, upload to Environment |
| Library install (CRAN) | Environment R libraries | R support varies by Fabric Spark runtime — check coverage |
| Cluster policies | Fabric capacity governance + Environment governance | Use workspace roles + Environment publish gates |
For full Environment authoring detail (YAML, library priority, conflict resolution), see Spark Environments & Job Definitions.
Example — Cluster Spec → Environment YAML¶
# Databricks job-cluster (excerpt from Job JSON):
# {
# "new_cluster": {
# "spark_version": "14.3.x-scala2.12",
# "node_type_id": "Standard_E8s_v3",
# "autoscale": {"min_workers": 2, "max_workers": 10},
# "spark_conf": {
# "spark.sql.shuffle.partitions": "400",
# "spark.databricks.delta.optimizeWrite.enabled": "true"
# },
# "init_scripts": [{"workspace": {"destination": "/Shared/init/install_geo.sh"}}],
# "spark_env_vars": {"PYTHONHASHSEED": "0"}
# }
# }
# Fabric Environment (env-migrated-bronze.yml):
name: env-migrated-bronze
runtime: 1.3
dependencies:
- geopy==2.4.1
- h3==3.7.7
- shapely==2.0.3
custom_libraries:
- geo_helpers-1.0.0-py3-none-any.whl # repackaged from init_scripts
spark_properties:
spark.sql.shuffle.partitions: "400"
spark.databricks.delta.optimizeWrite.enabled: "true"
spark.dynamicAllocation.minExecutors: "2"
spark.dynamicAllocation.maxExecutors: "10"
env:
PYTHONHASHSEED: "0"
⏰ Trigger / Schedule Translation¶
| Databricks Trigger | Fabric Equivalent | Notes |
|---|---|---|
schedule (cron) | Pipeline schedule trigger | Same cron syntax; check timezone (timezone_id) |
continuous (always-on streaming) | Eventstream trigger OR scheduled Notebook with availableNow=True at high frequency | Re-author against Eventstream — see Real-Time Intelligence |
file_arrival trigger | Storage event trigger on Pipeline via Logic App + REST API call | Native triggers landing late 2026; Logic App bridge today |
job_completion trigger | Invoke Pipeline activity in a parent pipeline | Chain pipelines explicitly instead of relying on event chaining |
pause_status: PAUSED | Disable schedule on Pipeline | Direct toggle in Pipeline portal or REST |
max_concurrent_runs | Pipeline concurrency setting | Fabric default 1; raise if your job is reentrant |
Retry policy (max_retries, min_retry_interval_millis) | Activity-level Retry + Retry interval | Set per activity, not per pipeline |
Example — Cron Schedule¶
// Databricks Job schedule:
"schedule": {
"quartz_cron_expression": "0 0 2 * * ?",
"timezone_id": "America/Los_Angeles",
"pause_status": "UNPAUSED"
}
// Fabric Pipeline schedule:
{
"type": "Schedule",
"frequency": "Day",
"interval": 1,
"startTime": "2026-04-27T02:00:00",
"timeZone": "Pacific Standard Time"
}
💡 Tip: Quartz cron uses 7 fields (with seconds + day-of-week); Fabric uses ISO 8601 frequency/interval or simple cron (5 fields). The conversion script translates Quartz → Fabric, but always spot-check edge cases (e.g.,
Lfor last-day-of-month).
🔣 Parameter Handling¶
| Concept | Databricks | Fabric | Migration Notes |
|---|---|---|---|
| Job-level parameters | parameters[] array on Job | Pipeline parameters | Direct map; types: string, int, bool, array |
| Task-level parameters | notebook_task.base_parameters | Activity parameters | Direct map per Notebook activity |
| Default values | default_value on parameter | Pipeline parameter defaultValue | Direct map |
| Variable substitution | {{job.parameters.xyz}} | @pipeline().parameters.xyz | Different syntax — script does the rewrite |
| Per-environment values | Per-job override or shell injection | Variable Library binding | Promote shared values to a Variable Library |
| Inter-task data passing | dbutils.jobs.taskValues.get/set(...) | Activity output → expression @activity('PrevActivity').output.xxx | Different model — see Cluster Reuse / Inter-Task Data |
| Dynamic value (date) | {{start_time.iso_date}} | @formatDateTime(utcNow(), 'yyyy-MM-dd') | Different expression language |
| Run ID | {{job.run_id}} | @pipeline().RunId | Direct map |
Variable Library Binding¶
Promote anything used by ≥ 3 jobs to a Fabric Variable Library, with per-stage values (Dev / Test / Prod). See Wave 7 — Variable Library Setup for the full pattern.
// Pipeline parameter bound to Variable Library:
{
"name": "lakehouse_id",
"type": "string",
"defaultValue": "@variableLibrary('shared').lakehouse_id"
}
⚠️ Gotcha: Databricks
{{job.parameters.xyz}}inside notebook code does nothing at notebook runtime — it's only resolved by Workflows. In Fabric, the same pattern requiresmssparkutils.notebook.exit()for return values and parameter cells (taggedparameters) for input. The conversion script flags every{{job.parameters.*}}reference in notebook cells for rewrite.
🔄 Cluster Reuse / Inter-Task Data¶
Databricks workflows reuse a single cluster across tasks (cheaper, faster). Fabric autoscales sessions out of a CU pool — there's no "reuse a cluster" knob — but you achieve the same throughput by chaining activities tightly.
Inter-Task Data Passing¶
| Pattern | Databricks | Fabric |
|---|---|---|
| Pass scalar from task A to task B | dbutils.jobs.taskValues.set("k", v) then taskValues.get(taskKey="A", key="k") | mssparkutils.notebook.exit(json.dumps({"k": v})) then @activity('A').output.exitValue |
| Share large DataFrame | Cluster-scoped temp view across tasks | Write to a Lakehouse staging table; downstream reads it |
| Share file artifact | DBFS path | OneLake Files/ path passed as parameter |
| Conditional branch on prior result | dbutils.jobs.taskValues + condition_task | If Condition activity with @equals(activity('A').output.exitValue, 'OK') |
Example — Notebook Returning a Value¶
# Notebook A (Fabric):
result = compute_something()
mssparkutils.notebook.exit(str(result)) # always cast to str
⚠️ Gotcha:
notebook.exit()only supports string return. For complex objects, return JSON and parse downstream with@json(activity('A').output.exitValue).
🧪 Concrete Conversion Examples¶
Three end-to-end examples covering the most common migration shapes.
Example 1 — Simple Scheduled Notebook¶
A daily notebook that runs at 02:00 UTC, reads yesterday's data, writes Bronze.
Source — Databricks Job JSON¶
{
"name": "daily_bronze_slot_ingest",
"schedule": {
"quartz_cron_expression": "0 0 2 * * ?",
"timezone_id": "UTC",
"pause_status": "UNPAUSED"
},
"tasks": [{
"task_key": "ingest",
"notebook_task": {
"notebook_path": "/Repos/casino/bronze/01_bronze_slot_telemetry",
"base_parameters": {
"run_date": "{{start_time.iso_date}}",
"source_db": "SlotManagement"
}
},
"new_cluster": {
"spark_version": "14.3.x-scala2.12",
"node_type_id": "Standard_E8s_v3",
"autoscale": {"min_workers": 2, "max_workers": 6}
}
}]
}
Target — Fabric Pipeline JSON¶
{
"name": "pl_daily_bronze_slot_ingest",
"properties": {
"parameters": {
"run_date": {
"type": "string",
"defaultValue": "@formatDateTime(utcNow(), 'yyyy-MM-dd')"
},
"source_db": { "type": "string", "defaultValue": "SlotManagement" }
},
"activities": [{
"name": "IngestBronzeSlots",
"type": "TridentNotebook",
"typeProperties": {
"notebookId": "<fabric-notebook-id>",
"workspaceId": "<workspace-id>",
"parameters": {
"run_date": { "value": "@pipeline().parameters.run_date", "type": "string" },
"source_db": { "value": "@pipeline().parameters.source_db", "type": "string" }
}
},
"policy": { "retry": 2, "retryIntervalInSeconds": 300 }
}]
},
"schedule": {
"frequency": "Day", "interval": 1,
"startTime": "2026-04-28T02:00:00", "timeZone": "UTC"
}
}
Runtime config (autoscale, spark_version) moves from the cluster spec to the Environment attached to the notebook. See cluster translation.
Example 2 — Multi-Task DAG with Dependencies¶
Bronze → Silver → Gold, with Silver fanning out per region (ForEach).
Source — Databricks DAG¶
[bronze_load] ── depends_on ──▶ [silver_us, silver_eu, silver_apac] (parallel)
│ │ │
└───┬───┴───────┘
▼
[gold_kpis]
{
"name": "etl_full",
"tasks": [
{ "task_key": "bronze_load",
"notebook_task": { "notebook_path": "/bronze/01_load" } },
{ "task_key": "silver_us",
"depends_on": [{"task_key": "bronze_load"}],
"notebook_task": { "notebook_path": "/silver/02_clean",
"base_parameters": {"region": "US"} } },
{ "task_key": "silver_eu",
"depends_on": [{"task_key": "bronze_load"}],
"notebook_task": { "notebook_path": "/silver/02_clean",
"base_parameters": {"region": "EU"} } },
{ "task_key": "silver_apac",
"depends_on": [{"task_key": "bronze_load"}],
"notebook_task": { "notebook_path": "/silver/02_clean",
"base_parameters": {"region": "APAC"} } },
{ "task_key": "gold_kpis",
"depends_on": [
{"task_key": "silver_us"},
{"task_key": "silver_eu"},
{"task_key": "silver_apac"}
],
"notebook_task": { "notebook_path": "/gold/03_kpis" } }
]
}
Target — Fabric Pipeline¶
Replace the three parallel tasks with a single ForEach + Notebook activity:
{
"name": "pl_etl_full",
"activities": [
{
"name": "BronzeLoad",
"type": "TridentNotebook",
"typeProperties": { "notebookId": "<bronze-nb>" }
},
{
"name": "SilverFanout",
"type": "ForEach",
"dependsOn": [{ "activity": "BronzeLoad", "dependencyConditions": ["Succeeded"] }],
"typeProperties": {
"items": { "value": "@createArray('US','EU','APAC')", "type": "Expression" },
"isSequential": false,
"batchCount": 3,
"activities": [{
"name": "SilverClean",
"type": "TridentNotebook",
"typeProperties": {
"notebookId": "<silver-nb>",
"parameters": {
"region": { "value": "@item()", "type": "string" }
}
}
}]
}
},
{
"name": "GoldKpis",
"type": "TridentNotebook",
"dependsOn": [{ "activity": "SilverFanout", "dependencyConditions": ["Succeeded"] }],
"typeProperties": { "notebookId": "<gold-nb>" }
}
]
}
Three lessons from this conversion: - Replicated tasks → ForEach. Don't mechanically port three near-identical tasks; collapse them. - Parallelism via batchCount. Set batchCount: 3 to match the Databricks parallel-fan behaviour. - Single notebook with a parameter. The three Databricks notebooks were already parameterised — Fabric ForEach exposes that more directly.
Example 3 — DLT Pipeline → Materialized Lake View¶
A SQL DLT pipeline that lands a streaming Bronze table and a Silver projection.
Source — Databricks DLT (SQL)¶
CREATE OR REFRESH STREAMING TABLE bronze_slot_telemetry
COMMENT "Raw slot events from Eventhub"
AS SELECT * FROM cloud_files(
"abfss://raw@adlsdb01.dfs.core.windows.net/slots/",
"json",
map("cloudFiles.schemaLocation", "abfss://raw@adlsdb01.dfs.core.windows.net/_schemas/slots/")
);
CREATE OR REFRESH LIVE TABLE silver_slot_telemetry (
CONSTRAINT valid_coin EXPECT (coin_in >= 0) ON VIOLATION DROP ROW
)
COMMENT "Cleansed slot telemetry"
AS SELECT
machine_id,
CAST(event_time AS TIMESTAMP) AS event_time,
coin_in,
coin_out,
coin_in - coin_out AS net_revenue
FROM LIVE.bronze_slot_telemetry
WHERE event_time IS NOT NULL;
Target — Fabric Materialized Lake View¶
The Bronze (autoLoader) leg becomes an Eventstream (or Copy Job CDC), and the Silver leg becomes an MLV:
-- Silver as Materialized Lake View:
CREATE MATERIALIZED LAKE VIEW lh_silver.silver_slot_telemetry
AS SELECT
machine_id,
CAST(event_time AS TIMESTAMP) AS event_time,
coin_in,
coin_out,
coin_in - coin_out AS net_revenue
FROM lh_bronze.bronze_slot_telemetry
WHERE event_time IS NOT NULL
AND coin_in >= 0; -- DLT EXPECT enforced as filter (drop semantics)
For the expect rule (ON VIOLATION DROP ROW), the conversion above embeds the predicate directly. For richer rules use a Great Expectations checkpoint in the upstream notebook — see Wave 3 Data Contract Suite.
⚠️ Gotcha: DLT
ON VIOLATION FAIL(block-on-bad-data) has no MLV equivalent. Author it as a GE checkpoint on the Bronze table that fails the upstream pipeline run. Don't try to encode it as aCHECKconstraint — DeltaCHECKaborts the write, not the read.
🌊 DLT-Specific Migration¶
Because DLT is the most opinionated thing in Databricks Workflows, it's worth its own checklist.
Decorator / SQL → Fabric Mapping¶
| DLT Construct | Fabric Equivalent | Notes |
|---|---|---|
@dlt.table (batch) | Materialized Lake View (CREATE MATERIALIZED LAKE VIEW) | Direct rewrite if the body is SQL |
@dlt.view | View (CREATE VIEW) | Non-materialised; same as Spark/T-SQL view |
@dlt.table (Python with custom logic) | Scheduled notebook with MERGE INTO | Manual rewrite required |
CREATE OR REFRESH STREAMING TABLE | Eventstream → Lakehouse OR Copy Job CDC | Use Eventstream for true streaming; Copy Job for incremental batch |
cloud_files(...) (autoLoader) | Eventstream with file-arrival source OR Copy Job CDC with file-arrival trigger | autoLoader's schema-tracking → Copy Job's auto-schema-evolve |
@dlt.expect("rule", "predicate") | Filter in MLV body (drop) OR GE checkpoint (warn/quarantine) | Most direct: encode predicate in MLV WHERE clause |
@dlt.expect_or_drop | Filter in MLV WHERE clause | Direct |
@dlt.expect_or_fail | GE checkpoint on upstream table that fails the pipeline | No MLV equivalent for fail-on-violation |
@dlt.expect_all_or_drop({...}) | Multiple AND-ed filter predicates in MLV WHERE | Concatenate predicates |
| Apply CHANGES INTO (CDC merge) | Copy Job CDC activity OR Notebook with MERGE INTO | Copy Job is the no-code path; see Copy Job CDC |
DLT pipeline mode = triggered | Pipeline schedule | Direct |
DLT pipeline mode = continuous | Eventstream + streaming Notebook (availableNow=False) | Re-author |
DLT target (catalog destination) | Lakehouse + schema (lh_silver.<schema>.<table>) | Direct |
autoLoader → Eventstream / Copy Job¶
# Databricks autoLoader (DLT or notebook):
df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/_schemas/slots/")
.load("/mnt/raw/slots/"))
# Fabric replacement A — Eventstream (true streaming):
# Source: ADLS Gen2 file source on /raw/slots/
# Destination: Lakehouse table lh_bronze.bronze_slot_telemetry
# Configure schema inference + retention
# Fabric replacement B — Copy Job CDC (incremental batch):
# Source: ADLS Gen2 folder with file-arrival watermark
# Destination: Lakehouse table
# Schedule: every 5/15/60 minutes
💡 Tip: The vast majority of DLT autoLoader pipelines are batch every N minutes in disguise — they don't actually need true streaming. A Copy Job CDC every 5 minutes is cheaper and simpler than maintaining an always-on Eventstream. Pick the streaming path only when sub-minute latency is a hard requirement.
🛠️ Migration Tooling¶
Bulk Export of Workflow JSON¶
Databricks REST API (/api/2.1/jobs/list and /api/2.1/jobs/get) returns the canonical Job definition. The bulk-export script saves every job as a JSON file for batch conversion.
PowerShell — export_databricks_jobs.ps1¶
param(
[Parameter(Mandatory)] [string] $DatabricksHost,
[Parameter(Mandatory)] [string] $DatabricksToken,
[string] $OutputDir = "./databricks-export/Jobs"
)
New-Item -ItemType Directory -Force -Path $OutputDir | Out-Null
$headers = @{ Authorization = "Bearer $DatabricksToken" }
$page = $null
$jobs = @()
do {
$url = "$DatabricksHost/api/2.1/jobs/list?limit=25"
if ($page) { $url += "&page_token=$page" }
$resp = Invoke-RestMethod -Uri $url -Headers $headers
if ($resp.jobs) { $jobs += $resp.jobs }
$page = $resp.next_page_token
} while ($page)
foreach ($job in $jobs) {
$detail = Invoke-RestMethod -Uri "$DatabricksHost/api/2.1/jobs/get?job_id=$($job.job_id)" -Headers $headers
$safeName = ($detail.settings.name -replace '[^a-zA-Z0-9_-]', '_')
$detail | ConvertTo-Json -Depth 20 | Set-Content -Path "$OutputDir/$safeName.json"
Write-Host "Exported: $safeName"
}
Write-Host "Total jobs exported: $($jobs.Count)"
Python equivalent — export_databricks_jobs.py¶
import argparse, json, os, requests
from pathlib import Path
def main():
p = argparse.ArgumentParser()
p.add_argument("--host", required=True)
p.add_argument("--token", required=True)
p.add_argument("--output-dir", default="./databricks-export/Jobs")
args = p.parse_args()
Path(args.output_dir).mkdir(parents=True, exist_ok=True)
headers = {"Authorization": f"Bearer {args.token}"}
jobs, page = [], None
while True:
url = f"{args.host}/api/2.1/jobs/list?limit=25"
if page: url += f"&page_token={page}"
r = requests.get(url, headers=headers).json()
jobs.extend(r.get("jobs", []))
page = r.get("next_page_token")
if not page: break
for job in jobs:
d = requests.get(f"{args.host}/api/2.1/jobs/get?job_id={job['job_id']}", headers=headers).json()
name = "".join(c if c.isalnum() or c in "_-" else "_" for c in d["settings"]["name"])
Path(args.output_dir, f"{name}.json").write_text(json.dumps(d, indent=2))
print(f"Exported: {name}")
print(f"Total: {len(jobs)}")
if __name__ == "__main__":
main()
Conversion — Scriptable vs Manual¶
| Aspect | Scriptable | Manual Review |
|---|---|---|
| Schedule cron → Fabric schedule | ✅ | — |
notebook_task → Notebook activity | ✅ | Re-point notebook ID |
condition_task (simple If/Else) | ✅ | — |
for_each_task → ForEach | ✅ | — |
run_job_task → Invoke Pipeline | ✅ | — |
| Job parameter → Pipeline parameter | ✅ | — |
spark_jar_task / spark_python_task → SJD | ⚠️ partial | Repackage artifact, set entry-point |
| DLT pipeline → MLV | ❌ | Author MLV by hand from DLT SQL |
| Continuous trigger → Eventstream | ❌ | Re-author against Eventstream |
dbt_task → dbt-fabric notebook | ⚠️ partial | Re-point profile; verify adapter coverage |
| Webhook task → Web activity | ✅ | Re-bind auth (Workspace Identity / KV) |
| Job-cluster spec → Environment | ⚠️ partial | Repackage init scripts as .whl/.tar.gz |
| Custom workflow framework tasks | ❌ | Redesign — don't port the framework |
⚡ Performance Considerations¶
| Concern | Databricks | Fabric |
|---|---|---|
| Cold start for job-cluster | 60-180 sec to provision new cluster | 10-30 sec for Spark session within F-SKU |
| Warm start (cluster pool) | ~30 sec (warm pool) | ~5-10 sec (CU pool already warm) |
| Cost per minute of idle | High (cluster billed even idle) | F-SKU is fixed monthly — idle minutes don't cost extra |
| Cost per active DBU | Variable (tier × DBU) | Smoothed against F-SKU |
| Scale-out latency | 30-90 sec to add workers | 10-30 sec autoscale within session |
| Photon vs NEE | Photon paid surcharge | NEE included; coverage growing |
| Streaming always-on cost | Cluster billed 24×7 | F-SKU absorbs; bound by CU watermark |
💡 Tip: The fixed-cost F-SKU model means idle pipelines are free. In Databricks the temptation was to consolidate jobs onto fewer clusters to save cost; in Fabric, splitting work across more pipelines is fine and improves observability.
🚫 Anti-Patterns¶
Don't lift-and-shift these — redesign during migration.
- Lift-and-shift "all-purpose cluster" jobs. All-purpose clusters were cheap for interactive dev, expensive for prod jobs. In Fabric, split: notebooks for dev, Spark Job Definitions for prod batch.
- One giant Workflow with 50+ tasks. Re-architect into smaller Pipelines invoked via Invoke Pipeline. Easier to monitor, restart, and version.
- Continuous-mode jobs that are really micro-batch. If the job triggers every minute, it's not streaming — make it a Pipeline schedule or Copy Job CDC at a sane cadence.
- DLT pipelines for non-streaming workloads. Static Bronze→Silver chains belong in MLVs or scheduled notebooks; don't re-create DLT runtime overhead.
dbutils.jobs.taskValuesfor large payloads. Don't pipe DataFrames through task values; write to a staging Lakehouse table.- Photon-only SQL inside notebooks. Identify Photon-only functions during assessment and rewrite using portable Spark SQL before migration.
- Copying init scripts byte-for-byte. Init scripts that
apt-get installsystem packages don't fit the Environment model. Repackage the intent (the libraries you needed) as a.whl/.tar.gz. - One Variable Library per pipeline. Use one shared Variable Library per environment (Dev/Test/Prod), not per pipeline. Keeps secrets and connection strings consolidated.
- Using Pipeline parameters as feature flags. If you find yourself toggling logic with a parameter, split into two pipelines or use Deployment Pipeline rules.
- Re-creating DLT
expect_or_failas DeltaCHECKconstraints.CHECKaborts writes (not what you want); use a GE checkpoint that fails the upstream activity.
✅ Implementation Checklist¶
Use this as the workflow-migration sub-checklist for Step 5 of Tutorial 42.
- Bulk-exported all Job JSON via REST API (
export_databricks_jobs.ps1/.py) - Categorised every job: Pipeline, SJD, Eventstream, MLV, decommission
- Identified every
pipeline_task(DLT) for manual rewrite - Identified every
continuoustrigger for Eventstream re-author - Mapped every
run_job_taskto Invoke Pipeline activity - Mapped every
for_each_taskto ForEach activity - Mapped every
condition_taskto If/Switch/Until activity - Translated every
new_clusterspec to a Fabric Environment YAML - Repackaged init scripts as custom
.whl/.tar.gz - Translated every cron schedule to Fabric Pipeline schedule
- Translated
dbutils.jobs.taskValuescalls tomssparkutils.notebook.exit()+ activity outputs - Promoted shared parameters (used by ≥3 jobs) to Variable Library
- Wired retry policy on each activity (was: per-job)
- Wired storage-event triggers via Logic App (where file-arrival was used)
- Authored DLT-replacement MLVs and verified row-count parity vs DLT outputs
- Authored DLT
expect_or_failrules as upstream GE checkpoints - Verified each converted Pipeline runs end-to-end against test data
- Disabled (paused) the corresponding Databricks Workflow before Fabric Pipeline goes live
- Captured pre-migration SLA metrics (duration, cost) and post-migration deltas
📚 References¶
Databricks Documentation¶
- Databricks Jobs REST API 2.1
- Databricks Workflows overview
- Delta Live Tables — Python and SQL APIs
- Cluster configuration
Microsoft Fabric Documentation¶
- Fabric Data Pipelines overview
- Pipeline activities reference
- Spark Job Definitions
- Materialized Lake Views
- Eventstream sources
- Copy Job
Wave 4 Cross-References¶
- Tutorial 42 — Databricks → Fabric (parent tutorial — Step 5 anchors here)
- Tutorial 41 — Synapse → Fabric (style anchor; overlapping Pipeline patterns)
- Spark Environments & Job Definitions (Environment YAML, SJD authoring)
- Deployment Pipelines (stage promotion + Variable Library)
- Pipelines & Data Movement Best Practices (ETL/ELT, copy activity tuning)
- Materialized Lake Views (DLT replacement)
- Copy Job CDC (autoLoader replacement)
- Real-Time Intelligence (continuous-trigger replacement)
- dbt Fabric Integration (
dbt_taskmigration) - Spark Runtime Migration (DBR → Fabric Spark runtime)
- Testing Strategies — Data Contract Suites (GE checkpoints replacing DLT expects)
- fabric-cicd Deployment (deploying converted Pipelines)
- Workspace Monitoring (replacement for
cluster_log_conf)
⬆️ Back to Top | ⬅️ Back to Tutorial 42 | 📚 Tutorial Index | 🏠 Home