CDP Data Engineering Migration: CDE, CML, CDW to Azure
A detailed guide for migrating Cloudera Data Platform (CDP) components -- Data Engineering (CDE), Machine Learning (CML), and Data Warehouse (CDW) -- to Databricks, Azure ML, and Fabric.
Overview
CDP represents Cloudera's modern platform, built on Kubernetes and available in Private Cloud and Public Cloud editions. Organizations on CDP are in a different position than CDH shops: the infrastructure is more modern, the APIs are cleaner, and the migration paths are more direct. However, the core economic and strategic arguments for Azure migration still apply -- CDP licensing costs are rising, and Azure-native services provide broader capabilities at lower total cost.
This guide covers three CDP-specific components that require dedicated migration approaches beyond the core HDFS/Hive/Spark/Oozie playbook.
1. CDE Virtual Clusters to Databricks Workspaces
Architecture comparison
| CDE concept | Databricks equivalent | Notes |
| CDE Service | Databricks account | Top-level container for all resources. |
| Virtual Cluster | Databricks Workspace | Isolated environment with its own compute, notebooks, and jobs. |
| CDE Spark job | Databricks Job (Spark task) | Spark application submitted for scheduled or on-demand execution. |
| CDE resource (files/archives) | Databricks DBFS / Unity Catalog Volumes | File storage for job dependencies. |
| CDE job run | Databricks Job Run | Individual execution of a job. |
| CDE CLI | Databricks CLI / REST API | Command-line management interface. |
| CDE API | Databricks REST API / SDK | Programmatic access to all functionality. |
CDE job definition to Databricks job
# CDE job definition (CDE CLI format)
name: daily-sales-etl
type: spark
application-file: s3a://cde-resources/jobs/sales_etl.py
driver-cores: 2
driver-memory: 4g
executor-cores: 4
executor-memory: 8g
num-executors: 10
conf:
spark.sql.shuffle.partitions: 400
schedule:
enabled: true
cron-expression: "0 2 * * *"
start: "2025-01-01T00:00:00Z"
// Databricks Job definition (Jobs API 2.1 format)
{
"name": "daily-sales-etl",
"tasks": [
{
"task_key": "run_sales_etl",
"spark_python_task": {
"python_file": "dbfs:/jobs/sales_etl.py"
},
"new_cluster": {
"spark_version": "15.4.x-scala2.12",
"node_type_id": "Standard_DS4_v2",
"driver_node_type_id": "Standard_DS4_v2",
"autoscale": {
"min_workers": 4,
"max_workers": 10
},
"spark_conf": {
"spark.sql.shuffle.partitions": "400"
}
}
}
],
"schedule": {
"quartz_cron_expression": "0 0 2 * * ?",
"timezone_id": "UTC"
}
}
Key differences in job management
| CDE behavior | Databricks behavior | Migration note |
| Virtual cluster has fixed compute | Clusters are per-job or shared (job clusters vs all-purpose) | Use job clusters for production; all-purpose for development. |
| Resources uploaded via CDE CLI | Libraries attached via cluster config or job config | Upload JARs/wheels to DBFS or Unity Catalog Volumes. |
| CDE manages Spark versions per VC | Databricks Runtime version set per cluster | Choose DBR version in cluster config. |
| CDE auto-scales within VC limits | Databricks auto-scales per cluster policy | Set min/max workers in cluster or policy. |
| CDE resource isolation via namespace | Databricks workspace isolation + Unity Catalog | Workspace-level isolation; data access via Unity Catalog grants. |
2. CDE Airflow to Databricks Workflows
CDE includes Apache Airflow for orchestration. This is one of the more straightforward migrations because Databricks Workflows provides a native alternative, and ADF provides a broader orchestration layer.
Migration targets
| CDE Airflow pattern | Target | When to use |
| Simple Spark DAG (all tasks are Spark) | Databricks Workflows (multi-task job) | All tasks run on Databricks compute. |
| Mixed DAG (Spark + SQL + shell + API) | ADF Pipeline | Cross-service orchestration (Databricks + SQL + Logic Apps). |
| Complex DAG with branching/dynamic tasks | Databricks Workflows + ADF | Databricks for compute tasks; ADF for cross-service logic. |
| Airflow sensors (file/time/external) | ADF triggers (schedule/event/tumbling window) | ADF trigger types replace Airflow sensor patterns. |
Airflow DAG to Databricks Workflow conversion
# CDE Airflow DAG (before)
from airflow import DAG
from airflow.providers.cde.operators.cde_job import CDEJobRunOperator
from datetime import datetime
dag = DAG(
'daily_sales_pipeline',
schedule_interval='0 2 * * *',
start_date=datetime(2025, 1, 1),
catchup=False
)
extract = CDEJobRunOperator(
task_id='extract_orders',
job_name='extract_orders_job',
dag=dag
)
transform = CDEJobRunOperator(
task_id='transform_sales',
job_name='transform_sales_job',
dag=dag
)
load = CDEJobRunOperator(
task_id='load_warehouse',
job_name='load_warehouse_job',
dag=dag
)
extract >> transform >> load
// Databricks Workflow (after)
{
"name": "daily_sales_pipeline",
"tasks": [
{
"task_key": "extract_orders",
"spark_python_task": {
"python_file": "dbfs:/jobs/extract_orders.py"
},
"new_cluster": {
"spark_version": "15.4.x-scala2.12",
"node_type_id": "Standard_DS4_v2",
"autoscale": { "min_workers": 2, "max_workers": 8 }
}
},
{
"task_key": "transform_sales",
"depends_on": [{ "task_key": "extract_orders" }],
"spark_python_task": {
"python_file": "dbfs:/jobs/transform_sales.py"
},
"new_cluster": {
"spark_version": "15.4.x-scala2.12",
"node_type_id": "Standard_DS4_v2",
"autoscale": { "min_workers": 2, "max_workers": 10 }
}
},
{
"task_key": "load_warehouse",
"depends_on": [{ "task_key": "transform_sales" }],
"spark_python_task": {
"python_file": "dbfs:/jobs/load_warehouse.py"
},
"new_cluster": {
"spark_version": "15.4.x-scala2.12",
"node_type_id": "Standard_DS4_v2",
"num_workers": 4
}
}
],
"schedule": {
"quartz_cron_expression": "0 0 2 * * ?",
"timezone_id": "UTC"
},
"email_notifications": {
"on_failure": ["data-eng@example.com"]
}
}
Airflow operator mapping
| Airflow operator (CDE) | Databricks Workflow / ADF equivalent | Notes |
CDEJobRunOperator | Databricks Spark task | Direct mapping. |
BashOperator | ADF Custom Activity (Azure Batch) | Shell scripts run on Azure Batch. |
PythonOperator | Databricks Python task / Azure Functions | Python scripts as Spark tasks or serverless Functions. |
SqlSensor | ADF Lookup Activity + Until loop | Poll database until condition met. |
FileSensor | ADF GetMetadata + Until loop / Event Grid trigger | File arrival detection. |
ExternalTaskSensor | ADF Execute Pipeline with dependency | Cross-pipeline dependencies. |
BranchPythonOperator | ADF If Condition / Switch | Conditional branching. |
TriggerDagRunOperator | ADF Execute Pipeline activity | Trigger another pipeline/workflow. |
EmailOperator | Logic App (triggered by ADF/Databricks webhook) | Email notifications via Logic App. |
SlackWebhookOperator | Logic App (Slack connector) | Slack alerts via Logic App. |
3. CML to Azure ML + Databricks ML
Architecture comparison
| CML component | Azure equivalent | Notes |
| CML Workspace | Azure ML Workspace / Databricks Workspace | Both provide Jupyter-style environments. |
| CML Session | Azure ML Compute Instance / Databricks cluster | Interactive compute for development. |
| CML Experiments | MLflow on Databricks / Azure ML Experiments | MLflow tracking is available on both platforms. |
| CML Models (registry) | Databricks Model Registry / Azure ML Model Registry | Model versioning and stage management. |
| CML Model Serving | Databricks Model Serving / Azure ML Managed Endpoints | Real-time inference endpoints. |
| CML Applied ML Prototypes (AMPs) | Databricks Solution Accelerators | Pre-built templates for common ML patterns. |
| CML Projects (Git-backed) | Databricks Repos / Azure ML linked repos | Git integration for version control. |
| CML Jobs (scheduled) | Databricks Jobs / Azure ML Pipelines | Scheduled ML training and scoring. |
Migration decision: Azure ML vs Databricks ML
| Use case | Choose Azure ML | Choose Databricks ML |
| Heavy Spark-based feature engineering | No | Yes (native Spark) |
| Traditional ML (scikit-learn, XGBoost) | Yes | Yes |
| Deep learning (PyTorch, TensorFlow) | Yes (GPU clusters) | Yes (GPU clusters) |
| LLM fine-tuning | Yes (Azure AI Foundry) | Yes (Foundation Model APIs) |
| AutoML | Yes (Azure ML AutoML) | Yes (Databricks AutoML) |
| Responsible AI dashboard | Yes | No |
| Already using Databricks for data | No | Yes (unified platform) |
| Complex pipeline orchestration | Yes (Azure ML Pipelines) | Databricks Workflows |
| Need endpoint autoscaling | Yes (managed online endpoints) | Yes (Model Serving) |
| Real-time feature serving | Databricks Feature Store | Databricks Feature Store |
Recommendation: If your data engineering runs on Databricks, use Databricks ML for tight integration. If you need Responsible AI dashboards, LLM fine-tuning with Azure AI Foundry, or complex multi-step ML pipelines, use Azure ML. Many organizations use both.
CML model migration script
# Step 1: Export model from CML (on CML cluster)
import cmlapi
import mlflow
import os
# Connect to CML
client = cmlapi.default_client()
# Download model artifacts
mlflow.set_tracking_uri("https://cml-workspace.example.com/mlflow")
model_uri = "models:/sales_forecast/Production"
local_path = mlflow.artifacts.download_artifacts(model_uri, dst_path="/tmp/models")
# Package model for transfer
# azcopy copy /tmp/models abfss://ml@storage.dfs.core.windows.net/models/sales_forecast/
# Step 2: Register model on Databricks (on Databricks)
import mlflow
mlflow.set_registry_uri("databricks-uc")
# Load model from ADLS
model_path = "abfss://ml@storage.dfs.core.windows.net/models/sales_forecast/"
# Register in Unity Catalog
mlflow.register_model(
f"runs:/{run_id}/model", # Or from local path
"ml_catalog.models.sales_forecast"
)
# Step 3: Deploy as serving endpoint
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
w.serving_endpoints.create(
name="sales-forecast-endpoint",
config={
"served_models": [{
"model_name": "ml_catalog.models.sales_forecast",
"model_version": "1",
"workload_size": "Small",
"scale_to_zero_enabled": True
}]
}
)
4. CDP Data Warehouse (CDW) to Databricks SQL + Fabric
CDW architecture to Azure mapping
| CDW component | Azure equivalent | Notes |
| CDW Hive Virtual Warehouse | Databricks SQL Warehouse | HiveQL to Spark SQL conversion (see playbook Section 6). |
| CDW Impala Virtual Warehouse | Databricks SQL Warehouse | See Impala Migration. |
| CDW auto-scaling | Databricks SQL Serverless auto-scaling | More granular scaling on Databricks. |
| CDW Data Visualization | Power BI | Richer visualization; Direct Lake for lakehouse data. |
| CDW query isolation | Databricks SQL multi-cluster auto-scaling | Each concurrent user group gets its own cluster. |
| CDW Hue interface | Databricks SQL Editor | SQL editor with autocomplete, query history. |
CDW to Fabric SQL endpoint (alternative)
For organizations adopting Microsoft Fabric, CDW workloads can also target Fabric SQL endpoint:
| CDW feature | Fabric SQL endpoint | Notes |
| Interactive SQL on lake data | Fabric SQL endpoint (read Delta via OneLake) | T-SQL syntax instead of HiveQL/Impala SQL. |
| BI serving | Power BI Direct Lake mode | Sub-second dashboard refresh. |
| Data exploration | Fabric Lakehouse notebooks | PySpark + SQL in Fabric notebooks. |
| Scheduled queries | Fabric Data Pipeline (notebook activity) | Scheduled notebook execution. |
When to choose Fabric vs Databricks SQL:
| Scenario | Fabric SQL endpoint | Databricks SQL |
| Organization is Microsoft 365-heavy | Yes | Maybe |
| Heavy Spark workloads alongside SQL | Maybe | Yes |
| Need T-SQL compatibility | Yes | No (Spark SQL) |
| Need Unity Catalog governance | No | Yes |
| BI-primary workload (Power BI) | Yes (Direct Lake) | Yes (via connector) |
| Mixed workload (ETL + SQL + ML) | Maybe | Yes |
5. Migration order for CDP components
Recommended sequence
flowchart TD
A[1. Migrate CDW Hive VW<br/>to Databricks SQL] --> B[2. Migrate CDW Impala VW<br/>to Databricks SQL]
B --> C[3. Migrate CDE Spark Jobs<br/>to Databricks Jobs]
C --> D[4. Migrate CDE Airflow<br/>to Databricks Workflows / ADF]
D --> E[5. Migrate CML Experiments<br/>to Databricks ML / Azure ML]
E --> F[6. Migrate CML Model Serving<br/>to Databricks Serving / Azure ML Endpoints]
F --> G[7. Decommission CDP]
Rationale for this order
- CDW first: SQL workloads are the easiest to validate (row counts, checksums) and have the highest business visibility (dashboards break immediately if wrong).
- CDE Spark next: Spark code is highly portable; the main changes are path updates and YARN config removal.
- CDE Airflow after Spark: Orchestration migration depends on the compute tasks being available on the target platform.
- CML last: ML workloads are often the most self-contained and can continue running on CML while other components migrate.
6. CDP vs CDH migration differences
If you are migrating from CDP rather than CDH, several things are easier:
| Migration aspect | CDH migration | CDP migration |
| Data location | HDFS on bare metal; requires Data Box or network transfer | If CDP Public Cloud: data already in cloud storage |
| Spark version | CDH ships Spark 2.x (old); upgrade to Spark 3.x needed | CDP ships Spark 3.x; direct port to Databricks |
| Hive version | CDH ships Hive 2.x; more syntax differences | CDP ships Hive 3.x; fewer syntax changes |
| Kerberos | Deep Kerberos integration in all services | CDP supports Kerberos but also token-based auth |
| Container awareness | CDH is bare-metal/VM only | CDP Private Cloud runs on Kubernetes; familiar concepts |
| API maturity | CDH APIs are older; more manual work | CDP APIs are modern REST; easier to script migration |
| MLflow | Not available on CDH | CML includes MLflow; experiments port directly |
CDP migration checklist
Next steps
- Review the Migration Playbook for the full HDFS/Hive/Spark/Oozie migration
- See the Impala Migration Guide for CDW Impala-specific conversion
- Review the Benchmarks for CDP vs Azure performance data
- Read the Best Practices for cluster-by-cluster migration strategy
Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team