CDP Data Engineering Migration: CDE, CML, CDW to Azure¶

Overview¶

CDP represents Cloudera's modern platform, built on Kubernetes and available in Private Cloud and Public Cloud editions. Organizations on CDP are in a different position than CDH shops: the infrastructure is more modern, the APIs are cleaner, and the migration paths are more direct. However, the core economic and strategic arguments for Azure migration still apply -- CDP licensing costs are rising, and Azure-native services provide broader capabilities at lower total cost.

This guide covers three CDP-specific components that require dedicated migration approaches beyond the core HDFS/Hive/Spark/Oozie playbook.

1. CDE Virtual Clusters to Databricks Workspaces¶

Architecture comparison¶

CDE concept	Databricks equivalent	Notes
CDE Service	Databricks account	Top-level container for all resources.
Virtual Cluster	Databricks Workspace	Isolated environment with its own compute, notebooks, and jobs.
CDE Spark job	Databricks Job (Spark task)	Spark application submitted for scheduled or on-demand execution.
CDE resource (files/archives)	Databricks DBFS / Unity Catalog Volumes	File storage for job dependencies.
CDE job run	Databricks Job Run	Individual execution of a job.
CDE CLI	Databricks CLI / REST API	Command-line management interface.
CDE API	Databricks REST API / SDK	Programmatic access to all functionality.

CDE job definition to Databricks job¶

# CDE job definition (CDE CLI format)
name: daily-sales-etl
type: spark
application-file: s3a://cde-resources/jobs/sales_etl.py
driver-cores: 2
driver-memory: 4g
executor-cores: 4
executor-memory: 8g
num-executors: 10
conf:
    spark.sql.shuffle.partitions: 400
schedule:
    enabled: true
    cron-expression: "0 2 * * *"
    start: "2025-01-01T00:00:00Z"

// Databricks Job definition (Jobs API 2.1 format)
{
    "name": "daily-sales-etl",
    "tasks": [
        {
            "task_key": "run_sales_etl",
            "spark_python_task": {
                "python_file": "dbfs:/jobs/sales_etl.py"
            },
            "new_cluster": {
                "spark_version": "15.4.x-scala2.12",
                "node_type_id": "Standard_DS4_v2",
                "driver_node_type_id": "Standard_DS4_v2",
                "autoscale": {
                    "min_workers": 4,
                    "max_workers": 10
                },
                "spark_conf": {
                    "spark.sql.shuffle.partitions": "400"
                }
            }
        }
    ],
    "schedule": {
        "quartz_cron_expression": "0 0 2 * * ?",
        "timezone_id": "UTC"
    }
}

Key differences in job management¶

CDE behavior	Databricks behavior	Migration note
Virtual cluster has fixed compute	Clusters are per-job or shared (job clusters vs all-purpose)	Use job clusters for production; all-purpose for development.
Resources uploaded via CDE CLI	Libraries attached via cluster config or job config	Upload JARs/wheels to DBFS or Unity Catalog Volumes.
CDE manages Spark versions per VC	Databricks Runtime version set per cluster	Choose DBR version in cluster config.
CDE auto-scales within VC limits	Databricks auto-scales per cluster policy	Set min/max workers in cluster or policy.
CDE resource isolation via namespace	Databricks workspace isolation + Unity Catalog	Workspace-level isolation; data access via Unity Catalog grants.

2. CDE Airflow to Databricks Workflows¶

CDE includes Apache Airflow for orchestration. This is one of the more straightforward migrations because Databricks Workflows provides a native alternative, and ADF provides a broader orchestration layer.

Migration targets¶

CDE Airflow pattern	Target	When to use
Simple Spark DAG (all tasks are Spark)	Databricks Workflows (multi-task job)	All tasks run on Databricks compute.
Mixed DAG (Spark + SQL + shell + API)	ADF Pipeline	Cross-service orchestration (Databricks + SQL + Logic Apps).
Complex DAG with branching/dynamic tasks	Databricks Workflows + ADF	Databricks for compute tasks; ADF for cross-service logic.
Airflow sensors (file/time/external)	ADF triggers (schedule/event/tumbling window)	ADF trigger types replace Airflow sensor patterns.

Airflow DAG to Databricks Workflow conversion¶

# CDE Airflow DAG (before)
from airflow import DAG
from airflow.providers.cde.operators.cde_job import CDEJobRunOperator
from datetime import datetime

dag = DAG(
    'daily_sales_pipeline',
    schedule_interval='0 2 * * *',
    start_date=datetime(2025, 1, 1),
    catchup=False
)

extract = CDEJobRunOperator(
    task_id='extract_orders',
    job_name='extract_orders_job',
    dag=dag
)

transform = CDEJobRunOperator(
    task_id='transform_sales',
    job_name='transform_sales_job',
    dag=dag
)

load = CDEJobRunOperator(
    task_id='load_warehouse',
    job_name='load_warehouse_job',
    dag=dag
)

extract >> transform >> load

// Databricks Workflow (after)
{
    "name": "daily_sales_pipeline",
    "tasks": [
        {
            "task_key": "extract_orders",
            "spark_python_task": {
                "python_file": "dbfs:/jobs/extract_orders.py"
            },
            "new_cluster": {
                "spark_version": "15.4.x-scala2.12",
                "node_type_id": "Standard_DS4_v2",
                "autoscale": { "min_workers": 2, "max_workers": 8 }
            }
        },
        {
            "task_key": "transform_sales",
            "depends_on": [{ "task_key": "extract_orders" }],
            "spark_python_task": {
                "python_file": "dbfs:/jobs/transform_sales.py"
            },
            "new_cluster": {
                "spark_version": "15.4.x-scala2.12",
                "node_type_id": "Standard_DS4_v2",
                "autoscale": { "min_workers": 2, "max_workers": 10 }
            }
        },
        {
            "task_key": "load_warehouse",
            "depends_on": [{ "task_key": "transform_sales" }],
            "spark_python_task": {
                "python_file": "dbfs:/jobs/load_warehouse.py"
            },
            "new_cluster": {
                "spark_version": "15.4.x-scala2.12",
                "node_type_id": "Standard_DS4_v2",
                "num_workers": 4
            }
        }
    ],
    "schedule": {
        "quartz_cron_expression": "0 0 2 * * ?",
        "timezone_id": "UTC"
    },
    "email_notifications": {
        "on_failure": ["data-eng@example.com"]
    }
}

Airflow operator mapping¶

Airflow operator (CDE)	Databricks Workflow / ADF equivalent	Notes
`CDEJobRunOperator`	Databricks Spark task	Direct mapping.
`BashOperator`	ADF Custom Activity (Azure Batch)	Shell scripts run on Azure Batch.
`PythonOperator`	Databricks Python task / Azure Functions	Python scripts as Spark tasks or serverless Functions.
`SqlSensor`	ADF Lookup Activity + Until loop	Poll database until condition met.
`FileSensor`	ADF GetMetadata + Until loop / Event Grid trigger	File arrival detection.
`ExternalTaskSensor`	ADF Execute Pipeline with dependency	Cross-pipeline dependencies.
`BranchPythonOperator`	ADF If Condition / Switch	Conditional branching.
`TriggerDagRunOperator`	ADF Execute Pipeline activity	Trigger another pipeline/workflow.
`EmailOperator`	Logic App (triggered by ADF/Databricks webhook)	Email notifications via Logic App.
`SlackWebhookOperator`	Logic App (Slack connector)	Slack alerts via Logic App.

3. CML to Azure ML + Databricks ML¶

Architecture comparison¶

CML component	Azure equivalent	Notes
CML Workspace	Azure ML Workspace / Databricks Workspace	Both provide Jupyter-style environments.
CML Session	Azure ML Compute Instance / Databricks cluster	Interactive compute for development.
CML Experiments	MLflow on Databricks / Azure ML Experiments	MLflow tracking is available on both platforms.
CML Models (registry)	Databricks Model Registry / Azure ML Model Registry	Model versioning and stage management.
CML Model Serving	Databricks Model Serving / Azure ML Managed Endpoints	Real-time inference endpoints.
CML Applied ML Prototypes (AMPs)	Databricks Solution Accelerators	Pre-built templates for common ML patterns.
CML Projects (Git-backed)	Databricks Repos / Azure ML linked repos	Git integration for version control.
CML Jobs (scheduled)	Databricks Jobs / Azure ML Pipelines	Scheduled ML training and scoring.

Migration decision: Azure ML vs Databricks ML¶

Use case	Choose Azure ML	Choose Databricks ML
Heavy Spark-based feature engineering	No	Yes (native Spark)
Traditional ML (scikit-learn, XGBoost)	Yes	Yes
Deep learning (PyTorch, TensorFlow)	Yes (GPU clusters)	Yes (GPU clusters)
LLM fine-tuning	Yes (Azure AI Foundry)	Yes (Foundation Model APIs)
AutoML	Yes (Azure ML AutoML)	Yes (Databricks AutoML)
Responsible AI dashboard	Yes	No
Already using Databricks for data	No	Yes (unified platform)
Complex pipeline orchestration	Yes (Azure ML Pipelines)	Databricks Workflows
Need endpoint autoscaling	Yes (managed online endpoints)	Yes (Model Serving)
Real-time feature serving	Databricks Feature Store	Databricks Feature Store

Recommendation: If your data engineering runs on Databricks, use Databricks ML for tight integration. If you need Responsible AI dashboards, LLM fine-tuning with Azure AI Foundry, or complex multi-step ML pipelines, use Azure ML. Many organizations use both.

CML model migration script¶

# Step 1: Export model from CML (on CML cluster)
import cmlapi
import mlflow
import os

# Connect to CML
client = cmlapi.default_client()

# Download model artifacts
mlflow.set_tracking_uri("https://cml-workspace.example.com/mlflow")
model_uri = "models:/sales_forecast/Production"
local_path = mlflow.artifacts.download_artifacts(model_uri, dst_path="/tmp/models")

# Package model for transfer
# azcopy copy /tmp/models abfss://ml@storage.dfs.core.windows.net/models/sales_forecast/

# Step 2: Register model on Databricks (on Databricks)
import mlflow

mlflow.set_registry_uri("databricks-uc")

# Load model from ADLS
model_path = "abfss://ml@storage.dfs.core.windows.net/models/sales_forecast/"

# Register in Unity Catalog
mlflow.register_model(
    f"runs:/{run_id}/model",  # Or from local path
    "ml_catalog.models.sales_forecast"
)

# Step 3: Deploy as serving endpoint
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
w.serving_endpoints.create(
    name="sales-forecast-endpoint",
    config={
        "served_models": [{
            "model_name": "ml_catalog.models.sales_forecast",
            "model_version": "1",
            "workload_size": "Small",
            "scale_to_zero_enabled": True
        }]
    }
)

4. CDP Data Warehouse (CDW) to Databricks SQL + Fabric¶

CDW architecture to Azure mapping¶

CDW component	Azure equivalent	Notes
CDW Hive Virtual Warehouse	Databricks SQL Warehouse	HiveQL to Spark SQL conversion (see playbook Section 6).
CDW Impala Virtual Warehouse	Databricks SQL Warehouse	See Impala Migration.
CDW auto-scaling	Databricks SQL Serverless auto-scaling	More granular scaling on Databricks.
CDW Data Visualization	Power BI	Richer visualization; Direct Lake for lakehouse data.
CDW query isolation	Databricks SQL multi-cluster auto-scaling	Each concurrent user group gets its own cluster.
CDW Hue interface	Databricks SQL Editor	SQL editor with autocomplete, query history.

CDW to Fabric SQL endpoint (alternative)¶

For organizations adopting Microsoft Fabric, CDW workloads can also target Fabric SQL endpoint:

CDW feature	Fabric SQL endpoint	Notes
Interactive SQL on lake data	Fabric SQL endpoint (read Delta via OneLake)	T-SQL syntax instead of HiveQL/Impala SQL.
BI serving	Power BI Direct Lake mode	Sub-second dashboard refresh.
Data exploration	Fabric Lakehouse notebooks	PySpark + SQL in Fabric notebooks.
Scheduled queries	Fabric Data Pipeline (notebook activity)	Scheduled notebook execution.

When to choose Fabric vs Databricks SQL:

Scenario	Fabric SQL endpoint	Databricks SQL
Organization is Microsoft 365-heavy	Yes	Maybe
Heavy Spark workloads alongside SQL	Maybe	Yes
Need T-SQL compatibility	Yes	No (Spark SQL)
Need Unity Catalog governance	No	Yes
BI-primary workload (Power BI)	Yes (Direct Lake)	Yes (via connector)
Mixed workload (ETL + SQL + ML)	Maybe	Yes

5. Migration order for CDP components¶

Recommended sequence¶

flowchart TD
    A[1. Migrate CDW Hive VW<br/>to Databricks SQL] --> B[2. Migrate CDW Impala VW<br/>to Databricks SQL]
    B --> C[3. Migrate CDE Spark Jobs<br/>to Databricks Jobs]
    C --> D[4. Migrate CDE Airflow<br/>to Databricks Workflows / ADF]
    D --> E[5. Migrate CML Experiments<br/>to Databricks ML / Azure ML]
    E --> F[6. Migrate CML Model Serving<br/>to Databricks Serving / Azure ML Endpoints]
    F --> G[7. Decommission CDP]

Rationale for this order¶

CDW first: SQL workloads are the easiest to validate (row counts, checksums) and have the highest business visibility (dashboards break immediately if wrong).
CDE Spark next: Spark code is highly portable; the main changes are path updates and YARN config removal.
CDE Airflow after Spark: Orchestration migration depends on the compute tasks being available on the target platform.
CML last: ML workloads are often the most self-contained and can continue running on CML while other components migrate.

6. CDP vs CDH migration differences¶

If you are migrating from CDP rather than CDH, several things are easier:

Migration aspect	CDH migration	CDP migration
Data location	HDFS on bare metal; requires Data Box or network transfer	If CDP Public Cloud: data already in cloud storage
Spark version	CDH ships Spark 2.x (old); upgrade to Spark 3.x needed	CDP ships Spark 3.x; direct port to Databricks
Hive version	CDH ships Hive 2.x; more syntax differences	CDP ships Hive 3.x; fewer syntax changes
Kerberos	Deep Kerberos integration in all services	CDP supports Kerberos but also token-based auth
Container awareness	CDH is bare-metal/VM only	CDP Private Cloud runs on Kubernetes; familiar concepts
API maturity	CDH APIs are older; more manual work	CDP APIs are modern REST; easier to script migration
MLflow	Not available on CDH	CML includes MLflow; experiments port directly

CDP migration checklist¶

Next steps¶

Review the Migration Playbook for the full HDFS/Hive/Spark/Oozie migration
See the Impala Migration Guide for CDW Impala-specific conversion
Review the Benchmarks for CDP vs Azure performance data
Read the Best Practices for cluster-by-cluster migration strategy

Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team