Best Practices: Cloudera to Azure Migration¶

Migration strategy, planning patterns, team structure, and common pitfalls for organizations moving from CDH or CDP to Azure.

1. Cluster-by-cluster migration strategy¶

Large Cloudera deployments often run multiple clusters serving different teams or workloads. Do not migrate them all at once. Use a cluster-by-cluster approach that proves the pattern on the lowest-risk cluster first and then accelerates through the remaining clusters.

Cluster prioritization matrix¶

Factor	Weight	How to score
Business criticality	High	Revenue-generating = high risk; internal analytics = lower risk
Technical complexity	High	UDF count, NiFi flow count, custom SerDes = higher complexity
Data volume	Medium	> 100 TB requires Data Box or WANdisco; adds timeline
Team readiness	Medium	Team with cloud experience migrates faster
Interdependencies	High	Clusters that feed downstream clusters must migrate in order
License expiration	High	Clusters with expiring CDH support migrate first

Recommended cluster migration order¶

flowchart TD
    A[Cluster 1: Dev/Test<br/>Low risk, small data<br/>Prove the pattern] --> B[Cluster 2: Analytics/BI<br/>Medium risk, read-heavy<br/>High visibility]
    B --> C[Cluster 3: ETL/Data Engineering<br/>Medium risk, write-heavy<br/>Most Spark jobs]
    C --> D[Cluster 4: Production/Revenue<br/>High risk, critical SLAs<br/>Migrate last]
    D --> E[Decommission all CDH hardware]

What each cluster migration proves¶

Cluster	What it proves	What you learn
Dev/Test	Data transfer pipeline works. Delta conversion works. Spark code ports cleanly.	Transfer throughput, conversion issues, team familiarity.
Analytics/BI	Impala-to-Databricks SQL works. BI tool connections work. Users can use the new platform.	SQL dialect issues, BI reconnection patterns, user training needs.
ETL/Data Engineering	Spark jobs run in production. Oozie-to-ADF works. NiFi-to-ADF works.	Job scheduling patterns, error handling, monitoring setup.
Production/Revenue	Full platform works at scale under SLAs. Security model is complete.	Cutover procedure, parallel-run validation, incident response on Azure.

2. CDP vs CDH: differences that affect migration planning¶

If you are migrating from CDP rather than CDH, several aspects of the migration are different. Plan accordingly.

Migration planning matrix¶

Aspect	CDH migration	CDP Private Cloud migration	CDP Public Cloud migration
Data location	HDFS on bare metal; full data lift required	HDFS on bare metal/VM; full data lift required	Cloud object storage; data lift is minimal
Spark version	Spark 2.x; upgrade to 3.x during migration	Spark 3.x; direct port to Databricks	Spark 3.x; direct port to Databricks
Hive version	Hive 2.x; more syntax differences vs Spark SQL	Hive 3.x; fewer differences	Hive 3.x; fewer differences
Security model	Kerberos + Ranger (or legacy Sentry)	Kerberos + Ranger + Knox	IDBroker + Ranger + Knox
Container platform	None (bare metal)	Kubernetes (ECS/OCP); teams may have K8s experience	Managed by Cloudera; teams may lack K8s experience
API maturity	Older CM API; more manual work	CDP CLI + REST API; scriptable	CDP CLI + REST API; scriptable
CDE / CML / CDW	Not available	Available; migration to Databricks/Azure ML	Available; migration to Databricks/Azure ML
Expected timeline	Longer (data lift + format conversion + code port)	Medium (data lift + code port; less format conversion)	Shorter (minimal data lift; code port only)

CDP Public Cloud advantage¶

Organizations on CDP Public Cloud have a significant head start: their data is already in cloud object storage (S3, ADLS, or GCS). The migration is primarily a compute and governance layer swap:

Point Databricks at existing cloud storage
Convert tables to Delta format (in place, using CONVERT TO DELTA)
Port Spark jobs (remove Cloudera-specific configs)
Replace Ranger with Unity Catalog grants
Replace CDE with Databricks Workflows
Decommission CDP Public Cloud subscription

This can be accomplished in 8-12 weeks for a typical deployment, versus 16-30 weeks for a CDH on-prem migration.

3. Service decomposition strategy¶

Cloudera clusters bundle many services on shared infrastructure. On Azure, each service runs independently. This decomposition is an opportunity to right-size each component.

Decomposition map¶

flowchart LR
    subgraph CDH["CDH Cluster (monolithic)"]
        HDFS
        YARN
        Hive
        Spark
        Impala
        Oozie
        Kafka
        NiFi
        Ranger
        Atlas
        ZK[ZooKeeper]
    end

    subgraph Azure["Azure (decomposed)"]
        ADLS[ADLS Gen2]
        DBX[Databricks]
        ADF[Data Factory]
        EH[Event Hubs]
        PV[Purview]
        UC[Unity Catalog]
        AM[Azure Monitor]
    end

    HDFS --> ADLS
    YARN --> DBX
    Hive --> DBX
    Spark --> DBX
    Impala --> DBX
    Oozie --> ADF
    Kafka --> EH
    NiFi --> ADF
    Ranger --> UC
    Atlas --> PV
    ZK --> AM

Right-sizing each service¶

CDH service	CDH resource allocation	Azure allocation	Savings mechanism
HDFS	3x replication across all DataNodes	ADLS Gen2 (storage-level redundancy)	Eliminate 67% storage overhead from 3x replication.
Spark	Fixed YARN queue (e.g., 50% of cluster)	Databricks auto-scaling (0 to N workers)	Pay only during job execution; terminate when idle.
Impala	Dedicated Impala daemons (always on)	Databricks SQL Serverless (scale to zero)	Pay per query; no always-on daemons.
Kafka	Dedicated Kafka brokers (always on)	Event Hubs (auto-inflate TUs)	Scale throughput units based on actual traffic.
NiFi	Dedicated NiFi cluster (always on)	ADF (per-activity pricing)	Pay per pipeline run; no always-on cluster.

4. Parallel-run strategy¶

Never cut over without a parallel-run period. This is the single most important risk mitigation step.

Parallel-run architecture¶

flowchart TD
    SOURCE[Data Sources<br/>RDBMS, SFTP, APIs, Kafka] --> CDH[CDH Cluster<br/>Existing pipelines]
    SOURCE --> AZURE[Azure Platform<br/>Migrated pipelines]
    CDH --> CDH_OUT[CDH Output<br/>Tables, reports, exports]
    AZURE --> AZURE_OUT[Azure Output<br/>Delta tables, Power BI, exports]
    CDH_OUT --> COMPARE[Comparison Engine<br/>Row counts, checksums, KPIs]
    AZURE_OUT --> COMPARE
    COMPARE --> REPORT[Validation Report<br/>Match / Mismatch per dataset]

Parallel-run rules¶

Duration: Minimum 2 weeks for each cluster migration. 4 weeks for production/revenue clusters.
Scope: Every pipeline, every scheduled query, every report must run on both platforms.
Comparison: Automated daily comparison of row counts, column checksums, and key business metrics.
Acceptance criteria: Zero data discrepancies for 5 consecutive business days before cutover.
Rollback plan: CDH cluster remains available for 30 days after cutover as a safety net.

Parallel-run cost impact¶

Running both platforms simultaneously costs approximately 1.3-1.5x the steady-state cost of CDH alone. Budget for this overhead for 2-8 weeks depending on the number of clusters and validation complexity.

Validation automation¶

# Automated comparison script (runs daily on Databricks)
from pyspark.sql import functions as F

tables_to_compare = [
    ("silver.sales.orders", "hdfs_mirror.sales.orders"),
    ("silver.sales.customers", "hdfs_mirror.sales.customers"),
    ("silver.inventory.products", "hdfs_mirror.inventory.products"),
]

results = []
for azure_table, cdh_table in tables_to_compare:
    azure_count = spark.table(azure_table).count()
    cdh_count = spark.table(cdh_table).count()
    match = azure_count == cdh_count
    results.append({
        "table": azure_table,
        "azure_count": azure_count,
        "cdh_count": cdh_count,
        "match": match,
        "diff": azure_count - cdh_count,
        "timestamp": F.current_timestamp()
    })

    if not match:
        print(f"MISMATCH: {azure_table}: Azure={azure_count}, CDH={cdh_count}")

# Write results to validation tracking table
spark.createDataFrame(results).write \
    .format("delta") \
    .mode("append") \
    .saveAsTable("audit.migration_validation")

5. Decommission timeline¶

Phased decommission plan¶

Phase	Duration	Action	Risk mitigation
Pre-cutover	2-4 weeks	Parallel run; both systems active	Full rollback capability.
Cutover	1 day	Redirect data sources to Azure endpoints	CDH still running as read-only backup.
Post-cutover bake	30 days	CDH read-only; Azure is primary	Can re-enable CDH pipelines if critical issue found.
CDH shutdown	1 day	Stop all CDH services	Data archived to ADLS for reference if needed.
Hardware decommission	2-4 weeks	Wipe disks, return/recycle hardware	Asset management and disposal.
License termination	Next renewal date	Do not renew Cloudera Enterprise license	Confirm with procurement.

Pre-decommission checklist¶

6. Team structure and roles¶

During migration (temporary team augmentation)¶

Role	Count	Responsibilities
Migration lead	1	Overall migration planning, timeline, stakeholder communication
Data engineer (Spark/Hive)	2-3	Spark job porting, Hive-to-dbt conversion, UDF rewrites
Data engineer (NiFi/ADF)	1-2	NiFi-to-ADF pipeline conversion, integration testing
Impala/SQL specialist	1	Impala-to-Databricks SQL conversion, BI tool reconnection
Security engineer	1	Ranger-to-Unity Catalog policy migration, Kerberos removal
Azure platform engineer	1	Landing zone setup, networking, IAM, monitoring
QA / validation	1	Parallel-run validation, data quality checks, reporting

Post-migration (steady state)¶

Role	Count	Responsibilities
Platform engineer	1-2	Azure infrastructure, Databricks workspace management, monitoring
Data engineer	2-4	dbt models, ADF pipelines, Databricks jobs, Delta table optimization
Analytics engineer	1-2	Power BI semantic models, dbt metrics, data documentation
Data governance	0.5-1	Purview classifications, Unity Catalog grants, data quality

The post-migration team is typically 30-50% smaller than the CDH operations team because managed services eliminate infrastructure management work.

7. Common pitfalls and how to avoid them¶

Pitfall 1: Trying to replicate CDH architecture on Azure¶

Problem: Teams deploy Azure VMs, install open-source Hadoop components, and try to recreate the CDH cluster in the cloud.

Why it happens: Familiarity bias. The team knows Hadoop and tries to minimize learning.

Solution: Use Azure-native managed services. The whole point of migration is to stop managing Hadoop infrastructure. Deploying Hadoop on Azure VMs gives you the worst of both worlds: cloud costs without cloud benefits.

Pitfall 2: Underestimating UDF migration¶

Problem: Teams discover during Phase 4 that they have 50+ Java UDFs, custom SerDes, and GenericUDAFs that cannot run on Databricks without rewriting.

Why it happens: UDFs are invisible in cluster inventory scripts. They live in JAR files on HDFS or in Maven repositories, not in the Hive metastore.

Solution: Inventory UDFs in Phase 1. Grep all Hive scripts for ADD JAR, CREATE FUNCTION, TRANSFORM USING. Prototype replacements before Phase 4.

Pitfall 3: Ignoring small-file compaction¶

Problem: Migrated data performs poorly on Azure because millions of small files from CDH streaming ingestion or Hive dynamic partitions were transferred as-is.

Why it happens: The migration tool (azcopy, Data Box) copies files faithfully. It does not compact them.

Solution: Compact during Delta conversion. Target 256 MB - 1 GB per file. Enable auto-optimization on all Delta tables post-migration.

Pitfall 4: Skipping the parallel-run¶

Problem: The team cuts over to Azure after unit testing individual pipelines but without running both systems on production data simultaneously. Discrepancies are discovered by business users, not by the migration team.

Why it happens: Parallel-run is expensive (1.3-1.5x cost) and time-consuming. Teams under deadline pressure skip it.

Solution: Budget for 2-4 weeks of parallel-run per cluster. Automate comparison. This is the single most important risk mitigation step. Skipping it costs more in incident response than it saves in parallel-run costs.

Pitfall 5: Big-bang migration of all clusters¶

Problem: The team plans to migrate all CDH clusters simultaneously to minimize the total migration window.

Why it happens: Management wants to minimize the period of paying for both CDH and Azure.

Solution: Migrate cluster by cluster. The first cluster takes longest (learning curve, tooling setup). Subsequent clusters go 2-3x faster because the team has established patterns. The sequential approach is actually faster in total wall-clock time because it avoids the coordination overhead and incident recovery costs of a big-bang.

Pitfall 6: Mechanical Oozie conversion¶

Problem: The team converts every Oozie action to an ADF activity 1:1, producing brittle ADF pipelines that are harder to maintain than the original Oozie XML.

Why it happens: Conversion tools and scripts generate 1:1 mappings without architectural judgment.

Solution: Redesign complex Oozie workflows. Use the migration as an opportunity to simplify DAGs, leverage dbt's ref() dependency management, and separate orchestration (ADF) from transformation (dbt/Databricks).

Pitfall 7: Not cleaning up Kerberos references¶

Problem: Migrated Spark scripts fail on Databricks because they contain hardcoded kinit calls, keytab paths, and Kerberos principal references.

Why it happens: Kerberos is so deeply embedded in CDH that teams forget to search for all references.

Solution: In Phase 1, grep all scripts, configuration files, and job definitions for: kinit, keytab, principal, krb5.conf, REALM, KDC. Create a remediation checklist.

Pitfall 8: Ignoring HDFS replication factor¶

Problem: The team configures ADLS Gen2 with extra copies of data, tripling storage costs, because they are accustomed to HDFS's 3x replication.

Why it happens: HDFS defaults to replication factor 3. Teams assume they need the same on Azure.

Solution: ADLS Gen2 handles redundancy at the storage layer (LRS = 3 copies within a datacenter, ZRS = 3 copies across availability zones, GRS = 6 copies across regions). Do not replicate data at the application level. This mistake alone can triple your storage bill.

8. Change management¶

Training plan¶

Audience	Training topic	Duration	Format
Data engineers	Databricks workspace, notebooks, Jobs	2 days	Hands-on workshop
Data engineers	dbt fundamentals + Databricks dbt integration	1 day	Hands-on workshop
Data engineers	ADF pipeline development	1 day	Hands-on workshop
SQL analysts	Databricks SQL Editor, dialect differences	0.5 day	Demo + practice
BI developers	Power BI + Databricks SQL connector	0.5 day	Demo + practice
Data governance	Purview + Unity Catalog	1 day	Hands-on workshop
Platform team	Azure Monitor, cost management, Databricks admin	2 days	Hands-on workshop
All users	ADLS Gen2 storage navigation, Azure Portal basics	0.5 day	Self-paced

Communication cadence¶

Communication	Frequency	Audience	Content
Migration status update	Weekly	All stakeholders	Progress, blockers, next steps
Technical sync	Twice weekly	Migration team	Technical issues, architecture decisions
Executive briefing	Bi-weekly	CIO/CDO	Timeline, budget, risk, decisions needed
User readiness update	Bi-weekly	End users	Training schedule, what is changing, FAQ
Post-migration retrospective	Once	All	Lessons learned, what worked, what to improve

9. Success criteria¶

Define these before migration starts. Get stakeholder sign-off.

Criterion	Metric	Target
Data accuracy	Row count and checksum match between CDH and Azure	100% match for all migrated datasets
Query performance	P95 query latency on Databricks SQL vs Impala	Within 120% of Impala baseline (or better)
Pipeline reliability	ADF/Databricks Workflow success rate	> 99% over 2-week validation period
Cost	Monthly Azure run-rate vs monthly CDH cost	≤ 65% of CDH cost by month 3 post-migration
Operational overhead	Platform team hours per week	≤ 50% of CDH operational hours
User satisfaction	Survey of data engineers and analysts	≥ 80% rate new platform as "good" or "excellent"
Security compliance	All Ranger policies recreated on Unity Catalog	100% policy coverage, verified by audit
CDH decommission	CDH hardware decommissioned	Within 60 days of cutover

Next steps¶

Use the Migration Playbook for the detailed phased plan
Review the TCO Analysis to build the financial case
See the Benchmarks for performance comparison data
Start hands-on with the NiFi to ADF Tutorial or Impala to Databricks Tutorial

Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team