Best Practices: Hadoop to Azure Migration
Operational best practices for planning, executing, and completing a Hadoop-to-Azure migration, covering cluster decomposition, parallel-run validation, decommission planning, edge case handling, and team retraining.
1. Cluster decomposition strategy
The decomposition principle
Do not migrate your Hadoop cluster as a monolith. Decompose it into independent workloads and migrate each one separately. This reduces risk, enables parallel work streams, and allows early wins.
Step 1: Inventory and classify
Create a workload inventory with four tiers:
| Tier | Description | Migration action | Typical percentage |
| A — Direct port | Spark/Hive jobs reading Parquet/Delta, standard patterns | Port to Databricks/Fabric with minimal changes | 20-30% |
| B — Modernize | Hive SQL workloads, Pig scripts, simple MapReduce | Convert to dbt + SparkSQL | 20-30% |
| C — Re-platform | HBase, Storm, Flink, custom YARN apps | Redesign for Cosmos DB, Functions, Databricks Streaming | 10-20% |
| D — Decommission | Stale data, abandoned jobs, unused tables | Archive evidence and delete | 30-50% |
Step 2: Identify Tier D first
Hadoop clusters accumulate dead weight over 7-15 years. Before migrating anything, identify what should not migrate:
-- Find tables with no reads in the last 12 months (Hive audit log)
-- Export from Ranger audit or Hive metastore access logs
SELECT
db_name,
table_name,
MAX(access_time) AS last_accessed,
DATEDIFF(CURRENT_DATE, MAX(access_time)) AS days_since_access
FROM hive_audit_log
GROUP BY db_name, table_name
HAVING DATEDIFF(CURRENT_DATE, MAX(access_time)) > 365
ORDER BY days_since_access DESC;
# Find HDFS directories with no modification in 12+ months
hdfs dfs -ls -R /user/hive/warehouse/ | \
awk '{print $6, $7, $8}' | \
sort | \
awk -v cutoff=$(date -d "12 months ago" +%Y-%m-%d) '$1 < cutoff {print}'
Step 3: Build the migration sequence
Order migrations by value and dependency:
Phase 1 (Weeks 1-8):
├── HDFS → ADLS Gen2 (bulk copy) [Tier A]
├── Kafka → Event Hubs [Tier A]
└── Sqoop → ADF connectors [Tier A]
Phase 2 (Weeks 8-20):
├── Hive SQL → dbt + SparkSQL [Tier B]
├── Spark jobs → Databricks jobs [Tier A/B]
└── Oozie → ADF / Databricks Workflows [Tier B]
Phase 3 (Weeks 20-36):
├── HBase → Cosmos DB [Tier C]
├── Storm/Flink → Databricks Streaming [Tier C]
└── Custom YARN apps → AKS / Functions [Tier C]
Phase 4 (Weeks 36-48):
├── Security migration (Ranger → Purview/Unity Catalog)
├── Parallel-run validation
└── Decommission planning
2. Parallel-run validation
Why parallel-run matters
Never cut over from Hadoop to Azure based solely on unit tests. Run both systems in parallel and compare outputs at every stage.
Parallel-run architecture
Data Sources
├── → Hadoop cluster (existing pipeline) → Hadoop outputs
│ │
│ ├── Reconciliation
│ │ (daily comparison)
│ │
└── → Azure pipeline (new) ──────────────────→ Azure outputs
Reconciliation framework
# reconciliation.py — Daily comparison of Hadoop vs Azure outputs
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
def reconcile_table(hadoop_path, azure_path, key_columns, metric_columns, table_name):
"""Compare Hadoop and Azure outputs for a single table."""
hadoop_df = spark.read.parquet(hadoop_path)
azure_df = spark.read.format("delta").load(azure_path)
results = {
"table": table_name,
"hadoop_rows": hadoop_df.count(),
"azure_rows": azure_df.count(),
"row_match": False,
"metric_matches": {},
"mismatched_keys": []
}
# Row count comparison
results["row_match"] = results["hadoop_rows"] == results["azure_rows"]
# Aggregate metric comparison
for col in metric_columns:
h_val = hadoop_df.agg(F.sum(col)).collect()[0][0]
a_val = azure_df.agg(F.sum(col)).collect()[0][0]
# Allow 0.01% tolerance for floating-point differences
tolerance = abs(h_val * 0.0001) if h_val else 0.01
results["metric_matches"][col] = abs(h_val - a_val) < tolerance
# Key-level comparison (find mismatched rows)
hadoop_keys = hadoop_df.select(*key_columns).distinct()
azure_keys = azure_df.select(*key_columns).distinct()
missing_in_azure = hadoop_keys.subtract(azure_keys)
missing_in_hadoop = azure_keys.subtract(hadoop_keys)
results["missing_in_azure"] = missing_in_azure.count()
results["missing_in_hadoop"] = missing_in_hadoop.count()
return results
# Run reconciliation for all migrated tables
tables_to_reconcile = [
{
"name": "daily_revenue",
"hadoop_path": "hdfs:///analytics/daily_revenue/",
"azure_path": "abfss://gold@storage.dfs.core.windows.net/daily_revenue/",
"key_columns": ["report_date"],
"metric_columns": ["total_revenue", "total_orders"]
},
{
"name": "customer_360",
"hadoop_path": "hdfs:///analytics/customer_360/",
"azure_path": "abfss://gold@storage.dfs.core.windows.net/customer_360/",
"key_columns": ["customer_id"],
"metric_columns": ["lifetime_revenue", "lifetime_orders"]
}
]
for table_config in tables_to_reconcile:
result = reconcile_table(**table_config)
print(f"\n{'='*60}")
print(f"Table: {result['table']}")
print(f" Row count: Hadoop={result['hadoop_rows']:,} Azure={result['azure_rows']:,} Match={result['row_match']}")
for metric, match in result["metric_matches"].items():
print(f" {metric}: Match={match}")
print(f" Missing in Azure: {result['missing_in_azure']}")
print(f" Extra in Azure: {result['missing_in_hadoop']}")
Parallel-run duration
| Workload type | Minimum parallel-run | Recommended |
| Daily batch ETL | 14 days | 30 days |
| Weekly reports | 4 weeks | 8 weeks |
| Monthly aggregations | 2 months | 3 months |
| Streaming pipelines | 7 days | 14 days |
| Ad-hoc queries | N/A (validate with test queries) | 5-10 representative queries |
Exit criteria for parallel-run
| Criterion | Threshold |
| Row count match | 100% for all tables |
| Aggregate metric match | Within 0.01% for all numeric columns |
| Schema match | 100% column name and type match |
| Missing records | 0 in either direction |
| Performance | Azure pipeline completes within 1.5x of Hadoop time or better |
| Consumer acceptance | All downstream consumers validate Azure outputs |
3. Decommission planning
Decommission sequence
Do not shut down Hadoop all at once. Decommission workload by workload:
For each workload:
1. Complete parallel-run (exit criteria met)
2. Redirect consumers to Azure endpoints
3. Make Hadoop data read-only (chmod -R 555 on HDFS path)
4. Monitor for 30 days (catch any missed consumers)
5. Archive Hadoop data to ADLS cold/archive tier
6. Free YARN capacity (reduce cluster allocation for this workload)
7. After all workloads decommissioned:
a. Final HDFS snapshot to ADLS
b. Export Ranger policies and Atlas metadata (archive)
c. Shut down cluster services (Hive, Spark, HBase)
d. Power off nodes
e. Terminate Cloudera/HDP license
f. Reclaim hardware or cancel cloud subscription
License termination timeline
| Vendor | Typical notice period | Key actions |
| Cloudera | 90 days before renewal | Notify account team; do not auto-renew |
| Hortonworks (legacy) | N/A (end of life) | Cancel support contract |
| AWS EMR | Immediate (pay-as-you-go) | Terminate clusters, cancel reserved instances |
| Azure HDInsight | Immediate (pay-as-you-go) | Delete clusters |
Data retention for compliance
Before decommissioning, ensure compliance requirements are met:
# Archive critical HDFS data to ADLS Archive tier for compliance
azcopy copy \
"/mnt/hdfs-export/compliance-data/" \
"https://archivestorage.dfs.core.windows.net/compliance/?<SAS>" \
--recursive \
--put-md5
# Set lifecycle policy for automatic deletion after retention period
az storage account management-policy create \
--account-name archivestorage \
--policy '{
"rules": [{
"name": "compliance-retention",
"type": "Lifecycle",
"definition": {
"filters": {"blobTypes": ["blockBlob"], "prefixMatch": ["compliance/"]},
"actions": {"baseBlob": {"delete": {"daysAfterModificationGreaterThan": 2555}}}
}
}]
}'
4. Handling edge cases
HBase edge cases
| Edge case | Challenge | Recommended approach |
| HBase with 100+ column families | Cosmos DB document size limits (2 MB) | Split into multiple containers by access pattern |
| HBase coprocessors for real-time indexing | No direct equivalent | Use Change Feed + Azure Functions |
| HBase with Phoenix SQL layer | Phoenix-specific SQL extensions | Rewrite queries for Cosmos DB SQL or move to Azure SQL |
| HBase TTL (cell-level expiration) | Cosmos DB has item-level TTL | Map cell TTL to item TTL; may require schema redesign |
| HBase with multi-version cells | Cosmos DB has no cell versioning | Store versions as separate documents or use Change Feed |
Kafka edge cases
| Edge case | Challenge | Recommended approach |
| Kafka Streams application | Stateful processing | Evaluate Event Hubs support; may need Databricks Streaming |
| Kafka transactions | Event Hubs does not support transactions | Design for at-least-once; add idempotency in consumer |
| Custom Kafka serdes | Must register in Schema Registry | Migrate to Azure Schema Registry |
| Kafka MirrorMaker | Event Hubs has different replication model | Use Event Hubs Geo-DR or multi-namespace |
| Low-latency requirements (<5ms) | Event Hubs has higher latency than bare-metal Kafka | Use Event Hubs Premium or evaluate Confluent on Azure |
Oozie edge cases
| Edge case | Challenge | Recommended approach |
| Custom Java actions | ADF does not run custom Java | Convert to Databricks JAR task or Azure Function |
| Oozie SLA monitoring | ADF has different SLA model | Use ADF custom metrics + Azure Monitor alerts |
| Oozie shared library | Centralized JAR management | Use Databricks cluster libraries or Unity Catalog volumes |
| Complex decision trees (100+ nodes) | ADF If/Switch has limits | Decompose into multiple ADF pipelines |
| Oozie bundle (multi-coordinator) | No direct ADF equivalent | Create ADF trigger groups or use Databricks multi-task jobs |
Streaming edge cases
| Edge case | Challenge | Recommended approach |
| Storm bolts with external state | Stateful processing redesign | Move to Databricks Structured Streaming with state store |
| Sub-second latency requirements | Databricks micro-batch adds latency | Evaluate continuous processing mode or Azure Stream Analytics |
| Complex event processing (CEP) | Spark Structured Streaming lacks native CEP | Use Azure Stream Analytics for CEP patterns |
| Flink savepoints/checkpoints | Not compatible with Spark | Restart streaming from Event Hubs offset (replay) |
5. Team retraining
Skills transfer matrix
| Hadoop skill | Azure equivalent skill | Training path | Duration |
| HDFS administration | ADLS Gen2 management | Azure Storage learning path | 1-2 days |
| YARN capacity management | Databricks cluster policies | Databricks admin training | 2-3 days |
| Hive SQL | SparkSQL + dbt | dbt Fundamentals + Databricks SQL | 3-5 days |
| Spark on YARN | Spark on Databricks | Databricks Developer training | 2-3 days |
| Oozie workflow design | ADF pipeline design | ADF learning path | 2-3 days |
| Ranger policy management | Purview + Unity Catalog | Microsoft Security learning path | 2-3 days |
| Kerberos administration | Entra ID management | Microsoft Identity learning path | 2-3 days |
| HBase administration | Cosmos DB management | Cosmos DB learning path | 3-5 days |
| Kafka administration | Event Hubs management | Event Hubs learning path | 1-2 days |
| Atlas catalog management | Purview governance | Purview learning path | 2-3 days |
Recommended certification paths
| Role | Certification | Provider |
| Data engineer | Databricks Certified Data Engineer Associate | Databricks |
| Data engineer | Azure Data Engineer Associate (DP-203) | Microsoft |
| Platform engineer | Azure Administrator Associate (AZ-104) | Microsoft |
| Security engineer | Azure Security Engineer Associate (AZ-500) | Microsoft |
| Analytics engineer | dbt Analytics Engineering Certification | dbt Labs |
| Data architect | Azure Solutions Architect Expert (AZ-305) | Microsoft |
Training timeline
Month 1-2 (Pre-migration):
├── Azure Fundamentals (AZ-900) for entire team
├── Databricks Lakehouse Fundamentals for data engineers
└── ADF fundamentals for workflow engineers
Month 3-4 (During migration):
├── Hands-on labs with migrated workloads
├── Databricks Developer training (data engineers)
└── dbt Fundamentals (analytics engineers)
Month 5-6 (Post-migration):
├── Advanced Databricks (performance tuning, streaming)
├── Purview governance training
└── Certification exams
Knowledge transfer from Hadoop to Azure
| Hadoop knowledge | How it transfers | What is new |
| SQL (HiveQL) | 95% transfers directly to SparkSQL | Delta Lake DDL, MERGE, OPTIMIZE |
| PySpark | 90% transfers directly | Databricks widgets, dbutils, Unity Catalog APIs |
| Data modeling | Fully transfers | Medallion architecture (bronze/silver/gold) pattern |
| ETL design patterns | Mostly transfers | dbt incremental models, ADF event triggers |
| Security concepts | Concepts transfer (RBAC, ACLs) | Entra ID, managed identity, Unity Catalog |
| Performance tuning | Concepts transfer (partitioning, predicate pushdown) | Photon, AQE, Z-ORDER, liquid clustering |
| Monitoring | Concepts transfer (dashboards, alerts) | Azure Monitor, Databricks Overwatch, ADF Monitor |
6. Risk mitigation
Top 10 migration risks and mitigations
| # | Risk | Likelihood | Impact | Mitigation |
| 1 | Data loss during transfer | Low | Critical | Checksum validation, parallel-run, HDFS snapshots before migration |
| 2 | Performance regression | Medium | High | Benchmark before/after, tune Databricks clusters, use Photon |
| 3 | Security gap during transition | Medium | Critical | Map all Ranger policies before migration, validate with security team |
| 4 | Missed downstream consumers | High | High | Inventory all consumers before cutover, 30-day read-only period |
| 5 | Budget overrun (parallel costs) | Medium | Medium | Time-box parallel-run, decommission aggressively |
| 6 | Team skill gaps | Medium | Medium | Start training early, pair senior and junior engineers |
| 7 | HBase migration complexity | High | High | Start HBase migration early, allow 2x estimated time |
| 8 | Oozie workflow translation errors | High | Medium | Test each workflow independently, automated regression tests |
| 9 | Network bandwidth for data transfer | Medium | Medium | Use ExpressRoute, Data Box for large datasets |
| 10 | Vendor lock-in concerns | Low | Medium | Use Delta Lake (open format), dbt (portable SQL), standard APIs |
7. Operational checklists
Pre-migration checklist
During-migration checklist
Post-migration checklist
Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team Related: Migration Hub | TCO Analysis | Benchmarks