🔄 Hadoop Migration Workshop¶
Migrate on-premises Hadoop workloads to Azure. Learn assessment, planning, and execution strategies.
🎯 Learning Objectives¶
- Assess on-premises Hadoop clusters
- Plan migration strategy
- Migrate data and workloads
- Optimize for Azure
- Validate and cutover
📋 Prerequisites¶
- On-premises Hadoop cluster or access
- Azure subscription with adequate quota
- HDInsight or Databricks experience
- Understanding of Hadoop architecture
🔍 Step 1: Assessment¶
Inventory Collection¶
# Collect cluster metrics
yarn node -list > cluster-nodes.txt
hdfs dfsadmin -report > hdfs-report.txt
yarn application -list -appStates ALL > applications.txt
hive -e "SHOW TABLES" > hive-tables.txt
Workload Analysis¶
- Identify data sources and sizes
- Map job dependencies
- Document SLAs and performance requirements
- List security and compliance needs
📊 Step 2: Migration Strategy¶
__Lift and Shift vs Modernization**¶
Lift and Shift (HDInsight) ✅ Fastest migration ✅ Minimal code changes ❌ Limited modernization
Modernize (Databricks/Synapse) ✅ Better performance ✅ Modern features ❌ More effort
Migration Phases¶
- Pilot - 1-2 workloads
- Wave 1 - Non-critical workloads
- Wave 2 - Production workloads
- Decommission - Turn off on-prem
🚀 Step 3: Data Migration¶
__Use AzCopy or DistCp**¶
# DistCp from on-prem to Azure
hadoop distcp \
hdfs://onprem-namenode:8020/data/* \
wasb://container@storageaccount.blob.core.windows.net/data/
# AzCopy
azcopy copy \
"hdfs://onprem-namenode:8020/data/*" \
"https://storageaccount.blob.core.windows.net/container" \
--recursive
🔧 Step 4: Workload Migration¶
__Hive Scripts**¶
-- Migrate Hive tables
CREATE EXTERNAL TABLE sales_azure
STORED AS ORC
LOCATION 'wasb://data@storageaccount.blob.core.windows.net/sales/'
AS
SELECT * FROM sales_onprem;
__MapReduce to Spark**¶
# Modernize MapReduce to Spark
# Old MapReduce
# New Spark
df = spark.read.csv("wasb:///data/sales.csv")
result = df.groupBy("category").sum("amount")
✅ Step 5: Validation¶
- Compare data counts
- Run test queries
- Benchmark performance
- Verify security
📚 Resources¶
Last Updated: January 2025