Tutorial: Migrate HDFS Data to ADLS Gen2¶
A step-by-step tutorial that walks through an end-to-end HDFS to ADLS Gen2 migration, including provisioning, data transfer, format conversion, validation, and verification.
Prerequisites¶
Before starting this tutorial, you need:
- An on-premises Hadoop cluster with HDFS access (or equivalent cloud Hadoop)
- An Azure subscription with Contributor access
- Azure CLI installed (
azcommand available) - Network connectivity between Hadoop cluster and Azure (ExpressRoute, VPN, or internet)
- A Databricks workspace (for format conversion and validation)
What you will build¶
By the end of this tutorial, you will have:
- An ADLS Gen2 storage account with hierarchical namespace
- HDFS data copied to ADLS Gen2 via DistCp
- Data converted from ORC/Parquet to Delta Lake
- Tables registered in Unity Catalog (Databricks)
- Validated data integrity between source and target
Estimated time¶
| Step | Duration |
|---|---|
| Step 1: Provision Azure resources | 15 minutes |
| Step 2: Configure ABFS driver on Hadoop | 30 minutes |
| Step 3: Run DistCp for bulk transfer | Varies (1 hour per 10 TB on ExpressRoute) |
| Step 4: Convert to Delta Lake | 30-60 minutes |
| Step 5: Register in Unity Catalog | 15 minutes |
| Step 6: Validate data integrity | 30 minutes |
| Total (excluding transfer time) | ~2.5 hours + transfer |
Step 1: Provision Azure resources¶
1.1 Create a resource group¶
1.2 Create an ADLS Gen2 storage account¶
az storage account create \
--name hadoopmigrationlake \
--resource-group rg-hadoop-migration \
--location eastus2 \
--sku Standard_ZRS \
--kind StorageV2 \
--hns true \
--allow-blob-public-access false \
--min-tls-version TLS1_2 \
--require-infrastructure-encryption true
Key parameters:
--hns trueenables hierarchical namespace (required for HDFS compatibility)--sku Standard_ZRSprovides zone-redundant storage (3 copies across availability zones)--require-infrastructure-encryption trueenables double encryption
1.3 Create containers (file systems)¶
# Raw zone: landing area for migrated data (original format)
az storage fs create \
--name raw \
--account-name hadoopmigrationlake
# Silver zone: cleansed and converted data (Delta format)
az storage fs create \
--name silver \
--account-name hadoopmigrationlake
# Gold zone: business-level aggregates (Delta format)
az storage fs create \
--name gold \
--account-name hadoopmigrationlake
1.4 Get storage account key (for DistCp authentication)¶
STORAGE_KEY=$(az storage account keys list \
--account-name hadoopmigrationlake \
--query '[0].value' -o tsv)
echo "Storage key retrieved (do not log this value in production)"
For production migrations, use a service principal or SAS token instead of a storage account key.
Step 2: Configure ABFS driver on Hadoop cluster¶
2.1 Determine your Hadoop version¶
2.2 Download Azure Storage JARs¶
Download JARs matching your Hadoop version from Maven Central:
# For Hadoop 3.1.x
HADOOP_AZURE_VERSION="3.1.1"
AZURE_STORAGE_VERSION="7.0.1"
cd /tmp
wget "https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure/${HADOOP_AZURE_VERSION}/hadoop-azure-${HADOOP_AZURE_VERSION}.jar"
wget "https://repo1.maven.org/maven2/com/microsoft/azure/azure-storage/${AZURE_STORAGE_VERSION}/azure-storage-${AZURE_STORAGE_VERSION}.jar"
wget "https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure-datalake/${HADOOP_AZURE_VERSION}/hadoop-azure-datalake-${HADOOP_AZURE_VERSION}.jar"
2.3 Install JARs on all nodes¶
# Copy to Hadoop common lib on every node
# (Use your cluster management tool — Ansible, Puppet, Ambari, CM)
HADOOP_LIB=$(hadoop classpath | tr ':' '\n' | grep "share/hadoop/common/lib" | head -1)
sudo cp /tmp/hadoop-azure-*.jar ${HADOOP_LIB}/
sudo cp /tmp/azure-storage-*.jar ${HADOOP_LIB}/
sudo cp /tmp/hadoop-azure-datalake-*.jar ${HADOOP_LIB}/
2.4 Configure core-site.xml¶
Add Azure storage configuration to core-site.xml on the cluster:
<!-- Add to core-site.xml on all nodes -->
<property>
<name>fs.azure.account.key.hadoopmigrationlake.dfs.core.windows.net</name>
<value>${STORAGE_KEY}</value>
<description>Storage account key for ADLS Gen2</description>
</property>
<property>
<name>fs.abfss.impl</name>
<value>org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem</value>
</property>
2.5 Verify connectivity¶
# Test ABFS access from Hadoop cluster
hdfs dfs -ls abfss://raw@hadoopmigrationlake.dfs.core.windows.net/
# Should return empty listing (no files yet)
Step 3: Run DistCp for bulk transfer¶
3.1 Inventory source data¶
# List top-level directories and sizes
hdfs dfs -du -s -h /user/hive/warehouse/*
# Example output:
# 12.3 T /user/hive/warehouse/orders
# 8.7 T /user/hive/warehouse/customers
# 3.1 T /user/hive/warehouse/products
# 45.6 G /user/hive/warehouse/regions
# 120.2 T /user/hive/warehouse/events
3.2 Start with a small table (proof of concept)¶
# Copy the smallest table first as a proof of concept
hadoop distcp \
-m 4 \
-bandwidth 100 \
-log /tmp/distcp-log-regions \
hdfs:///user/hive/warehouse/regions \
abfss://raw@hadoopmigrationlake.dfs.core.windows.net/hive/warehouse/regions
3.3 Verify the small table copy¶
# Check file count and size on source
hdfs dfs -count /user/hive/warehouse/regions
# Example: 1 12 48682345678 /user/hive/warehouse/regions
# Check on target
hdfs dfs -count abfss://raw@hadoopmigrationlake.dfs.core.windows.net/hive/warehouse/regions
# Should match source
3.4 Copy all tables (full migration)¶
# Full migration with higher parallelism
hadoop distcp \
-m 100 \
-bandwidth 500 \
-update \
-strategy dynamic \
-log /tmp/distcp-log-full \
hdfs:///user/hive/warehouse/ \
abfss://raw@hadoopmigrationlake.dfs.core.windows.net/hive/warehouse/
# Monitor progress
yarn application -list # Find DistCp application ID
yarn logs -applicationId application_XXXX_YYYY # View logs
3.5 Handle failures¶
If DistCp fails partway through:
# Re-run with -update flag (only copies new/modified files)
hadoop distcp \
-m 100 \
-bandwidth 500 \
-update \
-strategy dynamic \
-log /tmp/distcp-log-retry \
hdfs:///user/hive/warehouse/ \
abfss://raw@hadoopmigrationlake.dfs.core.windows.net/hive/warehouse/
The -update flag ensures DistCp only copies files that do not exist on the target or have different sizes.
Step 4: Convert to Delta Lake (Databricks)¶
4.1 Open a Databricks notebook¶
Log into your Databricks workspace and create a new Python notebook.
4.2 Configure storage access¶
# Cell 1: Configure ADLS access
# (If using managed identity, this is automatic on Databricks)
# If using account key:
spark.conf.set(
"fs.azure.account.key.hadoopmigrationlake.dfs.core.windows.net",
dbutils.secrets.get(scope="migration", key="storage-key")
)
4.3 List migrated tables¶
# Cell 2: List tables in raw zone
tables = dbutils.fs.ls("abfss://raw@hadoopmigrationlake.dfs.core.windows.net/hive/warehouse/")
for t in tables:
print(f"{t.name:40s} {t.size:>15,} bytes")
4.4 Convert a single table (proof of concept)¶
# Cell 3: Convert regions table (smallest)
source_path = "abfss://raw@hadoopmigrationlake.dfs.core.windows.net/hive/warehouse/regions"
target_path = "abfss://silver@hadoopmigrationlake.dfs.core.windows.net/regions"
# Read source (auto-detect format: ORC or Parquet)
try:
df = spark.read.format("orc").load(source_path)
source_format = "ORC"
except:
df = spark.read.format("parquet").load(source_path)
source_format = "Parquet"
print(f"Source format: {source_format}")
print(f"Row count: {df.count():,}")
print(f"Schema:")
df.printSchema()
# Write as Delta
df.write.format("delta") \
.mode("overwrite") \
.save(target_path)
print(f"Successfully converted {source_format} to Delta at {target_path}")
4.5 Convert all tables (batch)¶
# Cell 4: Batch convert all tables
from pyspark.sql import functions as F
results = []
for table_dir in tables:
table_name = table_dir.name.rstrip("/")
source_path = table_dir.path
target_path = f"abfss://silver@hadoopmigrationlake.dfs.core.windows.net/{table_name}"
try:
# Attempt to read as ORC first, then Parquet
try:
df = spark.read.format("orc").load(source_path)
fmt = "ORC"
except:
df = spark.read.format("parquet").load(source_path)
fmt = "Parquet"
row_count = df.count()
# Detect partition columns (Hive-style partition directories)
partition_cols = [c for c in df.columns
if c in ("year", "month", "day", "date", "region", "country")]
# Write as Delta
writer = df.write.format("delta").mode("overwrite")
if partition_cols:
writer = writer.partitionBy(*partition_cols)
writer.save(target_path)
results.append((table_name, fmt, row_count, "SUCCESS", ", ".join(partition_cols) or "none"))
print(f" [OK] {table_name}: {row_count:,} rows ({fmt} -> Delta)")
except Exception as e:
results.append((table_name, "unknown", 0, f"FAILED: {str(e)[:100]}", ""))
print(f" [FAIL] {table_name}: {str(e)[:100]}")
# Display summary
results_df = spark.createDataFrame(results, ["table", "source_format", "rows", "status", "partitions"])
display(results_df)
Step 5: Register in Unity Catalog¶
5.1 Create catalog and schema¶
-- Cell 5: Create catalog and schema
CREATE CATALOG IF NOT EXISTS migration;
CREATE SCHEMA IF NOT EXISTS migration.silver;
5.2 Register each table¶
# Cell 6: Register all Delta tables in Unity Catalog
for table_dir in tables:
table_name = table_dir.name.rstrip("/")
target_path = f"abfss://silver@hadoopmigrationlake.dfs.core.windows.net/{table_name}"
try:
spark.sql(f"""
CREATE TABLE IF NOT EXISTS migration.silver.{table_name}
USING DELTA
LOCATION '{target_path}'
""")
print(f" [OK] Registered migration.silver.{table_name}")
except Exception as e:
print(f" [FAIL] {table_name}: {str(e)[:100]}")
5.3 Verify catalog registration¶
Step 6: Validate data integrity¶
6.1 Row count validation¶
# Cell 9: Compare row counts between raw (source format) and silver (Delta)
validation = []
for table_dir in tables:
table_name = table_dir.name.rstrip("/")
raw_path = table_dir.path
silver_path = f"abfss://silver@hadoopmigrationlake.dfs.core.windows.net/{table_name}"
try:
# Count in raw zone
try:
raw_count = spark.read.format("orc").load(raw_path).count()
except:
raw_count = spark.read.format("parquet").load(raw_path).count()
# Count in silver zone (Delta)
silver_count = spark.read.format("delta").load(silver_path).count()
match = raw_count == silver_count
validation.append((table_name, raw_count, silver_count, match))
except Exception as e:
validation.append((table_name, -1, -1, False))
val_df = spark.createDataFrame(validation, ["table", "raw_count", "delta_count", "match"])
display(val_df)
# Assert all match
failures = val_df.filter("match = false")
if failures.count() > 0:
print("VALIDATION FAILED for the following tables:")
display(failures)
else:
print("ALL TABLES VALIDATED SUCCESSFULLY")
6.2 Aggregate validation (spot check)¶
# Cell 10: Aggregate validation for a key table
table_name = "orders" # Replace with your largest/most important table
raw_path = f"abfss://raw@hadoopmigrationlake.dfs.core.windows.net/hive/warehouse/{table_name}"
silver_path = f"abfss://silver@hadoopmigrationlake.dfs.core.windows.net/{table_name}"
try:
raw_df = spark.read.format("orc").load(raw_path)
except:
raw_df = spark.read.format("parquet").load(raw_path)
silver_df = spark.read.format("delta").load(silver_path)
# Compare key metrics
for df, label in [(raw_df, "RAW"), (silver_df, "DELTA")]:
stats = df.agg(
F.count("*").alias("row_count"),
F.sum("amount").alias("total_amount"),
F.countDistinct("customer_id").alias("unique_customers"),
F.min("order_date").alias("min_date"),
F.max("order_date").alias("max_date")
).collect()[0]
print(f"\n{label}:")
print(f" Row count: {stats['row_count']:,}")
print(f" Total amount: {stats['total_amount']:,.2f}")
print(f" Unique customers: {stats['unique_customers']:,}")
print(f" Date range: {stats['min_date']} to {stats['max_date']}")
6.3 Schema comparison¶
# Cell 11: Schema comparison
table_name = "orders"
raw_path = f"abfss://raw@hadoopmigrationlake.dfs.core.windows.net/hive/warehouse/{table_name}"
silver_path = f"abfss://silver@hadoopmigrationlake.dfs.core.windows.net/{table_name}"
try:
raw_schema = spark.read.format("orc").load(raw_path).schema
except:
raw_schema = spark.read.format("parquet").load(raw_path).schema
silver_schema = spark.read.format("delta").load(silver_path).schema
# Compare field by field
print(f"{'Column':<30} {'Raw Type':<20} {'Delta Type':<20} {'Match'}")
print("-" * 90)
for raw_field in raw_schema.fields:
delta_field = silver_schema[raw_field.name] if raw_field.name in silver_schema.fieldNames() else None
if delta_field:
match = str(raw_field.dataType) == str(delta_field.dataType)
print(f"{raw_field.name:<30} {str(raw_field.dataType):<20} {str(delta_field.dataType):<20} {'OK' if match else 'MISMATCH'}")
else:
print(f"{raw_field.name:<30} {str(raw_field.dataType):<20} {'MISSING':<20} FAIL")
Step 7: Optimize Delta tables (post-migration)¶
-- Cell 12: Optimize large tables
OPTIMIZE migration.silver.orders ZORDER BY (customer_id, order_date);
OPTIMIZE migration.silver.events ZORDER BY (event_type, event_date);
-- Enable auto-optimization for ongoing writes
ALTER TABLE migration.silver.orders SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
);
Cleanup and next steps¶
What to do after successful migration¶
- Run delta sync: Re-run DistCp with
-updateto catch any files written during migration - Migrate metastore: Follow Hive Migration to convert DDL and register tables
- Migrate Spark jobs: Follow Spark Migration to port job submission
- Migrate security: Follow Security Migration to port ACLs and policies
- Parallel run: Run workloads on both Hadoop and Azure for 14+ days
- Cutover: Switch consumers to Azure endpoints
Troubleshooting¶
| Problem | Solution |
|---|---|
DistCp fails with ClassNotFoundException | ABFS JARs not on Hadoop classpath; re-install on all nodes |
| DistCp extremely slow | Check network bandwidth; consider Data Box for >100 TB |
| Delta conversion OOM | Increase Databricks cluster size; process tables one at a time |
| Row count mismatch after conversion | Check for _SUCCESS, .hive-staging*, and .crc files in raw |
| Schema mismatch | Complex types (STRUCT, MAP) may serialize differently; verify manually |
Related¶
- HDFS Migration Guide — detailed migration reference
- Hive Migration — next step: metastore migration
- Spark Migration — next step: job migration
- Migration Hub — full migration center
Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team Related: HDFS Migration | Hive Migration | Migration Hub