HDFS to ADLS Gen2 Migration¶
A comprehensive guide for migrating data from Hadoop Distributed File System (HDFS) to Azure Data Lake Storage Gen2, covering bulk transfer, format conversion, partition preservation, small-file compaction, and data validation.
Overview¶
HDFS to ADLS Gen2 is the foundational migration step. Every other component — Hive, Spark, HBase, Oozie — depends on data being accessible in Azure storage. The good news: ADLS Gen2 implements the HDFS-compatible API (via the abfss:// driver), which means most Spark and Hive code requires only a URI change.
This guide covers:
- Understanding the HDFS-compatible API on ADLS
- Bulk data transfer with DistCp and AzCopy
- File format conversion (ORC, Parquet, Avro to Delta Lake)
- Partition layout preservation
- Solving the small-file problem with Delta compaction
- Snapshot and versioning equivalents
- Data validation strategies
- Worked example with end-to-end commands
1. HDFS-compatible API on ADLS Gen2¶
ADLS Gen2 exposes an HDFS-compatible REST API via the Azure Blob File System (ABFS) driver. This is not an emulation layer — it is a native protocol that supports:
- Hierarchical namespace (real directories, not prefix-based simulation)
- POSIX ACLs
- Atomic rename operations
- Append operations
abfss://URI scheme (TLS-encrypted by default)
URI mapping¶
# Hadoop HDFS
hdfs://namenode.cluster.local:8020/user/hive/warehouse/orders
# ADLS Gen2 equivalent
abfss://raw@mystorageaccount.dfs.core.windows.net/user/hive/warehouse/orders
# Breakdown
# abfss:// → protocol (ABFS over TLS)
# raw → container (file system) name
# @mystorageaccount.dfs.core.windows.net → storage account DFS endpoint
# /user/hive/warehouse/orders → path (identical to HDFS path)
Spark configuration¶
# In Spark session configuration (Databricks or Fabric)
spark.conf.set(
"fs.azure.account.key.mystorageaccount.dfs.core.windows.net",
dbutils.secrets.get(scope="storage", key="account-key")
)
# Or using managed identity / service principal (recommended)
spark.conf.set(
"fs.azure.account.auth.type.mystorageaccount.dfs.core.windows.net",
"OAuth"
)
spark.conf.set(
"fs.azure.account.oauth.provider.type.mystorageaccount.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider"
)
Code change: minimal¶
# Before (HDFS)
df = spark.read.parquet("hdfs://namenode:8020/data/orders/")
# After (ADLS Gen2)
df = spark.read.parquet("abfss://raw@storage.dfs.core.windows.net/data/orders/")
That is the only code change required for most Spark jobs. The ABFS driver handles all storage operations natively.
2. Bulk data transfer¶
Option A: DistCp (Hadoop native)¶
DistCp (Distributed Copy) is the standard Hadoop tool for large-scale data movement. It runs as a MapReduce job that copies files in parallel across the cluster.
Prerequisites:
- Azure ABFS driver JARs installed on the Hadoop cluster
- Storage account access configured (shared key, SAS token, or service principal)
- Network connectivity (ExpressRoute or VPN recommended for >10 TB)
Basic DistCp command:
hadoop distcp \
-Dfs.azure.account.key.mystorageaccount.dfs.core.windows.net=<key> \
-m 100 \
-bandwidth 500 \
-update \
-strategy dynamic \
hdfs://namenode:8020/user/hive/warehouse/ \
abfss://raw@mystorageaccount.dfs.core.windows.net/hive/warehouse/
Parameter explanation:
| Parameter | Purpose |
|---|---|
-m 100 | Use 100 mapper tasks for parallel copy |
-bandwidth 500 | Limit each mapper to 500 MB/s (prevent saturating network) |
-update | Only copy files that are new or modified (incremental sync) |
-strategy dynamic | Dynamic work distribution (handles skewed file sizes) |
Performance expectations:
| Network | Bandwidth | 100 TB transfer time |
|---|---|---|
| ExpressRoute 10 Gbps | ~1 GB/s effective | ~28 hours |
| VPN (1 Gbps) | ~100 MB/s effective | ~12 days |
| Internet (100 Mbps) | ~10 MB/s effective | ~120 days |
Recommendation: Use ExpressRoute for any migration over 10 TB. For migrations over 100 TB, consider Azure Data Box for the initial bulk load, followed by DistCp for incremental sync.
Option B: AzCopy¶
AzCopy is Microsoft's command-line tool for high-performance data transfer to and from Azure Storage. It runs outside the Hadoop cluster and copies data directly.
When to use AzCopy instead of DistCp:
- HDFS data is accessible via NFS or local mount
- You want to copy from a staging server rather than running MapReduce
- The Hadoop cluster cannot run additional MapReduce jobs (resource-constrained)
# Copy from local/NFS mount to ADLS Gen2
azcopy copy \
"/mnt/hdfs-export/hive/warehouse/" \
"https://mystorageaccount.dfs.core.windows.net/raw/hive/warehouse/?<SAS>" \
--recursive \
--put-md5 \
--log-level=INFO
Option C: Azure Data Box (>100 TB)¶
For very large datasets, Azure Data Box provides offline transfer:
- Order a Data Box Heavy (up to 1 PB)
- Copy data from HDFS to Data Box via NFS mount
- Ship Data Box to Azure datacenter
- Data appears in ADLS Gen2
- Run DistCp for delta sync (files changed after Data Box snapshot)
3. File format conversion¶
Why convert to Delta Lake?¶
| Feature | Parquet/ORC on HDFS | Delta Lake on ADLS |
|---|---|---|
| ACID transactions | No | Yes |
| Time travel | No | Yes (30-day default, configurable) |
| Schema evolution | Manual | Managed (additive by default, mergeSchema option) |
| MERGE (upserts) | Not supported | First-class operation |
| Small-file compaction | Manual scripts | OPTIMIZE command |
| Z-ORDER indexing | Not available | Built-in data skipping |
ORC to Delta¶
# Read ORC from HDFS or ADLS staging area
df = spark.read.format("orc").load(
"abfss://raw@storage.dfs.core.windows.net/staging/orders_orc/"
)
# Write as Delta to curated zone
df.write.format("delta") \
.mode("overwrite") \
.partitionBy("year", "month") \
.save("abfss://silver@storage.dfs.core.windows.net/orders/")
# Register as table in Unity Catalog
spark.sql("""
CREATE TABLE silver.orders
USING DELTA
LOCATION 'abfss://silver@storage.dfs.core.windows.net/orders/'
""")
Parquet to Delta (in-place conversion)¶
If data is already in Parquet format, Delta supports in-place conversion without rewriting the data files:
-- Convert existing Parquet directory to Delta (no data rewrite)
CONVERT TO DELTA parquet.`abfss://raw@storage.dfs.core.windows.net/orders_parquet/`
PARTITIONED BY (year INT, month INT);
This adds a Delta transaction log to the existing Parquet files. The underlying data files are not copied or rewritten.
Avro to Delta¶
# Read Avro
df = spark.read.format("avro").load(
"abfss://raw@storage.dfs.core.windows.net/staging/events_avro/"
)
# Write as Delta
df.write.format("delta") \
.mode("overwrite") \
.save("abfss://silver@storage.dfs.core.windows.net/events/")
CSV/Text to Delta¶
# Read CSV with schema inference
df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("abfss://raw@storage.dfs.core.windows.net/staging/legacy_csv/")
# Write as Delta with explicit schema
df.write.format("delta") \
.mode("overwrite") \
.save("abfss://silver@storage.dfs.core.windows.net/legacy_data/")
4. Partition layout preservation¶
HDFS partition structure¶
Hive-style partitioning creates directory trees:
/user/hive/warehouse/orders/
year=2024/
month=01/
part-00000.parquet
part-00001.parquet
month=02/
...
year=2025/
...
ADLS Gen2 partition structure¶
Delta Lake supports the same Hive-style partitioning. DistCp preserves the directory structure, so partitions map 1:1:
abfss://silver@storage.dfs.core.windows.net/orders/
year=2024/
month=01/
part-00000.snappy.parquet
part-00001.snappy.parquet
month=02/
...
year=2025/
...
Partition optimization: from Hive partitioning to liquid clustering¶
For new workloads, consider using Databricks liquid clustering instead of traditional partitioning:
-- Traditional Hive-style partitioning (compatible but not optimal)
CREATE TABLE silver.orders
USING DELTA
PARTITIONED BY (year, month)
LOCATION 'abfss://silver@storage.dfs.core.windows.net/orders/';
-- Modern liquid clustering (better for most query patterns)
CREATE TABLE silver.orders_v2
USING DELTA
CLUSTER BY (order_date, customer_id)
LOCATION 'abfss://silver@storage.dfs.core.windows.net/orders_v2/';
Liquid clustering automatically handles data layout optimization. It replaces the need for manual partitioning decisions and Z-ORDER operations.
5. The small-file problem and Delta compaction¶
The problem¶
HDFS clusters accumulate millions of small files over time:
- Streaming jobs that write micro-batches every few seconds
- Hive INSERT INTO statements that create one file per insert
- MapReduce jobs that create one file per mapper
- Failed/retried jobs that leave orphan files
Small files cause:
- NameNode memory pressure (HDFS) / transaction log bloat (Delta)
- Slow query performance (many file opens per query)
- Increased storage API costs (ADLS charges per transaction)
The Delta solution: OPTIMIZE¶
-- Compact small files into optimal-sized files (target: 1 GB per file)
OPTIMIZE silver.orders;
-- Compact with Z-ORDER for query acceleration
OPTIMIZE silver.orders
ZORDER BY (customer_id, order_date);
-- Compact specific partitions only
OPTIMIZE silver.orders
WHERE year = 2025 AND month = 4;
Automated compaction¶
In Databricks, enable auto-compaction:
ALTER TABLE silver.orders
SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
);
Pre-migration compaction¶
Before migrating from HDFS, consider compacting small files on the source to reduce transfer time:
# On Hadoop: merge small files using Spark
spark-submit --class com.example.SmallFileCompactor \
--master yarn \
--num-executors 20 \
compact-job.jar \
--input hdfs:///user/hive/warehouse/orders/ \
--output hdfs:///staging/orders_compacted/ \
--target-size 256mb
6. HDFS snapshots vs ADLS soft delete and Delta time travel¶
HDFS snapshots¶
HDFS provides directory-level snapshots: read-only, point-in-time copies of a directory tree.
# Create HDFS snapshot
hdfs dfs -createSnapshot /user/hive/warehouse/orders snapshot_20250430
# Access snapshot data
hdfs dfs -ls /user/hive/warehouse/orders/.snapshot/snapshot_20250430/
Azure equivalents¶
| HDFS feature | Azure equivalent | Scope |
|---|---|---|
| HDFS directory snapshot | ADLS Gen2 soft delete (7-365 day retention) | Container or blob level |
| HDFS snapshot diff | Delta Lake time travel | Table level |
| HDFS snapshot for backup | ADLS blob versioning + lifecycle management | Blob level |
Delta time travel¶
Delta time travel is more powerful than HDFS snapshots because it operates at the table level with full query support:
-- Query data as it was 7 days ago
SELECT * FROM silver.orders TIMESTAMP AS OF '2025-04-23';
-- Query data as it was at a specific version
SELECT * FROM silver.orders VERSION AS OF 42;
-- Restore table to a previous version
RESTORE TABLE silver.orders TO VERSION AS OF 42;
7. Data validation¶
Checksum-based validation¶
# Generate HDFS checksums
hdfs dfs -checksum /user/hive/warehouse/orders/year=2025/month=04/part-00000.parquet
# Generate ADLS checksum (via AzCopy with MD5)
azcopy copy --put-md5 ...
# Then validate MD5 matches via Azure Storage API
Row count validation¶
# Count rows in HDFS source
hdfs_count = spark.read.parquet("hdfs:///user/hive/warehouse/orders/").count()
# Count rows in ADLS target
adls_count = spark.read.format("delta").load(
"abfss://silver@storage.dfs.core.windows.net/orders/"
).count()
assert hdfs_count == adls_count, f"Row count mismatch: HDFS={hdfs_count}, ADLS={adls_count}"
Schema validation¶
# Compare schemas
hdfs_schema = spark.read.parquet("hdfs:///user/hive/warehouse/orders/").schema
adls_schema = spark.read.format("delta").load(
"abfss://silver@storage.dfs.core.windows.net/orders/"
).schema
assert hdfs_schema == adls_schema, "Schema mismatch detected"
Aggregate validation¶
# Compare key aggregates
hdfs_agg = spark.read.parquet("hdfs:///user/hive/warehouse/orders/") \
.agg(
F.count("*").alias("row_count"),
F.sum("amount").alias("total_amount"),
F.countDistinct("customer_id").alias("unique_customers"),
F.min("order_date").alias("min_date"),
F.max("order_date").alias("max_date")
).collect()[0]
adls_agg = spark.read.format("delta").load(
"abfss://silver@storage.dfs.core.windows.net/orders/"
).agg(
F.count("*").alias("row_count"),
F.sum("amount").alias("total_amount"),
F.countDistinct("customer_id").alias("unique_customers"),
F.min("order_date").alias("min_date"),
F.max("order_date").alias("max_date")
).collect()[0]
# Compare each metric
for field in ["row_count", "total_amount", "unique_customers", "min_date", "max_date"]:
assert hdfs_agg[field] == adls_agg[field], f"Mismatch in {field}"
8. Worked example: end-to-end HDFS to ADLS migration¶
Scenario¶
Migrate 50 TB Hive warehouse from a 60-node Cloudera CDH cluster to ADLS Gen2, converting ORC to Delta Lake.
Step 1: Inventory¶
# Count files and total size per table
hdfs dfs -du -s -h /user/hive/warehouse/* | sort -rh | head -20
Step 2: Provision Azure resources¶
# Create storage account with hierarchical namespace
az storage account create \
--name migrationlake \
--resource-group rg-migration \
--location eastus2 \
--sku Standard_ZRS \
--kind StorageV2 \
--hns true
# Create containers
az storage fs create --name raw --account-name migrationlake
az storage fs create --name silver --account-name migrationlake
az storage fs create --name gold --account-name migrationlake
Step 3: Install ABFS driver on Hadoop cluster¶
# Download and install azure-storage JARs to Hadoop classpath
# (Version depends on your Hadoop version)
cp azure-storage-*.jar $HADOOP_HOME/share/hadoop/common/lib/
cp hadoop-azure-*.jar $HADOOP_HOME/share/hadoop/common/lib/
Step 4: Bulk copy with DistCp¶
# Phase 1: Initial bulk copy
hadoop distcp \
-Dfs.azure.account.key.migrationlake.dfs.core.windows.net=$STORAGE_KEY \
-m 200 \
-bandwidth 500 \
-log /tmp/distcp-log \
hdfs://namenode:8020/user/hive/warehouse/ \
abfss://raw@migrationlake.dfs.core.windows.net/hive/warehouse/
Step 5: Convert ORC to Delta (Databricks notebook)¶
import os
from pyspark.sql import functions as F
# List all tables in the migrated warehouse
tables = dbutils.fs.ls("abfss://raw@migrationlake.dfs.core.windows.net/hive/warehouse/")
for table_dir in tables:
table_name = table_dir.name.rstrip("/")
source_path = table_dir.path
target_path = f"abfss://silver@migrationlake.dfs.core.windows.net/{table_name}/"
print(f"Converting {table_name}...")
# Read ORC
df = spark.read.format("orc").load(source_path)
# Write as Delta with same partitioning
partition_cols = [c for c in df.columns if c.startswith("year") or c.startswith("month")]
writer = df.write.format("delta").mode("overwrite")
if partition_cols:
writer = writer.partitionBy(*partition_cols)
writer.save(target_path)
# Register in Unity Catalog
spark.sql(f"""
CREATE TABLE IF NOT EXISTS silver.{table_name}
USING DELTA
LOCATION '{target_path}'
""")
print(f" Registered silver.{table_name}")
Step 6: Validate¶
# Run validation for each table
validation_results = []
for table_dir in tables:
table_name = table_dir.name.rstrip("/")
source_count = spark.read.format("orc").load(table_dir.path).count()
target_count = spark.table(f"silver.{table_name}").count()
match = source_count == target_count
validation_results.append((table_name, source_count, target_count, match))
# Display results
validation_df = spark.createDataFrame(
validation_results,
["table", "source_rows", "target_rows", "match"]
)
validation_df.show(100, truncate=False)
Step 7: Delta sync (catch changes during migration)¶
# Run DistCp in update mode to catch files changed during conversion
hadoop distcp \
-Dfs.azure.account.key.migrationlake.dfs.core.windows.net=$STORAGE_KEY \
-m 50 \
-update \
hdfs://namenode:8020/user/hive/warehouse/ \
abfss://raw@migrationlake.dfs.core.windows.net/hive/warehouse/
Common issues and solutions¶
| Issue | Cause | Solution |
|---|---|---|
| DistCp fails with OOM | Too many small files overwhelming NameNode | Use -strategy dynamic and reduce -m count |
| ABFS driver not found | Missing JARs on Hadoop classpath | Install hadoop-azure and azure-storage JARs |
| Permission denied on ADLS | Incorrect authentication configuration | Use shared key, SAS, or service principal with Storage Blob Data Contributor role |
| Slow transfer speed | Network bottleneck (VPN) | Use ExpressRoute or Azure Data Box for large datasets |
| File count mismatch | .hive-staging or _SUCCESS files | Filter these during validation; they are metadata, not data |
| ORC to Delta schema mismatch | Hive complex types (STRUCT, MAP, ARRAY) | Verify complex type compatibility; most work natively |
Related¶
- Tutorial: HDFS to ADLS Gen2 — step-by-step tutorial
- Hive Migration — metastore and SQL migration
- Feature Mapping — all component mappings
- Migration Hub — full migration center
Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team Related: Tutorial: HDFS to ADLS | Hive Migration | Migration Hub