Tutorial: Migrate an S3 Bucket to ADLS Gen2¶
Status: Authored 2026-04-30 Audience: Data engineers migrating AWS S3 storage to Azure Data Lake Storage Gen2 as part of a broader AWS-to-Azure analytics migration. Prerequisites knowledge: AWS S3, Azure Storage, CLI tools, basic networking. Time estimate: 4-8 hours for a single bucket (excluding data transfer time for large datasets).
Overview¶
This tutorial walks through migrating a single S3 bucket to ADLS Gen2, setting up a hybrid bridge via OneLake shortcuts, converting file formats to Delta Lake, and validating data parity. By the end, your data lives natively on ADLS Gen2 with Delta Lake tables registered in Unity Catalog.
AWS comparison: In AWS, your data lake is S3 buckets + Glue Catalog. In Azure, the equivalent is ADLS Gen2 containers + Unity Catalog (runtime metadata) + Purview (enterprise governance). OneLake provides a unified namespace across all storage, similar to how S3 Access Points abstract bucket topology.
Prerequisites¶
Tools¶
| Tool | Minimum version | Install |
|---|---|---|
| AWS CLI | 2.x | pip install awscli or MSI installer |
| Azure CLI | 2.60+ | curl -sL https://aka.ms/InstallAzureCLIDeb \| sudo bash |
| AzCopy | 10.24+ | Download |
| Databricks CLI | 0.220+ | pip install databricks-cli |
| jq | 1.6+ | sudo apt install jq or brew install jq |
AWS access¶
- IAM credentials with
s3:ListBucket,s3:GetObject,s3:GetBucketLocationon the source bucket. - If using S3-to-AzCopy directly: the bucket must allow public-list or you must generate pre-signed URLs / use an IAM role with programmatic access.
Azure access¶
- An Azure subscription (commercial or Azure Government).
- Permissions:
Contributoron the resource group,Storage Blob Data Contributoron the target storage account. - A Databricks workspace with Unity Catalog enabled (for format conversion steps).
Step 1: Inventory the S3 bucket¶
Before migrating anything, understand what you are moving.
# Set the source bucket
SOURCE_BUCKET="s3://acme-analytics-raw"
# List top-level prefixes (folders)
aws s3 ls ${SOURCE_BUCKET}/ --summarize
# Get total object count and size
aws s3 ls ${SOURCE_BUCKET} --recursive --summarize | tail -2
# Example output:
# Total Objects: 1,247,832
# Total Size: 2.3 TiB
# Inventory file types
aws s3 ls ${SOURCE_BUCKET} --recursive | \
awk '{print $4}' | \
sed 's/.*\.//' | \
sort | uniq -c | sort -rn | head -20
# Example output:
# 834210 parquet
# 201445 json
# 112034 csv
# 100143 orc
# Check bucket region
aws s3api get-bucket-location --bucket acme-analytics-raw
# Example: {"LocationConstraint": "us-gov-west-1"}
# List lifecycle rules (to replicate on Azure)
aws s3api get-bucket-lifecycle-configuration --bucket acme-analytics-raw
Record the following in your migration tracker:
| Attribute | Value |
|---|---|
| Bucket name | acme-analytics-raw |
| Region | us-gov-west-1 |
| Total size | 2.3 TiB |
| Object count | 1,247,832 |
| Primary formats | Parquet (67%), JSON (16%), CSV (9%), ORC (8%) |
| Lifecycle rules | 90-day transition to S3-IA, 365-day to Glacier |
| Encryption | SSE-S3 (AES-256) |
| Versioning | Enabled |
AWS comparison: In AWS,
aws s3 ls --recursive --summarizeis the standard inventory command. In Azure, the equivalent isaz storage blob list --account-name <name> --container-name <name> --query "[].{name:name, size:properties.contentLength}" -o table.
Step 2: Create the ADLS Gen2 storage account¶
# Variables
RESOURCE_GROUP="rg-analytics-migration"
STORAGE_ACCOUNT="acmeanalyticsgov" # Must be globally unique, 3-24 chars, lowercase
LOCATION="usgovvirginia" # Use "eastus2" for commercial Azure
SUBSCRIPTION="<your-subscription-id>"
# Set subscription
az account set --subscription ${SUBSCRIPTION}
# Create resource group (if it doesn't exist)
az group create \
--name ${RESOURCE_GROUP} \
--location ${LOCATION}
# Create ADLS Gen2 storage account with hierarchical namespace
az storage account create \
--name ${STORAGE_ACCOUNT} \
--resource-group ${RESOURCE_GROUP} \
--location ${LOCATION} \
--sku Standard_ZRS \
--kind StorageV2 \
--hns true \
--min-tls-version TLS1_2 \
--allow-blob-public-access false \
--require-infrastructure-encryption true \
--encryption-services blob file \
--tags environment=migration project=analytics-migration
# Create containers matching the medallion architecture
for CONTAINER in bronze silver gold archive; do
az storage container create \
--name ${CONTAINER} \
--account-name ${STORAGE_ACCOUNT} \
--auth-mode login
done
# Verify HNS is enabled (this is what makes it "Gen2")
az storage account show \
--name ${STORAGE_ACCOUNT} \
--resource-group ${RESOURCE_GROUP} \
--query "isHnsEnabled"
# Expected: true
AWS comparison: In AWS, you create an S3 bucket with
aws s3 mb. ADLS Gen2 is a storage account with hierarchical namespace (HNS) enabled. HNS gives you true directory operations (rename is O(1), not O(n) like S3 prefix renames). The--hns trueflag is critical -- without it, you get Blob Storage, not Data Lake Storage Gen2.
Set up lifecycle management (matching S3 lifecycle rules)¶
# Create lifecycle policy JSON
cat > /tmp/lifecycle-policy.json << 'EOF'
{
"rules": [
{
"enabled": true,
"name": "archive-old-bronze",
"type": "Lifecycle",
"definition": {
"actions": {
"baseBlob": {
"tierToCool": { "daysAfterModificationGreaterThan": 90 },
"tierToArchive": { "daysAfterModificationGreaterThan": 365 }
}
},
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["bronze/"]
}
}
}
]
}
EOF
az storage account management-policy create \
--account-name ${STORAGE_ACCOUNT} \
--resource-group ${RESOURCE_GROUP} \
--policy @/tmp/lifecycle-policy.json
Step 3: Configure networking¶
Choose one of the following approaches based on your migration scenario.
Option A: ExpressRoute / VPN (recommended for production)¶
If you have an ExpressRoute circuit or site-to-site VPN between AWS and Azure:
# Enable private endpoint for the storage account
az network private-endpoint create \
--name pe-${STORAGE_ACCOUNT} \
--resource-group ${RESOURCE_GROUP} \
--vnet-name vnet-analytics \
--subnet snet-private-endpoints \
--private-connection-resource-id $(az storage account show \
--name ${STORAGE_ACCOUNT} \
--resource-group ${RESOURCE_GROUP} \
--query id -o tsv) \
--group-id blob \
--connection-name plc-${STORAGE_ACCOUNT}
# Disable public network access after private endpoint is confirmed
az storage account update \
--name ${STORAGE_ACCOUNT} \
--resource-group ${RESOURCE_GROUP} \
--public-network-access Disabled
Option B: Public endpoint with IP restrictions (for initial migration)¶
# Allow only your migration server's IP
MIGRATION_SERVER_IP="203.0.113.50"
az storage account network-rule add \
--account-name ${STORAGE_ACCOUNT} \
--resource-group ${RESOURCE_GROUP} \
--ip-address ${MIGRATION_SERVER_IP}
az storage account update \
--name ${STORAGE_ACCOUNT} \
--resource-group ${RESOURCE_GROUP} \
--default-action Deny
AWS comparison: In AWS, you restrict bucket access via bucket policies and VPC endpoints. In Azure, you use network rules (IP allowlists), private endpoints (equivalent to VPC endpoints), and service endpoints. The Private Endpoint model is more explicit than S3's VPC endpoint -- each storage account gets a dedicated NIC in your VNet.
Step 4: Run AzCopy from S3 to ADLS Gen2¶
AzCopy natively supports S3-to-Azure transfers without an intermediate staging location.
Generate a SAS token for the target¶
# Generate a SAS token valid for 7 days
END_DATE=$(date -u -d "+7 days" '+%Y-%m-%dT%H:%MZ')
SAS_TOKEN=$(az storage account generate-sas \
--account-name ${STORAGE_ACCOUNT} \
--permissions rwdlacup \
--resource-types sco \
--services b \
--expiry ${END_DATE} \
--output tsv)
Set AWS credentials for AzCopy¶
# AzCopy reads these environment variables for S3 access
export AWS_ACCESS_KEY_ID="<your-access-key>"
export AWS_SECRET_ACCESS_KEY="<your-secret-key>"
# For GovCloud:
export AWS_REGION="us-gov-west-1"
Run the transfer¶
# Full bucket copy: S3 → ADLS Gen2 bronze container
azcopy copy \
"https://s3-us-gov-west-1.amazonaws.com/acme-analytics-raw/" \
"https://${STORAGE_ACCOUNT}.blob.core.usgovcloudapi.net/bronze/?${SAS_TOKEN}" \
--recursive \
--s2s-preserve-access-tier=false \
--include-pattern "*.parquet;*.json;*.csv;*.orc" \
--log-level INFO \
--cap-mbps 1000
# Monitor progress (AzCopy logs to ~/.azcopy/)
azcopy jobs list
azcopy jobs show <job-id>
# For very large migrations (10+ TiB), run in parallel by prefix:
for PREFIX in sales/ inventory/ customers/ logs/; do
azcopy copy \
"https://s3-us-gov-west-1.amazonaws.com/acme-analytics-raw/${PREFIX}" \
"https://${STORAGE_ACCOUNT}.blob.core.usgovcloudapi.net/bronze/${PREFIX}?${SAS_TOKEN}" \
--recursive \
--s2s-preserve-access-tier=false \
--log-level INFO &
done
wait
Expected transfer rates:
| Network path | Throughput | 2 TiB estimate |
|---|---|---|
| Public internet | 200-500 Mbps | 9-22 hours |
| ExpressRoute 1 Gbps | 800-900 Mbps | 5-6 hours |
| ExpressRoute 10 Gbps | 5-8 Gbps | 35-55 minutes |
AWS comparison: In AWS, cross-region replication or
aws s3 synchandles bucket-to-bucket copies. AzCopy is the Azure equivalent ofaws s3 syncbut with native S3-source support. For datasets over 50 TiB, consider Azure Data Box instead of network transfer.
Step 5: Set up OneLake shortcut to S3 (hybrid bridge)¶
During the migration period, consumers need to read from both S3 (historical data not yet migrated) and ADLS Gen2 (newly landing data). OneLake shortcuts solve this without copying data.
Create the shortcut via Fabric REST API¶
# Prerequisites: a Fabric workspace and lakehouse
WORKSPACE_ID="<fabric-workspace-id>"
LAKEHOUSE_ID="<fabric-lakehouse-id>"
# Create an S3 shortcut in OneLake
curl -X POST \
"https://api.fabric.microsoft.com/v1/workspaces/${WORKSPACE_ID}/items/${LAKEHOUSE_ID}/shortcuts" \
-H "Authorization: Bearer $(az account get-access-token --resource https://api.fabric.microsoft.com --query accessToken -o tsv)" \
-H "Content-Type: application/json" \
-d '{
"path": "Tables/s3_raw_sales",
"name": "s3_raw_sales",
"target": {
"amazonS3": {
"location": "https://s3.us-gov-west-1.amazonaws.com",
"subpath": "acme-analytics-raw/sales/",
"connectionId": "<s3-connection-id>"
}
}
}'
Register the shortcut in Databricks via Unity Catalog¶
-- In Databricks SQL, create an external location pointing to the S3 shortcut
CREATE EXTERNAL LOCATION IF NOT EXISTS s3_bridge_raw
URL 'abfss://raw@onelake.dfs.fabric.microsoft.com/sales/'
WITH (STORAGE CREDENTIAL onelake_credential);
-- Create a table over the shortcut (read-only during migration)
CREATE TABLE IF NOT EXISTS migration_bridge.raw.sales_s3
USING PARQUET
LOCATION 'abfss://raw@onelake.dfs.fabric.microsoft.com/sales/';
AWS comparison: In AWS, Athena's federated queries or Redshift Spectrum let you query data in external locations. OneLake shortcuts serve the same purpose but work across cloud boundaries -- your Databricks queries read S3 data through OneLake without copying it. This is the single most valuable migration pattern: keep S3 read-only while Azure warms up.
Step 6: Convert Parquet/ORC to Delta Lake format¶
Once data is on ADLS Gen2, convert it to Delta Lake for ACID transactions, time travel, and Z-ordering.
Databricks notebook: Bulk format conversion¶
# Notebook: convert_to_delta.py
# Run on a Databricks cluster with Unity Catalog enabled
from pyspark.sql import SparkSession
from pyspark.sql.functions import input_file_name, current_timestamp
import logging
# Configuration
STORAGE_ACCOUNT = "acmeanalyticsgov"
SOURCE_CONTAINER = "bronze"
TARGET_CATALOG = "analytics_prod"
TARGET_SCHEMA = "bronze"
source_base = f"abfss://{SOURCE_CONTAINER}@{STORAGE_ACCOUNT}.dfs.core.usgovcloudapi.net"
# List of datasets to convert (prefix, format, partition_cols)
datasets = [
("sales/", "parquet", ["year", "month"]),
("inventory/", "parquet", ["region"]),
("customers/", "json", []),
("logs/", "orc", ["date"]),
]
for prefix, fmt, partition_cols in datasets:
source_path = f"{source_base}/{prefix}"
table_name = prefix.rstrip("/").replace("/", "_")
full_table = f"{TARGET_CATALOG}.{TARGET_SCHEMA}.{table_name}"
print(f"Converting {source_path} ({fmt}) -> {full_table}")
# Read source format
df = spark.read.format(fmt).load(source_path)
# Add metadata columns
df = df.withColumn("_source_file", input_file_name()) \
.withColumn("_ingested_at", current_timestamp())
# Write as Delta with optional partitioning
writer = df.write.format("delta").mode("overwrite")
if partition_cols:
writer = writer.partitionBy(*partition_cols)
writer.saveAsTable(full_table)
# Optimize the new Delta table
spark.sql(f"OPTIMIZE {full_table}")
row_count = spark.sql(f"SELECT COUNT(*) AS cnt FROM {full_table}").first().cnt
print(f" -> {full_table}: {row_count:,} rows written")
Run OPTIMIZE and ZORDER for query performance¶
-- After conversion, optimize tables for common query patterns
OPTIMIZE analytics_prod.bronze.sales
ZORDER BY (product_id, region);
OPTIMIZE analytics_prod.bronze.inventory
ZORDER BY (sku, warehouse_id);
-- Verify table properties
DESCRIBE EXTENDED analytics_prod.bronze.sales;
-- Look for: Provider = delta, Location = abfss://...
AWS comparison: In AWS, you might use Glue jobs or Athena CTAS to convert between formats. In Azure, Databricks notebooks with
spark.read.format("parquet").write.format("delta")serve the same purpose. The key difference is that Delta Lake tables are ACID-compliant -- you get time travel, schema enforcement, andMERGEoperations that Parquet-on-S3 lacks without Iceberg/Hudi.
Step 7: Validate data parity¶
Never trust a migration without validation. Run these checks for every dataset.
Row count validation¶
# Databricks notebook: validate_parity.py
import hashlib
datasets_to_validate = [
{
"name": "sales",
"s3_table": "migration_bridge.raw.sales_s3",
"azure_table": "analytics_prod.bronze.sales",
"key_columns": ["order_id"],
"measure_columns": ["quantity", "gross_amount"],
}
]
results = []
for ds in datasets_to_validate:
# Row counts
s3_count = spark.sql(f"SELECT COUNT(*) AS cnt FROM {ds['s3_table']}").first().cnt
az_count = spark.sql(f"SELECT COUNT(*) AS cnt FROM {ds['azure_table']}").first().cnt
# Aggregate checksums on measure columns
measures = ", ".join([f"SUM(CAST({c} AS DOUBLE)) AS sum_{c}" for c in ds["measure_columns"]])
s3_sums = spark.sql(f"SELECT {measures} FROM {ds['s3_table']}").first()
az_sums = spark.sql(f"SELECT {measures} FROM {ds['azure_table']}").first()
match = (s3_count == az_count)
for c in ds["measure_columns"]:
s3_val = getattr(s3_sums, f"sum_{c}")
az_val = getattr(az_sums, f"sum_{c}")
if abs(s3_val - az_val) > 0.01:
match = False
results.append({
"dataset": ds["name"],
"s3_rows": s3_count,
"azure_rows": az_count,
"row_match": s3_count == az_count,
"checksum_match": match,
})
print(f"{ds['name']}: S3={s3_count:,} Azure={az_count:,} Match={match}")
# Create summary table
results_df = spark.createDataFrame(results)
results_df.display()
Schema comparison¶
-- Compare schemas between S3 source and Delta target
DESCRIBE TABLE migration_bridge.raw.sales_s3;
DESCRIBE TABLE analytics_prod.bronze.sales;
-- Check for type mismatches (common: string vs int, timestamp precision)
SELECT
s.col_name,
s.data_type AS s3_type,
a.data_type AS azure_type,
CASE WHEN s.data_type = a.data_type THEN 'MATCH' ELSE 'MISMATCH' END AS status
FROM (
SELECT col_name, data_type
FROM (DESCRIBE TABLE migration_bridge.raw.sales_s3)
WHERE col_name NOT LIKE '#%'
) s
FULL OUTER JOIN (
SELECT col_name, data_type
FROM (DESCRIBE TABLE analytics_prod.bronze.sales)
WHERE col_name NOT LIKE '#%' AND col_name NOT LIKE '_%'
) a ON s.col_name = a.col_name;
Step 8: Update downstream consumers¶
Once validation passes, redirect consumers from S3 to ADLS Gen2.
Update Databricks notebooks/jobs¶
# Before (reading from S3 via shortcut)
df = spark.read.table("migration_bridge.raw.sales_s3")
# After (reading from native Delta on ADLS Gen2)
df = spark.read.table("analytics_prod.bronze.sales")
Update ADF pipelines¶
In Azure Data Factory linked services, change the dataset path from the OneLake shortcut location to the native ADLS Gen2 path:
Update Power BI semantic models¶
If Power BI reports connect via DirectQuery or Direct Lake:
- Open the semantic model in Power BI Desktop.
- Change the data source from the shortcut path to the ADLS Gen2 native path.
- Publish and verify the report refreshes successfully.
Decommission the OneLake shortcut¶
After all consumers are migrated and validated (recommended: 2-week parallel-run minimum):
# Remove the S3 shortcut from OneLake
curl -X DELETE \
"https://api.fabric.microsoft.com/v1/workspaces/${WORKSPACE_ID}/items/${LAKEHOUSE_ID}/shortcuts/s3_raw_sales" \
-H "Authorization: Bearer $(az account get-access-token --resource https://api.fabric.microsoft.com --query accessToken -o tsv)"
Troubleshooting¶
AzCopy fails with "AuthorizationFailure"¶
- Verify the SAS token has not expired.
- Ensure
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYare set in the environment. - For GovCloud, verify the S3 endpoint URL uses
s3-us-gov-west-1.amazonaws.com.
AzCopy transfer is slow¶
- Check
--cap-mbpsis not set too low. - For ExpressRoute, verify the circuit is not saturated (check metrics in Azure portal).
- Split large prefixes into parallel AzCopy jobs (see Step 4 parallel example).
- Consider Azure Data Box for datasets over 50 TiB.
Delta conversion fails with OutOfMemoryError¶
- Increase the cluster driver and worker memory.
- Process datasets in smaller partitions:
.write.partitionBy("year", "month"). - Use Auto Loader for incremental ingestion instead of batch conversion.
Row count mismatch after migration¶
- Check for S3 versioning: if enabled,
aws s3 ls --recursivecounts the latest version only, but there may be delete markers. - Check for files that arrived in S3 after the AzCopy snapshot.
- Verify partition filters: Delta tables with
_metadatacolumns may add rows.
OneLake shortcut returns "Forbidden"¶
- Verify the S3 connection in Fabric has valid AWS credentials.
- Ensure the IAM role used by the connection has
s3:GetObjectands3:ListBucket. - Check that the S3 bucket policy does not deny cross-account access.
Next steps¶
- Convert more buckets: Repeat this tutorial for each S3 bucket in your migration plan.
- Set up incremental sync: For buckets receiving ongoing writes, configure Auto Loader to continuously ingest new files from ADLS Gen2.
- Migrate Redshift: See tutorial-redshift-to-fabric.md for warehouse migration.
- Migrate Glue ETL: See tutorial-glue-to-adf-dbt.md for ETL pipeline conversion.
- Review best practices: See best-practices.md for migration patterns and common pitfalls.
Related resources¶
- AWS-to-Azure migration playbook -- full capability mapping and phased plan
- Benchmarks -- performance and cost comparisons
csa_platform/unity_catalog_pattern/onelake_config.yaml-- OneLake shortcut configurationdocs/adr/0003-delta-lake-over-iceberg-and-parquet.md-- why Delta Lake is the primary formatdocs/COST_MANAGEMENT.md-- storage cost optimization
Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team