Storage Migration: S3 to ADLS Gen2 and OneLake¶
A deep-dive guide for data engineers migrating Amazon S3 storage to Azure Data Lake Storage Gen2 and Microsoft OneLake.
Executive summary¶
S3 is the substrate of the AWS analytics estate. Every other AWS analytics service --- Redshift Spectrum, EMR, Glue, Athena --- reads from and writes to S3. Migrating storage is therefore the most consequential decision in an AWS-to-Azure migration: get it right, and every other service migration becomes simpler.
The key insight is that storage migration does not have to be a big-bang cutover. OneLake S3 shortcuts provide zero-copy, read-only access to S3 buckets from Azure, allowing the compute layer to migrate first while data moves incrementally. This document covers architecture comparison, migration patterns, file format considerations, data transfer strategies, access control mapping, and worked examples.
Architecture comparison¶
S3 data model¶
AWS Account
└── S3 Bucket (globally unique name)
├── Prefix (virtual directory)
│ ├── Object (file)
│ └── Object
└── Prefix
└── Object
- Flat namespace. S3 uses key-value pairs; directories are simulated via prefix delimiters.
- Bucket policies + IAM. Access control combines bucket policies, IAM policies, and S3 Access Points.
- Storage classes. Standard, Intelligent-Tiering, Standard-IA, One Zone-IA, Glacier Instant Retrieval, Glacier Flexible Retrieval, Glacier Deep Archive.
- Event notifications. S3 emits events to SQS, SNS, Lambda, or EventBridge.
ADLS Gen2 data model¶
Azure Subscription
└── Storage Account (globally unique name)
└── Container (equivalent to bucket)
├── Directory (true hierarchical namespace)
│ ├── File
│ └── File
└── Directory
└── File
- Hierarchical namespace (HNS). ADLS Gen2 provides true directory operations (rename, delete directory) with atomic guarantees. This is a fundamental improvement over S3's flat namespace for analytics workloads.
- RBAC + ACLs. Azure RBAC at the container/account level; POSIX-like ACLs at the directory/file level.
- Access tiers. Hot, Cool, Cold, Archive (per-blob or per-account default).
- Event Grid. ADLS Gen2 emits BlobCreated/BlobDeleted events to Event Grid.
OneLake data model¶
Fabric Tenant
└── Workspace
└── Lakehouse
├── Tables (managed Delta tables)
└── Files (unmanaged files)
└── Shortcut → S3 bucket (zero-copy)
- OneLake shortcuts. Provide a virtual mount point to external storage (S3, ADLS Gen2, GCS) without data movement.
- Unified namespace. All Fabric artifacts (lakehouses, warehouses, semantic models) share a single OneLake storage layer.
Key architectural differences¶
| Dimension | S3 | ADLS Gen2 | OneLake |
|---|---|---|---|
| Namespace | Flat (prefix-based) | Hierarchical (HNS) | Hierarchical (Fabric-managed) |
| Directory rename | O(n) --- copies all objects | O(1) --- atomic | O(1) --- managed |
| Access control | Bucket policies + IAM + ACLs | RBAC + POSIX ACLs | Fabric workspace permissions |
| Storage tiers | 7 classes | 4 tiers (hot/cool/cold/archive) | Hot (managed by Fabric) |
| Encryption | SSE-S3, SSE-KMS, SSE-C | Microsoft-managed or CMK via Key Vault | Fabric-managed encryption |
| Versioning | Object-level versioning | Blob-level versioning + Delta time travel | Delta time travel |
| Event system | S3 Events to SQS/SNS/Lambda/EventBridge | Event Grid | Fabric Data Activator |
| Maximum object size | 5 TB | ~190 TB (block blob) | Fabric-managed |
| API | S3 REST API | Azure Blob REST API + ADLS Gen2 DFS API | OneLake REST API |
Migration patterns¶
Pattern 1: OneLake S3 shortcuts (zero-copy bridge)¶
When to use: Immediate compute migration; S3 stays source-of-truth during transition.
How it works:
- Create a OneLake shortcut pointing to the S3 bucket or prefix.
- Databricks and Fabric read S3 data through the shortcut with no data copy.
- New writes land on ADLS Gen2 from day one.
- Over time, flip individual datasets from S3-backed to ADLS-native.
Advantages:
- Zero data movement on day one
- Compute migration can proceed immediately
- Fallback: remove shortcuts and return to pure AWS
- No cross-cloud egress for reads (shortcut reads are server-side)
Limitations:
- Read-only from the Azure side (cannot write back to S3 via shortcut)
- Latency depends on cross-cloud network path
- S3 access credentials must be managed in Fabric
Configuration example:
{
"shortcutType": "AmazonS3",
"path": "Tables/raw_sales",
"target": {
"connectionId": "connection-guid",
"location": "s3://acme-analytics-raw/sales/",
"subpath": ""
}
}
Cross-reference: csa_platform/unity_catalog_pattern/onelake_config.yaml
Pattern 2: AzCopy bulk transfer¶
When to use: Full data migration for hot/warm data; datasets under 100 TB.
# Install AzCopy (pre-installed on Azure Cloud Shell)
# Authenticate to both S3 and ADLS Gen2
# Copy from S3 to ADLS Gen2
azcopy copy \
"https://s3.amazonaws.com/acme-analytics-raw/" \
"https://acmeanalyticsgov.dfs.core.usgovcloudapi.net/raw/" \
--recursive \
--s2s-preserve-access-tier=false \
--include-pattern "*.parquet;*.json;*.csv" \
--log-level INFO
# Verify with checksum
azcopy copy \
"https://s3.amazonaws.com/acme-analytics-raw/" \
"https://acmeanalyticsgov.dfs.core.usgovcloudapi.net/raw/" \
--recursive \
--s2s-preserve-access-tier=false \
--check-md5 FailIfDifferent \
--dry-run
Performance tuning:
- AzCopy automatically parallelizes across multiple connections.
- Set
AZCOPY_CONCURRENCY_VALUEto increase parallelism (default: 300). - Use
--block-size-mbto optimize for large files (default: 8MB; use 100MB for Parquet). - Expect 5-10 Gbps throughput over ExpressRoute; 1-2 Gbps over VPN.
Pattern 3: ADF Copy Activity¶
When to use: Ongoing incremental sync; orchestrated migration with monitoring.
{
"name": "CopyS3ToADLS",
"type": "Copy",
"inputs": [
{
"referenceName": "S3RawDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "ADLSRawDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ParquetSource",
"storeSettings": {
"type": "AmazonS3ReadSettings",
"recursive": true,
"wildcardFolderPath": "sales/*",
"wildcardFileName": "*.parquet",
"modifiedDatetimeStart": {
"value": "@pipeline().parameters.watermark",
"type": "Expression"
}
}
},
"sink": {
"type": "ParquetSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
}
},
"enableStaging": false,
"parallelCopies": 32
}
}
ADF linked service for S3:
{
"name": "AmazonS3LinkedService",
"type": "AmazonS3",
"typeProperties": {
"accessKeyId": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "KeyVaultLinkedService",
"type": "LinkedServiceReference"
},
"secretName": "aws-access-key-id"
},
"secretAccessKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "KeyVaultLinkedService",
"type": "LinkedServiceReference"
},
"secretName": "aws-secret-access-key"
},
"serviceUrl": "https://s3.us-gov-west-1.amazonaws.com"
}
}
Pattern 4: Azure Data Box (offline transfer)¶
When to use: Large cold/archive datasets (50 TB+) where network transfer is impractical.
- Order Azure Data Box (80 TB usable) or Data Box Heavy (770 TB usable) from Azure portal.
- Copy S3 data to local staging via
aws s3 sync. - Load data onto Data Box via NFS/SMB mount or AzCopy.
- Ship Data Box to Azure datacenter.
- Data appears in target ADLS Gen2 storage account.
- Verify with checksums.
Timeline: 7-10 business days from order to data availability. Cost: ~\(0.01/GB (Data Box) or ~\)0.003/GB (Data Box Heavy at scale).
File format considerations¶
Current state on AWS¶
| Format | Typical usage | Azure handling |
|---|---|---|
| Parquet | Most S3 data lakes; Athena/Glue default | Read natively; convert to Delta for ACID |
| ORC | EMR Hive workloads | Read natively in Databricks; convert to Delta |
| CSV/JSON | Raw ingestion layers | Ingest via Auto Loader with schema inference |
| Avro | Streaming / schema evolution | Read natively; convert to Delta |
| Hudi | CDC / upsert workloads on EMR | Read in Databricks; migrate to Delta for writes |
| Iceberg | Modern Glue / Athena v3 | Read natively in Databricks (ADR-0003); Delta is write target |
Delta Lake conversion¶
For tables that will be actively queried and written to on Azure, convert to Delta Lake format:
-- In Databricks SQL: convert Parquet directory to Delta
CONVERT TO DELTA parquet.`abfss://raw@acmeanalyticsgov.dfs.core.usgovcloudapi.net/sales/`
PARTITIONED BY (date STRING);
-- Verify
DESCRIBE HISTORY sales_prod.bronze.raw_sales;
# In Databricks notebook: read Parquet from S3, write Delta to ADLS
df = spark.read.parquet("s3a://acme-analytics-raw/sales/")
df.write \
.format("delta") \
.mode("overwrite") \
.partitionBy("date") \
.option("overwriteSchema", "true") \
.save("abfss://raw@acmeanalyticsgov.dfs.core.usgovcloudapi.net/sales/")
Z-Order optimization after conversion:
S3 bucket policy to ADLS ACL mapping¶
S3 bucket policy example¶
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowAnalyticsTeamRead",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:role/analytics-team-role"
},
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::acme-analytics-curated",
"arn:aws:s3:::acme-analytics-curated/*"
]
}
]
}
Equivalent ADLS Gen2 RBAC + ACL¶
# RBAC: assign Storage Blob Data Reader at container level
az role assignment create \
--role "Storage Blob Data Reader" \
--assignee-object-id <entra-group-object-id> \
--scope "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/acmeanalyticsgov/blobServices/default/containers/curated"
# ACL: grant read + execute at directory level (if finer granularity needed)
az storage fs access set \
--acl "group:<entra-group-object-id>:r-x" \
--path "sales/" \
--file-system curated \
--account-name acmeanalyticsgov \
--auth-mode login
Common IAM-to-RBAC translations¶
| S3/IAM action | Azure RBAC role | Scope |
|---|---|---|
s3:GetObject | Storage Blob Data Reader | Container or directory |
s3:PutObject | Storage Blob Data Contributor | Container or directory |
s3:DeleteObject | Storage Blob Data Contributor | Container or directory |
s3:ListBucket | Storage Blob Data Reader | Container |
s3:GetBucketPolicy | Reader (ARM) | Storage account |
s3:PutBucketPolicy | Storage Account Contributor | Storage account |
KMS kms:Decrypt | Key Vault Crypto Service Encryption User | Key Vault |
Versioning and lifecycle equivalents¶
S3 versioning to ADLS versioning¶
| S3 feature | ADLS Gen2 equivalent | Notes |
|---|---|---|
| Object versioning | Blob versioning | Enable at storage account level |
| Version-specific GET | Version ID-based access | Same pattern, different API |
| Lifecycle rule: delete old versions | Lifecycle management: delete previous versions | 1:1 rule translation |
| MFA Delete | Soft delete + legal hold | Different mechanism, same outcome |
Recommended approach: Use Delta Lake time travel instead of blob-level versioning for analytics tables. Time travel provides table-level version history with SQL syntax:
-- Query data as of a specific version
SELECT * FROM sales_prod.gold.fact_sales VERSION AS OF 42;
-- Query data as of a timestamp
SELECT * FROM sales_prod.gold.fact_sales TIMESTAMP AS OF '2026-04-15';
-- Restore a previous version
RESTORE TABLE sales_prod.gold.fact_sales TO VERSION AS OF 42;
S3 lifecycle to ADLS lifecycle¶
{
"rules": [
{
"enabled": true,
"name": "tier-to-cool-90d",
"type": "Lifecycle",
"definition": {
"actions": {
"baseBlob": {
"tierToCool": {
"daysAfterModificationGreaterThan": 90
},
"tierToArchive": {
"daysAfterModificationGreaterThan": 365
},
"delete": { "daysAfterModificationGreaterThan": 2555 }
}
},
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["raw/", "bronze/"]
}
}
}
]
}
Data transfer strategies¶
Decision matrix¶
| Data volume | Latency tolerance | Pattern | Estimated time |
|---|---|---|---|
| < 1 TB | Hours | AzCopy over internet | 1-4 hours |
| 1-10 TB | Hours | AzCopy over VPN/ExpressRoute | 4-24 hours |
| 10-50 TB | Days | ADF Copy Activity (parallel) | 1-5 days |
| 50-100 TB | Days | AzCopy over ExpressRoute (10 Gbps) | 2-5 days |
| 100+ TB | Weeks | Azure Data Box + parallel AzCopy | 2-4 weeks |
| Ongoing sync | Real-time | OneLake shortcuts (zero-copy) | Immediate |
Network considerations for federal¶
- ExpressRoute (recommended): Dedicated private connectivity between AWS Direct Connect peering and Azure ExpressRoute. Provides consistent bandwidth (1-10 Gbps) and does not traverse the public internet.
- Site-to-site VPN: Encrypted tunnel over the internet. Lower bandwidth (1.25 Gbps per tunnel) but simpler to set up.
- Internet transfer: Viable for small datasets; use HTTPS with AzCopy's built-in encryption.
- Data Box: Physical transfer; avoids network entirely. Required for air-gapped environments.
Cross-cloud egress cost management¶
S3 data transfer out to the internet costs $0.09/GB for the first 10 TB. For a 100 TB migration, the egress cost alone is approximately $8,500. Strategies to minimize:
- Use OneLake shortcuts during the bridge phase to avoid copying data until necessary.
- Compress before transfer. Parquet is already compressed, but raw CSV/JSON should be gzipped.
- Transfer during off-peak. AWS provides no off-peak pricing, but your network links may have less contention.
- Budget egress separately. Include \(50K-\)200K for cross-cloud transfer in the migration budget (see TCO Analysis).
Medallion architecture mapping¶
| AWS layer | S3 prefix convention | Azure layer | ADLS container | OneLake equivalent |
|---|---|---|---|---|
| Raw / Landing | s3://bucket/raw/ | Bronze | raw / bronze container | Lakehouse Files/raw/ |
| Cleaned / Staging | s3://bucket/stage/ | Silver | silver container | Lakehouse Tables/silver_* |
| Curated / Analytics | s3://bucket/curated/ | Gold | gold container | Lakehouse Tables/gold_* |
| Archive | s3://bucket/archive/ (Glacier) | Archive tier | archive container (archive tier) | N/A --- use ADLS lifecycle |
Cross-reference: examples/commerce/, examples/noaa/ for worked medallion implementations.
Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team Related: Migration Center | Compute Migration | Feature Mapping | Migration Playbook