AWS-to-Azure Migration Best Practices¶
Status: Authored 2026-04-30 Audience: Migration leads, solution architects, and program managers planning or executing an AWS analytics estate migration to csa-inabox on Azure. Scope: Covers the organizational, technical, and operational practices that distinguish successful migrations from failed ones.
Overview¶
Migrating an AWS analytics estate (Redshift, EMR, Glue, Athena, S3) to Azure is a 30-40 week program for mid-to-large federal tenants. The technical steps are documented in the companion tutorials. This document covers the practices that make or break the migration: assessment, architecture decisions, identity mapping, validation, team structure, and risk mitigation.
1. Pre-migration assessment checklist¶
Complete every item before writing a single line of migration code.
Infrastructure inventory¶
- List all AWS accounts in scope (production, staging, dev, sandbox).
- Inventory every Redshift cluster: node type, node count, database count, total storage used.
- Inventory every EMR cluster: instance types, job count, average runtime, permanent vs transient.
- Inventory every Glue job: type (Spark/Python Shell), DPU allocation, schedule, last run status.
- Inventory every Glue Crawler: databases scanned, table count, schedule.
- Inventory every Athena workgroup: query volume, average scan size, saved queries.
- Inventory every S3 bucket: size, object count, access frequency, lifecycle rules, versioning.
- Inventory Kinesis/MSK streams: shard count, throughput, consumer count.
- Document all IAM roles with analytics permissions: who uses what, where.
- Export CloudTrail logs covering 90 days of analytics API calls.
Dependency mapping¶
- Map every cross-service dependency: which Glue jobs write to which Redshift tables, which EMR jobs read which S3 prefixes.
- Identify external consumers: BI tools, APIs, downstream systems, partner feeds.
- Document event-driven patterns: S3 notifications to Lambda/SQS, EventBridge rules triggering Glue.
- Identify shadow consumers: teams or systems accessing S3/Redshift without formal registration.
- Map data lineage: source to bronze to silver to gold to consumer for each data product.
Compliance review¶
- Confirm target Azure region meets data residency requirements (Azure Government for FedRAMP High/IL4/IL5).
- Review ITAR constraints: data cannot leave US sovereign boundaries during or after migration.
- Map AWS compliance evidence (CloudTrail, Config, GuardDuty) to Azure equivalents (Monitor, Policy, Defender for Cloud).
- Verify that every AWS service in scope has an equivalent at the required compliance tier on Azure Gov (check
docs/GOV_SERVICE_MATRIX.md). - Plan for audit evidence preservation: CloudTrail logs, S3 access logs, Redshift audit logs must be archived before decommission.
Cost baseline¶
- Export 12 months of AWS Cost Explorer data for all analytics services.
- Document current commitment models: Reserved Instances, Savings Plans, Redshift Reserved Nodes.
- Calculate current effective $/TB/month for storage across all tiers.
- Calculate current effective $/query-hour for Redshift and Athena.
- Run
scripts/deploy/estimate-costs.shagainst the target Azure configuration.
2. Network architecture¶
Recommended network topology for migration¶
AWS VPC (GovCloud) Azure VNet (Gov)
+-------------------+ +-------------------+
| Redshift | | Databricks |
| EMR | ExpressRoute | ADF (SHIR) |
| S3 Endpoints |<-------------->| ADLS Gen2 (PE) |
| Glue | or VPN | Key Vault (PE) |
+-------------------+ +-------------------+
ExpressRoute (recommended for production migrations)¶
- When: Data transfer > 5 TB, ongoing hybrid period > 4 weeks, latency-sensitive reads.
- Setup: Provision an ExpressRoute circuit with Microsoft peering. Connect to the Azure VNet where ADLS Gen2 and Databricks reside.
- Bandwidth: Start with 1 Gbps; upgrade to 10 Gbps if bulk transfer windows are tight.
- Cost: $300-3,000/month depending on circuit speed and provider.
Site-to-site VPN (acceptable for smaller migrations)¶
- When: Data transfer < 5 TB, migration duration < 4 weeks, no ongoing hybrid reads.
- Setup: Azure VPN Gateway (VpnGw2) connected to AWS Virtual Private Gateway.
- Bandwidth: Up to 1.25 Gbps aggregate (multiple tunnels).
- Cost: $150-400/month.
Public internet with IP restrictions (dev/test only)¶
- When: Non-production environments, initial testing, proof of concept.
- Setup: AzCopy over public internet with storage account network rules restricting to source IPs.
- Risk: Slower, subject to ISP variability, no SLA.
DNS considerations¶
- Configure Azure Private DNS zones for ADLS Gen2 private endpoints.
- If using split-horizon DNS, ensure Databricks clusters resolve ADLS Gen2 to private IPs.
- During hybrid period, AWS workloads need DNS resolution to Azure private endpoints via ExpressRoute/VPN.
3. Identity mapping: IAM roles to Entra groups¶
Mapping strategy¶
| AWS IAM construct | Azure equivalent | Migration notes |
|---|---|---|
| IAM User (programmatic) | Service Principal or Managed Identity | Prefer managed identity for Azure-native workloads |
| IAM User (console) | Entra ID user | Federated identity via Entra ID |
| IAM Role (EC2/EMR instance profile) | User-Assigned Managed Identity | Attach to Databricks workspace or VM |
| IAM Role (Glue service role) | ADF Managed Identity | ADF gets a system-assigned MI at creation |
| IAM Role (cross-account) | Cross-tenant service principal | Rare; usually same-tenant in Azure |
| IAM Group | Entra ID Security Group | Map 1:1; use for RBAC assignments |
| IAM Policy (inline) | Azure Role Definition (custom) | Avoid custom roles; use built-in where possible |
| IAM Policy (managed) | Azure Built-in Role | See role mapping table below |
| S3 Bucket Policy | Storage Account RBAC + ACL | RBAC preferred over ACLs on ADLS Gen2 |
| Lake Formation permissions | Unity Catalog grants | GRANT SELECT ON TABLE ... |
| KMS Key Policy | Key Vault access policy or RBAC | Key Vault RBAC is the modern approach |
Common role mappings¶
| AWS managed policy | Azure built-in role | Scope |
|---|---|---|
AmazonS3ReadOnlyAccess | Storage Blob Data Reader | Storage account or container |
AmazonS3FullAccess | Storage Blob Data Contributor | Storage account or container |
AmazonRedshiftReadOnlyAccess | Unity Catalog SELECT grant | Catalog/schema/table |
AmazonRedshiftFullAccess | Unity Catalog ALL PRIVILEGES + Storage Blob Data Contributor | Workspace + storage |
AWSGlueServiceRole | ADF system-assigned managed identity + Data Factory Contributor | ADF instance |
AmazonAthenaFullAccess | Databricks SQL Warehouse access + Storage Blob Data Reader | Workspace + storage |
AmazonEMRFullAccessPolicy_v2 | Contributor on Databricks workspace | Resource group |
Service account migration checklist¶
- Export all IAM roles used by analytics services:
aws iam list-roles --query 'Roles[?contains(RoleName,analytics) || contains(RoleName,glue) || contains(RoleName,redshift) || contains(RoleName,emr)]'. - For each role, create an Entra ID security group with equivalent membership.
- Assign Azure RBAC roles at the appropriate scope (resource group, storage account, Databricks workspace).
- Create managed identities for service-to-service authentication.
- Test each identity mapping in a dev environment before production cutover.
- Document the mapping in the migration tracker for audit purposes.
4. Data migration patterns¶
Pattern A: Parallel ingestion (recommended for most migrations)¶
Day 1: OneLake shortcut to S3 (read-only bridge)
New writes land on ADLS Gen2
Consumers read from both via Unity Catalog
Day N: Backfill historical data from S3 to ADLS Gen2 (AzCopy)
Validate parity per dataset
Day N+M: Flip individual datasets from S3-backed to ADLS-native
Remove OneLake shortcuts one by one
Final: S3 becomes archive; ADLS Gen2 is source-of-truth
Pros: Zero downtime, gradual cutover, easy rollback per dataset. Cons: Dual-cloud egress cost during bridge period, requires discipline to track shortcut cleanup.
Pattern B: Big-bang cutover (for small estates or hard deadlines)¶
Day 0: Freeze all AWS writes
Day 1: AzCopy full transfer S3 → ADLS Gen2
Day 2: Validate parity (row counts, checksums)
Day 3: Redirect all consumers to Azure
Day 4: Decommission AWS
Pros: Clean cutover, no hybrid complexity, lowest total cost. Cons: Downtime required, high risk if validation fails, no rollback after decommission.
Pattern C: Hybrid indefinite (for multi-cloud mandates)¶
Day 1: OneLake shortcuts to S3 (permanent bridge)
Some workloads stay on AWS (IL6, deep SageMaker)
Analytics workloads move to Azure
Delta Sharing for cross-cloud data exchange
Ongoing: Two clouds, clear ownership boundaries
Pros: No forced migration of workloads that work well on AWS. Cons: Ongoing cross-cloud egress, dual operational burden, two sets of governance.
5. Glue Catalog preservation strategies¶
The Glue Data Catalog often contains years of accumulated metadata, partitions, and schema evolution history. Do not discard it.
Strategy 1: Export and replay (recommended)¶
- Export all Glue Catalog databases, tables, and partitions via
aws glue get-tables. - Convert Glue table definitions to Unity Catalog
CREATE TABLEstatements. - Register each table in Unity Catalog pointing to the migrated ADLS Gen2 location.
- Purview scans Unity Catalog and inherits the metadata.
Strategy 2: Federated catalog (for hybrid periods)¶
- Keep the Glue Catalog running during migration.
- Use Databricks Lakehouse Federation to query Glue-backed tables from Databricks.
- Gradually migrate tables from Glue to Unity Catalog as datasets are validated.
- Decommission Glue Catalog only after all tables are migrated.
Strategy 3: Purview S3 connector (for governance continuity)¶
- Configure Purview to scan S3 buckets directly (supported via the AWS connector).
- Purview discovers and classifies data in S3 alongside ADLS Gen2 data.
- As datasets migrate, Purview lineage automatically updates.
- This provides a single governance view across both clouds during migration.
6. Parallel-run approach for validation¶
How to run parallel validation¶
- Select validation datasets: Pick 5-10 representative tables covering different shapes (large fact tables, small dimensions, time-series, CDC).
- Run both pipelines: Execute the AWS pipeline (Glue/EMR/Redshift) and the Azure pipeline (ADF/dbt/Databricks) on the same input data.
- Compare outputs:
- Row counts (must match exactly).
- Aggregate checksums (SUM, COUNT DISTINCT on key columns -- must match within 0.01%).
- Sample row comparison (random sample of 1000 rows, field-by-field comparison).
- Schema comparison (column names, types, nullability).
- Duration: Run in parallel for at least 5 business days. Extend to 10 days for mission-critical pipelines.
- Acceptance criteria: Zero row-count mismatches, < 0.01% aggregate deviation, zero schema mismatches.
Automated validation framework¶
Use dbt tests + a validation notebook (see the companion tutorials for code). Automate the comparison to run daily during the parallel period. Alert on any deviation.
7. Common pitfalls (and solutions)¶
Pitfall 1: Trying to replicate AWS service topology exactly¶
What happens: Teams map every AWS service to an Azure "equivalent" one-to-one, resulting in a complex architecture with too many moving parts.
Solution: Consolidate. The five-service AWS analytics estate (Redshift + EMR + Glue + Athena + S3) collapses to three core services in csa-inabox (Databricks + ADF + ADLS Gen2). Let the architecture simplify rather than mirroring complexity.
Pitfall 2: Ignoring S3 event-driven patterns that need rearchitecting¶
What happens: S3 event notifications trigger Lambda functions, SQS queues, or SNS topics. These patterns do not have a direct lift-and-shift to Azure.
Solution: Map each S3 event pattern to its Azure equivalent early in discovery:
- S3 → Lambda: ADLS Gen2 → Event Grid → Azure Functions.
- S3 → SQS → consumer: ADLS Gen2 → Event Grid → Service Bus → consumer.
- S3 → SNS fan-out: ADLS Gen2 → Event Grid → multiple subscribers.
Pitfall 3: Underestimating Redshift SQL dialect differences¶
What happens: Teams assume Redshift SQL is "just PostgreSQL" and that Databricks SQL is close enough. In practice, there are 25+ dialect differences that cause silent data errors or query failures.
Solution: Use the SQL dialect conversion table to systematically convert every query. Run automated regression tests comparing Redshift and Databricks results for every converted query.
Pitfall 4: Not leveraging OneLake shortcuts for hybrid periods¶
What happens: Teams try to migrate all S3 data before starting any Azure workloads, creating a multi-month delay before Azure shows value.
Solution: Day 1: set up OneLake shortcuts to S3. Databricks reads S3 through the shortcut while new writes land on ADLS Gen2. This lets Azure workloads start immediately without waiting for data transfer to complete.
Pitfall 5: Migrating Glue Crawlers without rethinking the pattern¶
What happens: Teams build Purview scan jobs that replicate every Glue Crawler's behavior, including crawling raw data on a schedule.
Solution: Glue Crawlers serve two purposes: schema discovery and partition registration. In Azure, Databricks Auto Loader handles schema inference and evolution at read time, and Delta Lake manages partitions natively. You only need Purview scans for governance (classification, lineage) -- not for runtime schema discovery.
Pitfall 6: Under-provisioning the migration network¶
What happens: Teams try to migrate 20+ TB over a 100 Mbps VPN, resulting in weeks-long transfer times and missed deadlines.
Solution: Calculate transfer time before starting. At 100 Mbps, 20 TB takes approximately 18 days of continuous transfer. Budget for ExpressRoute (1 Gbps or 10 Gbps) or Azure Data Box for large migrations.
Pitfall 7: Forgetting to migrate Redshift user permissions¶
What happens: Data is migrated but access controls are not. Users either cannot access data or have excessive permissions on Azure.
Solution: Export Redshift permissions (SELECT * FROM svl_user_grants) and map them to Unity Catalog grants before cutover. Test with actual user accounts in a staging environment.
Pitfall 8: Not planning for Glue job bookmark state¶
What happens: Glue jobs use bookmarks for incremental processing. After migration, dbt incremental models need equivalent state initialization.
Solution: For each Glue job with bookmarks enabled, determine the bookmark state (last processed timestamp or file). Initialize the dbt incremental model with a full refresh, then switch to incremental mode. The {% if is_incremental() %} block handles ongoing incremental runs.
Pitfall 9: Skipping performance testing before cutover¶
What happens: Data and logic are migrated correctly, but Databricks SQL query performance is worse than expected because tables are not optimized.
Solution: After loading Delta tables, run OPTIMIZE ... ZORDER BY on every table. Verify that partition column choices align with query filter patterns. Run the top 20 most expensive queries (from Redshift profiling) on Databricks SQL and compare latency.
Pitfall 10: Decommissioning AWS before archiving audit evidence¶
What happens: AWS accounts are shut down before CloudTrail logs, S3 access logs, and Redshift audit logs are preserved, creating compliance gaps.
Solution: Before decommissioning any AWS resource, archive all audit logs to a long-term retention location (S3 Glacier or ADLS Gen2 Archive tier). Federal compliance frameworks (FedRAMP, CMMC) require audit log retention for 1-3 years minimum.
8. Team structure recommendations¶
Recommended migration team composition¶
| Role | Count | Responsibilities |
|---|---|---|
| Migration lead / architect | 1 | Overall architecture, decision-making, risk management |
| Data engineer (AWS-focused) | 2-3 | Redshift profiling, Glue job analysis, S3 inventory, UNLOAD operations |
| Data engineer (Azure-focused) | 2-3 | ADF pipelines, dbt models, Databricks configuration, Delta Lake optimization |
| Platform engineer | 1-2 | Networking (ExpressRoute/VPN), identity (Entra ID), Bicep deployments |
| Security / compliance lead | 1 | IAM mapping, compliance evidence, audit log preservation |
| BI developer | 1 | Power BI semantic models, QuickSight-to-Power BI report conversion |
| QA / validation engineer | 1 | Data parity validation, regression testing, parallel-run monitoring |
| Program manager | 1 | Timeline, risk register, stakeholder communication |
Scaling guidance¶
- Small estate (< 10 Glue jobs, < 5 Redshift schemas): 4-6 people, 12-16 weeks.
- Medium estate (10-50 Glue jobs, 5-20 Redshift schemas): 8-12 people, 20-28 weeks.
- Large estate (50+ Glue jobs, 20+ Redshift schemas, streaming): 12-18 people, 30-40 weeks.
9. Timeline estimation by deployment size¶
Small migration (< 5 TB data, < 10 pipelines)¶
| Phase | Duration | Activities |
|---|---|---|
| Discovery | 1 week | Inventory, dependency mapping |
| Landing zone | 2 weeks | Bicep deployment, networking, identity |
| Data migration | 1 week | AzCopy transfer, Delta conversion |
| Pipeline migration | 2-3 weeks | Glue to ADF+dbt, Athena to Databricks SQL |
| Validation | 1 week | Parallel run, parity checks |
| Cutover | 1 week | Consumer redirect, decommission |
| Total | 8-10 weeks |
Medium migration (5-50 TB, 10-50 pipelines)¶
| Phase | Duration | Activities |
|---|---|---|
| Discovery | 2-3 weeks | Full inventory, wave planning |
| Landing zone | 3-4 weeks | Bicep, ExpressRoute, identity mapping |
| Pilot domain | 4-6 weeks | One end-to-end domain migrated |
| Redshift migration | 6-8 weeks | Schema, data, SQL conversion (overlaps) |
| Pipeline migration | 4-6 weeks | All Glue jobs converted (overlaps) |
| Streaming migration | 2-3 weeks | Kinesis to Event Hubs (if applicable) |
| Validation | 2-3 weeks | Parallel run, regression testing |
| Cutover + decommission | 2-3 weeks | Staged cutover, audit log archive |
| Total | 20-28 weeks |
Large migration (50+ TB, 50+ pipelines, streaming, ML)¶
Follow the phased plan in the main playbook: 30-40 weeks across 6 phases.
10. Risk mitigation¶
Risk register template¶
| # | Risk | Likelihood | Impact | Mitigation | Owner |
|---|---|---|---|---|---|
| 1 | Data loss during S3-to-ADLS transfer | Low | Critical | Checksum validation per dataset; S3 versioning preserved; rollback to S3 source | Data engineer |
| 2 | Redshift SQL conversion introduces silent errors | Medium | High | Automated regression test suite comparing Redshift and Databricks results for all converted queries | QA engineer |
| 3 | ExpressRoute provisioning delay | Medium | Medium | Order circuit 4-6 weeks before data transfer phase; fall back to VPN for small datasets | Platform engineer |
| 4 | Glue Catalog metadata loss | Low | High | Export full catalog before migration; validate Unity Catalog table count matches Glue table count | Data engineer |
| 5 | Compliance evidence gap during migration | Medium | Critical | Archive all AWS audit logs before decommission; run Purview scans on day 1; maintain dual-cloud evidence | Security lead |
| 6 | Consumer disruption during cutover | Medium | High | Staged cutover with OneLake shortcuts; 2-week parallel run; rollback plan per dataset | Migration lead |
| 7 | Databricks SQL performance regression | Low | Medium | Run top-20 queries on Databricks SQL in staging; OPTIMIZE+ZORDER before cutover; right-size SQL Warehouses | Data engineer |
| 8 | Budget overrun from cross-cloud egress | Medium | Medium | Budget $90/TB for S3 egress; use ExpressRoute to reduce cost; migrate hot data first, leave cold data on S3 via shortcuts | Program manager |
| 9 | Team skill gap on Azure/Databricks | High | Medium | 2-week training sprint before migration starts; pair AWS-skilled and Azure-skilled engineers | Migration lead |
| 10 | Shadow consumers discovered mid-migration | High | Medium | CloudTrail analysis for S3 GetObject callers; Redshift query log analysis for all connecting applications | Data engineer |
Related resources¶
- AWS-to-Azure migration playbook -- full capability mapping and phased plan
- S3 to ADLS tutorial -- storage migration step-by-step
- Redshift to Fabric tutorial -- warehouse migration step-by-step
- Glue to ADF + dbt tutorial -- ETL pipeline migration step-by-step
- Benchmarks -- performance and cost comparisons
docs/COST_MANAGEMENT.md-- Azure cost optimizationdocs/GOV_SERVICE_MATRIX.md-- Azure Government service availabilitydocs/adr/0004-bicep-over-terraform.md-- IaC choice rationalecsa_platform/multi_synapse/rbac_templates/-- RBAC template patterns
Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team