Home > Docs > Research > Platform Research Report
CSA-in-a-Box: Comprehensive Platform Research Report¶
Note
Quick Summary: Deep research report for building a complete Cloud-Scale Analytics / Data Mesh / Data Fabric platform in Azure as an open-source alternative to Microsoft Fabric — covers CSA architecture (DMLZ + DLZ + Medallion), Data Mesh domain ownership, Data Fabric metadata layer, component mapping from Fabric to Azure PaaS, 50+ required Azure services, 4-subscription deployment strategy with VNet address spaces and RBAC, Well-Architected best practices, zero-trust networking, data classification, FinOps, DR/BCDR, and IaC reference templates.
Date: 2026-04-09 Purpose: Deep research for building a complete Cloud-Scale Analytics / Data Mesh / Data Fabric platform in Azure as an open-source alternative to Microsoft Fabric.
📑 Table of Contents¶
- 🏗️ 1. Azure Cloud-Scale Analytics Architecture
- 🔀 2. Data Mesh Architecture in Azure
- 🧵 3. Data Fabric Architecture
- 🔄 4. Microsoft Fabric Alternative Components
- ☁️ 5. Required Azure Services for a Complete Platform
- 📦 6. Deployment Strategy for 4 Azure Subscriptions
- 📋 7. Best Practices and Standards
- 📚 8. Reference Templates and IaC
- 🔗 9. Sources and References
- 📊 Appendix A: Service SKU Recommendations
- 📛 Appendix B: Naming Convention
- 🚀 Appendix C: Deployment Order
🏗️ 1. Azure Cloud-Scale Analytics Architecture¶
1.1 Overview and Current Status¶
Microsoft's Cloud-Scale Analytics (CSA) was the reference architecture for building enterprise data platforms on Azure. It was part of the Cloud Adoption Framework (CAF) and provided prescriptive guidance for data landing zones, governance, and scalable analytics.
Warning
As of early 2026, Microsoft has deprecated the Cloud-Scale Analytics scenario. The deprecation notice states: "The Cloud-Scale Analytics scenario has been deprecated and is no longer maintained or supported. To ensure only the best guidance is surfaced, this guidance will be deleted April 2026." Microsoft redirects to their new "Unify your data platform" guidance at https://aka.ms/cafdata.
What this means for csa-inabox: The CSA architecture remains the best-documented and most comprehensive open reference for building a modular, enterprise-grade data platform on Azure. While Microsoft is consolidating guidance (likely pushing toward Fabric), the architectural patterns, landing zone structure, and IaC templates remain valid and are exactly what we need. Our project preserves and extends these patterns with open-source tooling.
1.2 Core Architectural Concepts¶
CSA consists of two primary architectural constructs:
Data Management Landing Zone (DMLZ)¶
A separate Azure subscription that provides centralized governance for the entire analytics platform:
| Component | Azure Service | Purpose |
|---|---|---|
| Data Catalog | Microsoft Purview | Register, classify, discover data sources across all landing zones |
| Data Governance | Purview + Unity Catalog | Centralized access control, auditing, lineage, data quality |
| Primary Data Management | Purview + custom | Master data management, golden records |
| Data Sharing & Contracts | Entra ID Entitlement Mgmt | Access packages, sharing policies |
| API Catalog | Azure API Management | Standardized API documentation and governance |
| Data Quality | Purview Data Quality | Quality metrics, validation, monitoring |
| Data Modeling Repository | ER/Studio, custom | Centralized entity relationship models |
| Container Registry | Azure Container Registry | Standard containers for data science |
| Service Layer | Custom microservices | Data marketplace, operations console, automation |
| Networking | VNet, DNS, Peering | Hub connectivity to all data landing zones |
| Security | Key Vault, Defender | Centralized secrets and threat protection |
Important
The DMLZ must be deployed as a separate subscription under a management group with appropriate governance policies. It connects to data landing zones via VNet peering and to the connectivity subscription.
Data Landing Zone (DLZ)¶
Each DLZ is a separate Azure subscription that hosts analytics workloads for a specific domain or business unit:
| Layer | Resource Groups | Purpose |
|---|---|---|
| Platform Services (Required) | network-rg, security-rg | VNet, NSGs, route tables, monitoring, Defender |
| Core Services (Required) | storage-rg, runtimes-rg, mgmt-rg, external-data-rg | Data lakes, shared IRs, CI/CD agents, external storage |
| Core Services (Optional) | data-ingestion-rg, shared-applications-rg | ADF, SQL metastore, shared Databricks |
| Data Application (Optional) | data-application-rg (one or more) | Per-application resources |
| Reporting (Optional) | reporting-rg | Visualization, Power BI gateways |
1.3 Data Lake Architecture (Medallion Pattern)¶
Each DLZ provisions three ADLS Gen2 storage accounts forming a logical data lake:
| Lake # | Layer | Medallion | Containers | Description |
|---|---|---|---|---|
| 1 | Raw | Bronze | landing, conformance | Immutable source data, data quality gates |
| 2 | Enriched | Silver | standardized | Merged, cleansed, type-aligned data |
| 2 | Curated | Gold | data-products | Aggregated, modeled, consumption-ready |
| 3 | Development | N/A | analytics-sandbox, synapse-primary-* | Exploratory sandboxes, workspace storage |
Container Folder Structures
Container Folder Structure (Raw/Landing):
Landing/
Log/{Application Name}/
Master and Reference/{Source System}/
Telemetry/{Source System}/{Application}/
Transactional/{Source System}/{Entity}/{Version}/
Delta/{rundate=YYYY-MM-DD}/
Full/
Container Folder Structure (Raw/Conformance):
Conformance/
Transactional/{Source System}/{Entity}/{Version}/
Delta/
Input/{rundate=YYYY-MM-DD}/
Output/{rundate=YYYY-MM-DD}/
Error/{rundate=YYYY-MM-DD}/
Full/
Input/{rundate=YYYY-MM-DD}/
Output/{rundate=YYYY-MM-DD}/
Error/{rundate=YYYY-MM-DD}/
Container Folder Structure (Enriched/Standardized):
Standardized/
Transactional/{Source System}/{Entity}/{Version}/
General/{rundate=YYYY-MM-DD}/
Sensitive/{rundate=YYYY-MM-DD}/
Container Folder Structure (Curated/Data Products):
Key Configuration:
- Enable Hierarchical Namespace (HNS) on all ADLS Gen2 accounts
- Use ACLs + Microsoft Entra groups for fine-grained access control
- Separate
GeneralandSensitivefolders for data classification - Store data in Delta Lake format (Parquet + transaction log)
1.4 Hub-Spoke Network Topology¶
The networking architecture follows a hub-spoke model integrated with Azure Landing Zones:
graph TB
subgraph Connectivity["Connectivity Sub (Hub VNet)"]
FW["Azure Firewall"]
ER["ExpressRoute"]
VPN["VPN Gateway"]
end
subgraph DMLZ["DMLZ VNet"]
PV["Purview"]
GOV["Governance"]
ACR["ACR"]
end
subgraph DLZ1["DLZ-1 VNet"]
ST1["Storage"]
CMP1["Compute"]
ADF1["ADF"]
end
subgraph DLZ2["DLZ-2 VNet"]
ST2["Storage"]
CMP2["Compute"]
ADF2["ADF"]
end
Connectivity -->|"VNet Peering"| DMLZ
Connectivity -->|"VNet Peering"| DLZ1
Connectivity -->|"VNet Peering"| DLZ2
DMLZ -->|"VNet Peering"| DLZ1
DMLZ -->|"VNet Peering"| DLZ2 Design Principles:
- All PaaS services use Private Endpoints (no public IPs)
- VNet peering between DMLZ and each DLZ
- VNet peering between DLZs for cross-domain data sharing
- Central Azure Private DNS zones for endpoint resolution
- DNS A-records automated via Azure Policy (
deployIfNotExists) - Site-to-Site VPN for third-party cloud connectivity
- ExpressRoute for on-premises connectivity through the hub
- NSGs and route tables per subnet
- Azure Firewall in the hub for traffic inspection
Private DNS Zones Required
privatelink.blob.core.windows.netprivatelink.dfs.core.windows.netprivatelink.database.windows.netprivatelink.sql.azuresynapse.netprivatelink.dev.azuresynapse.netprivatelink.azuresynapse.netprivatelink.vaultcore.azure.netprivatelink.datafactory.azure.netprivatelink.adf.azure.comprivatelink.purview.azure.comprivatelink.purviewstudio.azure.comprivatelink.servicebus.windows.netprivatelink.azuredatabricks.netprivatelink.azurecr.ioprivatelink.monitor.azure.com
1.5 Integration with Azure Landing Zones (ALZ)¶
CSA builds on top of Azure Landing Zones. The ALZ reference architecture provides:
Management Group Hierarchy:
Tenant Root Group
├── Platform
│ ├── Management (Log Analytics, Automation, Sentinel)
│ ├── Connectivity (Hub VNet, Firewall, ExpressRoute, DNS)
│ └── Identity (Domain Controllers, Entra ID Connect)
├── Landing Zones
│ ├── Corp (Internal workloads with private connectivity)
│ │ ├── DMLZ Sub (Data Management Landing Zone)
│ │ ├── DLZ-Dev Sub (Development Data Landing Zone)
│ │ └── DLZ-Prod Sub (Production Data Landing Zone)
│ └── Online (Internet-facing workloads)
├── Sandbox (Experimentation)
└── Decommissioned
Platform Landing Zone Subscriptions:
- Management Subscription — Log Analytics workspace, Azure Monitor, Automation, Sentinel
- Connectivity Subscription — Hub VNet, Azure Firewall, ExpressRoute, VPN, DNS zones
- Identity Subscription — AD domain controllers (if needed)
Application Landing Zone Subscriptions (for data): 4. DMLZ Subscription — Data governance, Purview, ACR, shared services 5. DLZ Subscription(s) — One per data domain or environment
🔀 2. Data Mesh Architecture in Azure¶
2.1 Core Principles¶
Data mesh, as defined by Zhamak Dehghani, is an architectural pattern built on four principles:
- Domain-Oriented Data Ownership — Data is owned by domain teams who understand it best
- Data as a Product — Data products are first-class citizens with defined quality, SLAs, and discoverability
- Self-Serve Data Infrastructure Platform — A platform that enables domain teams to build data products autonomously
- Federated Computational Governance — Governance policies automated and embedded in the platform
2.2 Mapping Data Mesh to Azure / CSA¶
| Data Mesh Concept | CSA Implementation | Azure Services |
|---|---|---|
| Data Domain | Data Landing Zone (one per domain) | Azure Subscription + VNet + resource groups |
| Data Product | Data Application resource group | ADLS Gen2, ADF, Databricks, SQL |
| Self-Serve Platform | DMLZ + automated provisioning | IaC templates, Azure DevOps, Policy |
| Federated Governance | DMLZ + Purview + Policy | Purview, Azure Policy, Entra ID |
| Data Catalog | Centralized in DMLZ | Microsoft Purview |
| Domain Team | Data Application Team | Entra security groups |
| Data Contract | Sharing repository in DMLZ | Purview policies, Entra entitlement mgmt |
2.3 Data Domains¶
Three criteria for defining data domains:
- Long-term ownership — Boundaries must support identified, persistent owners
- Match reality — Domains should reflect actual business operations, not theoretical concepts
- Atomic integrity — Don't combine unrelated areas into a single domain
Domain examples: Sales, Marketing, Finance, Supply Chain, Customer, Product
2.4 Data Products¶
A successful data product must be:
- Usable — Has users outside the immediate data domain
- Valuable — Maintains value over time
- Feasible — Can be built from available data and technology
Data product components:
- Data (the actual datasets)
- Code assets (generation, delivery, pipelines)
- Metadata (descriptions, schemas, lineage, quality metrics)
- Policies (access control, retention, classification)
Delivery formats: API, table, dataset in data lake, report, stream
2.5 Self-Serve Data Infrastructure¶
The DMLZ provides automation services (not products, but patterns to implement):
| Service | Scope |
|---|---|
| DLZ Provisioning | Creates new data landing zones via IaC |
| Data Product Onboarding | Creates resource groups, configures resources |
| Data Agnostic Ingestion | Metadata-driven ingestion engine using ADF + SQL metastore |
| Metadata Service | Exposes and creates platform metadata |
| Access Provisioning | Creates access packages and approval workflows |
| Data Lifecycle | Manages retention, cold storage, deletion |
| Domain Onboarding | Captures domain metadata, creates domain infrastructure |
2.6 Federated Governance¶
Implementation approach:
- Automated policies via Azure Policy (enforced at management group level)
- Code-first — Standards, policies, and platform deployment as code
- Purview for cross-domain data cataloging, classification, and lineage
- Unity Catalog (if using Databricks) for workspace-level governance
- Entra ID entitlement management for access request/approval workflows
2.7 Unity Catalog vs Purview for Governance¶
| Aspect | Microsoft Purview | Databricks Unity Catalog |
|---|---|---|
| Scope | Tenant-wide, all Azure data sources | Databricks workspaces only |
| Cataloging | All Azure + 100+ connectors | Databricks tables, volumes, models |
| Access Control | Policy-based, integrated with Entra ID | Fine-grained table/column ACLs |
| Lineage | Cross-service, ADF, Synapse | Spark jobs, notebooks, SQL |
| Data Quality | Built-in quality rules | Partner integrations |
| Classification | Automatic sensitivity classification | Manual tags + Purview integration |
| Cost | Included with Azure (consumption model) | Included with Databricks Premium |
| Recommendation | Use as enterprise-wide catalog | Use in addition to Purview for Databricks-specific governance |
Note
Best Practice: Use both together. Purview provides the enterprise-wide catalog, classification, and cross-platform lineage. Unity Catalog provides fine-grained access control, auditing, and lineage within the Databricks ecosystem. They are complementary, not competing.
🧵 3. Data Fabric Architecture¶
3.1 How Data Fabric Differs from Data Mesh¶
| Aspect | Data Mesh | Data Fabric |
|---|---|---|
| Philosophy | Organizational (decentralized ownership) | Technical (automated integration) |
| Core Driver | Domain teams own their data products | Metadata/AI automates data integration |
| Governance | Federated (domain teams + central policies) | Centralized (knowledge graph-driven) |
| Integration | Self-serve, domain teams build pipelines | Automated, AI discovers and integrates |
| Best For | Large orgs with autonomous business units | Orgs needing to unify disparate data sources |
| Key Technology | Self-serve platform + catalog | Knowledge graph + metadata layer + ML |
| Data Ownership | Distributed to domains | Central with virtual access |
3.2 Data Fabric Core Components¶
-
Knowledge Graph / Metadata Layer
- Active metadata that learns from usage patterns
- Automatic relationship discovery between data assets
- Semantic layer that provides business context
- Azure Services: Purview, Cosmos DB (graph API), custom knowledge graph
-
Automated Data Integration
- AI-driven data discovery and cataloging
- Automatic schema mapping and transformation
- Self-optimizing data pipelines
- Azure Services: ADF, Purview auto-classification, Azure ML
-
Unified Access Layer
- Virtual data access without physical movement
- Polyglot query (SQL, Spark, API)
- Azure Services: Synapse Serverless SQL, Databricks SQL Warehouses
-
Governance and Security
- Policy-driven, automated enforcement
- Data lineage tracking
- Compliance monitoring
- Azure Services: Purview, Azure Policy, Defender for Cloud
3.3 Azure Services for Data Fabric Patterns¶
| Pattern | Azure Services |
|---|---|
| Metadata Layer | Purview (catalog, lineage, classification) |
| Knowledge Graph | Cosmos DB Gremlin API, Purview relationships |
| Virtual Access | Synapse Serverless SQL pools (query across data lakes) |
| Automated Integration | ADF metadata-driven pipelines, Purview auto-scan |
| Semantic Layer | Power BI datasets, Azure Analysis Services |
| Data Virtualization | Synapse Serverless SQL, Databricks lakehouse federation |
3.4 Hybrid Approach for csa-inabox¶
For csa-inabox, we recommend a hybrid data mesh + data fabric approach:
- Data Mesh organizational model — Domain teams own data products in DLZs
- Data Fabric technical layer — Centralized metadata, automated discovery, virtual access via DMLZ
- This gives the best of both: decentralized ownership with automated, AI-driven governance
🔄 4. Microsoft Fabric Alternative Components¶
This section maps each Microsoft Fabric capability to equivalent Azure services for our open-source platform.
4.1 Complete Component Mapping¶
| Capability | Microsoft Fabric | csa-inabox Alternative | Open Source? | Notes |
|---|---|---|---|---|
| Data Lakehouse | Fabric Lakehouse (OneLake) | ADLS Gen2 + Delta Lake | Delta Lake: Yes | Delta Lake is open source, ADLS is PaaS |
| Spark Compute | Fabric Spark | Synapse Spark Pools / Databricks | Spark: Yes | Apache Spark is open source |
| Data Warehouse | Fabric Warehouse | Synapse Dedicated SQL / Serverless SQL | No | Azure-managed service |
| Data Integration | Fabric Data Pipelines | Azure Data Factory | No | ADF is Azure-managed |
| Real-Time Analytics | Fabric Real-Time Intelligence | Azure Data Explorer + Event Hubs | No | ADX is Azure-managed |
| Power BI | Fabric Power BI | Power BI (standalone) | No | Same service, different licensing |
| Data Governance | Fabric Governance | Purview + Unity Catalog | Unity Catalog: Partially | Purview is Azure-managed |
| AI/ML | Fabric Data Science | Azure ML + Databricks ML | MLflow: Yes | MLflow is open source |
| Data Engineering | Fabric Data Engineering | Spark on Synapse/Databricks | Spark: Yes | |
| Data Activator | Fabric Data Activator | Event Grid + Logic Apps + Functions | No | Azure-managed |
| Mirroring | Fabric Mirroring | ADF CDC + Debezium | Debezium: Yes | Open source CDC |
| Shortcuts | OneLake Shortcuts | ADLS linked services + mount points | N/A | Native ADLS capability |
4.2 Detailed Component Analysis¶
4.2.1 Data Lakehouse: Delta Lake on ADLS Gen2
Architecture:
- Storage: ADLS Gen2 with Hierarchical Namespace enabled
- Table Format: Delta Lake (open-source, Apache 2.0 license)
- Features: ACID transactions, time travel, schema evolution, unified batch/streaming
- Alternative formats: Apache Iceberg, Apache Hudi (for future flexibility)
Configuration Recommendations:
- Enable soft delete with 7-day retention
- Use lifecycle management policies for hot/cool/archive tiering
- Enable versioning for critical datasets
- Storage redundancy: LRS for dev, GRS for production
- Use customer-managed keys (CMK) via Key Vault for encryption
4.2.2 Compute: Synapse Spark Pools vs Databricks
Synapse Spark Pools:
- Integrated with Synapse workspace
- Auto-pause and auto-scale
- Good for organizations already using Synapse ecosystem
- Lower cost for intermittent workloads
- Limited compared to Databricks for advanced ML
Azure Databricks:
- Most feature-rich Spark platform
- Unity Catalog for governance
- Photon engine for SQL performance
- MLflow integration for ML lifecycle
- Delta Live Tables for declarative ETL
- Cluster policies for cost control
- Premium tier required for Unity Catalog
Recommendation: Use Databricks as primary compute for its superior governance (Unity Catalog), ML capabilities, and ecosystem. Use Synapse Serverless SQL for ad-hoc querying of the data lake. This provides the best combination of capabilities.
4.2.3 Data Warehouse: Synapse SQL Options
Synapse Dedicated SQL Pool:
- Traditional MPP data warehouse
- Persistent storage and compute
- Best for large-scale, predictable analytical workloads
- Expensive (pay-per-hour even when idle)
- Consider for enterprise reporting marts
Synapse Serverless SQL Pool:
- Query data in-place in ADLS Gen2
- Pay-per-query model
- Excellent for exploratory analysis
- Can create views over Delta Lake tables
- No data movement required
- Great complement to the lakehouse pattern
Recommendation: Use Serverless SQL as primary query engine for the lakehouse. Use Dedicated SQL only for specific high-performance reporting marts if needed. This dramatically reduces cost.
4.2.4 Data Integration: Azure Data Factory
Core Capabilities:
- 100+ connectors (on-premises, cloud, SaaS)
- Metadata-driven pipeline patterns
- Self-hosted Integration Runtime for on-premises connectivity
- Mapping data flows for code-free transformation
- Tumbling window and event-driven triggers
- Integration with Purview for lineage
Metadata-Driven Ingestion Pattern:
- Store connection metadata in Azure SQL Database
- ADF reads metadata to dynamically generate pipelines
- Parameterized datasets and linked services
- Single pipeline template handles multiple source types
- This is the recommended "data agnostic ingestion engine" in CSA
Alternative/Complement: Use dbt (already in the csa-inabox repo) for transformation logic. ADF handles orchestration and data movement; dbt handles SQL-based transformations in the lakehouse.
4.2.5 Real-Time Analytics
Azure Data Explorer (ADX / Kusto):
- Purpose-built for log and telemetry analytics
- KQL (Kusto Query Language) for ad-hoc analysis
- Sub-second query performance on massive datasets
- Streaming ingestion from Event Hubs, IoT Hub
- Time series analysis built-in
Azure Event Hubs:
- Managed Kafka-compatible event streaming
- Millions of events per second
- Event capture to ADLS Gen2 (Avro format)
- Kafka protocol support (Schema Registry)
Real-Time Architecture:
4.2.6 Power BI
Power BI standalone works identically to Fabric Power BI for reporting:
- Direct Lake mode (connect to Delta Lake tables)
- Import mode (for high-performance dashboards)
- DirectQuery (for real-time data)
- Premium capacity for large-scale deployment
- Power BI Embedded for ISV scenarios
Recommendation: Use Power BI Premium Per Capacity for the organization, connected directly to Delta Lake tables via Synapse Serverless SQL or Databricks SQL Warehouses.
4.2.7 AI/ML
Azure Machine Learning:
- End-to-end ML lifecycle
- Responsible AI dashboard
- Managed compute instances and clusters
- Automated ML (AutoML)
- Model registry and deployment
Databricks ML:
- MLflow (open source) for experiment tracking
- Feature Store
- Model Serving
- AutoML
- Deep learning with GPU clusters
Azure OpenAI Service:
- GPT-4, embeddings, DALL-E
- Private endpoint support
- Content filtering
- Fine-tuning capabilities
Recommendation: Use Databricks ML + MLflow as primary ML platform (open-source ML tracking). Add Azure ML for specialized needs (AutoML, Responsible AI). Use Azure OpenAI for generative AI workloads.
☁️ 5. Required Azure Services for a Complete Platform¶
5.1 Complete Service Catalog¶
Networking Services
| Service | Purpose | Subscription |
|---|---|---|
| Azure Virtual Network (VNet) | Network isolation per landing zone | All |
| VNet Peering | Hub-spoke connectivity | All |
| Private Endpoints | Secure PaaS access | All |
| Azure Private DNS Zones | Private endpoint resolution | Connectivity |
| Network Security Groups (NSG) | Subnet-level traffic filtering | All |
| Route Tables (UDR) | Force traffic through firewall | All |
| Azure Firewall | Central traffic inspection, FQDN filtering | Connectivity |
| Azure Bastion | Secure RDP/SSH to jumpboxes | Connectivity |
| ExpressRoute | On-premises connectivity | Connectivity |
| VPN Gateway | Site-to-Site for third-party clouds | Connectivity |
Identity & Access Services
| Service | Purpose | Subscription |
|---|---|---|
| Microsoft Entra ID | Identity provider | Tenant-wide |
| Managed Identities | Service-to-service auth (no passwords) | All |
| RBAC | Role-based access control | All |
| Service Principals | CI/CD and automation authentication | All |
| Entra ID Entitlement Mgmt | Self-service access request/approval | Tenant-wide |
| Privileged Identity Management | Just-in-time admin access | Tenant-wide |
| Conditional Access | Context-aware access policies | Tenant-wide |
Storage Services
| Service | Purpose | Subscription |
|---|---|---|
| ADLS Gen2 (HNS-enabled) | Data lake storage (3 per DLZ) | DLZ |
| Azure Blob Storage | External data staging | DLZ |
| Azure SQL Database | Metadata stores, ADF metastore | DLZ, DMLZ |
| Cosmos DB | Knowledge graph (optional), app data | DMLZ |
Compute Services
| Service | Purpose | Subscription |
|---|---|---|
| Azure Databricks | Primary Spark compute, ML, SQL analytics | DLZ |
| Synapse Analytics | Serverless SQL, optional Spark pools | DLZ |
| Azure Kubernetes Service (AKS) | Microservices, API hosting (optional) | DMLZ |
| Azure Functions | Event-driven automation | DLZ, DMLZ |
| Virtual Machines | Self-hosted IR, jumpboxes | DLZ |
Data Integration Services
| Service | Purpose | Subscription |
|---|---|---|
| Azure Data Factory | Orchestration, data movement | DLZ |
| Self-Hosted Integration Runtime | On-premises/private network access | DLZ |
| Event Hubs | Real-time event ingestion | DLZ |
| Event Grid | Event-driven triggers | DLZ |
| Service Bus | Reliable message queuing | DLZ, DMLZ |
| IoT Hub | IoT device telemetry (optional) | DLZ |
Governance Services
| Service | Purpose | Subscription |
|---|---|---|
| Microsoft Purview | Data catalog, classification, lineage | DMLZ (tenant-scoped) |
| Azure Policy | Governance enforcement | Management Group |
| Databricks Unity Catalog | Spark-level data governance | DLZ (Databricks) |
| Management Groups | Policy hierarchy, RBAC inheritance | Tenant-wide |
Security Services
| Service | Purpose | Subscription |
|---|---|---|
| Azure Key Vault | Secrets, keys, certificates | All |
| Microsoft Defender for Cloud | Threat protection, security posture | All |
| Microsoft Sentinel | SIEM, threat detection, SOAR | Management |
| Defender for Identity | Identity threat detection | Management |
| Defender for Storage | Storage threat protection | DLZ |
| Defender for SQL | Database threat protection | DLZ |
Monitoring Services
| Service | Purpose | Subscription |
|---|---|---|
| Azure Monitor | Platform metrics and alerts | All |
| Log Analytics Workspace | Centralized logging | Management |
| Application Insights | Application performance monitoring | DMLZ, DLZ |
| Azure Workbooks | Custom dashboards | Management |
| Diagnostic Settings | Service-level telemetry | All |
AI/ML Services
| Service | Purpose | Subscription |
|---|---|---|
| Azure Machine Learning | ML lifecycle management | DLZ |
| Azure OpenAI Service | Generative AI workloads | DLZ |
| Azure Cognitive Services | Pre-built AI (Vision, Language, etc.) | DLZ |
| Databricks ML / MLflow | Experiment tracking, model registry | DLZ |
5.2 Service Dependencies Map¶
graph TB
EntraID["Microsoft Entra ID<br/>(Tenant-wide identity)"]
subgraph Mgmt["Management Subscription"]
LA["Log Analytics"]
Sentinel["Sentinel"]
Auto["Automation"]
Monitor["Monitor"]
end
subgraph Conn["Connectivity Subscription"]
Hub["Hub VNet"]
FW["Firewall"]
ER["ExpressRoute"]
DNS["DNS Zones"]
Bastion["Bastion"]
end
subgraph DMLZSub["DMLZ Subscription"]
Purview["Purview"]
KV1["Key Vault"]
ACR["ACR"]
SQLDB["SQL Database"]
APIM["API Management"]
end
subgraph DLZSub["DLZ Subscription(s)"]
ADLS["ADLS Gen2 x3"]
DBX["Databricks"]
Synapse["Synapse"]
ADF["ADF + IR"]
KV2["Key Vault"]
EH["Event Hubs"]
AML["Azure ML"]
PBI["Power BI"]
end
EntraID --> Mgmt
EntraID --> Conn
EntraID --> DMLZSub
EntraID --> DLZSub
Conn -->|"VNet Peering"| DMLZSub
Conn -->|"VNet Peering"| DLZSub
DMLZSub -->|"Peering + Scanning"| DLZSub 📦 6. Deployment Strategy for 4 Azure Subscriptions¶
6.1 Recommended Subscription Layout¶
For the csa-inabox initial deployment with 4 subscriptions:
Tenant Root Group
└── csa-inabox (Management Group)
├── Platform (Management Group)
│ ├── Sub 1: csa-platform (Management + Connectivity + Identity)
│ │ ├── rg-management (Log Analytics, Automation, Sentinel)
│ │ ├── rg-connectivity (Hub VNet, Firewall, DNS zones)
│ │ └── rg-identity (Optional AD DCs)
│ │
│ └── Sub 2: csa-governance (Data Management Landing Zone)
│ ├── rg-governance (Purview, Policy artifacts)
│ ├── rg-network (DMLZ VNet, peering, private endpoints)
│ ├── rg-shared-services (ACR, API Management, Key Vault)
│ └── rg-service-layer (Automation microservices, data marketplace)
│
└── Landing Zones (Management Group)
├── Sub 3: csa-data-nonprod (Development + Test DLZ)
│ ├── rg-network (DLZ VNet, NSGs, route tables)
│ ├── rg-security (Key Vault, Defender)
│ ├── rg-storage (ADLS Gen2 x3: raw, enriched, dev)
│ ├── rg-compute (Databricks, Synapse)
│ ├── rg-integration (ADF, shared IR, Event Hubs)
│ ├── rg-data-app-{name} (Per data product)
│ └── rg-monitoring (Diagnostic settings)
│
└── Sub 4: csa-data-prod (Production DLZ)
├── rg-network
├── rg-security
├── rg-storage
├── rg-compute
├── rg-integration
├── rg-data-app-{name}
└── rg-monitoring
6.2 Alternative: Scale-Out Layout¶
For larger deployments, expand to domain-based DLZs:
Landing Zones (Management Group)
├── Sub: csa-dlz-finance (Finance domain DLZ)
├── Sub: csa-dlz-sales (Sales domain DLZ)
├── Sub: csa-dlz-marketing (Marketing domain DLZ)
└── Sub: csa-dlz-operations (Operations domain DLZ)
6.3 Cross-Subscription Networking¶
VNet Address Space Allocation:
Hub VNet (csa-platform): 10.0.0.0/16
- GatewaySubnet: 10.0.0.0/24
- AzureFirewallSubnet: 10.0.1.0/24
- AzureBastionSubnet: 10.0.2.0/24
- ManagementSubnet: 10.0.10.0/24
DMLZ VNet (csa-governance): 10.1.0.0/16
- PurviewSubnet: 10.1.1.0/24
- SharedServicesSubnet: 10.1.2.0/24
- PrivateEndpointSubnet: 10.1.10.0/24
DLZ-NonProd VNet: 10.2.0.0/16
- DatabricksPublicSubnet: 10.2.1.0/24
- DatabricksPrivateSubnet: 10.2.2.0/24
- SynapseSubnet: 10.2.3.0/24
- PrivateEndpointSubnet: 10.2.10.0/24
- IntegrationSubnet: 10.2.11.0/24
- DataApp1Subnet: 10.2.20.0/24
DLZ-Prod VNet: 10.3.0.0/16
- (Same structure as NonProd)
Peering Configuration:
| From | To | Direction | Purpose |
|---|---|---|---|
| Hub | DMLZ | Bidirectional | Governance connectivity |
| Hub | DLZ-NonProd | Bidirectional | Internet/on-premises access |
| Hub | DLZ-Prod | Bidirectional | Internet/on-premises access |
| DMLZ | DLZ-NonProd | Bidirectional | Purview scanning, governance |
| DMLZ | DLZ-Prod | Bidirectional | Purview scanning, governance |
| DLZ-NonProd | DLZ-Prod | Optional* | Cross-environment data sharing |
*Cross-DLZ peering between non-prod and prod should be avoided in most cases for isolation.
6.4 Policy Inheritance and Management Group Hierarchy¶
| Management Group | Key Policies |
|---|---|
| Root (csa-inabox) | Require tags (cost center, environment, owner); Allowed locations; Audit diagnostic settings; Deny public IP addresses |
| Platform | Require resource locks on critical resources; Deny unauthorized resource types |
| Landing Zones | Deploy Private DNS zone groups (deployIfNotExists); Require HTTPS/TLS; Deploy Defender for Cloud; Deny public network access on storage accounts; Require encryption at rest with CMK; Deny creation of classic resources |
6.5 RBAC Strategy¶
| Role | Scope | Principals |
|---|---|---|
| Owner | Management Group: csa-inabox | Platform team (PIM-protected) |
| Contributor | Sub: csa-platform | Platform operations team |
| Network Contributor | Hub VNet resource group | Network team |
| Contributor | Sub: csa-governance | Data governance team |
| Purview Data Curator | Purview account | Data stewards |
| Contributor | DLZ resource groups | Data domain teams |
| User Access Administrator | DLZ subscription | Data platform team |
| Private DNS Zone Contributor | DNS resource group | Service principals for DLZ deployments |
| Network Contributor | DLZ subnets (child scope) | Data application team service principals |
| Reader | Sub: csa-data-prod | All data consumers |
| Storage Blob Data Reader | ADLS Gen2 curated | Data analysts, BI users |
| Storage Blob Data Contributor | ADLS Gen2 raw/enriched | Data engineers |
📋 7. Best Practices and Standards¶
7.1 Azure Well-Architected Framework for Data Platforms¶
Reliability:
- Use GRS (Geo-Redundant Storage) for production data lakes
- Configure ADLS Gen2 soft delete and versioning
- Design for regional failover of critical services
- Implement retry policies in all data pipelines
- Use Availability Zones for Databricks, Synapse, Key Vault
Security:
- Zero-trust: Never trust, always verify
- Private endpoints for all PaaS services (no public access)
- Customer-managed keys for encryption at rest
- TLS 1.2+ for all data in transit
- Managed identities over service principals where possible
- Azure Policy to enforce security baseline
- Microsoft Defender for all data services
Cost Optimization:
- Synapse Serverless SQL (pay-per-query) over Dedicated SQL
- Databricks auto-scaling and auto-terminate policies
- ADLS lifecycle management (hot → cool → archive)
- Reserved instances for predictable Databricks/Synapse workloads
- Azure Advisor cost recommendations
- Tag-based cost allocation and chargeback
Operational Excellence:
- Infrastructure as Code (Bicep/Terraform) for all deployments
- CI/CD pipelines for data platform changes
- GitOps for configuration management
- Centralized monitoring via Log Analytics
- Automated alerting on pipeline failures
- Runbooks for common operational tasks
Performance Efficiency:
- Delta Lake Z-ordering for query optimization
- Databricks Photon engine for SQL workloads
- Partition pruning in data lake queries
- Caching strategies in Power BI and Synapse
- Right-sizing compute clusters
7.2 Zero-Trust Network Architecture¶
Principles applied to data platforms:
-
Verify Explicitly — Every access request is fully authenticated and authorized
- Managed identities for service-to-service
- Entra ID for user access
- Conditional access policies
-
Use Least-Privilege Access — Just-in-time, just-enough access
- PIM for admin roles
- ACLs on data lake folders (not container level)
- Granular RBAC roles
-
Assume Breach — Minimize blast radius
- Network segmentation (subnets per workload)
- Private endpoints (no public access surface)
- NSGs with deny-all default
- Azure Firewall for egress filtering
- Micro-segmentation within DLZs
Network Zero-Trust Checklist:
- All PaaS services behind private endpoints
- Public network access disabled on all storage accounts
- Azure Firewall filtering all outbound traffic
- NSGs on every subnet with explicit allow rules
- No public IP addresses on any resource
- DNS resolution through private DNS zones only
- Jumpbox + Bastion for administrative access
- TLS 1.2 minimum on all services
7.3 Data Classification and Sensitivity Labeling¶
Classification Levels:
| Level | Label | Description | Handling |
|---|---|---|---|
| 1 | Public | Approved for public release | No restrictions |
| 2 | Internal | General business data | Require authentication |
| 3 | Confidential | Sensitive business data | Encrypt, limited access |
| 4 | Highly Confidential | PII, financial, regulated | Encrypt, audit, MFA, DLP |
| 5 | Restricted | Top secret, regulated | Full audit, approval workflow |
Implementation:
- Microsoft Purview auto-classification scans data lakes
- Sensitivity labels applied to ADLS containers/folders
- Data in
Sensitivefolders gets Confidential or higher classification - Azure Information Protection for document-level classification
- DLP policies via Purview to prevent data exfiltration
- Classification drives access control and retention policies
7.4 Cost Management and FinOps¶
Cost Levers for Data Platforms:
| Service | Cost Strategy |
|---|---|
| ADLS Gen2 | Lifecycle policies: hot (30 days) → cool (90 days) → archive |
| Databricks | Auto-terminate (15 min idle), spot instances for batch, cluster pools |
| Synapse | Serverless SQL (pay-per-TB scanned), auto-pause dedicated pools |
| ADF | Pipeline optimization, avoid excessive data movement |
| Event Hubs | Right-size throughput units, auto-inflate |
| Key Vault | Standard tier (not Premium) unless HSM required |
| Purview | Scan scheduling (not continuous) |
FinOps Practices:
- Tag all resources with:
costCenter,environment,owner,project - Use Azure Cost Management + Billing for dashboards
- Set budgets and alerts per subscription and resource group
- Weekly cost review by platform team
- Reserved Instances for Databricks (1-year or 3-year)
- Dev/test pricing for non-production subscriptions
- Auto-shutdown schedules for non-production compute
- Right-sizing reviews quarterly
7.5 Disaster Recovery and Business Continuity¶
RPO/RTO Targets by Data Layer:
| Layer | RPO | RTO | Strategy |
|---|---|---|---|
| Raw (Bronze) | 24 hours | 4 hours | Source re-ingest + GRS |
| Enriched (Silver) | 4 hours | 2 hours | GRS + reprocessing pipelines |
| Curated (Gold) | 1 hour | 1 hour | GRS + hot standby |
| Metadata (Purview) | 0 (continuous) | 1 hour | Managed service redundancy |
BCDR Architecture:
- ADLS Gen2: GRS or GZRS for cross-region replication
- Azure SQL: Active geo-replication or auto-failover groups
- Key Vault: Soft delete + purge protection enabled
- Databricks: Multi-region workspace deployment (active-passive)
- ADF: Pipeline definitions in git (redeploy from source)
- Purview: Tenant-scoped, inherently resilient
Backup Strategy:
- ADLS: Blob versioning + soft delete + point-in-time restore
- Azure SQL: Automated backups (7-35 day retention)
- Databricks: Unity Catalog metadata backed by managed storage
- Key Vault: Soft-delete (90 days) + purge protection
- IaC: All infrastructure defined in code (Bicep/Terraform in git)
📚 8. Reference Templates and IaC¶
8.1 Microsoft Official Templates¶
Microsoft provides official Bicep/ARM templates in GitHub:
| Repository | Purpose | Deployment |
|---|---|---|
| Azure/data-management-zone | DMLZ template (Purview, governance, networking) | One per platform |
| Azure/data-landing-zone | DLZ template (storage, compute, integration) | One per DLZ |
| Azure/data-product-batch | Batch data processing workload | One+ per DLZ |
| Azure/data-product-streaming | Streaming data processing workload | One+ per DLZ |
| Azure/data-product-analytics | Analytics and data science workload | One+ per DLZ |
Template Stack: Bicep (67.5%), PowerShell, Shell Deployment Methods: Azure Portal (Deploy to Azure), GitHub Actions, Azure DevOps
8.2 Template Architecture¶
data-management-zone/
├── infra/
│ ├── main.json (ARM template entry point)
│ └── modules/ (Bicep modules)
│ ├── Purview/
│ ├── Network/
│ │ ├── privateDnsZones/
│ │ └── virtualNetworkLinks/
│ ├── KeyVault/
│ ├── ContainerRegistry/
│ └── ...
├── code/ (Application code)
├── docs/ (Documentation)
├── .github/ (GitHub Actions workflows)
└── .ado/ (Azure DevOps pipelines)
data-landing-zone/
├── infra/
│ ├── main.json
│ └── modules/
│ ├── Storage/ (ADLS Gen2 accounts)
│ ├── Databricks/
│ ├── Synapse/
│ ├── DataFactory/
│ ├── Network/
│ └── ...
├── code/
├── docs/
├── .github/
└── .ado/
8.3 csa-inabox Deployment Approach¶
For our project, we should:
- Fork/adapt the Microsoft templates as our baseline
- Add Terraform support alongside Bicep (for multi-cloud flexibility)
- Create a unified deployment orchestrator that:
- Deploys the platform subscription (management + connectivity)
- Deploys the DMLZ subscription
- Deploys DLZ subscriptions
- Configures cross-subscription networking
- Sets up governance policies
- Add open-source tooling layer:
- dbt for SQL transformations
- Great Expectations for data quality
- Apache Airflow or Dagster for orchestration (optional)
- MLflow for ML lifecycle
- OpenMetadata or DataHub as Purview alternative (optional)
8.4 Existing csa-inabox Assets¶
Based on the current repo structure:
deploy/arm/— ARM templates (e.g., Purview)deploy/bicep/— Bicep modules (DMLZ, network, private DNS zones)scripts/sql/— SQL scripts, Hive metastore notebooktools/dbt/— dbt environment for transformationscodeqlDB/— Code quality database
🔗 9. Sources and References¶
Microsoft Official Documentation¶
- Cloud-Scale Analytics Overview — Main CSA scenario (deprecated, redirects to Unify your data platform)
- Data Landing Zone Architecture — DLZ component architecture
- Data Management Landing Zone — DMLZ governance architecture
- What is Data Mesh? — Data mesh principles on Azure
- Network Topology and Connectivity — Networking architecture
- Identity and Access Management — IAM for data platforms
- Security, Governance, and Compliance — Security architecture
- Data Lake Zones and Containers — Medallion architecture
- What is an Azure Landing Zone? — ALZ reference architecture
- Subscription Considerations — Subscription organization
- Unify Your Data Platform (new guidance) — Replacement for CSA
GitHub Repositories¶
- Azure/data-management-zone — DMLZ Bicep templates
- Azure/data-landing-zone — DLZ Bicep templates
- Azure/data-product-batch — Batch processing template
- Azure/data-product-streaming — Streaming template
- Azure/data-product-analytics — Analytics/DS template
Architecture Frameworks¶
- Azure Well-Architected Framework — Five pillars
- Cloud Adoption Framework — Cloud adoption methodology
- Azure Landing Zone Accelerator — IaC deployment
Open Source Projects¶
- Delta Lake — Open-source lakehouse storage layer
- Apache Spark — Distributed compute engine
- dbt — SQL transformation framework
- MLflow — ML lifecycle management
- Great Expectations — Data quality validation
- Debezium — Change data capture
📊 Appendix A: Service SKU Recommendations¶
| Service | Development | Production |
|---|---|---|
| ADLS Gen2 | Standard LRS, Hot | Standard GRS, Hot/Cool tiering |
| Databricks | Standard, Standard_DS3_v2 | Premium, Standard_DS4_v2+, Unity Catalog |
| Synapse | Serverless SQL only | Serverless SQL + DW100c (if dedicated needed) |
| ADF | Pay-as-you-go | Pay-as-you-go (consider reserved DIU) |
| Key Vault | Standard | Standard (Premium for HSM) |
| SQL Database | Basic/S0 (metastore) | S1/S2 (metastore) |
| Event Hubs | Basic (1 TU) | Standard (auto-inflate to 20 TU) |
| Purview | N/A (tenant-scoped) | N/A (tenant-scoped) |
| Azure Firewall | Basic | Standard or Premium |
| Log Analytics | Pay-as-you-go | Commitment tier (100GB/day+) |
📛 Appendix B: Naming Convention¶
{resourceType}-{project}-{environment}-{region}-{instance}
Examples:
rg-csa-prod-eastus2-001 (Resource Group)
st-csa-prod-eastus2-raw (Storage Account - raw lake)
st-csa-prod-eastus2-enriched (Storage Account - enriched lake)
st-csa-prod-eastus2-dev (Storage Account - development lake)
adf-csa-prod-eastus2-001 (Data Factory)
dbw-csa-prod-eastus2-001 (Databricks Workspace)
syn-csa-prod-eastus2-001 (Synapse Workspace)
kv-csa-prod-eastus2-001 (Key Vault)
vnet-csa-prod-eastus2-001 (Virtual Network)
pep-csa-prod-eastus2-st-raw (Private Endpoint for raw storage)
nsg-csa-prod-eastus2-dbw-pub (NSG for Databricks public subnet)
pdz-blob-core-windows-net (Private DNS Zone)
🚀 Appendix C: Deployment Order¶
graph TD
subgraph P1["Phase 1: Foundation"]
MG["1. Management Groups<br/>+ Policy Definitions"]
PLAT["2. Platform Subscription"]
MG --> PLAT
PLAT --> LA["2a. Log Analytics"]
PLAT --> HUB["2b. Hub VNet + Firewall"]
PLAT --> PDNS["2c. Private DNS Zones"]
PLAT --> BAST["2d. Azure Bastion"]
end
subgraph P2["Phase 2: Governance"]
DMLZ["3. DMLZ Subscription"]
DMLZ --> DMLZV["3a. DMLZ VNet + Peering"]
DMLZ --> KV["3b. Key Vault"]
DMLZ --> PV["3c. Purview"]
DMLZ --> ACR["3d. ACR"]
DMLZ --> SQL["3e. Shared SQL DB"]
end
subgraph P3["Phase 3: Data Landing Zones"]
DLZ["4. DLZ Subscription<br/>(NonProd → Prod)"]
DLZ --> DLZV["4a. DLZ VNet + Peering"]
DLZ --> NSG["4b. NSGs + Route Tables"]
DLZ --> KV2["4c. Key Vault"]
DLZ --> ADLS["4d. ADLS Gen2 x3 + PE"]
DLZ --> DBX["4e. Databricks + PE"]
DLZ --> SYN["4f. Synapse + PE"]
DLZ --> ADF["4g. ADF + PE + SHIR"]
DLZ --> EH["4h. Event Hubs + PE"]
end
subgraph P4["Phase 4: Governance Config"]
PVC["5. Purview Config"]
PVC --> REG["5a. Register sources"]
PVC --> SCAN["5b. Scanning schedules"]
PVC --> CLASS["5c. Classification rules"]
UC["6. Unity Catalog Config"]
UC --> META["6a. Create metastore"]
UC --> EXT["6b. External locations"]
UC --> CAT["6c. Catalogs + schemas"]
end
subgraph P5["Phase 5: Data Products"]
DP["7. Data Product Templates"]
DP --> SRC["7a. Source-aligned apps"]
DP --> CONS["7b. Consumer-aligned apps"]
DP --> ING["7c. ADF ingestion pipelines"]
DP --> DBT["7d. dbt transformations"]
end
P1 --> P2 --> P3 --> P4 --> P5 This report was compiled from Microsoft's official Cloud Adoption Framework documentation, GitHub reference templates, and Azure architecture guidance. The information is current as of April 2026.
🔗 Related Documentation¶
- ARCHITECTURE.md — Platform architecture overview
- PLATFORM_SERVICES.md — Platform services reference and SKU details
- MULTI_REGION.md — Multi-region deployment for high availability