Home > Docs > Research > Platform Research Report

CSA-in-a-Box: Comprehensive Platform Research Report¶

Comparative positioning note

This document is written from the perspective of Microsoft Azure, Cloud Scale Analytics, and CSA Loom. Any description of third-party or competing products, services, pricing, or capabilities is derived from publicly available documentation and sources believed accurate at the time of writing, and is provided for general comparison only. We do not claim expertise in, or authority over, any non-Microsoft product or service; the respective vendor's official documentation is the authoritative source for their offerings, which may change over time. Nothing here is intended to disparage any vendor — where a competing product has genuine advantages, we aim to note them honestly. Verify all third-party details against the vendor's current official documentation before making decisions.

Note

Quick Summary: Deep research report for building a complete Cloud-Scale Analytics / Data Mesh / Data Fabric platform in Azure as an open-source alternative to Microsoft Fabric — covers CSA architecture (DMLZ + DLZ + Medallion), Data Mesh domain ownership, Data Fabric metadata layer, component mapping from Fabric to Azure PaaS, 50+ required Azure services, 4-subscription deployment strategy with VNet address spaces and RBAC, Well-Architected best practices, zero-trust networking, data classification, FinOps, DR/BCDR, and IaC reference templates.

Date: 2026-04-09 Purpose: Deep research for building a complete Cloud-Scale Analytics / Data Mesh / Data Fabric platform in Azure as an open-source alternative to Microsoft Fabric.

📑 Table of Contents¶

🏗️ 1. Azure Cloud-Scale Analytics Architecture
🔀 2. Data Mesh Architecture in Azure
🧵 3. Data Fabric Architecture
🔄 4. Microsoft Fabric Alternative Components
- 4.1 Complete Component Mapping
- 4.2 Detailed Component Analysis
☁️ 5. Required Azure Services for a Complete Platform
- 5.1 Complete Service Catalog
- 5.2 Service Dependencies Map
📦 6. Deployment Strategy for 4 Azure Subscriptions
📋 7. Best Practices and Standards
📚 8. Reference Templates and IaC
🔗 9. Sources and References
📊 Appendix A: Service SKU Recommendations
📛 Appendix B: Naming Convention
🚀 Appendix C: Deployment Order

🏗️ 1. Azure Cloud-Scale Analytics Architecture¶

1.1 Overview and Current Status¶

Microsoft's Cloud-Scale Analytics (CSA) was the reference architecture for building enterprise data platforms on Azure. It was part of the Cloud Adoption Framework (CAF) and provided prescriptive guidance for data landing zones, governance, and scalable analytics.

Warning

As of early 2026, Microsoft has deprecated the Cloud-Scale Analytics scenario. The deprecation notice states: "The Cloud-Scale Analytics scenario has been deprecated and is no longer maintained or supported. To ensure only the best guidance is surfaced, this guidance will be deleted April 2026." Microsoft redirects to their new "Unify your data platform" guidance at https://aka.ms/cafdata.

What this means for csa-inabox: The CSA architecture remains the best-documented and most comprehensive open reference for building a modular, enterprise-grade data platform on Azure. While Microsoft is consolidating guidance (likely pushing toward Fabric), the architectural patterns, landing zone structure, and IaC templates remain valid and are exactly what we need. Our project preserves and extends these patterns with open-source tooling.

1.2 Core Architectural Concepts¶

CSA consists of two primary architectural constructs:

Data Management Landing Zone (DMLZ)¶

A separate Azure subscription that provides centralized governance for the entire analytics platform:

Component	Azure Service	Purpose
Data Catalog	Microsoft Purview	Register, classify, discover data sources across all landing zones
Data Governance	Purview + Unity Catalog	Centralized access control, auditing, lineage, data quality
Primary Data Management	Purview + custom	Master data management, golden records
Data Sharing & Contracts	Entra ID Entitlement Mgmt	Access packages, sharing policies
API Catalog	Azure API Management	Standardized API documentation and governance
Data Quality	Purview Data Quality	Quality metrics, validation, monitoring
Data Modeling Repository	ER/Studio, custom	Centralized entity relationship models
Container Registry	Azure Container Registry	Standard containers for data science
Service Layer	Custom microservices	Data marketplace, operations console, automation
Networking	VNet, DNS, Peering	Hub connectivity to all data landing zones
Security	Key Vault, Defender	Centralized secrets and threat protection

Important

The DMLZ must be deployed as a separate subscription under a management group with appropriate governance policies. It connects to data landing zones via VNet peering and to the connectivity subscription.

Data Landing Zone (DLZ)¶

Each DLZ is a separate Azure subscription that hosts analytics workloads for a specific domain or business unit:

Layer	Resource Groups	Purpose
Platform Services (Required)	`network-rg`, `security-rg`	VNet, NSGs, route tables, monitoring, Defender
Core Services (Required)	`storage-rg`, `runtimes-rg`, `mgmt-rg`, `external-data-rg`	Data lakes, shared IRs, CI/CD agents, external storage
Core Services (Optional)	`data-ingestion-rg`, `shared-applications-rg`	ADF, SQL metastore, shared Databricks
Data Application (Optional)	`data-application-rg` (one or more)	Per-application resources
Reporting (Optional)	`reporting-rg`	Visualization, Power BI gateways

1.3 Data Lake Architecture (Medallion Pattern)¶

Each DLZ provisions three ADLS Gen2 storage accounts forming a logical data lake:

Lake #	Layer	Medallion	Containers	Description
1	Raw	Bronze	`landing`, `conformance`	Immutable source data, data quality gates
2	Enriched	Silver	`standardized`	Merged, cleansed, type-aligned data
2	Curated	Gold	`data-products`	Aggregated, modeled, consumption-ready
3	Development	N/A	`analytics-sandbox`, `synapse-primary-*`	Exploratory sandboxes, workspace storage

Container Folder Structures

Container Folder Structure (Raw/Landing):

Landing/
  Log/{Application Name}/
  Master and Reference/{Source System}/
  Telemetry/{Source System}/{Application}/
  Transactional/{Source System}/{Entity}/{Version}/
    Delta/{rundate=YYYY-MM-DD}/
    Full/

Container Folder Structure (Raw/Conformance):

Conformance/
  Transactional/{Source System}/{Entity}/{Version}/
    Delta/
      Input/{rundate=YYYY-MM-DD}/
      Output/{rundate=YYYY-MM-DD}/
      Error/{rundate=YYYY-MM-DD}/
    Full/
      Input/{rundate=YYYY-MM-DD}/
      Output/{rundate=YYYY-MM-DD}/
      Error/{rundate=YYYY-MM-DD}/

Container Folder Structure (Enriched/Standardized):

Standardized/
  Transactional/{Source System}/{Entity}/{Version}/
    General/{rundate=YYYY-MM-DD}/
    Sensitive/{rundate=YYYY-MM-DD}/

Container Folder Structure (Curated/Data Products):

{Data Product}/
  {Entity}/{Version}/
    General/{rundate=YYYY-MM-DD}/
    Sensitive/{rundate=YYYY-MM-DD}/

Key Configuration:

Enable Hierarchical Namespace (HNS) on all ADLS Gen2 accounts
Use ACLs + Microsoft Entra groups for fine-grained access control
Separate General and Sensitive folders for data classification
Store data in Delta Lake format (Parquet + transaction log)

1.4 Hub-Spoke Network Topology¶

The networking architecture follows a hub-spoke model integrated with Azure Landing Zones:

graph TB
    subgraph Connectivity["Connectivity Sub (Hub VNet)"]
        FW["Azure Firewall"]
        ER["ExpressRoute"]
        VPN["VPN Gateway"]
    end

    subgraph DMLZ["DMLZ VNet"]
        PV["Purview"]
        GOV["Governance"]
        ACR["ACR"]
    end

    subgraph DLZ1["DLZ-1 VNet"]
        ST1["Storage"]
        CMP1["Compute"]
        ADF1["ADF"]
    end

    subgraph DLZ2["DLZ-2 VNet"]
        ST2["Storage"]
        CMP2["Compute"]
        ADF2["ADF"]
    end

    Connectivity -->|"VNet Peering"| DMLZ
    Connectivity -->|"VNet Peering"| DLZ1
    Connectivity -->|"VNet Peering"| DLZ2
    DMLZ -->|"VNet Peering"| DLZ1
    DMLZ -->|"VNet Peering"| DLZ2

Design Principles:

All PaaS services use Private Endpoints (no public IPs)
VNet peering between DMLZ and each DLZ
VNet peering between DLZs for cross-domain data sharing
Central Azure Private DNS zones for endpoint resolution
DNS A-records automated via Azure Policy (deployIfNotExists)
Site-to-Site VPN for third-party cloud connectivity
ExpressRoute for on-premises connectivity through the hub
NSGs and route tables per subnet
Azure Firewall in the hub for traffic inspection

Private DNS Zones Required

privatelink.blob.core.windows.net
privatelink.dfs.core.windows.net
privatelink.database.windows.net
privatelink.sql.azuresynapse.net
privatelink.dev.azuresynapse.net
privatelink.azuresynapse.net
privatelink.vaultcore.azure.net
privatelink.datafactory.azure.net
privatelink.adf.azure.com
privatelink.purview.azure.com
privatelink.purviewstudio.azure.com
privatelink.servicebus.windows.net
privatelink.azuredatabricks.net
privatelink.azurecr.io
privatelink.monitor.azure.com

1.5 Integration with Azure Landing Zones (ALZ)¶

CSA builds on top of Azure Landing Zones. The ALZ reference architecture provides:

Management Group Hierarchy:

Tenant Root Group
├── Platform
│   ├── Management        (Log Analytics, Automation, Sentinel)
│   ├── Connectivity      (Hub VNet, Firewall, ExpressRoute, DNS)
│   └── Identity          (Domain Controllers, Entra ID Connect)
├── Landing Zones
│   ├── Corp              (Internal workloads with private connectivity)
│   │   ├── DMLZ Sub      (Data Management Landing Zone)
│   │   ├── DLZ-Dev Sub   (Development Data Landing Zone)
│   │   └── DLZ-Prod Sub  (Production Data Landing Zone)
│   └── Online            (Internet-facing workloads)
├── Sandbox               (Experimentation)
└── Decommissioned

Platform Landing Zone Subscriptions:

Management Subscription — Log Analytics workspace, Azure Monitor, Automation, Sentinel
Connectivity Subscription — Hub VNet, Azure Firewall, ExpressRoute, VPN, DNS zones
Identity Subscription — AD domain controllers (if needed)

Application Landing Zone Subscriptions (for data): 4. DMLZ Subscription — Data governance, Purview, ACR, shared services 5. DLZ Subscription(s) — One per data domain or environment

🔀 2. Data Mesh Architecture in Azure¶

2.1 Core Principles¶

Data mesh, as defined by Zhamak Dehghani, is an architectural pattern built on four principles:

Domain-Oriented Data Ownership — Data is owned by domain teams who understand it best
Data as a Product — Data products are first-class citizens with defined quality, SLAs, and discoverability
Self-Serve Data Infrastructure Platform — A platform that enables domain teams to build data products autonomously
Federated Computational Governance — Governance policies automated and embedded in the platform

2.2 Mapping Data Mesh to Azure / CSA¶

Data Mesh Concept	CSA Implementation	Azure Services
Data Domain	Data Landing Zone (one per domain)	Azure Subscription + VNet + resource groups
Data Product	Data Application resource group	ADLS Gen2, ADF, Databricks, SQL
Self-Serve Platform	DMLZ + automated provisioning	IaC templates, Azure DevOps, Policy
Federated Governance	DMLZ + Purview + Policy	Purview, Azure Policy, Entra ID
Data Catalog	Centralized in DMLZ	Microsoft Purview
Domain Team	Data Application Team	Entra security groups
Data Contract	Sharing repository in DMLZ	Purview policies, Entra entitlement mgmt

2.3 Data Domains¶

Three criteria for defining data domains:

Long-term ownership — Boundaries must support identified, persistent owners
Match reality — Domains should reflect actual business operations, not theoretical concepts
Atomic integrity — Don't combine unrelated areas into a single domain

Domain examples: Sales, Marketing, Finance, Supply Chain, Customer, Product

2.4 Data Products¶

A successful data product must be:

Usable — Has users outside the immediate data domain
Valuable — Maintains value over time
Feasible — Can be built from available data and technology

Data product components:

Data (the actual datasets)
Code assets (generation, delivery, pipelines)
Metadata (descriptions, schemas, lineage, quality metrics)
Policies (access control, retention, classification)

Delivery formats: API, table, dataset in data lake, report, stream

2.5 Self-Serve Data Infrastructure¶

The DMLZ provides automation services (not products, but patterns to implement):

Service	Scope
DLZ Provisioning	Creates new data landing zones via IaC
Data Product Onboarding	Creates resource groups, configures resources
Data Agnostic Ingestion	Metadata-driven ingestion engine using ADF + SQL metastore
Metadata Service	Exposes and creates platform metadata
Access Provisioning	Creates access packages and approval workflows
Data Lifecycle	Manages retention, cold storage, deletion
Domain Onboarding	Captures domain metadata, creates domain infrastructure

2.6 Federated Governance¶

Implementation approach:

Automated policies via Azure Policy (enforced at management group level)
Code-first — Standards, policies, and platform deployment as code
Purview for cross-domain data cataloging, classification, and lineage
Unity Catalog (if using Databricks) for workspace-level governance
Entra ID entitlement management for access request/approval workflows

2.7 Unity Catalog vs Purview for Governance¶

Aspect	Microsoft Purview	Databricks Unity Catalog
Scope	Tenant-wide, all Azure data sources	Databricks workspaces only
Cataloging	All Azure + 100+ connectors	Databricks tables, volumes, models
Access Control	Policy-based, integrated with Entra ID	Fine-grained table/column ACLs
Lineage	Cross-service, ADF, Synapse	Spark jobs, notebooks, SQL
Data Quality	Built-in quality rules	Partner integrations
Classification	Automatic sensitivity classification	Manual tags + Purview integration
Cost	Included with Azure (consumption model)	Included with Databricks Premium
Recommendation	Use as enterprise-wide catalog	Use in addition to Purview for Databricks-specific governance

Note

Best Practice: Use both together. Purview provides the enterprise-wide catalog, classification, and cross-platform lineage. Unity Catalog provides fine-grained access control, auditing, and lineage within the Databricks ecosystem. They are complementary, not competing.

🧵 3. Data Fabric Architecture¶

3.1 How Data Fabric Differs from Data Mesh¶

Aspect	Data Mesh	Data Fabric
Philosophy	Organizational (decentralized ownership)	Technical (automated integration)
Core Driver	Domain teams own their data products	Metadata/AI automates data integration
Governance	Federated (domain teams + central policies)	Centralized (knowledge graph-driven)
Integration	Self-serve, domain teams build pipelines	Automated, AI discovers and integrates
Best For	Large orgs with autonomous business units	Orgs needing to unify disparate data sources
Key Technology	Self-serve platform + catalog	Knowledge graph + metadata layer + ML
Data Ownership	Distributed to domains	Central with virtual access

3.2 Data Fabric Core Components¶

Knowledge Graph / Metadata Layer
- Active metadata that learns from usage patterns
- Automatic relationship discovery between data assets
- Semantic layer that provides business context
- Azure Services: Purview, Cosmos DB (graph API), custom knowledge graph
Automated Data Integration
- AI-driven data discovery and cataloging
- Automatic schema mapping and transformation
- Self-optimizing data pipelines
- Azure Services: ADF, Purview auto-classification, Azure ML
Unified Access Layer
- Virtual data access without physical movement
- Polyglot query (SQL, Spark, API)
- Azure Services: Synapse Serverless SQL, Databricks SQL Warehouses
Governance and Security
- Policy-driven, automated enforcement
- Data lineage tracking
- Compliance monitoring
- Azure Services: Purview, Azure Policy, Defender for Cloud

3.3 Azure Services for Data Fabric Patterns¶

Pattern	Azure Services
Metadata Layer	Purview (catalog, lineage, classification)
Knowledge Graph	Cosmos DB Gremlin API, Purview relationships
Virtual Access	Synapse Serverless SQL pools (query across data lakes)
Automated Integration	ADF metadata-driven pipelines, Purview auto-scan
Semantic Layer	Power BI datasets, Azure Analysis Services
Data Virtualization	Synapse Serverless SQL, Databricks lakehouse federation

3.4 Hybrid Approach for csa-inabox¶

For csa-inabox, we recommend a hybrid data mesh + data fabric approach:

Data Mesh organizational model — Domain teams own data products in DLZs
Data Fabric technical layer — Centralized metadata, automated discovery, virtual access via DMLZ
This gives the best of both: decentralized ownership with automated, AI-driven governance

🔄 4. Microsoft Fabric Alternative Components¶

This section maps each Microsoft Fabric capability to equivalent Azure services for our open-source platform.

4.1 Complete Component Mapping¶

Capability	Microsoft Fabric	csa-inabox Alternative	Open Source?	Notes
Data Lakehouse	Fabric Lakehouse (OneLake)	ADLS Gen2 + Delta Lake	Delta Lake: Yes	Delta Lake is open source, ADLS is PaaS
Spark Compute	Fabric Spark	Synapse Spark Pools / Databricks	Spark: Yes	Apache Spark is open source
Data Warehouse	Fabric Warehouse	Synapse Dedicated SQL / Serverless SQL	No	Azure-managed service
Data Integration	Fabric Data Pipelines	Azure Data Factory	No	ADF is Azure-managed
Real-Time Analytics	Fabric Real-Time Intelligence	Azure Data Explorer + Event Hubs	No	ADX is Azure-managed
Power BI	Fabric Power BI	Power BI (standalone)	No	Same service, different licensing
Data Governance	Fabric Governance	Purview + Unity Catalog	Unity Catalog: Partially	Purview is Azure-managed
AI/ML	Fabric Data Science	Azure ML + Databricks ML	MLflow: Yes	MLflow is open source
Data Engineering	Fabric Data Engineering	Spark on Synapse/Databricks	Spark: Yes
Data Activator	Fabric Data Activator	Event Grid + Logic Apps + Functions	No	Azure-managed
Mirroring	Fabric Mirroring	ADF CDC + Debezium	Debezium: Yes	Open source CDC
Shortcuts	OneLake Shortcuts	ADLS linked services + mount points	N/A	Native ADLS capability

4.2 Detailed Component Analysis¶

4.2.1 Data Lakehouse: Delta Lake on ADLS Gen2

Architecture:

Storage: ADLS Gen2 with Hierarchical Namespace enabled
Table Format: Delta Lake (open-source, Apache 2.0 license)
Features: ACID transactions, time travel, schema evolution, unified batch/streaming
Alternative formats: Apache Iceberg, Apache Hudi (for future flexibility)

Configuration Recommendations:

Enable soft delete with 7-day retention
Use lifecycle management policies for hot/cool/archive tiering
Enable versioning for critical datasets
Storage redundancy: LRS for dev, GRS for production
Use customer-managed keys (CMK) via Key Vault for encryption

4.2.2 Compute: Synapse Spark Pools vs Databricks

Synapse Spark Pools:

Integrated with Synapse workspace
Auto-pause and auto-scale
Good for organizations already using Synapse ecosystem
Lower cost for intermittent workloads
Limited compared to Databricks for advanced ML

Azure Databricks:

Most feature-rich Spark platform
Unity Catalog for governance
Photon engine for SQL performance
MLflow integration for ML lifecycle
Delta Live Tables for declarative ETL
Cluster policies for cost control
Premium tier required for Unity Catalog

Recommendation: Use Databricks as primary compute for its superior governance (Unity Catalog), ML capabilities, and ecosystem. Use Synapse Serverless SQL for ad-hoc querying of the data lake. This provides the best combination of capabilities.

4.2.3 Data Warehouse: Synapse SQL Options

Synapse Dedicated SQL Pool:

Traditional MPP data warehouse
Persistent storage and compute
Best for large-scale, predictable analytical workloads
Expensive (pay-per-hour even when idle)
Consider for enterprise reporting marts

Synapse Serverless SQL Pool:

Query data in-place in ADLS Gen2
Pay-per-query model
Excellent for exploratory analysis
Can create views over Delta Lake tables
No data movement required
Great complement to the lakehouse pattern

Recommendation: Use Serverless SQL as primary query engine for the lakehouse. Use Dedicated SQL only for specific high-performance reporting marts if needed. This dramatically reduces cost.

4.2.4 Data Integration: Azure Data Factory

Core Capabilities:

100+ connectors (on-premises, cloud, SaaS)
Metadata-driven pipeline patterns
Self-hosted Integration Runtime for on-premises connectivity
Mapping data flows for code-free transformation
Tumbling window and event-driven triggers
Integration with Purview for lineage

Metadata-Driven Ingestion Pattern:

Store connection metadata in Azure SQL Database
ADF reads metadata to dynamically generate pipelines
Parameterized datasets and linked services
Single pipeline template handles multiple source types
This is the recommended "data agnostic ingestion engine" in CSA

Alternative/Complement: Use dbt (already in the csa-inabox repo) for transformation logic. ADF handles orchestration and data movement; dbt handles SQL-based transformations in the lakehouse.

4.2.5 Real-Time Analytics

Azure Data Explorer (ADX / Kusto):

Purpose-built for log and telemetry analytics
KQL (Kusto Query Language) for ad-hoc analysis
Sub-second query performance on massive datasets
Streaming ingestion from Event Hubs, IoT Hub
Time series analysis built-in

Azure Event Hubs:

Managed Kafka-compatible event streaming
Millions of events per second
Event capture to ADLS Gen2 (Avro format)
Kafka protocol support (Schema Registry)

Real-Time Architecture:

Sources → Event Hubs → Databricks Structured Streaming → Delta Lake
                    └→ Azure Data Explorer (for operational analytics)
                    └→ Event Hubs Capture → ADLS Gen2 (archival)

4.2.6 Power BI

Power BI standalone works identically to Fabric Power BI for reporting:

Direct Lake mode (connect to Delta Lake tables)
Import mode (for high-performance dashboards)
DirectQuery (for real-time data)
Premium capacity for large-scale deployment
Power BI Embedded for ISV scenarios

Recommendation: Use Power BI Premium Per Capacity for the organization, connected directly to Delta Lake tables via Synapse Serverless SQL or Databricks SQL Warehouses.

4.2.7 AI/ML

Azure Machine Learning:

End-to-end ML lifecycle
Responsible AI dashboard
Managed compute instances and clusters
Automated ML (AutoML)
Model registry and deployment

Databricks ML:

MLflow (open source) for experiment tracking
Feature Store
Model Serving
AutoML
Deep learning with GPU clusters

Azure OpenAI Service:

GPT-4, embeddings, DALL-E
Private endpoint support
Content filtering
Fine-tuning capabilities

Recommendation: Use Databricks ML + MLflow as primary ML platform (open-source ML tracking). Add Azure ML for specialized needs (AutoML, Responsible AI). Use Azure OpenAI for generative AI workloads.

☁️ 5. Required Azure Services for a Complete Platform¶

5.1 Complete Service Catalog¶

Networking Services

Service	Purpose	Subscription
Azure Virtual Network (VNet)	Network isolation per landing zone	All
VNet Peering	Hub-spoke connectivity	All
Private Endpoints	Secure PaaS access	All
Azure Private DNS Zones	Private endpoint resolution	Connectivity
Network Security Groups (NSG)	Subnet-level traffic filtering	All
Route Tables (UDR)	Force traffic through firewall	All
Azure Firewall	Central traffic inspection, FQDN filtering	Connectivity
Azure Bastion	Secure RDP/SSH to jumpboxes	Connectivity
ExpressRoute	On-premises connectivity	Connectivity
VPN Gateway	Site-to-Site for third-party clouds	Connectivity

Identity & Access Services

Service	Purpose	Subscription
Microsoft Entra ID	Identity provider	Tenant-wide
Managed Identities	Service-to-service auth (no passwords)	All
RBAC	Role-based access control	All
Service Principals	CI/CD and automation authentication	All
Entra ID Entitlement Mgmt	Self-service access request/approval	Tenant-wide
Privileged Identity Management	Just-in-time admin access	Tenant-wide
Conditional Access	Context-aware access policies	Tenant-wide

Storage Services

Service	Purpose	Subscription
ADLS Gen2 (HNS-enabled)	Data lake storage (3 per DLZ)	DLZ
Azure Blob Storage	External data staging	DLZ
Azure SQL Database	Metadata stores, ADF metastore	DLZ, DMLZ
Cosmos DB	Knowledge graph (optional), app data	DMLZ

Compute Services

Service	Purpose	Subscription
Azure Databricks	Primary Spark compute, ML, SQL analytics	DLZ
Synapse Analytics	Serverless SQL, optional Spark pools	DLZ
Azure Kubernetes Service (AKS)	Microservices, API hosting (optional)	DMLZ
Azure Functions	Event-driven automation	DLZ, DMLZ
Virtual Machines	Self-hosted IR, jumpboxes	DLZ

Data Integration Services

Service	Purpose	Subscription
Azure Data Factory	Orchestration, data movement	DLZ
Self-Hosted Integration Runtime	On-premises/private network access	DLZ
Event Hubs	Real-time event ingestion	DLZ
Event Grid	Event-driven triggers	DLZ
Service Bus	Reliable message queuing	DLZ, DMLZ
IoT Hub	IoT device telemetry (optional)	DLZ

Governance Services

Service	Purpose	Subscription
Microsoft Purview	Data catalog, classification, lineage	DMLZ (tenant-scoped)
Azure Policy	Governance enforcement	Management Group
Databricks Unity Catalog	Spark-level data governance	DLZ (Databricks)
Management Groups	Policy hierarchy, RBAC inheritance	Tenant-wide

Security Services

Service	Purpose	Subscription
Azure Key Vault	Secrets, keys, certificates	All
Microsoft Defender for Cloud	Threat protection, security posture	All
Microsoft Sentinel	SIEM, threat detection, SOAR	Management
Defender for Identity	Identity threat detection	Management
Defender for Storage	Storage threat protection	DLZ
Defender for SQL	Database threat protection	DLZ

Monitoring Services

Service	Purpose	Subscription
Azure Monitor	Platform metrics and alerts	All
Log Analytics Workspace	Centralized logging	Management
Application Insights	Application performance monitoring	DMLZ, DLZ
Azure Workbooks	Custom dashboards	Management
Diagnostic Settings	Service-level telemetry	All

AI/ML Services

Service	Purpose	Subscription
Azure Machine Learning	ML lifecycle management	DLZ
Azure OpenAI Service	Generative AI workloads	DLZ
Azure Cognitive Services	Pre-built AI (Vision, Language, etc.)	DLZ
Databricks ML / MLflow	Experiment tracking, model registry	DLZ

5.2 Service Dependencies Map¶

graph TB
    EntraID["Microsoft Entra ID<br/>(Tenant-wide identity)"]

    subgraph Mgmt["Management Subscription"]
        LA["Log Analytics"]
        Sentinel["Sentinel"]
        Auto["Automation"]
        Monitor["Monitor"]
    end

    subgraph Conn["Connectivity Subscription"]
        Hub["Hub VNet"]
        FW["Firewall"]
        ER["ExpressRoute"]
        DNS["DNS Zones"]
        Bastion["Bastion"]
    end

    subgraph DMLZSub["DMLZ Subscription"]
        Purview["Purview"]
        KV1["Key Vault"]
        ACR["ACR"]
        SQLDB["SQL Database"]
        APIM["API Management"]
    end

    subgraph DLZSub["DLZ Subscription(s)"]
        ADLS["ADLS Gen2 x3"]
        DBX["Databricks"]
        Synapse["Synapse"]
        ADF["ADF + IR"]
        KV2["Key Vault"]
        EH["Event Hubs"]
        AML["Azure ML"]
        PBI["Power BI"]
    end

    EntraID --> Mgmt
    EntraID --> Conn
    EntraID --> DMLZSub
    EntraID --> DLZSub

    Conn -->|"VNet Peering"| DMLZSub
    Conn -->|"VNet Peering"| DLZSub
    DMLZSub -->|"Peering + Scanning"| DLZSub

📦 6. Deployment Strategy for 4 Azure Subscriptions¶

6.1 Recommended Subscription Layout¶

For the csa-inabox initial deployment with 4 subscriptions:

Tenant Root Group
└── csa-inabox (Management Group)
    ├── Platform (Management Group)
    │   ├── Sub 1: csa-platform      (Management + Connectivity + Identity)
    │   │   ├── rg-management        (Log Analytics, Automation, Sentinel)
    │   │   ├── rg-connectivity      (Hub VNet, Firewall, DNS zones)
    │   │   └── rg-identity          (Optional AD DCs)
    │   │
    │   └── Sub 2: csa-governance    (Data Management Landing Zone)
    │       ├── rg-governance        (Purview, Policy artifacts)
    │       ├── rg-network           (DMLZ VNet, peering, private endpoints)
    │       ├── rg-shared-services   (ACR, API Management, Key Vault)
    │       └── rg-service-layer     (Automation microservices, data marketplace)
    │
    └── Landing Zones (Management Group)
        ├── Sub 3: csa-data-nonprod  (Development + Test DLZ)
        │   ├── rg-network           (DLZ VNet, NSGs, route tables)
        │   ├── rg-security          (Key Vault, Defender)
        │   ├── rg-storage           (ADLS Gen2 x3: raw, enriched, dev)
        │   ├── rg-compute           (Databricks, Synapse)
        │   ├── rg-integration       (ADF, shared IR, Event Hubs)
        │   ├── rg-data-app-{name}   (Per data product)
        │   └── rg-monitoring        (Diagnostic settings)
        │
        └── Sub 4: csa-data-prod     (Production DLZ)
            ├── rg-network
            ├── rg-security
            ├── rg-storage
            ├── rg-compute
            ├── rg-integration
            ├── rg-data-app-{name}
            └── rg-monitoring

6.2 Alternative: Scale-Out Layout¶

For larger deployments, expand to domain-based DLZs:

Landing Zones (Management Group)
├── Sub: csa-dlz-finance     (Finance domain DLZ)
├── Sub: csa-dlz-sales       (Sales domain DLZ)
├── Sub: csa-dlz-marketing   (Marketing domain DLZ)
└── Sub: csa-dlz-operations  (Operations domain DLZ)

6.3 Cross-Subscription Networking¶

VNet Address Space Allocation:

Hub VNet (csa-platform):     10.0.0.0/16
  - GatewaySubnet:           10.0.0.0/24
  - AzureFirewallSubnet:     10.0.1.0/24
  - AzureBastionSubnet:      10.0.2.0/24
  - ManagementSubnet:        10.0.10.0/24

DMLZ VNet (csa-governance):  10.1.0.0/16
  - PurviewSubnet:           10.1.1.0/24
  - SharedServicesSubnet:    10.1.2.0/24
  - PrivateEndpointSubnet:   10.1.10.0/24

DLZ-NonProd VNet:            10.2.0.0/16
  - DatabricksPublicSubnet:  10.2.1.0/24
  - DatabricksPrivateSubnet: 10.2.2.0/24
  - SynapseSubnet:           10.2.3.0/24
  - PrivateEndpointSubnet:   10.2.10.0/24
  - IntegrationSubnet:       10.2.11.0/24
  - DataApp1Subnet:          10.2.20.0/24

DLZ-Prod VNet:               10.3.0.0/16
  - (Same structure as NonProd)

Peering Configuration:

From	To	Direction	Purpose
Hub	DMLZ	Bidirectional	Governance connectivity
Hub	DLZ-NonProd	Bidirectional	Internet/on-premises access
Hub	DLZ-Prod	Bidirectional	Internet/on-premises access
DMLZ	DLZ-NonProd	Bidirectional	Purview scanning, governance
DMLZ	DLZ-Prod	Bidirectional	Purview scanning, governance
DLZ-NonProd	DLZ-Prod	Optional*	Cross-environment data sharing

*Cross-DLZ peering between non-prod and prod should be avoided in most cases for isolation.

6.4 Policy Inheritance and Management Group Hierarchy¶

Management Group	Key Policies
Root (csa-inabox)	Require tags (cost center, environment, owner); Allowed locations; Audit diagnostic settings; Deny public IP addresses
Platform	Require resource locks on critical resources; Deny unauthorized resource types
Landing Zones	Deploy Private DNS zone groups (deployIfNotExists); Require HTTPS/TLS; Deploy Defender for Cloud; Deny public network access on storage accounts; Require encryption at rest with CMK; Deny creation of classic resources

6.5 RBAC Strategy¶

Role	Scope	Principals
Owner	Management Group: csa-inabox	Platform team (PIM-protected)
Contributor	Sub: csa-platform	Platform operations team
Network Contributor	Hub VNet resource group	Network team
Contributor	Sub: csa-governance	Data governance team
Purview Data Curator	Purview account	Data stewards
Contributor	DLZ resource groups	Data domain teams
User Access Administrator	DLZ subscription	Data platform team
Private DNS Zone Contributor	DNS resource group	Service principals for DLZ deployments
Network Contributor	DLZ subnets (child scope)	Data application team service principals
Reader	Sub: csa-data-prod	All data consumers
Storage Blob Data Reader	ADLS Gen2 curated	Data analysts, BI users
Storage Blob Data Contributor	ADLS Gen2 raw/enriched	Data engineers

📋 7. Best Practices and Standards¶

7.1 Azure Well-Architected Framework for Data Platforms¶

Reliability:

Use GRS (Geo-Redundant Storage) for production data lakes
Configure ADLS Gen2 soft delete and versioning
Design for regional failover of critical services
Implement retry policies in all data pipelines
Use Availability Zones for Databricks, Synapse, Key Vault

Security:

Zero-trust: Never trust, always verify
Private endpoints for all PaaS services (no public access)
Customer-managed keys for encryption at rest
TLS 1.2+ for all data in transit
Managed identities over service principals where possible
Azure Policy to enforce security baseline
Microsoft Defender for all data services

Cost Optimization:

Synapse Serverless SQL (pay-per-query) over Dedicated SQL
Databricks auto-scaling and auto-terminate policies
ADLS lifecycle management (hot → cool → archive)
Reserved instances for predictable Databricks/Synapse workloads
Azure Advisor cost recommendations
Tag-based cost allocation and chargeback

Operational Excellence:

Infrastructure as Code (Bicep/Terraform) for all deployments
CI/CD pipelines for data platform changes
GitOps for configuration management
Centralized monitoring via Log Analytics
Automated alerting on pipeline failures
Runbooks for common operational tasks

Performance Efficiency:

Delta Lake Z-ordering for query optimization
Databricks Photon engine for SQL workloads
Partition pruning in data lake queries
Caching strategies in Power BI and Synapse
Right-sizing compute clusters

7.2 Zero-Trust Network Architecture¶

Principles applied to data platforms:

Verify Explicitly — Every access request is fully authenticated and authorized
- Managed identities for service-to-service
- Entra ID for user access
- Conditional access policies
Use Least-Privilege Access — Just-in-time, just-enough access
- PIM for admin roles
- ACLs on data lake folders (not container level)
- Granular RBAC roles
Assume Breach — Minimize blast radius
- Network segmentation (subnets per workload)
- Private endpoints (no public access surface)
- NSGs with deny-all default
- Azure Firewall for egress filtering
- Micro-segmentation within DLZs

Network Zero-Trust Checklist:

All PaaS services behind private endpoints
Public network access disabled on all storage accounts
Azure Firewall filtering all outbound traffic
NSGs on every subnet with explicit allow rules
No public IP addresses on any resource
DNS resolution through private DNS zones only
Jumpbox + Bastion for administrative access
TLS 1.2 minimum on all services

7.3 Data Classification and Sensitivity Labeling¶

Classification Levels:

Level	Label	Description	Handling
1	Public	Approved for public release	No restrictions
2	Internal	General business data	Require authentication
3	Confidential	Sensitive business data	Encrypt, limited access
4	Highly Confidential	PII, financial, regulated	Encrypt, audit, MFA, DLP
5	Restricted	Top secret, regulated	Full audit, approval workflow

Implementation:

Microsoft Purview auto-classification scans data lakes
Sensitivity labels applied to ADLS containers/folders
Data in Sensitive folders gets Confidential or higher classification
Azure Information Protection for document-level classification
DLP policies via Purview to prevent data exfiltration
Classification drives access control and retention policies

7.4 Cost Management and FinOps¶

Cost Levers for Data Platforms:

Service	Cost Strategy
ADLS Gen2	Lifecycle policies: hot (30 days) → cool (90 days) → archive
Databricks	Auto-terminate (15 min idle), spot instances for batch, cluster pools
Synapse	Serverless SQL (pay-per-TB scanned), auto-pause dedicated pools
ADF	Pipeline optimization, avoid excessive data movement
Event Hubs	Right-size throughput units, auto-inflate
Key Vault	Standard tier (not Premium) unless HSM required
Purview	Scan scheduling (not continuous)

FinOps Practices:

Tag all resources with: costCenter, environment, owner, project
Use Azure Cost Management + Billing for dashboards
Set budgets and alerts per subscription and resource group
Weekly cost review by platform team
Reserved Instances for Databricks (1-year or 3-year)
Dev/test pricing for non-production subscriptions
Auto-shutdown schedules for non-production compute
Right-sizing reviews quarterly

7.5 Disaster Recovery and Business Continuity¶

RPO/RTO Targets by Data Layer:

Layer	RPO	RTO	Strategy
Raw (Bronze)	24 hours	4 hours	Source re-ingest + GRS
Enriched (Silver)	4 hours	2 hours	GRS + reprocessing pipelines
Curated (Gold)	1 hour	1 hour	GRS + hot standby
Metadata (Purview)	0 (continuous)	1 hour	Managed service redundancy

BCDR Architecture:

ADLS Gen2: GRS or GZRS for cross-region replication
Azure SQL: Active geo-replication or auto-failover groups
Key Vault: Soft delete + purge protection enabled
Databricks: Multi-region workspace deployment (active-passive)
ADF: Pipeline definitions in git (redeploy from source)
Purview: Tenant-scoped, inherently resilient

Backup Strategy:

ADLS: Blob versioning + soft delete + point-in-time restore
Azure SQL: Automated backups (7-35 day retention)
Databricks: Unity Catalog metadata backed by managed storage
Key Vault: Soft-delete (90 days) + purge protection
IaC: All infrastructure defined in code (Bicep/Terraform in git)

📚 8. Reference Templates and IaC¶

8.1 Microsoft Official Templates¶

Microsoft provides official Bicep/ARM templates in GitHub:

Repository	Purpose	Deployment
Azure/data-management-zone	DMLZ template (Purview, governance, networking)	One per platform
Azure/data-landing-zone	DLZ template (storage, compute, integration)	One per DLZ
Azure/data-product-batch	Batch data processing workload	One+ per DLZ
Azure/data-product-streaming	Streaming data processing workload	One+ per DLZ
Azure/data-product-analytics	Analytics and data science workload	One+ per DLZ

Template Stack: Bicep (67.5%), PowerShell, Shell Deployment Methods: Azure Portal (Deploy to Azure), GitHub Actions, Azure DevOps

8.2 Template Architecture¶

data-management-zone/
├── infra/
│   ├── main.json          (ARM template entry point)
│   └── modules/           (Bicep modules)
│       ├── Purview/
│       ├── Network/
│       │   ├── privateDnsZones/
│       │   └── virtualNetworkLinks/
│       ├── KeyVault/
│       ├── ContainerRegistry/
│       └── ...
├── code/                  (Application code)
├── docs/                  (Documentation)
├── .github/               (GitHub Actions workflows)
└── .ado/                  (Azure DevOps pipelines)

data-landing-zone/
├── infra/
│   ├── main.json
│   └── modules/
│       ├── Storage/       (ADLS Gen2 accounts)
│       ├── Databricks/
│       ├── Synapse/
│       ├── DataFactory/
│       ├── Network/
│       └── ...
├── code/
├── docs/
├── .github/
└── .ado/

8.3 csa-inabox Deployment Approach¶

For our project, we should:

Fork/adapt the Microsoft templates as our baseline
Add Terraform support alongside Bicep (for multi-cloud flexibility)
Create a unified deployment orchestrator that:
- Deploys the platform subscription (management + connectivity)
- Deploys the DMLZ subscription
- Deploys DLZ subscriptions
- Configures cross-subscription networking
- Sets up governance policies
Add open-source tooling layer:
- dbt for SQL transformations
- Great Expectations for data quality
- Apache Airflow or Dagster for orchestration (optional)
- MLflow for ML lifecycle
- OpenMetadata or DataHub as Purview alternative (optional)

8.4 Existing csa-inabox Assets¶

Based on the current repo structure:

deploy/arm/ — ARM templates (e.g., Purview)
deploy/bicep/ — Bicep modules (DMLZ, network, private DNS zones)
scripts/sql/ — SQL scripts, Hive metastore notebook
tools/dbt/ — dbt environment for transformations
codeqlDB/ — Code quality database

🔗 9. Sources and References¶

Microsoft Official Documentation¶

Cloud-Scale Analytics Overview — Main CSA scenario (deprecated, redirects to Unify your data platform)
Data Landing Zone Architecture — DLZ component architecture
Data Management Landing Zone — DMLZ governance architecture
What is Data Mesh? — Data mesh principles on Azure
Network Topology and Connectivity — Networking architecture
Identity and Access Management — IAM for data platforms
Security, Governance, and Compliance — Security architecture
Data Lake Zones and Containers — Medallion architecture
What is an Azure Landing Zone? — ALZ reference architecture
Subscription Considerations — Subscription organization
Unify Your Data Platform (new guidance) — Replacement for CSA

GitHub Repositories¶

Azure/data-management-zone — DMLZ Bicep templates
Azure/data-landing-zone — DLZ Bicep templates
Azure/data-product-batch — Batch processing template
Azure/data-product-streaming — Streaming template
Azure/data-product-analytics — Analytics/DS template

Architecture Frameworks¶

Azure Well-Architected Framework — Five pillars
Cloud Adoption Framework — Cloud adoption methodology
Azure Landing Zone Accelerator — IaC deployment

Open Source Projects¶

Delta Lake — Open-source lakehouse storage layer
Apache Spark — Distributed compute engine
dbt — SQL transformation framework
MLflow — ML lifecycle management
Great Expectations — Data quality validation
Debezium — Change data capture

📊 Appendix A: Service SKU Recommendations¶

Service	Development	Production
ADLS Gen2	Standard LRS, Hot	Standard GRS, Hot/Cool tiering
Databricks	Standard, Standard_DS3_v2	Premium, Standard_DS4_v2+, Unity Catalog
Synapse	Serverless SQL only	Serverless SQL + DW100c (if dedicated needed)
ADF	Pay-as-you-go	Pay-as-you-go (consider reserved DIU)
Key Vault	Standard	Standard (Premium for HSM)
SQL Database	Basic/S0 (metastore)	S1/S2 (metastore)
Event Hubs	Basic (1 TU)	Standard (auto-inflate to 20 TU)
Purview	N/A (tenant-scoped)	N/A (tenant-scoped)
Azure Firewall	Basic	Standard or Premium
Log Analytics	Pay-as-you-go	Commitment tier (100GB/day+)

📛 Appendix B: Naming Convention¶

{resourceType}-{project}-{environment}-{region}-{instance}

Examples:
  rg-csa-prod-eastus2-001          (Resource Group)
  st-csa-prod-eastus2-raw          (Storage Account - raw lake)
  st-csa-prod-eastus2-enriched     (Storage Account - enriched lake)
  st-csa-prod-eastus2-dev          (Storage Account - development lake)
  adf-csa-prod-eastus2-001         (Data Factory)
  dbw-csa-prod-eastus2-001         (Databricks Workspace)
  syn-csa-prod-eastus2-001         (Synapse Workspace)
  kv-csa-prod-eastus2-001          (Key Vault)
  vnet-csa-prod-eastus2-001        (Virtual Network)
  pep-csa-prod-eastus2-st-raw      (Private Endpoint for raw storage)
  nsg-csa-prod-eastus2-dbw-pub     (NSG for Databricks public subnet)
  pdz-blob-core-windows-net        (Private DNS Zone)

🚀 Appendix C: Deployment Order¶

graph TD
    subgraph P1["Phase 1: Foundation"]
        MG["1. Management Groups<br/>+ Policy Definitions"]
        PLAT["2. Platform Subscription"]
        MG --> PLAT
        PLAT --> LA["2a. Log Analytics"]
        PLAT --> HUB["2b. Hub VNet + Firewall"]
        PLAT --> PDNS["2c. Private DNS Zones"]
        PLAT --> BAST["2d. Azure Bastion"]
    end

    subgraph P2["Phase 2: Governance"]
        DMLZ["3. DMLZ Subscription"]
        DMLZ --> DMLZV["3a. DMLZ VNet + Peering"]
        DMLZ --> KV["3b. Key Vault"]
        DMLZ --> PV["3c. Purview"]
        DMLZ --> ACR["3d. ACR"]
        DMLZ --> SQL["3e. Shared SQL DB"]
    end

    subgraph P3["Phase 3: Data Landing Zones"]
        DLZ["4. DLZ Subscription<br/>(NonProd → Prod)"]
        DLZ --> DLZV["4a. DLZ VNet + Peering"]
        DLZ --> NSG["4b. NSGs + Route Tables"]
        DLZ --> KV2["4c. Key Vault"]
        DLZ --> ADLS["4d. ADLS Gen2 x3 + PE"]
        DLZ --> DBX["4e. Databricks + PE"]
        DLZ --> SYN["4f. Synapse + PE"]
        DLZ --> ADF["4g. ADF + PE + SHIR"]
        DLZ --> EH["4h. Event Hubs + PE"]
    end

    subgraph P4["Phase 4: Governance Config"]
        PVC["5. Purview Config"]
        PVC --> REG["5a. Register sources"]
        PVC --> SCAN["5b. Scanning schedules"]
        PVC --> CLASS["5c. Classification rules"]

        UC["6. Unity Catalog Config"]
        UC --> META["6a. Create metastore"]
        UC --> EXT["6b. External locations"]
        UC --> CAT["6c. Catalogs + schemas"]
    end

    subgraph P5["Phase 5: Data Products"]
        DP["7. Data Product Templates"]
        DP --> SRC["7a. Source-aligned apps"]
        DP --> CONS["7b. Consumer-aligned apps"]
        DP --> ING["7c. ADF ingestion pipelines"]
        DP --> DBT["7d. dbt transformations"]
    end

    P1 --> P2 --> P3 --> P4 --> P5

This report was compiled from Microsoft's official Cloud Adoption Framework documentation, GitHub reference templates, and Azure architecture guidance. The information is current as of April 2026.

ARCHITECTURE.md — Platform architecture overview
PLATFORM_SERVICES.md — Platform services reference and SKU details
MULTI_REGION.md — Multi-region deployment for high availability