🏞️ Delta Lakehouse Architecture Description¶
Comparative positioning note
This document is written from the perspective of Microsoft Azure, Cloud Scale Analytics, and CSA Loom. Any description of third-party or competing products, services, pricing, or capabilities is derived from publicly available documentation and sources believed accurate at the time of writing, and is provided for general comparison only. We do not claim expertise in, or authority over, any non-Microsoft product or service; the respective vendor's official documentation is the authoritative source for their offerings, which may change over time. Nothing here is intended to disparage any vendor — where a competing product has genuine advantages, we aim to note them honestly. Verify all third-party details against the vendor's current official documentation before making decisions.
Detailed description and visual representation of the Delta Lakehouse architecture pattern in Azure Synapse Analytics.
🎯 Overview¶
The Delta Lakehouse architecture combines the best features of data lakes and data warehouses, providing a unified platform for both batch and streaming analytics with ACID transaction support, schema enforcement, and time travel capabilities.
📊 Visual Architecture¶
High-Level Architecture Diagram¶
graph TB
subgraph "Data Sources"
IoT[IoT Devices & Sensors]
Apps[Enterprise Applications]
DB[Databases & SaaS]
Files[Files & Logs]
end
subgraph "Ingestion Layer"
ADF[Azure Data Factory<br/>Batch Ingestion]
EH[Event Hubs<br/>Streaming Ingestion]
IoTHub[IoT Hub<br/>Device Telemetry]
end
subgraph "Storage Foundation - Data Lake Gen2"
subgraph "Bronze Layer - Raw Data"
Bronze[Raw Data<br/>As-Is from Source<br/>Immutable Historical Record]
end
subgraph "Silver Layer - Refined Data"
Silver[Cleaned & Validated<br/>Standardized Schema<br/>Deduplicated Records]
end
subgraph "Gold Layer - Business Ready"
Gold[Aggregated & Enriched<br/>Business Logic Applied<br/>Optimized for Analytics]
end
end
subgraph "Delta Lake Storage Format"
DeltaFiles[Parquet Data Files<br/>+<br/>Transaction Logs<br/>+<br/>Statistics & Indexes]
end
subgraph "Compute Engines - Synapse Analytics"
SparkPool[Spark Pools<br/>Data Engineering<br/>ML Workloads]
ServerlessSQL[Serverless SQL Pool<br/>Ad-hoc Queries<br/>Data Exploration]
DedicatedSQL[Dedicated SQL Pool<br/>Enterprise DW<br/>BI Workloads]
end
subgraph "Metadata & Governance"
Metastore[Spark Metastore<br/>Table Definitions<br/>Schema Registry]
Purview[Azure Purview<br/>Data Catalog<br/>Lineage Tracking]
end
subgraph "Consumption Layer"
PowerBI[Power BI<br/>Dashboards & Reports]
AzureML[Azure ML<br/>Model Training & Inference]
APIApps[APIs & Applications<br/>Custom Solutions]
end
IoT --> IoTHub
Apps --> ADF
DB --> ADF
Files --> ADF
IoT --> EH
Apps --> EH
IoTHub --> Bronze
EH --> Bronze
ADF --> Bronze
Bronze --> SparkPool
SparkPool --> Silver
Silver --> SparkPool
SparkPool --> Gold
Bronze -.->|Delta Format| DeltaFiles
Silver -.->|Delta Format| DeltaFiles
Gold -.->|Delta Format| DeltaFiles
DeltaFiles --> ServerlessSQL
DeltaFiles --> DedicatedSQL
DeltaFiles --> SparkPool
SparkPool -.->|Registers Tables| Metastore
Metastore -.->|Shared Metadata| ServerlessSQL
Bronze -.->|Tracks| Purview
Silver -.->|Tracks| Purview
Gold -.->|Tracks| Purview
Gold --> PowerBI
Gold --> AzureML
Gold --> APIApps
ServerlessSQL --> PowerBI
DedicatedSQL --> PowerBI
style Bronze fill:#cd7f32,color:#000
style Silver fill:#c0c0c0,color:#000
style Gold fill:#ffd700,color:#000
style DeltaFiles fill:#e1f5fe
style SparkPool fill:#fff3e0
style ServerlessSQL fill:#e8f5e9
style DedicatedSQL fill:#f3e5f5 🏗️ Architecture Components¶
1. Data Sources Layer¶
Multiple heterogeneous data sources feeding into the lakehouse:
| Source Type | Examples | Ingestion Method | Frequency |
|---|---|---|---|
| IoT Devices | Sensors, smart devices, industrial equipment | IoT Hub, Event Hubs | Real-time streaming |
| Enterprise Applications | ERP, CRM, custom apps | Data Factory, CDC | Batch, scheduled |
| Databases | SQL Server, Oracle, PostgreSQL | Data Factory, linked services | Incremental, CDC |
| Files & Logs | CSV, JSON, Parquet, logs | Data Factory, auto-loader | Batch, event-driven |
2. Ingestion Layer¶
Azure Data Factory¶
- Purpose: Orchestrate batch data ingestion and transformation
- Capabilities:
- 90+ native connectors for data sources
- Code-free ETL pipeline design
- Incremental data loading with watermarking
- Scheduled and trigger-based execution
Event Hubs¶
- Purpose: High-throughput streaming data ingestion
- Capabilities:
- Millions of events per second
- Kafka protocol compatibility
- Event capture to Data Lake
- Partition-based scalability
IoT Hub¶
- Purpose: Specialized IoT device connectivity
- Capabilities:
- Per-device security and authentication
- Bidirectional communication
- Device management and provisioning
- Edge intelligence support
3. Storage Foundation - Medallion Architecture¶
Bronze Layer (Raw Zone)¶
graph LR
subgraph "Bronze Layer Characteristics"
Raw[Raw Data Files]
Immutable[Immutable<br/>Historical Record]
Source[Source Schema<br/>Preserved]
Audit[Full Audit Trail]
end
Raw --> Immutable
Immutable --> Source
Source --> Audit
style Raw fill:#cd7f32
style Immutable fill:#ffe0b2
style Source fill:#ffccbc
style Audit fill:#d7ccc8 Characteristics:
- Data Quality: As-is from source, no transformations
- Schema: Source system schema preserved
- Format: Delta Lake (Parquet + transaction log)
- Purpose: Historical record, replay capability, audit trail
- Retention: Long-term (years), move to archive tier
Example Table Structure:
-- Bronze layer table
CREATE TABLE bronze.iot_telemetry (
event_data STRING, -- Raw JSON payload
event_timestamp TIMESTAMP, -- Source timestamp
device_id STRING,
_ingestion_time TIMESTAMP, -- System ingestion time
_source_file STRING, -- Source file reference
_partition_date DATE -- Partition key
)
USING DELTA
PARTITIONED BY (_partition_date)
LOCATION 'abfss://bronze@datalake.dfs.core.windows.net/iot_telemetry';
Silver Layer (Refined Zone)¶
graph LR
subgraph "Silver Layer Characteristics"
Clean[Cleaned Data]
Validated[Validated & Typed]
Dedup[Deduplicated]
Standard[Standardized Schema]
end
Clean --> Validated
Validated --> Dedup
Dedup --> Standard
style Clean fill:#c0c0c0
style Validated fill:#e0e0e0
style Dedup fill:#f5f5f5
style Standard fill:#eeeeee Characteristics:
- Data Quality: Cleaned, validated, deduplicated
- Schema: Standardized across sources
- Format: Delta Lake with schema enforcement
- Purpose: Source of truth for analytics
- Retention: Medium-term (months to years)
Transformations Applied:
- Data Cleansing
- Null handling and default values
- Invalid data filtering
-
Data type conversions
-
Standardization
- Consistent naming conventions
- Uniform date/time formats
-
Standardized units and measures
-
Deduplication
- Remove exact duplicates
- Handle late-arriving data
- Merge updates and corrections
Example Table Structure:
-- Silver layer table
CREATE TABLE silver.iot_telemetry (
device_id STRING NOT NULL,
event_timestamp TIMESTAMP NOT NULL,
temperature DOUBLE,
humidity DOUBLE,
pressure DOUBLE,
battery_level INT,
device_status STRING,
location_lat DOUBLE,
location_long DOUBLE,
_source STRING,
_processed_time TIMESTAMP,
_quality_score DOUBLE
)
USING DELTA
PARTITIONED BY (DATE(event_timestamp))
LOCATION 'abfss://silver@datalake.dfs.core.windows.net/iot_telemetry';
Gold Layer (Consumption Zone)¶
graph LR
subgraph "Gold Layer Characteristics"
Aggregated[Aggregated Metrics]
Business[Business Logic]
Optimized[Query Optimized]
Enriched[Enriched & Joined]
end
Aggregated --> Business
Business --> Optimized
Optimized --> Enriched
style Aggregated fill:#ffd700
style Business fill:#ffeb3b
style Optimized fill:#fff59d
style Enriched fill:#fff9c4 Characteristics:
- Data Quality: Business-ready, aggregated
- Schema: Dimensional models, business entities
- Format: Delta Lake optimized for queries
- Purpose: Direct consumption by BI tools
- Retention: Active data (days to months)
Data Models:
- Fact Tables: Metrics and measurements
- Dimension Tables: Descriptive attributes
- Aggregated Views: Pre-computed summaries
- Business Entities: Customer 360, Product catalogs
Example Table Structure:
-- Gold layer fact table
CREATE TABLE gold.device_daily_metrics (
device_id STRING NOT NULL,
metric_date DATE NOT NULL,
avg_temperature DOUBLE,
max_temperature DOUBLE,
min_temperature DOUBLE,
avg_humidity DOUBLE,
total_events BIGINT,
uptime_minutes INT,
alert_count INT,
device_type STRING, -- Enriched from dimension
location_name STRING, -- Enriched from dimension
customer_segment STRING -- Business categorization
)
USING DELTA
PARTITIONED BY (metric_date)
LOCATION 'abfss://gold@datalake.dfs.core.windows.net/device_daily_metrics'
TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
);
4. Delta Lake Storage Format¶
Core Components¶
graph TB
subgraph "Delta Lake File Structure"
DataFiles[Parquet Data Files<br/>Columnar Storage<br/>Compressed]
TxLog[Transaction Log<br/>_delta_log/<br/>JSON Logs]
Checkpoints[Checkpoints<br/>Parquet Snapshots<br/>Every 10 commits]
Stats[Statistics<br/>Min/Max Values<br/>Record Counts]
end
subgraph "ACID Guarantees"
Atomicity[Atomicity<br/>All or nothing commits]
Consistency[Consistency<br/>Schema enforcement]
Isolation[Isolation<br/>Snapshot isolation]
Durability[Durability<br/>Persistent storage]
end
DataFiles --> TxLog
TxLog --> Checkpoints
DataFiles --> Stats
TxLog --> Atomicity
TxLog --> Consistency
TxLog --> Isolation
TxLog --> Durability
style DataFiles fill:#e1f5fe
style TxLog fill:#fff3e0
style Checkpoints fill:#f3e5f5
style Stats fill:#e8f5e9 Key Features¶
| Feature | Description | Benefit |
|---|---|---|
| ACID Transactions | Full ACID compliance for data operations | Data reliability and consistency |
| Time Travel | Query historical versions of data | Audit, rollback, reproduce analyses |
| Schema Evolution | Add, rename, change columns safely | Flexibility without breaking changes |
| Upserts/Merges | Efficient UPDATE and MERGE operations | Simplified CDC and SCD patterns |
| Data Skipping | Statistics-based file pruning | Faster queries through data skipping |
| Compaction | Automatic small file optimization | Better query performance |
Delta Lake Operations¶
Write Operations:
# Append new data
df.write.format("delta").mode("append").save("/path/to/delta-table")
# Overwrite partition
df.write.format("delta") \
.mode("overwrite") \
.option("replaceWhere", "date = '2025-01-28'") \
.save("/path/to/delta-table")
# Merge (Upsert)
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/path/to/delta-table")
deltaTable.alias("target").merge(
updates.alias("source"),
"target.id = source.id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
Read Operations:
# Current version
df = spark.read.format("delta").load("/path/to/delta-table")
# Time travel - specific version
df = spark.read.format("delta") \
.option("versionAsOf", 5) \
.load("/path/to/delta-table")
# Time travel - specific timestamp
df = spark.read.format("delta") \
.option("timestampAsOf", "2025-01-28") \
.load("/path/to/delta-table")
5. Compute Engines¶
Synapse Spark Pools¶
graph TB
subgraph "Spark Pool Capabilities"
DataEng[Data Engineering<br/>ETL/ELT Processing]
ML[Machine Learning<br/>Feature Engineering<br/>Model Training]
Streaming[Structured Streaming<br/>Real-time Processing]
Languages[Multi-language<br/>Python, Scala, SQL, .NET]
end
subgraph "Optimizations"
AutoScale[Auto-scaling<br/>Dynamic Allocation]
Caching[Intelligent Caching<br/>Delta Cache]
Optimization[Query Optimization<br/>Adaptive Execution]
end
DataEng --> AutoScale
ML --> Caching
Streaming --> AutoScale
Languages --> Optimization
style DataEng fill:#fff3e0
style ML fill:#f3e5f5
style Streaming fill:#e1f5fe
style Languages fill:#e8f5e9 Primary Use Cases:
- ETL/ELT Processing: Transform data between medallion layers
- Machine Learning: Feature engineering and model training
- Streaming: Real-time data processing with Spark Structured Streaming
- Data Engineering: Complex transformations and data quality rules
Configuration Guidance:
| Workload Type | Node Size | Node Count | Auto-scaling |
|---|---|---|---|
| Development/Testing | Small (4 cores, 32GB) | 3-5 | Enabled |
| Production ETL | Medium (8 cores, 64GB) | 5-20 | Enabled |
| ML Training | Large (16 cores, 128GB) | 10-50 | Enabled |
| Streaming | Medium (8 cores, 64GB) | 5-10 | Limited |
Synapse Serverless SQL Pool¶
Architecture:
graph LR
User[User Query] --> Parser[Query Parser]
Parser --> Optimizer[Query Optimizer]
Optimizer --> Executor[Distributed Executor]
Executor --> DataLake[Data Lake Gen2<br/>Delta Files]
DataLake --> Results[Result Set]
style User fill:#e8f5e9
style Parser fill:#fff3e0
style Optimizer fill:#f3e5f5
style Executor fill:#e1f5fe
style DataLake fill:#fff9c4 Characteristics:
- Pricing: Pay-per-query (per TB scanned)
- Startup: Instant, no cluster management
- Language: T-SQL compatible
- Formats: Parquet, Delta, CSV, JSON
- Best For: Ad-hoc queries, data exploration, BI queries
Query Optimization Tips:
- Partitioning: Leverage partition pruning
- File Size: Larger files (>100MB) perform better
- Statistics: Use OPENROWSET hints
- Caching: Query results cached for 48 hours
Synapse Dedicated SQL Pool¶
__Purpose**: Enterprise data warehousing for consistent, high-performance workloads
__Architecture Pattern**:
graph TB
subgraph "Dedicated SQL Pool"
Control[Control Node<br/>Query Coordination]
Compute1[Compute Node 1]
Compute2[Compute Node 2]
Compute3[Compute Node N]
Storage[Azure Storage<br/>Distributed Storage]
end
Control --> Compute1
Control --> Compute2
Control --> Compute3
Compute1 --> Storage
Compute2 --> Storage
Compute3 --> Storage
style Control fill:#fff3e0
style Compute1 fill:#e1f5fe
style Compute2 fill:#e1f5fe
style Compute3 fill:#e1f5fe
style Storage fill:#e8f5e9 When to Use:
- Consistent query workloads
- Predictable performance requirements
- Complex dimensional models
- High concurrency BI workloads
6. Metadata & Governance¶
Spark Metastore¶
- Purpose: Centralized catalog of tables and schemas
- Scope: Shared across Spark pools
- Compatibility: Hive metastore compatible
- Storage: Backed by Azure SQL Database
Azure Purview¶
- Data Catalog: Discover and understand data assets
- Data Lineage: Track data movement and transformations
- Classification: Automatically classify sensitive data
- Compliance: Support regulatory compliance requirements
🔄 Data Flow Patterns¶
Batch Processing Flow¶
sequenceDiagram
participant Source as Data Source
participant ADF as Data Factory
participant Bronze as Bronze Layer
participant Spark as Spark Pool
participant Silver as Silver Layer
participant Gold as Gold Layer
participant BI as Power BI
Source->>ADF: Extract data
ADF->>Bronze: Load raw data (Delta)
Note over Bronze: Immutable historical record
Bronze->>Spark: Read raw data
Spark->>Spark: Clean & validate
Spark->>Silver: Write refined data (Delta)
Note over Silver: Cleaned, deduplicated
Silver->>Spark: Read refined data
Spark->>Spark: Aggregate & enrich
Spark->>Gold: Write business data (Delta)
Note over Gold: Business-ready metrics
Gold->>BI: Query for visualization
BI->>BI: Render dashboards Streaming Processing Flow¶
sequenceDiagram
participant Device as IoT Devices
participant EH as Event Hubs
participant Spark as Spark Streaming
participant Silver as Silver Layer
participant Gold as Gold Layer
participant RT as Real-time Views
Device->>EH: Send telemetry events
EH->>Spark: Stream events
Spark->>Spark: Transform & validate
Spark->>Silver: Micro-batch writes (Delta)
par Batch Aggregation
Silver->>Spark: Trigger batch job
Spark->>Gold: Write aggregates
and Real-time Aggregation
Spark->>RT: Update real-time metrics
end
Note over Gold,RT: Both paths available for queries Time Travel & Versioning¶
graph LR
subgraph "Delta Table Versions"
V0[Version 0<br/>Initial Load]
V1[Version 1<br/>Update 1]
V2[Version 2<br/>Update 2]
V3[Version 3<br/>Current]
end
subgraph "Query Options"
Current[Current Query<br/>Version 3]
Historic[Time Travel<br/>Version 1]
Rollback[Rollback<br/>Restore Version 2]
end
V0 --> V1
V1 --> V2
V2 --> V3
V3 -.-> Current
V1 -.-> Historic
V2 -.-> Rollback
style V3 fill:#4caf50
style Current fill:#4caf50
style Historic fill:#2196f3
style Rollback fill:#ff9800 💡 Key Benefits¶
1. Unified Platform¶
- Single storage layer for both batch and streaming
- Consistent data format across all compute engines
- Shared metadata accessible from SQL and Spark
2. ACID Guarantees¶
- Reliable writes with transaction support
- Consistent reads with snapshot isolation
- Safe concurrent operations from multiple users
- Data quality with schema enforcement
3. Performance¶
- Data skipping with statistics-based pruning
- Partition pruning for efficient queries
- Compaction for optimal file sizes
- Caching with Delta Cache in Spark
4. Cost Efficiency¶
- Storage costs: Parquet compression reduces storage needs
- Compute costs: Serverless SQL for ad-hoc queries
- Lifecycle management: Auto-archive old data to cool/archive tiers
5. Flexibility¶
- Schema evolution without breaking changes
- Time travel for audit and analysis
- Multiple compute engines for different workloads
- Open format based on Parquet and open-source Delta Lake
🎯 Implementation Best Practices¶
1. Data Organization¶
- Layer Separation: Strictly separate Bronze, Silver, Gold
- Partitioning: Partition by date for time-series data
- File Sizes: Target 100MB-1GB per file
- Naming Conventions: Consistent, descriptive table names
2. Performance Optimization¶
- Z-ordering: Optimize for common query patterns
- Compaction: Regular small file compaction
- Statistics: Maintain up-to-date statistics
- Caching: Leverage Delta Cache for hot data
3. Data Quality¶
- Validation: Implement data quality checks at each layer
- Schema Management: Use schema evolution carefully
- Deduplication: Handle duplicates in Silver layer
- Audit Trail: Preserve raw data in Bronze
4. Governance¶
- Access Control: Implement fine-grained ACLs
- Data Classification: Tag sensitive data
- Lineage: Track data movement with Purview
- Retention: Define retention policies per layer
🔗 Related Resources¶
Architecture Documentation¶
Implementation Guides¶
- Delta Lake Guide
- Synapse Spark Tutorial
- Best Practices
Code Examples¶
Last Updated: 2025-01-28 Architecture Version: 2.0 Pattern: Medallion + Delta Lake