Azure Synapse Analytics Delta Lakehouse Detailed Architecture¶
Home > Architecture > Delta Lakehouse > Detailed Architecture
Overview¶
The Delta Lakehouse architecture combines the best of data lakes and data warehouses, providing ACID transactions, schema enforcement, and time travel capabilities while maintaining the flexibility and scalability of a data lake. This document details the implementation of a Delta Lakehouse using Azure Synapse Analytics.
Core Components¶
Storage Layer¶
Azure Data Lake Storage Gen2 (ADLS Gen2)¶
- Hierarchical namespace for efficient directory/file operations
- Built-in security with Azure Active Directory integration
- Cost-effective storage with tiering capabilities (hot, cool, archive)
- Designed for high throughput and parallelism
Storage Organization¶
datalake/
├── bronze/ # Raw ingested data
│ ├── source1/
│ └── source2/
├── silver/ # Cleaned and transformed data
│ ├── dimension1/
│ └── fact1/
└── gold/ # Business-level aggregated data
├── reports/
└── analytics/
Compute Layer¶
Azure Synapse Spark Pools¶
- Fully managed Apache Spark service
- Autoscaling capabilities based on workload
- Native integration with Delta Lake
- Configurable for memory-optimized or compute-optimized workloads
Pool Configurations¶
| Pool Type | Node Size | Autoscale | Use Case |
|---|---|---|---|
| Small | Medium (8 vCores) | 3-10 nodes | Development, testing |
| Medium | Large (16 vCores) | 5-20 nodes | Production ETL |
| Large | XLarge (32 vCores) | 10-40 nodes | Data science workloads |
Delta Lake Integration¶
Key Components¶
- Transaction log for ACID compliance
- Optimistic concurrency control
- Schema enforcement and evolution
- Data skipping and Z-ordering for query optimization
- Time travel capabilities
Implementation¶
# Example of configuring Spark with Delta Lake
spark = SparkSession.builder \
.appName("Delta Lake Configuration") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# Setting Delta specific configurations
spark.conf.set("spark.databricks.delta.properties.defaults.enableChangeDataFeed", "true")
spark.conf.set("spark.databricks.delta.optimize.maxFileSize", 1024 * 1024 * 256) # 256MB
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
Architecture Patterns¶
Bronze-Silver-Gold Pattern¶
Bronze Layer (Raw Data)¶
- Ingests data in raw format with minimal transformation
- Preserves original data for auditing and reprocessing
- Implemented as Delta tables with schema inference
- Retention policies based on compliance requirements
Silver Layer (Processed Data)¶
- Cleaned and conformed data
- Standardized formats and data types
- Data quality checks and validation
- Implemented as Delta tables with strict schemas
Gold Layer (Business Data)¶
- Aggregated, enriched data ready for consumption
- Optimized for specific business domains or use cases
- Often dimensional models or denormalized structures
- Implemented as Delta tables optimized for query performance
Data Ingestion Patterns¶
Batch Ingestion¶
- Using Azure Synapse pipelines for orchestration
- Scheduled or event-triggered processing
- Support for various source formats (CSV, JSON, Parquet, etc.)
- Parallel loading for high-volume data
Stream Ingestion¶
- Integration with Azure Event Hubs or Kafka
- Real-time processing with Structured Streaming
- Delta Lake's support for streaming writes
- Auto-compaction for optimizing small files
Data Processing Patterns¶
ELT (Extract, Load, Transform)¶
- Load raw data into Bronze layer
- Transform in-place using Spark SQL or DataFrame APIs
- Move processed data to Silver and Gold layers
- Leverages Synapse's distributed processing capabilities
- Optimize and manage metadata with VACUUM and ANALYZE
Advanced Features¶
Time Travel and Versioning¶
Delta Lake provides time travel capabilities, allowing queries against previous versions of the data. This is particularly useful for:
- Auditing and compliance
- Debugging and rollback scenarios
- Point-in-time analysis
- Reproducible reporting
-- Query data as of a specific timestamp
SELECT * FROM delta.`/path/to/table` TIMESTAMP AS OF '2025-08-01 00:00:00'
-- Query data as of a specific version
SELECT * FROM delta.`/path/to/table` VERSION AS OF 123
Schema Evolution¶
Delta Lake supports schema evolution, allowing tables to adapt as data structures change over time:
- Add new columns
- Change data types (with compatible conversions)
- Rename columns using column mapping
-- Add a new column with a default value
ALTER TABLE delta_table ADD COLUMN new_column STRING DEFAULT 'default_value'
Change Data Capture (CDC)¶
Delta Lake supports Change Data Feed, enabling downstream systems to consume only changed data:
# Enable CDC on a Delta table
spark.sql("ALTER TABLE delta_table SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")
# Read changes between versions
changes = spark.read.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", 5) \
.option("endingVersion", 10) \
.table("delta_table")
Optimizations¶
Data Skipping¶
- Delta Lake maintains statistics on data files
- Query predicates use these statistics to skip irrelevant files
- Significantly improves query performance
Z-Ordering¶
- Multi-dimensional clustering technique
- Colocates related data together
- Improves query performance when filtering on Z-ordered columns
File Compaction¶
- Combines small files into larger ones
- Reduces metadata overhead
- Improves scan performance
Security and Governance¶
Authentication and Authorization¶
Azure Active Directory Integration¶
- Single sign-on with Azure AD
- Role-based access control (RBAC)
- Integration with existing identity systems
- Support for managed identities
Fine-grained Access Control¶
- Table-level and column-level security
- Row-level security through Delta Lake filters
- Dynamic data masking for sensitive fields
Data Governance¶
Azure Purview Integration¶
- Automated data discovery and classification
- Data lineage tracking
- Sensitive data identification
- Centralized metadata management
Metadata Management¶
- Schema history tracking
- Transaction history logging
- Origin tracking with detailed provenance
- Integration with external metadata systems
Monitoring and Optimization¶
Performance Monitoring¶
Azure Monitor Integration¶
- Resource utilization tracking
- Query performance metrics
- Cost analysis
- Alerting on performance degradation
Delta-specific Metrics¶
- Transaction log size and growth rate
- Data skipping effectiveness
- Compaction efficiency
- Read/write throughput
Cost Optimization Strategies¶
Storage Optimization¶
- Tiered storage policies
- Data lifecycle management
- Vacuum operations to remove stale files
- Compression settings optimization
Compute Optimization¶
- Right-sizing Spark pools
- Autoscaling configurations
- Workload isolation for predictable performance
- Caching strategies for frequently accessed data
Integration Points¶
Synapse SQL Integration¶
- Query Delta tables directly from Serverless SQL pools
- Create external tables over Delta format
-
Join between Delta Lake and other data sources
-
Cross-engine queries (Spark and SQL)
Power BI Integration¶
- Direct Query support for Delta tables
- Composite models combining Delta Lake with other sources
-
Incremental refresh based on Delta Lake partitioning
-
Enterprise-scale semantic models
Azure Machine Learning¶
- Feature store implementation using Delta Lake
- Model training on Delta tables
-
Model deployment with feature versioning
-
MLOps workflows with data and model versioning
Reference Implementation¶
For a detailed reference implementation of Delta Lakehouse in Azure Synapse Analytics, refer to the code examples section of this documentation.