Delta Lakehouse Architecture¶
🏠 Home > 🏗️ Architecture > 📄 Delta Lakehouse
Overview¶
The Delta Lakehouse architecture combines the flexibility and cost-efficiency of a data lake with the data management and ACID transaction capabilities of a data warehouse. Azure Synapse Analytics provides native integration with Delta Lake format, enabling a modern and efficient lakehouse implementation.
Architecture Components¶
Core Components¶
- Azure Data Lake Storage Gen2
- Foundation for storing all data in raw, refined, and curated zones
- Hierarchical namespace for efficient file organization
-
Fine-grained ACLs for security at folder and file levels
-
Delta Lake
- Open-source storage layer that brings ACID transactions to data lakes
- Schema enforcement and evolution capabilities
- Time travel (data versioning) for auditing and rollbacks
-
Support for optimized Parquet format for performance
-
Azure Synapse Spark Pools
- Distributed processing engine for data transformation
- Native support for Delta Lake format
- Scalable compute for batch and stream processing
-
Integration with Azure Machine Learning for advanced analytics
-
Azure Synapse SQL
- SQL interface for querying Delta tables
- Serverless pool for ad-hoc analytics
- Dedicated pool for enterprise data warehousing
Implementation Patterns¶
Multi-Zone Data Organization¶
adls://data/
├── raw/ # Raw ingested data
├── refined/ # Cleansed and conformed data
└── curated/ # Business-ready data products
Data Flow Diagram¶
The following diagram illustrates the end-to-end data flow through the Delta Lakehouse architecture:
graph LR
A[Data Sources] --> B[Ingestion Layer]
B --> C[Bronze Layer<br/>Raw Data]
C --> D[Silver Layer<br/>Refined Data]
D --> E[Gold Layer<br/>Curated Data]
C --> F[Delta Lake Storage]
D --> F
E --> F
F --> G[Spark Pools<br/>Processing]
F --> H[Serverless SQL<br/>Querying]
F --> I[Dedicated SQL<br/>Analytics]
G --> J[Analytics & BI]
H --> J
I --> J
G --> K[Machine Learning]
style C fill:#CD7F32
style D fill:#C0C0C0
style E fill:#FFD700
style F fill:#90EE90 Medallion Architecture¶
The medallion architecture organizes your Delta Lake data into layers with increasing data quality and refinement:
- Bronze Layer (Raw Data)
- Ingestion sink for all source data
- Preserves original data format and content
- Minimal transformation, primarily ELT
-
Schema-on-read approach
-
Silver Layer (Refined Data)
- Cleansed and conformed data
- Standardized formats and resolved duplicates
- Common data quality rules applied
-
Typically organized by domain or source system
-
Gold Layer (Curated Data)
- Business-level aggregates and metrics
- Dimensional models for reporting
- Feature tables for machine learning
- Optimized for specific analytical use cases
Performance Optimization¶
Delta Optimizations¶
- Data Skipping: Delta maintains statistics to skip irrelevant files during queries
- Z-Ordering: Multi-dimensional clustering for improved filtering performance
- Compaction: Small file consolidation to optimize read performance
- Caching: Metadata and data caching for frequently accessed tables
Spark Tuning¶
- Autoscaling: Configure Spark pools to scale based on workload
- Partition Management: Right-size partitions to optimize parallelism
- Memory Configuration: Allocate appropriate memory for shuffle and execution
- Query Plan Optimization: Analyze and tune Spark execution plans
Governance and Security¶
- Azure Purview Integration: Data cataloging and lineage tracking
- Column-Level Security: Fine-grained access control within tables
- Row-Level Security: Filter data based on user context
- Transparent Data Encryption: Data encryption at rest
Deployment and DevOps¶
- Infrastructure as Code: Deploy lakehouse components using ARM templates or Terraform
- CI/CD Pipelines: Automated testing and deployment of Spark notebooks and SQL scripts
- Monitoring: Azure Monitor integration for performance tracking and alerts
- Delta Live Tables: Declarative ETL framework for reliable pipeline development
Best Practices¶
- Implement a systematic approach to schema evolution
- Use appropriate partitioning strategies based on data access patterns
- Apply retention policies to manage data lifecycle efficiently
- Leverage checkpoint files for streaming workloads
- Implement Slowly Changing Dimension patterns for tracking historical changes
- Use Z-Ordering on frequently filtered columns
- Maintain separate compute clusters for ETL and query workloads
- Implement CI/CD practices for Delta table schema changes