Azure Synapse Analytics Delta Lakehouse Detailed Architecture¶

Home > Architecture > Delta Lakehouse > Detailed Architecture

Overview¶

The Delta Lakehouse architecture combines the best of data lakes and data warehouses, providing ACID transactions, schema enforcement, and time travel capabilities while maintaining the flexibility and scalability of a data lake. This document details the implementation of a Delta Lakehouse using Azure Synapse Analytics.

Core Components¶

Storage Layer¶

Azure Data Lake Storage Gen2 (ADLS Gen2)¶

Hierarchical namespace for efficient directory/file operations
Built-in security with Azure Active Directory integration
Cost-effective storage with tiering capabilities (hot, cool, archive)
Designed for high throughput and parallelism

Storage Organization¶

datalake/
├── bronze/             # Raw ingested data
│   ├── source1/
│   └── source2/
├── silver/             # Cleaned and transformed data
│   ├── dimension1/
│   └── fact1/
└── gold/               # Business-level aggregated data
    ├── reports/
    └── analytics/

Compute Layer¶

Azure Synapse Spark Pools¶

Fully managed Apache Spark service
Autoscaling capabilities based on workload
Native integration with Delta Lake
Configurable for memory-optimized or compute-optimized workloads

Pool Configurations¶

Pool Type	Node Size	Autoscale	Use Case
Small	Medium (8 vCores)	3-10 nodes	Development, testing
Medium	Large (16 vCores)	5-20 nodes	Production ETL
Large	XLarge (32 vCores)	10-40 nodes	Data science workloads

Delta Lake Integration¶

Key Components¶

Transaction log for ACID compliance
Optimistic concurrency control
Schema enforcement and evolution
Data skipping and Z-ordering for query optimization
Time travel capabilities

Implementation¶

# Example of configuring Spark with Delta Lake
spark = SparkSession.builder \
    .appName("Delta Lake Configuration") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Setting Delta specific configurations
spark.conf.set("spark.databricks.delta.properties.defaults.enableChangeDataFeed", "true")
spark.conf.set("spark.databricks.delta.optimize.maxFileSize", 1024 * 1024 * 256)  # 256MB
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

Architecture Patterns¶

Bronze-Silver-Gold Pattern¶

Bronze Layer (Raw Data)¶

Ingests data in raw format with minimal transformation
Preserves original data for auditing and reprocessing
Implemented as Delta tables with schema inference
Retention policies based on compliance requirements

Silver Layer (Processed Data)¶

Cleaned and conformed data
Standardized formats and data types
Data quality checks and validation
Implemented as Delta tables with strict schemas

Gold Layer (Business Data)¶

Aggregated, enriched data ready for consumption
Optimized for specific business domains or use cases
Often dimensional models or denormalized structures
Implemented as Delta tables optimized for query performance

Data Ingestion Patterns¶

Batch Ingestion¶

Using Azure Synapse pipelines for orchestration
Scheduled or event-triggered processing
Support for various source formats (CSV, JSON, Parquet, etc.)
Parallel loading for high-volume data

Stream Ingestion¶

Integration with Azure Event Hubs or Kafka
Real-time processing with Structured Streaming
Delta Lake's support for streaming writes
Auto-compaction for optimizing small files

Data Processing Patterns¶

ELT (Extract, Load, Transform)¶

Load raw data into Bronze layer
Transform in-place using Spark SQL or DataFrame APIs
Move processed data to Silver and Gold layers
Leverages Synapse's distributed processing capabilities
Optimize and manage metadata with VACUUM and ANALYZE

Advanced Features¶

Time Travel and Versioning¶

Delta Lake provides time travel capabilities, allowing queries against previous versions of the data. This is particularly useful for:

Auditing and compliance
Debugging and rollback scenarios
Point-in-time analysis
Reproducible reporting

-- Query data as of a specific timestamp
SELECT * FROM delta.`/path/to/table` TIMESTAMP AS OF '2025-08-01 00:00:00'

-- Query data as of a specific version
SELECT * FROM delta.`/path/to/table` VERSION AS OF 123

Schema Evolution¶

Delta Lake supports schema evolution, allowing tables to adapt as data structures change over time:

Add new columns
Change data types (with compatible conversions)
Rename columns using column mapping

-- Add a new column with a default value
ALTER TABLE delta_table ADD COLUMN new_column STRING DEFAULT 'default_value'

Change Data Capture (CDC)¶

Delta Lake supports Change Data Feed, enabling downstream systems to consume only changed data:

# Enable CDC on a Delta table
spark.sql("ALTER TABLE delta_table SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")

# Read changes between versions
changes = spark.read.format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingVersion", 5) \
    .option("endingVersion", 10) \
    .table("delta_table")

Optimizations¶

Data Skipping¶

Delta Lake maintains statistics on data files
Query predicates use these statistics to skip irrelevant files
Significantly improves query performance

Z-Ordering¶

Multi-dimensional clustering technique
Colocates related data together
Improves query performance when filtering on Z-ordered columns

-- Z-order by multiple columns
OPTIMIZE delta_table ZORDER BY (date_column, region_column)

File Compaction¶

Combines small files into larger ones
Reduces metadata overhead
Improves scan performance

-- Compact files without Z-ordering
OPTIMIZE delta_table

Security and Governance¶

Authentication and Authorization¶

Azure Active Directory Integration¶

Single sign-on with Azure AD
Role-based access control (RBAC)
Integration with existing identity systems
Support for managed identities

Fine-grained Access Control¶

Table-level and column-level security
Row-level security through Delta Lake filters
Dynamic data masking for sensitive fields

Data Governance¶

Azure Purview Integration¶

Automated data discovery and classification
Data lineage tracking
Sensitive data identification
Centralized metadata management

Metadata Management¶

Schema history tracking
Transaction history logging
Origin tracking with detailed provenance
Integration with external metadata systems

Monitoring and Optimization¶

Performance Monitoring¶

Azure Monitor Integration¶

Resource utilization tracking
Query performance metrics
Cost analysis
Alerting on performance degradation

Delta-specific Metrics¶

Transaction log size and growth rate
Data skipping effectiveness
Compaction efficiency
Read/write throughput

Cost Optimization Strategies¶

Storage Optimization¶

Tiered storage policies
Data lifecycle management
Vacuum operations to remove stale files
Compression settings optimization

Compute Optimization¶

Right-sizing Spark pools
Autoscaling configurations
Workload isolation for predictable performance
Caching strategies for frequently accessed data

Integration Points¶

Synapse SQL Integration¶

Query Delta tables directly from Serverless SQL pools
Create external tables over Delta format
Join between Delta Lake and other data sources
Cross-engine queries (Spark and SQL)

Power BI Integration¶

Direct Query support for Delta tables
Composite models combining Delta Lake with other sources
Incremental refresh based on Delta Lake partitioning
Enterprise-scale semantic models

Azure Machine Learning¶

Feature store implementation using Delta Lake
Model training on Delta tables
Model deployment with feature versioning
MLOps workflows with data and model versioning

Reference Implementation¶

For a detailed reference implementation of Delta Lakehouse in Azure Synapse Analytics, refer to the code examples section of this documentation.