Skip to content

Azure Synapse Analytics Delta Lakehouse Detailed Architecture

Home > Architecture > Delta Lakehouse > Detailed Architecture

Overview

The Delta Lakehouse architecture combines the best of data lakes and data warehouses, providing ACID transactions, schema enforcement, and time travel capabilities while maintaining the flexibility and scalability of a data lake. This document details the implementation of a Delta Lakehouse using Azure Synapse Analytics.

Core Components

Storage Layer

Azure Data Lake Storage Gen2 (ADLS Gen2)

  • Hierarchical namespace for efficient directory/file operations
  • Built-in security with Azure Active Directory integration
  • Cost-effective storage with tiering capabilities (hot, cool, archive)
  • Designed for high throughput and parallelism

Storage Organization

datalake/
├── bronze/             # Raw ingested data
│   ├── source1/
│   └── source2/
├── silver/             # Cleaned and transformed data
│   ├── dimension1/
│   └── fact1/
└── gold/               # Business-level aggregated data
    ├── reports/
    └── analytics/

Compute Layer

Azure Synapse Spark Pools

  • Fully managed Apache Spark service
  • Autoscaling capabilities based on workload
  • Native integration with Delta Lake
  • Configurable for memory-optimized or compute-optimized workloads

Pool Configurations

Pool Type Node Size Autoscale Use Case
Small Medium (8 vCores) 3-10 nodes Development, testing
Medium Large (16 vCores) 5-20 nodes Production ETL
Large XLarge (32 vCores) 10-40 nodes Data science workloads

Delta Lake Integration

Key Components

  • Transaction log for ACID compliance
  • Optimistic concurrency control
  • Schema enforcement and evolution
  • Data skipping and Z-ordering for query optimization
  • Time travel capabilities

Implementation

# Example of configuring Spark with Delta Lake
spark = SparkSession.builder \
    .appName("Delta Lake Configuration") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Setting Delta specific configurations
spark.conf.set("spark.databricks.delta.properties.defaults.enableChangeDataFeed", "true")
spark.conf.set("spark.databricks.delta.optimize.maxFileSize", 1024 * 1024 * 256)  # 256MB
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

Architecture Patterns

Bronze-Silver-Gold Pattern

Bronze Layer (Raw Data)

  • Ingests data in raw format with minimal transformation
  • Preserves original data for auditing and reprocessing
  • Implemented as Delta tables with schema inference
  • Retention policies based on compliance requirements

Silver Layer (Processed Data)

  • Cleaned and conformed data
  • Standardized formats and data types
  • Data quality checks and validation
  • Implemented as Delta tables with strict schemas

Gold Layer (Business Data)

  • Aggregated, enriched data ready for consumption
  • Optimized for specific business domains or use cases
  • Often dimensional models or denormalized structures
  • Implemented as Delta tables optimized for query performance

Data Ingestion Patterns

Batch Ingestion

  • Using Azure Synapse pipelines for orchestration
  • Scheduled or event-triggered processing
  • Support for various source formats (CSV, JSON, Parquet, etc.)
  • Parallel loading for high-volume data

Stream Ingestion

  • Integration with Azure Event Hubs or Kafka
  • Real-time processing with Structured Streaming
  • Delta Lake's support for streaming writes
  • Auto-compaction for optimizing small files

Data Processing Patterns

ELT (Extract, Load, Transform)

  • Load raw data into Bronze layer
  • Transform in-place using Spark SQL or DataFrame APIs
  • Move processed data to Silver and Gold layers
  • Leverages Synapse's distributed processing capabilities
  • Optimize and manage metadata with VACUUM and ANALYZE

Advanced Features

Time Travel and Versioning

Delta Lake provides time travel capabilities, allowing queries against previous versions of the data. This is particularly useful for:

  • Auditing and compliance
  • Debugging and rollback scenarios
  • Point-in-time analysis
  • Reproducible reporting
-- Query data as of a specific timestamp
SELECT * FROM delta.`/path/to/table` TIMESTAMP AS OF '2025-08-01 00:00:00'

-- Query data as of a specific version
SELECT * FROM delta.`/path/to/table` VERSION AS OF 123

Schema Evolution

Delta Lake supports schema evolution, allowing tables to adapt as data structures change over time:

  • Add new columns
  • Change data types (with compatible conversions)
  • Rename columns using column mapping
-- Add a new column with a default value
ALTER TABLE delta_table ADD COLUMN new_column STRING DEFAULT 'default_value'

Change Data Capture (CDC)

Delta Lake supports Change Data Feed, enabling downstream systems to consume only changed data:

# Enable CDC on a Delta table
spark.sql("ALTER TABLE delta_table SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")

# Read changes between versions
changes = spark.read.format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingVersion", 5) \
    .option("endingVersion", 10) \
    .table("delta_table")

Optimizations

Data Skipping

  • Delta Lake maintains statistics on data files
  • Query predicates use these statistics to skip irrelevant files
  • Significantly improves query performance

Z-Ordering

  • Multi-dimensional clustering technique
  • Colocates related data together
  • Improves query performance when filtering on Z-ordered columns
-- Z-order by multiple columns
OPTIMIZE delta_table ZORDER BY (date_column, region_column)

File Compaction

  • Combines small files into larger ones
  • Reduces metadata overhead
  • Improves scan performance
-- Compact files without Z-ordering
OPTIMIZE delta_table

Security and Governance

Authentication and Authorization

Azure Active Directory Integration

  • Single sign-on with Azure AD
  • Role-based access control (RBAC)
  • Integration with existing identity systems
  • Support for managed identities

Fine-grained Access Control

  • Table-level and column-level security
  • Row-level security through Delta Lake filters
  • Dynamic data masking for sensitive fields

Data Governance

Azure Purview Integration

  • Automated data discovery and classification
  • Data lineage tracking
  • Sensitive data identification
  • Centralized metadata management

Metadata Management

  • Schema history tracking
  • Transaction history logging
  • Origin tracking with detailed provenance
  • Integration with external metadata systems

Monitoring and Optimization

Performance Monitoring

Azure Monitor Integration

  • Resource utilization tracking
  • Query performance metrics
  • Cost analysis
  • Alerting on performance degradation

Delta-specific Metrics

  • Transaction log size and growth rate
  • Data skipping effectiveness
  • Compaction efficiency
  • Read/write throughput

Cost Optimization Strategies

Storage Optimization

  • Tiered storage policies
  • Data lifecycle management
  • Vacuum operations to remove stale files
  • Compression settings optimization

Compute Optimization

  • Right-sizing Spark pools
  • Autoscaling configurations
  • Workload isolation for predictable performance
  • Caching strategies for frequently accessed data

Integration Points

Synapse SQL Integration

  • Query Delta tables directly from Serverless SQL pools
  • Create external tables over Delta format
  • Join between Delta Lake and other data sources

  • Cross-engine queries (Spark and SQL)

Power BI Integration

  • Direct Query support for Delta tables
  • Composite models combining Delta Lake with other sources
  • Incremental refresh based on Delta Lake partitioning

  • Enterprise-scale semantic models

Azure Machine Learning

  • Feature store implementation using Delta Lake
  • Model training on Delta tables
  • Model deployment with feature versioning

  • MLOps workflows with data and model versioning

Reference Implementation

For a detailed reference implementation of Delta Lakehouse in Azure Synapse Analytics, refer to the code examples section of this documentation.