Skip to content
Learn — Azure analytics reference library covering services, architecture patterns, tutorials, solutions, monitoring, DevOps

☁️ Serverless SQL Architecture Description

Status Complexity Pattern

Detailed description and visual representation of Azure Synapse Serverless SQL Pool architecture and query processing.


🎯 Overview

Azure Synapse Serverless SQL Pool provides on-demand T-SQL query capabilities over data in Azure Data Lake without the need to provision or manage compute infrastructure. It operates on a pay-per-query model, charging only for data processed.

📊 Visual Architecture

High-Level Architecture

graph TB
    subgraph "Query Submission Layer"
        Users[Users & Applications]
        SSMS[SQL Server Management Studio]
        PowerBI[Power BI]
        AzurePortal[Azure Portal]
        JDBC[JDBC/ODBC Clients]
    end

    subgraph "Serverless SQL Pool"
        subgraph "Query Processing"
            Parser[Query Parser<br/>T-SQL Parsing]
            Optimizer[Query Optimizer<br/>Cost-based Optimization]
            Planner[Execution Planner<br/>Distributed Plan]
        end

        subgraph "Execution Engine"
            Coordinator[Query Coordinator<br/>Plan Distribution]
            Workers[Compute Workers<br/>Auto-scaled Pool]
            Cache[Result Cache<br/>48-hour TTL]
        end

        subgraph "Metadata Services"
            MetadataDB[Metadata Database<br/>Table Definitions]
            StatService[Statistics Service<br/>Data Profiles]
            SchemaDiscovery[Schema Discovery<br/>File Inference]
        end
    end

    subgraph "Storage Layer"
        subgraph "Data Lake Gen2"
            Delta[Delta Lake Tables<br/>Parquet + Transaction Logs]
            Parquet[Parquet Files<br/>Columnar Storage]
            CSV[CSV Files<br/>Delimited Text]
            JSON[JSON Files<br/>Semi-structured]
        end

        subgraph "External Sources"
            CosmosDB[Cosmos DB<br/>Analytical Store]
            BlobStorage[Blob Storage<br/>Any Format]
        end
    end

    subgraph "Security & Governance"
        AAD[Azure AD<br/>Authentication]
        RBAC[Role-based Access<br/>Permissions]
        ACLs[Storage ACLs<br/>Data-level Security]
    end

    Users --> Parser
    SSMS --> Parser
    PowerBI --> Parser
    AzurePortal --> Parser
    JDBC --> Parser

    Parser --> Optimizer
    Optimizer --> Planner
    Planner --> Coordinator

    Coordinator --> Workers
    Workers --> Cache

    Optimizer -.->|Query Statistics| StatService
    Planner -.->|Schema Info| MetadataDB
    Parser -.->|Auto-discovery| SchemaDiscovery

    Workers --> Delta
    Workers --> Parquet
    Workers --> CSV
    Workers --> JSON
    Workers --> CosmosDB
    Workers --> BlobStorage

    Users -.->|Authenticate| AAD
    Workers -.->|Check Permissions| RBAC
    Workers -.->|Data Access| ACLs

    Cache --> Users

    style Parser fill:#e1f5fe
    style Optimizer fill:#fff3e0
    style Workers fill:#e8f5e9
    style Delta fill:#f3e5f5
    style Cache fill:#fce4ec

🏗️ Architecture Components

1. Query Submission Layer

Multiple client tools and applications can connect to Serverless SQL Pool:

Client Type Protocol Authentication Best For
SQL Server Management Studio TDS (TCP 1433) Azure AD, SQL Auth Interactive queries, development
Azure Data Studio TDS (TCP 1433) Azure AD Cross-platform development
Power BI Direct Query, Import Azure AD Business intelligence, dashboards
JDBC/ODBC Standard drivers Azure AD, connection string Application integration
Azure Portal Web interface Azure AD Quick exploration, testing

Connection String Example:

Server=tcp:your-workspace-ondemand.sql.azuresynapse.net,1433;
Database=your-database;
Authentication=Active Directory Integrated;
Encrypt=yes;
TrustServerCertificate=no;

2. Query Processing Pipeline

Query Parser

graph LR
    SQLQuery[T-SQL Query] --> Lexer[Lexical Analysis<br/>Tokenization]
    Lexer --> Syntax[Syntax Analysis<br/>Parse Tree]
    Syntax --> Semantic[Semantic Analysis<br/>Validation]
    Semantic --> IR[Intermediate<br/>Representation]

    style SQLQuery fill:#e8f5e9
    style Lexer fill:#fff3e0
    style Syntax fill:#e1f5fe
    style Semantic fill:#f3e5f5
    style IR fill:#fce4ec

Parser Functions:

  • Tokenization: Break down SQL text into tokens
  • Syntax Validation: Ensure T-SQL syntax correctness
  • Semantic Analysis: Validate object references, data types
  • IR Generation: Create internal query representation

Query Optimizer

graph TB
    IR[Intermediate<br/>Representation] --> Logical[Logical Plan<br/>Generation]
    Logical --> Candidates[Multiple Plan<br/>Candidates]

    Candidates --> CostEstimation{Cost<br/>Estimation}

    CostEstimation -->|Statistics| DataStats[Data Statistics<br/>Row counts, distributions]
    CostEstimation -->|File Info| FileStats[File Statistics<br/>Size, partitions]
    CostEstimation -->|Index Info| IndexStats[Index Statistics<br/>If available]

    DataStats --> BestPlan[Select Best<br/>Execution Plan]
    FileStats --> BestPlan
    IndexStats --> BestPlan

    BestPlan --> Physical[Physical Plan<br/>Distributed Operations]

    style IR fill:#e1f5fe
    style Logical fill:#fff3e0
    style CostEstimation fill:#f3e5f5
    style BestPlan fill:#4caf50
    style Physical fill:#e8f5e9

Optimization Techniques:

  1. Predicate Pushdown: Filter early in data scan
  2. Partition Elimination: Skip irrelevant partitions
  3. Column Pruning: Read only required columns
  4. Join Optimization: Choose optimal join strategy
  5. Aggregation Pushdown: Aggregate before data transfer

Example Optimization:

-- Original query
SELECT customer_id, SUM(amount) as total
FROM sales.transactions
WHERE transaction_date >= '2025-01-01'
GROUP BY customer_id;

-- Optimized execution:
-- 1. Partition elimination (only 2025 partitions)
-- 2. Column pruning (read only customer_id, amount, transaction_date)
-- 3. Predicate pushdown (filter applied during file scan)
-- 4. Aggregation pushdown (partial aggregates per file)

Execution Planner

graph TB
    PhysicalPlan[Physical<br/>Execution Plan] --> Stages[Divide into<br/>Execution Stages]

    Stages --> Stage1[Stage 1: Data Scan<br/>Parallel file readers]
    Stages --> Stage2[Stage 2: Shuffle<br/>Data redistribution]
    Stages --> Stage3[Stage 3: Aggregation<br/>Final results]

    Stage1 --> Workers1[Worker Pool 1<br/>File Scanning]
    Stage2 --> Workers2[Worker Pool 2<br/>Data Shuffling]
    Stage3 --> Workers3[Worker Pool 3<br/>Aggregation]

    Workers1 --> Exchange1[Data Exchange<br/>Network Transfer]
    Workers2 --> Exchange2[Data Exchange<br/>Network Transfer]

    Exchange1 --> Workers2
    Exchange2 --> Workers3

    Workers3 --> Results[Query Results]

    style PhysicalPlan fill:#e1f5fe
    style Stage1 fill:#fff9c4
    style Stage2 fill:#f3e5f5
    style Stage3 fill:#e8f5e9
    style Results fill:#4caf50

3. Execution Engine

Query Coordinator

Responsibilities:

  • Receive execution plan from planner
  • Allocate compute workers dynamically
  • Distribute work across workers
  • Monitor execution progress
  • Handle failures and retries
  • Collect and merge results
graph TB
    Coordinator[Query Coordinator]

    Coordinator --> Allocation[Worker Allocation<br/>Based on query complexity]
    Coordinator --> Distribution[Work Distribution<br/>Parallel tasks]
    Coordinator --> Monitoring[Execution Monitoring<br/>Progress tracking]
    Coordinator --> ResultMerge[Result Merging<br/>Combine worker outputs]

    Allocation --> Workers[Compute Worker Pool]
    Distribution --> Workers
    Workers --> Monitoring
    Workers --> ResultMerge

    style Coordinator fill:#fff3e0
    style Allocation fill:#e1f5fe
    style Workers fill:#e8f5e9
    style ResultMerge fill:#4caf50

Compute Workers

Auto-scaling Characteristics:

graph LR
    subgraph "Worker Scaling"
        QueryLoad[Query Complexity<br/>& Data Volume]
        ScaleLogic{Scaling<br/>Decision}
        SmallPool[Small Pool<br/>4-8 workers]
        MediumPool[Medium Pool<br/>16-32 workers]
        LargePool[Large Pool<br/>64+ workers]
    end

    QueryLoad --> ScaleLogic
    ScaleLogic -->|Simple query| SmallPool
    ScaleLogic -->|Medium query| MediumPool
    ScaleLogic -->|Complex query| LargePool

    style QueryLoad fill:#e8f5e9
    style SmallPool fill:#c8e6c9
    style MediumPool fill:#fff9c4
    style LargePool fill:#ffccbc

Worker Operations:

Operation Description Parallelization
File Scanning Read data from storage files High (per file)
Filtering Apply WHERE clause predicates High (per partition)
Projection Select specific columns High (per row group)
Aggregation GROUP BY and aggregate functions Medium (partial aggregates)
Joins Combine data from multiple sources Medium (hash/merge joins)
Sorting ORDER BY operations Low (global sort required)

Result Cache

graph TB
    Query[User Query] --> CacheCheck{Result in<br/>Cache?}

    CacheCheck -->|Yes, recent| Cached[Return Cached<br/>Result]
    CacheCheck -->|No| Execute[Execute Query]

    Execute --> Store[Store Result<br/>in Cache]
    Store --> Return[Return Result]

    Cached -.->|TTL: 48 hours| AutoEvict[Auto-eviction]
    Store -.->|Set TTL| TTL[48-hour Expiration]

    style CacheCheck fill:#fff3e0
    style Cached fill:#4caf50
    style Execute fill:#e1f5fe
    style Store fill:#e8f5e9

Cache Characteristics:

  • TTL: 48 hours automatic expiration
  • Invalidation: Automatic when underlying data changes
  • Scope: Per-user and per-query
  • Storage: Temporary blob storage
  • Billing: Cached results not charged on subsequent access

4. Metadata Services

Metadata Database

Storage:

  • Azure SQL Database backend
  • Workspace-level metadata repository
  • Stores table definitions, views, procedures

Metadata Types:

-- External tables over data lake files
CREATE EXTERNAL TABLE sales.transactions (
    transaction_id INT,
    customer_id INT,
    amount DECIMAL(10,2),
    transaction_date DATE
)
WITH (
    LOCATION = '/sales/transactions/',
    DATA_SOURCE = DataLakeSource,
    FILE_FORMAT = ParquetFormat
);

-- Views for business logic
CREATE VIEW sales.monthly_summary AS
SELECT
    YEAR(transaction_date) as year,
    MONTH(transaction_date) as month,
    SUM(amount) as total_sales
FROM sales.transactions
GROUP BY YEAR(transaction_date), MONTH(transaction_date);

-- Stored procedures for reusable logic
CREATE PROCEDURE sales.GetCustomerSummary
    @customer_id INT
AS
BEGIN
    SELECT
        customer_id,
        COUNT(*) as transaction_count,
        SUM(amount) as total_spent
    FROM sales.transactions
    WHERE customer_id = @customer_id
    GROUP BY customer_id;
END;

Statistics Service

Automatic Statistics:

graph LR
    DataScan[File Scan] --> StatCollection[Collect Statistics<br/>During Query]
    StatCollection --> Store[Store in<br/>Metadata DB]
    Store --> Optimizer[Used by<br/>Query Optimizer]

    subgraph "Statistics Collected"
        RowCount[Row Counts]
        NullCount[Null Counts]
        MinMax[Min/Max Values]
        Histogram[Value Histograms]
    end

    StatCollection --> RowCount
    StatCollection --> NullCount
    StatCollection --> MinMax
    StatCollection --> Histogram

    style DataScan fill:#e8f5e9
    style StatCollection fill:#fff3e0
    style Store fill:#e1f5fe
    style Optimizer fill:#4caf50

Manual Statistics:

-- Create statistics manually for better performance
CREATE STATISTICS customer_stats
ON sales.transactions(customer_id)
WITH FULLSCAN;

-- View existing statistics
SELECT * FROM sys.stats
WHERE object_id = OBJECT_ID('sales.transactions');

Schema Discovery

Automatic Schema Inference:

graph TB
    File[Data File] --> Type{File<br/>Type?}

    Type -->|Parquet| ParquetSchema[Read Parquet<br/>Metadata]
    Type -->|Delta| DeltaSchema[Read Delta<br/>Transaction Log]
    Type -->|CSV| CSVInfer[Infer from<br/>First 100 Rows]
    Type -->|JSON| JSONInfer[Infer from<br/>Sample Records]

    ParquetSchema --> Schema[Inferred Schema]
    DeltaSchema --> Schema
    CSVInfer --> Schema
    JSONInfer --> Schema

    Schema --> ExternalTable[Auto-create<br/>External Table]

    style File fill:#e8f5e9
    style ParquetSchema fill:#e1f5fe
    style DeltaSchema fill:#fff3e0
    style Schema fill:#4caf50
    style ExternalTable fill:#f3e5f5

Schema Inference Examples:

-- Automatic schema discovery with OPENROWSET
SELECT TOP 100 *
FROM OPENROWSET(
    BULK 'https://datalake.dfs.core.windows.net/data/sales/*.parquet',
    FORMAT = 'PARQUET'
) AS [result];

-- Schema inference with type specification
SELECT *
FROM OPENROWSET(
    BULK 'https://datalake.dfs.core.windows.net/data/sales/*.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0',
    HEADER_ROW = TRUE
) WITH (
    transaction_id INT,
    amount DECIMAL(10,2),
    transaction_date DATE
) AS [result];

💾 Storage Layer Integration

Supported File Formats

Format Read Performance Compression Schema Evolution Best For
Delta Lake Excellent Snappy, Gzip ✅ Yes ACID requirements, updates
Parquet Excellent Snappy, Gzip ⚠️ Limited Columnar analytics
CSV Fair Gzip ❌ No Simple data, compatibility
JSON Good Gzip ⚠️ Limited Semi-structured, flexibility

Query Patterns by Format

Delta Lake Queries

-- Query Delta table directly
SELECT *
FROM OPENROWSET(
    BULK 'https://datalake.dfs.core.windows.net/gold/sales/',
    FORMAT = 'DELTA'
) AS [sales];

-- Time travel query
SELECT *
FROM OPENROWSET(
    BULK 'https://datalake.dfs.core.windows.net/gold/sales/',
    FORMAT = 'DELTA'
) WITH (VERSION = 5) AS [sales];

-- Create external table over Delta
CREATE EXTERNAL TABLE gold.sales
WITH (
    LOCATION = '/gold/sales/',
    DATA_SOURCE = DataLakeSource,
    FILE_FORMAT = DeltaFormat
);

Parquet Queries

-- Direct Parquet query with partition elimination
SELECT
    customer_id,
    SUM(amount) as total
FROM OPENROWSET(
    BULK 'https://datalake.dfs.core.windows.net/sales/year=2025/month=01/*.parquet',
    FORMAT = 'PARQUET'
) AS [result]
WHERE transaction_date >= '2025-01-01'
GROUP BY customer_id;

-- Wildcard patterns for multiple partitions
SELECT *
FROM OPENROWSET(
    BULK 'https://datalake.dfs.core.windows.net/sales/year=2025/month=*/day=*/*.parquet',
    FORMAT = 'PARQUET'
) AS [result]
WHERE filepath(1) IN ('01', '02')  -- Only Jan and Feb
  AND filepath(2) > '15';           -- After 15th of month

Data Access Patterns

graph TB
    subgraph "Access Methods"
        OPENROWSET[OPENROWSET<br/>Ad-hoc queries]
        ExternalTable[EXTERNAL TABLE<br/>Reusable definitions]
        View[VIEW<br/>Business logic layer]
    end

    subgraph "Security Layers"
        AAD[Azure AD<br/>Authentication]
        RBAC[Synapse RBAC<br/>Workspace permissions]
        StorageACL[Storage ACLs<br/>File/folder permissions]
    end

    OPENROWSET --> AAD
    ExternalTable --> AAD
    View --> AAD

    AAD --> RBAC
    RBAC --> StorageACL

    StorageACL --> DataAccess[Data Access<br/>Granted]

    style OPENROWSET fill:#e1f5fe
    style ExternalTable fill:#fff3e0
    style View fill:#e8f5e9
    style DataAccess fill:#4caf50

🔒 Security & Governance

Authentication Flow

sequenceDiagram
    participant User
    participant SynapseSQL as Serverless SQL Pool
    participant AAD as Azure Active Directory
    participant Storage as Data Lake Storage

    User->>SynapseSQL: Connect with Azure AD
    SynapseSQL->>AAD: Validate credentials
    AAD->>SynapseSQL: Return access token

    User->>SynapseSQL: Submit query
    SynapseSQL->>SynapseSQL: Check workspace RBAC
    SynapseSQL->>Storage: Access data with user identity
    Storage->>Storage: Check storage ACLs
    Storage->>SynapseSQL: Return data (if authorized)
    SynapseSQL->>User: Return query results

Authorization Layers

Layer Purpose Granularity Management
Azure AD User authentication User/group identity Azure portal
Synapse RBAC Workspace permissions Workspace/pool level Synapse Studio
SQL Permissions Database object access Table/view/procedure T-SQL GRANT/DENY
Storage ACLs File/folder access File/folder level Storage portal/SDK

Security Best Practices:

-- Grant minimum required permissions
GRANT SELECT ON SCHEMA::sales TO [DataAnalysts];

-- Use database scoped credentials for external data sources
CREATE DATABASE SCOPED CREDENTIAL DataLakeCredential
WITH IDENTITY = 'Managed Identity';

CREATE EXTERNAL DATA SOURCE DataLakeSource
WITH (
    LOCATION = 'https://datalake.dfs.core.windows.net/data',
    CREDENTIAL = DataLakeCredential
);

-- Row-level security with predicates
CREATE SECURITY POLICY SalesFilter
ADD FILTER PREDICATE dbo.fn_securitypredicate(SalesPersonID)
ON sales.transactions
WITH (STATE = ON);

⚡ Performance Optimization

Query Optimization Techniques

1. Partition Elimination

-- Inefficient: Full scan
SELECT * FROM sales.transactions
WHERE transaction_date >= '2025-01-01';

-- Efficient: Partition elimination
SELECT * FROM sales.transactions
WHERE year = 2025 AND month = 1;
-- Folder structure: /sales/year=2025/month=01/

2. Column Pruning

-- Inefficient: Read all columns
SELECT * FROM sales.transactions;

-- Efficient: Select only needed columns
SELECT customer_id, amount, transaction_date
FROM sales.transactions;
-- Only 3 columns read from Parquet

3. Predicate Pushdown

-- Efficient: Filter pushed to file scan
SELECT customer_id, SUM(amount) as total
FROM sales.transactions
WHERE transaction_date >= '2025-01-01'
  AND amount > 100
GROUP BY customer_id;
-- Filters applied during file read, reduces data transfer

4. File Size Optimization

graph LR
    subgraph "File Size Impact"
        TooSmall[Too Small<br/>< 10MB<br/>❌ Overhead]
        Optimal[Optimal<br/>100MB-1GB<br/>✅ Best Performance]
        TooLarge[Too Large<br/>> 1GB<br/>⚠️ Limited Parallelism]
    end

    TooSmall -.->|Consolidate| Optimal
    TooLarge -.->|Partition| Optimal

    style TooSmall fill:#ffebee
    style Optimal fill:#e8f5e9
    style TooLarge fill:#fff9c4

File Consolidation Example:

# Spark job to consolidate small files
df = spark.read.format("delta").load("/path/to/table")
df.coalesce(10).write.format("delta").mode("overwrite").save("/path/to/table")

Cost Optimization

Data Processed Calculation

graph TB
    Query[User Query] --> Parse[Parse Query<br/>Identify Files]
    Parse --> Partition[Apply Partition<br/>Elimination]
    Partition --> Column[Apply Column<br/>Pruning]
    Column --> Scan[Actual Data<br/>Scanned]
    Scan --> Charge[Charge = Data Scanned<br/>$/TB]

    style Query fill:#e8f5e9
    style Partition fill:#fff3e0
    style Column fill:#e1f5fe
    style Scan fill:#f3e5f5
    style Charge fill:#ffccbc

Cost Comparison:

Scenario Files Scanned Data Scanned Cost ($/TB) Total Cost
Full table scan (CSV) 1000 files 10 TB $5 $50
Partition elimination (Parquet) 100 files 1 TB $5 $5
Column pruning (Parquet) 100 files 0.2 TB $5 $1
Cached result (2nd run) 0 files 0 TB $0 $0

Cost Optimization Tips:

  1. Use Parquet/Delta: 10-100x better compression than CSV
  2. Partition Data: Reduce scanned files
  3. Query Only Needed Columns: Reduce scanned data
  4. Use Views: Encapsulate optimization logic
  5. Leverage Cache: Repeated queries are free

📊 Monitoring & Troubleshooting

Query Performance Monitoring

-- View query execution statistics
SELECT
    request_id,
    status,
    total_elapsed_time,
    data_processed_mb,
    result_set_size_kb,
    command
FROM sys.dm_exec_requests
WHERE session_id = SESSION_ID();

-- Historical query performance
SELECT
    request_id,
    start_time,
    end_time,
    DATEDIFF(SECOND, start_time, end_time) as duration_seconds,
    data_processed_mb / 1024.0 as data_processed_gb,
    command
FROM sys.dm_exec_requests_history
WHERE session_id = SESSION_ID()
ORDER BY start_time DESC;

Common Issues & Solutions

Issue Symptom Solution
Slow queries Long execution time Add partitioning, use Parquet, create statistics
High costs Unexpected charges Optimize queries, use views, partition data
Permission errors Access denied Check AAD, RBAC, storage ACLs
Schema errors Column not found Recreate external table, check file format
Timeout errors Query timeout Optimize query, reduce data scanned

🎯 Best Practices

Design Principles

  1. Use External Tables for frequently accessed datasets
  2. Partition Data by commonly filtered columns (date, region, category)
  3. Use Parquet or Delta for optimal performance and cost
  4. Create Statistics on join and filter columns
  5. Implement Views for reusable business logic
  6. Monitor Costs regularly and optimize expensive queries

Anti-Patterns to Avoid

  • ❌ SELECT * queries on large datasets
  • ❌ Querying CSV files for large-scale analytics
  • ❌ Small files (< 10MB) in data lake
  • ❌ Missing partitioning on time-series data
  • ❌ Not leveraging result cache

Architecture Documentation

Implementation Guides

Performance & Cost


Last Updated: 2025-01-28 Architecture Version: 2.0 Pattern: Serverless Query Processing