☁️ Serverless SQL Architecture Description¶

Detailed description and visual representation of Azure Synapse Serverless SQL Pool architecture and query processing.

🎯 Overview¶

Azure Synapse Serverless SQL Pool provides on-demand T-SQL query capabilities over data in Azure Data Lake without the need to provision or manage compute infrastructure. It operates on a pay-per-query model, charging only for data processed.

📊 Visual Architecture¶

High-Level Architecture¶

graph TB
    subgraph "Query Submission Layer"
        Users[Users & Applications]
        SSMS[SQL Server Management Studio]
        PowerBI[Power BI]
        AzurePortal[Azure Portal]
        JDBC[JDBC/ODBC Clients]
    end

    subgraph "Serverless SQL Pool"
        subgraph "Query Processing"
            Parser[Query Parser<br/>T-SQL Parsing]
            Optimizer[Query Optimizer<br/>Cost-based Optimization]
            Planner[Execution Planner<br/>Distributed Plan]
        end

        subgraph "Execution Engine"
            Coordinator[Query Coordinator<br/>Plan Distribution]
            Workers[Compute Workers<br/>Auto-scaled Pool]
            Cache[Result Cache<br/>48-hour TTL]
        end

        subgraph "Metadata Services"
            MetadataDB[Metadata Database<br/>Table Definitions]
            StatService[Statistics Service<br/>Data Profiles]
            SchemaDiscovery[Schema Discovery<br/>File Inference]
        end
    end

    subgraph "Storage Layer"
        subgraph "Data Lake Gen2"
            Delta[Delta Lake Tables<br/>Parquet + Transaction Logs]
            Parquet[Parquet Files<br/>Columnar Storage]
            CSV[CSV Files<br/>Delimited Text]
            JSON[JSON Files<br/>Semi-structured]
        end

        subgraph "External Sources"
            CosmosDB[Cosmos DB<br/>Analytical Store]
            BlobStorage[Blob Storage<br/>Any Format]
        end
    end

    subgraph "Security & Governance"
        AAD[Azure AD<br/>Authentication]
        RBAC[Role-based Access<br/>Permissions]
        ACLs[Storage ACLs<br/>Data-level Security]
    end

    Users --> Parser
    SSMS --> Parser
    PowerBI --> Parser
    AzurePortal --> Parser
    JDBC --> Parser

    Parser --> Optimizer
    Optimizer --> Planner
    Planner --> Coordinator

    Coordinator --> Workers
    Workers --> Cache

    Optimizer -.->|Query Statistics| StatService
    Planner -.->|Schema Info| MetadataDB
    Parser -.->|Auto-discovery| SchemaDiscovery

    Workers --> Delta
    Workers --> Parquet
    Workers --> CSV
    Workers --> JSON
    Workers --> CosmosDB
    Workers --> BlobStorage

    Users -.->|Authenticate| AAD
    Workers -.->|Check Permissions| RBAC
    Workers -.->|Data Access| ACLs

    Cache --> Users

    style Parser fill:#e1f5fe
    style Optimizer fill:#fff3e0
    style Workers fill:#e8f5e9
    style Delta fill:#f3e5f5
    style Cache fill:#fce4ec

🏗️ Architecture Components¶

1. Query Submission Layer¶

Multiple client tools and applications can connect to Serverless SQL Pool:

Client Type	Protocol	Authentication	Best For
SQL Server Management Studio	TDS (TCP 1433)	Azure AD, SQL Auth	Interactive queries, development
Azure Data Studio	TDS (TCP 1433)	Azure AD	Cross-platform development
Power BI	Direct Query, Import	Azure AD	Business intelligence, dashboards
JDBC/ODBC	Standard drivers	Azure AD, connection string	Application integration
Azure Portal	Web interface	Azure AD	Quick exploration, testing

Connection String Example:

Server=tcp:your-workspace-ondemand.sql.azuresynapse.net,1433;
Database=your-database;
Authentication=Active Directory Integrated;
Encrypt=yes;
TrustServerCertificate=no;

2. Query Processing Pipeline¶

Query Parser¶

graph LR
    SQLQuery[T-SQL Query] --> Lexer[Lexical Analysis<br/>Tokenization]
    Lexer --> Syntax[Syntax Analysis<br/>Parse Tree]
    Syntax --> Semantic[Semantic Analysis<br/>Validation]
    Semantic --> IR[Intermediate<br/>Representation]

    style SQLQuery fill:#e8f5e9
    style Lexer fill:#fff3e0
    style Syntax fill:#e1f5fe
    style Semantic fill:#f3e5f5
    style IR fill:#fce4ec

Parser Functions:

Tokenization: Break down SQL text into tokens
Syntax Validation: Ensure T-SQL syntax correctness
Semantic Analysis: Validate object references, data types
IR Generation: Create internal query representation

Query Optimizer¶

graph TB
    IR[Intermediate<br/>Representation] --> Logical[Logical Plan<br/>Generation]
    Logical --> Candidates[Multiple Plan<br/>Candidates]

    Candidates --> CostEstimation{Cost<br/>Estimation}

    CostEstimation -->|Statistics| DataStats[Data Statistics<br/>Row counts, distributions]
    CostEstimation -->|File Info| FileStats[File Statistics<br/>Size, partitions]
    CostEstimation -->|Index Info| IndexStats[Index Statistics<br/>If available]

    DataStats --> BestPlan[Select Best<br/>Execution Plan]
    FileStats --> BestPlan
    IndexStats --> BestPlan

    BestPlan --> Physical[Physical Plan<br/>Distributed Operations]

    style IR fill:#e1f5fe
    style Logical fill:#fff3e0
    style CostEstimation fill:#f3e5f5
    style BestPlan fill:#4caf50
    style Physical fill:#e8f5e9

Optimization Techniques:

Predicate Pushdown: Filter early in data scan
Partition Elimination: Skip irrelevant partitions
Column Pruning: Read only required columns
Join Optimization: Choose optimal join strategy
Aggregation Pushdown: Aggregate before data transfer

Example Optimization:

-- Original query
SELECT customer_id, SUM(amount) as total
FROM sales.transactions
WHERE transaction_date >= '2025-01-01'
GROUP BY customer_id;

-- Optimized execution:
-- 1. Partition elimination (only 2025 partitions)
-- 2. Column pruning (read only customer_id, amount, transaction_date)
-- 3. Predicate pushdown (filter applied during file scan)
-- 4. Aggregation pushdown (partial aggregates per file)

Execution Planner¶

graph TB
    PhysicalPlan[Physical<br/>Execution Plan] --> Stages[Divide into<br/>Execution Stages]

    Stages --> Stage1[Stage 1: Data Scan<br/>Parallel file readers]
    Stages --> Stage2[Stage 2: Shuffle<br/>Data redistribution]
    Stages --> Stage3[Stage 3: Aggregation<br/>Final results]

    Stage1 --> Workers1[Worker Pool 1<br/>File Scanning]
    Stage2 --> Workers2[Worker Pool 2<br/>Data Shuffling]
    Stage3 --> Workers3[Worker Pool 3<br/>Aggregation]

    Workers1 --> Exchange1[Data Exchange<br/>Network Transfer]
    Workers2 --> Exchange2[Data Exchange<br/>Network Transfer]

    Exchange1 --> Workers2
    Exchange2 --> Workers3

    Workers3 --> Results[Query Results]

    style PhysicalPlan fill:#e1f5fe
    style Stage1 fill:#fff9c4
    style Stage2 fill:#f3e5f5
    style Stage3 fill:#e8f5e9
    style Results fill:#4caf50

3. Execution Engine¶

Query Coordinator¶

Responsibilities:

Receive execution plan from planner
Allocate compute workers dynamically
Distribute work across workers
Monitor execution progress
Handle failures and retries
Collect and merge results

graph TB
    Coordinator[Query Coordinator]

    Coordinator --> Allocation[Worker Allocation<br/>Based on query complexity]
    Coordinator --> Distribution[Work Distribution<br/>Parallel tasks]
    Coordinator --> Monitoring[Execution Monitoring<br/>Progress tracking]
    Coordinator --> ResultMerge[Result Merging<br/>Combine worker outputs]

    Allocation --> Workers[Compute Worker Pool]
    Distribution --> Workers
    Workers --> Monitoring
    Workers --> ResultMerge

    style Coordinator fill:#fff3e0
    style Allocation fill:#e1f5fe
    style Workers fill:#e8f5e9
    style ResultMerge fill:#4caf50

Compute Workers¶

Auto-scaling Characteristics:

graph LR
    subgraph "Worker Scaling"
        QueryLoad[Query Complexity<br/>& Data Volume]
        ScaleLogic{Scaling<br/>Decision}
        SmallPool[Small Pool<br/>4-8 workers]
        MediumPool[Medium Pool<br/>16-32 workers]
        LargePool[Large Pool<br/>64+ workers]
    end

    QueryLoad --> ScaleLogic
    ScaleLogic -->|Simple query| SmallPool
    ScaleLogic -->|Medium query| MediumPool
    ScaleLogic -->|Complex query| LargePool

    style QueryLoad fill:#e8f5e9
    style SmallPool fill:#c8e6c9
    style MediumPool fill:#fff9c4
    style LargePool fill:#ffccbc

Worker Operations:

Operation	Description	Parallelization
File Scanning	Read data from storage files	High (per file)
Filtering	Apply WHERE clause predicates	High (per partition)
Projection	Select specific columns	High (per row group)
Aggregation	GROUP BY and aggregate functions	Medium (partial aggregates)
Joins	Combine data from multiple sources	Medium (hash/merge joins)
Sorting	ORDER BY operations	Low (global sort required)

Result Cache¶

graph TB
    Query[User Query] --> CacheCheck{Result in<br/>Cache?}

    CacheCheck -->|Yes, recent| Cached[Return Cached<br/>Result]
    CacheCheck -->|No| Execute[Execute Query]

    Execute --> Store[Store Result<br/>in Cache]
    Store --> Return[Return Result]

    Cached -.->|TTL: 48 hours| AutoEvict[Auto-eviction]
    Store -.->|Set TTL| TTL[48-hour Expiration]

    style CacheCheck fill:#fff3e0
    style Cached fill:#4caf50
    style Execute fill:#e1f5fe
    style Store fill:#e8f5e9

Cache Characteristics:

TTL: 48 hours automatic expiration
Invalidation: Automatic when underlying data changes
Scope: Per-user and per-query
Storage: Temporary blob storage
Billing: Cached results not charged on subsequent access

4. Metadata Services¶

Metadata Database¶

Storage:

Azure SQL Database backend
Workspace-level metadata repository
Stores table definitions, views, procedures

Metadata Types:

-- External tables over data lake files
CREATE EXTERNAL TABLE sales.transactions (
    transaction_id INT,
    customer_id INT,
    amount DECIMAL(10,2),
    transaction_date DATE
)
WITH (
    LOCATION = '/sales/transactions/',
    DATA_SOURCE = DataLakeSource,
    FILE_FORMAT = ParquetFormat
);

-- Views for business logic
CREATE VIEW sales.monthly_summary AS
SELECT
    YEAR(transaction_date) as year,
    MONTH(transaction_date) as month,
    SUM(amount) as total_sales
FROM sales.transactions
GROUP BY YEAR(transaction_date), MONTH(transaction_date);

-- Stored procedures for reusable logic
CREATE PROCEDURE sales.GetCustomerSummary
    @customer_id INT
AS
BEGIN
    SELECT
        customer_id,
        COUNT(*) as transaction_count,
        SUM(amount) as total_spent
    FROM sales.transactions
    WHERE customer_id = @customer_id
    GROUP BY customer_id;
END;

Statistics Service¶

Automatic Statistics:

graph LR
    DataScan[File Scan] --> StatCollection[Collect Statistics<br/>During Query]
    StatCollection --> Store[Store in<br/>Metadata DB]
    Store --> Optimizer[Used by<br/>Query Optimizer]

    subgraph "Statistics Collected"
        RowCount[Row Counts]
        NullCount[Null Counts]
        MinMax[Min/Max Values]
        Histogram[Value Histograms]
    end

    StatCollection --> RowCount
    StatCollection --> NullCount
    StatCollection --> MinMax
    StatCollection --> Histogram

    style DataScan fill:#e8f5e9
    style StatCollection fill:#fff3e0
    style Store fill:#e1f5fe
    style Optimizer fill:#4caf50

Manual Statistics:

-- Create statistics manually for better performance
CREATE STATISTICS customer_stats
ON sales.transactions(customer_id)
WITH FULLSCAN;

-- View existing statistics
SELECT * FROM sys.stats
WHERE object_id = OBJECT_ID('sales.transactions');

Schema Discovery¶

Automatic Schema Inference:

graph TB
    File[Data File] --> Type{File<br/>Type?}

    Type -->|Parquet| ParquetSchema[Read Parquet<br/>Metadata]
    Type -->|Delta| DeltaSchema[Read Delta<br/>Transaction Log]
    Type -->|CSV| CSVInfer[Infer from<br/>First 100 Rows]
    Type -->|JSON| JSONInfer[Infer from<br/>Sample Records]

    ParquetSchema --> Schema[Inferred Schema]
    DeltaSchema --> Schema
    CSVInfer --> Schema
    JSONInfer --> Schema

    Schema --> ExternalTable[Auto-create<br/>External Table]

    style File fill:#e8f5e9
    style ParquetSchema fill:#e1f5fe
    style DeltaSchema fill:#fff3e0
    style Schema fill:#4caf50
    style ExternalTable fill:#f3e5f5

Schema Inference Examples:

-- Automatic schema discovery with OPENROWSET
SELECT TOP 100 *
FROM OPENROWSET(
    BULK 'https://datalake.dfs.core.windows.net/data/sales/*.parquet',
    FORMAT = 'PARQUET'
) AS [result];

-- Schema inference with type specification
SELECT *
FROM OPENROWSET(
    BULK 'https://datalake.dfs.core.windows.net/data/sales/*.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0',
    HEADER_ROW = TRUE
) WITH (
    transaction_id INT,
    amount DECIMAL(10,2),
    transaction_date DATE
) AS [result];

💾 Storage Layer Integration¶

Supported File Formats¶

Format	Compression	Schema Evolution	Best For
Delta Lake	Snappy, Gzip	✅ Yes	ACID requirements, updates
Parquet	Snappy, Gzip	⚠️ Limited	Columnar analytics
CSV	Gzip	❌ No	Simple data, compatibility
JSON	Gzip	⚠️ Limited	Semi-structured, flexibility

Query Patterns by Format¶

Delta Lake Queries¶

-- Query Delta table directly
SELECT *
FROM OPENROWSET(
    BULK 'https://datalake.dfs.core.windows.net/gold/sales/',
    FORMAT = 'DELTA'
) AS [sales];

-- Time travel query
SELECT *
FROM OPENROWSET(
    BULK 'https://datalake.dfs.core.windows.net/gold/sales/',
    FORMAT = 'DELTA'
) WITH (VERSION = 5) AS [sales];

-- Create external table over Delta
CREATE EXTERNAL TABLE gold.sales
WITH (
    LOCATION = '/gold/sales/',
    DATA_SOURCE = DataLakeSource,
    FILE_FORMAT = DeltaFormat
);

Parquet Queries¶

-- Direct Parquet query with partition elimination
SELECT
    customer_id,
    SUM(amount) as total
FROM OPENROWSET(
    BULK 'https://datalake.dfs.core.windows.net/sales/year=2025/month=01/*.parquet',
    FORMAT = 'PARQUET'
) AS [result]
WHERE transaction_date >= '2025-01-01'
GROUP BY customer_id;

-- Wildcard patterns for multiple partitions
SELECT *
FROM OPENROWSET(
    BULK 'https://datalake.dfs.core.windows.net/sales/year=2025/month=*/day=*/*.parquet',
    FORMAT = 'PARQUET'
) AS [result]
WHERE filepath(1) IN ('01', '02')  -- Only Jan and Feb
  AND filepath(2) > '15';           -- After 15th of month

Data Access Patterns¶

graph TB
    subgraph "Access Methods"
        OPENROWSET[OPENROWSET<br/>Ad-hoc queries]
        ExternalTable[EXTERNAL TABLE<br/>Reusable definitions]
        View[VIEW<br/>Business logic layer]
    end

    subgraph "Security Layers"
        AAD[Azure AD<br/>Authentication]
        RBAC[Synapse RBAC<br/>Workspace permissions]
        StorageACL[Storage ACLs<br/>File/folder permissions]
    end

    OPENROWSET --> AAD
    ExternalTable --> AAD
    View --> AAD

    AAD --> RBAC
    RBAC --> StorageACL

    StorageACL --> DataAccess[Data Access<br/>Granted]

    style OPENROWSET fill:#e1f5fe
    style ExternalTable fill:#fff3e0
    style View fill:#e8f5e9
    style DataAccess fill:#4caf50

🔒 Security & Governance¶

Authentication Flow¶

sequenceDiagram
    participant User
    participant SynapseSQL as Serverless SQL Pool
    participant AAD as Azure Active Directory
    participant Storage as Data Lake Storage

    User->>SynapseSQL: Connect with Azure AD
    SynapseSQL->>AAD: Validate credentials
    AAD->>SynapseSQL: Return access token

    User->>SynapseSQL: Submit query
    SynapseSQL->>SynapseSQL: Check workspace RBAC
    SynapseSQL->>Storage: Access data with user identity
    Storage->>Storage: Check storage ACLs
    Storage->>SynapseSQL: Return data (if authorized)
    SynapseSQL->>User: Return query results

Authorization Layers¶

Layer	Purpose	Granularity	Management
Azure AD	User authentication	User/group identity	Azure portal
Synapse RBAC	Workspace permissions	Workspace/pool level	Synapse Studio
SQL Permissions	Database object access	Table/view/procedure	T-SQL GRANT/DENY
Storage ACLs	File/folder access	File/folder level	Storage portal/SDK

Security Best Practices:

-- Grant minimum required permissions
GRANT SELECT ON SCHEMA::sales TO [DataAnalysts];

-- Use database scoped credentials for external data sources
CREATE DATABASE SCOPED CREDENTIAL DataLakeCredential
WITH IDENTITY = 'Managed Identity';

CREATE EXTERNAL DATA SOURCE DataLakeSource
WITH (
    LOCATION = 'https://datalake.dfs.core.windows.net/data',
    CREDENTIAL = DataLakeCredential
);

-- Row-level security with predicates
CREATE SECURITY POLICY SalesFilter
ADD FILTER PREDICATE dbo.fn_securitypredicate(SalesPersonID)
ON sales.transactions
WITH (STATE = ON);

⚡ Performance Optimization¶

Query Optimization Techniques¶

1. Partition Elimination¶

-- Inefficient: Full scan
SELECT * FROM sales.transactions
WHERE transaction_date >= '2025-01-01';

-- Efficient: Partition elimination
SELECT * FROM sales.transactions
WHERE year = 2025 AND month = 1;
-- Folder structure: /sales/year=2025/month=01/

2. Column Pruning¶

-- Inefficient: Read all columns
SELECT * FROM sales.transactions;

-- Efficient: Select only needed columns
SELECT customer_id, amount, transaction_date
FROM sales.transactions;
-- Only 3 columns read from Parquet

3. Predicate Pushdown¶

-- Efficient: Filter pushed to file scan
SELECT customer_id, SUM(amount) as total
FROM sales.transactions
WHERE transaction_date >= '2025-01-01'
  AND amount > 100
GROUP BY customer_id;
-- Filters applied during file read, reduces data transfer

4. File Size Optimization¶

graph LR
    subgraph "File Size Impact"
        TooSmall[Too Small<br/>< 10MB<br/>❌ Overhead]
        Optimal[Optimal<br/>100MB-1GB<br/>✅ Best Performance]
        TooLarge[Too Large<br/>> 1GB<br/>⚠️ Limited Parallelism]
    end

    TooSmall -.->|Consolidate| Optimal
    TooLarge -.->|Partition| Optimal

    style TooSmall fill:#ffebee
    style Optimal fill:#e8f5e9
    style TooLarge fill:#fff9c4

File Consolidation Example:

# Spark job to consolidate small files
df = spark.read.format("delta").load("/path/to/table")
df.coalesce(10).write.format("delta").mode("overwrite").save("/path/to/table")

Cost Optimization¶

Data Processed Calculation¶

graph TB
    Query[User Query] --> Parse[Parse Query<br/>Identify Files]
    Parse --> Partition[Apply Partition<br/>Elimination]
    Partition --> Column[Apply Column<br/>Pruning]
    Column --> Scan[Actual Data<br/>Scanned]
    Scan --> Charge[Charge = Data Scanned<br/>$/TB]

    style Query fill:#e8f5e9
    style Partition fill:#fff3e0
    style Column fill:#e1f5fe
    style Scan fill:#f3e5f5
    style Charge fill:#ffccbc

Cost Comparison:

Scenario	Files Scanned	Data Scanned	Cost ($/TB)	Total Cost
Full table scan (CSV)	1000 files	10 TB	$5	$50
Partition elimination (Parquet)	100 files	1 TB	$5	$5
Column pruning (Parquet)	100 files	0.2 TB	$5	$1
Cached result (2^nd run)	0 files	0 TB	$0	$0

Cost Optimization Tips:

Use Parquet/Delta: 10-100x better compression than CSV
Partition Data: Reduce scanned files
Query Only Needed Columns: Reduce scanned data
Use Views: Encapsulate optimization logic
Leverage Cache: Repeated queries are free

📊 Monitoring & Troubleshooting¶

Query Performance Monitoring¶

-- View query execution statistics
SELECT
    request_id,
    status,
    total_elapsed_time,
    data_processed_mb,
    result_set_size_kb,
    command
FROM sys.dm_exec_requests
WHERE session_id = SESSION_ID();

-- Historical query performance
SELECT
    request_id,
    start_time,
    end_time,
    DATEDIFF(SECOND, start_time, end_time) as duration_seconds,
    data_processed_mb / 1024.0 as data_processed_gb,
    command
FROM sys.dm_exec_requests_history
WHERE session_id = SESSION_ID()
ORDER BY start_time DESC;

Common Issues & Solutions¶

Issue	Symptom	Solution
Slow queries	Long execution time	Add partitioning, use Parquet, create statistics
High costs	Unexpected charges	Optimize queries, use views, partition data
Permission errors	Access denied	Check AAD, RBAC, storage ACLs
Schema errors	Column not found	Recreate external table, check file format
Timeout errors	Query timeout	Optimize query, reduce data scanned

🎯 Best Practices¶

Design Principles¶

Use External Tables for frequently accessed datasets
Partition Data by commonly filtered columns (date, region, category)
Use Parquet or Delta for optimal performance and cost
Create Statistics on join and filter columns
Implement Views for reusable business logic
Monitor Costs regularly and optimize expensive queries

Anti-Patterns to Avoid¶

❌ SELECT * queries on large datasets
❌ Querying CSV files for large-scale analytics
❌ Small files (< 10MB) in data lake
❌ Missing partitioning on time-series data
❌ Not leveraging result cache

Architecture Documentation¶

Implementation Guides¶

Performance & Cost¶

Last Updated: 2025-01-28 Architecture Version: 2.0 Pattern: Serverless Query Processing