Serverless SQL Pool Architecture DiagramΒΆ
Comparative positioning note
This document is written from the perspective of Microsoft Azure, Cloud Scale Analytics, and CSA Loom. Any description of third-party or competing products, services, pricing, or capabilities is derived from publicly available documentation and sources believed accurate at the time of writing, and is provided for general comparison only. We do not claim expertise in, or authority over, any non-Microsoft product or service; the respective vendor's official documentation is the authoritative source for their offerings, which may change over time. Nothing here is intended to disparage any vendor β where a competing product has genuine advantages, we aim to note them honestly. Verify all third-party details against the vendor's current official documentation before making decisions.
π Home > π Diagrams > Serverless SQL Architecture
OverviewΒΆ
This diagram illustrates the Azure Synapse Serverless SQL Pool architecture, showing how it enables on-demand querying of data lake files using familiar T-SQL syntax without managing infrastructure or dedicated resources.
Complete Serverless SQL ArchitectureΒΆ
graph TB
subgraph "Data Sources & Storage"
subgraph "Azure Data Lake Storage Gen2"
Raw[Raw Zone<br/>CSV, JSON, Logs]
Parquet[Optimized Zone<br/>Parquet, Delta Lake]
Structured[Structured Zone<br/>Relational Format]
end
Blob[Azure Blob Storage<br/>Archive, Backup]
CosmosDB[Cosmos DB<br/>Analytical Store]
end
subgraph "Serverless SQL Pool"
subgraph "Query Processing"
Parser[SQL Parser<br/>T-SQL Compatibility]
Optimizer[Query Optimizer<br/>Pushdown Predicates]
Executor[Distributed Execution<br/>Auto-scaling]
end
subgraph "Metadata Services"
Schema[Schema Inference<br/>Auto-discovery]
Stats[Statistics<br/>Cost-based Optimization]
Cache[Result Cache<br/>Query Acceleration]
end
subgraph "Security & Access"
AAD[Azure AD Auth<br/>Managed Identity]
ACL[ACL Enforcement<br/>File/Folder Level]
RBAC[RBAC<br/>Database Objects]
end
end
subgraph "Data Virtualization"
ExtTable[External Tables<br/>Schema-on-Read]
OpenRowset[OPENROWSET<br/>Ad-hoc Queries]
Views[Views & Procs<br/>Logical Layer]
ExtDataSource[External Data Sources<br/>Connection Strings]
end
subgraph "Query Interfaces"
SSMS[SQL Server<br/>Management Studio]
AzureStudio[Azure Data<br/>Studio]
PowerBI[Power BI<br/>DirectQuery]
Apps[Applications<br/>JDBC/ODBC]
end
subgraph "Monitoring & Management"
QueryInsights[Query Insights<br/>Performance Analysis]
Monitor[Azure Monitor<br/>Metrics & Logs]
CostMgmt[Cost Management<br/>Data Processed Tracking]
end
%% Data Flow
Raw --> Parser
Parquet --> Parser
Structured --> Parser
Blob --> Parser
CosmosDB --> Parser
Parser --> Optimizer
Optimizer --> Schema
Optimizer --> Stats
Schema --> Executor
Stats --> Executor
Executor --> Cache
Cache --> ExtTable
Cache --> OpenRowset
ExtTable --> Views
OpenRowset --> Views
ExtDataSource -.-> ExtTable
ExtDataSource -.-> OpenRowset
%% Security
AAD -.-> Parser
ACL -.-> Executor
RBAC -.-> Views
%% Query Interface
SSMS --> Parser
AzureStudio --> Parser
PowerBI --> Parser
Apps --> Parser
Views --> PowerBI
Views --> Apps
Views --> SSMS
%% Monitoring
Executor -.-> QueryInsights
Parser -.-> Monitor
Optimizer -.-> CostMgmt
%% Styling
style Raw fill:#E3F2FD
style Parquet fill:#E8F5E9
style Structured fill:#FFF3E0
style Parser fill:#E1F5FE
style Optimizer fill:#E1F5FE
style Executor fill:#E1F5FE
style Schema fill:#F3E5F5
style Stats fill:#F3E5F5
style Cache fill:#C8E6C9
style ExtTable fill:#FFF9C4
style OpenRowset fill:#FFF9C4
style Views fill:#FFF9C4
style PowerBI fill:#FFCCBC Query Processing FlowΒΆ
Ad-hoc Query with OPENROWSETΒΆ
sequenceDiagram
participant User as User/Application
participant SQL as Serverless SQL Pool
participant Meta as Metadata Service
participant Storage as Data Lake Gen2
participant Cache as Result Cache
User->>SQL: Submit OPENROWSET Query
Note over User,SQL: SELECT * FROM OPENROWSET(...)<br/>WHERE date > '2024-01-01'
SQL->>Meta: Infer Schema
Meta->>Storage: Read Sample Files
Storage-->>Meta: Return Sample Data
Meta-->>SQL: Schema Definition
SQL->>SQL: Parse & Optimize Query
Note over SQL: Apply predicate pushdown<br/>Partition elimination
SQL->>Storage: Read Filtered Data
Note over Storage: Only scan required files<br/>Based on predicates
Storage-->>SQL: Return Query Results
SQL->>Cache: Store Result Set
Cache-->>User: Return Results
Note over User,Cache: Subsequent identical queries<br/>served from cache External Table QueryΒΆ
sequenceDiagram
participant User as User/Application
participant SQL as Serverless SQL Pool
participant Meta as Metadata Catalog
participant Storage as Data Lake Gen2
User->>SQL: CREATE EXTERNAL TABLE
Note over User,SQL: Define schema once
SQL->>Meta: Store Table Definition
Meta-->>SQL: Confirmation
User->>SQL: SELECT FROM External Table
SQL->>Meta: Retrieve Schema
Meta-->>SQL: Table Metadata
SQL->>SQL: Query Optimization
Note over SQL: Use statistics if available<br/>Apply filters
SQL->>Storage: Execute Query
Storage-->>SQL: Return Results
SQL-->>User: Query Results Data Virtualization PatternsΒΆ
OPENROWSET Pattern (Schema-on-Read)ΒΆ
graph LR
subgraph "Query Types with OPENROWSET"
Single[Single File<br/>OPENROWSET(...file.parquet)]
Wildcard[Multiple Files<br/>OPENROWSET(.../*.parquet)]
Partitioned[Partitioned Data<br/>OPENROWSET(.../year=*/...)]
end
subgraph "Format Support"
Parquet[Parquet<br/>Columnar]
CSV[CSV<br/>Delimited Text]
JSON[JSON<br/>Semi-structured]
Delta[Delta Lake<br/>Versioned]
end
subgraph "Schema Handling"
Auto[Auto Schema<br/>Inference]
Explicit[Explicit Schema<br/>WITH Clause]
Metadata[Metadata Columns<br/>filename(), filepath()]
end
Single --> Parquet
Wildcard --> CSV
Partitioned --> Delta
Parquet --> Auto
CSV --> Explicit
JSON --> Metadata
style Single fill:#E1F5FE
style Wildcard fill:#E1F5FE
style Partitioned fill:#E1F5FE Example Query:
-- Ad-hoc query with schema inference
SELECT *
FROM OPENROWSET(
BULK 'https://datalake.dfs.core.windows.net/data/sales/*.parquet',
FORMAT = 'PARQUET'
) AS sales
WHERE order_date >= '2024-01-01';
External Table Pattern (Schema-on-Write)ΒΆ
graph TB
subgraph "External Table Components"
DataSource[External Data Source<br/>Connection String]
FileFormat[External File Format<br/>Parquet, CSV, JSON]
ExtTable[External Table<br/>Schema Definition]
end
subgraph "Benefits"
Reuse[Schema Reusability<br/>Define Once]
Security[Security Management<br/>Credentials Abstracted]
Perf[Performance<br/>Statistics Available]
end
subgraph "Management"
Create[CREATE EXTERNAL TABLE]
Alter[ALTER EXTERNAL TABLE]
Drop[DROP EXTERNAL TABLE]
Stats[CREATE STATISTICS]
end
DataSource --> ExtTable
FileFormat --> ExtTable
ExtTable --> Reuse
ExtTable --> Security
ExtTable --> Perf
Create --> ExtTable
Stats --> Perf
style DataSource fill:#FFF3E0
style FileFormat fill:#FFF3E0
style ExtTable fill:#E8F5E9 Example Setup:
-- Create external data source
CREATE EXTERNAL DATA SOURCE SalesDataLake
WITH (
LOCATION = 'https://datalake.dfs.core.windows.net/sales'
);
-- Create external file format
CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
-- Create external table
CREATE EXTERNAL TABLE SalesOrders (
OrderID INT,
CustomerID INT,
OrderDate DATE,
Amount DECIMAL(10,2)
)
WITH (
LOCATION = '/orders/*.parquet',
DATA_SOURCE = SalesDataLake,
FILE_FORMAT = ParquetFormat
);
-- Query like a regular table
SELECT * FROM SalesOrders WHERE OrderDate >= '2024-01-01';
Query Optimization StrategiesΒΆ
Predicate Pushdown & Partition EliminationΒΆ
graph TB
subgraph "Query Submitted"
Q[SELECT *<br/>FROM sales<br/>WHERE year = 2024<br/>AND region = 'West']
end
subgraph "Optimizer Actions"
Parse[Parse Query<br/>Extract Predicates]
Analyze[Analyze Storage<br/>Partitioning Scheme]
Pushdown[Apply Pushdown<br/>Filter at Source]
Eliminate[Eliminate Partitions<br/>Skip Irrelevant Data]
end
subgraph "Storage Scan"
Before[Before Optimization<br/>Scan 1000 files]
After[After Optimization<br/>Scan 10 files]
end
subgraph "Performance Impact"
Time[90% Faster<br/>Query Time]
Cost[90% Lower<br/>Data Scanned]
Concurrency[More Concurrent<br/>Queries]
end
Q --> Parse
Parse --> Analyze
Analyze --> Pushdown
Pushdown --> Eliminate
Before -.->|Without Optimization| Time
After --> Time
After --> Cost
After --> Concurrency
style Q fill:#FFEBEE
style Parse fill:#E1F5FE
style Eliminate fill:#C8E6C9
style After fill:#C8E6C9
style Time fill:#A5D6A7 Result Set CachingΒΆ
graph LR
subgraph "First Query Execution"
Q1[Query Submitted]
Exec1[Execute Against<br/>Data Lake]
Cache1[Store in<br/>Result Cache]
Return1[Return Results<br/>60 sec]
end
subgraph "Subsequent Executions"
Q2[Same Query]
Check[Check Cache]
Return2[Return from Cache<br/>< 1 sec]
end
subgraph "Cache Management"
TTL[Time-to-Live<br/>24-48 hours]
Invalidate[Invalidate on<br/>Data Change]
Size[Size Limits<br/>Auto-eviction]
end
Q1 --> Exec1
Exec1 --> Cache1
Cache1 --> Return1
Q2 --> Check
Check -->|Cache Hit| Return2
Check -->|Cache Miss| Exec1
Cache1 -.-> TTL
Cache1 -.-> Invalidate
Cache1 -.-> Size
style Cache1 fill:#C8E6C9
style Return2 fill:#A5D6A7
style Return1 fill:#FFF9C4 File Format OptimizationΒΆ
Format ComparisonΒΆ
graph TB
subgraph "File Formats Performance"
direction LR
subgraph "CSV"
CSV1[Text Format]
CSV2[Row-based]
CSV3[No Compression]
CSV4[Slow Queries]
end
subgraph "Parquet"
P1[Binary Format]
P2[Columnar]
P3[Built-in Compression]
P4[Fast Queries]
end
subgraph "Delta Lake"
D1[Parquet + Logs]
D2[ACID Transactions]
D3[Time Travel]
D4[Optimized Performance]
end
end
subgraph "Query Performance"
Fast[< 1 sec]
Medium[1-10 sec]
Slow[> 10 sec]
end
subgraph "Cost per Query"
Low[$]
Med[$$]
High[$$$]
end
CSV4 --> Slow
P4 --> Medium
D4 --> Fast
CSV4 --> High
P4 --> Med
D4 --> Low
style P1 fill:#E8F5E9
style P2 fill:#E8F5E9
style P3 fill:#E8F5E9
style P4 fill:#C8E6C9
style D1 fill:#E8F5E9
style D2 fill:#E8F5E9
style D3 fill:#E8F5E9
style D4 fill:#A5D6A7 Recommended Format StrategyΒΆ
| Data Stage | Format | Reason | Cost Impact |
|---|---|---|---|
| Raw/Landing | CSV, JSON | Source format preservation | Higher query cost |
| Refined/Cleansed | Parquet | Optimized for analytics | Medium cost |
| Curated/Gold | Delta Lake | Best performance + ACID | Lowest cost |
Cost Optimization StrategiesΒΆ
Data Scanned vs Query CostΒΆ
graph TB
subgraph "Factors Affecting Cost"
DataSize[Amount of Data<br/>Scanned in TB]
FileType[File Format<br/>CSV vs Parquet]
Partition[Partition<br/>Elimination]
Compression[Compression<br/>Codec]
end
subgraph "Cost Calculation"
Formula[Cost = Data Scanned (TB)<br/>Γ $5 per TB]
end
subgraph "Optimization Techniques"
Convert[Convert to<br/>Parquet/Delta]
PartitionData[Partition by<br/>Date/Category]
CreateStats[Create<br/>Statistics]
FilterEarly[Apply Filters<br/>in WHERE Clause]
end
subgraph "Cost Savings"
S1[60-80% Reduction<br/>Parquet vs CSV]
S2[80-95% Reduction<br/>With Partitioning]
S3[Result Caching<br/>Near Zero Cost]
end
DataSize --> Formula
FileType --> Formula
Partition --> Formula
Convert --> S1
PartitionData --> S2
CreateStats --> S2
FilterEarly --> S2
style Formula fill:#FFEBEE
style S1 fill:#C8E6C9
style S2 fill:#A5D6A7
style S3 fill:#81C784 Cost Optimization Example:
-- High Cost: Full table scan on CSV
SELECT AVG(amount)
FROM OPENROWSET(
BULK 'https://.../sales/**/*.csv',
FORMAT = 'CSV'
) AS sales;
-- Scans: 100 GB, Cost: $0.50
-- Optimized: Filtered query on Parquet with partitioning
SELECT AVG(amount)
FROM OPENROWSET(
BULK 'https://.../sales/year=2024/month=01/*.parquet',
FORMAT = 'PARQUET'
) AS sales
WHERE region = 'West';
-- Scans: 2 GB, Cost: $0.01 (98% savings)
Security ArchitectureΒΆ
Multi-Layer Security ModelΒΆ
graph TB
subgraph "Authentication Layer"
AAD[Azure Active Directory<br/>Identity Management]
MI[Managed Identity<br/>Service Principal]
SAS[Shared Access Signature<br/>Limited Access]
end
subgraph "Authorization Layer"
StorageRBAC[Storage RBAC<br/>Blob Data Reader]
ACLs[ACLs<br/>File/Folder Permissions]
SQLRBAC[SQL RBAC<br/>Database Roles]
end
subgraph "Data Protection"
Encryption[Encryption at Rest<br/>Storage Service Encryption]
TLS[Encryption in Transit<br/>TLS 1.2+]
Masking[Dynamic Data Masking<br/>Sensitive Fields]
end
subgraph "Access Control"
User[User/Application]
Credential[Database Scoped<br/>Credential]
DataSource[External Data<br/>Source]
Storage[Data Lake Storage]
end
AAD --> StorageRBAC
MI --> StorageRBAC
SAS --> ACLs
User --> AAD
User --> Credential
Credential --> DataSource
DataSource --> Storage
StorageRBAC -.-> Storage
ACLs -.-> Storage
Encryption -.-> Storage
style AAD fill:#E8F5E9
style StorageRBAC fill:#FFF3E0
style Encryption fill:#E1F5FE
style Storage fill:#F3E5F5 Use Case PatternsΒΆ
Pattern 1: Ad-hoc Data ExplorationΒΆ
graph LR
Analyst[Data Analyst] --> SSMS[SQL Server<br/>Management Studio]
SSMS --> Query[OPENROWSET Query<br/>Explore Raw Files]
Query --> DataLake[Data Lake<br/>Raw Zone]
DataLake --> Results[Quick Insights<br/>Pay-per-Query]
style Query fill:#E1F5FE
style Results fill:#C8E6C9 Best For: - Exploring new data sources - One-time analysis - Prototyping queries - Data quality checks
Pattern 2: Self-Service BIΒΆ
graph TB
Users[Business Users] --> PowerBI[Power BI]
PowerBI --> ExtTables[External Tables<br/>Semantic Layer]
ExtTables --> Views[Views & Stored Procs<br/>Business Logic]
Views --> DataLake[Data Lake<br/>Curated Zone]
style ExtTables fill:#FFF3E0
style Views fill:#E8F5E9
style DataLake fill:#E1F5FE Best For: - Business intelligence dashboards - Self-service reporting - Departmental analytics - Cost-effective queries
Pattern 3: Data Lake QueryingΒΆ
graph LR
DataLake[Data Lake<br/>Multi-Zone] --> Bronze[Bronze Layer<br/>OPENROWSET]
DataLake --> Silver[Silver Layer<br/>External Tables]
DataLake --> Gold[Gold Layer<br/>Views]
Bronze --> Exploration[Data Exploration]
Silver --> Integration[Data Integration]
Gold --> Analytics[Business Analytics]
style Bronze fill:#CD7F32,color:#000
style Silver fill:#C0C0C0,color:#000
style Gold fill:#FFD700,color:#000 Best For: - Medallion architecture - Progressive data refinement - Cost-optimized analytics - Scalable queries
Performance Best PracticesΒΆ
Query Optimization ChecklistΒΆ
graph TB
subgraph "Before Writing Query"
Format[β Use Parquet or Delta Lake]
Partition[β Leverage Partitioning]
Columns[β Select Only Needed Columns]
end
subgraph "Query Design"
Where[β Apply WHERE Filters]
Stats[β Create Statistics]
Aggregate[β Aggregate in SQL]
Join[β Optimize JOIN Order]
end
subgraph "Schema Management"
ExtTable[β Use External Tables]
DataTypes[β Explicit Data Types]
Compress[β Enable Compression]
end
subgraph "Monitoring"
QPI[β Review Query Performance Insights]
Cost[β Monitor Data Scanned]
Optimize[β Iteratively Optimize]
end
Format --> Where
Partition --> Where
Where --> ExtTable
Stats --> Join
ExtTable --> QPI
style Format fill:#C8E6C9
style Where fill:#C8E6C9
style ExtTable fill:#C8E6C9
style QPI fill:#C8E6C9 Reference LinksΒΆ
Last Updated: 2025-01-28 Diagram Type: Service Architecture Technology: Azure Synapse Serverless SQL Pool