☁️ Serverless SQL Architecture Description¶
Detailed description and visual representation of Azure Synapse Serverless SQL Pool architecture and query processing.
🎯 Overview¶
Azure Synapse Serverless SQL Pool provides on-demand T-SQL query capabilities over data in Azure Data Lake without the need to provision or manage compute infrastructure. It operates on a pay-per-query model, charging only for data processed.
📊 Visual Architecture¶
High-Level Architecture¶
graph TB
subgraph "Query Submission Layer"
Users[Users & Applications]
SSMS[SQL Server Management Studio]
PowerBI[Power BI]
AzurePortal[Azure Portal]
JDBC[JDBC/ODBC Clients]
end
subgraph "Serverless SQL Pool"
subgraph "Query Processing"
Parser[Query Parser<br/>T-SQL Parsing]
Optimizer[Query Optimizer<br/>Cost-based Optimization]
Planner[Execution Planner<br/>Distributed Plan]
end
subgraph "Execution Engine"
Coordinator[Query Coordinator<br/>Plan Distribution]
Workers[Compute Workers<br/>Auto-scaled Pool]
Cache[Result Cache<br/>48-hour TTL]
end
subgraph "Metadata Services"
MetadataDB[Metadata Database<br/>Table Definitions]
StatService[Statistics Service<br/>Data Profiles]
SchemaDiscovery[Schema Discovery<br/>File Inference]
end
end
subgraph "Storage Layer"
subgraph "Data Lake Gen2"
Delta[Delta Lake Tables<br/>Parquet + Transaction Logs]
Parquet[Parquet Files<br/>Columnar Storage]
CSV[CSV Files<br/>Delimited Text]
JSON[JSON Files<br/>Semi-structured]
end
subgraph "External Sources"
CosmosDB[Cosmos DB<br/>Analytical Store]
BlobStorage[Blob Storage<br/>Any Format]
end
end
subgraph "Security & Governance"
AAD[Azure AD<br/>Authentication]
RBAC[Role-based Access<br/>Permissions]
ACLs[Storage ACLs<br/>Data-level Security]
end
Users --> Parser
SSMS --> Parser
PowerBI --> Parser
AzurePortal --> Parser
JDBC --> Parser
Parser --> Optimizer
Optimizer --> Planner
Planner --> Coordinator
Coordinator --> Workers
Workers --> Cache
Optimizer -.->|Query Statistics| StatService
Planner -.->|Schema Info| MetadataDB
Parser -.->|Auto-discovery| SchemaDiscovery
Workers --> Delta
Workers --> Parquet
Workers --> CSV
Workers --> JSON
Workers --> CosmosDB
Workers --> BlobStorage
Users -.->|Authenticate| AAD
Workers -.->|Check Permissions| RBAC
Workers -.->|Data Access| ACLs
Cache --> Users
style Parser fill:#e1f5fe
style Optimizer fill:#fff3e0
style Workers fill:#e8f5e9
style Delta fill:#f3e5f5
style Cache fill:#fce4ec 🏗️ Architecture Components¶
1. Query Submission Layer¶
Multiple client tools and applications can connect to Serverless SQL Pool:
| Client Type | Protocol | Authentication | Best For |
|---|---|---|---|
| SQL Server Management Studio | TDS (TCP 1433) | Azure AD, SQL Auth | Interactive queries, development |
| Azure Data Studio | TDS (TCP 1433) | Azure AD | Cross-platform development |
| Power BI | Direct Query, Import | Azure AD | Business intelligence, dashboards |
| JDBC/ODBC | Standard drivers | Azure AD, connection string | Application integration |
| Azure Portal | Web interface | Azure AD | Quick exploration, testing |
Connection String Example:
Server=tcp:your-workspace-ondemand.sql.azuresynapse.net,1433;
Database=your-database;
Authentication=Active Directory Integrated;
Encrypt=yes;
TrustServerCertificate=no;
2. Query Processing Pipeline¶
Query Parser¶
graph LR
SQLQuery[T-SQL Query] --> Lexer[Lexical Analysis<br/>Tokenization]
Lexer --> Syntax[Syntax Analysis<br/>Parse Tree]
Syntax --> Semantic[Semantic Analysis<br/>Validation]
Semantic --> IR[Intermediate<br/>Representation]
style SQLQuery fill:#e8f5e9
style Lexer fill:#fff3e0
style Syntax fill:#e1f5fe
style Semantic fill:#f3e5f5
style IR fill:#fce4ec Parser Functions:
- Tokenization: Break down SQL text into tokens
- Syntax Validation: Ensure T-SQL syntax correctness
- Semantic Analysis: Validate object references, data types
- IR Generation: Create internal query representation
Query Optimizer¶
graph TB
IR[Intermediate<br/>Representation] --> Logical[Logical Plan<br/>Generation]
Logical --> Candidates[Multiple Plan<br/>Candidates]
Candidates --> CostEstimation{Cost<br/>Estimation}
CostEstimation -->|Statistics| DataStats[Data Statistics<br/>Row counts, distributions]
CostEstimation -->|File Info| FileStats[File Statistics<br/>Size, partitions]
CostEstimation -->|Index Info| IndexStats[Index Statistics<br/>If available]
DataStats --> BestPlan[Select Best<br/>Execution Plan]
FileStats --> BestPlan
IndexStats --> BestPlan
BestPlan --> Physical[Physical Plan<br/>Distributed Operations]
style IR fill:#e1f5fe
style Logical fill:#fff3e0
style CostEstimation fill:#f3e5f5
style BestPlan fill:#4caf50
style Physical fill:#e8f5e9 Optimization Techniques:
- Predicate Pushdown: Filter early in data scan
- Partition Elimination: Skip irrelevant partitions
- Column Pruning: Read only required columns
- Join Optimization: Choose optimal join strategy
- Aggregation Pushdown: Aggregate before data transfer
Example Optimization:
-- Original query
SELECT customer_id, SUM(amount) as total
FROM sales.transactions
WHERE transaction_date >= '2025-01-01'
GROUP BY customer_id;
-- Optimized execution:
-- 1. Partition elimination (only 2025 partitions)
-- 2. Column pruning (read only customer_id, amount, transaction_date)
-- 3. Predicate pushdown (filter applied during file scan)
-- 4. Aggregation pushdown (partial aggregates per file)
Execution Planner¶
graph TB
PhysicalPlan[Physical<br/>Execution Plan] --> Stages[Divide into<br/>Execution Stages]
Stages --> Stage1[Stage 1: Data Scan<br/>Parallel file readers]
Stages --> Stage2[Stage 2: Shuffle<br/>Data redistribution]
Stages --> Stage3[Stage 3: Aggregation<br/>Final results]
Stage1 --> Workers1[Worker Pool 1<br/>File Scanning]
Stage2 --> Workers2[Worker Pool 2<br/>Data Shuffling]
Stage3 --> Workers3[Worker Pool 3<br/>Aggregation]
Workers1 --> Exchange1[Data Exchange<br/>Network Transfer]
Workers2 --> Exchange2[Data Exchange<br/>Network Transfer]
Exchange1 --> Workers2
Exchange2 --> Workers3
Workers3 --> Results[Query Results]
style PhysicalPlan fill:#e1f5fe
style Stage1 fill:#fff9c4
style Stage2 fill:#f3e5f5
style Stage3 fill:#e8f5e9
style Results fill:#4caf50 3. Execution Engine¶
Query Coordinator¶
Responsibilities:
- Receive execution plan from planner
- Allocate compute workers dynamically
- Distribute work across workers
- Monitor execution progress
- Handle failures and retries
- Collect and merge results
graph TB
Coordinator[Query Coordinator]
Coordinator --> Allocation[Worker Allocation<br/>Based on query complexity]
Coordinator --> Distribution[Work Distribution<br/>Parallel tasks]
Coordinator --> Monitoring[Execution Monitoring<br/>Progress tracking]
Coordinator --> ResultMerge[Result Merging<br/>Combine worker outputs]
Allocation --> Workers[Compute Worker Pool]
Distribution --> Workers
Workers --> Monitoring
Workers --> ResultMerge
style Coordinator fill:#fff3e0
style Allocation fill:#e1f5fe
style Workers fill:#e8f5e9
style ResultMerge fill:#4caf50 Compute Workers¶
Auto-scaling Characteristics:
graph LR
subgraph "Worker Scaling"
QueryLoad[Query Complexity<br/>& Data Volume]
ScaleLogic{Scaling<br/>Decision}
SmallPool[Small Pool<br/>4-8 workers]
MediumPool[Medium Pool<br/>16-32 workers]
LargePool[Large Pool<br/>64+ workers]
end
QueryLoad --> ScaleLogic
ScaleLogic -->|Simple query| SmallPool
ScaleLogic -->|Medium query| MediumPool
ScaleLogic -->|Complex query| LargePool
style QueryLoad fill:#e8f5e9
style SmallPool fill:#c8e6c9
style MediumPool fill:#fff9c4
style LargePool fill:#ffccbc Worker Operations:
| Operation | Description | Parallelization |
|---|---|---|
| File Scanning | Read data from storage files | High (per file) |
| Filtering | Apply WHERE clause predicates | High (per partition) |
| Projection | Select specific columns | High (per row group) |
| Aggregation | GROUP BY and aggregate functions | Medium (partial aggregates) |
| Joins | Combine data from multiple sources | Medium (hash/merge joins) |
| Sorting | ORDER BY operations | Low (global sort required) |
Result Cache¶
graph TB
Query[User Query] --> CacheCheck{Result in<br/>Cache?}
CacheCheck -->|Yes, recent| Cached[Return Cached<br/>Result]
CacheCheck -->|No| Execute[Execute Query]
Execute --> Store[Store Result<br/>in Cache]
Store --> Return[Return Result]
Cached -.->|TTL: 48 hours| AutoEvict[Auto-eviction]
Store -.->|Set TTL| TTL[48-hour Expiration]
style CacheCheck fill:#fff3e0
style Cached fill:#4caf50
style Execute fill:#e1f5fe
style Store fill:#e8f5e9 Cache Characteristics:
- TTL: 48 hours automatic expiration
- Invalidation: Automatic when underlying data changes
- Scope: Per-user and per-query
- Storage: Temporary blob storage
- Billing: Cached results not charged on subsequent access
4. Metadata Services¶
Metadata Database¶
Storage:
- Azure SQL Database backend
- Workspace-level metadata repository
- Stores table definitions, views, procedures
Metadata Types:
-- External tables over data lake files
CREATE EXTERNAL TABLE sales.transactions (
transaction_id INT,
customer_id INT,
amount DECIMAL(10,2),
transaction_date DATE
)
WITH (
LOCATION = '/sales/transactions/',
DATA_SOURCE = DataLakeSource,
FILE_FORMAT = ParquetFormat
);
-- Views for business logic
CREATE VIEW sales.monthly_summary AS
SELECT
YEAR(transaction_date) as year,
MONTH(transaction_date) as month,
SUM(amount) as total_sales
FROM sales.transactions
GROUP BY YEAR(transaction_date), MONTH(transaction_date);
-- Stored procedures for reusable logic
CREATE PROCEDURE sales.GetCustomerSummary
@customer_id INT
AS
BEGIN
SELECT
customer_id,
COUNT(*) as transaction_count,
SUM(amount) as total_spent
FROM sales.transactions
WHERE customer_id = @customer_id
GROUP BY customer_id;
END;
Statistics Service¶
Automatic Statistics:
graph LR
DataScan[File Scan] --> StatCollection[Collect Statistics<br/>During Query]
StatCollection --> Store[Store in<br/>Metadata DB]
Store --> Optimizer[Used by<br/>Query Optimizer]
subgraph "Statistics Collected"
RowCount[Row Counts]
NullCount[Null Counts]
MinMax[Min/Max Values]
Histogram[Value Histograms]
end
StatCollection --> RowCount
StatCollection --> NullCount
StatCollection --> MinMax
StatCollection --> Histogram
style DataScan fill:#e8f5e9
style StatCollection fill:#fff3e0
style Store fill:#e1f5fe
style Optimizer fill:#4caf50 Manual Statistics:
-- Create statistics manually for better performance
CREATE STATISTICS customer_stats
ON sales.transactions(customer_id)
WITH FULLSCAN;
-- View existing statistics
SELECT * FROM sys.stats
WHERE object_id = OBJECT_ID('sales.transactions');
Schema Discovery¶
Automatic Schema Inference:
graph TB
File[Data File] --> Type{File<br/>Type?}
Type -->|Parquet| ParquetSchema[Read Parquet<br/>Metadata]
Type -->|Delta| DeltaSchema[Read Delta<br/>Transaction Log]
Type -->|CSV| CSVInfer[Infer from<br/>First 100 Rows]
Type -->|JSON| JSONInfer[Infer from<br/>Sample Records]
ParquetSchema --> Schema[Inferred Schema]
DeltaSchema --> Schema
CSVInfer --> Schema
JSONInfer --> Schema
Schema --> ExternalTable[Auto-create<br/>External Table]
style File fill:#e8f5e9
style ParquetSchema fill:#e1f5fe
style DeltaSchema fill:#fff3e0
style Schema fill:#4caf50
style ExternalTable fill:#f3e5f5 Schema Inference Examples:
-- Automatic schema discovery with OPENROWSET
SELECT TOP 100 *
FROM OPENROWSET(
BULK 'https://datalake.dfs.core.windows.net/data/sales/*.parquet',
FORMAT = 'PARQUET'
) AS [result];
-- Schema inference with type specification
SELECT *
FROM OPENROWSET(
BULK 'https://datalake.dfs.core.windows.net/data/sales/*.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE
) WITH (
transaction_id INT,
amount DECIMAL(10,2),
transaction_date DATE
) AS [result];
💾 Storage Layer Integration¶
Supported File Formats¶
Query Patterns by Format¶
Delta Lake Queries¶
-- Query Delta table directly
SELECT *
FROM OPENROWSET(
BULK 'https://datalake.dfs.core.windows.net/gold/sales/',
FORMAT = 'DELTA'
) AS [sales];
-- Time travel query
SELECT *
FROM OPENROWSET(
BULK 'https://datalake.dfs.core.windows.net/gold/sales/',
FORMAT = 'DELTA'
) WITH (VERSION = 5) AS [sales];
-- Create external table over Delta
CREATE EXTERNAL TABLE gold.sales
WITH (
LOCATION = '/gold/sales/',
DATA_SOURCE = DataLakeSource,
FILE_FORMAT = DeltaFormat
);
Parquet Queries¶
-- Direct Parquet query with partition elimination
SELECT
customer_id,
SUM(amount) as total
FROM OPENROWSET(
BULK 'https://datalake.dfs.core.windows.net/sales/year=2025/month=01/*.parquet',
FORMAT = 'PARQUET'
) AS [result]
WHERE transaction_date >= '2025-01-01'
GROUP BY customer_id;
-- Wildcard patterns for multiple partitions
SELECT *
FROM OPENROWSET(
BULK 'https://datalake.dfs.core.windows.net/sales/year=2025/month=*/day=*/*.parquet',
FORMAT = 'PARQUET'
) AS [result]
WHERE filepath(1) IN ('01', '02') -- Only Jan and Feb
AND filepath(2) > '15'; -- After 15th of month
Data Access Patterns¶
graph TB
subgraph "Access Methods"
OPENROWSET[OPENROWSET<br/>Ad-hoc queries]
ExternalTable[EXTERNAL TABLE<br/>Reusable definitions]
View[VIEW<br/>Business logic layer]
end
subgraph "Security Layers"
AAD[Azure AD<br/>Authentication]
RBAC[Synapse RBAC<br/>Workspace permissions]
StorageACL[Storage ACLs<br/>File/folder permissions]
end
OPENROWSET --> AAD
ExternalTable --> AAD
View --> AAD
AAD --> RBAC
RBAC --> StorageACL
StorageACL --> DataAccess[Data Access<br/>Granted]
style OPENROWSET fill:#e1f5fe
style ExternalTable fill:#fff3e0
style View fill:#e8f5e9
style DataAccess fill:#4caf50 🔒 Security & Governance¶
Authentication Flow¶
sequenceDiagram
participant User
participant SynapseSQL as Serverless SQL Pool
participant AAD as Azure Active Directory
participant Storage as Data Lake Storage
User->>SynapseSQL: Connect with Azure AD
SynapseSQL->>AAD: Validate credentials
AAD->>SynapseSQL: Return access token
User->>SynapseSQL: Submit query
SynapseSQL->>SynapseSQL: Check workspace RBAC
SynapseSQL->>Storage: Access data with user identity
Storage->>Storage: Check storage ACLs
Storage->>SynapseSQL: Return data (if authorized)
SynapseSQL->>User: Return query results Authorization Layers¶
| Layer | Purpose | Granularity | Management |
|---|---|---|---|
| Azure AD | User authentication | User/group identity | Azure portal |
| Synapse RBAC | Workspace permissions | Workspace/pool level | Synapse Studio |
| SQL Permissions | Database object access | Table/view/procedure | T-SQL GRANT/DENY |
| Storage ACLs | File/folder access | File/folder level | Storage portal/SDK |
Security Best Practices:
-- Grant minimum required permissions
GRANT SELECT ON SCHEMA::sales TO [DataAnalysts];
-- Use database scoped credentials for external data sources
CREATE DATABASE SCOPED CREDENTIAL DataLakeCredential
WITH IDENTITY = 'Managed Identity';
CREATE EXTERNAL DATA SOURCE DataLakeSource
WITH (
LOCATION = 'https://datalake.dfs.core.windows.net/data',
CREDENTIAL = DataLakeCredential
);
-- Row-level security with predicates
CREATE SECURITY POLICY SalesFilter
ADD FILTER PREDICATE dbo.fn_securitypredicate(SalesPersonID)
ON sales.transactions
WITH (STATE = ON);
⚡ Performance Optimization¶
Query Optimization Techniques¶
1. Partition Elimination¶
-- Inefficient: Full scan
SELECT * FROM sales.transactions
WHERE transaction_date >= '2025-01-01';
-- Efficient: Partition elimination
SELECT * FROM sales.transactions
WHERE year = 2025 AND month = 1;
-- Folder structure: /sales/year=2025/month=01/
2. Column Pruning¶
-- Inefficient: Read all columns
SELECT * FROM sales.transactions;
-- Efficient: Select only needed columns
SELECT customer_id, amount, transaction_date
FROM sales.transactions;
-- Only 3 columns read from Parquet
3. Predicate Pushdown¶
-- Efficient: Filter pushed to file scan
SELECT customer_id, SUM(amount) as total
FROM sales.transactions
WHERE transaction_date >= '2025-01-01'
AND amount > 100
GROUP BY customer_id;
-- Filters applied during file read, reduces data transfer
4. File Size Optimization¶
graph LR
subgraph "File Size Impact"
TooSmall[Too Small<br/>< 10MB<br/>❌ Overhead]
Optimal[Optimal<br/>100MB-1GB<br/>✅ Best Performance]
TooLarge[Too Large<br/>> 1GB<br/>⚠️ Limited Parallelism]
end
TooSmall -.->|Consolidate| Optimal
TooLarge -.->|Partition| Optimal
style TooSmall fill:#ffebee
style Optimal fill:#e8f5e9
style TooLarge fill:#fff9c4 File Consolidation Example:
# Spark job to consolidate small files
df = spark.read.format("delta").load("/path/to/table")
df.coalesce(10).write.format("delta").mode("overwrite").save("/path/to/table")
Cost Optimization¶
Data Processed Calculation¶
graph TB
Query[User Query] --> Parse[Parse Query<br/>Identify Files]
Parse --> Partition[Apply Partition<br/>Elimination]
Partition --> Column[Apply Column<br/>Pruning]
Column --> Scan[Actual Data<br/>Scanned]
Scan --> Charge[Charge = Data Scanned<br/>$/TB]
style Query fill:#e8f5e9
style Partition fill:#fff3e0
style Column fill:#e1f5fe
style Scan fill:#f3e5f5
style Charge fill:#ffccbc Cost Comparison:
| Scenario | Files Scanned | Data Scanned | Cost ($/TB) | Total Cost |
|---|---|---|---|---|
| Full table scan (CSV) | 1000 files | 10 TB | $5 | $50 |
| Partition elimination (Parquet) | 100 files | 1 TB | $5 | $5 |
| Column pruning (Parquet) | 100 files | 0.2 TB | $5 | $1 |
| Cached result (2nd run) | 0 files | 0 TB | $0 | $0 |
Cost Optimization Tips:
- Use Parquet/Delta: 10-100x better compression than CSV
- Partition Data: Reduce scanned files
- Query Only Needed Columns: Reduce scanned data
- Use Views: Encapsulate optimization logic
- Leverage Cache: Repeated queries are free
📊 Monitoring & Troubleshooting¶
Query Performance Monitoring¶
-- View query execution statistics
SELECT
request_id,
status,
total_elapsed_time,
data_processed_mb,
result_set_size_kb,
command
FROM sys.dm_exec_requests
WHERE session_id = SESSION_ID();
-- Historical query performance
SELECT
request_id,
start_time,
end_time,
DATEDIFF(SECOND, start_time, end_time) as duration_seconds,
data_processed_mb / 1024.0 as data_processed_gb,
command
FROM sys.dm_exec_requests_history
WHERE session_id = SESSION_ID()
ORDER BY start_time DESC;
Common Issues & Solutions¶
| Issue | Symptom | Solution |
|---|---|---|
| Slow queries | Long execution time | Add partitioning, use Parquet, create statistics |
| High costs | Unexpected charges | Optimize queries, use views, partition data |
| Permission errors | Access denied | Check AAD, RBAC, storage ACLs |
| Schema errors | Column not found | Recreate external table, check file format |
| Timeout errors | Query timeout | Optimize query, reduce data scanned |
🎯 Best Practices¶
Design Principles¶
- Use External Tables for frequently accessed datasets
- Partition Data by commonly filtered columns (date, region, category)
- Use Parquet or Delta for optimal performance and cost
- Create Statistics on join and filter columns
- Implement Views for reusable business logic
- Monitor Costs regularly and optimize expensive queries
Anti-Patterns to Avoid¶
- ❌ SELECT * queries on large datasets
- ❌ Querying CSV files for large-scale analytics
- ❌ Small files (< 10MB) in data lake
- ❌ Missing partitioning on time-series data
- ❌ Not leveraging result cache
🔗 Related Resources¶
Architecture Documentation¶
Implementation Guides¶
Performance & Cost¶
Last Updated: 2025-01-28 Architecture Version: 2.0 Pattern: Serverless Query Processing