Skip to content

Video Script: Serverless SQL Mastery

Duration: 30 minutes Level: Intermediate Version: 1.0

Video Metadata

  • Title: Serverless SQL Pool Mastery - Query Data Lakes at Scale
  • Duration: 30:00
  • Target Audience: Data analysts, SQL developers
  • Skill Level: Intermediate
  • Prerequisites:
  • Strong SQL knowledge
  • Understanding of data lake concepts
  • Completed Synapse Fundamentals video
  • Access to Synapse workspace
  • Tools Required:
  • Azure Synapse workspace
  • Azure Data Lake Storage with sample data
  • SQL client (optional)

Learning Objectives

By the end of this video, viewers will be able to:

  1. Query files directly in data lakes using T-SQL
  2. Optimize Serverless SQL queries for cost and performance
  3. Create external tables and views for data lake access
  4. Implement security and access control patterns
  5. Troubleshoot common query issues
  6. Understand billing and cost optimization strategies

Video Script

Opening Hook (0:00 - 0:45)

[SCENE 1: Dynamic Query Visualization] [Background: SQL query executing across distributed data lake files]

NARRATOR: "Query petabytes of data without provisioning infrastructure. Pay only for the data you scan. Access CSV, JSON, Parquet, and Delta files using familiar T-SQL syntax. This is Serverless SQL in Azure Synapse Analytics."

[VISUAL: Cost meter showing minimal charge] - Query size: 2.5 TB - Data scanned: 125 GB (optimized) - Cost: $0.63 - Execution time: 8 seconds

NARRATOR: "In this masterclass, I'll reveal every technique for querying data lakes efficiently, securely, and cost-effectively. Let's make you a Serverless SQL expert!"

[TRANSITION: Zoom into Synapse Studio]

Introduction & Concepts (0:45 - 3:30)

[SCENE 2: Concept Overview]

NARRATOR: "Serverless SQL Pool - also called SQL on-demand - is revolutionary because it separates compute from storage."

[VISUAL: Traditional vs Serverless architecture diagram]

Traditional Data Warehouse: ```textData → ETL → Load → Warehouse → Query Cost: Always running ($$$) Latency: Hours to load data

**Serverless SQL**:
```textData → Query in Place
Cost: Pay per query ($)
Latency: Immediate access

Key Capabilities: - Query files without moving data - Support for structured and semi-structured data - T-SQL syntax (familiar to SQL developers) - Automatic schema inference - Integration with Power BI and other tools - Consumption-based pricing

[TRANSITION: First query demonstration]

Section 1: Basic Querying Patterns (3:30 - 10:00)

[SCENE 3: Screen Recording - Query Editor]

Querying CSV Files (3:30 - 5:00)

NARRATOR: "Let's start with the most common scenario - querying CSV files in your data lake."

[VISUAL: Open new SQL script]

Query 1 - Basic CSV Query:

-- Query CSV files with automatic schema inference
SELECT TOP 100 *
FROM OPENROWSET(
    BULK 'https://mydatalake.dfs.core.windows.net/raw/sales/*.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0',
    HEADER_ROW = TRUE
) AS sales_data

[RUN QUERY]

NARRATOR: "Notice PARSER_VERSION 2.0 - this gives better performance and more accurate type inference than version 1.0."

Query 2 - Explicit Schema:

-- Define explicit schema for better performance
SELECT
    customer_id,
    product_name,
    sales_amount,
    sale_date
FROM OPENROWSET(
    BULK 'https://mydatalake.dfs.core.windows.net/raw/sales/2024/*.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0',
    HEADER_ROW = TRUE
)
WITH (
    customer_id INT,
    product_name VARCHAR(200),
    sales_amount DECIMAL(18,2),
    sale_date DATE
) AS sales_data
WHERE
    sale_date >= '2024-01-01'
    AND sales_amount > 100
ORDER BY
    sales_amount DESC

[RUN QUERY, show execution time]

NARRATOR: "Explicit schemas are faster because Serverless SQL doesn't need to infer types. This also gives you control over data type conversions."

Querying Parquet Files (5:00 - 6:30)

NARRATOR: "Parquet files are columnar and compressed - perfect for analytics."

Query 3 - Parquet Query:

-- Query Parquet files (much faster than CSV)
SELECT
    product_category,
    COUNT(*) as transaction_count,
    SUM(sales_amount) as total_sales,
    AVG(sales_amount) as avg_sale,
    MIN(sale_date) as first_sale,
    MAX(sale_date) as last_sale
FROM OPENROWSET(
    BULK 'https://mydatalake.dfs.core.windows.net/curated/sales_partitioned/**/*.parquet',
    FORMAT = 'PARQUET'
) AS sales_parquet
WHERE
    sales_parquet.filepath(1) = 'year=2024'  -- Partition pruning
    AND sales_parquet.filepath(2) IN ('month=01', 'month=02', 'month=03')
GROUP BY
    product_category
HAVING
    SUM(sales_amount) > 100000
ORDER BY
    total_sales DESC

[RUN QUERY]

NARRATOR: "The filepath() function enables partition pruning - scanning only relevant files. This dramatically reduces cost and improves performance."

Querying JSON Files (6:30 - 8:00)

NARRATOR: "JSON is common for semi-structured data like logs and API responses."

Query 4 - JSON Query:

-- Query JSON files and extract nested properties
SELECT
    JSON_VALUE(doc, '$.userId') AS user_id,
    JSON_VALUE(doc, '$.event.type') AS event_type,
    JSON_VALUE(doc, '$.event.timestamp') AS event_timestamp,
    JSON_QUERY(doc, '$.event.properties') AS event_properties,
    JSON_VALUE(doc, '$.session.id') AS session_id
FROM OPENROWSET(
    BULK 'https://mydatalake.dfs.core.windows.net/logs/events/2024/01/**/*.json',
    FORMAT = 'CSV',
    FIELDTERMINATOR = '0x0b',
    FIELDQUOTE = '0x0b',
    ROWTERMINATOR = '0x0b'
) WITH (doc NVARCHAR(MAX)) AS events
CROSS APPLY OPENJSON(doc)
WHERE
    JSON_VALUE(doc, '$.event.type') = 'purchase'

[RUN QUERY]

NARRATOR: "JSON_VALUE extracts scalar values, while JSON_QUERY extracts objects and arrays. Use OPENJSON for complex nested structures."

Querying Delta Lake (8:00 - 10:00)

NARRATOR: "Delta Lake tables provide ACID transactions and metadata - let's query them."

Query 5 - Delta Lake Query:

-- Query Delta Lake tables with time travel
SELECT
    product_id,
    product_name,
    current_price,
    last_updated
FROM OPENROWSET(
    BULK 'https://mydatalake.dfs.core.windows.net/delta/products/',
    FORMAT = 'DELTA'
) AS products
WHERE
    current_price BETWEEN 50 AND 500

-- Time travel query (requires timestamp)
SELECT
    product_id,
    product_name,
    price_yesterday
FROM OPENROWSET(
    BULK 'https://mydatalake.dfs.core.windows.net/delta/products/',
    FORMAT = 'DELTA',
    VERSION_AS_OF = (
        SELECT MAX(version) - 1
        FROM OPENROWSET(
            BULK 'https://mydatalake.dfs.core.windows.net/delta/products/_delta_log/*.json',
            FORMAT = 'CSV'
        ) AS log
    )
) AS products_yesterday

NARRATOR: "Delta format automatically tracks schema, handles deletions, and provides time travel capabilities."

[TRANSITION: External tables section]

Section 2: External Tables & Views (10:00 - 16:00)

[SCENE 4: Creating Reusable Objects]

Creating External Data Source (10:00 - 11:00)

NARRATOR: "Instead of repeating URLs, create external data sources for cleaner code."

Query 6 - External Data Source:

-- Create database for external objects
CREATE DATABASE SalesAnalytics
GO

USE SalesAnalytics
GO

-- Create master key for encryption (one time)
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Strong_Password_123!'
GO

-- Create database scoped credential
CREATE DATABASE SCOPED CREDENTIAL DataLakeCredential
WITH IDENTITY = 'Managed Identity'  -- Or use Shared Access Signature
GO

-- Create external data source
CREATE EXTERNAL DATA SOURCE ContosoDatalake
WITH (
    LOCATION = 'https://mydatalake.dfs.core.windows.net/curated',
    CREDENTIAL = DataLakeCredential
)
GO

[EXECUTE statements one by one]

NARRATOR: "Now queries can reference ContosoDatalake instead of full URLs. Much cleaner and easier to manage."

Creating External File Formats (11:00 - 12:00)

Query 7 - File Formats:

-- Define reusable file formats
CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
)
GO

CREATE EXTERNAL FILE FORMAT CsvFormat
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = ',',
        STRING_DELIMITER = '"',
        FIRST_ROW = 2,  -- Skip header
        USE_TYPE_DEFAULT = FALSE,
        ENCODING = 'UTF8'
    )
)
GO

CREATE EXTERNAL FILE FORMAT DeltaFormat
WITH (
    FORMAT_TYPE = DELTA
)
GO

Creating External Tables (12:00 - 14:00)

NARRATOR: "External tables provide a relational interface to data lake files."

Query 8 - External Table:

-- Create external table for sales data
CREATE EXTERNAL TABLE dbo.Sales
(
    sale_id BIGINT,
    customer_id INT,
    product_id INT,
    product_category VARCHAR(100),
    sales_amount DECIMAL(18,2),
    quantity INT,
    sale_date DATE,
    sale_timestamp DATETIME2
)
WITH (
    LOCATION = 'sales_partitioned/**/*.parquet',
    DATA_SOURCE = ContosoDatalake,
    FILE_FORMAT = ParquetFormat
)
GO

-- Query like a regular table
SELECT
    product_category,
    COUNT(*) as sales_count,
    SUM(sales_amount) as revenue
FROM dbo.Sales
WHERE sale_date >= DATEADD(month, -3, GETDATE())
GROUP BY product_category
ORDER BY revenue DESC

[RUN QUERY]

NARRATOR: "External tables enable Power BI, SSMS, and other tools to query data lake files as if they were database tables."

Creating Views (14:00 - 16:00)

NARRATOR: "Views add business logic and simplify complex queries."

Query 9 - Views:

-- Create view with business logic
CREATE VIEW dbo.SalesQuarterlySummary
AS
SELECT
    DATEPART(YEAR, sale_date) as sale_year,
    DATEPART(QUARTER, sale_date) as sale_quarter,
    product_category,
    COUNT(*) as transaction_count,
    SUM(sales_amount) as total_revenue,
    AVG(sales_amount) as avg_transaction_value,
    SUM(quantity) as units_sold
FROM dbo.Sales
GROUP BY
    DATEPART(YEAR, sale_date),
    DATEPART(QUARTER, sale_date),
    product_category
GO

-- Create view joining multiple sources
CREATE VIEW dbo.EnrichedSales
AS
SELECT
    s.sale_id,
    s.sale_date,
    s.sales_amount,
    c.customer_name,
    c.customer_tier,
    p.product_name,
    p.category_name
FROM dbo.Sales s
INNER JOIN dbo.Customers c ON s.customer_id = c.customer_id
INNER JOIN dbo.Products p ON s.product_id = p.product_id
GO

-- Query the view
SELECT * FROM dbo.SalesQuarterlySummary
WHERE sale_year = 2024
ORDER BY total_revenue DESC

[RUN QUERY]

[TRANSITION: Performance optimization]

Section 3: Performance Optimization (16:00 - 22:00)

[SCENE 5: Optimization Techniques]

Understanding Query Costs (16:00 - 17:30)

NARRATOR: "Serverless SQL bills based on data scanned. Optimization directly reduces costs."

[VISUAL: Show cost breakdown]

Cost Calculation: ```textCost = Data Scanned (TB) × $5.00 per TB

Example: - Unoptimized: 500 GB scanned = $2.50 - Optimized: 50 GB scanned = $0.25 - Savings: 90% reduction

**Cost Optimization Principles**:
```sql
-- ❌ BAD: Scans all columns
SELECT * FROM dbo.Sales

-- ✅ GOOD: Only scans needed columns
SELECT sale_id, sales_amount, sale_date
FROM dbo.Sales

-- ❌ BAD: Scans all files
SELECT * FROM dbo.Sales
WHERE product_category = 'Electronics'

-- ✅ GOOD: Partition pruning
SELECT * FROM dbo.Sales
WHERE sales.filepath(1) = 'category=Electronics'

Partition Pruning (17:30 - 19:00)

NARRATOR: "Partition pruning is the single most effective optimization technique."

Query 10 - Partition Pruning Examples:

-- Leveraging partition structure: /year=2024/month=01/day=15/
SELECT
    customer_id,
    SUM(sales_amount) as total_spent
FROM OPENROWSET(
    BULK 'https://mydatalake.dfs.core.windows.net/sales/**/*.parquet',
    FORMAT = 'PARQUET'
) AS sales
WHERE
    sales.filepath(1) = 'year=2024'
    AND sales.filepath(2) IN ('month=01', 'month=02', 'month=03')
GROUP BY customer_id

-- Combined with column pruning
SELECT
    -- Only select needed columns
    product_id,
    SUM(quantity) as units_sold
FROM OPENROWSET(
    BULK 'https://mydatalake.dfs.core.windows.net/sales/year=2024/month=01/**/*.parquet',
    FORMAT = 'PARQUET'
) AS sales
GROUP BY product_id

-- View query cost
-- Check "Data processed (MB)" in query results

[RUN QUERY, highlight data scanned metric]

File Format Selection (19:00 - 20:00)

NARRATOR: "File format dramatically impacts performance. Let's compare."

Performance Comparison:

-- CSV Query (slowest)
SELECT COUNT(*) FROM OPENROWSET(
    BULK 'https://mydatalake.dfs.core.windows.net/data/sales.csv',
    FORMAT = 'CSV', PARSER_VERSION = '2.0', HEADER_ROW = TRUE
) AS data
-- Result: 45 seconds, 2.5 GB scanned

-- Parquet Query (fast)
SELECT COUNT(*) FROM OPENROWSET(
    BULK 'https://mydatalake.dfs.core.windows.net/data/sales.parquet',
    FORMAT = 'PARQUET'
) AS data
-- Result: 3 seconds, 450 MB scanned

-- Parquet with compression (fastest)
-- Snappy or GZIP compression
-- Result: 2 seconds, 180 MB scanned

Recommendations: - ✅ Use Parquet for large datasets (columnar, compressed) - ✅ Use Delta for transactional workloads (ACID, versioning) - ⚠️ Use CSV only for compatibility (slow, inefficient) - ⚠️ Use JSON for semi-structured data (flexible but slower)

Result Set Caching (20:00 - 21:00)

NARRATOR: "Serverless SQL automatically caches query results for 48 hours."

Query 11 - Caching Demonstration:

-- First run: scans data lake
SELECT
    product_category,
    COUNT(*) as count,
    SUM(sales_amount) as revenue
FROM dbo.Sales
WHERE sale_date >= '2024-01-01'
GROUP BY product_category
-- Execution: 8 seconds, 2.1 GB scanned

-- Second run: uses cache
-- Same query executed again
-- Execution: <1 second, 0 GB scanned, $0 cost!

NARRATOR: "Identical queries use cached results at no charge. This is perfect for Power BI dashboards with multiple viewers."

Statistics and Query Plans (21:00 - 22:00)

Query 12 - Query Insights:

-- Enable actual execution plan
SET STATISTICS IO ON
SET STATISTICS TIME ON

-- Run query
SELECT
    c.customer_tier,
    COUNT(DISTINCT s.customer_id) as customer_count,
    SUM(s.sales_amount) as total_revenue
FROM dbo.Sales s
INNER JOIN dbo.Customers c ON s.customer_id = c.customer_id
WHERE s.sale_date >= '2024-01-01'
GROUP BY c.customer_tier

-- Review execution plan for:
-- - Data scanned per table
-- - Join strategy used
-- - Aggregation method

[SHOW execution plan in SSMS or Synapse Studio]

[TRANSITION: Security section]

Section 4: Security & Access Control (22:00 - 26:00)

[SCENE 6: Security Patterns]

Authentication Methods (22:00 - 23:00)

NARRATOR: "Serverless SQL supports multiple authentication methods."

Authentication Options:

-- 1. Azure AD Authentication (recommended)
-- Users authenticate with Azure AD credentials
-- No credentials in connection string

-- 2. Managed Identity
CREATE DATABASE SCOPED CREDENTIAL ManagedIdentityCredential
WITH IDENTITY = 'Managed Identity'
GO

-- 3. Shared Access Signature (SAS)
CREATE DATABASE SCOPED CREDENTIAL SASCredential
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 'sv=2021-06-08&ss=bfqt&srt=sco&sp=rwdlacupiytfx...'
GO

-- 4. Storage Account Key (least secure)
CREATE DATABASE SCOPED CREDENTIAL AccountKeyCredential
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 'your-storage-account-key'
GO

Best Practice: Use Azure AD authentication for users, Managed Identity for services.

Row-Level Security (23:00 - 24:30)

NARRATOR: "Implement row-level security to control data access by user."

Query 13 - Row-Level Security:

-- Create security predicate function
CREATE FUNCTION dbo.fn_SecurityPredicate(@Region VARCHAR(50))
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
    SELECT 1 AS fn_SecurityPredicate_result
    WHERE
        @Region = USER_NAME()  -- Simplified example
        OR IS_MEMBER('SalesManagers') = 1  -- Managers see all regions
GO

-- Create security policy
CREATE SECURITY POLICY dbo.RegionalSalesFilter
ADD FILTER PREDICATE dbo.fn_SecurityPredicate(region)
ON dbo.Sales
WITH (STATE = ON)
GO

-- Now queries automatically filter by user
SELECT * FROM dbo.Sales
-- Users only see their region
-- Managers see all regions

Column-Level Security (24:30 - 25:30)

Query 14 - Column Masking:

-- Create view with conditional masking
CREATE VIEW dbo.CustomersMasked
AS
SELECT
    customer_id,
    -- Mask email for non-privileged users
    CASE
        WHEN IS_MEMBER('DataPrivacyOfficers') = 1
        THEN email
        ELSE LEFT(email, 3) + '***' + RIGHT(email, CHARINDEX('@', REVERSE(email)))
    END AS email,
    -- Mask phone
    CASE
        WHEN IS_MEMBER('DataPrivacyOfficers') = 1
        THEN phone
        ELSE 'XXX-XXX-' + RIGHT(phone, 4)
    END AS phone,
    customer_name,
    city,
    state
FROM dbo.Customers
GO

-- Grant access to masked view
GRANT SELECT ON dbo.CustomersMasked TO PublicUsers
GO

Access Control Best Practices (25:30 - 26:00)

Security Checklist:

-- ✅ DO:
-- Use Azure AD authentication
-- Implement row-level security for multi-tenant scenarios
-- Use managed identities for service principals
-- Encrypt sensitive data at rest
-- Audit data access with Azure Monitor
-- Grant minimum required permissions

-- ❌ DON'T:
-- Store credentials in code or connection strings
-- Use storage account keys (use SAS or managed identity)
-- Grant broad SELECT permissions
-- Expose PII without masking
-- Skip access auditing

[TRANSITION: Troubleshooting]

Section 5: Troubleshooting & Best Practices (26:00 - 29:00)

[SCENE 7: Common Issues]

Common Errors and Solutions (26:00 - 27:30)

NARRATOR: "Let's solve the most common Serverless SQL issues."

Error 1: Authentication Failure

-- Error: "Failed to execute query. Error: Access to the path is forbidden"

-- Solution 1: Grant RBAC permissions
-- Azure Portal > Storage Account > Access Control (IAM)
-- Add role: "Storage Blob Data Reader" or "Storage Blob Data Contributor"

-- Solution 2: Verify credential
CREATE DATABASE SCOPED CREDENTIAL FixedCredential
WITH IDENTITY = 'Managed Identity'  -- Ensure managed identity is enabled
GO

-- Solution 3: Check firewall rules
-- Azure Portal > Storage Account > Networking
-- Add your IP or enable "Allow Azure services"

Error 2: Timeout Errors

-- Error: "Query timeout expired"

-- Solution: Optimize query
-- 1. Add WHERE clause to reduce data scanned
SELECT * FROM dbo.Sales
WHERE sale_date >= DATEADD(month, -1, GETDATE())  -- Add filter

-- 2. Use partition pruning
WHERE sales.filepath(1) = 'year=2024'

-- 3. Reduce result set size
SELECT TOP 1000 * FROM dbo.Sales  -- Limit rows

-- 4. Create external table for repeated queries

Error 3: Schema Mismatch

-- Error: "Column name 'xyz' is not a valid column name"

-- Solution: Define explicit schema
SELECT * FROM OPENROWSET(
    BULK 'https://mydatalake.dfs.core.windows.net/data/file.csv',
    FORMAT = 'CSV', PARSER_VERSION = '2.0'
) WITH (
    column1 VARCHAR(100),
    column2 INT,
    column3 DECIMAL(18,2)
) AS data

Performance Best Practices Summary (27:30 - 29:00)

NARRATOR: "Here's your checklist for optimal Serverless SQL performance."

Best Practices:

-- 1. File Format
 Use Parquet with Snappy compression
 Partition large files by date or category
 Keep file sizes between 100MB-1GB

-- 2. Query Optimization
 Select only needed columns
 Use WHERE clauses on partition columns
 Create external tables for repeated queries
 Leverage result caching (identical queries)
 Use statistics for query optimization

-- 3. Cost Management
 Monitor data scanned in query results
 Set up cost alerts in Azure Monitor
 Use views to enforce column selection
 Educate users on cost-effective patterns

-- 4. Security
 Use Azure AD authentication
 Implement row/column-level security
 Audit access with diagnostic logs
 Never expose credentials in queries

-- 5. Monitoring
 Enable diagnostic logging
 Track query performance in Azure Monitor
 Set up alerts for failed queries
 Review cost trends weekly

[TRANSITION: Conclusion]

Conclusion & Next Steps (29:00 - 30:00)

[SCENE 8: Presenter Summary]

NARRATOR: "You're now equipped to master Serverless SQL in Azure Synapse!"

Key Takeaways: - ✅ Serverless SQL enables pay-per-query data lake access - ✅ File format and partitioning are critical for performance - ✅ External tables provide relational interface to files - ✅ Security can be implemented at row, column, and object levels - ✅ Optimization directly impacts both cost and performance

Next Steps: 1. Practice: Create external tables for your data lake 2. Optimize: Implement partition pruning in existing queries 3. Secure: Set up row-level security for multi-tenant data 4. Monitor: Enable diagnostic logging and cost tracking 5. Integrate: Connect Power BI to Serverless SQL endpoint

Resources: - Serverless SQL Best Practices - T-SQL Reference - Cost Management Guide

NARRATOR: "Thanks for watching! Next, explore Delta Lake Essentials to learn about ACID transactions in your data lake. Don't forget to subscribe!"

[FADE OUT]

Production Notes

Visual Assets Required

  • Architecture diagrams (traditional vs serverless)
  • Cost comparison charts
  • Query execution visualizations
  • Security pattern diagrams
  • Performance metrics overlay

Demo Requirements

  • Sample CSV, Parquet, JSON, Delta files
  • Pre-populated external tables
  • Security policies configured
  • Query performance metrics enabled

Production Status: Status

Last Updated: January 2025