Skip to content

Azure Synapse Analytics Guide

See also: generic Azure reference

For service-agnostic deep-dive content on Azure Synapse Analytics — architecture, feature reference, code samples, and patterns independent of CSA-in-a-Box — see Azure Synapse Analytics in the reference library.

Overview

Azure Synapse Analytics is a unified analytics service that brings together enterprise data warehousing, big data analytics, data integration, and visualization under a single control plane. Within CSA-in-a-Box, Synapse serves as the primary analytics runtime for Azure Government deployments and a complementary engine alongside Databricks in Azure Commercial.

Synapse provides five core capabilities:

Capability Engine Billing Model Best For
Serverless SQL Pool T-SQL over ADLS Pay-per-TB scanned Ad-hoc exploration, Gold views, data virtualization
Dedicated SQL Pool MPP (formerly SQL DW) Reserved DWU capacity Sub-second BI, high-concurrency dashboards
Apache Spark Pool Managed Spark 3.x Compute-hour ML, complex ETL, Delta Lake processing
Data Explorer Pool Kusto (KQL) Compute-hour Log analytics, time-series, near-real-time ingestion
Synapse Pipelines ADF-compatible Activity runs + DIU-hours Orchestration, data movement, scheduling

When to choose Synapse

See the Fabric vs. Databricks vs. Synapse decision tree for a structured comparison. In short: choose Synapse when you need Azure Government compliance (IL5/IL6), existing dedicated SQL pool estates, or serverless SQL-first data virtualization.


Architecture

The following diagram shows how Synapse integrates with the CSA-in-a-Box medallion architecture on ADLS Gen2 and serves downstream consumers.

graph LR
    subgraph "ADLS Gen2 (Medallion)"
        B[(Bronze<br/>Raw Parquet/JSON)]
        S[(Silver<br/>Cleansed Delta)]
        G[(Gold<br/>Curated Delta)]
    end

    subgraph "Synapse Workspace"
        SQ[Serverless SQL Pool]
        DQ[Dedicated SQL Pool]
        SP[Spark Pool]
        DEP[Data Explorer Pool]
        PL[Synapse Pipelines]
    end

    subgraph Consumers
        PBI[Power BI]
        API[Data APIs]
        NB[Notebooks / Data Science]
        DASH[Operational Dashboards]
    end

    B -->|OPENROWSET| SQ
    S -->|External Tables| SQ
    G -->|CETAS / Views| SQ
    G -->|Loaded via Pipelines| DQ
    B -->|PySpark Read| SP
    S -->|Delta Merge| SP
    SP -->|Write Gold| G
    PL -->|Orchestrate| SP
    PL -->|Copy Activity| DQ
    DEP -->|Ingest Streaming| DASH

    SQ --> PBI
    DQ --> PBI
    DQ --> API
    SP --> NB
    DEP --> DASH

Serverless SQL Pool

Serverless SQL is the default entry point for data exploration in Synapse. There is no infrastructure to provision — queries run on-demand against files in ADLS Gen2 and you pay only for the data scanned.

OPENROWSET Queries

Query Parquet, Delta, CSV, or JSON files directly without loading them into a table.

-- Query Parquet files in the Silver layer
SELECT
    customer_id,
    order_date,
    total_amount
FROM OPENROWSET(
    BULK 'https://<storage>.dfs.core.windows.net/silver/sales/orders/**',
    FORMAT = 'PARQUET'
) AS orders
WHERE order_date >= '2026-01-01';
-- Query Delta Lake tables (specify Delta format)
SELECT *
FROM OPENROWSET(
    BULK 'https://<storage>.dfs.core.windows.net/gold/dim_customer/',
    FORMAT = 'DELTA'
) AS dim_customer;

External Tables

For frequently queried datasets, create external tables to avoid repeating OPENROWSET boilerplate. External tables also support column statistics.

-- Step 1: Scoped credential for ADLS access
CREATE DATABASE SCOPED CREDENTIAL adls_credential
WITH IDENTITY = 'Managed Identity';

-- Step 2: External data source
CREATE EXTERNAL DATA SOURCE gold_lake
WITH (
    LOCATION   = 'https://<storage>.dfs.core.windows.net/gold/',
    CREDENTIAL = adls_credential
);

-- Step 3: External file format
CREATE EXTERNAL FILE FORMAT parquet_format
WITH (FORMAT_TYPE = PARQUET);

-- Step 4: External table
CREATE EXTERNAL TABLE dbo.dim_product (
    product_id   INT,
    product_name NVARCHAR(200),
    category     NVARCHAR(100),
    unit_price   DECIMAL(10,2)
)
WITH (
    LOCATION     = 'dim_product/',
    DATA_SOURCE  = gold_lake,
    FILE_FORMAT  = parquet_format
);

CETAS (CREATE EXTERNAL TABLE AS SELECT)

CETAS materializes query results back to ADLS as Parquet, enabling you to build the Gold layer directly from serverless SQL.

CREATE EXTERNAL TABLE gold.fact_daily_sales
WITH (
    LOCATION     = 'fact_daily_sales/',
    DATA_SOURCE  = gold_lake,
    FILE_FORMAT  = parquet_format
)
AS
SELECT
    CAST(order_date AS DATE)     AS sale_date,
    product_id,
    COUNT(*)                     AS order_count,
    SUM(total_amount)            AS revenue
FROM OPENROWSET(
    BULK 'https://<storage>.dfs.core.windows.net/silver/sales/orders/**',
    FORMAT = 'PARQUET'
) AS orders
GROUP BY CAST(order_date AS DATE), product_id;

Gold Layer Views

Create views on top of external tables or OPENROWSET to present a clean semantic layer for Power BI and downstream APIs.

CREATE VIEW gold.vw_customer_lifetime_value AS
SELECT
    c.customer_id,
    c.customer_name,
    c.segment,
    COUNT(DISTINCT o.order_id) AS total_orders,
    SUM(o.total_amount)        AS lifetime_value,
    MAX(o.order_date)          AS last_order_date
FROM gold.dim_customer  c
JOIN gold.fact_orders    o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name, c.segment;

Performance Tips

Technique Impact Details
Partition elimination High Partition data by date or region; serverless prunes partitions automatically when WHERE filters align
File sizing High Target 128 MB - 1 GB per Parquet file; small files cause excessive I/O overhead
Column statistics Medium CREATE STATISTICS on external tables helps the optimizer choose better plans
Column projection Medium Select only needed columns — Parquet is columnar, so fewer columns = less data scanned
Row group filtering Medium Parquet row-group statistics enable predicate pushdown; sort data on filter columns before writing
Result set caching Low-Med Repeated identical queries return cached results at no scan cost

Cost Model

Serverless SQL charges $5 per TB scanned (may vary by region). There is no charge for idle time, but poorly written queries that scan entire containers without partition pruning can become expensive quickly. Always use WHERE clauses that align with your partitioning scheme.


Dedicated SQL Pools

Dedicated SQL pools (formerly Azure SQL Data Warehouse) provide a provisioned MPP engine for sub-second, high-concurrency workloads.

When to Use

  • Power BI dashboards serving 50+ concurrent users requiring sub-second response
  • Star-schema data warehouses with heavy JOIN and aggregation patterns
  • Workloads that need predictable latency independent of query complexity
  • Azure Government environments at IL5/IL6 with high-concurrency BI

Distribution Strategies

Choosing the right distribution is the single most impactful performance decision for dedicated pools.

Strategy Syntax Best For Watch Out
Hash DISTRIBUTION = HASH(column) Large fact tables (>60M rows); choose high-cardinality join key Data skew if chosen column has low cardinality
Round-robin DISTRIBUTION = ROUND_ROBIN Staging tables, no dominant join pattern Every query requires data movement for JOINs
Replicate DISTRIBUTION = REPLICATE Small dimension tables (<1.5 GB) Rebuild cost on updates; not for large tables
-- Hash-distributed fact table
CREATE TABLE dbo.fact_sales
WITH (
    DISTRIBUTION = HASH(customer_id),
    CLUSTERED COLUMNSTORE INDEX
)
AS SELECT * FROM staging.raw_sales;

-- Replicated dimension table
CREATE TABLE dbo.dim_date
WITH (
    DISTRIBUTION = REPLICATE,
    CLUSTERED COLUMNSTORE INDEX
)
AS SELECT * FROM staging.date_dimension;

Indexing Strategies

Index Type Use Case Notes
Clustered Columnstore (default) Fact tables, analytics workloads Best compression, batch-mode execution; requires >1M rows per distribution for optimal segment quality
Heap Staging / temp tables Fastest for bulk insert; no ordering overhead
Clustered Index Lookup-heavy dimensions B-tree for point lookups; avoid on wide fact tables
Nonclustered Secondary access paths Adds overhead on writes; use sparingly

Workload Management

Dedicated pools support workload groups and classifiers to allocate resources across competing workloads.

-- Create a workload group for BI queries
CREATE WORKLOAD GROUP bi_reports
WITH (
    MIN_PERCENTAGE_RESOURCE = 20,
    MAX_PERCENTAGE_RESOURCE = 60,
    CAP_PERCENTAGE_RESOURCE = 60,
    REQUEST_MIN_RESOURCE_GRANT_PERCENT = 5,
    REQUEST_MAX_RESOURCE_GRANT_PERCENT = 10
);

-- Classify BI service account into the group
CREATE WORKLOAD CLASSIFIER bi_classifier
WITH (
    WORKLOAD_GROUP = 'bi_reports',
    MEMBERNAME     = 'powerbi_svc',
    IMPORTANCE     = ABOVE_NORMAL
);

Auto-Pause and Geo-Backup

Feature Configuration Notes
Auto-pause Portal > Scale > Enable auto-pause Pauses after inactivity (min 60 min); first query takes ~2 min to resume
Geo-backup Automatic (every 8 hours to paired region) RPO = 8 hours; can restore to any point in paired region
Restore points Automatic every 8 hours + user-defined Retain for 7 days (automatic) or until deleted (user-defined)

Auto-pause in production

Auto-pause is excellent for dev/test pools but should be evaluated carefully for production. The ~2 minute resume time may violate latency SLAs for interactive BI dashboards.


Spark Pools

Synapse Spark pools provide a managed Apache Spark environment integrated into the Synapse workspace, sharing the same metadata store and security boundary.

Notebook Development

Synapse notebooks support Python, Scala, SQL, R, and .NET (C#/F#). Notebooks can reference data through the shared Synapse metastore.

# Read Delta Lake table from Gold layer
df = spark.read.format("delta").load(
    "abfss://gold@<storage>.dfs.core.windows.net/fact_sales/"
)

# Transform and write back
from pyspark.sql import functions as F

daily_summary = (
    df.groupBy(F.col("sale_date"), F.col("region"))
      .agg(
          F.count("order_id").alias("order_count"),
          F.sum("total_amount").alias("revenue")
      )
)

daily_summary.write.format("delta").mode("overwrite").save(
    "abfss://gold@<storage>.dfs.core.windows.net/fact_daily_summary/"
)

Delta Lake Support

Synapse Spark has native Delta Lake support. Use Delta for all Silver and Gold layer tables to get ACID transactions, time travel, and schema enforcement.

# Merge (upsert) pattern for slowly changing dimensions
from delta.tables import DeltaTable

target = DeltaTable.forPath(
    spark, "abfss://gold@<storage>.dfs.core.windows.net/dim_customer/"
)
source = spark.read.format("delta").load(
    "abfss://silver@<storage>.dfs.core.windows.net/customers_staging/"
)

target.alias("t").merge(
    source.alias("s"),
    "t.customer_id = s.customer_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

Integration with Synapse SQL

Spark tables created through the shared metastore are automatically visible in serverless SQL pool, enabling cross-engine queries.

# Create a managed Spark table visible to serverless SQL
daily_summary.write.format("delta").mode("overwrite") \
    .option("path", "abfss://gold@<storage>.dfs.core.windows.net/fact_daily/") \
    .saveAsTable("gold_db.fact_daily_summary")
-- Query the same table from serverless SQL
SELECT * FROM gold_db.dbo.fact_daily_summary
WHERE sale_date >= '2026-04-01';

Configuration Tuning

Setting Recommended Value Rationale
spark.executor.instances 2-10 (auto-scale) Match to data volume; start small
spark.sql.shuffle.partitions 2x executor cores Avoid too many small tasks
spark.sql.files.maxPartitionBytes 128m Align with ADLS block size
spark.databricks.delta.optimizeWrite true Coalesce small files on write
spark.databricks.delta.autoCompact true Background compaction

Library Management

Install Python/R packages at the pool level (shared) or session level (ad-hoc).

# requirements.txt uploaded to Spark pool configuration
pandas==2.2.0
scikit-learn==1.4.0
azure-identity==1.15.0
great-expectations==0.18.0

Session-scoped packages

For ad-hoc exploration, use %%configure magic or mssparkutils.install to add packages at the session level without modifying pool configuration.


Synapse Pipelines

Synapse Pipelines share the same underlying engine as Azure Data Factory and support identical activities, linked services, integration runtimes, and data flows.

Synapse Pipelines vs. Azure Data Factory

Aspect Synapse Pipelines Azure Data Factory
Integration runtime Shared with Synapse workspace Standalone service
Notebook activity Native Synapse notebook execution Requires Databricks or HDInsight linked service
SQL pool activity Direct dedicated/serverless SQL execution Requires linked service
Data flows Identical capability Identical capability
Git integration Synapse workspace Git ADF-specific Git
Pricing Same per-activity pricing Same per-activity pricing

When to use which

Use Synapse Pipelines when your orchestration is tightly coupled to Synapse SQL or Spark pools. Use standalone ADF when you need a centralized orchestration layer across multiple Synapse workspaces, Databricks, or non-Synapse compute.

Parameterized Pipeline Example

{
    "name": "MedallionPipeline",
    "properties": {
        "parameters": {
            "sourceDate": { "type": "string", "defaultValue": "" },
            "layerTarget": { "type": "string", "defaultValue": "silver" }
        },
        "activities": [
            {
                "name": "BronzeToSilver",
                "type": "SynapseNotebook",
                "linkedServiceName": { "referenceName": "SynapseWorkspace" },
                "typeProperties": {
                    "notebook": { "referenceName": "bronze_to_silver" },
                    "parameters": {
                        "source_date": {
                            "value": "@pipeline().parameters.sourceDate",
                            "type": "Expression"
                        }
                    },
                    "sparkPool": { "referenceName": "csa_spark_pool" }
                }
            }
        ]
    }
}

Synapse and CSA-in-a-Box

Synapse as an Alternative to Databricks

Within CSA-in-a-Box, Synapse is positioned as the primary analytics engine in scenarios where Databricks is unavailable or less suitable. The decision is driven by three factors.

flowchart TD
    Q1{"Azure Government<br/>IL5 / IL6?"} -->|Yes| REC_SYN["Use Synapse"]
    Q1 -->|No| Q2
    Q2{"Existing dedicated SQL pool<br/>or SQL DW estate?"} -->|Yes| REC_SYN
    Q2 -->|No| Q3
    Q3{"SQL-first team,<br/>minimal Spark needs?"} -->|Yes| REC_SYN
    Q3 -->|No| REC_DBR["Use Databricks"]

    REC_SYN --> NOTE_SYN["Serverless SQL + Dedicated Pools<br/>+ Spark Pools as needed"]
    REC_DBR --> NOTE_DBR["Unity Catalog + Photon<br/>+ MLflow"]

    style REC_SYN fill:#0078D4,color:#fff
    style REC_DBR fill:#FF3621,color:#fff

See also: ADR-0002 (Databricks over OSS Spark) and the full decision tree.

Medallion Architecture on Synapse

The same Bronze/Silver/Gold pattern used with Databricks maps directly onto Synapse engines.

Layer Primary Engine Storage Format Notes
Bronze Synapse Pipelines (Copy Activity) Parquet / raw JSON Land source data as-is
Silver Spark Pool notebooks Delta Lake Cleanse, deduplicate, type-cast
Gold Serverless SQL (CETAS) or Spark Delta Lake / Parquet Aggregated, business-aligned datasets
Serving Dedicated SQL Pool or Serverless SQL views Columnar / views Optimized for BI and API consumption

dbt-synapse Adapter

CSA-in-a-Box supports dbt-synapse for SQL-based transformations on dedicated pools.

# profiles.yml
csa_synapse:
    target: dev
    outputs:
        dev:
            type: synapse
            driver: "ODBC Driver 18 for SQL Server"
            host: "<workspace>.sql.azuresynapse.net"
            port: 1433
            database: "csa_dedicated_pool"
            schema: "gold"
            authentication: "ActiveDirectoryServicePrincipal"
            tenant_id: "{{ env_var('AZURE_TENANT_ID') }}"
            client_id: "{{ env_var('AZURE_CLIENT_ID') }}"
            client_secret: "{{ env_var('AZURE_CLIENT_SECRET') }}"

dbt-synapse vs. dbt-databricks

If your CSA-in-a-Box deployment uses both engines, maintain separate dbt profiles and share model SQL through the ref() function. See ADR-0013 (dbt as canonical transformation) for the canonical dbt strategy.


Managed Private Endpoints

Synapse supports workspace-managed VNets for network isolation. All outbound traffic from managed Spark pools and integration runtimes routes through managed private endpoints.

Configuration

graph LR
    subgraph "Synapse Managed VNet"
        IR[Managed Integration Runtime]
        SP_N[Spark Pool]
    end

    subgraph "Customer VNet"
        PE_ADLS[Private Endpoint<br/>ADLS Gen2]
        PE_SQL[Private Endpoint<br/>Azure SQL]
        PE_KV[Private Endpoint<br/>Key Vault]
    end

    IR -->|Managed PE| PE_ADLS
    IR -->|Managed PE| PE_SQL
    SP_N -->|Managed PE| PE_KV
Endpoint Target Status Approval
ADLS Gen2 (storage account) Auto-approved (same tenant) Automatic
Azure SQL / Synapse dedicated pool Auto-approved (same tenant) Automatic
Key Vault Auto-approved (same tenant) Automatic
Third-party / cross-tenant Pending Manual approval by resource owner

Managed VNet limitations

When a workspace managed VNet is enabled, all Spark pools and integration runtimes use managed compute. You cannot use self-hosted integration runtimes within the managed VNet — deploy them in a separate workspace or use a standard ADF instance.


Monitoring

Dedicated Pool DMVs

-- Active queries and resource consumption
SELECT
    r.request_id,
    r.session_id,
    r.status,
    r.submit_time,
    r.total_elapsed_time / 1000.0 AS elapsed_sec,
    r.resource_class,
    r.command
FROM sys.dm_pdw_exec_requests r
WHERE r.status = 'Running'
ORDER BY r.submit_time;

-- Distribution data skew detection
SELECT
    tb.name                        AS table_name,
    CAST(MAX(row_count) AS FLOAT)
        / NULLIF(MIN(row_count), 0) AS skew_ratio
FROM sys.dm_pdw_nodes_db_partition_stats ps
JOIN sys.pdw_nodes_tables              nt ON ps.object_id = nt.object_id
JOIN sys.pdw_table_mappings            tm ON nt.name = tm.physical_name
JOIN sys.tables                        tb ON tm.object_id = tb.object_id
GROUP BY tb.name
HAVING MAX(row_count) > 0
ORDER BY skew_ratio DESC;

Query Store

Enable Query Store on dedicated pools to track query performance over time, identify regressions, and find optimization opportunities.

ALTER DATABASE csa_dedicated_pool
SET QUERY_STORE = ON (
    OPERATION_MODE = READ_WRITE,
    MAX_STORAGE_SIZE_MB = 1024,
    INTERVAL_LENGTH_MINUTES = 30
);

Azure Monitor Integration

Signal Diagnostic Setting Destination
SQL requests SynapseSqlPoolExecRequests Log Analytics workspace
DMS workers SynapseSqlPoolDmsWorkers Log Analytics workspace
Wait stats SynapseSqlPoolWaits Log Analytics workspace
Spark applications SynapseSparkApplications Log Analytics workspace
Pipeline runs SynapsePipelineRuns Log Analytics workspace
// Diagnostic settings (Bicep excerpt)
resource synapseDiag 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
  name: 'synapse-diagnostics'
  scope: synapseWorkspace
  properties: {
    workspaceId: logAnalyticsWorkspace.id
    logs: [
      { category: 'SynapseSqlPoolExecRequests'; enabled: true }
      { category: 'SynapseSqlPoolDmsWorkers';   enabled: true }
      { category: 'SynapseSqlPoolWaits';        enabled: true }
      { category: 'SynapseSparkApplications';   enabled: true }
      { category: 'SynapsePipelineRuns';        enabled: true }
    ]
  }
}

Cost Optimization

Serverless vs. Dedicated Decision Tree

flowchart TD
    START{"What is the workload pattern?"} -->|Ad-hoc, exploratory| SLESS["Serverless SQL"]
    START -->|Scheduled dashboards,<br/><10 concurrent users| SLESS
    START -->|High concurrency,<br/>sub-second SLA| DED["Dedicated SQL Pool"]
    START -->|Predictable daily load,<br/>can tolerate pause| DED_PAUSE["Dedicated Pool<br/>+ Auto-Pause"]
    START -->|Intermittent, bursty,<br/>unpredictable| SLESS

    SLESS --> COST_S["~$5 / TB scanned<br/>Zero idle cost"]
    DED --> COST_D["DW100c = ~$1.20/hr<br/>24/7 unless paused"]
    DED_PAUSE --> COST_P["Same DWU rate<br/>but $0 when paused"]

    style SLESS fill:#4CAF50,color:#fff
    style DED fill:#FF9800,color:#fff
    style DED_PAUSE fill:#2196F3,color:#fff

Pause/Resume Automation

Automate dedicated pool pause/resume to avoid paying for idle compute during nights and weekends.

# Pause (e.g., via Azure Automation Runbook at 8 PM)
az synapse sql pool pause \
    --workspace-name csa-synapse-ws \
    --name csa_dedicated_pool \
    --resource-group csa-analytics-rg

# Resume (e.g., at 6 AM)
az synapse sql pool resume \
    --workspace-name csa-synapse-ws \
    --name csa_dedicated_pool \
    --resource-group csa-analytics-rg

Additional Cost Levers

Lever Savings Implementation
Reserved capacity Up to 65% vs. pay-as-you-go 1-year or 3-year commitment for dedicated pools
Result set caching Eliminates repeated scans ALTER DATABASE ... SET RESULT_SET_CACHING ON
Materialized views Precomputed aggregations CREATE MATERIALIZED VIEW on hot query patterns
Serverless cost cap Budget protection Set daily/weekly data-processed limits in workspace settings
Spark auto-scale Right-size compute Configure min/max nodes; pools scale down after idle timeout

Related guide

For platform-wide FinOps strategies including Synapse, see the Cost Optimization Best Practices.


Security

Authentication and Authorization

Method Use Case Configuration
Azure AD (Entra ID) Interactive users, SSO Default; workspace-level AAD integration
Managed Identity Service-to-service (ADLS, Key Vault) System-assigned or user-assigned MI
Service Principal CI/CD, dbt, automation Register app in Entra ID; assign Synapse RBAC roles
SQL Authentication Legacy compatibility only Disabled by default in CSA-in-a-Box; enable only if required

Synapse RBAC Roles

Role Scope Permissions
Synapse Administrator Workspace Full control over all artifacts and compute
Synapse SQL Administrator Workspace / Pool Manage dedicated and serverless SQL
Synapse Spark Administrator Workspace / Pool Manage Spark pools and notebooks
Synapse Contributor Workspace Create/edit artifacts; cannot manage access
Synapse Artifact User Workspace Read and execute published artifacts
Synapse Credential User Workspace Use credentials in pipelines and notebooks

Column-Level and Row-Level Security

-- Column-level security: restrict SSN visibility
GRANT SELECT ON dbo.fact_customers
    (customer_id, customer_name, segment, lifetime_value)
    TO [bi_analysts];
-- SSN column is NOT in the grant list — invisible to bi_analysts

-- Row-level security: filter by region
CREATE FUNCTION dbo.fn_region_filter(@region NVARCHAR(50))
RETURNS TABLE
WITH SCHEMABINDING
AS
    RETURN SELECT 1 AS result
    WHERE @region = USER_NAME()
       OR USER_NAME() = 'dbo';

CREATE SECURITY POLICY dbo.RegionFilter
ADD FILTER PREDICATE dbo.fn_region_filter(region)
ON dbo.fact_sales
WITH (STATE = ON);

Dynamic Data Masking

-- Mask PII columns
ALTER TABLE dbo.dim_customer
ALTER COLUMN email ADD MASKED WITH (FUNCTION = 'email()');

ALTER TABLE dbo.dim_customer
ALTER COLUMN phone ADD MASKED WITH (FUNCTION = 'partial(0,"XXX-XXX-",4)');

ALTER TABLE dbo.dim_customer
ALTER COLUMN ssn ADD MASKED WITH (FUNCTION = 'default()');

-- Grant unmask to privileged role
GRANT UNMASK TO [compliance_officers];

Defense in depth

Combine column-level security (structural access control), row-level security (data-level filtering), and dynamic data masking (display-level obfuscation) for a layered approach. See the Security & Compliance Best Practices for the full defense-in-depth model.


Anti-Patterns

Anti-Pattern Problem Correct Approach
Running dedicated pools 24/7 for dev/test Burning budget on idle compute Enable auto-pause or use serverless SQL for dev
Storing small CSV files in Bronze Serverless SQL scans are expensive per-file for many small files Convert to Parquet/Delta; target 128 MB+ file sizes
Round-robin distribution on fact tables Every JOIN requires full data movement Use HASH distribution on the primary join key
Skipping column statistics on external tables Serverless SQL optimizer makes poor choices Create statistics on filter and join columns
Using dedicated pools for ad-hoc exploration Over-provisioning for unpredictable workloads Use serverless SQL; switch to dedicated only for production BI
Single-file Delta tables in Gold No partition pruning, full scan every query Partition by date or key dimension; maintain reasonable file count
Granting Synapse Administrator broadly Excessive privilege; violates least-privilege Use granular RBAC roles scoped to specific pools or artifacts
Ignoring data skew on hash-distributed tables One distribution does all the work; queries slow Profile data cardinality; choose high-cardinality columns

Do / Don't Quick Reference

Do Don't
Use serverless SQL for exploration and Gold-layer views Use dedicated pools for ad-hoc queries
Partition Gold tables by date or key dimension Store everything in a single flat folder
Set cost caps on serverless SQL consumption Let serverless queries run uncapped against large datasets
Enable auto-pause on non-production dedicated pools Leave dev/test pools running 24/7
Use managed identity for ADLS access Embed storage keys in connection strings
Create external tables with statistics for repeated queries Re-run OPENROWSET ad-hoc queries in production
Use Delta Lake format for Silver and Gold Use CSV or uncompressed JSON for analytics layers
Configure workload management for mixed workloads Let BI and ETL queries compete for the same resources
Enable diagnostic settings to Log Analytics Troubleshoot blind without telemetry
Use managed private endpoints for network isolation Allow public network access in production

Pre-Deployment Checklist

  • Synapse workspace deployed via Bicep with managed VNet enabled
  • ADLS Gen2 storage provisioned with Bronze/Silver/Gold containers
  • Managed identity granted Storage Blob Data Contributor on ADLS
  • Serverless SQL database created with external data sources and credentials
  • Dedicated SQL pool sized appropriately (start DW100c, scale as needed)
  • Spark pool configured with auto-scale (min 3 / max 10 nodes)
  • Diagnostic settings streaming to Log Analytics workspace
  • Managed private endpoints created and approved for ADLS, Key Vault, and SQL
  • Synapse RBAC roles assigned per least-privilege principle
  • Column-level and row-level security configured for PII/sensitive data
  • Dynamic data masking applied to PII columns
  • Cost caps configured for serverless SQL (daily data-processed limit)
  • Auto-pause enabled on non-production dedicated pools
  • dbt-synapse profile configured and tested (if using dbt)
  • Pipeline schedules and alerting configured
  • Geo-backup restore tested in paired region

Cross-References

Topic Link
Synapse vs. Databricks vs. Fabric decision tree fabric-vs-databricks-vs-synapse.md
ADR-0002: Databricks over OSS Spark 0002-databricks-over-oss-spark.md
ADR-0013: dbt as canonical transformation 0013-dbt-as-canonical-transformation.md
Databricks Guide (complementary engine) DATABRICKS_GUIDE.md
Cost Optimization Best Practices cost-optimization.md
Security & Compliance Best Practices security-compliance.md
OSS Ecosystem (Trino as serverless SQL alternative) oss-ecosystem.md
APIM Data Mesh Gateway (API serving layer) apim-data-mesh-gateway.md
Disaster Recovery DR.md