On-Premises SQL Server Integration¶

Overview¶

Every federal agency, defense contractor, and regulated enterprise has the same conversation at some point: "We need cloud analytics, but we cannot move the database." The reasons vary — FISMA boundary constraints, application coupling, change-control overhead, or simply that the SQL Server instance is so deeply embedded in operations that migration risk outweighs any benefit.

CSA-in-a-Box is designed for exactly this scenario. You do not need to migrate your SQL Server to get value from the platform. Instead, you integrate it — bringing analytical copies, virtual references, or real-time replicas into the Azure analytics stack while the operational database stays on-premises, untouched, under the same controls it has always been under.

This guide covers five integration patterns, from zero-copy virtualization to near-real-time replication. Each pattern has different trade-offs in latency, cost, complexity, and compliance posture. The decision tree at the end will help you pick the right one.

SQL Server 2022 Is the Target

This guide targets SQL Server 2022 (16.x), which introduced significant hybrid features — Synapse Link, S3 object storage integration, contained availability groups, and more. Most patterns also work with SQL Server 2019 and 2016, but several features (especially Synapse Link and S3 integration) require 2022. Version requirements are noted in each section.

!!! tip "Related Guides" - Self-Hosted Integration Runtime — deploying and operating the SHIR that bridges on-prem to Azure - Azure Data Factory Setup — ADF pipeline artifacts and CI/CD - Multi-Cloud Data Virtualization — broader virtualization patterns across clouds - Performance Tuning — optimizing pipeline and query performance

Architecture¶

The following diagram shows how on-premises SQL Server connects to the CSA-in-a-Box analytics platform through multiple integration paths.

graph TB
    subgraph OnPrem["On-Premises Data Center"]
        SQLPRI["SQL Server 2022<br/>Primary (OLTP)"]
        SQLSEC["SQL Server 2022<br/>AG Secondary (Read-Only)"]
        SHIR["Self-Hosted IR<br/>(2+ nodes, HA)"]
        ARC_AGENT["Azure Arc Agent"]
        SQLPRI --- SQLSEC
        SQLPRI --- ARC_AGENT
    end

    subgraph Network["Network Layer"]
        ER["ExpressRoute<br/>(Production)"]
        VPN["Site-to-Site VPN<br/>(Dev/Test)"]
    end

    subgraph Azure["Azure (CSA-in-a-Box)"]
        subgraph Ingest["Ingestion Layer"]
            ADF["Azure Data Factory"]
            SYNLINK["Synapse Link<br/>for SQL Server"]
            MILINK["MI Link<br/>(SQL MI)"]
        end

        subgraph Store["Storage Layer"]
            ADLS["ADLS Gen2<br/>Bronze (Delta Lake)"]
            ONELAKE["Fabric OneLake"]
            MI["Azure SQL MI<br/>(Read Replica)"]
        end

        subgraph Transform["Transformation Layer"]
            DBT["dbt Core<br/>Silver + Gold Models"]
            SYNAPSE["Synapse Serverless SQL<br/>External Tables"]
        end

        subgraph Serve["Serving Layer"]
            PBI["Power BI<br/>Dashboards + Reports"]
            PURVIEW["Microsoft Purview<br/>Catalog + Lineage"]
        end
    end

    SHIR -->|HTTPS/443| ER
    ARC_AGENT -->|HTTPS/443| ER
    ER --> ADF
    ER --> SYNLINK
    ER --> MILINK

    VPN -.->|Dev/Test Only| ADF

    ADF -->|CDC / Incremental| ADLS
    SYNLINK -->|Zero-ETL| ADLS
    MILINK -->|Near-Real-Time| MI

    ADLS --> DBT
    MI --> ONELAKE
    DBT --> ADLS
    ADLS --> SYNAPSE
    SYNAPSE --> PBI
    ONELAKE --> PBI
    ADLS --> PURVIEW
    MI --> PURVIEW

    SQLSEC -.->|Analytics Queries| SHIR

    style SQLPRI fill:#2d6a4f,color:#fff
    style SQLSEC fill:#40916c,color:#fff
    style ADLS fill:#0078d4,color:#fff
    style MI fill:#0078d4,color:#fff
    style PBI fill:#f2c811,color:#000

SQL Server 2022 Feature Highlights for Hybrid¶

SQL Server 2022 is the most Azure-connected release in SQL Server history. These features are directly relevant to hybrid integration with CSA-in-a-Box.

Feature	What It Does	CSA-in-a-Box Relevance
Azure Synapse Link for SQL Server	Zero-ETL replication of operational data to Synapse analytical store	Eliminates ADF pipelines for simple analytical scenarios
Managed Instance Link	Near-real-time replication to Azure SQL MI (RPO ~seconds)	Full read replica in Azure for BI and analytics
S3 Object Storage Integration	BACKUP/RESTORE, BULK INSERT, OPENROWSET against S3-compatible endpoints	Direct data exchange between SQL Server and ADLS
Azure Arc-enabled SQL Server	Unified management, inventory, and governance from Azure portal	Single pane of glass for hybrid SQL estate
Ledger Tables	Immutable, cryptographically verifiable audit trail	Compliance and tamper-evidence for regulated data
Query Store Improvements	Query Store hints, CE feedback, plan forcing enhancements	Better performance without application changes
Intelligent Query Processing (IQP)	DOP feedback, CE feedback, optimized plan forcing	Automatic query optimization for analytical workloads
Parameter Sensitive Plan (PSP) Optimization	Multiple plans for a single parameterized query	Eliminates parameter sniffing issues in reporting queries
Contained Availability Groups	AG-level system databases (master, msdb, logins)	Simplified AG management, easier failover for analytics
T-SQL Enhancements	`GREATEST`, `LEAST`, `DATETRUNC`, `GENERATE_SERIES`, `STRING_SPLIT` with ordinal, JSON improvements	Cleaner transformation logic in source queries

Version Compatibility Matrix

Pattern	Minimum SQL Server Version
Azure Arc Registration	SQL Server 2012+
Self-Hosted IR + ADF	SQL Server 2008+
AG Read Replicas	SQL Server 2012+ (Enterprise)
MI Link	SQL Server 2016+
Synapse Link for SQL Server	SQL Server 2022
S3 Object Storage Integration	SQL Server 2022

Pattern 1: Azure Arc + Data Virtualization (Query in Place)¶

When to use: Low-frequency analytics, compliance constraints that prevent copying data outside the security boundary, or initial integration before committing to a replication pattern.

Latency: Real-time (queries execute against live data) Data Movement: None (queries travel to the data, not the other way around) Complexity: Low

How It Works¶

Azure Arc extends Azure management to your on-premises SQL Server. Once registered, the instance appears in the Azure portal alongside your cloud resources — same RBAC, same policy, same Purview catalog. You then create external tables in Synapse Serverless SQL that reference the on-prem data through a Self-Hosted Integration Runtime.

Step 1: Register SQL Server with Azure Arc¶

# Download and install the Azure Connected Machine agent on the SQL Server host
# (Run on the SQL Server machine as Administrator)

# Set variables
RESOURCE_GROUP="csa-inabox-rg"
LOCATION="eastus"
SUBSCRIPTION_ID="<your-subscription-id>"
TENANT_ID="<your-tenant-id>"

# Install the Arc agent (Windows)
# Download from: https://aka.ms/AzureConnectedMachineAgent
msiexec /i AzureConnectedMachineAgent.msi /l*v installationlog.txt /qn

# Connect to Azure Arc
& "$env:ProgramW6432\AzureConnectedMachineAgent\azcmagent.exe" connect `
    --resource-group $RESOURCE_GROUP `
    --location $LOCATION `
    --subscription-id $SUBSCRIPTION_ID `
    --tenant-id $TENANT_ID

# Verify connection
& "$env:ProgramW6432\AzureConnectedMachineAgent\azcmagent.exe" show

# Register the SQL Server extension on the Arc-enabled server (from Azure CLI)
az connectedmachine extension create \
    --machine-name "sql-server-prod-01" \
    --resource-group "$RESOURCE_GROUP" \
    --name "WindowsAgent.SqlServer" \
    --type "WindowsAgent.SqlServer" \
    --publisher "Microsoft.AzureData" \
    --location "$LOCATION" \
    --settings '{
        "SqlManagement": { "IsEnabled": true },
        "LicenseType": "Paid"
    }'

# Verify Arc-enabled SQL Server registration
az sql server-arc show \
    --name "sql-server-prod-01" \
    --resource-group "$RESOURCE_GROUP"

Step 2: Configure Purview Scanning¶

Once the SQL Server is Arc-registered, Microsoft Purview can scan it for metadata, classifications, and lineage.

# Register the Arc-enabled SQL Server as a Purview data source
# (Purview portal → Data Map → Register → SQL Server on Azure Arc)

# Using Purview REST API:
az rest --method PUT \
    --uri "https://<purview-account>.purview.azure.com/scan/datasources/sql-onprem-prod?api-version=2022-07-01-preview" \
    --body '{
        "kind": "AzureArcEnabledSqlServer",
        "properties": {
            "serverEndpoint": "sql-server-prod-01.contoso.local",
            "resourceGroup": "csa-inabox-rg",
            "subscriptionId": "<subscription-id>",
            "location": "eastus",
            "collection": {
                "referenceName": "on-premises-sources",
                "type": "CollectionReference"
            }
        }
    }'

Step 3: Create External Tables in Synapse Serverless¶

-- In Synapse Serverless SQL pool
-- Create a master key for credential encryption
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<strong-password>';

-- Create a credential for the Self-Hosted IR connection
CREATE DATABASE SCOPED CREDENTIAL SqlServerOnPremCred
WITH IDENTITY = 'sql-analytics-reader',
     SECRET = '<password>';

-- Create an external data source pointing to the on-prem SQL Server
-- (Requires Self-Hosted IR configured in ADF/Synapse linked service)
CREATE EXTERNAL DATA SOURCE SqlServerOnPrem
WITH (
    LOCATION = 'sqlserver://sql-server-prod-01.contoso.local',
    CREDENTIAL = SqlServerOnPremCred
);

-- Query on-prem data directly (no copy)
SELECT
    c.CustomerID,
    c.CustomerName,
    SUM(o.TotalAmount) AS TotalRevenue,
    COUNT(o.OrderID) AS OrderCount
FROM
    EXTERNAL TABLE SqlServerOnPrem.dbo.Customers AS c
    JOIN EXTERNAL TABLE SqlServerOnPrem.dbo.Orders AS o
        ON c.CustomerID = o.CustomerID
WHERE
    o.OrderDate >= DATEADD(month, -3, GETDATE())
GROUP BY
    c.CustomerID, c.CustomerName
ORDER BY
    TotalRevenue DESC;

Performance Considerations

Data virtualization queries execute against the remote SQL Server in real time. Large analytical queries (full table scans, complex aggregations) will impact the on-prem instance. Use this pattern for low-frequency, small-result-set queries — not for building dashboards that 50 users refresh every 5 minutes. For heavier workloads, use Pattern 2 or Pattern 3.

Pattern 2: Self-Hosted IR + ADF (Incremental Copy)¶

When to use: Regular analytical refresh (hourly, daily), medallion architecture pattern, or when you need the data materialized in the lakehouse for dbt transformations.

Latency: Minutes to hours (depends on pipeline schedule) Data Movement: Incremental (only changed rows) Complexity: Medium

This is the most common integration pattern for CSA-in-a-Box. Data flows from SQL Server through a Self-Hosted Integration Runtime into ADLS Bronze as Delta Lake tables, then dbt transforms it through Silver and Gold layers.

Prerequisites¶

Self-Hosted IR deployed (see SHIR guide)
CDC or Change Tracking enabled on source tables
ADF instance with linked services configured
ADLS Gen2 with Bronze container

Step 1: Enable CDC on Source Tables¶

-- On the on-premises SQL Server (run as sysadmin)

-- Enable CDC at the database level
USE [OperationalDB];
GO
EXEC sys.sp_cdc_enable_db;
GO

-- Enable CDC on specific tables
-- Only enable on tables you actually need for analytics
EXEC sys.sp_cdc_enable_table
    @source_schema = N'dbo',
    @source_name   = N'Orders',
    @role_name     = N'cdc_reader',
    @supports_net_changes = 1;
GO

EXEC sys.sp_cdc_enable_table
    @source_schema = N'dbo',
    @source_name   = N'Customers',
    @role_name     = N'cdc_reader',
    @supports_net_changes = 1;
GO

EXEC sys.sp_cdc_enable_table
    @source_schema = N'dbo',
    @source_name   = N'Products',
    @role_name     = N'cdc_reader',
    @supports_net_changes = 1;
GO

-- Verify CDC is enabled
SELECT name, is_cdc_enabled
FROM sys.databases
WHERE name = 'OperationalDB';

SELECT s.name AS schema_name, t.name AS table_name, t.is_tracked_by_cdc
FROM sys.tables t
JOIN sys.schemas s ON t.schema_id = s.schema_id
WHERE t.is_tracked_by_cdc = 1;

Step 2: Create the CDC Reader Query¶

-- Query to extract net changes since the last LSN
-- This is what the ADF pipeline executes via the Self-Hosted IR

DECLARE @from_lsn binary(10), @to_lsn binary(10);

-- @from_lsn is passed as a pipeline parameter (last successful LSN)
SET @from_lsn = @LastProcessedLSN;
SET @to_lsn   = sys.fn_cdc_get_max_lsn();

-- Extract net changes (INSERT, UPDATE, DELETE)
SELECT
    o.__$operation AS cdc_operation,
    -- 1 = DELETE, 2 = INSERT, 4 = UPDATE (after image)
    o.OrderID,
    o.CustomerID,
    o.OrderDate,
    o.TotalAmount,
    o.Status,
    o.ModifiedDate,
    @to_lsn AS _cdc_lsn,
    SYSUTCDATETIME() AS _extracted_at
FROM
    cdc.fn_cdc_get_net_changes_dbo_Orders(
        @from_lsn,
        @to_lsn,
        N'all with merge'
    ) AS o;

Step 3: ADF Pipeline Definition¶

{
    "name": "pl_ingest_sqlserver_cdc_to_bronze",
    "properties": {
        "description": "Incremental CDC load from on-prem SQL Server to ADLS Bronze (Delta)",
        "activities": [
            {
                "name": "Lookup_LastLSN",
                "type": "Lookup",
                "typeProperties": {
                    "source": {
                        "type": "AzureSqlSource",
                        "sqlReaderQuery": "SELECT LastLSN FROM dbo.WatermarkTable WHERE TableName = '@{pipeline().parameters.TableName}'"
                    },
                    "dataset": {
                        "referenceName": "ds_watermark_table",
                        "type": "DatasetReference"
                    }
                }
            },
            {
                "name": "Lookup_CurrentLSN",
                "type": "Lookup",
                "dependsOn": [
                    {
                        "activity": "Lookup_LastLSN",
                        "dependencyConditions": ["Succeeded"]
                    }
                ],
                "typeProperties": {
                    "source": {
                        "type": "SqlServerSource",
                        "sqlReaderQuery": "SELECT sys.fn_cdc_get_max_lsn() AS CurrentLSN"
                    },
                    "dataset": {
                        "referenceName": "ds_sqlserver_onprem",
                        "type": "DatasetReference"
                    },
                    "linkedServiceName": {
                        "referenceName": "ls_sqlserver_onprem_shir",
                        "type": "LinkedServiceReference"
                    }
                }
            },
            {
                "name": "Copy_CDC_Changes",
                "type": "Copy",
                "dependsOn": [
                    {
                        "activity": "Lookup_CurrentLSN",
                        "dependencyConditions": ["Succeeded"]
                    }
                ],
                "typeProperties": {
                    "source": {
                        "type": "SqlServerSource",
                        "sqlReaderStoredProcedureName": "dbo.usp_ExtractCDCChanges",
                        "storedProcedureParameters": {
                            "FromLSN": {
                                "value": "@activity('Lookup_LastLSN').output.firstRow.LastLSN"
                            },
                            "ToLSN": {
                                "value": "@activity('Lookup_CurrentLSN').output.firstRow.CurrentLSN"
                            },
                            "TableName": {
                                "value": "@pipeline().parameters.TableName"
                            }
                        }
                    },
                    "sink": {
                        "type": "ParquetSink",
                        "storeSettings": {
                            "type": "AzureBlobFSWriteSettings"
                        },
                        "formatSettings": {
                            "type": "ParquetWriteSettings"
                        }
                    }
                },
                "inputs": [
                    {
                        "referenceName": "ds_sqlserver_onprem",
                        "type": "DatasetReference"
                    }
                ],
                "outputs": [
                    {
                        "referenceName": "ds_adls_bronze_delta",
                        "type": "DatasetReference",
                        "parameters": {
                            "Container": "bronze",
                            "FolderPath": "@concat('sql_server/', pipeline().parameters.TableName, '/', formatDateTime(utcNow(), 'yyyy/MM/dd'))"
                        }
                    }
                ]
            },
            {
                "name": "Update_Watermark",
                "type": "SqlServerStoredProcedure",
                "dependsOn": [
                    {
                        "activity": "Copy_CDC_Changes",
                        "dependencyConditions": ["Succeeded"]
                    }
                ],
                "typeProperties": {
                    "storedProcedureName": "dbo.usp_UpdateWatermark",
                    "storedProcedureParameters": {
                        "TableName": {
                            "value": "@pipeline().parameters.TableName"
                        },
                        "LastLSN": {
                            "value": "@activity('Lookup_CurrentLSN').output.firstRow.CurrentLSN"
                        }
                    }
                }
            }
        ],
        "parameters": {
            "TableName": { "type": "string" }
        }
    }
}

Step 4: Watermark Pattern (for Tables Without CDC)¶

Not every table supports CDC. For tables where CDC is not enabled (or not feasible), use a watermark column.

-- Watermark table in Azure SQL (or ADLS metadata)
CREATE TABLE dbo.WatermarkTable (
    TableName       NVARCHAR(128) PRIMARY KEY,
    LastLSN         BINARY(10)     NULL,
    LastModifiedDate DATETIME2      NULL,
    LastRowVersion   BIGINT         NULL,
    LastRunTime     DATETIME2      NOT NULL DEFAULT SYSUTCDATETIME()
);

-- Insert initial watermarks
INSERT INTO dbo.WatermarkTable (TableName, LastModifiedDate)
VALUES
    ('Products', '1900-01-01'),
    ('Inventory', '1900-01-01');

-- ADF source query using ModifiedDate watermark
SELECT *
FROM dbo.Products
WHERE ModifiedDate > '@{activity('Lookup_LastWatermark').output.firstRow.LastModifiedDate}'
  AND ModifiedDate <= '@{activity('Lookup_CurrentWatermark').output.firstRow.CurrentMax}';

Step 5: dbt Incremental Model Consuming CDC Data¶

-- models/silver/stg_orders.sql
{{
    config(
        materialized='incremental',
        unique_key='order_id',
        incremental_strategy='merge',
        file_format='delta',
        on_schema_change='sync_all_columns'
    )
}}

WITH source AS (
    SELECT
        OrderID         AS order_id,
        CustomerID      AS customer_id,
        OrderDate       AS order_date,
        TotalAmount     AS total_amount,
        Status          AS order_status,
        ModifiedDate    AS modified_date,
        cdc_operation,
        _cdc_lsn,
        _extracted_at
    FROM {{ source('bronze_sql_server', 'orders') }}
    {% if is_incremental() %}
    WHERE _extracted_at > (SELECT MAX(_extracted_at) FROM {{ this }})
    {% endif %}
),

-- Apply CDC operations: filter out deletes, keep latest version
deduplicated AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY order_id
            ORDER BY _cdc_lsn DESC, _extracted_at DESC
        ) AS rn
    FROM source
)

SELECT
    order_id,
    customer_id,
    order_date,
    total_amount,
    order_status,
    modified_date,
    _extracted_at,
    CASE WHEN cdc_operation = 1 THEN TRUE ELSE FALSE END AS is_deleted
FROM deduplicated
WHERE rn = 1

Pattern 3: SQL Server Read Replicas + Analytics¶

When to use: Near-real-time analytics without impacting the OLTP primary, or when you need to offload heavy reporting queries from production.

Latency: Seconds (synchronous commit) to minutes (asynchronous commit) Data Movement: Automatic (AG replication) Complexity: Medium-High (requires SQL Server Enterprise Edition)

How It Works¶

Always On Availability Groups provide a read-only secondary replica that receives real-time transaction log updates from the primary. Analytics tools (ADF, Power BI, dbt) connect to the secondary, keeping the primary free for OLTP workloads.

Step 1: Configure Always On Availability Group¶

-- On the PRIMARY replica
-- (Assumes Windows Server Failover Clustering is already configured)

-- Enable Always On at the instance level (requires restart)
-- Run in SQL Server Configuration Manager or via PowerShell:
-- Enable-SqlAlwaysOn -ServerInstance "SQL-PRIMARY" -Force

-- Create the Availability Group with a read-only secondary
CREATE AVAILABILITY GROUP [AG_Analytics]
WITH (
    AUTOMATED_BACKUP_PREFERENCE = SECONDARY,
    DB_FAILOVER = ON,
    CLUSTER_TYPE = WSFC
)
FOR DATABASE [OperationalDB]
REPLICA ON
    N'SQL-PRIMARY' WITH (
        ENDPOINT_URL = N'TCP://sql-primary.contoso.local:5022',
        FAILOVER_MODE = AUTOMATIC,
        AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
        SECONDARY_ROLE (
            ALLOW_CONNECTIONS = NO
        )
    ),
    N'SQL-SECONDARY' WITH (
        ENDPOINT_URL = N'TCP://sql-secondary.contoso.local:5022',
        FAILOVER_MODE = MANUAL,
        AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
        SECONDARY_ROLE (
            ALLOW_CONNECTIONS = READ_ONLY,
            READ_ONLY_ROUTING_URL = N'TCP://sql-secondary.contoso.local:1433'
        )
    );

-- Configure read-only routing
ALTER AVAILABILITY GROUP [AG_Analytics]
MODIFY REPLICA ON N'SQL-PRIMARY' WITH (
    PRIMARY_ROLE (
        READ_ONLY_ROUTING_LIST = (N'SQL-SECONDARY', N'SQL-PRIMARY')
    )
);

-- Create the listener
ALTER AVAILABILITY GROUP [AG_Analytics]
ADD LISTENER N'ag-analytics-listener' (
    WITH IP (
        (N'10.0.1.100', N'255.255.255.0')
    ),
    PORT = 1433
);

Step 2: ADF Linked Service Pointing to Read-Only Replica¶

{
    "name": "ls_sqlserver_readonly_replica",
    "properties": {
        "type": "SqlServer",
        "typeProperties": {
            "connectionString": {
                "type": "SecureString",
                "value": "Server=ag-analytics-listener.contoso.local;Database=OperationalDB;ApplicationIntent=ReadOnly;Encrypt=True;TrustServerCertificate=False;"
            },
            "userName": "adf-analytics-reader",
            "password": {
                "type": "AzureKeyVaultSecret",
                "store": {
                    "referenceName": "ls_keyvault",
                    "type": "LinkedServiceReference"
                },
                "secretName": "sql-analytics-reader-password"
            }
        },
        "connectVia": {
            "referenceName": "SelfHostedIR",
            "type": "IntegrationRuntimeReference"
        }
    }
}

ApplicationIntent=ReadOnly

The ApplicationIntent=ReadOnly connection string parameter tells the AG listener to route the connection to the secondary replica. This is the key setting that offloads analytics from the primary. Without it, connections go to the primary.

Log Shipping Alternative (Simpler, No Enterprise Required)¶

For smaller deployments where Enterprise Edition licensing is not available, log shipping provides a simpler (but less current) alternative.

-- On the PRIMARY server: configure log backup job
BACKUP LOG [OperationalDB]
TO DISK = N'\\fileshare\logship\OperationalDB_log.trn'
WITH NOFORMAT, NOINIT, COMPRESSION;
-- Schedule this every 5-15 minutes via SQL Agent

-- On the SECONDARY server: restore with STANDBY (allows read access)
RESTORE LOG [OperationalDB]
FROM DISK = N'\\fileshare\logship\OperationalDB_log.trn'
WITH STANDBY = N'D:\Standby\OperationalDB_undo.ldf';
-- Users are disconnected during restore, then can read again

Log Shipping Limitations

Log shipping with STANDBY mode disconnects all users during each restore cycle (typically every 5-15 minutes). This is acceptable for batch analytics but not for interactive dashboards. For interactive workloads, use AG read replicas (Enterprise) or MI Link (Pattern 4).

Pattern 4: Database Mirroring to Azure SQL MI¶

When to use: You need a full read replica in Azure for analytics and BI, with near-real-time data currency and no SHIR dependency for query execution.

Latency: Seconds (RPO ~seconds for near-real-time replication) Data Movement: Continuous (log-based replication to Azure SQL MI) Complexity: Medium

How It Works¶

The Managed Instance Link uses distributed availability groups to replicate data from on-premises SQL Server to Azure SQL MI in near-real-time. The MI instance serves as a read-only replica that lives in Azure — no Self-Hosted IR needed for analytics queries because the data is already in Azure.

Once the data is in MI, you can mirror it to Fabric OneLake for lakehouse analytics, or query it directly from Power BI and Synapse.

graph LR
    subgraph OnPrem["On-Premises"]
        SQL["SQL Server 2022<br/>Primary (Read-Write)"]
    end

    subgraph Azure["Azure"]
        MI["Azure SQL MI<br/>Read-Only Replica"]
        FABRIC["Fabric OneLake<br/>Mirrored Database"]
        PBI["Power BI<br/>DirectQuery"]
        SYNAPSE["Synapse Serverless<br/>External Tables"]
    end

    SQL -->|MI Link<br/>Near-Real-Time<br/>Log Replication| MI
    MI -->|Fabric Mirroring<br/>Delta Lake| FABRIC
    MI --> PBI
    FABRIC --> SYNAPSE
    FABRIC --> PBI

    style SQL fill:#2d6a4f,color:#fff
    style MI fill:#0078d4,color:#fff
    style FABRIC fill:#0078d4,color:#fff

Step 1: Prerequisites¶

Requirement	Details
SQL Server Version	2016 SP3, 2019 CU17+, or 2022 CU1+
Azure SQL MI	Business Critical tier (for real-time read replica) or General Purpose
Network	ExpressRoute or S2S VPN with <100ms latency between on-prem and MI
Certificates	Database mirroring endpoint certificate exchange

Step 2: Configure MI Link¶

-- On the ON-PREMISES SQL Server

-- 1. Create a database mirroring endpoint (if not already present)
CREATE ENDPOINT dbm_endpoint
    STATE = STARTED
    AS TCP (LISTENER_PORT = 5022)
    FOR DATABASE_MIRRORING (ROLE = ALL);
GO

-- 2. Create a certificate for the endpoint
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<strong-password>';
GO

CREATE CERTIFICATE OnPrem_MI_Link_Cert
    WITH SUBJECT = 'MI Link Certificate - On-Premises',
    EXPIRY_DATE = '2030-12-31';
GO

-- 3. Back up the certificate to exchange with MI
BACKUP CERTIFICATE OnPrem_MI_Link_Cert
    TO FILE = 'C:\Certs\OnPrem_MI_Link_Cert.cer';
GO

# Create the MI Link using Azure CLI
az sql mi link create \
    --resource-group "csa-inabox-rg" \
    --managed-instance "csa-sql-mi" \
    --distributed-availability-group-name "dag-onprem-to-mi" \
    --primary-availability-group-name "AG_OnPrem" \
    --secondary-availability-group-name "AG_MI" \
    --source-endpoint "TCP://sql-server-prod-01.contoso.local:5022" \
    --databases "[{\"databaseName\": \"OperationalDB\"}]" \
    --replication-mode "Async"

Step 3: Enable Fabric Mirroring from MI¶

Once the MI Link is active and data is flowing, enable Fabric Mirroring to automatically replicate MI tables into OneLake as Delta Lake.

Open the Fabric portal and create a new Mirrored Database item
Select Azure SQL Managed Instance as the source
Authenticate using MI's managed identity or SQL authentication
Select the tables to mirror
Fabric continuously replicates changes to OneLake in Delta Lake format

-- In Fabric SQL analytics endpoint (auto-created with mirroring)
-- Query the mirrored data using T-SQL
SELECT
    c.CustomerName,
    COUNT(o.OrderID)   AS order_count,
    SUM(o.TotalAmount) AS total_revenue
FROM dbo.Customers c
JOIN dbo.Orders o ON c.CustomerID = o.CustomerID
GROUP BY c.CustomerName
ORDER BY total_revenue DESC;

Latency Chain

The total end-to-end latency is: SQL Server primary (seconds) -> Azure SQL MI (seconds) -> Fabric OneLake (minutes). For most analytical workloads, this sub-5-minute latency is more than sufficient.

Pattern 5: SQL Server 2022 S3 Object Storage Integration¶

When to use: Simple data exchange between SQL Server and the lakehouse, bulk export/import scenarios, or backup to cloud storage. Requires SQL Server 2022.

Latency: On-demand (push or pull) Data Movement: Explicit (user-initiated) Complexity: Low

SQL Server 2022 introduced native support for S3-compatible object storage. Since ADLS Gen2 exposes an S3-compatible endpoint, SQL Server can read and write directly to the CSA-in-a-Box lakehouse.

Step 1: Create S3 Credentials in SQL Server¶

-- On SQL Server 2022
-- Create a credential for the ADLS S3-compatible endpoint
CREATE DATABASE SCOPED CREDENTIAL adls_s3_credential
WITH
    IDENTITY = 'S3 Access Key',
    SECRET   = '<storage-account-access-key>';
GO

-- Create an external data source pointing to ADLS via S3 protocol
CREATE EXTERNAL DATA SOURCE adls_bronze
WITH (
    LOCATION = 's3://<storage-account>.blob.core.windows.net/<container>',
    CREDENTIAL = adls_s3_credential
);
GO

Step 2: Read Parquet from ADLS Directly¶

-- Query Parquet files in the lakehouse directly from SQL Server 2022
SELECT
    CustomerID,
    CustomerName,
    Region,
    Segment
FROM OPENROWSET(
    BULK '/bronze/sql_server/customers/*.parquet',
    DATA_SOURCE = 'adls_bronze',
    FORMAT = 'PARQUET'
) AS customers;

Step 3: Export Data to ADLS¶

-- Bulk export from SQL Server to ADLS as Parquet
-- (Uses CETAS - CREATE EXTERNAL TABLE AS SELECT)

-- First, create an external file format
CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (FORMAT_TYPE = PARQUET);
GO

-- Export a summary table to the Gold layer
CREATE EXTERNAL TABLE dbo.MonthlySalesSummary
WITH (
    LOCATION = '/gold/monthly_sales/',
    DATA_SOURCE = adls_bronze,
    FILE_FORMAT = ParquetFormat
)
AS
SELECT
    YEAR(OrderDate)  AS order_year,
    MONTH(OrderDate) AS order_month,
    Region,
    COUNT(*)         AS order_count,
    SUM(TotalAmount) AS total_revenue
FROM dbo.Orders o
JOIN dbo.Customers c ON o.CustomerID = c.CustomerID
GROUP BY YEAR(OrderDate), MONTH(OrderDate), Region;

Step 4: Backup to ADLS¶

-- SQL Server 2022 can back up directly to S3-compatible storage
BACKUP DATABASE [OperationalDB]
TO URL = 's3://<storage-account>.blob.core.windows.net/backups/OperationalDB_full.bak'
WITH
    CREDENTIAL = 'adls_s3_credential',
    COMPRESSION,
    STATS = 10;

When to Use S3 Integration vs. ADF

Use S3 integration for DBA-driven, ad-hoc, or scheduled exports that the SQL Server team manages. Use ADF for orchestrated, monitored, enterprise pipelines with retry logic, alerting, and lineage tracking. Both can coexist.

Network Architecture¶

All integration patterns depend on reliable, secure connectivity between on-premises and Azure.

graph TB
    subgraph OnPrem["On-Premises Network"]
        SQLSERVER["SQL Server 2022"]
        SHIR_NODE1["SHIR Node 1"]
        SHIR_NODE2["SHIR Node 2"]
        FW_ONPREM["On-Prem Firewall"]
    end

    subgraph Edge["Network Edge"]
        ER_CIRCUIT["ExpressRoute Circuit<br/>1 Gbps+ (Production)"]
        VPN_GW["VPN Gateway<br/>VpnGw2 (Dev/Test)"]
    end

    subgraph Azure["Azure Hub VNet"]
        ER_GW["ExpressRoute Gateway<br/>ErGw1Az"]
        VPN_GW_AZ["VPN Gateway<br/>VpnGw2Az"]
        FW_AZ["Azure Firewall<br/>Premium SKU"]
    end

    subgraph Spoke["CSA-in-a-Box Spoke VNet"]
        PE_ADLS["Private Endpoint<br/>ADLS Gen2"]
        PE_ADF["Private Endpoint<br/>ADF"]
        PE_KV["Private Endpoint<br/>Key Vault"]
        PE_MI["Private Endpoint<br/>SQL MI"]
        PE_SYNAPSE["Private Endpoint<br/>Synapse"]
    end

    SQLSERVER --> FW_ONPREM
    SHIR_NODE1 --> FW_ONPREM
    SHIR_NODE2 --> FW_ONPREM

    FW_ONPREM -->|TCP 443 Only| ER_CIRCUIT
    FW_ONPREM -.->|TCP 443 Only| VPN_GW

    ER_CIRCUIT --> ER_GW
    VPN_GW --> VPN_GW_AZ

    ER_GW --> FW_AZ
    VPN_GW_AZ --> FW_AZ

    FW_AZ --> PE_ADLS
    FW_AZ --> PE_ADF
    FW_AZ --> PE_KV
    FW_AZ --> PE_MI
    FW_AZ --> PE_SYNAPSE

    style ER_CIRCUIT fill:#2d6a4f,color:#fff
    style FW_AZ fill:#d62828,color:#fff

Connectivity Requirements by Pattern¶

Pattern	Required Connectivity	Ports	Recommended Network
Arc + Virtualization	Arc agent -> Azure, SHIR -> ADF	TCP 443 outbound	ExpressRoute or VPN
ADF + SHIR (CDC)	SHIR -> ADF, SHIR -> SQL Server	TCP 443 outbound, TCP 1433 internal	ExpressRoute
AG Read Replica	SHIR -> AG Listener	TCP 443 outbound, TCP 1433 internal	ExpressRoute
MI Link	SQL Server -> Azure SQL MI	TCP 5022, TCP 11000-11999	ExpressRoute (required)
S3 Integration	SQL Server -> ADLS S3 endpoint	TCP 443 outbound	ExpressRoute or VPN

Private Endpoint Configuration¶

# Create private endpoints for all Azure services accessed from on-prem
# ADLS Gen2
az network private-endpoint create \
    --name "pe-adls-csa" \
    --resource-group "csa-inabox-rg" \
    --vnet-name "vnet-csa-spoke" \
    --subnet "snet-private-endpoints" \
    --private-connection-resource-id "/subscriptions/.../storageAccounts/csaadls" \
    --group-ids "dfs" \
    --connection-name "adls-privatelink"

# Azure Data Factory
az network private-endpoint create \
    --name "pe-adf-csa" \
    --resource-group "csa-inabox-rg" \
    --vnet-name "vnet-csa-spoke" \
    --subnet "snet-private-endpoints" \
    --private-connection-resource-id "/subscriptions/.../factories/csa-adf" \
    --group-ids "dataFactory" \
    --connection-name "adf-privatelink"

# Key Vault
az network private-endpoint create \
    --name "pe-kv-csa" \
    --resource-group "csa-inabox-rg" \
    --vnet-name "vnet-csa-spoke" \
    --subnet "snet-private-endpoints" \
    --private-connection-resource-id "/subscriptions/.../vaults/csa-keyvault" \
    --group-ids "vault" \
    --connection-name "kv-privatelink"

Best Practices¶

Do	Don't
Use CDC or Change Tracking for incremental loads	Perform nightly full table scans in production
Point analytics queries at AG read replicas or MI	Run analytical queries against the OLTP primary
Deploy SHIR in HA (2+ nodes) behind a load balancer	Run a single SHIR node with no redundancy
Use Azure Arc for governance even without data copy	Ignore on-prem SQL Server in your Azure governance story
Encrypt data in transit (TLS 1.2+) and at rest (TDE)	Send data over unencrypted connections
Monitor replication lag for MI Link and AG replicas	Assume replication is always current
Test failover scenarios (what happens when the VPN drops?)	Discover network failures in production
Use Private Endpoints for all Azure service connections	Expose Azure services on public endpoints
Size SHIR VMs based on data volume and concurrent jobs	Use the smallest VM and wonder why copies are slow
Coordinate DDL changes with pipeline updates	ALTER TABLE on CDC-enabled tables without updating ADF

CDC Configuration Guidelines¶

-- Monitor CDC transaction log growth
-- CDC uses the transaction log; heavy write tables will generate large logs
SELECT
    DB_NAME() AS database_name,
    name AS log_file,
    CAST(size * 8.0 / 1024 AS DECIMAL(10,2)) AS size_mb,
    CAST(FILEPROPERTY(name, 'SpaceUsed') * 8.0 / 1024 AS DECIMAL(10,2)) AS used_mb,
    CAST((size - FILEPROPERTY(name, 'SpaceUsed')) * 8.0 / 1024 AS DECIMAL(10,2)) AS free_mb
FROM sys.database_files
WHERE type_desc = 'LOG';

-- Configure CDC cleanup to prevent log bloat
-- Default retention is 3 days (4320 minutes)
EXEC sys.sp_cdc_change_db_setting
    @maxtrans = 500,      -- max transactions per scan cycle
    @maxscans = 10,        -- max scan cycles per poll
    @pollinginterval = 5;  -- polling interval in seconds

-- Adjust retention period (in minutes)
EXEC sys.sp_cdc_cleanup_change_table
    @capture_instance = 'dbo_Orders',
    @low_water_mark = NULL,
    @threshold = 5000;

Monitoring Replication Lag¶

-- For AG replicas: check synchronization health
SELECT
    ag.name AS ag_name,
    ar.replica_server_name,
    drs.database_id,
    DB_NAME(drs.database_id) AS database_name,
    drs.synchronization_state_desc,
    drs.synchronization_health_desc,
    drs.log_send_queue_size AS log_send_queue_kb,
    drs.redo_queue_size AS redo_queue_kb,
    drs.last_commit_time
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar
    ON drs.replica_id = ar.replica_id
JOIN sys.availability_groups ag
    ON ar.group_id = ag.group_id
WHERE drs.is_local = 0;  -- Remote replicas only

# For MI Link: check replication status via Azure CLI
az sql mi link show \
    --resource-group "csa-inabox-rg" \
    --managed-instance "csa-sql-mi" \
    --distributed-availability-group-name "dag-onprem-to-mi" \
    --query "{status: replicationState, lag: lagTimeSeconds}"

Lessons Learned¶

Self-Hosted IR Is a Single Point of Failure

Always deploy the Self-Hosted IR with 2+ nodes in a high-availability configuration. A single SHIR node means a single Windows update, reboot, or hardware failure stops all data movement. The CSA-in-a-Box SHIR module deploys on a VMSS with auto-scaling 1-10 nodes specifically to avoid this. See the SHIR guide for HA configuration details.

CDC Has Overhead — Enable Selectively

CDC adds overhead to the transaction log. Every insert, update, and delete on a CDC-enabled table generates additional log records. Enable CDC only on tables you actually need for analytics. Monitor transaction log growth weekly. For high-volume tables (millions of rows/day), consider Change Tracking instead — it has lower overhead but provides less detail (no column-level before/after images).

ExpressRoute Takes Weeks to Provision

Do not wait until the sprint before go-live to order an ExpressRoute circuit. Provisioning can take 2-8 weeks depending on the connectivity provider and peering location. Start the network provisioning workstream at project kickoff. Use Site-to-Site VPN for initial development and testing while ExpressRoute is being provisioned.

Schema Changes Break CDC

When a DBA runs ALTER TABLE on a CDC-enabled table (adding columns, changing data types), the CDC capture instance becomes out of sync. This silently breaks downstream pipelines. Always coordinate DDL changes with the analytics team. The process is: (1) disable CDC on the table, (2) make the DDL change, (3) re-enable CDC with a new capture instance, (4) update ADF pipeline mappings.

Full Loads Are Expensive — Resist the Temptation

The first instinct is always "let's just do a nightly full refresh." For a 10-million-row table, that means copying 10 million rows every night even if only 500 changed. This wastes SHIR bandwidth, ADLS storage, and compute time. Invest in CDC or Change Tracking from day one. The setup cost is paid once; the savings compound every pipeline run.

Anti-Patterns¶

Direct Linked Server Queries from Synapse to On-Prem

Creating a linked server in Synapse that queries on-prem SQL Server directly seems convenient, but it has catastrophic performance characteristics. Every query traverses the network, there is no caching layer, and complex joins between cloud and on-prem tables result in full data shuffles over the WAN. Use materialized copies (Pattern 2) or MI Link (Pattern 4) instead.

Full Table Copies Every Night Instead of Incremental

Copying entire tables nightly is a hallmark of "we didn't have time to set up CDC." It wastes network bandwidth, storage, and compute. Worse, it creates a 24-hour data freshness gap. Even if you start with full copies, add a backlog task to migrate to incremental loads within the first two sprints.

No Network Redundancy (Single VPN Tunnel)

A single Site-to-Site VPN tunnel is acceptable for dev/test environments. It is not acceptable for production. VPN tunnels drop, ISPs have outages, and when your tunnel is down, all data movement stops. For production workloads, use ExpressRoute with a VPN failover, or dual ExpressRoute circuits for mission-critical scenarios.

Ignoring SQL Server Version Requirements

Synapse Link for SQL Server, S3 object storage integration, and several other hybrid features require SQL Server 2022. MI Link requires SQL Server 2016+. Do not design an architecture around features your SQL Server version does not support. Check the version compatibility matrix at the top of this guide before committing to a pattern.

Running Analytics on the OLTP Primary

Connecting Power BI, ADF, or ad-hoc analytical queries directly to the production OLTP primary is the fastest path to an incident. Analytical queries (table scans, large aggregations, complex joins) compete with transactional workloads for CPU, memory, and I/O. Use AG read replicas (Pattern 3), MI Link (Pattern 4), or materialized copies (Pattern 2) to isolate analytical workloads.

Decision Tree¶

Use this flowchart to select the right integration pattern for your scenario.

flowchart TD
    START(["On-Prem SQL Server<br/>Integration Needed"]) --> Q1{"Do you need<br/>real-time data<br/>in Azure?"}

    Q1 -->|Yes| Q2{"Is Azure SQL MI<br/>an option?"}
    Q1 -->|No| Q3{"Can you copy data<br/>to the lakehouse?"}

    Q2 -->|Yes| P4["Pattern 4: MI Link<br/>Near-real-time replica in Azure<br/>+ Fabric mirroring to OneLake"]
    Q2 -->|No| Q4{"Enterprise Edition<br/>available?"}

    Q4 -->|Yes| P3["Pattern 3: AG Read Replica<br/>Read-only secondary<br/>+ SHIR for analytics"]
    Q4 -->|No| P2_RT["Pattern 2: ADF + SHIR<br/>High-frequency CDC pipeline<br/>(5-15 min intervals)"]

    Q3 -->|Yes| Q5{"SQL Server 2022?"}
    Q3 -->|No| P1["Pattern 1: Azure Arc +<br/>Data Virtualization<br/>Query in place, no data copy"]

    Q5 -->|Yes| Q6{"Who manages<br/>the pipeline?"}
    Q5 -->|No| P2["Pattern 2: ADF + SHIR<br/>CDC incremental pipeline<br/>to ADLS Bronze"]

    Q6 -->|DBA / SQL Team| P5["Pattern 5: S3 Integration<br/>SQL Server pushes data<br/>to ADLS directly"]
    Q6 -->|Data Engineering| P2

    P1 --> DONE(["Configure + Deploy"])
    P2 --> DONE
    P2_RT --> DONE
    P3 --> DONE
    P4 --> DONE
    P5 --> DONE

    style START fill:#1b4332,color:#fff
    style DONE fill:#0078d4,color:#fff
    style P1 fill:#40916c,color:#fff
    style P2 fill:#40916c,color:#fff
    style P2_RT fill:#40916c,color:#fff
    style P3 fill:#40916c,color:#fff
    style P4 fill:#40916c,color:#fff
    style P5 fill:#40916c,color:#fff

Quick Comparison¶

Criteria	Pattern 1: Arc + Virtualization	Pattern 2: ADF + SHIR (CDC)	Pattern 3: AG Read Replica	Pattern 4: MI Link	Pattern 5: S3 Integration
Data Freshness	Real-time	Minutes-hours	Seconds-minutes	Seconds	On-demand
Data Movement	None	Incremental copy	AG replication	Log replication	Explicit push/pull
Min SQL Version	2012	2008	2012 (Enterprise)	2016	2022
Azure Cost	Low (compute only)	Medium (ADF + ADLS)	Medium (SHIR + ADLS)	High (SQL MI)	Low (ADLS only)
On-Prem Impact	Medium (query load)	Low (CDC reader)	Low (log shipping)	Low (log shipping)	Low (on-demand)
Complexity	Low	Medium	Medium-High	Medium	Low
dbt Compatible	Limited	Full	Full	Full (via Fabric)	Full
Best For	Compliance constraints	Medallion architecture	Real-time + isolation	Full Azure replica	DBA-managed exchange

Cross-References¶

Self-Hosted Integration Runtime — deploying, scaling, and monitoring the SHIR that bridges on-prem to Azure
Azure Data Factory Setup — ADF pipeline artifacts, linked services, and CI/CD
Multi-Cloud Data Virtualization — broader virtualization patterns including AWS, GCP, and SaaS sources
Performance Tuning — optimizing pipeline throughput and query performance
Security & Compliance — encryption, access control, and compliance patterns
Disaster Recovery — DR strategies including hybrid failover scenarios
Batch vs. Streaming Decision — choosing between batch CDC and streaming replication
Materialize vs. Virtualize Decision — when to copy data vs. query in place