Multi-Cloud Data Virtualization with Azure at the Core¶
Overview¶
Every enterprise of meaningful scale has the same problem: data is everywhere. Customer records live in AWS RDS. Transaction logs stream into GCP BigQuery. The ERP system runs on-premises on Oracle. CRM data sits in Salesforce. And someone in finance has a critical spreadsheet on SharePoint.
The traditional answer — copy everything into one place — creates more problems than it solves:
- Cost: Storing duplicate petabytes across clouds doubles or triples storage spend.
- Governance: Every copy is a governance liability. Which copy is authoritative? Who has access to copy #3?
- Latency: Batch ETL pipelines introduce hours or days of delay. The data is always stale.
- Data residency: Regulations like GDPR, CCPA, and sovereign data laws may prohibit moving data across regions or borders.
- Complexity: N sources × M destinations = an exponential integration problem.
Data virtualization is the alternative: query data where it lives. Instead of moving bytes, you move queries. A virtualization layer presents a unified semantic model over physically distributed data, and the compute travels to the data — not the other way around.
What Data Virtualization Is Not
Data virtualization is not a replacement for a data warehouse. It is a complementary pattern. Some data should be materialized (copied, transformed, stored) for performance. Other data should be virtualized (queried in place) for freshness, cost, or compliance. The art is knowing which is which.
Why Azure Is Uniquely Positioned as the Core¶
Azure is the only cloud platform that provides the complete stack required to serve as the core of a multi-cloud data virtualization architecture. No other cloud comes close to this combination:
| Capability | Azure Service | What It Does |
|---|---|---|
| Multi-cloud management plane | Azure Arc | Extends Azure governance, policy, and monitoring to any infrastructure — AWS, GCP, on-prem, edge |
| Zero-copy cross-cloud data access | Microsoft Fabric OneLake Shortcuts | Creates virtual references to S3, GCS, and ADLS data — no data movement |
| Serverless cross-cloud SQL | Synapse Analytics Serverless SQL | Queries Parquet, Delta, CSV across any storage using T-SQL via OPENROWSET |
| Real-time BI across clouds | Power BI DirectQuery + Composite Models | Queries multiple sources live in a single report without importing data |
| Unified multi-cloud governance | Microsoft Purview | Scans, catalogs, classifies, and tracks lineage across AWS, GCP, on-prem, and SaaS |
| Cross-cloud data integration | Azure Data Factory | 100+ connectors with Mapping Data Flows for in-place transformation |
AWS and GCP each have strong analytics services — but they are designed to work within their own ecosystems. Azure is designed to work across all of them.
Architecture¶
The following diagram shows the complete multi-cloud data virtualization architecture with Azure at the core.
graph TB
subgraph Sources["Data Sources"]
AWS["☁️ AWS<br/>S3 · Redshift · RDS · DynamoDB"]
GCP["☁️ GCP<br/>GCS · BigQuery · Cloud SQL"]
ONPREM["🏢 On-Premises<br/>SQL Server · Oracle · Hadoop · File Shares"]
SAAS["📦 SaaS<br/>Salesforce · SAP · Dynamics 365 · ServiceNow"]
end
subgraph ARC["Azure Arc Layer"]
ARCMGMT["Arc Resource Bridge<br/>Unified management plane"]
ARCDATA["Arc-enabled Data Services<br/>SQL MI · PostgreSQL"]
ARCK8S["Arc-enabled Kubernetes<br/>Run Azure services anywhere"]
end
subgraph Governance["Microsoft Purview"]
CATALOG["Unified Data Catalog<br/>Cross-cloud asset inventory"]
LINEAGE["Lineage Tracking<br/>Source → Transform → Report"]
CLASSIFY["Data Classification<br/>Sensitivity labels across clouds"]
POLICY["Access Policies<br/>Consistent RBAC everywhere"]
end
subgraph Core["Azure Core Hub"]
ONELAKE["OneLake Shortcuts<br/>Zero-copy references to S3, GCS, ADLS"]
SYNAPSE["Synapse Serverless SQL<br/>OPENROWSET + External Tables"]
ADF["Azure Data Factory<br/>100+ linked services"]
FABRIC["Microsoft Fabric<br/>Unified analytics platform"]
end
subgraph Medallion["Medallion Architecture"]
BRONZE["🥉 Bronze Layer<br/>Virtual references + raw copies"]
SILVER["🥈 Silver Layer<br/>Cleansed, conformed, joined"]
GOLD["🥇 Gold Layer<br/>Analytics-ready, business models"]
end
subgraph Serving["Serving Layer"]
PBI["Power BI<br/>DirectQuery + Composite Models"]
SQLEP["SQL Endpoint<br/>T-SQL access for tools"]
API["REST APIs<br/>Programmatic access"]
NOTEBOOK["Notebooks<br/>Spark + Python analytics"]
end
AWS --> ARCMGMT
GCP --> ARCMGMT
ONPREM --> ARCMGMT
SAAS --> ADF
ARCMGMT --> CATALOG
ARCMGMT --> ONELAKE
ADF --> BRONZE
ONELAKE --> BRONZE
SYNAPSE --> BRONZE
BRONZE --> SILVER
SILVER --> GOLD
GOLD --> PBI
GOLD --> SQLEP
GOLD --> API
GOLD --> NOTEBOOK
CATALOG --> LINEAGE
LINEAGE --> CLASSIFY
CLASSIFY --> POLICY Why Azure as the Core — Technical Comparison¶
This is not a marketing claim. It is an architectural reality based on what each cloud platform actually offers today.
| Capability | Azure | AWS | GCP |
|---|---|---|---|
| Multi-cloud management plane | Azure Arc — projects Azure Resource Manager to any infra (AWS EC2, GCP VMs, on-prem servers, edge) | None. AWS Systems Manager is AWS-only. | Anthos — Kubernetes-focused, limited data service support |
| Cross-cloud data shortcuts | OneLake shortcuts to S3, GCS, ADLS, Dataverse — zero-copy, metadata-only references | None. Lake Formation is S3-only. | BigQuery Omni runs on AWS/Azure but queries GCS natively only |
| Serverless cross-cloud SQL | Synapse serverless SQL pools — OPENROWSET queries Parquet/Delta/CSV in S3, GCS, ADLS, HTTP endpoints | Athena — S3 only, no native cross-cloud | BigQuery — GCS only for external tables, limited cross-cloud |
| Unified multi-cloud governance | Microsoft Purview — scans AWS S3, RDS, Redshift, GCS, BigQuery, Snowflake, Oracle, SAP, Salesforce, on-prem SQL | AWS Glue Data Catalog — AWS services only | Dataplex — GCP services only |
| Hybrid identity | Microsoft Entra ID + Arc — single identity plane across all clouds and on-prem | IAM — AWS-scoped, federate via SAML/OIDC only | Cloud Identity — GCP-scoped |
| BI with live cross-cloud queries | Power BI DirectQuery + composite models — 100+ native connectors, query any source live | QuickSight — limited DirectQuery, primarily SPICE import | Looker — strong modeling but limited live connectivity |
| Data integration connectors | ADF — 100+ built-in connectors including SAP, Salesforce, Oracle, Dynamics, mainframes | Glue — fewer connectors, AWS-focused | Dataflow — GCP ecosystem focused |
The Key Differentiator
Azure Arc + Purview + OneLake shortcuts is a combination no other cloud can replicate. It gives you a single management plane, a single governance catalog, and zero-copy data access — across every cloud and on-premises environment — from one control point.
Data Virtualization Patterns¶
Pattern 1: OneLake Shortcuts (Zero-Copy Federation)¶
OneLake shortcuts create virtual pointers to data in external storage systems. When you query a shortcut, OneLake reads directly from the source — no data is copied, no storage is consumed in OneLake.
Supported shortcut targets:
- Amazon S3 (and S3-compatible storage)
- Google Cloud Storage
- Azure Data Lake Storage Gen2
- Dataverse
How it works:
- You create a shortcut in a Fabric lakehouse that points to an S3 bucket (or GCS bucket, ADLS container).
- The shortcut appears as a folder in your lakehouse — indistinguishable from local data.
- Spark notebooks, SQL endpoints, and Power BI all query the shortcut as if the data were local.
- OneLake issues read requests to the source storage at query time. No intermediate copies.
Step-by-step: Create an S3 shortcut
import requests
# Create a shortcut to an AWS S3 bucket in a Fabric lakehouse
url = "https://api.fabric.microsoft.com/v1/workspaces/{workspace_id}/items/{lakehouse_id}/shortcuts"
payload = {
"path": "Tables/aws_customers",
"name": "aws_customers",
"target": {
"amazonS3": {
"location": "https://my-bucket.s3.us-east-1.amazonaws.com",
"subpath": "/customers/delta/",
"connectionId": "{connection_id}" # Pre-configured S3 connection
}
}
}
headers = {
"Authorization": f"Bearer {access_token}",
"Content-Type": "application/json"
}
response = requests.post(url, json=payload, headers=headers)
print(response.status_code, response.json())
- Open your Fabric lakehouse.
- Right-click Tables → New shortcut.
- Select Amazon S3 as the source.
- Enter the S3 bucket URL and select your connection.
- Choose the subfolder containing your Delta/Parquet data.
- Name the shortcut and click Create.
Query the shortcut via SQL endpoint:
-- This queries AWS S3 data live — no copy exists in OneLake
SELECT
customer_id,
customer_name,
region,
lifetime_value
FROM lakehouse1.dbo.aws_customers
WHERE region = 'us-east-1'
AND lifetime_value > 10000
ORDER BY lifetime_value DESC;
Shortcut Performance Depends on Source Latency
OneLake shortcuts query the source storage at runtime. If your Fabric workspace is in East US and your S3 bucket is in eu-west-1, every query pays ~100-150ms of cross-region latency per request. Co-locate your Fabric capacity region with your source data region when possible. For high-frequency queries, consider materializing the data instead.
When to use OneLake shortcuts:
- Large datasets (TB+) that are too expensive to copy
- Data that must remain in the source cloud for residency or compliance
- Analytics workloads with moderate query frequency (dozens of queries/day, not thousands)
- Cross-cloud data exploration and ad-hoc analytics
Pattern 2: Synapse Serverless SQL (OPENROWSET / External Tables)¶
Synapse serverless SQL pools let you query files in any storage account using standard T-SQL — without provisioning any compute in advance. You pay only for the data scanned.
Step-by-step: Query S3 data via OPENROWSET
-- Step 1: Create a database-scoped credential for AWS S3
CREATE DATABASE SCOPED CREDENTIAL s3_credential
WITH IDENTITY = 'S3 Access Key',
SECRET = '{"AccessKeyId":"AKIA...","SecretAccessKey":"wJalr..."}';
GO
-- Step 2: Create an external data source pointing to S3
CREATE EXTERNAL DATA SOURCE aws_sales_data
WITH (
LOCATION = 's3://my-bucket.s3.us-east-1.amazonaws.com/',
CREDENTIAL = s3_credential
);
GO
-- Step 3: Query Parquet files directly with OPENROWSET
SELECT
order_id,
customer_id,
order_date,
total_amount,
currency
FROM OPENROWSET(
BULK 'sales/2024/**/*.parquet',
DATA_SOURCE = 'aws_sales_data',
FORMAT = 'PARQUET'
) AS orders
WHERE order_date >= '2024-01-01'
AND total_amount > 500;
Create a persistent external table for repeated use:
-- External file format
CREATE EXTERNAL FILE FORMAT parquet_format
WITH (FORMAT_TYPE = PARQUET);
GO
-- External table — reusable, looks like a normal table
CREATE EXTERNAL TABLE dbo.aws_sales (
order_id BIGINT,
customer_id BIGINT,
order_date DATE,
total_amount DECIMAL(18,2),
currency VARCHAR(3)
)
WITH (
LOCATION = 'sales/2024/',
DATA_SOURCE = aws_sales_data,
FILE_FORMAT = parquet_format
);
GO
-- Now query like any table
SELECT * FROM dbo.aws_sales WHERE currency = 'USD';
OPENROWSET File Limit
OPENROWSET has a limit of approximately 10,000 files per query. If your S3 bucket has heavily partitioned data with many small files, you will hit this limit. Partition wisely — use date-based partitioning with fewer, larger files rather than thousands of micro-files.
When to use Synapse serverless:
- Ad-hoc data exploration across cloud storage
- Lightweight, infrequent queries that don't justify OneLake shortcuts
- T-SQL-native teams who need cross-cloud access without learning Spark
- Cost-sensitive workloads — pay only for TB scanned ($5/TB)
Pattern 3: ADF Mapping Data Flows (Transform at Source)¶
Azure Data Factory connects to 100+ data sources via linked services. Mapping Data Flows provide a visual, code-free transformation layer that runs on managed Spark.
When to use ADF instead of shortcuts:
- SaaS sources (Salesforce, SAP, Dynamics, ServiceNow) — these don't expose storage-layer access
- On-premises databases — behind firewalls, accessible only via Self-Hosted Integration Runtime
- Data that needs transformation before landing — deduplication, PII masking, schema normalization
Step-by-step: Ingest Salesforce data to Bronze layer
- Create a Self-Hosted Integration Runtime (for on-prem) or use Azure IR (for cloud/SaaS).
- Create a Linked Service for Salesforce:
- Authentication: OAuth2 or username/password + security token
- Connection: Salesforce API endpoint
- Create a Mapping Data Flow:
- Source: Salesforce
AccountandOpportunityobjects - Transformations: Select columns, rename to snake_case, add
_ingested_attimestamp - Sink: ADLS Gen2 Bronze container in Delta format
- Source: Salesforce
- Create a Pipeline that triggers the data flow on schedule.
{
"name": "SalesforceToLakehouse",
"properties": {
"activities": [
{
"name": "IngestAccounts",
"type": "ExecuteDataFlow",
"dataflow": { "referenceName": "df_salesforce_accounts" },
"compute": {
"coreCount": 8,
"computeType": "MemoryOptimized"
}
}
],
"parameters": {
"incrementalDate": { "type": "string" }
}
}
}
Mapping Data Flow Cost
Mapping Data Flows spin up dedicated Spark clusters. An 8-core MemoryOptimized cluster costs approximately $0.96/hour. For small datasets, consider Copy Activity (no Spark) instead. Reserve Data Flows for transformations that genuinely require Spark.
Pattern 4: Power BI Composite Models (DirectQuery + Import)¶
Composite models let a single Power BI dataset combine DirectQuery tables (queried live) with Import tables (cached in memory). This is the serving-layer pattern for multi-cloud virtualization.
Architecture:
graph LR
ONPREM_SQL["On-Prem SQL Server<br/>(via Gateway)"] -->|DirectQuery| PBI["Power BI<br/>Composite Model"]
GOLD_LAKE["ADLS Gold Layer<br/>(Delta Tables)"] -->|Import| PBI
SYNAPSE["Synapse Serverless<br/>(External Tables)"] -->|DirectQuery| PBI
PBI --> DASHBOARD["Executive Dashboard"] Step-by-step:
- Connect to on-premises SQL Server via an On-Premises Data Gateway in DirectQuery mode.
- Connect to ADLS Gold layer via Synapse SQL endpoint — import aggregated dimension tables.
- Create relationships between DirectQuery fact tables and imported dimension tables.
- Build reports that combine real-time on-prem operational data with historical cloud analytics.
Aggregation Tables for DirectQuery Performance
DirectQuery performance is bounded by source query speed. Create aggregation tables in Power BI that pre-summarize DirectQuery data at higher grain levels. Power BI automatically routes queries to aggregations when possible, falling back to DirectQuery for detail-level queries.
DirectQuery Anti-Pattern
Do not use DirectQuery against sources with >5 second query response times. Users will experience unacceptable report load times. If the source is slow, import the data or create a materialized view closer to the consumer.
Azure Arc for Multi-Cloud Governance¶
Azure Arc extends the Azure management plane to infrastructure running anywhere. For data virtualization, Arc provides three critical capabilities:
Arc-Enabled SQL Server¶
Register any SQL Server instance — on-premises, in AWS EC2, or in GCP Compute Engine — with Azure Arc. Once registered:
- Azure Portal visibility: The SQL Server appears in the Azure Portal alongside cloud-native resources.
- Microsoft Defender for SQL: Threat detection and vulnerability assessment, regardless of hosting location.
- Purview integration: Purview can scan Arc-connected SQL Servers for cataloging and classification.
- Azure Policy: Enforce configuration baselines (encryption, auditing) across all SQL Servers.
Step-by-step: Register an AWS EC2 SQL Server with Arc
# 1. Download and install the Azure Connected Machine agent on the EC2 instance
wget https://aka.ms/azcmagent -O install_linux_azcmagent.sh
bash install_linux_azcmagent.sh
# 2. Connect the machine to Azure Arc
azcmagent connect \
--resource-group "rg-arc-multicloud" \
--tenant-id "{tenant-id}" \
--location "eastus" \
--subscription-id "{subscription-id}" \
--cloud "AzureCloud"
# 3. The SQL Server extension is auto-detected and registered
# Verify in Azure Portal: Azure Arc > SQL Server
Arc-Enabled Data Services¶
Run Azure SQL Managed Instance or Azure Arc-enabled PostgreSQL on any Kubernetes cluster — on-prem, AWS EKS, or GCP GKE. This gives you Azure-managed database services on infrastructure you control.
Arc-Enabled Kubernetes¶
Deploy Azure services (including data services, ML workloads, and App Services) to any CNCF-conformant Kubernetes cluster. This is the foundation for running Azure analytics components at the data source.
Microsoft Purview for Unified Governance¶
Data virtualization without governance is just federation of chaos. Microsoft Purview provides the governance layer that makes multi-cloud virtualization manageable.
Scanning Multi-Cloud Sources¶
Purview natively scans:
| Source Type | Examples | Scan Method |
|---|---|---|
| AWS | S3, RDS (SQL Server, PostgreSQL, MySQL), Redshift, DynamoDB | Purview connectors via Self-Hosted IR or managed VNet |
| GCP | GCS, BigQuery, Cloud SQL | Purview connectors |
| On-premises | SQL Server, Oracle, SAP HANA, Teradata, file shares | Self-Hosted Integration Runtime |
| SaaS | Salesforce, SAP S/4HANA, Dynamics 365, Power BI | Native connectors |
| Azure | ADLS, Synapse, Fabric, SQL Database, Cosmos DB | Native (auto-discovery) |
Unified Data Catalog¶
After scanning, Purview creates a single searchable catalog of every data asset across all clouds:
- Business glossary terms mapped to physical assets regardless of cloud
- Data classification — automatic detection of PII, financial data, health records
- Sensitivity labels — Microsoft Information Protection labels applied to data assets
- Owner assignment — data stewards assigned per asset or collection
Cross-Cloud Lineage¶
Purview tracks data lineage from source through transformation to consumption:
AWS RDS (customers) → ADF Copy → ADLS Bronze → dbt transform → ADLS Silver →
Synapse View → Power BI Report
This lineage is visualized in the Purview portal, showing exactly how data flows across cloud boundaries. When a source schema changes or a quality issue is detected, you can trace impact downstream.
Step-by-step: Register and scan AWS S3 in Purview
- Register the source: Purview portal → Data Map → Register → Amazon S3.
- Configure the connection: Provide the ARN of the IAM role Purview will assume (cross-account access).
- Set up a scan: Choose the S3 bucket(s) to scan, select a scan rule set (system default or custom), set the schedule.
- Run the scan: Purview reads metadata (schema, file types, row counts) and applies classification rules.
- Review in catalog: Assets appear in the Purview catalog with auto-detected classifications and suggested glossary mappings.
Purview Scans Metadata, Not Data
Purview scanning reads metadata and samples data for classification. It does not copy or ingest your data. Scanning an S3 bucket does not incur AWS egress charges beyond the small metadata reads.
Complete Walkthrough: Multi-Cloud Analytics Pipeline¶
Scenario: An enterprise has:
- Customer data in AWS RDS (PostgreSQL)
- Transaction data in GCP BigQuery
- Product catalog in on-premises SQL Server
- CRM data in Salesforce
Goal: Build a unified Customer 360 analytics model queryable from Power BI, without copying all data to Azure.
Step 1: Register All Sources with Azure Arc and Purview¶
# Register the on-prem SQL Server with Azure Arc
azcmagent connect --resource-group rg-multicloud --location eastus ...
# In Purview, register:
# - AWS RDS PostgreSQL (via Self-Hosted IR or managed VNet)
# - GCP BigQuery (native connector)
# - On-prem SQL Server (via Arc + SHIR)
# - Salesforce (native connector)
Run initial Purview scans on all four sources. Review the catalog to understand schemas, data volumes, and classifications.
Step 2: Create OneLake Shortcuts for Cloud Data¶
Export AWS RDS customer data to S3 as Delta tables (scheduled daily via AWS Glue or Lambda). Export GCP BigQuery transaction data to GCS as Parquet (scheduled via BigQuery export).
# Create OneLake shortcut to AWS S3 customer data
shortcut_payload = {
"path": "Tables/aws_customers",
"name": "aws_customers",
"target": {
"amazonS3": {
"location": "https://analytics-export.s3.us-east-1.amazonaws.com",
"subpath": "/customers/delta/",
"connectionId": "{s3_connection_id}"
}
}
}
# Create OneLake shortcut to GCP GCS transaction data
shortcut_payload_gcs = {
"path": "Tables/gcp_transactions",
"name": "gcp_transactions",
"target": {
"googleCloudStorage": {
"location": "https://storage.googleapis.com/analytics-export",
"subpath": "/transactions/parquet/",
"connectionId": "{gcs_connection_id}"
}
}
}
Step 3: Configure ADF for Salesforce and On-Prem¶
Salesforce and on-prem SQL Server cannot be accessed via storage shortcuts. Use ADF:
- Salesforce Linked Service: OAuth2 connection to Salesforce API.
- On-prem SQL Server Linked Service: Self-Hosted Integration Runtime on the on-prem network.
- Pipelines: Incremental extraction to ADLS Bronze layer in Delta format.
Step 4: Build the Bronze Layer¶
| Source | Bronze Method | Refresh Frequency |
|---|---|---|
| AWS RDS → S3 | OneLake shortcut (virtual) | Real-time (reads S3 at query time) |
| GCP BigQuery → GCS | OneLake shortcut (virtual) | Real-time (reads GCS at query time) |
| On-prem SQL Server | ADF copy to ADLS (physical) | Every 4 hours |
| Salesforce | ADF copy to ADLS (physical) | Daily |
Step 5: dbt Silver Layer — Unified Customer 360¶
-- models/silver/customer_360.sql
{{
config(
materialized='table',
file_format='delta',
location_root='abfss://silver@lakehouse.dfs.core.windows.net/customer_360'
)
}}
WITH customers AS (
SELECT
customer_id,
customer_name,
email,
region,
created_at
FROM {{ source('bronze', 'aws_customers') }} -- OneLake shortcut to S3
),
transactions AS (
SELECT
customer_id,
COUNT(*) AS total_orders,
SUM(amount) AS lifetime_spend,
MAX(transaction_date) AS last_order_date
FROM {{ source('bronze', 'gcp_transactions') }} -- OneLake shortcut to GCS
GROUP BY customer_id
),
products_purchased AS (
SELECT
t.customer_id,
ARRAY_AGG(DISTINCT p.category) AS product_categories
FROM {{ source('bronze', 'gcp_transactions') }} t
JOIN {{ source('bronze', 'onprem_products') }} p -- ADF copy
ON t.product_id = p.product_id
GROUP BY t.customer_id
),
crm AS (
SELECT
email,
account_owner,
lead_source,
last_activity_date
FROM {{ source('bronze', 'salesforce_accounts') }} -- ADF copy
)
SELECT
c.customer_id,
c.customer_name,
c.email,
c.region,
c.created_at,
t.total_orders,
t.lifetime_spend,
t.last_order_date,
pp.product_categories,
crm.account_owner,
crm.lead_source,
crm.last_activity_date,
CURRENT_TIMESTAMP() AS _refreshed_at
FROM customers c
LEFT JOIN transactions t ON c.customer_id = t.customer_id
LEFT JOIN products_purchased pp ON c.customer_id = pp.customer_id
LEFT JOIN crm ON c.email = crm.email
Step 6: Gold Layer — Analytics-Ready Datasets¶
-- models/gold/customer_segments.sql
SELECT
customer_id,
customer_name,
region,
lifetime_spend,
total_orders,
CASE
WHEN lifetime_spend > 100000 THEN 'Enterprise'
WHEN lifetime_spend > 10000 THEN 'Mid-Market'
WHEN lifetime_spend > 1000 THEN 'SMB'
ELSE 'Self-Serve'
END AS customer_segment,
CASE
WHEN last_order_date > DATEADD(DAY, -30, CURRENT_DATE()) THEN 'Active'
WHEN last_order_date > DATEADD(DAY, -90, CURRENT_DATE()) THEN 'At Risk'
ELSE 'Churned'
END AS engagement_status
FROM {{ ref('customer_360') }}
Step 7: Power BI Composite Model¶
- DirectQuery to on-prem SQL Server (via gateway) for real-time inventory data.
- Import from Gold layer
customer_segments(refreshed every 4 hours). - DirectQuery to Synapse serverless for ad-hoc drill-through to transaction detail.
- Create relationships between imported dimensions and DirectQuery facts.
Step 8: Purview Lineage¶
After all components are running, Purview automatically captures lineage:
AWS RDS → S3 Export → OneLake Shortcut → dbt Silver → Gold → Power BI Report
GCP BigQuery → GCS Export → OneLake Shortcut ↗
On-prem SQL Server → ADF → Bronze → dbt Silver ↗
Salesforce → ADF → Bronze → dbt Silver ↗
This lineage is visible in the Purview portal. When a schema change occurs in the AWS RDS source, you can trace the impact all the way to the Power BI report.
Best Practices¶
1. Classify Data Before Virtualizing¶
Before federating, classify every dataset:
| Classification | Action | Rationale |
|---|---|---|
| Can move, should move | ADF copy to ADLS | Performance-critical, frequently joined |
| Can move, prefer not to | OneLake shortcut | Large, infrequently queried |
| Cannot move | OneLake shortcut or Synapse serverless | Data residency, compliance, sovereignty |
| Must stay at source | DirectQuery or API | Real-time operational data |
2. Network Architecture¶
Multi-cloud virtualization is only as fast as the network between clouds.
- On-prem to Azure: Azure ExpressRoute (dedicated circuit, predictable latency).
- Azure internal: Private endpoints for all PaaS services (ADLS, Synapse, Fabric).
- AWS to Azure: AWS Direct Connect + Azure ExpressRoute via colocation, or site-to-site VPN.
- GCP to Azure: Google Cloud Interconnect + Azure ExpressRoute, or VPN.
Without Dedicated Interconnect
Cross-cloud queries over the public internet introduce 50-200ms of latency per request and expose data in transit (even with TLS) to public routing. For production workloads, always use dedicated interconnect.
3. Cost Optimization¶
Shortcuts and serverless queries are "free" in compute — but egress from the source cloud is not.
| Source Cloud | Egress Cost | 1 TB/day Cost |
|---|---|---|
| AWS (to internet/other cloud) | $0.09/GB | ~$2,700/month |
| GCP (to internet/other cloud) | $0.12/GB | ~$3,600/month |
| Azure (to internet) | $0.087/GB | ~$2,610/month |
Decision rule: If a dataset is queried more than 3x per day and is <100 GB, it is almost always cheaper to copy it (ADF) than to virtualize it (shortcut), because egress costs compound with every query.
4. Performance Tiering¶
graph TD
A["How often is this data queried?"] -->|"< 1x/day"| B["OneLake Shortcut or Synapse Serverless"]
A -->|"1-10x/day"| C["Depends on size"]
A -->|"> 10x/day"| D["Materialize with ADF"]
C -->|"< 10 GB"| D
C -->|"> 10 GB"| B 5. Security¶
- Managed Identity for all Azure-to-Azure connections (no keys to rotate).
- Azure Key Vault for storing AWS access keys, GCP service account keys, and database passwords.
- Cross-cloud credential rotation: Automate key rotation with Key Vault and ADF parameterized linked services.
- Network isolation: Managed VNet Integration Runtime for ADF; Private endpoints for Synapse and ADLS.
6. Monitoring¶
- Azure Monitor workbooks for cross-cloud pipeline health.
- ADF pipeline run metrics: Success/failure rates, duration, data volumes.
- Synapse serverless: Monitor TB scanned to control costs (set budgets).
- Power BI: Premium capacity metrics for DirectQuery performance.
7. Schema Evolution¶
Federated sources change independently. Protect yourself:
- Schema drift detection in ADF Mapping Data Flows (built-in).
- dbt tests on source schemas (
dbt testwithaccepted_values,not_null,relationships). - Purview schema change alerts — detect when scanned source schemas differ from baseline.
- Backward-compatible external tables — define Synapse external tables with
REJECT_TYPE = VALUE, REJECT_VALUE = 0to fail fast on schema mismatch.
Lessons Learned¶
Virtualization Is Not Free
Every query against a shortcut or external table incurs egress from the source cloud. A dashboard with 50 users refreshing 5 times per day against a 10 GB shortcut generates 2.5 TB/month of egress. At AWS rates, that's $225/month — for a single dataset. Multiply by dozens of datasets and the math can exceed the cost of just copying the data. Always model egress costs before choosing virtualization over replication.
Governance Before Federation
If you don't know what PII exists in your AWS RDS tables today, federating those tables into a cross-cloud analytics model does not solve the problem — it amplifies it. Now that PII is queryable from Power BI by anyone with report access. Run Purview classification scans BEFORE creating shortcuts or external tables.
Start with Gold, Not Bronze
The instinct is to virtualize everything — create shortcuts to every source and build up. Instead, start from the consumer: what Gold-layer datasets do stakeholders actually need? Then work backward to determine which sources feed those datasets. Many source datasets are irrelevant to the analytics layer and don't need federation at all.
Latency Is Architecture
200ms of cross-cloud latency is invisible in a batch pipeline that runs at 2 AM. It is very visible in a Power BI DirectQuery report where every slicer click triggers a cross-cloud round trip. Design your architecture around latency tolerance: batch workloads can virtualize freely; interactive workloads should materialize or use aggregation tables.
Test Failover Scenarios
If your Gold layer depends on an OneLake shortcut to S3, and AWS us-east-1 has an outage, your Gold layer is broken. Your Power BI dashboards are broken. Your executive reporting is broken. Identify single-cloud dependencies in your virtual layer and plan for degraded operation — cached snapshots, fallback views, or delayed refresh tolerances.
Identity Is the Hardest Part
The data plumbing (shortcuts, pipelines, queries) is the easy part. The hard part is: which Azure identity should be authorized to read from AWS S3? How do you map AWS IAM roles to Azure RBAC? How do you audit access across clouds? Invest in cross-cloud identity architecture early. Use Azure Key Vault for credential management and Microsoft Entra ID Workload Identity Federation where possible to avoid long-lived secrets.
Gotchas & Anti-Patterns¶
Don't Virtualize Transactional Workloads
Data virtualization is for analytics (OLAP), not transactions (OLTP). Never use OneLake shortcuts or Synapse serverless as the backend for a transactional application. The latency, consistency guarantees, and concurrency model are wrong.
Don't Skip Network Planning
Cross-cloud data virtualization over the public internet is slow (variable latency), expensive (full egress pricing), and insecure (public routing). Budget for ExpressRoute, Direct Connect, or Cloud Interconnect before committing to the architecture.
Don't Ignore Egress Costs
AWS charges $0.09/GB for data leaving. GCP charges $0.12/GB. If you query a 50 GB dataset via a shortcut 10 times per day, that is 500 GB/day × 30 days × \(0.09/GB = **\)1,350/month** in AWS egress alone. Compare this to storing a copy in ADLS (~$1/TB/month for cool tier). The math often favors replication for high-frequency queries.
Don't Use DirectQuery for Everything
DirectQuery is powerful but not a silver bullet. Every report interaction generates a live query against the source. If the source is slow, the report is slow. If the source is down, the report is down. Import data when freshness tolerance allows it. Use composite models to blend DirectQuery and Import.
Don't Skip Data Contracts
Federated data without contracts is "garbage in from everywhere." Define explicit contracts for each source: expected schema, freshness SLA, quality thresholds, owner contact. Use dbt tests, Great Expectations, or Fabric data quality rules to enforce contracts programmatically.
Don't Assume Source Schemas Are Stable
The AWS team will add a column to the RDS table. The GCP team will change a BigQuery field from STRING to INT64. The Salesforce admin will add a custom field. Implement schema drift detection at every integration point. ADF has built-in drift handling; dbt has source freshness and schema tests.
Cost Optimization¶
Virtualization vs. Replication Cost Comparison¶
| Scenario | Virtualize (Shortcut) | Replicate (ADF Copy) | Winner |
|---|---|---|---|
| 100 GB dataset, queried 1x/week | Egress: ~$0.36/month | Storage: ~\(0.10/month + ADF: ~\)0.25/run | Virtualize — minimal egress |
| 100 GB dataset, queried 10x/day | Egress: ~$2,700/month | Storage: ~\(0.10/month + ADF: ~\)0.25/run | Replicate — egress dominates |
| 1 TB dataset, queried 1x/day | Egress: ~$2,700/month | Storage: ~\(1/month + ADF: ~\)2.50/run | Replicate — storage is cheap |
| 10 TB dataset, queried 1x/month | Egress: ~$900/month | Storage: ~\(10/month + ADF: ~\)25/run | Virtualize — low frequency |
| 10 TB dataset, residency-restricted | N/A — must virtualize | N/A — cannot copy | Virtualize — compliance requires it |
Cross-Cloud Egress Pricing¶
| From → To | Per GB | 1 TB | 10 TB |
|---|---|---|---|
| AWS → Azure/Internet | $0.09 | $90 | $900 |
| GCP → Azure/Internet | $0.12 | $120 | $1,200 |
| Azure → AWS/Internet | $0.087 | $87 | $870 |
| AWS → AWS (cross-region) | $0.02 | $20 | $200 |
| Any cloud → Same region | Free | Free | Free |
Negotiate Committed Use Discounts
All three clouds offer egress discounts for committed volumes. AWS offers Data Transfer Out tiered pricing starting at $0.085/GB for the first 10 TB. GCP offers committed use discounts. Azure offers bandwidth pricing tiers. If you know your egress volume, negotiate.
Decision Tree: Virtualize vs. Replicate¶
- Is data residency or compliance a constraint? → Yes: Virtualize. No data leaves the source.
- Is the dataset queried more than 5x/day? → Yes: Replicate. Egress costs compound fast.
- Is the dataset >1 TB? → Yes: Consider virtualization unless query frequency is high.
- Is real-time freshness required? → Yes: DirectQuery or streaming. Batch replication won't suffice.
- Default: Start with virtualization, monitor egress costs, replicate if costs exceed storage.
CSA-in-a-Box Integration¶
Multi-cloud data virtualization fits directly into the CSA-in-a-Box framework:
Medallion Architecture with Virtual Bronze¶
The Bronze layer in a multi-cloud scenario is a hybrid:
- Virtual Bronze: OneLake shortcuts to S3/GCS — no physical storage, metadata-only references.
- Physical Bronze: ADF copies from SaaS/on-prem sources — raw data in ADLS Delta format.
- Silver/Gold: Always materialized. Joining across virtual sources requires materializing the result.
dbt Models Across Federated Sources¶
dbt models in CSA-in-a-Box can reference both virtual (shortcut) and physical (ADF-landed) Bronze sources. The dbt sources.yml configuration is identical — dbt doesn't know or care whether the underlying table is a shortcut or a physical Delta table.
# models/sources.yml
sources:
- name: bronze
schema: dbo
tables:
- name: aws_customers # OneLake shortcut to S3
- name: gcp_transactions # OneLake shortcut to GCS
- name: onprem_products # ADF copy to ADLS
- name: salesforce_accounts # ADF copy to ADLS
Data Contracts for Cross-Cloud Data Products¶
Each federated source should have an explicit data contract:
# contracts/aws_customers.yml
contract:
name: aws_customers
owner: aws-data-team@company.com
source: AWS RDS PostgreSQL (us-east-1)
access_method: OneLake shortcut via S3 export
freshness_sla: 24 hours
schema:
- name: customer_id
type: BIGINT
nullable: false
tests: [unique, not_null]
- name: email
type: VARCHAR
nullable: false
tests: [not_null]
classification: PII
- name: region
type: VARCHAR
nullable: false
tests: [accepted_values: ["us-east-1", "eu-west-1", "ap-southeast-1"]]
Purview Governance Overlay¶
Purview acts as the governance overlay for all CSA-in-a-Box data products:
- Catalog: Every Bronze, Silver, and Gold asset is registered and classified.
- Lineage: End-to-end lineage from any source cloud through dbt transformation to Power BI.
- Access control: Data access policies enforced consistently regardless of source cloud.
- Quality: Integration with data quality rules to validate contract compliance.