Best Practice — Zero-Move Data Architecture¶

Comparative positioning note

This document is written from the perspective of Microsoft Azure, Cloud Scale Analytics, and CSA Loom. Any description of third-party or competing products, services, pricing, or capabilities is derived from publicly available documentation and sources believed accurate at the time of writing, and is provided for general comparison only. We do not claim expertise in, or authority over, any non-Microsoft product or service; the respective vendor's official documentation is the authoritative source for their offerings, which may change over time. Nothing here is intended to disparage any vendor — where a competing product has genuine advantages, we aim to note them honestly. Verify all third-party details against the vendor's current official documentation before making decisions.

The principle¶

Compute travels to the data. Data does not travel to the compute. Materialize only when freshness, governance, and cost analysis justify it — not by default.

Zero-move is no longer an academic preference. Data residency regulations, sovereign cloud mandates, classified data handling rules, the cost of moving petabytes, and the freshness cost of batch pipelines have made zero-move the architectural default. This page is the opinionated playbook for executing zero-move on Azure.

When to virtualize vs materialize¶

The single most important decision in a zero-move strategy is when to break the rule. The decision table:

Signal	Virtualize	Materialize
Query latency need	Sub-minute reads OK	Sub-second reads required
Query freshness need	Real-time	Daily / hourly tolerable
Query frequency	Low-medium	Very high, repetitive
Data size	Any	Any
Compliance posture	Movement restricted	Movement permitted
Backend load	Low	High concern
Egress cost	High (cross-cloud)	Acceptable for one-time move
Schema stability	Stable	Stable
Aggregation	Light	Heavy / pre-aggregated
Joins across systems	Required	Pre-computed materialized

Default to virtualize. Reach for materialization with documented justification. Materialize the aggregated result, not the source.

The Azure zero-move toolkit¶

1. OneLake Shortcuts¶

OneLake shortcuts create read-only logical references to data in S3, GCS, ADLS Gen2, or other OneLake workspaces. The data appears in OneLake as if local but is queried in place.

Capability	Notes
Sources supported	ADLS Gen2, S3, GCS, OneLake, AWS S3-compatible, Dataverse, on-prem via SHIR
Direction	Read-only
Engines that consume	Fabric Spark, Fabric Warehouse, Fabric KQL, Power BI Direct Lake, Synapse, Databricks
Authentication	Managed identity, service principal, workspace identity
Caching	OneLake caches files for the local region on first access

Use when: cross-cloud data needs to participate in a Fabric / lakehouse query without movement.

2. APIM Façades¶

APIM in front of a non-Azure data source (REST, SQL via DAB, Dataverse Web API, on-prem mainframe with a REST adapter) gives the source the same API surface as any first-party Azure data source.

Use when: the data source has a transactional API; consumers want REST / OData; rate limiting and audit needed.

3. Synapse Serverless SQL via OPENROWSET¶

T-SQL queries directly against Parquet / Delta / CSV files in any storage account or S3-compatible store, no movement, pay per TB scanned.

SELECT TOP 100 *
FROM OPENROWSET(
    BULK 'https://datasource.blob.core.windows.net/path/*.parquet',
    FORMAT='PARQUET'
) AS rows
WHERE rows.region = 'NW';

Use when: ad-hoc T-SQL access to lake data; serving layer for Power BI Composite Models; cost-bounded ad-hoc analytics.

4. Power BI DirectQuery and Composite Models¶

DirectQuery runs report visuals as queries against the source — no import. Composite Models combine DirectQuery sources with imported tables in one semantic model.

Use when: real-time dashboards on transactional sources; cross-source joins in a report without ETL.

Delta Sharing is the open protocol for sharing Delta tables across organizations and clouds. The receiver queries without copying.

Use when: cross-organization data sharing inside the Databricks ecosystem; partner / vendor data exchange.

6. Dataverse Web API¶

Dataverse data is queryable in place via the OData v4 Web API. See the Dataverse use case. No replication required.

7. Microsoft Graph API¶

M365 content (mail, files, sites, Teams, calendar) is queryable in place via the Graph API. No replication required.

8. Data API Builder (DAB)¶

DAB wraps Azure SQL, Cosmos DB, or PostgreSQL with a REST + GraphQL surface — exposing the database as an API without writing code.

Use when: a SQL backend needs to be a first-class API; combine with APIM for full gateway features.

The decision tree¶

graph TD
    Q1[New data integration request]
    Q1 --> Q2{Is the data in Azure?}
    Q2 -->|Yes, Azure-native| Q3{Existing API on the source?}
    Q2 -->|No, cross-cloud / on-prem| Q4{Can OneLake shortcut reach it?}
    Q3 -->|Yes| R1[Use the API behind APIM]
    Q3 -->|No, SQL backend| R2[Add Data API Builder behind APIM]
    Q4 -->|Yes - ADLS / S3 / GCS| R3[Create OneLake shortcut + Purview catalog entry]
    Q4 -->|No - REST or DB| Q5{Can a REST façade be exposed?}
    Q5 -->|Yes| R4[Stand up APIM façade + Purview entry]
    Q5 -->|No - legacy, batch-only| Q6{Materialization justified?}
    Q6 -->|Yes - approved per policy| R5[Set up ADF / Fabric pipeline + Purview lineage]
    Q6 -->|No| R6[Re-evaluate source modernization with owner]

The defaults in this tree push every request toward virtualization. Materialization is the explicit exception, with documented justification.

Operational guardrails¶

Zero-move without guardrails becomes zero-performance. Five rules:

Rule 1 — Cache aggressively at the gateway¶

APIM cache policies for read-heavy queries:

<cache-lookup vary-by-developer="false" vary-by-developer-groups="false">
  <vary-by-header>Authorization</vary-by-header>
  <vary-by-query-parameter>$select</vary-by-query-parameter>
  <vary-by-query-parameter>$filter</vary-by-query-parameter>
</cache-lookup>
<!-- ... backend call ... -->
<cache-store duration="300" />

Tune duration per data product. A 5-minute cache on read-heavy dashboards routinely cuts backend load 70–90%.

Rule 2 — Pre-aggregate when query frequency justifies¶

If 80% of queries against a virtualized source are the same aggregation, pre-compute it. Either:

Use a Fabric materialized view backed by OneLake
Use a Databricks Delta table refreshed on a schedule
Use a Power BI semantic model with aggregation tables

The principle is the same: materialize the aggregate, not the source.

Rule 3 — Watch backend load¶

Zero-move shifts query load to the source system. Instrument the source. If you see:

Sustained CPU > 70% on the operational store
Connection pool exhaustion
Source-side rate limit triggering

…then push the work upstream — cache more aggressively, pre-aggregate, or materialize a Delta replica refreshed on schedule.

Rule 4 — Egress is the silent cost¶

Cross-cloud virtualization can produce surprising egress bills. Mitigations:

Filter and project at the source ($select, $filter) — never SELECT * across clouds
Use OneLake shortcuts with regional caching to amortize
Compress payloads (GZIP at APIM)
For high-volume sources, materialize daily and virtualize only for fresh-window queries

Rule 5 — Lineage is mandatory¶

Every virtualized source has a Purview lineage entry. Every consumer is tracked. Every join is documented. Without lineage, the architecture rots into ungovernable spaghetti within 18 months.

Federal posture¶

Zero-move is especially valuable in regulated and sovereign environments:

Constraint	Zero-move advantage
ITAR data	Data never crosses authorized boundary
FedRAMP High reciprocity on partner products	APIM brokers without re-credentialing
Cross-agency sharing	Delta Sharing or shortcuts let partner agencies query without copying
Classified data segregation	Compute can run in the higher boundary; lower-boundary data is reached through controlled façades
Data residency by region	Data stays in region; reports federate across regions through APIM

The Microsoft architecture composes across boundaries cleanly because APIM, Entra, Purview, and OneLake are all FedRAMP High accredited and have IL5 / IL6 paths.

Anti-patterns to refuse¶

Anti-pattern	Refuse because
"Let's just copy it to ADLS to make it easier"	Easier today, more expensive forever; creates a governance liability
"We'll cache it in our app, no need for APIM cache"	Caching belongs at the platform, not the app — applies once, governed once
"We don't need lineage for this one"	The one without lineage is the one that breaks an audit
"Materialization is faster, let's just materialize everything"	True for one query; false for the estate; cost compounds linearly
"We can't virtualize because the source is too slow"	Then the source needs to be fixed or augmented — not duplicated
"Egress is fine — it's only a few TB / month"	Egress is the cost line item that surprises every steady-state customer