Home > Docs > Decision Trees > ETL vs ELT vs Streaming

ETL vs ELT vs Streaming¶

Select the optimal data movement strategy for your Fabric workload

Last Updated: 2026-05-05 | Version: 1.0.0

TL;DR¶

Use ELT as the default pattern in Fabric -- land raw data in OneLake first, then transform with Spark or T-SQL at scale. Switch to ETL when data must be cleansed or filtered before landing (PII stripping, schema normalization from legacy sources). Use Streaming when business decisions depend on sub-minute data freshness (real-time dashboards, fraud detection, IoT telemetry).

When This Question Comes Up¶

Designing the initial data ingestion architecture for a new Fabric workspace
Migrating SSIS, ADF, or Informatica pipelines to Fabric
A stakeholder demands "real-time" dashboards and you need to assess if streaming is justified
Data sources include a mix of batch databases, file drops, APIs, and event streams
Compliance requires PII to be stripped before data lands in the lakehouse

Decision Flowchart¶

flowchart TD
    START([New data source to ingest]) --> LATENCY{Freshness requirement?}

    LATENCY -->|Real-time<br/>< 1 minute| STREAM_CHECK{Event-driven source?<br/>Kafka, Event Hubs,<br/>IoT Hub, CDC stream}
    LATENCY -->|Near-real-time<br/>1-15 minutes| MICRO{Can source emit<br/>micro-batches?}
    LATENCY -->|Batch<br/>> 15 minutes| TRANSFORM_LOC{Where should<br/>transforms run?}

    STREAM_CHECK -->|Yes| STREAMING[Streaming<br/>Eventstream + Eventhouse]
    STREAM_CHECK -->|No - must poll| MICRO

    MICRO -->|Yes| STREAMING
    MICRO -->|No - batch only| TRANSFORM_LOC

    TRANSFORM_LOC --> PII{Must filter PII<br/>before landing?}

    PII -->|Yes - compliance<br/>requires pre-filter| ETL[ETL<br/>Transform before load]
    PII -->|No - land raw,<br/>transform in place| VOLUME{Data volume<br/>per load?}

    VOLUME -->|< 1 GB| DATAFLOW{Complex transforms<br/>or simple mapping?}
    VOLUME -->|> 1 GB| ELT[ELT<br/>Land then transform<br/>with Spark / T-SQL]

    DATAFLOW -->|Simple mapping,<br/>low-code preferred| DATAFLOW_RESULT[ELT via<br/>Dataflow Gen2]
    DATAFLOW -->|Complex joins,<br/>business logic| ELT

    ETL --> ETL_TOOL{Preferred tooling?}
    ETL_TOOL -->|Low-code| ETL_DF[ETL via<br/>Dataflow Gen2]
    ETL_TOOL -->|Code-first| ETL_NB[ETL via<br/>Spark Notebook<br/>in Pipeline]

    style STREAMING fill:#FF5722,color:#fff
    style ELT fill:#4CAF50,color:#fff
    style DATAFLOW_RESULT fill:#4CAF50,color:#fff
    style ETL_DF fill:#2196F3,color:#fff
    style ETL_NB fill:#2196F3,color:#fff

ELT (Extract-Load-Transform)¶

When¶

Default pattern for most Fabric workloads
Data can be landed raw in Bronze without compliance concerns
Transforms benefit from Spark or Warehouse compute at scale
Medallion architecture (Bronze raw, Silver cleansed, Gold aggregated)

Why¶

Decouples ingestion from transformation, enabling independent scaling
Raw data preserved in Bronze for replay, audit, and schema evolution
Spark engine handles large-scale transforms more cost-effectively than pre-load processing
Simplifies pipeline orchestration -- Copy Activity lands data, notebooks transform

Tradeoffs¶

Dimension	Assessment
Cost	Lower ingestion cost (simple copy); transform cost depends on Spark/SQL CU usage
Latency	Batch-oriented; minimum ~5-15 minute cycle with scheduled pipelines
Compliance	Raw PII lands in Bronze -- requires access controls and retention policies
Skill match	Requires PySpark or T-SQL for transform layer; Copy Activity is low-code

Anti-patterns¶

Landing PII or regulated data in Bronze without encryption and RBAC (use ETL if pre-filtering is mandatory)
Skipping Bronze and transforming directly into Silver (loses raw audit trail)
Using Dataflow Gen2 for >1 GB loads when Copy Activity + Spark notebook is faster
Running transforms in the pipeline orchestration layer instead of delegating to Spark

ETL (Extract-Transform-Load)¶

When¶

Compliance mandates PII stripping, masking, or filtering before data lands in OneLake
Source schema is messy and requires normalization before landing (legacy mainframe, flat files)
Data volume is small enough that pre-load transforms don't create bottlenecks
Dataflow Gen2 low-code experience is preferred for citizen integrators

Why¶

Data arrives clean -- Bronze layer already meets governance requirements
Reduces storage cost by filtering irrelevant records before loading
Dataflow Gen2 provides Power Query-style visual transforms for non-developers
Useful for edge cases where Spark transforms would be overkill

Tradeoffs¶

Dimension	Assessment
Cost	Transform runs on ingestion compute; can be expensive for large datasets
Latency	Adds processing time before data is available in OneLake
Compliance	Strongest pre-landing compliance posture; PII never touches raw storage
Skill match	Dataflow Gen2 is accessible to Power Query users; Spark ETL needs developers

Anti-patterns¶

Using ETL for multi-terabyte loads (transform bottleneck; use ELT and transform in Spark)
Building complex business logic in Dataflow Gen2 when a Spark notebook is more maintainable
Losing raw data auditability by filtering too aggressively before landing
Chaining 10+ Dataflow transforms when a single Spark notebook would be clearer

Streaming¶

When¶

Business requires sub-minute data freshness (fraud detection, live dashboards, IoT monitoring)
Source natively emits events (Kafka, Azure Event Hubs, IoT Hub, database CDC)
Use case involves pattern detection, alerting, or real-time aggregation
Real-Time Intelligence (Eventhouse + KQL) is the query layer

Why¶

Eventstream provides managed, no-code stream ingestion into Eventhouse
KQL enables sub-second analytical queries on streaming data
Data Activator triggers alerts and actions based on real-time conditions
Supports both hot path (Eventhouse) and warm path (Lakehouse Delta) simultaneously

Tradeoffs¶

Dimension	Assessment
Cost	Always-on compute for stream processing; higher baseline cost than batch
Latency	Sub-second to seconds; best freshness available in Fabric
Compliance	Streaming data harder to mask in-flight; may need downstream PII handling
Skill match	Requires KQL for Eventhouse queries; Eventstream is low-code for ingestion

Anti-patterns¶

Using streaming for data that is truly batch (daily file drops don't need Eventstream)
Skipping the warm-path write to Lakehouse and querying Eventhouse for historical analytics
Not setting retention policies on Eventhouse -- unbounded storage growth
Building complex transforms in Eventstream when a Spark notebook post-ingestion is cleaner

Quick Comparison¶

Dimension	ELT	ETL	Streaming
Freshness	5-60 min (batch)	5-60 min (batch)	Seconds
Complexity	Medium	Medium-High	High
Best Volume	Any (Spark scales)	< 1 GB per load	Continuous events
Transform Engine	Spark / T-SQL	Dataflow Gen2 / Spark	Eventstream / KQL
Raw Preservation	Yes (Bronze)	Partial (pre-filtered)	Yes (Eventhouse + Lakehouse)
Cost Profile	Pay per job	Pay per job	Always-on

Pipelines & Data Movement -- Pipeline best practices
Real-Time Intelligence -- Eventstream and Eventhouse
Dataflow Gen2 -- Low-code data transforms
Copy Job CDC -- Change data capture via Copy Job
Incremental Refresh & CDC -- Incremental patterns
Medallion Architecture -- Bronze/Silver/Gold patterns
Data Activator -- Real-time alerting

ETL vs ELT vs Streaming¶

TL;DR¶

When This Question Comes Up¶

Decision Flowchart¶

ELT (Extract-Load-Transform)¶

When¶

Why¶

Tradeoffs¶

Anti-patterns¶

ETL (Extract-Transform-Load)¶

When¶

Why¶

Tradeoffs¶

Anti-patterns¶

Streaming¶

When¶

Why¶

Tradeoffs¶

Anti-patterns¶

Quick Comparison¶

Related Links¶