Home > Docs > Best Practices > ETL/ELT Comparison Guide

🔄 ETL/ELT Comparison Guide — Choosing the Right Data Movement Method¶

Side-by-Side Comparison of All 5 Fabric Data Ingestion Methods

Last Updated: 2026-04-21 | Version: 1.0.0

📑 Table of Contents¶

🎯 Overview
📊 Method-by-Method Comparison
🔧 Same Dataset — Five Ways
💰 Cost and Performance Comparison
🧭 Decision Matrix
🔀 Migration Paths
🎰 Casino Recommendations
🏛️ Federal Agency Recommendations
📚 References

🎯 Overview¶

Microsoft Fabric provides five distinct methods for moving data into OneLake. Each has different trade-offs in complexity, transform capability, latency, cost, and skill requirements. This guide compares all five using the same casino slot telemetry dataset, helping you choose the right method for each workload.

The Five Methods¶

flowchart LR
    subgraph Methods["📊 Data Movement Methods"]
        CA["1️⃣ Copy Activity<br/>(Data Pipeline)"]
        DF["2️⃣ Dataflow Gen2<br/>(Power Query)"]
        NB["3️⃣ Spark Notebook<br/>(PySpark)"]
        CJ["4️⃣ Copy Job<br/>(Managed CDC)"]
        MR["5️⃣ Mirroring<br/>(Real-time Replication)"]
    end

    CA --- DF --- NB --- CJ --- MR

    style CA fill:#3498DB,stroke:#2471A3,color:#fff
    style DF fill:#2E86C1,stroke:#1B4F72,color:#fff
    style NB fill:#E67E22,stroke:#CA6F1E,color:#fff
    style CJ fill:#6C3483,stroke:#4A235A,color:#fff
    style MR fill:#E74C3C,stroke:#C0392B,color:#fff

📊 Method-by-Method Comparison¶

Feature Matrix¶

Feature	Copy Activity	Dataflow Gen2	Spark Notebook	Copy Job	Mirroring
Skill Level	Low	Low-Medium	High	Low	Low
Transform Capability	None (mapping only)	Medium (Power Query M)	Full (arbitrary code)	Minimal (type cast)	None
Connectors	100+	300+	JDBC/ODBC + custom	10+	5 databases
Incremental Support	Manual watermark	Built-in incremental refresh	Manual checkpoint	Automatic watermark	Automatic CDC
Latency	Minutes (scheduled)	Minutes (scheduled)	Minutes (scheduled)	Minutes (scheduled)	Seconds (continuous)
Max Volume	Very large	Moderate (staging helps)	Very large	Large	Full database
Output Format	Delta, Parquet, CSV	Delta (to Lakehouse/WH)	Any OneLake format	Delta only	Delta (read-only)
Upsert/MERGE	❌	❌	✅ Native	✅ Built-in	N/A (mirror)
Pipeline Integration	✅ Native	✅ Activity	✅ Activity	❌ Standalone	❌ Standalone
Git Support	✅	❌ Limited	✅	❌	❌
Unit Testing	❌	❌	✅ pytest	❌	❌
Schema Evolution	Manual	Manual	Programmatic	Manual	Automatic
Error Handling	Retry + failure path	Limited	Full try/except	Retry	Automatic

Complexity Spectrum¶

Simple ◄────────────────────────────────────────────────► Complex

 Copy Job    Mirroring    Copy Activity    Dataflow Gen2    Spark Notebook
    │            │              │                │                │
    ▼            ▼              ▼                ▼                ▼
 No-code      No-code       Low-code        Low-code         Full code
 wizard       wizard        pipeline        Power Query      PySpark/SQL

🔧 Same Dataset — Five Ways¶

The following examples all load the same casino slot telemetry table (dbo.SlotEvents from SQL Server) into lh_bronze.bronze_slot_telemetry in the Bronze Lakehouse.

Method 1: Copy Activity (Data Pipeline)¶

{
    "name": "CopySlotTelemetry",
    "type": "Copy",
    "inputs": [{
        "referenceName": "SqlServerSlotEvents",
        "type": "DatasetReference"
    }],
    "outputs": [{
        "referenceName": "LakehouseBronzeSlots",
        "type": "DatasetReference"
    }],
    "typeProperties": {
        "source": {
            "type": "SqlServerSource",
            "sqlReaderQuery": "SELECT * FROM dbo.SlotEvents WHERE event_timestamp > '@{pipeline().parameters.LastLoadTime}'"
        },
        "sink": {
            "type": "LakehouseTableSink",
            "tableActionOption": "Append"
        },
        "translator": {
            "type": "TabularTranslator",
            "mappings": [
                {"source": {"name": "MACHINE_ID"}, "sink": {"name": "machine_id"}},
                {"source": {"name": "CASINO_ID"}, "sink": {"name": "casino_id"}},
                {"source": {"name": "EVENT_TYPE"}, "sink": {"name": "event_type"}},
                {"source": {"name": "EVENT_TS"}, "sink": {"name": "event_timestamp"}},
                {"source": {"name": "COIN_IN"}, "sink": {"name": "coin_in"}},
                {"source": {"name": "COIN_OUT"}, "sink": {"name": "coin_out"}}
            ]
        }
    }
}

Pros: Pipeline-native, parallelizable, easy column mapping Cons: No transforms beyond rename/cast, manual watermark management

Method 2: Dataflow Gen2¶

let
    Source = Sql.Database("casino-sql.database.windows.net", "SlotManagement"),
    SlotEvents = Source{[Schema="dbo", Item="SlotEvents"]}[Data],
    Renamed = Table.RenameColumns(SlotEvents, {
        {"MACHINE_ID", "machine_id"}, {"CASINO_ID", "casino_id"},
        {"EVENT_TYPE", "event_type"}, {"EVENT_TS", "event_timestamp"},
        {"COIN_IN", "coin_in"}, {"COIN_OUT", "coin_out"}
    }),
    Typed = Table.TransformColumnTypes(Renamed, {
        {"machine_id", type text}, {"casino_id", type text},
        {"event_type", type text}, {"event_timestamp", type datetimezone},
        {"coin_in", Currency.Type}, {"coin_out", Currency.Type}
    }),
    WithMeta = Table.AddColumn(Typed, "_ingestion_timestamp",
        each DateTimeZone.UtcNow(), type datetimezone)
in
    WithMeta

Data Destination: lh_bronze → bronze_slot_telemetry → Append

Pros: Query folding, 300+ connectors, familiar Power Query syntax Cons: Memory limits (need staging for large data), no MERGE

Method 3: Spark Notebook (PySpark)¶

# Databricks notebook source
# MAGIC %md
# MAGIC # Bronze: Slot Telemetry Ingestion

TARGET_TABLE = "lh_bronze.dbo.bronze_slot_telemetry"

# Read from SQL Server
jdbc_url = "jdbc:sqlserver://casino-sql.database.windows.net;database=SlotManagement"
df_raw = (spark.read
    .format("jdbc")
    .option("url", jdbc_url)
    .option("dbtable", "dbo.SlotEvents")
    .option("user", mssparkutils.credentials.getSecret("kv-casino", "sql-username"))
    .option("password", mssparkutils.credentials.getSecret("kv-casino", "sql-password"))
    .load()
)

# Transform
from pyspark.sql.functions import col, current_timestamp, lit

df_bronze = (df_raw
    .withColumnRenamed("MACHINE_ID", "machine_id")
    .withColumnRenamed("CASINO_ID", "casino_id")
    .withColumnRenamed("EVENT_TYPE", "event_type")
    .withColumnRenamed("EVENT_TS", "event_timestamp")
    .withColumnRenamed("COIN_IN", "coin_in")
    .withColumnRenamed("COIN_OUT", "coin_out")
    .withColumn("_ingestion_timestamp", current_timestamp())
    .withColumn("_source_system", lit("SlotManagement_OLTP"))
)

# Write to Lakehouse Delta table
df_bronze.write.format("delta").mode("append").saveAsTable(TARGET_TABLE)

print(f"Loaded {df_bronze.count()} rows to {TARGET_TABLE}")

Pros: Full code control, Delta MERGE, unit testable, schema enforcement Cons: Spark session cold start (30-60s), requires PySpark skills

Method 4: Copy Job¶

Copy Job: casino-slot-ingestion
  Source: SQL Server (casino-sql.database.windows.net / SlotManagement)
  Tables:
    - dbo.SlotEvents → bronze_slot_telemetry
      Incremental column: EVENT_TS (timestamp)
      Write mode: Append
  Schedule: Every 15 minutes

Pros: Automatic watermark, zero code, built-in monitoring Cons: Minimal transforms, limited connectors, Delta output only

Method 5: Mirroring¶

Mirrored Database: casino-slot-mirror
  Source: SQL Server (casino-sql.database.windows.net / SlotManagement)
  Replicated tables: dbo.SlotEvents, dbo.Players, dbo.Machines
  Mode: Continuous CDC (change data capture)
  Output: Read-only Delta tables in OneLake

Pros: Real-time (seconds latency), zero code, full database replication Cons: Read-only output, limited to 5 database types, no transforms

💰 Cost and Performance Comparison¶

Benchmark: 100K Rows Daily Load¶

Metric	Copy Activity	Dataflow Gen2	Spark Notebook	Copy Job	Mirroring
Setup Time	30 min	20 min	1-2 hours	10 min	5 min
Execution Time	~2 min	~3 min	~4 min (incl. cold start)	~2 min	Continuous
CU Cost/Run	~0.5% F64	~1% F64	~1.5% F64	~0.5% F64	~0.3% F64 (continuous)
Daily CU (hourly schedule)	~12% F64	~24% F64	~36% F64	~12% F64	~7% F64
Maintainability	Medium	Medium	High	Low	Low
Testability	Low	Low	High (pytest)	Low	N/A

CU Consumption by Volume¶

Daily Volume	Copy Activity	Dataflow Gen2	Spark Notebook	Copy Job
10K rows	< 0.1%	< 0.2%	~0.5%	< 0.1%
100K rows	~0.5%	~1%	~1.5%	~0.5%
1M rows	~2%	~4% (staging)	~3%	~2%
10M rows	~8%	~15% (staging)	~6%	~8%
100M rows	~30%	Not recommended	~20%	~30%

💡 Note: Spark Notebooks become more cost-efficient at very large volumes because Spark's distributed processing amortizes the cold-start overhead. Below 1M rows, Copy Activity and Copy Job are more efficient.

🧭 Decision Matrix¶

Quick Decision Flowchart¶

flowchart TD
    START["🔄 Need to load data"] --> Q1{"Need real-time<br/>(sub-minute)?"}
    Q1 -->|Yes| Q1a{"Full database<br/>replication?"}
    Q1a -->|Yes| MR["5️⃣ Mirroring"]
    Q1a -->|No| ES["Use Eventstream<br/>(not covered here)"]
    Q1 -->|No| Q2{"Need transforms?"}
    Q2 -->|No| Q3{"Need automatic<br/>incremental?"}
    Q3 -->|Yes| CJ["4️⃣ Copy Job"]
    Q3 -->|No| CA["1️⃣ Copy Activity"]
    Q2 -->|Yes| Q4{"Complex logic?<br/>(MERGE, ML, joins)"}
    Q4 -->|Yes| NB["3️⃣ Spark Notebook"]
    Q4 -->|No| DF["2️⃣ Dataflow Gen2"]

    style MR fill:#E74C3C,stroke:#C0392B,color:#fff
    style CJ fill:#6C3483,stroke:#4A235A,color:#fff
    style CA fill:#3498DB,stroke:#2471A3,color:#fff
    style DF fill:#2E86C1,stroke:#1B4F72,color:#fff
    style NB fill:#E67E22,stroke:#CA6F1E,color:#fff

Scenario-Based Recommendations¶

Scenario	Recommended Method	Why
Full table snapshot daily	Copy Activity	Simple, fast, pipeline-native
Incremental load every 15 min	Copy Job	Automatic watermark, zero code
Analyst-led data prep	Dataflow Gen2	Power Query familiarity
Complex ETL with business rules	Spark Notebook	Full code control, MERGE support
Real-time database sync	Mirroring	Continuous CDC, no code
Bronze raw ingestion	Copy Activity or Copy Job	No transforms needed
Silver cleansing + dedup	Spark Notebook	Delta MERGE for dedup/SCD2
Gold aggregations	Spark Notebook or Dataflow Gen2	Depends on complexity
REST API ingestion	Dataflow Gen2	Native REST connector + transforms
Mainframe / EBCDIC	Spark Notebook	Custom parsing required
Power BI semantic model refresh	Dataflow Gen2	Familiar tooling for BI teams

🔀 Migration Paths¶

Upgrading Between Methods¶

Copy Activity ─────► Copy Job
  When: You need automatic incremental without pipeline complexity

Copy Activity ─────► Spark Notebook
  When: You need transforms, MERGE, or schema enforcement

Dataflow Gen2 ─────► Spark Notebook
  When: Volume exceeds staging limits or you need MERGE/SCD2

Copy Job ──────────► Mirroring
  When: You need real-time and source is a supported database

Spark Notebook ────► Dataflow Gen2
  When: Simplifying for analyst self-service (if transforms are simple enough)

Common Migration Scenarios¶

From	To	Trigger	Effort
Copy Activity (ADF)	Copy Activity (Fabric Pipeline)	ADF → Fabric migration	Low (same engine)
SSIS packages	Spark Notebook	SQL Server migration	High (rewrite)
SSIS packages	Dataflow Gen2	SQL Server migration	Medium (visual rebuild)
Power BI Dataflows	Dataflow Gen2	Power BI → Fabric migration	Low (same M language)
ADB Notebooks	Spark Notebooks	Databricks → Fabric migration	Medium (runtime differences)
Informatica / Talend	Dataflow Gen2 or Notebook	Legacy ETL modernization	High (rebuild)

🎰 Casino Recommendations¶

Recommended Architecture¶

Layer	Workload	Method	Rationale
Bronze	Slot telemetry (SQL Server)	Copy Job	Auto-incremental, 15-min cadence
Bronze	Player CRM (SQL Server)	Mirroring	Real-time player data for floor ops
Bronze	Compliance filings (REST API)	Dataflow Gen2	REST connector + light transforms
Bronze	Historical load (one-time)	Copy Activity	Bulk copy, no transforms needed
Silver	Slot cleansing + dedup	Spark Notebook	MERGE, DQ scoring, derived KPIs
Silver	Player cleansing	Spark Notebook	PII masking, SCD Type 2
Gold	Daily revenue KPIs	Spark Notebook	Complex aggregation + window functions
Gold	Player dimensions	Dataflow Gen2	Simple reshape for BI team self-service
Gold	Compliance screening	Spark Notebook	CTR/SAR logic requires code

Mixed-Method Pipeline¶

flowchart LR
    subgraph Bronze["🟤 Bronze"]
        CJ1["Copy Job<br/>Slot Events"]
        MR1["Mirroring<br/>Player CRM"]
        DF1["Dataflow Gen2<br/>Compliance API"]
    end

    subgraph Silver["⚪ Silver"]
        NB1["Notebook<br/>Slot Cleansing"]
        NB2["Notebook<br/>Player Cleansing"]
    end

    subgraph Gold["🟡 Gold"]
        NB3["Notebook<br/>Revenue KPIs"]
        DF2["Dataflow Gen2<br/>Player Dim"]
    end

    CJ1 --> NB1 --> NB3
    MR1 --> NB2 --> DF2
    DF1 --> NB1

    style CJ1 fill:#6C3483,stroke:#4A235A,color:#fff
    style MR1 fill:#E74C3C,stroke:#C0392B,color:#fff
    style DF1 fill:#2E86C1,stroke:#1B4F72,color:#fff
    style NB1 fill:#E67E22,stroke:#CA6F1E,color:#fff
    style NB2 fill:#E67E22,stroke:#CA6F1E,color:#fff
    style NB3 fill:#E67E22,stroke:#CA6F1E,color:#fff
    style DF2 fill:#2E86C1,stroke:#1B4F72,color:#fff

🏛️ Federal Agency Recommendations¶

Agency	Bronze Method	Silver Method	Rationale
USDA	Dataflow Gen2 (REST API)	Spark Notebook	API source + complex commodity transforms
SBA	Copy Job (Azure SQL)	Spark Notebook	Incremental loan data + fraud scoring
NOAA	Copy Job (REST → staging)	Spark Notebook	High-frequency weather data + anomaly detection
EPA	Dataflow Gen2 (REST API)	Spark Notebook	API source + AQI calculations
DOI	Dataflow Gen2 (REST API)	Spark Notebook	API source + geospatial processing

📚 References¶

Resource	URL
Data Factory Overview	https://learn.microsoft.com/fabric/data-factory/data-factory-overview
Copy Activity	https://learn.microsoft.com/fabric/data-factory/copy-data-activity
Dataflow Gen2	https://learn.microsoft.com/fabric/data-factory/create-first-dataflow-gen2
Spark Notebooks	https://learn.microsoft.com/fabric/data-engineering/how-to-use-notebook
Copy Job	https://learn.microsoft.com/fabric/data-factory/copy-job-overview
Mirroring	https://learn.microsoft.com/fabric/database/mirrored-database/overview

Dataflow Gen2 — Full Dataflow Gen2 feature doc
Copy Job CDC — Copy Job feature doc
Mirroring — Database Mirroring feature doc
Medallion Architecture Deep Dive — Medallion layer patterns
Performance Best Practices — Delta table optimization
Architecture — System architecture overview

📝 Document Metadata - Author: Documentation Team - Reviewers: Data Engineering, Data Factory, Analytics, Platform - Classification: Internal - Next Review: 2026-07-21