🧭 Decision Guide: Choosing the Right Tool¶

Last Updated: 2026-04-15 | Version: 2.0 Status: ✅ Final | Maintainer: Documentation Team

📑 Table of Contents¶

Overview
Quick Decision Matrix
Tool Comparison
Decision Tree: Data Movement
Decision Tree: Storage Destination
Scenario-Based Recommendations
Tool Selection by Persona
Performance Considerations
Cost Considerations
Hybrid Patterns
Anti-Patterns to Avoid
Quick Reference Card
Decision Checklist
Related Documents

📋 Overview¶

Microsoft Fabric offers multiple tools for data movement and transformation. This guide helps you choose the right tool based on your specific requirements, data volumes, and team capabilities.

⚡ Quick Decision Matrix¶

Scenario	Recommended Tool
Large volume bulk copy (EL)	Pipeline Copy Activity
Automated CDC replication	Copy Job
Low-code transformations	Dataflow Gen2
Real-time streaming	Eventstream
Complex code-based transforms	Spark Notebook
SQL-centric analytics	Warehouse
Data lake analytics	Lakehouse

🔍 Tool Comparison¶

Data Movement Tools¶

Capability	Pipeline Copy	Copy Job	Dataflow Gen2	Eventstream	Spark
Use Case	ETL, bulk copy	CDC, replication	Data wrangling	Event streaming	Data processing
Developer	Data engineer	Data integrator	Business analyst	Data engineer	Data engineer
Code Required	Low-code	No-code	No-code	Low-code	Code
Data Volume	Low to High	Low to High	Low to High	Medium to High	Low to High
Connectors	50+	50+	150+	CDC, Kafka, etc.	Spark libraries
Transform Complexity	Low	Low	Medium to High	Low	High

Storage Tools¶

Capability	Lakehouse	Warehouse	Eventhouse
Primary Use	Data lake + analytics	Data warehousing	Real-time analytics
Storage Format	Delta Lake	Delta Lake	KQL Database
Query Language	Spark SQL, T-SQL (read)	T-SQL (full)	KQL
Transactions	Spark-level	Full ACID	No
Best For	Data engineering	Business analytics	Time-series data

🔀 Decision Tree: Data Movement¶

flowchart TD
    A[Data Movement Need] --> B{Real-time?}
    B -->|Yes| C[Eventstream]
    B -->|No| D{Data Volume?}

    D -->|Small < 1GB| E{Transform Complexity?}
    D -->|Medium 1-100GB| F{CDC Required?}
    D -->|Large > 100GB| G[Pipeline Copy Activity]

    E -->|Low| H[Dataflow Gen2]
    E -->|High| I{Code Preferred?}

    I -->|Yes| J[Spark Notebook]
    I -->|No| K[Dataflow with Staging]

    F -->|Yes| L[Copy Job]
    F -->|No| M{Transform Needed?}

    M -->|Yes| N{In-flight Transform?}
    M -->|No| O[Pipeline Copy Activity]

    N -->|Yes| P[Dataflow Gen2]
    N -->|No| Q[Pipeline + Spark]

    G --> R{Parallel Load?}
    R -->|Yes| S[Pipeline with ForEach]
    R -->|No| T[Single Copy Activity]

🗄️ Decision Tree: Storage Destination¶

flowchart TD
    A[Storage Decision] --> B{Primary Workload?}

    B -->|Data Engineering| C{Team Skill?}
    B -->|Business Analytics| D{Query Pattern?}
    B -->|Real-time| E[Eventhouse]

    C -->|Spark/Python| F[Lakehouse]
    C -->|SQL| G{Transaction Needs?}

    G -->|Full ACID| H[Warehouse]
    G -->|Read-heavy| I[Lakehouse SQL Endpoint]

    D -->|Complex SQL| J[Warehouse]
    D -->|Mixed| K[Both + Shortcuts]
    D -->|BI Direct Query| L{Performance Priority?}

    L -->|Speed| M[Warehouse]
    L -->|Flexibility| N[Lakehouse + Direct Lake]

💡 Scenario-Based Recommendations¶

Scenario 1: Enterprise Data Warehouse Migration¶

Context: Migrating from on-premises Oracle/SQL Server to Fabric

Recommendation:

Source → Pipeline Copy → Lakehouse (Bronze) → Spark (Transform) → Warehouse (Gold)

Component	Tool	Reason
Extraction	Pipeline Copy	High volume, parallel copy
Landing	Lakehouse	Raw data flexibility
Transform	Spark Notebooks	Complex business logic
Serving	Warehouse	T-SQL analytics, BI

Scenario 2: Real-time Dashboard¶

Context: Casino floor monitoring with live updates

Recommendation:

IoT Devices → Eventstream → Eventhouse → Real-time Dashboard
                    ↓
              Lakehouse (archive)

Component	Tool	Reason
Ingestion	Eventstream	Sub-second latency
Analytics	Eventhouse	KQL for time-series
Archive	Lakehouse	Long-term storage
Visualization	Real-time Dashboard	Live updates

Scenario 3: Self-Service Analytics¶

Context: Business analysts need to prepare and analyze data

Recommendation:

Various Sources → Dataflow Gen2 → Lakehouse → Semantic Model → Power BI

Component	Tool	Reason
Preparation	Dataflow Gen2	Low-code Power Query
Storage	Lakehouse	Flexible schema
Modeling	Semantic Model	Business definitions
Reporting	Power BI	Self-service BI

Scenario 4: CDC Replication¶

Context: Keep Fabric in sync with operational SQL Server

Recommendation:

SQL Server (CDC) → Copy Job → Lakehouse/Warehouse

Component	Tool	Reason
Replication	Copy Job	Built-in CDC support
Destination	Lakehouse or Warehouse	Based on query needs

Scenario 5: Complex Data Science Workload¶

Context: ML model training with feature engineering

Recommendation:

Multiple Sources → Lakehouse (Bronze) → Spark (Feature Engineering) → Lakehouse (Features) → ML Model

Component	Tool	Reason
Raw Data	Lakehouse	Schema flexibility
Processing	Spark Notebooks	Python ML libraries
Features	Lakehouse	Delta Lake versioning
Training	Spark MLlib or Notebooks	Distributed compute

👥 Tool Selection by Persona¶

Persona	Primary Tools	Secondary Tools
Data Engineer	Pipeline, Spark, Lakehouse	Dataflow, Warehouse
Data Analyst	Dataflow Gen2, Warehouse	Lakehouse SQL Endpoint
BI Developer	Semantic Model, Power BI	Warehouse, Lakehouse
Data Scientist	Spark Notebooks, Lakehouse	Eventhouse
Integration Specialist	Pipeline, Copy Job	Dataflow, Eventstream

📈 Performance Considerations¶

By Data Volume¶

Volume	Recommended Approach
< 1 GB	Dataflow Gen2 (simple), any tool works
1-10 GB	Pipeline Copy or Dataflow with Fast Copy
10-100 GB	Pipeline with parallel copy
100 GB - 1 TB	Pipeline with logical partitioning
> 1 TB	Pipeline + Spark for transforms

By Latency Requirement¶

Latency	Recommended Tool
Sub-second	Eventstream
Minutes	Copy Job (CDC mode)
Hourly	Pipeline (scheduled)
Daily	Pipeline or Dataflow

💰 Cost Considerations¶

Compute Cost Ranking (Lower = Less Expensive)¶

Dataflow Gen2 - Efficient for small-medium volumes
Pipeline Copy - Optimized for bulk transfer
Copy Job - Similar to Pipeline
Spark - Higher cost, more powerful
Eventstream - Cost based on throughput

Storage Cost¶

All Fabric storage uses OneLake with the same pricing: - Lakehouse Tables (Delta) - Warehouse Tables (Delta) - Eventhouse (KQL Database)

🔗 Hybrid Patterns¶

Pipeline + Dataflow¶

Large Source → Pipeline Copy → Staging Lakehouse → Dataflow (cleanse) → Final Lakehouse

Use when: Need bulk copy performance AND Power Query transformations

Pipeline + Spark¶

Source → Pipeline Copy → Bronze Lakehouse → Spark Notebook → Silver/Gold Lakehouse

Use when: Need parallel copy AND complex transformations

Eventstream + Lakehouse¶

Real-time Events → Eventstream → Eventhouse (hot) → Lakehouse (cold)

Use when: Need real-time AND historical analytics

🚫 Anti-Patterns to Avoid¶

Anti-Pattern	Problem	Better Approach
Spark for simple copy	Overhead, complexity	Pipeline Copy Activity
Dataflow for 500GB	Performance	Pipeline with partitioning
Pipeline for real-time	Not designed for streaming	Eventstream
Warehouse for raw data	Unnecessary structure	Lakehouse Bronze layer
Lakehouse for all SQL	Missing T-SQL features	Use Warehouse for analytics

📇 Quick Reference Card¶

When to Use Each Tool¶

Pipeline Copy Activity - ✅ Bulk data movement - ✅ High volume (10GB+) - ✅ Scheduled batch loads - ❌ Real-time streaming - ❌ Complex transformations

Copy Job - ✅ CDC/incremental sync - ✅ Automated replication - ✅ Simple setup - ❌ Complex transformations - ❌ Real-time (near-real-time only)

Dataflow Gen2 - ✅ Business user data prep - ✅ 150+ connectors - ✅ Power Query interface - ❌ Very large volumes (100GB+) - ❌ Real-time

Eventstream - ✅ Real-time ingestion - ✅ Event processing - ✅ IoT/streaming scenarios - ❌ Batch processing - ❌ Historical bulk loads

Spark Notebooks - ✅ Complex transformations - ✅ ML/Data Science - ✅ Code flexibility - ❌ Simple copy operations - ❌ Business user self-service

✅ Decision Checklist¶

Before choosing a tool, answer these questions:

Volume: How much data? (GB/TB)
Latency: Real-time or batch?
Complexity: Simple copy or complex transforms?
Team: Data engineers or business analysts?
Destination: Lakehouse, Warehouse, or both?
Pattern: Full load, incremental, or CDC?
Sources: How many connectors needed?
Code: Code-first or no-code preferred?

Document	Description
Pipelines & Data Movement	Pipeline optimization patterns
Dataflows Gen2	Dataflow performance optimization
Spark & Notebooks	Spark best practices
Lakehouse Setup	Lakehouse architecture and configuration
Warehouse Setup	Warehouse schema design and optimization
Real-Time Intelligence	Eventstream and Eventhouse features

⬆️ Back to Top | 📚 Best Practices | 🏠 Home

🧭 Decision Guide: Choosing the Right Tool¶

📑 Table of Contents¶

📋 Overview¶

⚡ Quick Decision Matrix¶

🔍 Tool Comparison¶

Data Movement Tools¶

Storage Tools¶

🔀 Decision Tree: Data Movement¶

🗄️ Decision Tree: Storage Destination¶

💡 Scenario-Based Recommendations¶

Scenario 1: Enterprise Data Warehouse Migration¶

Scenario 2: Real-time Dashboard¶

Scenario 3: Self-Service Analytics¶

Scenario 4: CDC Replication¶

Scenario 5: Complex Data Science Workload¶

👥 Tool Selection by Persona¶

📈 Performance Considerations¶

By Data Volume¶

By Latency Requirement¶

💰 Cost Considerations¶

Compute Cost Ranking (Lower = Less Expensive)¶

Storage Cost¶

🔗 Hybrid Patterns¶

Pipeline + Dataflow¶

Pipeline + Spark¶

Eventstream + Lakehouse¶

🚫 Anti-Patterns to Avoid¶

📇 Quick Reference Card¶

When to Use Each Tool¶

✅ Decision Checklist¶

🔗 Related Documents¶