Home > Best Practices > Decision Guide
🧭 Decision Guide: Choosing the Right Tool¶
Last Updated: 2026-04-15 | Version: 2.0 Status: ✅ Final | Maintainer: Documentation Team
📑 Table of Contents¶
- Overview
- Quick Decision Matrix
- Tool Comparison
- Decision Tree: Data Movement
- Decision Tree: Storage Destination
- Scenario-Based Recommendations
- Tool Selection by Persona
- Performance Considerations
- Cost Considerations
- Hybrid Patterns
- Anti-Patterns to Avoid
- Quick Reference Card
- Decision Checklist
- Related Documents
📋 Overview¶
Microsoft Fabric offers multiple tools for data movement and transformation. This guide helps you choose the right tool based on your specific requirements, data volumes, and team capabilities.
⚡ Quick Decision Matrix¶
| Scenario | Recommended Tool |
|---|---|
| Large volume bulk copy (EL) | Pipeline Copy Activity |
| Automated CDC replication | Copy Job |
| Low-code transformations | Dataflow Gen2 |
| Real-time streaming | Eventstream |
| Complex code-based transforms | Spark Notebook |
| SQL-centric analytics | Warehouse |
| Data lake analytics | Lakehouse |
🔍 Tool Comparison¶
Data Movement Tools¶
| Capability | Pipeline Copy | Copy Job | Dataflow Gen2 | Eventstream | Spark |
|---|---|---|---|---|---|
| Use Case | ETL, bulk copy | CDC, replication | Data wrangling | Event streaming | Data processing |
| Developer | Data engineer | Data integrator | Business analyst | Data engineer | Data engineer |
| Code Required | Low-code | No-code | No-code | Low-code | Code |
| Data Volume | Low to High | Low to High | Low to High | Medium to High | Low to High |
| Connectors | 50+ | 50+ | 150+ | CDC, Kafka, etc. | Spark libraries |
| Transform Complexity | Low | Low | Medium to High | Low | High |
Storage Tools¶
| Capability | Lakehouse | Warehouse | Eventhouse |
|---|---|---|---|
| Primary Use | Data lake + analytics | Data warehousing | Real-time analytics |
| Storage Format | Delta Lake | Delta Lake | KQL Database |
| Query Language | Spark SQL, T-SQL (read) | T-SQL (full) | KQL |
| Transactions | Spark-level | Full ACID | No |
| Best For | Data engineering | Business analytics | Time-series data |
🔀 Decision Tree: Data Movement¶
flowchart TD
A[Data Movement Need] --> B{Real-time?}
B -->|Yes| C[Eventstream]
B -->|No| D{Data Volume?}
D -->|Small < 1GB| E{Transform Complexity?}
D -->|Medium 1-100GB| F{CDC Required?}
D -->|Large > 100GB| G[Pipeline Copy Activity]
E -->|Low| H[Dataflow Gen2]
E -->|High| I{Code Preferred?}
I -->|Yes| J[Spark Notebook]
I -->|No| K[Dataflow with Staging]
F -->|Yes| L[Copy Job]
F -->|No| M{Transform Needed?}
M -->|Yes| N{In-flight Transform?}
M -->|No| O[Pipeline Copy Activity]
N -->|Yes| P[Dataflow Gen2]
N -->|No| Q[Pipeline + Spark]
G --> R{Parallel Load?}
R -->|Yes| S[Pipeline with ForEach]
R -->|No| T[Single Copy Activity] 🗄️ Decision Tree: Storage Destination¶
flowchart TD
A[Storage Decision] --> B{Primary Workload?}
B -->|Data Engineering| C{Team Skill?}
B -->|Business Analytics| D{Query Pattern?}
B -->|Real-time| E[Eventhouse]
C -->|Spark/Python| F[Lakehouse]
C -->|SQL| G{Transaction Needs?}
G -->|Full ACID| H[Warehouse]
G -->|Read-heavy| I[Lakehouse SQL Endpoint]
D -->|Complex SQL| J[Warehouse]
D -->|Mixed| K[Both + Shortcuts]
D -->|BI Direct Query| L{Performance Priority?}
L -->|Speed| M[Warehouse]
L -->|Flexibility| N[Lakehouse + Direct Lake] 💡 Scenario-Based Recommendations¶
Scenario 1: Enterprise Data Warehouse Migration¶
Context: Migrating from on-premises Oracle/SQL Server to Fabric
Recommendation:
| Component | Tool | Reason |
|---|---|---|
| Extraction | Pipeline Copy | High volume, parallel copy |
| Landing | Lakehouse | Raw data flexibility |
| Transform | Spark Notebooks | Complex business logic |
| Serving | Warehouse | T-SQL analytics, BI |
Scenario 2: Real-time Dashboard¶
Context: Casino floor monitoring with live updates
Recommendation:
| Component | Tool | Reason |
|---|---|---|
| Ingestion | Eventstream | Sub-second latency |
| Analytics | Eventhouse | KQL for time-series |
| Archive | Lakehouse | Long-term storage |
| Visualization | Real-time Dashboard | Live updates |
Scenario 3: Self-Service Analytics¶
Context: Business analysts need to prepare and analyze data
Recommendation:
| Component | Tool | Reason |
|---|---|---|
| Preparation | Dataflow Gen2 | Low-code Power Query |
| Storage | Lakehouse | Flexible schema |
| Modeling | Semantic Model | Business definitions |
| Reporting | Power BI | Self-service BI |
Scenario 4: CDC Replication¶
Context: Keep Fabric in sync with operational SQL Server
Recommendation:
| Component | Tool | Reason |
|---|---|---|
| Replication | Copy Job | Built-in CDC support |
| Destination | Lakehouse or Warehouse | Based on query needs |
Scenario 5: Complex Data Science Workload¶
Context: ML model training with feature engineering
Recommendation:
Multiple Sources → Lakehouse (Bronze) → Spark (Feature Engineering) → Lakehouse (Features) → ML Model
| Component | Tool | Reason |
|---|---|---|
| Raw Data | Lakehouse | Schema flexibility |
| Processing | Spark Notebooks | Python ML libraries |
| Features | Lakehouse | Delta Lake versioning |
| Training | Spark MLlib or Notebooks | Distributed compute |
👥 Tool Selection by Persona¶
| Persona | Primary Tools | Secondary Tools |
|---|---|---|
| Data Engineer | Pipeline, Spark, Lakehouse | Dataflow, Warehouse |
| Data Analyst | Dataflow Gen2, Warehouse | Lakehouse SQL Endpoint |
| BI Developer | Semantic Model, Power BI | Warehouse, Lakehouse |
| Data Scientist | Spark Notebooks, Lakehouse | Eventhouse |
| Integration Specialist | Pipeline, Copy Job | Dataflow, Eventstream |
📈 Performance Considerations¶
By Data Volume¶
| Volume | Recommended Approach |
|---|---|
| < 1 GB | Dataflow Gen2 (simple), any tool works |
| 1-10 GB | Pipeline Copy or Dataflow with Fast Copy |
| 10-100 GB | Pipeline with parallel copy |
| 100 GB - 1 TB | Pipeline with logical partitioning |
| > 1 TB | Pipeline + Spark for transforms |
By Latency Requirement¶
| Latency | Recommended Tool |
|---|---|
| Sub-second | Eventstream |
| Minutes | Copy Job (CDC mode) |
| Hourly | Pipeline (scheduled) |
| Daily | Pipeline or Dataflow |
💰 Cost Considerations¶
Compute Cost Ranking (Lower = Less Expensive)¶
- Dataflow Gen2 - Efficient for small-medium volumes
- Pipeline Copy - Optimized for bulk transfer
- Copy Job - Similar to Pipeline
- Spark - Higher cost, more powerful
- Eventstream - Cost based on throughput
Storage Cost¶
All Fabric storage uses OneLake with the same pricing: - Lakehouse Tables (Delta) - Warehouse Tables (Delta) - Eventhouse (KQL Database)
🔗 Hybrid Patterns¶
Pipeline + Dataflow¶
Use when: Need bulk copy performance AND Power Query transformations
Pipeline + Spark¶
Use when: Need parallel copy AND complex transformations
Eventstream + Lakehouse¶
Use when: Need real-time AND historical analytics
🚫 Anti-Patterns to Avoid¶
| Anti-Pattern | Problem | Better Approach |
|---|---|---|
| Spark for simple copy | Overhead, complexity | Pipeline Copy Activity |
| Dataflow for 500GB | Performance | Pipeline with partitioning |
| Pipeline for real-time | Not designed for streaming | Eventstream |
| Warehouse for raw data | Unnecessary structure | Lakehouse Bronze layer |
| Lakehouse for all SQL | Missing T-SQL features | Use Warehouse for analytics |
📇 Quick Reference Card¶
When to Use Each Tool¶
Pipeline Copy Activity - ✅ Bulk data movement - ✅ High volume (10GB+) - ✅ Scheduled batch loads - ❌ Real-time streaming - ❌ Complex transformations
Copy Job - ✅ CDC/incremental sync - ✅ Automated replication - ✅ Simple setup - ❌ Complex transformations - ❌ Real-time (near-real-time only)
Dataflow Gen2 - ✅ Business user data prep - ✅ 150+ connectors - ✅ Power Query interface - ❌ Very large volumes (100GB+) - ❌ Real-time
Eventstream - ✅ Real-time ingestion - ✅ Event processing - ✅ IoT/streaming scenarios - ❌ Batch processing - ❌ Historical bulk loads
Spark Notebooks - ✅ Complex transformations - ✅ ML/Data Science - ✅ Code flexibility - ❌ Simple copy operations - ❌ Business user self-service
✅ Decision Checklist¶
Before choosing a tool, answer these questions:
- Volume: How much data? (GB/TB)
- Latency: Real-time or batch?
- Complexity: Simple copy or complex transforms?
- Team: Data engineers or business analysts?
- Destination: Lakehouse, Warehouse, or both?
- Pattern: Full load, incremental, or CDC?
- Sources: How many connectors needed?
- Code: Code-first or no-code preferred?
🔗 Related Documents¶
| Document | Description |
|---|---|
| Pipelines & Data Movement | Pipeline optimization patterns |
| Dataflows Gen2 | Dataflow performance optimization |
| Spark & Notebooks | Spark best practices |
| Lakehouse Setup | Lakehouse architecture and configuration |
| Warehouse Setup | Warehouse schema design and optimization |
| Real-Time Intelligence | Eventstream and Eventhouse features |