Decision Trees: Choosing the Right Fabric Component¶
Last Updated: 2026-04-27 | Version: 1.0 Status: Active | Maintainer: Architecture Team
This document provides structured decision flowcharts for the most common "which Fabric component should I use?" questions. Each section contains a Mermaid flowchart, a comparison table, a "Choose X when..." summary, and a real-world example drawn from the POC.
Table of Contents¶
- 1. Lakehouse vs Warehouse vs SQL Database
- 2. Real-Time Intelligence vs Batch Processing
- 3. Direct Lake vs Import vs DirectQuery
- 4. Notebooks vs Spark Job Definitions vs Pipelines
- 5. Mirroring vs Shortcuts vs Pipeline Copy
- 6. Eventhouse vs Lakehouse for Time-Series
- 7. Variable Libraries vs Pipeline Parameters vs Spark Config
- Cross-Reference: Related Docs
1. Lakehouse vs Warehouse vs SQL Database¶
When to use which analytical store in Fabric.
flowchart TD
A[Need to store and query data?] --> B{Data format?}
B -->|Structured only<br/>relational schema| C{Need full T-SQL DML?<br/>INSERT/UPDATE/DELETE}
B -->|Semi-structured<br/>JSON, Parquet, CSV| D[Lakehouse]
B -->|Mixed structured<br/>+ unstructured| D
C -->|Yes, full CRUD| E{Operational or<br/>analytical workload?}
C -->|No, read-heavy<br/>analytics only| F{Team prefers<br/>T-SQL or PySpark?}
E -->|Operational / OLTP| G[SQL Database]
E -->|Analytical / OLAP| F
F -->|T-SQL, stored procs,<br/>CTAS patterns| H[Warehouse]
F -->|PySpark, Delta Lake,<br/>notebook-first| D
D --> I{Need Direct Lake<br/>for Power BI?}
I -->|Yes| J[Lakehouse<br/>with V-Order optimization]
I -->|No| K[Lakehouse<br/>standard Delta tables] Comparison Table¶
| Capability | Lakehouse | Warehouse | SQL Database |
|---|---|---|---|
| Engine | Spark + SQL endpoint | T-SQL (Synapse DW) | Full SQL Server engine |
| DML Support | Delta Lake via Spark | T-SQL INSERT/UPDATE/DELETE | Full CRUD + stored procs |
| Query Language | PySpark, Spark SQL, SQL endpoint (read-only) | T-SQL | T-SQL |
| Schema | Schema-on-read or schema-on-write | Schema-on-write | Schema-on-write |
| File Formats | Parquet, CSV, JSON, Delta, Avro | Delta (managed) | Row-store + columnstore |
| Direct Lake | Native support | Via shortcuts | Via mirroring |
| Best For | Data engineering, ML, unstructured data | BI-centric analytics, T-SQL teams | Operational apps, OLTP, CDC |
| Concurrency | Medium | High | Very High |
| Change Tracking | Delta change data feed | N/A | Native CDC, change tracking |
| Cost Profile | CU consumption on Spark | CU consumption on SQL | CU consumption on SQL |
Choose X When...¶
- Choose Lakehouse when your team works primarily in PySpark/Python, you need to process semi-structured data, or you want native Direct Lake connectivity for Power BI. This is the default choice for medallion architecture implementations.
- Choose Warehouse when your team has strong T-SQL skills, you need stored procedures and CTAS patterns, or you are migrating from Synapse Dedicated SQL Pool or traditional data warehouses.
- Choose SQL Database when you need a fully transactional store with OLTP capabilities, your application requires INSERT/UPDATE/DELETE from external apps, or you need built-in change tracking for downstream CDC.
POC Example¶
This POC uses Lakehouse as the primary store for all medallion layers (lh_bronze, lh_silver, lh_gold). The choice was driven by the need to process diverse data formats (Parquet from data generators, JSON from streaming sources, CSV from federal open data downloads) and the PySpark-first notebook workflow. Direct Lake mode connects Gold tables directly to Power BI dashboards without data duplication.
See: Lakehouse/Warehouse/SQL DB Decision Guide | Decision Guide
2. Real-Time Intelligence vs Batch Processing¶
When to stream vs. when to batch.
flowchart TD
A[Data arrives from source] --> B{Latency requirement?}
B -->|Sub-second to<br/>seconds| C[Real-Time Intelligence<br/>Eventstream + Eventhouse]
B -->|Minutes to hours| D{Data volume<br/>per batch?}
B -->|Daily or slower| E[Batch Pipeline<br/>Spark notebooks]
D -->|< 100 MB| F[Micro-batch<br/>Spark Structured Streaming]
D -->|100 MB - 10 GB| G{Transformation<br/>complexity?}
D -->|> 10 GB| E
G -->|Simple filter/project| F
G -->|Complex joins,<br/>aggregations, ML| E
C --> H{Query pattern?}
H -->|Time-series, KQL,<br/>anomaly detection| I[Eventhouse<br/>KQL Database]
H -->|Joins with<br/>dimensional data| J[Eventhouse + Lakehouse<br/>shortcut federation] Comparison Table¶
| Factor | Real-Time Intelligence | Micro-Batch | Full Batch |
|---|---|---|---|
| Latency | Sub-second to seconds | Seconds to minutes | Minutes to hours |
| Ingestion | Eventstream, Event Hub | Spark Structured Streaming | Spark batch read |
| Query | KQL | Spark SQL, PySpark | Spark SQL, PySpark, T-SQL |
| Retention | Hot cache (days) + cold (years) | Delta Lake | Delta Lake |
| Cost | Higher per-event | Medium | Lowest per-GB |
| Complexity | Moderate (KQL, update policies) | Low-Medium | Low |
| Anomaly Detection | Built-in (series_decompose_anomalies) | Custom ML model | Custom ML model |
| Windowed Aggregations | Native (bin, make-series) | Spark window functions | Spark window functions |
Choose X When...¶
- Choose Real-Time Intelligence when you need sub-second alerting (slot machine malfunctions, security breaches), time-series analytics with KQL, or built-in anomaly detection. Cost is higher but justified when immediate action is required.
- Choose Micro-Batch when near-real-time (seconds to minutes) is sufficient, you want to land data directly into the Lakehouse Delta tables, and your transformations are simple enough for Structured Streaming.
- Choose Full Batch when daily or hourly freshness is acceptable, data volumes are large (>10 GB per run), or complex multi-table joins and ML model scoring are required.
POC Example¶
The POC runs both paths for slot telemetry. The 01_realtime_slot_streaming.py notebook feeds Eventstream data into the Eventhouse for sub-second casino floor monitoring (KQL queries in 02_kql_casino_floor.kql). Simultaneously, batch notebooks (01_bronze_slot_telemetry.py through 01_gold_slot_performance.py) process daily aggregations for historical trend analysis and Power BI dashboards.
See: Real-Time Intelligence | Streaming Notebooks
3. Direct Lake vs Import vs DirectQuery¶
How should Power BI connect to your data?
flowchart TD
A[Power BI semantic model<br/>needs data connection] --> B{Data in<br/>OneLake Delta tables?}
B -->|No| C{Can you move<br/>data to OneLake?}
B -->|Yes| D{Dataset size?}
C -->|Yes, via mirroring<br/>or shortcuts| B
C -->|No, external<br/>source only| E[DirectQuery]
D -->|< 10 GB<br/>simple schema| F{Refresh frequency<br/>acceptable?}
D -->|10 GB - 200 GB| G[Direct Lake]
D -->|> 200 GB| H{Complex DAX<br/>calculations?}
F -->|Hourly/daily OK| I[Import Mode]
F -->|Need real-time| G
H -->|Yes, many<br/>calculated tables| I
H -->|No, standard<br/>measures| G Comparison Table¶
| Factor | Direct Lake | Import | DirectQuery |
|---|---|---|---|
| Data Freshness | Near-real-time (as Delta updates) | Stale (scheduled refresh) | Real-time (every query) |
| Query Speed | Fast (in-memory from Delta) | Fastest (fully cached) | Slowest (source query each time) |
| Data Duplication | None (reads OneLake directly) | Full copy in model | None |
| Max Dataset Size | Up to capacity memory | Up to capacity memory | Source-limited |
| Refresh Required | No (auto-sync on Delta change) | Yes (scheduled) | No |
| DAX Support | Most functions (fallback to DQ for unsupported) | Full DAX | Full DAX |
| V-Order Required | Recommended for performance | N/A | N/A |
| Composite Model | Supported | Supported | Supported |
| Best For | Fabric-native analytics | Small datasets, complex DAX | External sources, real-time |
Choose X When...¶
- Choose Direct Lake when your Gold tables are in OneLake Delta format, you want zero-copy freshness without scheduled refreshes, and your DAX patterns avoid calculated tables. This is the recommended default for Fabric-native projects.
- Choose Import when dataset is small (<10 GB), you need the absolute fastest query performance, or you rely heavily on calculated tables and complex M transformations that Direct Lake does not support.
- Choose DirectQuery when data must stay in an external source (Azure SQL, Dataverse, etc.), you cannot mirror or shortcut the data into OneLake, or strict real-time accuracy is required at every query execution.
POC Example¶
All Gold layer tables in this POC use Direct Lake mode. The 01_gold_slot_performance.py through 19_gold_doj_analytics.py notebooks write V-Order-optimized Delta tables to lh_gold, which Power BI reads directly without scheduled refreshes. The casino executive dashboard updates automatically as new slot telemetry flows through the medallion pipeline.
See: Direct Lake | Power BI Best Practices
4. Notebooks vs Spark Job Definitions vs Pipelines¶
How should you orchestrate data processing?
flowchart TD
A[Need to run<br/>data processing] --> B{Interactive<br/>development needed?}
B -->|Yes, exploring data,<br/>building logic| C[Notebook]
B -->|No, production<br/>scheduled job| D{Single script or<br/>multi-step workflow?}
D -->|Single .py/.jar<br/>no UI needed| E[Spark Job Definition]
D -->|Multi-step with<br/>dependencies| F{Need conditional logic,<br/>loops, or retry?}
F -->|Yes| G[Pipeline<br/>orchestrating notebooks]
F -->|No, just sequential| H{Need parameterization<br/>across environments?}
H -->|Yes, dev/staging/prod| G
H -->|No, fixed config| I[Pipeline with<br/>notebook activities]
C --> J{Moving to<br/>production?}
J -->|Yes| K[Convert to Spark Job Def<br/>or wrap in Pipeline]
J -->|No, ad-hoc| C Comparison Table¶
| Factor | Notebook | Spark Job Definition | Pipeline |
|---|---|---|---|
| Interaction | Interactive cells, visualization | Headless execution | Visual orchestration |
| Language | PySpark, Scala, R, SQL | Python, Scala, Java (.py/.jar) | Low-code + activity types |
| Scheduling | Basic (notebook scheduler) | Cron-style | Full orchestration (triggers, dependencies) |
| Parameterization | mssparkutils.notebook.getArgument | Command-line args | Pipeline parameters + variables |
| Error Handling | try/except in cells | Application-level | Retry policies, failure paths |
| Version Control | Git integration | Git integration | Git integration (JSON definition) |
| Cost | Session startup overhead | Session startup overhead | Orchestration CU + activity CU |
| Best For | Development, exploration, ad-hoc | Production batch jobs | Multi-step workflows, ETL orchestration |
Choose X When...¶
- Choose Notebook for development, data exploration, building and testing transformation logic, and ad-hoc analysis. Notebooks are the starting point for all data engineering work.
- Choose Spark Job Definition when you have a stable, tested Python/Scala script that needs to run headlessly on a schedule without interactive overhead. Good for single-purpose production jobs.
- Choose Pipeline when you need multi-step orchestration with dependencies (Bronze then Silver then Gold), conditional execution paths, retry policies, parameterized environments, or integration with non-Spark activities (Copy, Dataflow, stored procedures).
POC Example¶
This POC uses notebooks for development and pipelines for orchestration. Each medallion layer has individual notebooks (e.g., 01_bronze_slot_telemetry.py) developed interactively, then orchestrated by a Fabric pipeline that runs Bronze notebooks in parallel, waits for completion, then triggers Silver notebooks sequentially, followed by Gold. Pipeline parameters control batch_date and source_path across environments.
See: Spark Environments & Job Definitions | Pipeline Best Practices | fabric-cicd Deployment
5. Mirroring vs Shortcuts vs Pipeline Copy¶
How should you bring external data into Fabric?
flowchart TD
A[External data source<br/>needs to reach Fabric] --> B{Source type?}
B -->|Azure SQL, Cosmos DB,<br/>Snowflake, supported DB| C{Need continuous<br/>near-real-time sync?}
B -->|ADLS Gen2, S3,<br/>GCS, Dataverse| D{Need to copy<br/>or just reference?}
B -->|On-prem SQL Server,<br/>Oracle, DB2| E[Pipeline Copy Activity<br/>via Data Gateway]
C -->|Yes, CDC/change feed| F[Mirroring]
C -->|No, periodic sync OK| G{Acceptable latency?}
G -->|Minutes| H[Pipeline Copy<br/>with incremental load]
G -->|Hours/daily| H
D -->|Reference in-place,<br/>no data movement| I[Shortcut]
D -->|Need transformation<br/>before use| J{Transform at<br/>read or write time?}
J -->|Read time<br/>schema-on-read| I
J -->|Write time<br/>materialize Delta| H
F --> K{Query via<br/>Lakehouse or Warehouse?}
K -->|Lakehouse| L[Mirror to<br/>Lakehouse tables]
K -->|Warehouse| M[Mirror to<br/>Warehouse tables] Comparison Table¶
| Factor | Mirroring | Shortcuts | Pipeline Copy |
|---|---|---|---|
| Data Movement | Continuous replication | No movement (reference) | Scheduled copy |
| Latency | Near-real-time (seconds-minutes) | Zero (reads source directly) | Minutes to hours |
| Supported Sources | Azure SQL, Cosmos DB, Snowflake, more | ADLS, S3, GCS, Dataverse, OneLake | Any (via connectors + gateways) |
| Data Ownership | Replicated copy in OneLake | Source owns data | Copied to OneLake |
| Transformation | None (raw replication) | None (passthrough) | Full ETL in pipeline |
| Cost | Replication CU + storage | Minimal (query-time CU only) | Copy CU + storage |
| Delta Format | Automatic | Depends on source | Writer controls format |
| Governance | OneLake lineage | Cross-source lineage | Full pipeline lineage |
Choose X When...¶
- Choose Mirroring when you need continuous, near-real-time replication from a supported database (Azure SQL, Cosmos DB, Snowflake) without building custom CDC pipelines. Data is automatically converted to Delta format in OneLake.
- Choose Shortcuts when you want to query data in-place without copying it, the source is ADLS Gen2/S3/GCS/Dataverse, and you want to avoid storage duplication. Ideal for cross-workspace or cross-cloud data federation.
- Choose Pipeline Copy when the source requires a data gateway (on-premises), you need transformation during ingestion, the source is not supported by mirroring or shortcuts, or you need full control over scheduling and error handling.
POC Example¶
The POC demonstrates all three patterns. Shortcuts are used in 17_bronze_shortcut_transformations.py to reference external ADLS data without duplication. Pipeline Copy patterns are documented in streaming notebooks (01_sql_server_cdc.py through 05_oracle_cdc.py) for on-premises source ingestion. Mirroring patterns are covered in Tutorial 08 (Database Mirroring).
See: Mirroring | Shortcut Transformations Notebook | Tutorial 08
6. Eventhouse vs Lakehouse for Time-Series¶
Where should time-series data live?
flowchart TD
A[Time-series data<br/>ingestion] --> B{Query pattern?}
B -->|Ad-hoc exploration,<br/>KQL, anomaly detection| C[Eventhouse]
B -->|Batch aggregation,<br/>joins with dims| D[Lakehouse]
B -->|Both real-time<br/>and historical| E[Both: Eventhouse<br/>hot + Lakehouse cold]
C --> F{Retention<br/>requirement?}
F -->|Hot: days/weeks<br/>Cold: months/years| G[Eventhouse with<br/>retention policy]
F -->|All data hot,<br/>sub-second queries| H[Eventhouse with<br/>large hot cache]
D --> I{Ingestion rate?}
I -->|< 1,000 events/sec| J[Lakehouse Delta<br/>append + optimize]
I -->|> 1,000 events/sec| K[Eventhouse for ingest,<br/>shortcut to Lakehouse]
E --> L[Eventstream splits<br/>to both destinations] Comparison Table¶
| Factor | Eventhouse (KQL DB) | Lakehouse (Delta) |
|---|---|---|
| Query Language | KQL | PySpark, Spark SQL |
| Ingestion Rate | Millions of events/sec | Thousands of events/sec (streaming) |
| Hot Cache | Configurable (SSD-backed) | In-memory via Spark session |
| Cold Storage | Automatic tiering | Delta Lake in OneLake |
| Time-Series Functions | Native (make-series, series_decompose, anomaly detection) | Window functions, custom UDFs |
| Retention Policies | Built-in per-table soft/hard delete | Manual via Delta VACUUM + time travel |
| Materialized Views | Native support | Manual Gold aggregation tables |
| Update Policies | Automatic ETL on ingest | Manual notebook/pipeline triggers |
| Direct Lake | Via shortcut | Native |
| Best For | Real-time monitoring, telemetry, logs | Historical analytics, ML training, BI |
Choose X When...¶
- Choose Eventhouse when you need sub-second queries over streaming time-series data, built-in anomaly detection, KQL-native analytics, and automatic retention policies. Ideal for operational monitoring dashboards.
- Choose Lakehouse when time-series data needs to be joined with dimensional tables, used for ML model training, aggregated into Gold KPIs, or served via Direct Lake to Power BI.
- Choose Both (the most common pattern) when you need real-time operational monitoring AND historical batch analytics. Use Eventstream to split ingestion to both destinations, with Eventhouse handling the hot path and Lakehouse handling the cold path.
POC Example¶
Slot machine telemetry follows the dual-path pattern. 01_realtime_slot_streaming.py sends live telemetry to the Eventhouse for KQL-based casino floor monitoring (02_kql_casino_floor.kql), while the batch pipeline processes the same data through Bronze/Silver/Gold for daily reporting. The Eventhouse keeps 7 days of hot cache; the Lakehouse retains full history.
See: Eventhouse Vector Database | Real-Time Intelligence | Real-Time Hub
7. Variable Libraries vs Pipeline Parameters vs Spark Config¶
How should you manage configuration across environments?
flowchart TD
A[Need to configure<br/>a value] --> B{Scope?}
B -->|Single notebook<br/>or job| C{Sensitive?}
B -->|Across pipeline<br/>activities| D[Pipeline Parameters<br/>+ Variables]
B -->|All notebooks<br/>in environment| E[Spark Environment<br/>Configuration]
C -->|Yes, secret<br/>or credential| F[Key Vault<br/>via mssparkutils.credentials]
C -->|No, plain config| G{Changes per<br/>environment?}
G -->|Yes, dev/staging/prod| H[Pipeline Parameters<br/>with env-specific defaults]
G -->|No, global constant| I[Notebook constants<br/>or Environment config]
D --> J{Need conditional<br/>logic?}
J -->|Yes| K[Pipeline Variables<br/>with expressions]
J -->|No| L[Pipeline Parameters<br/>with defaults] Comparison Table¶
| Mechanism | Scope | Secrets | Environment-Specific | Use Case |
|---|---|---|---|---|
| Notebook Constants | Single notebook | No | No | Fixed config values |
mssparkutils.notebook.getArgument | Notebook (from caller) | No | Yes (passed by pipeline) | Parameterized notebook runs |
| Pipeline Parameters | Pipeline execution | No | Yes (per trigger/schedule) | Orchestration-level config |
| Pipeline Variables | Pipeline execution | No | Yes (computed at runtime) | Dynamic values, expressions |
| Spark Environment Config | All notebooks in environment | No | Yes (per environment) | Spark session settings |
Key Vault + mssparkutils.credentials | Any notebook/pipeline | Yes | Yes (per vault) | Connection strings, API keys |
| Workspace Identity | Workspace-scoped | N/A (identity-based) | Yes (per workspace) | Credential-free Azure auth |
Choose X When...¶
- Choose Pipeline Parameters when values need to change per pipeline run or per environment (batch_date, source_path, target_lakehouse). Parameters flow into notebook activities as arguments.
- Choose Spark Environment Config when you need Spark-level settings (
spark.sql.shuffle.partitions, library versions) applied consistently across all notebooks in a workspace. - Choose Key Vault for any sensitive value (connection strings, API keys, SAS tokens). Access via
mssparkutils.credentials.getSecret('vault-name', 'secret-name'). - Choose Workspace Identity when you want credential-free authentication to Azure resources (Storage, Key Vault, Purview) from notebooks. No secrets to manage or rotate.
POC Example¶
This POC uses a layered approach. Pipeline parameters pass batch_date and source_path to notebooks. Each notebook reads these via the _get_arg helper shim. Key Vault stores the Event Hub connection string for streaming notebooks. Workspace Identity (deployed via workspace-identity.bicep) enables credential-free access to ADLS Gen2 and Key Vault. Spark session configuration sets spark.sql.shuffle.partitions to 8 for POC-scale data.
See: Spark Environments & Job Definitions | Workspace Identity Module | OneLake Security
8. When to Add a New Vertical / Domain¶
Expanding the POC to a new industry or data domain.
flowchart TD
A[New industry vertical<br/>or data domain] --> B{Source data<br/>available?}
B -->|Real open data<br/>API available| C[Add to<br/>federal_datasets.yaml]
B -->|Synthetic only<br/>no public API| D[Build data generator<br/>in data_generation/generators/]
B -->|Both real<br/>and synthetic| E[Build both paths]
C --> F[Create Bronze<br/>notebook]
D --> F
E --> F
F --> G[Create Silver<br/>notebook]
G --> H[Create Gold<br/>notebook]
H --> I{Compliance<br/>requirements?}
I -->|HIPAA, FedRAMP,<br/>PCI-DSS, etc.| J[Add compliance<br/>controls to Silver]
I -->|Standard| K[Standard DQ<br/>validation only]
J --> L[Create Tutorial<br/>in tutorials/]
K --> L
L --> M[Create Use Case<br/>doc in docs/use-cases/]
M --> N[Add unit tests<br/>in validation/]
N --> O[Update notebooks/README.md<br/>cross-reference index] Vertical Expansion Checklist¶
| Step | File/Location | Template From |
|---|---|---|
| 1. Data config | data_generation/config/federal_datasets.yaml | Existing agency entry |
| 2. Generator | data_generation/generators/ | base_generator.py |
| 3. Bronze notebook | notebooks/bronze/NN_bronze_DOMAIN.py | 12_bronze_usda.py |
| 4. Silver notebook | notebooks/silver/NN_silver_DOMAIN.py | 12_silver_usda.py |
| 5. Gold notebook | notebooks/gold/NN_gold_DOMAIN_analytics.py | 12_gold_usda_analytics.py |
| 6. Tutorial | tutorials/NN-domain-name/ | 32-usda-agriculture/ |
| 7. Use case doc | docs/use-cases/domain-analytics.md | agricultural-analytics.md |
| 8. Unit tests | validation/unit_tests/federal/ | Existing test file |
| 9. Cross-ref | notebooks/README.md | Add row to domain table |
POC Example¶
The DOJ vertical (Phase 13) followed this exact pattern: 18_bronze_doj.py -> 18_silver_doj.py -> 19_gold_doj_analytics.py, with tutorial 38-doj-justice and use case doc federal-justice-analytics.md. Total time from start to merged PR: ~2 hours using the template pattern.
9. Capacity Sizing Decision¶
What Fabric SKU should you deploy?
flowchart TD
A[Choose Fabric SKU] --> B{Workload type?}
B -->|Learning, tutorials,<br/>single developer| C[F2 or F4<br/>$265-530/mo 24/7]
B -->|POC demo,<br/>3-day workshop| D[F64<br/>$35-50 for 3 days]
B -->|Development team,<br/>concurrent users| E{How many<br/>concurrent users?}
B -->|Production| F{Data volume<br/>and concurrency?}
E -->|1-3 developers| G[F4 or F8<br/>pause off-hours]
E -->|4-10 developers| H[F16 or F32<br/>pause off-hours]
F -->|< 100 GB, low concurrency| I[F32]
F -->|100 GB - 1 TB, medium| J[F64]
F -->|> 1 TB, high concurrency| K[F128+]
C --> L{Cost optimization?}
D --> L
G --> L
H --> L
I --> L
J --> L
K --> L
L -->|Yes| M[Pause/resume schedule<br/>+ reserved capacity]
L -->|No| N[Pay-as-you-go] SKU Quick Reference¶
| SKU | CU/sec | Monthly Cost (24/7) | Monthly Cost (8hr weekdays) | Best For |
|---|---|---|---|---|
| F2 | 2 | ~$265 | ~$88 | Individual learning |
| F4 | 4 | ~$530 | ~$176 | Solo development |
| F8 | 8 | ~$1,060 | ~$353 | Small team dev |
| F16 | 16 | ~$2,120 | ~$707 | Team dev + staging |
| F32 | 32 | ~$4,240 | ~$1,413 | Large team + UAT |
| F64 | 64 | ~$8,480 | ~$2,827 | POC recommended / Small prod |
| F128 | 128 | ~$16,960 | ~$5,653 | Medium production |
| F256 | 256 | ~$33,920 | ~$11,307 | Large production |
POC Example¶
This POC targets F64 because it provides sufficient CU for: parallel Bronze notebook execution (6 notebooks concurrently), real-time Eventstream ingestion, interactive notebook development, and Power BI Direct Lake queries -- all without throttling. For development iterations, scale down to F4 and pause overnight to reduce costs by ~93%.
See: Capacity Planning | Cost Estimation
10. Data Quality Strategy Decision¶
Where should data quality checks run?
flowchart TD
A[Data quality check<br/>needed] --> B{Check type?}
B -->|Schema validation<br/>column types, nulls| C{Which layer?}
B -->|Business rules<br/>thresholds, ranges| D[Silver layer<br/>Delta constraints + code]
B -->|Referential integrity<br/>FK relationships| E[Silver layer<br/>join validation]
B -->|Statistical profiling<br/>distributions, outliers| F[Gold layer<br/>or dedicated DQ notebook]
C -->|Ingest-time<br/>reject bad records| G[Bronze layer<br/>schema-on-read with fallback]
C -->|Post-ingest<br/>quarantine bad records| H[Silver layer<br/>DQ scoring + quarantine table]
D --> I{Enforcement level?}
I -->|Hard reject<br/>fail pipeline| J[Delta CHECK constraints<br/>+ try/except]
I -->|Soft flag<br/>continue with warning| K[DQ score column<br/>+ downstream filter]
F --> L{Tool?}
L -->|Automated suite| M[Great Expectations<br/>checkpoint]
L -->|Ad-hoc profiling| N[Notebook with<br/>profiling functions] DQ Pattern in This POC¶
| Layer | Strategy | Implementation | Example |
|---|---|---|---|
| Bronze | Schema-on-read, append-only, minimal validation | mergeSchema=true, audit columns (_ingested_at, _source_file) | 01_bronze_slot_telemetry.py |
| Silver | Schema enforcement, dedup, DQ scoring | MERGE upsert, dq_score column, NOT NULL constraints | 01_silver_slot_cleansed.py |
| Gold | Business rule validation, threshold checks | Aggregation assertions, CTR >= $10K checks | 03_gold_compliance_reporting.py |
| Validation | Automated test suites | 612 pytest tests + 9 Great Expectations suites | validation/unit_tests/ |
See: Testing Strategies | Medallion Deep Dive
Cross-Reference: Related Docs¶
| Decision Area | Primary Doc | Supporting Docs |
|---|---|---|
| Lakehouse vs Warehouse vs SQL DB | Decision Guide | Warehouse Setup, SQL Database |
| Real-Time vs Batch | RTI | Streaming Notebooks, Eventstream |
| Direct Lake vs Import vs DQ | Direct Lake | Power BI Best Practices |
| Notebooks vs Jobs vs Pipelines | Spark Environments | Pipeline Patterns, CI/CD |
| Mirroring vs Shortcuts vs Copy | Mirroring | Shortcut Notebook, Source Patterns |
| Eventhouse vs Lakehouse | RTI | Vector DB, Lakehouse Setup |
| Config Management | OneLake Security | Workspace Identity, CMK |
| Adding a New Vertical | Notebook Index | USDA Tutorial, DOJ Tutorial |
| Capacity Sizing | Capacity Planning | Cost Estimation, Alerts & Budgets |
| Data Quality Strategy | Testing Strategies | Medallion Deep Dive, Great Expectations |
| General Decision Guide | 10_DECISION_GUIDE.md | This document (detailed flowcharts) |
Tip: These decision trees capture the common paths. Edge cases exist -- when in doubt, start with the Lakehouse (most flexible) and migrate to a more specialized component once query patterns stabilize.