Lakehouse vs. Warehouse vs. Data Lake¶
TL;DR¶
Mixed BI + ML workloads on >10 TB of open-format data: pick a Lakehouse. Pure sub-second BI at >100 concurrent users: pick a Warehouse. Raw append-only exploration: a Data Lake is fine — but promote curated data to a lakehouse or warehouse.
When this question comes up¶
- Designing the gold layer serving for a new agency workload.
- Evaluating whether to stand up Synapse dedicated pools or rely on a Delta-based lakehouse for BI.
- Debating whether to expose raw ADLS to analysts or insist on a curated tier.
Decision tree¶
flowchart TD
start["Dominant workload shape?"] -->|Structured BI, heavy concurrency| q_bi
start -->|Mixed BI + ML + ad-hoc| q_mixed
start -->|Unstructured exploration| q_lake
q_bi{">100 BI users,<br/>sub-second SLA?"}
q_bi -->|Yes| rec_wh["**Recommend:** Warehouse"]
q_bi -->|No| q_also_ml{"ML on same data?"}
q_also_ml -->|Yes| rec_lh["**Recommend:** Lakehouse"]
q_also_ml -->|No| rec_wh
q_mixed{">10 TB with<br/>open-format need?"}
q_mixed -->|Yes| rec_lh
q_mixed -->|No, SQL-first| rec_wh
q_lake{"Need ACID /<br/>row-column governance?"}
q_lake -->|Yes| rec_lh
q_lake -->|No, append-only| rec_dl["**Recommend:** Data Lake"] Per-recommendation detail¶
Recommend: Lakehouse (Delta/Iceberg on ADLS Gen2)¶
When: Mixed BI + ML, >10 TB growth, need for ACID + time travel + open formats. Why: One copy of data serves all consumers with medallion tiering. Tradeoffs: Cost — storage cheap, compute scales with pattern; Latency — sub-second with Direct Lake; Compliance — full Commercial + Gov via Databricks/Synapse; Skill — Delta literacy required. Anti-patterns:
- No Spark/Delta expertise and <90-day timeline — start with a warehouse.
- Undifferentiated lakehouse with no bronze/silver/gold — becomes a data swamp.
Linked example: examples/usda/
Recommend: Warehouse (Synapse Dedicated Pool / Fabric Warehouse)¶
When: Pure BI at high concurrency, sub-second SLA, read-only gold consumption. Why: Purpose-built MPP SQL, governed semantic models, best concurrency. Tradeoffs: Cost — reserved capacity ($$$); Latency — sub-second at 100+ users; Compliance — full Commercial + Gov IL5; Skill — T-SQL first. Anti-patterns:
- Sparse / intermittent workloads — idle capacity burns budget.
- ML feature generation inside the warehouse — export to lakehouse first.
Linked example: examples/usps/
Recommend: Data Lake (ADLS Gen2 raw)¶
When: Raw landing zone, archival, schema-on-read exploration. Why: Cheapest storage, no lock-in, bring-your-own-compute. Tradeoffs: Cost — lowest; Latency — variable, no caching; Compliance — Bronze posture; Skill — low bar. Anti-patterns:
- Exposed directly to BI consumers — they will build brittle semantic layers.
- Used as system of record for curated products — promote to lakehouse/warehouse.
Linked example: examples/iot-streaming/
Related¶
- Architecture: Storage — OneLake Pattern
- Decision: Delta vs. Iceberg vs. Parquet
- Finding: CSA-0010