Home > Docs > Best Practices > OneLake Files vs Tables
📂 OneLake Files vs Tables Decision Guide¶
Choose the Right Storage Strategy for Every Data Asset in Microsoft Fabric
Last Updated: 2026-04-27 | Version: 1.0.0
📑 Table of Contents¶
- 🎯 Overview
- 📊 Decision Matrix
- 📁 Files Section Deep Dive
- 📋 Tables Section Deep Dive
- ⚡ Performance Comparison
- 🔄 Files-to-Tables Promotion
- 🎰 Casino Industry Patterns
- 🏛️ Federal Agency Patterns
- 🚫 Anti-Patterns
- 🌳 Decision Flowchart
- 📚 References
🎯 Overview¶
Every Fabric Lakehouse contains two top-level storage areas: Files and Tables. Both reside in OneLake (backed by ADLS Gen2), but they serve fundamentally different purposes and unlock different capabilities. Choosing incorrectly leads to query failures, broken Direct Lake reports, or wasted compute.
Key Distinction¶
| Aspect | Files Section | Tables Section |
|---|---|---|
| Storage path | Files/… (arbitrary folders) | Tables/… (managed Delta tables) |
| Format | Any (CSV, JSON, Parquet, images, PDF) | Delta Lake only |
| SQL queryable | No (not via SQL endpoint) | Yes (automatic SQL endpoint) |
| Direct Lake | Not compatible | Fully compatible |
| Schema | None enforced | Schema-on-write via Delta |
| Time travel | Not available | Full Delta versioning |
| ACID | No transactional guarantees | Full ACID transactions |
| V-Order | Not applicable | Applied automatically on write |
📊 Decision Matrix¶
Use this matrix to determine where each data asset belongs:
| Data Characteristic | Files | Tables | Rationale |
|---|---|---|---|
| Structured tabular data (CSV, Parquet ingest) | Landing only | Final | Tables provide schema, SQL access, Direct Lake |
| Semi-structured JSON / XML | Yes | After flattening | Store raw; promote flattened version to Tables |
| Unstructured (images, PDFs, audio, video) | Yes | No | Tables cannot store binary blobs efficiently |
| Machine learning model artifacts (.pkl, .onnx) | Yes | No | Not tabular data |
| Streaming landing zone (raw events) | Yes | After micro-batch | Land in Files; structured merge into Tables |
| Reference/lookup data (< 100 MB) | No | Yes | Direct Lake dimension tables |
| Compliance filings (CTR, SAR PDFs) | Yes | Metadata only | Store PDF in Files; index metadata in Tables |
| Configuration files (YAML, JSON config) | Yes | No | Not queryable data |
| Large Parquet datasets (query-ready) | No | Yes | Convert to Delta for SQL + Direct Lake |
| Intermediate Spark outputs | Yes | Only if reused | Temp outputs in Files; durable outputs in Tables |
| External partner data drops | Yes | After validation | Land in Files; validate then promote to Tables |
Quick Rule¶
If a downstream consumer needs to query it with SQL or display it in Power BI, it belongs in Tables. Everything else goes in Files.
📁 Files Section Deep Dive¶
Appropriate Use Cases¶
1. Landing Zone for Raw Ingestion
# Land raw CSV files from external SFTP
raw_path = "Files/landing/casino/slot_telemetry/2026/04/27/"
# Read from landing zone
df_raw = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load(f"abfss://{workspace}@onelake.dfs.fabric.microsoft.com/{lakehouse}/{raw_path}")
# After validation, promote to Tables (see promotion section)
2. Unstructured Data Storage
# Store compliance document scans
doc_path = "Files/compliance/sar_filings/2026/04/"
# Store metadata about the documents in a Delta table
doc_metadata = spark.createDataFrame([
("SAR-2026-04-001", "Files/compliance/sar_filings/2026/04/sar_001.pdf", "2026-04-15", "pending_review"),
("SAR-2026-04-002", "Files/compliance/sar_filings/2026/04/sar_002.pdf", "2026-04-16", "submitted"),
], ["filing_id", "document_path", "filing_date", "status"])
doc_metadata.write.format("delta").mode("append").saveAsTable("gold.sar_filing_index")
3. Multi-Format Data Hub
# Files section supports any format
files_structure = """
Files/
├── landing/
│ ├── csv/ # Partner CSV drops
│ ├── json/ # API response snapshots
│ └── xml/ # Legacy system exports
├── models/
│ ├── onnx/ # ML model artifacts
│ └── checkpoints/ # Training checkpoints
├── compliance/
│ ├── sar_filings/ # PDF scans
│ └── ctr_filings/ # PDF scans
└── temp/
└── spark_staging/ # Intermediate outputs
"""
Files Section Folder Organization¶
| Folder | Purpose | Retention |
|---|---|---|
landing/ | Raw data drops before validation | 7-30 days |
archive/ | Processed files moved post-ingestion | 90 days - 7 years |
models/ | ML artifacts, ONNX models | Versioned indefinitely |
compliance/ | Regulatory documents (PDF, images) | Per regulation (5-7 years) |
temp/ | Spark staging, intermediate outputs | 24-48 hours |
reference/ | Static config files, lookup CSVs | Indefinite |
📋 Tables Section Deep Dive¶
Why Tables for All Structured Data¶
1. Automatic SQL Endpoint
Every Delta table in the Tables section is automatically exposed via the Lakehouse SQL endpoint. No configuration needed.
-- Accessible immediately via SQL endpoint
SELECT machine_id, SUM(coin_in) as total_coin_in
FROM lh_gold.slot_performance_daily
WHERE gaming_date >= '2026-04-01'
GROUP BY machine_id
ORDER BY total_coin_in DESC;
2. Direct Lake Compatibility
Direct Lake mode in Power BI reads directly from Delta tables in the Tables section. Files section data is not accessible via Direct Lake.
# This table is automatically available to Direct Lake
df_gold.write.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.saveAsTable("gold.slot_performance_daily")
# V-Order is applied automatically, optimizing Direct Lake reads
3. Schema Enforcement
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType
slot_schema = StructType([
StructField("machine_id", StringType(), False),
StructField("timestamp", TimestampType(), False),
StructField("coin_in", DoubleType(), True),
StructField("coin_out", DoubleType(), True),
StructField("denomination", DoubleType(), False),
])
# Schema enforcement rejects malformed records
df_validated = spark.read.schema(slot_schema).json("Files/landing/slot_events/")
df_validated.write.format("delta").mode("append").saveAsTable("bronze.slot_telemetry")
4. ACID Transactions and Time Travel
# Time travel for compliance audits
df_as_of = spark.read.format("delta") \
.option("timestampAsOf", "2026-04-15T00:00:00Z") \
.table("silver.player_transactions")
# Restore a table to a prior version after bad data load
from delta.tables import DeltaTable
dt = DeltaTable.forName(spark, "silver.player_transactions")
dt.restoreToVersion(42)
Tables with Lakehouse Schemas (GA 2026)¶
# Using Lakehouse Schemas to organize by medallion layer
# Tables appear under: Tables/bronze/*, Tables/silver/*, Tables/gold/*
df.write.format("delta").mode("append").saveAsTable("bronze.slot_telemetry")
df.write.format("delta").mode("overwrite").saveAsTable("silver.slot_cleansed")
df.write.format("delta").mode("overwrite").saveAsTable("gold.slot_kpi_daily")
# SQL access with schema prefix
# SELECT * FROM lh_main.bronze.slot_telemetry
⚡ Performance Comparison¶
Query Performance¶
| Scenario | Files (Parquet) | Tables (Delta) | Delta Advantage |
|---|---|---|---|
| Point lookup by key | ~2.5s | ~0.3s | 8x faster (data skipping) |
| Full scan 100M rows | ~45s | ~38s | 1.2x faster (V-Order) |
| Aggregation with filter | ~12s | ~3s | 4x faster (predicate pushdown + stats) |
| Join two datasets | ~30s | ~8s | 3.7x faster (Z-Order + broadcast) |
| Concurrent reads (10 users) | Fails/slow | Consistent | ACID isolation |
Write Performance¶
| Scenario | Files (raw write) | Tables (Delta write) | Notes |
|---|---|---|---|
| Append 1M rows | ~5s | ~8s | Delta overhead: transaction log + V-Order |
| Overwrite 10M rows | ~15s | ~20s | Delta: snapshot isolation cost |
| Streaming micro-batch | ~2s | ~5s | Delta: commit protocol overhead |
Key Insight: Tables (Delta) have slightly higher write latency due to transaction log and V-Order encoding, but dramatically better read performance. For read-heavy analytical workloads (the primary Fabric use case), Tables always win.
Storage Overhead¶
| Format | 100M Rows Raw Size | Compressed Size | Overhead |
|---|---|---|---|
| CSV (Files) | 12 GB | N/A | Baseline |
| Parquet (Files) | 2.1 GB | — | 82% reduction |
| Delta (Tables) | 2.3 GB | — | +10% vs Parquet (transaction log) |
| Delta + V-Order (Tables) | 2.3 GB | — | Same size, faster reads |
🔄 Files-to-Tables Promotion¶
Standard Promotion Pattern¶
# Step 1: Read from Files landing zone
df_landing = spark.read.format("csv") \
.option("header", "true") \
.schema(defined_schema) \
.load("Files/landing/usda_crop_production/2026/04/")
# Step 2: Validate
from pyspark.sql.functions import col, when, count
null_check = df_landing.select([
count(when(col(c).isNull(), c)).alias(c) for c in df_landing.columns
])
null_counts = null_check.collect()[0]
for field in defined_schema.fields:
if not field.nullable and null_counts[field.name] > 0:
raise ValueError(f"Non-nullable field {field.name} has {null_counts[field.name]} nulls")
# Step 3: Write to Tables as Delta
df_landing.write.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.saveAsTable("bronze.usda_crop_production")
# Step 4: Archive processed files
mssparkutils.fs.mv(
"Files/landing/usda_crop_production/2026/04/",
"Files/archive/usda_crop_production/2026/04/"
)
Incremental Promotion with Watermark¶
from pyspark.sql.functions import max as spark_max
# Get high watermark from last successful load
watermark_df = spark.sql("SELECT max(ingestion_timestamp) as hwm FROM bronze.epa_air_quality")
high_watermark = watermark_df.collect()[0]["hwm"]
# Read only new files
df_new = spark.read.format("json") \
.load("Files/landing/epa_air_quality/") \
.filter(col("_metadata.file_modification_time") > high_watermark)
if df_new.count() > 0:
df_new.withColumn("ingestion_timestamp", current_timestamp()) \
.write.format("delta").mode("append") \
.saveAsTable("bronze.epa_air_quality")
🎰 Casino Industry Patterns¶
| Data Asset | Section | Rationale |
|---|---|---|
| Slot telemetry (real-time events) | Tables | Structured, queryable, Direct Lake KPIs |
| Player card swipe logs | Tables | Structured, joins with player dim |
| Surveillance camera frames | Files | Binary image data |
| Compliance PDFs (CTR, SAR, W-2G) | Files | Unstructured documents |
| Compliance filing metadata | Tables | Queryable index for audit |
| Player loyalty CSV imports | Files → Tables | Land in Files, promote to Tables |
| Cage transaction logs | Tables | Structured financial data |
| Floor layout diagrams (CAD/SVG) | Files | Non-tabular |
🏛️ Federal Agency Patterns¶
| Data Asset | Section | Rationale |
|---|---|---|
| USDA crop production statistics | Tables | Structured, analytical queries |
| SBA loan application CSVs (raw) | Files → Tables | Land raw, validate, promote |
| NOAA weather station JSON feeds | Files → Tables | Semi-structured landing, then flatten |
| EPA air quality sensor readings | Tables | Time-series, KQL-queryable |
| DOI land survey GeoJSON | Files + Tables | GeoJSON in Files, attributes in Tables |
| DOJ case document scans (PDF) | Files | Unstructured compliance documents |
| DOJ case metadata | Tables | Structured index for search/reporting |
🚫 Anti-Patterns¶
Anti-Pattern 1: Storing CSVs Directly in Tables¶
# ❌ WRONG: CSV files placed in Tables/ folder manually
# They won't be recognized as Delta tables and will cause errors
mssparkutils.fs.cp("Files/landing/data.csv", "Tables/data.csv") # BROKEN
# ✅ CORRECT: Read CSV and write as Delta table
df = spark.read.csv("Files/landing/data.csv", header=True)
df.write.format("delta").saveAsTable("bronze.imported_data")
Anti-Pattern 2: Querying Files Section via SQL Endpoint¶
-- ❌ WRONG: Files section is invisible to SQL endpoint
SELECT * FROM "Files/landing/transactions.parquet"; -- ERROR: table not found
-- ✅ CORRECT: Only Tables section is SQL-queryable
SELECT * FROM bronze.transactions;
Anti-Pattern 3: Storing Large Binary Data in Tables¶
# ❌ WRONG: Storing image bytes in Delta tables
df_images = spark.createDataFrame([
("img_001", open("surveillance_001.jpg", "rb").read()),
], ["image_id", "image_bytes"])
df_images.write.format("delta").saveAsTable("bronze.surveillance") # Huge, slow, wasteful
# ✅ CORRECT: Store path reference in table, file in Files
df_index = spark.createDataFrame([
("img_001", "Files/surveillance/2026/04/27/cam_01_001.jpg", "2026-04-27T14:30:00Z"),
], ["image_id", "file_path", "capture_time"])
df_index.write.format("delta").saveAsTable("bronze.surveillance_index")
Anti-Pattern 4: Using Files as Long-Term Queryable Storage¶
# ❌ WRONG: Keeping Parquet in Files and querying with Spark repeatedly
df = spark.read.parquet("Files/analytics/player_summary/") # No caching, no stats, no V-Order
# ✅ CORRECT: Promote to Tables for repeated queries
df.write.format("delta").saveAsTable("gold.player_summary") # V-Order, data skipping, SQL access
Anti-Pattern 5: Skipping the Landing Zone¶
# ❌ WRONG: Writing external data directly to Tables without validation
spark.read.csv("https://external-api.com/data.csv") \
.write.format("delta").saveAsTable("bronze.external") # No validation, no replay
# ✅ CORRECT: Land first, validate, then promote
# 1. Download to Files/landing/
# 2. Validate schema + quality
# 3. Write validated data to Tables
🌳 Decision Flowchart¶
flowchart TD
A[New Data Asset] --> B{Is it structured tabular data?}
B -->|Yes| C{Is it already in Delta format?}
B -->|No| D{Is it semi-structured?}
C -->|Yes| E[Write directly to Tables]
C -->|No| F[Land in Files → Validate → Promote to Tables]
D -->|Yes, JSON/XML| G[Land in Files → Flatten → Promote to Tables]
D -->|No, binary/unstructured| H[Store in Files permanently]
H --> I{Need queryable metadata?}
I -->|Yes| J[Create metadata index table in Tables]
I -->|No| K[Files only — done]
E --> L{Needs Direct Lake / SQL?}
F --> L
G --> L
L -->|Yes| M[Verify table in Tables section ✅]
L -->|No| N[Consider if Files is sufficient] 🔒 Security & Governance Considerations¶
Access Control Differences¶
| Capability | Files Section | Tables Section |
|---|---|---|
| Workspace roles | Read/Write via workspace membership | Read/Write via workspace membership |
| Item-level sharing | Share lakehouse grants Files access | Share lakehouse grants Tables access |
| SQL endpoint security | N/A (not visible) | Row-level security (RLS) supported |
| OneLake data access roles | Folder-level RBAC (preview) | Table-level RBAC (preview) |
| Sensitivity labels | Applied at lakehouse level | Applied at lakehouse level |
| Column-level security | N/A | Via SQL endpoint views |
Governance Best Practices¶
# Pattern: PII in Files (encrypted), metadata in Tables
# Scenario: Player social security numbers for W-2G compliance
# ❌ WRONG: SSN stored in a Delta table accessible via SQL endpoint
df.withColumn("ssn", col("raw_ssn")) \
.write.format("delta").saveAsTable("silver.player_pii")
# SSN queryable by anyone with SQL endpoint access
# ✅ CORRECT: Hashed SSN in Tables, raw encrypted in Files (restricted)
import hashlib
import os
salt = os.environ.get("FABRIC_POC_HASH_SALT", "")
if not salt:
raise ValueError("FABRIC_POC_HASH_SALT environment variable required")
@udf(StringType())
def hash_ssn(ssn):
return hashlib.sha256(f"{salt}{ssn}".encode()).hexdigest()
# Tables: only the hash (for matching, not reversible)
df.withColumn("ssn_hash", hash_ssn(col("raw_ssn"))) \
.drop("raw_ssn") \
.write.format("delta").saveAsTable("silver.player_profile")
# Files: encrypted original (for compliance retrieval only)
# Stored in restricted Files folder with limited access
df.select("player_id", "raw_ssn") \
.write.format("parquet") \
.mode("overwrite") \
.save("Files/restricted/player_pii/")
Retention & Lifecycle Policies¶
| Section | Data Type | Retention Strategy |
|---|---|---|
| Files/landing/ | Raw drops | Delete after promotion to Tables (7-30 days) |
| Files/archive/ | Processed originals | Per regulatory requirement (5-7 years) |
| Files/compliance/ | Regulatory PDFs | Per regulation (CTR: 5 years, SAR: 5 years) |
| Files/temp/ | Intermediate outputs | 24-48 hours (auto-cleanup) |
| Tables/bronze.* | Raw Delta tables | VACUUM after 7-30 days of time travel |
| Tables/silver.* | Cleansed Delta tables | VACUUM after 7-14 days |
| Tables/gold.* | Aggregated Delta tables | VACUUM after 7 days |
# Automated cleanup for landing zone
from datetime import datetime, timedelta
cutoff = datetime.now() - timedelta(days=30)
landing_files = mssparkutils.fs.ls("Files/landing/")
for item in landing_files:
if item.isDir:
sub_files = mssparkutils.fs.ls(item.path)
for f in sub_files:
mod_time = datetime.fromtimestamp(f.modifyTime / 1000)
if mod_time < cutoff:
mssparkutils.fs.rm(f.path, recurse=True)
print(f"Cleaned: {f.path} (modified {mod_time})")
📐 OneLake Shortcuts and External Data¶
Shortcuts in Files vs Tables¶
OneLake shortcuts can point to external storage (ADLS Gen2, S3, GCS) and appear in either section:
| Shortcut Target | Place In | Behavior |
|---|---|---|
| External Delta tables | Tables/ | Full SQL + Direct Lake access |
| External Parquet files | Tables/ (as Delta) or Files/ | Tables if Delta-compatible |
| External CSV / JSON | Files/ | Not queryable via SQL endpoint |
| External blob storage | Files/ | Read via Spark only |
| Another Fabric lakehouse | Tables/ or Files/ | Mirrors source structure |
# Reading data from a shortcut in Files section
df_external = spark.read.format("parquet") \
.load("Files/shortcuts/partner_data/transactions/")
# Promoting shortcut data to a managed Delta table
df_external.write.format("delta") \
.mode("overwrite") \
.saveAsTable("bronze.partner_transactions")
# Now available via SQL endpoint and Direct Lake
📊 Impact on Downstream Services¶
| Service | Files Section | Tables Section |
|---|---|---|
| SQL Endpoint | Not visible | Full access |
| Direct Lake (Power BI) | Not compatible | Fully compatible |
| Warehouse queries | Not accessible | Cross-database queries |
| Dataflows Gen2 | Can read as source | Read/write destination |
| Spark notebooks | Full read/write | Full read/write |
| KQL (Eventhouse) | Not applicable | Shortcut-queryable |
| Pipelines (Copy Activity) | Source/destination | Source/destination |
| OneLake Shortcuts | Supported | Supported |
| Data Activator | Not supported | Supported (via SQL endpoint) |
📚 References¶
- Microsoft Learn: OneLake Lakehouse
- Direct Lake Mode Requirements
- Delta Lake on Fabric
- Lakehouse Schemas
- V-Order Optimization
Next: Lakehouse Schema Versioning | V-Order Tuning Deep Dive