Home > Tutorials > Tutorial 16: Performance Tuning & Optimization
⚡ Tutorial 16: Performance Tuning & Optimization¶
Last Updated: 2026-04-15 | Version: 2.0 Status: ✅ Final | Maintainer: Documentation Team
Third-party references — publicly sourced, good-faith comparison
This page references non-Microsoft products and services. That information is drawn from each vendor's publicly available documentation and is offered for honest, good-faith comparison only. This is a personal project written from a Microsoft Fabric and Azure perspective; it does not claim expertise in, or authority over, any third-party product, and nothing here is an official statement by, or endorsed by, those vendors. Capabilities, pricing, and features change often — always verify against the vendor's current official documentation. Where a third-party offering is the stronger choice, we say so plainly.
⚡ Tutorial 16: Performance Tuning & Optimization¶
| Difficulty | Advanced |
| Time | 2.5 hours |
| Focus | Performance, Optimization, Benchmarking |
Progress Tracker¶
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
| 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 |
| SETUP | BRONZE | SILVER | GOLD | RT | PBI | PIPES | GOV | MIRROR | AI/ML |
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
| | | | | | | | | | |
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
| 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| TERADATA| SAS | CI/CD | PLANNING| SECURITY| COST | PERF | MONITOR | DISASTER| E2E |
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
| | | | | | | | | | |
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
^
YOU ARE HERE
| Navigation | |
|---|---|
| Previous | 15-Cost Management |
| Next | 17-Monitoring & Alerting |
📋 Overview¶
This tutorial provides comprehensive guidance on optimizing performance across all Microsoft Fabric workloads. You will learn techniques for tuning Delta Lake tables, optimizing Spark configurations, improving query performance, and establishing performance baselines for your casino analytics platform.
Performance optimization in Microsoft Fabric spans multiple layers: - Storage Layer - V-Order, partitioning, file compaction - Compute Layer - Spark configuration, resource allocation - Query Layer - Predicate pushdown, join optimization - Semantic Layer - Direct Lake tuning, aggregations - Real-Time Layer - KQL query optimization
🎯 Learning Objectives¶
By the end of this tutorial, you will be able to:
- Implement V-Order optimization for Delta Lake tables
- Design effective partitioning strategies for casino data
- Apply file compaction and Z-Order for multi-column queries
- Optimize Spark configurations for different workload types
- Tune Direct Lake semantic models for faster Power BI reports
- Implement caching strategies for frequently accessed data
- Optimize KQL queries for real-time analytics
- Establish performance baselines and benchmarking practices
- Choose between Lakehouse and Warehouse for specific workloads
- Monitor and continuously improve query performance
🏗️ Performance Optimization Architecture¶
flowchart TB
subgraph Storage["Storage Optimization"]
VORDER[V-Order<br/>Optimization]
PARTITION[Partitioning<br/>Strategy]
COMPACT[File<br/>Compaction]
ZORDER[Z-Order<br/>Indexing]
end
subgraph Compute["Compute Optimization"]
SPARK[Spark<br/>Configuration]
MEMORY[Memory<br/>Management]
PARALLEL[Parallelism<br/>Tuning]
CACHE[Data<br/>Caching]
end
subgraph Query["Query Optimization"]
PREDICATE[Predicate<br/>Pushdown]
COLUMN[Column<br/>Pruning]
JOIN[Join<br/>Optimization]
STATS[Statistics<br/>Collection]
end
subgraph Semantic["Semantic Layer"]
DIRECTLAKE[Direct Lake<br/>Tuning]
AGG[Aggregation<br/>Tables]
COMPOSITE[Composite<br/>Models]
end
subgraph Realtime["Real-Time"]
KQL[KQL Query<br/>Optimization]
MATERIALIZE[Materialized<br/>Views]
UPDATE[Update<br/>Policies]
end
Storage --> Compute
Compute --> Query
Query --> Semantic
Query --> Realtime
style Storage fill:#3b82f6,color:#fff
style Compute fill:#8b5cf6,color:#fff
style Query fill:#f59e0b,color:#fff
style Semantic fill:#10b981,color:#fff
style Realtime fill:#ef4444,color:#fff ✅ Prerequisites¶
Before starting this tutorial, ensure you have:
- Completed Tutorial 00-03 (Environment through Gold Layer)
- Bronze, Silver, and Gold Lakehouses with data loaded
- At least 1 million rows in slot telemetry tables
- Access to Fabric workspace with Contributor permissions
- Basic understanding of Delta Lake and Spark
Tip: Performance tuning is most effective with realistic data volumes. If you have less than 1 million rows, generate additional test data using the data generators.
📦 Step 1: V-Order Optimization for Delta Lake¶
1.1 Understanding V-Order¶

Source: Delta Lake table optimization and V-Order
V-Order is a Fabric-specific write optimization that applies special sorting, row group distribution, and compression to Parquet files. It provides:
- 15-50% faster reads for analytical queries
- Reduced storage costs through better compression
- Optimized for Direct Lake mode in Power BI
flowchart LR
subgraph Without["Without V-Order"]
A1[Row Group 1<br/>Mixed Data]
A2[Row Group 2<br/>Mixed Data]
A3[Row Group 3<br/>Mixed Data]
end
subgraph With["With V-Order"]
B1[Row Group 1<br/>Sorted + Compressed]
B2[Row Group 2<br/>Sorted + Compressed]
B3[Row Group 3<br/>Sorted + Compressed]
end
Without -->|V-Order Applied| With
style Without fill:#fee2e2,color:#000
style With fill:#dcfce7,color:#000 1.2 Enable V-Order for Tables¶
V-Order is enabled by default in Fabric, but you can explicitly control it:
# Cell 1: Check V-Order Status
# ============================
# Verify V-Order is enabled for your tables
from delta.tables import DeltaTable
# Check table properties
table_name = "gold_slot_performance"
delta_table = DeltaTable.forName(spark, table_name)
# Get table properties
properties = delta_table.detail().select("properties").collect()[0][0]
print(f"Table: {table_name}")
print(f"Properties: {properties}")
# V-Order should be enabled by default in Fabric
# Property: delta.parquet.vorder.enabled = true
# Cell 2: Create V-Order Optimized Table
# ======================================
# Explicitly create a table with V-Order optimization
df_slot_perf = spark.table("silver_slot_enriched") \
.groupBy("machine_id", "game_type", "denomination") \
.agg(
F.count("*").alias("total_spins"),
F.sum("coin_in").alias("total_coin_in"),
F.sum("coin_out").alias("total_coin_out"),
F.avg("hold_percentage").alias("avg_hold_pct")
)
# Write with V-Order enabled (default in Fabric)
df_slot_perf.write \
.format("delta") \
.mode("overwrite") \
.option("delta.parquet.vorder.enabled", "true") \
.saveAsTable("gold_slot_performance_vorder")
print("Table created with V-Order optimization")
1.3 V-Order Performance Comparison¶
# Cell 3: Benchmark V-Order Performance
# =====================================
# Compare query performance with and without V-Order
import time
def benchmark_query(table_name: str, query_desc: str, iterations: int = 5):
"""Benchmark a query against a table."""
times = []
for i in range(iterations):
spark.catalog.clearCache()
start = time.time()
df = spark.sql(f"""
SELECT
machine_id,
SUM(coin_in) as total_coin_in,
SUM(coin_out) as total_coin_out,
COUNT(*) as spin_count
FROM {table_name}
WHERE event_date >= '2024-01-01'
GROUP BY machine_id
ORDER BY total_coin_in DESC
LIMIT 100
""")
df.collect()
elapsed = time.time() - start
times.append(elapsed)
avg_time = sum(times) / len(times)
min_time = min(times)
max_time = max(times)
print(f"{query_desc}:")
print(f" Average: {avg_time:.3f}s | Min: {min_time:.3f}s | Max: {max_time:.3f}s")
return avg_time
# Benchmark both tables
time_without = benchmark_query("gold_slot_performance", "Without V-Order")
time_with = benchmark_query("gold_slot_performance_vorder", "With V-Order")
improvement = ((time_without - time_with) / time_without) * 100
print(f"\nPerformance Improvement: {improvement:.1f}%")
🗂️ Step 2: Partitioning Strategies¶
2.1 Partitioning Decision Matrix¶
| Data Type | Recommended Partition | Rationale |
|---|---|---|
| Slot Telemetry | event_date | Time-series queries, daily aggregations |
| Player Activity | event_date | Session analysis by date |
| Financial Transactions | transaction_date | Daily reconciliation |
| Player Profiles | loyalty_tier | Segment-based analysis |
| Compliance Data | report_date, report_type | Regulatory reporting |
2.2 Partition by Date (Most Common)¶
# Cell 4: Date-Based Partitioning
# ===============================
# Optimal for time-series data like slot telemetry
from pyspark.sql import functions as F
# Read source data
df_telemetry = spark.table("bronze_slot_telemetry")
# Add partition column if not exists
df_partitioned = df_telemetry \
.withColumn("event_date", F.to_date("event_timestamp"))
# Write with date partitioning
df_partitioned.write \
.format("delta") \
.mode("overwrite") \
.partitionBy("event_date") \
.option("delta.parquet.vorder.enabled", "true") \
.saveAsTable("silver_slot_telemetry_partitioned")
print("Table partitioned by event_date")
# Verify partition structure
partitions = spark.sql("""
DESCRIBE DETAIL silver_slot_telemetry_partitioned
""").select("numFiles", "sizeInBytes").collect()[0]
print(f"Number of files: {partitions[0]}")
print(f"Total size: {partitions[1] / (1024*1024):.2f} MB")
2.3 Multi-Level Partitioning¶
# Cell 5: Multi-Level Partitioning
# ================================
# For large tables with multiple access patterns
# Casino floor data partitioned by date and region
df_floor = spark.table("bronze_floor_activity")
df_floor_partitioned = df_floor \
.withColumn("event_date", F.to_date("event_timestamp")) \
.withColumn("region", F.col("floor_section"))
# Write with hierarchical partitioning
df_floor_partitioned.write \
.format("delta") \
.mode("overwrite") \
.partitionBy("event_date", "region") \
.saveAsTable("silver_floor_activity_partitioned")
# Query benefits from partition pruning
spark.sql("""
SELECT *
FROM silver_floor_activity_partitioned
WHERE event_date = '2024-01-15'
AND region = 'HIGH_LIMIT'
""").explain(True) # Shows partition pruning in plan
2.4 Partition by Player Tier¶
# Cell 6: Tier-Based Partitioning
# ===============================
# For player analytics with tier-based access patterns
df_player = spark.table("silver_player_enriched")
# Partition by loyalty tier for tier-specific analysis
df_player.write \
.format("delta") \
.mode("overwrite") \
.partitionBy("loyalty_tier") \
.option("delta.parquet.vorder.enabled", "true") \
.saveAsTable("gold_player_by_tier")
# Verify partition distribution
spark.sql("""
SELECT
loyalty_tier,
COUNT(*) as player_count
FROM gold_player_by_tier
GROUP BY loyalty_tier
ORDER BY player_count DESC
""").show()
2.5 Partitioning Best Practices¶
| Practice | Recommendation |
|---|---|
| Partition Column Cardinality | 10-10,000 distinct values |
| Partition Size | Target 100MB - 1GB per partition |
| Over-Partitioning | Avoid too many small files |
| Under-Partitioning | Avoid single huge partitions |
| Query Alignment | Partition on frequently filtered columns |
🔧 Step 3: File Compaction and OPTIMIZE¶
3.1 Understanding Small File Problem¶
flowchart LR
subgraph Before["Before OPTIMIZE"]
F1[5MB]
F2[10MB]
F3[2MB]
F4[8MB]
F5[3MB]
F6[12MB]
end
subgraph After["After OPTIMIZE"]
G1[128MB]
end
Before -->|OPTIMIZE| After
style Before fill:#fee2e2,color:#000
style After fill:#dcfce7,color:#000 3.2 OPTIMIZE Command¶
# Cell 7: File Compaction with OPTIMIZE
# =====================================
# Compact small files into larger, more efficient files
# Check current file count before optimization
before_stats = spark.sql("""
DESCRIBE DETAIL silver_slot_telemetry_partitioned
""").select("numFiles", "sizeInBytes").collect()[0]
print(f"Before OPTIMIZE:")
print(f" Files: {before_stats[0]}")
print(f" Size: {before_stats[1] / (1024*1024):.2f} MB")
print(f" Avg file size: {before_stats[1] / before_stats[0] / (1024*1024):.2f} MB")
# Run OPTIMIZE
spark.sql("""
OPTIMIZE silver_slot_telemetry_partitioned
""")
# Check after optimization
after_stats = spark.sql("""
DESCRIBE DETAIL silver_slot_telemetry_partitioned
""").select("numFiles", "sizeInBytes").collect()[0]
print(f"\nAfter OPTIMIZE:")
print(f" Files: {after_stats[0]}")
print(f" Size: {after_stats[1] / (1024*1024):.2f} MB")
print(f" Avg file size: {after_stats[1] / after_stats[0] / (1024*1024):.2f} MB")
reduction = ((before_stats[0] - after_stats[0]) / before_stats[0]) * 100
print(f"\nFile count reduction: {reduction:.1f}%")
3.3 OPTIMIZE with WHERE Clause¶
# Cell 8: Incremental OPTIMIZE
# ============================
# Optimize only recent partitions for efficiency
from datetime import datetime, timedelta
# Calculate date range for recent data
end_date = datetime.now().strftime('%Y-%m-%d')
start_date = (datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d')
# Optimize only recent partitions
spark.sql(f"""
OPTIMIZE silver_slot_telemetry_partitioned
WHERE event_date >= '{start_date}'
AND event_date <= '{end_date}'
""")
print(f"Optimized partitions from {start_date} to {end_date}")
3.4 Z-Order for Multi-Column Queries¶
Z-Order co-locates related data in the same files, improving query performance when filtering on multiple columns.
# Cell 9: Z-Order Optimization
# ============================
# Optimize for multi-column query patterns
# Z-Order by frequently filtered columns together
spark.sql("""
OPTIMIZE gold_slot_performance
ZORDER BY (machine_id, game_type)
""")
print("Z-Order applied on machine_id and game_type")
# This benefits queries like:
# SELECT * FROM gold_slot_performance
# WHERE machine_id = 'SLOT-001' AND game_type = 'VIDEO_POKER'
3.5 OPTIMIZE Scheduling¶
# Cell 10: Automated OPTIMIZE Pipeline
# ====================================
# Schedule OPTIMIZE as part of data pipeline
def optimize_table_with_logging(table_name: str, zorder_columns: list = None):
"""Optimize a Delta table with logging."""
import time
start = time.time()
# Get before stats
before = spark.sql(f"DESCRIBE DETAIL {table_name}").collect()[0]
# Build OPTIMIZE command
if zorder_columns:
zorder_clause = f"ZORDER BY ({', '.join(zorder_columns)})"
spark.sql(f"OPTIMIZE {table_name} {zorder_clause}")
else:
spark.sql(f"OPTIMIZE {table_name}")
# Get after stats
after = spark.sql(f"DESCRIBE DETAIL {table_name}").collect()[0]
elapsed = time.time() - start
print(f"Table: {table_name}")
print(f" Duration: {elapsed:.2f}s")
print(f" Files: {before['numFiles']} -> {after['numFiles']}")
print(f" Size: {before['sizeInBytes']/(1024*1024):.2f}MB -> {after['sizeInBytes']/(1024*1024):.2f}MB")
return {
"table": table_name,
"duration_seconds": elapsed,
"files_before": before['numFiles'],
"files_after": after['numFiles']
}
# Example: Optimize multiple tables
tables_to_optimize = [
("gold_slot_performance", ["machine_id", "game_type"]),
("gold_player_360", ["player_id"]),
("gold_daily_revenue", ["revenue_date", "property_id"])
]
for table_name, zorder_cols in tables_to_optimize:
try:
optimize_table_with_logging(table_name, zorder_cols)
except Exception as e:
print(f"Error optimizing {table_name}: {e}")
🔍 Step 4: Query Optimization Patterns¶
4.1 Predicate Pushdown¶
flowchart LR
subgraph Without["Without Pushdown"]
A1[Read All Data] --> A2[Filter in Memory]
end
subgraph With["With Pushdown"]
B1[Push Filter to Storage] --> B2[Read Only Matching]
end
style Without fill:#fee2e2,color:#000
style With fill:#dcfce7,color:#000 # Cell 11: Predicate Pushdown Verification
# ========================================
# Ensure filters are pushed down to storage layer
# Good: Filter pushed to storage
query_good = """
SELECT machine_id, SUM(coin_in) as total_coin_in
FROM silver_slot_telemetry_partitioned
WHERE event_date = '2024-01-15'
AND machine_id LIKE 'SLOT-001%'
GROUP BY machine_id
"""
# Explain shows partition pruning and filter pushdown
print("Query Execution Plan:")
spark.sql(query_good).explain(True)
# Look for:
# - PartitionFilters: [event_date = 2024-01-15]
# - PushedFilters: [machine_id LIKE 'SLOT-001%']
4.2 Column Pruning¶
# Cell 12: Column Pruning Best Practices
# ======================================
# Select only needed columns to reduce I/O
# Bad: SELECT * reads all columns
query_bad = """
SELECT *
FROM silver_slot_telemetry_partitioned
WHERE event_date = '2024-01-15'
"""
# Good: Select only needed columns
query_good = """
SELECT
machine_id,
event_timestamp,
coin_in,
coin_out
FROM silver_slot_telemetry_partitioned
WHERE event_date = '2024-01-15'
"""
# Compare read bytes
import time
# Test SELECT *
spark.catalog.clearCache()
start = time.time()
df_bad = spark.sql(query_bad)
df_bad.write.format("noop").mode("overwrite").save()
time_bad = time.time() - start
# Test SELECT specific columns
spark.catalog.clearCache()
start = time.time()
df_good = spark.sql(query_good)
df_good.write.format("noop").mode("overwrite").save()
time_good = time.time() - start
print(f"SELECT *: {time_bad:.3f}s")
print(f"SELECT columns: {time_good:.3f}s")
print(f"Improvement: {((time_bad - time_good) / time_bad) * 100:.1f}%")
4.3 Join Optimization¶
# Cell 13: Join Optimization Strategies
# =====================================
# Optimize joins for casino analytics queries
from pyspark.sql import functions as F
# Read tables
df_transactions = spark.table("silver_slot_transactions")
df_players = spark.table("silver_player_profile")
df_machines = spark.table("dim_slot_machines")
# Strategy 1: Broadcast small tables
# For tables < 10MB, broadcast to all executors
df_result = df_transactions.join(
F.broadcast(df_machines),
on="machine_id",
how="inner"
)
# Strategy 2: Filter before join
# Reduce data volume before expensive joins
df_filtered = df_transactions.filter(
F.col("event_date") == "2024-01-15"
)
df_result = df_filtered.join(
df_players,
on="player_id",
how="inner"
)
# Strategy 3: Use join hints
df_result = df_transactions.hint("merge").join(
df_players,
on="player_id",
how="inner"
)
# View execution plan
df_result.explain(True)
4.4 Aggregate Optimization¶
# Cell 14: Aggregation Optimization
# =================================
# Optimize GROUP BY and window functions
# Pre-aggregate at lower granularity
# Instead of aggregating raw telemetry every time,
# create summary tables
# Hourly slot summary
df_hourly = spark.sql("""
SELECT
machine_id,
date_trunc('hour', event_timestamp) as event_hour,
COUNT(*) as spin_count,
SUM(coin_in) as total_coin_in,
SUM(coin_out) as total_coin_out,
SUM(CASE WHEN is_jackpot THEN 1 ELSE 0 END) as jackpot_count
FROM silver_slot_telemetry_partitioned
GROUP BY
machine_id,
date_trunc('hour', event_timestamp)
""")
df_hourly.write \
.format("delta") \
.mode("overwrite") \
.partitionBy("event_hour") \
.saveAsTable("gold_slot_hourly_summary")
print("Hourly summary table created for faster aggregations")
🏛️ Step 5: Lakehouse vs Warehouse Decision Matrix¶

Source: OneLake, the OneDrive for data
5.1 Workload Comparison¶
| Criteria | Lakehouse | Warehouse |
|---|---|---|
| Query Language | Spark SQL, Python | T-SQL |
| Data Format | Delta Lake (Parquet) | Proprietary |
| Best For | Data engineering, ML, exploration | BI, ad-hoc queries, reporting |
| Concurrency | High (Spark) | Very High (distributed query) |
| Query Patterns | Complex transforms, joins | Star schema, aggregations |
| Direct Lake | Yes (native) | Yes (via SQL endpoint) |
| Cost Model | Compute Units (CU) | Compute Units (CU) |
5.2 When to Use Lakehouse¶
flowchart TD
START[New Workload] --> Q1{Data Engineering<br/>Heavy?}
Q1 -->|Yes| LH[Use Lakehouse]
Q1 -->|No| Q2{ML/AI<br/>Workloads?}
Q2 -->|Yes| LH
Q2 -->|No| Q3{Complex<br/>Transformations?}
Q3 -->|Yes| LH
Q3 -->|No| Q4{Streaming<br/>Data?}
Q4 -->|Yes| LH
Q4 -->|No| WH[Use Warehouse]
style LH fill:#3b82f6,color:#fff
style WH fill:#10b981,color:#fff 5.3 When to Use Warehouse¶
# Cell 15: Warehouse for BI-Optimized Queries
# ===========================================
# Create optimized tables in Warehouse for reporting
# In Fabric Warehouse SQL:
"""
-- Create dimension table with clustered columnstore
CREATE TABLE dim_player (
player_id VARCHAR(50) NOT NULL,
player_name VARCHAR(200),
loyalty_tier VARCHAR(20),
enrollment_date DATE,
home_property VARCHAR(50),
player_segment VARCHAR(50)
);
-- Create fact table with distribution
CREATE TABLE fact_slot_activity (
activity_id BIGINT NOT NULL,
player_id VARCHAR(50),
machine_id VARCHAR(50),
activity_date DATE,
coin_in DECIMAL(18,2),
coin_out DECIMAL(18,2),
theo_win DECIMAL(18,2)
)
WITH (
DISTRIBUTION = HASH(player_id),
CLUSTERED COLUMNSTORE INDEX
);
-- Create statistics for query optimization
CREATE STATISTICS stat_player_id ON fact_slot_activity(player_id);
CREATE STATISTICS stat_activity_date ON fact_slot_activity(activity_date);
"""
print("Warehouse tables optimized for star schema queries")
5.4 Hybrid Architecture¶
flowchart LR
subgraph Ingestion["Data Ingestion"]
STREAM[Streaming Data]
BATCH[Batch Data]
end
subgraph Lakehouse["Lakehouse (Engineering)"]
BRONZE[Bronze Layer]
SILVER[Silver Layer]
ML[ML Models]
end
subgraph Warehouse["Warehouse (BI)"]
DIMS[Dimensions]
FACTS[Fact Tables]
VIEWS[Optimized Views]
end
subgraph Semantic["Semantic Layer"]
MODEL[Semantic Model]
PBI[Power BI Reports]
end
STREAM --> BRONZE
BATCH --> BRONZE
BRONZE --> SILVER
SILVER --> ML
SILVER --> FACTS
SILVER --> DIMS
FACTS --> VIEWS
DIMS --> VIEWS
VIEWS --> MODEL
MODEL --> PBI
style Lakehouse fill:#3b82f6,color:#fff
style Warehouse fill:#10b981,color:#fff
style Semantic fill:#f59e0b,color:#fff 🚀 Step 6: Direct Lake Performance Tuning¶
6.1 Direct Lake Architecture¶

Source: Direct Lake overview
flowchart LR
subgraph OneLake["OneLake Storage"]
DELTA[Delta Tables<br/>V-Order Optimized]
end
subgraph DirectLake["Direct Lake Mode"]
METADATA[Metadata<br/>Cache]
FRAMING[Framing<br/>Engine]
end
subgraph VertiPaq["VertiPaq Engine"]
INMEM[In-Memory<br/>Columns]
QUERY[Query<br/>Processing]
end
subgraph PowerBI["Power BI"]
VISUAL[Visualizations]
end
DELTA --> METADATA
METADATA --> FRAMING
FRAMING --> INMEM
INMEM --> QUERY
QUERY --> VISUAL
style OneLake fill:#3b82f6,color:#fff
style DirectLake fill:#8b5cf6,color:#fff
style VertiPaq fill:#f59e0b,color:#fff 6.2 Direct Lake Optimization Checklist¶
| Optimization | Description | Impact |
|---|---|---|
| V-Order Enabled | Faster column reads | High |
| Small Files Eliminated | OPTIMIZE regularly | High |
| Partition Pruning | Use date partitions | Medium |
| Column Selection | Only needed columns | Medium |
| Row-Level Security | Apply at model level | Low |
6.3 Framing Optimization¶
# Cell 16: Optimize Tables for Direct Lake
# ========================================
# Prepare tables for optimal Direct Lake performance
# List of tables used by Direct Lake semantic model
direct_lake_tables = [
"gold_player_360",
"gold_slot_performance",
"gold_daily_revenue",
"dim_calendar",
"dim_property",
"dim_slot_machine"
]
for table in direct_lake_tables:
print(f"\nOptimizing {table} for Direct Lake...")
# 1. Run OPTIMIZE with V-Order
spark.sql(f"OPTIMIZE {table}")
# 2. Check file statistics
stats = spark.sql(f"DESCRIBE DETAIL {table}").collect()[0]
print(f" Files: {stats['numFiles']}")
print(f" Size: {stats['sizeInBytes'] / (1024*1024):.2f} MB")
print(f" Avg file size: {stats['sizeInBytes'] / max(stats['numFiles'], 1) / (1024*1024):.2f} MB")
# 3. Warn if too many small files
avg_size_mb = stats['sizeInBytes'] / max(stats['numFiles'], 1) / (1024*1024)
if avg_size_mb < 50:
print(f" WARNING: Average file size ({avg_size_mb:.2f} MB) is small. Consider more aggressive OPTIMIZE.")
print("\nDirect Lake optimization complete")
6.4 Semantic Model Aggregations¶
// DAX Measure with Aggregation Awareness
// ======================================
// Create aggregation tables for common patterns
// In Power BI Desktop or Fabric Semantic Model:
// 1. Create aggregation table for daily revenue
Daily Revenue Agg =
SUMMARIZE(
fact_SlotActivity,
dim_Calendar[Date],
dim_Property[PropertyID],
"Total Coin In", SUM(fact_SlotActivity[CoinIn]),
"Total Coin Out", SUM(fact_SlotActivity[CoinOut]),
"Total Theo Win", SUM(fact_SlotActivity[TheoWin]),
"Spin Count", COUNT(fact_SlotActivity[ActivityID])
)
// 2. Set aggregation behavior
// In Model view, select aggregation table
// Map columns to detail table
// Set aggregation type (Sum, Count, etc.)
⚙️ Step 7: Spark Configuration Tuning¶

Source: Configure Apache Spark in Microsoft Fabric
7.1 Key Spark Configurations¶
| Configuration | Default | Recommendation | Use Case |
|---|---|---|---|
spark.sql.shuffle.partitions | 200 | 2x CPU cores | Shuffle operations |
spark.sql.files.maxPartitionBytes | 128MB | 256MB for large files | File reading |
spark.sql.autoBroadcastJoinThreshold | 10MB | 50MB for small dims | Broadcast joins |
spark.sql.adaptive.enabled | true | true | Adaptive query execution |
spark.databricks.delta.optimizeWrite.enabled | true | true | Write optimization |
7.2 Configure Spark Session¶
# Cell 17: Spark Configuration for Casino Analytics
# =================================================
# Optimize Spark settings for casino workloads
# View current configuration
print("Current Spark Configuration:")
print(f" shuffle.partitions: {spark.conf.get('spark.sql.shuffle.partitions')}")
print(f" maxPartitionBytes: {spark.conf.get('spark.sql.files.maxPartitionBytes')}")
print(f" autoBroadcastJoinThreshold: {spark.conf.get('spark.sql.autoBroadcastJoinThreshold')}")
print(f" adaptive.enabled: {spark.conf.get('spark.sql.adaptive.enabled')}")
# Configure for large slot telemetry processing
spark.conf.set("spark.sql.shuffle.partitions", "400") # For F64 capacity
spark.conf.set("spark.sql.files.maxPartitionBytes", "268435456") # 256MB
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "52428800") # 50MB
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
print("\nUpdated Spark Configuration:")
print(f" shuffle.partitions: {spark.conf.get('spark.sql.shuffle.partitions')}")
print(f" maxPartitionBytes: {spark.conf.get('spark.sql.files.maxPartitionBytes')}")
print(f" autoBroadcastJoinThreshold: {spark.conf.get('spark.sql.autoBroadcastJoinThreshold')}")
7.3 Workload-Specific Configurations¶
# Cell 18: Workload-Specific Spark Configs
# ========================================
# Configure based on workload type
def configure_for_workload(workload_type: str):
"""Configure Spark for specific workload types."""
configs = {
"batch_ingestion": {
"spark.sql.shuffle.partitions": "200",
"spark.databricks.delta.optimizeWrite.enabled": "true",
"spark.databricks.delta.autoCompact.enabled": "true",
"spark.sql.files.maxPartitionBytes": "134217728" # 128MB
},
"aggregation": {
"spark.sql.shuffle.partitions": "400",
"spark.sql.adaptive.enabled": "true",
"spark.sql.adaptive.coalescePartitions.enabled": "true",
"spark.sql.autoBroadcastJoinThreshold": "104857600" # 100MB
},
"ml_training": {
"spark.sql.shuffle.partitions": "100",
"spark.sql.execution.arrow.pyspark.enabled": "true",
"spark.sql.execution.arrow.maxRecordsPerBatch": "50000",
"spark.driver.maxResultSize": "4g"
},
"interactive": {
"spark.sql.shuffle.partitions": "50",
"spark.sql.adaptive.enabled": "true",
"spark.sql.adaptive.localShuffleReader.enabled": "true",
"spark.databricks.delta.stalenessLimit": "1h"
}
}
if workload_type not in configs:
print(f"Unknown workload type: {workload_type}")
return
print(f"Configuring Spark for {workload_type}...")
for key, value in configs[workload_type].items():
spark.conf.set(key, value)
print(f" {key} = {value}")
# Example: Configure for aggregation workload
configure_for_workload("aggregation")
💾 Step 8: Caching Strategies¶
8.1 Caching Options in Fabric¶
| Cache Type | Scope | Persistence | Use Case |
|---|---|---|---|
| Result Cache | Query | Session | Repeated queries |
| Dataset Cache | Table/DataFrame | Session | Iterative processing |
| Delta Cache | Files | Disk | Frequently accessed data |
| Direct Lake Cache | Semantic Model | Memory | Power BI queries |
8.2 DataFrame Caching¶
# Cell 19: DataFrame Caching for Iterative Processing
# ===================================================
# Cache frequently accessed DataFrames
# Read dimension tables that will be used multiple times
df_players = spark.table("dim_player")
df_machines = spark.table("dim_slot_machine")
df_calendar = spark.table("dim_calendar")
# Cache small dimension tables
df_players.cache()
df_machines.cache()
df_calendar.cache()
# Force materialization
df_players.count()
df_machines.count()
df_calendar.count()
print("Dimension tables cached:")
print(f" Players: {df_players.count():,} rows")
print(f" Machines: {df_machines.count():,} rows")
print(f" Calendar: {df_calendar.count():,} rows")
# View cache status
print("\nCache Status:")
for (id, rdd) in spark.sparkContext._jsc.sc().getRDDStorageInfo():
print(f" RDD {id}: {rdd.memSize()} bytes in memory")
8.3 Persist with Storage Level¶
# Cell 20: Persist with Custom Storage Levels
# ============================================
# Choose appropriate storage level based on data size
from pyspark import StorageLevel
# For data that fits in memory
df_high_value_players = spark.sql("""
SELECT * FROM gold_player_360
WHERE lifetime_value > 10000
""")
df_high_value_players.persist(StorageLevel.MEMORY_AND_DISK)
# For large data with serialization
df_all_transactions = spark.table("silver_slot_transactions")
df_all_transactions.persist(StorageLevel.MEMORY_AND_DISK_SER)
print("Persisted DataFrames:")
print(f" High-value players: {df_high_value_players.count():,} rows")
print(f" All transactions: {df_all_transactions.count():,} rows")
# Clean up when done
# df_high_value_players.unpersist()
# df_all_transactions.unpersist()
8.4 Delta Cache¶
# Cell 21: Delta Cache Configuration
# ==================================
# Enable Delta caching for frequently accessed tables
# Delta cache is automatic in Fabric, but you can optimize
# Check if Delta cache is being used
spark.sql("""
SELECT
table_name,
cached_bytes,
last_accessed
FROM system.delta_cache_info
ORDER BY last_accessed DESC
""").show(truncate=False)
# Warm the cache for critical tables
critical_tables = [
"gold_player_360",
"gold_slot_performance",
"fact_daily_revenue"
]
for table in critical_tables:
print(f"Warming cache for {table}...")
spark.sql(f"SELECT COUNT(*) FROM {table}").collect()
print("\nDelta cache warmed for critical tables")
📡 Step 9: Real-Time Query Optimization (KQL)¶
9.1 KQL Query Best Practices¶
| Practice | Bad Example | Good Example |
|---|---|---|
| Filter Early | Table \| project cols \| where condition | Table \| where condition \| project cols |
| Limit Results | Table \| summarize ... | Table \| summarize ... \| take 1000 |
| Use Materialized Views | Repeat complex query | Use materialized_view() |
| Index Usage | Full table scan | Use indexed columns |
9.2 Optimized KQL for Casino Real-Time¶
// KQL Query Optimization Examples
// ================================
// Bad: Late filtering
SlotEvents
| project machine_id, event_type, coin_in, event_time
| where event_time > ago(1h)
| where machine_id startswith "SLOT-001"
// Good: Early filtering
SlotEvents
| where event_time > ago(1h)
| where machine_id startswith "SLOT-001"
| project machine_id, event_type, coin_in, event_time
// ================================
// Optimized real-time slot monitoring
SlotEvents
| where event_time > ago(5m)
| summarize
spin_count = count(),
total_coin_in = sum(coin_in),
total_coin_out = sum(coin_out),
jackpot_count = countif(is_jackpot == true),
avg_hold_pct = avg(hold_percentage)
by machine_id, bin(event_time, 1m)
| order by event_time desc
| take 100
// ================================
// Player session real-time tracking
PlayerSessions
| where session_start > ago(30m)
| summarize
active_sessions = dcount(session_id),
total_coin_in = sum(session_coin_in),
avg_session_duration = avg(session_duration_minutes)
by loyalty_tier, bin(session_start, 5m)
| order by session_start desc
// ================================
// Jackpot alert query (optimized for low latency)
SlotEvents
| where event_time > ago(1m)
| where is_jackpot == true
| project
event_time,
machine_id,
player_id,
jackpot_amount = coin_out,
game_type
| order by event_time desc
| take 10
9.3 Materialized Views for Real-Time¶
// Create materialized view for hourly aggregations
// This pre-computes results for faster queries
.create materialized-view with (
lookback = 24h,
autoUpdateSchema = true
) SlotHourlyStats on table SlotEvents
{
SlotEvents
| summarize
spin_count = count(),
total_coin_in = sum(coin_in),
total_coin_out = sum(coin_out),
unique_players = dcount(player_id)
by machine_id, bin(event_time, 1h)
}
// Query the materialized view (much faster)
SlotHourlyStats
| where event_time > ago(24h)
| summarize
daily_spins = sum(spin_count),
daily_coin_in = sum(total_coin_in)
by machine_id
| top 10 by daily_coin_in desc
📊 Step 10: Benchmarking and Baseline Establishment¶
10.1 Benchmark Framework¶
flowchart TD
A[Define Benchmark Queries] --> B[Establish Baseline]
B --> C[Apply Optimizations]
C --> D[Measure Performance]
D --> E{Improvement?}
E -->|Yes| F[Document & Deploy]
E -->|No| G[Rollback & Analyze]
F --> H[Monitor Continuously]
G --> C
style A fill:#3b82f6,color:#fff
style F fill:#10b981,color:#fff
style H fill:#8b5cf6,color:#fff 10.2 Benchmark Query Suite¶
# Cell 22: Performance Benchmark Suite
# ====================================
# Standard queries for performance measurement
import time
import pandas as pd
from typing import Dict, List
class PerformanceBenchmark:
"""Casino analytics performance benchmark suite."""
def __init__(self, spark_session):
self.spark = spark_session
self.results = []
def run_query(self, name: str, query: str, iterations: int = 3) -> Dict:
"""Run a query multiple times and collect metrics."""
times = []
for i in range(iterations):
self.spark.catalog.clearCache()
start = time.time()
df = self.spark.sql(query)
row_count = df.count()
elapsed = time.time() - start
times.append(elapsed)
result = {
"query_name": name,
"avg_time_seconds": sum(times) / len(times),
"min_time_seconds": min(times),
"max_time_seconds": max(times),
"row_count": row_count,
"iterations": iterations
}
self.results.append(result)
return result
def run_benchmark_suite(self) -> pd.DataFrame:
"""Run all benchmark queries."""
queries = {
"slot_daily_summary": """
SELECT
event_date,
COUNT(*) as spin_count,
SUM(coin_in) as total_coin_in,
SUM(coin_out) as total_coin_out
FROM silver_slot_telemetry_partitioned
WHERE event_date >= date_sub(current_date(), 30)
GROUP BY event_date
ORDER BY event_date
""",
"player_360_lookup": """
SELECT *
FROM gold_player_360
WHERE player_id IN (
SELECT player_id
FROM gold_player_360
ORDER BY lifetime_value DESC
LIMIT 100
)
""",
"machine_performance": """
SELECT
machine_id,
game_type,
SUM(coin_in) as total_coin_in,
AVG(hold_percentage) as avg_hold
FROM gold_slot_performance
GROUP BY machine_id, game_type
ORDER BY total_coin_in DESC
LIMIT 50
""",
"revenue_trend": """
SELECT
revenue_date,
property_id,
SUM(gaming_revenue) as total_gaming,
SUM(food_beverage_revenue) as total_fb
FROM gold_daily_revenue
WHERE revenue_date >= date_sub(current_date(), 90)
GROUP BY revenue_date, property_id
ORDER BY revenue_date, property_id
""",
"compliance_check": """
SELECT
player_id,
SUM(CASE WHEN amount >= 10000 THEN 1 ELSE 0 END) as ctr_count,
SUM(amount) as total_amount
FROM silver_financial_transactions
WHERE transaction_date >= date_sub(current_date(), 30)
GROUP BY player_id
HAVING SUM(amount) >= 10000
"""
}
print("Running Performance Benchmark Suite")
print("=" * 60)
for name, query in queries.items():
print(f"\nRunning: {name}")
result = self.run_query(name, query)
print(f" Avg: {result['avg_time_seconds']:.3f}s | Rows: {result['row_count']:,}")
return pd.DataFrame(self.results)
# Run benchmarks
benchmark = PerformanceBenchmark(spark)
results_df = benchmark.run_benchmark_suite()
print("\n" + "=" * 60)
print("BENCHMARK RESULTS SUMMARY")
print("=" * 60)
print(results_df.to_string(index=False))
10.3 Save Baseline Results¶
# Cell 23: Save Benchmark Baseline
# ================================
# Store baseline for comparison after optimizations
from datetime import datetime
# Add metadata to results
results_df["run_timestamp"] = datetime.now()
results_df["environment"] = "production"
results_df["fabric_capacity"] = "F64"
results_df["optimization_version"] = "baseline"
# Convert to Spark DataFrame and save
df_baseline = spark.createDataFrame(results_df)
df_baseline.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.saveAsTable("admin_performance_baselines")
print("Baseline saved to admin_performance_baselines")
# View historical baselines
spark.sql("""
SELECT
query_name,
optimization_version,
avg_time_seconds,
run_timestamp
FROM admin_performance_baselines
ORDER BY run_timestamp DESC
""").show(truncate=False)
10.4 Performance Comparison¶
# Cell 24: Compare Before/After Optimizations
# ===========================================
# Compare performance across optimization versions
def compare_performance():
"""Compare performance across optimization versions."""
comparison = spark.sql("""
WITH baseline AS (
SELECT query_name, avg_time_seconds as baseline_time
FROM admin_performance_baselines
WHERE optimization_version = 'baseline'
),
optimized AS (
SELECT query_name, avg_time_seconds as optimized_time
FROM admin_performance_baselines
WHERE optimization_version = 'optimized_v1'
)
SELECT
b.query_name,
b.baseline_time,
o.optimized_time,
ROUND((b.baseline_time - o.optimized_time) / b.baseline_time * 100, 1)
as improvement_pct
FROM baseline b
JOIN optimized o ON b.query_name = o.query_name
ORDER BY improvement_pct DESC
""")
print("=" * 70)
print("PERFORMANCE COMPARISON: BASELINE vs OPTIMIZED")
print("=" * 70)
comparison.show(truncate=False)
# Calculate overall improvement
overall = spark.sql("""
SELECT
optimization_version,
AVG(avg_time_seconds) as avg_query_time,
SUM(avg_time_seconds) as total_time
FROM admin_performance_baselines
GROUP BY optimization_version
ORDER BY optimization_version
""")
print("\nOVERALL SUMMARY:")
overall.show(truncate=False)
compare_performance()
📋 Performance Optimization Summary Table¶
| Technique | Target | Improvement | When to Apply |
|---|---|---|---|
| V-Order | Storage | 15-50% reads | All tables |
| Partitioning | Storage | 10-100x for filtered queries | Large tables (>1M rows) |
| OPTIMIZE | Storage | 2-10x fewer files | After bulk inserts |
| Z-Order | Storage | 2-5x for multi-column | Frequently co-filtered columns |
| Column Pruning | Query | 20-80% I/O reduction | All queries |
| Predicate Pushdown | Query | 10-100x for filtered | Partition/indexed columns |
| Broadcast Joins | Query | 2-10x join performance | Small dimension tables |
| Caching | Compute | 5-100x repeated queries | Iterative processing |
| Spark Config | Compute | 10-50% overall | Workload-specific |
| Direct Lake Tuning | BI | 2-5x report refresh | Power BI models |
✅ Validation Checklist¶
Before moving to the next tutorial, verify:
- V-Order Enabled - All Gold tables have V-Order optimization
- Partitioning Applied - Large tables partitioned by date/key
- Files Optimized - OPTIMIZE run on all major tables
- Z-Order Applied - Multi-column tables have Z-Order
- Query Plans Verified - Predicate pushdown confirmed
- Baseline Established - Performance metrics saved
- Spark Configured - Optimal settings for workload
- Direct Lake Ready - Tables prepared for semantic model
Verification Commands
### Verify V-Order# Check V-Order is enabled
for table in ["gold_player_360", "gold_slot_performance"]:
props = DeltaTable.forName(spark, table).detail() \
.select("properties").collect()[0][0]
vorder = props.get("delta.parquet.vorder.enabled", "not set")
print(f"{table}: V-Order = {vorder}")
# Verify file compaction
for table in ["gold_player_360", "gold_slot_performance"]:
stats = spark.sql(f"DESCRIBE DETAIL {table}").collect()[0]
avg_size = stats["sizeInBytes"] / max(stats["numFiles"], 1) / (1024*1024)
status = "OK" if avg_size > 50 else "NEEDS OPTIMIZE"
print(f"{table}: {stats['numFiles']} files, {avg_size:.1f} MB avg - {status}")
🔧 Troubleshooting¶
| Issue | Symptom | Solution |
|---|---|---|
| Slow queries after OPTIMIZE | Query time increased | Check Z-Order columns, may need different ordering |
| Out of memory | Job fails with OOM | Reduce shuffle partitions, increase node memory |
| Partition skew | One partition slow | Rebalance data or use salting |
| Cache not working | Repeated queries slow | Check cache size, use MEMORY_AND_DISK |
| Direct Lake fallback | Import mode triggered | Check table size, optimize parquet files |
| Small file problem | Thousands of tiny files | Run OPTIMIZE more frequently |
📌 Best Practices¶
- Measure First - Establish baselines before optimizing
- Optimize Incrementally - One change at a time, measure impact
- Monitor Continuously - Track query performance over time
- Automate Maintenance - Schedule OPTIMIZE and VACUUM
- Right-Size Partitions - 100MB-1GB per partition
- Use V-Order Always - Enable by default in Fabric
- Cache Wisely - Cache dimension tables, not fact tables
- Configure for Workload - Different settings for batch vs. interactive
- Document Changes - Track what optimizations were applied
- Review Regularly - Revisit as data volumes grow
📝 Summary¶
You have successfully learned to optimize performance across Microsoft Fabric:
- Implemented V-Order optimization for Delta Lake tables
- Designed effective partitioning strategies for casino data
- Applied file compaction with OPTIMIZE and Z-Order
- Configured Spark for optimal performance
- Tuned Direct Lake for Power BI semantic models
- Implemented caching strategies for iterative workloads
- Optimized KQL queries for real-time analytics
- Established performance baselines and benchmarking practices
🚀 Next Steps¶
Continue to Tutorial 17: Monitoring & Alerting to learn how to monitor your optimized Fabric environment and set up proactive alerting.
📚 Additional Resources¶
Microsoft Documentation¶
- Delta Lake Optimization
- V-Order in Fabric
- Direct Lake Performance
- Spark Configuration
- KQL Best Practices
Delta Lake Resources¶
Performance Tuning Guides¶
🧭 Navigation¶
| Previous | Up | Next |
|---|---|---|
| ⬅️ 15-Cost Optimization | 📖 Tutorials Index | 17-Monitoring & Alerting ➡️ |