Home > Best Practices > Spark & Notebooks

⚡ Spark & Notebooks Best Practices¶

Last Updated: 2026-04-15 | Version: 2.0 Status: ✅ Final | Maintainer: Documentation Team

📑 Table of Contents¶

Overview
Spark Capacity and Cluster Planning
Native Execution Engine (NEE)
Session Configuration
Read Optimization
Write Optimization
Shuffle Optimization
High Concurrency Mode
Parallel Notebook Execution
Library Management
Resource Profiles
Monitoring and Profiling
Delta Lake Best Practices
Notebook Organization
Performance Checklist
Related Documents

📋 Overview¶

Apache Spark in Microsoft Fabric provides powerful distributed computing for data engineering and data science workloads. This guide covers Spark optimization, notebook best practices, library management, and performance tuning.

Notebook Resources and Operations

Source: Microsoft Learn - How to Use Notebooks

🖥️ Spark Capacity and Cluster Planning¶

Starter Pool vs Custom Pool¶

Pool Type	Use Case	Startup Time	Best For
Starter Pool	Development, testing	5-10 seconds	Fast iteration, no custom libraries
Custom Pool	Production, security	Minutes	MPE, Private Link, custom configs

Recommendation: Use Starter Pools for development to maximize productivity. Switch to Custom Pools when you need Managed Private Endpoints or Private Links.

Compute Configuration Guidelines¶

Scenario	Node Size	Configuration
Transform-heavy jobs (shuffles, joins)	16-64 cores	Larger nodes
Bursty/unpredictable jobs	Small-Medium	Autoscale + Dynamic Allocate
Many small parallel jobs	Small-Medium	Minimum nodes to avoid cold-start
Development work	Small-Medium	Single node mode
Large jobs with known partitioning	Match data volume	Manual presizing
ML/distributed training	Medium-Large	Maximize parallelism
Just Python code	Any	Python Kernel

🚀 Native Execution Engine (NEE)¶

Enabling NEE¶

The Native Execution Engine provides 2x-5x performance improvements for many workloads.

Enable at Environment Level:

Workspace Settings > Spark Settings > Native Execution Engine: Enabled

Enable at Session Level (PySpark):

spark.conf.set("spark.native.enabled", "true")

Enable at Session Level (SQL):

SET spark.native.enabled = True

Performance Impact: - Vectorized execution for Spark operations - Optimized memory management - Reduced serialization overhead - Best improvements on analytical queries

⚙️ Session Configuration¶

%%configure Magic Command¶

Configure Spark sessions at the beginning of notebooks:

Spark Session Information

Source: Microsoft Learn - Author and Execute Notebooks

%%configure
{
    "driverMemory": "28g",
    "driverCores": 4,
    "executorMemory": "28g",
    "executorCores": 4,
    "numExecutors": 2,
    "conf": {
        "spark.sql.shuffle.partitions": "200",
        "spark.sql.files.maxPartitionBytes": "134217728",
        "spark.databricks.delta.autoCompact.enabled": "true"
    }
}

Important Notes: - Run %%configure in the first code cell - Set same values for driverMemory and executorMemory - Set same values for driverCores and executorCores - Cannot restart session mid-pipeline if %%configure isn't first

Key Spark Configurations¶

Configuration	Default	Purpose
`spark.sql.shuffle.partitions`	200	Partitions during shuffle operations
`spark.sql.files.maxPartitionBytes`	128MB	Max bytes per partition when reading
`spark.task.cpus`	1	CPU cores per task
`spark.databricks.delta.autoCompact.enabled`	false	Auto compact small files

📖 Read Optimization¶

Partition Tuning¶

# Adjust based on data volume
spark.conf.set("spark.sql.files.maxPartitionBytes", "256mb")  # Larger files = fewer partitions

# For small files, reduce partition size
spark.conf.set("spark.sql.files.maxPartitionBytes", "64mb")

Predicate Pushdown¶

# Good: Filter pushed to source
df = spark.read.format("delta").load("/path") \
    .filter(col("event_date") >= "2024-01-01")

# Bad: Filter applied after full scan
df = spark.read.format("delta").load("/path")
filtered_df = df.filter(col("event_date") >= "2024-01-01")  # Same result but explicit

Column Pruning¶

# Good: Only read needed columns
df = spark.read.format("delta").load("/path") \
    .select("col1", "col2", "col3")

# Bad: Read all columns
df = spark.read.format("delta").load("/path")

✍️ Write Optimization¶

Auto Compaction¶

Enable for pipelines with frequent small writes:

# At session level
spark.conf.set("spark.databricks.delta.autoCompact.enabled", True)

# At table level
spark.sql("""
    ALTER TABLE my_table
    SET TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true')
""")

Optimize Write¶

# Enable optimized writes for better file sizes
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", True)

V-Order for Fabric¶

V-Order optimization improves read performance across all Fabric engines:

# Write with V-Order
df.write.format("delta") \
    .option("vorder", "true") \
    .mode("overwrite") \
    .save("/path/to/table")

🔀 Shuffle Optimization¶

Tuning Shuffle Partitions¶

# Default is 200 - adjust based on data size
# Rule of thumb: target 100-200MB per partition after shuffle

# For small datasets
spark.conf.set("spark.sql.shuffle.partitions", "50")

# For large datasets
spark.conf.set("spark.sql.shuffle.partitions", "400")

Adaptive Query Execution (AQE)¶

AQE automatically optimizes queries at runtime:

# Usually enabled by default
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

🔗 High Concurrency Mode¶

Benefits¶

Share Spark sessions across multiple notebooks
Instant session attachment (no cold start)
Better resource utilization

Notebooks can share a session if they: 1. Are run by the same user 2. Have the same default lakehouse 3. Have the same Spark configurations 4. Have the same library packages

Enable High Concurrency¶

Workspace Settings > Spark Settings > High Concurrency Mode: Enabled

📓 Parallel Notebook Execution¶

Using runMultiple()¶

Assign Notebook Parameters from Pipeline

Source: Microsoft Learn - Author and Execute Notebooks

# Run notebooks in parallel
mssparkutils.notebook.runMultiple(["NotebookA", "NotebookB", "NotebookC"])

# With DAG structure for dependencies
dag = {
    "activities": [
        {"name": "NotebookA", "path": "NotebookA"},
        {"name": "NotebookB", "path": "NotebookB", "dependencies": ["NotebookA"]},
        {"name": "NotebookC", "path": "NotebookC", "dependencies": ["NotebookA"]}
    ],
    "concurrency": 50,
    "timeoutInSeconds": 43200
}
mssparkutils.notebook.runMultiple(dag)

Concurrency Limits¶

Driver Node Size	Max Concurrent Notebooks
Small (4 cores)	4
Medium (8 cores)	8
Large (16 cores)	16

📦 Library Management¶

Best Practices by Scenario¶

Scenario	Approach
Workspace default libraries	Environment attached to workspace
Common libraries across items	Environment attached to notebooks
One-time use in interactive	Inline installation (%pip)
Production pipelines	Environment (not inline)

Manage Environment Resources

Source: Microsoft Learn - How to Use Notebooks

Environment Libraries¶

Create environment in workspace
Install required libraries
Attach environment to workspace or specific items

Inline Installation (Development Only)¶

# Install packages in current session
%pip install pandas==2.0.0
%pip install great-expectations

# After installation, restart Python interpreter
# (but not Spark session)

Warning: Inline installation is disabled by default in pipelines. Not recommended for production.

📊 Resource Profiles¶

Pre-defined Profiles¶

# Use predefined resource profiles
spark.conf.set("spark.fabric.profile", "HighConcurrency")

Profile	Use Case
Default	General workloads
HighConcurrency	Many small jobs
LargeData	Big data processing
ML	Machine learning workloads

🔍 Monitoring and Profiling¶

Spark History Server¶

Access detailed execution information: - Stage-level details - Task-level metrics - Skew detection - Logical and physical plans

Spark Jobs Progress Details

Source: Microsoft Learn - Author and Execute Notebooks

Resource Usage UI¶

Monitor: - Executor utilization - Executor scale-up/down patterns - Memory usage per stage

Monitoring Hub¶

30-day metrics for: - Notebook execution times - Spark Job Definition details - Pipeline execution status

🏗️ Delta Lake Best Practices¶

Table Configuration¶

# Create table with optimization settings
spark.sql("""
    CREATE TABLE silver_transactions (
        transaction_id STRING,
        amount DECIMAL(18,2),
        event_date DATE
    )
    USING DELTA
    PARTITIONED BY (event_date)
    TBLPROPERTIES (
        'delta.autoOptimize.optimizeWrite' = 'true',
        'delta.autoOptimize.autoCompact' = 'true'
    )
""")

MERGE Operations¶

from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "/path/to/table")

delta_table.alias("target").merge(
    source_df.alias("source"),
    "target.id = source.id"
).whenMatchedUpdate(
    set={"value": "source.value", "updated_at": "current_timestamp()"}
).whenNotMatchedInsert(
    values={"id": "source.id", "value": "source.value", "created_at": "current_timestamp()"}
).execute()

Z-ORDER Clustering¶

-- Optimize with Z-ORDER for common filter columns
OPTIMIZE my_table ZORDER BY (customer_id, event_date);

📂 Notebook Organization¶

Recommended Structure¶

notebooks/
├── _shared/
│   ├── common_functions.py
│   └── config.py
├── bronze/
│   ├── nb_bronze_01_ingest_slot_telemetry.ipynb
│   └── nb_bronze_02_ingest_player_data.ipynb
├── silver/
│   ├── nb_silver_01_cleanse_slot_telemetry.ipynb
│   └── nb_silver_02_cleanse_player_data.ipynb
├── gold/
│   └── nb_gold_01_build_kpis.ipynb
└── ml/
    └── nb_ml_01_churn_prediction.ipynb

Notebook Naming Convention¶

nb_{layer}_{sequence}_{description}

Examples:
  nb_bronze_01_ingest_slot_telemetry
  nb_silver_02_cleanse_player_data
  nb_gold_03_aggregate_daily_kpis

✅ Performance Checklist¶

Before Production¶

Enable Native Execution Engine (NEE)
Configure appropriate shuffle partitions
Enable auto-compaction for write-heavy tables
Use V-Order for Fabric-optimized reads
Profile notebooks using Spark History Server
Test with production-scale data
Configure appropriate executor sizing

Ongoing Optimization¶

Monitor executor utilization
Check for data skew in shuffle stages
Run OPTIMIZE on frequently queried tables
Update statistics for query planning
Review and clean up unused notebooks
Keep library versions current

Document	Description
Lakehouse Setup & Organization	Delta Lake configuration and table maintenance
Pipelines & Data Movement	Pipeline optimization for data movement
Decision Guide	When to use Spark vs other tools
Testing Strategies	Unit and integration testing patterns
Monitoring & Observability	Spark monitoring and performance tracking

⬆️ Back to Top | 📚 Best Practices | 🏠 Home

⚡ Spark & Notebooks Best Practices¶

📑 Table of Contents¶

📋 Overview¶

🖥️ Spark Capacity and Cluster Planning¶

Starter Pool vs Custom Pool¶

Compute Configuration Guidelines¶

🚀 Native Execution Engine (NEE)¶

Enabling NEE¶

⚙️ Session Configuration¶

%%configure Magic Command¶

Key Spark Configurations¶

📖 Read Optimization¶

Partition Tuning¶

Predicate Pushdown¶

Column Pruning¶

✍️ Write Optimization¶

Auto Compaction¶

Optimize Write¶

V-Order for Fabric¶

🔀 Shuffle Optimization¶

Tuning Shuffle Partitions¶

Adaptive Query Execution (AQE)¶

🔗 High Concurrency Mode¶

Benefits¶

Session Sharing Requirements¶

Enable High Concurrency¶

📓 Parallel Notebook Execution¶

Using runMultiple()¶

Concurrency Limits¶

📦 Library Management¶

Best Practices by Scenario¶

Environment Libraries¶

Inline Installation (Development Only)¶

📊 Resource Profiles¶

Pre-defined Profiles¶

🔍 Monitoring and Profiling¶

Spark History Server¶

Resource Usage UI¶

Monitoring Hub¶

🏗️ Delta Lake Best Practices¶

Table Configuration¶

MERGE Operations¶

Z-ORDER Clustering¶

📂 Notebook Organization¶

Recommended Structure¶

Notebook Naming Convention¶

✅ Performance Checklist¶

Before Production¶

Ongoing Optimization¶

🔗 Related Documents¶