🔥 Tutorial 6: Spark Pool Configuration¶

Configure and optimize Apache Spark pools for performance and cost-efficiency. Learn pool sizing, auto-scaling, package management, and performance tuning.

🎯 Learning Objectives¶

✅ Create Spark pools with optimal configurations
✅ Implement auto-scaling and auto-pause policies
✅ Manage Python/Scala packages and library dependencies
✅ Tune Spark performance parameters
✅ Monitor resource utilization and optimize costs

⏱️ Time Estimate: 30 minutes¶

📋 Prerequisites¶

🏗️ Step 1: Create Production Spark Pool¶

$config = Get-Content "workspace-config.json" | ConvertFrom-Json

# Create medium-sized Spark pool for production workloads
az synapse spark pool create `
  --name "sparkmedium" `
  --workspace-name $config.WorkspaceName `
  --resource-group $config.ResourceGroup `
  --spark-version "3.4" `
  --node-count 5 `
  --node-size Medium `
  --node-size-family MemoryOptimized `
  --enable-auto-scale true `
  --min-node-count 3 `
  --max-node-count 10 `
  --enable-auto-pause true `
  --delay 15 `
  --enable-dynamic-executor-allocation true `
  --tags Environment=Production Workload=Analytics

Write-Host "✅ Production Spark pool created" -ForegroundColor Green

⚙️ Step 2: Configure Spark Settings¶

2.1 Spark Configuration File¶

# spark-defaults.conf
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 2
spark.dynamicAllocation.maxExecutors 10
spark.sql.shuffle.partitions 200
spark.sql.adaptive.enabled true
spark.sql.adaptive.coalescePartitions.enabled true
spark.sql.files.maxPartitionBytes 134217728
spark.speculation true
spark.sql.parquet.compression.codec snappy
spark.eventLog.enabled true

2.2 Apply Configuration¶

# Upload Spark configuration
az storage blob upload `
  --account-name $config.StorageAccount `
  --container-name "synapsefs" `
  --name "spark-config/spark-defaults.conf" `
  --file "spark-defaults.conf" `
  --auth-mode login

Write-Host "✅ Spark configuration uploaded" -ForegroundColor Green

📦 Step 3: Package Management¶

3.1 Create Requirements File¶

# requirements.txt
pandas==2.1.0
numpy==1.25.0
delta-spark==3.0.0
azure-storage-blob==12.19.0
pyarrow==13.0.0
matplotlib==3.8.0
scikit-learn==1.3.0

3.2 Install Packages¶

# Upload requirements.txt
az storage blob upload `
  --account-name $config.StorageAccount `
  --container-name "synapsefs" `
  --name "spark-config/requirements.txt" `
  --file "requirements.txt" `
  --auth-mode login

# Update Spark pool to use requirements
az synapse spark pool update `
  --name "sparkmedium" `
  --workspace-name $config.WorkspaceName `
  --resource-group $config.ResourceGroup `
  --library-requirements "requirements.txt"

Write-Host "✅ Package requirements configured" -ForegroundColor Green

🎯 Step 4: Performance Tuning¶

4.1 Optimize for Different Workloads¶

ETL Workloads:

spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

Machine Learning:

spark.conf.set("spark.executor.memory", "8g")
spark.conf.set("spark.executor.cores", "4")
spark.conf.set("spark.python.worker.memory", "2g")

Streaming:

spark.conf.set("spark.streaming.backpressure.enabled", "true")
spark.conf.set("spark.streaming.receiver.maxRate", "10000")

4.2 Memory Management¶

# Optimal memory configuration
spark.conf.set("spark.executor.memoryOverhead", "2g")
spark.conf.set("spark.memory.fraction", "0.8")
spark.conf.set("spark.memory.storageFraction", "0.3")

📊 Step 5: Monitor Performance¶

-- Query Spark application metrics
SELECT
    application_id,
    application_name,
    start_time,
    end_time,
    DATEDIFF(second, start_time, end_time) as duration_seconds,
    executor_count,
    executor_cores_total,
    executor_memory_total_gb
FROM monitoring.spark_applications
WHERE start_time >= DATEADD(day, -7, GETDATE())
ORDER BY start_time DESC;

✅ Validation¶

# Verify Spark pool configuration
az synapse spark pool show `
  --name "sparkmedium" `
  --workspace-name $config.WorkspaceName `
  --resource-group $config.ResourceGroup `
  --query "{Name:name, NodeSize:nodeSize, MinNodes:autoScale.minNodeCount, MaxNodes:autoScale.maxNodeCount, AutoPause:autoPause.enabled}" `
  --output table

Write-Host "✅ Spark pool validated" -ForegroundColor Green

💡 Best Practices¶

✅ Enable auto-pause with 15-minute delay
✅ Use auto-scaling for variable workloads
✅ Pin library versions in requirements.txt
✅ Enable adaptive query execution
✅ Monitor and right-size node count/size

🚀 What's Next?¶

Continue to Tutorial 7: PySpark Data Processing

Tutorial Progress: 6 of 14 completed Next: 07. PySpark Processing →