Performance Optimization Best Practices¶

Home | Best Practices | Performance Optimization

Best practices for optimizing performance in Cloud Scale Analytics.

Overview¶

For detailed guidance, see: Full Performance Guide

Quick Reference¶

Spark Performance¶

Technique	Impact	Implementation
Partition tuning	High	`spark.sql.shuffle.partitions = 200`
Broadcast joins	High	Small tables < 10MB
Caching	Medium	Frequently accessed DataFrames
Predicate pushdown	High	Filter early in transformations

# Optimal Spark configuration
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

# Broadcast join for small tables
from pyspark.sql.functions import broadcast

result = large_df.join(
    broadcast(small_df),
    "join_key"
)

SQL Performance¶

-- Use clustered columnstore for analytics
CREATE TABLE fact_sales
WITH (
    DISTRIBUTION = HASH(customer_id),
    CLUSTERED COLUMNSTORE INDEX
);

-- Maintain statistics
UPDATE STATISTICS fact_sales;

-- Use result set caching
SET RESULT_SET_CACHING ON;

Delta Lake Performance¶

-- Optimize table layout
OPTIMIZE gold.sales ZORDER BY (customer_id, order_date);

-- Vacuum old files
VACUUM gold.sales RETAIN 168 HOURS;

-- Auto-optimize
ALTER TABLE gold.sales SET TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true',
    'delta.autoOptimize.autoCompact' = 'true'
);

Performance Monitoring¶

// Query performance metrics
SynapseSparkLogs
| where TimeGenerated > ago(24h)
| summarize
    AvgDuration = avg(DurationMs),
    MaxDuration = max(DurationMs),
    P95Duration = percentile(DurationMs, 95)
    by JobName, bin(TimeGenerated, 1h)
| render timechart

Spark Performance
SQL Performance
Delta Lake Optimization

Last Updated: January 2025

Performance Optimization Best Practices¶

Overview¶

Quick Reference¶

Spark Performance¶

SQL Performance¶

Delta Lake Performance¶

Performance Monitoring¶

Related Documentation¶