Skip to content

Performance Optimization Best Practices

Home | Best Practices | Performance Optimization

Status

Best practices for optimizing performance in Cloud Scale Analytics.


Overview

For detailed guidance, see: Full Performance Guide


Quick Reference

Spark Performance

Technique Impact Implementation
Partition tuning High spark.sql.shuffle.partitions = 200
Broadcast joins High Small tables < 10MB
Caching Medium Frequently accessed DataFrames
Predicate pushdown High Filter early in transformations
# Optimal Spark configuration
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

# Broadcast join for small tables
from pyspark.sql.functions import broadcast

result = large_df.join(
    broadcast(small_df),
    "join_key"
)

SQL Performance

-- Use clustered columnstore for analytics
CREATE TABLE fact_sales
WITH (
    DISTRIBUTION = HASH(customer_id),
    CLUSTERED COLUMNSTORE INDEX
);

-- Maintain statistics
UPDATE STATISTICS fact_sales;

-- Use result set caching
SET RESULT_SET_CACHING ON;

Delta Lake Performance

-- Optimize table layout
OPTIMIZE gold.sales ZORDER BY (customer_id, order_date);

-- Vacuum old files
VACUUM gold.sales RETAIN 168 HOURS;

-- Auto-optimize
ALTER TABLE gold.sales SET TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true',
    'delta.autoOptimize.autoCompact' = 'true'
);

Performance Monitoring

// Query performance metrics
SynapseSparkLogs
| where TimeGenerated > ago(24h)
| summarize
    AvgDuration = avg(DurationMs),
    MaxDuration = max(DurationMs),
    P95Duration = percentile(DurationMs, 95)
    by JobName, bin(TimeGenerated, 1h)
| render timechart

  • Spark Performance
  • SQL Performance
  • Delta Lake Optimization

Last Updated: January 2025