Performance Optimization Best Practices¶
Home | Best Practices | Performance Optimization
Best practices for optimizing performance in Cloud Scale Analytics.
Overview¶
For detailed guidance, see: Full Performance Guide
Quick Reference¶
Spark Performance¶
| Technique | Impact | Implementation |
|---|---|---|
| Partition tuning | High | spark.sql.shuffle.partitions = 200 |
| Broadcast joins | High | Small tables < 10MB |
| Caching | Medium | Frequently accessed DataFrames |
| Predicate pushdown | High | Filter early in transformations |
# Optimal Spark configuration
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
# Broadcast join for small tables
from pyspark.sql.functions import broadcast
result = large_df.join(
broadcast(small_df),
"join_key"
)
SQL Performance¶
-- Use clustered columnstore for analytics
CREATE TABLE fact_sales
WITH (
DISTRIBUTION = HASH(customer_id),
CLUSTERED COLUMNSTORE INDEX
);
-- Maintain statistics
UPDATE STATISTICS fact_sales;
-- Use result set caching
SET RESULT_SET_CACHING ON;
Delta Lake Performance¶
-- Optimize table layout
OPTIMIZE gold.sales ZORDER BY (customer_id, order_date);
-- Vacuum old files
VACUUM gold.sales RETAIN 168 HOURS;
-- Auto-optimize
ALTER TABLE gold.sales SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
);
Performance Monitoring¶
// Query performance metrics
SynapseSparkLogs
| where TimeGenerated > ago(24h)
| summarize
AvgDuration = avg(DurationMs),
MaxDuration = max(DurationMs),
P95Duration = percentile(DurationMs, 95)
by JobName, bin(TimeGenerated, 1h)
| render timechart
Related Documentation¶
- Spark Performance
- SQL Performance
- Delta Lake Optimization
Last Updated: January 2025