Cost Optimization Best Practices for Azure Synapse Analytics¶
Home > Best Practices > Cost Optimization
Understanding Synapse Analytics Cost Model¶
Cost Components¶
Serverless SQL Pool Costs¶
- Data Scanning: Charged per TB of data processed
- Result Caching: No charge for subsequent queries using cached results
- Resource Management: No charge when idle (pay-per-query model)
Dedicated SQL Pool Costs¶
- Compute Costs: Based on Data Warehouse Units (DWU) and time running
- Storage Costs: Based on volume of data stored in the dedicated pool
- Data Movement: Included in compute costs
Spark Pool Costs¶
- Compute Costs: Based on vCore-hours consumed
- Autoscale Impact: Costs vary based on actual usage with autoscale
- Node Types: Different costs for different node types (memory-optimized vs. compute-optimized)
Storage Costs¶
- ADLS Gen2 Storage: Based on volume of data and storage tier (hot/cool/archive)
- Transaction Costs: Based on number and type of storage operations
Compute Optimization Strategies¶
Serverless SQL Pool Optimization¶
Query Optimization¶
- Minimize Data Scanning: Prune data aggressively
-- Good: Scans less data with partition filtering
SELECT * FROM external_table
WHERE year_partition = 2025 AND month_partition = 1
-- Avoid: Full scan across all partitions
SELECT * FROM external_table
WHERE YEAR(transaction_date) = 2025 AND MONTH(transaction_date) = 1
- Use Appropriate File Formats: Prefer columnar formats (Parquet, ORC) over row-based formats (CSV, JSON)
-- Create external file format for Parquet
CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'SNAPPY'
);
- Statistics: Create statistics on frequently filtered columns
-- Create statistics for better query plans
CREATE STATISTICS stats_year ON external_table(year_column);
Result Set Caching¶
- Enable Result Set Caching: Reuse query results for identical queries
- Parameterize Queries: Use parameterized queries to maximize cache hits
Dedicated SQL Pool Optimization¶
Scale Management¶
- Implement Automated Scaling: Scale up/down based on workload patterns
# Scale DW based on schedule
$startTime = (Get-Date).AddHours(1)
$timeZone = [System.TimeZoneInfo]::Local.Id
$schedule = New-AzSynapseWorkspaceManagedSchedule -DayOfWeek Monday, Tuesday, Wednesday, Thursday, Friday -Time "08:00" -TimeZone $timeZone
New-AzSynapseSqlPoolWorkloadManagement -WorkspaceName $workspaceName -SqlPoolName $sqlPoolName -DwuValue 1000 -Schedule $schedule
- Pause During Inactivity: Automatically pause during non-business hours
Resource Classes¶
- Optimize Resource Classes: Use smaller resource classes for simple queries
-- Assign smaller resource class for simple queries
EXEC sp_addrolemember 'smallrc', 'username';
-- Assign larger resource class for complex queries
EXEC sp_addrolemember 'largerc', 'username';
Spark Pool Optimization¶
Autoscale Configuration¶
- Right-Size Min/Max Nodes: Configure appropriate autoscale range
{
"name": "optimizedSparkPool",
"properties": {
"nodeSize": "Small",
"nodeSizeFamily": "MemoryOptimized",
"autoScale": {
"enabled": true,
"minNodeCount": 3,
"maxNodeCount": 10
}
}
}
- Session-Level Configuration: Only request resources needed for each job
# Configure Spark session with appropriate resources
spark.conf.set("spark.executor.instances", "4")
spark.conf.set("spark.executor.memory", "4g")
spark.conf.set("spark.executor.cores", "2")
Node Selection¶
- Use Appropriate Node Types: Select based on workload characteristics
- Memory-optimized for ML and large joins
-
Compute-optimized for ETL and data processing
-
Consider Job Requirements: Match node size to job requirements
Session Management¶
- Session Timeout: Configure appropriate timeout to release resources
{
"name": "optimizedSparkPool",
"properties": {
"sessionLevelPackages": [],
"sparkConfigProperties": {},
"nodeSize": "Small",
"nodeSizeFamily": "MemoryOptimized",
"sessionLevelPackages": [],
"customLibraries": [],
"sparkEventsFolder": "/events",
"autoScale": {
"enabled": true,
"minNodeCount": 3,
"maxNodeCount": 10
},
"isComputeIsolationEnabled": false,
"sessionProperties": {
"driverSize": "Small",
"executorSize": "Small",
"executorCount": 2
},
"defaultSparkLogFolder": "/logs",
"nodeCount": 0,
"dynamicExecutorAllocation": {
"enabled": true,
"minExecutors": 1,
"maxExecutors": 5
},
"coordinatorSize": "Small",
"provisioningState": "Succeeded"
}
}
Storage Optimization Strategies¶
Data Lifecycle Management¶
Storage Tiering¶
- Hot Storage: Use for frequently accessed data (last 30-90 days)
- Cool Storage: Use for infrequently accessed data (older than 90 days)
- Archive Storage: Use for rarely accessed data (compliance/historical)
Automated Tiering¶
- Lifecycle Management Policies: Configure to automatically move data between tiers
{
"rules": [
{
"enabled": true,
"name": "MoveToCoolTier",
"type": "Lifecycle",
"definition": {
"filters": {
"blobTypes": [ "blockBlob" ],
"prefixMatch": [ "data/historical/" ]
},
"actions": {
"baseBlob": {
"tierToCool": { "daysAfterModificationGreaterThan": 90 }
}
}
}
}
]
}
Data Storage Optimization¶
Compression and File Formats¶
- Use Compression: Prefer columnar formats with compression
# Write with compression
df.write.format("parquet") \
.option("compression", "snappy") \
.save("/path/to/data")
- Optimize File Sizes: Target 100MB-1GB per file
Data Cleanup¶
- Remove Duplicate Data: Deduplicate data where possible
- Regular Vacuum: Clean up stale files in Delta tables
- Temporary Data Management: Remove temporary datasets after use
Pipeline Optimization¶
Integration Pipeline Costs¶
Activity Optimization¶
- Combine Activities: Reduce activity runs by combining related operations
- Use Appropriate Integration Runtime: Match the IR to the workload requirements
- Optimize Copy Activity: Configure appropriate compute size for data movement
Monitoring and Debugging¶
- Limit Debug Runs: Use debug runs sparingly
- Optimize Logging: Implement appropriate logging levels
- Use Activity Constraints: Set appropriate timeouts and retry policies
Orchestration Patterns¶
Trigger Optimization¶
- Batch Related Activities: Trigger multiple related activities together
- Use Event-Based Triggers: Trigger only when needed, rather than on schedule
Monitoring and Analysis¶
Cost Monitoring¶
Azure Cost Management¶
- Budget Alerts: Set up alerts for cost thresholds
# Create budget with alert
New-AzConsumptionBudget -Name "SynapseMonthlyBudget" `
-Amount 1000 `
-Category "Cost" `
-TimeGrain "Monthly" `
-StartDate (Get-Date) `
-EndDate (Get-Date).AddYears(1) `
-ContactEmail @("user@contoso.com")
- Cost Analysis: Regularly analyze costs by service, resource, and tag
- Tag Resources: Implement consistent tagging for cost allocation
Resource Utilization Analysis¶
- Monitor Usage Patterns: Track usage to identify optimization opportunities
- Identify Idle Resources: Find and address underutilized resources
- Workload Analysis: Understand peak vs. average requirements
Cost Optimization Workflow¶
Regular Review Process¶
- Monthly Cost Review: Schedule regular cost review meetings
- Cost Optimization Backlog: Maintain a backlog of optimization opportunities
- ROI Analysis: Prioritize optimization efforts by potential savings
Enterprise Strategies¶
Reserved Instances¶
Azure Reservations¶
- Reserved Capacity: Consider 1-year or 3-year reservations for stable workloads
- Reservation Scope: Choose appropriate scope (subscription or resource group)
- Mixed Approach: Use reserved instances for baseline and pay-as-you-go for variable workloads
Enterprise Agreement Benefits¶
EA Optimization¶
- Leverage EA Pricing: Utilize enterprise agreement discounts
- Azure Hybrid Benefit: Apply for eligible workloads
- Enterprise Dev/Test Subscription: Use for non-production environments
Conclusion¶
Cost optimization in Azure Synapse Analytics requires a multi-faceted approach across compute, storage, and operational aspects. By implementing these best practices, organizations can achieve significant cost savings while maintaining performance and meeting business requirements.
Remember that cost optimization is an ongoing process that should be integrated into your regular operational rhythms. Regular monitoring, analysis, and adjustment of your optimization strategies will ensure continued cost efficiency as your workloads evolve.