Cost Optimization Best Practices for Azure Synapse Analytics¶

Home > Best Practices > Cost Optimization

Understanding Synapse Analytics Cost Model¶

Cost Components¶

Serverless SQL Pool Costs¶

Data Scanning: Charged per TB of data processed
Result Caching: No charge for subsequent queries using cached results
Resource Management: No charge when idle (pay-per-query model)

Dedicated SQL Pool Costs¶

Compute Costs: Based on Data Warehouse Units (DWU) and time running
Storage Costs: Based on volume of data stored in the dedicated pool
Data Movement: Included in compute costs

Spark Pool Costs¶

Compute Costs: Based on vCore-hours consumed
Autoscale Impact: Costs vary based on actual usage with autoscale
Node Types: Different costs for different node types (memory-optimized vs. compute-optimized)

Storage Costs¶

ADLS Gen2 Storage: Based on volume of data and storage tier (hot/cool/archive)
Transaction Costs: Based on number and type of storage operations

Compute Optimization Strategies¶

Serverless SQL Pool Optimization¶

Query Optimization¶

Minimize Data Scanning: Prune data aggressively

-- Good: Scans less data with partition filtering
SELECT * FROM external_table
WHERE year_partition = 2025 AND month_partition = 1

-- Avoid: Full scan across all partitions
SELECT * FROM external_table
WHERE YEAR(transaction_date) = 2025 AND MONTH(transaction_date) = 1

Use Appropriate File Formats: Prefer columnar formats (Parquet, ORC) over row-based formats (CSV, JSON)

-- Create external file format for Parquet
CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'SNAPPY'
);

Statistics: Create statistics on frequently filtered columns

-- Create statistics for better query plans
CREATE STATISTICS stats_year ON external_table(year_column);

Result Set Caching¶

Enable Result Set Caching: Reuse query results for identical queries

-- Enable result set caching at database level
ALTER DATABASE MyDatabase
SET RESULT_SET_CACHING ON;

Parameterize Queries: Use parameterized queries to maximize cache hits

Dedicated SQL Pool Optimization¶

Scale Management¶

Implement Automated Scaling: Scale up/down based on workload patterns

# Scale DW based on schedule
$startTime = (Get-Date).AddHours(1)
$timeZone = [System.TimeZoneInfo]::Local.Id
$schedule = New-AzSynapseWorkspaceManagedSchedule -DayOfWeek Monday, Tuesday, Wednesday, Thursday, Friday -Time "08:00" -TimeZone $timeZone
New-AzSynapseSqlPoolWorkloadManagement -WorkspaceName $workspaceName -SqlPoolName $sqlPoolName -DwuValue 1000 -Schedule $schedule

Pause During Inactivity: Automatically pause during non-business hours

# Pause SQL pool
Suspend-AzSynapseSqlPool -WorkspaceName $workspaceName -Name $sqlPoolName

Resource Classes¶

Optimize Resource Classes: Use smaller resource classes for simple queries

-- Assign smaller resource class for simple queries
EXEC sp_addrolemember 'smallrc', 'username';

-- Assign larger resource class for complex queries
EXEC sp_addrolemember 'largerc', 'username';

Spark Pool Optimization¶

Autoscale Configuration¶

Right-Size Min/Max Nodes: Configure appropriate autoscale range

{
  "name": "optimizedSparkPool",
  "properties": {
    "nodeSize": "Small",
    "nodeSizeFamily": "MemoryOptimized",
    "autoScale": {
      "enabled": true,
      "minNodeCount": 3,
      "maxNodeCount": 10
    }
  }
}

Session-Level Configuration: Only request resources needed for each job

# Configure Spark session with appropriate resources
spark.conf.set("spark.executor.instances", "4")
spark.conf.set("spark.executor.memory", "4g")
spark.conf.set("spark.executor.cores", "2")

Node Selection¶

Use Appropriate Node Types: Select based on workload characteristics
Memory-optimized for ML and large joins
Compute-optimized for ETL and data processing
Consider Job Requirements: Match node size to job requirements

Session Management¶

Session Timeout: Configure appropriate timeout to release resources

{
  "name": "optimizedSparkPool",
  "properties": {
    "sessionLevelPackages": [],
    "sparkConfigProperties": {},
    "nodeSize": "Small",
    "nodeSizeFamily": "MemoryOptimized",
    "sessionLevelPackages": [],
    "customLibraries": [],
    "sparkEventsFolder": "/events",
    "autoScale": {
      "enabled": true,
      "minNodeCount": 3,
      "maxNodeCount": 10
    },
    "isComputeIsolationEnabled": false,
    "sessionProperties": {
      "driverSize": "Small",
      "executorSize": "Small",
      "executorCount": 2
    },
    "defaultSparkLogFolder": "/logs",
    "nodeCount": 0,
    "dynamicExecutorAllocation": {
      "enabled": true,
      "minExecutors": 1,
      "maxExecutors": 5
    },
    "coordinatorSize": "Small",
    "provisioningState": "Succeeded"
  }
}

Storage Optimization Strategies¶

Data Lifecycle Management¶

Storage Tiering¶

Hot Storage: Use for frequently accessed data (last 30-90 days)
Cool Storage: Use for infrequently accessed data (older than 90 days)
Archive Storage: Use for rarely accessed data (compliance/historical)

Automated Tiering¶

Lifecycle Management Policies: Configure to automatically move data between tiers

{
  "rules": [
    {
      "enabled": true,
      "name": "MoveToCoolTier",
      "type": "Lifecycle",
      "definition": {
        "filters": {
          "blobTypes": [ "blockBlob" ],
          "prefixMatch": [ "data/historical/" ]
        },
        "actions": {
          "baseBlob": {
            "tierToCool": { "daysAfterModificationGreaterThan": 90 }
          }
        }
      }
    }
  ]
}

Data Storage Optimization¶

Compression and File Formats¶

Use Compression: Prefer columnar formats with compression

# Write with compression
df.write.format("parquet") \
    .option("compression", "snappy") \
    .save("/path/to/data")

Optimize File Sizes: Target 100MB-1GB per file

# Control Parquet file size
spark.conf.set("spark.sql.files.maxPartitionBytes", 134217728)  # 128 MB

Data Cleanup¶

Remove Duplicate Data: Deduplicate data where possible
Regular Vacuum: Clean up stale files in Delta tables

-- Remove files no longer needed by the table
VACUUM delta_table RETAIN 7 DAYS

Temporary Data Management: Remove temporary datasets after use

Pipeline Optimization¶

Integration Pipeline Costs¶

Activity Optimization¶

Combine Activities: Reduce activity runs by combining related operations
Use Appropriate Integration Runtime: Match the IR to the workload requirements
Optimize Copy Activity: Configure appropriate compute size for data movement

Monitoring and Debugging¶

Limit Debug Runs: Use debug runs sparingly
Optimize Logging: Implement appropriate logging levels
Use Activity Constraints: Set appropriate timeouts and retry policies

Orchestration Patterns¶

Trigger Optimization¶

Batch Related Activities: Trigger multiple related activities together
Use Event-Based Triggers: Trigger only when needed, rather than on schedule

Monitoring and Analysis¶

Cost Monitoring¶

Azure Cost Management¶

Budget Alerts: Set up alerts for cost thresholds

# Create budget with alert
New-AzConsumptionBudget -Name "SynapseMonthlyBudget" `
    -Amount 1000 `
    -Category "Cost" `
    -TimeGrain "Monthly" `
    -StartDate (Get-Date) `
    -EndDate (Get-Date).AddYears(1) `
    -ContactEmail @("user@contoso.com")

Cost Analysis: Regularly analyze costs by service, resource, and tag
Tag Resources: Implement consistent tagging for cost allocation

{
  "tags": {
    "Environment": "Production",
    "Department": "Finance",
    "Project": "DataWarehouse"
  }
}

Resource Utilization Analysis¶

Monitor Usage Patterns: Track usage to identify optimization opportunities
Identify Idle Resources: Find and address underutilized resources
Workload Analysis: Understand peak vs. average requirements

Cost Optimization Workflow¶

Regular Review Process¶

Monthly Cost Review: Schedule regular cost review meetings
Cost Optimization Backlog: Maintain a backlog of optimization opportunities
ROI Analysis: Prioritize optimization efforts by potential savings

Enterprise Strategies¶

Reserved Instances¶

Azure Reservations¶

Reserved Capacity: Consider 1-year or 3-year reservations for stable workloads
Reservation Scope: Choose appropriate scope (subscription or resource group)
Mixed Approach: Use reserved instances for baseline and pay-as-you-go for variable workloads

Enterprise Agreement Benefits¶

EA Optimization¶

Leverage EA Pricing: Utilize enterprise agreement discounts
Azure Hybrid Benefit: Apply for eligible workloads
Enterprise Dev/Test Subscription: Use for non-production environments

Conclusion¶

Cost optimization in Azure Synapse Analytics requires a multi-faceted approach across compute, storage, and operational aspects. By implementing these best practices, organizations can achieve significant cost savings while maintaining performance and meeting business requirements.

Remember that cost optimization is an ongoing process that should be integrated into your regular operational rhythms. Regular monitoring, analysis, and adjustment of your optimization strategies will ensure continued cost efficiency as your workloads evolve.