General Performance Troubleshooting - CSA in-a-Box¶

Comprehensive guide for diagnosing and resolving performance issues across Cloud Scale Analytics services including Azure Synapse Analytics, Azure Data Factory, Azure Data Lake Storage, and streaming services.

Table of Contents¶

Overview
Performance Baseline Metrics
Common Performance Issues
Cross-Service Performance
Diagnostic Approach
Resolution Strategies
Prevention Best Practices
Related Resources

Overview¶

Performance issues in Cloud Scale Analytics can stem from multiple sources: inefficient queries, resource constraints, network bottlenecks, or suboptimal configurations. This guide provides a systematic approach to identifying and resolving performance problems across the entire analytics platform.

⚠️ Important: Always establish performance baselines during normal operations to effectively identify degradation.

Performance Baseline Metrics¶

Key Performance Indicators¶

Metric Category	KPI	Good	Warning	Critical	Tool
Query Execution	Average query time	< 5s	5-30s	> 30s	Azure Synapse
Data Ingestion	Pipeline duration	< 10min	10-30min	> 30min	Azure Data Factory
Storage I/O	ADLS latency	< 100ms	100-500ms	> 500ms	Azure Monitor
Streaming	Event processing lag	< 1s	1-5s	> 5s	Stream Analytics
Resource Usage	CPU utilization	< 70%	70-85%	> 85%	Azure Monitor
Memory Usage	Memory pressure	< 80%	80-90%	> 90%	Azure Monitor

Establishing Baselines¶

Azure Monitor Query:

// Establish performance baseline over 30 days
let timeRange = 30d;
let metricName = "QueryDuration";

AzureDiagnostics
| where TimeGenerated > ago(timeRange)
| where MetricName == metricName
| summarize
    P50 = percentile(MetricValue, 50),
    P95 = percentile(MetricValue, 95),
    P99 = percentile(MetricValue, 99),
    Avg = avg(MetricValue),
    Max = max(MetricValue)
    by bin(TimeGenerated, 1d), Resource
| render timechart

Common Performance Issues¶

Issue 1: Slow Data Processing¶

Symptoms: - Pipelines taking longer than expected - Increasing backlog in processing queues - Timeout errors in downstream services

Common Causes:

Cause	Service	Impact	Quick Check
Insufficient compute	Synapse/ADF	High	Check DWU/IR capacity
Data skew	Spark/Synapse	High	Review partition distribution
Large datasets	All	Medium	Check data volumes
Network throttling	Storage	Medium	Monitor network metrics
Inefficient transformations	ADF/Spark	High	Review activity execution times

Step-by-Step Resolution:

1. Identify Bottleneck Component¶

Azure Data Factory:

# Get pipeline run details
az datafactory pipeline-run show \
    --factory-name <factory-name> \
    --resource-group <rg-name> \
    --run-id <run-id>

# Query activity-level metrics
az monitor metrics list \
    --resource <resource-id> \
    --metric "PipelineSucceededRuns" "ActivitySucceededRuns" \
    --start-time 2025-01-01T00:00:00Z \
    --end-time 2025-01-10T00:00:00Z

KQL Query for ADF Performance:

ADFPipelineRun
| where TimeGenerated > ago(24h)
| where Status == "Succeeded"
| extend Duration = datetime_diff('second', End, Start)
| summarize
    AvgDuration = avg(Duration),
    P95Duration = percentile(Duration, 95),
    RunCount = count()
    by PipelineName
| where AvgDuration > 600  // Pipelines averaging > 10 minutes
| order by AvgDuration desc

2. Analyze Resource Utilization¶

// Check Integration Runtime utilization
ADFActivityRun
| where TimeGenerated > ago(24h)
| extend Duration = datetime_diff('second', End, Start)
| summarize
    TotalDuration = sum(Duration),
    MaxDuration = max(Duration),
    ActivityCount = count()
    by IntegrationRuntimeName, ActivityType
| order by TotalDuration desc

3. Scale Resources¶

Azure Synapse:

# Scale SQL pool
az synapse sql pool update \
    --name <pool-name> \
    --workspace-name <workspace-name> \
    --resource-group <rg-name> \
    --performance-level DW500c

# Scale Spark pool
az synapse spark pool update \
    --name <pool-name> \
    --workspace-name <workspace-name> \
    --resource-group <rg-name> \
    --node-count 10 \
    --node-size Medium

Azure Data Factory:

# Scale Integration Runtime
az datafactory integration-runtime update \
    --factory-name <factory-name> \
    --resource-group <rg-name> \
    --name <ir-name> \
    --core-count 8 \
    --time-to-live 30

Issue 2: High Storage Latency¶

Symptoms: - Slow read/write operations to Data Lake - Storage throttling errors (503) - Increased pipeline execution times

Error Messages:

OperationTimedOut: The operation did not complete in the allotted timeout
Status: 503 Server Busy

Resolution Steps:

1. Check Storage Metrics¶

// Storage performance metrics
StorageBlobLogs
| where TimeGenerated > ago(1h)
| where StatusCode >= 500
| summarize
    ErrorCount = count(),
    AvgLatency = avg(DurationMs),
    MaxLatency = max(DurationMs)
    by OperationName, StatusCode
| order by ErrorCount desc

2. Identify Throttling¶

# Check storage account metrics
az monitor metrics list \
    --resource <storage-account-id> \
    --metric "Availability" "SuccessE2ELatency" "Transactions" \
    --aggregation Average \
    --interval PT1H

3. Optimize Access Patterns¶

Implement Retry Logic:

from azure.storage.blob import BlobServiceClient
from azure.core.exceptions import ServiceRequestError
from tenacity import retry, wait_exponential, stop_after_attempt

@retry(
    wait=wait_exponential(multiplier=1, min=4, max=60),
    stop=stop_after_attempt(5)
)
def read_blob_with_retry(container_name, blob_name):
    """Read blob with exponential backoff retry."""
    blob_service_client = BlobServiceClient.from_connection_string(conn_str)
    blob_client = blob_service_client.get_blob_client(
        container=container_name,
        blob=blob_name
    )

    try:
        return blob_client.download_blob().readall()
    except ServiceRequestError as e:
        print(f"Storage request failed: {e}")
        raise

Optimize Partition Strategy:

# Poor partition strategy (hot partition)
/data/2025/01/01/all_data.parquet  (10GB single file)

# Good partition strategy (distributed load)
/data/year=2025/month=01/day=01/part-00001.parquet (100MB)
/data/year=2025/month=01/day=01/part-00002.parquet (100MB)
...
/data/year=2025/month=01/day=01/part-00100.parquet (100MB)

4. Implement Caching¶

from azure.identity import DefaultAzureCredential
from azure.storage.filedatalake import DataLakeServiceClient
import diskcache

cache = diskcache.Cache('./cache')

@cache.memoize(expire=3600)  # Cache for 1 hour
def read_metadata(file_path):
    """Cache frequently accessed metadata."""
    service_client = DataLakeServiceClient(
        account_url=f"https://{account_name}.dfs.core.windows.net",
        credential=DefaultAzureCredential()
    )

    file_client = service_client.get_file_client(
        file_system="data",
        file_path=file_path
    )

    return file_client.get_file_properties()

Issue 3: Memory Pressure¶

Symptoms: - Out of memory errors in Spark jobs - Slow garbage collection - Job failures with resource exhaustion

Error Messages:

java.lang.OutOfMemoryError: Java heap space
org.apache.spark.shuffle.FetchFailedException: Failed to fetch shuffle blocks

Resolution:

1. Analyze Memory Usage¶

Spark Monitoring Query:

SparkMetrics
| where TimeGenerated > ago(1h)
| where MetricName in ("ExecutorMemoryUsed", "JVMHeapMemory", "ExecutorGCTime")
| summarize
    AvgMemory = avg(MetricValue),
    MaxMemory = max(MetricValue)
    by bin(TimeGenerated, 5m), ExecutorId
| render timechart

2. Optimize Spark Configuration¶

# Spark configuration for memory optimization
spark_config = {
    "spark.executor.memory": "16g",
    "spark.executor.memoryOverhead": "4g",
    "spark.driver.memory": "8g",
    "spark.memory.fraction": "0.8",
    "spark.memory.storageFraction": "0.3",
    "spark.sql.shuffle.partitions": "200",
    "spark.default.parallelism": "200",
    "spark.sql.adaptive.enabled": "true",
    "spark.sql.adaptive.coalescePartitions.enabled": "true"
}

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("OptimizedApp") \
    .config(spark_config) \
    .getOrCreate()

3. Reduce Data Shuffling¶

# Poor: Forces full shuffle
large_df.groupBy("category").count()

# Better: Use broadcast join for small tables
from pyspark.sql.functions import broadcast

large_df.join(
    broadcast(small_dim_df),
    "category_id"
)

# Best: Pre-partition data
large_df.repartition("category").write.partitionBy("category").parquet("/output")

Cross-Service Performance¶

Data Pipeline End-to-End Performance¶

Monitor Complete Pipeline:

// Correlate performance across services
let startTime = ago(24h);
let endTime = now();

// ADF pipeline execution
let pipelineRuns = ADFPipelineRun
| where TimeGenerated between (startTime .. endTime)
| extend PipelineDuration = datetime_diff('second', End, Start)
| project TimeGenerated, PipelineName, PipelineDuration, Status, RunId;

// Synapse query execution
let synapseQueries = SynapseSqlPoolExecRequests
| where TimeGenerated between (startTime .. endTime)
| extend QueryDuration = TotalElapsedTimeMs / 1000.0
| project TimeGenerated, RequestId, QueryDuration, Status;

// Join and analyze
pipelineRuns
| join kind=inner (synapseQueries) on $left.TimeGenerated == $right.TimeGenerated
| summarize
    TotalDuration = sum(PipelineDuration + QueryDuration),
    AvgDuration = avg(PipelineDuration + QueryDuration)
    by PipelineName
| order by TotalDuration desc

Network Performance¶

Check Network Latency:

# Test connectivity to storage account
az network watcher test-connectivity \
    --source-resource <vm-resource-id> \
    --dest-address <storage-account>.blob.core.windows.net \
    --protocol Tcp \
    --dest-port 443

# Check ExpressRoute/VPN performance
az network watcher connection-monitor test-connectivity \
    --endpoint <source-endpoint> \
    --dest <destination>

Monitor Private Endpoint Performance:

AzureNetworkAnalytics_CL
| where SubType_s == "FlowLog"
| where TimeGenerated > ago(1h)
| summarize
    TotalFlows = sum(FlowCount_d),
    AvgBytes = avg(BytesSentToSource_d + BytesSentToDest_d)
    by SrcIP_s, DestIP_s, DestPort_d
| order by TotalFlows desc

Diagnostic Approach¶

Performance Troubleshooting Workflow¶

graph TD
    A[Performance Issue Detected] --> B{Issue Type?}

    B -->|Query Slow| C[Check Execution Plans]
    B -->|Pipeline Slow| D[Check Activity Durations]
    B -->|Storage Slow| E[Check Storage Metrics]
    B -->|Memory Issues| F[Check Resource Usage]

    C --> G[Analyze Statistics]
    D --> H[Analyze Bottlenecks]
    E --> I[Check Throttling]
    F --> J[Analyze Memory Dumps]

    G --> K[Apply Optimizations]
    H --> K
    I --> K
    J --> K

    K --> L{Performance Improved?}
    L -->|Yes| M[Document Solution]
    L -->|No| N[Scale Resources]

    N --> O{Performance Improved?}
    O -->|Yes| M
    O -->|No| P[Escalate to Support]

    style A fill:#ff6b6b
    style M fill:#51cf66
    style P fill:#ffd43b

Comprehensive Diagnostics Script¶

# Performance diagnostics collection script
param(
    [string]$SubscriptionId,
    [string]$ResourceGroupName,
    [string]$WorkspaceName,
    [int]$HoursBack = 24
)

$outputPath = ".\performance_diagnostics_$(Get-Date -Format 'yyyyMMdd_HHmmss')"
New-Item -ItemType Directory -Path $outputPath -Force

# Set context
Set-AzContext -SubscriptionId $SubscriptionId

# Collect Synapse metrics
$synapseWorkspace = Get-AzSynapseWorkspace -ResourceGroupName $ResourceGroupName -Name $WorkspaceName
$metrics = @("CPUPercent", "MemoryPercent", "DWULimit", "DWUUsed")

foreach ($metric in $metrics) {
    $data = Get-AzMetric -ResourceId $synapseWorkspace.Id `
        -MetricName $metric `
        -StartTime (Get-Date).AddHours(-$HoursBack) `
        -EndTime (Get-Date) `
        -TimeGrain 00:05:00 `
        -AggregationType Average

    $data | ConvertTo-Json -Depth 10 | Out-File "$outputPath\metric_$metric.json"
}

# Collect activity logs
$startTime = (Get-Date).AddHours(-$HoursBack)
$logs = Get-AzActivityLog -ResourceGroupName $ResourceGroupName -StartTime $startTime
$logs | ConvertTo-Json -Depth 10 | Out-File "$outputPath\activity_logs.json"

# Collect pipeline runs (if ADF exists)
$dataFactories = Get-AzDataFactoryV2 -ResourceGroupName $ResourceGroupName
foreach ($df in $dataFactories) {
    $pipelineRuns = Get-AzDataFactoryV2PipelineRun `
        -ResourceGroupName $ResourceGroupName `
        -DataFactoryName $df.DataFactoryName `
        -LastUpdatedAfter $startTime

    $pipelineRuns | ConvertTo-Json -Depth 10 | Out-File "$outputPath\pipeline_runs_$($df.DataFactoryName).json"
}

Write-Host "Performance diagnostics collected in: $outputPath" -ForegroundColor Green

Resolution Strategies¶

Quick Wins¶

Optimization	Impact	Effort	Service
Update statistics	High	Low	Synapse SQL
Enable result caching	High	Low	Synapse SQL
Increase partition count	Medium	Low	Spark
Use broadcast joins	High	Low	Spark
Enable data compression	Medium	Low	Storage
Implement retry logic	Medium	Low	All

Long-Term Optimizations¶

1. Implement Data Lifecycle Management¶

from azure.storage.blob import BlobServiceClient, BlobManagementPolicy

# Configure lifecycle management
lifecycle_policy = {
    "rules": [
        {
            "enabled": True,
            "name": "move-old-data-to-cool",
            "type": "Lifecycle",
            "definition": {
                "actions": {
                    "baseBlob": {
                        "tierToCool": {
                            "daysAfterModificationGreaterThan": 30
                        },
                        "tierToArchive": {
                            "daysAfterModificationGreaterThan": 90
                        }
                    }
                },
                "filters": {
                    "prefixMatch": ["data/raw/"],
                    "blobTypes": ["blockBlob"]
                }
            }
        }
    ]
}

2. Optimize Table Design¶

-- Create optimized table with appropriate distribution
CREATE TABLE dbo.FactSales_Optimized
WITH (
    DISTRIBUTION = HASH(CustomerID),  -- Distribute on frequently joined column
    CLUSTERED COLUMNSTORE INDEX,       -- Best for analytics workloads
    PARTITION (
        OrderDate RANGE RIGHT FOR VALUES (
            '2024-01-01', '2024-02-01', '2024-03-01',
            '2024-04-01', '2024-05-01', '2024-06-01'
        )
    )
)
AS
SELECT * FROM dbo.FactSales;

-- Create statistics on key columns
CREATE STATISTICS stat_customer ON dbo.FactSales_Optimized (CustomerID) WITH FULLSCAN;
CREATE STATISTICS stat_date ON dbo.FactSales_Optimized (OrderDate) WITH FULLSCAN;

3. Implement Caching Strategy¶

from functools import lru_cache
from datetime import datetime, timedelta

class PerformanceOptimizedDataLoader:
    """Data loader with multi-level caching."""

    def __init__(self, storage_account, container):
        self.storage_account = storage_account
        self.container = container
        self._cache = {}
        self._cache_expiry = {}

    @lru_cache(maxsize=128)
    def get_metadata(self, file_path):
        """Cache metadata lookups."""
        # Implementation
        pass

    def get_data_with_cache(self, file_path, ttl_minutes=60):
        """Get data with time-based cache."""
        cache_key = f"{self.container}/{file_path}"

        # Check cache validity
        if cache_key in self._cache:
            if datetime.now() < self._cache_expiry.get(cache_key, datetime.min):
                return self._cache[cache_key]

        # Load data
        data = self._load_data(file_path)

        # Update cache
        self._cache[cache_key] = data
        self._cache_expiry[cache_key] = datetime.now() + timedelta(minutes=ttl_minutes)

        return data

Prevention Best Practices¶

Proactive Monitoring¶

1. Set Up Performance Alerts¶

# Create alert rule for high query duration
az monitor metrics alert create \
    --name "HighQueryDuration" \
    --resource-group <rg-name> \
    --scopes <synapse-workspace-id> \
    --condition "avg QueryDuration > 30000" \
    --window-size 5m \
    --evaluation-frequency 1m \
    --action <action-group-id>

# Create alert for pipeline failures
az monitor metrics alert create \
    --name "PipelineFailures" \
    --resource-group <rg-name> \
    --scopes <data-factory-id> \
    --condition "count PipelineFailedRuns > 3" \
    --window-size 15m \
    --action <action-group-id>

2. Performance Dashboard¶

// Create performance dashboard query
let timeRange = 24h;

// Query performance over time
SynapseSqlPoolExecRequests
| where TimeGenerated > ago(timeRange)
| summarize
    P50 = percentile(TotalElapsedTimeMs, 50),
    P95 = percentile(TotalElapsedTimeMs, 95),
    P99 = percentile(TotalElapsedTimeMs, 99)
    by bin(TimeGenerated, 1h)
| render timechart with (title="Query Performance Percentiles")

Regular Maintenance¶

Weekly Performance Review Checklist:

Review slow query reports
Update statistics on large tables
Check storage account metrics
Review pipeline execution times
Analyze resource utilization trends
Review and optimize costly queries
Check for data skew issues
Validate partition strategies

Monthly Optimization Tasks:

Comprehensive performance audit
Review and update baselines
Optimize frequently accessed datasets
Review and tune Spark configurations
Analyze storage access patterns
Review and optimize data retention policies
Update documentation with new optimizations

Internal Documentation¶

Resource	Description
Synapse Query Performance	Synapse-specific query optimization
Databricks Performance	Databricks query and job optimization
Best Practices: Performance	General performance best practices
Monitoring Setup	Configure comprehensive monitoring

External Resources¶

Resource	Link
Azure Synapse Performance	Microsoft Docs
Spark Performance Tuning	Azure Databricks Docs
Storage Performance	ADLS Gen2 Best Practices
ADF Performance	Data Factory Performance Guide

When to Escalate¶

Contact Azure Support if:

Performance degraded significantly without configuration changes
Resource scaling doesn't improve performance
Consistent throttling despite optimization
Platform-level issues suspected
Performance SLAs not being met
Need assistance with capacity planning

Information to Provide: - Performance baseline metrics - Recent metric trends and graphs - Executed optimization attempts - Resource configuration details - Sample slow queries with execution plans - Storage access patterns

💡 Performance Tip: Establish clear performance baselines and monitor trends proactively. Most performance issues are easier to prevent than to troubleshoot after they occur.

Last Updated: 2025-12-10 Version: 1.0.0