Guided Troubleshooting¶

🏠 Home > 🔧 Troubleshooting > 🧭 Guided Troubleshooting

🧭 Interactive Diagnostic Guide Follow this step-by-step decision tree to diagnose and resolve common Azure Synapse Analytics issues quickly and effectively.

🎯 Quick Issue Identification¶

Start by identifying your issue category:

Issue Type	Symptoms	Quick Link
🔑 Authentication	Access denied, login failures, permission errors	Authentication Issues
🌐 Connectivity	Connection timeouts, network errors, VNet issues	Connectivity Issues
⚡ Performance	Slow queries, high latency, resource bottlenecks	Performance Issues
🏞️ Delta Lake	Table corruption, versioning issues, ACID failures	Delta Lake Issues
🔄 Pipelines	Pipeline failures, activity errors, orchestration issues	Pipeline Issues
☁️ Serverless SQL	Query failures, data access issues, quota exceeded	Serverless SQL Issues
🔥 Spark	Job failures, executor errors, memory issues	Spark Issues

🔑 Authentication Issues¶

Decision Tree¶

graph TD
    A[Authentication Error] --> B{Error Type?}
    B -->|Access Denied| C{Using MSI?}
    B -->|Login Failed| D{Authentication Method?}
    B -->|Permission Error| E{Resource Type?}

    C -->|Yes| F[Check MSI Permissions]
    C -->|No| G[Check Service Principal]

    D -->|Azure AD| H[Verify Azure AD Setup]
    D -->|SQL Auth| I[Check SQL Credentials]
    D -->|Key/Token| J[Validate API Keys]

    E -->|Storage| K[Check RBAC Roles]
    E -->|Synapse| L[Check Synapse RBAC]
    E -->|Database| M[Check Database Permissions]

    F --> N[Solution: Grant Required Roles]
    G --> O[Solution: Update SP Credentials]
    H --> P[Solution: Azure AD Configuration]
    I --> Q[Solution: Reset SQL Password]
    J --> R[Solution: Regenerate Keys]
    K --> S[Solution: Assign Storage Roles]
    L --> T[Solution: Assign Synapse Roles]
    M --> U[Solution: Grant DB Permissions]

Diagnostic Steps¶

Step 1: Identify Authentication Method¶

Question: What authentication method are you using?

Azure AD (Recommended) → Go to Step 2A
Managed Identity → Go to Step 2B
Service Principal → Go to Step 2C
SQL Authentication → Go to Step 2D

Step 2A: Azure AD Authentication¶

Symptoms: AADSTS* error codes, token expiration, consent errors

Diagnostic Actions:

Verify Azure AD registration:

az ad app show --id <application-id>

Check token validity:

az account get-access-token --resource https://database.windows.net/

Verify user permissions:

az role assignment list --assignee <user-id> --scope <resource-id>

Solutions:

Error AADSTS50105 (User not assigned): Assign user to application
Error AADSTS50126 (Invalid credentials): Reset password or credentials
Error AADSTS65001 (Consent required): Grant admin consent

📚 Related: Authentication Troubleshooting

Step 2B: Managed Identity¶

Symptoms: Access denied with MSI, identity not found

Diagnostic Actions:

Verify MSI is enabled:

az synapse workspace show --name <workspace-name> --resource-group <rg-name> --query identity

Check MSI role assignments:

az role assignment list --assignee <principal-id>

Solutions:

Enable managed identity on Synapse workspace
Assign required roles: Storage Blob Data Contributor, Synapse Contributor
Wait up to 5 minutes for role propagation

📚 Related: Security Best Practices

Step 2C: Service Principal¶

Symptoms: Client credential flow errors, secret expiration

Diagnostic Actions:

Verify service principal exists:

az ad sp show --id <app-id>

Check secret expiration:

az ad app credential list --id <app-id>

Solutions:

Regenerate expired secrets
Update application credentials in Key Vault
Verify client ID and tenant ID are correct

📚 Related: Reference Security

Step 2D: SQL Authentication¶

Symptoms: Login failed for user, password policy errors

Diagnostic Actions:

Test SQL connection:

SELECT SYSTEM_USER, ORIGINAL_LOGIN();

Check login exists:

SELECT name, type_desc FROM sys.server_principals WHERE name = '<username>';

Solutions:

Reset SQL admin password via Azure Portal
Ensure password meets complexity requirements
Check if account is locked or disabled

📚 Related: SQL Performance

🌐 Connectivity Issues¶

Decision Tree¶

graph TD
    A[Connection Error] --> B{Error Message?}
    B -->|Timeout| C{Public/Private Access?}
    B -->|Connection Refused| D{Firewall Rules?}
    B -->|Host Not Found| E{DNS Resolution?}

    C -->|Public| F[Check Public Access]
    C -->|Private| G[Check Private Endpoint]

    D -->|Configured| H[Verify IP Whitelisted]
    D -->|Not Configured| I[Add Firewall Rule]

    E -->|Working| J[Check Network Path]
    E -->|Failing| K[Check DNS Config]

    F --> L[Solution: Enable Public Access]
    G --> M[Solution: Configure VNet]
    H --> N[Solution: Update Firewall]
    I --> O[Solution: Create Rule]
    J --> P[Solution: Check NSG/Route]
    K --> Q[Solution: Fix DNS]

Diagnostic Steps¶

Step 1: Test Basic Connectivity¶

Question: Can you reach the endpoint?

# Test DNS resolution
nslookup <workspace-name>.sql.azuresynapse.net

# Test port connectivity
Test-NetConnection -ComputerName <workspace-name>.sql.azuresynapse.net -Port 1433

# Test HTTPS endpoint
curl https://<workspace-name>.dev.azuresynapse.net

Results:

✅ Success → Go to Step 2A: Azure AD Authentication
❌ DNS Failed → Go to DNS Issues
❌ Port Blocked → Go to Firewall Issues
❌ HTTPS Failed → Go to Certificate Issues

DNS Issues¶

Diagnostic Actions:

Verify DNS settings:

ipconfig /all

Check private DNS zone:

az network private-dns zone show --name privatelink.sql.azuresynapse.net --resource-group <rg-name>

Verify DNS record:

az network private-dns record-set a list --zone-name privatelink.sql.azuresynapse.net --resource-group <rg-name>

Solutions:

Create private DNS zone if using Private Link
Link DNS zone to VNet
Flush DNS cache: ipconfig /flushdns

📚 Related: Connectivity Troubleshooting

Firewall Issues¶

Diagnostic Actions:

Check workspace firewall rules:

az synapse workspace firewall-rule list --workspace-name <workspace-name> --resource-group <rg-name>

Verify client IP:

curl ifconfig.me

Solutions:

Add client IP to firewall rules
Enable "Allow Azure services and resources to access this workspace"
For private endpoints, disable public network access

📚 Related: Network Security

Certificate Issues¶

Symptoms: SSL/TLS errors, certificate validation failures

Solutions:

Update root certificates on client machine
Add connection string parameter: TrustServerCertificate=True (development only)
Verify server certificate is valid

⚡ Performance Issues¶

Decision Tree¶

graph TD
    A[Slow Performance] --> B{Component?}
    B -->|Query| C{Query Type?}
    B -->|Pipeline| D{Pipeline Stage?}
    B -->|Spark Job| E{Job Phase?}

    C -->|SELECT| F[Check Query Plan]
    C -->|INSERT/UPDATE| G[Check Table Design]
    C -->|JOIN| H[Check Statistics]

    D -->|Copy Activity| I[Check Throughput]
    D -->|Data Flow| J[Check Partitioning]
    D -->|Lookup| K[Check Query Opt]

    E -->|Stage Time| L[Check Shuffle]
    E -->|Task Skew| M[Check Partitioning]
    E -->|Memory| N[Check Config]

    F --> O[Solution: Optimize Query]
    G --> P[Solution: Add Indexes]
    H --> Q[Solution: Update Stats]
    I --> R[Solution: Increase DIUs]
    J --> S[Solution: Repartition]
    K --> T[Solution: Use Parameters]
    L --> U[Solution: Broadcast Join]
    M --> V[Solution: Repartition Data]
    N --> W[Solution: Tune Memory]

Diagnostic Steps¶

Step 1: Identify Performance Bottleneck¶

Question: What type of operation is slow?

SQL Query → SQL Performance Diagnostics
Spark Job → Spark Performance Diagnostics
Pipeline → Pipeline Performance Diagnostics

SQL Performance Diagnostics¶

Collect Query Metrics:

-- Get query execution plan
SET STATISTICS PROFILE ON;
<your-query>
SET STATISTICS PROFILE OFF;

-- Check query statistics
SELECT
    query_id,
    start_time,
    end_time,
    total_elapsed_time_ms,
    row_count
FROM sys.dm_exec_requests
WHERE session_id = @@SPID;

-- Identify expensive operations
SELECT TOP 10
    text,
    total_worker_time/execution_count AS avg_cpu_time,
    total_elapsed_time/execution_count AS avg_elapsed_time
FROM sys.dm_exec_query_stats
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
ORDER BY avg_elapsed_time DESC;

Common Issues:

Symptom	Likely Cause	Solution
Full table scan	Missing statistics	Update statistics
High CPU time	Complex joins	Simplify query, add indexes
Memory spills	Insufficient memory grant	Increase DWU, optimize query
Lock waits	Concurrent operations	Use snapshot isolation

📚 Related: SQL Performance Best Practices

Spark Performance Diagnostics¶

Collect Job Metrics:

Access Spark UI: https://<workspace-name>.dev.azuresynapse.net/sparkui
Review job timeline and stages
Check executor metrics

Common Issues:

Symptom	Likely Cause	Solution
Long stage time	Data shuffle	Use broadcast join
Task skew	Uneven partitioning	Repartition data
OOM errors	Insufficient memory	Increase executor memory
Spill to disk	Memory pressure	Optimize transformations

📚 Related: Spark Performance Best Practices

Pipeline Performance Diagnostics¶

Check Pipeline Metrics:

Navigate to Monitor → Pipeline runs
Review activity durations
Check data integration units (DIUs)

Common Issues:

Symptom	Likely Cause	Solution
Slow copy activity	Low DIUs	Increase DIU count
Mapping data flow slow	Small cluster	Scale up integration runtime
Lookup timeout	Large result set	Use query optimization

📚 Related: Pipeline Optimization

🏞️ Delta Lake Issues¶

Common Scenarios¶

Scenario 1: Table Not Found¶

Symptoms: Table or view not found errors

Diagnostic Steps:

# Check if table exists in metastore
spark.sql("SHOW TABLES IN <database>").display()

# Check Delta Lake path
dbutils.fs.ls("/mnt/delta/<table-path>")

# Verify Delta log
dbutils.fs.ls("/mnt/delta/<table-path>/_delta_log")

Solutions:

Refresh table metadata: REFRESH TABLE <table_name>
Repair table: MSCK REPAIR TABLE <table_name>
Recreate table if corrupted

📚 Related: Delta Lake Troubleshooting

Scenario 2: Concurrent Write Conflicts¶

Symptoms: ConcurrentAppendException, ConcurrentDeleteException

Solutions:

Implement optimistic concurrency control
Use merge operations instead of separate update/insert
Enable auto-optimize for Delta tables

# Enable optimized writes
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")

# Use merge for upserts
deltaTable.alias("target").merge(
    source.alias("source"),
    "target.id = source.id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

Scenario 3: Performance Degradation¶

Symptoms: Slow reads, increasing query time

Diagnostic Steps:

# Check table statistics
spark.sql("DESCRIBE DETAIL <table_name>").display()

# Check file sizes
spark.sql("DESCRIBE DETAIL <table_name>").select("numFiles", "sizeInBytes").display()

Solutions:

Run OPTIMIZE: OPTIMIZE <table_name>
Z-ORDER on frequently filtered columns: OPTIMIZE <table_name> ZORDER BY (column1, column2)
Vacuum old files: VACUUM <table_name> RETAIN 168 HOURS

📚 Related: Delta Lake Optimization

🔄 Pipeline Issues¶

Common Failure Patterns¶

Pattern 1: Copy Activity Failures¶

Diagnostic Questions:

What's the error message?
Is it source or sink failure?
Is it data-related or connectivity-related?

Common Errors:

Error Code	Description	Solution
`UserErrorFailedToReadFromSqlSource`	SQL query error	Validate SQL syntax
`UserErrorInvalidColumnMapping`	Schema mismatch	Update column mappings
`UserErrorSinkPathNotFound`	Destination path missing	Create target container
`UserErrorQuotaExceeded`	DIU limit reached	Request quota increase

Pattern 2: Data Flow Failures¶

Diagnostic Steps:

Enable debug mode in Data Flow
Review data preview at each transformation
Check cluster logs

Common Issues:

Schema drift not handled → Enable "Allow schema drift"
Null values causing errors → Add derived column with null handling
Memory errors → Increase cluster size or optimize transformations

📚 Related: Pipeline Troubleshooting

☁️ Serverless SQL Issues¶

Common Problems¶

Problem 1: File Format Errors¶

Error: Failed to read parquet file or CSV parsing error

Solutions:

-- Test file format with OPENROWSET
SELECT TOP 100 *
FROM OPENROWSET(
    BULK 'https://<storage>.dfs.core.windows.net/container/file.parquet',
    FORMAT = 'PARQUET'
) AS [result];

-- Handle CSV with custom settings
SELECT *
FROM OPENROWSET(
    BULK 'https://<storage>.dfs.core.windows.net/container/file.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0',
    FIRSTROW = 2,
    FIELDTERMINATOR = ',',
    ROWTERMINATOR = '\n'
) AS [result];

Problem 2: Performance Issues¶

Symptoms: Slow queries on data lake files

Solutions:

Create external tables instead of OPENROWSET
Use partitioned data sources
Create statistics on external tables
Use CETAS to cache results

📚 Related: Serverless SQL Troubleshooting

🔥 Spark Issues¶

Common Job Failures¶

Failure Type 1: Out of Memory¶

Symptoms: java.lang.OutOfMemoryError, executor failures

Diagnostic Steps:

# Check executor memory configuration
spark.conf.get("spark.executor.memory")
spark.conf.get("spark.driver.memory")

# Monitor memory usage
spark.sparkContext.statusTracker().getExecutorInfos()

Solutions:

Increase executor memory: spark.conf.set("spark.executor.memory", "8g")
Increase driver memory: spark.conf.set("spark.driver.memory", "4g")
Optimize data structures (use arrays instead of lists)
Process data in smaller batches

Failure Type 2: Shuffle Failures¶

Symptoms: Fetch failed, shuffle read errors

Solutions:

Increase shuffle partitions: spark.conf.set("spark.sql.shuffle.partitions", "200")
Enable adaptive query execution: spark.conf.set("spark.sql.adaptive.enabled", "true")
Use broadcast joins for small tables

📚 Related: Spark Troubleshooting

🆘 Emergency Checklist¶

When all else fails, work through this checklist:

📚 Additional Resources¶

Resource	Description	Link
🔧 Troubleshooting Hub	All troubleshooting guides	Troubleshooting
📖 Best Practices	Prevention strategies	Best Practices
📊 Monitoring Setup	Proactive monitoring	Monitoring
💬 FAQ	Common questions	FAQ

💡 Pro Tip: Keep a troubleshooting journal documenting issues and resolutions. Patterns often emerge that help prevent future problems.

Last Updated: January 2025