Guided Troubleshooting¶
🏠 Home > 🔧 Troubleshooting > 🧭 Guided Troubleshooting
🧭 Interactive Diagnostic Guide Follow this step-by-step decision tree to diagnose and resolve common Azure Synapse Analytics issues quickly and effectively.
🎯 Quick Issue Identification¶
Start by identifying your issue category:
| Issue Type | Symptoms | Quick Link |
|---|---|---|
| 🔑 Authentication | Access denied, login failures, permission errors | Authentication Issues |
| 🌐 Connectivity | Connection timeouts, network errors, VNet issues | Connectivity Issues |
| ⚡ Performance | Slow queries, high latency, resource bottlenecks | Performance Issues |
| 🏞️ Delta Lake | Table corruption, versioning issues, ACID failures | Delta Lake Issues |
| 🔄 Pipelines | Pipeline failures, activity errors, orchestration issues | Pipeline Issues |
| ☁️ Serverless SQL | Query failures, data access issues, quota exceeded | Serverless SQL Issues |
| 🔥 Spark | Job failures, executor errors, memory issues | Spark Issues |
🔑 Authentication Issues¶
Decision Tree¶
graph TD
A[Authentication Error] --> B{Error Type?}
B -->|Access Denied| C{Using MSI?}
B -->|Login Failed| D{Authentication Method?}
B -->|Permission Error| E{Resource Type?}
C -->|Yes| F[Check MSI Permissions]
C -->|No| G[Check Service Principal]
D -->|Azure AD| H[Verify Azure AD Setup]
D -->|SQL Auth| I[Check SQL Credentials]
D -->|Key/Token| J[Validate API Keys]
E -->|Storage| K[Check RBAC Roles]
E -->|Synapse| L[Check Synapse RBAC]
E -->|Database| M[Check Database Permissions]
F --> N[Solution: Grant Required Roles]
G --> O[Solution: Update SP Credentials]
H --> P[Solution: Azure AD Configuration]
I --> Q[Solution: Reset SQL Password]
J --> R[Solution: Regenerate Keys]
K --> S[Solution: Assign Storage Roles]
L --> T[Solution: Assign Synapse Roles]
M --> U[Solution: Grant DB Permissions] Diagnostic Steps¶
Step 1: Identify Authentication Method¶
Question: What authentication method are you using?
- Azure AD (Recommended) → Go to Step 2A
- Managed Identity → Go to Step 2B
- Service Principal → Go to Step 2C
- SQL Authentication → Go to Step 2D
Step 2A: Azure AD Authentication¶
Symptoms: AADSTS* error codes, token expiration, consent errors
Diagnostic Actions:
- Verify Azure AD registration:
- Check token validity:
- Verify user permissions:
Solutions:
- Error AADSTS50105 (User not assigned): Assign user to application
- Error AADSTS50126 (Invalid credentials): Reset password or credentials
- Error AADSTS65001 (Consent required): Grant admin consent
📚 Related: Authentication Troubleshooting
Step 2B: Managed Identity¶
Symptoms: Access denied with MSI, identity not found
Diagnostic Actions:
- Verify MSI is enabled:
- Check MSI role assignments:
Solutions:
- Enable managed identity on Synapse workspace
- Assign required roles:
Storage Blob Data Contributor,Synapse Contributor - Wait up to 5 minutes for role propagation
📚 Related: Security Best Practices
Step 2C: Service Principal¶
Symptoms: Client credential flow errors, secret expiration
Diagnostic Actions:
- Verify service principal exists:
- Check secret expiration:
Solutions:
- Regenerate expired secrets
- Update application credentials in Key Vault
- Verify client ID and tenant ID are correct
📚 Related: Reference Security
Step 2D: SQL Authentication¶
Symptoms: Login failed for user, password policy errors
Diagnostic Actions:
- Test SQL connection:
- Check login exists:
Solutions:
- Reset SQL admin password via Azure Portal
- Ensure password meets complexity requirements
- Check if account is locked or disabled
📚 Related: SQL Performance
🌐 Connectivity Issues¶
Decision Tree¶
graph TD
A[Connection Error] --> B{Error Message?}
B -->|Timeout| C{Public/Private Access?}
B -->|Connection Refused| D{Firewall Rules?}
B -->|Host Not Found| E{DNS Resolution?}
C -->|Public| F[Check Public Access]
C -->|Private| G[Check Private Endpoint]
D -->|Configured| H[Verify IP Whitelisted]
D -->|Not Configured| I[Add Firewall Rule]
E -->|Working| J[Check Network Path]
E -->|Failing| K[Check DNS Config]
F --> L[Solution: Enable Public Access]
G --> M[Solution: Configure VNet]
H --> N[Solution: Update Firewall]
I --> O[Solution: Create Rule]
J --> P[Solution: Check NSG/Route]
K --> Q[Solution: Fix DNS] Diagnostic Steps¶
Step 1: Test Basic Connectivity¶
Question: Can you reach the endpoint?
# Test DNS resolution
nslookup <workspace-name>.sql.azuresynapse.net
# Test port connectivity
Test-NetConnection -ComputerName <workspace-name>.sql.azuresynapse.net -Port 1433
# Test HTTPS endpoint
curl https://<workspace-name>.dev.azuresynapse.net
Results:
- ✅ Success → Go to Step 2A: Azure AD Authentication
- ❌ DNS Failed → Go to DNS Issues
- ❌ Port Blocked → Go to Firewall Issues
- ❌ HTTPS Failed → Go to Certificate Issues
DNS Issues¶
Diagnostic Actions:
- Verify DNS settings:
- Check private DNS zone:
- Verify DNS record:
az network private-dns record-set a list --zone-name privatelink.sql.azuresynapse.net --resource-group <rg-name>
Solutions:
- Create private DNS zone if using Private Link
- Link DNS zone to VNet
- Flush DNS cache:
ipconfig /flushdns
📚 Related: Connectivity Troubleshooting
Firewall Issues¶
Diagnostic Actions:
- Check workspace firewall rules:
az synapse workspace firewall-rule list --workspace-name <workspace-name> --resource-group <rg-name>
- Verify client IP:
Solutions:
- Add client IP to firewall rules
- Enable "Allow Azure services and resources to access this workspace"
- For private endpoints, disable public network access
📚 Related: Network Security
Certificate Issues¶
Symptoms: SSL/TLS errors, certificate validation failures
Solutions:
- Update root certificates on client machine
- Add connection string parameter:
TrustServerCertificate=True(development only) - Verify server certificate is valid
⚡ Performance Issues¶
Decision Tree¶
graph TD
A[Slow Performance] --> B{Component?}
B -->|Query| C{Query Type?}
B -->|Pipeline| D{Pipeline Stage?}
B -->|Spark Job| E{Job Phase?}
C -->|SELECT| F[Check Query Plan]
C -->|INSERT/UPDATE| G[Check Table Design]
C -->|JOIN| H[Check Statistics]
D -->|Copy Activity| I[Check Throughput]
D -->|Data Flow| J[Check Partitioning]
D -->|Lookup| K[Check Query Opt]
E -->|Stage Time| L[Check Shuffle]
E -->|Task Skew| M[Check Partitioning]
E -->|Memory| N[Check Config]
F --> O[Solution: Optimize Query]
G --> P[Solution: Add Indexes]
H --> Q[Solution: Update Stats]
I --> R[Solution: Increase DIUs]
J --> S[Solution: Repartition]
K --> T[Solution: Use Parameters]
L --> U[Solution: Broadcast Join]
M --> V[Solution: Repartition Data]
N --> W[Solution: Tune Memory] Diagnostic Steps¶
Step 1: Identify Performance Bottleneck¶
Question: What type of operation is slow?
- SQL Query → SQL Performance Diagnostics
- Spark Job → Spark Performance Diagnostics
- Pipeline → Pipeline Performance Diagnostics
SQL Performance Diagnostics¶
Collect Query Metrics:
-- Get query execution plan
SET STATISTICS PROFILE ON;
<your-query>
SET STATISTICS PROFILE OFF;
-- Check query statistics
SELECT
query_id,
start_time,
end_time,
total_elapsed_time_ms,
row_count
FROM sys.dm_exec_requests
WHERE session_id = @@SPID;
-- Identify expensive operations
SELECT TOP 10
text,
total_worker_time/execution_count AS avg_cpu_time,
total_elapsed_time/execution_count AS avg_elapsed_time
FROM sys.dm_exec_query_stats
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
ORDER BY avg_elapsed_time DESC;
Common Issues:
| Symptom | Likely Cause | Solution |
|---|---|---|
| Full table scan | Missing statistics | Update statistics |
| High CPU time | Complex joins | Simplify query, add indexes |
| Memory spills | Insufficient memory grant | Increase DWU, optimize query |
| Lock waits | Concurrent operations | Use snapshot isolation |
📚 Related: SQL Performance Best Practices
Spark Performance Diagnostics¶
Collect Job Metrics:
- Access Spark UI:
https://<workspace-name>.dev.azuresynapse.net/sparkui - Review job timeline and stages
- Check executor metrics
Common Issues:
| Symptom | Likely Cause | Solution |
|---|---|---|
| Long stage time | Data shuffle | Use broadcast join |
| Task skew | Uneven partitioning | Repartition data |
| OOM errors | Insufficient memory | Increase executor memory |
| Spill to disk | Memory pressure | Optimize transformations |
📚 Related: Spark Performance Best Practices
Pipeline Performance Diagnostics¶
Check Pipeline Metrics:
- Navigate to Monitor → Pipeline runs
- Review activity durations
- Check data integration units (DIUs)
Common Issues:
| Symptom | Likely Cause | Solution |
|---|---|---|
| Slow copy activity | Low DIUs | Increase DIU count |
| Mapping data flow slow | Small cluster | Scale up integration runtime |
| Lookup timeout | Large result set | Use query optimization |
📚 Related: Pipeline Optimization
🏞️ Delta Lake Issues¶
Common Scenarios¶
Scenario 1: Table Not Found¶
Symptoms: Table or view not found errors
Diagnostic Steps:
# Check if table exists in metastore
spark.sql("SHOW TABLES IN <database>").display()
# Check Delta Lake path
dbutils.fs.ls("/mnt/delta/<table-path>")
# Verify Delta log
dbutils.fs.ls("/mnt/delta/<table-path>/_delta_log")
Solutions:
- Refresh table metadata:
REFRESH TABLE <table_name> - Repair table:
MSCK REPAIR TABLE <table_name> - Recreate table if corrupted
📚 Related: Delta Lake Troubleshooting
Scenario 2: Concurrent Write Conflicts¶
Symptoms: ConcurrentAppendException, ConcurrentDeleteException
Solutions:
- Implement optimistic concurrency control
- Use merge operations instead of separate update/insert
- Enable auto-optimize for Delta tables
# Enable optimized writes
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
# Use merge for upserts
deltaTable.alias("target").merge(
source.alias("source"),
"target.id = source.id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
Scenario 3: Performance Degradation¶
Symptoms: Slow reads, increasing query time
Diagnostic Steps:
# Check table statistics
spark.sql("DESCRIBE DETAIL <table_name>").display()
# Check file sizes
spark.sql("DESCRIBE DETAIL <table_name>").select("numFiles", "sizeInBytes").display()
Solutions:
- Run OPTIMIZE:
OPTIMIZE <table_name> - Z-ORDER on frequently filtered columns:
OPTIMIZE <table_name> ZORDER BY (column1, column2) - Vacuum old files:
VACUUM <table_name> RETAIN 168 HOURS
📚 Related: Delta Lake Optimization
🔄 Pipeline Issues¶
Common Failure Patterns¶
Pattern 1: Copy Activity Failures¶
Diagnostic Questions:
- What's the error message?
- Is it source or sink failure?
- Is it data-related or connectivity-related?
Common Errors:
| Error Code | Description | Solution |
|---|---|---|
UserErrorFailedToReadFromSqlSource | SQL query error | Validate SQL syntax |
UserErrorInvalidColumnMapping | Schema mismatch | Update column mappings |
UserErrorSinkPathNotFound | Destination path missing | Create target container |
UserErrorQuotaExceeded | DIU limit reached | Request quota increase |
Pattern 2: Data Flow Failures¶
Diagnostic Steps:
- Enable debug mode in Data Flow
- Review data preview at each transformation
- Check cluster logs
Common Issues:
- Schema drift not handled → Enable "Allow schema drift"
- Null values causing errors → Add derived column with null handling
- Memory errors → Increase cluster size or optimize transformations
📚 Related: Pipeline Troubleshooting
☁️ Serverless SQL Issues¶
Common Problems¶
Problem 1: File Format Errors¶
Error: Failed to read parquet file or CSV parsing error
Solutions:
-- Test file format with OPENROWSET
SELECT TOP 100 *
FROM OPENROWSET(
BULK 'https://<storage>.dfs.core.windows.net/container/file.parquet',
FORMAT = 'PARQUET'
) AS [result];
-- Handle CSV with custom settings
SELECT *
FROM OPENROWSET(
BULK 'https://<storage>.dfs.core.windows.net/container/file.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
) AS [result];
Problem 2: Performance Issues¶
Symptoms: Slow queries on data lake files
Solutions:
- Create external tables instead of OPENROWSET
- Use partitioned data sources
- Create statistics on external tables
- Use CETAS to cache results
📚 Related: Serverless SQL Troubleshooting
🔥 Spark Issues¶
Common Job Failures¶
Failure Type 1: Out of Memory¶
Symptoms: java.lang.OutOfMemoryError, executor failures
Diagnostic Steps:
# Check executor memory configuration
spark.conf.get("spark.executor.memory")
spark.conf.get("spark.driver.memory")
# Monitor memory usage
spark.sparkContext.statusTracker().getExecutorInfos()
Solutions:
- Increase executor memory:
spark.conf.set("spark.executor.memory", "8g") - Increase driver memory:
spark.conf.set("spark.driver.memory", "4g") - Optimize data structures (use arrays instead of lists)
- Process data in smaller batches
Failure Type 2: Shuffle Failures¶
Symptoms: Fetch failed, shuffle read errors
Solutions:
- Increase shuffle partitions:
spark.conf.set("spark.sql.shuffle.partitions", "200") - Enable adaptive query execution:
spark.conf.set("spark.sql.adaptive.enabled", "true") - Use broadcast joins for small tables
📚 Related: Spark Troubleshooting
🆘 Emergency Checklist¶
When all else fails, work through this checklist:
- Check Azure Service Health
- Review Azure Monitor alerts
- Check recent changes in deployment history
- Verify quota limits haven't been reached
- Review activity logs for error patterns
- Test from different network location
- Compare working vs. non-working configuration
- Create minimal reproduction case
- Gather diagnostic information
- Open Azure support ticket
📚 Additional Resources¶
| Resource | Description | Link |
|---|---|---|
| 🔧 Troubleshooting Hub | All troubleshooting guides | Troubleshooting |
| 📖 Best Practices | Prevention strategies | Best Practices |
| 📊 Monitoring Setup | Proactive monitoring | Monitoring |
| 💬 FAQ | Common questions | FAQ |
💡 Pro Tip: Keep a troubleshooting journal documenting issues and resolutions. Patterns often emerge that help prevent future problems.
Last Updated: January 2025