Troubleshooting Guide¶
Note
Quick Summary: Comprehensive troubleshooting guide covering Bicep deployment failures, dbt issues, data quality problems, Azure Functions, ADF pipelines, Stream Analytics, Databricks, Purview, Great Expectations, Key Vault, Cosmos DB, and CI/CD workflow issues.
📑 Table of Contents¶
- 📦 Bicep Deployment Issues
- 🗄️ dbt Issues
- 📊 Data Quality Issues
- ⚙️ Azure Functions Issues
- 🔄 Deployment Rollback
- 🏗️ Regional Outage / Disaster Recovery
- 🔧 ADF Pipeline Issues
- 📊 Stream Analytics Issues
- ⚡ Databricks Issues
- 📊 Purview Issues
- 🧪 Great Expectations Issues
- 🔒 Key Vault Issues
- 🗄️ Cosmos DB Issues
- 🔄 CI/CD Workflow Issues
📦 Bicep Deployment Issues¶
"Resource provider not registered"¶
Fix: Register the provider:
az provider register --namespace Microsoft.Purview
az provider register --namespace Microsoft.Databricks
az provider register --namespace Microsoft.Synapse
"Template validation failed"¶
Fix: Run bicep build locally first:
"DeploymentFailed - PrivateEndpoint"¶
Private endpoints require:
- The target VNet/subnet exists
- The Private DNS Zone exists and is linked to the VNet
- The subnet has
privateEndpointNetworkPoliciesset toDisabled
"Conflict - RoleAssignment"¶
Safe to ignore on re-deployments. The role assignment already exists.
🗄️ dbt Issues¶
"Connection refused" on dbt compile¶
Ensure your profiles.yml has correct Databricks connection info:
csa_analytics:
target: dev
outputs:
dev:
type: databricks
host: "your-workspace.azuredatabricks.net"
http_path: "/sql/1.0/warehouses/your-warehouse-id"
token: "{{ env_var('DBT_DATABRICKS_TOKEN') }}"
"Catalog not found"¶
Run Unity Catalog setup first:
📊 Data Quality Issues¶
Volume check shows "warn" instead of "pass"¶
This means dbt CLI is not available in the current environment. Volume checks require dbt to query actual row counts. Install dbt: pip install dbt-databricks
Freshness check times out¶
Increase the timeout in run_quality_checks.py or check network connectivity to Databricks.
⚙️ Azure Functions Issues¶
"AI client not configured"¶
Set these environment variables in the Function App configuration:
AZURE_AI_ENDPOINT: Your Azure AI Services endpoint URLAZURE_AI_KEY: Key Vault reference to your AI key
"Event Hub connection failed"¶
Verify EVENT_HUB_CONNECTION app setting points to a valid Event Hub namespace connection string.
🔄 Deployment Rollback¶
If a deployment landed broken state in Azure, see ROLLBACK.md for the step-by-step rollback runbook. It covers Bicep redeploy, ADF pipeline restore, dbt full-refresh, Cosmos DB point-in-time restore, and storage account blob recovery.
🏗️ Regional Outage / Disaster Recovery¶
If the whole primary Azure region is down (not just a deploy gone bad), see DR.md for the failover runbook. It documents RPO/RTO targets per service, the primary/secondary region pairs, and the step-by-step failover and failback procedures.
🔧 ADF Pipeline Issues¶
Pipeline stuck in "InProgress"¶
Check for long-running activities in Monitor > Pipeline runs:
az datafactory pipeline-run query-by-factory \
--factory-name csadlzdevdf \
--resource-group rg-csadlz-dev \
--last-updated-after "2026-01-01T00:00:00Z" \
--last-updated-before "2026-12-31T00:00:00Z" \
--filters '[{"operand":"Status","operator":"Equals","values":["InProgress"]}]'
Fix: Cancel the run and check for:
- Databricks cluster that failed to start
- ADLS permission issues (managed identity needs
Storage Blob Data Contributor) - Timeout on Copy activities (increase in pipeline JSON)
"LinkedServiceNotFound" error¶
The linked service must be deployed before the pipeline that references it. Use the deployment script which handles ordering:
Trigger not firing¶
- Verify the trigger is started:
az datafactory trigger show --name tr_daily_medallion ... - Check the trigger's
runtimeState— must beStarted - If
Stopped, start it:az datafactory trigger start --name tr_daily_medallion ...
📊 Stream Analytics Issues¶
"Input deserialization error"¶
The incoming event doesn't match the expected schema.
// Check for deserialization errors
AzureDiagnostics
| where Category == "Execution"
| where Level == "Error"
| project TimeGenerated, Message
Fix: Verify the event producer (scripts/streaming/produce_events.py) output matches the SA job input schema. Common issues:
- Missing required fields
- Wrong data types (string vs number)
- Nested JSON not flattened
"Output sink error" (Event Hub / ADX / Blob)¶
- Verify the output connection string is valid
- Check Event Hub namespace isn't throttled (quota exceeded)
- For ADX: verify the table exists and streaming ingestion is enabled
Query syntax error on deployment¶
Test queries locally before deploying:
# Validate ASAQL syntax
az stream-analytics query test --job-name <job> \
--resource-group <rg> \
--query-file scripts/streaming/queries/tumbling_window_event_counts.asaql
⚡ Databricks Issues¶
See the detailed DATABRICKS_GUIDE.md for full coverage. Quick fixes for common issues:
"Cluster terminated unexpectedly"¶
Check the cluster event log for OOM or spot instance eviction:
Fix: Increase driver/worker memory or switch from spot to on-demand.
"Delta table version conflict"¶
Concurrent writes to the same Delta table from multiple jobs:
Fix: Enable Delta auto-retry:
Or stagger job schedules to avoid overlap.
📊 Purview Issues¶
"Scan failed: Access denied"¶
The Purview managed identity needs access to the data source:
- Storage: Assign
Storage Blob Data Readerto the Purview MI - Cosmos DB: Assign
Cosmos DB Account Readerto the Purview MI - Databricks: Generate a PAT and store in Purview credentials
"Classification rules not applied"¶
- Verify custom classification rules are loaded:
- Check that the scan ruleset includes the custom rules
- Re-run the scan after updating classification rules
"Lineage not showing"¶
For ADF-to-Purview lineage:
- Verify
purviewAccountIdis set in the ADF Bicep parameters - Check that ADF's managed identity has
Purview Data Curatorrole - Run a pipeline and wait 5-10 minutes for lineage to propagate
🧪 Great Expectations Issues¶
"No suites configured" warning¶
The GE runner found no suites in quality-rules.yaml:
- Verify
quality-rules.yamlhas agreat_expectations.suitessection - Check for YAML syntax errors:
python -c "import yaml; yaml.safe_load(open('csa_platform/governance/dataquality/quality-rules.yaml'))"
"Checkpoint not found"¶
GE checkpoint YAMLs live in great_expectations/checkpoints/. Verify:
"great_expectations not installed"¶
The GE package is optional (200MB+). Install via:
Or use the in-memory fallback by passing sample_data= to run_ge_checkpoints().
🔒 Key Vault Issues¶
"SecretNotFound" or "Forbidden"¶
- Verify the secret exists:
az keyvault secret show --vault-name <vault> --name <secret> - Check access policy: the calling identity needs
Getpermission on secrets - If using RBAC: verify the identity has
Key Vault Secrets Userrole - Check if Key Vault is behind a private endpoint — the caller must be in the VNet
"Key Vault is soft-deleted"¶
A previously deleted Key Vault with the same name blocks recreation:
az keyvault recover --name <vault> # Recover it
# OR
az keyvault purge --name <vault> # Permanently delete
🗄️ Cosmos DB Issues¶
"Request rate too large" (429 throttling)¶
The container is exceeding its provisioned RU/s:
Fix: Increase RU/s or enable autoscale:
az cosmosdb sql container throughput update \
--account-name <account> --database-name <db> \
--name <container> --max-throughput 10000
"Partition key not found"¶
Verify the partition key path matches between the Bicep template and the application code. The Cosmos Bicep module sets the partition key during container creation.
🔄 CI/CD Workflow Issues¶
"OIDC token request failed"¶
The GitHub Actions OIDC federation to Azure failed:
- Verify the federated credential exists on the service principal
- Check the
subjectclaim matches:repo:<org>/<repo>:ref:refs/heads/main - Verify the
AZURE_CLIENT_ID,AZURE_TENANT_ID,AZURE_SUBSCRIPTION_IDsecrets
Coverage gate failing¶
The pytest --cov-fail-under=80 gate requires 80% coverage:
Check which files are below threshold and add tests.
Bicep what-if PR comment not appearing¶
- The
bicep-whatif.ymlworkflow requires Azure OIDC credentials - It only runs on PRs that modify
deploy/bicep/**files - The bot needs write permissions on the PR — check
permissions: pull-requests: write
See also:
- ← Previous: Rollback
- → Next: Supply Chain Security
- ⌂ Index: Documentation home