Skip to content

Home > Docs > Runbooks > Failed Refresh Triage

❌ Failed Refresh Triage Runbook

Last Updated: 2026-05-05 | Version: 1.0 Audience: Data engineers, on-call SRE, BI developers Purpose: Diagnose and recover from semantic model refresh failures, pipeline failures, notebook failures, and Dataflow Gen2 failures in Microsoft Fabric.

Category Type Platform Severity


📑 Table of Contents

  1. Trigger Conditions
  2. Severity Classification
  3. Decision Flowchart
  4. Step-by-Step Procedure
  5. Failure-Specific Diagnosis
  6. Escalation Path
  7. Post-Incident Review Checklist
  8. Related Documents

Trigger Conditions

Use this runbook when any of the following conditions are observed:

# Condition Detection Source
1 Semantic model refresh shows Failed status Power BI workspace → Refresh history
2 Data Factory pipeline run failed Data Factory → Monitor → Pipeline runs
3 Notebook execution failed or timed out Workspace → Monitor → Notebook runs
4 Dataflow Gen2 refresh error Workspace → Dataflow → Refresh history
5 Scheduled refresh did not start (missed schedule) Refresh history shows gap; no run recorded
6 Data Activator alert for stale data in downstream report Activator reflex notification
7 Workspace Monitoring shows RefreshFailed events system.pipeline_runs or system.dataset_refreshes

Severity Classification

Severity Condition Example Response SLA
SEV1 Compliance-critical refresh failed; regulatory reporting stale CTR daily aggregation pipeline failed; SOX audit dataset >24h stale 5 min page
SEV2 Production BI report dataset refresh failed; exec dashboard stale gold.fact_daily_slot_performance dataset refresh failed for 2 consecutive runs 15 min page
SEV3 Non-critical pipeline or notebook failed; workaround exists Silver dedup notebook failed but can be re-run manually 2 hr ack
SEV4 Dev/test refresh failed; no downstream impact Dev workspace notebook syntax error 24 hr ack

Decision Flowchart

flowchart TD
    A([Refresh Failure Detected]) --> B{Which item<br/>type failed?}
    B -->|Semantic Model| C[Check refresh history<br/>error message]
    B -->|Pipeline| D[Check pipeline run<br/>activity errors]
    B -->|Notebook| E[Check notebook output<br/>and Spark logs]
    B -->|Dataflow Gen2| F[Check dataflow<br/>refresh details]

    C --> G{Error type?}
    G -->|Credentials expired| H[Rotate credentials<br/>→ Step 7]
    G -->|Timeout| I[Optimize model or<br/>increase timeout → Step 8]
    G -->|Capacity throttled| J[See Capacity<br/>Throttling Runbook]
    G -->|Data source error| K[Check source<br/>connectivity → Step 9]

    D --> L{Activity error?}
    L -->|Source not found| K
    L -->|Schema mismatch| M[Enable schema evolution<br/>→ Step 10]
    L -->|Auth failure| H
    L -->|Timeout| I

    E --> N{Notebook error?}
    N -->|OOM / SparkException| O[Tune Spark config<br/>→ Step 11]
    N -->|ModuleNotFoundError| P[Fix environment<br/>→ Step 12]
    N -->|Data error| K

    F --> Q{Dataflow error?}
    Q -->|Mashup error| R[Fix M query<br/>→ Step 13]
    Q -->|Gateway offline| S[Restart gateway<br/>→ Step 14]
    Q -->|Timeout| I

Step-by-Step Procedure

Phase 1 — Detect and Classify (0–15 min)

Step 1. Identify the failed item type (semantic model, pipeline, notebook, or Dataflow Gen2) and open its run history in the Fabric portal.

Step 2. Copy the full error message from the failed run details. For pipelines, expand each activity to find the first failed activity.

Step 3. Check whether this is a first-time failure or a recurring pattern:

// Workspace Monitoring — recent failures for this item
system_pipeline_runs
| where ItemName == "<item-name>"
| where Status == "Failed"
| where Timestamp > ago(7d)
| project Timestamp, Status, ErrorMessage, DurationSeconds
| order by Timestamp desc

Step 4. Classify severity using the table above. Open an incident channel if SEV1 or SEV2.

Phase 2 — Diagnose (15–45 min)

Step 5. Determine the error category from the error message:

Error Pattern Category Go To
InvalidCredentials, AuthenticationFailed, TokenExpired Credential Step 7
Timeout, OperationTimedOut, TaskCanceled Timeout Step 8
SourceNotFound, FileNotFound, ConnectionFailed Source connectivity Step 9
InvalidSchema, SchemaChanged, ColumnNotFound Schema mismatch Step 10
OutOfMemoryError, SparkException, ExecutorLost Spark resource Step 11
ModuleNotFoundError, ImportError Environment Step 12
MashupError, DataFormat.Error Dataflow M query Step 13
GatewayOffline, GatewayUnreachable Gateway Step 14
CapacityThrottled, 429 Too Many Requests Capacity Capacity Throttling Runbook

Step 6. If the error pattern does not match any category, collect diagnostic artifacts (error message, run ID, timestamps, item name) and proceed to Escalation Path.

Phase 3 — Resolve

Step 7. Credential failure — Rotate the affected credential: 1. Navigate to Workspace → Settings → Manage connections and gateways. 2. Edit the connection for the affected data source. 3. Re-enter or refresh the credential (OAuth, service principal, or key). 4. Test the connection. 5. Re-trigger the refresh.

Step 8. Timeout failure — Increase timeout or optimize the operation: 1. For semantic models: reduce table row counts, add incremental refresh, or remove unnecessary tables. 2. For pipelines: increase the activity timeout in the pipeline JSON ("timeout": "02:00:00"). 3. For notebooks: partition the data processing into smaller batches. 4. Re-trigger the refresh.

Step 9. Source connectivity failure — Verify the data source is reachable: 1. Test the connection from the Fabric portal (Manage connectionsTest connection). 2. If using an on-premises data gateway, verify the gateway service is running (see Step 14). 3. Check firewall rules and network connectivity between Fabric and the source. 4. If the source file was moved or deleted, update the path in the connection settings. 5. Re-trigger the refresh.

Step 10. Schema mismatch — Reconcile the schema change: 1. Compare the current source schema against the expected schema. 2. For Delta Lake notebooks, enable schema evolution:

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")
3. For pipelines, update the Copy Activity schema mapping. 4. For semantic models, open the model in the web editor and refresh the table schema. 5. Re-trigger the refresh.

Step 11. Spark resource failure (OOM / executor lost) — Tune Spark configuration: 1. Increase driver/executor memory in the notebook's Spark configuration:

spark.conf.set("spark.executor.memory", "8g")
spark.conf.set("spark.driver.memory", "8g")
2. Increase shuffle partitions for large datasets:
spark.conf.set("spark.sql.shuffle.partitions", "200")
3. If the notebook is processing a very large file, consider partitioning the input. 4. Re-trigger the notebook run.

Step 12. Environment / module error — Fix the Spark environment: 1. Navigate to Workspace → Environments and open the environment attached to the notebook. 2. Add the missing library to the environment's public libraries or custom libraries. 3. Publish the environment and wait for it to build. 4. Re-trigger the notebook run.

Step 13. Dataflow Gen2 M query error — Debug the mashup: 1. Open the Dataflow Gen2 item and navigate to the failing query step. 2. Check the error details pane for the specific M expression that failed. 3. Common fixes: update column names, fix data type conversions, handle nulls with try ... otherwise. 4. Test the query in the Dataflow editor. 5. Save and re-trigger the refresh.

Step 14. Gateway offline — Restart the on-premises data gateway: 1. RDP or SSH into the gateway machine. 2. Open Services (Windows) and restart the On-premises data gateway service. 3. Verify the gateway status in the Fabric portal (Manage connections and gatewaysGateways). 4. If the gateway remains offline, check for Windows updates, disk space, and network connectivity. 5. Re-trigger the refresh.

Phase 4 — Verify

Step 15. After re-triggering the refresh, monitor the run to completion: - Confirm the run status changes to Succeeded. - Verify data freshness in the target table or report.

Step 16. If the failure was recurring, set up a Data Activator reflex to alert on consecutive failures:

system_dataset_refreshes
| where ItemName == "<dataset-name>"
| where Status == "Failed"
| summarize FailCount = count() by bin(Timestamp, 1h)
| where FailCount >= 2

Step 17. Document the resolution and proceed to the Post-Incident Review Checklist.


Failure-Specific Diagnosis

Semantic Model Refresh

Error Root Cause Resolution
CredentialsExpired OAuth token expired Re-authenticate connection
DataSourceNotFound Source table dropped or renamed Update model table source
ProcessingTimeout Model too large for refresh window Enable incremental refresh
CapacityNotAvailable Capacity throttled during refresh See Capacity Throttling

Pipeline Failures

Error Root Cause Resolution
CopyActivity_InvalidSchema Source schema changed Update schema mapping
NotebookActivity_Failed Downstream notebook error Debug notebook independently
DependencyConditionNotMet Upstream activity failed Fix upstream activity first
UserCancelledExecution Manual cancellation Verify intent; re-run if accidental

Notebook Failures

Error Root Cause Resolution
Py4JJavaError: OutOfMemoryError Insufficient executor memory Increase memory or partition data
AnalysisException: Table not found Table dropped or path changed Verify table exists in Lakehouse
IllegalArgumentException Bad parameter value Check parameterized cell values

Dataflow Gen2 Failures

Error Root Cause Resolution
DataFormat.Error Source data type mismatch Add type conversion in M query
Expression.Error Null value in non-nullable operation Add null handling (try ... otherwise)
GatewayTimeout Gateway unresponsive Restart gateway service

Escalation Path

Time Elapsed Action Contact
0 min On-call engineer begins triage On-call rotation
15 min If SEV1/SEV2 and root cause unclear, escalate to Data Platform Lead Data Platform Lead
30 min If compliance refresh (CTR/SAR/W-2G), notify Compliance Officer Compliance Officer
1 hr If unresolved, engage Microsoft support Microsoft Unified Support (Sev B)
2 hr If SEV1 still unresolved, escalate to VP Engineering VP Engineering
4 hr If gateway-related, engage gateway infrastructure team Infrastructure Team

Post-Incident Review Checklist

  • Failed item type and name documented
  • Error message and run ID captured
  • Root cause category identified (credential, timeout, source, schema, resource, environment, M query, gateway)
  • Resolution steps taken documented
  • Refresh successfully re-triggered and verified
  • Recurring failure pattern checked — is this the first occurrence or a repeat?
  • Credential rotation scheduled (if credential failure)
  • Timeout / resource settings permanently updated (if applicable)
  • Schema evolution or mapping updated (if schema mismatch)
  • Monitoring alert created for consecutive failures
  • Blameless postmortem completed within 48 hours (SEV1/SEV2 only)

Document Description
Error Handling & Monitoring Pipeline error architecture
Monitoring & Observability Dashboard and alert setup
Alerting & Data Activator Data Activator reflex patterns
Dataflow Gen2 Dataflow Gen2 best practices
Capacity Throttling When failure is caused by throttling
Incident Response Template Master incident response structure
Testing Strategies Data quality and integration testing

⬆️ Back to Top | 📋 Runbook Index | 🏠 Home