Skip to content

Home > Docs > Runbooks > Fabric Capacity Management

Runbook — Fabric Capacity Management

When to use this runbook: Your Fabric capacity is throttled, queries are slow, jobs are queuing, or CU utilization is sustained above 90%. This covers triage, immediate response by severity, right-sizing guidance, and long-term optimization for Microsoft Fabric capacities (F SKUs).

Before First Use — Customization Checklist

  • Populate the Contact Information table with your Fabric admin and platform engineering leads.
  • Replace <capacity-name> placeholders with your real capacity names (dev / staging / prod).
  • Confirm your Azure subscription IDs for the CLI commands below.
  • Verify the Capacity Metrics app is installed and shared with the on-call rotation.

Symptoms

Symptom Severity Likely Cause
Throttling alerts from Fabric portal P1 CU utilization sustained above 100%
Slow query performance (>3x normal) P2 CU contention from concurrent workloads
Jobs queuing / not starting P2 Capacity burst budget exhausted
CU usage sustained >90% P3 Under-provisioned capacity or workload growth
Background operations failing P2 Throttling rejecting low-priority operations
Scheduled refreshes timing out P3 Refresh window overlap with heavy queries
Lakehouse maintenance not completing P3 Maintenance throttled by foreground workloads

Triage

Step 1: Confirm throttling is active

  • Open the Fabric Admin Portal > Capacity settings > select the capacity in question.
  • Check the Capacity Metrics app for the last 24 hours. Look for sustained utilization above 100% (the red line).

Warning

Fabric smooths CU consumption over a rolling window. A brief spike above 100% is normal (burst). Sustained utilization above 100% for more than 10 minutes triggers throttling.

Step 2: Identify the utilization pattern

  • Burst overload — short spike (< 10 min) followed by recovery. Typically caused by a single large query or refresh. Usually self-heals.
  • Sustained overload — utilization above 90% for 30+ minutes. Indicates the capacity is under-sized for the current workload mix.
  • Periodic overload — spikes at the same time daily. Points to overlapping scheduled refreshes or batch jobs.

Step 3: Identify top CU consumers

Open the Capacity Metrics app and navigate to Items > sort by CU (s) descending.

  • Record the top 5 items by CU consumption.
  • Note whether the top consumers are interactive (queries, reports) or background (refreshes, dataflows, Spark jobs).
  • Check which workspace owns each top consumer.

Step 4: Check for runaway operations

// Fabric capacity utilization — last 6 hours
FabricEvents
| where TimeGenerated > ago(6h)
| where CapacityName == "<capacity-name>"
| where OperationName == "QueryEnd" or OperationName == "RefreshEnd"
| summarize TotalCU = sum(CUSeconds), Count = count() by WorkspaceName, ItemName
| order by TotalCU desc
| take 20
  • Look for a single item consuming more than 40% of total CU.
  • Check for repeated failures that trigger automatic retries (each retry consumes CU).

Step 5: Classify severity and move to response

Condition Severity
Active throttling, user-facing impact P1
CU > 90% sustained, reports slow P2
CU trending upward, no current impact P3

Response Actions

P1 — Critical: Active Throttling

Danger

Active throttling means user queries are being delayed or rejected. Act immediately.

  • Scale up the capacity to the next SKU tier:
az fabric capacity update \
  --capacity-name <capacity-name> \
  --resource-group <rg> \
  --sku-name F16   # adjust to next tier above current
  • Pause non-critical workloads — in the Fabric portal, navigate to workspace settings for non-production workspaces and pause scheduled refreshes.
  • Kill runaway Spark sessions — in the Fabric portal, go to Monitoring hub > Spark applications > cancel long-running jobs that are not time-sensitive.
  • Notify stakeholders using the communication template in section 7.
  • Monitor — confirm CU drops below 80% within 15 minutes of scaling.

P2 — Degraded: High Utilization

  • Identify the top 3 CU consumers from the Capacity Metrics app.
  • For heavy refreshes: reschedule to off-peak hours or stagger start times by 15-minute intervals.
  • For expensive queries: check for missing aggregations, unnecessary DAX calculations, or reports scanning full tables (see Optimization Tactics below).
  • For Spark jobs: review cluster sizing and consider switching to smaller, more frequent batch windows.
  • Plan a capacity SKU review for the next sprint if utilization has trended above 80% for 7+ consecutive days.

P3 — Warning: Approaching Limits

  • Review the scheduled refresh calendar for overlap patterns.
  • Audit workspace assignments — ensure dev/test workloads are not running on the production capacity.
  • Evaluate whether Import models can be converted to DirectLake to reduce refresh CU cost.
  • Create a capacity planning ticket for the next quarter.

Capacity Sizing Guide

Workload Profile Recommended SKU CU/s Typical Use Case
Small team, < 10 reports, light refresh F2 2 Proof of concept, dev/test
Department, 10-30 reports, hourly refresh F4 4 Single team analytics
Multi-team, 30-100 reports, Spark notebooks F8 8 Cross-functional analytics
Enterprise, 100+ reports, heavy ETL, DirectLake F16 16 Production analytics platform
Large enterprise, real-time + batch, ML workloads F32 32 Data platform with ML/AI
Mission-critical, high concurrency, low latency F64 64 Enterprise-wide, SLA-bound deployments

Tip

Start one SKU below your estimate. Fabric's burst capability handles temporary spikes. Scale up only if sustained utilization exceeds 80% for 7+ days.

For a deeper breakdown of CU mechanics, burst windows, and smoothing behavior, see Capacity Planning & Cost Optimization on the Supercharge Microsoft Fabric companion site.


Optimization Tactics

Query and Report Optimization

  • Remove unused visuals — each visual on a report page fires a separate query. Fewer visuals = fewer CU seconds.
  • Use aggregations — pre-aggregated tables reduce scan volume by orders of magnitude for summary reports.
  • Avoid complex DAX iterators on large tables — SUMX, FILTER over millions of rows are CU-expensive. Push logic to the lakehouse SQL layer where possible.
  • Enable query caching — in dataset settings, turn on "Enhanced metadata" and query caching for stable reports.

Storage and Refresh Optimization

  • DirectLake over Import — DirectLake models skip the refresh step entirely, eliminating refresh CU cost. Requires Delta tables in OneLake.
  • Incremental refresh — for Import models that cannot move to DirectLake, configure incremental refresh to process only changed partitions.
  • Materialized views — create materialized views in the SQL analytics endpoint for frequently queried aggregations.

Scheduling and Concurrency

  • Stagger refreshes — space scheduled refreshes at least 10 minutes apart to avoid burst stacking.
  • Separate capacities — assign dev/test workspaces to a cheaper F2/F4 capacity. Never share production capacity with ad-hoc exploration.
  • Off-peak heavy jobs — schedule Spark notebooks and large dataflows outside business hours (e.g., 02:00-05:00 UTC).

Import vs. DirectQuery Decision Matrix

Factor Import / DirectLake DirectQuery
Query latency Sub-second (cached) Source-dependent
CU cost — query Lower (compressed scans) Higher (live queries)
CU cost — refresh Periodic cost None
Data freshness Refresh-dependent Real-time
Best for Dashboards, aggregations Operational reporting

Monitoring Setup

Capacity Metrics App

  • Install the Microsoft Fabric Capacity Metrics app from AppSource.
  • Share the app with the on-call rotation and Fabric admins.
  • Pin the Utilization and Throttling pages to a shared dashboard.

Custom Alerts via Azure Monitor

# Create an alert rule for sustained high CU utilization
az monitor metrics alert create \
  --name "Fabric-CU-High" \
  --resource-group <rg> \
  --scopes "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Fabric/capacities/<capacity-name>" \
  --condition "avg CapacityUtilization > 90" \
  --window-size 30m \
  --evaluation-frequency 5m \
  --action-group <action-group-id> \
  --description "Fabric capacity CU utilization above 90% for 30 minutes"

Key Metrics to Track

Metric Threshold Alert Severity
CU utilization (avg 30 min) > 90% Warning
CU utilization (avg 30 min) > 100% Critical
Throttling events (count/hr) > 0 Critical
Queued operations (count) > 10 Warning
Refresh duration (p95) > 2x normal Warning

Escalation

When to Engage Microsoft Support

  • Throttling persists after scaling to the maximum available SKU.
  • The Capacity Metrics app shows utilization below 50% but throttling is still occurring (potential platform bug).
  • Capacity provisioning or scaling commands fail with Azure errors.
  • Unexplained CU spikes with no matching workload in the Monitoring hub.

How to File a Support Ticket

# Open the Azure support blade directly
az support tickets create \
  --ticket-name "Fabric-Throttle-$(date +%Y%m%d)" \
  --severity "critical" \
  --problem-classification "/providers/Microsoft.Support/services/fabric/..." \
  --description "Fabric capacity <capacity-name> throttled despite scaling to F64. Details: ..."

Include in the ticket:

  • Capacity name, subscription ID, region.
  • Capacity Metrics app screenshots (last 24h utilization + throttling).
  • List of top CU consumers (workspace, item, CU seconds).
  • Timeline of scaling actions taken.

When to Request a Capacity Increase

  • Sustained utilization above 80% for 14+ consecutive days.
  • Projected workload growth will exceed current SKU within 30 days.
  • New workload (e.g., ML training, large-scale ETL) is being onboarded.

Communication Templates

Internal notification (P1/P2)

Subject: [P1/P2] Fabric Capacity Throttling — <capacity-name>

Detected: <timestamp UTC> Impact: Reports and refreshes on capacity <capacity-name> are throttled. Users may experience slow or failed queries. Status: Investigating / Scaling / Resolved Actions taken: <bullet list> Next update: <time>


Contact Information

Warning

Action Required: Populate these before first production use.

Role Contact Phone Escalation
Fabric Admin (your Fabric admin) (see PagerDuty / OpsGenie) First responder
Platform Eng Lead (your platform team lead) (see PagerDuty / OpsGenie) P1/P2 escalation
FinOps Lead (your FinOps contact) (DL) Cost/sizing decisions
Microsoft Support Azure Portal N/A Platform-level issues

Drill Log

Run this runbook in tabletop form quarterly. Add one row per drill.

Quarter Date Type (tabletop / live) Scenario exercised Lead Gaps identified Fixes tracked
Q1 — Jan TBD TBD TBD TBD TBD TBD
Q2 — Apr TBD TBD TBD TBD TBD TBD
Q3 — Jul TBD TBD TBD TBD TBD TBD
Q4 — Oct TBD TBD TBD TBD TBD TBD