Skip to content

Home > Docs > Runbooks > Capacity Throttling Response

🚨 Capacity Throttling Response

Last Updated: 2026-04-27 | Phase: 14 (Wave 1) Audience: On-call engineers, capacity admins, FinOps, SRE Purpose: Stop the bleeding when Fabric capacity throttling impacts production workloads β€” restore CU headroom, drain queued requests, and prevent recurrence. Anchor: This runbook follows the structure of Incident Response Template. Read that first if you are new to incident response.

Category Type Platform Severity


πŸ“‘ Table of Contents

  1. Symptoms
  2. Severity Classification
  3. Diagnostic Steps
  4. Common Root Causes
  5. Resolution Procedures
  6. Verification
  7. Rollback (Scale-Down)
  8. Post-Incident Actions
  9. Escalation Path
  10. Quick-Reference Commands
  11. Related Runbooks
  12. Related Best-Practice Docs

πŸ”₯ Symptoms

Capacity throttling occurs when sustained CU demand exceeds the SKU's allocated CU and Fabric's smoothing buffer is exhausted. Symptoms are graded by severity of throttle state.

Indicator Threshold Where to Observe
CU Utilization > 90% sustained for β‰₯ 15 min Capacity Metrics App β†’ Utilization tab
Smoothing Buffer Used > 80% of 24-hour smoothing window Capacity Metrics App β†’ Throttling tab
Background Rejection Any background job rejected (refreshes, scheduled pipelines) Throttling tab β†’ "Background rejection" timeline
Interactive Delay Interactive queries delayed > 20 sec Throttling tab β†’ "Interactive delay" timeline
Interactive Rejection Power BI / SQL queries return 429 Too Many Requests Throttling tab β†’ "Interactive rejection"
Queued Requests capacity_metrics.queued_requests > 0 for β‰₯ 5 min Workspace Monitoring β†’ system.capacity_metrics
Spark Session Wait New sessions take > 60 sec to start Spark monitoring β†’ session wait time
Power BI Reports Report load > 30 sec, refreshes failing with CapacityNotAvailable Power BI workspace β†’ refresh history
SQL Endpoint Queries fail with Resource governance: too many requests Warehouse / SQL endpoint query history
Eventhouse / KQL Ingestion lag growing, queries returning LimitExceeded Real-Time Hub β†’ ingestion metrics

Fabric throttling vs. rejection β€” Fabric uses a graduated response: first interactive delay (queries slowed), then interactive rejection (queries refused), then background rejection (refreshes refused). Background rejection means smoothing is fully exhausted.


🎯 Severity Classification

Map throttling impact to the SEV matrix from the anchor template.

Severity Throttling State Customer Impact Response SLA
SEV1 Background + interactive rejection active across multiple workspaces; SOX/compliance refreshes failing Region-wide BI outage; pipelines failing; revenue-impacting reports stale 5 min page
SEV2 Sustained CU > 95% for β‰₯ 30 min; interactive delay; one workspace down Power BI users see slow reports; one prod pipeline failing 15 min page
SEV3 CU 90–95% with intermittent throttling; degraded but workable Slow queries, no outright failures; non-prod impact only 2 hr ack
SEV4 Brief CU spike > 90% self-resolving in < 15 min None; investigate proactively 24 hr ack

Decision Tree

flowchart TD
    Start([Throttling alert fires]) --> Q1{CU > 90%<br/>sustained 15 min?}
    Q1 -->|No| SEV4[SEV4 β€” observe & log]
    Q1 -->|Yes| Q2{Background<br/>rejection active?}
    Q2 -->|Yes| Q3{Multiple workspaces<br/>or compliance refresh?}
    Q3 -->|Yes| SEV1[SEV1 β€” page IC + scale-up NOW]
    Q3 -->|No| SEV2A[SEV2 β€” page on-call]
    Q2 -->|No| Q4{Interactive<br/>rejection?}
    Q4 -->|Yes| SEV2B[SEV2 β€” page on-call]
    Q4 -->|No| Q5{Interactive<br/>delay > 30 min?}
    Q5 -->|Yes| SEV2C[SEV2 β€” mitigate]
    Q5 -->|No| SEV3[SEV3 β€” investigate top consumers]
    SEV1 --> Action1[Scale F64β†’F128 + pause non-critical]
    SEV2A --> Action2[Scale or kill heavy queries]
    SEV2B --> Action2
    SEV2C --> Action2
    SEV3 --> Action3[Identify top CU consumer; defer batch]

πŸ” Diagnostic Steps

1. Confirm Throttle State (Capacity Metrics App)

Navigate to: app.powerbi.com β†’ Apps β†’ Microsoft Fabric Capacity Metrics
1. Select capacity (e.g., fabric-casino-prod)
2. Open "Utilization" page β†’ check CU% timeline (last 24h)
3. Open "Throttling" page β†’ look for orange/red bands:
   - Green band:  Interactive delay
   - Orange band: Interactive rejection
   - Red band:    Background rejection (CRITICAL)
4. Click "TimePoint" drill-through to see top items by CU at the spike

2. Pull Top CU Consumers (KQL β€” Workspace Monitoring)

// Top 10 CU consumers in the last hour, broken down by workload type
FabricCapacityMetrics
| where TimeGenerated > ago(1h)
| summarize TotalCU = sum(CUSeconds), PeakSessions = max(ActiveSessions)
    by ItemName, ItemType, WorkspaceName
| top 10 by TotalCU desc
// Throttle events per 5-min bucket β€” confirms smoothing exhaustion
FabricCapacityMetrics
| where TimeGenerated > ago(2h)
| summarize
    ThrottleEvents = countif(ThrottleEvent == true),
    QueuedRequests = max(QueuedRequests),
    CUConsumed = sum(CUConsumed),
    CUThrottled = sum(CUThrottled)
    by bin(TimeGenerated, 5m)
| order by TimeGenerated asc

3. Find Long-Running Queries (Workspace Monitoring sql_queries)

-- SQL queries currently exceeding 60s β€” candidates to kill
SELECT
    query_id,
    item_name        AS endpoint,
    user_email,
    SUBSTRING(query_text, 1, 200) AS query_preview,
    DATEDIFF(SECOND, start_time, CURRENT_TIMESTAMP) AS running_seconds,
    rows_scanned,
    cu_seconds
FROM system.sql_queries
WHERE status = 'Running'
  AND start_time < DATEADD(SECOND, -60, CURRENT_TIMESTAMP)
ORDER BY running_seconds DESC;

4. Find Heavy Spark Sessions

-- Active Spark sessions and CU consumption rate
SELECT
    item_name           AS notebook,
    user_email,
    start_time,
    DATEDIFF(SECOND, start_time, CURRENT_TIMESTAMP) AS running_seconds,
    executor_count,
    cu_seconds,
    ROUND(cu_seconds / NULLIF(DATEDIFF(SECOND, start_time, CURRENT_TIMESTAMP), 0), 2)
        AS cu_per_second
FROM system.spark_sessions
WHERE status = 'Running'
ORDER BY cu_seconds DESC;

5. Capacity Status via REST

# Get current SKU + state
az rest --method get \
  --url "https://management.azure.com/subscriptions/${SUB_ID}/resourceGroups/${RG}/providers/Microsoft.Fabric/capacities/${CAPACITY_NAME}?api-version=2023-11-01" \
  --query "{name:name, sku:sku.name, state:properties.state, region:location}"

6. Triage Checklist

  • Confirm alert is real β€” Capacity Metrics App shows red/orange band
  • Note time of first throttle event (becomes T0 for postmortem timeline)
  • Identify dominant workload type (Spark / SQL / Pipeline / RTI / Power BI)
  • Identify top 3 CU consumers by item
  • Check Microsoft Fabric Status Page for region-level issues
  • Snapshot Capacity Metrics App (screenshot) β€” postmortem evidence
  • Open incident channel per naming convention

🧭 Common Root Causes

Root Cause Indicator Mitigation Pattern
Runaway Spark notebook One notebook consumes > 50% of total CU; cu_per_second outlier Cancel session; scale up SKU; review notebook for cartesian joins / missing partition filters
Power BI mass refresh Spike at top of hour from scheduled refreshes; RefreshSchedule overlap Stagger refresh times; convert imports to Direct Lake; raise SKU during refresh window
Unindexed Warehouse query Single query > 5 min; bytes_scanned > 100 GB; no statistics Kill query; ANALYZE TABLE; add result-set caching; rewrite with predicate pushdown
Pipeline parallelism storm pipeline_runs shows N concurrent activities; queue depth growing Reduce pipeline maxConcurrency; defer batch jobs; add staggered triggers
Eventstream backlog Eventhouse ingestion lag growing; KQL queries throttled Scale Eventhouse CU; pause non-critical eventstreams; check upstream Event Hub
Smoothing exhaustion Sustained > 100% CU for > 24 h (smoothing window full) Scale SKU one tier β€” burst credit is gone, only base capacity remains
Unbounded Direct Lake fallback Power BI queries falling back to DirectQuery; high SQL CU Refresh Direct Lake framing; check DirectLakeOnly flag; reload semantic model
Noisy neighbor (shared capacity) Throttle on prod despite normal prod workload; dev workspace busy Move dev to separate capacity; pause dev workspace temporarily
Materialized view rebuild KQL ingestion CU spike; MV with broad partition scope Pause MV refresh; rebuild during off-peak; narrow partition
AutoML / Data Agent loop New unfamiliar item in top consumers; sustained Spark CU Cancel job; gate AutoML in policy; isolate to separate capacity

πŸ”§ Resolution Procedures

Order of operations: kill the bleeding (cancel the worst offender), then add headroom (scale up), then prevent reflood (defer batch). Do not scale up before identifying the offender β€” you may be funding a bug.

4.1 Cancel the Worst Offender

Kill a Long-Running Warehouse / SQL Endpoint Query

-- Step 1 β€” find the session_id (Warehouse)
SELECT session_id, login_name, start_time, command, status
FROM sys.dm_exec_sessions
WHERE status = 'running'
  AND DATEDIFF(SECOND, login_time, CURRENT_TIMESTAMP) > 60;

-- Step 2 β€” kill it
KILL 12345;   -- replace with session_id from above

For Lakehouse SQL endpoint (read-only), terminate via portal:

Navigate to: Workspace β†’ Lakehouse β†’ SQL analytics endpoint β†’ Query insights
Select query β†’ Cancel

Cancel a Spark Notebook Session

# Inside a separate notebook, list and cancel sessions
import sempy.fabric as fabric

sessions = fabric.list_items(type="SparkApplication")
print(sessions[sessions["State"] == "Running"][["Name", "Id", "State"]])

# Cancel a specific session (Livy API)
import requests
from azure.identity import DefaultAzureCredential
token = DefaultAzureCredential().get_token("https://api.fabric.microsoft.com/.default").token

session_id = "<session-guid>"
workspace_id = "<workspace-guid>"
requests.delete(
    f"https://api.fabric.microsoft.com/v1/workspaces/{workspace_id}/spark/sessions/{session_id}",
    headers={"Authorization": f"Bearer {token}"},
).raise_for_status()

Or via the UI: Workspace β†’ Monitor β†’ Spark applications β†’ select β†’ Cancel.

Pause a Misbehaving Pipeline

# Disable a pipeline trigger to prevent re-runs while you investigate
az rest --method post \
  --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/dataPipelines/${PIPELINE_ID}/jobs/instances/${RUN_ID}/cancel?jobType=Pipeline" \
  --headers "Content-Type=application/json"

4.2 Scale-Up via Azure REST / CLI (F64 β†’ F128)

# Vertical scale β€” temporary mitigation. Billing prorates by minute.
SUB_ID="<subscription-id>"
RG="rg-fabric-prod"
CAPACITY_NAME="fabric-casino-prod"

az rest --method patch \
  --url "https://management.azure.com/subscriptions/${SUB_ID}/resourceGroups/${RG}/providers/Microsoft.Fabric/capacities/${CAPACITY_NAME}?api-version=2023-11-01" \
  --body '{"sku": {"name": "F128", "tier": "Fabric"}}'

# Verify scale-up completed (state should be "Active", sku.name "F128")
az rest --method get \
  --url "https://management.azure.com/subscriptions/${SUB_ID}/resourceGroups/${RG}/providers/Microsoft.Fabric/capacities/${CAPACITY_NAME}?api-version=2023-11-01" \
  --query "{sku:sku.name, state:properties.state}"

Scaling is online β€” no workspace downtime. Active sessions continue. New CU is available within ~60 sec. Cost doubles per tier.

From To CU Ξ” Approx. Hourly Cost Ξ” (PAYG, East US)
F64 F128 +64 +$11.50/hr
F128 F256 +128 +$23/hr
F256 F512 +256 +$46/hr

4.3 Pause Non-Critical Workspaces on Shared Capacity

If multiple workspaces share the throttled capacity, pause the lowest-priority workspace to free CU.

# Pause a non-critical capacity (e.g., dev capacity sharing a tenant)
import requests
from azure.identity import DefaultAzureCredential

token = DefaultAzureCredential().get_token("https://management.azure.com/.default").token
sub_id, rg, cap = "<sub>", "rg-fabric-dev", "fabric-casino-dev"

requests.post(
    f"https://management.azure.com/subscriptions/{sub_id}/resourceGroups/{rg}"
    f"/providers/Microsoft.Fabric/capacities/{cap}/suspend?api-version=2023-11-01",
    headers={"Authorization": f"Bearer {token}"},
).raise_for_status()

Note: this pauses the capacity, not a workspace. To shift workspaces off a busy capacity without pausing, reassign:

az rest --method post \
  --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/assignToCapacity" \
  --body '{"capacityId": "<low-priority-capacity-id>"}'

4.4 Defer Batch Jobs (Pipeline Pause)

# Disable a scheduled trigger via REST
az rest --method patch \
  --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/dataPipelines/${PIPELINE_ID}" \
  --body '{"properties": {"enabled": false}}'

# Or pause via Power BI Admin REST (legacy datasets)
az rest --method post \
  --url "https://api.powerbi.com/v1.0/myorg/groups/${WS_ID}/datasets/${DS_ID}/refreshSchedule" \
  --body '{"value": {"enabled": false}}'

For ad-hoc deferral, set the pipeline's next-scheduled-run window to off-peak (e.g., shift 9 AM batch to 10 PM).

4.5 Burst & Smoothing Window Tactics

Fabric smooths CU consumption over 24 hours for background operations. If you have unused CU earlier in the day, brief bursts are absorbed without throttling. To recover smoothing capacity:

Tactic When to Use How
Idle window pause Throttle event resolved, but smoothing buffer still drained Pause capacity for 1–2 hr to "reset" the 24h window β€” incoming work resumes with full burst credit
Throttle deferral Smoothing nearly full, jobs still queueing Defer all non-critical batch by 4–6 hr β€” gives smoothing window time to roll forward
Workload move Single workload (e.g., RTI) saturating Move workload to a sibling capacity with available headroom (assignToCapacity)
Temp scale + defer Aggressive burst expected (month-end close) Scale up 2 hr before peak, defer batch from peak hour, scale down 2 hr after

4.6 Bulk-Kill Long-Running Queries (SEV1 only)

For SEV1 only, after notifying the workspace owner channel, iterate sys.dm_exec_sessions and KILL sessions running > 5 min (excluding your own). Use sparingly β€” kills may cause transaction rollback.


βœ… Verification

After mitigation, confirm recovery before declaring resolved.

Verification Checklist

  • CU Utilization back below 70% sustained for β‰₯ 15 min (Capacity Metrics App β†’ Utilization)
  • No active throttle bands in Throttling page (no orange/red for β‰₯ 15 min)
  • Smoothing buffer recovering (target < 50% consumed within 1 hr)
  • Queued requests = 0 in system.capacity_metrics
  • Power BI report load time < 5 sec for sample reports (test 3 representative reports)
  • SQL endpoint sample query < 10 sec for typical Gold table aggregation
  • Spark session new session start time < 30 sec
  • Pipeline runs scheduled batch completes within normal SLA
  • No new alerts firing from Capacity / Throttling alert rules
  • Monitor for 2Γ— incident duration before declaring resolved (per anchor template Β§3.3)

Verification KQL

// Confirm throttle has cleared β€” last 30 min
FabricCapacityMetrics
| where TimeGenerated > ago(30m)
| summarize
    MaxCU = max(CUConsumed),
    AvgCU = avg(CUConsumed),
    ThrottleEvents = countif(ThrottleEvent == true),
    MaxQueued = max(QueuedRequests)
| extend Status = case(
    ThrottleEvents > 0,           "STILL THROTTLING",
    AvgCU > 80,                   "ELEVATED β€” keep watching",
    AvgCU < 70 and MaxQueued == 0,"RECOVERED",
    "MIXED")

βͺ Rollback (Scale-Down)

After a temporary scale-up (e.g., F64 β†’ F128), scale back to baseline once load stabilizes β€” typically 4–8 hr after recovery.

Pre-Rollback Verification

  • CU has held below 60% of target SKU for β‰₯ 4 hr
  • No throttle events in last 2 hr
  • No queued requests in last 1 hr
  • Postmortem root cause identified (otherwise leave scaled until fix lands)
  • Stakeholder notified β€” scale-down window is the highest-risk moment for re-throttle

Scale-Down Sequence

# 1. Cordon non-critical workloads (optional safety net)
#    Pause large batch jobs that could refire during/just-after scale-down

# 2. Patch SKU back to baseline
az rest --method patch \
  --url "https://management.azure.com/subscriptions/${SUB_ID}/resourceGroups/${RG}/providers/Microsoft.Fabric/capacities/${CAPACITY_NAME}?api-version=2023-11-01" \
  --body '{"sku": {"name": "F64", "tier": "Fabric"}}'

# 3. Wait 5 min and re-check throttle state
az rest --method get \
  --url "https://management.azure.com/subscriptions/${SUB_ID}/resourceGroups/${RG}/providers/Microsoft.Fabric/capacities/${CAPACITY_NAME}?api-version=2023-11-01" \
  --query "{sku:sku.name, state:properties.state}"

# 4. Resume any cordoned workloads

If Scale-Down Triggers Re-Throttle

  1. Immediately scale back up (F64 β†’ F128).
  2. Reclassify as incident not resolved β€” root cause remains.
  3. Block scale-down until permanent fix (query optimization, SKU baseline change, workload separation) is deployed.

πŸ§ͺ Post-Incident Actions

Within 48 hr of resolution (SEV1/SEV2):

FinOps Review

  • Calculate incident cost: scale-up duration Γ— delta-rate (e.g., 4 hr Γ— $11.50 = $46)
  • Calculate recurring cost if SKU baseline must rise to prevent recurrence
  • Compare to cost of permanent fix (engineer time Γ— hours)
  • Recommend: hold at higher SKU, return to baseline + fix, or split workloads across capacities
  • Update Capacity Planning doc if baseline changes

Query Optimization Candidates

Pull last 7 days of CU consumption and rank candidates:

-- Top 20 queries by total CU last 7 days β€” prioritize for optimization
SELECT
    item_name           AS endpoint,
    query_hash,
    SUBSTRING(MAX(query_text), 1, 200)             AS sample_query,
    COUNT(*)                                       AS execution_count,
    ROUND(SUM(cu_seconds), 2)                      AS total_cu_seconds,
    ROUND(AVG(cu_seconds), 2)                      AS avg_cu_per_run,
    ROUND(AVG(bytes_scanned) / 1073741824.0, 2)    AS avg_gb_scanned
FROM system.sql_queries
WHERE start_time >= DATEADD(DAY, -7, CURRENT_TIMESTAMP)
  AND status = 'Succeeded'
GROUP BY item_name, query_hash
ORDER BY total_cu_seconds DESC
LIMIT 20;

For each top consumer:

  • Verify predicate pushdown (filter pushed to storage)
  • Confirm V-Order on target Delta tables
  • Check for missing statistics (ANALYZE TABLE)
  • Consider materialized view if pattern is repeated
  • Convert imports to Direct Lake where applicable

Autoscale Evaluation

  • Decide: manual escalation (page) vs. programmatic auto-scale
  • If auto-scale: define guardrails β€” max SKU ceiling, scale-cooldown, business-hour window
  • Reference auto-scale pattern in Capacity Planning doc Β§Strategy 1

Postmortem


πŸ“ž Escalation Path

Time From Detection Action Owner
0 min Page on-call (PagerDuty) Alert rule
5 min On-call ack; open incident channel On-call engineer
15 min (SEV½) Engage Platform Lead On-call
30 min (SEV1) Engage Incident Commander; notify VP Eng Platform Lead
45 min (SEV1) Notify CTO/CDO; consider tenant-wide banner VP Eng
60 min (SEV1) Compliance Officer (if SOX/HIPAA refresh missed); Legal (if SLA breach) Incident Commander

External communications follow the Stakeholder Update Template.


⚑ Quick-Reference Commands

Azure CLI β€” Capacity Lifecycle

# Show current SKU / state
az rest --method get \
  --url "https://management.azure.com/subscriptions/${SUB_ID}/resourceGroups/${RG}/providers/Microsoft.Fabric/capacities/${CAPACITY_NAME}?api-version=2023-11-01"

# Scale up
az rest --method patch \
  --url "https://management.azure.com/subscriptions/${SUB_ID}/resourceGroups/${RG}/providers/Microsoft.Fabric/capacities/${CAPACITY_NAME}?api-version=2023-11-01" \
  --body '{"sku": {"name": "F128", "tier": "Fabric"}}'

# Pause capacity
az rest --method post \
  --url "https://management.azure.com/subscriptions/${SUB_ID}/resourceGroups/${RG}/providers/Microsoft.Fabric/capacities/${CAPACITY_NAME}/suspend?api-version=2023-11-01"

# Resume capacity
az rest --method post \
  --url "https://management.azure.com/subscriptions/${SUB_ID}/resourceGroups/${RG}/providers/Microsoft.Fabric/capacities/${CAPACITY_NAME}/resume?api-version=2023-11-01"

Power BI Admin REST β€” Workload Control

# List capacities and assigned workspaces
az rest --method get \
  --url "https://api.powerbi.com/v1.0/myorg/admin/capacities"

# Reassign workspace to different capacity
az rest --method post \
  --url "https://api.fabric.microsoft.com/v1/workspaces/${WS_ID}/assignToCapacity" \
  --body '{"capacityId": "<target-capacity-id>"}'

# Disable a dataset's scheduled refresh (defer batch)
az rest --method patch \
  --url "https://api.powerbi.com/v1.0/myorg/groups/${WS_ID}/datasets/${DS_ID}/refreshSchedule" \
  --body '{"value": {"enabled": false}}'

KQL β€” Workspace Monitoring

// Throttle event timeline (5-min buckets)
FabricCapacityMetrics
| where TimeGenerated > ago(4h)
| summarize ThrottleCount = countif(ThrottleEvent == true),
            CU = sum(CUConsumed)
            by bin(TimeGenerated, 5m), WorkloadType
| render timechart
// Smoothing window utilization (last 24h)
FabricCapacityMetrics
| where TimeGenerated > ago(24h)
| summarize TotalCU = sum(CUConsumed) by bin(TimeGenerated, 1h)
| extend RollingCU24h = row_cumsum(TotalCU)
| project TimeGenerated, TotalCU, RollingCU24h

Python β€” Cancel Spark Session

import requests
from azure.identity import DefaultAzureCredential
token = DefaultAzureCredential().get_token("https://api.fabric.microsoft.com/.default").token
requests.delete(
    f"https://api.fabric.microsoft.com/v1/workspaces/{WS_ID}/spark/sessions/{SESS_ID}",
    headers={"Authorization": f"Bearer {token}"},
).raise_for_status()

Runbook When to Use
Incident Response Template Anchor β€” severity matrix, comms tree, postmortem template
Pipeline Failure Triage Pipeline failed because of throttle-induced rejection
Auth Failure Playbook Workspace Identity / SP failures (separate symptom)
Multi-Region Failover Throttling caused by region-wide outage; failover to secondary
Tenant Migration (Dev/Staging/Prod) Workspace reassignment between capacities
Data Quality Incident Throttle masked a data quality breach upstream
Document Description
Capacity Planning & Cost Optimization SKU sizing, smoothing model, auto-scale, V-Order, predicate pushdown
Workspace Monitoring system.capacity_metrics, sql_queries, spark_sessions schemas
Monitoring & Observability Custom dashboards, alert wiring
Alerting & Data Activator Throttle alert rules, Action Group fan-out
Error Handling & Monitoring Pipeline error architecture
Performance Optimization Spark tuning, V-Order, Z-Order
Disaster Recovery & BCDR Capacity failover patterns

⬆️ Back to Top | πŸ“š Runbooks Index | 🏠 Home