Runbook — Capacity overrun¶

Symptom¶

CU-equivalent utilization exceeds budget threshold. Cost alert fires. Or per-service throttling alerts (Databricks DBU quota, Power BI F-SKU memory, ADX vCore-second exhaustion).

Diagnosis¶

# 1. Check Console Monitoring Hub → Capacity
# Look at the 24h CU-equivalent chart; identify which service is
# dominant

# 2. Per-service deep dive (LAW KQL)
# Databricks DBU consumption per cluster:
DatabricksClusterEvents
| where TimeGenerated > ago(24h)
| summarize dbu = sum(dbuConsumed) by ClusterName
| order by dbu desc

# Power BI Premium memory:
PowerBICapacityEvents
| where TimeGenerated > ago(24h)
| summarize avg_mem = avg(memoryMb), max_mem = max(memoryMb)
            by WorkspaceName

# ADX vCore-seconds:
ADXIngestionEvents
| where TimeGenerated > ago(24h)
| summarize vcs = sum(vcoreSeconds) by databaseName

# 3. Identify the workload driving overrun
# e.g., scheduled Spark job, semantic model refresh, mass query
# from a Power BI report

Common causes + fixes¶

Cause	Fix
Runaway Databricks job	Cancel + tune SQL / Spark partitioning
Full-table semantic model refresh on huge fact table	Switch to partition refresh in Direct-Lake-Shim policy
ADX query scanning > expected (full table scan instead of partition pruning)	Optimize KQL; add update policies; verify table partitioning
New workload onboarded without capacity sizing	Scale up SKU; or split workload across DLZs
Power BI dataset memory pressure (model > F-SKU memory)	Move tables to DirectQuery; or scale up F-SKU
ADX cold cluster auto-resuming on every query (5-10s overhead each)	Schedule warm-keep cron; or scale up to keep warm
AOAI runaway usage (debug loop in Copilot tool calling)	Investigate user; apply per-user rate limit

Remediation (immediate — stop the bleeding)¶

Pause the runaway resource:
Databricks: cancel job; pause cluster
Power BI Premium: pause capacity (loses queries until resume)
ADX: stop cluster
Notify workload owner + capacity Admin
Scale up if scale-down isn't viable mid-incident:
Console "Admin → Capacity → Scale up [service]"

Remediation (longer-term)¶

Right-size the workload (per-service tuning)
Apply cost-optimization patterns from Cost management
Move to Reservations for stable consumption
Pre-deploy capacity-overrun alerts at 60% / 80% / 95% of budget

Prevention¶

Monthly capacity-forecast review using fiab-capacity-forecast
Per-workload capacity allocation: each DLZ has its own Databricks + ADX + Power BI capacity, billed separately
Tag every resource with CostCenter for charge-back / show-back
Deploy budget alerts at 50% / 80% / 100% of monthly budget

Runbook — Capacity overrun¶

Symptom¶

Diagnosis¶

Common causes + fixes¶

Remediation (immediate — stop the bleeding)¶

Remediation (longer-term)¶

Prevention¶

Related¶