CSA Loom — Operations¶
Day-2 operations for CSA Loom: capacity management, monitoring, cost, DR, upgrades, forward migration. Customer ops teams can run Loom in production using this section + the runbooks.
Topics¶
-
The CU-equivalent model + per-service scaling. Pause/resume patterns. Monitoring capacity-overrun risks.
-
Monitoring Hub deep-dive. Pre-built KQL queries. Per-engine telemetry sources.
-
Cost-optimization patterns. Pause/resume. ADX hot/cold tiering. Power BI smoothing. AOAI provisioned vs PAYG.
-
Per-component DR + RPO/RTO targets. Region pairs. Failover drills.
-
Upgrade lifecycle (
azd upre-run + Console "Updates" pane); single-sub → multi-sub conversion; boundary promotion. -
Forward migration to Microsoft Fabric
The strategic anchor: when Fabric reaches your boundary, migrate forward 1:1 via OneLake shortcut + per-artifact mapping.
Day-2 responsibilities (split between Loom + customer)¶
| Responsibility | Who |
|---|---|
| Container image patching | Loom (push to ACR; customer pulls via Console "Updates") |
| Bicep module updates | Loom (via release tags; customer azd up re-runs) |
| Azure resource patching (Databricks runtime, ADX engine, etc.) | Microsoft (managed services) |
| Capacity scaling decisions | Customer (Console "Admin → Capacity") |
| Workspace creation + lifecycle | Customer (Console "Workspaces" pane) |
| Per-workspace member management | Customer (Entra groups) |
| Incident response for customer-data issues | Customer (with Loom runbooks as reference) |
| Loom Console / parity service incidents | Customer first; escalate via GitHub Issues if blocked |
Runbook index¶
The full runbook library is at runbooks section. Common patterns:
| Runbook | When to use |
|---|---|
| Deploy failure | Initial azd up or DLZ-add fails |
| Direct-Lake-Shim stuck | Power BI semantic model not refreshing |
| Activator rules not firing | Expected Activator action didn't dispatch |
| Mirroring CDC lag | Mirror is more than N minutes behind source |
| Copilot throttling | AOAI 429s in Console |
| Capacity overrun | CU-equivalent exceeds threshold |
| DLZ onboard new domain | Adding agency / mission domain |
| Forward migrate to Fabric | Fabric GA in your boundary |
| Boundary promotion | GCC-H → IL5 promotion |
| Defender AI equivalent SOC | Sentinel pipeline health check |
| MCP troubleshooting | MCP server / wizard issues |
| Purview scan stuck | Catalog scan stalls |
SLAs (operational targets)¶
| Metric | Target |
|---|---|
| Loom Console availability | 99.5% / month |
| Loom Setup Wizard deploy success rate | > 95% |
| Direct-Lake-Shim refresh latency (partition) | p50 < 30 s; p95 < 60 s |
| Activator end-to-end latency | 5-30 s |
| Mirroring CDC steady-state lag | < 60 s |
| Loom Copilot response time | p50 < 3 s; p95 < 10 s |