Loom LAW monitoring + alert pack¶
The standardized Loom Log Analytics workspace (law-csa-loom-<region> in the Admin Plane RG) is the single observability surface for the entire CSA Loom stack. This runbook is the operator query catalog — copy-pasteable KQL for the most-used investigations.
Quick health check — every service in one query¶
let services = dynamic([
'loom-console','loom-orchestrator','loom-copilot','loom-activator',
'loom-mirroring','loom-direct-lake-shim','loom-mcp','loom-presidio-analyzer',
'loom-presidio-anonymizer'
]);
AppRequests
| where TimeGenerated > ago(15m)
| where AppRoleName in (services)
| summarize
requestCount = count(),
successCount = countif(Success == true),
failureCount = countif(Success == false),
p50ms = percentile(DurationMs, 50),
p95ms = percentile(DurationMs, 95),
p99ms = percentile(DurationMs, 99)
by AppRoleName
| extend successRate = round(100.0 * successCount / requestCount, 1)
| order by AppRoleName asc
Use as your morning standup dashboard.
Per-service deep dives¶
Loom Console — auth + RLS¶
AppRequests
| where AppRoleName == 'loom-console'
| where TimeGenerated > ago(1h)
| where Name startswith 'GET /api/workspaces' or Name startswith 'POST /api/workspaces'
| summarize
requests = count(),
auth401 = countif(ResultCode == '401'),
auth403 = countif(ResultCode == '403'),
server500 = countif(ResultCode startswith '5')
by Name, bin(TimeGenerated, 5m)
| render timechart
Loom Activator — fired rules over time¶
AppEvents
| where AppRoleName == 'loom-activator'
| where Name == 'rule.fired'
| extend ruleId = tostring(Properties.ruleId),
primitive = tostring(Properties.primitive)
| summarize fireCount = count() by primitive, bin(TimeGenerated, 15m)
| render timechart
Loom Mirroring — CDC lag¶
The replicator emits a mirror.lag_seconds custom metric each microbatch. Spikes above 60s indicate the replicator can't keep up; escalate per Mirroring CDC lag runbook.
AppMetrics
| where Name == 'mirror.lag_seconds'
| where TimeGenerated > ago(1h)
| extend mirrorId = tostring(Properties['csa-loom.mirror_id'])
| summarize maxLag = max(Sum / Count) by mirrorId, bin(TimeGenerated, 1m)
| render timechart
Loom Direct-Lake Shim — refresh latency SLA¶
The shim emits refresh.duration_ms per partition refresh. The SLA gate is MaxStalenessSeconds declared in the Cosmos refresh-policies container.
AppMetrics
| where Name == 'refresh.duration_ms'
| where TimeGenerated > ago(6h)
| extend modelId = tostring(Properties.semanticModelId),
tableName = tostring(Properties.tableName)
| summarize p50ms = percentile(Sum, 50), p95ms = percentile(Sum, 95), maxMs = max(Sum)
by modelId, tableName, bin(TimeGenerated, 5m)
| where p95ms > 30000 // honest gap: 5-30s; sustained >30s = investigate
Loom Data Agents — NL2SQL accuracy proxy¶
Counts the fraction of generated SQL that succeeded vs failed at the engine layer. Sustained drop = AOAI model behavior change OR schema drift not propagated to the per-agent schema registration.
AppDependencies
| where AppRoleName == 'loom-copilot'
| where Name == 'tool.nl2sql.execute_sql'
| summarize
attempts = count(),
successes = countif(Success == true)
by bin(TimeGenerated, 1h)
| extend successRate = round(100.0 * successes / attempts, 1)
| render timechart
Cross-service correlation¶
Trace a single Setup Wizard deployment end-to-end¶
let deploymentId = 'PASTE-HERE';
union AppRequests, AppDependencies, AppEvents, AppExceptions, AppTraces
| where TimeGenerated > ago(2h)
| where customDimensions has deploymentId
or Properties has deploymentId
or Message has deploymentId
| project TimeGenerated, AppRoleName, ItemType, Name, Success, ResultCode, Message
| order by TimeGenerated asc
Pairs with /api/setup/{deployment_id}/sse for the UI side.
Find the slowest user across all services¶
AppRequests
| where TimeGenerated > ago(1d)
| extend userOid = tostring(customDimensions.user_oid)
| where isnotempty(userOid)
| summarize
requestCount = count(),
p95ms = percentile(DurationMs, 95),
p99ms = percentile(DurationMs, 99)
by userOid, AppRoleName
| top 20 by p95ms desc
Cost monitoring¶
LAW ingestion by service (last 7d, GB)¶
Usage
| where TimeGenerated > ago(7d)
| where IsBillable == true
| summarize totalGB = sum(Quantity) / 1024 by Solution, DataType
| top 20 by totalGB desc
Triggers a budget review if any single service > 5GB/day sustained.
Daily LAW spend trend¶
Usage
| where TimeGenerated > ago(30d)
| where IsBillable == true
| summarize gbIngested = sum(Quantity) / 1024 by bin(TimeGenerated, 1d)
| render timechart
Suggested alert rules (Sentinel-ready)¶
These extend the AI-defense rules already provisioned by monitoring.bicep. Add via additional Microsoft.SecurityInsights/ alertRules resources.
Service down (no requests in 5 min)¶
let services = dynamic([
'loom-console','loom-orchestrator','loom-mcp','loom-activator',
'loom-mirroring','loom-direct-lake-shim'
]);
AppRequests
| where TimeGenerated > ago(5m)
| where AppRoleName in (services)
| summarize requestCount = count() by AppRoleName
| join kind=rightouter (datatable(AppRoleName:string) services
| extend hint = 1) on AppRoleName
| where isnull(requestCount) or requestCount == 0
Severity: High. Triggers AI-defense playbook (extends existing).
Activator action-dispatch failures¶
AppExceptions
| where AppRoleName == 'loom-activator'
| where OuterMessage has 'Action dispatch failed'
| summarize failureCount = count() by bin(TimeGenerated, 5m)
| where failureCount > 5
Direct-Lake refresh SLA violation (sustained)¶
AppMetrics
| where Name == 'refresh.duration_ms'
| where TimeGenerated > ago(15m)
| extend modelId = tostring(Properties.semanticModelId)
| summarize p95ms = percentile(Sum, 95) by modelId
| where p95ms > 60000 // sustained > 60s (2x the honest gap)