Home > Docs > Best Practices > Operations > Observability Stack
๐ญ Observability Stack: Log Analytics + Workspace Monitoring + Action Groups + Grafana¶
End-to-End Telemetry, Storage, Alerting & Visualization Blueprint for Microsoft Fabric
Last Updated: 2026-04-27 | Version: 1.0.0 | Phase: 14 Wave 1 (Feature 1.11)
๐ Table of Contents¶
- ๐ฏ Overview
- ๐๏ธ Reference Architecture
- ๐ก Telemetry Sources
- ๐๏ธ Storage Layer Choices
- ๐ Standard KQL Library
- ๐จ Alert Wiring
- ๐ Dashboards
- โก Data Activator for Business Events
- ๐งต Distributed Tracing
- ๐ฐ Cost Considerations
- โ Implementation Checklist
- ๐ซ Anti-Patterns
- ๐ Related Runbooks & Best Practices
- ๐ References
๐ฏ Overview¶
This document is the end-to-end observability blueprint for the Microsoft Fabric platform. It is the integration layer that ties together every signal source, every storage backend, every alert channel, and every visualization tool into a single, coherent stack. Where Monitoring & Observability introduces the pillars and Alerting & Data Activator covers business-event triggers, this guide answers the operational question:
Given a Fabric workspace, where does each signal go, how is it stored, how is it queried, and how does it become an alert or a dashboard?
The stack is opinionated. Pick one primary store per telemetry class (no double-writes), wire alerts through one Action Group taxonomy, and standardize on one dashboard tool per persona. This guide pairs with the Bicep modules landing in batch 1c (action-groups.bicep, log-analytics-workspace.bicep).
Pillars Covered¶
| Pillar | Primary Sink | Query Surface | Visualization |
|---|---|---|---|
| Capacity & CU | Capacity Metrics App + Workspace Monitoring (Eventhouse) | KQL | Power BI report + Grafana |
| Pipeline / Spark / SQL | Workspace Monitoring (Eventhouse) | KQL | RTI Dashboard + Grafana |
| Diagnostic logs | Log Analytics Workspace | KQL | Grafana + Azure Monitor Workbooks |
| Custom application traces | Application Insights | KQL | Grafana + Application Map |
| Business events | Eventstream โ Eventhouse | KQL | Data Activator + RTI Dashboard |
| Audit & security | Microsoft 365 Audit Log + Purview + Log Analytics | KQL | Power BI + Sentinel (optional) |
Companion docs: SLO/SLI for Fabric defines the targets; this document defines the plumbing that measures them.
๐๏ธ Reference Architecture¶
End-to-End Telemetry Flow¶
flowchart TB
subgraph Sources["๐ท Telemetry Sources"]
style Sources fill:#E67E22,color:#fff
F1[Capacity Metrics<br/>CU, throttle, reject]
F2[Notebooks / Spark]
F3[Pipelines / Dataflows]
F4[SQL Endpoint / Warehouse]
F5[Eventstream / Eventhouse]
F6[Power BI / Semantic Models]
F7[Audit Log + Purview]
F8[Custom App Traces<br/>OpenTelemetry]
end
subgraph Diag["๐ Diagnostic Settings"]
style Diag fill:#6C3483,color:#fff
DS1[Per-item Diagnostic Settings]
DS2[Tenant-level Audit Export]
end
subgraph Stores["๐๏ธ Storage Layer"]
style Stores fill:#2471A3,color:#fff
WM[(Workspace Monitoring<br/>Eventhouse โ KQL native<br/>30 days)]
LA[(Log Analytics<br/>Workspace<br/>Azure Monitor โ 90 days)]
AI[(Application Insights<br/>Custom traces / metrics)]
ST[(ADLS Gen2 Archive<br/>Cold tier, 7-10 years)]
end
subgraph Query["๐ Query & Action"]
style Query fill:#27AE60,color:#fff
AG[Action Groups<br/>P1 / P2 / P3 routing]
AR[Azure Monitor<br/>Alert Rules]
DA[Data Activator<br/>Business events]
end
subgraph Viz["๐ Visualization"]
style Viz fill:#1ABC9C,color:#fff
V1[Power BI<br/>Capacity Metrics App]
V2[RTI Dashboard<br/>on Eventhouse]
V3[Grafana<br/>Cross-source dashboards]
V4[FUAM<br/>Fabric Unified Admin Mon.]
end
subgraph Channels["๐ข Notification Channels"]
style Channels fill:#C0392B,color:#fff
N1[Teams]
N2[Email / SMS / Voice]
N3[Webhook / Logic App]
N4[ITSM<br/>ServiceNow / PagerDuty]
end
F1 & F2 & F3 & F4 & F5 --> WM
F2 & F3 & F4 & F6 -.diag.-> DS1
F7 --> DS2
F8 --> AI
DS1 & DS2 --> LA
LA & WM -.archive.-> ST
WM & LA & AI --> AR
AR --> AG
F5 --> DA
AG --> N1 & N2 & N3 & N4
DA --> N1 & N3
WM --> V2
WM & LA & AI --> V3
F1 --> V1
LA --> V4 Alert Flow (Severity-Based Routing)¶
sequenceDiagram
participant Source as Fabric Item
participant Store as Log Analytics / Eventhouse
participant Rule as Alert Rule
participant AG as Action Group
participant Channels as Channels
participant On-Call as On-Call Engineer
Source->>Store: Emit telemetry / log
Store->>Rule: Evaluate KQL on schedule (1-5m)
Rule-->>Rule: Threshold breach?
alt P1 Critical
Rule->>AG: Fire severity=0
AG->>Channels: Teams + Email + SMS + Voice + ITSM
Channels->>On-Call: 15-min ack SLA
else P2 High
Rule->>AG: Fire severity=1
AG->>Channels: Teams + Email
Channels->>On-Call: 1-hr ack SLA
else P3 Medium
Rule->>AG: Fire severity=2
AG->>Channels: Teams only
Channels->>On-Call: 4-hr ack SLA
end
On-Call->>Store: Investigate via dashboard / KQL
On-Call->>Rule: Acknowledge / suppress Design rule: Every alert in production points to exactly one Action Group. Action Groups own the channel fan-out; alert rules own the detection. This separation lets you change notification policy (e.g., add PagerDuty to all P1 alerts) by editing one Action Group instead of every rule.
๐ก Telemetry Sources¶
The platform produces these telemetry streams. Each one has a recommended primary sink.
| Source | Primary Sink | Secondary Sink | Cadence |
|---|---|---|---|
| Capacity Metrics App (CU%, throttling, smoothing) | Capacity Metrics App + Workspace Monitoring | Log Analytics (export) | 30s |
| Workspace Monitoring (Spark, SQL, pipelines, dataflows) | Eventhouse (native, KQL) | โ | Near real-time |
| Pipeline run logs (status, duration, error) | Workspace Monitoring | Log Analytics (Diag Settings) | Per run |
| Spark application logs (driver, executor, spill) | Workspace Monitoring + Log Analytics | App Insights (custom) | Per session |
| SQL endpoint query logs | Workspace Monitoring | Log Analytics (audit) | Per query |
| Power BI usage metrics | Power BI usage dataset | Log Analytics (export) | Hourly |
| Eventstream / Eventhouse internal (lag, ingest errors) | Eventhouse .show commands | Log Analytics (Diag Settings) | Real-time |
| Audit logs (Purview + Security) | M365 Audit Log + Purview | Log Analytics + ADLS archive | 30-min batch |
| Custom application traces (OTel) | Application Insights | Log Analytics (WS-based AI) | Real-time |
| Infrastructure (Storage, Key Vault, network, Defender) | Log Analytics | Sentinel (optional) | 1-5m |
Diagnostic Settings Coverage Matrix¶
| Fabric Item | Diagnostic Setting Available? | Recommended Categories |
|---|---|---|
| Workspace (top-level) | Yes (preview, 2026) | WorkspaceActivity, Permissions |
| Lakehouse | Indirect via Workspace Monitoring | โ |
| Notebook | Indirect via Workspace Monitoring | โ |
| Pipeline | Yes | PipelineRuns, ActivityRuns |
| Eventhouse | Yes | Query, Ingestion, TableUsage |
| Eventstream | Yes | OperationalLogs, StreamLag |
| Power BI Semantic Model | Yes | Engine, AuditLog |
| Capacity | Yes | CapacityMetrics, Throttling |
Operational rule: Enable diagnostic settings on every Fabric item that supports it at provisioning time, not retroactively. Missing logs from the first incident are the costliest gap.
๐๏ธ Storage Layer Choices¶
Fabric observability data has three legitimate storage targets. Picking the right one for each signal class avoids both blind spots and double-billing.
Storage Decision Matrix¶
| Need | Workspace Monitoring (Eventhouse) | Log Analytics | App Insights |
|---|---|---|---|
| Native to Fabric | โ first-class Fabric item | โ Azure resource | โ Azure resource |
| Cost model | Eventhouse OPU (workspace capacity) | Per-GB ingest + retention | Per-GB ingest |
| Default retention | 30 days | 30 days (โ 730) | 90 days (โ 730) |
| Query language | KQL | KQL | KQL |
| Best for | Fabric workload telemetry | Cross-Azure logs, security, infra | App traces, custom metrics |
| OneLake / Direct Lake | โ Native | โ Export only | โ Export only |
| Action Groups | โ (use Data Activator) | โ | โ |
| Sentinel / SOC | โ | โ Native | โ Native |
When to Use Which¶
flowchart TD
Start[New telemetry stream] --> Q1{Fabric-internal<br/>workload telemetry?}
Q1 -->|Yes| WM[Workspace Monitoring<br/>Eventhouse]
Q1 -->|No| Q2{Custom app trace<br/>OpenTelemetry?}
Q2 -->|Yes| AI[Application Insights]
Q2 -->|No| Q3{Needs Azure Monitor<br/>alerts?}
Q3 -->|Yes| LA[Log Analytics Workspace]
Q3 -->|No| Q4{Business event<br/>requiring action?}
Q4 -->|Yes| DA[Data Activator]
Q4 -->|No โ pure archive| ADLS[ADLS Gen2 Archive]
style WM fill:#6C3483,color:#fff
style LA fill:#2471A3,color:#fff
style AI fill:#27AE60,color:#fff
style DA fill:#E67E22,color:#fff
style ADLS fill:#7F8C8D,color:#fff Retention & Tiering Policy¶
| Tier | Retention | Cost (relative) | Use Case |
|---|---|---|---|
| Hot โ Workspace Monitoring | 30 days | Highest (OPU + capacity) | Real-time ops, KQL queries, dashboards |
| Warm โ Log Analytics Analytics tier | 31-90 days | Medium | Recent investigations, alert rules |
| Cool โ Log Analytics Basic Logs | 8-day query window, 1-year retention | Low | Audit, search-only |
| Archive โ Log Analytics Archive | 1-7 years | Lowest | Compliance retention, restore-on-demand |
| Cold โ ADLS Gen2 (Cool/Archive) | 7-10 years | Lowest | FedRAMP, NIGC retention, Federal Records Act |
Audit retention requirement: Casino (NIGC ยง542.17) and Federal (Records Act) mandate 7+ year retention. Log Analytics archive + ADLS Gen2 archive tier cover this; Workspace Monitoring alone does not.
๐ Standard KQL Library¶
Copy-paste runnable queries. Every query specifies the table source. Substitute table names if your environment uses custom export configurations.
1. Capacity CU Saturation (Workspace Monitoring โ CapacityMetrics)¶
CapacityMetrics
| where Timestamp > ago(24h)
| summarize AvgCU=avg(CUPercent), P95CU=percentile(CUPercent,95), MaxCU=max(CUPercent),
ThrottleEvents=countif(IsThrottled), RejectEvents=countif(IsRejected)
by CapacityName, bin(Timestamp, 1h)
| where P95CU > 70 or ThrottleEvents > 0
| order by P95CU desc
2. Top Expensive Queries (Workspace Monitoring โ SQLRequests)¶
SQLRequests
| where StartTime > ago(7d) and Status == "Succeeded"
| summarize ExecCount=count(), TotalCUSec=sum(CPUTimeMs)/1000.0,
AvgDurSec=avg(DurationMs)/1000.0, MaxDurSec=max(DurationMs)/1000.0
by QueryHash=hash(QueryText, 1000), QueryFingerprint=substring(QueryText, 0, 120)
| top 25 by TotalCUSec desc
3. Pipeline Failure Trends (Workspace Monitoring โ PipelineRuns)¶
PipelineRuns
| where StartTime > ago(14d)
| extend ErrorClass = case(
Error has "OutOfMemoryError", "OOM",
Error has "Timeout", "Timeout",
Error has "Authentication", "Auth",
Error has "SchemaMismatch", "SchemaDrift",
Error has "Throttled", "Throttling",
Error == "", "None",
"Other")
| summarize Total=count(), Failed=countif(Status=="Failed"),
SuccessRate=round(100.0*countif(Status=="Succeeded")/count(),1)
by bin(StartTime, 1d), ErrorClass
| order by StartTime desc, Failed desc
4. Authentication Failures (Log Analytics โ FabricAuditLog)¶
FabricAuditLog
| where TimeGenerated > ago(24h)
| where Operation in ("SignInFailure", "AcquireTokenFailed", "ServicePrincipalSignInFailure")
| summarize FailureCount=count(), DistinctTargets=dcount(TargetResource),
FirstFailure=min(TimeGenerated), LastFailure=max(TimeGenerated)
by UserPrincipalName, ClientIP, FailureReason
| where FailureCount >= 5
| order by FailureCount desc
5. Dataset Refresh Durations (Workspace Monitoring โ SemanticModelRefreshes)¶
SemanticModelRefreshes
| where StartTime > ago(30d) and RefreshType == "Scheduled"
| summarize AvgDurMin=round(avg(DurationSeconds)/60.0,1),
P95DurMin=round(percentile(DurationSeconds,95)/60.0,1),
FailureRate=round(100.0*countif(Status=="Failed")/count(),2), Runs=count()
by ModelName, WorkspaceName, bin(StartTime, 1d)
| order by P95DurMin desc
6. Cross-Source Join (Log Analytics + Workspace Monitoring)¶
Log Analytics holds audit context; Workspace Monitoring holds workload context. Joining them tells you who ran the expensive query.
let wmCluster = "https://trd-xxxxxxxx.kusto.fabric.microsoft.com/WorkspaceMonitoringDB";
let SparkSessions = cluster(wmCluster).database("WorkspaceMonitoringDB").SparkSessions
| where StartTime > ago(24h) and DurationSec > 600
| project SessionId, UserPrincipalName, NotebookName, DurationSec, PeakMemoryGB, StartTime;
FabricAuditLog
| where TimeGenerated > ago(24h) and Operation == "ExecuteNotebook"
| join kind=inner SparkSessions on $left.UserId == $right.UserPrincipalName
| project TimeGenerated, UserPrincipalName, NotebookName, DurationSec, PeakMemoryGB, ClientIP, WorkspaceId
| order by DurationSec desc
7. Data Freshness (Workspace Monitoring โ DeltaTableMetrics)¶
DeltaTableMetrics
| where Timestamp > ago(1h)
| summarize LastWrite=max(LastModifiedTime) by LakehouseName, TableName
| extend StaleMinutes = datetime_diff('minute', now(), LastWrite)
| extend Layer = case(LakehouseName has "bronze","Bronze", LakehouseName has "silver","Silver",
LakehouseName has "gold","Gold","Other")
| extend SLABreach = case(Layer=="Bronze" and StaleMinutes>30, true,
Layer=="Silver" and StaleMinutes>60, true,
Layer=="Gold" and StaleMinutes>240, true, false)
| where SLABreach == true
| order by StaleMinutes desc
8. Spark OOM Detection (Workspace Monitoring โ NotebookRuns)¶
NotebookRuns
| where StartTime > ago(7d) and Status == "Failed"
| where ErrorMessage has_any ("OutOfMemoryError", "Container killed by YARN", "GC overhead")
| summarize OOMCount=count(),
PeakMemGB=round(max(PeakMemoryBytes)/(1024.0*1024*1024),2),
AffectedDays=dcount(bin(StartTime,1d))
by NotebookName, WorkspaceName
| order by OOMCount desc
๐จ Alert Wiring¶
Action Group Taxonomy¶
Use three Action Groups keyed to severity. This minimizes the number of objects to manage while preserving severity-based routing.
| Action Group | Severity | Channels | Acknowledgement SLA |
|---|---|---|---|
ag-fabric-p1-critical | Sev 0 (P1) | Teams + Email + SMS + Voice + ITSM (PagerDuty/ServiceNow) + Logic App | 15 min |
ag-fabric-p2-high | Sev 1 (P2) | Teams + Email + ITSM | 1 hr |
ag-fabric-p3-medium | Sev 2 (P3) | Teams + Email digest | 4 hr |
Why three, not seven? A four- or five-tier scheme almost always collapses in practice (operators lose track of which tier means what). Three is the sweet spot for severity-based fan-out.
Channel Types (Azure Monitor Action Group)¶
| Channel | Use | Notes |
|---|---|---|
| All severities | Throttled 100/hr per address โ use distribution lists | |
| SMS | P1 only | Throttled ⅕min per number; per-message cost |
| Voice (TTS) | P1 only | Per-call cost; reads alert title + resource |
| Webhook / Secure Webhook | P1/P2 | HTTP POST, 5s timeout; secure variant uses AAD |
| Logic App / Azure Function | All severities | Auto-remediation, enrichment, dedupe, ITSM bridging |
| ITSM Connector | P1/P2 | ServiceNow, ServiceDesk Plus, Provance, Cherwell |
| Event Hub | All severities | Forward to SIEM / Sentinel / Splunk |
| Mobile App Push | All severities | Azure Mobile App for engineers |
Bicep Snippet โ Action Group + Alert Rule (Placeholder)¶
Note: The full implementation lands in
infra/modules/observability/action-groups.bicep(Phase 14 batch 1c). The snippet below is the contract this best-practice doc commits to. Do not copy verbatim into production until the module ships.
// Placeholder: infra/modules/observability/action-groups.bicep
@description('Severity tier: P1, P2, or P3')
param severity string
param emailRecipients array
param smsRecipients array = [] // P1 only โ country code + number
param itsmConnectionId string = '' // PagerDuty / ServiceNow
param logicAppResourceId string = ''
resource ag 'Microsoft.Insights/actionGroups@2023-01-01' = {
name: 'ag-fabric-${toLower(severity)}'
location: 'global'
properties: {
groupShortName: 'fab${toLower(severity)}' // <= 12 chars
enabled: true
emailReceivers: [for (e, i) in emailRecipients: {
name: 'email${i}'
emailAddress: e
useCommonAlertSchema: true
}]
smsReceivers: severity == 'P1' ? [for (s, i) in smsRecipients: {
name: 'sms${i}'
countryCode: split(s, '-')[0]
phoneNumber: split(s, '-')[1]
}] : []
webhookReceivers: !empty(logicAppResourceId) ? [{
name: 'logic-app-remediation'
serviceUri: 'PLACEHOLDER_LOGIC_APP_URL'
useCommonAlertSchema: true
}] : []
itsmReceivers: !empty(itsmConnectionId) && severity != 'P3' ? [{
name: 'itsm'
workspaceId: 'PLACEHOLDER_WS_ID'
connectionId: itsmConnectionId
ticketConfiguration: '{"PayloadRevision":0,"WorkItemType":"Incident"}'
region: 'eastus2'
}] : []
}
}
// Scheduled KQL alert wired to the Action Group (P2 example)
resource cuThrottleAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
name: 'alert-fabric-capacity-throttle'
location: resourceGroup().location
properties: {
severity: 1
enabled: true
evaluationFrequency: 'PT5M'
windowSize: 'PT15M'
scopes: [logAnalyticsWorkspaceId]
criteria: { allOf: [{
query: 'CapacityMetrics | where Timestamp > ago(15m) | summarize MaxCU=max(CUPercent), Throttles=countif(IsThrottled) by CapacityName | where MaxCU > 90 or Throttles > 0'
timeAggregation: 'Count'
operator: 'GreaterThan'
threshold: 0
failingPeriods: { numberOfEvaluationPeriods: 2, minFailingPeriodsToAlert: 2 }
}]}
actions: {
actionGroups: [ag.id]
customProperties: {
runbook: 'https://github.com/.../runbooks/capacity-throttled.md'
severity: 'P2'
}
}
}
}
Suppression and Deduplication¶
| Technique | Implementation | When to Use |
|---|---|---|
| Cooldown / mute | Action Rule with suppressionConfig | Prevent alert storms on flapping signals (15-60 min) |
| Maintenance window | Action Rule with recurring schedule | Planned deploys, scheduled VACUUM windows |
| Smart Groups | Azure Monitor automatic correlation | Same root cause across multiple resources |
| Alert dedup key | customProperties.dedupKey in webhook payload | Coalesce identical alerts in PagerDuty/ITSM |
| Failing-period gating | failingPeriods.minFailingPeriodsToAlert >= 2 | Eliminate single-evaluation false positives |
Test Fire-Drill Protocol (Quarterly)¶
- Pre-flight: Notify on-call, set maintenance window on monitoring channels.
- Inject: Trigger a synthetic alert (failing query, paused synthetic ingest).
- Observe: Alert fired in window | Action Group fanned out | On-call paged within SLA | Runbook URL reachable | ITSM ticket created with correct severity.
- Resolve: Stop synthetic failure; verify auto-resolution.
- Postmortem: Update runbook with anything unclear, missing, or stale. Record drill in log.
๐ Dashboards¶
A persona-aligned dashboard set. Avoid duplicating views โ assign each persona to one primary tool.
| Persona | Tool | Refresh |
|---|---|---|
| Capacity Admin | Fabric Capacity Metrics App (Power BI) | 30 min |
| Platform Engineer โ live ops | RTI Dashboard on Eventhouse | 30 sec |
| Platform Engineer โ cross-source | Grafana | 1 min |
| Tenant Admin | FUAM (Power BI / Log Analytics) | Hourly |
| SRE / On-call | Grafana + Azure Monitor Workbooks | 30 sec |
| Compliance Officer | FISMA / NIGC report (Power BI) | Daily |
Power BI โ Capacity Metrics App¶
Install from AppSource for each capacity. See Monitoring & Observability โ Capacity Monitoring for the full setup.
RTI Dashboard on Eventhouse¶
Right tool for Platform Engineer (live ops): 5-second auto-refresh, native KQL editor, drill-through into raw events, direct Data Activator integration. See the Real-Time Intelligence feature doc for setup.
Grafana โ Cross-Source Dashboards¶
Right tool when you need one pane of glass across Workspace Monitoring + Log Analytics + Application Insights + Azure infrastructure metrics. Use Azure Managed Grafana for managed identity auth and AAD SSO.
Sample Dashboard JSON (skeleton)¶
{
"title": "Fabric Platform Observability",
"uid": "fabric-platform-obs",
"schemaVersion": 39,
"refresh": "1m",
"panels": [
{
"title": "Capacity CU% (P95, last 24h)",
"type": "timeseries",
"datasource": { "type": "grafana-azure-monitor-datasource", "uid": "azmonitor" },
"targets": [{ "queryType": "Azure Log Analytics", "azureLogAnalytics": {
"resource": "/subscriptions/<sub>/resourceGroups/rg-fabric/providers/Microsoft.OperationalInsights/workspaces/law-fabric",
"query": "CapacityMetrics | where Timestamp > ago(24h) | summarize P95=percentile(CUPercent,95) by bin(Timestamp,5m), CapacityName"
}}]
},
{
"title": "Pipeline Success Rate (24h)",
"type": "stat",
"targets": [{ "azureLogAnalytics": {
"query": "PipelineRuns | where StartTime > ago(24h) | summarize SuccessRate=round(100.0*countif(Status=='Succeeded')/count(),1)"
}}],
"fieldConfig": { "defaults": { "thresholds": { "steps": [
{ "color": "red", "value": null }, { "color": "orange", "value": 95 }, { "color": "green", "value": 99 }
]}}}
},
{
"title": "Active Alerts by Severity",
"type": "piechart",
"targets": [{ "azureLogAnalytics": {
"query": "AlertsManagementResources | where properties.essentials.alertState=='New' | summarize count() by tostring(properties.essentials.severity)"
}}]
},
{
"title": "Top Notebooks by Duration (7d)",
"type": "table",
"targets": [{ "azureLogAnalytics": {
"query": "cluster('https://trd-xxx.kusto.fabric.microsoft.com').database('WorkspaceMonitoringDB').NotebookRuns | where StartTime > ago(7d) | summarize AvgDur=avg(DurationSec), Runs=count() by NotebookName | top 20 by AvgDur desc"
}}]
}
]
}
Production tip: Store dashboard JSON in
infra/modules/observability/grafana-dashboards/and deploy via Bicep + REST API. Treat dashboards as code โ review changes in PRs.
FUAM (Fabric Unified Admin Monitoring)¶
FUAM is Microsoft's open-source admin monitoring solution: ingests tenant audit logs, capacity metrics, workspace inventory, and refresh history into a Lakehouse + Power BI report. See the FUAM feature doc. Use FUAM for tenant-wide admin reporting, chargeback, and 90+ day audit trends โ not for real-time ops (use RTI), per-query troubleshooting (use Workspace Monitoring), or sub-minute alerts (use Data Activator).
โก Data Activator for Business Events¶
Data Activator (Reflex) is the business-event companion to Azure Monitor Action Groups. The two systems coexist; they are not redundant.
| Use Data Activator for | Use Azure Monitor + Action Groups for |
|---|---|
| Per-entity thresholds (per-machine, per-player, per-station, per-agency) | Platform health (capacity, pipeline reliability, query performance) |
| Streaming-source alerts (Eventstream, Real-Time Hub) | Infrastructure (Key Vault, storage, network) |
| No-code alert authoring by domain owners | Severity-based fan-out to ITSM and on-call |
| Alerts that drive Power Automate workflows | Standardized notification taxonomy across the platform |
See Alerting & Data Activator best practices and the Data Activator feature doc for configuration patterns.
๐งต Distributed Tracing¶
For applications that span notebooks, Functions, web APIs, and Spark jobs, use W3C Trace Context to correlate spans across components. The Fabric runtime supports OpenTelemetry exports to Application Insights.
Notebook Trace Propagation¶
# Inside a Fabric notebook โ emit OTel spans to App Insights
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
from azure.monitor.opentelemetry.exporter import AzureMonitorTraceExporter
import os
# Propagated header from pipeline activity โ W3C format: 00-{trace-id}-{span-id}-{flags}
incoming_traceparent = mssparkutils.notebook.run("get_param", 60, {"name": "traceparent"})
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(
AzureMonitorTraceExporter.from_connection_string(os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"])
))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
ctx = TraceContextTextMapPropagator().extract({"traceparent": incoming_traceparent})
with tracer.start_as_current_span("bronze_ingest", context=ctx) as span:
span.set_attribute("layer", "bronze")
span.set_attribute("source_system", "casino_pos")
span.set_attribute("batch_id", BATCH_ID)
# ... do ingestion work
span.set_attribute("rows_ingested", row_count)
Pipeline Activity โ Notebook Header Pass-Through¶
// Data Factory pipeline โ Notebook activity baseParameters
{
"baseParameters": {
"traceparent": "@{activity('Generate Trace').output.traceparent}",
"tracestate": "@{activity('Generate Trace').output.tracestate}"
}
}
Querying End-to-End Traces¶
// Source: Application Insights โ table dependencies / requests / traces
union dependencies, requests, traces
| where operation_Id == "<trace-id>"
| project timestamp, itemType, name, target, duration, success, customDimensions
| order by timestamp asc
๐ฐ Cost Considerations¶
Log Analytics Pricing Model¶
| Tier | Best For | Pricing | Query Cost |
|---|---|---|---|
| Pay-As-You-Go | < 100 GB/day | ~$2.76/GB ingested | Free for 90 days hot |
| Commitment Tiers | 100 GB/day to 5 TB/day | 15-30% discount vs PAYG | Free for 90 days hot |
| Basic Logs | High-volume verbose logs | ~$0.65/GB ingested | $0.005/GB scanned |
| Archive | Compliance retention | ~$0.025/GB/month | Restore-on-demand |
Eventhouse / Workspace Monitoring Cost¶
Workspace Monitoring runs on the workspace's Fabric capacity (implicit in SKU; F64 = $5,256/month list). Optimize by disabling monitoring on dev/test workspaces, reducing hot retention on chatty tables (e.g., SQLRequests โ 7d), and pre-aggregating with materialized views.
Sampling Strategies¶
For high-volume telemetry (Spark task-level events, fine-grained traces):
| Strategy | Sample Rate | When |
|---|---|---|
| Always-on | 100% | Errors, P1/P2 alerts, audit |
| Tail-based | 100% errors + 10% successes | Application traces |
| Head-based | 10% | Verbose Spark task events |
| Adaptive | 100% / load | Dynamic โ back off under pressure |
Tiering Policy & Cost Watch¶
flowchart LR
Hot[Hot โ Workspace Monitoring 30d $$$] -->|31d| Warm[Log Analytics 90d $$]
Warm -->|91d| Basic[Basic Logs 1y $]
Basic -->|365d| Archive[Archive Logs 1-7y ยข]
Archive -->|7y| ADLS[ADLS Cold/Archive 10y ยขยข]
style Hot fill:#E74C3C,color:#fff
style Warm fill:#E67E22,color:#fff
style Basic fill:#F39C12,color:#fff
style Archive fill:#3498DB,color:#fff
style ADLS fill:#7F8C8D,color:#fff // Cost watch โ per-table ingestion trend (Log Analytics โ Usage)
Usage
| where TimeGenerated > ago(30d) and IsBillable == true
| summarize IngestedGB = sum(Quantity) / 1024.0 by DataType, bin(TimeGenerated, 1d)
| extend EstMonthlyCost = round(IngestedGB * 30 * 2.76, 2) // PAYG rate
| order by IngestedGB desc
โ Implementation Checklist¶
For every new Fabric platform โ apply this checklist at provisioning time. Treat as code review gates, not optional polish.
Telemetry Collection - [ ] Diagnostic Settings on every supported Fabric item (capacity, pipelines, eventhouse, eventstream, semantic models) - [ ] Workspace Monitoring item provisioned in every production workspace - [ ] Tenant audit log export to Log Analytics enabled - [ ] Application Insights provisioned; connection string distributed to notebooks - [ ] OpenTelemetry instrumentation in shared notebook utilities
Storage - [ ] Log Analytics workspace deployed via log-analytics-workspace.bicep - [ ] Retention configured per signal class (90 / 365 / 2555 days) - [ ] Continuous export to ADLS Gen2 archive for 7+ year domains (casino, federal) - [ ] Eventhouse retention tuned per table (not default 30 days for everything)
Action Groups & Alerts - [ ] Three Action Groups deployed via action-groups.bicep (P1 / P2 / P3) - [ ] Distribution lists for email recipients (no individual addresses) - [ ] SMS / voice for P1 only via on-call rotation numbers - [ ] ITSM connector wired (PagerDuty or ServiceNow) for P1 + P2 - [ ] Alert rules deployed via Bicep โ no portal-authored rules in production - [ ] Each alert references a runbook URL in customProperties.runbook - [ ] Suppression / maintenance windows configured for known deploys
Dashboards - [ ] Capacity Metrics App installed and pinned for capacity admins - [ ] RTI Dashboard published for live ops view - [ ] Azure Managed Grafana deployed with Azure Monitor + Workspace Monitoring data sources - [ ] FUAM deployed in admin tenant workspace - [ ] Dashboard JSON committed under infra/modules/observability/grafana-dashboards/
Operational Readiness - [ ] On-call rotation defined and uploaded to ITSM - [ ] Runbook bookmarks set for each P1/P2 alert - [ ] Quarterly fire-drill scheduled and recorded - [ ] SLO/SLI dashboards published per SLO/SLI for Fabric - [ ] Postmortem template defined in docs/runbooks/templates/postmortem.md
๐ซ Anti-Patterns¶
| # | Anti-Pattern | Problem | Fix |
|---|---|---|---|
| 1 | Double-writing telemetry | Sending the same Spark log to both Workspace Monitoring and Log Analytics "for safety" โ 2x cost, query divergence, retention drift. | Pick the primary sink per signal class (Decision Matrix). For cross-store correlation, use cross-cluster KQL โ don't duplicate. |
| 2 | Per-engineer email alerts | Alerts addressed to individual addresses (alice@contoso.com) disappear on PTO; no audit trail. | Route through distribution lists or on-call rotation services. Membership changes โ alert-rule changes. |
| 3 | Portal-authored alert rules | Rules created in the portal during incident response, never codified โ config drift, lost on DR. | All alert rules in Bicep, reviewed in PRs. Portal authoring only for prototyping; export to Bicep before merging. |
| 4 | One alert per symptom | 200 alert rules, each watching one KPI with its own threshold โ fatigue, burnout, paging on noise. | Define SLOs (SLO/SLI for Fabric). Alert on error-budget burn rate, not raw thresholds. |
| 5 | Dashboards without owners | 80 Grafana dashboards, half reference dead tables, none maintained. | Tag every dashboard owner=<team> and last_reviewed=<date>. Quarterly audit โ > 1 year stale โ archive. |
| 6 | Runbook URLs that 404 | Alert says "see runbook" with a link to a deleted wiki page โ no 3am guidance. | Validate runbook URLs in the quarterly fire-drill. Runbooks live in the same repo as alert Bicep. |
| 7 | 30-day retention for audit logs | Default retention for FabricAuditLog left at 30 days โ 7-year compliance audit fails. | Set retention per table based on regulatory requirement (NIGC, BSA/FinCEN, FedRAMP, HIPAA). Configure once in Bicep; verify quarterly. |
| 8 | Sampling errors | 10% sampling applied to all traces, including errors โ 90% of failure traces lost. | Always sample at 100% for errors, P1/P2 alerts, and audit events. Sample only successful high-volume low-signal events. |
๐ Related Runbooks & Best Practices¶
Related Runbooks¶
- Capacity Throttled โ first response when CU > 90%
- Pipeline Failed โ pipeline triage steps
- Spark OutOfMemory โ Spark OOM diagnosis
- Semantic Model Refresh Timeout โ Direct Lake refresh recovery
Related Best-Practice Docs¶
- Monitoring & Observability โ pillars, capacity, system tables, runbook templates
- Alerting & Data Activator โ Reflex configuration, business-event triggers
- SLO/SLI for Fabric โ service-level targets and error budgets
- Error Handling & Monitoring โ pipeline error tracking
- SQL Audit Logs Compliance โ SQL-level audit logging
- Network Security โ diagnostic settings + private endpoint considerations
- Customer-Managed Keys โ log data encryption
Related Feature Docs¶
- Workspace Monitoring โ system tables reference
- Data Activator โ Reflex configuration
- Real-Time Intelligence โ RTI dashboards
- FUAM (Fabric Unified Admin Monitoring) โ tenant admin reporting
๐ References¶
Microsoft Documentation - Workspace Monitoring overview - Diagnostic Settings for Fabric - Azure Monitor Action Groups - Log Analytics Workspace overview - Application Insights - Azure Managed Grafana - KQL reference | W3C Trace Context | OpenTelemetry Python
Operational Best Practices - Google SRE โ Monitoring distributed systems | SRE Workbook โ Alerting on SLOs - Azure Monitor best practices | PagerDuty Incident Response
Compliance - NIGC MICS ยง542.17 | NIST SP 800-137 | FedRAMP Continuous Monitoring Strategy
Back to Operations | Back to Best Practices | Back to Documentation