Skip to content

Home > Docs > Best Practices > Operations > Observability Stack

๐Ÿ”ญ Observability Stack: Log Analytics + Workspace Monitoring + Action Groups + Grafana

End-to-End Telemetry, Storage, Alerting & Visualization Blueprint for Microsoft Fabric

Category Phase Status Last Updated


Last Updated: 2026-04-27 | Version: 1.0.0 | Phase: 14 Wave 1 (Feature 1.11)


๐Ÿ“‘ Table of Contents


๐ŸŽฏ Overview

This document is the end-to-end observability blueprint for the Microsoft Fabric platform. It is the integration layer that ties together every signal source, every storage backend, every alert channel, and every visualization tool into a single, coherent stack. Where Monitoring & Observability introduces the pillars and Alerting & Data Activator covers business-event triggers, this guide answers the operational question:

Given a Fabric workspace, where does each signal go, how is it stored, how is it queried, and how does it become an alert or a dashboard?

The stack is opinionated. Pick one primary store per telemetry class (no double-writes), wire alerts through one Action Group taxonomy, and standardize on one dashboard tool per persona. This guide pairs with the Bicep modules landing in batch 1c (action-groups.bicep, log-analytics-workspace.bicep).

Pillars Covered

Pillar Primary Sink Query Surface Visualization
Capacity & CU Capacity Metrics App + Workspace Monitoring (Eventhouse) KQL Power BI report + Grafana
Pipeline / Spark / SQL Workspace Monitoring (Eventhouse) KQL RTI Dashboard + Grafana
Diagnostic logs Log Analytics Workspace KQL Grafana + Azure Monitor Workbooks
Custom application traces Application Insights KQL Grafana + Application Map
Business events Eventstream โ†’ Eventhouse KQL Data Activator + RTI Dashboard
Audit & security Microsoft 365 Audit Log + Purview + Log Analytics KQL Power BI + Sentinel (optional)

Companion docs: SLO/SLI for Fabric defines the targets; this document defines the plumbing that measures them.


๐Ÿ—๏ธ Reference Architecture

End-to-End Telemetry Flow

flowchart TB
    subgraph Sources["๐Ÿ”ท Telemetry Sources"]
        style Sources fill:#E67E22,color:#fff
        F1[Capacity Metrics<br/>CU, throttle, reject]
        F2[Notebooks / Spark]
        F3[Pipelines / Dataflows]
        F4[SQL Endpoint / Warehouse]
        F5[Eventstream / Eventhouse]
        F6[Power BI / Semantic Models]
        F7[Audit Log + Purview]
        F8[Custom App Traces<br/>OpenTelemetry]
    end

    subgraph Diag["๐Ÿ”Œ Diagnostic Settings"]
        style Diag fill:#6C3483,color:#fff
        DS1[Per-item Diagnostic Settings]
        DS2[Tenant-level Audit Export]
    end

    subgraph Stores["๐Ÿ—„๏ธ Storage Layer"]
        style Stores fill:#2471A3,color:#fff
        WM[(Workspace Monitoring<br/>Eventhouse โ€” KQL native<br/>30 days)]
        LA[(Log Analytics<br/>Workspace<br/>Azure Monitor โ€” 90 days)]
        AI[(Application Insights<br/>Custom traces / metrics)]
        ST[(ADLS Gen2 Archive<br/>Cold tier, 7-10 years)]
    end

    subgraph Query["๐Ÿ” Query & Action"]
        style Query fill:#27AE60,color:#fff
        AG[Action Groups<br/>P1 / P2 / P3 routing]
        AR[Azure Monitor<br/>Alert Rules]
        DA[Data Activator<br/>Business events]
    end

    subgraph Viz["๐Ÿ“Š Visualization"]
        style Viz fill:#1ABC9C,color:#fff
        V1[Power BI<br/>Capacity Metrics App]
        V2[RTI Dashboard<br/>on Eventhouse]
        V3[Grafana<br/>Cross-source dashboards]
        V4[FUAM<br/>Fabric Unified Admin Mon.]
    end

    subgraph Channels["๐Ÿ“ข Notification Channels"]
        style Channels fill:#C0392B,color:#fff
        N1[Teams]
        N2[Email / SMS / Voice]
        N3[Webhook / Logic App]
        N4[ITSM<br/>ServiceNow / PagerDuty]
    end

    F1 & F2 & F3 & F4 & F5 --> WM
    F2 & F3 & F4 & F6 -.diag.-> DS1
    F7 --> DS2
    F8 --> AI
    DS1 & DS2 --> LA
    LA & WM -.archive.-> ST

    WM & LA & AI --> AR
    AR --> AG
    F5 --> DA
    AG --> N1 & N2 & N3 & N4
    DA --> N1 & N3

    WM --> V2
    WM & LA & AI --> V3
    F1 --> V1
    LA --> V4

Alert Flow (Severity-Based Routing)

sequenceDiagram
    participant Source as Fabric Item
    participant Store as Log Analytics / Eventhouse
    participant Rule as Alert Rule
    participant AG as Action Group
    participant Channels as Channels
    participant On-Call as On-Call Engineer

    Source->>Store: Emit telemetry / log
    Store->>Rule: Evaluate KQL on schedule (1-5m)
    Rule-->>Rule: Threshold breach?
    alt P1 Critical
        Rule->>AG: Fire severity=0
        AG->>Channels: Teams + Email + SMS + Voice + ITSM
        Channels->>On-Call: 15-min ack SLA
    else P2 High
        Rule->>AG: Fire severity=1
        AG->>Channels: Teams + Email
        Channels->>On-Call: 1-hr ack SLA
    else P3 Medium
        Rule->>AG: Fire severity=2
        AG->>Channels: Teams only
        Channels->>On-Call: 4-hr ack SLA
    end
    On-Call->>Store: Investigate via dashboard / KQL
    On-Call->>Rule: Acknowledge / suppress

Design rule: Every alert in production points to exactly one Action Group. Action Groups own the channel fan-out; alert rules own the detection. This separation lets you change notification policy (e.g., add PagerDuty to all P1 alerts) by editing one Action Group instead of every rule.


๐Ÿ“ก Telemetry Sources

The platform produces these telemetry streams. Each one has a recommended primary sink.

Source Primary Sink Secondary Sink Cadence
Capacity Metrics App (CU%, throttling, smoothing) Capacity Metrics App + Workspace Monitoring Log Analytics (export) 30s
Workspace Monitoring (Spark, SQL, pipelines, dataflows) Eventhouse (native, KQL) โ€” Near real-time
Pipeline run logs (status, duration, error) Workspace Monitoring Log Analytics (Diag Settings) Per run
Spark application logs (driver, executor, spill) Workspace Monitoring + Log Analytics App Insights (custom) Per session
SQL endpoint query logs Workspace Monitoring Log Analytics (audit) Per query
Power BI usage metrics Power BI usage dataset Log Analytics (export) Hourly
Eventstream / Eventhouse internal (lag, ingest errors) Eventhouse .show commands Log Analytics (Diag Settings) Real-time
Audit logs (Purview + Security) M365 Audit Log + Purview Log Analytics + ADLS archive 30-min batch
Custom application traces (OTel) Application Insights Log Analytics (WS-based AI) Real-time
Infrastructure (Storage, Key Vault, network, Defender) Log Analytics Sentinel (optional) 1-5m

Diagnostic Settings Coverage Matrix

Fabric Item Diagnostic Setting Available? Recommended Categories
Workspace (top-level) Yes (preview, 2026) WorkspaceActivity, Permissions
Lakehouse Indirect via Workspace Monitoring โ€”
Notebook Indirect via Workspace Monitoring โ€”
Pipeline Yes PipelineRuns, ActivityRuns
Eventhouse Yes Query, Ingestion, TableUsage
Eventstream Yes OperationalLogs, StreamLag
Power BI Semantic Model Yes Engine, AuditLog
Capacity Yes CapacityMetrics, Throttling

Operational rule: Enable diagnostic settings on every Fabric item that supports it at provisioning time, not retroactively. Missing logs from the first incident are the costliest gap.


๐Ÿ—„๏ธ Storage Layer Choices

Fabric observability data has three legitimate storage targets. Picking the right one for each signal class avoids both blind spots and double-billing.

Storage Decision Matrix

Need Workspace Monitoring (Eventhouse) Log Analytics App Insights
Native to Fabric โœ… first-class Fabric item โŒ Azure resource โŒ Azure resource
Cost model Eventhouse OPU (workspace capacity) Per-GB ingest + retention Per-GB ingest
Default retention 30 days 30 days (โ†’ 730) 90 days (โ†’ 730)
Query language KQL KQL KQL
Best for Fabric workload telemetry Cross-Azure logs, security, infra App traces, custom metrics
OneLake / Direct Lake โœ… Native โŒ Export only โŒ Export only
Action Groups โŒ (use Data Activator) โœ… โœ…
Sentinel / SOC โŒ โœ… Native โœ… Native

When to Use Which

flowchart TD
    Start[New telemetry stream] --> Q1{Fabric-internal<br/>workload telemetry?}
    Q1 -->|Yes| WM[Workspace Monitoring<br/>Eventhouse]
    Q1 -->|No| Q2{Custom app trace<br/>OpenTelemetry?}
    Q2 -->|Yes| AI[Application Insights]
    Q2 -->|No| Q3{Needs Azure Monitor<br/>alerts?}
    Q3 -->|Yes| LA[Log Analytics Workspace]
    Q3 -->|No| Q4{Business event<br/>requiring action?}
    Q4 -->|Yes| DA[Data Activator]
    Q4 -->|No โ€” pure archive| ADLS[ADLS Gen2 Archive]

    style WM fill:#6C3483,color:#fff
    style LA fill:#2471A3,color:#fff
    style AI fill:#27AE60,color:#fff
    style DA fill:#E67E22,color:#fff
    style ADLS fill:#7F8C8D,color:#fff

Retention & Tiering Policy

Tier Retention Cost (relative) Use Case
Hot โ€” Workspace Monitoring 30 days Highest (OPU + capacity) Real-time ops, KQL queries, dashboards
Warm โ€” Log Analytics Analytics tier 31-90 days Medium Recent investigations, alert rules
Cool โ€” Log Analytics Basic Logs 8-day query window, 1-year retention Low Audit, search-only
Archive โ€” Log Analytics Archive 1-7 years Lowest Compliance retention, restore-on-demand
Cold โ€” ADLS Gen2 (Cool/Archive) 7-10 years Lowest FedRAMP, NIGC retention, Federal Records Act

Audit retention requirement: Casino (NIGC ยง542.17) and Federal (Records Act) mandate 7+ year retention. Log Analytics archive + ADLS Gen2 archive tier cover this; Workspace Monitoring alone does not.


๐Ÿ“š Standard KQL Library

Copy-paste runnable queries. Every query specifies the table source. Substitute table names if your environment uses custom export configurations.

1. Capacity CU Saturation (Workspace Monitoring โ†’ CapacityMetrics)

CapacityMetrics
| where Timestamp > ago(24h)
| summarize AvgCU=avg(CUPercent), P95CU=percentile(CUPercent,95), MaxCU=max(CUPercent),
            ThrottleEvents=countif(IsThrottled), RejectEvents=countif(IsRejected)
    by CapacityName, bin(Timestamp, 1h)
| where P95CU > 70 or ThrottleEvents > 0
| order by P95CU desc

2. Top Expensive Queries (Workspace Monitoring โ†’ SQLRequests)

SQLRequests
| where StartTime > ago(7d) and Status == "Succeeded"
| summarize ExecCount=count(), TotalCUSec=sum(CPUTimeMs)/1000.0,
            AvgDurSec=avg(DurationMs)/1000.0, MaxDurSec=max(DurationMs)/1000.0
    by QueryHash=hash(QueryText, 1000), QueryFingerprint=substring(QueryText, 0, 120)
| top 25 by TotalCUSec desc
PipelineRuns
| where StartTime > ago(14d)
| extend ErrorClass = case(
    Error has "OutOfMemoryError", "OOM",
    Error has "Timeout",          "Timeout",
    Error has "Authentication",   "Auth",
    Error has "SchemaMismatch",   "SchemaDrift",
    Error has "Throttled",        "Throttling",
    Error == "",                  "None",
    "Other")
| summarize Total=count(), Failed=countif(Status=="Failed"),
            SuccessRate=round(100.0*countif(Status=="Succeeded")/count(),1)
    by bin(StartTime, 1d), ErrorClass
| order by StartTime desc, Failed desc

4. Authentication Failures (Log Analytics โ†’ FabricAuditLog)

FabricAuditLog
| where TimeGenerated > ago(24h)
| where Operation in ("SignInFailure", "AcquireTokenFailed", "ServicePrincipalSignInFailure")
| summarize FailureCount=count(), DistinctTargets=dcount(TargetResource),
            FirstFailure=min(TimeGenerated), LastFailure=max(TimeGenerated)
    by UserPrincipalName, ClientIP, FailureReason
| where FailureCount >= 5
| order by FailureCount desc

5. Dataset Refresh Durations (Workspace Monitoring โ†’ SemanticModelRefreshes)

SemanticModelRefreshes
| where StartTime > ago(30d) and RefreshType == "Scheduled"
| summarize AvgDurMin=round(avg(DurationSeconds)/60.0,1),
            P95DurMin=round(percentile(DurationSeconds,95)/60.0,1),
            FailureRate=round(100.0*countif(Status=="Failed")/count(),2), Runs=count()
    by ModelName, WorkspaceName, bin(StartTime, 1d)
| order by P95DurMin desc

6. Cross-Source Join (Log Analytics + Workspace Monitoring)

Log Analytics holds audit context; Workspace Monitoring holds workload context. Joining them tells you who ran the expensive query.

let wmCluster = "https://trd-xxxxxxxx.kusto.fabric.microsoft.com/WorkspaceMonitoringDB";
let SparkSessions = cluster(wmCluster).database("WorkspaceMonitoringDB").SparkSessions
    | where StartTime > ago(24h) and DurationSec > 600
    | project SessionId, UserPrincipalName, NotebookName, DurationSec, PeakMemoryGB, StartTime;
FabricAuditLog
| where TimeGenerated > ago(24h) and Operation == "ExecuteNotebook"
| join kind=inner SparkSessions on $left.UserId == $right.UserPrincipalName
| project TimeGenerated, UserPrincipalName, NotebookName, DurationSec, PeakMemoryGB, ClientIP, WorkspaceId
| order by DurationSec desc

7. Data Freshness (Workspace Monitoring โ†’ DeltaTableMetrics)

DeltaTableMetrics
| where Timestamp > ago(1h)
| summarize LastWrite=max(LastModifiedTime) by LakehouseName, TableName
| extend StaleMinutes = datetime_diff('minute', now(), LastWrite)
| extend Layer = case(LakehouseName has "bronze","Bronze", LakehouseName has "silver","Silver",
                      LakehouseName has "gold","Gold","Other")
| extend SLABreach = case(Layer=="Bronze" and StaleMinutes>30, true,
                          Layer=="Silver" and StaleMinutes>60, true,
                          Layer=="Gold"   and StaleMinutes>240, true, false)
| where SLABreach == true
| order by StaleMinutes desc

8. Spark OOM Detection (Workspace Monitoring โ†’ NotebookRuns)

NotebookRuns
| where StartTime > ago(7d) and Status == "Failed"
| where ErrorMessage has_any ("OutOfMemoryError", "Container killed by YARN", "GC overhead")
| summarize OOMCount=count(),
            PeakMemGB=round(max(PeakMemoryBytes)/(1024.0*1024*1024),2),
            AffectedDays=dcount(bin(StartTime,1d))
    by NotebookName, WorkspaceName
| order by OOMCount desc

๐Ÿšจ Alert Wiring

Action Group Taxonomy

Use three Action Groups keyed to severity. This minimizes the number of objects to manage while preserving severity-based routing.

Action Group Severity Channels Acknowledgement SLA
ag-fabric-p1-critical Sev 0 (P1) Teams + Email + SMS + Voice + ITSM (PagerDuty/ServiceNow) + Logic App 15 min
ag-fabric-p2-high Sev 1 (P2) Teams + Email + ITSM 1 hr
ag-fabric-p3-medium Sev 2 (P3) Teams + Email digest 4 hr

Why three, not seven? A four- or five-tier scheme almost always collapses in practice (operators lose track of which tier means what). Three is the sweet spot for severity-based fan-out.

Channel Types (Azure Monitor Action Group)

Channel Use Notes
Email All severities Throttled 100/hr per address โ€” use distribution lists
SMS P1 only Throttled ⅕min per number; per-message cost
Voice (TTS) P1 only Per-call cost; reads alert title + resource
Webhook / Secure Webhook P1/P2 HTTP POST, 5s timeout; secure variant uses AAD
Logic App / Azure Function All severities Auto-remediation, enrichment, dedupe, ITSM bridging
ITSM Connector P1/P2 ServiceNow, ServiceDesk Plus, Provance, Cherwell
Event Hub All severities Forward to SIEM / Sentinel / Splunk
Mobile App Push All severities Azure Mobile App for engineers

Bicep Snippet โ€” Action Group + Alert Rule (Placeholder)

Note: The full implementation lands in infra/modules/observability/action-groups.bicep (Phase 14 batch 1c). The snippet below is the contract this best-practice doc commits to. Do not copy verbatim into production until the module ships.

// Placeholder: infra/modules/observability/action-groups.bicep
@description('Severity tier: P1, P2, or P3')
param severity string
param emailRecipients array
param smsRecipients array = []          // P1 only โ€” country code + number
param itsmConnectionId string = ''      // PagerDuty / ServiceNow
param logicAppResourceId string = ''

resource ag 'Microsoft.Insights/actionGroups@2023-01-01' = {
  name: 'ag-fabric-${toLower(severity)}'
  location: 'global'
  properties: {
    groupShortName: 'fab${toLower(severity)}'   // <= 12 chars
    enabled: true
    emailReceivers: [for (e, i) in emailRecipients: {
      name: 'email${i}'
      emailAddress: e
      useCommonAlertSchema: true
    }]
    smsReceivers: severity == 'P1' ? [for (s, i) in smsRecipients: {
      name: 'sms${i}'
      countryCode: split(s, '-')[0]
      phoneNumber: split(s, '-')[1]
    }] : []
    webhookReceivers: !empty(logicAppResourceId) ? [{
      name: 'logic-app-remediation'
      serviceUri: 'PLACEHOLDER_LOGIC_APP_URL'
      useCommonAlertSchema: true
    }] : []
    itsmReceivers: !empty(itsmConnectionId) && severity != 'P3' ? [{
      name: 'itsm'
      workspaceId: 'PLACEHOLDER_WS_ID'
      connectionId: itsmConnectionId
      ticketConfiguration: '{"PayloadRevision":0,"WorkItemType":"Incident"}'
      region: 'eastus2'
    }] : []
  }
}

// Scheduled KQL alert wired to the Action Group (P2 example)
resource cuThrottleAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
  name: 'alert-fabric-capacity-throttle'
  location: resourceGroup().location
  properties: {
    severity: 1
    enabled: true
    evaluationFrequency: 'PT5M'
    windowSize: 'PT15M'
    scopes: [logAnalyticsWorkspaceId]
    criteria: { allOf: [{
      query: 'CapacityMetrics | where Timestamp > ago(15m) | summarize MaxCU=max(CUPercent), Throttles=countif(IsThrottled) by CapacityName | where MaxCU > 90 or Throttles > 0'
      timeAggregation: 'Count'
      operator: 'GreaterThan'
      threshold: 0
      failingPeriods: { numberOfEvaluationPeriods: 2, minFailingPeriodsToAlert: 2 }
    }]}
    actions: {
      actionGroups: [ag.id]
      customProperties: {
        runbook: 'https://github.com/.../runbooks/capacity-throttled.md'
        severity: 'P2'
      }
    }
  }
}

Suppression and Deduplication

Technique Implementation When to Use
Cooldown / mute Action Rule with suppressionConfig Prevent alert storms on flapping signals (15-60 min)
Maintenance window Action Rule with recurring schedule Planned deploys, scheduled VACUUM windows
Smart Groups Azure Monitor automatic correlation Same root cause across multiple resources
Alert dedup key customProperties.dedupKey in webhook payload Coalesce identical alerts in PagerDuty/ITSM
Failing-period gating failingPeriods.minFailingPeriodsToAlert >= 2 Eliminate single-evaluation false positives

Test Fire-Drill Protocol (Quarterly)

  1. Pre-flight: Notify on-call, set maintenance window on monitoring channels.
  2. Inject: Trigger a synthetic alert (failing query, paused synthetic ingest).
  3. Observe: Alert fired in window | Action Group fanned out | On-call paged within SLA | Runbook URL reachable | ITSM ticket created with correct severity.
  4. Resolve: Stop synthetic failure; verify auto-resolution.
  5. Postmortem: Update runbook with anything unclear, missing, or stale. Record drill in log.

๐Ÿ“Š Dashboards

A persona-aligned dashboard set. Avoid duplicating views โ€” assign each persona to one primary tool.

Persona Tool Refresh
Capacity Admin Fabric Capacity Metrics App (Power BI) 30 min
Platform Engineer โ€” live ops RTI Dashboard on Eventhouse 30 sec
Platform Engineer โ€” cross-source Grafana 1 min
Tenant Admin FUAM (Power BI / Log Analytics) Hourly
SRE / On-call Grafana + Azure Monitor Workbooks 30 sec
Compliance Officer FISMA / NIGC report (Power BI) Daily

Power BI โ€” Capacity Metrics App

Install from AppSource for each capacity. See Monitoring & Observability โ€” Capacity Monitoring for the full setup.

RTI Dashboard on Eventhouse

Right tool for Platform Engineer (live ops): 5-second auto-refresh, native KQL editor, drill-through into raw events, direct Data Activator integration. See the Real-Time Intelligence feature doc for setup.

Grafana โ€” Cross-Source Dashboards

Right tool when you need one pane of glass across Workspace Monitoring + Log Analytics + Application Insights + Azure infrastructure metrics. Use Azure Managed Grafana for managed identity auth and AAD SSO.

Sample Dashboard JSON (skeleton)

{
  "title": "Fabric Platform Observability",
  "uid": "fabric-platform-obs",
  "schemaVersion": 39,
  "refresh": "1m",
  "panels": [
    {
      "title": "Capacity CU% (P95, last 24h)",
      "type": "timeseries",
      "datasource": { "type": "grafana-azure-monitor-datasource", "uid": "azmonitor" },
      "targets": [{ "queryType": "Azure Log Analytics", "azureLogAnalytics": {
        "resource": "/subscriptions/<sub>/resourceGroups/rg-fabric/providers/Microsoft.OperationalInsights/workspaces/law-fabric",
        "query": "CapacityMetrics | where Timestamp > ago(24h) | summarize P95=percentile(CUPercent,95) by bin(Timestamp,5m), CapacityName"
      }}]
    },
    {
      "title": "Pipeline Success Rate (24h)",
      "type": "stat",
      "targets": [{ "azureLogAnalytics": {
        "query": "PipelineRuns | where StartTime > ago(24h) | summarize SuccessRate=round(100.0*countif(Status=='Succeeded')/count(),1)"
      }}],
      "fieldConfig": { "defaults": { "thresholds": { "steps": [
        { "color": "red", "value": null }, { "color": "orange", "value": 95 }, { "color": "green", "value": 99 }
      ]}}}
    },
    {
      "title": "Active Alerts by Severity",
      "type": "piechart",
      "targets": [{ "azureLogAnalytics": {
        "query": "AlertsManagementResources | where properties.essentials.alertState=='New' | summarize count() by tostring(properties.essentials.severity)"
      }}]
    },
    {
      "title": "Top Notebooks by Duration (7d)",
      "type": "table",
      "targets": [{ "azureLogAnalytics": {
        "query": "cluster('https://trd-xxx.kusto.fabric.microsoft.com').database('WorkspaceMonitoringDB').NotebookRuns | where StartTime > ago(7d) | summarize AvgDur=avg(DurationSec), Runs=count() by NotebookName | top 20 by AvgDur desc"
      }}]
    }
  ]
}

Production tip: Store dashboard JSON in infra/modules/observability/grafana-dashboards/ and deploy via Bicep + REST API. Treat dashboards as code โ€” review changes in PRs.

FUAM (Fabric Unified Admin Monitoring)

FUAM is Microsoft's open-source admin monitoring solution: ingests tenant audit logs, capacity metrics, workspace inventory, and refresh history into a Lakehouse + Power BI report. See the FUAM feature doc. Use FUAM for tenant-wide admin reporting, chargeback, and 90+ day audit trends โ€” not for real-time ops (use RTI), per-query troubleshooting (use Workspace Monitoring), or sub-minute alerts (use Data Activator).


โšก Data Activator for Business Events

Data Activator (Reflex) is the business-event companion to Azure Monitor Action Groups. The two systems coexist; they are not redundant.

Use Data Activator for Use Azure Monitor + Action Groups for
Per-entity thresholds (per-machine, per-player, per-station, per-agency) Platform health (capacity, pipeline reliability, query performance)
Streaming-source alerts (Eventstream, Real-Time Hub) Infrastructure (Key Vault, storage, network)
No-code alert authoring by domain owners Severity-based fan-out to ITSM and on-call
Alerts that drive Power Automate workflows Standardized notification taxonomy across the platform

See Alerting & Data Activator best practices and the Data Activator feature doc for configuration patterns.


๐Ÿงต Distributed Tracing

For applications that span notebooks, Functions, web APIs, and Spark jobs, use W3C Trace Context to correlate spans across components. The Fabric runtime supports OpenTelemetry exports to Application Insights.

Notebook Trace Propagation

# Inside a Fabric notebook โ€” emit OTel spans to App Insights
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
from azure.monitor.opentelemetry.exporter import AzureMonitorTraceExporter
import os

# Propagated header from pipeline activity โ€” W3C format: 00-{trace-id}-{span-id}-{flags}
incoming_traceparent = mssparkutils.notebook.run("get_param", 60, {"name": "traceparent"})

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(
    AzureMonitorTraceExporter.from_connection_string(os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"])
))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

ctx = TraceContextTextMapPropagator().extract({"traceparent": incoming_traceparent})
with tracer.start_as_current_span("bronze_ingest", context=ctx) as span:
    span.set_attribute("layer", "bronze")
    span.set_attribute("source_system", "casino_pos")
    span.set_attribute("batch_id", BATCH_ID)
    # ... do ingestion work
    span.set_attribute("rows_ingested", row_count)

Pipeline Activity โ†’ Notebook Header Pass-Through

// Data Factory pipeline โ€” Notebook activity baseParameters
{
  "baseParameters": {
    "traceparent": "@{activity('Generate Trace').output.traceparent}",
    "tracestate":  "@{activity('Generate Trace').output.tracestate}"
  }
}

Querying End-to-End Traces

// Source: Application Insights โ€” table dependencies / requests / traces
union dependencies, requests, traces
| where operation_Id == "<trace-id>"
| project timestamp, itemType, name, target, duration, success, customDimensions
| order by timestamp asc

๐Ÿ’ฐ Cost Considerations

Log Analytics Pricing Model

Tier Best For Pricing Query Cost
Pay-As-You-Go < 100 GB/day ~$2.76/GB ingested Free for 90 days hot
Commitment Tiers 100 GB/day to 5 TB/day 15-30% discount vs PAYG Free for 90 days hot
Basic Logs High-volume verbose logs ~$0.65/GB ingested $0.005/GB scanned
Archive Compliance retention ~$0.025/GB/month Restore-on-demand

Eventhouse / Workspace Monitoring Cost

Workspace Monitoring runs on the workspace's Fabric capacity (implicit in SKU; F64 = $5,256/month list). Optimize by disabling monitoring on dev/test workspaces, reducing hot retention on chatty tables (e.g., SQLRequests โ†’ 7d), and pre-aggregating with materialized views.

Sampling Strategies

For high-volume telemetry (Spark task-level events, fine-grained traces):

Strategy Sample Rate When
Always-on 100% Errors, P1/P2 alerts, audit
Tail-based 100% errors + 10% successes Application traces
Head-based 10% Verbose Spark task events
Adaptive 100% / load Dynamic โ€” back off under pressure

Tiering Policy & Cost Watch

flowchart LR
    Hot[Hot โ€” Workspace Monitoring 30d $$$] -->|31d| Warm[Log Analytics 90d $$]
    Warm -->|91d| Basic[Basic Logs 1y $]
    Basic -->|365d| Archive[Archive Logs 1-7y ยข]
    Archive -->|7y| ADLS[ADLS Cold/Archive 10y ยขยข]
    style Hot fill:#E74C3C,color:#fff
    style Warm fill:#E67E22,color:#fff
    style Basic fill:#F39C12,color:#fff
    style Archive fill:#3498DB,color:#fff
    style ADLS fill:#7F8C8D,color:#fff
// Cost watch โ€” per-table ingestion trend (Log Analytics โ†’ Usage)
Usage
| where TimeGenerated > ago(30d) and IsBillable == true
| summarize IngestedGB = sum(Quantity) / 1024.0 by DataType, bin(TimeGenerated, 1d)
| extend EstMonthlyCost = round(IngestedGB * 30 * 2.76, 2)  // PAYG rate
| order by IngestedGB desc

โœ… Implementation Checklist

For every new Fabric platform โ€” apply this checklist at provisioning time. Treat as code review gates, not optional polish.

Telemetry Collection - [ ] Diagnostic Settings on every supported Fabric item (capacity, pipelines, eventhouse, eventstream, semantic models) - [ ] Workspace Monitoring item provisioned in every production workspace - [ ] Tenant audit log export to Log Analytics enabled - [ ] Application Insights provisioned; connection string distributed to notebooks - [ ] OpenTelemetry instrumentation in shared notebook utilities

Storage - [ ] Log Analytics workspace deployed via log-analytics-workspace.bicep - [ ] Retention configured per signal class (90 / 365 / 2555 days) - [ ] Continuous export to ADLS Gen2 archive for 7+ year domains (casino, federal) - [ ] Eventhouse retention tuned per table (not default 30 days for everything)

Action Groups & Alerts - [ ] Three Action Groups deployed via action-groups.bicep (P1 / P2 / P3) - [ ] Distribution lists for email recipients (no individual addresses) - [ ] SMS / voice for P1 only via on-call rotation numbers - [ ] ITSM connector wired (PagerDuty or ServiceNow) for P1 + P2 - [ ] Alert rules deployed via Bicep โ€” no portal-authored rules in production - [ ] Each alert references a runbook URL in customProperties.runbook - [ ] Suppression / maintenance windows configured for known deploys

Dashboards - [ ] Capacity Metrics App installed and pinned for capacity admins - [ ] RTI Dashboard published for live ops view - [ ] Azure Managed Grafana deployed with Azure Monitor + Workspace Monitoring data sources - [ ] FUAM deployed in admin tenant workspace - [ ] Dashboard JSON committed under infra/modules/observability/grafana-dashboards/

Operational Readiness - [ ] On-call rotation defined and uploaded to ITSM - [ ] Runbook bookmarks set for each P1/P2 alert - [ ] Quarterly fire-drill scheduled and recorded - [ ] SLO/SLI dashboards published per SLO/SLI for Fabric - [ ] Postmortem template defined in docs/runbooks/templates/postmortem.md


๐Ÿšซ Anti-Patterns

# Anti-Pattern Problem Fix
1 Double-writing telemetry Sending the same Spark log to both Workspace Monitoring and Log Analytics "for safety" โ†’ 2x cost, query divergence, retention drift. Pick the primary sink per signal class (Decision Matrix). For cross-store correlation, use cross-cluster KQL โ€” don't duplicate.
2 Per-engineer email alerts Alerts addressed to individual addresses (alice@contoso.com) disappear on PTO; no audit trail. Route through distribution lists or on-call rotation services. Membership changes โ‰  alert-rule changes.
3 Portal-authored alert rules Rules created in the portal during incident response, never codified โ†’ config drift, lost on DR. All alert rules in Bicep, reviewed in PRs. Portal authoring only for prototyping; export to Bicep before merging.
4 One alert per symptom 200 alert rules, each watching one KPI with its own threshold โ†’ fatigue, burnout, paging on noise. Define SLOs (SLO/SLI for Fabric). Alert on error-budget burn rate, not raw thresholds.
5 Dashboards without owners 80 Grafana dashboards, half reference dead tables, none maintained. Tag every dashboard owner=<team> and last_reviewed=<date>. Quarterly audit โ€” > 1 year stale โ†’ archive.
6 Runbook URLs that 404 Alert says "see runbook" with a link to a deleted wiki page โ†’ no 3am guidance. Validate runbook URLs in the quarterly fire-drill. Runbooks live in the same repo as alert Bicep.
7 30-day retention for audit logs Default retention for FabricAuditLog left at 30 days โ†’ 7-year compliance audit fails. Set retention per table based on regulatory requirement (NIGC, BSA/FinCEN, FedRAMP, HIPAA). Configure once in Bicep; verify quarterly.
8 Sampling errors 10% sampling applied to all traces, including errors โ†’ 90% of failure traces lost. Always sample at 100% for errors, P1/P2 alerts, and audit events. Sample only successful high-volume low-signal events.


๐Ÿ“š References

Microsoft Documentation - Workspace Monitoring overview - Diagnostic Settings for Fabric - Azure Monitor Action Groups - Log Analytics Workspace overview - Application Insights - Azure Managed Grafana - KQL reference | W3C Trace Context | OpenTelemetry Python

Operational Best Practices - Google SRE โ€” Monitoring distributed systems | SRE Workbook โ€” Alerting on SLOs - Azure Monitor best practices | PagerDuty Incident Response

Compliance - NIGC MICS ยง542.17 | NIST SP 800-137 | FedRAMP Continuous Monitoring Strategy


Back to Operations | Back to Best Practices | Back to Documentation