Skip to content

Home > Docs > Best Practices > Monitoring & Observability

📊 Monitoring & Observability for Microsoft Fabric

Unified Telemetry, Dashboards, Alerting & Runbooks for Fabric Workloads

Category Status Last Updated


Last Updated: 2026-04-13 | Version: 1.0.0


📑 Table of Contents


đŸŽ¯ Overview

Monitoring and observability in Microsoft Fabric requires a unified strategy that spans capacity utilization, workspace health, pipeline execution, Spark job performance, data quality, and business SLA compliance. This guide establishes the monitoring patterns for the casino gaming and federal agency POC, covering the built-in Fabric monitoring capabilities (Capacity Metrics app, system tables, Admin Monitoring workspace), custom Real-Time Intelligence dashboards, Data Activator alerting, and operational runbooks for incident response.

Monitoring Pillars

Pillar What It Answers Fabric Tooling
Capacity Are we within compute/memory limits? Will we throttle? Capacity Metrics app, Admin APIs
Pipelines Did ETL jobs succeed? How long did they take? Pipeline monitoring, system tables
Spark Are notebooks running efficiently? Any OOM failures? Spark monitoring, Spark UI, system tables
Data Quality Is data arriving on time? Are quality gates passing? Great Expectations, custom metrics
Security Are there unauthorized access attempts or policy violations? Unified Audit Log, Purview
SLA Are we meeting business freshness and availability requirements? Custom KPI dashboards

Observability Maturity Levels

Level Description Capabilities
L1 — Reactive Respond when users report issues Manual checking, no alerts
L2 — Proactive Alerts before users notice Threshold alerts, email notifications
L3 — Predictive Anticipate issues before they occur Trend analysis, capacity forecasting
L4 — Autonomous Self-healing with automated remediation Data Activator + Power Automate flows

Target for this POC: Level 3 (Predictive) for capacity and pipeline monitoring, Level 2 (Proactive) for data quality and security.


đŸ—ī¸ Architecture

Unified Monitoring Architecture

flowchart TB
    subgraph Telemetry["Telemetry Sources"]
        style Telemetry fill:#E67E22,color:#fff
        T1[Capacity Metrics<br/>CU, memory, throttling]
        T2[Pipeline Runs<br/>status, duration, errors]
        T3[Spark Jobs<br/>stages, tasks, memory]
        T4[SQL Queries<br/>duration, rows, errors]
        T5[Audit Events<br/>access, permissions]
        T6[Data Quality<br/>GE results, row counts]
    end

    subgraph Collection["Collection Layer"]
        style Collection fill:#6C3483,color:#fff
        C1[Admin Monitoring<br/>Workspace<br/>System Tables]
        C2[Eventhouse<br/>KQL Database]
        C3[Lakehouse<br/>Metrics Tables]
    end

    subgraph Storage["Storage & Analytics"]
        style Storage fill:#2471A3,color:#fff
        S1["KQL Queries<br/>Real-time analysis"]
        S2["Semantic Model<br/>Historical trends"]
        S3["Delta Tables<br/>Long-term retention"]
    end

    subgraph Presentation["Dashboards & Alerts"]
        style Presentation fill:#27AE60,color:#fff
        D1[RTI Dashboard<br/>Real-time ops view]
        D2[Power BI Report<br/>Historical analysis]
        D3[Data Activator<br/>Threshold alerts]
        D4[Teams Channel<br/>Notifications]
        D5[Power Automate<br/>Automated response]
    end

    T1 --> C1
    T2 --> C1
    T3 --> C1
    T4 --> C1
    T5 --> C2
    T6 --> C3
    C1 --> S1
    C1 --> S2
    C2 --> S1
    C3 --> S3
    S1 --> D1
    S1 --> D3
    S2 --> D2
    S3 --> D2
    D3 --> D4
    D3 --> D5

Data Flow for Monitoring Telemetry

sequenceDiagram
    participant Source as Fabric Workload
    participant System as System Tables
    participant KQL as Eventhouse (KQL)
    participant Dashboard as RTI Dashboard
    participant Activator as Data Activator
    participant Teams as Teams Channel

    Source->>System: Emit telemetry (automatic)
    System->>KQL: Ingest via Eventstream
    KQL->>Dashboard: Real-time query (30s refresh)
    KQL->>Activator: Evaluate trigger conditions
    Activator-->>Teams: Alert: Pipeline failed!
    Dashboard->>Dashboard: Auto-refresh every 30 seconds
    Note over System,KQL: Historical data retained<br/>in system tables for 30 days

⚡ Capacity Monitoring

Fabric Capacity Metrics App

The Capacity Metrics app is the primary tool for monitoring Fabric capacity utilization. Install it from AppSource for each capacity you manage.

Key Metrics to Monitor:

Metric Description Warning Threshold Critical Threshold
CU Utilization (%) Compute Units consumed vs. available > 70% sustained > 90% sustained
Interactive CU (%) CU consumed by interactive queries > 80% > 95%
Background CU (%) CU consumed by background jobs (refresh, pipelines) > 70% > 85%
Throttling Events Number of throttling events in the period Any occurrence Sustained throttling
Rejection Events Requests rejected due to capacity exhaustion Any occurrence Multiple per hour
Overages CU debt carried forward (smoothing window) > 10 min carry > 30 min carry
Memory (GB) Memory utilization for Spark and SQL > 70% > 85%

KQL Queries for Capacity Health

// Capacity utilization over the last 24 hours (15-minute windows)
FabricCapacityMetrics
| where TimeGenerated > ago(24h)
| summarize
    AvgCU = avg(CUPercent),
    MaxCU = max(CUPercent),
    ThrottleCount = countif(IsThrottled == true),
    RejectCount = countif(IsRejected == true)
    by bin(TimeGenerated, 15m)
| order by TimeGenerated desc
// Top CU consumers by workspace and item type
FabricCapacityMetrics
| where TimeGenerated > ago(1h)
| summarize TotalCU = sum(CUSeconds) by WorkspaceName, ItemType, ItemName
| top 20 by TotalCU desc
| project WorkspaceName, ItemType, ItemName, TotalCU_Minutes = round(TotalCU / 60.0, 2)
// Detect throttling patterns — when does throttling occur most?
FabricCapacityMetrics
| where TimeGenerated > ago(7d)
| where IsThrottled == true
| summarize ThrottleEvents = count() by bin(TimeGenerated, 1h), DayOfWeek = dayofweek(TimeGenerated)
| order by ThrottleEvents desc

Capacity Forecasting

// 7-day CU trend for capacity planning
FabricCapacityMetrics
| where TimeGenerated > ago(7d)
| summarize DailyCU = avg(CUPercent) by bin(TimeGenerated, 1d)
| extend Day = format_datetime(TimeGenerated, 'yyyy-MM-dd')
| project Day, DailyCU
| order by Day asc
// Plot this as a line chart — if the trend is increasing, plan for SKU upgrade

Capacity Alert Thresholds

Alert Condition Severity Action
CU > 70% for 30 min Sustained high utilization Warning Review top consumers
CU > 90% for 15 min Near capacity limit Critical Pause non-essential workloads
Throttling detected Any throttle event High Check spike source, consider scaling
Rejection detected Request rejected Critical Immediate: scale capacity or pause jobs
Memory > 85% High memory pressure High Review Spark configs, reduce concurrency
Overage > 30 min Extended CU debt Warning Reschedule background jobs to off-peak

🔍 Workspace Monitoring

System Tables (Admin Monitoring Workspace)

Fabric provides system tables in the Admin Monitoring workspace that expose operational telemetry:

System Table Contents Retention
pipeline_runs Pipeline execution history 30 days
notebook_runs Spark notebook execution history 30 days
sql_queries SQL endpoint query history 30 days
capacity_metrics CU consumption telemetry 30 days
audit_events User and system actions 90 days

Pipeline Monitoring

// Pipeline run summary — last 24 hours
FabricPipelineRuns
| where StartTime > ago(24h)
| summarize
    TotalRuns = count(),
    Succeeded = countif(Status == "Succeeded"),
    Failed = countif(Status == "Failed"),
    Cancelled = countif(Status == "Cancelled"),
    InProgress = countif(Status == "InProgress"),
    AvgDurationMin = round(avg(DurationInSeconds) / 60.0, 2),
    MaxDurationMin = round(max(DurationInSeconds) / 60.0, 2)
    by PipelineName, WorkspaceName
| extend SuccessRate = round(todouble(Succeeded) / TotalRuns * 100, 1)
| order by SuccessRate asc
// Failed pipelines with error details
FabricPipelineRuns
| where StartTime > ago(24h)
| where Status == "Failed"
| project
    PipelineName,
    WorkspaceName,
    StartTime,
    DurationMin = round(DurationInSeconds / 60.0, 2),
    ErrorMessage = tostring(parse_json(Error).message),
    ErrorCode = tostring(parse_json(Error).errorCode)
| order by StartTime desc

Spark Job Monitoring

// Spark notebook performance — last 24 hours
FabricNotebookRuns
| where StartTime > ago(24h)
| summarize
    TotalRuns = count(),
    AvgDurationMin = round(avg(DurationInSeconds) / 60.0, 2),
    MaxDurationMin = round(max(DurationInSeconds) / 60.0, 2),
    FailedRuns = countif(Status == "Failed"),
    AvgSpillGB = round(avg(DiskSpillBytes) / (1024.0 * 1024 * 1024), 2)
    by NotebookName, WorkspaceName
| extend FailRate = round(todouble(FailedRuns) / TotalRuns * 100, 1)
| order by AvgDurationMin desc
// Spark OOM detection — jobs with excessive memory usage
FabricNotebookRuns
| where StartTime > ago(7d)
| where Status == "Failed"
| where ErrorMessage has_any ("OutOfMemoryError", "java.lang.OutOfMemoryError", "Container killed by YARN")
| project
    NotebookName,
    WorkspaceName,
    StartTime,
    ErrorMessage,
    PeakMemoryGB = round(PeakMemoryBytes / (1024.0 * 1024 * 1024), 2),
    ExecutorCount
| order by StartTime desc

SQL Endpoint Monitoring

// Slow SQL queries — duration > 30 seconds
FabricSQLQueries
| where StartTime > ago(24h)
| where DurationInSeconds > 30
| project
    QueryText = substring(QueryText, 0, 200),
    WorkspaceName,
    UserEmail,
    DurationSec = round(DurationInSeconds, 1),
    RowsReturned,
    StartTime
| order by DurationSec desc
| take 20

Pipeline History Dashboard Query

// 7-day pipeline success/failure trend for dashboard
FabricPipelineRuns
| where StartTime > ago(7d)
| summarize
    Succeeded = countif(Status == "Succeeded"),
    Failed = countif(Status == "Failed")
    by bin(StartTime, 1d)
| extend SuccessRate = round(todouble(Succeeded) / (Succeeded + Failed) * 100, 1)
| project Day = format_datetime(StartTime, 'yyyy-MM-dd'), Succeeded, Failed, SuccessRate
| order by Day asc

📈 Custom Dashboards

RTI Dashboard Patterns

Real-Time Intelligence dashboards provide live operational views with auto-refresh capabilities:

Operations Dashboard Layout

┌─────────────────────────────────────────────────────────┐
│                    FABRIC OPERATIONS CENTER              │
│              Last Refresh: 2026-04-13 14:30:00 UTC       │
├──────────────â”Ŧ──────────────â”Ŧ──────────────â”Ŧ────────────┤
│  CU Usage    │  Pipeline    │  Spark Jobs  │  Alerts    │
│  ████░░ 68%  │  ✅ 142/150  │  ✅ 28/30    │  âš ī¸ 2     │
│  Warning: 70%│  ❌ 8 failed │  ❌ 2 OOM    │  🔴 0     │
├──────────────┴──────────────┴──────────────┴────────────┤
│                  CU Utilization (24h)                     │
│  100%│                                                   │
│   80%│          ▄▄▄▄                                     │
│   60%│    ▄▄▄▄▄█████▄▄▄▄▄▄▄                             │
│   40%│▄▄▄█████████████████████▄▄▄▄                       │
│   20%│████████████████████████████████                    │
│     0├────────────────────────────────                    │
│      00:00  04:00  08:00  12:00  16:00  20:00            │
├──────────────────────────────────────────────────────────┤
│              Pipeline Status (Last 24h)                   │
│  Pipeline Name          │ Status │ Duration │ Last Run   │
│  pl_bronze_slot_ingest  │  ✅    │  4m 23s  │ 14:15 UTC  │
│  pl_silver_slot_cleanse │  ✅    │  8m 12s  │ 14:20 UTC  │
│  pl_gold_kpi_compute    │  ❌    │  --      │ 14:25 UTC  │
│  pl_federal_usda_ingest │  ✅    │  3m 44s  │ 13:00 UTC  │
└──────────────────────────────────────────────────────────┘

KQL for Dashboard Tiles

// Tile 1: Current CU utilization (single value)
FabricCapacityMetrics
| where TimeGenerated > ago(5m)
| summarize CurrentCU = round(avg(CUPercent), 1)
// Tile 2: Pipeline success rate (last 24h)
FabricPipelineRuns
| where StartTime > ago(24h)
| summarize
    Total = count(),
    Succeeded = countif(Status == "Succeeded")
| extend SuccessRate = round(todouble(Succeeded) / Total * 100, 1)
| project SuccessRate, Succeeded, Total
// Tile 3: Active alerts count
FabricAlerts
| where TimeGenerated > ago(1h)
| where State == "Active"
| summarize
    Critical = countif(Severity == "Critical"),
    Warning = countif(Severity == "Warning"),
    Info = countif(Severity == "Info")
// Tile 4: Data freshness — last ingestion timestamp per table
FabricDeltaTableMetrics
| summarize LastUpdate = max(LastModifiedTimestamp) by TableName, LakehouseName
| extend FreshnessMinutes = datetime_diff('minute', now(), LastUpdate)
| extend Status = iff(FreshnessMinutes > 60, "âš ī¸ Stale", "✅ Fresh")
| order by FreshnessMinutes desc

Power BI Historical Dashboard

For historical trend analysis, create a Power BI report connected to system tables via DirectQuery or Import:

Page Content Refresh
Executive Summary KPI cards (uptime, success rate, CU avg), 30-day trend Daily
Capacity CU utilization heatmap, throttling events, top consumers Hourly
Pipelines Success/failure trend, duration distribution, error categories Hourly
Spark Job duration trends, OOM occurrences, spill detection Hourly
Data Quality GE suite pass rates, row count trends, quarantine volume Daily
Security Access anomalies, permission changes, failed logins Daily

🚨 Alerting Strategy

Data Activator Integration

Data Activator triggers are the primary alerting mechanism for Fabric-native monitoring:

flowchart LR
    subgraph Sources["Monitoring Sources"]
        style Sources fill:#E67E22,color:#fff
        S1[Capacity Metrics]
        S2[Pipeline Status]
        S3[Spark Failures]
        S4[Data Quality]
        S5[Security Events]
    end

    subgraph Activator["Data Activator"]
        style Activator fill:#6C3483,color:#fff
        R1[reflex_platform_capacity]
        R2[reflex_platform_pipelines]
        R3[reflex_platform_spark]
        R4[reflex_data_quality]
        R5[reflex_security_audit]
    end

    subgraph Actions["Alert Actions"]
        style Actions fill:#27AE60,color:#fff
        A1[Teams: #fabric-alerts]
        A2[Email: on-call team]
        A3[Power Automate: auto-remediate]
        A4[PagerDuty: critical escalation]
    end

    S1 --> R1
    S2 --> R2
    S3 --> R3
    S4 --> R4
    S5 --> R5
    R1 -->|Warning| A1
    R1 -->|Critical| A2
    R2 -->|Failed| A1
    R2 -->|Failed 3x| A4
    R3 -->|OOM| A1
    R4 -->|DQ < 70%| A1
    R5 -->|Unauthorized| A4

Severity Classification

Severity Description Response Time Notification Channel Escalation
P1 — Critical Data loss, security breach, complete outage 15 min PagerDuty + Teams + Email + Phone Immediate to VP
P2 — High Pipeline failure, capacity throttling, compliance SLA miss 1 hour Teams + Email After 2 hours to Manager
P3 — Medium Slow queries, data freshness warning, non-critical job failure 4 hours Teams After 24 hours to Lead
P4 — Low Informational, optimization opportunities Next business day Email digest None

Alert Configuration Matrix

Alert Severity Condition Cooldown Channel
Capacity > 90% P2 CU > 90% for 15 min 30 min Teams + Email
Capacity throttling P2 Any throttle event 15 min Teams + Email
Request rejection P1 Any rejection event None PagerDuty + Teams
Pipeline failed P3 Status = Failed Per pipeline Teams
Pipeline failed 3x consecutive P2 3 consecutive failures 1 hour Teams + Email
Bronze ingestion > 2h late P3 No new rows in 2 hours 2 hours Teams
Gold refresh > 4h late P2 Refresh SLA breach 4 hours Teams + Email
Spark OOM P3 OutOfMemory error Per notebook Teams
DQ score < 70% P2 Great Expectations suite failure 1 hour Teams + Email
Unauthorized access (403) P1 Audit log 403 event None PagerDuty + Security
Permission change P4 Workspace role modified 24 hours Email digest
Quarantine > 1000 records P3 Quarantine table threshold 4 hours Teams

Alert Fatigue Prevention

Strategy Implementation
Cooldown periods Suppress duplicate alerts for 15-60 min after first firing
Severity-based routing Only P1/P2 go to PagerDuty; P3/P4 go to Teams only
Business hours P3/P4 alerts suppressed outside business hours (except casino 24/7)
Aggregation Batch P4 alerts into daily digest emails
Auto-acknowledge Auto-close alerts when condition resolves
Threshold tuning Review alert thresholds monthly; adjust based on false positive rate

Escalation Matrix

flowchart TB
    subgraph Alert["Alert Fired"]
        style Alert fill:#E67E22,color:#fff
        A[Alert triggered]
    end

    subgraph L1["Level 1: On-Call Engineer"]
        style L1 fill:#27AE60,color:#fff
        L1A[Acknowledge within SLA]
        L1B{Resolved?}
    end

    subgraph L2["Level 2: Team Lead"]
        style L2 fill:#2471A3,color:#fff
        L2A[Escalated after timeout]
        L2B{Resolved?}
    end

    subgraph L3["Level 3: Management"]
        style L3 fill:#6C3483,color:#fff
        L3A[VP / Director notification]
        L3B[War room activated]
    end

    A --> L1A
    L1A --> L1B
    L1B -->|Yes| Done[✅ Resolved + Postmortem]
    L1B -->|No - after 2h| L2A
    L2A --> L2B
    L2B -->|Yes| Done
    L2B -->|No - after 4h| L3A
    L3A --> L3B
    L3B --> Done

📋 Runbooks

Runbook 1: Capacity Throttled

Trigger: CU utilization > 90% for 15+ minutes with throttling events.

Severity: P2 — High

Steps:

  1. Assess — Open Capacity Metrics app. Identify top CU consumers by workspace and item type.

    FabricCapacityMetrics
    | where TimeGenerated > ago(1h)
    | where IsThrottled == true
    | summarize TotalCU = sum(CUSeconds) by WorkspaceName, ItemType, ItemName
    | top 10 by TotalCU desc
    

  2. Immediate Relief — Pause non-essential background jobs:

  3. Cancel any running ad-hoc notebooks
  4. Pause scheduled refreshes for non-critical semantic models
  5. Defer any data generation or load testing jobs

  6. Root Cause — Determine if spike is expected (batch window) or anomalous:

  7. Check if a large backfill or historical reload is running
  8. Check if Spark notebook has a Cartesian join or data explosion
  9. Check if multiple Power BI reports triggered simultaneous refreshes

  10. Scale (if needed) — Temporarily scale capacity:

    # Scale F64 to F128 for emergency capacity
    az resource update \
      --resource-group rg-fabric-poc \
      --name fabric-capacity-poc \
      --resource-type Microsoft.Fabric/capacities \
      --set properties.administration.members=["admin@contoso.com"] \
      --set sku.name=F128
    

  11. Resolve — After spike passes:

  12. Scale back to original SKU
  13. Document the incident
  14. Adjust scheduling to prevent recurrence

  15. Postmortem — Within 24 hours, document:

  16. Timeline of events
  17. Root cause
  18. Impact (queries throttled, users affected)
  19. Preventive measures

Runbook 2: Pipeline Failed

Trigger: Pipeline status = Failed.

Severity: P3 (single failure) or P2 (3+ consecutive failures).

Steps:

  1. Assess — Check the pipeline run history:

    FabricPipelineRuns
    | where PipelineName == "pl_bronze_slot_ingest"
    | where StartTime > ago(24h)
    | project StartTime, Status, DurationInSeconds, Error
    | order by StartTime desc
    

  2. Identify Failure Activity — Determine which activity in the pipeline failed:

  3. Copy activity: Source connectivity? Schema drift? Timeout?
  4. Notebook activity: Spark error? Data quality gate? OOM?
  5. Dataflow activity: Refresh timeout? Memory limit?

  6. Common Fixes:

Error Likely Cause Fix
Connection timeout Source system down Verify source, retry with backoff
Schema mismatch Source schema changed Update schema mapping, re-run
OutOfMemoryError Data volume spike Increase Spark executor memory
Authentication error Credential expired Rotate SPN secret or refresh token
Throttling Capacity overloaded Reschedule to off-peak
  1. Retry — Re-run the failed pipeline:
  2. If transient error: Retry immediately
  3. If schema/data issue: Fix and re-run
  4. If capacity issue: Wait for capacity availability

  5. Notify — If the failure affects SLA:

  6. Post to #fabric-alerts Teams channel
  7. Update the on-call log
  8. Notify downstream consumers

Runbook 3: Spark OutOfMemory (OOM)

Trigger: Notebook fails with java.lang.OutOfMemoryError or Container killed by YARN.

Severity: P3 — Medium.

Steps:

  1. Assess — Check which notebook and dataset:

    FabricNotebookRuns
    | where Status == "Failed"
    | where ErrorMessage has "OutOfMemoryError"
    | project NotebookName, StartTime, PeakMemoryGB = round(PeakMemoryBytes / 1073741824.0, 2), ExecutorCount
    | order by StartTime desc
    

  2. Diagnose — Common OOM causes:

  3. Collect to driver: df.collect() on a large DataFrame
  4. Broadcast join: Broadcasting a table that is too large
  5. Skewed partition: One partition has significantly more data
  6. Cartesian join: Missing join condition creates exploding dataset
  7. Window function without partition: Window.orderBy() without .partitionBy()

  8. Fix — Spark Configuration:

    # Increase executor memory
    spark.conf.set("spark.executor.memory", "16g")
    spark.conf.set("spark.driver.memory", "8g")
    
    # Disable broadcast for large tables
    spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
    
    # Enable adaptive query execution for skew handling
    spark.conf.set("spark.sql.adaptive.enabled", "true")
    spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
    

  9. Fix — Code Changes:

    # Replace .collect() with .toPandas() on small datasets
    # or use .limit(N).collect()
    
    # Add partition columns to window functions
    # Before (OOM risk):
    Window.orderBy("timestamp")
    # After (safe):
    Window.partitionBy("machine_id").orderBy("timestamp")
    
    # Repartition skewed data
    df = df.repartition(200, "machine_id")
    

  10. Verify — Re-run and monitor Spark UI for memory utilization.


Runbook 4: Semantic Model Refresh Timeout

Trigger: Power BI semantic model refresh exceeds 2-hour timeout.

Severity: P3 — Medium (P2 if Gold KPI model for executives).

Steps:

  1. Assess — Check refresh history in Fabric portal:
  2. Go to workspace → Semantic model → Refresh history
  3. Note: duration, rows processed, tables refreshed

  4. Diagnose — Common timeout causes:

  5. Direct Lake fallback to DirectQuery (table too large for memory)
  6. Inefficient DAX measures computed during refresh
  7. Too many tables in full refresh (use incremental)
  8. Source lakehouse running OPTIMIZE/VACUUM during refresh

  9. Fix — Direct Lake Optimization:

    - Reduce row groups: Merge small Parquet files (OPTIMIZE)
    - Enable V-Order: Better compression for Direct Lake
    - Split large tables: Partition by date, refresh only recent partitions
    - Remove unused columns: Fewer columns = faster segment loading
    

  10. Fix — Incremental Refresh:

    - Configure incremental refresh policy (last 3 months hot, archive cold)
    - Ensure partition key column is datetime type
    - Set refresh/detection window appropriately
    

  11. Monitor — After fix, confirm refresh completes within SLA.


Runbook Template

# Runbook: {Issue Name}

**Trigger:** {What fires this runbook}
**Severity:** {P1/P2/P3/P4}
**Owner:** {Team or role responsible}
**Last Tested:** {Date of last runbook drill}

## Assessment
1. {First diagnostic step}
2. {KQL query or UI check}

## Diagnosis
| Symptom | Likely Cause | Fix |
|---------|-------------|-----|
| {symptom} | {cause} | {fix} |

## Remediation
1. {Step-by-step fix}
2. {Verification}

## Escalation
- If not resolved in {N} hours, escalate to {role}
- If customer-impacting, notify {stakeholder}

## Post-Incident
- [ ] Document timeline
- [ ] Identify root cause
- [ ] Implement preventive measure
- [ ] Update this runbook if needed

🎰 Casino NOC Dashboard

24/7 Network Operations Center

Casino gaming operations run 24/7/365. The NOC dashboard must provide at-a-glance status for the gaming floor data platform:

flowchart TB
    subgraph NOC["Casino NOC Dashboard"]
        style NOC fill:#6C3483,color:#fff
        subgraph Status["System Status"]
            HS[Health Score: 98.5%<br/>đŸŸĸ Healthy]
        end
        subgraph Pipelines["ETL Status"]
            PS[Bronze: ✅ 4m ago<br/>Silver: ✅ 8m ago<br/>Gold: ✅ 15m ago]
        end
        subgraph Compliance["Compliance"]
            CS[CTR Filed: 12 today<br/>SAR Review: 3 pending<br/>W-2G: 47 today]
        end
        subgraph Floor["Floor Metrics (Live)"]
            FM[Active Machines: 2,847<br/>Occupancy: 78%<br/>Coin-In/Hr: $1.2M]
        end
    end

    subgraph Alerts["Active Alerts"]
        style Alerts fill:#E67E22,color:#fff
        A1["âš ī¸ Gold refresh delayed<br/>15 min over SLA"]
        A2["â„šī¸ USDA API rate limit<br/>Retry in 60s"]
    end

    NOC --> Alerts

Casino-Specific Monitoring KPIs

KPI Source Refresh Rate SLA
Slot coin-in per hour Eventstream → Eventhouse Real-time (30s) < 1 min latency
Active machine count Bronze ingestion count Every 5 minutes < 10 min freshness
CTR filing count (today) Gold compliance table Every 15 minutes < 30 min freshness
SAR pending review count Gold compliance table Every 15 minutes < 30 min freshness
Hold percentage (running) Gold KPI table Every 15 minutes < 30 min freshness
Floor occupancy (%) Eventstream telemetry Real-time (30s) < 1 min latency
Pipeline health (last 1h) System tables Every 5 minutes 100% visibility
Data freshness (all layers) Delta table metadata Every 10 minutes < 15 min freshness

Casino Alert Customization

Alert Casino-Specific Condition Why
Slot telemetry gap > 5 min No events from a floor section Machine down or network issue
CTR threshold breach Transaction > $10,000 detected Immediate filing required
SAR pattern detected Multiple \(8K-\)9.9K transactions Structuring investigation
Gold KPI stale > 30 min KPI dashboard not updated Gaming commission reporting SLA
Compliance report generation failed Daily compliance extract failed Regulatory filing deadline

Callout — Casino 24/7 Operations: All casino alerts fire 24/7 with no business-hours suppression. The on-call rotation follows the casino's shift pattern (day/swing/graveyard). Compliance alerts (CTR, SAR) always escalate to the Compliance Manager regardless of shift.


đŸ›ī¸ Federal FISMA Continuous Monitoring

Continuous Monitoring Framework

FISMA requires continuous monitoring of security controls. Fabric monitoring maps to NIST SP 800-137 continuous monitoring:

flowchart TB
    subgraph NIST["NIST SP 800-137 Process"]
        style NIST fill:#6C3483,color:#fff
        N1[Define Strategy]
        N2[Establish Program]
        N3[Implement Program]
        N4[Analyze & Report]
        N5[Respond to Findings]
        N6[Review & Update]
    end

    subgraph Fabric["Fabric Implementation"]
        style Fabric fill:#2471A3,color:#fff
        F1[Monitoring requirements<br/>per agency ATO]
        F2[System tables +<br/>Eventhouse + Activator]
        F3[KQL queries +<br/>RTI dashboards]
        F4[Monthly POA&M +<br/>Weekly security report]
        F5[Runbook execution +<br/>Remediation tracking]
        F6[Quarterly control<br/>assessment review]
    end

    N1 --> F1
    N2 --> F2
    N3 --> F3
    N4 --> F4
    N5 --> F5
    N6 --> F6

FISMA Control Monitoring Mapping

NIST Control Control Name Monitoring Method Frequency
AC-2(4) Automated Audit Actions Audit log query for account create/modify/delete Real-time
AC-6(9) Log Use of Privileged Functions Audit log query for Admin actions Real-time
AU-6 Audit Record Review Weekly security report from Unified Audit Log Weekly
CA-7 Continuous Monitoring System tables + RTI dashboards Continuous
CM-3 Configuration Change Control Git integration + pipeline deploy logs Per change
IR-5 Incident Monitoring Data Activator alerts + incident log Real-time
RA-5 Vulnerability Monitoring Security scan results, Defender alerts Daily
SI-4 System Monitoring Capacity metrics, access logs, anomaly detection Continuous

Federal Security Monitoring KQL

// FISMA AU-6: Weekly audit record review
// Export for POA&M reporting
FabricAuditLogs
| where TimeGenerated > ago(7d)
| where ActionCategory in ("UserAccess", "PermissionChange", "DataExport", "AdminAction")
| summarize EventCount = count() by
    ActionCategory,
    ActionName = tostring(parse_json(Properties).operationName),
    UserEmail = tostring(parse_json(Properties).userEmail)
| order by EventCount desc
// FISMA AC-2(4): Account management audit
FabricAuditLogs
| where TimeGenerated > ago(24h)
| where ActionName in ("AddWorkspaceUser", "RemoveWorkspaceUser", "UpdateWorkspaceAccess")
| project
    Timestamp = TimeGenerated,
    Action = ActionName,
    TargetUser = tostring(parse_json(Properties).targetUserEmail),
    PerformedBy = tostring(parse_json(Properties).performedBy),
    WorkspaceName = tostring(parse_json(Properties).workspaceName),
    NewRole = tostring(parse_json(Properties).role)
| order by Timestamp desc
// FISMA SI-4: Anomaly detection — unusual data export volume
FabricAuditLogs
| where TimeGenerated > ago(24h)
| where ActionName in ("ExportData", "DownloadReport", "ExportVisualData")
| summarize ExportCount = count(), TotalRows = sum(tolong(parse_json(Properties).rowCount))
    by UserEmail = tostring(parse_json(Properties).userEmail)
| where ExportCount > 50 or TotalRows > 1000000
| project UserEmail, ExportCount, TotalRows, Alert = "Unusual export volume"

Federal Monthly Compliance Report Template

# Generate FISMA continuous monitoring monthly report
from datetime import datetime, timedelta

report_period = datetime.utcnow().strftime("%Y-%m")

sections = {
    "Access Control Events": """
        FabricAuditLogs
        | where TimeGenerated > ago(30d)
        | where ActionCategory == "UserAccess"
        | summarize count() by ActionName
    """,
    "Permission Changes": """
        FabricAuditLogs
        | where TimeGenerated > ago(30d)
        | where ActionCategory == "PermissionChange"
        | summarize count() by ActionName, WorkspaceName
    """,
    "Data Export Activity": """
        FabricAuditLogs
        | where TimeGenerated > ago(30d)
        | where ActionName has "Export"
        | summarize count() by UserEmail, ActionName
    """,
    "Capacity Health": """
        FabricCapacityMetrics
        | where TimeGenerated > ago(30d)
        | summarize AvgCU = avg(CUPercent), MaxCU = max(CUPercent),
                    ThrottleCount = countif(IsThrottled)
        | project AvgCU, MaxCU, ThrottleCount
    """,
    "Pipeline Reliability": """
        FabricPipelineRuns
        | where StartTime > ago(30d)
        | summarize Total = count(), Failed = countif(Status == "Failed")
        | extend Reliability = round((Total - Failed) * 100.0 / Total, 2)
    """
}

print(f"📋 FISMA Continuous Monitoring Report — {report_period}")
print(f"Generated: {datetime.utcnow().isoformat()}")
print("=" * 60)
for section, query in sections.items():
    print(f"\n## {section}")
    print(f"Query:\n{query.strip()}")
    # Execute KQL queries and format results

Callout — Agency-Specific ATO Requirements: Each federal agency has its own Authority to Operate (ATO) with specific monitoring requirements. The monitoring strategy must be customized per agency. The queries above provide a baseline — work with each agency's ISSO (Information System Security Officer) to confirm required monitoring controls.


🔧 Operational Maturity Model

Assessment Checklist

Capability L1 Reactive L2 Proactive L3 Predictive L4 Autonomous
Capacity monitoring Manual check Threshold alerts Trend forecasting Auto-scale
Pipeline monitoring Check on failure report Alert on failure Predict failures from trends Auto-retry with backoff
Spark monitoring Review after OOM Alert on OOM Memory trend alerts Auto-tune Spark config
Data quality User reports bad data GE suite alerts DQ trend dashboard Auto-quarantine + notify
Security Annual audit review Real-time access alerts Anomaly detection Auto-revoke suspicious
Incident response Ad-hoc troubleshooting Runbooks available Runbooks tested monthly Auto-remediation flows
Reporting On-demand Scheduled weekly Real-time dashboard Self-service

Maturity Progression Plan

Phase 1 (Week 1-2): Foundation
  ✅ Install Capacity Metrics app
  ✅ Enable system tables in Admin Monitoring workspace
  ✅ Create basic KQL queries for pipeline/Spark monitoring
  ✅ Set up Teams channel for alerts

Phase 2 (Week 3-4): Proactive Alerting
  ✅ Configure Data Activator Reflex items
  ✅ Implement alert severity classification
  ✅ Create initial runbooks for top 4 scenarios
  ✅ Set up on-call rotation

Phase 3 (Month 2): Custom Dashboards
  ✅ Build RTI operations dashboard
  ✅ Build Power BI historical trends report
  ✅ Implement data freshness monitoring
  ✅ Create domain-specific dashboards (Casino NOC, Federal FISMA)

Phase 4 (Month 3): Predictive + Automation
  âŦœ Implement capacity trend forecasting
  âŦœ Add anomaly detection for data quality
  âŦœ Create Power Automate auto-remediation flows
  âŦœ Conduct first runbook drill

âš ī¸ Common Issues

Issue Symptom Root Cause Resolution
System tables empty No data in monitoring queries Admin Monitoring workspace not enabled Enable via Fabric Admin portal
Alert storms Hundreds of alerts in minutes Missing cooldown periods Add 15-60 min cooldown per alert
Dashboard stale RTI dashboard shows old data Eventstream disconnected Reconnect Eventstream source
False positives Alert fires during maintenance No maintenance window suppression Add maintenance window to alert rules
KQL query timeout Complex query exceeds 30s limit Scanning too much data Add time filters, reduce scope
Capacity spike at midnight Unexpected CU burst All scheduled refreshes at same time Stagger refresh schedules
Missing audit events Security report incomplete Audit log search limited to 90 days Export to long-term storage (ADLS Gen2)
OOM during monitoring notebook Monitoring notebook itself OOMs Querying too much history Limit lookback window, use incremental

📚 References

Microsoft Documentation

Compliance & Frameworks

Operational Best Practices



Back to Best Practices Index | Back to Documentation