Home > Docs > Best Practices > Monitoring & Observability
đ Monitoring & Observability for Microsoft Fabric¶
Unified Telemetry, Dashboards, Alerting & Runbooks for Fabric Workloads
Last Updated: 2026-04-13 | Version: 1.0.0
đ Table of Contents¶
- đ¯ Overview
- đī¸ Architecture
- ⥠Capacity Monitoring
- đ Workspace Monitoring
- đ Custom Dashboards
- đ¨ Alerting Strategy
- đ Runbooks
- đ° Casino NOC Dashboard
- đī¸ Federal FISMA Continuous Monitoring
- đ§ Operational Maturity Model
- â ī¸ Common Issues
- đ References
đ¯ Overview¶
Monitoring and observability in Microsoft Fabric requires a unified strategy that spans capacity utilization, workspace health, pipeline execution, Spark job performance, data quality, and business SLA compliance. This guide establishes the monitoring patterns for the casino gaming and federal agency POC, covering the built-in Fabric monitoring capabilities (Capacity Metrics app, system tables, Admin Monitoring workspace), custom Real-Time Intelligence dashboards, Data Activator alerting, and operational runbooks for incident response.
Monitoring Pillars¶
| Pillar | What It Answers | Fabric Tooling |
|---|---|---|
| Capacity | Are we within compute/memory limits? Will we throttle? | Capacity Metrics app, Admin APIs |
| Pipelines | Did ETL jobs succeed? How long did they take? | Pipeline monitoring, system tables |
| Spark | Are notebooks running efficiently? Any OOM failures? | Spark monitoring, Spark UI, system tables |
| Data Quality | Is data arriving on time? Are quality gates passing? | Great Expectations, custom metrics |
| Security | Are there unauthorized access attempts or policy violations? | Unified Audit Log, Purview |
| SLA | Are we meeting business freshness and availability requirements? | Custom KPI dashboards |
Observability Maturity Levels¶
| Level | Description | Capabilities |
|---|---|---|
| L1 â Reactive | Respond when users report issues | Manual checking, no alerts |
| L2 â Proactive | Alerts before users notice | Threshold alerts, email notifications |
| L3 â Predictive | Anticipate issues before they occur | Trend analysis, capacity forecasting |
| L4 â Autonomous | Self-healing with automated remediation | Data Activator + Power Automate flows |
Target for this POC: Level 3 (Predictive) for capacity and pipeline monitoring, Level 2 (Proactive) for data quality and security.
đī¸ Architecture¶
Unified Monitoring Architecture¶
flowchart TB
subgraph Telemetry["Telemetry Sources"]
style Telemetry fill:#E67E22,color:#fff
T1[Capacity Metrics<br/>CU, memory, throttling]
T2[Pipeline Runs<br/>status, duration, errors]
T3[Spark Jobs<br/>stages, tasks, memory]
T4[SQL Queries<br/>duration, rows, errors]
T5[Audit Events<br/>access, permissions]
T6[Data Quality<br/>GE results, row counts]
end
subgraph Collection["Collection Layer"]
style Collection fill:#6C3483,color:#fff
C1[Admin Monitoring<br/>Workspace<br/>System Tables]
C2[Eventhouse<br/>KQL Database]
C3[Lakehouse<br/>Metrics Tables]
end
subgraph Storage["Storage & Analytics"]
style Storage fill:#2471A3,color:#fff
S1["KQL Queries<br/>Real-time analysis"]
S2["Semantic Model<br/>Historical trends"]
S3["Delta Tables<br/>Long-term retention"]
end
subgraph Presentation["Dashboards & Alerts"]
style Presentation fill:#27AE60,color:#fff
D1[RTI Dashboard<br/>Real-time ops view]
D2[Power BI Report<br/>Historical analysis]
D3[Data Activator<br/>Threshold alerts]
D4[Teams Channel<br/>Notifications]
D5[Power Automate<br/>Automated response]
end
T1 --> C1
T2 --> C1
T3 --> C1
T4 --> C1
T5 --> C2
T6 --> C3
C1 --> S1
C1 --> S2
C2 --> S1
C3 --> S3
S1 --> D1
S1 --> D3
S2 --> D2
S3 --> D2
D3 --> D4
D3 --> D5 Data Flow for Monitoring Telemetry¶
sequenceDiagram
participant Source as Fabric Workload
participant System as System Tables
participant KQL as Eventhouse (KQL)
participant Dashboard as RTI Dashboard
participant Activator as Data Activator
participant Teams as Teams Channel
Source->>System: Emit telemetry (automatic)
System->>KQL: Ingest via Eventstream
KQL->>Dashboard: Real-time query (30s refresh)
KQL->>Activator: Evaluate trigger conditions
Activator-->>Teams: Alert: Pipeline failed!
Dashboard->>Dashboard: Auto-refresh every 30 seconds
Note over System,KQL: Historical data retained<br/>in system tables for 30 days ⥠Capacity Monitoring¶
Fabric Capacity Metrics App¶
The Capacity Metrics app is the primary tool for monitoring Fabric capacity utilization. Install it from AppSource for each capacity you manage.
Key Metrics to Monitor:
| Metric | Description | Warning Threshold | Critical Threshold |
|---|---|---|---|
| CU Utilization (%) | Compute Units consumed vs. available | > 70% sustained | > 90% sustained |
| Interactive CU (%) | CU consumed by interactive queries | > 80% | > 95% |
| Background CU (%) | CU consumed by background jobs (refresh, pipelines) | > 70% | > 85% |
| Throttling Events | Number of throttling events in the period | Any occurrence | Sustained throttling |
| Rejection Events | Requests rejected due to capacity exhaustion | Any occurrence | Multiple per hour |
| Overages | CU debt carried forward (smoothing window) | > 10 min carry | > 30 min carry |
| Memory (GB) | Memory utilization for Spark and SQL | > 70% | > 85% |
KQL Queries for Capacity Health¶
// Capacity utilization over the last 24 hours (15-minute windows)
FabricCapacityMetrics
| where TimeGenerated > ago(24h)
| summarize
AvgCU = avg(CUPercent),
MaxCU = max(CUPercent),
ThrottleCount = countif(IsThrottled == true),
RejectCount = countif(IsRejected == true)
by bin(TimeGenerated, 15m)
| order by TimeGenerated desc
// Top CU consumers by workspace and item type
FabricCapacityMetrics
| where TimeGenerated > ago(1h)
| summarize TotalCU = sum(CUSeconds) by WorkspaceName, ItemType, ItemName
| top 20 by TotalCU desc
| project WorkspaceName, ItemType, ItemName, TotalCU_Minutes = round(TotalCU / 60.0, 2)
// Detect throttling patterns â when does throttling occur most?
FabricCapacityMetrics
| where TimeGenerated > ago(7d)
| where IsThrottled == true
| summarize ThrottleEvents = count() by bin(TimeGenerated, 1h), DayOfWeek = dayofweek(TimeGenerated)
| order by ThrottleEvents desc
Capacity Forecasting¶
// 7-day CU trend for capacity planning
FabricCapacityMetrics
| where TimeGenerated > ago(7d)
| summarize DailyCU = avg(CUPercent) by bin(TimeGenerated, 1d)
| extend Day = format_datetime(TimeGenerated, 'yyyy-MM-dd')
| project Day, DailyCU
| order by Day asc
// Plot this as a line chart â if the trend is increasing, plan for SKU upgrade
Capacity Alert Thresholds¶
| Alert | Condition | Severity | Action |
|---|---|---|---|
| CU > 70% for 30 min | Sustained high utilization | Warning | Review top consumers |
| CU > 90% for 15 min | Near capacity limit | Critical | Pause non-essential workloads |
| Throttling detected | Any throttle event | High | Check spike source, consider scaling |
| Rejection detected | Request rejected | Critical | Immediate: scale capacity or pause jobs |
| Memory > 85% | High memory pressure | High | Review Spark configs, reduce concurrency |
| Overage > 30 min | Extended CU debt | Warning | Reschedule background jobs to off-peak |
đ Workspace Monitoring¶
System Tables (Admin Monitoring Workspace)¶
Fabric provides system tables in the Admin Monitoring workspace that expose operational telemetry:
| System Table | Contents | Retention |
|---|---|---|
pipeline_runs | Pipeline execution history | 30 days |
notebook_runs | Spark notebook execution history | 30 days |
sql_queries | SQL endpoint query history | 30 days |
capacity_metrics | CU consumption telemetry | 30 days |
audit_events | User and system actions | 90 days |
Pipeline Monitoring¶
// Pipeline run summary â last 24 hours
FabricPipelineRuns
| where StartTime > ago(24h)
| summarize
TotalRuns = count(),
Succeeded = countif(Status == "Succeeded"),
Failed = countif(Status == "Failed"),
Cancelled = countif(Status == "Cancelled"),
InProgress = countif(Status == "InProgress"),
AvgDurationMin = round(avg(DurationInSeconds) / 60.0, 2),
MaxDurationMin = round(max(DurationInSeconds) / 60.0, 2)
by PipelineName, WorkspaceName
| extend SuccessRate = round(todouble(Succeeded) / TotalRuns * 100, 1)
| order by SuccessRate asc
// Failed pipelines with error details
FabricPipelineRuns
| where StartTime > ago(24h)
| where Status == "Failed"
| project
PipelineName,
WorkspaceName,
StartTime,
DurationMin = round(DurationInSeconds / 60.0, 2),
ErrorMessage = tostring(parse_json(Error).message),
ErrorCode = tostring(parse_json(Error).errorCode)
| order by StartTime desc
Spark Job Monitoring¶
// Spark notebook performance â last 24 hours
FabricNotebookRuns
| where StartTime > ago(24h)
| summarize
TotalRuns = count(),
AvgDurationMin = round(avg(DurationInSeconds) / 60.0, 2),
MaxDurationMin = round(max(DurationInSeconds) / 60.0, 2),
FailedRuns = countif(Status == "Failed"),
AvgSpillGB = round(avg(DiskSpillBytes) / (1024.0 * 1024 * 1024), 2)
by NotebookName, WorkspaceName
| extend FailRate = round(todouble(FailedRuns) / TotalRuns * 100, 1)
| order by AvgDurationMin desc
// Spark OOM detection â jobs with excessive memory usage
FabricNotebookRuns
| where StartTime > ago(7d)
| where Status == "Failed"
| where ErrorMessage has_any ("OutOfMemoryError", "java.lang.OutOfMemoryError", "Container killed by YARN")
| project
NotebookName,
WorkspaceName,
StartTime,
ErrorMessage,
PeakMemoryGB = round(PeakMemoryBytes / (1024.0 * 1024 * 1024), 2),
ExecutorCount
| order by StartTime desc
SQL Endpoint Monitoring¶
// Slow SQL queries â duration > 30 seconds
FabricSQLQueries
| where StartTime > ago(24h)
| where DurationInSeconds > 30
| project
QueryText = substring(QueryText, 0, 200),
WorkspaceName,
UserEmail,
DurationSec = round(DurationInSeconds, 1),
RowsReturned,
StartTime
| order by DurationSec desc
| take 20
Pipeline History Dashboard Query¶
// 7-day pipeline success/failure trend for dashboard
FabricPipelineRuns
| where StartTime > ago(7d)
| summarize
Succeeded = countif(Status == "Succeeded"),
Failed = countif(Status == "Failed")
by bin(StartTime, 1d)
| extend SuccessRate = round(todouble(Succeeded) / (Succeeded + Failed) * 100, 1)
| project Day = format_datetime(StartTime, 'yyyy-MM-dd'), Succeeded, Failed, SuccessRate
| order by Day asc
đ Custom Dashboards¶
RTI Dashboard Patterns¶
Real-Time Intelligence dashboards provide live operational views with auto-refresh capabilities:
Operations Dashboard Layout¶
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â FABRIC OPERATIONS CENTER â
â Last Refresh: 2026-04-13 14:30:00 UTC â
ââââââââââââââââŦâââââââââââââââŦâââââââââââââââŦâââââââââââââ¤
â CU Usage â Pipeline â Spark Jobs â Alerts â
â ââââââ 68% â â
142/150 â â
28/30 â â ī¸ 2 â
â Warning: 70%â â 8 failed â â 2 OOM â đ´ 0 â
ââââââââââââââââ´âââââââââââââââ´âââââââââââââââ´âââââââââââââ¤
â CU Utilization (24h) â
â 100%â â
â 80%â ââââ â
â 60%â âââââââââââââââââ â
â 40%âââââââââââââââââââââââââââââ â
â 20%âââââââââââââââââââââââââââââââââ â
â 0âââââââââââââââââââââââââââââââââ â
â 00:00 04:00 08:00 12:00 16:00 20:00 â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â Pipeline Status (Last 24h) â
â Pipeline Name â Status â Duration â Last Run â
â pl_bronze_slot_ingest â â
â 4m 23s â 14:15 UTC â
â pl_silver_slot_cleanse â â
â 8m 12s â 14:20 UTC â
â pl_gold_kpi_compute â â â -- â 14:25 UTC â
â pl_federal_usda_ingest â â
â 3m 44s â 13:00 UTC â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
KQL for Dashboard Tiles¶
// Tile 1: Current CU utilization (single value)
FabricCapacityMetrics
| where TimeGenerated > ago(5m)
| summarize CurrentCU = round(avg(CUPercent), 1)
// Tile 2: Pipeline success rate (last 24h)
FabricPipelineRuns
| where StartTime > ago(24h)
| summarize
Total = count(),
Succeeded = countif(Status == "Succeeded")
| extend SuccessRate = round(todouble(Succeeded) / Total * 100, 1)
| project SuccessRate, Succeeded, Total
// Tile 3: Active alerts count
FabricAlerts
| where TimeGenerated > ago(1h)
| where State == "Active"
| summarize
Critical = countif(Severity == "Critical"),
Warning = countif(Severity == "Warning"),
Info = countif(Severity == "Info")
// Tile 4: Data freshness â last ingestion timestamp per table
FabricDeltaTableMetrics
| summarize LastUpdate = max(LastModifiedTimestamp) by TableName, LakehouseName
| extend FreshnessMinutes = datetime_diff('minute', now(), LastUpdate)
| extend Status = iff(FreshnessMinutes > 60, "â ī¸ Stale", "â
Fresh")
| order by FreshnessMinutes desc
Power BI Historical Dashboard¶
For historical trend analysis, create a Power BI report connected to system tables via DirectQuery or Import:
| Page | Content | Refresh |
|---|---|---|
| Executive Summary | KPI cards (uptime, success rate, CU avg), 30-day trend | Daily |
| Capacity | CU utilization heatmap, throttling events, top consumers | Hourly |
| Pipelines | Success/failure trend, duration distribution, error categories | Hourly |
| Spark | Job duration trends, OOM occurrences, spill detection | Hourly |
| Data Quality | GE suite pass rates, row count trends, quarantine volume | Daily |
| Security | Access anomalies, permission changes, failed logins | Daily |
đ¨ Alerting Strategy¶
Data Activator Integration¶
Data Activator triggers are the primary alerting mechanism for Fabric-native monitoring:
flowchart LR
subgraph Sources["Monitoring Sources"]
style Sources fill:#E67E22,color:#fff
S1[Capacity Metrics]
S2[Pipeline Status]
S3[Spark Failures]
S4[Data Quality]
S5[Security Events]
end
subgraph Activator["Data Activator"]
style Activator fill:#6C3483,color:#fff
R1[reflex_platform_capacity]
R2[reflex_platform_pipelines]
R3[reflex_platform_spark]
R4[reflex_data_quality]
R5[reflex_security_audit]
end
subgraph Actions["Alert Actions"]
style Actions fill:#27AE60,color:#fff
A1[Teams: #fabric-alerts]
A2[Email: on-call team]
A3[Power Automate: auto-remediate]
A4[PagerDuty: critical escalation]
end
S1 --> R1
S2 --> R2
S3 --> R3
S4 --> R4
S5 --> R5
R1 -->|Warning| A1
R1 -->|Critical| A2
R2 -->|Failed| A1
R2 -->|Failed 3x| A4
R3 -->|OOM| A1
R4 -->|DQ < 70%| A1
R5 -->|Unauthorized| A4 Severity Classification¶
| Severity | Description | Response Time | Notification Channel | Escalation |
|---|---|---|---|---|
| P1 â Critical | Data loss, security breach, complete outage | 15 min | PagerDuty + Teams + Email + Phone | Immediate to VP |
| P2 â High | Pipeline failure, capacity throttling, compliance SLA miss | 1 hour | Teams + Email | After 2 hours to Manager |
| P3 â Medium | Slow queries, data freshness warning, non-critical job failure | 4 hours | Teams | After 24 hours to Lead |
| P4 â Low | Informational, optimization opportunities | Next business day | Email digest | None |
Alert Configuration Matrix¶
| Alert | Severity | Condition | Cooldown | Channel |
|---|---|---|---|---|
| Capacity > 90% | P2 | CU > 90% for 15 min | 30 min | Teams + Email |
| Capacity throttling | P2 | Any throttle event | 15 min | Teams + Email |
| Request rejection | P1 | Any rejection event | None | PagerDuty + Teams |
| Pipeline failed | P3 | Status = Failed | Per pipeline | Teams |
| Pipeline failed 3x consecutive | P2 | 3 consecutive failures | 1 hour | Teams + Email |
| Bronze ingestion > 2h late | P3 | No new rows in 2 hours | 2 hours | Teams |
| Gold refresh > 4h late | P2 | Refresh SLA breach | 4 hours | Teams + Email |
| Spark OOM | P3 | OutOfMemory error | Per notebook | Teams |
| DQ score < 70% | P2 | Great Expectations suite failure | 1 hour | Teams + Email |
| Unauthorized access (403) | P1 | Audit log 403 event | None | PagerDuty + Security |
| Permission change | P4 | Workspace role modified | 24 hours | Email digest |
| Quarantine > 1000 records | P3 | Quarantine table threshold | 4 hours | Teams |
Alert Fatigue Prevention¶
| Strategy | Implementation |
|---|---|
| Cooldown periods | Suppress duplicate alerts for 15-60 min after first firing |
| Severity-based routing | Only P1/P2 go to PagerDuty; P3/P4 go to Teams only |
| Business hours | P3/P4 alerts suppressed outside business hours (except casino 24/7) |
| Aggregation | Batch P4 alerts into daily digest emails |
| Auto-acknowledge | Auto-close alerts when condition resolves |
| Threshold tuning | Review alert thresholds monthly; adjust based on false positive rate |
Escalation Matrix¶
flowchart TB
subgraph Alert["Alert Fired"]
style Alert fill:#E67E22,color:#fff
A[Alert triggered]
end
subgraph L1["Level 1: On-Call Engineer"]
style L1 fill:#27AE60,color:#fff
L1A[Acknowledge within SLA]
L1B{Resolved?}
end
subgraph L2["Level 2: Team Lead"]
style L2 fill:#2471A3,color:#fff
L2A[Escalated after timeout]
L2B{Resolved?}
end
subgraph L3["Level 3: Management"]
style L3 fill:#6C3483,color:#fff
L3A[VP / Director notification]
L3B[War room activated]
end
A --> L1A
L1A --> L1B
L1B -->|Yes| Done[â
Resolved + Postmortem]
L1B -->|No - after 2h| L2A
L2A --> L2B
L2B -->|Yes| Done
L2B -->|No - after 4h| L3A
L3A --> L3B
L3B --> Done đ Runbooks¶
Runbook 1: Capacity Throttled¶
Trigger: CU utilization > 90% for 15+ minutes with throttling events.
Severity: P2 â High
Steps:
-
Assess â Open Capacity Metrics app. Identify top CU consumers by workspace and item type.
-
Immediate Relief â Pause non-essential background jobs:
- Cancel any running ad-hoc notebooks
- Pause scheduled refreshes for non-critical semantic models
-
Defer any data generation or load testing jobs
-
Root Cause â Determine if spike is expected (batch window) or anomalous:
- Check if a large backfill or historical reload is running
- Check if Spark notebook has a Cartesian join or data explosion
-
Check if multiple Power BI reports triggered simultaneous refreshes
-
Scale (if needed) â Temporarily scale capacity:
-
Resolve â After spike passes:
- Scale back to original SKU
- Document the incident
-
Adjust scheduling to prevent recurrence
-
Postmortem â Within 24 hours, document:
- Timeline of events
- Root cause
- Impact (queries throttled, users affected)
- Preventive measures
Runbook 2: Pipeline Failed¶
Trigger: Pipeline status = Failed.
Severity: P3 (single failure) or P2 (3+ consecutive failures).
Steps:
-
Assess â Check the pipeline run history:
-
Identify Failure Activity â Determine which activity in the pipeline failed:
- Copy activity: Source connectivity? Schema drift? Timeout?
- Notebook activity: Spark error? Data quality gate? OOM?
-
Dataflow activity: Refresh timeout? Memory limit?
-
Common Fixes:
| Error | Likely Cause | Fix |
|---|---|---|
| Connection timeout | Source system down | Verify source, retry with backoff |
| Schema mismatch | Source schema changed | Update schema mapping, re-run |
| OutOfMemoryError | Data volume spike | Increase Spark executor memory |
| Authentication error | Credential expired | Rotate SPN secret or refresh token |
| Throttling | Capacity overloaded | Reschedule to off-peak |
- Retry â Re-run the failed pipeline:
- If transient error: Retry immediately
- If schema/data issue: Fix and re-run
-
If capacity issue: Wait for capacity availability
-
Notify â If the failure affects SLA:
- Post to #fabric-alerts Teams channel
- Update the on-call log
- Notify downstream consumers
Runbook 3: Spark OutOfMemory (OOM)¶
Trigger: Notebook fails with java.lang.OutOfMemoryError or Container killed by YARN.
Severity: P3 â Medium.
Steps:
-
Assess â Check which notebook and dataset:
-
Diagnose â Common OOM causes:
- Collect to driver:
df.collect()on a large DataFrame - Broadcast join: Broadcasting a table that is too large
- Skewed partition: One partition has significantly more data
- Cartesian join: Missing join condition creates exploding dataset
-
Window function without partition:
Window.orderBy()without.partitionBy() -
Fix â Spark Configuration:
# Increase executor memory spark.conf.set("spark.executor.memory", "16g") spark.conf.set("spark.driver.memory", "8g") # Disable broadcast for large tables spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") # Enable adaptive query execution for skew handling spark.conf.set("spark.sql.adaptive.enabled", "true") spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true") -
Fix â Code Changes:
# Replace .collect() with .toPandas() on small datasets # or use .limit(N).collect() # Add partition columns to window functions # Before (OOM risk): Window.orderBy("timestamp") # After (safe): Window.partitionBy("machine_id").orderBy("timestamp") # Repartition skewed data df = df.repartition(200, "machine_id") -
Verify â Re-run and monitor Spark UI for memory utilization.
Runbook 4: Semantic Model Refresh Timeout¶
Trigger: Power BI semantic model refresh exceeds 2-hour timeout.
Severity: P3 â Medium (P2 if Gold KPI model for executives).
Steps:
- Assess â Check refresh history in Fabric portal:
- Go to workspace â Semantic model â Refresh history
-
Note: duration, rows processed, tables refreshed
-
Diagnose â Common timeout causes:
- Direct Lake fallback to DirectQuery (table too large for memory)
- Inefficient DAX measures computed during refresh
- Too many tables in full refresh (use incremental)
-
Source lakehouse running OPTIMIZE/VACUUM during refresh
-
Fix â Direct Lake Optimization:
-
Fix â Incremental Refresh:
-
Monitor â After fix, confirm refresh completes within SLA.
Runbook Template¶
# Runbook: {Issue Name}
**Trigger:** {What fires this runbook}
**Severity:** {P1/P2/P3/P4}
**Owner:** {Team or role responsible}
**Last Tested:** {Date of last runbook drill}
## Assessment
1. {First diagnostic step}
2. {KQL query or UI check}
## Diagnosis
| Symptom | Likely Cause | Fix |
|---------|-------------|-----|
| {symptom} | {cause} | {fix} |
## Remediation
1. {Step-by-step fix}
2. {Verification}
## Escalation
- If not resolved in {N} hours, escalate to {role}
- If customer-impacting, notify {stakeholder}
## Post-Incident
- [ ] Document timeline
- [ ] Identify root cause
- [ ] Implement preventive measure
- [ ] Update this runbook if needed
đ° Casino NOC Dashboard¶
24/7 Network Operations Center¶
Casino gaming operations run 24/7/365. The NOC dashboard must provide at-a-glance status for the gaming floor data platform:
flowchart TB
subgraph NOC["Casino NOC Dashboard"]
style NOC fill:#6C3483,color:#fff
subgraph Status["System Status"]
HS[Health Score: 98.5%<br/>đĸ Healthy]
end
subgraph Pipelines["ETL Status"]
PS[Bronze: â
4m ago<br/>Silver: â
8m ago<br/>Gold: â
15m ago]
end
subgraph Compliance["Compliance"]
CS[CTR Filed: 12 today<br/>SAR Review: 3 pending<br/>W-2G: 47 today]
end
subgraph Floor["Floor Metrics (Live)"]
FM[Active Machines: 2,847<br/>Occupancy: 78%<br/>Coin-In/Hr: $1.2M]
end
end
subgraph Alerts["Active Alerts"]
style Alerts fill:#E67E22,color:#fff
A1["â ī¸ Gold refresh delayed<br/>15 min over SLA"]
A2["âšī¸ USDA API rate limit<br/>Retry in 60s"]
end
NOC --> Alerts Casino-Specific Monitoring KPIs¶
| KPI | Source | Refresh Rate | SLA |
|---|---|---|---|
| Slot coin-in per hour | Eventstream â Eventhouse | Real-time (30s) | < 1 min latency |
| Active machine count | Bronze ingestion count | Every 5 minutes | < 10 min freshness |
| CTR filing count (today) | Gold compliance table | Every 15 minutes | < 30 min freshness |
| SAR pending review count | Gold compliance table | Every 15 minutes | < 30 min freshness |
| Hold percentage (running) | Gold KPI table | Every 15 minutes | < 30 min freshness |
| Floor occupancy (%) | Eventstream telemetry | Real-time (30s) | < 1 min latency |
| Pipeline health (last 1h) | System tables | Every 5 minutes | 100% visibility |
| Data freshness (all layers) | Delta table metadata | Every 10 minutes | < 15 min freshness |
Casino Alert Customization¶
| Alert | Casino-Specific Condition | Why |
|---|---|---|
| Slot telemetry gap > 5 min | No events from a floor section | Machine down or network issue |
| CTR threshold breach | Transaction > $10,000 detected | Immediate filing required |
| SAR pattern detected | Multiple \(8K-\)9.9K transactions | Structuring investigation |
| Gold KPI stale > 30 min | KPI dashboard not updated | Gaming commission reporting SLA |
| Compliance report generation failed | Daily compliance extract failed | Regulatory filing deadline |
Callout â Casino 24/7 Operations: All casino alerts fire 24/7 with no business-hours suppression. The on-call rotation follows the casino's shift pattern (day/swing/graveyard). Compliance alerts (CTR, SAR) always escalate to the Compliance Manager regardless of shift.
đī¸ Federal FISMA Continuous Monitoring¶
Continuous Monitoring Framework¶
FISMA requires continuous monitoring of security controls. Fabric monitoring maps to NIST SP 800-137 continuous monitoring:
flowchart TB
subgraph NIST["NIST SP 800-137 Process"]
style NIST fill:#6C3483,color:#fff
N1[Define Strategy]
N2[Establish Program]
N3[Implement Program]
N4[Analyze & Report]
N5[Respond to Findings]
N6[Review & Update]
end
subgraph Fabric["Fabric Implementation"]
style Fabric fill:#2471A3,color:#fff
F1[Monitoring requirements<br/>per agency ATO]
F2[System tables +<br/>Eventhouse + Activator]
F3[KQL queries +<br/>RTI dashboards]
F4[Monthly POA&M +<br/>Weekly security report]
F5[Runbook execution +<br/>Remediation tracking]
F6[Quarterly control<br/>assessment review]
end
N1 --> F1
N2 --> F2
N3 --> F3
N4 --> F4
N5 --> F5
N6 --> F6 FISMA Control Monitoring Mapping¶
| NIST Control | Control Name | Monitoring Method | Frequency |
|---|---|---|---|
| AC-2(4) | Automated Audit Actions | Audit log query for account create/modify/delete | Real-time |
| AC-6(9) | Log Use of Privileged Functions | Audit log query for Admin actions | Real-time |
| AU-6 | Audit Record Review | Weekly security report from Unified Audit Log | Weekly |
| CA-7 | Continuous Monitoring | System tables + RTI dashboards | Continuous |
| CM-3 | Configuration Change Control | Git integration + pipeline deploy logs | Per change |
| IR-5 | Incident Monitoring | Data Activator alerts + incident log | Real-time |
| RA-5 | Vulnerability Monitoring | Security scan results, Defender alerts | Daily |
| SI-4 | System Monitoring | Capacity metrics, access logs, anomaly detection | Continuous |
Federal Security Monitoring KQL¶
// FISMA AU-6: Weekly audit record review
// Export for POA&M reporting
FabricAuditLogs
| where TimeGenerated > ago(7d)
| where ActionCategory in ("UserAccess", "PermissionChange", "DataExport", "AdminAction")
| summarize EventCount = count() by
ActionCategory,
ActionName = tostring(parse_json(Properties).operationName),
UserEmail = tostring(parse_json(Properties).userEmail)
| order by EventCount desc
// FISMA AC-2(4): Account management audit
FabricAuditLogs
| where TimeGenerated > ago(24h)
| where ActionName in ("AddWorkspaceUser", "RemoveWorkspaceUser", "UpdateWorkspaceAccess")
| project
Timestamp = TimeGenerated,
Action = ActionName,
TargetUser = tostring(parse_json(Properties).targetUserEmail),
PerformedBy = tostring(parse_json(Properties).performedBy),
WorkspaceName = tostring(parse_json(Properties).workspaceName),
NewRole = tostring(parse_json(Properties).role)
| order by Timestamp desc
// FISMA SI-4: Anomaly detection â unusual data export volume
FabricAuditLogs
| where TimeGenerated > ago(24h)
| where ActionName in ("ExportData", "DownloadReport", "ExportVisualData")
| summarize ExportCount = count(), TotalRows = sum(tolong(parse_json(Properties).rowCount))
by UserEmail = tostring(parse_json(Properties).userEmail)
| where ExportCount > 50 or TotalRows > 1000000
| project UserEmail, ExportCount, TotalRows, Alert = "Unusual export volume"
Federal Monthly Compliance Report Template¶
# Generate FISMA continuous monitoring monthly report
from datetime import datetime, timedelta
report_period = datetime.utcnow().strftime("%Y-%m")
sections = {
"Access Control Events": """
FabricAuditLogs
| where TimeGenerated > ago(30d)
| where ActionCategory == "UserAccess"
| summarize count() by ActionName
""",
"Permission Changes": """
FabricAuditLogs
| where TimeGenerated > ago(30d)
| where ActionCategory == "PermissionChange"
| summarize count() by ActionName, WorkspaceName
""",
"Data Export Activity": """
FabricAuditLogs
| where TimeGenerated > ago(30d)
| where ActionName has "Export"
| summarize count() by UserEmail, ActionName
""",
"Capacity Health": """
FabricCapacityMetrics
| where TimeGenerated > ago(30d)
| summarize AvgCU = avg(CUPercent), MaxCU = max(CUPercent),
ThrottleCount = countif(IsThrottled)
| project AvgCU, MaxCU, ThrottleCount
""",
"Pipeline Reliability": """
FabricPipelineRuns
| where StartTime > ago(30d)
| summarize Total = count(), Failed = countif(Status == "Failed")
| extend Reliability = round((Total - Failed) * 100.0 / Total, 2)
"""
}
print(f"đ FISMA Continuous Monitoring Report â {report_period}")
print(f"Generated: {datetime.utcnow().isoformat()}")
print("=" * 60)
for section, query in sections.items():
print(f"\n## {section}")
print(f"Query:\n{query.strip()}")
# Execute KQL queries and format results
Callout â Agency-Specific ATO Requirements: Each federal agency has its own Authority to Operate (ATO) with specific monitoring requirements. The monitoring strategy must be customized per agency. The queries above provide a baseline â work with each agency's ISSO (Information System Security Officer) to confirm required monitoring controls.
đ§ Operational Maturity Model¶
Assessment Checklist¶
| Capability | L1 Reactive | L2 Proactive | L3 Predictive | L4 Autonomous |
|---|---|---|---|---|
| Capacity monitoring | Manual check | Threshold alerts | Trend forecasting | Auto-scale |
| Pipeline monitoring | Check on failure report | Alert on failure | Predict failures from trends | Auto-retry with backoff |
| Spark monitoring | Review after OOM | Alert on OOM | Memory trend alerts | Auto-tune Spark config |
| Data quality | User reports bad data | GE suite alerts | DQ trend dashboard | Auto-quarantine + notify |
| Security | Annual audit review | Real-time access alerts | Anomaly detection | Auto-revoke suspicious |
| Incident response | Ad-hoc troubleshooting | Runbooks available | Runbooks tested monthly | Auto-remediation flows |
| Reporting | On-demand | Scheduled weekly | Real-time dashboard | Self-service |
Maturity Progression Plan¶
Phase 1 (Week 1-2): Foundation
â
Install Capacity Metrics app
â
Enable system tables in Admin Monitoring workspace
â
Create basic KQL queries for pipeline/Spark monitoring
â
Set up Teams channel for alerts
Phase 2 (Week 3-4): Proactive Alerting
â
Configure Data Activator Reflex items
â
Implement alert severity classification
â
Create initial runbooks for top 4 scenarios
â
Set up on-call rotation
Phase 3 (Month 2): Custom Dashboards
â
Build RTI operations dashboard
â
Build Power BI historical trends report
â
Implement data freshness monitoring
â
Create domain-specific dashboards (Casino NOC, Federal FISMA)
Phase 4 (Month 3): Predictive + Automation
âŦ Implement capacity trend forecasting
âŦ Add anomaly detection for data quality
âŦ Create Power Automate auto-remediation flows
âŦ Conduct first runbook drill
â ī¸ Common Issues¶
| Issue | Symptom | Root Cause | Resolution |
|---|---|---|---|
| System tables empty | No data in monitoring queries | Admin Monitoring workspace not enabled | Enable via Fabric Admin portal |
| Alert storms | Hundreds of alerts in minutes | Missing cooldown periods | Add 15-60 min cooldown per alert |
| Dashboard stale | RTI dashboard shows old data | Eventstream disconnected | Reconnect Eventstream source |
| False positives | Alert fires during maintenance | No maintenance window suppression | Add maintenance window to alert rules |
| KQL query timeout | Complex query exceeds 30s limit | Scanning too much data | Add time filters, reduce scope |
| Capacity spike at midnight | Unexpected CU burst | All scheduled refreshes at same time | Stagger refresh schedules |
| Missing audit events | Security report incomplete | Audit log search limited to 90 days | Export to long-term storage (ADLS Gen2) |
| OOM during monitoring notebook | Monitoring notebook itself OOMs | Querying too much history | Limit lookback window, use incremental |
đ References¶
Microsoft Documentation¶
- Fabric Capacity Metrics app
- Admin Monitoring workspace
- System tables in Fabric
- Data Activator overview
- Real-Time Intelligence dashboards
- KQL overview
- Unified audit log in Fabric
Compliance & Frameworks¶
- NIST SP 800-137 â Continuous Monitoring
- FISMA Implementation Guide
- NIGC MICS Compliance
- FedRAMP Continuous Monitoring Strategy Guide
Operational Best Practices¶
Related Documents¶
- Error Handling & Monitoring â Pipeline error tracking and retry patterns
- Alerting & Data Activator â Detailed Data Activator configuration
- Performance & Parallelism â Spark optimization and capacity management
- SQL Audit Logs Compliance â SQL-level audit logging
- Real-Time Intelligence â RTI dashboard and Eventhouse setup