Home > Tutorials > Tutorial 17: Monitoring and Observability
📡 Tutorial 17: Monitoring and Observability for Microsoft Fabric¶
Last Updated: 2026-04-15 | Version: 2.0 Status: ✅ Final | Maintainer: Documentation Team
Third-party references — publicly sourced, good-faith comparison
This page references non-Microsoft products and services. That information is drawn from each vendor's publicly available documentation and is offered for honest, good-faith comparison only. This is a personal project written from a Microsoft Fabric and Azure perspective; it does not claim expertise in, or authority over, any third-party product, and nothing here is an official statement by, or endorsed by, those vendors. Capabilities, pricing, and features change often — always verify against the vendor's current official documentation. Where a third-party offering is the stronger choice, we say so plainly.
📡 Tutorial 17: Monitoring and Observability¶
| Difficulty | Intermediate |
| Time | 120 minutes |
| Focus | Monitoring, Alerting, Diagnostics |
Progress Tracker¶
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
| 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 |
| SETUP | BRONZE | SILVER | GOLD | RT | PBI | PIPES | GOV | MIRROR | AI/ML |
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
| [x] | [x] | [x] | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
| 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
|TERADATA | SAS | CI/CD |PLANNING | SECURITY| TESTING | PERF | MONITOR | SHARING | CAPSTONE|
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
| [x] | [x] | [x] | [x] | [x] | [x] | [x] | [*] | [ ] | [ ] |
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
^
YOU ARE HERE
| Navigation | |
|---|---|
| Previous | 16-Performance Optimization |
| Next | 18-Data Sharing |
📋 Overview¶
This tutorial provides comprehensive guidance on implementing monitoring and observability for Microsoft Fabric environments. You will learn how to monitor capacity utilization, track pipeline and notebook performance, configure alerts, analyze logs, and build custom monitoring dashboards for your casino analytics platform.
Effective monitoring ensures: - Proactive issue detection before user impact - Capacity planning based on actual utilization - Compliance auditing for gaming regulations - Cost optimization through resource tracking - SLA management with measurable metrics
🎯 Learning Objectives¶
By the end of this tutorial, you will be able to:
- Configure Microsoft Fabric Capacity Metrics app
- Set up Azure Monitor integration for Fabric workloads
- Create Log Analytics workspace for centralized logging
- Build custom monitoring dashboards in Power BI
- Monitor key metrics: CU utilization, query duration, refresh failures
- Configure Azure Monitor alerts with action groups
- Analyze diagnostic logs and activity logs
- Use Spark UI for notebook debugging and optimization
- Implement real-time monitoring for Eventstreams
- Create incident response playbooks for casino operations
Monitoring Architecture¶

Source: Microsoft Fabric Capacity Metrics App
flowchart TB
subgraph Fabric["Microsoft Fabric Workloads"]
CAP[Fabric Capacity]
WS[Workspaces]
PIPE[Pipelines]
NB[Notebooks]
SEM[Semantic Models]
ES[Eventstreams]
EH[Eventhouse]
end
subgraph Collection["Data Collection"]
DIAG[Diagnostic Settings]
ACTIVITY[Activity Logs]
CAP_METRICS[Capacity Metrics]
SPARK_UI[Spark UI / Livy]
end
subgraph Storage["Log Storage"]
LA[Log Analytics<br/>Workspace]
SA[Storage Account<br/>Archive]
EH_LOGS[Event Hub<br/>Streaming]
end
subgraph Analysis["Analysis & Alerting"]
KQL[KQL Queries]
ALERTS[Azure Monitor<br/>Alerts]
ACTION[Action Groups]
DASHBOARD[Power BI<br/>Dashboards]
end
subgraph Response["Response"]
EMAIL[Email]
TEAMS[Teams]
SMS[SMS]
LOGIC[Logic Apps]
RUNBOOK[Azure Runbooks]
end
Fabric --> DIAG
Fabric --> ACTIVITY
CAP --> CAP_METRICS
NB --> SPARK_UI
DIAG --> LA
ACTIVITY --> LA
CAP_METRICS --> LA
DIAG --> SA
DIAG --> EH_LOGS
LA --> KQL
KQL --> ALERTS
KQL --> DASHBOARD
ALERTS --> ACTION
ACTION --> EMAIL
ACTION --> TEAMS
ACTION --> SMS
ACTION --> LOGIC
LOGIC --> RUNBOOK
style CAP fill:#0078D4,color:#fff
style LA fill:#68217A,color:#fff
style ALERTS fill:#E74C3C,color:#fff
style DASHBOARD fill:#F2C811,color:#000 Prerequisites¶
Before starting this tutorial, ensure you have:
- Completed Tutorial 00-06 (Foundation through Pipelines)
- Fabric capacity with Admin access
- Azure subscription with Contributor role
- Log Analytics workspace (or permissions to create one)
- Power BI Pro or PPU license for dashboard creation
- Basic understanding of KQL (Kusto Query Language)
Note: Some monitoring features require Fabric capacity Admin permissions. Coordinate with your tenant administrator if needed.
Step 1: Microsoft Fabric Capacity Metrics App¶
1.1 Install the Capacity Metrics App¶
The Microsoft Fabric Capacity Metrics app provides out-of-the-box monitoring for capacity utilization.
- Navigate to Microsoft AppSource
- Search for "Microsoft Fabric Capacity Metrics"
- Click Get it now
- Select your workspace for installation
- Connect to your Fabric capacity
Alternative Installation via Fabric Portal:
- Open Fabric Portal
- Navigate to Settings > Admin portal
- Select Capacity settings
- Click on your capacity
- Select Metrics app > Install
1.2 Key Metrics in the App¶
| Metric | Description | Threshold |
|---|---|---|
| CU Utilization % | Compute Units consumed | Warning: >70%, Critical: >90% |
| Throttling Events | Times capacity was throttled | Any occurrence needs investigation |
| Overload Minutes | Minutes in overloaded state | Should be minimal |
| Interactive vs Background | Workload distribution | Balance based on priority |
| Timepoint Analysis | CU usage over time | Identify peak usage patterns |
1.3 Capacity Metrics Dashboard Overview¶
flowchart LR
subgraph Overview["Capacity Overview"]
CU[CU Utilization]
THROTTLE[Throttle Status]
ITEMS[Active Items]
end
subgraph Breakdown["Workload Breakdown"]
INTERACTIVE[Interactive<br/>Queries, Reports]
BACKGROUND[Background<br/>Refreshes, Pipelines]
SPARK[Spark<br/>Notebooks]
end
subgraph Trends["Trend Analysis"]
HOURLY[Hourly Patterns]
DAILY[Daily Patterns]
WEEKLY[Weekly Patterns]
end
Overview --> Breakdown --> Trends
style CU fill:#E74C3C,color:#fff
style THROTTLE fill:#F39C12,color:#fff
style ITEMS fill:#27AE60,color:#fff 1.4 Understanding CU Consumption¶
Casino POC - Typical CU Distribution:
+----------------------------------+
| CU CONSUMPTION |
+----------------------------------+
| Semantic Model Refresh 35% |
| Pipeline Execution 25% |
| Notebook/Spark Jobs 20% |
| Interactive Queries 15% |
| Real-Time Analytics 5% |
+----------------------------------+
| Total Daily CU Budget 100% |
+----------------------------------+
Step 2: Azure Monitor Integration¶
2.1 Create Log Analytics Workspace¶
Azure Portal:
- Navigate to Azure Portal > Create a resource
- Search for "Log Analytics Workspace"
- Click Create
Configuration:
| Setting | Value |
|---|---|
| Subscription | Your Azure subscription |
| Resource Group | rg-fabric-monitoring |
| Name | law-fabric-casino-poc |
| Region | Same as Fabric capacity |
| Pricing Tier | Per GB (recommended) |
Azure CLI:
# Create resource group
az group create \
--name rg-fabric-monitoring \
--location eastus2
# Create Log Analytics workspace
az monitor log-analytics workspace create \
--resource-group rg-fabric-monitoring \
--workspace-name law-fabric-casino-poc \
--location eastus2 \
--sku PerGB2018 \
--retention-time 90
PowerShell:
# Create resource group
New-AzResourceGroup -Name "rg-fabric-monitoring" -Location "eastus2"
# Create Log Analytics workspace
New-AzOperationalInsightsWorkspace `
-ResourceGroupName "rg-fabric-monitoring" `
-Name "law-fabric-casino-poc" `
-Location "eastus2" `
-Sku "PerGB2018" `
-RetentionInDays 90
2.2 Configure Diagnostic Settings for Fabric¶
Enable Diagnostics via Azure Portal:
- Navigate to your Fabric Capacity in Azure Portal
- Select Diagnostic settings under Monitoring
- Click + Add diagnostic setting
Diagnostic Setting Configuration:
| Setting | Value |
|---|---|
| Name | ds-fabric-to-loganalytics |
| Logs | All categories enabled |
| Metrics | AllMetrics |
| Destination | Log Analytics workspace |
| Workspace | law-fabric-casino-poc |
Available Log Categories:
| Category | Description |
|---|---|
| Engine | Query engine operations |
| AllMetrics | Capacity-level metrics |
| Audit | Security and access events |
Azure CLI:
# Get Fabric capacity resource ID
CAPACITY_ID=$(az resource show \
--resource-group rg-fabric \
--resource-type "Microsoft.Fabric/capacities" \
--name "fabric-casino-poc" \
--query id -o tsv)
# Get Log Analytics workspace ID
LA_WORKSPACE_ID=$(az monitor log-analytics workspace show \
--resource-group rg-fabric-monitoring \
--workspace-name law-fabric-casino-poc \
--query id -o tsv)
# Create diagnostic setting
az monitor diagnostic-settings create \
--name "ds-fabric-to-loganalytics" \
--resource "$CAPACITY_ID" \
--workspace "$LA_WORKSPACE_ID" \
--logs '[{"category": "Engine", "enabled": true}, {"category": "Audit", "enabled": true}]' \
--metrics '[{"category": "AllMetrics", "enabled": true}]'
2.3 Configure Power BI Audit Logs¶
Power BI audit logs provide detailed tracking of user activities:
- Navigate to Microsoft 365 Admin Center
- Go to Settings > Org settings > Services
- Select Power BI
- Enable Audit logs
Export to Log Analytics via Sentinel or Defender:
# Configure audit log export using Microsoft Graph
$params = @{
displayName = "Export Power BI Audit Logs"
logTypes = @("PowerBIActivity")
destination = @{
logAnalyticsWorkspaceId = "/subscriptions/{sub-id}/resourceGroups/rg-fabric-monitoring/providers/Microsoft.OperationalInsights/workspaces/law-fabric-casino-poc"
}
}
Step 3: KQL Queries for Log Analytics¶
3.1 Capacity Utilization Queries¶
Query: CU Utilization Over Time
// Fabric Capacity CU Utilization - Last 24 Hours
FabricCapacityMetrics
| where TimeGenerated >= ago(24h)
| where MetricName == "CUUtilization"
| summarize
AvgUtilization = avg(MetricValue),
MaxUtilization = max(MetricValue),
MinUtilization = min(MetricValue)
by bin(TimeGenerated, 1h)
| order by TimeGenerated desc
| render timechart
with (title="CU Utilization (24 Hours)")
Query: Throttling Events
// Identify Throttling Events
FabricCapacityMetrics
| where TimeGenerated >= ago(7d)
| where MetricName == "ThrottlingCount"
| where MetricValue > 0
| summarize
ThrottleCount = sum(MetricValue),
ThrottleMinutes = count()
by bin(TimeGenerated, 1h)
| order by ThrottleMinutes desc
| take 50
Query: Overloaded Periods
// Capacity Overload Analysis
FabricCapacityMetrics
| where TimeGenerated >= ago(7d)
| where MetricName == "OverloadMinutes"
| where MetricValue > 0
| summarize
TotalOverloadMinutes = sum(MetricValue)
by bin(TimeGenerated, 1d), WorkspaceName
| order by TotalOverloadMinutes desc
3.2 Pipeline Monitoring Queries¶
Query: Pipeline Run Status
// Pipeline Execution Summary - Last 7 Days
FabricPipelineRuns
| where TimeGenerated >= ago(7d)
| summarize
SuccessCount = countif(Status == "Succeeded"),
FailedCount = countif(Status == "Failed"),
InProgressCount = countif(Status == "InProgress"),
CancelledCount = countif(Status == "Cancelled"),
TotalRuns = count(),
AvgDurationSeconds = avg(DurationSeconds)
by PipelineName
| extend
SuccessRate = round(100.0 * SuccessCount / TotalRuns, 2),
AvgDurationMinutes = round(AvgDurationSeconds / 60, 2)
| project
PipelineName,
TotalRuns,
SuccessRate,
FailedCount,
AvgDurationMinutes
| order by FailedCount desc
Query: Failed Pipeline Analysis
// Failed Pipeline Details
FabricPipelineRuns
| where TimeGenerated >= ago(24h)
| where Status == "Failed"
| project
TimeGenerated,
PipelineName,
WorkspaceName,
ErrorMessage,
DurationSeconds,
ActivityName
| order by TimeGenerated desc
Query: Long-Running Pipelines
// Identify Long-Running Pipelines
FabricPipelineRuns
| where TimeGenerated >= ago(7d)
| where Status == "Succeeded"
| where DurationSeconds > 1800 // > 30 minutes
| summarize
AvgDuration = avg(DurationSeconds),
MaxDuration = max(DurationSeconds),
RunCount = count()
by PipelineName
| extend
AvgDurationMinutes = round(AvgDuration / 60, 1),
MaxDurationMinutes = round(MaxDuration / 60, 1)
| order by MaxDurationMinutes desc
3.3 Semantic Model Refresh Queries¶
Query: Refresh Status Summary
// Semantic Model Refresh Summary
PowerBIDatasetRefresh
| where TimeGenerated >= ago(7d)
| summarize
SuccessCount = countif(RefreshStatus == "Completed"),
FailedCount = countif(RefreshStatus == "Failed"),
TotalRefreshes = count(),
AvgDurationMinutes = avg(DurationSeconds) / 60
by DatasetName, WorkspaceName
| extend SuccessRate = round(100.0 * SuccessCount / TotalRefreshes, 2)
| order by FailedCount desc
Query: Refresh Failure Details
// Failed Refresh Analysis
PowerBIDatasetRefresh
| where TimeGenerated >= ago(24h)
| where RefreshStatus == "Failed"
| project
TimeGenerated,
DatasetName,
WorkspaceName,
ErrorMessage,
RefreshType,
DurationSeconds
| order by TimeGenerated desc
3.4 Query Performance Monitoring¶
Query: Slow Queries Analysis
// Identify Slow DAX/SQL Queries
FabricQueryEvents
| where TimeGenerated >= ago(24h)
| where DurationMs > 10000 // > 10 seconds
| summarize
Count = count(),
AvgDurationMs = avg(DurationMs),
MaxDurationMs = max(DurationMs)
by QueryType, DatasetName, UserPrincipalName
| order by MaxDurationMs desc
| take 50
Query: Query Error Analysis
// Query Errors by Type
FabricQueryEvents
| where TimeGenerated >= ago(7d)
| where QueryStatus == "Error"
| summarize ErrorCount = count() by ErrorCode, ErrorMessage
| order by ErrorCount desc
| take 20
3.5 Casino-Specific Compliance Queries¶
Query: Compliance Job Monitoring
// Compliance Pipeline Execution Status
FabricPipelineRuns
| where TimeGenerated >= ago(24h)
| where PipelineName has_any ("compliance", "ctr", "sar", "aml", "kyc")
| summarize
LastRun = max(TimeGenerated),
Status = arg_max(TimeGenerated, Status),
Duration = arg_max(TimeGenerated, DurationSeconds)
by PipelineName
| extend
HoursSinceLastRun = datetime_diff('hour', now(), LastRun),
AlertLevel = case(
Status != "Succeeded", "CRITICAL",
HoursSinceLastRun > 24, "WARNING",
"OK"
)
| project
PipelineName,
LastRun,
Status,
HoursSinceLastRun,
AlertLevel
| order by AlertLevel
Query: CTR Threshold Monitoring
// Cash Transactions Approaching CTR Threshold ($10,000)
// This would query a custom table if you're logging transaction summaries
FabricCustomLogs
| where TimeGenerated >= ago(1h)
| where LogType == "TransactionAggregate"
| where TotalCashIn >= 8000 or TotalCashOut >= 8000
| project
TimeGenerated,
PlayerId,
TotalCashIn,
TotalCashOut,
AlertLevel = case(
TotalCashIn >= 10000 or TotalCashOut >= 10000, "CTR_REQUIRED",
TotalCashIn >= 9000 or TotalCashOut >= 9000, "APPROACHING_CTR",
"MONITOR"
)
| order by TimeGenerated desc
Step 4: Configure Azure Monitor Alerts¶
4.1 Create Action Groups¶
Action groups define who gets notified and how.
Azure Portal:
- Navigate to Azure Monitor > Alerts
- Click Action groups > + Create
Action Group Configuration:
| Setting | Value |
|---|---|
| Subscription | Your subscription |
| Resource Group | rg-fabric-monitoring |
| Action Group Name | ag-fabric-alerts |
| Display Name | Fabric Alerts |
Notification Types:
| Type | Configuration | Use Case |
|---|---|---|
fabric-team@casino.com | All alerts | |
| SMS | +1-555-123-4567 | Critical only |
| Azure Mobile App | Team members | All alerts |
| Voice Call | On-call number | Critical only |
Actions (Automations):
| Action | Purpose |
|---|---|
| Logic App | Custom notification workflows |
| Azure Function | Automated remediation |
| Webhook | Integration with ticketing systems |
| ITSM | ServiceNow/Jira integration |
PowerShell - Create Action Group:
# Create action group with email notification
$emailReceiver = New-AzActionGroupReceiver `
-Name "FabricTeamEmail" `
-EmailAddress "fabric-team@casino.com" `
-EmailReceiver
$smsReceiver = New-AzActionGroupReceiver `
-Name "OnCallSMS" `
-CountryCode "1" `
-PhoneNumber "5551234567" `
-SmsReceiver
New-AzActionGroup `
-ResourceGroupName "rg-fabric-monitoring" `
-Name "ag-fabric-alerts" `
-ShortName "FabricAG" `
-Receiver $emailReceiver, $smsReceiver `
-Location "Global"
4.2 Create Alert Rules¶
Alert 1: High CU Utilization
# Alert when CU utilization exceeds 85%
$condition = New-AzMetricAlertRuleV2Criteria `
-MetricName "CUUtilization" `
-Operator GreaterThan `
-Threshold 85 `
-TimeAggregation Average
Add-AzMetricAlertRuleV2 `
-Name "alert-high-cu-utilization" `
-ResourceGroupName "rg-fabric-monitoring" `
-WindowSize 00:15:00 `
-Frequency 00:05:00 `
-TargetResourceId "/subscriptions/{sub}/resourceGroups/rg-fabric/providers/Microsoft.Fabric/capacities/fabric-casino-poc" `
-Condition $condition `
-ActionGroupId "/subscriptions/{sub}/resourceGroups/rg-fabric-monitoring/providers/Microsoft.Insights/actionGroups/ag-fabric-alerts" `
-Severity 2 `
-Description "Fabric capacity CU utilization exceeds 85%"
Alert 2: Pipeline Failure
// Log Analytics Alert Query - Pipeline Failures
FabricPipelineRuns
| where TimeGenerated >= ago(15m)
| where Status == "Failed"
| summarize FailureCount = count() by PipelineName
| where FailureCount > 0
Alert Configuration:
| Setting | Value |
|---|---|
| Alert rule name | alert-pipeline-failure |
| Severity | 1 - Error |
| Evaluation frequency | Every 5 minutes |
| Lookback period | Last 15 minutes |
| Threshold | Greater than 0 |
Alert 3: Semantic Model Refresh Failure
// Log Analytics Alert Query - Refresh Failures
PowerBIDatasetRefresh
| where TimeGenerated >= ago(15m)
| where RefreshStatus == "Failed"
| project DatasetName, WorkspaceName, ErrorMessage, TimeGenerated
Alert 4: Capacity Throttling
// Log Analytics Alert Query - Throttling Detected
FabricCapacityMetrics
| where TimeGenerated >= ago(5m)
| where MetricName == "ThrottlingCount"
| where MetricValue > 0
| summarize ThrottleEvents = sum(MetricValue)
| where ThrottleEvents > 0
4.3 Alert Rule Summary¶
| Alert Name | Severity | Condition | Action |
|---|---|---|---|
| High CU Utilization | Warning (2) | CU > 85% for 15 min | Email team |
| Critical CU Utilization | Critical (1) | CU > 95% for 5 min | Email + SMS + Teams |
| Pipeline Failure | Error (1) | Any failure | Email team |
| Compliance Pipeline Failure | Critical (0) | CTR/SAR pipeline fail | Email + SMS + Call |
| Refresh Failure | Warning (2) | Model refresh fails | Email team |
| Throttling Detected | Warning (2) | Any throttle event | Email team |
| Storage Growth | Info (3) | Storage > 80% | Email admin |
Step 5: Custom Monitoring Dashboard¶
5.1 Power BI Dashboard Architecture¶
flowchart TB
subgraph DataSources["Data Sources"]
LA[Log Analytics<br/>Direct Query]
CAP_APP[Capacity Metrics<br/>App Data]
CUSTOM[Custom Logs<br/>Delta Tables]
end
subgraph Dashboard["Monitoring Dashboard"]
subgraph Overview["Executive Overview"]
HEALTH[System Health]
SLA[SLA Status]
ALERTS[Active Alerts]
end
subgraph Capacity["Capacity Metrics"]
CU[CU Utilization]
THROTTLE[Throttle Status]
TREND[Usage Trends]
end
subgraph Workloads["Workload Status"]
PIPE_STATUS[Pipeline Status]
REFRESH[Refresh Status]
NOTEBOOKS[Notebook Runs]
end
subgraph Casino["Casino Operations"]
COMPLIANCE[Compliance Jobs]
REALTIME[Real-Time Floor]
PLAYER[Player Analytics]
end
end
DataSources --> Dashboard
LA --> Overview
LA --> Workloads
CAP_APP --> Capacity
CUSTOM --> Casino
style LA fill:#68217A,color:#fff
style HEALTH fill:#27AE60,color:#fff
style COMPLIANCE fill:#E74C3C,color:#fff 5.2 Dashboard DAX Measures¶
Measure: Pipeline Success Rate
Pipeline Success Rate =
VAR SuccessCount =
CALCULATE(
COUNTROWS('PipelineRuns'),
'PipelineRuns'[Status] = "Succeeded"
)
VAR TotalCount = COUNTROWS('PipelineRuns')
RETURN
DIVIDE(SuccessCount, TotalCount, 0) * 100
Measure: Average CU Utilization
Measure: Compliance Job Health
Compliance Health Status =
VAR LastCTRRun =
CALCULATE(
MAX('PipelineRuns'[EndTime]),
'PipelineRuns'[PipelineName] = "CTR_Daily_Job"
)
VAR HoursSinceRun =
DATEDIFF(LastCTRRun, NOW(), HOUR)
VAR LastStatus =
CALCULATE(
SELECTEDVALUE('PipelineRuns'[Status]),
'PipelineRuns'[EndTime] = LastCTRRun
)
RETURN
SWITCH(TRUE(),
LastStatus <> "Succeeded", "CRITICAL",
HoursSinceRun > 24, "WARNING",
"HEALTHY"
)
Measure: SLA Compliance Percentage
SLA Compliance % =
VAR OnTimeRefreshes =
CALCULATE(
COUNTROWS('RefreshHistory'),
'RefreshHistory'[DurationMinutes] <= 30,
'RefreshHistory'[Status] = "Completed"
)
VAR TotalRefreshes =
COUNTROWS('RefreshHistory')
RETURN
DIVIDE(OnTimeRefreshes, TotalRefreshes, 0) * 100
5.3 Key Visuals for Dashboard¶
Visual 1: System Health Card
+----------------------------------+
| SYSTEM HEALTH |
| HEALTHY |
| 99.8% |
| Uptime Last 30 Days |
+----------------------------------+
Visual 2: CU Utilization Gauge
+----------------------------------+
| CU UTILIZATION |
| |
| [=====> ] 62% |
| Current / 100% Capacity |
| |
| Avg: 58% Peak: 87% |
+----------------------------------+
Visual 3: Pipeline Status Matrix
+----------------------------------------------+
| PIPELINE STATUS (Last 24 Hours) |
+----------------------------------------------+
| Pipeline | Runs | Success | Failed |
+----------------------------------------------+
| Bronze Ingestion | 24 | 24 | 0 |
| Silver Transform | 12 | 11 | 1 |
| Gold Aggregation | 6 | 6 | 0 |
| CTR Compliance | 1 | 1 | 0 |
| Player Analytics | 4 | 4 | 0 |
+----------------------------------------------+
| TOTAL | 47 | 46 | 1 |
+----------------------------------------------+
Step 6: Spark UI for Notebook Debugging¶
6.1 Accessing Spark UI¶
From Fabric Portal:
- Open your Notebook
- Run a Spark job
- Click on the Spark Jobs tab at the bottom
- Click View Spark UI for detailed analysis
Spark UI Sections:
| Tab | Purpose | Key Metrics |
|---|---|---|
| Jobs | Overall job status | Duration, stages, tasks |
| Stages | Stage-level details | Shuffle read/write, task distribution |
| Storage | Cached RDDs/DataFrames | Memory usage, cache hits |
| Environment | Spark configuration | Settings, classpath |
| Executors | Executor health | Memory, cores, task stats |
| SQL | Query execution plans | DAG visualization |
6.2 Common Performance Issues in Spark UI¶
Issue 1: Data Skew
Symptoms in Spark UI:
- One task takes much longer than others
- Uneven data distribution across partitions
Solution:
# Repartition by a more uniform key
df = df.repartition(200, "player_id")
# Or use salting for hot keys
from pyspark.sql.functions import concat, lit, col, rand
df_salted = df.withColumn(
"salted_key",
concat(col("hot_key"), lit("_"), (rand() * 10).cast("int"))
)
Issue 2: Shuffle Spill to Disk
Symptoms in Spark UI:
- "Shuffle Spill (Disk)" > 0 in stage details
- Slow stage completion
Solution:
# Increase executor memory or reduce partition size
spark.conf.set("spark.sql.shuffle.partitions", 400)
spark.conf.set("spark.executor.memory", "8g")
Issue 3: Small Files Problem
# Compact files before processing
df.coalesce(100).write.mode("overwrite").parquet("path")
# Or use Delta Lake OPTIMIZE
spark.sql("OPTIMIZE lakehouse.bronze_slot_telemetry")
6.3 Notebook Performance Logging¶
Add Performance Metrics to Notebooks:
# Cell: Performance Tracking Setup
import time
from datetime import datetime
class PerformanceTracker:
"""Track notebook cell execution performance."""
def __init__(self):
self.metrics = []
def start_cell(self, cell_name: str):
"""Start timing a cell."""
self.current_cell = cell_name
self.start_time = time.time()
def end_cell(self, rows_processed: int = 0):
"""End timing and record metrics."""
duration = time.time() - self.start_time
self.metrics.append({
"cell_name": self.current_cell,
"duration_seconds": round(duration, 2),
"rows_processed": rows_processed,
"timestamp": datetime.now().isoformat()
})
print(f"Cell '{self.current_cell}' completed in {duration:.2f}s ({rows_processed:,} rows)")
def summary(self):
"""Print performance summary."""
total_duration = sum(m["duration_seconds"] for m in self.metrics)
total_rows = sum(m["rows_processed"] for m in self.metrics)
print("\n" + "="*50)
print("NOTEBOOK PERFORMANCE SUMMARY")
print("="*50)
for m in self.metrics:
print(f" {m['cell_name']}: {m['duration_seconds']}s")
print("-"*50)
print(f" TOTAL: {total_duration:.2f}s, {total_rows:,} rows")
print("="*50)
return self.metrics
# Initialize tracker
perf = PerformanceTracker()
# Cell: Example Usage
perf.start_cell("Read Bronze Data")
df = spark.read.table("bronze.slot_telemetry")
row_count = df.count()
perf.end_cell(row_count)
# Cell: Final Summary
perf.summary()
# Optionally log to Delta table for historical tracking
from pyspark.sql import Row
metrics_df = spark.createDataFrame([Row(**m) for m in perf.metrics])
metrics_df.write.mode("append").saveAsTable("monitoring.notebook_performance")
Step 7: Pipeline Monitoring¶
7.1 Pipeline Run History Analysis¶
Access Pipeline Monitoring:
- Open your Data Pipeline in Fabric
- Click View run history
- Analyze individual runs
Key Metrics to Track:
| Metric | Description | Target |
|---|---|---|
| Duration | Total pipeline runtime | Within SLA |
| Activity Count | Number of activities | Stable |
| Data Moved | Rows/bytes processed | Consistent |
| Retry Count | Automatic retries | Zero |
| Error Rate | Failed activities | < 1% |
7.2 Pipeline Alerting Script¶
PowerShell: Check Pipeline Status
<#
.SYNOPSIS
Monitor Fabric pipeline runs and send alerts.
#>
param(
[string]$WorkspaceId,
[string]$PipelineName,
[int]$LookbackMinutes = 60
)
# Authenticate
Connect-PowerBIServiceAccount
# Get pipeline runs
$uri = "https://api.fabric.microsoft.com/v1/workspaces/$WorkspaceId/pipelines"
$headers = @{
"Authorization" = "Bearer $(Get-PowerBIAccessToken -AsString)"
}
$pipelines = Invoke-RestMethod -Uri $uri -Headers $headers
$targetPipeline = $pipelines.value | Where-Object { $_.displayName -eq $PipelineName }
if (-not $targetPipeline) {
Write-Error "Pipeline '$PipelineName' not found"
exit 1
}
# Get recent runs
$runsUri = "https://api.fabric.microsoft.com/v1/workspaces/$WorkspaceId/pipelines/$($targetPipeline.id)/runs"
$runs = Invoke-RestMethod -Uri $runsUri -Headers $headers
# Analyze results
$recentRuns = $runs.value | Where-Object {
$runTime = [DateTime]::Parse($_.startTime)
$runTime -gt (Get-Date).AddMinutes(-$LookbackMinutes)
}
$failedRuns = $recentRuns | Where-Object { $_.status -eq "Failed" }
if ($failedRuns.Count -gt 0) {
Write-Host "ALERT: $($failedRuns.Count) failed pipeline runs in last $LookbackMinutes minutes!"
foreach ($run in $failedRuns) {
Write-Host " - Run ID: $($run.id)"
Write-Host " Status: $($run.status)"
Write-Host " Start: $($run.startTime)"
Write-Host " Error: $($run.error.message)"
}
# Send alert (integrate with your notification system)
# Send-AlertEmail -Subject "Pipeline Failure: $PipelineName" -Body $alertBody
exit 1
}
Write-Host "All pipeline runs successful in last $LookbackMinutes minutes"
exit 0
Step 8: Real-Time Monitoring for Eventstreams¶
8.1 Eventstream Health Monitoring¶
sequenceDiagram
participant Source as Event Source<br/>(IoT/Kafka)
participant ES as Eventstream
participant EH as Eventhouse
participant ALERT as Alert System
loop Every 5 minutes
ALERT->>ES: Check ingestion metrics
ES-->>ALERT: Events/second, lag, errors
alt Lag > threshold
ALERT->>ALERT: Trigger lag alert
end
alt Error rate > 0
ALERT->>ALERT: Trigger error alert
end
end
Source->>ES: Stream events
ES->>EH: Ingest to KQL
EH-->>ALERT: Query metrics 8.2 Eventstream Monitoring KQL¶
Query: Ingestion Lag Monitoring
// Monitor real-time ingestion lag
EventstreamMetrics
| where TimeGenerated >= ago(1h)
| where MetricName == "IngestionLatencyMs"
| summarize
AvgLatencyMs = avg(MetricValue),
P95LatencyMs = percentile(MetricValue, 95),
MaxLatencyMs = max(MetricValue)
by bin(TimeGenerated, 5m), StreamName
| where P95LatencyMs > 5000 // Alert if P95 > 5 seconds
Query: Throughput Analysis
// Events per second throughput
EventstreamMetrics
| where TimeGenerated >= ago(1h)
| where MetricName == "EventsReceived"
| summarize EventsPerSecond = sum(MetricValue) / 60 by bin(TimeGenerated, 1m), StreamName
| render timechart
8.3 Casino Floor Real-Time Alerts¶
Slot Machine Anomaly Detection:
// Detect unusual slot machine patterns (potential malfunction or fraud)
SlotMachineTelemetry
| where ingestion_time() >= ago(5m)
| summarize
TotalSpins = count(),
TotalWin = sum(win_amount),
WinRate = countif(win_amount > 0) * 100.0 / count()
by machine_id, bin(ingestion_time(), 1m)
| where WinRate > 80 // Unusually high win rate
or TotalSpins > 200 // Excessive spin rate (potential bot)
or TotalWin > 10000 // Large payouts
| project
TimeWindow = ingestion_time(),
machine_id,
TotalSpins,
WinRate,
TotalWin,
AlertType = case(
WinRate > 80, "HIGH_WIN_RATE",
TotalSpins > 200, "EXCESSIVE_SPINS",
TotalWin > 10000, "LARGE_PAYOUT",
"UNKNOWN"
)
Player Behavior Monitoring:
// Monitor player session patterns for compliance
PlayerSessions
| where ingestion_time() >= ago(1h)
| summarize
SessionDuration = datetime_diff('minute', max(event_time), min(event_time)),
TotalWagered = sum(wager_amount),
TotalWon = sum(win_amount)
by player_id, session_id
| where TotalWagered > 5000 // High wagering activity
or SessionDuration > 240 // Session > 4 hours (responsible gaming)
| project
player_id,
session_id,
SessionDuration,
TotalWagered,
TotalWon,
AlertType = case(
TotalWagered > 10000, "HIGH_WAGERING",
SessionDuration > 360, "EXTENDED_SESSION",
"MONITOR"
)
Step 9: Incident Response Playbooks¶
9.1 Capacity Overload Playbook¶
flowchart TD
DETECT[Alert: CU > 90%] --> ASSESS{Duration?}
ASSESS -->|< 5 min| WAIT[Monitor for 5 min]
WAIT --> ASSESS
ASSESS -->|> 5 min| IDENTIFY[Identify Top Consumers]
IDENTIFY --> CHECK{Critical Workload?}
CHECK -->|Yes| SCALE[Consider Scaling Up]
CHECK -->|No| THROTTLE[Throttle Non-Critical]
THROTTLE --> OPTIONS{Options}
OPTIONS --> PAUSE_REFRESH[Pause Model Refreshes]
OPTIONS --> DELAY_PIPELINE[Delay Pipelines]
OPTIONS --> LIMIT_QUERIES[Limit Interactive]
SCALE --> DOCUMENT[Document Incident]
PAUSE_REFRESH --> DOCUMENT
DELAY_PIPELINE --> DOCUMENT
LIMIT_QUERIES --> DOCUMENT
DOCUMENT --> POSTMORTEM[Post-Incident Review]
style DETECT fill:#E74C3C,color:#fff
style SCALE fill:#27AE60,color:#fff
style DOCUMENT fill:#3498DB,color:#fff Step-by-Step Capacity Overload Response:
## INCIDENT RESPONSE: Capacity Overload
### 1. DETECTION (0-5 minutes)
- [ ] Acknowledge alert
- [ ] Open Capacity Metrics app
- [ ] Identify current CU utilization percentage
- [ ] Check for throttling events
### 2. ASSESSMENT (5-10 minutes)
- [ ] Identify top 3 consuming workloads
- [ ] Determine if workloads are critical
- [ ] Check scheduled vs. ad-hoc activities
- [ ] Review recent deployments/changes
### 3. MITIGATION (10-30 minutes)
**Option A: Non-Critical Workloads**
- [ ] Pause non-critical semantic model refreshes
- [ ] Delay scheduled pipeline runs
- [ ] Contact heavy report users
**Option B: Critical Workloads**
- [ ] Initiate capacity scale-up (if available)
- [ ] Engage capacity administrator
- [ ] Communicate to stakeholders
### 4. RESOLUTION
- [ ] Verify CU utilization returns to normal
- [ ] Re-enable paused workloads gradually
- [ ] Confirm no data loss or failures
### 5. POST-INCIDENT
- [ ] Document timeline and actions
- [ ] Identify root cause
- [ ] Update capacity planning
- [ ] Review alert thresholds
9.2 Compliance Job Failure Playbook¶
Critical: CTR/SAR Pipeline Failure
## INCIDENT RESPONSE: Compliance Pipeline Failure
### SEVERITY: CRITICAL
### SLA: Must be resolved within 4 hours
### 1. IMMEDIATE ACTIONS (0-15 minutes)
- [ ] Acknowledge alert immediately
- [ ] Page compliance team lead
- [ ] Open pipeline run history
- [ ] Identify failure point and error message
### 2. DIAGNOSIS (15-30 minutes)
- [ ] Check data source connectivity
- [ ] Verify input data availability
- [ ] Review error logs in detail
- [ ] Check for schema changes
### 3. RESOLUTION PATHS
**Path A: Data Issue**
- [ ] Identify missing/corrupt data
- [ ] Trigger source system refresh
- [ ] Re-run pipeline with corrected data
**Path B: System Issue**
- [ ] Check Fabric capacity status
- [ ] Verify service health
- [ ] Restart failed activities
- [ ] Escalate if infrastructure issue
**Path C: Code Issue**
- [ ] Review recent code changes
- [ ] Rollback if recent deployment
- [ ] Fix and redeploy
- [ ] Trigger manual run
### 4. VERIFICATION
- [ ] Confirm successful pipeline completion
- [ ] Verify output data accuracy
- [ ] Check downstream dependencies
- [ ] Confirm compliance reports generated
### 5. DOCUMENTATION (MANDATORY)
- [ ] Log incident in compliance tracker
- [ ] Document root cause
- [ ] Note any regulatory impact
- [ ] Schedule post-mortem if required
### ESCALATION CONTACTS
- Compliance Team Lead: compliance-lead@casino.com
- Data Platform On-Call: +1-555-DATA-911
- Gaming Commission (if required): regulator@gaming.gov
9.3 Runbook Template¶
## RUNBOOK: [Incident Type]
### Metadata
- **Owner:** [Team Name]
- **Last Updated:** [Date]
- **Review Frequency:** Quarterly
- **Severity:** [Critical/High/Medium/Low]
### Alert Information
- **Alert Name:** [Name from Azure Monitor]
- **Threshold:** [Condition that triggers]
- **Notification:** [Who is notified]
### Pre-Requisites
- [ ] Access to Fabric Admin portal
- [ ] Access to Azure Monitor
- [ ] Access to Log Analytics
- [ ] Required permissions: [List]
### Diagnostic Steps
1. [First diagnostic step]
2. [Second diagnostic step]
3. [Additional steps...]
### Resolution Steps
1. [First resolution step]
2. [Second resolution step]
3. [Additional steps...]
### Rollback Procedure
1. [If resolution fails, rollback step 1]
2. [Rollback step 2]
### Post-Incident Checklist
- [ ] Incident documented
- [ ] Stakeholders notified
- [ ] Root cause identified
- [ ] Prevention measures identified
- [ ] Runbook updated if needed
### Related Resources
- [Link to monitoring dashboard]
- [Link to architecture docs]
- [Link to escalation procedures]
Step 10: SLA Tracking and Reporting¶
10.1 SLA Definition Table¶
| Service | Metric | Target | Measurement |
|---|---|---|---|
| Bronze Ingestion | Freshness | < 15 min | Time from source to Bronze |
| Silver Transform | Freshness | < 1 hour | Time from Bronze to Silver |
| Gold Aggregation | Freshness | < 4 hours | Time from Silver to Gold |
| Model Refresh | Duration | < 30 min | End-to-end refresh time |
| Report Load | Response | < 5 sec | Time to first visual |
| Pipeline Success | Rate | > 99% | Successful runs / total runs |
| Capacity Availability | Uptime | > 99.9% | Available / total time |
| CTR Reporting | Timeliness | < 24 hours | From transaction to report |
10.2 SLA Tracking KQL Query¶
// Weekly SLA Compliance Report
let SLATargets = datatable(ServiceName:string, TargetMinutes:int)[
"Bronze Ingestion", 15,
"Silver Transform", 60,
"Gold Aggregation", 240,
"Model Refresh", 30
];
FabricPipelineRuns
| where TimeGenerated >= ago(7d)
| where Status == "Succeeded"
| extend DurationMinutes = DurationSeconds / 60
| join kind=inner SLATargets on $left.PipelineName == $right.ServiceName
| extend WithinSLA = DurationMinutes <= TargetMinutes
| summarize
TotalRuns = count(),
WithinSLARuns = countif(WithinSLA),
AvgDuration = avg(DurationMinutes),
P95Duration = percentile(DurationMinutes, 95)
by ServiceName, TargetMinutes
| extend
SLACompliance = round(100.0 * WithinSLARuns / TotalRuns, 2),
Status = case(
100.0 * WithinSLARuns / TotalRuns >= 99, "GREEN",
100.0 * WithinSLARuns / TotalRuns >= 95, "YELLOW",
"RED"
)
| project
ServiceName,
TotalRuns,
SLACompliance,
TargetMinutes,
AvgDuration = round(AvgDuration, 1),
P95Duration = round(P95Duration, 1),
Status
| order by SLACompliance asc
10.3 Weekly SLA Report Template¶
+==============================================================+
| MICROSOFT FABRIC SLA REPORT |
| Week of January 20-27, 2026 |
+==============================================================+
EXECUTIVE SUMMARY
-----------------
Overall SLA Compliance: 98.7%
Critical Incidents: 1
Capacity Uptime: 99.95%
DETAILED METRICS
----------------
+----------------------+--------+--------+--------+--------+
| Service | Target | Actual | SLA % | Status |
+----------------------+--------+--------+--------+--------+
| Bronze Ingestion | 15 min | 8 min | 100% | GREEN |
| Silver Transform | 60 min | 45 min | 99.2% | GREEN |
| Gold Aggregation | 4 hrs | 2.5 hrs| 98.5% | GREEN |
| Model Refresh | 30 min | 22 min | 97.8% | YELLOW |
| CTR Compliance | 24 hrs | 6 hrs | 100% | GREEN |
+----------------------+--------+--------+--------+--------+
INCIDENTS THIS WEEK
-------------------
1. [Jan 22, 14:30] Pipeline Failure - Silver Transform
Duration: 45 minutes
Impact: Delayed Gold layer refresh by 1 hour
Root Cause: Network timeout to source system
Resolution: Retry succeeded after network recovery
CAPACITY UTILIZATION
--------------------
Average CU: 62%
Peak CU: 89% (Jan 24, 08:00 - month-end processing)
Throttling Events: 0
RECOMMENDATIONS
---------------
1. Review Model Refresh SLA - consider optimization
2. Add buffer capacity for month-end processing
3. Implement proactive network monitoring
Report generated: 2026-01-28 06:00:00 UTC
Next review: Weekly team meeting
+==============================================================+
Validation Checklist¶
Before completing this tutorial, verify:
- Capacity Metrics App - Installed and connected to your capacity
- Log Analytics Workspace - Created and receiving logs
- Diagnostic Settings - Configured for Fabric capacity
- KQL Queries - Tested and returning expected results
- Alert Rules - Created for critical metrics
- Action Groups - Configured with correct notifications
- Monitoring Dashboard - Built in Power BI with key visuals
- Spark UI - Can access and interpret performance data
- Incident Playbooks - Documented for key scenarios
- SLA Definitions - Established with tracking queries
Verification Commands
### Check Log Analytics Connection// Verify data is flowing to Log Analytics
search *
| where TimeGenerated >= ago(1h)
| summarize count() by Type
| order by count_ desc
# List all alert rules in resource group
Get-AzMetricAlertRuleV2 -ResourceGroupName "rg-fabric-monitoring"
Troubleshooting¶
| Issue | Cause | Solution |
|---|---|---|
| No data in Log Analytics | Diagnostic settings not configured | Enable diagnostics on Fabric capacity |
| Alerts not firing | Threshold too high or wrong metric | Review alert conditions and test |
| Capacity Metrics app error | Permission issue | Verify Fabric Admin access |
| KQL query timeout | Query too broad | Add time filters and optimize |
| Action group not sending | Email blocked or wrong address | Verify recipients and check spam |
| Spark UI not loading | Session ended | Restart notebook session |
Best Practices¶
- Start with Essentials - Monitor CU utilization, pipeline failures, and refresh status first
- Right-Size Alerts - Avoid alert fatigue by setting appropriate thresholds
- Use Action Groups - Centralize notification management
- Document Runbooks - Have procedures ready before incidents occur
- Review Weekly - Analyze trends and adjust thresholds
- Automate Responses - Use Logic Apps for common remediation
- Archive Logs - Keep 90+ days for compliance and analysis
- Test Alerts - Regularly verify notifications are working
- Track SLAs - Measure what matters to the business
- Continuous Improvement - Update monitoring as workloads evolve
Summary¶
You have learned to implement comprehensive monitoring and observability for Microsoft Fabric:
- Configured the Fabric Capacity Metrics app for real-time capacity monitoring
- Set up Azure Monitor integration with Log Analytics
- Created KQL queries for pipeline, refresh, and query analysis
- Implemented alerts with action groups for proactive notification
- Built custom Power BI dashboards for operational visibility
- Used Spark UI for notebook debugging and optimization
- Established incident response playbooks for casino operations
- Defined SLAs and created tracking mechanisms
Effective monitoring ensures your casino analytics platform operates reliably, meets compliance requirements, and provides the insights needed for continuous optimization.
Next Steps¶
Continue to Tutorial 18: Data Sharing and Collaboration to learn how to securely share data across teams and external partners using Fabric's collaboration features.
Additional Resources¶
- Microsoft Fabric Monitoring Documentation
- Azure Monitor Overview
- Log Analytics Workspace
- KQL Reference
- Power BI Monitoring
- Fabric Capacity Metrics App
- Azure Monitor Alerts
- Spark UI Documentation
🧭 Navigation¶
| Previous | Up | Next |
|---|---|---|
| ⬅️ 16-Performance Tuning | 📖 Tutorials Index | 18-Data Sharing ➡️ |
💬 Questions or issues? Open an issue in the GitHub repository.