Home > Docs > Features > Workspace Monitoring
π Workspace Monitoring - Observability for Fabric Workloads¶
Queryable System Tables for Spark, SQL, Pipeline, and Dataflow Activity
Last Updated: 2026-04-13 | Version: 1.0.0
π Table of Contents¶
- π― Overview
- ποΈ Architecture
- βοΈ Available System Tables
- π Querying Patterns
- π Alerting and Automation
- π° Casino Implementation
- ποΈ Federal Agency Implementation
- π Dashboard Templates
- β‘ Performance and Retention
- β οΈ Limitations
- π References
π― Overview¶
Workspace Monitoring is a Fabric workspace item (Preview) that provides queryable system tables containing operational metadata about every workload execution within a workspace. Rather than navigating scattered monitoring UIs for Spark, SQL, pipelines, and dataflows, Workspace Monitoring consolidates execution history, performance metrics, resource consumption, and failure details into a single Lakehouse that you can query with Spark SQL, KQL, or any SQL-compatible tool.
Think of it as an observability layer purpose-built for Microsoft Fabric -- comparable to sys.dm_exec_query_stats in SQL Server, but covering the full spectrum of Fabric workloads (Spark notebooks, SQL queries, Data Factory pipelines, Dataflow Gen2 activities, and capacity utilization) in one place.
Key Capabilities¶
| Capability | Description |
|---|---|
| Unified Execution History | Track every Spark session, SQL query, pipeline run, and dataflow activity in one location |
| Performance Metrics | Duration, resource consumption (CU seconds), rows processed, bytes scanned |
| Failure Analysis | Error messages, stack traces, and failure root cause for debugging failed executions |
| Capacity Monitoring | Track compute unit (CU) consumption by workload type and time period |
| Cross-Workload Correlation | Join system tables to trace end-to-end data pipeline execution |
| Self-Service Querying | Use standard Spark SQL to query system tables -- no admin portal access required |
| Automated Alerting | Combine with Data Activator or Power Automate for threshold-based alerts |
How Workspace Monitoring Fits in the Fabric Ecosystem¶
flowchart TB
subgraph Workloads["π· Fabric Workloads"]
NB["π Notebooks<br/>(Spark Sessions)"]
SQL["π’ SQL Queries<br/>(Warehouse/Lakehouse)"]
PL["π Pipelines<br/>(Data Factory)"]
DF["π§ Dataflows<br/>(Dataflow Gen2)"]
CP["β‘ Capacity<br/>(CU Consumption)"]
end
subgraph Monitoring["π Workspace Monitoring"]
ST["System Tables<br/>(Monitoring Lakehouse)"]
end
subgraph Consumers["π€ Consumers"]
SPK["Spark SQL Queries"]
RTD["Real-Time Dashboards"]
PBI["Power BI Reports"]
DA["Data Activator<br/>Alerts"]
PA["Power Automate<br/>Workflows"]
end
Workloads -->|"Automatic logging"| Monitoring
Monitoring --> Consumers
style Workloads fill:#6C3483,stroke:#4A235A,color:#fff
style Monitoring fill:#E67E22,stroke:#CA6F1E,color:#fff
style Consumers fill:#27AE60,stroke:#1E8449,color:#fff Workspace Monitoring vs. Other Monitoring Approaches¶
| Approach | Scope | Query Method | Granularity | Latency |
|---|---|---|---|---|
| Workspace Monitoring | Full workspace workload history | Spark SQL, KQL | Per-session, per-query, per-run | Minutes (near real-time) |
| Admin Monitoring Portal | Tenant-level capacity overview | Web UI only | Per-capacity | Minutes |
| Fabric Capacity Metrics App | Capacity consumption analytics | Power BI report | Per-hour, per-day | Hours |
| Azure Monitor Integration | Infrastructure and platform events | KQL (Log Analytics) | Per-event | Minutes |
| Custom Notebook Logging | Application-level telemetry | Custom implementation | Developer-defined | Real-time |
ποΈ Architecture¶
Workspace Monitoring operates by automatically capturing execution metadata from all Fabric workloads within a workspace and writing it to system tables in a dedicated Monitoring Lakehouse.
Component Architecture¶
flowchart LR
subgraph Sources["π· Workload Engines"]
SP["Spark Engine<br/>notebook sessions"]
SQ["SQL Engine<br/>warehouse queries"]
PL["Pipeline Engine<br/>orchestration runs"]
DF["Dataflow Engine<br/>dataflow activities"]
CAP["Capacity Manager<br/>CU allocation"]
end
subgraph Collector["π‘ Monitoring Collector"]
EVT["Event Capture"]
ENR["Metadata Enrichment"]
BUF["Write Buffer"]
end
subgraph Storage["π Monitoring Lakehouse"]
T1["spark_sessions"]
T2["sql_queries"]
T3["pipeline_runs"]
T4["dataflow_activities"]
T5["capacity_metrics"]
end
subgraph Output["π Analytics"]
QRY["Spark SQL Queries"]
DSH["Monitoring Dashboard"]
ALR["Alert Rules"]
end
Sources --> Collector --> Storage --> Output
style Sources fill:#6C3483,stroke:#4A235A,color:#fff
style Collector fill:#E67E22,stroke:#CA6F1E,color:#fff
style Storage fill:#2471A3,stroke:#1A5276,color:#fff
style Output fill:#27AE60,stroke:#1E8449,color:#fff Data Flow¶
sequenceDiagram
participant WL as Fabric Workload
participant MC as Monitoring Collector
participant ML as Monitoring Lakehouse
participant USR as Analyst
WL->>MC: Execution event (start/complete/fail)
MC->>MC: Enrich with workspace, item, user metadata
MC->>ML: Write to system table (Delta format)
Note over MC,ML: 2-5 minute freshness lag
USR->>ML: SELECT * FROM system.spark_sessions
ML-->>USR: Execution history results
Note over WL,USR: All workloads logged automatically Enabling Workspace Monitoring¶
Workspace Settings β Monitoring (Preview)
βββ Enable Workspace Monitoring β On
βββ Monitoring Lakehouse β Auto-created (or select existing)
βββ Retention Period β 30 days (default, configurable to 90 days)
βββ System Tables β Auto-provisioned
π Note: Enabling Workspace Monitoring creates a dedicated Monitoring Lakehouse in the workspace. This Lakehouse consumes OneLake storage and capacity CU for the system table writes. The overhead is typically less than 1% of total workspace CU consumption.
βοΈ Available System Tables¶
Workspace Monitoring provisions five system tables in the Monitoring Lakehouse. Each table captures a different workload type with its specific execution metadata.
System Table Summary¶
| Table | Workload | Key Metrics | Granularity |
|---|---|---|---|
spark_sessions | Notebook and Spark Job executions | Duration, CU consumption, row/byte counts, status | Per Spark session |
sql_queries | Warehouse and Lakehouse SQL queries | Duration, rows scanned, bytes processed, query text | Per SQL statement |
pipeline_runs | Data Factory pipeline executions | Duration, activity count, status, error messages | Per pipeline run |
dataflow_activities | Dataflow Gen2 refresh operations | Duration, rows processed, partition count, status | Per dataflow refresh |
capacity_metrics | Workspace-level compute consumption | CU seconds by workload type, throttling events | Per 30-second interval |
Table Schemas¶
spark_sessions¶
Captures every Spark session initiated by notebooks, Spark Job Definitions, or Lakehouse scheduled jobs.
| Column | Type | Description |
|---|---|---|
session_id | string | Unique Spark session identifier |
workspace_id | string | Workspace GUID |
workspace_name | string | Workspace display name |
item_id | string | Notebook or Spark Job item GUID |
item_name | string | Item display name (notebook name) |
user_id | string | Entra ID of the user who initiated the session |
user_email | string | Email of the initiating user |
start_time | timestamp | Session start timestamp (UTC) |
end_time | timestamp | Session end timestamp (UTC) |
duration_seconds | long | Total session duration in seconds |
status | string | Session status: Succeeded, Failed, Cancelled, Running |
error_message | string | Error details if status = Failed |
spark_pool | string | Starter pool or custom pool name |
executor_count | int | Number of Spark executors allocated |
cu_seconds | decimal | Total compute unit seconds consumed |
rows_read | long | Total rows read across all stages |
rows_written | long | Total rows written across all stages |
bytes_read | long | Total bytes read from storage |
bytes_written | long | Total bytes written to storage |
shuffle_bytes | long | Total bytes shuffled between executors |
trigger_type | string | Interactive, Scheduled, Pipeline, API |
sql_queries¶
Captures every SQL query executed against Warehouse or Lakehouse SQL analytics endpoints.
| Column | Type | Description |
|---|---|---|
query_id | string | Unique query execution identifier |
workspace_id | string | Workspace GUID |
item_id | string | Warehouse or Lakehouse item GUID |
item_name | string | Warehouse or Lakehouse name |
item_type | string | Warehouse or Lakehouse |
user_id | string | Entra ID of the user who executed the query |
user_email | string | Email of the executing user |
query_text | string | Full SQL query text (up to 64 KB) |
query_hash | string | Hash of the query text for deduplication |
start_time | timestamp | Query start timestamp (UTC) |
end_time | timestamp | Query end timestamp (UTC) |
duration_ms | long | Query duration in milliseconds |
status | string | Succeeded, Failed, Cancelled |
error_message | string | Error details if status = Failed |
rows_returned | long | Number of rows in the result set |
rows_scanned | long | Total rows scanned from storage |
bytes_scanned | long | Total bytes scanned from storage |
cu_seconds | decimal | Compute unit seconds consumed |
source | string | SSMS, Notebook, PowerBI, ThirdParty, FabricPortal |
pipeline_runs¶
Captures every Data Factory pipeline execution, including scheduled, triggered, and manual runs.
| Column | Type | Description |
|---|---|---|
run_id | string | Unique pipeline run identifier |
workspace_id | string | Workspace GUID |
pipeline_id | string | Pipeline item GUID |
pipeline_name | string | Pipeline display name |
start_time | timestamp | Pipeline run start timestamp (UTC) |
end_time | timestamp | Pipeline run end timestamp (UTC) |
duration_seconds | long | Total pipeline duration in seconds |
status | string | Succeeded, Failed, Cancelled, InProgress |
error_message | string | Error details for failed activities |
activity_count | int | Total number of activities in the pipeline |
activities_succeeded | int | Count of activities that succeeded |
activities_failed | int | Count of activities that failed |
trigger_type | string | Scheduled, Manual, Tumbling, Event |
trigger_name | string | Name of the trigger (for scheduled/event runs) |
parameters | string | JSON string of pipeline parameters |
cu_seconds | decimal | Total CU seconds consumed by all activities |
dataflow_activities¶
Captures Dataflow Gen2 refresh operations.
| Column | Type | Description |
|---|---|---|
activity_id | string | Unique dataflow activity identifier |
workspace_id | string | Workspace GUID |
dataflow_id | string | Dataflow item GUID |
dataflow_name | string | Dataflow display name |
start_time | timestamp | Refresh start timestamp (UTC) |
end_time | timestamp | Refresh end timestamp (UTC) |
duration_seconds | long | Total refresh duration in seconds |
status | string | Succeeded, Failed, Cancelled |
error_message | string | Error details if status = Failed |
rows_processed | long | Total rows processed during refresh |
partition_count | int | Number of partitions refreshed |
refresh_type | string | Full, Incremental |
cu_seconds | decimal | Compute unit seconds consumed |
capacity_metrics¶
Provides 30-second interval snapshots of capacity utilization across the workspace.
| Column | Type | Description |
|---|---|---|
timestamp | timestamp | Metric collection timestamp (UTC) |
workspace_id | string | Workspace GUID |
workload_type | string | Spark, SQL, Pipeline, Dataflow, Other |
cu_allocated | decimal | CU allocated for the workload at this interval |
cu_consumed | decimal | CU actually consumed at this interval |
cu_throttled | decimal | CU requested but throttled due to capacity limits |
active_sessions | int | Number of active sessions/queries for this workload |
queued_requests | int | Number of requests waiting for capacity |
throttle_event | boolean | Whether a throttle event occurred in this interval |
π Querying Patterns¶
Basic Monitoring Queries¶
Spark Session History¶
-- Last 24 hours of Spark notebook executions
SELECT
item_name AS notebook,
user_email,
start_time,
duration_seconds,
status,
cu_seconds,
rows_written,
ROUND(bytes_written / 1048576.0, 2) AS mb_written
FROM system.spark_sessions
WHERE start_time >= DATEADD(HOUR, -24, CURRENT_TIMESTAMP)
ORDER BY start_time DESC;
Slow Query Detection¶
-- SQL queries exceeding 60 seconds in the last 7 days
SELECT
item_name AS data_source,
user_email,
SUBSTRING(query_text, 1, 200) AS query_preview,
duration_ms / 1000.0 AS duration_seconds,
rows_scanned,
ROUND(bytes_scanned / 1073741824.0, 2) AS gb_scanned,
cu_seconds
FROM system.sql_queries
WHERE start_time >= DATEADD(DAY, -7, CURRENT_TIMESTAMP)
AND duration_ms > 60000
AND status = 'Succeeded'
ORDER BY duration_ms DESC;
Pipeline Failure Trends¶
-- Daily pipeline failure rate over the last 30 days
SELECT
CAST(start_time AS DATE) AS run_date,
COUNT(*) AS total_runs,
SUM(CASE WHEN status = 'Succeeded' THEN 1 ELSE 0 END) AS succeeded,
SUM(CASE WHEN status = 'Failed' THEN 1 ELSE 0 END) AS failed,
ROUND(
SUM(CASE WHEN status = 'Failed' THEN 1.0 ELSE 0 END) /
COUNT(*) * 100, 2
) AS failure_rate_pct
FROM system.pipeline_runs
WHERE start_time >= DATEADD(DAY, -30, CURRENT_TIMESTAMP)
GROUP BY CAST(start_time AS DATE)
ORDER BY run_date DESC;
Advanced Analysis Patterns¶
Capacity Consumption by Workload¶
-- Hourly CU consumption breakdown by workload type (last 7 days)
SELECT
DATE_TRUNC('hour', timestamp) AS hour,
workload_type,
ROUND(SUM(cu_consumed), 2) AS total_cu,
MAX(active_sessions) AS peak_sessions,
SUM(CASE WHEN throttle_event THEN 1 ELSE 0 END) AS throttle_count
FROM system.capacity_metrics
WHERE timestamp >= DATEADD(DAY, -7, CURRENT_TIMESTAMP)
GROUP BY DATE_TRUNC('hour', timestamp), workload_type
ORDER BY hour DESC, total_cu DESC;
Top CU Consumers¶
-- Top 10 notebooks by total CU consumption this month
SELECT
item_name AS notebook,
user_email,
COUNT(*) AS run_count,
ROUND(SUM(cu_seconds), 2) AS total_cu_seconds,
ROUND(AVG(cu_seconds), 2) AS avg_cu_per_run,
ROUND(AVG(duration_seconds), 0) AS avg_duration_sec
FROM system.spark_sessions
WHERE start_time >= DATE_TRUNC('month', CURRENT_TIMESTAMP)
AND status = 'Succeeded'
GROUP BY item_name, user_email
ORDER BY total_cu_seconds DESC
LIMIT 10;
End-to-End Pipeline Tracing¶
-- Correlate pipeline runs with the Spark sessions they triggered
SELECT
pr.pipeline_name,
pr.run_id,
pr.start_time AS pipeline_start,
pr.duration_seconds AS pipeline_duration,
ss.item_name AS notebook_triggered,
ss.start_time AS notebook_start,
ss.duration_seconds AS notebook_duration,
ss.status AS notebook_status,
ss.cu_seconds AS notebook_cu
FROM system.pipeline_runs pr
LEFT JOIN system.spark_sessions ss
ON ss.trigger_type = 'Pipeline'
AND ss.start_time BETWEEN pr.start_time AND pr.end_time
AND ss.workspace_id = pr.workspace_id
WHERE pr.start_time >= DATEADD(DAY, -7, CURRENT_TIMESTAMP)
ORDER BY pr.start_time DESC;
Query Pattern Analysis¶
-- Identify the most frequently executed query patterns
SELECT
query_hash,
SUBSTRING(MIN(query_text), 1, 150) AS query_pattern,
COUNT(*) AS execution_count,
ROUND(AVG(duration_ms), 0) AS avg_duration_ms,
ROUND(AVG(rows_scanned), 0) AS avg_rows_scanned,
ROUND(SUM(cu_seconds), 2) AS total_cu_seconds,
MIN(source) AS primary_source
FROM system.sql_queries
WHERE start_time >= DATEADD(DAY, -7, CURRENT_TIMESTAMP)
AND status = 'Succeeded'
GROUP BY query_hash
HAVING COUNT(*) >= 5
ORDER BY total_cu_seconds DESC;
KQL Querying (via Eventhouse Shortcut)¶
If you create an Eventhouse shortcut to the Monitoring Lakehouse, you can use KQL for more advanced time-series analysis:
// Spark session duration anomaly detection
spark_sessions
| where start_time > ago(30d)
| where status == "Succeeded"
| summarize AvgDuration = avg(duration_seconds), StdDuration = stdev(duration_seconds) by item_name
| join kind=inner (
spark_sessions
| where start_time > ago(1d)
| where status == "Succeeded"
) on item_name
| extend ZScore = (duration_seconds - AvgDuration) / StdDuration
| where abs(ZScore) > 3
| project item_name, start_time, duration_seconds, AvgDuration, ZScore
| order by abs(ZScore) desc
π Alerting and Automation¶
Data Activator Integration¶
Combine Workspace Monitoring with Data Activator to create automated alerts for operational issues:
flowchart LR
subgraph Monitoring["π Monitoring Lakehouse"]
ST["System Tables"]
end
subgraph Activator["π Data Activator"]
R1["Rule: Pipeline Failure"]
R2["Rule: Slow Query"]
R3["Rule: CU Throttling"]
R4["Rule: Notebook Error"]
end
subgraph Actions["π― Actions"]
EM["π§ Email Alert"]
TM["π¬ Teams Message"]
PA["β‘ Power Automate"]
end
ST --> Activator --> Actions
style Monitoring fill:#2471A3,stroke:#1A5276,color:#fff
style Activator fill:#E67E22,stroke:#CA6F1E,color:#fff
style Actions fill:#27AE60,stroke:#1E8449,color:#fff Alert Rule Examples¶
| Alert | System Table | Condition | Action |
|---|---|---|---|
| Pipeline Failure | pipeline_runs | status = 'Failed' | Email data engineering team |
| Slow Query | sql_queries | duration_ms > 300000 (5 min) | Teams notification to DBA |
| CU Throttling | capacity_metrics | throttle_event = true for 3+ consecutive intervals | Alert capacity admin |
| Notebook Error | spark_sessions | status = 'Failed' AND item_name LIKE '%compliance%' | Page on-call compliance engineer |
| Capacity Spike | capacity_metrics | cu_consumed > 0.8 * cu_allocated for 30+ minutes | Email workspace admin |
Power Automate Webhook Pattern¶
{
"trigger": "data_activator",
"rule": "pipeline_failure",
"flow_actions": [
{
"action": "teams_post",
"channel": "#data-engineering-alerts",
"message": "π΄ Pipeline '{pipeline_name}' FAILED\nRun ID: {run_id}\nError: {error_message}\nDuration: {duration_seconds}s\nWorkspace: {workspace_name}"
},
{
"action": "create_devops_work_item",
"title": "Pipeline failure: {pipeline_name}",
"description": "Automated work item from Workspace Monitoring alert",
"priority": 2
}
]
}
π° Casino Implementation¶
Casino gaming workloads have strict execution requirements -- compliance batch processing must complete within regulatory windows, and failures during peak hours require immediate attention.
Compliance Batch Monitoring¶
The casino domain runs nightly compliance batch notebooks that process CTR filings, SAR detections, and W-2G calculations. Workspace Monitoring tracks these critical executions.
flowchart TB
subgraph Compliance["π° Compliance Pipelines"]
PL1["pl-nightly-ctr<br/>CTR batch processing"]
PL2["pl-nightly-sar<br/>SAR pattern detection"]
PL3["pl-nightly-w2g<br/>W-2G generation"]
PL4["pl-daily-audit<br/>Audit trail compilation"]
end
subgraph Monitoring["π Workspace Monitoring"]
PR["pipeline_runs"]
SS["spark_sessions"]
end
subgraph Alerts["π Alerts"]
A1["Compliance batch failed"]
A2["Batch exceeded SLA window"]
A3["Data quality check failed"]
end
Compliance -->|"Execution metadata"| Monitoring --> Alerts
style Compliance fill:#6C3483,stroke:#4A235A,color:#fff
style Monitoring fill:#E67E22,stroke:#CA6F1E,color:#fff
style Alerts fill:#C0392B,stroke:#922B21,color:#fff Compliance SLA Monitoring¶
-- Check if nightly compliance pipelines completed within SLA
-- SLA: All compliance pipelines must complete before 6:00 AM local time
SELECT
pipeline_name,
start_time,
end_time,
duration_seconds,
status,
CASE
WHEN end_time > CAST(CAST(start_time AS DATE) AS TIMESTAMP) + INTERVAL '6' HOUR
THEN 'SLA BREACH'
ELSE 'WITHIN SLA'
END AS sla_status,
error_message
FROM system.pipeline_runs
WHERE pipeline_name LIKE 'pl-nightly-%'
AND start_time >= DATEADD(DAY, -7, CURRENT_TIMESTAMP)
ORDER BY start_time DESC;
Peak Hour Error Detection¶
-- Notebooks that failed during casino peak hours (6 PM - 2 AM)
SELECT
item_name AS notebook,
user_email,
start_time,
status,
error_message,
CASE
WHEN HOUR(start_time) >= 18 OR HOUR(start_time) < 2
THEN 'PEAK HOURS'
ELSE 'OFF-PEAK'
END AS period
FROM system.spark_sessions
WHERE status = 'Failed'
AND start_time >= DATEADD(DAY, -1, CURRENT_TIMESTAMP)
AND (HOUR(start_time) >= 18 OR HOUR(start_time) < 2)
ORDER BY start_time DESC;
Slot Performance Notebook Tracking¶
-- Track execution history of the slot performance Gold notebook
SELECT
start_time,
duration_seconds,
cu_seconds,
rows_written,
status,
ROUND(bytes_written / 1048576.0, 2) AS mb_written
FROM system.spark_sessions
WHERE item_name = '01_gold_slot_performance'
AND start_time >= DATEADD(DAY, -30, CURRENT_TIMESTAMP)
ORDER BY start_time DESC;
Cost Attribution by Gaming Domain¶
-- Monthly CU consumption breakdown for casino workloads
SELECT
CASE
WHEN item_name LIKE '%slot%' THEN 'Slot Analytics'
WHEN item_name LIKE '%player%' THEN 'Player Analytics'
WHEN item_name LIKE '%compliance%' OR item_name LIKE '%ctr%' OR item_name LIKE '%sar%' THEN 'Compliance'
WHEN item_name LIKE '%gold%' THEN 'Gold Layer'
WHEN item_name LIKE '%silver%' THEN 'Silver Layer'
WHEN item_name LIKE '%bronze%' THEN 'Bronze Layer'
ELSE 'Other'
END AS domain,
COUNT(*) AS execution_count,
ROUND(SUM(cu_seconds), 2) AS total_cu_seconds,
ROUND(AVG(duration_seconds), 0) AS avg_duration,
SUM(CASE WHEN status = 'Failed' THEN 1 ELSE 0 END) AS failures
FROM system.spark_sessions
WHERE start_time >= DATE_TRUNC('month', CURRENT_TIMESTAMP)
GROUP BY
CASE
WHEN item_name LIKE '%slot%' THEN 'Slot Analytics'
WHEN item_name LIKE '%player%' THEN 'Player Analytics'
WHEN item_name LIKE '%compliance%' OR item_name LIKE '%ctr%' OR item_name LIKE '%sar%' THEN 'Compliance'
WHEN item_name LIKE '%gold%' THEN 'Gold Layer'
WHEN item_name LIKE '%silver%' THEN 'Silver Layer'
WHEN item_name LIKE '%bronze%' THEN 'Bronze Layer'
ELSE 'Other'
END
ORDER BY total_cu_seconds DESC;
ποΈ Federal Agency Implementation¶
Federal agencies require comprehensive audit trails for FISMA compliance, capacity dashboards per agency workspace, and execution monitoring that supports continuous Authority to Operate (ATO) evidence collection.
FISMA Execution Audit¶
Federal information systems must maintain audit records of all data processing activities. Workspace Monitoring system tables serve as a primary evidence source for FISMA audit requirements.
flowchart TB
subgraph Agencies["ποΈ Federal Agency Workspaces"]
W1["ws-usda<br/>USDA workloads"]
W2["ws-sba<br/>SBA workloads"]
W3["ws-noaa<br/>NOAA workloads"]
W4["ws-epa<br/>EPA workloads"]
W5["ws-doi<br/>DOI workloads"]
end
subgraph Monitoring["π Per-Workspace Monitoring"]
M1["Monitoring Lakehouse<br/>ws-usda"]
M2["Monitoring Lakehouse<br/>ws-sba"]
M3["Monitoring Lakehouse<br/>ws-noaa"]
M4["Monitoring Lakehouse<br/>ws-epa"]
M5["Monitoring Lakehouse<br/>ws-doi"]
end
subgraph Central["π Central Audit"]
CL["Consolidated<br/>Audit Dashboard"]
RPT["FISMA Evidence<br/>Report"]
end
Agencies --> Monitoring --> Central
style Agencies fill:#6C3483,stroke:#4A235A,color:#fff
style Monitoring fill:#E67E22,stroke:#CA6F1E,color:#fff
style Central fill:#27AE60,stroke:#1E8449,color:#fff FISMA Audit Report¶
-- FISMA-aligned execution audit report
-- Requirement: Log all data access and processing activities
SELECT
'SPARK' AS workload_type,
item_name AS resource,
user_email AS principal,
start_time AS event_time,
status AS outcome,
CONCAT('Duration: ', duration_seconds, 's, CU: ', ROUND(cu_seconds, 2)) AS details
FROM system.spark_sessions
WHERE start_time >= DATEADD(DAY, -30, CURRENT_TIMESTAMP)
UNION ALL
SELECT
'SQL' AS workload_type,
item_name AS resource,
user_email AS principal,
start_time AS event_time,
status AS outcome,
CONCAT('Rows scanned: ', rows_scanned, ', Duration: ', duration_ms, 'ms') AS details
FROM system.sql_queries
WHERE start_time >= DATEADD(DAY, -30, CURRENT_TIMESTAMP)
UNION ALL
SELECT
'PIPELINE' AS workload_type,
pipeline_name AS resource,
'system' AS principal,
start_time AS event_time,
status AS outcome,
CONCAT('Activities: ', activity_count, ', Duration: ', duration_seconds, 's') AS details
FROM system.pipeline_runs
WHERE start_time >= DATEADD(DAY, -30, CURRENT_TIMESTAMP)
ORDER BY event_time DESC;
Capacity Dashboard Per Agency¶
-- Agency workspace capacity utilization summary
SELECT
workload_type,
DATE_TRUNC('day', timestamp) AS day,
ROUND(SUM(cu_consumed), 2) AS daily_cu_consumed,
ROUND(AVG(cu_consumed), 4) AS avg_cu_per_interval,
MAX(active_sessions) AS peak_active_sessions,
SUM(CASE WHEN throttle_event THEN 1 ELSE 0 END) AS throttle_events
FROM system.capacity_metrics
WHERE timestamp >= DATEADD(DAY, -30, CURRENT_TIMESTAMP)
GROUP BY workload_type, DATE_TRUNC('day', timestamp)
ORDER BY day DESC, daily_cu_consumed DESC;
Federal Data Pipeline Health¶
-- Health check for federal agency bronze/silver/gold pipelines
SELECT
pipeline_name,
COUNT(*) AS total_runs,
SUM(CASE WHEN status = 'Succeeded' THEN 1 ELSE 0 END) AS succeeded,
SUM(CASE WHEN status = 'Failed' THEN 1 ELSE 0 END) AS failed,
ROUND(
SUM(CASE WHEN status = 'Succeeded' THEN 1.0 ELSE 0 END) / COUNT(*) * 100, 2
) AS success_rate_pct,
ROUND(AVG(duration_seconds), 0) AS avg_duration_sec,
MAX(end_time) AS last_run
FROM system.pipeline_runs
WHERE start_time >= DATEADD(DAY, -30, CURRENT_TIMESTAMP)
GROUP BY pipeline_name
ORDER BY success_rate_pct ASC;
Cross-Agency Comparison¶
If each agency workspace has Monitoring enabled, an authorized admin can create shortcuts across workspaces to build a consolidated view:
-- Compare notebook execution patterns across agency workspaces
-- (requires shortcuts from each workspace's monitoring lakehouse)
SELECT
'USDA' AS agency,
COUNT(*) AS notebook_runs,
ROUND(SUM(cu_seconds), 2) AS total_cu,
SUM(CASE WHEN status = 'Failed' THEN 1 ELSE 0 END) AS failures
FROM usda_monitoring.spark_sessions
WHERE start_time >= DATE_TRUNC('month', CURRENT_TIMESTAMP)
UNION ALL
SELECT 'EPA', COUNT(*), ROUND(SUM(cu_seconds), 2),
SUM(CASE WHEN status = 'Failed' THEN 1 ELSE 0 END)
FROM epa_monitoring.spark_sessions
WHERE start_time >= DATE_TRUNC('month', CURRENT_TIMESTAMP)
UNION ALL
SELECT 'NOAA', COUNT(*), ROUND(SUM(cu_seconds), 2),
SUM(CASE WHEN status = 'Failed' THEN 1 ELSE 0 END)
FROM noaa_monitoring.spark_sessions
WHERE start_time >= DATE_TRUNC('month', CURRENT_TIMESTAMP)
ORDER BY total_cu DESC;
π Dashboard Templates¶
Operations Dashboard Layout¶
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π Workspace Monitoring - Operations Dashboard β
β Filters: [Time Range βΌ] [Workload Type βΌ] [Status βΌ] β
ββββββββββββββββββ¬βββββββββββββββββ¬ββββββββββββββββ¬βββββββββββββ€
β Total Runs β Success Rate β Avg Duration β CU Used β
β 1,247 β 96.3% β 4.2 min β 15,230 CU β
β β 12% vs LW β β 0.5% β β 8% β β 5% β
ββββββββββββββββββ΄βββββββββββββββββ΄ββββββββββββββββ΄βββββββββββββ€
β π CU Consumption by Workload Type (Stacked Area Chart) β
β [Spark | SQL | Pipeline | Dataflow over last 7 days] β
ββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ€
β π΄ Recent Failures β π’ Slowest Executions β
β [Table: name, time, error] β [Table: name, duration, CU] β
β [Sorted by most recent] β [Sorted by duration DESC] β
ββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββ€
β π Daily Execution Volume (Bar Chart) β
β [Stacked: Succeeded | Failed | Cancelled over 30 days] β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Compliance Operations Dashboard (Casino)¶
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π° Compliance Pipeline Monitoring β
β Filters: [Date Range βΌ] [Pipeline βΌ] β
ββββββββββββββββββ¬βββββββββββββββββ¬ββββββββββββββββ¬βββββββββββββ€
β CTR Pipeline β SAR Pipeline β W-2G Pipelineβ Audit β
β β
Last: 2h β β
Last: 3h β β
Last: 1h β β
Last: 4hβ
β Avg: 45 min β Avg: 1.2 hr β Avg: 20 min β Avg: 35 minβ
ββββββββββββββββββ΄βββββββββββββββββ΄ββββββββββββββββ΄βββββββββββββ€
β π Compliance Pipeline Duration Trend (7 days) β
β [Line chart showing execution time per pipeline] β
ββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ€
β β οΈ SLA Breaches (Last 30d) β π΄ Pipeline Failures β
β [Table: pipeline, date, β [Table: pipeline, time, β
β actual vs SLA] β error, impact] β
ββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββ
β‘ Performance and Retention¶
Retention Configuration¶
| Setting | Default | Maximum | Recommendation |
|---|---|---|---|
| System table retention | 30 days | 90 days | 30 days for operational monitoring; archive to separate Lakehouse for longer retention |
| Query text retention | 30 days | 90 days | Store query text for debugging; be aware of storage impact for high-volume workloads |
| Capacity metrics retention | 30 days | 90 days | 90 days recommended for trend analysis and capacity planning |
| Freshness lag | 2-5 minutes | N/A | System tables are near-real-time, not live; expect 2-5 minute delay |
Storage Impact¶
| Workspace Profile | Events/Day | Storage/Month | CU Overhead |
|---|---|---|---|
| Small (< 50 executions/day) | 50 | < 100 MB | < 0.1% |
| Medium (50-500 executions/day) | 250 | 500 MB - 2 GB | 0.1-0.5% |
| Large (500-5000 executions/day) | 2,500 | 2-10 GB | 0.5-1.0% |
| Very Large (5000+ executions/day) | 10,000+ | 10-50 GB | 1.0-2.0% |
Long-Term Archival¶
For FISMA compliance and audit requirements exceeding the 90-day system table retention:
# Archive monitoring data to a long-term retention Lakehouse
# Run as a scheduled notebook monthly
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Read current month's data from monitoring system tables
current_month = spark.sql("""
SELECT *, current_timestamp() AS archived_at
FROM system.spark_sessions
WHERE start_time >= DATE_TRUNC('month', DATEADD(MONTH, -1, CURRENT_TIMESTAMP))
AND start_time < DATE_TRUNC('month', CURRENT_TIMESTAMP)
""")
# Write to long-term archive Lakehouse
current_month.write \
.format("delta") \
.mode("append") \
.partitionBy("status") \
.save("Tables/archive_spark_sessions")
π‘ Tip: For federal workloads requiring 7-year retention (per NARA records schedules), implement a monthly archival pipeline that copies system table data to a dedicated retention Lakehouse with appropriate sensitivity labels and Purview governance.
β οΈ Limitations¶
Current Limitations (Preview)¶
| Limitation | Details | Workaround |
|---|---|---|
| Preview Status | Feature is in Public Preview; schema may change before GA | Avoid building production-critical alerting during preview |
| Freshness Lag | System tables have 2-5 minute delay | Use real-time dashboards with Data Activator for sub-minute alerting |
| Single Workspace Scope | Each Monitoring Lakehouse covers one workspace only | Use shortcuts to consolidate across workspaces |
| No Cross-Workspace Joins | Cannot natively join system tables across workspaces | Create shortcuts in a central workspace for cross-workspace analysis |
| Query Text Truncation | query_text column is truncated at 64 KB | Sufficient for most queries; very large dynamic SQL may be truncated |
| Dataflow V1 Not Covered | Only Dataflow Gen2 is captured; legacy Dataflow V1 is excluded | Migrate to Dataflow Gen2 for monitoring coverage |
| No Event-Level Granularity | Pipeline monitoring is per-run, not per-activity | Use Data Factory monitoring API for per-activity detail |
| Storage Consumption | System tables consume OneLake storage and CU for writes | Monitor overhead via capacity_metrics table; adjust retention if needed |
| Read-Only | Cannot modify or delete system table records | By design -- ensures audit integrity |
Supported vs. Unsupported Workloads¶
| Workload | Supported | Notes |
|---|---|---|
| Spark Notebooks | β | Full session metadata and resource consumption |
| Spark Job Definitions | β | Same schema as notebook sessions |
| SQL Warehouse Queries | β | Query text, duration, rows, CU |
| Lakehouse SQL Endpoint | β | Same schema as warehouse queries |
| Data Factory Pipelines | β | Pipeline-level (not activity-level) |
| Dataflow Gen2 | β | Refresh-level metadata |
| Dataflow V1 | β | Legacy workload not supported |
| Power BI Report Renders | β | Use Power BI activity log instead |
| Eventstream Processing | β | Use RTI monitoring in Eventhouse |
| Data Activator | β | Use Activator's built-in history |
π References¶
| Resource | URL |
|---|---|
| Workspace Monitoring Overview | https://learn.microsoft.com/fabric/admin/monitoring-workspace |
| Fabric Capacity Metrics | https://learn.microsoft.com/fabric/enterprise/metrics-app |
| System Tables Reference | https://learn.microsoft.com/fabric/admin/monitoring-workspace-system-tables |
| Data Activator Alerts | https://learn.microsoft.com/fabric/real-time-intelligence/data-activator/data-activator-introduction |
| FISMA Compliance Requirements | https://www.cisa.gov/topics/cyber-threats-and-advisories/federal-information-security-modernization-act |
| Fabric CU Consumption | https://learn.microsoft.com/fabric/enterprise/fabric-operations |
π Related Documents¶
- Real-Time Intelligence -- RTI-specific monitoring via Eventhouse
- Fabric IQ -- Natural language queries against monitoring data
- Data Governance Deep Dive -- Governance and audit trail architecture
- SQL Audit Logs Compliance -- SQL-level audit logging
- Performance Best Practices -- Optimizing workload performance
- Architecture -- System architecture overview
π Document Metadata - Author: Documentation Team - Reviewers: Data Engineering, Platform Operations, Security, Compliance - Classification: Internal - Next Review: 2026-07-13