Home > Best Practices > Alerting & Data Activator
🔔 Alerting & Data Activator Best Practices¶
Last Updated: 2026-04-15 | Version: 2.0 Status: ✅ Final | Maintainer: Documentation Team
📖 Overview¶
Data Activator is Microsoft Fabric's no-code experience for automatically monitoring data and triggering actions when conditions are met. This guide covers Reflex item configuration, alert patterns for casino gaming compliance, federal agency data pipelines, and healthcare workloads, along with integration patterns for Teams, Email, Power Automate, and Azure Logic Apps. It also addresses alert fatigue prevention and operational runbooks.
🧭 Table of Contents¶
- ⚙️ Data Activator Fundamentals
- 🔧 Reflex Item Configuration
- 🚨 Alert Patterns
- 🔗 Integration Patterns
- 🏢 Domain-Specific Alert Configurations
- 🔇 Alert Fatigue Prevention
- 📊 Monitoring Dashboards
- 📞 On-Call Rotation Patterns
- 📓 Runbook Templates
- ✅ Testing and Validation
⚙️ Data Activator Fundamentals¶
What is Data Activator?¶
Data Activator is a Fabric workload that lets you monitor data streams and take automated action when patterns or conditions are detected. It eliminates the need for custom polling services or scheduled checks.
Core Concepts¶
| Concept | Description |
|---|---|
| Reflex item | A Fabric item that contains triggers, conditions, and actions |
| Object | A monitored entity (e.g., a pipeline, a table, a KPI metric) |
| Trigger | A rule that evaluates a condition on an object's property |
| Condition | A logical expression (threshold, change detection, absence) |
| Action | What happens when the condition is met (email, Teams, Power Automate) |
Architecture¶
flowchart LR
subgraph Sources["Data Sources"]
A[Power BI Report]
B[Eventstream]
C[Real-Time Dashboard]
D[Fabric Pipeline Events]
end
subgraph DA["Data Activator"]
E[Reflex Item]
E --> F[Object: Pipeline Health]
E --> G[Object: Data Quality]
E --> H[Object: SLA Compliance]
F --> I[Trigger: Error Threshold]
G --> J[Trigger: DQ Score < 70]
H --> K[Trigger: No Data in 4h]
end
subgraph Actions["Actions"]
L[Teams Message]
M[Email Alert]
N[Power Automate Flow]
end
Sources --> DA
I --> L
I --> M
J --> L
K --> N Setting Up Data Activator¶
Step 1: Create a Reflex item
- Navigate to your Fabric workspace
- Select New > Reflex
- Name it following the convention:
reflex_{domain}_{alert_type}(e.g.,reflex_casino_compliance_alerts)
Step 2: Connect to a data source
Data Activator can monitor data from: - Power BI visuals -- Right-click a visual and select "Set alert" - Eventstreams -- Route events directly to a Reflex - Real-Time Dashboards -- Set alerts on KQL-backed visuals - Custom events -- Send events via the Reflex REST API
Step 3: Define objects and triggers
Objects represent the entities you want to monitor. Each object has properties that can be tracked over time.
🔧 Reflex Item Configuration¶
Naming Convention¶
reflex_{domain}_{alert_category}
Examples:
reflex_casino_compliance -- Casino compliance alerts
reflex_federal_data_freshness -- Federal agency SLA alerts
reflex_healthcare_hipaa -- HIPAA compliance alerts
reflex_platform_pipeline_health -- Cross-domain pipeline alerts
Object Design¶
Design objects to represent logical monitoring targets:
| Object | Properties | Source |
|---|---|---|
Pipeline Run | status, duration, error_count, activity_name | Pipeline events |
Data Quality Score | dq_score, table_name, check_type, failed_checks | DQ results table |
Data Freshness | last_load_time, hours_since_update, table_name | Ingestion metadata |
Compliance Filing | filing_type, deadline, status, days_remaining | Compliance tracker |
Streaming Lag | consumer_lag, topic, partition, lag_seconds | Eventstream metrics |
Trigger Configuration¶
Each trigger evaluates a condition and fires an action when met.
Threshold trigger:
Object: Data Quality Score
Property: dq_score
Condition: Is less than 70
Evaluate: Every 15 minutes
Action: Send Teams message
Change detection trigger:
Object: Pipeline Run
Property: status
Condition: Changes to "Failed"
Evaluate: On each event
Action: Send email + Teams message
Absence trigger (data freshness):
Object: Data Freshness
Property: last_load_time
Condition: No new value in 4 hours
Evaluate: Every 30 minutes
Action: Trigger Power Automate flow
🚨 Alert Patterns¶
Pattern 1: Data Quality Threshold Alerts¶
Monitor data quality scores across all medallion layers and alert when scores drop below acceptable levels.
flowchart TD
A[DQ Check Completes] --> B{Score >= 90?}
B -->|Yes| C[Green: No Action]
B -->|No| D{Score >= 70?}
D -->|Yes| E[Yellow: Teams Warning]
D -->|No| F{Score >= 50?}
F -->|Yes| G[Orange: Email + Teams]
F -->|No| H[Red: Email + Teams + PagerDuty]
H --> I[Halt Downstream Processing] Threshold configuration:
| DQ Score Range | Severity | Action | Pipeline Behavior |
|---|---|---|---|
| 90-100 | INFO | Dashboard update only | Continue processing |
| 70-89 | WARNING | Teams channel notification | Continue with flag |
| 50-69 | HIGH | Email + Teams + incident ticket | Pause downstream, quarantine |
| 0-49 | CRITICAL | Email + Teams + PagerDuty | Halt all processing, escalate |
Pattern 2: Pipeline Failure Alerts¶
Detect pipeline failures and route alerts based on the pipeline domain and error classification.
Trigger configuration:
Object: Pipeline Run
Property: status
Condition: Equals "Failed"
Filter: environment = "PROD"
Action routing:
IF pipeline_name CONTAINS "compliance" OR "ctr" OR "sar"
THEN PagerDuty + Teams + Email (CRITICAL)
ELSE IF pipeline_name CONTAINS "gold"
THEN Teams + Email (HIGH)
ELSE IF pipeline_name CONTAINS "silver"
THEN Teams (MEDIUM)
ELSE
THEN Dashboard only (LOW)
Pattern 3: Data Freshness / SLA Alerts¶
Monitor that each data source is refreshed within its SLA window.
SLA definitions:
| Data Source | Expected Frequency | SLA Warning | SLA Breach |
|---|---|---|---|
| Casino slot telemetry | Every 5 minutes | 15 min late | 30 min late |
| Player transactions | Every 15 minutes | 30 min late | 1 hour late |
| USDA crop data | Daily (6 AM UTC) | 2 hours late | 6 hours late |
| NOAA weather data | Hourly | 2 hours late | 4 hours late |
| EPA AQI data | Hourly | 3 hours late | 6 hours late |
| DOI resource data | Daily | 4 hours late | 12 hours late |
| SBA loan data | Weekly | 1 day late | 3 days late |
| Tribal healthcare records | Daily | 4 hours late | 8 hours late |
Reflex trigger for data freshness:
Object: Data Freshness
Property: hours_since_update
Trigger: SLA Warning
Condition: hours_since_update > sla_warning_hours
Action: Teams notification to #{domain}-data channel
Trigger: SLA Breach
Condition: hours_since_update > sla_breach_hours
Action: Email to data owner + Teams notification + create incident
Pattern 4: Anomaly Detection Alerts¶
Detect statistical anomalies in data volumes and key metrics.
Volume anomaly trigger:
Object: Ingestion Metrics
Property: record_count
Condition: Percentage change from 7-day rolling average
WARNING: record_count < 50% of average OR > 200% of average
CRITICAL: record_count < 20% of average OR > 500% of average
Action:
WARNING -> Teams notification
CRITICAL -> Email + Teams + investigation trigger
Metric anomaly detection (KQL-backed):
// Calculate Z-score for daily record counts
let lookback = 30d;
let threshold = 3.0; // Standard deviations
ingestion_metrics
| where timestamp > ago(lookback)
| summarize daily_count = sum(record_count) by bin(timestamp, 1d), source_system
| extend rolling_avg = avg(daily_count) over (source_system, timestamp, lookback),
rolling_std = stdev(daily_count) over (source_system, timestamp, lookback)
| extend z_score = (daily_count - rolling_avg) / rolling_std
| where abs(z_score) > threshold
Pattern 5: Compliance Breach Alerts¶
Domain-specific compliance monitoring with zero tolerance for missed deadlines.
Casino compliance triggers:
| Trigger | Condition | SLA | Action |
|---|---|---|---|
| CTR filing deadline | CTR not filed within 15 days | 15 calendar days | CRITICAL: Compliance Officer + CTO |
| SAR pattern detected | Structuring pattern identified | Immediate review | HIGH: BSA Officer + Investigation Team |
| W-2G issuance | Jackpot > threshold, W-2G not generated | Before player leaves | CRITICAL: Floor Manager + Compliance |
| MICS audit gap | Required MICS check not completed | Daily | HIGH: Compliance Team |
Federal agency compliance triggers:
| Trigger | Condition | SLA | Action |
|---|---|---|---|
| USDA food recall detected | Recall notice in USDA feed | 1 hour | CRITICAL: Notify downstream consumers |
| NOAA severe weather | Extreme weather event in monitored area | 15 minutes | HIGH: Operations + Safety teams |
| EPA AQI hazardous | AQI > 300 in covered region | 30 minutes | HIGH: Public health notification |
| HIPAA breach indicator | PHI access pattern anomaly | Immediate | CRITICAL: Privacy Officer + Security |
🔗 Integration Patterns¶
Microsoft Teams Integration¶
Direct Reflex-to-Teams:
Data Activator can send messages directly to Teams channels or individual users.
Reflex Trigger -> Teams Adaptive Card
Channel: #{workspace}-alerts
Card includes:
- Alert severity badge
- Pipeline/table name
- Error summary
- Link to Fabric monitoring hub
- Action buttons: Acknowledge, Investigate, Dismiss
Teams Adaptive Card template (JSON):
{
"type": "AdaptiveCard",
"version": "1.4",
"body": [
{
"type": "Container",
"style": "attention",
"items": [
{
"type": "TextBlock",
"text": "${severity} Alert: ${alert_title}",
"weight": "Bolder",
"size": "Large"
}
]
},
{
"type": "FactSet",
"facts": [
{"title": "Pipeline:", "value": "${pipeline_name}"},
{"title": "Error:", "value": "${error_classification}"},
{"title": "Time:", "value": "${error_timestamp}"},
{"title": "Environment:", "value": "${environment}"},
{"title": "Source:", "value": "${source_system}"}
]
},
{
"type": "TextBlock",
"text": "${error_message}",
"wrap": true,
"maxLines": 3
}
],
"actions": [
{
"type": "Action.OpenUrl",
"title": "View in Fabric",
"url": "${fabric_monitoring_url}"
},
{
"type": "Action.OpenUrl",
"title": "View Runbook",
"url": "${runbook_url}"
}
]
}
Email Alert Integration¶
Email alert configuration:
To: ${alert_recipients} (based on domain + severity)
Subject: [${severity}] Fabric Alert: ${alert_title} - ${pipeline_name}
Body:
Alert Details:
- Pipeline: ${pipeline_name}
- Activity: ${activity_name}
- Classification: ${error_classification}
- Time: ${error_timestamp}
- Error: ${error_message}
Recommended Action:
${runbook_first_action}
Links:
- Fabric Monitoring: ${monitoring_url}
- Error Dashboard: ${dashboard_url}
- Runbook: ${runbook_url}
Email recipient mapping:
| Domain | Severity | Recipients |
|---|---|---|
| Casino Compliance | CRITICAL | compliance-officer@, cto@, oncall-engineer@ |
| Casino Compliance | HIGH | bsa-officer@, data-engineering@ |
| Federal (all) | CRITICAL | agency-liaison@, pm@, oncall-engineer@ |
| Federal (all) | HIGH | data-engineering@, pm@ |
| Healthcare/HIPAA | CRITICAL | privacy-officer@, legal@, ciso@, oncall-engineer@ |
| Healthcare/HIPAA | HIGH | privacy-officer@, data-engineering@ |
| Platform | CRITICAL | oncall-engineer@, engineering-manager@ |
| Platform | HIGH | data-engineering@ |
Power Automate Integration¶
Use Power Automate for complex multi-step alert workflows that go beyond simple notifications.
flowchart TD
A[Data Activator Trigger] --> B[Power Automate Flow]
B --> C[Parse Alert Payload]
C --> D{Severity?}
D -->|CRITICAL| E[Create Incident in ServiceNow/Jira]
D -->|HIGH| F[Create Work Item]
D -->|MEDIUM| G[Add to Backlog]
E --> H[Send Teams Alert with Incident Link]
F --> H
G --> I[Send Teams Summary]
H --> J[Update Error Record: alert_sent = 1]
I --> J
J --> K{Auto-remediation Available?}
K -->|Yes| L[Trigger Remediation Flow]
K -->|No| M[Wait for Manual Resolution]
L --> N[Verify Remediation]
N -->|Success| O[Auto-resolve Error]
N -->|Failed| P[Escalate to Next Tier] Power Automate flow examples:
| Flow Name | Trigger | Actions |
|---|---|---|
FA_Critical_Pipeline_Alert | Reflex fires CRITICAL | Create incident, notify Teams, page on-call |
FA_DQ_Quarantine_Handler | DQ score < 50 | Halt downstream pipelines, create investigation task |
FA_SLA_Breach_Escalation | SLA breached by 2x | Escalate to management, send stakeholder update |
FA_Auto_Retry_Transient | Transient error after max retries | Trigger pipeline re-run with increased resources |
FA_Compliance_Deadline_Warning | Filing deadline in 48 hours | Notify compliance team, create priority task |
Azure Logic Apps Integration¶
For enterprise-grade integrations that need guaranteed delivery, use Azure Logic Apps.
flowchart LR
A[Reflex Trigger] --> B[Event Grid]
B --> C[Logic App]
C --> D{Route by Type}
D -->|Incident| E[ServiceNow API]
D -->|Notification| F[SendGrid Email]
D -->|Escalation| G[PagerDuty API]
D -->|Audit| H[Log Analytics]
E --> I[Return Incident ID]
I --> J[Update Fabric Error Record] 🏢 Domain-Specific Alert Configurations¶
Casino Gaming Compliance¶
Reflex: reflex_casino_compliance
Objects:
- CTR Filing Status
- SAR Detection Pipeline
- W-2G Generation
- MICS Audit Checks
- Player Exclusion List
Triggers:
1. CTR_Deadline_Warning
Property: days_to_deadline
Condition: days_to_deadline <= 3
Action: Email compliance team
2. CTR_Deadline_Critical
Property: days_to_deadline
Condition: days_to_deadline <= 1
Action: PagerDuty + Email + Teams
3. SAR_Pattern_Detected
Property: structuring_score
Condition: structuring_score > 0.8
Action: Email BSA officer + create investigation
4. W2G_Generation_Failed
Property: jackpot_amount > threshold AND w2g_status = "failed"
Action: CRITICAL alert to floor manager
5. Player_Exclusion_Sync_Failed
Property: sync_status = "failed"
Action: HIGH alert - manual verification required
USDA Agricultural Data¶
Reflex: reflex_usda_data_monitoring
Triggers:
1. Food_Recall_Alert
Source: USDA recall feed
Condition: New recall notice detected
Action: CRITICAL - notify all downstream consumers, halt affected pipelines
2. Crop_Data_Freshness
Source: NASS crop production table
Condition: No update in 36 hours (expected daily)
Action: HIGH - check NASS API status, notify data team
3. USDA_API_Rate_Limit
Source: Pipeline error log
Condition: 429 errors > 5 in 1 hour
Action: MEDIUM - reduce polling frequency, notify ops
4. Livestock_Anomaly
Source: Gold layer livestock metrics
Condition: Value deviates > 3 sigma from 30-day average
Action: MEDIUM - flag for analyst review
NOAA Weather Data¶
Reflex: reflex_noaa_weather_alerts
Triggers:
1. Severe_Weather_Event
Source: NOAA weather alerts feed
Condition: Severity = "Extreme" or "Severe"
Action: HIGH - notify operations, trigger contingency pipelines
2. Weather_Data_Gap
Source: Hourly station data
Condition: > 20% of stations missing in reporting window
Action: MEDIUM - log gap, check NOAA service status
3. Climate_Data_Anomaly
Source: Temperature/precipitation aggregates
Condition: Daily value > historical max/min for region
Action: LOW - flag for climatology review
EPA Air Quality¶
Reflex: reflex_epa_aqi_monitoring
Triggers:
1. Hazardous_AQI
Source: EPA AQI real-time feed
Condition: AQI > 300 (Hazardous)
Action: CRITICAL - public health notification chain
2. Unhealthy_AQI
Source: EPA AQI real-time feed
Condition: AQI > 150 (Unhealthy)
Action: HIGH - notify health and safety teams
3. AQI_Data_Staleness
Source: Ingestion metadata
Condition: No new AQI data in 3 hours
Action: MEDIUM - check EPA AirNow API
4. Emission_Threshold_Breach
Source: Gold layer emission aggregates
Condition: Facility emission exceeds permit threshold
Action: HIGH - notify environmental compliance
Tribal Healthcare (HIPAA)¶
Reflex: reflex_healthcare_hipaa
Triggers:
1. PHI_Access_Anomaly
Source: Audit log analysis
Condition: Access pattern deviates from baseline (volume, timing, user)
Action: CRITICAL - notify Privacy Officer + Security
2. Consent_Record_Missing
Source: Patient processing pipeline
Condition: Record processed without valid consent on file
Action: CRITICAL - halt processing, quarantine record
3. De_Identification_Failure
Source: PHI de-identification pipeline
Condition: Output contains PII patterns (SSN, full name, DOB combination)
Action: CRITICAL - halt output, purge, notify Privacy Officer
4. Healthcare_Data_Freshness
Source: Ingestion metadata
Condition: Clinical data > 8 hours stale
Action: HIGH - notify clinical data team
5. 42CFR_Part2_Violation_Risk
Source: Substance abuse data pipeline
Condition: Data accessed by non-authorized role
Action: CRITICAL - immediate access revocation + investigation
DOT/FAA Transportation¶
Reflex: reflex_dot_faa_monitoring
Triggers:
1. Safety_Incident_Detected
Source: FAA incident data feed
Condition: New incident classified as "serious" or "fatal"
Action: CRITICAL - notify safety analysis team
2. FedRAMP_Compliance_Check
Source: Security compliance pipeline
Condition: FedRAMP control check failure
Action: HIGH - notify security team + compliance officer
3. Transportation_Data_SLA
Source: Ingestion metadata
Condition: DOT data > 12 hours stale
Action: MEDIUM - check source availability
🔇 Alert Fatigue Prevention¶
Alert fatigue occurs when too many notifications desensitize the team, causing critical alerts to be missed. Follow these strategies to maintain alert effectiveness.
Severity Level Discipline¶
Only use CRITICAL and HIGH for situations that genuinely require immediate human intervention.
| Severity | Criteria | Expected Volume |
|---|---|---|
| CRITICAL | Data loss risk, compliance violation, security breach | < 1 per week |
| HIGH | Production impact, SLA breach, data quality failure | < 5 per week |
| MEDIUM | Degraded performance, non-critical pipeline failure | < 20 per week |
| LOW | Informational, minor anomaly, dev/staging issues | Unlimited (dashboard only) |
Rule of thumb: If you receive more than 3 CRITICAL alerts per week on average, your thresholds are too sensitive.
Alert Grouping¶
Group related alerts to reduce noise:
Instead of:
Alert: Pipeline A failed (activity 1)
Alert: Pipeline A failed (activity 2)
Alert: Pipeline A failed (activity 3)
Send:
Alert: Pipeline A failed (3 activities). First failure: activity 1 at 10:15 UTC
Implementation: Use the correlation_id field to group alerts from the same pipeline run.
Cool-Down Periods¶
Prevent repeated alerts for the same issue:
| Severity | Cool-Down Period | Behavior |
|---|---|---|
| CRITICAL | 15 minutes | Re-alert if still unresolved after cool-down |
| HIGH | 1 hour | Re-alert once after cool-down, then daily digest |
| MEDIUM | 4 hours | Daily digest only |
| LOW | No repeat | Dashboard only, never re-alert |
Alert Suppression Windows¶
Suppress alerts during known maintenance windows:
Suppression Rule: maintenance_window
Schedule: Sunday 02:00-06:00 UTC
Affected: All non-CRITICAL alerts
Behavior: Queue alerts, deliver as batch summary after window
Suppression Rule: deployment_window
Trigger: When deployment pipeline starts
Duration: Until deployment pipeline completes + 30 minutes
Affected: MEDIUM and LOW alerts
Behavior: Suppress entirely
Weekly Alert Health Review¶
Schedule a weekly review to calibrate alert effectiveness:
## Weekly Alert Health Review
| Metric | Target | This Week | Status |
|--------|--------|-----------|--------|
| CRITICAL alerts | < 1/week | __ | OK/REVIEW |
| HIGH alerts | < 5/week | __ | OK/REVIEW |
| False positive rate | < 10% | __% | OK/REVIEW |
| Mean acknowledgment time | < 15 min (CRITICAL) | __ min | OK/REVIEW |
| Unacknowledged alerts | 0 | __ | OK/REVIEW |
### Actions from Review:
- [ ] Adjust threshold for [alert name] -- too sensitive/not sensitive enough
- [ ] Remove/merge alert [name] -- redundant with [other alert]
- [ ] Add alert for [gap identified]
📊 Monitoring Dashboards¶
Power BI Alert Operations Dashboard¶
Create a dedicated Power BI report for alert operations:
Page 1: Alert Overview
| Visual | Type | Data |
|---|---|---|
| Active Alerts by Severity | Stacked bar chart | Count of unresolved alerts by severity |
| Alert Trend (7 days) | Line chart | Alert count per hour, colored by severity |
| Mean Time to Acknowledge | Card | Average minutes from alert to first response |
| Mean Time to Resolve | Card | Average minutes from alert to resolution |
| Alert Distribution by Domain | Donut chart | Percentage of alerts per domain |
| Top 5 Alerting Pipelines | Table | Pipeline name, alert count, last alert time |
Page 2: SLA Compliance
| Visual | Type | Data |
|---|---|---|
| SLA Status by Source | Matrix | Source vs SLA status (On Time / Warning / Breached) |
| Data Freshness Heatmap | Matrix | Source x Hour showing last update recency |
| SLA Breach Trend | Area chart | Breach count by day over 30 days |
| Active SLA Breaches | Table | Source, expected time, actual time, breach duration |
Page 3: Alert Effectiveness
| Visual | Type | Data |
|---|---|---|
| False Positive Rate | Gauge | Percentage of alerts dismissed as false positive |
| Alert Resolution by Classification | Bar chart | Resolution rate by error classification |
| Noise Ratio | KPI | Ratio of LOW/MEDIUM alerts to CRITICAL/HIGH |
| Repeat Alert Rate | Card | Percentage of alerts that are repeats of unresolved issues |
KQL Real-Time Dashboard¶
For near-zero-latency monitoring using Eventhouse/KQL:
// Active alerts summary (refresh every 30 seconds)
pipeline_errors
| where is_resolved == false
| where environment == "PROD"
| summarize
alert_count = count(),
oldest_alert = min(error_timestamp),
newest_alert = max(error_timestamp)
by severity
| extend hours_oldest = datetime_diff('hour', now(), oldest_alert)
| order by case(severity, "CRITICAL", 1, "HIGH", 2, "MEDIUM", 3, "LOW", 4, 5)
// Data freshness monitor
ingestion_metadata
| summarize last_load = max(load_timestamp) by source_system, table_name
| extend hours_since_update = datetime_diff('hour', now(), last_load)
| extend freshness_status = case(
hours_since_update <= 1, "Fresh",
hours_since_update <= 4, "Warning",
hours_since_update <= 8, "Stale",
"Critical"
)
| order by hours_since_update desc
📞 On-Call Rotation Patterns¶
Rotation Structure¶
Primary On-Call: Handles all CRITICAL and HIGH alerts
Secondary On-Call: Backup if primary doesn't acknowledge within 15 minutes
Domain Expert: Escalation for domain-specific issues (compliance, healthcare)
Rotation: Weekly, handoff on Monday 09:00 UTC
On-Call Schedule Template¶
| Week | Primary | Secondary | Casino Expert | Federal Expert | Healthcare Expert |
|---|---|---|---|---|---|
| 1 | Engineer A | Engineer B | Analyst C | Analyst D | Analyst E |
| 2 | Engineer B | Engineer C | Analyst C | Analyst D | Analyst E |
| 3 | Engineer C | Engineer A | Analyst C | Analyst D | Analyst E |
Handoff Checklist¶
## On-Call Handoff: Week N -> Week N+1
### Open Issues
- [ ] Issue 1: [description, current status, next step]
- [ ] Issue 2: [description, current status, next step]
### Recent Changes
- [Date]: Deployed [change] to [environment]
- [Date]: Modified alert threshold for [alert name]
### Known Risks
- [Risk 1]: [description and mitigation]
### Key Contacts
- Casino Compliance: [name, phone]
- Healthcare Privacy: [name, phone]
- Federal Liaison: [name, phone]
- Engineering Manager: [name, phone]
📓 Runbook Templates¶
Template: Pipeline Failure Runbook¶
# Runbook: Pipeline Failure - [Pipeline Name]
## Alert Details
- **Severity:** [CRITICAL / HIGH / MEDIUM / LOW]
- **Domain:** [Casino / Federal / Healthcare / Platform]
- **Expected SLA:** [Response time requirement]
## Symptoms
- Pipeline run failed with status [status]
- Error message: [typical error pattern]
- Affected downstream: [list of dependent pipelines/reports]
## Diagnosis Steps
### Step 1: Check Pipeline Run Details
1. Open Fabric workspace > Monitoring Hub
2. Find the failed pipeline run
3. Review the failed activity and error message
### Step 2: Check Error Classification
1. Query the pipeline_errors table:
```sql
SELECT * FROM dbo.pipeline_errors
WHERE pipeline_run_id = '<run_id>'
ORDER BY error_timestamp DESC;
```
2. Note the error_classification and severity
### Step 3: Classification-Specific Diagnosis
**TRANSIENT:**
- Check source system availability
- Review retry count (did retries exhaust?)
- Check Fabric capacity utilization in Admin Portal
**PERMANENT:**
- Check for schema changes in source
- Verify connection credentials
- Review recent code deployments
**DATA_QUALITY:**
- Check quarantine table for affected records
- Review data quality check results
- Contact source system owner if needed
**PERMISSION:**
- Verify service principal/managed identity permissions
- Check workspace access settings
- Review recent permission changes
## Resolution Steps
1. [Step-by-step resolution for this specific pipeline]
2. [Include commands, UI steps, and verification]
## Verification
- [ ] Pipeline re-run succeeds
- [ ] Downstream pipelines complete successfully
- [ ] Data quality scores are within threshold
- [ ] No data gaps in target tables
## Post-Incident
- [ ] Update error record: resolved = true
- [ ] Document root cause in resolution_notes
- [ ] Create follow-up task if structural fix needed
- [ ] Update this runbook if new failure mode discovered
Template: Data Freshness SLA Breach¶
# Runbook: Data Freshness SLA Breach - [Source System]
## Alert Details
- **Source:** [USDA / NOAA / EPA / Casino Telemetry / etc.]
- **Expected Frequency:** [Hourly / Daily / Weekly]
- **SLA Warning:** [Hours since expected update]
- **SLA Breach:** [Hours since expected update]
## Diagnosis Steps
### Step 1: Verify Source Availability
1. Check source API/endpoint status: [URL]
2. Check for known outages: [Status page URL]
3. Test connectivity from Fabric gateway (if applicable)
### Step 2: Check Pipeline Status
1. Is the ingestion pipeline running?
2. Was it scheduled but didn't trigger?
3. Did it run but produce 0 records?
### Step 3: Check for Upstream Issues
1. Has the source schema changed?
2. Is the source providing data (check their last update time)?
3. Are there API quota/rate limit issues?
## Resolution by Root Cause
**Source is down:**
- Document the outage
- Set up monitoring for source recovery
- Notify stakeholders about expected delay
**Pipeline failed silently:**
- Check pipeline error logs
- Review notification configuration
- Fix and re-run pipeline
**Schema change:**
- Compare expected vs actual schema
- Update mapping configuration
- Test with sample data before full re-run
## Verification
- [ ] Data is flowing again
- [ ] No gaps in the time series
- [ ] Downstream aggregations are recalculated
- [ ] SLA breach documented for reporting
✅ Testing and Validation¶
Alert Testing Checklist¶
Before deploying alerts to production, validate each alert:
| Test | Method | Expected Result |
|---|---|---|
| Trigger fires correctly | Inject test data that meets condition | Alert received within expected latency |
| Correct recipients | Verify delivery to all configured channels | All recipients receive the alert |
| Alert content is actionable | Review message content | Contains enough info to diagnose without context-switching |
| Cool-down works | Trigger twice in rapid succession | Second alert is suppressed within cool-down window |
| Escalation works | Don't acknowledge within SLA | Escalation notification sent to next tier |
| Suppression window works | Trigger during maintenance window | Alert is queued, not delivered immediately |
| False positive rate | Run for 1 week in shadow mode | < 10% false positive rate |
Shadow Mode Deployment¶
Deploy new alerts in shadow mode first:
- Create the Reflex trigger with all conditions
- Set the action to log only (write to a shadow_alerts table)
- Run for 1-2 weeks
- Analyze false positive rate and alert volume
- Adjust thresholds based on observed patterns
- Enable live alerting once confident
⭐ Summary¶
Effective alerting in Microsoft Fabric requires:
- Data Activator as the primary trigger engine for no-code, real-time monitoring
- Domain-aware alert routing that sends compliance and healthcare alerts to the right people
- Multi-channel delivery through Teams, Email, Power Automate, and PagerDuty
- Alert fatigue prevention through severity discipline, grouping, cool-downs, and weekly reviews
- Operational dashboards for visibility into alert health and SLA compliance
- Runbook discipline so every alert has a clear resolution path
- Testing before deployment to validate thresholds and reduce false positives
📚 Related Documents¶
| Document | Description |
|---|---|
| Error Handling & Monitoring | Error architecture and classification |
| Performance & Parallelism | Performance monitoring baselines |
| Disaster Recovery | Recovery procedures |
| Security Guide | Compliance and access controls |