Alerting Strategies¶

Home | Best Practices | Operational Excellence | Alert Strategies

Best practices for alerting across Azure analytics platforms.

Overview¶

Effective alerting enables proactive issue detection while minimizing alert fatigue. This guide covers strategies for configuring meaningful alerts.

Alert Hierarchy¶

Severity Levels¶

Severity	Response Time	Examples
Sev 0 (Critical)	Immediate	Data loss risk, security breach
Sev 1 (High)	< 1 hour	Pipeline failures, service degradation
Sev 2 (Medium)	< 4 hours	Performance degradation, warnings
Sev 3 (Low)	Next business day	Informational, capacity planning

Alert Categories¶

categories:
  availability:
    - Service health
    - Endpoint connectivity
    - Resource availability

  performance:
    - Query latency
    - Throughput degradation
    - Resource utilization

  data_quality:
    - Schema violations
    - Null rate thresholds
    - Freshness SLAs

  security:
    - Authentication failures
    - Access anomalies
    - Compliance violations

  cost:
    - Budget thresholds
    - Anomalous spending
    - Resource waste

Alert Configuration¶

Azure Monitor Alerts¶

{
    "type": "Microsoft.Insights/metricAlerts",
    "apiVersion": "2018-03-01",
    "name": "synapse-query-latency-alert",
    "properties": {
        "description": "Alert when query latency exceeds threshold",
        "severity": 2,
        "enabled": true,
        "scopes": ["/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Synapse/workspaces/{workspace}"],
        "evaluationFrequency": "PT5M",
        "windowSize": "PT15M",
        "criteria": {
            "odata.type": "Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria",
            "allOf": [
                {
                    "name": "QueryLatency",
                    "metricName": "IntegrationPipelineRunsEnded",
                    "dimensions": [
                        {
                            "name": "Result",
                            "operator": "Include",
                            "values": ["Failed"]
                        }
                    ],
                    "operator": "GreaterThan",
                    "threshold": 0,
                    "timeAggregation": "Total"
                }
            ]
        },
        "actions": [
            {
                "actionGroupId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/actionGroups/data-platform-ops"
            }
        ]
    }
}

Log-Based Alerts¶

// Pipeline failure alert query
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.SYNAPSE"
| where Category == "IntegrationPipelineRuns"
| where Status == "Failed"
| summarize FailureCount = count() by PipelineName, bin(TimeGenerated, 15m)
| where FailureCount > 0

Alert Patterns¶

Threshold-Based¶

# Dynamic threshold calculation
def calculate_dynamic_threshold(
    metric_history: list,
    sensitivity: str = "medium"
) -> tuple:
    """Calculate dynamic alert thresholds based on historical data."""
    import numpy as np

    data = np.array(metric_history)
    mean = np.mean(data)
    std = np.std(data)

    multipliers = {
        "low": 3.0,      # Fewer alerts
        "medium": 2.0,   # Balanced
        "high": 1.5      # More sensitive
    }

    multiplier = multipliers.get(sensitivity, 2.0)

    upper_threshold = mean + (std * multiplier)
    lower_threshold = max(0, mean - (std * multiplier))

    return lower_threshold, upper_threshold

Anomaly Detection¶

// Anomaly detection for query performance
let baseline =
    PerformanceMetrics
    | where TimeGenerated > ago(7d) and TimeGenerated < ago(1d)
    | summarize AvgLatency = avg(QueryLatencyMs), StdDev = stdev(QueryLatencyMs);
PerformanceMetrics
| where TimeGenerated > ago(1h)
| summarize CurrentLatency = avg(QueryLatencyMs) by bin(TimeGenerated, 5m)
| join baseline on $left.TimeGenerated != $right.TimeGenerated
| extend Zscore = (CurrentLatency - AvgLatency) / StdDev
| where abs(Zscore) > 3
| project TimeGenerated, CurrentLatency, Zscore, Alert = "Anomaly Detected"

Composite Alerts¶

{
    "type": "Microsoft.Insights/scheduledQueryRules",
    "properties": {
        "displayName": "Pipeline Health Composite Alert",
        "description": "Alert when multiple pipeline issues detected",
        "severity": 1,
        "enabled": true,
        "evaluationFrequency": "PT5M",
        "windowSize": "PT15M",
        "criteria": {
            "allOf": [
                {
                    "query": "AzureDiagnostics | where Category == 'PipelineRuns' | where Status == 'Failed' | summarize FailCount = count() | where FailCount > 3",
                    "timeAggregation": "Count",
                    "operator": "GreaterThan",
                    "threshold": 0
                }
            ]
        }
    }
}

Alert Routing¶

Action Groups¶

resource actionGroup 'Microsoft.Insights/actionGroups@2022-06-01' = {
  name: 'ag-data-platform'
  location: 'global'
  properties: {
    groupShortName: 'DataPlatform'
    enabled: true

    emailReceivers: [
      {
        name: 'PlatformTeam'
        emailAddress: 'data-platform@company.com'
        useCommonAlertSchema: true
      }
    ]

    smsReceivers: [
      {
        name: 'OnCall'
        countryCode: '1'
        phoneNumber: '5551234567'
      }
    ]

    webhookReceivers: [
      {
        name: 'PagerDuty'
        serviceUri: 'https://events.pagerduty.com/integration/{key}/enqueue'
        useCommonAlertSchema: true
      }
      {
        name: 'Teams'
        serviceUri: 'https://outlook.office.com/webhook/{id}'
        useCommonAlertSchema: true
      }
    ]

    azureFunctionReceivers: [
      {
        name: 'AlertProcessor'
        functionAppResourceId: '/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Web/sites/{app}'
        functionName: 'ProcessAlert'
        httpTriggerUrl: 'https://{app}.azurewebsites.net/api/ProcessAlert'
        useCommonAlertSchema: true
      }
    ]
  }
}

Escalation Policy¶

class AlertEscalation:
    """Manage alert escalation based on severity and age."""

    ESCALATION_RULES = {
        "sev0": {
            0: ["oncall_primary"],
            5: ["oncall_primary", "oncall_secondary"],
            15: ["oncall_primary", "oncall_secondary", "manager"],
            30: ["oncall_primary", "oncall_secondary", "manager", "director"]
        },
        "sev1": {
            0: ["team_channel"],
            30: ["oncall_primary"],
            60: ["oncall_primary", "manager"]
        }
    }

    def get_escalation_targets(self, severity: str, minutes_open: int) -> list:
        """Get notification targets based on alert age."""
        rules = self.ESCALATION_RULES.get(severity, {})
        targets = []
        for threshold, recipients in sorted(rules.items()):
            if minutes_open >= threshold:
                targets = recipients
        return targets

Alert Suppression¶

Maintenance Windows¶

def should_suppress_alert(alert: dict, maintenance_schedule: dict) -> bool:
    """Check if alert should be suppressed during maintenance."""
    from datetime import datetime

    current_time = datetime.utcnow()

    for window in maintenance_schedule.get("windows", []):
        start = datetime.fromisoformat(window["start"])
        end = datetime.fromisoformat(window["end"])
        resources = window.get("resources", [])

        if start <= current_time <= end:
            if not resources or alert["resource"] in resources:
                return True

    return False

Deduplication¶

// Deduplicate alerts by correlation ID
Alerts
| where TimeGenerated > ago(1h)
| summarize
    FirstOccurrence = min(TimeGenerated),
    LastOccurrence = max(TimeGenerated),
    Count = count()
    by AlertName, ResourceId, bin(TimeGenerated, 15m)
| where Count == 1 or TimeGenerated == FirstOccurrence

Best Practices¶

Alert Hygiene¶

Practice	Description
Review weekly	Check for noisy or ignored alerts
Document runbooks	Link alerts to remediation procedures
Test regularly	Verify alert routing works
Track MTTR	Measure mean time to resolution
Tune thresholds	Adjust based on false positive rate

Last Updated: January 2025