Monitoring & Observability¶

Overview¶

Effective observability for data platforms rests on three pillars: metrics, logs, and traces. Together they answer what happened, why it happened, and how long it took.

Pillar	Purpose	Azure Service
Metrics	Numerical health & performance	Azure Monitor Metrics
Logs	Detailed event records	Log Analytics / Sentinel
Traces	End-to-end request/pipeline flow	Application Insights

Cross-reference

For the canonical log event schema used across all CSA-in-a-Box pipelines, see LOG_SCHEMA.md.

Observability Architecture¶

flowchart LR
    subgraph Azure Resources
        ADLS[ADLS Gen2]
        ADF[Data Factory]
        DBX[Databricks]
        SYN[Synapse]
        KV[Key Vault]
        EH[Event Hubs]
    end

    DS[Diagnostic Settings]

    ADLS --> DS
    ADF  --> DS
    DBX  --> DS
    SYN  --> DS
    KV   --> DS
    EH   --> DS

    DS --> LAW[Log Analytics Workspace]

    LAW --> Alerts[Alert Rules]
    LAW --> Dash[Azure Dashboards]
    LAW --> WB[Workbooks]
    LAW --> Grafana[Azure Managed Grafana]
    LAW --> PBI[Power BI]

    Alerts --> AG[Action Groups]
    AG --> Email[Email / Teams]
    AG --> PD[PagerDuty / ServiceNow]

Log Analytics Workspace Design¶

Centralized vs. Per-Domain Workspace¶

Approach	Pros	Cons
Centralized	Single pane of glass, simpler RBAC, correlation	Noisy-neighbour risk, cost attribution harder
Per-Domain	Cost isolation, blast radius, domain RBAC	Cross-domain queries harder, more overhead

Recommendation

Use a single centralized workspace with table-level RBAC and resource-context access for most CSA-in-a-Box deployments. Split only when regulatory boundaries require it.

Data Retention Policies¶

Tier	Retention	Cost Posture	Use Case
Interactive (hot)	30 days	Higher	Active troubleshooting
Archive (cold)	Up to 7 years	Low	Compliance, audit, historical

Warning

Archive data requires a restore operation (minutes to hours) before querying. Plan retention tiers during design, not after cost overruns.

Table Plans¶

Plan	Query Perf	Ingestion Cost	Best For
Analytics	Full KQL	Standard	Security, pipeline monitoring
Basic	Limited	~60% cheaper	High-volume verbose logs
Auxiliary	Minimal	Lowest	Compliance archives, raw telemetry

Cost Management for High-Volume Logs¶

Set daily caps per table to prevent runaway ingestion.
Use data collection rules (DCR) to filter and transform before ingestion.
Move verbose diagnostic logs (e.g., Databricks driver logs) to Basic tier.
Review the Usage table monthly: Usage | summarize sum(Quantity) by DataType | sort by sum_Quantity desc.

Workspace RBAC¶

Role	Scope	Access
Log Analytics Reader	Workspace	Read all tables
Log Analytics Contributor	Workspace	Manage settings + read/write
Custom – Domain Reader	Table / Resource	Read only domain-specific tables
Custom – Security Analyst	SecurityEvent table	Read security logs only

Diagnostic Settings¶

Non-Negotiable

Every deployed Azure resource MUST have Diagnostic Settings enabled. This is enforced via Azure Policy in CSA-in-a-Box landing zones.

What to Collect¶

Resource	Metrics	Logs (Categories)
ADLS Gen2	✅	StorageRead, StorageWrite, StorageDelete
Databricks	✅	dbfs, clusters, jobs, notebook, secrets
Synapse	✅	SQLSecurityAuditEvents, IntegrationPipelineRuns
Data Factory	✅	PipelineRuns, TriggerRuns, ActivityRuns
Key Vault	✅	AuditEvent
Event Hubs	✅	ArchiveLogs, OperationalLogs, AutoScaleLogs

Bicep Module — Diagnostic Settings¶

@description('Diagnostic settings for any Azure resource.')
param resourceId string
param workspaceId string
param logsEnabled bool = true
param metricsEnabled bool = true

resource diagnostics 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
  name: 'diag-to-law'
  scope: resourceId
  properties: {
    workspaceId: workspaceId
    logs: logsEnabled ? [
      {
        categoryGroup: 'allLogs'
        enabled: true
        retentionPolicy: {
          enabled: false
          days: 0
        }
      }
    ] : []
    metrics: metricsEnabled ? [
      {
        category: 'AllMetrics'
        enabled: true
        retentionPolicy: {
          enabled: false
          days: 0
        }
      }
    ] : []
  }
}

Note

Reference the shared diagnosticSettings module in modules/monitoring/diagnostic-settings.bicep when adding new resources to the landing zone.

Custom Metrics & KPIs¶

Data Pipeline Metrics¶

Metric	Source	Threshold	Alert Action
`pipeline.records_in`	ADF / Spark	Δ > 50 % from baseline	Warning → investigate
`pipeline.records_out`	ADF / Spark	Δ > 50 % from baseline	Warning → investigate
`pipeline.duration_sec`	ADF Activity	> 2× p95	Warning → review compute
`pipeline.failure_count`	ADF / dbt	≥ 1	Critical → on-call page
`pipeline.freshness_lag`	Custom / dbt	> SLA window	Critical → escalate

Data Quality Metrics¶

Metric	Source	Threshold	Alert Action
`quality.test_pass_rate`	dbt test	< 100 %	Warning → review
`quality.null_pct`	Custom check	> column SLA	Warning → data owner
`quality.schema_drift_count`	Schema registry	≥ 1	Warning → review PR
`quality.duplicate_rate`	Custom check	> 0.1 %	Warning → investigate

Platform Metrics¶

Metric	Source	Threshold	Alert Action
`platform.cluster_util_pct`	Databricks	> 85 % sustained	Warning → scale review
`platform.storage_growth_gb`	ADLS	> budget + 20 %	Info → capacity plan
`platform.query_latency_p95`	Synapse	> 30 s	Warning → tune query

Alert Rules¶

Alert Taxonomy¶

Severity	Response Time	Example
Sev 0	15 min	Production pipeline complete failure
Sev 1	1 hour	Data freshness SLA breach
Sev 2	4 hours	Storage capacity > 80 %
Sev 3	Next business	Cost anomaly detected
Sev 4	Informational	Schema drift detected

Pipeline Failure Alert — KQL¶

ADFPipelineRun
| where Status == "Failed"
| where TimeGenerated > ago(15m)
| summarize FailureCount = count() by PipelineName, ResourceId
| where FailureCount >= 1

Data Freshness SLA Breach — KQL¶

let sla_hours = 4;
CustomMetrics_CL
| where MetricName_s == "pipeline.freshness_lag"
| where Value_d > (sla_hours * 3600)
| project TimeGenerated, Pipeline_s, LagSeconds = Value_d

Storage Capacity Warning — KQL¶

AzureMetrics
| where ResourceProvider == "MICROSOFT.STORAGE"
| where MetricName == "UsedCapacity"
| summarize CurrentBytes = max(Maximum) by ResourceId
| extend CurrentGB = CurrentBytes / (1024*1024*1024)
| extend ThresholdGB = 1024  // adjust per account
| where CurrentGB > ThresholdGB * 0.8

Bicep — Scheduled Query Alert Rule¶

resource pipelineFailureAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
  name: 'alert-pipeline-failure'
  location: location
  properties: {
    displayName: 'Pipeline Failure Detected'
    description: 'Fires when any ADF pipeline fails in the last 15 minutes.'
    severity: 0
    enabled: true
    evaluationFrequency: 'PT5M'
    windowSize: 'PT15M'
    scopes: [ logAnalyticsWorkspaceId ]
    criteria: {
      allOf: [
        {
          query: '''
            ADFPipelineRun
            | where Status == "Failed"
            | summarize FailureCount = count() by PipelineName
            | where FailureCount >= 1
          '''
          timeAggregation: 'Count'
          operator: 'GreaterThanOrEqual'
          threshold: 1
        }
      ]
    }
    actions: {
      actionGroups: [ actionGroupId ]
    }
  }
}

Azure CLI — Quick Alert Creation¶

az monitor scheduled-query create \
  --name "alert-freshness-sla" \
  --resource-group rg-monitoring \
  --scopes "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.OperationalInsights/workspaces/{ws}" \
  --condition "count 'CustomMetrics_CL | where MetricName_s == \"pipeline.freshness_lag\" | where Value_d > 14400' > 0" \
  --severity 1 \
  --evaluation-frequency 5m \
  --window-size 15m \
  --action-groups "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Insights/actionGroups/ag-oncall"

SLO / SLA Tracking¶

Define SLOs Per Data Product¶

Data Product	Freshness SLO	Availability SLO	Quality SLO
Finance Daily	≤ 06:00 UTC daily	99.5 %	100 % test pass rate
Clickstream	≤ 15 min latency	99.0 %	< 0.5 % null in keys
Customer 360	≤ 1 hour latency	99.5 %	100 % test pass rate

Error Budgets¶

An error budget is 1 − SLO. For a 99.5 % availability SLO over 30 days:

Budget: 0.5 % × 30 days = 3.6 hours of allowed downtime per month.
Track burn rate: if > 2× normal, trigger review.

SLO Dashboard Design¶

┌─────────────────────────────────────────────┐
│  Data Product SLO Summary                   │
├──────────────┬──────────┬──────────┬────────┤
│ Product      │ Fresh ✅ │ Avail ✅ │ Qual ⚠ │
│ Finance      │ 100 %    │ 99.8 %   │ 98.1 % │
│ Clickstream  │ 99.2 %   │ 99.5 %   │ 99.7 % │
│ Customer 360 │ 100 %    │ 99.9 %   │ 100 %  │
└──────────────┴──────────┴──────────┴────────┘
  ✅ Within SLO   ⚠ Budget < 50%   🔴 SLO breached

Reporting Cadence¶

Cadence	Audience	Content
Real-time	Engineering	Live SLO dashboards, alert feed
Weekly	Platform team	SLO compliance, error budget burn
Monthly	Leadership	Trend analysis, capacity forecast
Quarterly	Stakeholders	SLA report, improvement roadmap

Incident Response¶

Runbook Integration¶