Skip to content

Home > Docs > Best Practices > Operations > SLO/SLI for Fabric

🎯 SLO/SLI Definitions for Fabric Workspaces

A Concrete Service-Level Framework for Production Microsoft Fabric Capacities

Category Type Platform Status


Last Updated: 2026-04-27 | Phase: 14 (Wave 1) | Version: 1.0.0


📑 Table of Contents


🎯 Why SLO/SLI for Fabric

Most teams running Microsoft Fabric workloads in production have implicit expectations: "the report should be fresh by 9 AM," "the pipeline should usually succeed," "queries should be fast." Implicit expectations breed silent disagreements between engineering, product, and stakeholders — and they make on-call decisions arbitrary ("is this slow enough to page someone?").

This document establishes an explicit Service-Level Objective (SLO) framework for Microsoft Fabric workspaces, grounded in the Google SRE methodology and adapted to Fabric's unique primitives: capacity units, Direct Lake, Eventstreams, semantic model refreshes, and pipeline orchestration.

What Explicit SLOs Buy You

Benefit Why It Matters in Fabric
Decision authority for on-call "CU at 92% for 6 minutes" is no longer a judgment call — the runbook knows whether to page
Aligned expectations Product owners, capacity planners, and engineers agree on "good enough" before incidents happen
Error budget as a feature flag When budget burns fast, freeze risky deploys; when budget is healthy, ship features
Capacity ROI Tie SKU spend (F64 → F128) to measurable user-facing outcomes, not gut feel
Postmortem rigor Every SEV1/SEV2 retrospective starts with "which SLO breached, by how much"
Compliance evidence NIGC MICS, FedRAMP, and HIPAA auditors increasingly ask for documented availability targets

When NOT to Define an SLO

  • Throwaway dev workspaces — overhead exceeds value
  • Brand-new pipelines (<30 days production) — no baseline; you'll set targets blind
  • One-off analyst exploration — measure operational items, not human curiosity
  • Synthetic test workloads — they exist to break things; don't conflate with user impact

Rule: Every workspace tagged environment=production MUST have an SLO document on file. Workspaces tagged staging SHOULD. dev and sandbox are exempt.


📖 Glossary

Term One-Sentence Definition
SLI (Service Level Indicator) A directly measurable, customer-meaningful metric (e.g., the percentage of pipeline runs in the last hour that succeeded).
SLO (Service Level Objective) A target value or range for an SLI over a defined window (e.g., "99.5% of pipeline runs succeed over a rolling 28 days").
SLA (Service Level Agreement) A contractual promise to a customer, usually weaker than the internal SLO and with financial penalties — Fabric publishes its own SLAs; this doc is about your internal targets.
Error Budget (1 − SLO) × time_window — the amount of "bad" allowed before stakeholders should care; e.g., 99.5% over 28d = 3.36 hours of allowable badness.
Burn Rate How fast you are consuming the error budget; a 1× burn rate exhausts the budget exactly at window-end, 14× exhausts in <2 days.
Good event / Bad event The atoms an SLI counts (e.g., a successful refresh = good, a failed refresh = bad).
Customer-meaningful An SLI a real user would notice changing (latency, success, freshness) — not CPU utilization.

🟡 The Four Golden Signals (Adapted for Fabric)

Google SRE defines four signals every service should measure: Latency, Traffic, Errors, Saturation. The Fabric mapping:

Golden Signal Generic Definition Fabric Concrete Metrics System Table / Source
Latency How long requests take Lakehouse / Warehouse query duration; Power BI report load; semantic model refresh time; Eventstream end-to-end delay FabricSQLQueries, FabricSemanticModelRefreshes, FabricEventstreamMetrics
Traffic Demand on the system Queries/min, pipeline runs/hr, eventstream messages/sec, active concurrent users FabricSQLQueries, FabricPipelineRuns, FabricEventstreamMetrics
Errors Rate of failed requests Pipeline failures, refresh failures, query timeouts, auth rejections, GE quality gate failures FabricPipelineRuns, FabricSemanticModelRefreshes, FabricAuditEvents
Saturation "Fullness" of the system Capacity Unit (CU) utilization %, Spark executor pressure, Eventhouse ingestion lag, Eventhouse storage % FabricCapacityMetrics, FabricSparkSessions, FabricEventhouseMetrics
flowchart LR
    subgraph Signals["Four Golden Signals"]
        L[Latency<br/>p50/p95/p99]
        T[Traffic<br/>QPS, runs/hr]
        E[Errors<br/>fail %]
        S[Saturation<br/>CU %, lag]
    end

    subgraph SLI["SLI Catalog"]
        SLI1[Query latency]
        SLI2[Pipeline success]
        SLI3[Refresh freshness]
        SLI4[Stream lag]
        SLI5[CU saturation]
    end

    subgraph SLO["SLO Targets"]
        SLO1[p95 < 10s 99% of time]
        SLO2[99.5% success]
        SLO3[<2hr stale 99.9% of time]
        SLO4[<5min lag 99% of time]
        SLO5[CU<80% 99% of time]
    end

    subgraph Action["Action"]
        A1[Burn rate alert]
        A2[Page on-call]
        A3[Runbook execution]
    end

    Signals --> SLI
    SLI --> SLO
    SLO --> Action

    style Signals fill:#E67E22,color:#fff
    style SLI fill:#6C3483,color:#fff
    style SLO fill:#2471A3,color:#fff
    style Action fill:#27AE60,color:#fff

Rule of thumb: If you cannot map a metric to one of the four signals, it is operational telemetry — not an SLI.


The catalogs below are starting points. Every workspace must explicitly adopt an SLI (with an owner) — defaults are not contracts.

Lakehouse / Warehouse Query Latency

SLI Definition Measurement
Query Latency p50 50th percentile of completed query duration percentile(DurationMs, 50) over 5-min window
Query Latency p95 95th percentile — most users feel this percentile(DurationMs, 95) over 5-min window
Query Latency p99 99th percentile — tail risk for power users / dashboards percentile(DurationMs, 99) over 5-min window
Query Success Rate % of queries that completed without error count(Status == "Succeeded") / count(*)

Scope guidance: Filter to user-initiated queries only. Exclude system maintenance queries (OPTIMIZE, VACUUM, statistics refresh) — they pollute the distribution.

Pipeline Success Rate

SLI Definition Measurement
Pipeline Success Rate (rolling 7d) % of pipeline runs that succeeded over last 7 days count(Status == "Succeeded") / count(Status in ("Succeeded","Failed"))
Pipeline Activity Success Rate Per-activity granularity (Copy, Notebook, etc.) count(ActivityStatus == "Succeeded") / count(*)
Pipeline Duration p95 Tail latency of orchestration percentile(EndTime - StartTime, 95)

Cancellation handling: Manually cancelled runs count as neither good nor bad — exclude from the denominator.

Dataset Refresh Success Rate

SLI Definition Measurement
Semantic Model Refresh Success Rate % of scheduled refreshes that completed count(RefreshStatus == "Completed") / count(*)
Refresh Freshness Max age of last successful refresh at any point now() − max(LastSuccessfulRefreshTime)
Refresh Duration p95 How long refreshes typically take percentile(DurationMs, 95)

Eventstream End-to-End Latency

SLI Definition Measurement
Eventstream E2E Latency p95 Time from event arrival to destination commit percentile(SinkCommitTime − EventTime, 95)
Eventstream Throughput Messages successfully processed per second sum(MessagesProcessed) / 60 per-minute
Eventstream DLQ Rate % of messages routed to dead-letter queue count(DLQ) / count(Total)

Eventhouse Ingestion Lag

SLI Definition Measurement
Ingestion Lag p95 95th percentile delay from arrival to queryable percentile(IngestedTime − ArrivalTime, 95)
Ingestion Failure Rate % of ingestion batches that fail count(Failed) / count(Total)
Storage % Used Eventhouse storage relative to capacity allocation StorageUsedGB / StorageAllocatedGB

Capacity CU Saturation

SLI Definition Measurement
CU Saturation % Percent of allocated CUs in use (smoothed) CUSecondsUsed / CUSecondsAllocated × 100
Throttling Events Count of distinct minute-windows where throttling fired countif(Throttled == true) per minute
Carry-Forward Debt CU debt rolled into next 24h smoothing window sum(CarryForwardCUSeconds)

Authentication Success Rate

SLI Definition Measurement
Auth Success Rate (Workspace Identity) % of WI auth attempts succeeding count(Result == "Success") / count(*)
Auth Success Rate (Service Principal) Same, scoped to SPs accessing data per-SP filter
Token Refresh Success % of token-refresh calls succeeding count(TokenRefresh == "Success") / count(*)

Power BI Report Load Time

SLI Definition Measurement
Initial Render p95 Time to first interactive paint of the report percentile(InitialRenderMs, 95)
Visual Query p95 Per-visual DAX query duration percentile(VisualQueryMs, 95)
Report Open Success Rate % of report opens completing without error count(Status == "Success") / count(*)

Targets are tiered. Pick the tier that matches the workspace's business criticality. Higher tiers cost more (engineering effort, capacity headroom, on-call hours). Don't aspire upward unless funded.

Tier Definitions

Tier Use For Engineering Cost Capacity Headroom
Aspirational (4 nines) SOX-reporting Gold marts; compliance dashboards (NIGC, FedRAMP) High — multi-region, blue/green deploys, hot standby 50%+
Standard (3 nines) Production Casino floor, federal agency operational reports Medium — alerting + tested runbooks + on-call 25–40%
Best-effort (2 nines) Internal analytics, non-customer-facing exploratory marts Low — best-effort during business hours 10–20%

SLO Target Table

SLI Aspirational Standard Best-Effort
Query Latency p95 (Lakehouse/Warehouse) < 5s — 99.9% of windows < 10s — 99% of windows < 30s — 95% of windows
Query Latency p99 < 15s — 99% < 30s — 99% < 60s — 95%
Pipeline Success Rate (rolling 7d) 99.9% 99.5% 98%
Semantic Model Refresh Success 99.9% 99.5% 99%
Refresh Freshness (Direct Lake reframe / import) < 30 min stale 99.9% < 2 hr stale 99.9% < 6 hr stale 99%
Eventstream E2E Latency p95 < 30s < 2 min < 5 min
Eventhouse Ingestion Lag p95 < 1 min < 5 min < 15 min
Capacity CU Saturation < 70% sustained 99.9% < 80% sustained 99% < 90% sustained 95%
Authentication Success Rate 99.99% 99.9% 99.5%
Power BI Initial Render p95 < 3s < 8s < 20s

Casino POC default: Standard tier for lh_silver / lh_gold consumed by the Casino floor NOC dashboard; Aspirational for fact_compliance_summary and CTR/SAR pipelines (regulatory).

Federal POC default: Standard for all operational marts; Aspirational for fact_compliance_summary equivalents and any FedRAMP-monitored continuous-monitoring feeds.


💰 Error Budget Methodology

Calculating the Budget

Error Budget = (1 − SLO) × Time Window
SLO 28-day window Per-day equivalent
99.0% 6.72 hours 14.4 minutes
99.5% 3.36 hours 7.2 minutes
99.9% 40.3 minutes 1.44 minutes
99.95% 20.2 minutes 43.2 seconds
99.99% 4.03 minutes 8.64 seconds

28-Day Rolling vs Calendar Month

Approach Pros Cons Recommendation
28-day rolling Smooth — never resets at month boundary; matches SRE-book canonical practice Harder to communicate to non-engineers ("which 28 days?") Default for engineering-owned SLOs
Calendar month Easy to explain; aligns with month-end reporting; resets are predictable Encourages risky behavior at month-start ("we have a fresh budget!") Use for stakeholder-facing reports only
Quarterly Aligns with planning cadence; less alert noise Slow signal — bad behavior hides for weeks Avoid for primary SLOs

Recommended: Compute SLO compliance on a 28-day rolling window for engineering decisions. Roll up to a monthly stakeholder report for external comms. Never use both as gates simultaneously.

What to Do When Budget Is Burned

Budget Remaining Engineering Posture
> 50% Ship features. Take risks. Run experiments.
25–50% Ship features but increase deploy review scrutiny; pause optional risky changes.
10–25% Freeze non-essential changes. Focus rotation on reliability work.
0–10% Production freeze. Only fixes that improve reliability. Page leadership.
< 0% (over-budget) All hands. Postmortem all incidents in window. No new features until budget restored.

Burn Rate Math

Burn Rate = (% of budget consumed in window W) / (W as fraction of total period)

A burn rate of 1 means you'll exhaust the budget exactly at window end. A burn rate of 14.4 over 1 hour means you'll exhaust the entire 28-day budget in 2 days if it continues — that's pageable.


🔥 Burn-Rate Alerts (Multi-Window)

Single-threshold alerts are noisy and slow. The SRE community's solution: multi-window, multi-burn-rate alerts, modeled on Google's recommendations.

The Three-Tier Pattern

Window Burn Rate Threshold Time-to-Exhaust Budget Severity Action
1 hour 14.4× < 2 days SEV1 — page immediately "Fast burn" — service is on fire
6 hours < ~5 days SEV2 — page during business hours "Sustained burn" — investigate today
24 hours < 28 days SEV3 — ticket, weekly review "Slow burn" — backlog item

Why Multi-Window

A single 14.4× burn-rate breach over 5 minutes could just be an alert flap. Requiring it sustained over 1 hour dramatically reduces false pages while still catching real outages within ~10 min of onset (because the 1-hour window fires when its rolling SLI breaches, not at hour-end).

Worked Example: Pipeline Success SLO 99.5%

SLO: 99.5% (over 28 days)
Allowable failure rate: 0.5%
Total budget: 0.5% × 28d = 3.36 hours of "down" pipelines

Fast-burn alert (1h, 14.4×):
  Threshold = 0.5% × 14.4 = 7.2% failure rate over 1 hour
  → If >7.2% of last hour's pipeline runs failed, page on-call (SEV1)

Sustained-burn alert (6h, 6×):
  Threshold = 0.5% × 6 = 3.0% failure rate over 6 hours
  → If >3% of last 6 hours' runs failed, page during business hours (SEV2)

Slow-burn alert (24h, 1×):
  Threshold = 0.5% × 1 = 0.5% failure rate over 24 hours
  → If sustained, ticket for review (SEV3)

The numbers 14.4 and 6 come directly from Chapter 5 of the Site Reliability Workbook (Google) and have been validated across thousands of services. Don't invent your own without evidence.


💻 KQL Queries for Each SLI

All queries below run against the Workspace Monitoring Eventhouse system tables. Adjust workspace names and time ranges before use.

Query Latency p95 (Lakehouse/Warehouse)

// Lakehouse/Warehouse query latency p50/p95/p99 over 5-min buckets
FabricSQLQueries
| where TimeGenerated > ago(1h)
| where Status == "Succeeded"
| where QueryType in ("LakehouseSQL", "WarehouseSQL")
| where IsSystemQuery == false   // exclude OPTIMIZE/VACUUM
| summarize
    p50 = percentile(DurationMs, 50),
    p95 = percentile(DurationMs, 95),
    p99 = percentile(DurationMs, 99),
    QueryCount = count()
    by bin(TimeGenerated, 5m), WorkspaceName
| order by TimeGenerated desc

Pipeline Success Rate (Rolling 7d)

FabricPipelineRuns
| where TimeGenerated > ago(7d)
| where Status in ("Succeeded", "Failed")    // exclude "Cancelled"
| summarize
    Total = count(),
    Succeeded = countif(Status == "Succeeded"),
    Failed = countif(Status == "Failed")
    by WorkspaceName, PipelineName
| extend SuccessRate = round(100.0 * Succeeded / Total, 4)
| extend MeetsSLO_99_5 = SuccessRate >= 99.5
| order by SuccessRate asc

Pipeline Burn-Rate Alert (Fast Burn, 1h, 14.4×)

let SLO = 99.5;
let BurnRate = 14.4;
let FailThreshold = (100.0 - SLO) * BurnRate / 100.0;   // 0.072 = 7.2%
FabricPipelineRuns
| where TimeGenerated > ago(1h)
| where Status in ("Succeeded", "Failed")
| summarize Total = count(), Failed = countif(Status == "Failed") by WorkspaceName
| extend FailRate = 1.0 * Failed / Total
| where FailRate >= FailThreshold and Total >= 10   // require min sample size
| project WorkspaceName, Total, Failed, FailRate, AlertSeverity = "SEV1"

Semantic Model Refresh Freshness

// Time since last successful refresh, per dataset
FabricSemanticModelRefreshes
| where TimeGenerated > ago(7d)
| where Status == "Completed"
| summarize LastSuccess = max(EndTime) by WorkspaceName, DatasetName
| extend StalenessMinutes = datetime_diff('minute', now(), LastSuccess)
| extend MeetsSLO_StandardTier = StalenessMinutes <= 120   // < 2 hr
| order by StalenessMinutes desc

Semantic Model Refresh Success Rate (28d Rolling)

FabricSemanticModelRefreshes
| where TimeGenerated > ago(28d)
| summarize
    Total = count(),
    Completed = countif(Status == "Completed"),
    Failed = countif(Status == "Failed")
    by WorkspaceName, DatasetName
| extend SuccessRate = round(100.0 * Completed / Total, 4)
| extend BudgetRemaining_99_5 =
    iff(SuccessRate >= 99.5,
        round(((SuccessRate - 99.5) / 0.5) * 100, 1),  // % of budget left
        0.0)
| order by BudgetRemaining_99_5 asc

Eventstream E2E Latency

FabricEventstreamMetrics
| where TimeGenerated > ago(1h)
| where MetricName == "EndToEndLatencyMs"
| summarize
    p50 = percentile(MetricValue, 50),
    p95 = percentile(MetricValue, 95),
    p99 = percentile(MetricValue, 99)
    by bin(TimeGenerated, 1m), EventstreamName
| extend MeetsSLO_StandardTier = p95 <= 120000   // < 2 min

Eventhouse Ingestion Lag

.show ingestion failures
| where FailedOn > ago(1h)
| summarize Failures = count() by Table, FailureKind
| order by Failures desc;

// Lag distribution from system metrics
FabricEventhouseMetrics
| where TimeGenerated > ago(1h)
| where MetricName == "IngestionLagSeconds"
| summarize p95 = percentile(MetricValue, 95) by bin(TimeGenerated, 1m), Database, Table
| extend MeetsSLO_StandardTier = p95 <= 300   // < 5 min

Capacity CU Saturation

// 5-minute rolling CU utilization vs capacity allocation
FabricCapacityMetrics
| where TimeGenerated > ago(1h)
| summarize
    CUUsed = sum(CUSeconds),
    CUAvailable = max(CapacityCUSecondsAvailable)
    by bin(TimeGenerated, 5m), CapacityName
| extend SaturationPct = round(100.0 * CUUsed / CUAvailable, 2)
| extend MeetsSLO_StandardTier = SaturationPct < 80
| order by TimeGenerated desc

Capacity Throttling Burn (Fast Burn)

let SLO_NotThrottled = 99.0;
let BurnRate = 14.4;
let ThrottleThresholdPct = (100.0 - SLO_NotThrottled) * BurnRate;   // 14.4%
FabricCapacityMetrics
| where TimeGenerated > ago(1h)
| summarize
    Total = count(),
    Throttled = countif(Throttled == true)
    by CapacityName
| extend ThrottlePct = round(100.0 * Throttled / Total, 2)
| where ThrottlePct >= ThrottleThresholdPct
| project CapacityName, ThrottlePct, AlertSeverity = "SEV1", Runbook = "capacity-throttling-response.md"

Authentication Success Rate

FabricAuditEvents
| where TimeGenerated > ago(1h)
| where Operation in ("WorkspaceIdentityAuth", "ServicePrincipalAuth")
| summarize
    Total = count(),
    Failed = countif(Result == "Failure")
    by Operation, WorkspaceName
| extend SuccessRate = round(100.0 * (Total - Failed) / Total, 4)
| extend MeetsSLO_StandardTier = SuccessRate >= 99.9

Power BI Report Load Time

FabricPowerBIReportRenders
| where TimeGenerated > ago(1h)
| where ReportLifecycle == "InitialRender"
| summarize
    p50 = percentile(DurationMs, 50),
    p95 = percentile(DurationMs, 95),
    p99 = percentile(DurationMs, 99)
    by bin(TimeGenerated, 5m), ReportName
| extend MeetsSLO_StandardTier = p95 <= 8000   // < 8s

🚨 Wiring SLOs → Alerts → Pages

flowchart LR
    subgraph Capture["1. SLI Capture"]
        I1[Workspace Monitoring<br/>System Tables]
        I2[Capacity Metrics App]
    end

    subgraph Compute["2. SLO Computation"]
        C1[Scheduled KQL queries]
        C2[28-day rolling windows]
        C3[Burn-rate calculation]
    end

    subgraph Alert["3. Alert Routing"]
        A1[Data Activator rules]
        A2[Azure Monitor alerts]
        A3[Action Group<br/>email + SMS + webhook]
    end

    subgraph Page["4. Paging"]
        P1[PagerDuty / Opsgenie]
        P2[On-call rotation]
    end

    subgraph Respond["5. Runbook"]
        R1[Severity-specific runbook]
        R2[Incident channel opened]
        R3[Postmortem]
    end

    Capture --> Compute --> Alert --> Page --> Respond

    style Capture fill:#E67E22,color:#fff
    style Compute fill:#6C3483,color:#fff
    style Alert fill:#2471A3,color:#fff
    style Page fill:#C0392B,color:#fff
    style Respond fill:#27AE60,color:#fff

SEV1 Trigger Conditions (Page Immediately)

A SEV1 fires when any of these are true on a production workspace:

  • Fast-burn breach (1h, 14.4×) on any 99.9%+ SLI — service is melting
  • Capacity throttled for ≥ 5 sustained minutes
  • All pipelines failing (success rate < 50% in last hour, sample ≥ 10)
  • Auth success rate < 95% in last 15 min (auth meltdown)
  • Eventhouse ingestion lag > 30 min (real-time path is broken)
  • Compliance feed (CTR/SAR/W-2G or FedRAMP CDM) stale beyond regulatory window

Routes to: Incident Response Template + matched specialist runbook.

SEV2 Trigger Conditions (Page During Business Hours)

  • Sustained-burn breach (6h, 6×) on any standard-tier SLI
  • Single critical pipeline failed > 2 hours
  • Semantic model refresh failed for prod report
  • CU saturation > 90% sustained for 30+ min
  • Eventstream DLQ rate > 1% sustained

SEV3 Trigger Conditions (Ticket, Review)

  • Slow-burn breach (24h, 1×) — budget will exhaust at month-end if unchanged
  • Single non-critical pipeline failed
  • Intermittent slow queries (p95 elevated but inside SLO)
  • Non-prod refresh failures

Slow Burn vs Fast Burn — Visual

                 ┌──────────────────────────────┐
                 │  Error Budget = 100% (start) │
                 └──────────────┬───────────────┘
        ┌───────────────────────┼───────────────────────┐
        │                       │                       │
  Fast burn (14.4×)      Sustained (6×)          Slow burn (1×)
   1h window              6h window               24h window
        │                       │                       │
        ▼                       ▼                       ▼
   PAGE NOW              PAGE TODAY              TICKET (review)
   SEV1                  SEV2                    SEV3
   "fire"                "smoke"                 "smell"

📅 SLO Review Cadence

SLOs are not set-and-forget. The Fabric platform ships features quarterly; user expectations evolve; capacity SKUs change. Recalibrate explicitly.

Quarterly SLO Review

Activity Owner Output
Pull last 90 days of SLI data per workspace Platform team Compliance scorecard
Identify SLOs consistently met (>99.9% of windows) Platform team "Tighten target?" candidates
Identify SLOs frequently breached (<95% of windows) Service owners "Loosen target or invest?" decisions
Review postmortems, map to SLOs Incident Commanders New SLIs that should have caught the issue
Survey customers / stakeholders on perceived issues Product owners Gap between measured and felt reliability
Update SLO docs in Archon Service owners Versioned SLO doc per workspace

When to Recalibrate Outside Cadence

  • Major capacity change (F64 → F128 or vice versa)
  • New regulatory requirement (e.g., new state gaming rule, new federal directive)
  • Third consecutive month of breach — target is unrealistic
  • Two postmortems for same root cause — SLI gap detected
  • Customer complaint with no matching alert — SLI missing entirely

Recalibration Decision Framework

SLO consistently met (>99.9% of windows)?
  ├─ Yes ─→ Is engineering investing effort to maintain it?
  │          ├─ Yes → Tighten target, free engineering for new work
  │          └─ No  → Leave alone (it's working)
  └─ No  ─→ Are user complaints aligned with breach pattern?
             ├─ Yes → Invest in reliability OR loosen target with stakeholder buy-in
             └─ No  → SLI is wrong (measuring something users don't feel) — redefine

🚫 Anti-Patterns

❌ "100% Reliability" SLO

Problem: "We need 100% uptime." Why wrong: No real system is 100%. Setting 100% means you can never deploy, never experiment, and your error budget is zero — every incident is a crisis. Fix: Pick a realistic tier (99.5%, 99.9%) with explicit budget. Embrace that the budget will be spent.

❌ Measuring CPU Instead of Customer Outcome

Problem: SLO defined as "Spark cluster CPU < 80%." Why wrong: Users do not feel CPU. They feel slow queries. Fix: Translate to a customer-meaningful metric (query latency p95, refresh duration).

❌ One SLO for the Whole Workspace

Problem: "Workspace SLO = 99.5%." Why wrong: A workspace is not a service. A failed nightly internal report and a failed CTR compliance pipeline get the same response — wrong incentives. Fix: SLOs per-item-type or per-business-process. CTR pipeline gets Aspirational; analyst exploratory dataset gets Best-effort.

❌ Single-Threshold Alert ("p95 > 10s for 5 min")

Problem: Alert fires on every brief spike. Why wrong: Alert fatigue → on-call ignores the page → real incidents miss the SLA. Fix: Multi-window burn-rate alerts (1h/6h/24h, 14.4×/6×/1×).

❌ SLO Defined by Engineering Alone

Problem: No product/business owner agreed to the target. Why wrong: When budget burns and engineering wants to freeze deploys, business pushes back. No prior agreement = no authority. Fix: Every SLO has a co-signing service owner (engineering) and business owner (product/agency lead).

❌ Calendar-Month Budget Reset

Problem: Budget refills on the 1st of every month. Why wrong: Encourages "let's ship risky stuff at the start of the month, we have a fresh budget." Fix: 28-day rolling window. Always trailing, never resets.

❌ Alerting on the SLI, Not the Burn Rate

Problem: Alert fires when "p95 > 10s right now." Why wrong: A 30-second blip pages you. A slow degradation over 3 weeks goes unnoticed until budget exhausts. Fix: Alert on burn rate over multiple windows.

❌ Excluding "Maintenance" From SLI Calculation

Problem: "OPTIMIZE was running, that's why queries were slow — exclude it." Why wrong: Users still felt slow queries. Reality doesn't have asterisks. Fix: Include it. If maintenance hurts users, schedule it differently — don't paper over the SLI.

❌ Anti-Pattern Summary

Anti-Pattern Risk Fix
100% SLO Production freeze forever Pick 99.5–99.99% tier
CPU-based SLI Doesn't reflect user pain Customer-meaningful metric
Workspace-level SLO One size for many services Per-item or per-process SLOs
Single-threshold alert Flap fatigue Multi-window burn-rate
Engineer-only SLO No authority during budget burn Co-sign with business owner
Calendar-month budget Risky behavior at month-start 28-day rolling window
Static threshold alerts Misses slow degradation Burn-rate alerts
Excluding maintenance SLI lies about user experience Include all user-facing time

📋 Sample SLO Document Template

Copy this section into your workspace SLO doc and fill in the brackets. Store at docs/slos/{workspace-name}.md and link from your Archon project.

# SLO Document — {Workspace Name}

**Version:** {1.0.0}
**Last Reviewed:** {YYYY-MM-DD}
**Next Review:** {YYYY-MM-DD} (quarterly)
**Service Owner (Engineering):** {@username}
**Business Owner (Product / Agency):** {@username}
**Tier:** {Aspirational / Standard / Best-Effort}

## Workspace Scope

- **Workspace ID:** {GUID}
- **Capacity:** {F64 / F128 / etc.}
- **Items in scope for SLOs:**
  - {lh_silver_casino — Silver lakehouse}
  - {lh_gold_casino — Gold lakehouse}
  - {sm_casino_floor — Semantic model}
  - {pl_bronze_to_silver — Pipeline}
  - {es_slot_telemetry — Eventstream}

## SLIs and SLOs

### SLI 1: {Name}

- **Definition:** {one sentence — what counts as good vs bad}
- **Source:** {KQL query or dashboard}
- **SLO target:** {e.g., 99.5% over 28-day rolling window}
- **Error budget:** {N min/hr per 28d}
- **Fast-burn alert (1h, 14.4×):** {threshold}
- **Sustained-burn alert (6h, 6×):** {threshold}
- **Slow-burn alert (24h, 1×):** {threshold}
- **Runbook on breach:** {link to runbook}

### SLI 2: {Name}

{repeat structure}

### SLI 3: {Name}

{repeat structure}

## Compliance & Regulatory Notes

- **NIGC / FedRAMP / HIPAA implications:** {if any — explicit yes/no}
- **Customer-facing SLA reference:** {if external SLA backs this internal SLO}

## Review History

| Date | Change | Reason | Approver |
|------|--------|--------|----------|
| {YYYY-MM-DD} | Initial | First production deploy | {@user} |
| {YYYY-MM-DD} | Tightened p95 from 10s → 8s | Budget unused 6 months | {@user} |

## Action Items From Last Review

| ID | Item | Owner | Due | Status |
|----|------|-------|-----|--------|
| {AI-1} | {action} | {@user} | {date} | {todo / done} |

## Notes

{Free-form context — special handling, known seasonal patterns, related projects}

Runbook When SLO Breach Triggers It
Incident Response Template Master template — every SEV1/SEV2 starts here
Capacity Throttling Response CU saturation SLO breach
Pipeline Failure Triage Pipeline success-rate burn
Auth Failure Playbook Authentication SLO breach
Multi-Region Failover Region-wide SLO breach
Data Quality Incident Quality-gate SLO breach
Tenant Migration (Dev/Staging/Prod) Rollback after deploy-induced burn
Document Relationship
Monitoring & Observability Telemetry collection that feeds the SLIs
Capacity Planning & Cost Optimization Capacity sizing to satisfy CU saturation SLO
Error Handling & Monitoring Pipeline error architecture
Alerting & Data Activator Wiring layer for burn-rate alerts
Testing Strategies Pre-prod gates that protect SLOs
Disaster Recovery / BCDR RTO/RPO SLOs for failover
Document Relationship
Workspace Monitoring Source of truth for FabricCapacityMetrics, FabricPipelineRuns, FabricSemanticModelRefreshes
Real-Time Intelligence Eventstream / Eventhouse SLI source
Direct Lake Refresh-freshness SLI considerations

📚 References

Microsoft Documentation

SRE Foundations

  • Google, Site Reliability Engineering (the "SRE Book"), Chapter 4: Service Level Objectives. https://sre.google/sre-book/service-level-objectives/
  • Google, The Site Reliability Workbook, Chapter 5: Alerting on SLOs (multi-window, multi-burn-rate). https://sre.google/workbook/alerting-on-slos/
  • Google, The Site Reliability Workbook, Chapter 2: Implementing SLOs. https://sre.google/workbook/implementing-slos/

Adjacent Reading


⬆️ Back to Top | 📚 Best Practices Index | 📖 Runbooks Index | 🏠 Home