Home > Docs > Best Practices > Operations > Change Management
π‘οΈ Change Management for Fabric Platforms¶
RFC, CAB, Freeze Windows, and Risk-Tiered Approval for Microsoft Fabric
Last Updated: 2026-04-27 | Version: 1.0.0 | Phase 14 Wave 1 β Feature 1.10
π Table of Contents¶
- π― Why Change Management for Fabric
- π§ Change Lifecycle
- ποΈ Change Classification Matrix
- π RFC Template
- π₯ CAB (Change Advisory Board) Process
- π§ Freeze Windows
- π Change Calendar
- βοΈ Risk Assessment Framework
- β©οΈ Rollback Policy
- π Integration with fabric-cicd + Deployment Pipelines
- π§Ύ Audit Trail Requirements
- β Post-Change Verification
- π¨ Failed Change Procedure
- π« Anti-Patterns
- π Templates Provided
- π Related Runbooks & Best-Practice Docs
π― Why Change Management for Fabric¶
Microsoft Fabric platforms span shared capacity (F64), workspace identities, OneLake security, deployment pipelines, and dozens of interdependent items (Lakehouses, Notebooks, Semantic Models, Pipelines, SQL Databases, Eventhouses). A single uncoordinated change can throttle the capacity and block every workspace on it, break Direct Lake semantic models that BI executives depend on, violate compliance posture (CMK, OAP, audit-log retention) for federal workloads, drop Bronze data that cannot be re-derived (breaking 7-year retention obligations), or cascade through Deployment Pipelines without an approved rollback path.
Change management exists to make changes predictable, reviewable, reversible, and auditable β without slowing the team down for low-risk work. The framework below is ITIL-aware but tuned for Fabric: every guardrail maps to a specific Fabric concept (capacity, workspace, item, pipeline stage).
| Goal | Mechanism | Non-Goal |
|---|---|---|
| Predictability | Classified changes follow known paths | Replacing source control / PR review (this wraps them) |
| Reviewability | Normal/Major changes have RFC + CAB record | Slowing emergency response (see Hotfix path) |
| Reversibility | Every change has a rollback plan tested in Staging | Documenting every notebook tweak (Standard changes need only a PR) |
| Auditability | Every deploy traces PR β RFC β CAB β commit SHA | β |
| Velocity | Standard (pre-approved) changes auto-deploy with no CAB friction | β |
π§ Change Lifecycle¶
The lifecycle below applies to every change that lands in Dev, Staging, or Prod. The depth of review scales with the change classification.
flowchart LR
A[Idea / Need] --> B[Draft RFC]
B --> C{Classify Change}
C -->|Standard| D1[PR Review<br/>Auto-deploy Dev]
C -->|Normal| D2[CAB Review<br/>SLA 48 hr]
C -->|Major| D3[CAB + Exec Approval]
C -->|Emergency| D4[Hotfix Path<br/>Post-hoc CAB]
D1 --> E[Deploy]
D2 --> E
D3 --> E
D4 --> E
E --> F[Post-Change<br/>Verification]
F --> G{Stable?}
G -->|Yes| H[Close RFC]
G -->|No| I[Trigger Rollback<br/>Open Incident]
I --> J[Failed-Change<br/>Post-Mortem]
J --> H
style D4 fill:#E74C3C,color:#fff
style D3 fill:#E67E22,color:#fff
style D2 fill:#2471A3,color:#fff
style D1 fill:#27AE60,color:#fff
style I fill:#C0392B,color:#fff Cross-reference: Mechanics of how a change actually moves Dev β Staging β Prod live in the Tenant Migration Runbook. This document is the policy layer above that runbook.
ποΈ Change Classification Matrix¶
Every change is classified into exactly one of four tiers. The classification drives the approval path, deployment path, and rollback expectations.
| Tier | Definition | Examples (Fabric) | Approval | Deploy Path | Rollback SLA |
|---|---|---|---|---|---|
| Standard | Pre-approved, automated, low-risk. Documented pattern with proven safety. | Notebook parameter tweak; GE expectation rule update; semantic model measure rename; minor DAX change; non-breaking tutorial doc edit | PR review only (β₯1 reviewer); no CAB | Auto-deploy Dev β Staging on merge; manual one-click to Prod | <15 min |
| Normal | Discretionary change with non-trivial blast radius. CAB review required. | New Bronze ingestion; Silver schema additive change; new pipeline; capacity scaling within tier (F64 β F128); new Power BI app workspace; OneLake shortcut addition | PR + CAB (1 quorum, 48-hr SLA) | Standard 3-stage promotion (Dev β Staging β Prod) with manual Prod gate | <30 min |
| Major | High blast radius or compliance-impacting. Executive approval required. | New workspace tied to capacity; capacity SKU change >2 tiers (F64 β F256); region migration; CMK key rotation; OneLake security policy change; tenant-level setting change; breaking schema change in Silver/Gold | PR + CAB + executive approval (Platform Lead and Service Owner) | 3-stage with mandatory rollback test in Staging; dual-approver Prod gate | <1 hr (and rollback must be rehearsed in Staging before Prod deploy) |
| Emergency / Hotfix | Production-impacting incident requiring immediate fix. Post-hoc CAB review. | Patching a SEV1 Prod data corruption; reverting a bad deploy; emergency capacity scale-up to relieve throttling; revoking a leaked credential | On-call engineer + Incident Commander; CAB reviews after the fact at next session | Hotfix path per Tenant Migration Runbook Β§Hotfix | <15 min (this is the rollback for SEV1) |
Decision Tree β Which Tier Is My Change?¶
flowchart TD
Start([New Change Proposed]) --> Q1{Active SEV1/SEV2<br/>incident driving<br/>this change?}
Q1 -->|Yes| Emergency[Emergency / Hotfix]
Q1 -->|No| Q2{Touches Prod<br/>capacity, tenant settings,<br/>or compliance posture?}
Q2 -->|Yes| Major[Major]
Q2 -->|No| Q3{New item, schema change,<br/>or first-time deploy<br/>to Prod?}
Q3 -->|Yes| Normal[Normal]
Q3 -->|No| Q4{Matches a documented<br/>pre-approved pattern<br/>e.g. param change,<br/>GE rule, doc edit?}
Q4 -->|Yes| Standard[Standard]
Q4 -->|No| Normal
style Emergency fill:#E74C3C,color:#fff
style Major fill:#E67E22,color:#fff
style Normal fill:#2471A3,color:#fff
style Standard fill:#27AE60,color:#fff Rule of thumb: If you are not sure, classify higher. CAB can de-escalate; you cannot un-deploy a Major change that was reviewed as Standard.
π RFC Template¶
Every Normal, Major, or Emergency change requires an RFC (Request for Change). Standard changes only need a well-described PR. Copy this template into a new file under docs/rfcs/RFC-YYYYMMDD-NN-short-title.md or paste it into the PR description.
# RFC-YYYYMMDD-NN: <Short Title>
**Author:** <name@org> | **Created:** YYYY-MM-DD
**Status:** Draft | Submitted | Approved | Rejected | Deployed | Rolled Back | Closed
**Change Type:** Standard | Normal | Major | Emergency
**Risk Score:** Low (4-6) | Medium (7-9) | High (10-12) | Critical (13-16) β see Risk Framework
**Related Incident:** INC-YYYYMMDD-NN (if applicable)
## 1. Summary
One paragraph: what is changing, who benefits, why now.
## 2. Affected Systems
- **Workspaces:** casino-fabric-{dev,staging,prod}
- **Capacities:** fabric-eastus2-f64
- **Items:** lh_bronze.slot_telemetry, nb_01_bronze_slot_telemetry, pipeline_casino_daily
- **Downstream Consumers:** Power BI report "Casino Floor Daily KPI"; Data Agent "casino-compliance-bot"
- **Upstream Dependencies:** None | <list>
- **Compliance Scope:** NIGC MICS / FedRAMP / HIPAA / 42 CFR Part 2 / None
## 3. Rollout Plan
**3.1 Pre-Checks (must pass before deploy)**
- [ ] CI green (validate, test, lint, GE checkpoints)
- [ ] Bicep what-if reviewed (infra changes)
- [ ] Stakeholders notified in #release-comms (24 hr Normal, 72 hr Major)
- [ ] Outside freeze window; no active SEV1/SEV2 incidents
**3.2 Deploy Steps**
1. Merge PR #<num> to `main` (squash-and-merge)
2. Auto-deploy to Dev via `Deploy Fabric Items` workflow
3. `gh workflow run deploy-fabric.yml -f target_environment=staging`
4. Smoke tests: `python scripts/fabric_smoke_test.py --workspace-id $STAGING_WS`
5. `gh workflow run deploy-fabric.yml -f target_environment=production` (dual-approver for Major)
**3.3 Post-Checks**
- [ ] Smoke tests pass in Prod
- [ ] Power BI dataset refresh succeeds
- [ ] Capacity utilization β€ baseline + 10% over 30 min
- [ ] No new alerts in Data Activator
- [ ] Audit-log entry visible in Workspace Monitoring
## 4. Rollback Plan
**Trigger conditions:** Smoke-test failure; capacity utilization >90% sustained 15 min; SEV2+ opened; GE regression.
**Steps:** <git revert SHA / fabric-cicd redeploy from prior tag / Deployment Pipelines backwards-deploy>; verification; notify channels.
**Rehearsal evidence (Major required):** <link to Staging rollback run>.
## 5. Testing Evidence
- PR: <link> | CI run: <link> | Staging deploy: <link> | GE checkpoint: <link> | Manual notes: <link>
## 6. Reviewer Signoffs
| Role | Name | Approval | Date |
|------|------|----------|------|
| Author | | βοΈ | |
| Peer Reviewer (PR) | | β | |
| CAB Reviewer | | β | |
| Platform Lead (Major) | | β | |
| Service Owner (Major) | | β | |
| Compliance Officer (compliance-impacting) | | β | |
## 7. Deployment Window (UTC)
- **Planned start:** YYYY-MM-DD HH:MM | **Planned end:** YYYY-MM-DD HH:MM
- **Monitoring window:** 2Γ expected stabilization (per risk score)
## 8. Communication Plan
Pre-deploy T-24h, T-0, T+30m, post-monitoring β all in #release-comms; RFC updated on close.
π₯ CAB (Change Advisory Board) Process¶
Cadence¶
| Session | Day | Time (UTC) | Focus |
|---|---|---|---|
| Tuesday CAB | Tuesday | 14:00 | Normal changes for current week + Major change pre-reads |
| Thursday CAB | Thursday | 14:00 | Major change approvals + Emergency post-hoc reviews + retro on prior week |
Quorum¶
| Required | Role |
|---|---|
| β | Platform Lead (chair) |
| β | Senior Data Engineer |
| β | Compliance / Security rep (required for Major or compliance-impacting Normal) |
| β | Service Owner of the affected domain (Casino, Federal-USDA, Federal-DOJ, etc.) |
| Optional | Power BI / BI rep, FinOps rep, On-call engineer |
Quorum minimum: 3 voters including the Platform Lead. No vote without compliance rep for compliance-scope changes.
Decision Criteria¶
A change is approved when all of the following hold:
- RFC is complete (no missing sections, evidence links resolve)
- Risk score and classification match the change content (CAB may re-classify)
- Rollback plan is concrete and (for Major) rehearsed in Staging
- Deployment window is outside any active freeze or has a freeze exemption
- No active SEV1/SEV2 incidents on affected systems
- Service Owner of the affected domain has signed off
Approval Workflow¶
flowchart LR
A[Author opens PR] --> B[Author drafts RFC]
B --> C[Author applies label<br/>change-tier:normal/major]
C --> D[CAB triage bot<br/>auto-assigns reviewers]
D --> E{Tier?}
E -->|Standard| F1[PR review only]
E -->|Normal| F2[Tuesday CAB queue]
E -->|Major| F3[Tuesday pre-read +<br/>Thursday vote]
E -->|Emergency| F4[Skip CAB β<br/>Thursday post-hoc review]
F1 --> G[Approve & merge]
F2 --> G
F3 --> G
F4 --> H[Hotfix deploy]
H --> F4b[Post-hoc CAB record]
style F4 fill:#E74C3C,color:#fff
style F3 fill:#E67E22,color:#fff
style F2 fill:#2471A3,color:#fff
style F1 fill:#27AE60,color:#fff Tooling¶
The CAB workflow is implemented entirely with GitHub primitives β no separate tool required, but we map cleanly to ServiceNow/Jira if your org uses one:
| Function | GitHub Primitive | ServiceNow / Jira Equivalent |
|---|---|---|
| RFC document | PR description or docs/rfcs/*.md | CHG record |
| Classification | PR label change-tier:{standard,normal,major,emergency} | CHG type field |
| CAB queue | GitHub Project board "CAB" with columns Triage / Review / Approved / Deployed | CHG queue |
| Approval | PR approving review by CAB member | CHG approval action |
| Audit trail | PR + commit SHA + Action run | CHG record + attachments |
SLAs for CAB Review¶
| Tier | Review SLA | Window |
|---|---|---|
| Standard | n/a (PR review only β team SLA is 1 business day) | n/a |
| Normal | 48 business hours from RFC submission | Reviewed at next Tuesday CAB |
| Major | 5 business days (must hit one Tuesday pre-read + one Thursday vote) | Reviewed across Tuesday + Thursday |
| Emergency | 0 β deploy first, review at next Thursday CAB | Post-hoc within 7 days |
π§ Freeze Windows¶
Freeze windows block Normal and Major changes from reaching Production. Standard and Emergency changes can still proceed under the rules below. Freeze windows are calendarized at the start of each fiscal year and published in Change Calendar.
| Freeze | Default Window | Affects | Exemption |
|---|---|---|---|
| Holiday Freeze | Last Wednesday before US Thanksgiving β Jan 5 (next year) | Normal, Major | CAB unanimous + VP Eng approval |
| Quarter-End Freeze | Last 3 business days of each quarter | Major only | CAB unanimous |
| Major Release Freeze | T-2 business days before any Major release β T+1 business day after | Normal, Major (other than the release itself) | CAB chair + Service Owner |
| Compliance / Audit Freeze | Annual SOC 2 / NIGC / FedRAMP audit window (typically 2 weeks, scheduled) | Normal, Major touching compliance scope | Compliance Officer + CAB unanimous |
| Capacity-Risk Freeze | When F64 utilization >85% for 7 days running | Major capacity changes | Platform Lead (after capacity review) |
Exemption Process¶
- Author opens RFC with
freeze-exemptionlabel. - RFC must include: business justification, rollback rehearsal evidence, and reduced-blast-radius plan (e.g., deploy to one workspace first).
- CAB votes at next session (or async via PR review for time-sensitive cases).
- Required votes per the table above. Unanimous means every quorum voter β abstentions count as "no".
- Exemption is logged in the RFC and the freeze calendar entry; auditors will see both.
Standard changes during freeze: Standard changes (parameter tweaks, GE rule updates, doc edits) are allowed during all freezes except Compliance Freeze. They still need PR review and pass CI.
Emergency changes during freeze: Always allowed; they exist precisely to handle production-impacting issues during freezes.
π Change Calendar¶
A representative quarter (Q2 2026) showing freezes, release trains, and audit windows. Each team should publish its own calendar at the start of the quarter and link it from the CAB Project board.
gantt
title Q2 2026 Change Calendar (representative)
dateFormat YYYY-MM-DD
axisFormat %b %d
section Release Trains
Bi-weekly release (Wave 1) :active, r1, 2026-04-08, 1d
Bi-weekly release (Wave 2) : r2, 2026-04-22, 1d
Bi-weekly release (Wave 3) : r3, 2026-05-06, 1d
Bi-weekly release (Wave 4) : r4, 2026-05-20, 1d
Bi-weekly release (Wave 5) : r5, 2026-06-03, 1d
section Major Release
Phase 14 GA (Major) :crit, m1, 2026-06-10, 3d
Major Release Freeze :crit, mf, 2026-06-08, 5d
section Quarter-End Freeze
Q2 Quarter-End Freeze :crit, qf, 2026-06-26, 3d
section Audits
SOC 2 Type II evidence collection :crit, a1, 2026-05-12, 14d
section CAB Sessions
Tuesday CAB :milestone, c1, 2026-04-07, 0d
Thursday CAB :milestone, c2, 2026-04-09, 0d A live calendar lives at
docs/operations/change-calendar.yaml(planned in Phase 14 Wave 2). Freeze ranges in that file feed CI gates.
βοΈ Risk Assessment Framework¶
Every RFC computes a risk score from four factors. The score determines the required approvals, the rollback rehearsal expectation, and the post-deploy monitoring window. The framework is deliberately simple β auditors and on-call engineers should be able to read the score and immediately know what was approved.
Factor 1 β Blast Radius¶
| Score | Definition | Examples |
|---|---|---|
| 1 | Single notebook / pipeline within one workspace | Notebook param tweak |
| 2 | Single workspace, multiple items | New Bronze ingestion in casino-fabric-prod |
| 3 | Single capacity, multiple workspaces | OneLake security policy on F64 |
| 4 | Tenant-wide / multi-capacity | Tenant setting change; CMK rotation |
Factor 2 β Reversibility¶
| Score | Definition | Examples |
|---|---|---|
| 1 | Instant revert (one git revert + redeploy β€ 15 min) | Notebook code change |
| 2 | Staged revert (Deployment Pipelines backwards-deploy β€ 1 hr) | Pipeline / semantic model change |
| 3 | Hours-long restore (Lakehouse table restore from time-travel or backup) | Schema change requiring data rewrite |
| 4 | Data-affecting / irreversible (Bronze data lost; capacity SKU change with billing impact) | VACUUM with short retention; downscale capacity |
Factor 3 β User Impact¶
| Score | Definition |
|---|---|
| 1 | None β no user-visible effect |
| 2 | Read-only degraded β slower BI but no errors |
| 3 | Write-blocking β pipeline failures or partial unavailability |
| 4 | Full outage β workspace or capacity unavailable |
Factor 4 β Compliance Impact¶
| Score | Definition |
|---|---|
| 1 | None |
| 2 | Documentation / audit-log change only (no control change) |
| 3 | Modifies a compliance control (CMK, OAP, audit retention, RBAC scope) |
| 4 | Affects regulated data path (Bronze for CTR/SAR, HIPAA PHI, 42 CFR Part 2, FedRAMP boundary) |
Risk Score β Required Approvals¶
The risk score is the sum of the four factors. The mapping below is enforced by the CAB and reflected in PR labels.
| Sum | Risk | Required Approvals | Rollback Rehearsal | Monitoring Window |
|---|---|---|---|---|
| 4β6 | Low | Standard PR review (1 reviewer) | Not required | 30 min |
| 7β9 | Medium | CAB quorum (Normal change) | Recommended | 60 min |
| 10β12 | High | CAB unanimous (Major change) | Required in Staging | 2Γ expected stabilization, min 2 hr |
| 13β16 | Critical | CAB unanimous + VP Eng + Compliance Officer | Required in Staging + Pre-prod data dry-run | 24 hr with on-call coverage |
Worked example β adding a new Bronze ingestion for DOJ federal data:
Blast radius = 2 (single workspace, new item). Reversibility = 1 (drop the table). User impact = 1 (none β net new). Compliance impact = 4 (FedRAMP boundary).
Sum = 8 β Medium / Normal change. CAB quorum required, rollback recommended, 60-min monitoring. Compliance rep is mandatory at CAB because Factor 4 β₯ 3.
β©οΈ Rollback Policy¶
Every change has a rollback plan. No exceptions. The rollback plan is part of the RFC and is reviewed by CAB. A change without a rollback plan is rejected at triage.
Rollback SLAs¶
| Trigger | Target Rollback Time |
|---|---|
| SEV1 incident caused by a deploy | 15 minutes to begin rollback (this is the Failed Change Procedure) |
| SEV2 incident caused by a deploy | 30 minutes to begin rollback |
| Smoke-test failure in Prod | Begin rollback during the same deploy window β do not leave Prod in a broken state |
| GE checkpoint regression | Within 1 hour, after on-call engineer confirms regression is caused by the deploy |
Rollback Mechanisms (in preference order)¶
git revert+ redeploy viafabric-cicdβ preferred for code/config changes. Linear history is preserved by squash-merge, so the revert is a single commit.- Deployment Pipelines backwards-deploy β when a previously-approved Staging state is known good, deploy Staging β Prod.
- fabric-cicd redeploy from prior Git tag β
gh workflow run deploy-fabric.yml -f git_ref=v1.2.3 -f target_environment=production. - Lakehouse time-travel β
RESTORE TABLE lh_bronze.slot_telemetry TO VERSION AS OF <n>for data-level regressions. - Bicep redeploy from prior parameter file β for infra-level rollbacks (capacity, networking, CMK).
Rollback Testing β Mandatory for Major¶
Major changes must rehearse the rollback in Staging before Prod deploy. The CAB will reject Major RFCs whose rollback rehearsal evidence link is missing or returns an empty workflow run.
Rehearsal procedure: 1. Deploy the change to Staging. 2. Smoke-test Staging β confirm working state. 3. Execute the rollback steps from the RFC against Staging. 4. Smoke-test Staging β confirm rollback restores the prior state. 5. Re-deploy the change to Staging (return to working state). 6. Attach all four workflow run links to the RFC under Β§4 Rollback Plan.
Execution mechanics: detailed step-by-step rollback procedures (including data-affecting rollbacks, Power BI rebinding, and CMK rotation rollback) live in Tenant Migration Runbook β Rollback Procedure. This document is policy; that runbook is execution.
π Integration with fabric-cicd + Deployment Pipelines¶
Change management is enforced in code via the existing CI/CD plumbing. There is no separate ticketing system to learn β labels on a PR drive the workflow.
flowchart LR
A[Author opens PR] --> B[Apply label<br/>change-tier:*]
B --> C[Required CI checks<br/>validate Β· test Β· lint Β· GE]
C --> D{Tier?}
D -->|Standard| E1[PR approval β₯ 1]
D -->|Normal| E2[PR + CAB label<br/>cab-approved]
D -->|Major| E3[PR + CAB +<br/>major-approved label]
D -->|Emergency| E4[Hotfix bypass<br/>requires emergency-approved]
E1 --> F[Merge to main]
E2 --> F
E3 --> F
E4 --> F
F --> G[fabric-cicd<br/>auto-deploy Dev]
G --> H[Manual gh workflow run<br/>β Staging]
H --> I[Smoke tests +<br/>Deployment Pipelines<br/>stage gate]
I --> J[Manual gh workflow run<br/>β Production<br/>dual-approver for Major]
style E4 fill:#E74C3C,color:#fff
style E3 fill:#E67E22,color:#fff
style E2 fill:#2471A3,color:#fff
style E1 fill:#27AE60,color:#fff Branch Protection Required Settings¶
These are enforced on main and codify the policy above:
Settings β Branches β Branch protection rule for "main"
β
Require a pull request before merging
β
Require approvals: 1 (Standard) β bot enforces β₯2 if change-tier:major label present
β
Require status checks: validate, test, lint, ge-checkpoint, cab-label-required
β
Require conversation resolution before merging
β
Do not allow bypassing the above settings
β
Restrict who can push to matching branches: platform-leads team only
Required CI Checks¶
| Check | Purpose | Blocking |
|---|---|---|
validate | Bicep build + fabric-cicd dry-run | Yes |
test | pytest validation/unit_tests/ (612 tests) | Yes |
lint | ruff, bicep linter, markdownlint | Yes |
ge-checkpoint | Great Expectations bronze + silver | Yes |
cab-label-required | Bot fails if change-tier:* label is missing | Yes |
cab-approval-required | Bot fails if Normal/Major lacks cab-approved label | Yes |
Deployment Pipelines Stage Gating¶
For teams using Deployment Pipelines (Path B in the Tenant Migration Runbook), the same labels gate stage promotion:
| Stage | Auto-deploy on merge | Requires cab-approved | Requires Dual Approver |
|---|---|---|---|
| Dev | Yes | No | No |
| Staging | No (manual gh workflow run) | Yes (Normal+) | No |
| Production | No (manual + GitHub Environment approval) | Yes (Normal+) | Yes (Major) |
See fabric-cicd-deployment.md for the underlying workflow definition and deployment-pipelines.md for the native portal-based path.
π§Ύ Audit Trail Requirements¶
Every change must be traceable end-to-end from business request to production deployment. Auditors will trace any single Prod deploy back through this chain:
RFC ID β CAB approval record β PR # β Commit SHA β CI run β fabric-cicd workflow run β Workspace audit log entry
Required Artifacts (per change)¶
| Artifact | Where Stored | Retention | Owner |
|---|---|---|---|
| RFC | docs/rfcs/RFC-*.md (Git) or PR description | 7 years (matches Bronze retention floor) | RFC Author |
| CAB minutes | GitHub Project board card + docs/cab-minutes/YYYY-MM-DD.md | 7 years | CAB Chair |
| PR + review history | GitHub | Indefinite (Git) | n/a |
| CI run | GitHub Actions | 90 days (re-runnable from commit SHA) | n/a |
| Deploy workflow run | GitHub Actions + workspace audit log | 90 days (Actions) + per SQL Audit Logs Compliance (workspace) | Platform Lead |
| Approver identity & timestamp | GitHub Environment approval log | Indefinite | n/a |
Federal / Regulated Workloads¶
Federal agencies (USDA, SBA, NOAA, EPA, DOI, DOJ, DOT/FAA, Tribal Healthcare) and casino (NIGC) workloads require additional capture:
- Approver identity must include MFA assertion (enforced via Conditional Access, not in this doc)
- Audit log must be exported to immutable storage (Azure Storage with legal hold) β see SQL Audit Logs Compliance
- For Major changes touching FedRAMP boundary: dual-approver names captured in RFC Β§6 and surfaced in workspace audit log
β Post-Change Verification¶
A change is not complete when the deploy workflow turns green. It is complete when the monitoring window has elapsed without regression.
Automated Smoke Tests¶
Every Prod deploy runs the smoke-test script automatically after the deploy step:
Smoke tests verify: - All notebooks are executable (test-run a sample bronze, silver, gold notebook) - Lakehouse schemas match the prior Prod schemas (DESCRIBE EXTENDED) - Pipeline test-run completes end-to-end without error - Power BI dataset refresh succeeds - Sample queries return expected row counts (within Β±5% of prior baseline)
Manual Checks Per RFC Checklist¶
The RFC Β§3.3 Post-Checks section is treated as a literal checklist β the author ticks each box and updates the RFC as Status: Deployed. Missing checks block RFC closure.
Monitoring Window¶
| Risk | Window |
|---|---|
| Low | 30 min |
| Medium | 60 min |
| High | 2Γ expected stabilization, minimum 2 hr |
| Critical | 24 hr with on-call coverage |
During the window the author must stay reachable. New alerts in Data Activator or Workspace Monitoring during the window count as a regression and trigger the Failed Change Procedure.
π¨ Failed Change Procedure¶
A change is "failed" if any of the following are true within the monitoring window:
- Smoke test fails in Prod
- New SEV1 or SEV2 incident is opened on the affected systems
- GE checkpoint regression (any expectation that passed pre-deploy now fails)
- Capacity utilization >90% sustained for 15 minutes (when not previously)
- Power BI dataset refresh fails on a previously-passing dataset
Procedure¶
flowchart LR
A[Failure detected] --> B[Trigger rollback<br/>per RFC Β§4]
B --> C[Open incident<br/>per Incident Response Template]
C --> D[Notify CAB chair +<br/>Service Owner]
D --> E[Stabilize Prod]
E --> F{Major change?}
F -->|Yes| G[Mandatory<br/>post-mortem β€ 5 days]
F -->|No| H[Optional post-mortem<br/>at CAB discretion]
G --> I[Update RFC<br/>Status: Rolled Back]
H --> I
style A fill:#E74C3C,color:#fff Required Artifacts on Failed Change¶
- Incident record β per Incident Response Template. Link the RFC ID in the incident.
- Rolled-back state confirmation β smoke tests pass post-rollback.
- Failed-change post-mortem (Major only, recommended for Normal) β see Templates Provided.
- CAB notification β author posts the failure summary in the CAB channel within 1 business hour of detection.
Repeat-offender rule: Any author/team with two failed Major changes in a rolling 90-day window must run their next Major RFC through a pre-CAB technical deep-dive before submission.
π« Anti-Patterns¶
1. Verbal CAB Approval¶
Problem: "I asked the Platform Lead in the hallway and they said it's fine." Symptom: no audit trail; approver does not remember the conversation six months later. Fix: all approvals captured in the PR or CAB minutes. If it is not in writing, it did not happen.
2. Classifying Down to Avoid CAB¶
Problem: a schema change in Silver labeled Standard "because it is just a column add." Symptom: downstream Direct Lake semantic model breaks; no review caught it. Fix: use the Decision Tree. Any schema change in a layer downstream consumers depend on is Normal, minimum.
3. Rollback Plan = "Redeploy Prior Commit"¶
Problem: one-line rollback plan, no verification, no data-level steps. Symptom: on-call discovers the prior commit restores only code state, not data state β Bronze writes are append-only. Fix: rollback plan addresses code and data state; Major changes rehearse in Staging.
4. Friday-Afternoon Prod Deploys¶
Problem: Normal change squeezed into Prod late Friday before a long weekend. Symptom: regression detected Saturday; author unreachable; SEV2 stretches into 36-hr incident. Fix: Prod deploys land MondayβThursday, before 16:00 local of the on-call rotation. Friday Prod deploys require Platform Lead approval + explicit on-call handoff.
5. "Emergency" Used to Skip CAB¶
Problem: routine changes get emergency-approved to bypass the 48-hr Normal SLA. Symptom: auditors flag emergency abuse. Fix: emergency requires a linked active SEV1/SEV2 incident; CAB chair audits emergency usage monthly and revokes the label from abusers.
6. No Monitoring Window¶
Problem: author merges, deploy turns green, author closes laptop. Symptom: users find the regression hours later; on-call inherits a problem they did not ship. Fix: author stays reachable for the full monitoring window per Risk Framework.
Summary¶
| Anti-Pattern | Risk | Fix |
|---|---|---|
| Verbal CAB approval | No audit trail | Capture in writing |
| Classifying down | Skipped review | Use decision tree |
| Thin rollback plan | Bad rollback under pressure | Rehearse in Staging (Major) |
| Friday Prod deploys | On-call ambushed | MonβThu, before 16:00 |
| "Emergency" abuse | CAB bypass | Require linked SEV1/SEV2 |
| No monitoring window | Users find the regression | Author on-call per risk score |
π Templates Provided¶
The following templates live alongside this doc in docs/best-practices/operations/templates/ (added in Phase 14 Wave 2). Until then, copy from the inline blocks below or from this document directly.
1. RFC Template¶
See Β§ RFC Template above. Copy into docs/rfcs/RFC-YYYYMMDD-NN-short-title.md.
2. Failed-Change Post-Mortem Template¶
# Failed-Change Post-Mortem: RFC-YYYYMMDD-NN
**Date:** YYYY-MM-DD HH:MM UTC | **RFC:** RFC-... | **Incident:** INC-...
**Change Author:** <name> | **Post-Mortem Author:** <name β NOT the change author>
## 1. Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | Change deployed to Prod |
| HH:MM | First alert / user report |
| HH:MM | Rollback initiated |
| HH:MM | Prod stable |
## 2. What Happened
Blameless 1-2 paragraph summary.
## 3. Root Cause
Technical root cause. Not "human error" β what allowed it to reach Prod?
## 4. Why CAB Did Not Catch It
Gap in RFC, rollback plan, rehearsal, or CAB checklist.
## 5. Action Items
| # | Action | Owner | Due |
|---|--------|-------|-----|
## 6. Process Updates
Linked PRs updating RFC template, decision tree, freeze rules, or CAB checklist.
3. CAB Minutes Template¶
# CAB Minutes: YYYY-MM-DD (Tuesday | Thursday)
**Chair:** <name> | **Quorum:** β
/β | **Voters:** <list>
## RFCs Reviewed
| RFC | Tier | Risk | Decision | Conditions |
|-----|------|------|----------|------------|
| RFC-... | Normal | 8 | Approved | Monitor 60 min |
## Emergency Post-Hoc Reviews
| RFC | Incident | Decision | Action Items |
|-----|----------|----------|--------------|
## Freeze Exemptions
<list with vote tallies, or "none">
## Process Carry-Overs
- ...
π Related Runbooks & Best-Practice Docs¶
Runbooks (Execution)¶
- Tenant Migration: Dev β Staging β Prod β the runbook this policy wraps
- Incident Response Template β incident path triggered by failed changes
- Capacity Throttling Response | Pipeline Failure Triage | Auth Failure Playbook
Best-Practice Docs (Adjacent Policy)¶
- fabric-cicd Deployment β CI/CD machinery enforcing this policy
- Deployment Pipelines β native ALM tool
- Capacity Planning & Cost Optimization | Disaster Recovery & BCDR
- Network Security | Identity & RBAC Patterns | SQL Audit Logs Compliance
- Customer-Managed Keys | Outbound Access Protection
External References¶
- ITIL 4 β Change Enablement (concept anchor)
- Microsoft Fabric β Deployment Pipelines REST API | fabric-cicd library
- GitHub β Branch Protection | Environments & Required Reviewers
Phase 14 Wave 1 β Feature 1.10 β Change Management for Fabric Platforms
Maintained by Platform Engineering Β· Reviewed quarterly Β· Last review 2026-04-27