Home > Tutorials > End-to-End MLOps
🚀 Tutorial 39: End-to-End MLOps — Train → Register → Deploy → Monitor → Retrain¶
Last Updated: 2026-04-27 | Version: 1.0 Status: ✅ Final | Maintainer: Documentation Team
🚀 Tutorial 39: End-to-End MLOps on Microsoft Fabric¶
| Difficulty | ⭐⭐⭐⭐ Advanced |
| Time | ⏱️ 120-180 minutes |
| Focus | Production ML lifecycle: Train, Register, Validate, Deploy, Monitor, Retrain |
| Phase | Phase 14 Wave 2 — Feature 2.14 (canonical end-to-end MLOps walkthrough) |
📊 Progress Tracker¶
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 27 │ 28 │ 29 │ 30 │ 31 │ 32 │ 33 │ 34 │ 35 │ 36 │ 37 │ 38 │ 39 │
│VIDEO │ MOVE │GEOLC │TRIBL │ DOT │USDA │ SBA │NOAA │ EPA │ DOI │GRAPH │ DOJ │MLOPS │
├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┤
│ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ 🔵 │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
▲
YOU ARE HERE
| Navigation | |
|---|---|
| ⬅️ Previous | 38-DOJ Justice Analytics |
| ➡️ Next | Tutorial 40 — RAG Production (planned) |
📖 What You'll Build¶
By the end of this tutorial you will have shipped a production-grade casino slot revenue forecasting model on Microsoft Fabric — from Git-tracked source notebooks all the way through to a live ML Model Endpoint with drift-driven automated retraining. This is the canonical Phase 14 Wave 2 walkthrough that exercises every Wave 2 doc and notebook in a single coherent pipeline: MLflow registry, validation gates, champion/challenger evaluation, batch + online deployment, drift detection, alert wiring, and Logic-App-driven retraining.
You will not just train a model — you will operate it. When you finish, the model will be in MLflow's Production stage, exposed through a live REST endpoint, and protected by a closed-loop drift → alert → retrain → re-validate → re-promote pipeline.
┌──────────────────────────────────────────────────────────────────────────────────┐
│ 🚀 END-TO-END MLOps WALKTHROUGH — CASINO SLOT REVENUE FORECAST │
│ │
│ Train ──▶ Register ──▶ Validate ──▶ Promote ──▶ Deploy ──▶ Monitor ──▶ Retrain │
│ │ │ │ │ │ │ │ │
│ MLflow ML Model 5 Gates Stage Trans Endpoint Drift NB Logic │
│ Tracking Registry pytest MlflowClient + Batch ml_drift_* App │
│ │
│ All artifacts in OneLake │ All code in Git │ All promotions audited │
└──────────────────────────────────────────────────────────────────────────────────┘
💡 Why this tutorial matters
Most Fabric ML demos stop at "I trained a model." This one stops at "the model is live, healthy, monitored, and retrains itself." Every step references a real notebook or doc in this repo, so you are not following pseudocode — you are operating the actual production backbone.
🎯 Learning Objectives¶
By the end of this tutorial, you will be able to:
- Configure a Fabric workspace with Git integration, MLflow tracking, and ML Model items enabled
- Run the canonical
04_mlops_model_registry.pynotebook to train a baseline + challenger model with full reproducibility metadata - Read MLflow experiment metrics, artifacts, and signatures from the registry UI and via
MlflowClient - Author and run all five validation gates (performance, holdout stability, fairness, latency, calibration) and interpret pass/fail
- Promote a model through
None → Staging → Productionusing stage transitions (never UI clicks for prod) - Set up a Fabric Pipeline that runs holdout-based champion/challenger evaluation on a schedule
- Deploy a registered model as an ML Model Endpoint (Preview) via the Fabric REST API and smoke-test it under 200ms p99
- Wire a batch inference Fabric Pipeline that loads
models:/{name}/Productionand writes scores to Gold - Run
05_drift_detection.pyto populatelh_gold.ml_drift_metricsandlh_gold.ml_retrain_triggers - Wire drift alerts to an Azure Monitor Action Group and trigger a Logic App on threshold breach
- Close the loop: drift alert → retrain pipeline → re-validate → re-promote → audit log entry
- Use the Production Readiness Checklist from
mlops-fabric-production.mdto certify the model
📋 Prerequisites¶
Before starting this tutorial, ensure you have:
- Completed Tutorial 00: Environment Setup
- Completed Tutorial 01: Bronze Layer, 02: Silver Layer, 03: Gold Layer
- Completed Tutorial 09: Advanced AI/ML — MLflow basics
- Fabric workspace on F64+ capacity (F2 will run training but cannot host an ML Model Endpoint at production traffic)
- ML Model items enabled in the workspace (Workspace Settings → Data Science → ML Model)
- ML Model Endpoint (Preview) enabled in the tenant (Admin Portal → Tenant Settings → Data Science)
- GitHub repository for CI/CD (this repo is fine — fork or clone)
- Service Principal with Workspace Contributor role on the Fabric workspace, plus a stored secret in GitHub Actions secrets named
AZURE_CREDENTIALS - Azure CLI ≥ 2.60,
pytest≥ 8.0, and Python 3.11 available locally for validation gates
⚠️ Capacity gotcha
ML Model Endpoints currently require an F-SKU capacity in Active state. A paused capacity will return
503on every endpoint request even though the deployment showsSucceeded. Verify capacity state before debugging endpoint timeouts.
🏗️ Architecture Diagram¶
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#4527A0','primaryTextColor':'#fff','primaryBorderColor':'#311B92','lineColor':'#5E35B1','secondaryColor':'#EDE7F6','tertiaryColor':'#fff'}}}%%
flowchart TB
subgraph DataLayer["📊 Data Layer (OneLake)"]
Bronze[(🥉 Bronze<br/>Raw Slot Telemetry)]
Silver[(🥈 Silver<br/>Cleansed Spins)]
Gold[(🥇 Gold<br/>fact_daily_slot_revenue)]
end
subgraph DevPlane["🧪 Development & CI/CD"]
Git["📦 GitHub Repo"]
GHA["⚙️ GitHub Actions<br/>(fabric-cicd)"]
Tests["🧪 Validation Gates<br/>(pytest)"]
end
subgraph TrainPlane["🏋️ Training & Registry"]
NB04["📓 04_mlops_model_registry.py<br/>(Baseline + Challenger)"]
MLF["🧪 MLflow<br/>Experiment Tracking"]
REG["🗂️ ML Model Registry<br/>None → Staging → Production"]
AUDIT["📝 ml_promotion_audit"]
end
subgraph ServePlane["🚢 Serving"]
BATCH["📦 Batch Inference<br/>Fabric Pipeline"]
EP["⚡ ML Model Endpoint<br/>(Preview)"]
SCORES[(slot_revenue_forecasts)]
end
subgraph ObsPlane["📈 Monitoring & Retrain"]
NB05["📓 05_drift_detection.py"]
DRIFT[("ml_drift_metrics<br/>ml_retrain_triggers")]
AM["🚨 Azure Monitor<br/>Action Group"]
LA["🔁 Logic App<br/>Retrain Trigger"]
end
Bronze --> Silver --> Gold
Gold --> NB04
NB04 --> MLF
MLF --> REG
Git --> GHA
GHA --> Tests
Tests -->|approved| REG
REG --> AUDIT
REG --> BATCH
REG --> EP
BATCH --> SCORES
EP --> SCORES
SCORES --> NB05
NB05 --> DRIFT
DRIFT --> AM
AM --> LA
LA -.->|trigger retrain| NB04
style DataLayer fill:#E3F2FD
style DevPlane fill:#FFF3E0
style TrainPlane fill:#EDE7F6
style ServePlane fill:#E8F5E9
style ObsPlane fill:#FFEBEE | Component | Fabric Item | Purpose |
|---|---|---|
| Source data | Lakehouse lh_gold | fact_daily_slot_revenue populated by Tutorials 01-03 |
| Training notebook | Notebook | 04_mlops_model_registry.py — full lifecycle anchor |
| Experiment tracking | MLflow (built-in) | Per-run params, metrics, artifacts, lineage |
| Registry | ML Model item | Versioned models with Staging and Production stages |
| Validation gates | GitHub Actions + pytest | Gate every promotion in CI |
| Online serving | ML Model Endpoint (Preview) | REST API for low-latency scoring |
| Batch serving | Data Pipeline + Notebook | Nightly bulk scoring |
| Drift monitor | Notebook + Eventhouse | 05_drift_detection.py on schedule |
| Alerts | Azure Monitor Action Group | Fan out to Teams + Logic App |
| Retrain trigger | Logic App | Calls Fabric REST /jobs/instances to start retrain pipeline |
🛠️ Step 1: Workspace + Git Integration¶
1.1 Provision the workspace¶
Create (or reuse) a Fabric workspace named ws-mlops-poc. Assign it to the F64 capacity (or higher). Confirm the workspace settings:
License mode: Premium / Fabric capacity
Capacity: cap-fabricpoc-prod (F64)
Workspace ID: <copy from workspace URL>
Default lakehouses: lh_bronze, lh_silver, lh_gold
Data Science: Enabled
ML Model items: Enabled
ML Model Endpoint: Enabled (Preview)
Git integration: Connected to <your-org>/Suppercharge_Microsoft_Fabric, branch main
1.2 Connect Git¶
In the workspace, click Workspace settings → Git integration → Connect. Select your repo, branch main, and root directory /. Click Sync so notebooks under notebooks/ml/ materialize as Fabric items.
✅ Verification: After sync completes, confirm you can see
04_mlops_model_registry,05_drift_detection,06_feature_store_demo,07_rag_eventhouse_vector, and08_responsible_ai_auditlisted as Notebook items in the workspace.
1.3 Configure GitHub Actions secrets¶
In the GitHub repo, add these secrets (Settings → Secrets and variables → Actions):
AZURE_CREDENTIALS # SP json: {clientId, clientSecret, tenantId, subscriptionId}
FABRIC_WORKSPACE_ID # GUID
FABRIC_TENANT_ID # GUID
FABRIC_SP_OBJECT_ID # SP object id (workspace contributor)
💡 Tip: Use a dedicated SP per environment (
sp-fabric-mlops-dev,-staging,-prod). Never share SPs across environments — it breaks audit trails.⚠️ Gotcha: The SP needs both Workspace Contributor and ML Model Read/Write role. Assign via Fabric workspace → Manage access; the Azure role alone is insufficient.
🛠️ Step 2: Clone the Repo and Configure Local Env¶
git clone https://github.com/fgarofalo56/Suppercharge_Microsoft_Fabric.git
cd Suppercharge_Microsoft_Fabric
cp .env.example .env
Edit .env:
FABRIC_WORKSPACE_ID=<paste from Step 1.1>
FABRIC_TENANT_ID=<paste from Step 1.1>
FABRIC_POC_HASH_SALT=<a long random string for PII hashing — see Phase 11 fix>
GIT_SHA=$(git rev-parse HEAD)
GIT_BRANCH=$(git rev-parse --abbrev-ref HEAD)
MLFLOW_RUN_INTENT=production-candidate
Install Python dependencies:
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pip install -r validation/requirements.txt
✅ Verification:
pytest validation/unit_tests/ -qshould print612 passed.
🛠️ Step 3: Provision Bronze/Silver/Gold + ML Lakehouse¶
The MLOps pipeline reads from lh_gold.fact_daily_slot_revenue and writes back to lh_gold.* tables. Confirm all four lakehouses exist:
| Lakehouse | Purpose |
|---|---|
lh_bronze | Raw slot telemetry from Tutorial 01 |
lh_silver | Cleansed spins from Tutorial 02 |
lh_gold | Fact tables + ML output tables |
lh_ml_artifacts | (optional) MLflow large artifacts, model files > 100 MB |
If missing, create them via the workspace UI or via Bicep:
az deployment sub create --location eastus2 \
--template-file infra/main.bicep \
--parameters infra/environments/dev/dev.bicepparam
✅ Verification: From any notebook,
spark.sql("SHOW DATABASES").show()listslh_bronze,lh_silver,lh_gold.
🛠️ Step 4: Populate Gold (Tutorials 01–03)¶
Run Tutorials 01, 02, 03 end-to-end so that lh_gold.fact_daily_slot_revenue has at least 365 days of data. The MLOps notebook synthesizes plausible data if the table is empty, but you'll get more meaningful drift signals with real medallion output.
# Quick check from a Fabric notebook
df = spark.table("lh_gold.fact_daily_slot_revenue")
print(f"Rows: {df.count():,}")
print(f"Date range: {df.agg({'play_date': 'min'}).collect()[0][0]} to {df.agg({'play_date': 'max'}).collect()[0][0]}")
✅ Verification: At least 365 distinct dates and 1M+ rows.
⚠️ Gotcha: If
play_dateis a string instead of a date, the training notebook will fail withcannot resolve 'datediff'. Cast toDATEin Silver before promoting to Gold.
🛠️ Step 5: Train the Baseline Model¶
Open notebooks/ml/04_mlops_model_registry.py. This is the anchor notebook for the entire tutorial — every subsequent step references it.
5.1 Run the notebook¶
In the workspace, open the notebook, attach to lh_gold as the default lakehouse, and click Run all. The notebook will:
- Load (or synthesize) one year of
fact_daily_slot_revenue - Train a Ridge regression baseline
- Train a RandomForestRegressor challenger
- Log both to MLflow with full lineage metadata
- Evaluate against a fixed holdout split
- Write results to
lh_gold.ml_champion_challenger - Write a promotion audit row to
lh_gold.ml_promotion_audit
Expected runtime on F64: 90–180 seconds.
Expected console output:
MLflow tracking URI: https://api.fabric.microsoft.com/...
Experiment: /Shared/casino-slot-revenue-forecast
Code SHA: <git sha> | Branch: main | Actor: <your alias>
Loaded 365 rows from lh_gold.fact_daily_slot_revenue
Baseline (Ridge): R²=0.74 RMSE=$8,142 MAPE=4.1%
Challenger (RF): R²=0.82 RMSE=$6,710 MAPE=3.3%
Registered: casino-slot-revenue-forecast-baseline v1
Registered: casino-slot-revenue-forecast-challenger v1
✅ Verification: In ML experiments → /Shared/casino-slot-revenue-forecast you see two runs with
r2,rmse,mapemetrics and amodel/artifact each.💡 Tip: Set
MLFLOW_RUN_INTENT=exploratoryin your env when you're just experimenting —production-candidateis reserved for runs that actually go through validation gates.
🛠️ Step 6: Inspect the MLflow Registry¶
Navigate to Workspace → ML Model items and open casino-slot-revenue-forecast-challenger. The detail pane shows:
- Versions —
v1(just created) - Stage —
None - Source run — link to the experiment run
- Metrics, Parameters, Artifacts — captured from the
mlflow.log_*calls - Lineage —
lh_gold.fact_daily_slot_revenuelisted as input (because we loggedtraining_data_table+training_data_version)
# Programmatic inspection (run in any notebook)
from mlflow.tracking import MlflowClient
client = MlflowClient()
for v in client.search_model_versions("name='casino-slot-revenue-forecast-challenger'"):
print(f"v{v.version} | stage={v.current_stage} | run={v.run_id} | metrics={v.tags}")
✅ Verification: At least one version with
current_stage='None'and a non-emptyrun_id.⚠️ Gotcha: If the registry shows
Stage: Nonebut no metrics, you registered the model outside an active run (mlflow.start_run()not used). Re-run cells inside awith mlflow.start_run()block.
🛠️ Step 7: Train a Stronger Challenger¶
Edit the hyperparameters at the top of 04_mlops_model_registry.py:
# Bump RF capacity for the challenger
RF_N_ESTIMATORS = 400 # was 200
RF_MAX_DEPTH = 12 # was 8
RF_MIN_SAMPLES_LEAF = 2 # was 5
Re-run the notebook. Expected runtime ~3 minutes on F64. The challenger should now have higher R² and lower RMSE than v1.
Expected output:
Challenger v2: R²=0.88 RMSE=$5,420 MAPE=2.7%
Improvement vs v1: +7.3% R², -19.2% RMSE
✅ Verification:
casino-slot-revenue-forecast-challengernow hasv1andv2in the registry. v2's metrics are strictly better than v1's.💡 Tip: Use MLflow's parent/child run pattern for hyperparameter sweeps — wrap the loop in
with mlflow.start_run() as parentand usemlflow.start_run(nested=True)for each trial. The registry then shows the search space cleanly.
🛠️ Step 8: Run Validation Gates¶
The five gates (defined in mlops-fabric-production.md § Validation Gates) must all pass before any model enters Production.
# From repo root
export MODEL_NAME=casino-slot-revenue-forecast-challenger
export MODEL_VERSION=2
pytest validation/ml/test_validation_gates.py -v
Expected output:
test_gate_1_performance_threshold[v2] PASSED (R²=0.88 ≥ 0.95 × baseline 0.74)
test_gate_2_holdout_stability[v2] PASSED (drift 1.8% < 5% tol)
test_gate_3_fairness[v2] SKIPPED (slot revenue is non-regulated)
test_gate_4_latency_p99[v2] PASSED (p99 47ms < 200ms target)
test_gate_5_calibration[v2] PASSED (ECE 0.06 < 0.10)
============================ 4 passed, 1 skipped in 18.3s ============================
If any gate fails, the test prints the exact metric, threshold, and remediation hint. Do not proceed until all non-skipped gates pass.
✅ Verification:
pytestexits 0; the GitHub Actions checkML Model Promotion Gateis green on the open PR.⚠️ Gotcha: Gate 4 (latency) tests against a local load of the model, not a live endpoint. A model that passes Gate 4 locally can still time out on the endpoint due to instance sizing — you'll re-test in Step 13.
🛠️ Step 9: Promote Challenger to Staging¶
Promotion uses MlflowClient.transition_model_version_stage — never the UI for production-bound models, because UI clicks bypass the audit log.
from mlflow.tracking import MlflowClient
client = MlflowClient()
client.transition_model_version_stage(
name="casino-slot-revenue-forecast-challenger",
version=2,
stage="Staging",
archive_existing_versions=False, # keep v1 as None for rollback rehearsal
)
# Append to audit log
spark.sql("""
INSERT INTO lh_gold.ml_promotion_audit
SELECT
uuid() AS event_id,
current_timestamp() AS event_time,
'casino-slot-revenue-forecast-challenger' AS model_name,
2 AS version,
'None' AS from_stage,
'Staging' AS to_stage,
current_user() AS actor,
'<git_sha>' AS git_sha,
'gates 1,2,4,5 passed' AS rationale
""")
✅ Verification:
client.get_model_version(name, 2).current_stage == 'Staging'. A new row appears inlh_gold.ml_promotion_audit.
🛠️ Step 10: Holdout-Based Champion/Challenger Pipeline¶
A challenger must beat the champion on N consecutive evaluation windows before earning Production. Build a Fabric Pipeline that runs this comparison nightly.
10.1 Create the pipeline¶
In the workspace, + New → Data Pipeline → pl_champion_challenger_eval. Add a single Notebook activity:
Activity: Run notebook
Notebook: 04_mlops_model_registry (sub-section: champion_challenger_eval)
Default lakehouse: lh_gold
Parameters:
EVAL_WINDOW_DAYS: 1
CHAMPION_NAME: casino-slot-revenue-forecast-baseline
CHALLENGER_NAME: casino-slot-revenue-forecast-challenger
CONSECUTIVE_WINS_REQUIRED: 7
Schedule: Daily 02:30 UTC
Timeout: 30 min
Retry: 1
10.2 What it does¶
Each run loads yesterday's actuals, scores both Production-staged baseline and Staging-staged challenger, and appends to lh_gold.ml_champion_challenger:
SELECT eval_date, model_name, stage, rmse, mape, r2, run_id
FROM lh_gold.ml_champion_challenger
ORDER BY eval_date DESC
LIMIT 20;
✅ Verification: After two manual runs, the table has at least 4 rows (champion + challenger × 2 days).
💡 Tip: Use a Fabric Activator rule on this table that fires when the challenger has won 7 consecutive days — that's your "ready for Production" signal.
🛠️ Step 11: Promote to Production¶
Once the challenger has won 7 consecutive days (or you manually approve a hotfix), promote with archive_existing_versions=True so the old champion auto-archives:
client.transition_model_version_stage(
name="casino-slot-revenue-forecast-challenger",
version=2,
stage="Production",
archive_existing_versions=True,
)
The audit log row this writes is the defensible record for SOX/GDPR/internal-audit reviewers — preserve it.
SELECT * FROM lh_gold.ml_promotion_audit
WHERE to_stage = 'Production'
ORDER BY event_time DESC
LIMIT 5;
✅ Verification: The model's
Productionstage now points to v2; v1 isArchived. A new audit row exists withto_stage='Production'.⚠️ Gotcha: Consumers should reference
models:/{name}/Productionnotmodels:/{name}/2. Pinning to a version means rollback requires a code change. Stage references roll back viatransition_model_version_stagewith no consumer changes.
🛠️ Step 12: Deploy as ML Model Endpoint (Preview)¶
Online endpoints expose the model as a low-latency REST API. Deploy via the Fabric REST API so the deployment is reproducible (and Git-trackable):
TOKEN=$(az account get-access-token --resource https://api.fabric.microsoft.com/ \
--query accessToken -o tsv)
curl -X POST \
"https://api.fabric.microsoft.com/v1/workspaces/${FABRIC_WORKSPACE_ID}/mlmodels/${MODEL_ID}/endpoints" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"name": "slot-revenue-prod",
"modelVersion": "2",
"instanceType": "Standard_DS3_v2",
"minInstances": 2,
"maxInstances": 10,
"trafficSplit": { "2": 100 }
}'
The deployment takes 5–10 minutes. Poll for status:
curl -s "https://api.fabric.microsoft.com/v1/workspaces/${FABRIC_WORKSPACE_ID}/mlmodels/${MODEL_ID}/endpoints/slot-revenue-prod" \
-H "Authorization: Bearer ${TOKEN}" | jq '.status'
✅ Verification:
statustransitionsProvisioning → Updating → Succeeded. The response payload includes ascoringUriyou'll use in Step 13.⚠️ Gotcha:
minInstances: 1is allowed but breaks high availability — a single-instance restart causes 30+ seconds of 503s. Always run withminInstances: 2in production.
🛠️ Step 13: Smoke-Test the Endpoint¶
SCORING_URI=<paste from Step 12 response>
curl -X POST "${SCORING_URI}" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"input_data": {
"columns": ["day_of_week", "is_holiday", "promo_active", "lag_7_revenue", "lag_30_revenue"],
"data": [[5, 0, 1, 412900.0, 405200.0]]
}
}'
Expected response:
{
"predictions": [428173.42],
"model_name": "casino-slot-revenue-forecast-challenger",
"model_version": "2",
"latency_ms": 47
}
Run a load test with 1,000 sequential calls and confirm p99 latency:
✅ Verification: Endpoint returns
200on every request, p99 < 200ms, and the response includes themodel_versionyou deployed.💡 Tip: Capture the load-test results as evidence for the Production Readiness Checklist (Step 20).
🛠️ Step 14: Set Up Batch Inference Pipeline¶
Most casino consumers (BI dashboards, daily forecasting reports) prefer batch over online. Add a second pipeline pl_batch_score_slot_revenue:
Activity: Run notebook
Notebook: batch_score_slot_revenue (~30 lines, see snippet below)
Default lakehouse: lh_gold
Schedule: Daily 03:30 UTC (after pipeline pl_champion_challenger_eval)
SLA: complete within 60 min
Notebook body:
import mlflow
model = mlflow.sklearn.load_model("models:/casino-slot-revenue-forecast-challenger/Production")
unscored = (spark.table("lh_silver.daily_slot_aggregates")
.filter("predicted_revenue IS NULL")
.toPandas())
unscored["predicted_revenue"] = model.predict(unscored[FEATURE_COLS])
unscored["model_version"] = "2"
unscored["scored_at"] = pd.Timestamp.utcnow()
(spark.createDataFrame(unscored)
.write.mode("append")
.saveAsTable("lh_gold.slot_revenue_forecasts"))
✅ Verification: After one run,
lh_gold.slot_revenue_forecastshas new rows with non-nullpredicted_revenueand the correctmodel_version.
🛠️ Step 15: Set Up Drift Detection¶
Open notebooks/ml/05_drift_detection.py. The notebook computes four drift types — data, prediction, performance, concept — and writes results to lh_gold.ml_drift_metrics and lh_gold.ml_retrain_triggers.
15.1 First manual run¶
Attach to lh_gold, Run all. Expected runtime ~2 minutes.
Expected output:
Reference window: 90d (2026-01-27 → 2026-04-26), 270 rows
Current window: 7d (2026-04-20 → 2026-04-26), 35 rows
PSI per feature:
day_of_week: 0.04 (no drift)
is_holiday: 0.07 (no drift)
promo_active: 0.18 (moderate)
lag_7_revenue: 0.09 (no drift)
Prediction drift PSI: 0.06 (no drift)
Realized RMSE delta: +3.1% (within 10% tolerance)
Importance Spearman ρ: 0.93 (no concept drift)
→ No retrain trigger written.
15.2 Schedule it¶
Wrap the notebook in pipeline pl_drift_detection:
Schedule: Daily 04:00 UTC
Timeout: 30 min
Retry: 1
On failure: Email distribution list mlops-oncall@yourorg.com
✅ Verification: After 2 runs,
lh_gold.ml_drift_metricshas tworun_ids and per-feature rows.lh_gold.ml_retrain_triggersis empty (no drift yet).⚠️ Gotcha: PSI requires both windows to have ≥ 30 samples per bucket. With < 7 days of production data, PSI returns
NaNand the notebook logs a[skipped: insufficient data]warning rather than triggering retrain. Don't disable that guard.
🛠️ Step 16: Wire Drift Alerts to an Action Group¶
Drift values landing in a Delta table do nothing on their own. Wire them to an Azure Monitor Action Group so they page someone.
16.1 Stream metrics to Workspace Monitoring¶
Enable workspace monitoring on lh_gold (Workspace settings → Monitoring → Enable). Drift writes from 05_drift_detection.py automatically appear in the WorkspaceMonitoring log table.
16.2 Create the alert rule¶
// Azure Monitor scheduled query alert
WorkspaceMonitoring
| where TableName == "ml_drift_metrics"
| where ModelName == "casino-slot-revenue-forecast-challenger"
| where MetricName in ("psi_overall", "performance_rmse_delta_pct")
| where Window == "7d"
| summarize MaxPSI = max(MetricValue) by ModelName, bin(TimeGenerated, 1h)
| where MaxPSI > 0.20
Action Group ag-mlops-drift:
| Action | Target |
|---|---|
mlops-oncall@yourorg.com | |
| Teams webhook | #mlops-alerts channel |
| Logic App | la-retrain-slot-revenue (Step 17) |
See docs/best-practices/operations/observability-stack.md for the canonical Action Group wiring pattern.
✅ Verification: Manually inject a row with
MetricValue=0.30intolh_gold.ml_drift_metrics. Within 5 minutes the alert fires, an email arrives, and the Logic App run history shows a triggered run.
🛠️ Step 17: Retrain via Logic App¶
The Logic App's job is simple: when triggered by the Action Group, call the Fabric REST API to start the training pipeline.
{
"definition": {
"triggers": {
"When_Action_Group_fires": { "type": "Request", "kind": "Http" }
},
"actions": {
"Trigger_Retrain_Pipeline": {
"type": "Http",
"inputs": {
"method": "POST",
"uri": "https://api.fabric.microsoft.com/v1/workspaces/@parameters('workspaceId')/items/@parameters('pipelineId')/jobs/instances?jobType=Pipeline",
"headers": { "Authorization": "Bearer @parameters('spToken')" },
"body": {
"executionData": {
"parameters": {
"trigger_reason": "drift_alert",
"alert_id": "@triggerBody()?['alertId']"
}
}
}
}
}
}
}
}
Pipeline pl_retrain_slot_revenue chains:
- Run
04_mlops_model_registry.py(trains new challenger v3) - Run
validation/ml/run_gates.py(all 5 gates) - Conditional: if all gates pass, transition v3 to
Staging - Send Teams message with summary + link to MLflow run
✅ Verification: Trigger the alert manually (Step 16.2 verification). Within 10 minutes, a new model version exists in the registry at
Staging. The audit log captures the trigger reasondrift_alert.
🛠️ Step 18: Verify the Closed Loop¶
You now have a fully closed loop. Run an end-to-end smoke test:
| Step | Action | Expected |
|---|---|---|
| 1 | Inject synthetic drift: INSERT INTO lh_gold.ml_drift_metrics ... MetricValue=0.30 | Row written |
| 2 | Wait 5 min for KQL alert window | Action Group fires |
| 3 | Check Logic App run history | One successful run |
| 4 | Check Fabric pipeline pl_retrain_slot_revenue | One running instance |
| 5 | Wait ~5 min for retrain | New challenger version registered |
| 6 | Check lh_gold.ml_promotion_audit | New row, to_stage='Staging', rationale mentions drift |
| 7 | Check Teams #mlops-alerts | Summary message posted |
| 8 | Re-run drift detection on fresh challenger | PSI back below 0.10 |
✅ Verification: All 8 rows green within 30 minutes of the injection.
💡 Tip: Run this smoke test on a weekly cadence as part of your DR/runbook practice. A loop that works once but is never re-tested rots silently.
🛠️ Step 19: Cost Dashboard¶
Adapt the patterns from docs/best-practices/llm-cost-tracking.md for your ML model's surfaces:
| Surface | Driver | Where to track |
|---|---|---|
| Training | Spark CU-hours × runs/week | Capacity Metrics App + spark.fabric.cost_center tag |
| Endpoint | Instance count × hours + per-1k inferences | Endpoint Metrics blade |
| Drift detection | Spark CU-hours × daily runs | Capacity Metrics App |
| Storage | OneLake bytes × $/GB | OneLake size in workspace settings |
Tag every Spark session:
spark.conf.set("spark.fabric.cost_center", "casino-data-science")
spark.conf.set("spark.fabric.model", "casino-slot-revenue-forecast-challenger")
spark.conf.set("spark.fabric.intent", "production-scoring")
Build a Power BI cost dashboard with one card per surface and a treemap by cost_center / model.
✅ Verification: After 7 days, the dashboard shows non-zero values for all four surfaces, with
casino-data-scienceaccounting for the bulk.
🛠️ Step 20: Production Readiness Checklist¶
Walk through the canonical checklist from mlops-fabric-production.md § Production Readiness Checklist. At minimum, certify:
- Model in
Productionstage;StagingandNoneversions retained for rollback - Source notebook + dependencies under Git, tagged with the deployed commit SHA
- All 5 validation gates green in CI on the deployed commit
- Endpoint deployed with
minInstances ≥ 2; load-test p99 < SLO captured as evidence - Batch inference pipeline scheduled, last 7 runs successful
- Drift detection scheduled, last 7 runs successful, no unresolved triggers
- Action Group wired with Email + Teams + Logic App
- Retrain Logic App tested end-to-end within last 30 days
-
ml_promotion_auditretention ≥ 7 years (regulated) or ≥ 1 year (non-regulated) - Cost dashboard live with
cost_centerandmodeltags - On-call runbook published referencing this tutorial + drift doc + observability stack
- Postmortem template ready (
docs/best-practices/operations/incident-response-runbook.md)
✅ Verification: Every box ticked, with screenshots / queries / pipeline runs as evidence in your team's project wiki.
✅ Final Verification¶
Run these queries from any notebook to confirm the whole stack is alive:
-- 1. Production model exists
SELECT * FROM lh_gold.ml_promotion_audit
WHERE to_stage='Production' AND model_name LIKE 'casino-slot-revenue-forecast%'
ORDER BY event_time DESC LIMIT 1;
-- 2. Endpoint healthy (run from terminal)
-- curl ${SCORING_URI}/health → expect 200, p99 < 200ms
-- 3. Drift table populated
SELECT COUNT(*) AS rows, MAX(run_timestamp) AS last_run
FROM lh_gold.ml_drift_metrics;
-- 4. Alerting fires (manual test)
-- Inject fake drift row → Action Group activity logged within 5 min
-- 5. Retrain pipeline triggers on alert
SELECT * FROM lh_gold.ml_promotion_audit
WHERE rationale LIKE '%drift%' ORDER BY event_time DESC LIMIT 5;
| Verification | Pass Criterion |
|---|---|
| Model in Production | At least 1 row in ml_promotion_audit with to_stage='Production' |
| Endpoint < 200ms | Load test p99 < 200ms |
| Drift table writes | last_run within last 24h |
| Alert fires | Action Group history shows trigger within 5 min of injection |
| Retrain triggers | Audit row with rationale LIKE '%drift%' within 10 min of alert |
🧹 Cleanup¶
This tutorial leaves several resources running. To avoid F64 capacity charges:
# 1. Pause the endpoint (keeps definition, frees compute)
curl -X PATCH "https://api.fabric.microsoft.com/v1/workspaces/${WS}/mlmodels/${MODEL}/endpoints/slot-revenue-prod" \
-H "Authorization: Bearer ${TOKEN}" \
-d '{"minInstances": 0, "maxInstances": 0}'
# 2. Disable schedules
# Pipelines → pl_champion_challenger_eval → Schedule → Off
# Pipelines → pl_batch_score_slot_revenue → Schedule → Off
# Pipelines → pl_drift_detection → Schedule → Off
# 3. Vacuum tables to reclaim storage
spark.sql("VACUUM lh_gold.ml_drift_metrics RETAIN 168 HOURS")
spark.sql("VACUUM lh_gold.ml_champion_challenger RETAIN 168 HOURS")
spark.sql("VACUUM lh_gold.slot_revenue_forecasts RETAIN 168 HOURS")
# 4. Archive (do NOT delete) MLflow runs older than 90 days
# Apply OneLake lifecycle policy on the experiment artifact path
⚠️ Gotcha: Never delete
ml_promotion_auditrows. Archive to cold storage if needed, but the audit log is your defensible record for compliance reviews.
🔧 Troubleshooting¶
| # | Symptom | Likely Cause | Fix |
|---|---|---|---|
| 1 | 401 Unauthorized on REST API | SP token expired or wrong scope | Re-run az account get-access-token --resource https://api.fabric.microsoft.com/; SP needs Workspace Contributor + ML Model RW |
| 2 | Capacity throttle: 429 Too Many Requests during training | Concurrent jobs exceed CU budget | Stagger pipeline schedules; increase capacity or use SJD with reserved pool |
| 3 | MLflow UI returns 404 Experiment not found | Experiment created in a different workspace | Use absolute path /Shared/{name} (leading slash); workspace-relative paths break across workspaces |
| 4 | Endpoint smoke test times out (>30s) | Capacity paused, or minInstances=0 | az fabric capacity show → confirm state=Active; PATCH endpoint with minInstances ≥ 2 |
| 5 | transition_model_version_stage returns Permission denied | SP lacks ML Model Write role | Workspace → Manage access → assign SP as Contributor and model-level RW |
| 6 | Drift notebook returns NaN PSI | Current window < 30 samples | Backfill slot_revenue_forecasts with at least 30 days, or shorten reference window temporarily |
| 7 | Logic App fires but pipeline never starts | Wrong pipelineId parameter, or SP not added to pipeline owner | Confirm GUIDs in Logic App parameters; add SP to pipeline Manage permissions |
| 8 | Promoted to Production but consumers still see old model | Consumers pinned a version (models:/{name}/2) instead of stage (models:/{name}/Production) | Code-review every consumer to use stage references; add a CI lint to forbid version pins |
🗂️ Key Files Referenced¶
| Step | Source File |
|---|---|
| 1 | infra/main.bicep, infra/modules/security/workspace-identity.bicep |
| 1 | .github/workflows/deploy-fabric.yml |
| 1 | scripts/fabric-cicd-deploy.py |
| 4 | Tutorial 01, Tutorial 02, Tutorial 03 |
| 5–11 | notebooks/ml/04_mlops_model_registry.py |
| 8 | docs/best-practices/mlops-fabric-production.md § Validation Gates |
| 12–13 | Fabric REST API (ML Model Endpoints — Preview) |
| 14 | notebooks/ml/01_ml_player_churn_prediction.py (batch scoring pattern) |
| 15 | notebooks/ml/05_drift_detection.py |
| 15 | docs/best-practices/model-monitoring-drift-detection.md |
| 16 | docs/best-practices/operations/observability-stack.md |
| 16 | docs/best-practices/operations/slo-sli-fabric.md |
| 19 | docs/best-practices/llm-cost-tracking.md |
| 20 | docs/best-practices/mlops-fabric-production.md § Production Readiness Checklist |
📋 Best Practices Summary¶
-
Stage references, never version pins. Consumers reference
models:/{name}/Production. Pinning to2makes rollback a code change instead of atransition_model_version_stagecall. -
Audit every promotion programmatically.
ml_promotion_auditis your defensible record. Never delete rows; archive to cold storage if size becomes an issue. -
Gate before promote, always. Five gates run in CI on every PR touching
notebooks/ml/**. A model that bypasses gates because "it's just a hotfix" will eventually be the model that breaks production. -
minInstances ≥ 2for online endpoints. Single-instance endpoints have a 30-second restart hole. HA is not optional in production. -
Champion/challenger evaluation is continuous, not one-shot. A challenger that wins one day might lose seven of the next ten. Require N consecutive wins (we use 7) before promotion.
-
Drift detection runs on schedule, not on demand. Drift that sits undetected for a week is drift that already cost you money. 24h SLO on detection.
-
Wire alerts to runbooks, not to inboxes. An email no one reads is not an alert. Every Action Group must include either a paging system (PagerDuty) or a tested Logic App.
-
Test the closed loop weekly. Inject synthetic drift, watch it propagate through alert → Logic App → retrain → re-validate → re-promote. Loops rot silently.
-
Tag everything for cost attribution.
cost_center,model,intent. Without tags you cannot answer "how much did this model cost last quarter" — and finance will ask. -
Production Readiness Checklist before go-live, not after. A model in production without the checklist signed off is a liability. The checklist is short; complete it.
✅ Summary¶
Congratulations — you have shipped a production-grade ML model on Microsoft Fabric.
What You Accomplished¶
- Provisioned a Git-integrated F64 workspace with ML Model items and Endpoints enabled
- Trained a baseline + challenger model with full reproducibility metadata (data version, code SHA, env, params)
- Inspected the MLflow registry programmatically and via the UI
- Authored and executed all five validation gates in CI
- Promoted a model through
None → Staging → Productionusing stage transitions, with a defensible audit trail - Built a Fabric Pipeline for continuous champion/challenger evaluation
- Deployed the Production model as an ML Model Endpoint with HA (
minInstances=2) and load-tested it under p99 200ms - Wired a batch inference pipeline writing to
lh_gold.slot_revenue_forecasts - Scheduled drift detection writing to
lh_gold.ml_drift_metricsandlh_gold.ml_retrain_triggers - Wired drift alerts through Azure Monitor → Action Group → Logic App → Fabric Pipeline
- Verified the closed retrain → re-validate → re-promote loop end-to-end
- Built a cost dashboard with per-model attribution
- Certified production readiness against the canonical checklist
Key Takeaways¶
| Concept | Key Point |
|---|---|
| Reproducibility | Every model run captures data version, code SHA, env, params — recoverable from artifacts alone |
| Validation gates | Performance, holdout, fairness, latency, calibration — all in CI, not manual |
| Stage discipline | models:/{name}/Production for consumers; promotion via MlflowClient, not UI |
| Closed-loop retraining | Drift → alert → Logic App → retrain → gate → re-promote — tested weekly |
| Cost attribution | Tag every Spark session and endpoint with cost_center, model, intent |
| Audit trail | ml_promotion_audit is non-negotiable; preserved indefinitely |
🚀 Next Steps¶
Next Tutorial: Tutorial 40 — RAG Production (planned). Take what you learned here and apply it to retrieval-augmented generation: vector store on Eventhouse, prompt versioning in MLflow, drift on retrieval quality, and the same gate-promote-monitor loop for LLM-backed apps.
Related Wave 2 docs (data management — see docs/best-practices/operations/):
- Model Monitoring & Drift Detection — deep dive on the four drift types
- Feature Store on OneLake — point-in-time correctness and feature reuse
- Responsible AI Framework — fairness, explainability, governance
- LLM Cost Tracking — adapted patterns for ML cost attribution
- Observability Stack — Action Groups, runbooks, SLO/SLI
- Incident Response Runbook — postmortems, on-call
Related notebooks:
06_feature_store_demo.py— point-in-time feature retrieval07_rag_eventhouse_vector.py— RAG patterns on Eventhouse08_responsible_ai_audit.py— fairness audit on the casino models
📚 References¶
| Resource | Link |
|---|---|
| MLOps anchor doc | docs/best-practices/mlops-fabric-production.md |
| Drift detection anchor doc | docs/best-practices/model-monitoring-drift-detection.md |
| Registry notebook | notebooks/ml/04_mlops_model_registry.py |
| Drift notebook | notebooks/ml/05_drift_detection.py |
| Fabric MLflow docs | Microsoft Learn — MLflow in Fabric |
| ML Model Endpoints (Preview) | Microsoft Learn — ML Model Endpoints |
| fabric-cicd | docs/best-practices/fabric-cicd-deployment.md |
🧭 Navigation¶
| Previous | Up | Next |
|---|---|---|
| ⬅️ 38-DOJ Justice Analytics | 📖 Tutorials Index | Tutorial 40 — RAG Production (planned) ➡️ |
Questions or issues? Open an issue in the GitHub repository
Tutorial 39 — End-to-End MLOps — Phase 14 Wave 2 canonical walkthrough