Home > Docs > Best Practices > MLOps for Fabric Production
๐ MLOps for Fabric Production¶
End-to-End ML Lifecycle Management on Microsoft Fabric โ Anchor Doc for Wave 2
Last Updated: 2026-04-27 | Version: 1.0.0 | Anchor for: Wave 2 ML/AI doc set
๐ Table of Contents¶
- ๐ฏ Overview
- ๐๏ธ MLOps Reference Architecture
- ๐๏ธ Model Registry (MLflow in Fabric)
- ๐งช Experiment Tracking
- ๐๏ธ Training Pipelines
- ๐ฆ Model Validation Gates
- ๐ข Deployment Patterns
- ๐ค Canary, A/B, and Champion-Challenger
- ๐ Production Monitoring
- ๐ Retraining Triggers
- ๐ฐ Cost Attribution & FinOps
- ๐ฐ Casino Implementation
- ๐๏ธ Federal Implementation
- ๐ซ Anti-Patterns
- ๐ Production Readiness Checklist
- ๐ References
๐ฏ Overview¶
MLOps is the discipline of running machine-learning workloads in production with the same rigor we apply to software: version control, automated testing, CI/CD, observability, incident response, and SLOs. Most Fabric customers ship demo-grade ML โ a notebook that produces a model โ and stop there. This document covers the production backbone every Fabric ML team needs.
What "Production-Grade ML" Means in Fabric¶
| Aspect | Demo-Grade | Production-Grade |
|---|---|---|
| Versioning | Notebook in workspace | Code in Git, models in MLflow registry, data in Delta with time-travel |
| Reproducibility | "Works on my notebook" | Every model has captured: code SHA, data version, env, hyperparameters |
| Validation | Visual inspection of metrics | Automated gates: holdout AUC, drift, fairness, latency, calibration |
| Deployment | Manual export โ batch score | CI/CD via fabric-cicd, canary rollout, automated rollback |
| Monitoring | None or "looks fine" | Drift, performance decay, business KPIs, alerts wired to runbooks |
| Retraining | "When someone notices" | Automated triggers: schedule, drift, performance, data volume |
| Incident response | Ad-hoc panic | Runbooks, on-call rotation, postmortems |
Where Fabric Fits¶
Fabric provides the integrated ML platform: MLflow tracking, ML Model items, ML Model Endpoints (preview), AutoML, Spark for distributed training, OneLake for feature storage, and Workspace Monitoring for telemetry. This guide ties those pieces together into a coherent production workflow.
๐ Scope: This is the anchor doc for Phase 14 Wave 2. Sub-topics get their own deep-dive docs: model monitoring & drift detection, feature store on OneLake, responsible AI, LLM cost tracking, RAG patterns, prompt engineering, evaluation harnesses.
๐๏ธ MLOps Reference Architecture¶
flowchart LR
subgraph DataLayer["๐ Data Layer (OneLake)"]
Bronze[(๐ฅ Bronze)]
Silver[(๐ฅ Silver)]
Gold[(๐ฅ Gold)]
FS[(๐ช Feature Store)]
end
subgraph DevPlane["๐งช Development Plane"]
NB[๐ Notebook<br/>or SJD]
Exp[๐งช MLflow<br/>Experiment]
Reg[๐๏ธ MLflow<br/>Model Registry]
end
subgraph CICD["๐ CI/CD"]
Git[(๐ฆ Git)]
FCICD[fabric-cicd]
Tests[๐งช Validation<br/>Gates]
end
subgraph ServePlane["๐ข Serving Plane"]
Batch[๐ฆ Batch Inference<br/>Pipeline]
Online[โก ML Model<br/>Endpoint]
Stream[๐ Eventstream<br/>Scoring]
end
subgraph ObsPlane["๐ Observability"]
WM[Workspace<br/>Monitoring]
Drift[Drift<br/>Detector]
Alerts[Action<br/>Groups]
end
Silver --> NB
Gold --> NB
FS --> NB
NB --> Exp
Exp --> Reg
Git --> FCICD
FCICD --> Tests
Tests -->|approved| Reg
Reg --> Batch
Reg --> Online
Reg --> Stream
Batch --> Gold
Online --> WM
Stream --> WM
WM --> Drift
Drift --> Alerts
Alerts -.->|retrain trigger| NB Component Map¶
| Component | Fabric Item | Purpose |
|---|---|---|
| Feature Store | Lakehouse table or shortcut | Versioned features w/ point-in-time correctness |
| Experiment Tracking | MLflow (built-in) | Every training run logged |
| Model Registry | ML Model item | Versioned, governed model artifacts |
| Training Job | Spark Job Definition or Notebook | Reproducible training runs |
| Batch Inference | Data Pipeline + Notebook | Scheduled scoring of large batches |
| Online Inference | ML Model Endpoint (Preview) | RESTful serving for low-latency apps |
| Stream Inference | Eventstream + Notebook activity | Real-time scoring on event streams |
| Drift Monitoring | Workspace Monitoring + Eventhouse | Statistical + performance drift |
| Alerting | Action Groups + Data Activator | Fan-out to PagerDuty / Teams / runbook |
| CI/CD | GitHub Actions + fabric-cicd | Promote model + code dev โ staging โ prod |
๐๏ธ Model Registry (MLflow in Fabric)¶
Fabric's MLflow tracking server is workspace-scoped. The ML Model item wraps an MLflow registered model with Fabric-native governance, lineage, and access control.
Registry Stages¶
None โโโถ Staging โโโถ Production โโโถ Archived
โฒ โ
โโโโโโโโโโโโโโโ
(rollback path)
Use stage transitions, not version-pinning, for rollouts. Consumers reference models:/{name}/Production rather than a hardcoded version.
Naming Conventions¶
{domain}-{task}-{algo} โ model name
casino-slot-revenue-forecast-prophet
casino-fraud-detection-lightgbm
financial-aml-graph-gnn
healthcare-readmission-xgboost
Logging a Run (canonical pattern)¶
import mlflow
from mlflow.models.signature import infer_signature
mlflow.set_experiment("/Shared/casino-slot-revenue-forecast")
with mlflow.start_run() as run:
mlflow.log_params({
"algo": "prophet",
"lookback_days": 90,
"seasonality_mode": "multiplicative",
})
model = train_prophet(...)
metrics = evaluate(model, holdout_df)
mlflow.log_metrics({
"rmse": metrics["rmse"],
"mape": metrics["mape"],
"smape": metrics["smape"],
})
# Capture data lineage explicitly
mlflow.log_param("training_data_table", "lh_gold.fact_daily_slot_revenue")
mlflow.log_param("training_data_version", spark.sql(
"DESCRIBE HISTORY lh_gold.fact_daily_slot_revenue LIMIT 1"
).collect()[0]["version"])
# Capture code SHA for reproducibility
mlflow.log_param("git_sha", os.environ.get("GIT_SHA", "unknown"))
signature = infer_signature(holdout_df.drop("revenue"), holdout_df["revenue"])
mlflow.prophet.log_model(
model,
"model",
signature=signature,
registered_model_name="casino-slot-revenue-forecast-prophet",
)
Promoting to Production¶
Use the MLflow client; never click-promote in the UI for production models.
from mlflow.tracking import MlflowClient
client = MlflowClient()
client.transition_model_version_stage(
name="casino-slot-revenue-forecast-prophet",
version=42,
stage="Production",
archive_existing_versions=True, # auto-archive old prod
)
Promotion must be gated by Validation Gates (CI check before merge). Manual promotion is only allowed for hotfixes via the tenant migration runbook.
๐งช Experiment Tracking¶
Discipline Rules¶
- One experiment per (domain, task) โ not per developer, not per branch
- Every run captures: data version, code SHA, env (Spark runtime, lib versions), params, metrics, artifacts (model + plots)
- Tag runs with: branch, PR number, author, intent (
exploratory,baseline,production-candidate) - Parent/child runs for hyperparameter sweeps so the parent shows the search space and the children show individual trials
- Never delete runs โ archive, don't delete; runs are evidence for incidents and audits
Tagging Pattern¶
mlflow.set_tags({
"branch": os.environ.get("GIT_BRANCH"),
"pr_number": os.environ.get("GITHUB_PR_NUMBER"),
"author": os.environ.get("GITHUB_ACTOR"),
"intent": "production-candidate", # or "exploratory" / "baseline" / "rollback-test"
"domain": "casino",
"compliance_review": "approved-2026-04-15", # if model touches regulated data
})
๐๏ธ Training Pipelines¶
Three patterns supported in Fabric, by use-case complexity:
Pattern 1: Notebook-Driven (development & light prod)¶
Best for: data scientists prototyping, batch retraining < 1 hour, single GPU tasks.
# Fabric Data Pipeline
- name: train-slot-revenue-forecast
type: TridentNotebook
notebook: 02_ml_revenue_forecast_train
parameters:
training_window_days: 365
target_table: lh_gold.fact_daily_slot_revenue
trigger: schedule(0 2 * * *) # daily 2am
Pattern 2: Spark Job Definition (heavy distributed training)¶
Best for: large datasets > 100M rows, distributed training, custom Spark configs.
# spark_job_definition.json
{
"executableFile": "train.py",
"defaultLakehouse": "lh_silver",
"command": [
"spark-submit",
"--conf", "spark.executor.instances=20",
"--conf", "spark.executor.memory=32g",
"--py-files", "src.zip",
"train.py",
"--training-window", "365",
"--target", "lh_gold.fact_daily_revenue"
]
}
Pattern 3: AutoML Experiment (automated baseline)¶
Best for: tabular tasks, fast time-to-baseline, when human selection adds little.
from synapse.ml.train import AutoMLConfig
config = AutoMLConfig(
task="regression",
primary_metric="r2_score",
training_data=df_train,
label_column_name="revenue",
cv_folds=5,
experiment_timeout_hours=2,
max_concurrent_iterations=4,
)
config.run()
Pattern 3 (AutoML) is great for the first model in a domain. Move to Pattern 1 or 2 once you understand the problem space and need custom features / custom loss / custom evaluation.
๐ฆ Model Validation Gates¶
Every model promotion to Production must pass these gates automatically (in CI, not manually):
Gate 1 โ Performance Threshold¶
def gate_performance(metrics: dict, baseline_metrics: dict) -> bool:
"""New model must be โฅ 95% of baseline performance + within absolute floor."""
if metrics["auc"] < 0.75: # absolute floor
return False
if metrics["auc"] < 0.95 * baseline_metrics["auc"]: # relative floor
return False
return True
Gate 2 โ Holdout Stability¶
Run inference on a fixed, never-touched holdout set and compare. Catches data leakage.
def gate_holdout(model, holdout_path: str, max_drift_pct: float = 5.0) -> bool:
holdout = spark.table(holdout_path)
pred_drift = (model.predict(holdout) - holdout.expected_pred).abs().mean()
return pred_drift / holdout.expected_pred.mean() < max_drift_pct / 100
Gate 3 โ Fairness (regulated domains only)¶
For lending, healthcare, hiring, criminal justice. See Responsible AI Framework.
def gate_fairness(model, holdout, protected_attrs: list[str]) -> bool:
for attr in protected_attrs:
dpr = demographic_parity_ratio(model, holdout, attr)
if dpr < 0.8: # 80% rule
return False
return True
Gate 4 โ Latency (online endpoints only)¶
def gate_latency(model_uri: str, sample_df, p99_ms_target: int = 100) -> bool:
latencies = [time_inference(model_uri, sample_df.iloc[[i]]) for i in range(1000)]
p99 = numpy.percentile(latencies, 99)
return p99 < p99_ms_target
Gate 5 โ Calibration¶
Probabilities should be calibrated; predicted 70% should mean ~70% positive in reality.
from sklearn.calibration import calibration_curve
def gate_calibration(y_true, y_pred_proba, max_ece: float = 0.1) -> bool:
ece = expected_calibration_error(y_true, y_pred_proba, n_bins=10)
return ece < max_ece
CI Wiring (.github/workflows/ml-promotion.yml)¶
name: ML Model Promotion Gate
on:
pull_request:
paths: ['notebooks/ml/**', 'src/ml/**']
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pytest tests/ml/test_validation_gates.py
- run: python scripts/run_holdout_gate.py
- run: python scripts/run_fairness_gate.py
if: contains(github.event.pull_request.labels.*.name, 'regulated-domain')
๐ข Deployment Patterns¶
Pattern A: Batch Inference (most common)¶
# notebooks/ml/batch_score.py
model_uri = f"models:/casino-slot-revenue-forecast-prophet/Production"
model = mlflow.prophet.load_model(model_uri)
unscored = spark.table("lh_silver.daily_slot_aggregates").filter("predicted_revenue IS NULL")
predictions = model.predict(unscored.toPandas())
spark.createDataFrame(predictions).write.mode("append").saveAsTable("lh_gold.slot_revenue_forecasts")
Trigger: Fabric Pipeline scheduled nightly. SLA: complete within 2-hour window.
Pattern B: Online Endpoint (low-latency apps)¶
# Deploy via REST API (or Fabric Portal)
import requests
response = requests.post(
f"https://api.fabric.microsoft.com/v1/workspaces/{ws_id}/mlmodels/{model_id}/endpoints",
headers={"Authorization": f"Bearer {token}"},
json={
"name": "slot-revenue-prod",
"modelVersion": "42",
"instanceType": "Standard_DS3_v2",
"minInstances": 2, # for HA
"maxInstances": 10, # autoscale
"trafficSplit": {"42": 100},
}
)
Use online endpoints when: - App needs single-record latency < 200ms - Personalization or real-time decisioning - API consumed by external systems
Pattern C: Stream Inference (real-time)¶
# Eventstream โ notebook activity โ predict โ Eventhouse
from pyspark.sql.functions import udf
model_uri = "models:/casino-fraud-detection-lightgbm/Production"
model = mlflow.lightgbm.load_model(model_uri)
@udf("double")
def score(features):
return float(model.predict(features))
stream = (spark.readStream.format("eventhubs").options(**eh_conf).load()
.withColumn("fraud_score", score(struct("amount", "merchant_cat", ...)))
.writeStream.format("delta").outputMode("append")
.toTable("lh_gold.fraud_scores"))
๐ค Canary, A/B, and Champion-Challenger¶
Canary Rollout¶
Use ML Model Endpoint traffic splits:
Monitor key metrics for 24-72 hours. Promote to 50/50, then 100%/0%. Roll back immediately if:
- Error rate increases
- Latency p99 degrades
- Business KPI moves wrong direction (calibration check)
A/B Testing¶
Two models, both production-quality. Split deterministic by user/transaction ID hash. Run 2+ weeks. Statistical test on primary metric. Pre-register the test plan before running โ no fishing for significance.
Champion-Challenger¶
Continuous evaluation: production = champion. New candidate = challenger. Both score same data. Compare blind. When challenger beats champion on agreed metrics for N consecutive evaluation windows, promote challenger to champion.
# Daily challenger evaluation
champion = mlflow.load_model("models:/{name}/Production")
challenger = mlflow.load_model("models:/{name}/Staging")
for _ in range(num_days):
today = ... # today's actual data
champ_metrics = evaluate(champion, today)
chall_metrics = evaluate(challenger, today)
log_to_table("lh_gold.ml_champion_challenger", champ_metrics, chall_metrics)
# Promotion check (e.g., 14-day rolling window, p<0.05)
if challenger_wins_significantly(window_days=14):
promote_to_production(challenger)
๐ Production Monitoring¶
See Model Monitoring & Drift Detection for full coverage. Briefly, every production model emits:
| Signal | Type | Alert Threshold |
|---|---|---|
| Prediction distribution | Statistical drift (PSI, KS) | PSI > 0.2 |
| Input feature drift | Per-feature drift | Top-3 features PSI > 0.25 |
| Performance | Realized vs predicted | AUC degradation > 5% |
| Latency | p99 inference time | > target SLO |
| Error rate | 5xx + timeout | > 1% sustained 5 min |
| Volume | Predictions per hour | < 50% of baseline |
| Calibration | Reliability diagram | ECE > 0.1 |
Wire to Action Groups via observability stack.
๐ Retraining Triggers¶
| Trigger Type | Detection | Action |
|---|---|---|
| Schedule | Cron | Retrain weekly/monthly regardless of drift |
| Drift | Workspace Monitoring KQL | Retrain when PSI > 0.2 sustained |
| Performance | Realized metric below threshold | Retrain when AUC drops 5% |
| Volume | New labeled data > N | Retrain when training set grows โฅ 10% |
| Concept change | Business event flag | Manual: regulation change, product change, market shift |
| Anomaly | Outlier rate spike | Investigate first, retrain only after root cause found |
Drift-Driven Retrain (KQL โ Action Group โ Pipeline)¶
// Workspace Monitoring KQL
ModelDriftMetrics
| where ModelName == "casino-slot-revenue-forecast-prophet"
| where Window == "7d"
| where PSI > 0.2
| top 1 by TimeGenerated
Wire as Azure Monitor scheduled query alert โ Action Group โ Logic App โ Fabric Pipeline trigger (/v1/workspaces/{ws}/items/{pipeline}/jobs/instances?jobType=Pipeline).
๐ฐ Cost Attribution & FinOps¶
Cost Surfaces¶
| Surface | Driver | Mitigation |
|---|---|---|
| Spark training | Executor count ร duration | Right-size; use Job pools, not interactive sessions |
| AutoML | trial count ร duration ร parallelism | Cap trial budget, use early stopping |
| Online endpoint | Instance count ร hours + per-1000 inference | Autoscale min 1 / max N; batch when possible |
| MLflow artifacts | OneLake storage | Lifecycle policy: archive runs > 90 days |
| Drift monitoring | Eventhouse + KQL | Sample, don't score every record |
| LLM API calls | Token count | See LLM Cost Tracking |
Cost Attribution¶
Tag every job with cost_center, domain, model, intent. Roll up via Workspace Monitoring + Capacity Metrics โ Power BI cost dashboard.
spark.conf.set("spark.fabric.cost_center", "casino-data-science")
spark.conf.set("spark.fabric.model", "casino-slot-revenue-forecast-prophet")
๐ฐ Casino Implementation¶
| Model | Use Case | Pattern | Compliance Notes |
|---|---|---|---|
| Slot Revenue Forecast | Daily revenue projection per machine | Batch (Pattern A) | None โ aggregated only |
| Player Churn | Identify at-risk players | Batch | PII handling: features hashed |
| Fraud Detection | Real-time AML/structuring | Stream (Pattern C) | BSA/SAR โ see compliance layer |
| Slot Anomaly | Hardware fault prediction | Stream | None โ operational only |
| Marketing Lift | Promotion uplift modeling | Batch | Opt-in tracking required |
See notebooks/ml/01_ml_player_churn_prediction.py and notebooks/ml/02_ml_fraud_detection.py.
๐๏ธ Federal Implementation¶
| Model | Agency | Use Case | Compliance |
|---|---|---|---|
| Crop Yield Forecast | USDA | Regional yield prediction | None โ public data |
| Loan Default Risk | SBA | Underwriting | ECOA, fairness gate required |
| Storm Severity | NOAA | Severity classification | None โ public safety |
| Air Quality Forecast | EPA | AQI prediction | None โ public data |
| Earthquake Magnitude | DOI/USGS | Real-time magnitude | None โ public safety |
| Antitrust Risk Score | DOJ | Merger review prioritization | Sensitive โ restricted access |
For regulated federal use cases (DOJ, lending), the Responsible AI Framework governance gates are mandatory.
๐ซ Anti-Patterns¶
| Anti-Pattern | Why It Hurts | What to Do Instead |
|---|---|---|
| Notebook-only model | Not reproducible; no version, no rollback | Register every prod model in MLflow |
| Manual UI promotion to Production | No audit trail, no validation gates | CI-driven promotion via MLflow client |
| Same model serves dev + prod traffic | Bad deploy = customer impact | Separate workspaces; canary rollout |
| No holdout set | Reported metrics overfit to dev/test split | Permanent, never-touched holdout in OneLake |
| Drift detection without retraining trigger | Alerts fatigue, no action | Wire alerts to retraining pipeline |
| Single-version pin in client code | Forces redeploy for every model update | Use stage references (models:/{name}/Production) |
| Logging metrics but not data version | Can't reproduce a bad run | Always log Delta version of training data |
| No fairness check on regulated models | Legal/compliance liability | Mandatory gate for lending, healthcare, hiring |
| AutoML in production without review | Hidden complexity, hidden failure modes | Use AutoML for baseline; productionize as Pattern 1 or 2 |
| No retraining schedule for stable models | Quiet decay, late detection | Monthly retrain even without drift signal |
๐ Production Readiness Checklist¶
Before promoting any model to Production stage:
- Code in Git, on
mainbranch, PR reviewed - Model registered in MLflow with name, version, stage
- Training data version logged (Delta version or partition spec)
- Code SHA logged
- Environment captured (Spark runtime, lib versions, env file)
- All 5 validation gates pass: performance, holdout, fairness (if regulated), latency (if online), calibration
- Holdout set excluded from training/validation (audit-trail confirmed)
- Monitoring wired: drift, performance, latency, error, volume
- Alerts wired to Action Group with correct severity routing
- Runbook exists for the model's failure modes
- Rollback procedure documented and tested
- Retraining trigger defined (schedule, drift, performance, or hybrid)
- Cost-center tag set
- Model card published (purpose, training data, performance, limitations, fairness)
- On-call team notified of new model
- Postmortem template created for first 30-day incident review
- Compliance sign-off obtained (if regulated domain)
๐ References¶
Microsoft Fabric Documentation¶
Industry Standards¶
- Google SRE Book โ SLO/SLI principles applied to ML
- Microsoft Responsible AI Standard
- Martin Fowler โ Continuous Delivery for Machine Learning (CD4ML)
Related Wave 2 Docs¶
- Model Monitoring & Drift Detection
- Feature Store on OneLake
- Responsible AI Framework
- LLM Cost Tracking
- RAG Patterns Deep Dive
- Prompt Engineering for Fabric
- LLM Evaluation Harness
Related Operational Docs (Wave 1)¶
- Incident Response Template
- SLO/SLI for Fabric
- Tenant Migration Runbook
- Observability Stack
- On-Call Rotation Handbook