Home > Docs > Best Practices > MLOps for Fabric Production

🚀 MLOps for Fabric Production¶

End-to-End ML Lifecycle Management on Microsoft Fabric — Anchor Doc for Wave 2

Last Updated: 2026-04-27 | Version: 1.0.0 | Anchor for: Wave 2 ML/AI doc set

📑 Table of Contents¶

🎯 Overview
🏗️ MLOps Reference Architecture
🗂️ Model Registry (MLflow in Fabric)
🧪 Experiment Tracking
🏋️ Training Pipelines
🚦 Model Validation Gates
🚢 Deployment Patterns
🐤 Canary, A/B, and Champion-Challenger
📈 Production Monitoring
🔁 Retraining Triggers
💰 Cost Attribution & FinOps
🎰 Casino Implementation
🏛️ Federal Implementation
🚫 Anti-Patterns
📋 Production Readiness Checklist
📚 References

🎯 Overview¶

MLOps is the discipline of running machine-learning workloads in production with the same rigor we apply to software: version control, automated testing, CI/CD, observability, incident response, and SLOs. Most Fabric customers ship demo-grade ML — a notebook that produces a model — and stop there. This document covers the production backbone every Fabric ML team needs.

What "Production-Grade ML" Means in Fabric¶

Aspect	Demo-Grade	Production-Grade
Versioning	Notebook in workspace	Code in Git, models in MLflow registry, data in Delta with time-travel
Reproducibility	"Works on my notebook"	Every model has captured: code SHA, data version, env, hyperparameters
Validation	Visual inspection of metrics	Automated gates: holdout AUC, drift, fairness, latency, calibration
Deployment	Manual export → batch score	CI/CD via fabric-cicd, canary rollout, automated rollback
Monitoring	None or "looks fine"	Drift, performance decay, business KPIs, alerts wired to runbooks
Retraining	"When someone notices"	Automated triggers: schedule, drift, performance, data volume
Incident response	Ad-hoc panic	Runbooks, on-call rotation, postmortems

Where Fabric Fits¶

Fabric provides the integrated ML platform: MLflow tracking, ML Model items, ML Model Endpoints (preview), AutoML, Spark for distributed training, OneLake for feature storage, and Workspace Monitoring for telemetry. This guide ties those pieces together into a coherent production workflow.

📝 Scope: This is the anchor doc for Phase 14 Wave 2. Sub-topics get their own deep-dive docs: model monitoring & drift detection, feature store on OneLake, responsible AI, LLM cost tracking, RAG patterns, prompt engineering, evaluation harnesses.

🏗️ MLOps Reference Architecture¶

flowchart LR
    subgraph DataLayer["📊 Data Layer (OneLake)"]
        Bronze[(🥉 Bronze)]
        Silver[(🥈 Silver)]
        Gold[(🥇 Gold)]
        FS[(🏪 Feature Store)]
    end

    subgraph DevPlane["🧪 Development Plane"]
        NB[📓 Notebook<br/>or SJD]
        Exp[🧪 MLflow<br/>Experiment]
        Reg[🗂️ MLflow<br/>Model Registry]
    end

    subgraph CICD["🔄 CI/CD"]
        Git[(📦 Git)]
        FCICD[fabric-cicd]
        Tests[🧪 Validation<br/>Gates]
    end

    subgraph ServePlane["🚢 Serving Plane"]
        Batch[📦 Batch Inference<br/>Pipeline]
        Online[⚡ ML Model<br/>Endpoint]
        Stream[🌊 Eventstream<br/>Scoring]
    end

    subgraph ObsPlane["📈 Observability"]
        WM[Workspace<br/>Monitoring]
        Drift[Drift<br/>Detector]
        Alerts[Action<br/>Groups]
    end

    Silver --> NB
    Gold --> NB
    FS --> NB
    NB --> Exp
    Exp --> Reg
    Git --> FCICD
    FCICD --> Tests
    Tests -->|approved| Reg
    Reg --> Batch
    Reg --> Online
    Reg --> Stream
    Batch --> Gold
    Online --> WM
    Stream --> WM
    WM --> Drift
    Drift --> Alerts
    Alerts -.->|retrain trigger| NB

Component Map¶

Component	Fabric Item	Purpose
Feature Store	Lakehouse table or shortcut	Versioned features w/ point-in-time correctness
Experiment Tracking	MLflow (built-in)	Every training run logged
Model Registry	ML Model item	Versioned, governed model artifacts
Training Job	Spark Job Definition or Notebook	Reproducible training runs
Batch Inference	Data Pipeline + Notebook	Scheduled scoring of large batches
Online Inference	ML Model Endpoint (Preview)	RESTful serving for low-latency apps
Stream Inference	Eventstream + Notebook activity	Real-time scoring on event streams
Drift Monitoring	Workspace Monitoring + Eventhouse	Statistical + performance drift
Alerting	Action Groups + Data Activator	Fan-out to PagerDuty / Teams / runbook
CI/CD	GitHub Actions + fabric-cicd	Promote model + code dev → staging → prod

🗂️ Model Registry (MLflow in Fabric)¶

Fabric's MLflow tracking server is workspace-scoped. The ML Model item wraps an MLflow registered model with Fabric-native governance, lineage, and access control.

Registry Stages¶

   None ──▶ Staging ──▶ Production ──▶ Archived
              ▲             │
              └─────────────┘
              (rollback path)

Use stage transitions, not version-pinning, for rollouts. Consumers reference models:/{name}/Production rather than a hardcoded version.

Naming Conventions¶

{domain}-{task}-{algo}                          ← model name
casino-slot-revenue-forecast-prophet
casino-fraud-detection-lightgbm
financial-aml-graph-gnn
healthcare-readmission-xgboost

Logging a Run (canonical pattern)¶

import mlflow
from mlflow.models.signature import infer_signature

mlflow.set_experiment("/Shared/casino-slot-revenue-forecast")

with mlflow.start_run() as run:
    mlflow.log_params({
        "algo": "prophet",
        "lookback_days": 90,
        "seasonality_mode": "multiplicative",
    })

    model = train_prophet(...)

    metrics = evaluate(model, holdout_df)
    mlflow.log_metrics({
        "rmse": metrics["rmse"],
        "mape": metrics["mape"],
        "smape": metrics["smape"],
    })

    # Capture data lineage explicitly
    mlflow.log_param("training_data_table", "lh_gold.fact_daily_slot_revenue")
    mlflow.log_param("training_data_version", spark.sql(
        "DESCRIBE HISTORY lh_gold.fact_daily_slot_revenue LIMIT 1"
    ).collect()[0]["version"])

    # Capture code SHA for reproducibility
    mlflow.log_param("git_sha", os.environ.get("GIT_SHA", "unknown"))

    signature = infer_signature(holdout_df.drop("revenue"), holdout_df["revenue"])
    mlflow.prophet.log_model(
        model,
        "model",
        signature=signature,
        registered_model_name="casino-slot-revenue-forecast-prophet",
    )

Promoting to Production¶

Use the MLflow client; never click-promote in the UI for production models.

from mlflow.tracking import MlflowClient
client = MlflowClient()

client.transition_model_version_stage(
    name="casino-slot-revenue-forecast-prophet",
    version=42,
    stage="Production",
    archive_existing_versions=True,  # auto-archive old prod
)

Promotion must be gated by Validation Gates (CI check before merge). Manual promotion is only allowed for hotfixes via the tenant migration runbook.

🧪 Experiment Tracking¶

Discipline Rules¶

One experiment per (domain, task) — not per developer, not per branch
Every run captures: data version, code SHA, env (Spark runtime, lib versions), params, metrics, artifacts (model + plots)
Tag runs with: branch, PR number, author, intent (exploratory, baseline, production-candidate)
Parent/child runs for hyperparameter sweeps so the parent shows the search space and the children show individual trials
Never delete runs — archive, don't delete; runs are evidence for incidents and audits

Tagging Pattern¶

mlflow.set_tags({
    "branch": os.environ.get("GIT_BRANCH"),
    "pr_number": os.environ.get("GITHUB_PR_NUMBER"),
    "author": os.environ.get("GITHUB_ACTOR"),
    "intent": "production-candidate",  # or "exploratory" / "baseline" / "rollback-test"
    "domain": "casino",
    "compliance_review": "approved-2026-04-15",  # if model touches regulated data
})

🏋️ Training Pipelines¶

Three patterns supported in Fabric, by use-case complexity:

Pattern 1: Notebook-Driven (development & light prod)¶

Best for: data scientists prototyping, batch retraining < 1 hour, single GPU tasks.

# Fabric Data Pipeline
- name: train-slot-revenue-forecast
  type: TridentNotebook
  notebook: 02_ml_revenue_forecast_train
  parameters:
    training_window_days: 365
    target_table: lh_gold.fact_daily_slot_revenue
  trigger: schedule(0 2 * * *)  # daily 2am

Pattern 2: Spark Job Definition (heavy distributed training)¶

Best for: large datasets > 100M rows, distributed training, custom Spark configs.

# spark_job_definition.json
{
    "executableFile": "train.py",
    "defaultLakehouse": "lh_silver",
    "command": [
        "spark-submit",
        "--conf", "spark.executor.instances=20",
        "--conf", "spark.executor.memory=32g",
        "--py-files", "src.zip",
        "train.py",
        "--training-window", "365",
        "--target", "lh_gold.fact_daily_revenue"
    ]
}

Pattern 3: AutoML Experiment (automated baseline)¶

Best for: tabular tasks, fast time-to-baseline, when human selection adds little.

from synapse.ml.train import AutoMLConfig

config = AutoMLConfig(
    task="regression",
    primary_metric="r2_score",
    training_data=df_train,
    label_column_name="revenue",
    cv_folds=5,
    experiment_timeout_hours=2,
    max_concurrent_iterations=4,
)
config.run()

Pattern 3 (AutoML) is great for the first model in a domain. Move to Pattern 1 or 2 once you understand the problem space and need custom features / custom loss / custom evaluation.

🚦 Model Validation Gates¶

Every model promotion to Production must pass these gates automatically (in CI, not manually):

Gate 1 — Performance Threshold¶

def gate_performance(metrics: dict, baseline_metrics: dict) -> bool:
    """New model must be ≥ 95% of baseline performance + within absolute floor."""
    if metrics["auc"] < 0.75:  # absolute floor
        return False
    if metrics["auc"] < 0.95 * baseline_metrics["auc"]:  # relative floor
        return False
    return True

Gate 2 — Holdout Stability¶

Run inference on a fixed, never-touched holdout set and compare. Catches data leakage.

def gate_holdout(model, holdout_path: str, max_drift_pct: float = 5.0) -> bool:
    holdout = spark.table(holdout_path)
    pred_drift = (model.predict(holdout) - holdout.expected_pred).abs().mean()
    return pred_drift / holdout.expected_pred.mean() < max_drift_pct / 100

Gate 3 — Fairness (regulated domains only)¶

For lending, healthcare, hiring, criminal justice. See Responsible AI Framework.

def gate_fairness(model, holdout, protected_attrs: list[str]) -> bool:
    for attr in protected_attrs:
        dpr = demographic_parity_ratio(model, holdout, attr)
        if dpr < 0.8:  # 80% rule
            return False
    return True

Gate 4 — Latency (online endpoints only)¶

def gate_latency(model_uri: str, sample_df, p99_ms_target: int = 100) -> bool:
    latencies = [time_inference(model_uri, sample_df.iloc[[i]]) for i in range(1000)]
    p99 = numpy.percentile(latencies, 99)
    return p99 < p99_ms_target

Gate 5 — Calibration¶

Probabilities should be calibrated; predicted 70% should mean ~70% positive in reality.

from sklearn.calibration import calibration_curve

def gate_calibration(y_true, y_pred_proba, max_ece: float = 0.1) -> bool:
    ece = expected_calibration_error(y_true, y_pred_proba, n_bins=10)
    return ece < max_ece

CI Wiring (`.github/workflows/ml-promotion.yml`)¶

name: ML Model Promotion Gate
on:
  pull_request:
    paths: ['notebooks/ml/**', 'src/ml/**']
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pytest tests/ml/test_validation_gates.py
      - run: python scripts/run_holdout_gate.py
      - run: python scripts/run_fairness_gate.py
        if: contains(github.event.pull_request.labels.*.name, 'regulated-domain')

🚢 Deployment Patterns¶

Pattern A: Batch Inference (most common)¶

# notebooks/ml/batch_score.py
model_uri = f"models:/casino-slot-revenue-forecast-prophet/Production"
model = mlflow.prophet.load_model(model_uri)

unscored = spark.table("lh_silver.daily_slot_aggregates").filter("predicted_revenue IS NULL")
predictions = model.predict(unscored.toPandas())
spark.createDataFrame(predictions).write.mode("append").saveAsTable("lh_gold.slot_revenue_forecasts")

Trigger: Fabric Pipeline scheduled nightly. SLA: complete within 2-hour window.

Pattern B: Online Endpoint (low-latency apps)¶

# Deploy via REST API (or Fabric Portal)
import requests

response = requests.post(
    f"https://api.fabric.microsoft.com/v1/workspaces/{ws_id}/mlmodels/{model_id}/endpoints",
    headers={"Authorization": f"Bearer {token}"},
    json={
        "name": "slot-revenue-prod",
        "modelVersion": "42",
        "instanceType": "Standard_DS3_v2",
        "minInstances": 2,  # for HA
        "maxInstances": 10,  # autoscale
        "trafficSplit": {"42": 100},
    }
)

Use online endpoints when: - App needs single-record latency < 200ms - Personalization or real-time decisioning - API consumed by external systems

Pattern C: Stream Inference (real-time)¶

# Eventstream → notebook activity → predict → Eventhouse
from pyspark.sql.functions import udf

model_uri = "models:/casino-fraud-detection-lightgbm/Production"
model = mlflow.lightgbm.load_model(model_uri)

@udf("double")
def score(features):
    return float(model.predict(features))

stream = (spark.readStream.format("eventhubs").options(**eh_conf).load()
    .withColumn("fraud_score", score(struct("amount", "merchant_cat", ...)))
    .writeStream.format("delta").outputMode("append")
    .toTable("lh_gold.fraud_scores"))

🐤 Canary, A/B, and Champion-Challenger¶

Canary Rollout¶

Use ML Model Endpoint traffic splits:

v41 (current prod): 95% traffic
v42 (new):           5% traffic

Monitor key metrics for 24-72 hours. Promote to 50/50, then 100%/0%. Roll back immediately if:

Error rate increases
Latency p99 degrades
Business KPI moves wrong direction (calibration check)

A/B Testing¶

Two models, both production-quality. Split deterministic by user/transaction ID hash. Run 2+ weeks. Statistical test on primary metric. Pre-register the test plan before running — no fishing for significance.

Champion-Challenger¶

Continuous evaluation: production = champion. New candidate = challenger. Both score same data. Compare blind. When challenger beats champion on agreed metrics for N consecutive evaluation windows, promote challenger to champion.

# Daily challenger evaluation
champion = mlflow.load_model("models:/{name}/Production")
challenger = mlflow.load_model("models:/{name}/Staging")

for _ in range(num_days):
    today = ...  # today's actual data
    champ_metrics = evaluate(champion, today)
    chall_metrics = evaluate(challenger, today)
    log_to_table("lh_gold.ml_champion_challenger", champ_metrics, chall_metrics)

# Promotion check (e.g., 14-day rolling window, p<0.05)
if challenger_wins_significantly(window_days=14):
    promote_to_production(challenger)

📈 Production Monitoring¶

See Model Monitoring & Drift Detection for full coverage. Briefly, every production model emits:

Signal	Type	Alert Threshold
Prediction distribution	Statistical drift (PSI, KS)	PSI > 0.2
Input feature drift	Per-feature drift	Top-3 features PSI > 0.25
Performance	Realized vs predicted	AUC degradation > 5%
Latency	p99 inference time	> target SLO
Error rate	5xx + timeout	> 1% sustained 5 min
Volume	Predictions per hour	< 50% of baseline
Calibration	Reliability diagram	ECE > 0.1

Wire to Action Groups via observability stack.

🔁 Retraining Triggers¶

Trigger Type	Detection	Action
Schedule	Cron	Retrain weekly/monthly regardless of drift
Drift	Workspace Monitoring KQL	Retrain when PSI > 0.2 sustained
Performance	Realized metric below threshold	Retrain when AUC drops 5%
Volume	New labeled data > N	Retrain when training set grows ≥ 10%
Concept change	Business event flag	Manual: regulation change, product change, market shift
Anomaly	Outlier rate spike	Investigate first, retrain only after root cause found

Drift-Driven Retrain (KQL → Action Group → Pipeline)¶

// Workspace Monitoring KQL
ModelDriftMetrics
| where ModelName == "casino-slot-revenue-forecast-prophet"
| where Window == "7d"
| where PSI > 0.2
| top 1 by TimeGenerated

Wire as Azure Monitor scheduled query alert → Action Group → Logic App → Fabric Pipeline trigger (/v1/workspaces/{ws}/items/{pipeline}/jobs/instances?jobType=Pipeline).

💰 Cost Attribution & FinOps¶

Cost Surfaces¶

Surface	Driver	Mitigation
Spark training	Executor count × duration	Right-size; use Job pools, not interactive sessions
AutoML	trial count × duration × parallelism	Cap trial budget, use early stopping
Online endpoint	Instance count × hours + per-1000 inference	Autoscale min 1 / max N; batch when possible
MLflow artifacts	OneLake storage	Lifecycle policy: archive runs > 90 days
Drift monitoring	Eventhouse + KQL	Sample, don't score every record
LLM API calls	Token count	See LLM Cost Tracking

Cost Attribution¶

Tag every job with cost_center, domain, model, intent. Roll up via Workspace Monitoring + Capacity Metrics → Power BI cost dashboard.

spark.conf.set("spark.fabric.cost_center", "casino-data-science")
spark.conf.set("spark.fabric.model", "casino-slot-revenue-forecast-prophet")

🎰 Casino Implementation¶

Model	Use Case	Pattern	Compliance Notes
Slot Revenue Forecast	Daily revenue projection per machine	Batch (Pattern A)	None — aggregated only
Player Churn	Identify at-risk players	Batch	PII handling: features hashed
Fraud Detection	Real-time AML/structuring	Stream (Pattern C)	BSA/SAR — see compliance layer
Slot Anomaly	Hardware fault prediction	Stream	None — operational only
Marketing Lift	Promotion uplift modeling	Batch	Opt-in tracking required

See notebooks/ml/01_ml_player_churn_prediction.py and notebooks/ml/02_ml_fraud_detection.py.

🏛️ Federal Implementation¶

Model	Agency	Use Case	Compliance
Crop Yield Forecast	USDA	Regional yield prediction	None — public data
Loan Default Risk	SBA	Underwriting	ECOA, fairness gate required
Storm Severity	NOAA	Severity classification	None — public safety
Air Quality Forecast	EPA	AQI prediction	None — public data
Earthquake Magnitude	DOI/USGS	Real-time magnitude	None — public safety
Antitrust Risk Score	DOJ	Merger review prioritization	Sensitive — restricted access

For regulated federal use cases (DOJ, lending), the Responsible AI Framework governance gates are mandatory.

🚫 Anti-Patterns¶

Anti-Pattern	Why It Hurts	What to Do Instead
Notebook-only model	Not reproducible; no version, no rollback	Register every prod model in MLflow
Manual UI promotion to Production	No audit trail, no validation gates	CI-driven promotion via MLflow client
Same model serves dev + prod traffic	Bad deploy = customer impact	Separate workspaces; canary rollout
No holdout set	Reported metrics overfit to dev/test split	Permanent, never-touched holdout in OneLake
Drift detection without retraining trigger	Alerts fatigue, no action	Wire alerts to retraining pipeline
Single-version pin in client code	Forces redeploy for every model update	Use stage references (`models:/{name}/Production`)
Logging metrics but not data version	Can't reproduce a bad run	Always log Delta version of training data
No fairness check on regulated models	Legal/compliance liability	Mandatory gate for lending, healthcare, hiring
AutoML in production without review	Hidden complexity, hidden failure modes	Use AutoML for baseline; productionize as Pattern 1 or 2
No retraining schedule for stable models	Quiet decay, late detection	Monthly retrain even without drift signal

📋 Production Readiness Checklist¶

Before promoting any model to Production stage:

📚 References¶

Microsoft Fabric Documentation¶

Industry Standards¶

Google SRE Book — SLO/SLI principles applied to ML
Microsoft Responsible AI Standard
Martin Fowler — Continuous Delivery for Machine Learning (CD4ML)

⬆️ Back to Top | 📚 Best Practices Index | 🏠 Home