Skip to content

Home > Docs > Best Practices > MLOps for Fabric Production

๐Ÿš€ MLOps for Fabric Production

End-to-End ML Lifecycle Management on Microsoft Fabric โ€” Anchor Doc for Wave 2

Category Phase Priority Last Updated


Last Updated: 2026-04-27 | Version: 1.0.0 | Anchor for: Wave 2 ML/AI doc set


๐Ÿ“‘ Table of Contents


๐ŸŽฏ Overview

MLOps is the discipline of running machine-learning workloads in production with the same rigor we apply to software: version control, automated testing, CI/CD, observability, incident response, and SLOs. Most Fabric customers ship demo-grade ML โ€” a notebook that produces a model โ€” and stop there. This document covers the production backbone every Fabric ML team needs.

What "Production-Grade ML" Means in Fabric

Aspect Demo-Grade Production-Grade
Versioning Notebook in workspace Code in Git, models in MLflow registry, data in Delta with time-travel
Reproducibility "Works on my notebook" Every model has captured: code SHA, data version, env, hyperparameters
Validation Visual inspection of metrics Automated gates: holdout AUC, drift, fairness, latency, calibration
Deployment Manual export โ†’ batch score CI/CD via fabric-cicd, canary rollout, automated rollback
Monitoring None or "looks fine" Drift, performance decay, business KPIs, alerts wired to runbooks
Retraining "When someone notices" Automated triggers: schedule, drift, performance, data volume
Incident response Ad-hoc panic Runbooks, on-call rotation, postmortems

Where Fabric Fits

Fabric provides the integrated ML platform: MLflow tracking, ML Model items, ML Model Endpoints (preview), AutoML, Spark for distributed training, OneLake for feature storage, and Workspace Monitoring for telemetry. This guide ties those pieces together into a coherent production workflow.

๐Ÿ“ Scope: This is the anchor doc for Phase 14 Wave 2. Sub-topics get their own deep-dive docs: model monitoring & drift detection, feature store on OneLake, responsible AI, LLM cost tracking, RAG patterns, prompt engineering, evaluation harnesses.


๐Ÿ—๏ธ MLOps Reference Architecture

flowchart LR
    subgraph DataLayer["๐Ÿ“Š Data Layer (OneLake)"]
        Bronze[(๐Ÿฅ‰ Bronze)]
        Silver[(๐Ÿฅˆ Silver)]
        Gold[(๐Ÿฅ‡ Gold)]
        FS[(๐Ÿช Feature Store)]
    end

    subgraph DevPlane["๐Ÿงช Development Plane"]
        NB[๐Ÿ““ Notebook<br/>or SJD]
        Exp[๐Ÿงช MLflow<br/>Experiment]
        Reg[๐Ÿ—‚๏ธ MLflow<br/>Model Registry]
    end

    subgraph CICD["๐Ÿ”„ CI/CD"]
        Git[(๐Ÿ“ฆ Git)]
        FCICD[fabric-cicd]
        Tests[๐Ÿงช Validation<br/>Gates]
    end

    subgraph ServePlane["๐Ÿšข Serving Plane"]
        Batch[๐Ÿ“ฆ Batch Inference<br/>Pipeline]
        Online[โšก ML Model<br/>Endpoint]
        Stream[๐ŸŒŠ Eventstream<br/>Scoring]
    end

    subgraph ObsPlane["๐Ÿ“ˆ Observability"]
        WM[Workspace<br/>Monitoring]
        Drift[Drift<br/>Detector]
        Alerts[Action<br/>Groups]
    end

    Silver --> NB
    Gold --> NB
    FS --> NB
    NB --> Exp
    Exp --> Reg
    Git --> FCICD
    FCICD --> Tests
    Tests -->|approved| Reg
    Reg --> Batch
    Reg --> Online
    Reg --> Stream
    Batch --> Gold
    Online --> WM
    Stream --> WM
    WM --> Drift
    Drift --> Alerts
    Alerts -.->|retrain trigger| NB

Component Map

Component Fabric Item Purpose
Feature Store Lakehouse table or shortcut Versioned features w/ point-in-time correctness
Experiment Tracking MLflow (built-in) Every training run logged
Model Registry ML Model item Versioned, governed model artifacts
Training Job Spark Job Definition or Notebook Reproducible training runs
Batch Inference Data Pipeline + Notebook Scheduled scoring of large batches
Online Inference ML Model Endpoint (Preview) RESTful serving for low-latency apps
Stream Inference Eventstream + Notebook activity Real-time scoring on event streams
Drift Monitoring Workspace Monitoring + Eventhouse Statistical + performance drift
Alerting Action Groups + Data Activator Fan-out to PagerDuty / Teams / runbook
CI/CD GitHub Actions + fabric-cicd Promote model + code dev โ†’ staging โ†’ prod

๐Ÿ—‚๏ธ Model Registry (MLflow in Fabric)

Fabric's MLflow tracking server is workspace-scoped. The ML Model item wraps an MLflow registered model with Fabric-native governance, lineage, and access control.

Registry Stages

   None โ”€โ”€โ–ถ Staging โ”€โ”€โ–ถ Production โ”€โ”€โ–ถ Archived
              โ–ฒ             โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
              (rollback path)

Use stage transitions, not version-pinning, for rollouts. Consumers reference models:/{name}/Production rather than a hardcoded version.

Naming Conventions

{domain}-{task}-{algo}                          โ† model name
casino-slot-revenue-forecast-prophet
casino-fraud-detection-lightgbm
financial-aml-graph-gnn
healthcare-readmission-xgboost

Logging a Run (canonical pattern)

import mlflow
from mlflow.models.signature import infer_signature

mlflow.set_experiment("/Shared/casino-slot-revenue-forecast")

with mlflow.start_run() as run:
    mlflow.log_params({
        "algo": "prophet",
        "lookback_days": 90,
        "seasonality_mode": "multiplicative",
    })

    model = train_prophet(...)

    metrics = evaluate(model, holdout_df)
    mlflow.log_metrics({
        "rmse": metrics["rmse"],
        "mape": metrics["mape"],
        "smape": metrics["smape"],
    })

    # Capture data lineage explicitly
    mlflow.log_param("training_data_table", "lh_gold.fact_daily_slot_revenue")
    mlflow.log_param("training_data_version", spark.sql(
        "DESCRIBE HISTORY lh_gold.fact_daily_slot_revenue LIMIT 1"
    ).collect()[0]["version"])

    # Capture code SHA for reproducibility
    mlflow.log_param("git_sha", os.environ.get("GIT_SHA", "unknown"))

    signature = infer_signature(holdout_df.drop("revenue"), holdout_df["revenue"])
    mlflow.prophet.log_model(
        model,
        "model",
        signature=signature,
        registered_model_name="casino-slot-revenue-forecast-prophet",
    )

Promoting to Production

Use the MLflow client; never click-promote in the UI for production models.

from mlflow.tracking import MlflowClient
client = MlflowClient()

client.transition_model_version_stage(
    name="casino-slot-revenue-forecast-prophet",
    version=42,
    stage="Production",
    archive_existing_versions=True,  # auto-archive old prod
)

Promotion must be gated by Validation Gates (CI check before merge). Manual promotion is only allowed for hotfixes via the tenant migration runbook.


๐Ÿงช Experiment Tracking

Discipline Rules

  1. One experiment per (domain, task) โ€” not per developer, not per branch
  2. Every run captures: data version, code SHA, env (Spark runtime, lib versions), params, metrics, artifacts (model + plots)
  3. Tag runs with: branch, PR number, author, intent (exploratory, baseline, production-candidate)
  4. Parent/child runs for hyperparameter sweeps so the parent shows the search space and the children show individual trials
  5. Never delete runs โ€” archive, don't delete; runs are evidence for incidents and audits

Tagging Pattern

mlflow.set_tags({
    "branch": os.environ.get("GIT_BRANCH"),
    "pr_number": os.environ.get("GITHUB_PR_NUMBER"),
    "author": os.environ.get("GITHUB_ACTOR"),
    "intent": "production-candidate",  # or "exploratory" / "baseline" / "rollback-test"
    "domain": "casino",
    "compliance_review": "approved-2026-04-15",  # if model touches regulated data
})

๐Ÿ‹๏ธ Training Pipelines

Three patterns supported in Fabric, by use-case complexity:

Pattern 1: Notebook-Driven (development & light prod)

Best for: data scientists prototyping, batch retraining < 1 hour, single GPU tasks.

# Fabric Data Pipeline
- name: train-slot-revenue-forecast
  type: TridentNotebook
  notebook: 02_ml_revenue_forecast_train
  parameters:
    training_window_days: 365
    target_table: lh_gold.fact_daily_slot_revenue
  trigger: schedule(0 2 * * *)  # daily 2am

Pattern 2: Spark Job Definition (heavy distributed training)

Best for: large datasets > 100M rows, distributed training, custom Spark configs.

# spark_job_definition.json
{
    "executableFile": "train.py",
    "defaultLakehouse": "lh_silver",
    "command": [
        "spark-submit",
        "--conf", "spark.executor.instances=20",
        "--conf", "spark.executor.memory=32g",
        "--py-files", "src.zip",
        "train.py",
        "--training-window", "365",
        "--target", "lh_gold.fact_daily_revenue"
    ]
}

Pattern 3: AutoML Experiment (automated baseline)

Best for: tabular tasks, fast time-to-baseline, when human selection adds little.

from synapse.ml.train import AutoMLConfig

config = AutoMLConfig(
    task="regression",
    primary_metric="r2_score",
    training_data=df_train,
    label_column_name="revenue",
    cv_folds=5,
    experiment_timeout_hours=2,
    max_concurrent_iterations=4,
)
config.run()

Pattern 3 (AutoML) is great for the first model in a domain. Move to Pattern 1 or 2 once you understand the problem space and need custom features / custom loss / custom evaluation.


๐Ÿšฆ Model Validation Gates

Every model promotion to Production must pass these gates automatically (in CI, not manually):

Gate 1 โ€” Performance Threshold

def gate_performance(metrics: dict, baseline_metrics: dict) -> bool:
    """New model must be โ‰ฅ 95% of baseline performance + within absolute floor."""
    if metrics["auc"] < 0.75:  # absolute floor
        return False
    if metrics["auc"] < 0.95 * baseline_metrics["auc"]:  # relative floor
        return False
    return True

Gate 2 โ€” Holdout Stability

Run inference on a fixed, never-touched holdout set and compare. Catches data leakage.

def gate_holdout(model, holdout_path: str, max_drift_pct: float = 5.0) -> bool:
    holdout = spark.table(holdout_path)
    pred_drift = (model.predict(holdout) - holdout.expected_pred).abs().mean()
    return pred_drift / holdout.expected_pred.mean() < max_drift_pct / 100

Gate 3 โ€” Fairness (regulated domains only)

For lending, healthcare, hiring, criminal justice. See Responsible AI Framework.

def gate_fairness(model, holdout, protected_attrs: list[str]) -> bool:
    for attr in protected_attrs:
        dpr = demographic_parity_ratio(model, holdout, attr)
        if dpr < 0.8:  # 80% rule
            return False
    return True

Gate 4 โ€” Latency (online endpoints only)

def gate_latency(model_uri: str, sample_df, p99_ms_target: int = 100) -> bool:
    latencies = [time_inference(model_uri, sample_df.iloc[[i]]) for i in range(1000)]
    p99 = numpy.percentile(latencies, 99)
    return p99 < p99_ms_target

Gate 5 โ€” Calibration

Probabilities should be calibrated; predicted 70% should mean ~70% positive in reality.

from sklearn.calibration import calibration_curve

def gate_calibration(y_true, y_pred_proba, max_ece: float = 0.1) -> bool:
    ece = expected_calibration_error(y_true, y_pred_proba, n_bins=10)
    return ece < max_ece

CI Wiring (.github/workflows/ml-promotion.yml)

name: ML Model Promotion Gate
on:
  pull_request:
    paths: ['notebooks/ml/**', 'src/ml/**']
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pytest tests/ml/test_validation_gates.py
      - run: python scripts/run_holdout_gate.py
      - run: python scripts/run_fairness_gate.py
        if: contains(github.event.pull_request.labels.*.name, 'regulated-domain')

๐Ÿšข Deployment Patterns

Pattern A: Batch Inference (most common)

# notebooks/ml/batch_score.py
model_uri = f"models:/casino-slot-revenue-forecast-prophet/Production"
model = mlflow.prophet.load_model(model_uri)

unscored = spark.table("lh_silver.daily_slot_aggregates").filter("predicted_revenue IS NULL")
predictions = model.predict(unscored.toPandas())
spark.createDataFrame(predictions).write.mode("append").saveAsTable("lh_gold.slot_revenue_forecasts")

Trigger: Fabric Pipeline scheduled nightly. SLA: complete within 2-hour window.

Pattern B: Online Endpoint (low-latency apps)

# Deploy via REST API (or Fabric Portal)
import requests

response = requests.post(
    f"https://api.fabric.microsoft.com/v1/workspaces/{ws_id}/mlmodels/{model_id}/endpoints",
    headers={"Authorization": f"Bearer {token}"},
    json={
        "name": "slot-revenue-prod",
        "modelVersion": "42",
        "instanceType": "Standard_DS3_v2",
        "minInstances": 2,  # for HA
        "maxInstances": 10,  # autoscale
        "trafficSplit": {"42": 100},
    }
)

Use online endpoints when: - App needs single-record latency < 200ms - Personalization or real-time decisioning - API consumed by external systems

Pattern C: Stream Inference (real-time)

# Eventstream โ†’ notebook activity โ†’ predict โ†’ Eventhouse
from pyspark.sql.functions import udf

model_uri = "models:/casino-fraud-detection-lightgbm/Production"
model = mlflow.lightgbm.load_model(model_uri)

@udf("double")
def score(features):
    return float(model.predict(features))

stream = (spark.readStream.format("eventhubs").options(**eh_conf).load()
    .withColumn("fraud_score", score(struct("amount", "merchant_cat", ...)))
    .writeStream.format("delta").outputMode("append")
    .toTable("lh_gold.fraud_scores"))

๐Ÿค Canary, A/B, and Champion-Challenger

Canary Rollout

Use ML Model Endpoint traffic splits:

v41 (current prod): 95% traffic
v42 (new):           5% traffic

Monitor key metrics for 24-72 hours. Promote to 50/50, then 100%/0%. Roll back immediately if:

  • Error rate increases
  • Latency p99 degrades
  • Business KPI moves wrong direction (calibration check)

A/B Testing

Two models, both production-quality. Split deterministic by user/transaction ID hash. Run 2+ weeks. Statistical test on primary metric. Pre-register the test plan before running โ€” no fishing for significance.

Champion-Challenger

Continuous evaluation: production = champion. New candidate = challenger. Both score same data. Compare blind. When challenger beats champion on agreed metrics for N consecutive evaluation windows, promote challenger to champion.

# Daily challenger evaluation
champion = mlflow.load_model("models:/{name}/Production")
challenger = mlflow.load_model("models:/{name}/Staging")

for _ in range(num_days):
    today = ...  # today's actual data
    champ_metrics = evaluate(champion, today)
    chall_metrics = evaluate(challenger, today)
    log_to_table("lh_gold.ml_champion_challenger", champ_metrics, chall_metrics)

# Promotion check (e.g., 14-day rolling window, p<0.05)
if challenger_wins_significantly(window_days=14):
    promote_to_production(challenger)

๐Ÿ“ˆ Production Monitoring

See Model Monitoring & Drift Detection for full coverage. Briefly, every production model emits:

Signal Type Alert Threshold
Prediction distribution Statistical drift (PSI, KS) PSI > 0.2
Input feature drift Per-feature drift Top-3 features PSI > 0.25
Performance Realized vs predicted AUC degradation > 5%
Latency p99 inference time > target SLO
Error rate 5xx + timeout > 1% sustained 5 min
Volume Predictions per hour < 50% of baseline
Calibration Reliability diagram ECE > 0.1

Wire to Action Groups via observability stack.


๐Ÿ” Retraining Triggers

Trigger Type Detection Action
Schedule Cron Retrain weekly/monthly regardless of drift
Drift Workspace Monitoring KQL Retrain when PSI > 0.2 sustained
Performance Realized metric below threshold Retrain when AUC drops 5%
Volume New labeled data > N Retrain when training set grows โ‰ฅ 10%
Concept change Business event flag Manual: regulation change, product change, market shift
Anomaly Outlier rate spike Investigate first, retrain only after root cause found

Drift-Driven Retrain (KQL โ†’ Action Group โ†’ Pipeline)

// Workspace Monitoring KQL
ModelDriftMetrics
| where ModelName == "casino-slot-revenue-forecast-prophet"
| where Window == "7d"
| where PSI > 0.2
| top 1 by TimeGenerated

Wire as Azure Monitor scheduled query alert โ†’ Action Group โ†’ Logic App โ†’ Fabric Pipeline trigger (/v1/workspaces/{ws}/items/{pipeline}/jobs/instances?jobType=Pipeline).


๐Ÿ’ฐ Cost Attribution & FinOps

Cost Surfaces

Surface Driver Mitigation
Spark training Executor count ร— duration Right-size; use Job pools, not interactive sessions
AutoML trial count ร— duration ร— parallelism Cap trial budget, use early stopping
Online endpoint Instance count ร— hours + per-1000 inference Autoscale min 1 / max N; batch when possible
MLflow artifacts OneLake storage Lifecycle policy: archive runs > 90 days
Drift monitoring Eventhouse + KQL Sample, don't score every record
LLM API calls Token count See LLM Cost Tracking

Cost Attribution

Tag every job with cost_center, domain, model, intent. Roll up via Workspace Monitoring + Capacity Metrics โ†’ Power BI cost dashboard.

spark.conf.set("spark.fabric.cost_center", "casino-data-science")
spark.conf.set("spark.fabric.model", "casino-slot-revenue-forecast-prophet")

๐ŸŽฐ Casino Implementation

Model Use Case Pattern Compliance Notes
Slot Revenue Forecast Daily revenue projection per machine Batch (Pattern A) None โ€” aggregated only
Player Churn Identify at-risk players Batch PII handling: features hashed
Fraud Detection Real-time AML/structuring Stream (Pattern C) BSA/SAR โ€” see compliance layer
Slot Anomaly Hardware fault prediction Stream None โ€” operational only
Marketing Lift Promotion uplift modeling Batch Opt-in tracking required

See notebooks/ml/01_ml_player_churn_prediction.py and notebooks/ml/02_ml_fraud_detection.py.


๐Ÿ›๏ธ Federal Implementation

Model Agency Use Case Compliance
Crop Yield Forecast USDA Regional yield prediction None โ€” public data
Loan Default Risk SBA Underwriting ECOA, fairness gate required
Storm Severity NOAA Severity classification None โ€” public safety
Air Quality Forecast EPA AQI prediction None โ€” public data
Earthquake Magnitude DOI/USGS Real-time magnitude None โ€” public safety
Antitrust Risk Score DOJ Merger review prioritization Sensitive โ€” restricted access

For regulated federal use cases (DOJ, lending), the Responsible AI Framework governance gates are mandatory.


๐Ÿšซ Anti-Patterns

Anti-Pattern Why It Hurts What to Do Instead
Notebook-only model Not reproducible; no version, no rollback Register every prod model in MLflow
Manual UI promotion to Production No audit trail, no validation gates CI-driven promotion via MLflow client
Same model serves dev + prod traffic Bad deploy = customer impact Separate workspaces; canary rollout
No holdout set Reported metrics overfit to dev/test split Permanent, never-touched holdout in OneLake
Drift detection without retraining trigger Alerts fatigue, no action Wire alerts to retraining pipeline
Single-version pin in client code Forces redeploy for every model update Use stage references (models:/{name}/Production)
Logging metrics but not data version Can't reproduce a bad run Always log Delta version of training data
No fairness check on regulated models Legal/compliance liability Mandatory gate for lending, healthcare, hiring
AutoML in production without review Hidden complexity, hidden failure modes Use AutoML for baseline; productionize as Pattern 1 or 2
No retraining schedule for stable models Quiet decay, late detection Monthly retrain even without drift signal

๐Ÿ“‹ Production Readiness Checklist

Before promoting any model to Production stage:

  • Code in Git, on main branch, PR reviewed
  • Model registered in MLflow with name, version, stage
  • Training data version logged (Delta version or partition spec)
  • Code SHA logged
  • Environment captured (Spark runtime, lib versions, env file)
  • All 5 validation gates pass: performance, holdout, fairness (if regulated), latency (if online), calibration
  • Holdout set excluded from training/validation (audit-trail confirmed)
  • Monitoring wired: drift, performance, latency, error, volume
  • Alerts wired to Action Group with correct severity routing
  • Runbook exists for the model's failure modes
  • Rollback procedure documented and tested
  • Retraining trigger defined (schedule, drift, performance, or hybrid)
  • Cost-center tag set
  • Model card published (purpose, training data, performance, limitations, fairness)
  • On-call team notified of new model
  • Postmortem template created for first 30-day incident review
  • Compliance sign-off obtained (if regulated domain)

๐Ÿ“š References

Microsoft Fabric Documentation

Industry Standards


โฌ†๏ธ Back to Top | ๐Ÿ“š Best Practices Index | ๐Ÿ  Home