Home > Docs > Best Practices > Model Monitoring & Drift Detection
π‘ Model Monitoring & Drift Detection¶
Production Drift Detection on Microsoft Fabric β Workspace Monitoring + Eventhouse + KQL
Last Updated: 2026-04-27 | Version: 1.0.0 | Wave 2 Sub-Topic of: MLOps for Fabric Production
π Table of Contents¶
- π― Overview
- π§ Drift Type Taxonomy
- π Statistical Methods
- ποΈ Reference Architecture
- βοΈ Implementation in Fabric
- π KQL Drift Library
- π Performance Drift Patterns
- π Concept Drift Detection
- π¨ Alert Wiring
- π Retraining Trigger Patterns
- π‘οΈ False Positive Mitigation
- π° Casino Implementation
- ποΈ Federal Implementation
- π« Anti-Patterns
- π Implementation Checklist
- π References
π― Overview¶
Models decay. The world they predict on changes faster than the world they were trained on. Without drift detection, you only learn that a model has gone bad after it's hurt the business β missed revenue, compliance failure, customer harm. Drift detection is the "smoke alarm" of production ML: it fires before the fire spreads.
This is the deep-dive companion to the MLOps for Fabric Production anchor. It covers four drift categories, five statistical methods, KQL implementations against Workspace Monitoring + Eventhouse, alert wiring, retraining triggers, and false-positive mitigation.
Why Four Drift Types β Not Just "Drift"¶
Most teams say "drift" and mean "the input data looks different." That's only one of four failure modes:
| Drift Type | What Changes | Business Impact | Detection Cost |
|---|---|---|---|
| Data drift | Input feature distributions | Predictions on data the model never saw | Low β log inputs, compare distributions |
| Prediction drift | Output distribution | Score distribution shifts on similar inputs | Low β log predictions, compare |
| Performance drift | Accuracy / AUC / RMSE | Model decisions get worse | Medium β needs ground truth |
| Concept drift | Relationship between X and y | Same inputs β different correct answer | High β importance shift, A/B compare |
Each type has a different signal, test, and remediation. A single "drift detected" alert is useless; on-call needs to know which kind fired β retrain, investigate input pipeline, or escalate to product.
Where This Lives in Fabric¶
Three native surfaces: Workspace Monitoring (built-in endpoint metrics; KQL-queryable), Eventhouse (high-cardinality storage for prediction logs, feature snapshots, drift metrics), Real-Time Dashboards (trend visualization). Add a Lakehouse-stored reference distribution (the training-time feature snapshot) and you have full drift coverage.
π Scope: This doc covers detection. For remediation, see MLOps for Fabric Production Β§ Retraining Triggers and Β§ Canary, A/B, and Champion-Challenger.
π§ Drift Type Taxonomy¶
| Drift Type | Formal Definition | Signal | Tests | Typical Threshold |
|---|---|---|---|---|
| Data drift (covariate shift) | P(X) changes; P(y|X) stable | Per-feature distribution shift | PSI, KS, Chi-square, Wasserstein | PSI > 0.2 (moderate), > 0.25 (severe) |
| Prediction drift | P(Ε·) changes | Output score distribution shift | PSI on score buckets, KS on continuous score | PSI > 0.2 |
| Performance drift | Realized metric degrades | AUC / accuracy / RMSE vs reference window | Direct metric comparison + significance test | AUC drop > 5% or below absolute floor |
| Concept drift | P(y|X) changes; relationship shifts | Performance drops despite stable inputs | Sliding-window performance, feature-importance shift | Performance drift with PSI < 0.1 on inputs |
| Label drift (prior shift) | P(y) changes | Target-class distribution shift | Chi-square on label distribution | Chi-square p < 0.01, large effect size |
Decision Tree: Which Drift Fired?¶
Performance dropped?
βββ Yes β Check input PSI
β βββ PSI high (> 0.2) β Data drift caused it; investigate upstream
β βββ PSI normal (< 0.1) β Concept drift; relationship has changed (retrain, possibly rebuild)
β βββ PSI medium (0.1β0.2) β Mixed; investigate both
β
βββ No (performance stable) β Is prediction distribution shifted?
βββ Yes + inputs shifted β Data drift, but model is robust (monitor; lower urgency)
βββ Yes + inputs stable β Suspicious; likely instrumentation bug, not real drift
βββ No β Healthy
Label Drift β A Special Case¶
Label drift (P(y) shift) is detectable only when ground truth arrives. In casino fraud detection, labels arrive after investigators close the case β sometimes weeks later. In USDA crop yield, labels arrive at harvest β months later. Label drift detection must operate on the delayed label stream and compare to the prior-period label distribution. See Β§ Performance Drift Patterns β Proxy Metrics for what to do while waiting.
π Statistical Methods¶
Five methods cover almost every drift detection scenario you'll meet. Each is implementable in KQL or PySpark and tested against the reference distribution stored in the Lakehouse.
PSI β Population Stability Index¶
Use when: Feature is binned (continuous discretized into deciles, or categorical). Most common drift metric.
Formula: PSI = Ξ£ (actual% - expected%) Γ ln(actual% / expected%)
Thresholds (industry-standard):
| PSI | Interpretation | Action |
|---|---|---|
| < 0.1 | No significant change | Continue monitoring |
| 0.1 β 0.2 | Moderate change | Investigate; raise WARN |
| > 0.2 | Significant change | Trigger retraining evaluation |
| > 0.25 | Severe drift | Page on-call; freeze new traffic |
KQL:
// PSI between current 7-day window and reference distribution
let reference = ReferenceDistribution
| where ModelName == "casino-player-churn-lightgbm" and FeatureName == "avg_daily_spend"
| project Bucket, ExpectedPct;
let current = PredictionLog
| where ModelName == "casino-player-churn-lightgbm"
| where TimeGenerated > ago(7d)
| summarize ActualCount = count() by Bucket = bin(avg_daily_spend, 50.0)
| extend ActualPct = todouble(ActualCount) / toscalar(PredictionLog
| where ModelName == "casino-player-churn-lightgbm"
| where TimeGenerated > ago(7d) | count);
reference
| join kind=fullouter (current) on Bucket
| extend ExpectedPct = coalesce(ExpectedPct, 0.0001), ActualPct = coalesce(ActualPct, 0.0001)
| extend PSI_contrib = (ActualPct - ExpectedPct) * log(ActualPct / ExpectedPct)
| summarize PSI = sum(PSI_contrib)
PySpark (offline reference build):
from pyspark.sql.functions import col, log
def compute_psi(ref_df, cur_df, feature, n_bins=10):
qs = ref_df.approxQuantile(feature, [i/n_bins for i in range(n_bins+1)], 0.001)
bucketize = lambda df: df.withColumn("bucket",
F.expr(f"width_bucket({feature}, array{tuple(qs[1:-1])})"))
ref = bucketize(ref_df).groupBy("bucket").count().withColumnRenamed("count", "rc")
cur = bucketize(cur_df).groupBy("bucket").count().withColumnRenamed("count", "cc")
rt, ct = ref_df.count(), cur_df.count()
joined = ref.join(cur, "bucket", "fullouter").fillna(1)
return joined.withColumn("psi", (col("cc")/ct - col("rc")/rt) * log((col("cc")/ct) / (col("rc")/rt))) \
.agg({"psi": "sum"}).collect()[0][0]
KS Test β KolmogorovβSmirnov¶
Use when: Feature is continuous and you don't want to discretize.
Formula: Maximum vertical distance between the two empirical CDFs. KS statistic = max|F_ref(x) β F_cur(x)|.
Threshold: Reject null at p < 0.01 with effect size > 0.1 (D-statistic). Bonferroni-correct when running across many features.
KQL:
// KS test approximation via percentile comparison
let ref_pcts = ReferenceDistribution
| where ModelName == "casino-player-churn-lightgbm" and FeatureName == "avg_session_minutes"
| summarize percentiles_ref = make_list(percentile_value) by FeatureName;
let cur_pcts = PredictionLog
| where ModelName == "casino-player-churn-lightgbm"
| where TimeGenerated > ago(7d)
| summarize percentiles_cur = percentiles(avg_session_minutes,
5, 10, 25, 50, 75, 90, 95);
ref_pcts
| extend ks_stat = todouble(...) // compute in notebook for full KS
KQL approximates KS via percentile binning. For full KS p-value, run a daily PySpark job (
scipy.stats.ks_2samp) over a sample.
Chi-Square β Categorical Drift¶
Use when: Feature is categorical (region, channel, device type).
Threshold: p < 0.01 with CramΓ©r's V > 0.1 (effect size).
KQL:
// Chi-square contributions per category
let reference = ReferenceDistribution
| where ModelName == "casino-fraud-lightgbm" and FeatureName == "merchant_category"
| project Category, ExpectedCount;
let current = PredictionLog
| where ModelName == "casino-fraud-lightgbm"
| where TimeGenerated > ago(1d)
| summarize ObservedCount = count() by Category = merchant_category;
reference
| join kind=fullouter (current) on Category
| extend ExpectedCount = coalesce(ExpectedCount, 1.0), ObservedCount = coalesce(ObservedCount, 1.0)
| extend chi_contrib = pow(ObservedCount - ExpectedCount, 2) / ExpectedCount
| summarize chi_square = sum(chi_contrib), df = count() - 1
Wasserstein Distance (Earth Mover's Distance)¶
Use when: You want a single, scale-invariant number that respects the ordering of bins. Better than PSI when bin boundaries shift slightly.
Formula: Minimum "work" (mass Γ distance) to transform one distribution into another.
Threshold: Domain-dependent; calibrate against historic stable periods (e.g., > 2Ο of baseline = drift).
PySpark (using scipy):
from scipy.stats import wasserstein_distance
ref_sample = reference_df.select("amount").sample(0.1).toPandas()["amount"]
cur_sample = current_df.select("amount").sample(0.1).toPandas()["amount"]
wd = wasserstein_distance(ref_sample, cur_sample)
mlflow.log_metric("wasserstein_amount", wd)
JensenβShannon Divergence¶
Use when: Both distributions are categorical or binned, and you want a symmetric, bounded [0, 1] metric.
Formula: JSD(PβQ) = 0.5 Γ KL(PβM) + 0.5 Γ KL(QβM) where M = 0.5(P+Q).
Threshold: JSD > 0.1 = notable; JSD > 0.2 = severe (similar to PSI bands).
import numpy as np
from scipy.spatial.distance import jensenshannon
def jsd(p, q, epsilon=1e-9):
p = np.array(p) + epsilon; q = np.array(q) + epsilon
p /= p.sum(); q /= q.sum()
return jensenshannon(p, q) ** 2 # scipy returns sqrt(JSD)
Method Selection Matrix¶
| Feature Type | Primary | Backup | Notes |
|---|---|---|---|
| Continuous, well-distributed | PSI | KS | PSI for dashboards; KS for significance |
| Continuous, heavy-tailed | Wasserstein | PSI on log-bins | EMD handles outliers better |
| Categorical, low cardinality (< 50) | Chi-square | JSD | |
| Categorical, high cardinality (> 50) | JSD on top-N | PSI on grouped buckets | Group rare categories into "other" |
| Binary | PSI | Two-proportion z-test | Simplest case |
ποΈ Reference Architecture¶
flowchart LR
subgraph Serving["π’ Production Serving"]
EP[ML Model<br/>Endpoint]
BATCH[Batch Inference<br/>Pipeline]
STREAM[Stream Inference<br/>Eventstream]
end
subgraph Logging["π Prediction Logging"]
LOG[Log Hook]
EH[(π₯ Eventhouse<br/>PredictionLog)]
end
subgraph Reference["π Reference"]
REF[(π₯ Lakehouse<br/>ReferenceDistribution)]
SNAP[Training-time<br/>Snapshot]
end
subgraph Drift["π§ͺ Drift Detection"]
SCH[Scheduled<br/>Notebook]
KQL[KQL Drift<br/>Queries]
DM[(π₯ Eventhouse<br/>DriftMetrics)]
end
subgraph Action["π¨ Action"]
DASH[Real-Time<br/>Dashboard]
ALERT[Action Group]
RT[Retrain<br/>Pipeline]
end
EP --> LOG
BATCH --> LOG
STREAM --> LOG
LOG --> EH
SNAP --> REF
REF --> KQL
EH --> KQL
SCH --> KQL
KQL --> DM
DM --> DASH
DM --> ALERT
ALERT -.->|drift sustained| RT
RT -.->|new model| EP Component Responsibilities¶
| Component | Fabric Item | Responsibility |
|---|---|---|
| Log Hook | Notebook helper / endpoint sidecar | Emit (input, prediction, model_version, timestamp) per call |
| PredictionLog | Eventhouse table (hot 30d / cold 365d) | Append-only inference log |
| ReferenceDistribution | Delta table in lh_gold (mirrored to Eventhouse via Shortcut) | Training-time bin/percentile/category counts |
| Drift Notebook | Notebook + Pipeline (hourly/daily) | Joins log + reference; writes DriftMetrics |
| DriftMetrics | Eventhouse table | (model, feature, metric, value, window_end) time-series |
| Dashboard | Real-Time Dashboard | Per-feature drift, prediction histogram, performance trend |
| Alert | Workspace Monitoring β Action Group | KQL-driven scheduled alerts |
| Retrain Trigger | Logic App / Function + Fabric Pipeline | Consume alerts β kick off training |
βοΈ Implementation in Fabric¶
Step 1 β Log Predictions to Eventhouse¶
Wrap every inference call to emit a structured record: model name, model version, request id, inputs (or hashed/quantized for PII), prediction, score, latency, timestamp.
# notebooks/ml/_helpers/prediction_logger.py
import json, uuid
from datetime import datetime
from azure.kusto.data import KustoConnectionStringBuilder
from azure.kusto.ingest import QueuedIngestClient, IngestionProperties, DataFormat
_client = QueuedIngestClient(KustoConnectionStringBuilder.with_aad_managed_service_identity(
"https://ingest-<cluster>.kusto.fabric.microsoft.com"))
def log_prediction(model_name, model_version, request_id, features, prediction, score, latency_ms):
record = {
"TimeGenerated": datetime.utcnow().isoformat(),
"ModelName": model_name, "ModelVersion": model_version,
"RequestId": request_id or str(uuid.uuid4()),
"Features": json.dumps(features), "Prediction": str(prediction),
"Score": float(score), "LatencyMs": float(latency_ms),
}
_client.ingest_from_stream(json.dumps(record).encode("utf-8"),
IngestionProperties(database="ml_monitoring", table="PredictionLog",
data_format=DataFormat.JSON))
π Hot path note: For low-latency endpoints, use fire-and-forget logging (queue, don't block the response). For batch scoring, log in micro-batches of 10β50K records using
ingest_from_dataframe.
Step 2 β Build Reference Distribution at Training Time¶
# notebooks/ml/02_build_reference_distribution.py
import mlflow
from pyspark.sql import functions as F
NUMERIC = ["avg_daily_spend", "avg_session_minutes", "days_since_last_visit"]
CATEGORICAL = ["preferred_game", "tier", "channel"]
MODEL, VERSION = "casino-player-churn-lightgbm", "v42"
ref_df = spark.table("lh_gold.gold_player_360_training"); total = ref_df.count()
records = []
for feat in NUMERIC:
qs = ref_df.approxQuantile(feat, [i/10 for i in range(11)], 0.001)
binned = ref_df.withColumn("bucket",
F.expr(f"width_bucket({feat}, array{tuple(qs[1:-1])})").cast("string"))
for r in binned.groupBy("bucket").count().collect():
records.append((MODEL, VERSION, feat, r["bucket"], r["count"]/total, qs))
for feat in CATEGORICAL:
for r in ref_df.groupBy(feat).count().orderBy(F.desc("count")).limit(50).collect():
records.append((MODEL, VERSION, feat, r[feat], r["count"]/total, None))
spark.createDataFrame(records,
["ModelName", "ModelVersion", "FeatureName", "Bucket", "ExpectedPct", "Quantiles"]) \
.write.mode("overwrite").saveAsTable("lh_gold.reference_distribution")
mlflow.log_param("reference_distribution_table", "lh_gold.reference_distribution")
mlflow.log_param("reference_distribution_version",
spark.sql("DESCRIBE HISTORY lh_gold.reference_distribution LIMIT 1").collect()[0]["version"])
Step 3 β Schedule Drift Computation¶
# Fabric Pipeline (drift_detection_daily)
- name: compute-drift-casino-churn
type: TridentNotebook
notebook: 03_drift_detection_daily
parameters:
model_name: casino-player-churn-lightgbm
window_hours: 24
reference_table: lh_gold.reference_distribution
output_eventhouse_db: ml_monitoring
output_table: DriftMetrics
trigger: schedule(0 4 * * *) # daily 4am
The notebook reads PredictionLog (last 24h) + ReferenceDistribution, computes PSI/KS/Chi-square per feature, and writes to DriftMetrics.
Step 4 β Real-Time Dashboard¶
Build a Real-Time Dashboard against DriftMetrics with tiles for:
- Top-5 drifted features over the last 24 hours (heatmap)
- PSI trend per feature, last 30 days (line chart)
- Prediction-score histogram, last 24h vs reference (overlay)
- Performance metrics (when ground truth available) vs reference window
See Real-Time Intelligence for dashboard authoring patterns.
π KQL Drift Library¶
Five runnable queries. All assume PredictionLog and DriftMetrics tables in Eventhouse, and ReferenceDistribution materialized into Eventhouse via Shortcut from lh_gold.
Query 1 β PSI per feature, last 24 hours¶
let model = "casino-player-churn-lightgbm";
let win_start = ago(1d);
let total = toscalar(PredictionLog
| where ModelName == model and TimeGenerated > win_start
| count);
PredictionLog
| where ModelName == model and TimeGenerated > win_start
| extend feats = parse_json(Features)
| mv-expand feats
| extend FeatureName = tostring(bag_keys(feats)[0]),
FeatureValue = tostring(feats[tostring(bag_keys(feats)[0])])
| summarize ActualCount = count() by FeatureName, Bucket = FeatureValue
| extend ActualPct = todouble(ActualCount) / total
| join kind=inner (
ReferenceDistribution
| where ModelName == model
| project FeatureName, Bucket, ExpectedPct
) on FeatureName, Bucket
| extend ExpectedPct = iff(ExpectedPct == 0, 0.0001, ExpectedPct),
ActualPct = iff(ActualPct == 0, 0.0001, ActualPct)
| extend psi_contrib = (ActualPct - ExpectedPct) * log(ActualPct / ExpectedPct)
| summarize PSI = sum(psi_contrib) by FeatureName
| order by PSI desc
Query 2 β KS-like statistic per numeric feature¶
let model = "casino-player-churn-lightgbm";
PredictionLog
| where ModelName == model and TimeGenerated > ago(1d)
| extend feats = parse_json(Features), v = todouble(feats.avg_daily_spend)
| summarize cp = percentiles(v, 5, 25, 50, 75, 95)
| extend FeatureName = "avg_daily_spend",
KS_stat = max_of(
abs(cp.percentile_v_5 - toscalar(ReferenceDistribution | where FeatureName=="avg_daily_spend" and Bucket=="p5" | project ExpectedPct)),
abs(cp.percentile_v_50 - toscalar(ReferenceDistribution | where FeatureName=="avg_daily_spend" and Bucket=="p50" | project ExpectedPct)),
abs(cp.percentile_v_95 - toscalar(ReferenceDistribution | where FeatureName=="avg_daily_spend" and Bucket=="p95" | project ExpectedPct))
)
| project FeatureName, KS_stat
Query 3 β Prediction distribution shift (output drift)¶
let model = "casino-player-churn-lightgbm";
let cur = PredictionLog
| where ModelName == model and TimeGenerated > ago(1d)
| summarize cur_count = count() by ScoreBucket = bin(Score, 0.1)
| extend cur_pct = todouble(cur_count) / toscalar(PredictionLog
| where ModelName == model and TimeGenerated > ago(1d) | count);
let ref = ReferenceDistribution
| where ModelName == model and FeatureName == "__prediction__"
| project ScoreBucket = todouble(Bucket), ref_pct = ExpectedPct;
cur
| join kind=fullouter (ref) on ScoreBucket
| extend cur_pct = coalesce(cur_pct, 0.0001), ref_pct = coalesce(ref_pct, 0.0001)
| extend psi_contrib = (cur_pct - ref_pct) * log(cur_pct / ref_pct)
| summarize PredictionPSI = sum(psi_contrib)
Query 4 β Performance vs reference window¶
let model = "casino-player-churn-lightgbm";
let perf_ref = toscalar(DriftMetrics
| where ModelName == model and MetricName == "AUC_baseline"
| top 1 by TimeGenerated desc | project MetricValue);
DriftMetrics
| where ModelName == model and MetricName == "AUC"
| where TimeGenerated > ago(7d)
| summarize CurAUC = avg(MetricValue)
| extend RefAUC = perf_ref,
AUC_delta = CurAUC - perf_ref,
AUC_pct_change = (CurAUC - perf_ref) / perf_ref * 100
AUC is computed in the daily reconciliation notebook (
sklearn.metrics.roc_auc_score) and written toDriftMetrics. KQL handles the trend comparison.
Query 5 β Top-N drifted features (alert candidate)¶
DriftMetrics
| where ModelName == "casino-player-churn-lightgbm"
| where MetricName == "PSI"
| where TimeGenerated > ago(1d)
| summarize MaxPSI = max(MetricValue) by FeatureName
| where MaxPSI > 0.2
| top 10 by MaxPSI desc
| extend Severity = case(MaxPSI > 0.25, "P1", MaxPSI > 0.2, "P2", "P3")
π Performance Drift Patterns¶
Performance drift requires ground truth. Three patterns by label arrival latency:
Pattern A β Realized vs Predicted (immediate labels)¶
For models where ground truth arrives within minutes/hours: click-through, fraud-claim resolution, real-time forecasting.
from sklearn.metrics import roc_auc_score, brier_score_loss
preds = spark.sql("""
SELECT p.Score, r.ActualLabel
FROM ml_monitoring.PredictionLog p
JOIN lh_silver.realized_outcomes r ON p.RequestId = r.RequestId
WHERE p.TimeGenerated >= current_date() - INTERVAL 1 DAY
AND p.ModelName = 'casino-fraud-lightgbm'
""").toPandas()
auc = roc_auc_score(preds.ActualLabel, preds.Score)
brier = brier_score_loss(preds.ActualLabel, preds.Score)
mlflow.log_metric("daily_auc", auc); mlflow.log_metric("daily_brier", brier)
spark.createDataFrame([(datetime.utcnow(), "casino-fraud-lightgbm", "AUC", auc),
(datetime.utcnow(), "casino-fraud-lightgbm", "Brier", brier)],
["TimeGenerated", "ModelName", "MetricName", "MetricValue"]) \
.write.mode("append").saveAsTable("ml_monitoring.DriftMetrics")
Pattern B β Proxy Metrics (delayed labels)¶
When labels arrive in days/weeks (USDA crop yield, casino player churn over 90 days):
| Proxy | What It Approximates | Caveat |
|---|---|---|
| Score distribution shift | Performance drift | Only fires if model becomes systematically over/under-confident |
| Feature drift Γ model coefficients | Performance drift on linear models | Approximate; doesn't capture interaction effects |
| Calibration on partial labels | Full performance | Selection bias if early-labeling is non-random |
| Coverage rate (predictions made vs target volume) | Operational health | Detects upstream pipeline problems, not model issues |
| Confidence concentration | Decision quality | Spikes in low-confidence predictions = trouble |
Pattern C β A/B Holdout vs Production¶
For mission-critical models, deliberately route 5% of traffic to a holdout that uses the previous model. Compare performance side-by-side as labels arrive. This isolates "is it the model" from "is it the world."
-- Power BI / DAX measure for holdout-vs-prod
PerformanceDelta =
VAR ProdAUC = CALCULATE(AVERAGE(DriftMetrics[Value]),
DriftMetrics[Group] = "production",
DriftMetrics[MetricName] = "AUC")
VAR HoldoutAUC = CALCULATE(AVERAGE(DriftMetrics[Value]),
DriftMetrics[Group] = "holdout-v41",
DriftMetrics[MetricName] = "AUC")
RETURN ProdAUC - HoldoutAUC
π Concept Drift Detection¶
Concept drift is the hardest. Inputs look fine, but the relationship between inputs and target has changed. Three signals:
Signal 1 β Performance drops while inputs stable¶
// Concept drift candidate: AUC fell, but input PSI is normal
let model = "casino-player-churn-lightgbm";
let perf_drop = DriftMetrics
| where ModelName == model and MetricName == "AUC"
| where TimeGenerated > ago(7d)
| summarize CurAUC = avg(MetricValue);
let baseline = DriftMetrics
| where ModelName == model and MetricName == "AUC_baseline"
| top 1 by TimeGenerated desc
| project BaseAUC = MetricValue;
let max_input_psi = DriftMetrics
| where ModelName == model and MetricName == "PSI" and TimeGenerated > ago(7d)
| summarize MaxInputPSI = max(MetricValue);
perf_drop | extend BaseAUC = toscalar(baseline), MaxInputPSI = toscalar(max_input_psi)
| extend
AUC_drop_pct = (BaseAUC - CurAUC) / BaseAUC * 100,
ConceptDriftSuspected = AUC_drop_pct > 5 and MaxInputPSI < 0.1
Signal 2 β Tree-based feature importance shift¶
Train a shadow model on a recent labeled window and compare feature importances to production. Spearman Ο < 0.7 between importance vectors signals the relationship has changed.
import lightgbm as lgb
from scipy.stats import spearmanr
prod = mlflow.lightgbm.load_model("models:/casino-player-churn-lightgbm/Production")
prod_imp = dict(zip(prod.feature_name(), prod.feature_importance()))
recent = spark.table("ml_monitoring.PredictionLog") \
.filter("TimeGenerated > current_date() - INTERVAL 14 DAY AND ActualLabel IS NOT NULL").toPandas()
shadow = lgb.LGBMClassifier(n_estimators=200).fit(
recent.drop(columns=["ActualLabel"]), recent["ActualLabel"])
shadow_imp = dict(zip(shadow.feature_name_, shadow.feature_importances_))
feats = sorted(prod_imp.keys())
rho, _ = spearmanr([prod_imp[f] for f in feats], [shadow_imp[f] for f in feats])
mlflow.log_metric("feature_importance_spearman", rho)
if rho < 0.7: raise Alert("Concept drift suspected β feature importances diverged")
Signal 3 β Sliding-window comparison¶
Slice predictions into rolling windows (e.g., 7d Γ 4 weeks). Performance trending down monotonically, even with stable inputs, is the textbook concept-drift signature.
DriftMetrics
| where ModelName == "casino-player-churn-lightgbm" and MetricName == "AUC"
| where TimeGenerated > ago(28d)
| summarize WeekAvg = avg(MetricValue) by Week = bin(TimeGenerated, 7d)
| serialize
| extend Trend = row_number() - 1
| extend Slope = todouble(WeekAvg) / Trend
If Slope is consistently negative for 3+ weeks β concept drift; trigger retraining with fresh labels (not augmented historical data).
π¨ Alert Wiring¶
Drift alerts must be actionable, deduplicated, and routed to the right team. Reuse the patterns from SLO/SLI for Fabric and Observability Stack.
Alert Severity Matrix¶
| Signal | P1 (Page) | P2 (Ticket) | P3 (Dashboard) |
|---|---|---|---|
| Input PSI (any feature) | > 0.25 sustained 3 windows | > 0.2 sustained 3 windows | > 0.15 single window |
| Prediction PSI | > 0.25 | > 0.2 | > 0.15 |
| AUC drop vs baseline | > 10% | > 5% | > 2% |
| Concept drift (importance Ο) | < 0.5 | < 0.7 | < 0.85 |
| Volume drop | > 50% drop | > 25% drop | > 10% drop |
| Calibration ECE | > 0.15 | > 0.1 | > 0.05 |
Scheduled Query Alert (Workspace Monitoring)¶
# Azure Monitor scheduled query alert pointing at the Fabric KQL endpoint
name: drift-p1-casino-churn
query: |
DriftMetrics
| where ModelName == "casino-player-churn-lightgbm" and MetricName == "PSI"
| where TimeGenerated > ago(72h)
| summarize WindowsOver = countif(MetricValue > 0.25) by FeatureName
| where WindowsOver >= 3
schedule: every 1 hour
threshold: results > 0
severity: 1
action_group: ml-oncall-pagerduty # P1 β PagerDuty; P2 β Teams; trigger β retraining Logic App
π Retraining Trigger Patterns¶
See MLOps for Fabric Production Β§ Retraining Triggers for the master list. Drift-specific patterns:
| Pattern | Trigger | Action | Cool-down |
|---|---|---|---|
| Drift-only | PSI > 0.2 sustained 3 windows | Kick off retrain pipeline | 7 days (don't loop) |
| Performance-only | AUC drop > 5% for 5 days | Retrain with fresh labels | 14 days |
| Hybrid | (PSI > 0.2) AND (AUC drop > 3%) | Retrain + investigate input pipeline | 7 days |
| Concept-drift | Importance Ο < 0.7 + perf drop | Retrain + flag for product/data review | 30 days; involves human |
| Manual override | Product event (regulation, market shift) | Force retrain, ignore cool-down | n/a |
Cool-Down Logic¶
Prevent retrain storms. After triggering a retrain, suppress drift-driven retrains for the cool-down window. Alerts still fire (humans should know), but the automated pipeline is gated.
let last_retrain = toscalar(DriftMetrics
| where ModelName == "casino-player-churn-lightgbm" and MetricName == "RetrainTriggered"
| top 1 by TimeGenerated desc | project TimeGenerated);
DriftMetrics
| where ModelName == "casino-player-churn-lightgbm" and MetricName == "PSI"
| where TimeGenerated > ago(72h) and TimeGenerated > last_retrain + 7d // cool-down
| summarize WindowsOver = countif(MetricValue > 0.2) by FeatureName
| where WindowsOver >= 3
π‘οΈ False Positive Mitigation¶
Most drift alerts are false positives the first time you turn on monitoring. Tune for signal, not noise.
| Source | Symptom | Mitigation |
|---|---|---|
| Seasonality | Every Friday/weekend looks like drift | Build seasonal reference distributions keyed on (day_of_week, hour, is_holiday); compare current to the matching cell |
| Holidays / events | Black Friday, Super Bowl, hurricanes spike features | Maintain a SpecialEventCalendar Lakehouse table; drift queries leftanti-join to exclude or annotate |
| Intentional product changes | New game / promotion launches; eligibility rule changes | ChatOps suppression: /ml suppress drift <model> --reason "X" --until <date> writes to a suppression table consulted by alert queries |
| Small samples | PSI > 0.2 on < 1,000 records by chance | Require minimum sample size in the alert query; widen window or skip when traffic is low |
| Multiple comparisons | 2-3 false positives across 50 features at p < 0.05 | Bonferroni (p_adj = p Γ n) or FDR (BenjaminiβHochberg) |
| Reference drift | Baseline ages out at 6-12 months | Refresh reference distribution on every retrain; version it (ref_v42, ref_v43) |
// Sample-size gate β applied to every PSI alert
let n_samples = toscalar(PredictionLog
| where ModelName == "casino-player-churn-lightgbm" and TimeGenerated > ago(1d) | count);
DriftMetrics
| where ModelName == "casino-player-churn-lightgbm" and MetricName == "PSI"
| extend SampleSize = n_samples
| where SampleSize >= 1000 // suppress small-sample noise
# Seasonal reference build β keyed lookup at drift-compute time
ref_seasonal = training_df.groupBy("day_of_week", "hour", "is_holiday") \
.agg(F.expr("percentile_approx(amount, 0.5)").alias("p50_amount"))
ref_seasonal.write.mode("overwrite").saveAsTable("lh_gold.reference_distribution_seasonal")
π° Casino Implementation¶
| Model | Drift Focus | Method | Threshold | Action |
|---|---|---|---|---|
| Player Churn (LightGBM) | Input drift on spend/visit features; quarterly concept drift | PSI per feature; weekly AUC vs baseline | PSI > 0.2 (3 days); AUC drop > 5% | Retrain monthly + drift-trigger |
| Fraud Detection (Stream) | Score shift; investigator-label performance | Prediction PSI hourly; AUC weekly | PSI > 0.15; AUC drop > 3% | Page on-call; freeze new merchants if severe |
| Slot Anomaly (Stream) | Telemetry channel drift | Wasserstein on coin_in/out, payout_ratio | > 2Ο baseline | Investigate hardware; retrain monthly |
| Marketing Lift (Batch) | Concept drift after promotion changes | Sliding-window AUC; importance shift | Ο < 0.7 | Pause promotion; retrain on post-change data |
Compliance: Fraud-model drift is a BSA/SAR concern β a drifting fraud model that misses structuring is a compliance failure. Drift alerts on
casino-fraud-lightgbmare P0; page the AML team, not just on-call ML.
ποΈ Federal Implementation¶
| Model | Agency | Drift Focus | Method | Action |
|---|---|---|---|---|
| Crop Yield Forecast | USDA | Climate / sensor reference shift | Wasserstein on weather; PSI on soil moisture | Annual retrain post-harvest; mid-season alert PSI > 0.25 |
| AQI Forecast | EPA | Sensor drift; fire-season shift | PSI per sensor cohort; 24h-out reconciliation | Weekly perf check; retrain quarterly + wildfire transitions |
| Loan Default Risk | SBA | Macro shift (rates, unemployment) | PSI on macro features; fairness drift | Retrain quarterly; fairness gate every retrain |
| Storm Severity | NOAA | Rare-event class imbalance drift | Chi-square on class distribution; recall on severe class | Never auto-retrain β human review (public-safety) |
| Earthquake Magnitude | DOI/USGS | Sensor calibration drift | Wasserstein on seismograph readings | Sensor-level alerts; retrain annually |
Public-safety models (NOAA storm, USGS earthquake) require human-in-the-loop retraining. Drift fires; retraining is manually approved by the agency's chief scientist. Never auto-promote.
π« Anti-Patterns¶
| Anti-Pattern | Why It Hurts | What to Do Instead |
|---|---|---|
| One PSI threshold for all features | High-cardinality and low-traffic features generate noise | Per-feature thresholds calibrated against historic stable windows |
| Drift alerts without retraining wiring | Alert fatigue; engineers stop reading drift channel | Wire P1 alerts to a runbook with retraining trigger |
| Reference distribution never refreshed | After 12 months, baseline is wrong; everything looks like drift | Refresh reference on every model retrain; version it |
| Detecting only data drift | Misses concept drift completely; performance silently rots | Detect all four: data, prediction, performance, concept |
| Logging predictions without request_id | Can't join to realized labels later | Always log a stable request_id per prediction |
| Same window length for all models | Hourly model on 30-day window = stale; weekly model on 1-day window = noisy | Match drift window to inference cadence |
| No seasonality adjustment | Every weekend looks like a P1 drift event | Seasonal reference distributions or holiday calendar joins |
| Auto-retraining without cool-down | Drift firing β retrain β new model drifts β retrain (loop) | Cool-down window (7-30 days) per trigger pattern |
| Drift dashboard nobody opens | Detection without observation = false security | Embed drift metrics in the same dashboard as business KPIs |
| No fairness drift on regulated models | Lending/healthcare model fairness can degrade silently | Fairness metric tracked alongside performance for regulated domains |
π Implementation Checklist¶
Before declaring a model "monitored":
- Prediction logging hook deployed (every inference emits a structured record)
- PredictionLog table in Eventhouse with hot/cold tiering policy
- Reference distribution built at training time and stored in Lakehouse with version pinned to MLflow run
- Reference distribution materialized to Eventhouse via Shortcut for KQL joins
- PSI computed per feature, scheduled at appropriate cadence (hourly for streaming, daily for batch)
- Prediction-output PSI computed (output drift)
- Performance metric pipeline in place (realized vs predicted, A/B holdout, or proxy metrics)
- Concept-drift detection: importance shift + sliding-window performance
- DriftMetrics table populated and dashboarded
- Real-Time Dashboard published with: top-N drifted features, prediction histogram, performance trend
- Alert thresholds calibrated against 30-day stable history (no first-day-on alerts at default thresholds)
- Alerts wired to Action Groups with P1/P2/P3 routing
- Retraining trigger Logic App / Function deployed and tested end-to-end
- Cool-down logic implemented to prevent retrain storms
- Seasonality / holiday calendar joins applied to drift queries
- Suppression mechanism for intentional product changes (ChatOps or table-driven)
- Fairness-drift detection (regulated domains only)
- Runbook published: "What to do when drift fires"
- On-call team trained on the runbook
- Quarterly drift-detection review: are thresholds still right? Any silent failures?
π References¶
Microsoft Fabric Documentation¶
- Workspace Monitoring
- Eventhouse and KQL Database
- ML Model Endpoints β Monitoring (Preview)
- MLflow Tracking in Fabric
- Real-Time Dashboards
Industry Standards & Papers¶
- Gama et al. β A Survey on Concept Drift Adaptation (ACM Computing Surveys, 2014)
- Lipton et al. β Detecting and Correcting for Label Shift with Black Box Predictors (ICML 2018)
- Rabanser et al. β Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift (NeurIPS 2019)
- Google β The ML Test Score: A Rubric for ML Production Readiness
- Microsoft β Responsible AI Standard (Reliability & Safety Goal)
Related Wave 2 Docs¶
- MLOps for Fabric Production β anchor doc
- Feature Store on OneLake
- Responsible AI Framework
- LLM Cost Tracking
- LLM Evaluation Harness
Related Operational Docs (Wave 1)¶
Related Existing Docs¶
- AutoML & Model Endpoints
- Real-Time Intelligence
- Eventhouse Vector Database
- Monitoring & Observability
- Alerting with Data Activator