Skip to content

ML Operations Best Practices

Home | Best Practices | ML Operations

Status

Best practices for MLOps in Cloud Scale Analytics.


MLOps Lifecycle

flowchart LR
    A[Data Prep] --> B[Training]
    B --> C[Validation]
    C --> D[Deployment]
    D --> E[Monitoring]
    E --> |Drift Detected| A

Key Practices

1. Feature Store

from databricks.feature_engineering import FeatureEngineeringClient

fe = FeatureEngineeringClient()

# Create feature table
fe.create_table(
    name="ml.features.customer_features",
    primary_keys=["customer_id"],
    df=customer_features_df,
    description="Customer features for churn prediction"
)

# Use features in training
training_set = fe.create_training_set(
    df=labels_df,
    feature_lookups=[
        FeatureLookup(
            table_name="ml.features.customer_features",
            lookup_key="customer_id"
        )
    ],
    label="churned"
)

2. Experiment Tracking

import mlflow

mlflow.set_experiment("/Experiments/churn-prediction")

with mlflow.start_run():
    # Log parameters
    mlflow.log_params({
        "n_estimators": 100,
        "max_depth": 6,
        "learning_rate": 0.1
    })

    # Train model
    model = train_model(params)

    # Log metrics
    mlflow.log_metrics({
        "accuracy": 0.92,
        "f1": 0.89,
        "auc": 0.95
    })

    # Log model
    mlflow.sklearn.log_model(model, "model")

3. Model Registry

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register model
mlflow.register_model(
    "runs:/abc123/model",
    "churn_prediction_model"
)

# Transition to production
client.transition_model_version_stage(
    name="churn_prediction_model",
    version=1,
    stage="Production"
)

4. Model Monitoring

# Monitor for drift
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

monitor = w.quality_monitors.create(
    table_name="ml.inference.predictions",
    assets_dir="/ml/monitoring",
    output_schema_name="ml.monitoring",
    inference_log={
        "granularities": ["1 day"],
        "prediction_col": "prediction",
        "label_col": "actual"
    }
)

CI/CD Pipeline

stages:
  - stage: Test
    jobs:
      - job: UnitTests
        steps:
          - script: pytest tests/

  - stage: Train
    jobs:
      - job: TrainModel
        steps:
          - script: databricks jobs run-now --job-id $(JOB_ID)

  - stage: Deploy
    condition: succeeded()
    jobs:
      - deployment: DeployModel
        environment: production
        steps:
          - script: python deploy.py


Last Updated: January 2025