🤖 Machine Learning on Databricks¶

Build machine learning pipelines on Databricks. Learn MLflow, AutoML, and model deployment.

🎯 Learning Objectives¶

Build ML models with Spark MLlib
Track experiments with MLflow
Use AutoML for automated model selection
Deploy models for inference
Implement feature engineering

📋 Prerequisites¶

Databricks workspace - Quickstart
Python ML libraries - scikit-learn, pandas
Basic ML concepts - Classification, regression

🧠 Step 1: Load and Prepare Data¶

# Load data
df = spark.read.csv("/data/customer-churn.csv", header=True, inferSchema=True)

# Feature engineering
from pyspark.sql.functions import *

df_features = df.withColumn(
    "total_spend",
    col("monthly_charges") * col("tenure")
)

# Split data
train_df, test_df = df_features.randomSplit([0.8, 0.2], seed=42)

🏗️ Step 2: Build ML Pipeline¶

from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import LogisticRegression

# Feature transformation
assembler = VectorAssembler(
    inputCols=["tenure", "monthly_charges", "total_spend"],
    outputCol="features"
)

# Model
lr = LogisticRegression(
    featuresCol="features",
    labelCol="churn",
    maxIter=10
)

# Pipeline
pipeline = Pipeline(stages=[assembler, lr])

# Train
model = pipeline.fit(train_df)

📊 Step 3: Track with MLflow¶

import mlflow
import mlflow.spark

# Start run
with mlflow.start_run():
    # Train model
    model = pipeline.fit(train_df)

    # Log parameters
    mlflow.log_param("maxIter", 10)

    # Evaluate
    predictions = model.transform(test_df)
    accuracy = evaluator.evaluate(predictions)

    # Log metrics
    mlflow.log_metric("accuracy", accuracy)

    # Log model
    mlflow.spark.log_model(model, "model")

🚀 Step 4: Deploy Model¶

# Load model
model = mlflow.spark.load_model("runs:/run-id/model")

# Batch prediction
predictions = model.transform(new_data)

# Save results
predictions.write.format("delta").save("/ml/predictions")

📚 Resources¶

Last Updated: January 2025