🤖 Machine Learning on Databricks¶
Build machine learning pipelines on Databricks. Learn MLflow, AutoML, and model deployment.
🎯 Learning Objectives¶
- Build ML models with Spark MLlib
- Track experiments with MLflow
- Use AutoML for automated model selection
- Deploy models for inference
- Implement feature engineering
📋 Prerequisites¶
- Databricks workspace - Quickstart
- Python ML libraries - scikit-learn, pandas
- Basic ML concepts - Classification, regression
🧠 Step 1: Load and Prepare Data¶
# Load data
df = spark.read.csv("/data/customer-churn.csv", header=True, inferSchema=True)
# Feature engineering
from pyspark.sql.functions import *
df_features = df.withColumn(
"total_spend",
col("monthly_charges") * col("tenure")
)
# Split data
train_df, test_df = df_features.randomSplit([0.8, 0.2], seed=42)
🏗️ Step 2: Build ML Pipeline¶
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import LogisticRegression
# Feature transformation
assembler = VectorAssembler(
inputCols=["tenure", "monthly_charges", "total_spend"],
outputCol="features"
)
# Model
lr = LogisticRegression(
featuresCol="features",
labelCol="churn",
maxIter=10
)
# Pipeline
pipeline = Pipeline(stages=[assembler, lr])
# Train
model = pipeline.fit(train_df)
📊 Step 3: Track with MLflow¶
import mlflow
import mlflow.spark
# Start run
with mlflow.start_run():
# Train model
model = pipeline.fit(train_df)
# Log parameters
mlflow.log_param("maxIter", 10)
# Evaluate
predictions = model.transform(test_df)
accuracy = evaluator.evaluate(predictions)
# Log metrics
mlflow.log_metric("accuracy", accuracy)
# Log model
mlflow.spark.log_model(model, "model")
🚀 Step 4: Deploy Model¶
# Load model
model = mlflow.spark.load_model("runs:/run-id/model")
# Batch prediction
predictions = model.transform(new_data)
# Save results
predictions.write.format("delta").save("/ml/predictions")
📚 Resources¶
Last Updated: January 2025