Data Scientist Quickstart — Your First ML Experiment in 30 Minutes¶

Estimated time: 30 minutes | Difficulty: Beginner | What you'll build: Train a simple regression model on the CSA Gold layer, log metrics and artifacts with MLflow autologging, register the model in the MLflow Model Registry, and optionally deploy a scoring endpoint through Azure ML or Databricks Model Serving.

Prerequisites¶

Before you begin, make sure the following are in place:

Azure subscription with Contributor access to the resource group
Databricks workspace provisioned in the Data Landing Zone (see tutorials/01-foundation-platform)
Python 3.10+ installed locally (for optional local testing)
Gold layer data in ADLS Gen2 or OneLake -- the dbt pipeline must have run at least once (see QUICKSTART.md)
MLflow available in the Databricks workspace (included in Databricks Runtime ML by default)

Architecture¶

graph LR
    subgraph "CSA Gold Layer"
        GOLD[Gold Delta Tables<br/>ADLS Gen2 / OneLake]
    end

    subgraph "Databricks Workspace"
        FE[Feature Engineering<br/>PySpark Notebook]
        TRAIN[Model Training<br/>scikit-learn]
    end

    subgraph "MLflow"
        TRACK[Experiment Tracking<br/>Params · Metrics · Artifacts]
        REG[Model Registry<br/>Staging → Production]
    end

    subgraph "Serving"
        AML[Azure ML Managed Endpoint]
        DBS[Databricks Model Serving]
    end

    GOLD --> FE --> TRAIN --> TRACK --> REG
    REG --> AML
    REG --> DBS

Data flows left to right: Gold tables feed a PySpark feature-engineering notebook, which passes a pandas DataFrame to scikit-learn. MLflow autologging captures parameters, metrics, and the serialized model. The best run is promoted to the Model Registry and deployed to a scoring endpoint.

Step 1 — Access Your Databricks Workspace¶

Open the Azure portal and navigate to your Data Landing Zone resource group.
Click the Azure Databricks Service resource and select Launch Workspace.
In the Databricks sidebar, choose Compute. Use an existing cluster with Databricks Runtime ML 14.x+, or create one:

Setting	Value
Cluster Mode	Single Node (sufficient for 30 min)
Runtime Version	14.3 LTS ML (or latest ML LTS)
Node Type	Standard_DS3_v2
Auto-termination	30 minutes

Wait for the cluster to reach the Running state.

Step 2 — Load Gold Data¶

Create a new Python notebook. In the first cell, read the Gold-layer fact table from ADLS Gen2 (or OneLake if Fabric is enabled).

# Cell 1 — Read the gold fact table
gold_path = "abfss://gold@<your-storage-account>.dfs.core.windows.net/fact_sales"
df_gold = spark.read.format("delta").load(gold_path)
display(df_gold.limit(10))

Expected output

A table showing 10 rows from `fact_sales` with columns such as `sale_id`, `customer_key`, `product_key`, `date_key`, `quantity`, `unit_price`, and `total_amount`.

Tip: Replace <your-storage-account> with the ADLS Gen2 account name from your Data Landing Zone. For OneLake, use abfss://<workspace-id>@onelake.dfs.fabric.microsoft.com/<lakehouse>/Tables/fact_sales.

Step 3 — Feature Engineering¶

Build a feature set by joining dimensions, extracting date features, and selecting numeric columns.

# Cell 2 — Join dimensions and engineer features
dim_date_path   = "abfss://gold@<your-storage-account>.dfs.core.windows.net/dim_date"
dim_product_path = "abfss://gold@<your-storage-account>.dfs.core.windows.net/dim_product"
df_date    = spark.read.format("delta").load(dim_date_path)
df_product = spark.read.format("delta").load(dim_product_path)

df_features = (
    df_gold
    .join(df_date, df_gold.date_key == df_date.date_key, "left")
    .join(df_product, df_gold.product_key == df_product.product_key, "left")
    .select(
        "quantity", "unit_price",
        df_date.month.alias("sale_month"),
        df_date.day_of_week.alias("sale_dow"),
        df_product.category_id.alias("product_category"),
        "total_amount",
    )
    .na.drop()
)

display(df_features.describe())

# Cell 3 — Convert to pandas for scikit-learn
pdf = df_features.toPandas()
print(f"Feature set shape: {pdf.shape}")

Expected output

Feature set shape: (84521, 6)

Your exact row count will differ depending on the seed data loaded.

Step 4 — Train the Model¶

Use scikit-learn with MLflow autologging. Autologging captures hyperparameters, metrics, and the serialized model automatically.

# Cell 4 — Train a regression model with MLflow autologging
import mlflow, mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

mlflow.sklearn.autolog(log_input_examples=True, silent=True)

# Prepare train / test split
TARGET = "total_amount"
FEATURES = [c for c in pdf.columns if c != TARGET]

X_train, X_test, y_train, y_test = train_test_split(
    pdf[FEATURES], pdf[TARGET], test_size=0.2, random_state=42
)

mlflow.set_experiment("/Shared/csa-inabox/sales-regression")

with mlflow.start_run(run_name="gbr-quickstart") as run:
    model = GradientBoostingRegressor(
        n_estimators=200, max_depth=5,
        learning_rate=0.1, random_state=42,
    )
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    run_id = run.info.run_id
    print(f"MLflow Run ID: {run_id}")

Expected output

MLflow Run ID: a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4

What autologging captures: n_estimators, max_depth, learning_rate, training_score, mean_squared_error, r2_score, the serialized model, and an input example for schema inference.

Step 5 — Evaluate the Model¶

Review metrics and visualize predictions.

# Cell 5 — Evaluation metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"RMSE : {rmse:.4f}")
print(f"MAE  : {mae:.4f}")
print(f"R^2  : {r2:.4f}")

Expected output

RMSE : 12.3456
MAE  : 8.7654
R^2  : 0.9123

Exact values depend on seed data volume and distribution.

# Cell 6 — Predicted vs. Actual scatter plot
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(y_test, y_pred, alpha=0.3, s=10)
ax.plot([y_test.min(), y_test.max()],
        [y_test.min(), y_test.max()], "r--", lw=2)
ax.set_xlabel("Actual")
ax.set_ylabel("Predicted")
ax.set_title("Predicted vs. Actual — Sales Total Amount")
plt.tight_layout()
mlflow.log_figure(fig, "predicted_vs_actual.png")
display(fig)

Expected output

A scatter plot with points clustered along the red 45-degree line, indicating strong predictive performance.

Step 6 — Register the Model¶

Promote the trained model to the MLflow Model Registry, assigning a version and stage for downstream deployment.

# Cell 7 — Register the model
model_name = "csa-sales-regression"

model_uri = f"runs:/{run_id}/model"
result = mlflow.register_model(model_uri, model_name)

print(f"Model registered: {result.name} v{result.version}")

Expected output

Model registered: csa-sales-regression v1

# Cell 8 — Transition to Staging
from mlflow.tracking import MlflowClient

client = MlflowClient()
client.transition_model_version_stage(
    name=model_name, version=result.version, stage="Staging",
)
print(f"Model {model_name} v{result.version} moved to Staging")

Expected output

Model csa-sales-regression v1 moved to Staging

Step 7 — Deploy for Scoring¶

With the model registered, deploy it as a REST endpoint. Two paths are available.

Option A — Databricks Model Serving¶

In the Databricks sidebar, navigate to Serving.
Click Create serving endpoint, select csa-sales-regression, and choose the latest version.
Set compute size to Small for testing and click Create (provisions in 5-10 minutes).

# Cell 9 — Test the serving endpoint
import requests, json

endpoint_url = (
    "https://<databricks-host>/serving-endpoints/csa-sales-regression/invocations"
)
token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

payload = {
    "dataframe_records": [
        {"quantity": 5, "unit_price": 29.99, "sale_month": 3,
         "sale_dow": 2, "product_category": 7}
    ]
}

response = requests.post(
    endpoint_url,
    headers={"Authorization": f"Bearer {token}",
             "Content-Type": "application/json"},
    json=payload,
)
print(json.dumps(response.json(), indent=2))

Expected output

{
    "predictions": [149.95]
}

Option B — Azure ML Managed Online Endpoint¶

For production workloads requiring autoscaling or blue-green deployments, deploy through Azure ML (see guides/azure-ai-foundry.md for workspace setup).

az ml online-endpoint create --name csa-sales-endpoint \
    --resource-group <rg-name> --workspace-name <aml-workspace>

az ml online-deployment create --name blue \
    --endpoint-name csa-sales-endpoint \
    --model azureml:csa-sales-regression@latest \
    --instance-type Standard_DS3_v2 --instance-count 1

Test with az ml online-endpoint invoke.

For a complete ML lifecycle walkthrough including CI/CD for models, see examples/ml-lifecycle.md.

Troubleshooting¶

Symptom	Likely Cause	Resolution
`AnalysisException: Path does not exist`	Gold path incorrect or data not loaded	Verify ADLS path; confirm dbt pipeline ran
Cluster fails to start	Quota exceeded or SKU unavailable	Check Azure quotas; try a different node type
`ModuleNotFoundError: sklearn`	Cluster not using Runtime ML	Switch to a Databricks Runtime ML version
MLflow experiment not visible	Created in a different workspace folder	Search Experiments sidebar for `/Shared/csa-inabox/sales-regression`
`RESOURCE_DOES_NOT_EXIST` on register	Run ID stale or cleaned up	Re-run training cell for a fresh run ID
Serving endpoint returns 503	Still provisioning or model load failed	Wait 5-10 min; check endpoint logs
`PermissionDenied` on ADLS read	Missing Storage Blob Data Reader role	Assign the role at the storage-account level

What's Next¶

Full ML lifecycle -- CI/CD, evaluation gates, retraining: examples/ml-lifecycle.md
AI analytics with Azure AI Foundry -- RAG pipelines and agents: tutorials/06-ai-analytics-foundry
Azure AI Foundry guide -- Hub/project setup, model catalog, prompt flow: guides/azure-ai-foundry.md
Databricks best practices -- Cluster policies, Unity Catalog: guides/databricks-best-practices.md