Data Scientist Quickstart¶

Last Updated: 2026-05-05 | Role: Data Scientist Goal: Build, train, and deploy ML models using Fabric's integrated data science capabilities — from exploration through production serving.

Persona & Typical Day¶

You explore data, build predictive models, run experiments, and deploy trained models to production endpoints. A typical day involves querying Lakehouse tables with Spark, running feature engineering notebooks, training models with AutoML or Spark ML, evaluating experiment metrics, and collaborating with data engineers to get the right features into gold tables.

You care about reproducibility, experiment tracking, model accuracy, feature quality, and being able to iterate quickly without managing infrastructure.

Your First 30 Minutes¶

Follow these steps to train and evaluate your first model in Fabric:

Set up your workspace and Lakehouses - Ensure you have access to bronze/silver/gold tables that contain training data. Tutorial 00: Environment Setup
Explore gold-layer features - Review existing gold tables and understand the data available for modeling. Tutorial 03: Gold Layer
Run an AutoML experiment - Use Fabric's AutoML to automatically train and compare models on a forecasting or classification task. AutoML & Model Endpoints
Explore Semantic Link - Use Semantic Link to bridge Power BI semantic models and Spark notebooks for integrated analysis. Semantic Link
Try AI Functions for compliance scoring - See how AI Functions can enrich data with LLM-powered transformations inline in Spark. Tutorial 09: Advanced AI/ML

Your First Week¶

Day	Focus	Resource
1	Complete 30-minute path above	Tutorials 00, 03, 09 + AutoML docs
2	Build a churn prediction model with Spark ML	ML Notebook: Player Churn
3	Set up the Feature Store for reusable features	Feature Store Guide
4	Implement vector search with Eventhouse	Vector Database
5	Deploy a model and configure drift detection	MLOps Production Guide

Key Features for Data Scientists¶

Feature	Doc Link	Why It Matters
AutoML & Model Endpoints	AutoML Guide	Rapid model training with automatic algorithm selection and hyperparameter tuning
Semantic Link	Semantic Link	Bridge Power BI models and Spark notebooks - query semantic models from Python
Vector Database (Eventhouse)	Vector DB Guide	Store and search embeddings for RAG, similarity search, and recommendation systems
AI Functions	AI Copilot	LLM-powered inline data transformations in Spark notebooks
Feature Store	Feature Store	Centralized, versioned feature management for consistent model training
MLOps Production	MLOps Guide	Model registry, deployment, A/B testing, and production monitoring
Drift Detection	Drift Detection	Detect when production data distribution shifts from training data
Responsible AI	Responsible AI	Fairness, explainability, and bias detection frameworks
RAG Patterns	RAG Deep Dive	Retrieval-augmented generation patterns for knowledge-grounded AI
Prompt Engineering	Prompt Engineering	Best practices for working with LLMs in Fabric notebooks

Common Pitfalls¶

Training on raw Bronze data - Bronze tables contain duplicates, nulls, and schema inconsistencies. Always train on validated Silver or curated Gold tables.
Skipping experiment tracking - Without tracking metrics, parameters, and artifacts, you cannot reproduce results or compare runs. Use MLflow experiment tracking built into Fabric.
Building features in notebooks instead of the Feature Store - Ad-hoc feature engineering in notebooks leads to training/serving skew. Use the Feature Store for consistent, reusable features across training and inference.
Ignoring model drift after deployment - A model that was accurate at training time degrades as real-world data distributions shift. Set up drift detection monitoring from day one. See the Drift Detection Guide.
Not leveraging Semantic Link - Many data scientists query raw tables when the semantic model already has the right business logic (measures, relationships). Semantic Link lets you use those definitions directly in Spark.

AutoML & Model Endpoints

Automated model training, comparison, and deployment to real-time scoring endpoints.

AutoML Guide
Vector Database

Eventhouse-based vector storage for embeddings, similarity search, and RAG applications.

Vector DB Guide
MLOps Production

End-to-end ML lifecycle management from experiment to production serving.

MLOps Guide
RAG Patterns

Retrieval-augmented generation architectures using Fabric's data and compute.

RAG Deep Dive