Skip to content

Data Scientist Quickstart

Last Updated: 2026-05-05 | Role: Data Scientist Goal: Build, train, and deploy ML models using Fabric's integrated data science capabilities — from exploration through production serving.


Persona & Typical Day

You explore data, build predictive models, run experiments, and deploy trained models to production endpoints. A typical day involves querying Lakehouse tables with Spark, running feature engineering notebooks, training models with AutoML or Spark ML, evaluating experiment metrics, and collaborating with data engineers to get the right features into gold tables.

You care about reproducibility, experiment tracking, model accuracy, feature quality, and being able to iterate quickly without managing infrastructure.


Your First 30 Minutes

Follow these steps to train and evaluate your first model in Fabric:

  1. Set up your workspace and Lakehouses - Ensure you have access to bronze/silver/gold tables that contain training data. Tutorial 00: Environment Setup

  2. Explore gold-layer features - Review existing gold tables and understand the data available for modeling. Tutorial 03: Gold Layer

  3. Run an AutoML experiment - Use Fabric's AutoML to automatically train and compare models on a forecasting or classification task. AutoML & Model Endpoints

  4. Explore Semantic Link - Use Semantic Link to bridge Power BI semantic models and Spark notebooks for integrated analysis. Semantic Link

  5. Try AI Functions for compliance scoring - See how AI Functions can enrich data with LLM-powered transformations inline in Spark. Tutorial 09: Advanced AI/ML


Your First Week

Day Focus Resource
1 Complete 30-minute path above Tutorials 00, 03, 09 + AutoML docs
2 Build a churn prediction model with Spark ML ML Notebook: Player Churn
3 Set up the Feature Store for reusable features Feature Store Guide
4 Implement vector search with Eventhouse Vector Database
5 Deploy a model and configure drift detection MLOps Production Guide

Key Features for Data Scientists

Feature Doc Link Why It Matters
AutoML & Model Endpoints AutoML Guide Rapid model training with automatic algorithm selection and hyperparameter tuning
Semantic Link Semantic Link Bridge Power BI models and Spark notebooks - query semantic models from Python
Vector Database (Eventhouse) Vector DB Guide Store and search embeddings for RAG, similarity search, and recommendation systems
AI Functions AI Copilot LLM-powered inline data transformations in Spark notebooks
Feature Store Feature Store Centralized, versioned feature management for consistent model training
MLOps Production MLOps Guide Model registry, deployment, A/B testing, and production monitoring
Drift Detection Drift Detection Detect when production data distribution shifts from training data
Responsible AI Responsible AI Fairness, explainability, and bias detection frameworks
RAG Patterns RAG Deep Dive Retrieval-augmented generation patterns for knowledge-grounded AI
Prompt Engineering Prompt Engineering Best practices for working with LLMs in Fabric notebooks

Common Pitfalls

  1. Training on raw Bronze data - Bronze tables contain duplicates, nulls, and schema inconsistencies. Always train on validated Silver or curated Gold tables.

  2. Skipping experiment tracking - Without tracking metrics, parameters, and artifacts, you cannot reproduce results or compare runs. Use MLflow experiment tracking built into Fabric.

  3. Building features in notebooks instead of the Feature Store - Ad-hoc feature engineering in notebooks leads to training/serving skew. Use the Feature Store for consistent, reusable features across training and inference.

  4. Ignoring model drift after deployment - A model that was accurate at training time degrades as real-world data distributions shift. Set up drift detection monitoring from day one. See the Drift Detection Guide.

  5. Not leveraging Semantic Link - Many data scientists query raw tables when the semantic model already has the right business logic (measures, relationships). Semantic Link lets you use those definitions directly in Spark.


  • AutoML & Model Endpoints


    Automated model training, comparison, and deployment to real-time scoring endpoints.

    AutoML Guide

  • Vector Database


    Eventhouse-based vector storage for embeddings, similarity search, and RAG applications.

    Vector DB Guide

  • MLOps Production


    End-to-end ML lifecycle management from experiment to production serving.

    MLOps Guide

  • RAG Patterns


    Retrieval-augmented generation architectures using Fabric's data and compute.

    RAG Deep Dive