Data Scientist Quickstart¶
Last Updated: 2026-05-05 | Role: Data Scientist Goal: Build, train, and deploy ML models using Fabric's integrated data science capabilities — from exploration through production serving.
Persona & Typical Day¶
You explore data, build predictive models, run experiments, and deploy trained models to production endpoints. A typical day involves querying Lakehouse tables with Spark, running feature engineering notebooks, training models with AutoML or Spark ML, evaluating experiment metrics, and collaborating with data engineers to get the right features into gold tables.
You care about reproducibility, experiment tracking, model accuracy, feature quality, and being able to iterate quickly without managing infrastructure.
Your First 30 Minutes¶
Follow these steps to train and evaluate your first model in Fabric:
-
Set up your workspace and Lakehouses - Ensure you have access to bronze/silver/gold tables that contain training data. Tutorial 00: Environment Setup
-
Explore gold-layer features - Review existing gold tables and understand the data available for modeling. Tutorial 03: Gold Layer
-
Run an AutoML experiment - Use Fabric's AutoML to automatically train and compare models on a forecasting or classification task. AutoML & Model Endpoints
-
Explore Semantic Link - Use Semantic Link to bridge Power BI semantic models and Spark notebooks for integrated analysis. Semantic Link
-
Try AI Functions for compliance scoring - See how AI Functions can enrich data with LLM-powered transformations inline in Spark. Tutorial 09: Advanced AI/ML
Your First Week¶
| Day | Focus | Resource |
|---|---|---|
| 1 | Complete 30-minute path above | Tutorials 00, 03, 09 + AutoML docs |
| 2 | Build a churn prediction model with Spark ML | ML Notebook: Player Churn |
| 3 | Set up the Feature Store for reusable features | Feature Store Guide |
| 4 | Implement vector search with Eventhouse | Vector Database |
| 5 | Deploy a model and configure drift detection | MLOps Production Guide |
Key Features for Data Scientists¶
| Feature | Doc Link | Why It Matters |
|---|---|---|
| AutoML & Model Endpoints | AutoML Guide | Rapid model training with automatic algorithm selection and hyperparameter tuning |
| Semantic Link | Semantic Link | Bridge Power BI models and Spark notebooks - query semantic models from Python |
| Vector Database (Eventhouse) | Vector DB Guide | Store and search embeddings for RAG, similarity search, and recommendation systems |
| AI Functions | AI Copilot | LLM-powered inline data transformations in Spark notebooks |
| Feature Store | Feature Store | Centralized, versioned feature management for consistent model training |
| MLOps Production | MLOps Guide | Model registry, deployment, A/B testing, and production monitoring |
| Drift Detection | Drift Detection | Detect when production data distribution shifts from training data |
| Responsible AI | Responsible AI | Fairness, explainability, and bias detection frameworks |
| RAG Patterns | RAG Deep Dive | Retrieval-augmented generation patterns for knowledge-grounded AI |
| Prompt Engineering | Prompt Engineering | Best practices for working with LLMs in Fabric notebooks |
Common Pitfalls¶
-
Training on raw Bronze data - Bronze tables contain duplicates, nulls, and schema inconsistencies. Always train on validated Silver or curated Gold tables.
-
Skipping experiment tracking - Without tracking metrics, parameters, and artifacts, you cannot reproduce results or compare runs. Use MLflow experiment tracking built into Fabric.
-
Building features in notebooks instead of the Feature Store - Ad-hoc feature engineering in notebooks leads to training/serving skew. Use the Feature Store for consistent, reusable features across training and inference.
-
Ignoring model drift after deployment - A model that was accurate at training time degrades as real-world data distributions shift. Set up drift detection monitoring from day one. See the Drift Detection Guide.
-
Not leveraging Semantic Link - Many data scientists query raw tables when the semantic model already has the right business logic (measures, relationships). Semantic Link lets you use those definitions directly in Spark.
Related Resources¶
-
AutoML & Model Endpoints
Automated model training, comparison, and deployment to real-time scoring endpoints.
-
Vector Database
Eventhouse-based vector storage for embeddings, similarity search, and RAG applications.
-
MLOps Production
End-to-end ML lifecycle management from experiment to production serving.
-
RAG Patterns
Retrieval-augmented generation architectures using Fabric's data and compute.