Home > Docs > Best Practices > Data Modeling & Star Schema

⭐ Data Modeling & Star Schema Best Practices for Microsoft Fabric¶

Design Performant Dimensional Models Optimized for Direct Lake and Power BI

Last Updated: 2026-04-21 | Version: 1.0.0

📑 Table of Contents¶

🎯 Overview
📐 Dimensional Modeling Principles
📊 Fact Table Design
🏷️ Dimension Design
🔗 Advanced Dimension Patterns
🌐 Conformed Dimensions
⚡ Direct Lake Optimization
🎰 Casino Implementation
🏛️ Federal Implementation
🚫 Anti-Patterns
📉 Limitations
📚 References

🎯 Overview¶

Dimensional modeling (star schema) is the foundational data modeling pattern for the Gold layer in Microsoft Fabric. A well-designed star schema maximizes Direct Lake performance, simplifies Power BI measures, and provides intuitive navigation for business users.

Why Star Schema in Fabric?¶

Benefit	Description
Direct Lake performance	Narrow fact tables with integer keys load column segments efficiently
DAX simplicity	Relationships auto-resolve; no complex CALCULATE + FILTER chains
Business readability	Dimensions use business names; facts contain measures analysts understand
Reusability	Conformed dimensions serve multiple fact tables and subject areas
Scalability	Add new facts/dimensions without restructuring existing models
Power BI compatibility	Star schema is the recommended model for Power BI semantic models

Star Schema vs Snowflake in Fabric¶

Aspect	Star Schema	Snowflake Schema
Dimension normalization	Denormalized (flat)	Normalized (multiple tables)
Query performance	Faster (fewer joins)	Slower (more joins)
Direct Lake compatibility	✅ Optimal	⚠️ Suboptimal (extra joins)
Storage	Slightly more (denormalized)	Slightly less
DAX complexity	Simple relationships	Requires bridge logic
Maintenance	Easier (fewer tables)	Harder (more tables to update)
Recommendation for Fabric	✅ Always preferred	❌ Avoid unless required for extreme cardinality

erDiagram
    FACT_TABLE {
        bigint fact_id PK
        int date_key FK
        int dim_a_key FK
        int dim_b_key FK
        decimal measure_1
        decimal measure_2
        int count_measure
    }

    DIM_DATE {
        int date_key PK
        date full_date
        int year
        int quarter
        int month
        string day_name
    }

    DIM_A {
        int dim_a_key PK
        string name
        string category
        string attribute_1
    }

    DIM_B {
        int dim_b_key PK
        string description
        string group
    }

    FACT_TABLE ||--o{ DIM_DATE : date_key
    FACT_TABLE ||--o{ DIM_A : dim_a_key
    FACT_TABLE ||--o{ DIM_B : dim_b_key

📐 Dimensional Modeling Principles¶

The Four-Step Design Process¶

flowchart LR
    S1["1️⃣ Select<br/>Business Process"] --> S2["2️⃣ Declare<br/>Grain"]
    S2 --> S3["3️⃣ Identify<br/>Dimensions"]
    S3 --> S4["4️⃣ Identify<br/>Facts (Measures)"]

    style S1 fill:#6C3483,color:#fff
    style S2 fill:#2471A3,color:#fff
    style S3 fill:#27AE60,color:#fff
    style S4 fill:#E67E22,color:#fff

Step	Question to Answer	Example (Slot Performance)
1. Business process	What process are we measuring?	Slot machine play sessions
2. Grain	What does one row represent?	One machine, one day
3. Dimensions	How do users slice/filter?	Date, Machine, Property, Player
4. Facts	What are we measuring?	coin_in, coin_out, net_win, handle_pulls

Grain Definition Rules¶

The grain is the most critical decision. It determines what one row in the fact table represents.

Rule	Description
Atomic grain	Prefer the finest level of detail available
Single grain per fact	Never mix grains in one fact table
Document the grain	Add a comment to every fact table defining the grain
Grain drives dimensions	Every dimension must be at or above the grain level
Aggregate separately	Pre-aggregated KPIs belong in separate fact tables

# Example: Grain documentation in table creation
# GRAIN: One row per machine per day per property
df_fact = (
    spark.table("lh_silver.slot_telemetry_cleansed")
    .groupBy("machine_id", "property_id",
             F.date_trunc("day", "event_timestamp").alias("metric_date"))
    .agg(
        F.sum("coin_in").alias("coin_in"),
        F.sum("coin_out").alias("coin_out"),
        F.sum("theoretical_win").alias("theoretical_win"),
        F.count("*").alias("event_count"),
        F.countDistinct("player_id").alias("unique_players"),
    )
)

📊 Fact Table Design¶

Measure Types¶

Type	Definition	Example	Aggregation
Additive	Can be summed across all dimensions	coin_in, net_win, event_count	SUM always valid
Semi-additive	Can be summed across some dimensions, not time	account_balance, inventory_count	SUM across non-time; AVG/LAST over time
Non-additive	Cannot be summed across any dimension	hold_percentage, avg_session_minutes	Recalculate from components

Fact Table Patterns¶

# Transaction fact: finest grain, additive measures
# GRAIN: One row per slot event (handle pull)
df_fact_slot_events = (
    spark.table("lh_silver.slot_telemetry_cleansed")
    .select(
        F.monotonically_increasing_id().alias("event_key"),
        F.col("date_key"),          # FK to dim_date
        F.col("machine_key"),       # FK to dim_machine
        F.col("property_key"),      # FK to dim_property
        F.col("player_key"),        # FK to dim_player
        F.col("coin_in"),           # Additive
        F.col("coin_out"),          # Additive
        F.col("jackpot_amount"),    # Additive
        F.col("theoretical_win"),   # Additive
    )
)

# Periodic snapshot fact: regular intervals, semi-additive measures
# GRAIN: One row per machine per day (end-of-day snapshot)
df_fact_daily_snapshot = (
    spark.table("lh_silver.slot_telemetry_cleansed")
    .groupBy("date_key", "machine_key", "property_key")
    .agg(
        F.sum("coin_in").alias("daily_coin_in"),         # Additive
        F.last("cumulative_meter").alias("meter_reading"), # Semi-additive
        F.avg("session_duration").alias("avg_session"),    # Non-additive
    )
)

# Accumulating snapshot fact: tracks process milestones
# GRAIN: One row per compliance filing (lifecycle tracking)
df_fact_compliance = (
    spark.table("lh_silver.ctr_filings_verified")
    .select(
        "filing_key",
        "date_filed_key",       # When filed
        "date_reviewed_key",    # When reviewed (nullable until reviewed)
        "date_submitted_key",   # When submitted to FinCEN (nullable)
        "filing_amount",        # Additive
        "filing_status",        # Degenerate dimension
    )
)

Factless Fact Tables¶

Factless fact tables record events that have no measures — they simply capture the occurrence of a dimensional intersection.

# Factless: Player presence on the floor (no monetary measure)
# GRAIN: One row per player per machine per hour of presence
df_fact_player_presence = (
    spark.table("lh_silver.player_activity_validated")
    .select(
        "date_key", "hour_key", "player_key", "machine_key", "property_key"
    )
)
# Use: COUNT(*) to answer "How many players were on floor section A at 9 PM?"

🏷️ Dimension Design¶

Slowly Changing Dimensions (SCD)¶

Type	Behavior	Use Case	Storage Impact
SCD Type 1	Overwrite old value	Correcting data errors, attributes where history is irrelevant	Lowest
SCD Type 2	Add new row, expire old row	Track full history (player tier changes, address moves)	Highest
SCD Type 3	Add new column for previous value	Track limited history (current + previous only)	Medium

SCD Type 2 Implementation¶

# SCD Type 2: dim_player with full history tracking
from delta.tables import DeltaTable
from pyspark.sql import functions as F

def apply_scd2(spark, target_table: str, source_df, key_col: str, tracked_cols: list):
    """Generic SCD Type 2 merge for any dimension."""
    target = DeltaTable.forName(spark, target_table)

    # Find changed records
    df_changes = (
        source_df.alias("src")
        .join(
            spark.table(target_table).filter("_is_current = true").alias("tgt"),
            f"src.{key_col} = tgt.{key_col}", "inner"
        )
        .filter(
            " OR ".join([f"src.{c} != tgt.{c}" for c in tracked_cols])
        )
    )

    if df_changes.count() == 0:
        return  # No changes

    # Expire current records
    (
        target.alias("tgt")
        .merge(df_changes.select(f"src.{key_col}").alias("chg"),
               f"tgt.{key_col} = chg.{key_col} AND tgt._is_current = true")
        .whenMatchedUpdate(set={
            "_is_current": "false",
            "_effective_end": "current_timestamp()",
        })
        .execute()
    )

    # Insert new current records
    df_new = (
        source_df.alias("src")
        .join(df_changes.select(f"src.{key_col}").distinct().alias("chg"),
              f"src.{key_col} = chg.{key_col}", "inner")
        .withColumn("_surrogate_key", F.monotonically_increasing_id())
        .withColumn("_is_current", F.lit(True))
        .withColumn("_effective_start", F.current_timestamp())
        .withColumn("_effective_end", F.lit(None).cast("timestamp"))
    )

    df_new.write.format("delta").mode("append").saveAsTable(target_table)

SCD Type 2 Schema¶

CREATE TABLE lh_gold.dim_player (
    player_key      BIGINT,         -- Surrogate key
    player_id       STRING,         -- Natural/business key
    player_name     STRING,
    loyalty_tier    STRING,         -- Tracked: SCD2 triggers on change
    email_domain    STRING,
    age_group       STRING,
    enrollment_date DATE,
    property_id     STRING,         -- Home property
    _is_current     BOOLEAN,        -- Current row flag
    _effective_start TIMESTAMP,     -- Row validity start
    _effective_end  TIMESTAMP       -- Row validity end (NULL = current)
) USING DELTA;

Role-Playing Dimensions¶

A single dimension table used multiple times in a fact table via different foreign keys.

erDiagram
    fact_compliance_filing {
        bigint filing_key PK
        int date_filed_key FK
        int date_reviewed_key FK
        int date_submitted_key FK
        decimal filing_amount
    }

    dim_date {
        int date_key PK
        date full_date
        string month_name
        int fiscal_year
    }

    fact_compliance_filing }o--|| dim_date : "date_filed_key"
    fact_compliance_filing }o--|| dim_date : "date_reviewed_key"
    fact_compliance_filing }o--|| dim_date : "date_submitted_key"

Power BI Note: In Direct Lake semantic models, create three inactive relationships and use USERELATIONSHIP() in DAX measures to activate the desired date role.

// DAX: Use role-playing date dimension
Filed Count = CALCULATE(
    COUNT(fact_compliance_filing[filing_key]),
    USERELATIONSHIP(fact_compliance_filing[date_filed_key], dim_date[date_key])
)

Submitted Count = CALCULATE(
    COUNT(fact_compliance_filing[filing_key]),
    USERELATIONSHIP(fact_compliance_filing[date_submitted_key], dim_date[date_key])
)

🔗 Advanced Dimension Patterns¶

Junk Dimensions¶

Combine low-cardinality flags and indicators into a single dimension instead of polluting the fact table.

# Junk dimension: Combine boolean/flag attributes
df_junk = (
    spark.table("lh_silver.slot_telemetry_cleansed")
    .select("is_jackpot", "is_bonus_round", "is_loyalty_play", "denomination_type")
    .distinct()
    .withColumn("junk_key", F.monotonically_increasing_id())
)

df_junk.write.format("delta").mode("overwrite").saveAsTable("lh_gold.dim_play_attributes")
# Result: ~50 rows instead of 4 boolean columns on a billion-row fact table

Degenerate Dimensions¶

Dimension attributes stored directly on the fact table (no separate dimension table needed).

Example	Why Degenerate
`transaction_id`	Unique per transaction; no attributes to store
`filing_number`	Reference number with no additional attributes
`batch_id`	Operational metadata; no business attributes

Bridge Tables (Many-to-Many)¶

erDiagram
    fact_player_session {
        bigint session_key PK
        int date_key FK
        int player_key FK
        int machine_key FK
        decimal total_wager
    }

    bridge_player_promotion {
        int player_key FK
        int promotion_key FK
        decimal weight_factor
    }

    dim_promotion {
        int promotion_key PK
        string promotion_name
        string promotion_type
        date start_date
        date end_date
    }

    fact_player_session }o--|| bridge_player_promotion : player_key
    bridge_player_promotion }o--|| dim_promotion : promotion_key

# Bridge table: Player can have multiple active promotions
df_bridge = (
    spark.table("lh_silver.player_promotions_validated")
    .select("player_key", "promotion_key")
    .withColumn("weight_factor",
        F.lit(1.0) / F.count("*").over(Window.partitionBy("player_key"))
    )
)
df_bridge.write.format("delta").mode("overwrite").saveAsTable("lh_gold.bridge_player_promotion")

🌐 Conformed Dimensions¶

Conformed dimensions are shared across multiple fact tables and subject areas, ensuring consistent filtering and aggregation.

Core Conformed Dimensions¶

Dimension	Shared Across	Key Attributes
`dim_date`	All fact tables	full_date, year, quarter, month, fiscal_year, is_weekend, is_holiday
`dim_property`	Casino facts, compliance facts	property_id, property_name, city, state, region
`dim_player`	Session facts, compliance facts, marketing facts	player_id, loyalty_tier, enrollment_date
`dim_agency`	All federal fact tables	agency_code, agency_name, bureau, sub_agency

Date Dimension Generator¶

# Generate dim_date: covers 10 years with fiscal calendar
from datetime import date, timedelta

def generate_dim_date(start_year: int = 2020, end_year: int = 2030) -> list[dict]:
    """Generate a comprehensive date dimension."""
    rows = []
    d = date(start_year, 1, 1)
    key = 1
    while d <= date(end_year, 12, 31):
        fiscal_year = d.year if d.month >= 10 else d.year  # Federal: Oct-Sep FY
        rows.append({
            "date_key": key,
            "full_date": d,
            "year": d.year,
            "quarter": (d.month - 1) // 3 + 1,
            "month": d.month,
            "month_name": d.strftime("%B"),
            "day_of_week": d.isoweekday(),
            "day_name": d.strftime("%A"),
            "is_weekend": d.isoweekday() >= 6,
            "is_holiday": False,  # Populate from holiday calendar
            "fiscal_year": fiscal_year,
            "fiscal_quarter": ((d.month - 10) % 12) // 3 + 1,
            "week_of_year": d.isocalendar()[1],
        })
        key += 1
        d += timedelta(days=1)
    return rows

⚡ Direct Lake Optimization¶

Designing star schemas specifically for Direct Lake performance in Fabric.

Optimization Guidelines¶

Guideline	Rationale
Use integer surrogate keys	Integer keys compress better than string keys in columnar format
Keep fact tables narrow	Fewer columns = faster column segment loading by Direct Lake
Enable V-Order	Fabric's columnar optimization for Direct Lake read patterns
Avoid calculated columns	Complex DAX calculated columns force fallback to DirectQuery
Flatten nested types	Direct Lake does not support STRUCT, ARRAY, or MAP types
Pre-aggregate when possible	Reduces row count; Direct Lake scans fewer segments
Partition Gold tables	Partition by date for time-range filtering efficiency

V-Order Configuration¶

# Enable V-Order for all Gold table writes
spark.conf.set("spark.sql.parquet.vorder.enabled", "true")

# Write fact table with V-Order
(
    df_fact.write
    .format("delta")
    .mode("overwrite")
    .option("replaceWhere", f"metric_date = '{target_date}'")
    .saveAsTable("lh_gold.fact_slot_metrics")
)

Direct Lake Fallback Prevention¶

Fallback Trigger	Prevention
Complex calculated columns	Pre-compute in Spark; store as physical columns
Unsupported DAX functions	Use supported subset; check compatibility list
Row-level security with complex rules	Simplify RLS predicates
Too many columns (>500)	Split into multiple fact/dimension tables
Large cardinality columns	Move to dimension; use integer FK

🎰 Casino Implementation¶

Casino Star Schema¶

erDiagram
    fact_slot_performance {
        bigint metric_key PK
        int date_key FK
        int machine_key FK
        int property_key FK
        int player_key FK
        decimal coin_in
        decimal coin_out
        decimal net_win
        decimal theoretical_win
        decimal jackpot_amount
        int handle_pulls
        int session_count
    }

    dim_date {
        int date_key PK
        date full_date
        int year
        int month
        int day_of_week
        boolean is_weekend
        boolean is_holiday
    }

    dim_machine {
        int machine_key PK
        string machine_id
        string machine_type
        string manufacturer
        string denomination
        string floor_section
        string zone
        date install_date
    }

    dim_property {
        int property_key PK
        string property_id
        string property_name
        string city
        string state
        string region
        string license_number
    }

    dim_player {
        int player_key PK
        string player_id
        string loyalty_tier
        date enrollment_date
        string age_group
        boolean _is_current
    }

    fact_compliance {
        bigint filing_key PK
        int date_filed_key FK
        int property_key FK
        string filing_type
        decimal filing_amount
        string filing_status
    }

    fact_slot_performance ||--o{ dim_date : date_key
    fact_slot_performance ||--o{ dim_machine : machine_key
    fact_slot_performance ||--o{ dim_property : property_key
    fact_slot_performance ||--o{ dim_player : player_key
    fact_compliance ||--o{ dim_date : date_filed_key
    fact_compliance ||--o{ dim_property : property_key

Casino KPI Measures (DAX)¶

// Hold Percentage (non-additive — must recalculate from components)
Hold % = DIVIDE(
    SUM(fact_slot_performance[coin_in]) - SUM(fact_slot_performance[coin_out]),
    SUM(fact_slot_performance[coin_in]),
    0
)

// Win Per Unit Per Day (WPUPD)
WPUPD = DIVIDE(
    SUM(fact_slot_performance[net_win]),
    DISTINCTCOUNT(fact_slot_performance[machine_key]) *
        DISTINCTCOUNT(fact_slot_performance[date_key]),
    0
)

// Compliance: CTR Filing Rate
CTR Filing Rate = DIVIDE(
    CALCULATE(COUNT(fact_compliance[filing_key]),
              fact_compliance[filing_type] = "CTR"),
    CALCULATE(COUNT(fact_slot_performance[metric_key]),
              fact_slot_performance[coin_in] >= 10000),
    0
)

🏛️ Federal Implementation¶

Federal Star Schema: Grant Disbursement¶

erDiagram
    fact_grant_disbursement {
        bigint disbursement_key PK
        int date_key FK
        int agency_key FK
        int program_key FK
        int recipient_key FK
        decimal obligation_amount
        decimal disbursement_amount
        decimal remaining_balance
        int transaction_count
    }

    dim_agency {
        int agency_key PK
        string agency_code
        string agency_name
        string bureau
        string sub_agency
        string cfda_number
    }

    dim_program {
        int program_key PK
        string program_id
        string program_name
        string program_type
        string funding_source
        int fiscal_year
    }

    dim_recipient {
        int recipient_key PK
        string recipient_id
        string recipient_name
        string recipient_type
        string state
        string congressional_district
    }

    dim_date {
        int date_key PK
        date full_date
        int fiscal_year
        int fiscal_quarter
    }

    fact_grant_disbursement ||--o{ dim_date : date_key
    fact_grant_disbursement ||--o{ dim_agency : agency_key
    fact_grant_disbursement ||--o{ dim_program : program_key
    fact_grant_disbursement ||--o{ dim_recipient : recipient_key

Agency-Specific Fact Tables¶

Agency	Fact Table	Grain	Key Measures
USDA	fact_crop_yield	One crop, one state, one year	yield_per_acre, production_tons, harvested_acres
SBA	fact_loan_approval	One loan application	loan_amount, approval_amount, jobs_supported
NOAA	fact_weather_observation	One station, one hour	temperature, precipitation, wind_speed
EPA	fact_air_quality	One monitor, one day	aqi, pm25_concentration, ozone_ppm
DOI	fact_recreation_visit	One facility, one month	visitor_count, revenue, capacity_utilization

Cross-Agency Conformed Dimensions¶

# dim_agency: Conformed across all federal fact tables
df_dim_agency = spark.createDataFrame([
    (1, "USDA", "Dept of Agriculture", "NASS", "National Agricultural Statistics Service", "10.001"),
    (2, "SBA",  "Small Business Administration", "ODA", "Office of Disaster Assistance", "59.008"),
    (3, "NOAA", "Natl Oceanic & Atmospheric Admin", "NWS", "National Weather Service", "11.467"),
    (4, "EPA",  "Environmental Protection Agency", "OAR", "Office of Air and Radiation", "66.034"),
    (5, "DOI",  "Dept of the Interior", "NPS", "National Park Service", "15.916"),
], ["agency_key", "agency_code", "agency_name", "bureau", "sub_agency", "cfda_number"])

df_dim_agency.write.format("delta").mode("overwrite").saveAsTable("lh_gold.dim_agency")

🚫 Anti-Patterns¶

Anti-Pattern 1: One Big Table (OBT)¶

Problem: Flattening all dimensions into the fact table to create a single wide table.

Impact: 200+ columns; slow Direct Lake loads; impossible to maintain conformed dimensions.

Fix: Use proper star schema with narrow fact tables and separate dimensions.

❌ fact_everything (machine_name, machine_type, floor_section, property_name, city, state, ...)
✅ fact_slot_metrics (machine_key, property_key, ...) + dim_machine + dim_property

Anti-Pattern 2: Snowflaking Dimensions¶

Problem: Normalizing dimension tables into sub-dimensions (e.g., dim_machine → dim_manufacturer → dim_country).

Impact: Extra joins degrade Direct Lake and DAX performance.

Fix: Denormalize into flat dimensions. Storage is cheap; query performance is valuable.

Anti-Pattern 3: String Keys in Fact Tables¶

Problem: Using machine_id VARCHAR(50) as the FK instead of machine_key INT.

Impact: String keys consume 5–10x more storage in columnar format; slower joins.

Fix: Always use integer surrogate keys in fact tables.

Anti-Pattern 4: Mixed Grain¶

Problem: Storing daily and monthly records in the same fact table.

Impact: Aggregations produce incorrect results; SUM double-counts.

Fix: Separate fact tables per grain: fact_daily_metrics and fact_monthly_summary.

Anti-Pattern 5: Calculated Columns for Complex Logic¶

Problem: Adding DAX calculated columns with complex business logic to Direct Lake models.

Impact: Forces Direct Lake fallback to DirectQuery mode; performance degrades drastically.

Fix: Pre-compute all complex metrics in PySpark and store as physical Delta columns.

Anti-Pattern Summary¶

Anti-Pattern	Impact	Fix
One Big Table	Slow Direct Lake, unmaintainable	Star schema
Snowflaking	Extra joins, slow DAX	Denormalize dimensions
String keys in facts	Storage waste, slow joins	Integer surrogate keys
Mixed grain	Incorrect aggregations	Separate fact tables
Complex calculated columns	Direct Lake fallback	Pre-compute in Spark

📉 Limitations¶

Limitation	Impact	Workaround
Direct Lake max columns per table	~500 columns before degradation	Split wide tables into star schema
No STRUCT/ARRAY in Direct Lake	Cannot use nested types in Gold	Flatten in Silver → Gold
SCD Type 2 increases row count	Historical rows inflate dimension size	Archive expired rows periodically
Bridge tables add DAX complexity	Many-to-many requires `CROSSFILTER` or weighting	Document patterns; keep bridges small
V-Order write overhead	Slightly slower writes (~5–10%)	Only enable for Gold tables consumed by Power BI
No auto-surrogate key generation	Must generate surrogate keys manually	Use `monotonically_increasing_id()` or hash
Liquid clustering preview only	Not GA for production use	Use Z-ORDER until GA

📚 References¶

Microsoft Documentation¶

Data Modeling¶

Compliance¶

Medallion Architecture Deep Dive — Bronze/Silver/Gold layer patterns
Performance & Parallelism — Spark and query optimization
Data Governance Deep Dive — Classification, sensitivity labels, RLS
Direct Lake — Direct Lake connectivity patterns
Capacity Planning & Cost Optimization — SKU sizing for model workloads

Document Metadata¶

Field	Value
Title	Data Modeling & Star Schema Best Practices
Category	Best Practices — Data Modeling
Author	Supercharge Microsoft Fabric POC Team
Version	1.0.0
Created	2026-04-21
Last Updated	2026-04-21
Applicable SKUs	F2–F2048
Industries	Casino/Gaming, Federal Government

Back to Best Practices Index | Back to Documentation