🛒 Tutorial 49: Retail & CPG — Demand Forecasting and Customer Analytics¶

Difficulty: Advanced Duration: 90 minutes Prerequisites: Fabric workspace with F64 capacity, Lakehouse, Eventhouse

📋 Overview¶

In this tutorial you will build an end-to-end retail analytics solution on Microsoft Fabric for a 450-store omnichannel retailer. You will:

Generate synthetic POS transaction data (PCI-DSS compliant — no real PANs)
Ingest into a Bronze lakehouse with schema enforcement
Cleanse, deduplicate, and enrich in Silver
Build demand-forecast features and a Customer 360 (RFM + CLV) in Gold
Explore results with SQL and Power BI Direct Lake

🏗️ Architecture¶

graph TD
    A[🏪 POS Data Generator] -->|Parquet| B[Bronze: bronze_retail_pos]
    B --> C[Silver: silver_retail_sales]
    C --> D1[Gold: gold_retail_demand]
    C --> D2[Gold: gold_retail_customer_360]
    D1 --> E[📊 Power BI Dashboard]
    D2 --> E

    style A fill:#f9f,stroke:#333
    style B fill:#cd7f32,stroke:#333,color:#fff
    style C fill:#c0c0c0,stroke:#333
    style D1 fill:#ffd700,stroke:#333
    style D2 fill:#ffd700,stroke:#333
    style E fill:#4169e1,stroke:#333,color:#fff

📦 What You'll Build¶

Layer	Table	Description
Bronze	`bronze_retail_pos`	Raw POS transactions with metadata
Silver	`silver_retail_sales`	Deduped, validated, enriched sales
Gold	`gold_retail_demand`	SKU-Store-Day aggregates + rolling features
Gold	`gold_retail_customer_360`	RFM scores, CLV tiers, segments

🔧 Prerequisites¶

Microsoft Fabric workspace (F64 or trial capacity)
Three Lakehouses created: lh_bronze, lh_silver, lh_gold
Python environment with numpy, pandas, faker, tqdm
Eventhouse (optional, for real-time inventory queries)

📝 Step 1: Generate Synthetic POS Data¶

1.1 Install Dependencies¶

pip install numpy pandas faker tqdm pyarrow

1.2 Generate Data¶

from data_generation.generators.retail.sales_generator import RetailSalesGenerator

gen = RetailSalesGenerator(
    seed=42,
    num_stores=450,
    num_skus=5000,
    num_customers=50000,
)

# Generate 100K transactions
df = gen.generate(num_records=100_000)
print(f"Generated {len(df):,} records")
print(df.head())

# Save to Parquet for Fabric upload
gen.to_parquet(df, "output/bronze_retail_pos.parquet")

1.3 Verify Data Quality¶

# PCI-DSS: confirm no raw PANs
assert not df["card_token"].str.match(r"^\d{13,19}$").any(), "PAN leak!"

# Check distributions
print(df["category"].value_counts(normalize=True))
print(f"Unique stores: {df['store_id'].nunique()}")
print(f"Loyalty rate: {df['loyalty_id'].notna().mean():.1%}")

✅ Checkpoint: You should have a Parquet file with ~100K rows, no raw card numbers, and ~75% loyalty linkage rate.

📝 Step 2: Upload to Fabric¶

2.1 Upload Parquet File¶

Open your lh_bronze Lakehouse in the Fabric portal
Click Get data → Upload files
Navigate to output/bronze_retail_pos.parquet
Upload to Files/output/

2.2 Verify Upload¶

-- In Lakehouse SQL endpoint
SELECT COUNT(*) FROM bronze_retail_pos;
-- Expected: ~100,000

✅ Checkpoint: File visible in the Lakehouse Files section.

📝 Step 3: Bronze Ingestion¶

3.1 Import Notebook¶

In your workspace, click Import notebook
Select notebooks/bronze/53_retail_pos.py
Attach the notebook to lh_bronze

3.2 Run the Notebook¶

Execute all cells. The notebook will:

Read the Parquet file with explicit schema enforcement
Validate non-null txn_id, store_id, sku
Check for PCI-DSS compliance (no raw PAN)
Append to bronze_retail_pos Delta table with ingestion metadata

3.3 Verify¶

SELECT
    COUNT(*) AS total_rows,
    COUNT(DISTINCT txn_id) AS unique_txns,
    MIN(txn_timestamp) AS earliest,
    MAX(txn_timestamp) AS latest
FROM lh_bronze.bronze_retail_pos;

✅ Checkpoint: Bronze table populated, all quality checks passed, zero PAN violations.

📝 Step 4: Silver Transformation¶

4.1 Import & Run Notebook¶

Import notebooks/silver/53_retail_cleansed.py
Attach to lh_silver (with access to lh_bronze)
Run all cells

4.2 What Happens¶

Step	Transformation
Deduplication	Remove duplicate `txn_id` (keep latest)
Validation	Filter `qty > 0` and `unit_price > 0`; quarantine bad rows
Returns	Flag returns with `is_return = true`
Enrichment	Join product master (cost) and store master (city, state)
Margin	Calculate `gross_margin = line_total - (qty × product_cost)`
PCI-DSS	Drop `card_token` column (not needed for analytics)

4.3 Verify¶

SELECT
    category,
    COUNT(*) AS txns,
    ROUND(SUM(line_total), 2) AS revenue,
    SUM(CASE WHEN is_return THEN 1 ELSE 0 END) AS returns
FROM lh_silver.silver_retail_sales
GROUP BY category
ORDER BY revenue DESC;

✅ Checkpoint: Silver table has fewer rows than bronze (dedup + quarantine), no card_token column.

📝 Step 5: Gold — Demand Features & Customer 360¶

5.1 Import & Run Notebook¶

Import notebooks/gold/53_retail_demand_forecast.py
Attach to lh_gold (with access to lh_silver)
Run all cells

5.2 Demand Forecast Features¶

The notebook creates gold_retail_demand with:

Feature	Description
`daily_qty`	Total units sold per SKU-Store-Day
`daily_revenue`	Total revenue per SKU-Store-Day
`sales_ma_7d`	7-day moving average of daily_qty
`sales_ma_14d`	14-day moving average
`sales_ma_28d`	28-day moving average
`seasonal_index`	Ratio of current qty to 28-day average
`day_of_week`	Calendar feature (1=Sun .. 7=Sat)
`week_of_year`	Calendar feature

5.3 Customer 360¶

The notebook creates gold_retail_customer_360 with:

Metric	Description
`recency_days`	Days since last purchase
`frequency`	Number of distinct transactions
`monetary`	Total spend
`r_score` / `f_score` / `m_score`	Quintile scores (1-5)
`rfm_segment`	Champions, Loyal, New, At Risk, Hibernating
`clv_12m`	12-month projected CLV
`clv_tier`	Platinum / Gold / Silver / Bronze

5.4 Verify¶

-- Demand features
SELECT category, region,
    COUNT(*) AS records,
    ROUND(AVG(sales_ma_7d), 1) AS avg_ma7
FROM lh_gold.gold_retail_demand
GROUP BY category, region
ORDER BY avg_ma7 DESC
LIMIT 10;

-- Customer 360
SELECT rfm_segment, clv_tier,
    COUNT(*) AS customers,
    ROUND(AVG(clv_12m), 2) AS avg_clv
FROM lh_gold.gold_retail_customer_360
GROUP BY rfm_segment, clv_tier
ORDER BY avg_clv DESC;

✅ Checkpoint: Both gold tables populated. Champions segment has highest CLV.

📝 Step 6: Power BI Dashboard (Optional)¶

6.1 Create a Semantic Model¶

In lh_gold, click New semantic model
Select gold_retail_demand and gold_retail_customer_360
Create a Direct Lake connection

6.2 Suggested Visuals¶

Visual	Data
Line chart	Daily qty by category over time
Heat map	Seasonal index by day-of-week × month
Bar chart	Revenue by region and store format
Donut chart	Customer distribution by RFM segment
KPI cards	Total revenue, avg CLV, stockout rate

6.3 Row-Level Security¶

Apply RLS so regional managers only see their region:

[Region] = LOOKUPVALUE(
    UserRegion[Region],
    UserRegion[UPN], USERPRINCIPALNAME()
)

✅ Checkpoint: Dashboard renders with Direct Lake (no import needed).

📝 Step 7: Run Unit Tests¶

cd validation/unit_tests
pytest retail/test_sales_generator.py -v

Expected output:

test_generate_sales .............. PASSED
test_qty_positive ................ PASSED
test_price_positive .............. PASSED
test_category_valid .............. PASSED
test_loyalty_id_format ........... PASSED
test_reproducibility ............. PASSED

✅ Checkpoint: All 6 tests pass.

🎯 Summary¶

What You Did	Result
Generated PCI-DSS compliant POS data	No raw PANs in the lakehouse
Built Bronze → Silver → Gold pipeline	Deduped, validated, enriched
Created demand forecast features	SKU-Store-Day with rolling averages
Built Customer 360	RFM + CLV + segments for 12M loyalty members
Connected Power BI with Direct Lake	Sub-second dashboards, no import

⏭️ Next Steps¶

Add weather data join for improved forecast accuracy
Implement promotion lift measurement with causal inference
Set up Eventstream for real-time POS ingestion
Build vendor scorecard in supply chain gold layer

Tutorial 49 — Retail & CPG Vertical Last Updated: 2026-04-27

← Previous48 — Insurance ClaimsRead more →Next →50 — Manufacturing IoTRead more →