🎬 Tutorial 54: Media & Entertainment Audience Analytics¶

Build a complete audience analytics pipeline for a streaming media platform on Microsoft Fabric. You will ingest 2B+ playback events per day through the medallion architecture, sessionize viewing behavior, build content performance dashboards, and produce user engagement segments for churn prediction and content recommendations.

📋 Prerequisites¶

Microsoft Fabric workspace with F64 (or higher) capacity
Lakehouse items: lh_bronze, lh_silver, lh_gold
Python 3.10+ with numpy, pandas, faker installed locally (for data generation)
Basic familiarity with PySpark and Delta Lake

🎯 Learning Objectives¶

By the end of this tutorial you will be able to:

Generate realistic streaming playback events with COPPA-compliant age gating
Ingest raw events into a Bronze Delta table with quality checks
Sessionize events in the Silver layer with bot filtering and heartbeat dedup
Build Gold-layer content performance KPIs and user engagement segments
Query real-time engagement metrics with KQL in Eventhouse
Understand COPPA and GDPR compliance patterns in media analytics

🏗️ Architecture¶

graph TD
    A[Playback Clients] -->|Events| B[Event Hub / Eventstream]
    B --> C[Eventhouse - Real-Time KQL]
    B --> D[Bronze: bronze_media_events]
    D --> E[Silver: silver_media_sessions]
    E --> F[Gold: gold_content_performance]
    E --> G[Gold: gold_user_segments]
    E --> H[Gold: gold_content_affinity]
    F --> I[Power BI Direct Lake]
    G --> I
    H --> J[Recommendation API]

    style A fill:#f9f,stroke:#333
    style C fill:#ff9,stroke:#333
    style I fill:#9f9,stroke:#333

📊 Data Model¶

erDiagram
    BRONZE_MEDIA_EVENTS {
        string event_id PK
        string user_id FK
        string content_id FK
        string event_type
        timestamp event_timestamp
        int position_sec
        string device_type
        int bitrate_kbps
        string plan_tier
        string age_bucket
        string region
    }
    SILVER_MEDIA_SESSIONS {
        string session_id PK
        string user_id FK
        string content_id FK
        timestamp session_start
        timestamp session_end
        int watch_duration_sec
        double completion_pct
        double quality_score
        string age_bucket
    }
    GOLD_CONTENT_PERFORMANCE {
        string content_id PK
        bigint unique_viewers
        double total_watch_hours
        double avg_completion_pct
        double trending_score
    }
    GOLD_USER_SEGMENTS {
        string user_id PK
        string segment
        double watch_hours_30d
        double churn_probability
    }

    BRONZE_MEDIA_EVENTS ||--o{ SILVER_MEDIA_SESSIONS : sessionized
    SILVER_MEDIA_SESSIONS }o--|| GOLD_CONTENT_PERFORMANCE : aggregated
    SILVER_MEDIA_SESSIONS }o--|| GOLD_USER_SEGMENTS : segmented

🚀 Step-by-Step Instructions¶

Step 1: Generate Sample Data¶

Generate playback events locally using the media event generator.

from data_generation.generators.media.event_generator import MediaEventGenerator

gen = MediaEventGenerator(seed=42)
df = gen.generate(num_records=50000)
gen.to_parquet(df, "output/bronze_media_events.parquet")

print(f"Generated {len(df):,} events")
print(f"Event types: {df['event_type'].value_counts().to_dict()}")
print(f"Age buckets: {df['age_bucket'].value_counts().to_dict()}")

✅ Verification: You should see ~50,000 events with five event types (play, pause, seek, stop, heartbeat) and three age buckets (child ~12%, teen ~15%, adult ~73%).

Step 2: Upload to Fabric Lakehouse¶

Upload the generated Parquet file to your Bronze lakehouse.

Open your Fabric workspace
Navigate to lh_bronze → Files → output/
Upload bronze_media_events.parquet

✅ Verification: The file appears under Files/output/ in the lakehouse explorer.

Step 3: Run Bronze Notebook¶

Import and run notebooks/bronze/58_media_events.py.

In Fabric, select Import notebook
Upload 58_media_events.py
Attach to your lakehouse and run all cells

Expected Output: - Schema with 14 columns printed - All critical field null checks show PASS - Event type and device distribution tables displayed

✅ Verification:

SELECT count(*) FROM lh_bronze.bronze_media_events
-- Should return ~50,000 rows

Step 4: Run Silver Notebook¶

Import and run notebooks/silver/58_media_sessions.py.

Import 58_media_sessions.py
Run all cells

Key Transformations: - Heartbeat deduplication removes ~10-15% of events - Sessionization groups events by user+content with 30-min gap - Bot filter removes sessions < 5 seconds or > 24 hours

✅ Verification:

SELECT
    count(*) as total_sessions,
    count(distinct user_id) as unique_users,
    avg(completion_pct) as avg_completion,
    avg(quality_score) as avg_quality
FROM lh_silver.silver_media_sessions

Step 5: Run Gold Notebook¶

Import and run notebooks/gold/58_media_recommendations.py.

Import 58_media_recommendations.py
Run all cells

Three output tables are created:

Table	Purpose
`gold_content_performance`	Per-content KPIs with trending score
`gold_user_segments`	User engagement segments (power/casual/at-risk/dormant)
`gold_content_affinity`	User-content affinity for recommendations

✅ Verification:

-- Check segment distribution
SELECT segment, count(*) as users, avg(churn_probability) as avg_churn
FROM lh_gold.gold_user_segments
GROUP BY segment
ORDER BY segment

Expected segments: power (~15%), casual (~45%), at_risk (~30%), dormant (~10%).

Step 6: Explore Content Performance¶

Query the top trending content:

SELECT content_id, unique_viewers, total_watch_hours,
       round(avg_completion_pct, 2) as completion,
       round(trending_score, 1) as trending
FROM lh_gold.gold_content_performance
ORDER BY trending_score DESC
LIMIT 10

Step 7: Real-Time KQL Queries (Optional)¶

If you have an Eventhouse configured with the Eventstream, run these KQL queries:

Live Concurrent Viewers:

PlaybackEvents
| where EventTimestamp > ago(5m)
| where EventType in ("play", "heartbeat")
| summarize ConcurrentViewers = dcount(UserId) by bin(EventTimestamp, 1m)
| render timechart

Buffering Alert:

PlaybackEvents
| where EventTimestamp > ago(10m)
| where EventType == "buffering"
| summarize BufferEvents = count(), AffectedUsers = dcount(UserId) by Region
| where BufferEvents > 100

✅ Verification: Charts render with real-time data points updating every minute.

🔒 Compliance Notes¶

COPPA¶

Requirement	Implementation
Age-gated profiles	`age_bucket` field on every event; child = under 13
Data minimization	Child events exclude seek granularity, device fingerprints
No behavioral ads	AVOD child profiles use contextual targeting only
Parental access	Erasure/export via self-service portal
Retention limit	Child data: 30 days (vs. 90 days adult)

Right	Implementation
Consent	Granular opt-in for analytics, personalization, advertising
Erasure	DELETE by user_id across Bronze/Silver/Gold + Eventhouse purge
Portability	JSON export of viewing history and preferences
Pseudonymization	user_id is pseudonymized; PII mapping stored externally

🧹 Cleanup¶

To remove tutorial resources:

DROP TABLE IF EXISTS lh_bronze.bronze_media_events;
DROP TABLE IF EXISTS lh_silver.silver_media_sessions;
DROP TABLE IF EXISTS lh_gold.gold_content_performance;
DROP TABLE IF EXISTS lh_gold.gold_user_segments;
DROP TABLE IF EXISTS lh_gold.gold_content_affinity;

✅ Progress Tracker¶

Step	Task	Status
1	Generate sample data	⬜
2	Upload to Fabric	⬜
3	Run Bronze notebook	⬜
4	Run Silver notebook	⬜
5	Run Gold notebook	⬜
6	Explore content performance	⬜
7	Real-time KQL queries	⬜

Next Tutorial: Tutorial 55 →

← Previous53 — Pharma Clinical TrialsRead more →Next →OverviewRead more →

🎬 Tutorial 54: Media & Entertainment Audience Analytics¶

📋 Prerequisites¶

🎯 Learning Objectives¶

🏗️ Architecture¶

📊 Data Model¶

🚀 Step-by-Step Instructions¶

Step 1: Generate Sample Data¶

Step 2: Upload to Fabric Lakehouse¶

Step 3: Run Bronze Notebook¶

Step 4: Run Silver Notebook¶

Step 5: Run Gold Notebook¶

Step 6: Explore Content Performance¶

Step 7: Real-Time KQL Queries (Optional)¶

🔒 Compliance Notes¶

COPPA¶

GDPR¶

🧹 Cleanup¶

📚 Related Resources¶

✅ Progress Tracker¶