Media & Entertainment: Audience Analytics on Microsoft Fabric¶

Executive Summary¶

Attribute	Detail
Vertical	Media & Entertainment (Streaming)
Platform Scale	8M subscribers, 50K content titles, 2B playback events/day
Fabric SKU	F64 (P1 equivalent)
Monthly Cost	~$25,000
Primary ROI	Churn reduction ($4.2M/yr) + Ad revenue optimization ($1.8M/yr)
Compliance	COPPA, GDPR, CCPA
Key Workloads	Eventstream, Eventhouse, Lakehouse, Power BI Direct Lake

Business Context¶

A streaming media platform operates a hybrid monetization model with three tiers:

Free (AVOD): Ad-supported, limited catalog access, 2.1M users
Basic: Ad-free, standard quality, 3.4M subscribers at $9.99/mo
Premium: 4K HDR, offline downloads, family profiles, 2.5M subscribers at $15.99/mo

The platform generates approximately 2 billion playback events daily across web, mobile, smart TV, and gaming console clients. Content spans original productions, licensed movies, live sports, and a growing kids library subject to COPPA regulations.

Key Business Questions¶

Which content titles drive the highest engagement and lowest churn?
How do we optimize ad placement in AVOD without degrading watch time?
What is the true cost-per-viewing-hour for licensed vs. original content?
Which subscriber segments are at risk of churning in the next 30 days?
How do viewing patterns differ across age-gated (COPPA) profiles?

Architecture Overview¶

                    Playback Clients (Web/Mobile/TV/Console)
                                    |
                            Event Collector API
                                    |
                        Azure Event Hubs (32 partitions)
                                    |
                    +---------Fabric Eventstream---------+
                    |                                     |
            Eventhouse (KQL DB)                    Lakehouse (Delta)
            Real-time engagement                   Medallion pipeline
            dashboards, alerts                     Bronze -> Silver -> Gold
                    |                                     |
                    +----------Power BI Direct Lake-------+
                                    |
                    Audience Dashboards & Recommendation API

Data Flow¶

Ingest: Playback clients emit events (play, pause, seek, stop, heartbeat) via Event Collector API to Azure Event Hubs.
Real-Time Path: Fabric Eventstream routes events to Eventhouse for sub-second KQL queries powering live engagement dashboards and anomaly detection (e.g., buffering spike alerts).
Batch Path: Eventstream also lands raw events in the Bronze lakehouse layer as Delta tables partitioned by date.
Silver Layer: Events are sessionized (30-minute gap = new session), quality-scored, and bot traffic is filtered.
Gold Layer: Aggregated content performance, user engagement segments, and content affinity matrices feed Power BI and the recommendation engine.

Medallion Architecture¶

Bronze: Raw Playback Events¶

Table: bronze_media_events

Column	Type	Description
event_id	STRING	Unique event identifier (UUID)
user_id	STRING	Pseudonymized user identifier
content_id	STRING	Content title identifier
event_type	STRING	play, pause, seek, stop, heartbeat
event_timestamp	TIMESTAMP	Client-side event time (UTC)
position_sec	INT	Playback position in seconds
device_type	STRING	web, mobile, smart_tv, console
bitrate_kbps	INT	Current streaming bitrate
plan_tier	STRING	free, basic, premium
age_bucket	STRING	child, teen, adult (COPPA routing)
region	STRING	ISO 3166-1 region code
_ingested_at	TIMESTAMP	Bronze ingestion timestamp
_source	STRING	Generator/pipeline identifier
_batch_id	STRING	Batch identifier

Volume: ~2B events/day, ~60B events/month Partitioning: _bronze_load_date Retention: 90 days raw, then archive to cold storage

Silver: Sessionized Viewing¶

Table: silver_media_sessions

Column	Type	Description
session_id	STRING	Derived session identifier
user_id	STRING	Pseudonymized user identifier
content_id	STRING	Content title identifier
session_start	TIMESTAMP	First event in session
session_end	TIMESTAMP	Last event in session
watch_duration_sec	INT	Total active viewing time
content_duration_sec	INT	Total content duration
completion_pct	DOUBLE	watch_duration / content_duration
pause_count	INT	Number of pause events
seek_count	INT	Number of seek events
avg_bitrate_kbps	INT	Average streaming quality
quality_score	DOUBLE	Composite QoE metric (0-1)
device_type	STRING	Primary device for session
plan_tier	STRING	User plan at time of viewing
is_binge	BOOLEAN	Part of 3+ consecutive episodes
age_bucket	STRING	COPPA age classification

Transformations: - Sessionization: events grouped by user_id + content_id with 30-minute inactivity gap - Bot filtering: sessions < 5 seconds or > 24 hours excluded - Heartbeat deduplication: collapse consecutive heartbeats within 15-second window - Quality score: weighted combination of bitrate stability, rebuffer ratio, startup time

Gold: Content Performance¶

Table: gold_content_performance

Column	Type	Description
content_id	STRING	Content identifier
title	STRING	Content title
genre	STRING	Primary genre
period	DATE	Aggregation period (daily)
unique_viewers	BIGINT	Distinct viewers
total_watch_hours	DOUBLE	Sum of watch duration
avg_completion_pct	DOUBLE	Average completion rate
trending_score	DOUBLE	Weighted recency + velocity
cost_per_view_hour	DOUBLE	License cost / watch hours
is_original	BOOLEAN	Platform original content

Gold: User Engagement Segments¶

Table: gold_user_segments

Column	Type	Description
user_id	STRING	Pseudonymized user identifier
segment	STRING	power, casual, at_risk, dormant
watch_hours_30d	DOUBLE	Last 30 days viewing
sessions_30d	INT	Session count last 30 days
genres_watched	INT	Genre diversity
avg_completion	DOUBLE	Average completion rate
days_since_last	INT	Days since last session
churn_probability	DOUBLE	ML-predicted churn risk
plan_tier	STRING	Current subscription tier
cohort_month	DATE	Signup cohort

Segment Definitions: - Power: 20+ hours/month, 15+ sessions, completion > 70% - Casual: 5-20 hours/month, 5-15 sessions - At-Risk: < 5 hours/month declining, or 14+ days inactive - Dormant: 30+ days since last session

Analytics Use Cases¶

1. Audience Engagement Analytics¶

Objective: Understand viewing behavior to optimize content scheduling and personalization.

Key Metrics: - Watch Time per Session: Average and P95 session duration by content genre - Completion Rate: Percentage of content watched to completion (critical for series) - Binge Score: Consecutive episodes watched in a single sitting (3+ = binge) - Time-to-First-Play: Latency from app open to first playback event

KQL Real-Time Query (Eventhouse):

PlaybackEvents
| where EventTimestamp > ago(1h)
| summarize
    ActiveViewers = dcount(UserId),
    AvgBitrateKbps = avg(BitrateKbps),
    BufferingEvents = countif(EventType == "buffering")
  by bin(EventTimestamp, 1m), ContentId
| order by ActiveViewers desc
| take 20

Business Impact: Content teams use engagement data to greenlight Season 2 renewals. Titles with > 75% completion rate and > 40% binge score receive priority renewal consideration.

2. Content Recommendation Engine¶

Objective: Improve content discovery using collaborative filtering enhanced with content-based features.

Approach: - Collaborative Filtering: User-content interaction matrix factorized via ALS (Alternating Least Squares) in PySpark MLlib - Content-Based Features: Genre, cast, director, release year, content rating, duration - Hybrid Score: 0.7 * collaborative + 0.3 * content-based, with COPPA-safe filtering for child profiles

Content Affinity Matrix:

# Gold layer: user-genre affinity for recommendation input
affinity = sessions_df.groupBy("user_id", "genre") \
    .agg(
        F.sum("watch_duration_sec").alias("total_watch_sec"),
        F.avg("completion_pct").alias("avg_completion"),
        F.count("session_id").alias("session_count")
    ) \
    .withColumn("affinity_score",
        F.col("total_watch_sec") * 0.4 +
        F.col("avg_completion") * 100 * 0.4 +
        F.col("session_count") * 0.2
    )

COPPA Constraint: Child profiles (age_bucket = "child") only receive recommendations from the kids catalog. No cross-profile recommendation leakage.

3. Ad Monetization Optimization (AVOD Tier)¶

Objective: Maximize ad revenue for free-tier users without degrading engagement.

Key Metrics: - Impression Yield: Ad impressions per content hour viewed - Fill Rate: Percentage of ad slots filled by demand partners - CPM (Cost per Mille): Revenue per 1,000 ad impressions - Ad-Induced Drop-Off: Percentage of users who stop viewing during/after an ad break

Optimization Strategy: - Dynamic ad load: reduce ad frequency for users showing engagement decline - Contextual targeting: match ad categories to content genre (no behavioral tracking for COPPA profiles) - A/B test ad pod lengths: 15s vs 30s vs 60s pre-roll impact on completion rates

Gold Metrics:

Metric	Target	Current
Impression Yield	12/hr	10.3/hr
Fill Rate	95%	91%
Average CPM	$18.00	$15.40
Ad Drop-Off Rate	< 8%	11.2%

4. Content Acquisition ROI¶

Objective: Evaluate whether licensed content justifies its acquisition cost.

Key Metrics: - Cost per Viewing Hour (CPVH): Total license cost / total viewing hours - Catalog Utilization: Percentage of catalog titles viewed at least once in 30 days - Marginal Viewer Contribution: Incremental subscribers attributed to a title

Benchmarks:

Content Type	Avg CPVH	Catalog Utilization
Originals	$0.42	89%
Licensed Movies	$0.78	62%
Licensed Series	$0.55	71%
Live Sports	$1.20	95%

5. Subscriber Lifecycle Management¶

Objective: Reduce churn through proactive engagement interventions.

Lifecycle Stages: 1. Trial (Day 1-7): Onboarding optimization, content sampling breadth 2. Activation (Day 8-30): Habit formation, push notification cadence tuning 3. Engagement (Month 2-12): Content personalization, feature adoption 4. Decay (Declining usage): Win-back campaigns, plan downgrade offers 5. Churn (Cancellation): Exit survey analysis, re-acquisition targeting

Churn Prediction Features: - Watch hours trend (7d, 14d, 30d rolling) - Session frequency decline rate - Genre diversity contraction - Device usage pattern changes - Payment failure history - Support ticket sentiment

Intervention Triggers:

Signal	Action	Expected Lift
50% watch time decline over 14d	Personalized email with curated picks	+12% retention
No session in 10 days	Push notification with new release	+8% reactivation
Payment failure	Proactive support outreach	+22% save rate
Trial day 3, < 2 sessions	In-app content spotlight	+15% trial conversion

Compliance Framework¶

COPPA (Children's Online Privacy Protection Act)¶

The platform serves a kids content library requiring COPPA compliance for users under 13.

Requirements Implemented:

Age-Gated Profiles: Each account designates child profiles during setup. Child profiles are tagged with age_bucket = "child" at the event level.
Parental Consent: Verifiable parental consent (VPC) is obtained before creating child profiles. Consent records are stored separately with audit trail.
Data Minimization: Child profile events collect only:
Content interaction (play/pause/stop) -- no seek granularity
Device type (category only, no device fingerprint)
No geolocation, no behavioral advertising identifiers
No Behavioral Advertising: AVOD ads shown to child profiles use contextual targeting only (genre-based). No user-level tracking or cross-session profiling.
Data Retention: Child profile event data is retained for a maximum of 30 days (vs. 90 days for adult profiles).
Parental Access: Parents can request full export or deletion of child profile data via self-service portal.

Implementation in Medallion:

Bronze: age_bucket field populated from profile metadata
Silver: COPPA filter applied -- child sessions excluded from behavioral models
Gold:   Child engagement metrics aggregated at content level only (no user-level)
        Recommendations use content-based filtering only (no collaborative)

Requirements Implemented:

Consent Management: Users provide granular consent for:
Essential (playback functionality) -- always required
Analytics (engagement measurement) -- opt-in
Personalization (recommendations) -- opt-in
Advertising (behavioral ads for AVOD) -- opt-in
Right to Erasure (Article 17): Deletion request triggers:
Immediate: Remove user from active recommendation models
24 hours: Purge user_id from Silver/Gold aggregations
72 hours: Delete all Bronze event records matching user_id
Confirmation: Erasure completion notification to user
Data Portability (Article 20): Users can export their viewing history, preferences, and profile data in machine-readable JSON format.
Purpose Limitation: Event data collected for playback functionality is not repurposed for advertising without explicit consent upgrade.
Pseudonymization: All user identifiers in the lakehouse are pseudonymized. The mapping table (user_id <-> PII) is stored in a separate, access-controlled system outside Fabric.

GDPR Erasure Pipeline:

# Triggered by erasure request via API
def process_erasure(user_id: str):
    """Delete user data across all medallion layers."""
    # 1. Gold: Remove from segments and affinity tables
    spark.sql(f"DELETE FROM gold_user_segments WHERE user_id = '{user_id}'")

    # 2. Silver: Remove sessionized data
    spark.sql(f"DELETE FROM silver_media_sessions WHERE user_id = '{user_id}'")

    # 3. Bronze: Remove raw events (partitioned scan)
    spark.sql(f"DELETE FROM bronze_media_events WHERE user_id = '{user_id}'")

    # 4. Eventhouse: KQL purge
    # .purge table PlaybackEvents records where UserId == user_id

    # 5. Audit log
    log_erasure_completion(user_id, datetime.utcnow())

Cost Model¶

Monthly Fabric Consumption (~$25,000/month on F64)¶

Workload	CU Allocation	Monthly Cost	Notes
Eventstream	25%	$6,250	2B events/day ingestion
Eventhouse (KQL)	20%	$5,000	Real-time dashboards, 7-day hot cache
Lakehouse (Spark)	30%	$7,500	Medallion ETL, daily batch
Power BI Direct Lake	15%	$3,750	50+ dashboards, 200 users
OneLake Storage	10%	$2,500	~15 TB Delta + archive

ROI Analysis¶

Revenue Driver	Annual Impact
Churn reduction (1.5pp improvement)	$4,200,000
Ad revenue optimization (+18% CPM)	$1,800,000
Content acquisition savings (CPVH)	$900,000
Trial conversion improvement (+3pp)	$600,000
Total Annual Benefit	$7,500,000
Annual Fabric Cost	$300,000
ROI	25:1

Real-Time Intelligence¶

Eventhouse KQL Queries¶

Live Concurrent Viewers:

PlaybackEvents
| where EventTimestamp > ago(5m)
| where EventType in ("play", "heartbeat")
| summarize ConcurrentViewers = dcount(UserId) by bin(EventTimestamp, 1m)
| render timechart

Content Trending Detection:

PlaybackEvents
| where EventTimestamp > ago(1h)
| summarize Viewers = dcount(UserId) by ContentId
| join kind=inner (
    PlaybackEvents
    | where EventTimestamp between (ago(2h) .. ago(1h))
    | summarize PrevViewers = dcount(UserId) by ContentId
) on ContentId
| extend TrendPct = round((Viewers - PrevViewers) * 100.0 / PrevViewers, 1)
| where TrendPct > 50
| order by TrendPct desc

Quality of Experience Alert:

PlaybackEvents
| where EventTimestamp > ago(10m)
| where EventType == "buffering"
| summarize BufferEvents = count(), AffectedUsers = dcount(UserId) by Region
| where BufferEvents > 1000
| order by BufferEvents desc

Power BI Dashboards¶

Dashboard 1: Content Command Center¶

Purpose: Real-time content performance for programming executives.

Visuals: - Live concurrent viewer count (card with sparkline) - Top 20 trending titles (bar chart, hourly refresh) - Completion rate heatmap by genre x day-of-week - New release first-48-hour trajectory vs. benchmark

Dashboard 2: Subscriber Health¶

Purpose: Lifecycle management for growth and retention teams.

Visuals: - Subscriber funnel: Trial -> Active -> At-Risk -> Churned - Cohort retention curves (month-over-month) - Engagement segment distribution (power/casual/at-risk/dormant) - Churn prediction risk scores with intervention recommendations

Dashboard 3: Ad Revenue Operations¶

Purpose: AVOD monetization for ad sales and ad ops teams.

Visuals: - Impression yield and fill rate trends - CPM by genre and daypart - Ad-induced drop-off by ad pod length - Revenue forecast vs. actual (daily)

Media & Entertainment: Audience Analytics on Microsoft Fabric¶

Executive Summary¶

Business Context¶

Key Business Questions¶

Architecture Overview¶

Data Flow¶

Medallion Architecture¶

Bronze: Raw Playback Events¶

Silver: Sessionized Viewing¶

Gold: Content Performance¶

Gold: User Engagement Segments¶

Analytics Use Cases¶

1. Audience Engagement Analytics¶

2. Content Recommendation Engine¶

3. Ad Monetization Optimization (AVOD Tier)¶

4. Content Acquisition ROI¶

5. Subscriber Lifecycle Management¶

Compliance Framework¶

COPPA (Children's Online Privacy Protection Act)¶

Cost Model¶

Monthly Fabric Consumption (~$25,000/month on F64)¶

ROI Analysis¶

Real-Time Intelligence¶

Eventhouse KQL Queries¶

Power BI Dashboards¶

Dashboard 1: Content Command Center¶

Dashboard 2: Subscriber Health¶

Dashboard 3: Ad Revenue Operations¶

Implementation Checklist¶

References¶

Media & Entertainment: Audience Analytics on Microsoft Fabric¶

Executive Summary¶

Business Context¶

Key Business Questions¶

Architecture Overview¶

Data Flow¶

Medallion Architecture¶

Bronze: Raw Playback Events¶

Silver: Sessionized Viewing¶

Gold: Content Performance¶

Gold: User Engagement Segments¶

Analytics Use Cases¶

1. Audience Engagement Analytics¶

2. Content Recommendation Engine¶

3. Ad Monetization Optimization (AVOD Tier)¶

4. Content Acquisition ROI¶

5. Subscriber Lifecycle Management¶

Compliance Framework¶

COPPA (Children's Online Privacy Protection Act)¶

GDPR (General Data Protection Regulation)¶

Cost Model¶

Monthly Fabric Consumption (~$25,000/month on F64)¶

ROI Analysis¶

Real-Time Intelligence¶

Eventhouse KQL Queries¶

Power BI Dashboards¶

Dashboard 1: Content Command Center¶

Dashboard 2: Subscriber Health¶

Dashboard 3: Ad Revenue Operations¶

Implementation Checklist¶

References¶