Tutorial 48: Insurance Claims Analytics on Microsoft Fabric¶

Learning Objectives¶

By the end of this tutorial you will be able to:

Generate synthetic P&C insurance claims data across four lines of business
Ingest raw claims into a Bronze lakehouse with schema enforcement
Validate, deduplicate, and enrich claims in a Silver layer
Build actuarial loss development triangles in the Gold layer
Compute fraud score features for SIU referral
Create a Power BI dashboard for claims analytics

Architecture Overview¶

graph LR
    A[Claims Generator] -->|Parquet| B[Files/landing/insurance/]
    B --> C[Bronze: 52_insurance_claims.py]
    C --> D[Silver: 52_insurance_validated.py]
    D --> E[Gold: 52_insurance_predictions.py]
    E --> F[gold_insurance_loss_triangles]
    E --> G[gold_insurance_fraud_scores]
    E --> H[gold_insurance_loss_ratios]
    E --> I[gold_insurance_kpis]
    F --> J[Power BI Direct Lake]
    G --> J
    H --> J
    I --> J

graph TB
    subgraph Bronze
        B1[bronze_insurance_claims]
    end
    subgraph Silver
        S1[silver_insurance_claims]
    end
    subgraph Gold
        G1[gold_insurance_loss_triangles]
        G2[gold_insurance_fraud_scores]
        G3[gold_insurance_loss_ratios]
        G4[gold_insurance_kpis]
    end
    B1 --> S1
    S1 --> G1
    S1 --> G2
    S1 --> G3
    S1 --> G4

Prerequisites¶

Microsoft Fabric workspace with F64 capacity
Python 3.10+ (for data generation)
Fabric Lakehouse: lh_bronze, lh_silver, lh_gold
Required Python packages: numpy, pandas, faker, tqdm

Step 1 -- Generate Synthetic Insurance Data¶

Install Dependencies¶

pip install numpy pandas faker tqdm

Generate Claims Data¶

from datetime import datetime
from data_generation.generators.insurance.claims_generator import InsuranceClaimsGenerator

generator = InsuranceClaimsGenerator(
    seed=42,
    num_policies=5_000,
    num_adjusters=50,
    start_date=datetime(2023, 1, 1),
    end_date=datetime(2025, 12, 31),
)

# Generate 10,000 claims
claims_df = generator.generate(num_records=10_000)
print(f"Generated {len(claims_df):,} claims")
print(f"Fraud rate: {claims_df['fraud_flag'].mean():.1%}")
print(f"LOB distribution:\n{claims_df['line_of_business'].value_counts()}")

# Save to Parquet for upload
generator.to_parquet(claims_df, "output/insurance/claims.parquet")

Verify the Data¶

# Check key distributions
print(f"\nStatus distribution:")
print(claims_df["status"].value_counts())

print(f"\nReserve statistics:")
print(claims_df["reserve_amt"].describe())

print(f"\nStates covered: {claims_df['state'].nunique()}")

Checkpoint: You should see ~10,000 claims with ~2% fraud rate, 4 LOBs, and 30 states.

Step 2 -- Upload Data to Fabric Lakehouse¶

Option A: Manual Upload¶

Open your Fabric workspace
Navigate to lh_bronze lakehouse
Click Files > Upload > Upload folder
Upload the output/insurance/ folder to Files/landing/insurance/

Option B: Using mssparkutils¶

from notebookutils import mssparkutils

# Copy from local to lakehouse
mssparkutils.fs.cp(
    "file:///tmp/insurance/claims.parquet",
    "Files/landing/insurance/claims.parquet",
)

Checkpoint: Verify files appear under Files/landing/insurance/ in the lakehouse explorer.

Step 3 -- Bronze Ingestion¶

Import and Run Notebook¶

Import notebooks/bronze/52_insurance_claims.py into your Fabric workspace
Attach to a Spark cluster
Run all cells

What This Notebook Does¶

Step	Action	Output
Schema definition	Defines 14-column schema with types	StructType
Read landing	Reads Parquet with schema enforcement	Raw DataFrame
Add metadata	Adds `_ingested_at`, `_source_file`	Enriched DataFrame
Quality check	Counts records, unique claims, total reserves	Console summary
Write bronze	Appends to `bronze_insurance_claims` Delta table	Delta table

Verify Bronze Table¶

SELECT
    COUNT(*) as total_records,
    COUNT(DISTINCT claim_id) as unique_claims,
    COUNT(DISTINCT policy_id) as linked_policies,
    SUM(reserve_amt) as total_reserves
FROM lh_bronze.bronze_insurance_claims

Checkpoint: Record count matches generated data. All claims have valid claim_id.

Step 4 -- Silver Validation¶

Import and Run Notebook¶

Import notebooks/silver/52_insurance_validated.py
Run all cells

Validation Steps¶

#	Validation	Action on Failure
1	Deduplication	Keep latest ingestion per claim_id
2	Policy linkage	Remove claims with null/empty policy_id
3	Report lag	Calculate `report_lag_days = report_dt - loss_dt`; drop if negative
4	Development age	Calculate months since loss date
5	Loss type standardization	Lowercase, trim, underscores
6	Reserve validation	Flag active claims with zero reserves
7	PII masking	SHA-256 hash of claimant_name

Verify Silver Table¶

SELECT
    line_of_business,
    COUNT(*) as claims,
    AVG(report_lag_days) as avg_report_lag,
    AVG(reserve_amt) as avg_reserve,
    SUM(CASE WHEN reserve_warning IS NOT NULL THEN 1 ELSE 0 END) as reserve_warnings
FROM lh_silver.silver_insurance_claims
GROUP BY line_of_business

Checkpoint: No negative report lags. PII column replaced with hash. Loss types are standardized.

Step 5 -- Gold Analytics¶

Import and Run Notebook¶

Import notebooks/gold/52_insurance_predictions.py
Run all cells

Gold Tables Produced¶

Loss Ratios (`gold_insurance_loss_ratios`)¶

SELECT line_of_business, accident_year, accident_quarter,
       claim_count, incurred_losses, loss_ratio_estimate
FROM lh_gold.gold_insurance_loss_ratios
ORDER BY line_of_business, accident_year, accident_quarter

Fraud Scores (`gold_insurance_fraud_scores`)¶

-- High-risk claims for SIU review
SELECT claim_id, line_of_business, state, loss_type,
       reserve_amt, fraud_score, fraud_flag
FROM lh_gold.gold_insurance_fraud_scores
WHERE fraud_score > 75
ORDER BY fraud_score DESC
LIMIT 20

Loss Triangles (`gold_insurance_loss_triangles`)¶

-- Actuarial loss development for Auto
SELECT accident_period, dev_quarter,
       cumulative_incurred, cumulative_paid, claim_count
FROM lh_gold.gold_insurance_loss_triangles
WHERE line_of_business = 'auto'
ORDER BY accident_period, dev_quarter

KPIs (`gold_insurance_kpis`)¶

SELECT line_of_business, accident_year,
       total_claims, closure_rate, subro_recovery_rate, avg_report_lag
FROM lh_gold.gold_insurance_kpis
ORDER BY line_of_business, accident_year

Checkpoint: All 4 gold tables populated. Fraud scores range 0-100. Loss triangles show increasing cumulative incurred by development quarter.

Step 6 -- Power BI Dashboard¶

Create Direct Lake Semantic Model¶

In your Fabric workspace, click New > Semantic model
Select Direct Lake connection
Add gold tables: gold_insurance_loss_ratios, gold_insurance_fraud_scores, gold_insurance_kpis

Suggested Dashboard Pages¶

Page 1: Executive Summary¶

Card: Total claims, total incurred, overall loss ratio
Line chart: Loss ratio trend by quarter
Bar chart: Claims count by LOB
Map: Incurred losses by state

Page 2: Fraud Analytics¶

Histogram: Fraud score distribution
Table: Top 20 high-risk claims (score > 75)
Bar chart: Fraud referrals by LOB
KPI: SIU referral count vs. confirmed fraud rate

Page 3: Actuarial View¶

Matrix: Loss development triangle (accident period x development quarter)
Line chart: Cumulative incurred by accident period
Bar chart: Closure rate by LOB and quarter
Card: Average report lag, average reserve

Page 4: Subrogation¶

Bar chart: Recovery rate by LOB
Table: Top subrogation opportunities
Trend: Quarterly subrogation recovery trend

Step 7 -- NAIC Compliance Checks¶

Data Quality for Statutory Reporting¶

-- Reconciliation check: Bronze vs Silver counts
SELECT 'bronze' as layer, COUNT(*) as records FROM lh_bronze.bronze_insurance_claims
UNION ALL
SELECT 'silver', COUNT(*) FROM lh_silver.silver_insurance_claims
UNION ALL
SELECT 'gold_triangles', COUNT(*) FROM lh_gold.gold_insurance_loss_triangles
UNION ALL
SELECT 'gold_fraud', COUNT(*) FROM lh_gold.gold_insurance_fraud_scores

Reserve Adequacy Check¶

-- Compare case reserves to paid for closed claims
SELECT
    line_of_business,
    COUNT(*) as closed_claims,
    AVG(reserve_amt) as avg_case_reserve,
    AVG(paid_amt) as avg_paid,
    AVG(reserve_amt - paid_amt) as avg_reserve_redundancy
FROM lh_silver.silver_insurance_claims
WHERE status IN ('closed_paid')
GROUP BY line_of_business

Checkpoint: Bronze-to-silver record count variance is explained by deduplication and validation removals.

Troubleshooting¶

Issue	Cause	Solution
Empty bronze table	Landing path incorrect	Verify `Files/landing/insurance/` contains Parquet files
Zero silver records	All claims failed validation	Check policy_id is populated in source data
Negative report lag	Loss date after report date in source	Data quality issue in generator; re-generate with valid dates
No fraud scores > 75	Low fraud rate + low severity	Expected with 2% fraud rate; adjust thresholds
Triangle gaps	Insufficient data coverage	Generate data spanning at least 8 quarters

Summary¶

In this tutorial you built an end-to-end insurance claims analytics pipeline:

Layer	Table	Records	Purpose
Bronze	`bronze_insurance_claims`	~10,000	Raw ingestion with audit trail
Silver	`silver_insurance_claims`	~9,900+	Validated, deduplicated, PII-masked
Gold	`gold_insurance_loss_ratios`	~200+	Profitability by LOB/state/quarter
Gold	`gold_insurance_fraud_scores`	~9,900+	ML fraud feature vectors
Gold	`gold_insurance_loss_triangles`	~100+	Actuarial development triangles
Gold	`gold_insurance_kpis`	~50+	Closure rate, subrogation, lag metrics

Key capabilities demonstrated: - Medallion architecture for insurance domain - Actuarial loss triangle construction - Fraud scoring feature engineering - NAIC compliance data quality controls - Power BI Direct Lake integration

Next: Tutorial 49 | Back to Tutorials

← Previous47 — Financial ServicesRead more →Next →49 — Retail & CPGRead more →

Tutorial 48: Insurance Claims Analytics on Microsoft Fabric¶

Learning Objectives¶

Architecture Overview¶

Prerequisites¶

Step 1 -- Generate Synthetic Insurance Data¶

Install Dependencies¶

Generate Claims Data¶

Verify the Data¶

Step 2 -- Upload Data to Fabric Lakehouse¶

Option A: Manual Upload¶

Option B: Using mssparkutils¶

Step 3 -- Bronze Ingestion¶

Import and Run Notebook¶

What This Notebook Does¶

Verify Bronze Table¶

Step 4 -- Silver Validation¶

Import and Run Notebook¶

Validation Steps¶

Verify Silver Table¶

Step 5 -- Gold Analytics¶

Import and Run Notebook¶

Gold Tables Produced¶

Loss Ratios (gold_insurance_loss_ratios)¶

Fraud Scores (gold_insurance_fraud_scores)¶

Loss Triangles (gold_insurance_loss_triangles)¶

KPIs (gold_insurance_kpis)¶

Step 6 -- Power BI Dashboard¶

Create Direct Lake Semantic Model¶

Suggested Dashboard Pages¶

Page 1: Executive Summary¶

Page 2: Fraud Analytics¶

Page 3: Actuarial View¶

Page 4: Subrogation¶

Step 7 -- NAIC Compliance Checks¶

Data Quality for Statutory Reporting¶

Reserve Adequacy Check¶

Troubleshooting¶

Related Resources¶

Summary¶

Loss Ratios (`gold_insurance_loss_ratios`)¶

Fraud Scores (`gold_insurance_fraud_scores`)¶

Loss Triangles (`gold_insurance_loss_triangles`)¶

KPIs (`gold_insurance_kpis`)¶