Skip to content
Learn — Azure analytics reference library covering services, architecture patterns, tutorials, solutions, monitoring, DevOps

📊 Medallion Architecture - Sample Data

This directory contains sample datasets for the Medallion Architecture tutorial.

📁 Structure

data/
├── bronze/          # Raw data as received from source systems
│   ├── sales/       # Sales transaction data
│   ├── customers/   # Customer master data
│   └── products/    # Product catalog data
├── silver/          # Cleaned and validated data (generated by processing)
└── gold/            # Business aggregates (generated by processing)

🎯 Sample Data Sets

Sales Data (Bronze)

File: bronze/sales/sales_2024.csv

Sample e-commerce sales transactions:

transaction_id,customer_id,product_id,transaction_timestamp,amount,quantity,currency
TXN001,CUST001,PROD123,2024-01-15T10:30:00Z,99.99,1,USD
TXN002,CUST002,PROD456,2024-01-15T11:45:00Z,149.50,2,USD

Characteristics:

  • 10,000 sample transactions
  • Date range: Jan 2024 - Dec 2024
  • Includes some data quality issues for demonstration
  • ~1MB file size

Customer Data (Bronze)

File: bronze/customers/customers.json

Customer master data with demographics:

{
  "customer_id": "CUST001",
  "customer_name": "John Doe",
  "customer_email": "john.doe@example.com",
  "customer_segment": "Premium",
  "registration_date": "2023-06-15",
  "country": "USA",
  "lifetime_value": 2500.00
}

Characteristics:

  • 5,000 customer records
  • Includes email, segment, location
  • Some duplicate records for quality testing

Product Data (Bronze)

File: bronze/products/products.parquet

Product catalog with pricing and categories:

Column Type Description
product_id string Unique product identifier
product_name string Product display name
category string Product category
subcategory string Product subcategory
price decimal List price in USD
cost decimal Cost of goods
supplier_id string Supplier identifier

Characteristics:

  • 1,000 products across 20 categories
  • Parquet format for efficient storage
  • ~500KB file size

🔄 Data Generation

To generate fresh sample data:

# Run the data generator script
python scripts/generate_sample_data.py --output bronze/

# Specify number of records
python scripts/generate_sample_data.py --sales 50000 --customers 10000

📊 Data Quality Issues

The sample data intentionally includes common quality issues:

Issue Type Count Example
Duplicates ~2% Same transaction_id multiple times
Missing Values ~1% NULL customer_email
Invalid Data ~0.5% Negative amounts, future dates
Format Issues ~1% Inconsistent date formats
Outliers ~0.5% Extremely high amounts

These issues are resolved in the Silver layer processing.

💾 Data Sizes

Layer Records Size Format
Bronze 16,000 ~2MB CSV, JSON, Parquet
Silver 15,600 ~1.5MB Delta Lake
Gold Aggregated ~500KB Delta Lake

🔒 Data Privacy

All sample data is synthetic and generated specifically for this tutorial:

  • No real customer information
  • No real transaction data
  • No PII (Personally Identifiable Information)
  • Safe for public distribution

📚 Usage in Tutorial

  1. Bronze Layer: Upload raw data to Data Lake
  2. Silver Layer: Process with Spark notebooks
  3. Gold Layer: Create aggregates and dimensions
  4. Validation: Query through Synapse SQL

See the Medallion Architecture Tutorial for complete instructions.


Last Updated: 2025-12-12
Data Version: 1.0.0