📊 Medallion Architecture - Sample Data¶
This directory contains sample datasets for the Medallion Architecture tutorial.
📁 Structure¶
data/
├── bronze/ # Raw data as received from source systems
│ ├── sales/ # Sales transaction data
│ ├── customers/ # Customer master data
│ └── products/ # Product catalog data
├── silver/ # Cleaned and validated data (generated by processing)
└── gold/ # Business aggregates (generated by processing)
🎯 Sample Data Sets¶
Sales Data (Bronze)¶
File: bronze/sales/sales_2024.csv
Sample e-commerce sales transactions:
transaction_id,customer_id,product_id,transaction_timestamp,amount,quantity,currency
TXN001,CUST001,PROD123,2024-01-15T10:30:00Z,99.99,1,USD
TXN002,CUST002,PROD456,2024-01-15T11:45:00Z,149.50,2,USD
Characteristics:
- 10,000 sample transactions
- Date range: Jan 2024 - Dec 2024
- Includes some data quality issues for demonstration
- ~1MB file size
Customer Data (Bronze)¶
File: bronze/customers/customers.json
Customer master data with demographics:
{
"customer_id": "CUST001",
"customer_name": "John Doe",
"customer_email": "john.doe@example.com",
"customer_segment": "Premium",
"registration_date": "2023-06-15",
"country": "USA",
"lifetime_value": 2500.00
}
Characteristics:
- 5,000 customer records
- Includes email, segment, location
- Some duplicate records for quality testing
Product Data (Bronze)¶
File: bronze/products/products.parquet
Product catalog with pricing and categories:
| Column | Type | Description |
|---|---|---|
| product_id | string | Unique product identifier |
| product_name | string | Product display name |
| category | string | Product category |
| subcategory | string | Product subcategory |
| price | decimal | List price in USD |
| cost | decimal | Cost of goods |
| supplier_id | string | Supplier identifier |
Characteristics:
- 1,000 products across 20 categories
- Parquet format for efficient storage
- ~500KB file size
🔄 Data Generation¶
To generate fresh sample data:
# Run the data generator script
python scripts/generate_sample_data.py --output bronze/
# Specify number of records
python scripts/generate_sample_data.py --sales 50000 --customers 10000
📊 Data Quality Issues¶
The sample data intentionally includes common quality issues:
| Issue Type | Count | Example |
|---|---|---|
| Duplicates | ~2% | Same transaction_id multiple times |
| Missing Values | ~1% | NULL customer_email |
| Invalid Data | ~0.5% | Negative amounts, future dates |
| Format Issues | ~1% | Inconsistent date formats |
| Outliers | ~0.5% | Extremely high amounts |
These issues are resolved in the Silver layer processing.
💾 Data Sizes¶
| Layer | Records | Size | Format |
|---|---|---|---|
| Bronze | 16,000 | ~2MB | CSV, JSON, Parquet |
| Silver | 15,600 | ~1.5MB | Delta Lake |
| Gold | Aggregated | ~500KB | Delta Lake |
🔒 Data Privacy¶
All sample data is synthetic and generated specifically for this tutorial:
- No real customer information
- No real transaction data
- No PII (Personally Identifiable Information)
- Safe for public distribution
📚 Usage in Tutorial¶
- Bronze Layer: Upload raw data to Data Lake
- Silver Layer: Process with Spark notebooks
- Gold Layer: Create aggregates and dimensions
- Validation: Query through Synapse SQL
See the Medallion Architecture Tutorial for complete instructions.
Last Updated: 2025-12-12
Data Version: 1.0.0