🏛️ Medallion Architecture - Complete Tutorial¶

Comparative positioning note

This document is written from the perspective of Microsoft Azure, Cloud Scale Analytics, and CSA Loom. Any description of third-party or competing products, services, pricing, or capabilities is derived from publicly available documentation and sources believed accurate at the time of writing, and is provided for general comparison only. We do not claim expertise in, or authority over, any non-Microsoft product or service; the respective vendor's official documentation is the authoritative source for their offerings, which may change over time. Nothing here is intended to disparage any vendor — where a competing product has genuine advantages, we aim to note them honestly. Verify all third-party details against the vendor's current official documentation before making decisions.

Complete walkthrough tutorial for implementing the Medallion Architecture pattern on Azure. Build a production-ready data lakehouse with Bronze, Silver, and Gold layers using Azure Synapse Analytics, Data Lake Gen2, and Delta Lake.

🎯 Tutorial Overview¶

What You'll Build¶

A complete data lakehouse implementing the Medallion Architecture with three data quality layers:

graph LR
    A[Raw Data Sources] --> B[🥉 Bronze Layer<br/>Raw Ingestion]
    B --> C[🥈 Silver Layer<br/>Cleaned & Validated]
    C --> D[🥇 Gold Layer<br/>Business Ready]
    D --> E[Analytics & BI]
    D --> F[ML Models]
    D --> G[Applications]

    style A fill:#e3f2fd
    style B fill:#cd7f32
    style C fill:#c0c0c0
    style D fill:#ffd700
    style E fill:#c8e6c9
    style F fill:#ffccbc
    style G fill:#f8bbd0

Learning Objectives¶

By completing this tutorial, you will:

✅ Understand the Medallion Architecture pattern and its benefits
✅ Deploy Azure Synapse Analytics workspace with Spark pools
✅ Configure Data Lake Gen2 with proper folder structure
✅ Implement Bronze, Silver, and Gold data layers
✅ Process data through quality stages using Delta Lake
✅ Query data using Synapse SQL Serverless
✅ Monitor data pipelines and quality metrics
✅ Secure data access with proper RBAC and encryption

Architecture Components¶

Component	Purpose	Azure Service
Data Ingestion	Load raw data	Azure Data Factory
Bronze Layer	Raw data storage	Data Lake Gen2 + Delta Lake
Silver Layer	Cleaned data	Data Lake Gen2 + Delta Lake
Gold Layer	Business aggregates	Data Lake Gen2 + Delta Lake
Processing Engine	Transform data	Synapse Spark Pools
Query Engine	Ad-hoc queries	Synapse SQL Serverless
Orchestration	Pipeline management	Synapse Pipelines
Monitoring	Observability	Azure Monitor + Log Analytics

📋 Prerequisites¶

Before starting this tutorial, ensure you have completed the Prerequisites Guide and have:

Azure Subscription with Contributor access
Azure CLI installed and configured
VS Code with Azure extensions
Python 3.8+ environment set up
Jupyter notebook environment ready
Git repository cloned locally

Estimated Cost: $10-20 for completing this tutorial (remember to clean up resources after)

1. Architecture Deep Dive¶

🏛️ Medallion Architecture Explained¶

The Medallion Architecture is a data design pattern that organizes data in a lakehouse into three layers of increasing quality and refinement:

🥉 Bronze Layer (Raw)¶

Purpose: Preserve raw data exactly as received from source systems.

Characteristics:

Immutable raw data
Original format and structure
Minimal transformations
Audit trail preservation
Schema-on-read approach

Use Cases:

Data lineage and auditing
Reprocessing with new logic
Debugging data quality issues
Regulatory compliance

🥈 Silver Layer (Refined)¶

Purpose: Cleaned, validated, and enriched data ready for analytics.

Characteristics:

Data quality rules applied
Deduplication and cleansing
Schema validation
Type conversions
Business key enrichment

Use Cases:

Data science and ML features
Departmental analytics
Data exploration
Report data sources

🥇 Gold Layer (Curated)¶

Purpose: Business-level aggregates optimized for consumption.

Characteristics:

Denormalized structures
Pre-aggregated metrics
Dimension conforming
Star/snowflake schemas
Performance optimized

Use Cases:

BI dashboards and reports
Executive KPIs
Application integration
Real-time applications

Key Benefits¶

Benefit	Description
Data Quality	Progressive improvement through layers
Flexibility	Reprocess data without affecting sources
Performance	Query optimized curated data
Governance	Clear lineage and ownership
Scalability	Handle petabyte-scale data
Cost Effective	Optimize storage and compute separately

2. Environment Setup¶

Set Environment Variables¶

Create a .env file or export these variables:

# Azure Configuration
export AZURE_SUBSCRIPTION_ID="your-subscription-id"
export AZURE_TENANT_ID="your-tenant-id"
export LOCATION="eastus"
export RESOURCE_GROUP_NAME="rg-medallion-tutorial"

# Resource Names (must be globally unique)
export SYNAPSE_WORKSPACE_NAME="synapse-medallion-$(whoami)-${RANDOM}"
export STORAGE_ACCOUNT_NAME="samedallion$(whoami)${RANDOM}"
export DATA_LAKE_NAME="datalake-medallion"

# Configuration
export SPARK_POOL_NAME="sparkpool01"
export SQL_ADMIN_USER="sqladmin"
export SQL_ADMIN_PASSWORD="$(openssl rand -base64 32)"

# Save SQL password securely
echo $SQL_ADMIN_PASSWORD > .sql_password.txt
chmod 600 .sql_password.txt

# Login to Azure
az login

# Set subscription
az account set --subscription $AZURE_SUBSCRIPTION_ID

# Verify current account
az account show --output table

Create Resource Group¶

# Create resource group
az group create \
  --name $RESOURCE_GROUP_NAME \
  --location $LOCATION \
  --tags "Environment=Tutorial" "Pattern=Medallion" "CostCenter=Learning"

# Verify creation
az group show --name $RESOURCE_GROUP_NAME --output table

3. Deploy Infrastructure¶

Option A: Deploy Using Bicep (Recommended)¶

The Infrastructure as Code (IaC) approach for repeatable deployments.

Step 1: Review the Bicep Template

The complete Bicep template is located at: infrastructure/tutorials/batch/medallion-architecture/main.bicep

Key resources deployed:

Azure Data Lake Gen2 storage account
Azure Synapse Analytics workspace
Apache Spark pool (auto-scaling)
SQL Serverless endpoint
Log Analytics workspace
Azure Key Vault for secrets

Step 2: Deploy Infrastructure

# Navigate to infrastructure directory
cd infrastructure/tutorials/batch/medallion-architecture/

# Validate Bicep template
az deployment group validate \
  --resource-group $RESOURCE_GROUP_NAME \
  --template-file main.bicep \
  --parameters parameters.json

# Deploy infrastructure
az deployment group create \
  --name medallion-deployment-$(date +%Y%m%d-%H%M%S) \
  --resource-group $RESOURCE_GROUP_NAME \
  --template-file main.bicep \
  --parameters parameters.json \
  --parameters \
    synapseWorkspaceName=$SYNAPSE_WORKSPACE_NAME \
    storageAccountName=$STORAGE_ACCOUNT_NAME \
    sqlAdministratorLogin=$SQL_ADMIN_USER \
    sqlAdministratorLoginPassword=$SQL_ADMIN_PASSWORD \
    location=$LOCATION

# Monitor deployment
az deployment group list \
  --resource-group $RESOURCE_GROUP_NAME \
  --output table

Deployment Time: 10-15 minutes

Option B: Deploy Using Azure CLI¶

Quick deployment using CLI commands:

# Create storage account with Data Lake Gen2
az storage account create \
  --name $STORAGE_ACCOUNT_NAME \
  --resource-group $RESOURCE_GROUP_NAME \
  --location $LOCATION \
  --sku Standard_LRS \
  --kind StorageV2 \
  --hierarchical-namespace true \
  --enable-large-file-share \
  --tags "Layer=Storage" "Pattern=Medallion"

# Create Synapse workspace
az synapse workspace create \
  --name $SYNAPSE_WORKSPACE_NAME \
  --resource-group $RESOURCE_GROUP_NAME \
  --location $LOCATION \
  --storage-account $STORAGE_ACCOUNT_NAME \
  --file-system $DATA_LAKE_NAME \
  --sql-admin-login-user $SQL_ADMIN_USER \
  --sql-admin-login-password $SQL_ADMIN_PASSWORD \
  --tags "Service=Synapse" "Pattern=Medallion"

# Create Spark pool
az synapse spark pool create \
  --name $SPARK_POOL_NAME \
  --resource-group $RESOURCE_GROUP_NAME \
  --workspace-name $SYNAPSE_WORKSPACE_NAME \
  --node-count 3 \
  --node-size Small \
  --enable-auto-scale true \
  --min-node-count 3 \
  --max-node-count 10 \
  --spark-version 3.3

# Enable firewall rule for your IP
MY_IP=$(curl -s https://ifconfig.me)
az synapse workspace firewall-rule create \
  --name AllowMyIP \
  --resource-group $RESOURCE_GROUP_NAME \
  --workspace-name $SYNAPSE_WORKSPACE_NAME \
  --start-ip-address $MY_IP \
  --end-ip-address $MY_IP

Verify Deployment¶

# List deployed resources
az resource list \
  --resource-group $RESOURCE_GROUP_NAME \
  --output table

# Get Synapse workspace URL
az synapse workspace show \
  --name $SYNAPSE_WORKSPACE_NAME \
  --resource-group $RESOURCE_GROUP_NAME \
  --query connectivityEndpoints.web \
  --output tsv

4. Create Bronze Layer¶

Bronze Layer Structure¶

Create the folder structure in Data Lake Gen2:

# Get storage account key
STORAGE_KEY=$(az storage account keys list \
  --resource-group $RESOURCE_GROUP_NAME \
  --account-name $STORAGE_ACCOUNT_NAME \
  --query "[0].value" \
  --output tsv)

# Create bronze layer folders
az storage fs directory create \
  --name bronze/sales/2024 \
  --file-system $DATA_LAKE_NAME \
  --account-name $STORAGE_ACCOUNT_NAME \
  --account-key $STORAGE_KEY

az storage fs directory create \
  --name bronze/customers/2024 \
  --file-system $DATA_LAKE_NAME \
  --account-name $STORAGE_ACCOUNT_NAME \
  --account-key $STORAGE_KEY

az storage fs directory create \
  --name bronze/products/2024 \
  --file-system $DATA_LAKE_NAME \
  --account-name $STORAGE_ACCOUNT_NAME \
  --account-key $STORAGE_KEY

Upload Sample Data¶

Use the provided sample data or generate your own:

# Navigate to sample data directory
cd examples/architecture-patterns/batch/data/

# Upload bronze layer data
az storage blob upload-batch \
  --destination $DATA_LAKE_NAME/bronze/sales/2024 \
  --source ./bronze/sales/ \
  --account-name $STORAGE_ACCOUNT_NAME \
  --account-key $STORAGE_KEY

az storage blob upload-batch \
  --destination $DATA_LAKE_NAME/bronze/customers/2024 \
  --source ./bronze/customers/ \
  --account-name $STORAGE_ACCOUNT_NAME \
  --account-key $STORAGE_KEY

az storage blob upload-batch \
  --destination $DATA_LAKE_NAME/bronze/products/2024 \
  --source ./bronze/products/ \
  --account-name $STORAGE_ACCOUNT_NAME \
  --account-key $STORAGE_KEY

Bronze Layer Notebook¶

Open and run the Jupyter notebook: examples/architecture-patterns/batch/notebooks/01-bronze-layer.ipynb

Key Concepts Demonstrated:

Read raw CSV/JSON/Parquet files
Write to Delta Lake format
Preserve original structure
Add audit columns (ingestion_timestamp, source_system)
Implement idempotent loads

5. Build Silver Layer¶

Silver Layer Transformations¶

The Silver layer applies data quality rules and standardization.

Notebook: examples/architecture-patterns/batch/notebooks/02-silver-layer.ipynb

Key Transformations¶

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from delta.tables import *

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Medallion-Silver-Layer") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Read from Bronze
bronze_sales = spark.read.format("delta").load("abfss://datalake@storage/bronze/sales")

# Silver transformations
silver_sales = bronze_sales \
    .dropDuplicates(["transaction_id"]) \
    .filter(col("amount") > 0) \
    .withColumn("transaction_date", to_date(col("transaction_timestamp"))) \
    .withColumn("amount_usd", col("amount").cast("decimal(18,2)")) \
    .withColumn("processed_timestamp", current_timestamp()) \
    .select(
        "transaction_id",
        "customer_id",
        "product_id",
        "transaction_date",
        "amount_usd",
        "quantity",
        "processed_timestamp"
    )

# Write to Silver Delta table
silver_sales.write \
    .format("delta") \
    .mode("overwrite") \
    .option("mergeSchema", "true") \
    .save("abfss://datalake@storage/silver/sales")

Data Quality Checks¶

Implement quality validations:

# Quality checks
total_bronze_records = bronze_sales.count()
total_silver_records = silver_sales.count()
rejected_records = total_bronze_records - total_silver_records

# Log quality metrics
quality_metrics = {
    "layer": "silver",
    "table": "sales",
    "timestamp": datetime.now(),
    "bronze_records": total_bronze_records,
    "silver_records": total_silver_records,
    "rejected_records": rejected_records,
    "quality_score": (total_silver_records / total_bronze_records) * 100
}

print(f"Quality Score: {quality_metrics['quality_score']:.2f}%")

6. Construct Gold Layer¶

Gold Layer Aggregations¶

Create business-ready aggregates and star schema.

Notebook: examples/architecture-patterns/batch/notebooks/03-gold-layer.ipynb

Create Fact Tables¶

# Daily sales aggregates
gold_daily_sales = silver_sales \
    .groupBy("transaction_date", "customer_id", "product_id") \
    .agg(
        sum("amount_usd").alias("total_amount"),
        sum("quantity").alias("total_quantity"),
        count("transaction_id").alias("transaction_count"),
        avg("amount_usd").alias("avg_transaction_amount")
    ) \
    .withColumn("processed_timestamp", current_timestamp())

# Write to Gold layer
gold_daily_sales.write \
    .format("delta") \
    .mode("overwrite") \
    .partitionBy("transaction_date") \
    .save("abfss://datalake@storage/gold/sales_daily")

Create Dimension Tables¶

# Customer dimension
dim_customers = silver_customers \
    .select(
        "customer_id",
        "customer_name",
        "customer_email",
        "customer_segment",
        "registration_date",
        "lifetime_value"
    ) \
    .dropDuplicates(["customer_id"])

dim_customers.write \
    .format("delta") \
    .mode("overwrite") \
    .save("abfss://datalake@storage/gold/dim_customers")

7. Query and Analyze Data¶

Create SQL Serverless Views¶

Connect to Synapse SQL Serverless and create views:

-- Create database
CREATE DATABASE MedallionGold;
GO

USE MedallionGold;
GO

-- Create external data source
CREATE EXTERNAL DATA SOURCE DataLakeStorage
WITH (
    LOCATION = 'abfss://datalake@{storage_account}.dfs.core.windows.net/'
);
GO

-- Create gold layer view
CREATE OR ALTER VIEW vw_sales_daily
AS
SELECT 
    transaction_date,
    customer_id,
    product_id,
    total_amount,
    total_quantity,
    transaction_count,
    avg_transaction_amount
FROM OPENROWSET(
    BULK 'gold/sales_daily',
    DATA_SOURCE = 'DataLakeStorage',
    FORMAT = 'DELTA'
) AS [result];
GO

-- Query the view
SELECT 
    transaction_date,
    SUM(total_amount) as daily_revenue,
    SUM(transaction_count) as daily_transactions
FROM vw_sales_daily
GROUP BY transaction_date
ORDER BY transaction_date DESC;

Power BI Integration¶

Notebook: examples/architecture-patterns/batch/notebooks/04-powerbi-integration.ipynb

Connect Power BI Desktop to Synapse SQL Serverless:

Open Power BI Desktop
Get Data → Azure → Azure Synapse Analytics SQL
Server: {synapse-workspace}.sql.azuresynapse.net
Database: MedallionGold
Select views: vw_sales_daily, vw_customer_metrics
Load data and create visualizations

8. Implement Monitoring¶

Configure Azure Monitor¶

# Create Log Analytics workspace
az monitor log-analytics workspace create \
  --resource-group $RESOURCE_GROUP_NAME \
  --workspace-name law-medallion-tutorial \
  --location $LOCATION

# Link to Synapse workspace
WORKSPACE_ID=$(az monitor log-analytics workspace show \
  --resource-group $RESOURCE_GROUP_NAME \
  --workspace-name law-medallion-tutorial \
  --query customerId \
  --output tsv)

az synapse workspace update \
  --name $SYNAPSE_WORKSPACE_NAME \
  --resource-group $RESOURCE_GROUP_NAME \
  --workspace-log-analytics-workspace $WORKSPACE_ID

Create Alerts¶

# Alert on pipeline failures
az monitor metrics alert create \
  --name "Pipeline-Failure-Alert" \
  --resource-group $RESOURCE_GROUP_NAME \
  --scopes "/subscriptions/$AZURE_SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP_NAME/providers/Microsoft.Synapse/workspaces/$SYNAPSE_WORKSPACE_NAME" \
  --condition "count PipelineFailedRuns > 0" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --severity 2 \
  --description "Alert when pipeline runs fail"

Monitor Notebook¶

Notebook: examples/architecture-patterns/batch/notebooks/05-monitoring.ipynb

Query logs and metrics:

from azure.monitor.query import LogsQueryClient
from azure.identity import DefaultAzureCredential

# Initialize client
credential = DefaultAzureCredential()
client = LogsQueryClient(credential)

# Query pipeline metrics
query = """
SynapsePipelineRuns
| where TimeGenerated > ago(24h) |
| summarize |
    TotalRuns = count(),
    SuccessfulRuns = countif(Status == 'Succeeded'),
    FailedRuns = countif(Status == 'Failed')
| project 
    TotalRuns,
    SuccessfulRuns,
    FailedRuns,
    SuccessRate = (SuccessfulRuns * 100.0) / TotalRuns
"""

response = client.query_workspace(workspace_id=WORKSPACE_ID, query=query, timespan=timedelta(days=1))

9. Security and Governance¶

Implement RBAC¶

# Assign Storage Blob Data Contributor to Synapse workspace
az role assignment create \
  --assignee-object-id $(az synapse workspace show --name $SYNAPSE_WORKSPACE_NAME --resource-group $RESOURCE_GROUP_NAME --query identity.principalId --output tsv) \
  --assignee-principal-type ServicePrincipal \
  --role "Storage Blob Data Contributor" \
  --scope "/subscriptions/$AZURE_SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP_NAME/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT_NAME"

Enable Encryption¶

# Enable encryption at rest
az storage account update \
  --name $STORAGE_ACCOUNT_NAME \
  --resource-group $RESOURCE_GROUP_NAME \
  --encryption-services blob file

# Enable TDE for Synapse dedicated pools
# (Configure in Synapse Studio)

Data Catalog Integration¶

Document tables in Azure Purview or Synapse:

# Add table descriptions and metadata
spark.sql("""
  ALTER TABLE gold.sales_daily 
  SET TBLPROPERTIES (
    'description' = 'Daily aggregated sales metrics',
    'owner' = 'DataEngineering',
    'refresh_frequency' = 'daily',
    'data_classification' = 'internal'
  )
""")

10. Cleanup and Cost Management¶

Delete Resources¶

⚠️ Warning: This will permanently delete all resources and data.

# Delete resource group and all resources
az group delete \
  --name $RESOURCE_GROUP_NAME \
  --yes \
  --no-wait

# Verify deletion
az group show --name $RESOURCE_GROUP_NAME

Cost Analysis¶

Review costs incurred during the tutorial:

# Get cost analysis
az consumption usage list \
  --start-date $(date -d '1 day ago' +%Y-%m-%d) \
  --end-date $(date +%Y-%m-%d) \
  --query "[?tags.Environment=='Tutorial'].{Service:meterName, Cost:pretaxCost}" \
  --output table

Cost Optimization Tips¶

For production deployments:

Auto-pause Spark pools when not in use
Use Azure Synapse Serverless for ad-hoc queries
Implement data lifecycle policies (hot/cool/archive tiers)
Right-size compute resources based on workload
Use Reserved Capacity for predictable workloads
Enable Azure Hybrid Benefit if applicable

📚 Additional Resources¶

Documentation¶

Sample Code¶

Notebooks: examples/architecture-patterns/batch/notebooks/
Infrastructure: infrastructure/tutorials/batch/medallion-architecture/
Sample Data: examples/architecture-patterns/batch/data/

Next Steps¶

Hub & Spoke Tutorial - Traditional data warehouse
Lambda Architecture Tutorial - Add streaming
Data Mesh Tutorial - Scale to enterprise

🎉 Congratulations¶

You've successfully built a production-ready Medallion Architecture on Azure! You now have:

✅ Three-tier data lakehouse with Bronze, Silver, and Gold layers
✅ Data quality progressive improvement pipeline
✅ Scalable processing using Synapse Spark
✅ Ad-hoc querying with SQL Serverless
✅ Monitoring and alerting configured
✅ Security best practices implemented

Share your success! Tweet your completed architecture diagram with #AzureSynapseAnalytics #MedallionArchitecture

Last Updated: 2025-12-12
Tutorial Version: 1.0
Estimated Completion Time: 2-3 hours
Difficulty:

🏛️ Medallion Architecture - Complete Tutorial¶

🎯 Tutorial Overview¶

What You'll Build¶

Learning Objectives¶

Architecture Components¶

📋 Prerequisites¶

📖 Table of Contents¶

1. Architecture Deep Dive¶

🏛️ Medallion Architecture Explained¶

🥉 Bronze Layer (Raw)¶

🥈 Silver Layer (Refined)¶

🥇 Gold Layer (Curated)¶

Key Benefits¶

2. Environment Setup¶

Set Environment Variables¶

Verify Azure Login¶

Create Resource Group¶

3. Deploy Infrastructure¶

Option A: Deploy Using Bicep (Recommended)¶

Option B: Deploy Using Azure CLI¶

Verify Deployment¶

4. Create Bronze Layer¶

Bronze Layer Structure¶

Upload Sample Data¶

Bronze Layer Notebook¶

5. Build Silver Layer¶

Silver Layer Transformations¶

Key Transformations¶

Data Quality Checks¶

6. Construct Gold Layer¶

Gold Layer Aggregations¶

Create Fact Tables¶

Create Dimension Tables¶

7. Query and Analyze Data¶

Create SQL Serverless Views¶

Power BI Integration¶

8. Implement Monitoring¶

Configure Azure Monitor¶

Create Alerts¶

Monitor Notebook¶

9. Security and Governance¶

Implement RBAC¶

Enable Encryption¶

Data Catalog Integration¶

10. Cleanup and Cost Management¶

Delete Resources¶

Cost Analysis¶

Cost Optimization Tips¶

📚 Additional Resources¶

Documentation¶

Sample Code¶

Next Steps¶

🎉 Congratulations¶