🏛️ Medallion Architecture - Complete Tutorial¶
Comparative positioning note
This document is written from the perspective of Microsoft Azure, Cloud Scale Analytics, and CSA Loom. Any description of third-party or competing products, services, pricing, or capabilities is derived from publicly available documentation and sources believed accurate at the time of writing, and is provided for general comparison only. We do not claim expertise in, or authority over, any non-Microsoft product or service; the respective vendor's official documentation is the authoritative source for their offerings, which may change over time. Nothing here is intended to disparage any vendor — where a competing product has genuine advantages, we aim to note them honestly. Verify all third-party details against the vendor's current official documentation before making decisions.
Complete walkthrough tutorial for implementing the Medallion Architecture pattern on Azure. Build a production-ready data lakehouse with Bronze, Silver, and Gold layers using Azure Synapse Analytics, Data Lake Gen2, and Delta Lake.
🎯 Tutorial Overview¶
What You'll Build¶
A complete data lakehouse implementing the Medallion Architecture with three data quality layers:
graph LR
A[Raw Data Sources] --> B[🥉 Bronze Layer<br/>Raw Ingestion]
B --> C[🥈 Silver Layer<br/>Cleaned & Validated]
C --> D[🥇 Gold Layer<br/>Business Ready]
D --> E[Analytics & BI]
D --> F[ML Models]
D --> G[Applications]
style A fill:#e3f2fd
style B fill:#cd7f32
style C fill:#c0c0c0
style D fill:#ffd700
style E fill:#c8e6c9
style F fill:#ffccbc
style G fill:#f8bbd0 Learning Objectives¶
By completing this tutorial, you will:
- ✅ Understand the Medallion Architecture pattern and its benefits
- ✅ Deploy Azure Synapse Analytics workspace with Spark pools
- ✅ Configure Data Lake Gen2 with proper folder structure
- ✅ Implement Bronze, Silver, and Gold data layers
- ✅ Process data through quality stages using Delta Lake
- ✅ Query data using Synapse SQL Serverless
- ✅ Monitor data pipelines and quality metrics
- ✅ Secure data access with proper RBAC and encryption
Architecture Components¶
| Component | Purpose | Azure Service |
|---|---|---|
| Data Ingestion | Load raw data | Azure Data Factory |
| Bronze Layer | Raw data storage | Data Lake Gen2 + Delta Lake |
| Silver Layer | Cleaned data | Data Lake Gen2 + Delta Lake |
| Gold Layer | Business aggregates | Data Lake Gen2 + Delta Lake |
| Processing Engine | Transform data | Synapse Spark Pools |
| Query Engine | Ad-hoc queries | Synapse SQL Serverless |
| Orchestration | Pipeline management | Synapse Pipelines |
| Monitoring | Observability | Azure Monitor + Log Analytics |
📋 Prerequisites¶
Before starting this tutorial, ensure you have completed the Prerequisites Guide and have:
- Azure Subscription with Contributor access
- Azure CLI installed and configured
- VS Code with Azure extensions
- Python 3.8+ environment set up
- Jupyter notebook environment ready
- Git repository cloned locally
Estimated Cost: $10-20 for completing this tutorial (remember to clean up resources after)
📖 Table of Contents¶
- Architecture Deep Dive
- Environment Setup
- Deploy Infrastructure
- Create Bronze Layer
- Build Silver Layer
- Construct Gold Layer
- Query and Analyze Data
- Implement Monitoring
- Security and Governance
- Cleanup and Cost Management
1. Architecture Deep Dive¶
🏛️ Medallion Architecture Explained¶
The Medallion Architecture is a data design pattern that organizes data in a lakehouse into three layers of increasing quality and refinement:
🥉 Bronze Layer (Raw)¶
Purpose: Preserve raw data exactly as received from source systems.
Characteristics:
- Immutable raw data
- Original format and structure
- Minimal transformations
- Audit trail preservation
- Schema-on-read approach
Use Cases:
- Data lineage and auditing
- Reprocessing with new logic
- Debugging data quality issues
- Regulatory compliance
🥈 Silver Layer (Refined)¶
Purpose: Cleaned, validated, and enriched data ready for analytics.
Characteristics:
- Data quality rules applied
- Deduplication and cleansing
- Schema validation
- Type conversions
- Business key enrichment
Use Cases:
- Data science and ML features
- Departmental analytics
- Data exploration
- Report data sources
🥇 Gold Layer (Curated)¶
Purpose: Business-level aggregates optimized for consumption.
Characteristics:
- Denormalized structures
- Pre-aggregated metrics
- Dimension conforming
- Star/snowflake schemas
- Performance optimized
Use Cases:
- BI dashboards and reports
- Executive KPIs
- Application integration
- Real-time applications
Key Benefits¶
| Benefit | Description |
|---|---|
| Data Quality | Progressive improvement through layers |
| Flexibility | Reprocess data without affecting sources |
| Performance | Query optimized curated data |
| Governance | Clear lineage and ownership |
| Scalability | Handle petabyte-scale data |
| Cost Effective | Optimize storage and compute separately |
2. Environment Setup¶
Set Environment Variables¶
Create a .env file or export these variables:
# Azure Configuration
export AZURE_SUBSCRIPTION_ID="your-subscription-id"
export AZURE_TENANT_ID="your-tenant-id"
export LOCATION="eastus"
export RESOURCE_GROUP_NAME="rg-medallion-tutorial"
# Resource Names (must be globally unique)
export SYNAPSE_WORKSPACE_NAME="synapse-medallion-$(whoami)-${RANDOM}"
export STORAGE_ACCOUNT_NAME="samedallion$(whoami)${RANDOM}"
export DATA_LAKE_NAME="datalake-medallion"
# Configuration
export SPARK_POOL_NAME="sparkpool01"
export SQL_ADMIN_USER="sqladmin"
export SQL_ADMIN_PASSWORD="$(openssl rand -base64 32)"
# Save SQL password securely
echo $SQL_ADMIN_PASSWORD > .sql_password.txt
chmod 600 .sql_password.txt
Verify Azure Login¶
# Login to Azure
az login
# Set subscription
az account set --subscription $AZURE_SUBSCRIPTION_ID
# Verify current account
az account show --output table
Create Resource Group¶
# Create resource group
az group create \
--name $RESOURCE_GROUP_NAME \
--location $LOCATION \
--tags "Environment=Tutorial" "Pattern=Medallion" "CostCenter=Learning"
# Verify creation
az group show --name $RESOURCE_GROUP_NAME --output table
3. Deploy Infrastructure¶
Option A: Deploy Using Bicep (Recommended)¶
The Infrastructure as Code (IaC) approach for repeatable deployments.
Step 1: Review the Bicep Template
The complete Bicep template is located at: infrastructure/tutorials/batch/medallion-architecture/main.bicep
Key resources deployed:
- Azure Data Lake Gen2 storage account
- Azure Synapse Analytics workspace
- Apache Spark pool (auto-scaling)
- SQL Serverless endpoint
- Log Analytics workspace
- Azure Key Vault for secrets
Step 2: Deploy Infrastructure
# Navigate to infrastructure directory
cd infrastructure/tutorials/batch/medallion-architecture/
# Validate Bicep template
az deployment group validate \
--resource-group $RESOURCE_GROUP_NAME \
--template-file main.bicep \
--parameters parameters.json
# Deploy infrastructure
az deployment group create \
--name medallion-deployment-$(date +%Y%m%d-%H%M%S) \
--resource-group $RESOURCE_GROUP_NAME \
--template-file main.bicep \
--parameters parameters.json \
--parameters \
synapseWorkspaceName=$SYNAPSE_WORKSPACE_NAME \
storageAccountName=$STORAGE_ACCOUNT_NAME \
sqlAdministratorLogin=$SQL_ADMIN_USER \
sqlAdministratorLoginPassword=$SQL_ADMIN_PASSWORD \
location=$LOCATION
# Monitor deployment
az deployment group list \
--resource-group $RESOURCE_GROUP_NAME \
--output table
Deployment Time: 10-15 minutes
Option B: Deploy Using Azure CLI¶
Quick deployment using CLI commands:
# Create storage account with Data Lake Gen2
az storage account create \
--name $STORAGE_ACCOUNT_NAME \
--resource-group $RESOURCE_GROUP_NAME \
--location $LOCATION \
--sku Standard_LRS \
--kind StorageV2 \
--hierarchical-namespace true \
--enable-large-file-share \
--tags "Layer=Storage" "Pattern=Medallion"
# Create Synapse workspace
az synapse workspace create \
--name $SYNAPSE_WORKSPACE_NAME \
--resource-group $RESOURCE_GROUP_NAME \
--location $LOCATION \
--storage-account $STORAGE_ACCOUNT_NAME \
--file-system $DATA_LAKE_NAME \
--sql-admin-login-user $SQL_ADMIN_USER \
--sql-admin-login-password $SQL_ADMIN_PASSWORD \
--tags "Service=Synapse" "Pattern=Medallion"
# Create Spark pool
az synapse spark pool create \
--name $SPARK_POOL_NAME \
--resource-group $RESOURCE_GROUP_NAME \
--workspace-name $SYNAPSE_WORKSPACE_NAME \
--node-count 3 \
--node-size Small \
--enable-auto-scale true \
--min-node-count 3 \
--max-node-count 10 \
--spark-version 3.3
# Enable firewall rule for your IP
MY_IP=$(curl -s https://ifconfig.me)
az synapse workspace firewall-rule create \
--name AllowMyIP \
--resource-group $RESOURCE_GROUP_NAME \
--workspace-name $SYNAPSE_WORKSPACE_NAME \
--start-ip-address $MY_IP \
--end-ip-address $MY_IP
Verify Deployment¶
# List deployed resources
az resource list \
--resource-group $RESOURCE_GROUP_NAME \
--output table
# Get Synapse workspace URL
az synapse workspace show \
--name $SYNAPSE_WORKSPACE_NAME \
--resource-group $RESOURCE_GROUP_NAME \
--query connectivityEndpoints.web \
--output tsv
4. Create Bronze Layer¶
Bronze Layer Structure¶
Create the folder structure in Data Lake Gen2:
# Get storage account key
STORAGE_KEY=$(az storage account keys list \
--resource-group $RESOURCE_GROUP_NAME \
--account-name $STORAGE_ACCOUNT_NAME \
--query "[0].value" \
--output tsv)
# Create bronze layer folders
az storage fs directory create \
--name bronze/sales/2024 \
--file-system $DATA_LAKE_NAME \
--account-name $STORAGE_ACCOUNT_NAME \
--account-key $STORAGE_KEY
az storage fs directory create \
--name bronze/customers/2024 \
--file-system $DATA_LAKE_NAME \
--account-name $STORAGE_ACCOUNT_NAME \
--account-key $STORAGE_KEY
az storage fs directory create \
--name bronze/products/2024 \
--file-system $DATA_LAKE_NAME \
--account-name $STORAGE_ACCOUNT_NAME \
--account-key $STORAGE_KEY
Upload Sample Data¶
Use the provided sample data or generate your own:
# Navigate to sample data directory
cd examples/architecture-patterns/batch/data/
# Upload bronze layer data
az storage blob upload-batch \
--destination $DATA_LAKE_NAME/bronze/sales/2024 \
--source ./bronze/sales/ \
--account-name $STORAGE_ACCOUNT_NAME \
--account-key $STORAGE_KEY
az storage blob upload-batch \
--destination $DATA_LAKE_NAME/bronze/customers/2024 \
--source ./bronze/customers/ \
--account-name $STORAGE_ACCOUNT_NAME \
--account-key $STORAGE_KEY
az storage blob upload-batch \
--destination $DATA_LAKE_NAME/bronze/products/2024 \
--source ./bronze/products/ \
--account-name $STORAGE_ACCOUNT_NAME \
--account-key $STORAGE_KEY
Bronze Layer Notebook¶
Open and run the Jupyter notebook: examples/architecture-patterns/batch/notebooks/01-bronze-layer.ipynb
Key Concepts Demonstrated:
- Read raw CSV/JSON/Parquet files
- Write to Delta Lake format
- Preserve original structure
- Add audit columns (ingestion_timestamp, source_system)
- Implement idempotent loads
5. Build Silver Layer¶
Silver Layer Transformations¶
The Silver layer applies data quality rules and standardization.
Notebook: examples/architecture-patterns/batch/notebooks/02-silver-layer.ipynb
Key Transformations¶
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from delta.tables import *
# Initialize Spark session
spark = SparkSession.builder \
.appName("Medallion-Silver-Layer") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# Read from Bronze
bronze_sales = spark.read.format("delta").load("abfss://datalake@storage/bronze/sales")
# Silver transformations
silver_sales = bronze_sales \
.dropDuplicates(["transaction_id"]) \
.filter(col("amount") > 0) \
.withColumn("transaction_date", to_date(col("transaction_timestamp"))) \
.withColumn("amount_usd", col("amount").cast("decimal(18,2)")) \
.withColumn("processed_timestamp", current_timestamp()) \
.select(
"transaction_id",
"customer_id",
"product_id",
"transaction_date",
"amount_usd",
"quantity",
"processed_timestamp"
)
# Write to Silver Delta table
silver_sales.write \
.format("delta") \
.mode("overwrite") \
.option("mergeSchema", "true") \
.save("abfss://datalake@storage/silver/sales")
Data Quality Checks¶
Implement quality validations:
# Quality checks
total_bronze_records = bronze_sales.count()
total_silver_records = silver_sales.count()
rejected_records = total_bronze_records - total_silver_records
# Log quality metrics
quality_metrics = {
"layer": "silver",
"table": "sales",
"timestamp": datetime.now(),
"bronze_records": total_bronze_records,
"silver_records": total_silver_records,
"rejected_records": rejected_records,
"quality_score": (total_silver_records / total_bronze_records) * 100
}
print(f"Quality Score: {quality_metrics['quality_score']:.2f}%")
6. Construct Gold Layer¶
Gold Layer Aggregations¶
Create business-ready aggregates and star schema.
Notebook: examples/architecture-patterns/batch/notebooks/03-gold-layer.ipynb
Create Fact Tables¶
# Daily sales aggregates
gold_daily_sales = silver_sales \
.groupBy("transaction_date", "customer_id", "product_id") \
.agg(
sum("amount_usd").alias("total_amount"),
sum("quantity").alias("total_quantity"),
count("transaction_id").alias("transaction_count"),
avg("amount_usd").alias("avg_transaction_amount")
) \
.withColumn("processed_timestamp", current_timestamp())
# Write to Gold layer
gold_daily_sales.write \
.format("delta") \
.mode("overwrite") \
.partitionBy("transaction_date") \
.save("abfss://datalake@storage/gold/sales_daily")
Create Dimension Tables¶
# Customer dimension
dim_customers = silver_customers \
.select(
"customer_id",
"customer_name",
"customer_email",
"customer_segment",
"registration_date",
"lifetime_value"
) \
.dropDuplicates(["customer_id"])
dim_customers.write \
.format("delta") \
.mode("overwrite") \
.save("abfss://datalake@storage/gold/dim_customers")
7. Query and Analyze Data¶
Create SQL Serverless Views¶
Connect to Synapse SQL Serverless and create views:
-- Create database
CREATE DATABASE MedallionGold;
GO
USE MedallionGold;
GO
-- Create external data source
CREATE EXTERNAL DATA SOURCE DataLakeStorage
WITH (
LOCATION = 'abfss://datalake@{storage_account}.dfs.core.windows.net/'
);
GO
-- Create gold layer view
CREATE OR ALTER VIEW vw_sales_daily
AS
SELECT
transaction_date,
customer_id,
product_id,
total_amount,
total_quantity,
transaction_count,
avg_transaction_amount
FROM OPENROWSET(
BULK 'gold/sales_daily',
DATA_SOURCE = 'DataLakeStorage',
FORMAT = 'DELTA'
) AS [result];
GO
-- Query the view
SELECT
transaction_date,
SUM(total_amount) as daily_revenue,
SUM(transaction_count) as daily_transactions
FROM vw_sales_daily
GROUP BY transaction_date
ORDER BY transaction_date DESC;
Power BI Integration¶
Notebook: examples/architecture-patterns/batch/notebooks/04-powerbi-integration.ipynb
Connect Power BI Desktop to Synapse SQL Serverless:
- Open Power BI Desktop
- Get Data → Azure → Azure Synapse Analytics SQL
- Server:
{synapse-workspace}.sql.azuresynapse.net - Database:
MedallionGold - Select views:
vw_sales_daily,vw_customer_metrics - Load data and create visualizations
8. Implement Monitoring¶
Configure Azure Monitor¶
# Create Log Analytics workspace
az monitor log-analytics workspace create \
--resource-group $RESOURCE_GROUP_NAME \
--workspace-name law-medallion-tutorial \
--location $LOCATION
# Link to Synapse workspace
WORKSPACE_ID=$(az monitor log-analytics workspace show \
--resource-group $RESOURCE_GROUP_NAME \
--workspace-name law-medallion-tutorial \
--query customerId \
--output tsv)
az synapse workspace update \
--name $SYNAPSE_WORKSPACE_NAME \
--resource-group $RESOURCE_GROUP_NAME \
--workspace-log-analytics-workspace $WORKSPACE_ID
Create Alerts¶
# Alert on pipeline failures
az monitor metrics alert create \
--name "Pipeline-Failure-Alert" \
--resource-group $RESOURCE_GROUP_NAME \
--scopes "/subscriptions/$AZURE_SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP_NAME/providers/Microsoft.Synapse/workspaces/$SYNAPSE_WORKSPACE_NAME" \
--condition "count PipelineFailedRuns > 0" \
--window-size 5m \
--evaluation-frequency 1m \
--severity 2 \
--description "Alert when pipeline runs fail"
Monitor Notebook¶
Notebook: examples/architecture-patterns/batch/notebooks/05-monitoring.ipynb
Query logs and metrics:
from azure.monitor.query import LogsQueryClient
from azure.identity import DefaultAzureCredential
# Initialize client
credential = DefaultAzureCredential()
client = LogsQueryClient(credential)
# Query pipeline metrics
query = """
SynapsePipelineRuns
| where TimeGenerated > ago(24h) |
| summarize |
TotalRuns = count(),
SuccessfulRuns = countif(Status == 'Succeeded'),
FailedRuns = countif(Status == 'Failed')
| project
TotalRuns,
SuccessfulRuns,
FailedRuns,
SuccessRate = (SuccessfulRuns * 100.0) / TotalRuns
"""
response = client.query_workspace(workspace_id=WORKSPACE_ID, query=query, timespan=timedelta(days=1))
9. Security and Governance¶
Implement RBAC¶
# Assign Storage Blob Data Contributor to Synapse workspace
az role assignment create \
--assignee-object-id $(az synapse workspace show --name $SYNAPSE_WORKSPACE_NAME --resource-group $RESOURCE_GROUP_NAME --query identity.principalId --output tsv) \
--assignee-principal-type ServicePrincipal \
--role "Storage Blob Data Contributor" \
--scope "/subscriptions/$AZURE_SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP_NAME/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT_NAME"
Enable Encryption¶
# Enable encryption at rest
az storage account update \
--name $STORAGE_ACCOUNT_NAME \
--resource-group $RESOURCE_GROUP_NAME \
--encryption-services blob file
# Enable TDE for Synapse dedicated pools
# (Configure in Synapse Studio)
Data Catalog Integration¶
Document tables in Azure Purview or Synapse:
# Add table descriptions and metadata
spark.sql("""
ALTER TABLE gold.sales_daily
SET TBLPROPERTIES (
'description' = 'Daily aggregated sales metrics',
'owner' = 'DataEngineering',
'refresh_frequency' = 'daily',
'data_classification' = 'internal'
)
""")
10. Cleanup and Cost Management¶
Delete Resources¶
⚠️ Warning: This will permanently delete all resources and data.
# Delete resource group and all resources
az group delete \
--name $RESOURCE_GROUP_NAME \
--yes \
--no-wait
# Verify deletion
az group show --name $RESOURCE_GROUP_NAME
Cost Analysis¶
Review costs incurred during the tutorial:
# Get cost analysis
az consumption usage list \
--start-date $(date -d '1 day ago' +%Y-%m-%d) \
--end-date $(date +%Y-%m-%d) \
--query "[?tags.Environment=='Tutorial'].{Service:meterName, Cost:pretaxCost}" \
--output table
Cost Optimization Tips¶
For production deployments:
- Auto-pause Spark pools when not in use
- Use Azure Synapse Serverless for ad-hoc queries
- Implement data lifecycle policies (hot/cool/archive tiers)
- Right-size compute resources based on workload
- Use Reserved Capacity for predictable workloads
- Enable Azure Hybrid Benefit if applicable
📚 Additional Resources¶
Documentation¶
Sample Code¶
- Notebooks:
examples/architecture-patterns/batch/notebooks/ - Infrastructure:
infrastructure/tutorials/batch/medallion-architecture/ - Sample Data:
examples/architecture-patterns/batch/data/
Next Steps¶
- Hub & Spoke Tutorial - Traditional data warehouse
- Lambda Architecture Tutorial - Add streaming
- Data Mesh Tutorial - Scale to enterprise
🎉 Congratulations¶
You've successfully built a production-ready Medallion Architecture on Azure! You now have:
- ✅ Three-tier data lakehouse with Bronze, Silver, and Gold layers
- ✅ Data quality progressive improvement pipeline
- ✅ Scalable processing using Synapse Spark
- ✅ Ad-hoc querying with SQL Serverless
- ✅ Monitoring and alerting configured
- ✅ Security best practices implemented
Share your success! Tweet your completed architecture diagram with #AzureSynapseAnalytics #MedallionArchitecture
Last Updated: 2025-12-12
Tutorial Version: 1.0
Estimated Completion Time: 2-3 hours
Difficulty: