🔗 Shared Metadata Architecture Description¶
Comparative positioning note
This document is written from the perspective of Microsoft Azure, Cloud Scale Analytics, and CSA Loom. Any description of third-party or competing products, services, pricing, or capabilities is derived from publicly available documentation and sources believed accurate at the time of writing, and is provided for general comparison only. We do not claim expertise in, or authority over, any non-Microsoft product or service; the respective vendor's official documentation is the authoritative source for their offerings, which may change over time. Nothing here is intended to disparage any vendor — where a competing product has genuine advantages, we aim to note them honestly. Verify all third-party details against the vendor's current official documentation before making decisions.
Detailed description and visual representation of the shared metadata architecture in Azure Synapse Analytics, enabling cross-engine data access and unified governance.
🎯 Overview¶
Shared metadata in Azure Synapse Analytics provides a unified catalog that allows different compute engines (Spark, SQL, Pipelines) to discover and access the same data assets seamlessly. This eliminates data silos, reduces duplication, and enables a truly unified analytics platform.
📊 Visual Architecture¶
High-Level Metadata Architecture¶
graph TB
subgraph "Synapse Workspace"
subgraph "Metadata Management Layer"
MetaService[Metadata Service<br/>Central Orchestrator]
Catalog[Unified Catalog<br/>Cross-engine Registry]
end
subgraph "Compute Engines"
SparkPool[Spark Pools<br/>Hive Metastore]
ServerlessSQL[Serverless SQL Pool<br/>SQL Metadata]
DedicatedSQL[Dedicated SQL Pool<br/>DW Metadata]
Pipeline[Synapse Pipelines<br/>Dataset Metadata]
end
subgraph "Metadata Storage"
SparkMetastore[(Spark Metastore<br/>Hive-compatible<br/>Azure SQL DB)]
SQLMetadata[(SQL Metadata<br/>sys.tables<br/>sys.columns)]
WorkspaceDB[(Workspace Database<br/>Shared Catalog<br/>Azure SQL DB)]
end
subgraph "External Metadata Integration"
Purview[Azure Purview<br/>Data Catalog<br/>Lineage]
ExternalCatalog[External Catalogs<br/>HMS, Glue, etc.]
end
end
subgraph "Data Storage"
DataLake[Data Lake Gen2<br/>Physical Data Files]
DedicatedStorage[Dedicated SQL Storage<br/>Columnar Format]
end
subgraph "Users & Applications"
SparkNotebook[Spark Notebooks<br/>Python, Scala, SQL]
SQLClient[SQL Clients<br/>SSMS, Azure Data Studio]
PowerBI[Power BI<br/>DirectQuery, Import]
DataFactory[Azure Data Factory<br/>External Pipelines]
end
SparkPool -.->|Register Tables| SparkMetastore
ServerlessSQL -.->|Create Views| SQLMetadata
DedicatedSQL -.->|Define Schema| SQLMetadata
SparkMetastore <-.->|Synchronize| MetaService
SQLMetadata <-.->|Synchronize| MetaService
MetaService <-.->|Update| Catalog
Catalog <-.->|Discover| SparkPool
Catalog <-.->|Discover| ServerlessSQL
Catalog <-.->|Discover| DedicatedSQL
SparkPool --> DataLake
ServerlessSQL --> DataLake
DedicatedSQL --> DedicatedStorage
Pipeline --> DataLake
Catalog -.->|Publish| Purview
Purview -.->|Enrich| Catalog
ExternalCatalog -.->|Import| MetaService
WorkspaceDB -.->|Shared Definitions| Catalog
SparkNotebook --> SparkPool
SQLClient --> ServerlessSQL
SQLClient --> DedicatedSQL
PowerBI --> ServerlessSQL
PowerBI --> DedicatedSQL
DataFactory --> Pipeline
style MetaService fill:#fff3e0
style Catalog fill:#e1f5fe
style SparkPool fill:#e8f5e9
style ServerlessSQL fill:#f3e5f5
style DedicatedSQL fill:#fce4ec
style Purview fill:#fff9c4
style DataLake fill:#e0f2f1 🏗️ Architecture Components¶
1. Metadata Management Layer¶
Metadata Service¶
Purpose: Central orchestrator for metadata synchronization across engines
Responsibilities:
- Synchronize metadata between Spark and SQL engines
- Handle schema evolution and versioning
- Manage metadata conflicts and resolution
- Provide unified discovery interface
- Track metadata lineage and dependencies
graph LR
subgraph "Metadata Service Operations"
Register[Register<br/>New Table]
Sync[Synchronize<br/>Metadata]
Discover[Discover<br/>Tables]
Update[Update<br/>Schema]
end
subgraph "Metadata Flow"
SparkCreate[Spark Creates<br/>Delta Table] --> Register
Register --> Catalog
Catalog --> Sync
Sync --> SQLVisible[Visible in<br/>SQL Pool]
SQLVisible --> Discover
SparkUpdate[Spark Updates<br/>Schema] --> Update
Update --> Catalog
end
style Register fill:#e8f5e9
style Sync fill:#e1f5fe
style Catalog fill:#fff3e0
style SQLVisible fill:#f3e5f5 Unified Catalog¶
Data Model:
erDiagram
DATABASE ||--o{ SCHEMA : contains
SCHEMA ||--o{ TABLE : contains
TABLE ||--o{ COLUMN : has
TABLE ||--o{ PARTITION : partitioned_by
TABLE }o--|| STORAGE_LOCATION : stored_in
TABLE }o--o{ TABLE_PROPERTY : has
COLUMN }o--|| DATA_TYPE : typed_as
DATABASE {
string name
string description
timestamp created_time
string owner
}
SCHEMA {
string name
string database_name
string description
timestamp created_time
}
TABLE {
string name
string schema_name
string table_type
string storage_format
string location
timestamp created_time
timestamp modified_time
}
COLUMN {
string name
string table_name
int ordinal_position
string data_type
boolean nullable
string description
}
PARTITION {
string column_name
string data_type
int ordinal_position
} Catalog Queries:
-- Query catalog for all databases
SELECT * FROM sys.databases;
-- Query catalog for tables in a schema
SELECT
t.name as table_name,
s.name as schema_name,
t.type_desc as table_type,
t.create_date,
t.modify_date
FROM sys.tables t
JOIN sys.schemas s ON t.schema_id = s.schema_id
WHERE s.name = 'gold';
-- Query catalog for column information
SELECT
c.name as column_name,
t.name as table_name,
ty.name as data_type,
c.max_length,
c.precision,
c.scale,
c.is_nullable
FROM sys.columns c
JOIN sys.tables t ON c.object_id = t.object_id
JOIN sys.types ty ON c.user_type_id = ty.user_type_id
WHERE t.name = 'customer_metrics';
2. Compute Engine Metadata¶
Spark Metastore¶
Hive Metastore Compatibility:
graph TB
subgraph "Spark Metastore Structure"
DB[Database/Schema]
TB[Tables]
Part[Partitions]
Cols[Columns]
SerDe[SerDe Info<br/>Storage Format]
Stats[Statistics<br/>Row counts, sizes]
end
DB --> TB
TB --> Part
TB --> Cols
TB --> SerDe
TB --> Stats
subgraph "Backend Storage"
AzureSQL[(Azure SQL Database<br/>Metastore Backend)]
end
DB -.-> AzureSQL
TB -.-> AzureSQL
Part -.-> AzureSQL
style DB fill:#fff3e0
style TB fill:#e1f5fe
style AzureSQL fill:#e8f5e9 Spark SQL DDL Examples:
# Create database
spark.sql("""
CREATE DATABASE IF NOT EXISTS sales
COMMENT 'Sales analytics database'
LOCATION 'abfss://gold@datalake.dfs.core.windows.net/sales'
""")
# Create managed Delta table (metadata + data in catalog)
spark.sql("""
CREATE TABLE sales.transactions (
transaction_id BIGINT,
customer_id INT,
amount DECIMAL(10,2),
transaction_date DATE
)
USING DELTA
PARTITIONED BY (transaction_date)
""")
# Create external Delta table (metadata in catalog, data in specified location)
spark.sql("""
CREATE EXTERNAL TABLE sales.customer_dim (
customer_id INT,
customer_name STRING,
email STRING,
segment STRING
)
USING DELTA
LOCATION 'abfss://gold@datalake.dfs.core.windows.net/dimensions/customer'
""")
# Describe table metadata
spark.sql("DESCRIBE EXTENDED sales.transactions").show(100, False)
# Show table properties
spark.sql("SHOW TBLPROPERTIES sales.transactions").show()
SQL Pool Metadata¶
Metadata Views:
-- System catalog views
SELECT * FROM INFORMATION_SCHEMA.TABLES;
SELECT * FROM INFORMATION_SCHEMA.COLUMNS;
SELECT * FROM INFORMATION_SCHEMA.VIEWS;
-- Extended metadata
SELECT * FROM sys.tables;
SELECT * FROM sys.columns;
SELECT * FROM sys.views;
SELECT * FROM sys.external_tables;
SELECT * FROM sys.external_data_sources;
-- Table statistics
SELECT * FROM sys.dm_pdw_nodes_db_partition_stats;
Creating SQL Objects Over Spark Tables:
-- Create external data source pointing to data lake
CREATE EXTERNAL DATA SOURCE DataLakeSource
WITH (
LOCATION = 'abfss://gold@datalake.dfs.core.windows.net',
CREDENTIAL = DataLakeCredential
);
-- Create external file format for Delta
CREATE EXTERNAL FILE FORMAT DeltaFormat
WITH (
FORMAT_TYPE = DELTA
);
-- Create external table over Spark-created Delta table
CREATE EXTERNAL TABLE sales.transactions_sql
WITH (
LOCATION = '/sales/transactions/',
DATA_SOURCE = DataLakeSource,
FILE_FORMAT = DeltaFormat
);
-- Create view with business logic
CREATE VIEW sales.monthly_summary AS
SELECT
YEAR(transaction_date) as year,
MONTH(transaction_date) as month,
SUM(amount) as total_sales,
COUNT(DISTINCT customer_id) as unique_customers
FROM sales.transactions_sql
GROUP BY YEAR(transaction_date), MONTH(transaction_date);
3. Cross-Engine Metadata Sharing¶
Metadata Synchronization Flow¶
sequenceDiagram
participant Spark as Spark Pool
participant HMS as Hive Metastore
participant MetaService as Metadata Service
participant Catalog as Unified Catalog
participant SQL as SQL Pool
Note over Spark,SQL: Create Table in Spark
Spark->>Spark: CREATE TABLE sales.transactions
Spark->>HMS: Register table metadata
HMS->>MetaService: Notify table creation
MetaService->>Catalog: Update unified catalog
Catalog->>SQL: Publish table availability
Note over Spark,SQL: Discover Table in SQL
SQL->>Catalog: Query available tables
Catalog->>SQL: Return table list (including sales.transactions)
SQL->>SQL: Create external table definition
SQL->>Catalog: Register SQL view
Catalog->>MetaService: Update unified catalog
Note over Spark,SQL: Schema Evolution in Spark
Spark->>Spark: ALTER TABLE ADD COLUMN
Spark->>HMS: Update table metadata
HMS->>MetaService: Notify schema change
MetaService->>Catalog: Update unified catalog
Catalog->>SQL: Notify schema change
Note over SQL: Existing queries may need refresh
Note over Spark,SQL: Both engines access same data
Spark->>DataLake: Query data
SQL->>DataLake: Query same data Automatic Table Discovery¶
Spark to SQL Discovery:
# Step 1: Create table in Spark
spark.sql("""
CREATE TABLE gold.product_metrics (
product_id INT,
product_name STRING,
category STRING,
total_sales DECIMAL(15,2),
metric_date DATE
)
USING DELTA
PARTITIONED BY (metric_date)
LOCATION 'abfss://gold@datalake.dfs.core.windows.net/products/metrics'
""")
# Step 2: Automatically visible in Serverless SQL Pool
# (No additional steps needed for discovery)
-- Step 3: Query from SQL Pool
-- Table is discoverable through catalog
SELECT
schema_name,
table_name,
create_date
FROM INFORMATION_SCHEMA.TABLES
WHERE schema_name = 'gold'
AND table_name = 'product_metrics';
-- Step 4: Create external table in SQL to access
CREATE EXTERNAL TABLE gold.product_metrics
WITH (
LOCATION = '/products/metrics/',
DATA_SOURCE = DataLakeSource,
FILE_FORMAT = DeltaFormat
);
-- Step 5: Query the data
SELECT
category,
SUM(total_sales) as category_sales
FROM gold.product_metrics
WHERE metric_date >= '2025-01-01'
GROUP BY category;
4. External Metadata Integration¶
Azure Purview Integration¶
graph TB
subgraph "Synapse Workspace"
SynapseCatalog[Unified Catalog]
SparkTables[Spark Tables]
SQLTables[SQL Tables]
end
subgraph "Azure Purview"
DataCatalog[Data Catalog<br/>Search & Discovery]
Lineage[Data Lineage<br/>End-to-end Tracking]
Classification[Data Classification<br/>Sensitive Data]
Glossary[Business Glossary<br/>Terms & Definitions]
end
subgraph "Enriched Metadata"
Business[Business Context]
Technical[Technical Metadata]
Operational[Operational Metadata]
end
SynapseCatalog -.->|Automatic Scan| DataCatalog
SparkTables -.->|Publish| DataCatalog
SQLTables -.->|Publish| DataCatalog
DataCatalog --> Lineage
DataCatalog --> Classification
DataCatalog --> Glossary
Lineage --> Business
Classification --> Business
Glossary --> Business
SparkTables --> Technical
SQLTables --> Technical
SynapseCatalog --> Operational
Business -.->|Enrich| SynapseCatalog
Technical -.->|Augment| DataCatalog
style SynapseCatalog fill:#e1f5fe
style DataCatalog fill:#fff3e0
style Lineage fill:#e8f5e9
style Classification fill:#f3e5f5 Purview Capabilities:
| Capability | Description | Benefit |
|---|---|---|
| Automatic Scanning | Discover Synapse tables automatically | Complete asset inventory |
| Data Lineage | Track data flow end-to-end | Impact analysis, debugging |
| Classification | Identify sensitive data (PII, PHI) | Compliance, security |
| Business Glossary | Define business terms | Shared understanding |
| Search & Discovery | Find data assets easily | Reduced time to insight |
Lineage Example:
graph LR
Source1[SQL Server<br/>Orders DB] --> ADF1[Data Factory<br/>Copy Activity]
Source2[Cosmos DB<br/>Customer DB] --> ADF2[Data Factory<br/>Copy Activity]
ADF1 --> Bronze1[Bronze Layer<br/>raw_orders]
ADF2 --> Bronze2[Bronze Layer<br/>raw_customers]
Bronze1 --> Spark1[Spark Notebook<br/>Cleanse Orders]
Bronze2 --> Spark2[Spark Notebook<br/>Cleanse Customers]
Spark1 --> Silver1[Silver Layer<br/>orders_cleaned]
Spark2 --> Silver2[Silver Layer<br/>customers_cleaned]
Silver1 --> Spark3[Spark Notebook<br/>Join & Aggregate]
Silver2 --> Spark3
Spark3 --> Gold1[Gold Layer<br/>customer_order_summary]
Gold1 --> PowerBI[Power BI<br/>Sales Dashboard]
style Bronze1 fill:#cd7f32
style Bronze2 fill:#cd7f32
style Silver1 fill:#c0c0c0
style Silver2 fill:#c0c0c0
style Gold1 fill:#ffd700
style PowerBI fill:#e8f5e9 🔄 Metadata Lifecycle¶
Table Creation Lifecycle¶
stateDiagram-v2
[*] --> Planned: Design table schema
Planned --> Created_Spark: Create in Spark
Planned --> Created_SQL: Create in SQL
Created_Spark --> Registered_HMS: Register in Hive Metastore
Created_SQL --> Registered_SQL: Register in SQL Metadata
Registered_HMS --> Synced: Metadata Service sync
Registered_SQL --> Synced: Metadata Service sync
Synced --> Discoverable: Available in Unified Catalog
Discoverable --> In_Use_Spark: Query from Spark
Discoverable --> In_Use_SQL: Query from SQL
Discoverable --> In_Use_Pipeline: Reference in Pipeline
In_Use_Spark --> Schema_Evolution: ALTER TABLE
In_Use_SQL --> Schema_Evolution: ALTER TABLE
Schema_Evolution --> Synced: Re-synchronize metadata
In_Use_Spark --> Deprecated: Mark as deprecated
In_Use_SQL --> Deprecated: Mark as deprecated
Deprecated --> Archived: Move to archive
Archived --> [*]: Metadata retained for audit Schema Evolution Scenarios¶
Adding Columns¶
# Spark: Add column to Delta table
spark.sql("""
ALTER TABLE sales.transactions
ADD COLUMN payment_method STRING
""")
# Automatic sync to catalog
# SQL Pool: Column visible after sync (< 5 minutes)
-- SQL: Verify new column
SELECT column_name, data_type
FROM INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'transactions'
AND column_name = 'payment_method';
Column Type Changes (with Delta)¶
# Spark: Change column type (Delta supports type widening)
spark.sql("""
ALTER TABLE sales.transactions
ALTER COLUMN amount TYPE DECIMAL(15,2)
""")
# Note: Type narrowing may require data validation
# Old data remains readable with type casting
Partition Evolution¶
# Spark: Add new partition column
spark.sql("""
ALTER TABLE sales.transactions
ADD PARTITION (region='US', year=2025)
""")
# Spark: Drop partition
spark.sql("""
ALTER TABLE sales.transactions
DROP PARTITION (region='EU', year=2024)
""")
🎯 Use Cases & Patterns¶
Use Case 1: Unified Data Engineering & Analytics¶
Scenario: Data engineers use Spark for ETL, analysts use SQL for querying
graph LR
subgraph "Data Engineering - Spark"
Engineer[Data Engineer]
SparkNotebook[Spark Notebook]
ETL[ETL Processing<br/>Bronze → Silver → Gold]
end
subgraph "Shared Metadata"
Catalog[Unified Catalog]
end
subgraph "Data Analytics - SQL"
Analyst[Data Analyst]
SQLQuery[SQL Queries]
Dashboard[Power BI Dashboard]
end
Engineer --> SparkNotebook
SparkNotebook --> ETL
ETL -.->|Register Tables| Catalog
Catalog -.->|Discover Tables| SQLQuery
Analyst --> SQLQuery
SQLQuery --> Dashboard
style ETL fill:#fff3e0
style Catalog fill:#e1f5fe
style SQLQuery fill:#e8f5e9
style Dashboard fill:#f3e5f5 Implementation:
# Data Engineer: Create gold tables in Spark
spark.sql("""
CREATE TABLE gold.daily_sales_summary (
sale_date DATE,
region STRING,
total_revenue DECIMAL(15,2),
order_count INT,
avg_order_value DECIMAL(10,2)
)
USING DELTA
PARTITIONED BY (sale_date)
""")
# Populate with ETL logic
daily_summary = spark.sql("""
SELECT
DATE(order_timestamp) as sale_date,
region,
SUM(amount) as total_revenue,
COUNT(*) as order_count,
AVG(amount) as avg_order_value
FROM silver.orders
GROUP BY DATE(order_timestamp), region
""")
daily_summary.write.format("delta") \
.mode("overwrite") \
.partitionBy("sale_date") \
.saveAsTable("gold.daily_sales_summary")
-- Data Analyst: Query in SQL (automatically available)
-- Create external table
CREATE EXTERNAL TABLE gold.daily_sales_summary_sql
WITH (
LOCATION = '/daily_sales_summary/',
DATA_SOURCE = DataLakeSource,
FILE_FORMAT = DeltaFormat
);
-- Query for dashboard
SELECT
region,
SUM(total_revenue) as total_revenue,
SUM(order_count) as total_orders,
AVG(avg_order_value) as avg_order_value
FROM gold.daily_sales_summary_sql
WHERE sale_date >= DATEADD(day, -30, GETDATE())
GROUP BY region
ORDER BY total_revenue DESC;
Use Case 2: Machine Learning Feature Store¶
Scenario: ML engineers create features in Spark, data scientists consume via SQL
graph TB
subgraph "Feature Engineering - Spark"
RawData[Raw Telemetry Data]
FeatureEng[Feature Engineering<br/>Spark Pipeline]
FeatureStore[Feature Store<br/>Delta Tables]
end
subgraph "Shared Metadata"
Catalog[Feature Catalog<br/>Metadata + Lineage]
end
subgraph "Model Training - SQL/Spark"
FeatureSelect[Feature Selection<br/>SQL Queries]
TrainingData[Training Dataset<br/>Exported to ML]
ModelTrain[Model Training<br/>Azure ML / Databricks]
end
RawData --> FeatureEng
FeatureEng --> FeatureStore
FeatureStore -.->|Register Features| Catalog
Catalog -.->|Discover Features| FeatureSelect
FeatureSelect --> TrainingData
TrainingData --> ModelTrain
style FeatureEng fill:#fff3e0
style FeatureStore fill:#e1f5fe
style Catalog fill:#e8f5e9
style ModelTrain fill:#f3e5f5 Use Case 3: Cross-Engine Data Validation¶
Scenario: Validate data quality across Spark and SQL engines
# Spark: Calculate data quality metrics
quality_metrics = spark.sql("""
SELECT
COUNT(*) as total_rows,
COUNT(DISTINCT customer_id) as unique_customers,
SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) as null_amounts,
AVG(amount) as avg_amount,
MIN(transaction_date) as min_date,
MAX(transaction_date) as max_date
FROM silver.transactions
""")
quality_metrics.write.format("delta") \
.mode("overwrite") \
.saveAsTable("monitoring.data_quality_metrics")
-- SQL: Query quality metrics for dashboard
SELECT
total_rows,
unique_customers,
null_amounts,
CAST(null_amounts AS FLOAT) / total_rows * 100 as null_percentage,
avg_amount,
min_date,
max_date
FROM monitoring.data_quality_metrics;
💡 Best Practices¶
Metadata Management¶
- Consistent Naming Conventions
- Use lowercase for database and table names
- Use underscores for multi-word names
-
Prefix with layer (bronze, silver, gold)
-
Documentation
- Add table and column descriptions
- Document data lineage
-
Maintain business glossary
-
Schema Evolution
- Plan for backward compatibility
- Use schema evolution features of Delta Lake
-
Communicate changes to consumers
-
Governance
- Implement access controls at multiple layers
- Use Azure Purview for data catalog
- Track sensitive data classifications
Performance Optimization¶
- Metadata Caching
- Metadata is cached by compute engines
- Cache refresh happens automatically
-
Manual refresh if needed:
REFRESH TABLE tablename -
Statistics Maintenance
- Keep statistics up to date
- Run ANALYZE TABLE periodically
-
Use auto-optimize features in Delta
-
Partition Strategy
- Partition by commonly filtered columns
- Avoid over-partitioning (< 1GB per partition)
- Align partitioning across related tables
🔗 Related Resources¶
Architecture Documentation¶
Implementation Guides¶
Best Practices¶
Last Updated: 2025-01-28 Architecture Version: 2.0 Pattern: Unified Metadata Catalog