Skip to content

🔗 Shared Metadata Store - Azure Synapse Analytics

Status Feature Complexity

Unified metadata catalog that enables seamless table sharing across Spark, SQL, and Data Explorer engines in Azure Synapse Analytics.


🌟 Overview

The Shared Metadata Store in Azure Synapse Analytics provides a unified catalog that allows tables created in one compute engine (Spark, SQL, Data Explorer) to be automatically discovered and queried by other engines. This eliminates data silos and enables true polyglot analytics.

🔥 Key Benefits

  • Cross-Engine Discovery: Tables automatically visible across all engines
  • Schema Synchronization: Schema changes propagate automatically
  • Security Inheritance: Access controls apply consistently
  • Data Lineage: Track data flow across engines
  • Simplified Governance: Single source of truth for metadata

🏗️ Architecture

graph TB
    subgraph "Shared Metadata Store"
        MetaDB[(Metadata<br/>Database)]
    end

    subgraph "Compute Engines"
        Spark[Spark Pools<br/>Python, Scala, SQL]
        ServerlessSQL[Serverless SQL<br/>T-SQL]
        DedicatedSQL[Dedicated SQL<br/>T-SQL]
        DataExplorer[Data Explorer<br/>KQL]
    end

    subgraph "Storage"
        DataLake[Data Lake<br/>Storage Gen2]
    end

    Spark -->|Register Tables| MetaDB
    ServerlessSQL -->|Query Metadata| MetaDB
    DedicatedSQL -->|Query Metadata| MetaDB
    DataExplorer -->|Query Metadata| MetaDB

    MetaDB -->|Table Locations| DataLake
    Spark -->|Read/Write| DataLake
    ServerlessSQL -->|Read| DataLake

🎯 Cross-Engine Table Sharing

Create Table in Spark, Query in SQL

# Create Delta table in Spark
df = spark.read.parquet("/data/sales/*.parquet")

df.write.format("delta") \
    .mode("overwrite") \
    .saveAsTable("sales.customer_orders")

# Table automatically visible in SQL pools
-- Query Spark-created table in Serverless SQL
SELECT
    customer_id,
    COUNT(*) as order_count,
    SUM(order_amount) as total_spent
FROM sales.customer_orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id
ORDER BY total_spent DESC;

Metadata Synchronization Example

# Spark: Create managed Delta table
spark.sql("""
    CREATE TABLE IF NOT EXISTS analytics.sales_summary (
        sale_date DATE,
        region STRING,
        product_category STRING,
        total_revenue DECIMAL(18,2),
        order_count INT
    )
    USING DELTA
    PARTITIONED BY (sale_date)
    LOCATION '/delta/sales_summary'
""")

# Insert data
spark.sql("""
    INSERT INTO analytics.sales_summary
    SELECT
        order_date as sale_date,
        region,
        product_category,
        SUM(amount) as total_revenue,
        COUNT(*) as order_count
    FROM sales.customer_orders
    GROUP BY order_date, region, product_category
""")
-- Serverless SQL: Table immediately available
SELECT TOP 10 *
FROM analytics.sales_summary
ORDER BY total_revenue DESC;

-- Check table metadata
SELECT
    TABLE_SCHEMA,
    TABLE_NAME,
    TABLE_TYPE
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 'sales_summary';

🔒 Unified Security

Row-Level Security Across Engines

# Define security policy in Spark
spark.sql("""
    CREATE OR REPLACE FUNCTION get_user_region()
    RETURNS STRING
    RETURN current_user()
""")

# Create secure view
spark.sql("""
    CREATE OR REPLACE VIEW sales.secure_customer_orders AS
    SELECT *
    FROM sales.customer_orders
    WHERE region = get_user_region()
""")
-- Same security applies in SQL
-- Users only see their region's data
SELECT * FROM sales.secure_customer_orders;

📊 Best Practices

1. Use Managed Tables for Shared Access

# ✅ GOOD: Managed table (metadata tracked)
df.write.format("delta") \
    .mode("overwrite") \
    .saveAsTable("sales.managed_table")

# ❌ LESS IDEAL: External path (not in catalog)
df.write.format("delta") \
    .mode("overwrite") \
    .save("/delta/external_path")

2. Consistent Naming Conventions

# Use database.table naming
spark.sql("CREATE DATABASE IF NOT EXISTS sales")
spark.sql("CREATE DATABASE IF NOT EXISTS analytics")
spark.sql("CREATE DATABASE IF NOT EXISTS ml_features")

# Organize by domain
# sales.customer_orders
# analytics.sales_summary
# ml_features.customer_features

3. Schema Evolution Management

# Enable schema evolution for Delta tables
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

# Add new columns safely
spark.sql("""
    ALTER TABLE sales.customer_orders
    ADD COLUMNS (
        customer_segment STRING COMMENT 'Customer tier: Gold, Silver, Bronze',
        lifetime_value DECIMAL(18,2) COMMENT 'Total customer value'
    )
""")

# Changes visible across all engines immediately

🎓 Implementation Guides


Last Updated: 2025-01-28 Feature: Shared Metadata Catalog Documentation Status: Complete