🗄️ Unity Catalog¶
See also: CSA-in-a-Box platform guide
This is the generic Azure reference for Databricks Unity Catalog. For how CSA-in-a-Box specifically deploys, configures, and integrates this service, see the platform guide: Databricks Unity Catalog guide.
Unified governance solution for data and AI assets across Azure Databricks workspaces with centralized access control, audit logging, and data discovery.
🌟 Overview¶
Unity Catalog is a unified governance solution for all data and AI assets in Azure Databricks, including files, tables, machine learning models, and dashboards. It provides centralized access control, auditing, lineage, and data discovery across multiple workspaces.
🔥 Key Benefits¶
- Centralized Governance: Single source of truth for all data assets
- Fine-Grained Access Control: Row, column, and attribute-based permissions
- Data Lineage: Automatic tracking from source to consumption
- Data Discovery: Search and browse across all workspaces
- Secure Data Sharing: Share data across workspaces and organizations
- Compliance: Built-in audit logging and access tracking
🏗️ Architecture¶
Unity Catalog Hierarchy¶
graph TB
subgraph "Metastore Level"
Meta[Unity Catalog<br/>Metastore]
end
subgraph "Catalog Level"
Cat1[Production<br/>Catalog]
Cat2[Development<br/>Catalog]
Cat3[Analytics<br/>Catalog]
end
subgraph "Schema Level"
Schema1[Sales Schema]
Schema2[Marketing Schema]
Schema3[Finance Schema]
end
subgraph "Object Level"
Tables[Tables]
Views[Views]
Functions[Functions]
Models[ML Models]
Volumes[Volumes]
end
Meta --> Cat1
Meta --> Cat2
Meta --> Cat3
Cat1 --> Schema1
Cat1 --> Schema2
Cat2 --> Schema3
Schema1 --> Tables
Schema1 --> Views
Schema2 --> Functions
Schema3 --> Models
Schema3 --> Volumes Three-Level Namespace¶
└── catalog # Environment or business unit
└── schema # Functional area or domain
└── table/view # Data asset
Example:
🚀 Getting Started¶
Prerequisites¶
Requirements:
- Azure Databricks Premium or Enterprise tier
- Account Administrator role
- Azure Active Directory integration
- Storage account for metastore
Permissions:
- Account Admin (for metastore creation)
- Workspace Admin (for workspace assignment)
- Metastore Admin (for governance setup)
Create Unity Catalog Metastore¶
Step 1: Create Storage for Metastore¶
# Variables
RESOURCE_GROUP="rg-databricks-unity"
LOCATION="eastus"
STORAGE_ACCOUNT="saunitycatalog"
CONTAINER="metastore"
# Create storage account
az storage account create \
--name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--location $LOCATION \
--sku Standard_LRS \
--kind StorageV2 \
--enable-hierarchical-namespace true
# Create container
az storage container create \
--name $CONTAINER \
--account-name $STORAGE_ACCOUNT \
--auth-mode login
Step 2: Create Service Principal¶
# Create service principal
SP=$(az ad sp create-for-rbac \
--name "sp-unity-catalog" \
--role "Storage Blob Data Contributor" \
--scopes "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT")
# Extract credentials
APP_ID=$(echo $SP | jq -r '.appId')
CLIENT_SECRET=$(echo $SP | jq -r '.password')
TENANT_ID=$(echo $SP | jq -r '.tenant')
echo "Application ID: $APP_ID"
echo "Client Secret: $CLIENT_SECRET"
echo "Tenant ID: $TENANT_ID"
Step 3: Create Metastore (via Databricks UI)¶
- Go to Account Console → Data
- Click Create Metastore
- Configure:
Metastore Name: metastore-prod
Region: East US
Storage Location: abfss://metastore@saunitycatalog.dfs.core.windows.net/
Access Connector:
Application ID: [from step 2]
Directory ID: [tenant-id]
Secret: [client-secret]
- Click Create
Step 4: Assign Workspace to Metastore¶
# Using Databricks SDK
from databricks.sdk import AccountClient
from databricks.sdk.service import catalog
# Initialize account client
account = AccountClient()
# Assign workspace to metastore
account.metastores.assign(
workspace_id="workspace-id",
metastore_id="metastore-id",
default_catalog_name="production"
)
📊 Core Concepts¶
Catalogs¶
Catalogs are the top-level container for organizing data assets, typically by environment or business unit.
-- Create catalog
CREATE CATALOG production
COMMENT 'Production data catalog';
-- Create development catalog
CREATE CATALOG development
COMMENT 'Development and testing catalog';
-- List catalogs
SHOW CATALOGS;
-- Set default catalog
USE CATALOG production;
Schemas (Databases)¶
Schemas organize tables and views by functional area.
-- Create schema
CREATE SCHEMA production.sales
COMMENT 'Sales and revenue data'
LOCATION 'abfss://data@storage.dfs.core.windows.net/sales';
-- Create managed schema (Unity Catalog manages storage)
CREATE SCHEMA production.marketing;
-- List schemas
SHOW SCHEMAS IN production;
-- Use schema
USE production.sales;
Tables¶
Unity Catalog supports both managed and external tables.
-- Managed table (UC manages storage)
CREATE TABLE production.sales.customers (
customer_id BIGINT,
name STRING,
email STRING,
country STRING,
created_date DATE
)
USING DELTA
COMMENT 'Customer master data';
-- External table (you manage storage)
CREATE TABLE production.sales.transactions
USING DELTA
LOCATION 'abfss://data@storage.dfs.core.windows.net/transactions'
COMMENT 'Sales transactions';
-- Create table from query
CREATE TABLE production.sales.monthly_revenue AS
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(amount) as revenue
FROM production.sales.transactions
GROUP BY DATE_TRUNC('month', order_date);
Volumes¶
Volumes provide governed access to non-tabular data (files, images, models).
-- Create managed volume
CREATE VOLUME production.ml.model_artifacts
COMMENT 'ML model artifacts and weights';
-- Create external volume
CREATE EXTERNAL VOLUME production.ml.training_data
LOCATION 'abfss://ml@storage.dfs.core.windows.net/training';
-- List volumes
SHOW VOLUMES IN production.ml;
Access volumes in Python:
# Write to volume
with open('/Volumes/production/ml/model_artifacts/model.pkl', 'wb') as f:
pickle.dump(model, f)
# Read from volume
with open('/Volumes/production/ml/model_artifacts/model.pkl', 'rb') as f:
loaded_model = pickle.load(f)
# List files in volume
dbutils.fs.ls('/Volumes/production/ml/model_artifacts/')
🔒 Access Control¶
Permission Model¶
graph TB
subgraph "Metastore Permissions"
MA[Metastore Admin]
MC[CREATE CATALOG]
MU[USE CATALOG]
end
subgraph "Catalog Permissions"
CA[Catalog Owner]
CS[CREATE SCHEMA]
CU[USE CATALOG]
end
subgraph "Schema Permissions"
SO[Schema Owner]
CT[CREATE TABLE]
SU[USE SCHEMA]
end
subgraph "Table Permissions"
TO[Table Owner]
SEL[SELECT]
MOD[MODIFY]
ALL[ALL PRIVILEGES]
end
MA -->|Grants| MC
MA -->|Grants| MU
CA -->|Grants| CS
SO -->|Grants| CT
TO -->|Grants| SEL
TO -->|Grants| MOD Grant Permissions¶
-- Catalog-level permissions
GRANT USE CATALOG ON CATALOG production TO `data_analysts@company.com`;
GRANT CREATE SCHEMA ON CATALOG production TO `data_engineers@company.com`;
-- Schema-level permissions
GRANT USE SCHEMA ON SCHEMA production.sales TO `sales_team@company.com`;
GRANT CREATE TABLE ON SCHEMA production.sales TO `data_engineers@company.com`;
-- Table-level permissions
GRANT SELECT ON TABLE production.sales.customers TO `analysts@company.com`;
GRANT SELECT, MODIFY ON TABLE production.sales.transactions TO `data_engineers@company.com`;
GRANT ALL PRIVILEGES ON TABLE production.sales.revenue TO `finance_team@company.com`;
-- Grant to groups
GRANT SELECT ON SCHEMA production.sales TO `sales_analysts`;
-- View grants
SHOW GRANTS ON TABLE production.sales.customers;
Row-Level Security¶
-- Create row filter function
CREATE FUNCTION production.sales.region_filter()
RETURN
CASE
WHEN IS_ACCOUNT_GROUP_MEMBER('emea_team') THEN region = 'EMEA'
WHEN IS_ACCOUNT_GROUP_MEMBER('americas_team') THEN region = 'Americas'
WHEN IS_ACCOUNT_GROUP_MEMBER('apac_team') THEN region = 'APAC'
ELSE FALSE
END;
-- Apply row filter to table
ALTER TABLE production.sales.transactions
SET ROW FILTER production.sales.region_filter ON (region);
-- Users will only see rows for their region
SELECT * FROM production.sales.transactions;
Column-Level Security (Masking)¶
-- Create masking function for PII
CREATE FUNCTION production.sales.mask_email(email STRING)
RETURN CASE
WHEN IS_ACCOUNT_GROUP_MEMBER('pii_access') THEN email
ELSE CONCAT(LEFT(email, 3), '***@***.com')
END;
-- Apply column mask
ALTER TABLE production.sales.customers
ALTER COLUMN email SET MASK production.sales.mask_email;
-- Non-privileged users see masked data
SELECT email FROM production.sales.customers;
-- Returns: abc***@***.com
-- Privileged users see actual data
-- Returns: abc@example.com
Dynamic Views for Security¶
-- Create dynamic view based on user context
CREATE VIEW production.sales.customers_by_region AS
SELECT
customer_id,
name,
email,
country,
region
FROM production.sales.customers
WHERE
-- Regional access
CASE
WHEN IS_ACCOUNT_GROUP_MEMBER('emea_team') THEN region = 'EMEA'
WHEN IS_ACCOUNT_GROUP_MEMBER('americas_team') THEN region = 'Americas'
WHEN IS_ACCOUNT_GROUP_MEMBER('global_team') THEN TRUE
ELSE FALSE
END;
-- Grant access to view
GRANT SELECT ON VIEW production.sales.customers_by_region TO `regional_teams`;
🔍 Data Discovery & Lineage¶
Data Discovery¶
-- Search for tables
SHOW TABLES IN production LIKE '*customer*';
-- Search across catalogs
SELECT
catalog_name,
schema_name,
table_name,
table_type,
comment
FROM system.information_schema.tables
WHERE table_name LIKE '%sales%';
-- Get table details
DESCRIBE TABLE EXTENDED production.sales.customers;
-- Get table properties
SHOW TBLPROPERTIES production.sales.customers;
Data Lineage¶
Unity Catalog automatically tracks lineage:
# Query lineage API
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
# Get table lineage
lineage = w.lineage.get_table_lineage(
table_name="production.sales.monthly_revenue"
)
# Upstream dependencies
print("Upstream tables:")
for upstream in lineage.upstream_tables:
print(f" - {upstream.full_name}")
# Downstream consumers
print("Downstream tables:")
for downstream in lineage.downstream_tables:
print(f" - {downstream.full_name}")
Lineage visualization is available in Databricks UI: - Navigate to Data → Select table → Lineage tab
🤝 Data Sharing¶
Delta Sharing¶
Share data securely with external organizations.
Set Up as Provider¶
-- Create share
CREATE SHARE sales_analytics_share
COMMENT 'Quarterly sales data for partners';
-- Add tables to share
ALTER SHARE sales_analytics_share
ADD TABLE production.sales.quarterly_summary;
-- Create recipient
CREATE RECIPIENT partner_corp
COMMENT 'Partner Corporation';
-- Grant access
GRANT SELECT ON SHARE sales_analytics_share TO RECIPIENT partner_corp;
-- Generate activation link
-- (Done via UI or API)
Consume as Recipient¶
# Configure Delta Sharing client
import delta_sharing
# Create profile file
profile = {
"shareCredentialsVersion": 1,
"endpoint": "https://sharing.databricks.com/delta-sharing/",
"bearerToken": "activation-token"
}
# List available shares
client = delta_sharing.SharingClient(profile)
shares = client.list_all_tables()
# Read shared data
df = delta_sharing.load_as_spark("profile.share.schema.table")
display(df)
Cross-Workspace Sharing¶
-- Create external location for sharing
CREATE EXTERNAL LOCATION shared_data
URL 'abfss://shared@storage.dfs.core.windows.net/'
WITH (STORAGE CREDENTIAL shared_credential);
-- Grant access to other workspace
GRANT READ FILES ON EXTERNAL LOCATION shared_data TO `workspace_b_service_principal`;
-- In workspace B, create table referencing shared location
CREATE TABLE workspace_b.shared.data
USING DELTA
LOCATION 'abfss://shared@storage.dfs.core.windows.net/data';
📊 Monitoring & Auditing¶
Audit Logging¶
-- Query audit logs
SELECT
event_time,
user_identity.email,
action_name,
request_params.full_name_arg as object_name,
response.status_code
FROM system.access.audit
WHERE action_name LIKE '%TABLE%'
AND event_date >= CURRENT_DATE - 7
ORDER BY event_time DESC;
-- Track access patterns
SELECT
request_params.full_name_arg as table_name,
COUNT(*) as access_count,
COUNT(DISTINCT user_identity.email) as unique_users
FROM system.access.audit
WHERE action_name = 'getTable'
AND event_date >= CURRENT_DATE - 30
GROUP BY table_name
ORDER BY access_count DESC;
-- Monitor permission changes
SELECT
event_time,
user_identity.email as changed_by,
action_name,
request_params.securable_type,
request_params.securable_full_name,
request_params.principal
FROM system.access.audit
WHERE action_name IN ('createGrant', 'revokeGrant')
AND event_date >= CURRENT_DATE - 7
ORDER BY event_time DESC;
Compliance Reporting¶
-- PII access report
CREATE OR REPLACE VIEW governance.compliance.pii_access_report AS
SELECT
DATE_TRUNC('day', event_time) as access_date,
user_identity.email as user,
request_params.full_name_arg as table_name,
COUNT(*) as access_count
FROM system.access.audit
WHERE request_params.full_name_arg IN (
SELECT CONCAT(catalog_name, '.', schema_name, '.', table_name)
FROM system.information_schema.tables
WHERE comment LIKE '%PII%'
)
GROUP BY access_date, user, table_name;
-- Grant permission report
CREATE OR REPLACE VIEW governance.compliance.permissions_summary AS
SELECT
catalog_name,
schema_name,
table_name,
grantee,
privilege_type,
grantor
FROM system.information_schema.table_privileges
ORDER BY catalog_name, schema_name, table_name;
🔧 Advanced Patterns¶
Multi-Environment Strategy¶
-- Catalog per environment
CREATE CATALOG dev COMMENT 'Development environment';
CREATE CATALOG staging COMMENT 'Staging environment';
CREATE CATALOG prod COMMENT 'Production environment';
-- Consistent schema structure across environments
CREATE SCHEMA dev.sales;
CREATE SCHEMA staging.sales;
CREATE SCHEMA prod.sales;
-- Promote data across environments
CREATE TABLE staging.sales.customers AS
SELECT * FROM dev.sales.customers
WHERE validation_status = 'PASSED';
CREATE TABLE prod.sales.customers AS
SELECT * FROM staging.sales.customers
WHERE ready_for_prod = TRUE;
Data Quality with Constraints¶
-- Add constraints to ensure data quality
ALTER TABLE production.sales.customers
ADD CONSTRAINT valid_email CHECK (email RLIKE '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}$');
ALTER TABLE production.sales.transactions
ADD CONSTRAINT positive_amount CHECK (amount > 0);
ALTER TABLE production.sales.transactions
ADD CONSTRAINT valid_customer FOREIGN KEY (customer_id) REFERENCES production.sales.customers(customer_id);
-- View constraints
SHOW TBLPROPERTIES production.sales.customers;
Lifecycle Management¶
-- Set table retention
ALTER TABLE production.sales.transactions
SET TBLPROPERTIES (
'delta.logRetentionDuration' = '30 days',
'delta.deletedFileRetentionDuration' = '7 days'
);
-- Implement data retention policy
CREATE OR REPLACE FUNCTION production.sales.cleanup_old_data()
RETURNS VOID
LANGUAGE SQL
AS $$
DELETE FROM production.sales.transactions
WHERE transaction_date < CURRENT_DATE - INTERVAL 7 YEARS;
VACUUM production.sales.transactions RETAIN 168 HOURS;
$$;
-- Schedule cleanup job
CREATE JOB monthly_cleanup
SCHEDULE CRON '0 0 1 * *'
AS CALL production.sales.cleanup_old_data();
💰 Best Practices¶
Organization Structure¶
Recommended Catalog Structure:
- production: Live production data
- staging: Pre-production validation
- development: Development and testing
- analytics: Curated analytics datasets
- sandbox: User experimentation
Schema Organization:
- By business domain: sales, marketing, finance
- By data source: salesforce, google_analytics, erp
- By function: raw, processed, aggregated
Security Best Practices¶
- Principle of Least Privilege: Grant minimum required permissions
- Use Groups: Manage permissions via AAD groups, not individual users
- Audit Regularly: Review access logs and permissions quarterly
- Automate Provisioning: Use CI/CD for permission management
- Encrypt Sensitive Data: Use column masking for PII
- Implement Row-Level Security: For multi-tenant scenarios
Performance Optimization¶
-- Use Delta Lake optimizations
ALTER TABLE production.sales.transactions
SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
);
-- Partition large tables
CREATE TABLE production.sales.transactions_partitioned (
transaction_id BIGINT,
customer_id BIGINT,
amount DECIMAL(10,2),
transaction_date DATE
)
USING DELTA
PARTITIONED BY (transaction_date);
-- Z-order for query performance
OPTIMIZE production.sales.transactions
ZORDER BY (customer_id, product_id);
🆘 Troubleshooting¶
Common Issues¶
Issue: Cannot access table after metastore migration
-- Solution: Grant USE CATALOG permission
GRANT USE CATALOG ON CATALOG production TO `user@company.com`;
GRANT USE SCHEMA ON SCHEMA production.sales TO `user@company.com`;
GRANT SELECT ON TABLE production.sales.customers TO `user@company.com`;
Issue: External location access denied
# Solution: Verify storage credential permissions
az role assignment list \
--assignee $APP_ID \
--scope "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT"
# Add missing permission
az role assignment create \
--assignee $APP_ID \
--role "Storage Blob Data Contributor" \
--scope "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT"
Issue: Table shows as "Not accessible"
-- Check table location
DESCRIBE EXTENDED production.sales.customers;
-- Verify external location permissions
SHOW GRANTS ON EXTERNAL LOCATION data_lake;
-- Re-grant access if needed
GRANT READ FILES, WRITE FILES ON EXTERNAL LOCATION data_lake TO `data_engineers`;
Diagnostic Queries¶
-- Check current user's permissions
SHOW GRANTS ON CATALOG production;
SHOW GRANTS ON SCHEMA production.sales;
SHOW GRANTS ON TABLE production.sales.customers;
-- List all principals with access to a table
SELECT
grantee,
privilege_type,
is_grantable
FROM system.information_schema.table_privileges
WHERE table_catalog = 'production'
AND table_schema = 'sales'
AND table_name = 'customers';
-- Find orphaned tables (no owner)
SELECT
CONCAT(table_catalog, '.', table_schema, '.', table_name) as full_name
FROM system.information_schema.tables
WHERE table_owner IS NULL
OR table_owner = '';
🎯 Migration Guide¶
Migrating from Hive Metastore¶
# Step 1: Sync existing tables to Unity Catalog
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
# List Hive tables
hive_tables = spark.sql("SHOW TABLES IN default").collect()
# Create corresponding Unity Catalog tables
for table in hive_tables:
table_name = table.tableName
# Create managed table in Unity Catalog
spark.sql(f"""
CREATE TABLE production.migrated.{table_name}
AS SELECT * FROM default.{table_name}
""")
print(f"Migrated: {table_name}")
# Step 2: Update queries to use three-level namespace
# Old: SELECT * FROM table_name
# New: SELECT * FROM catalog.schema.table_name
# Step 3: Grant permissions
spark.sql("""
GRANT SELECT ON SCHEMA production.migrated TO `users@company.com`
""")
📚 Related Resources¶
- Azure Databricks Overview
- Delta Live Tables
- Security Best Practices
- Data Governance Patterns
🎯 Next Steps¶
- Set Up External Locations - Configure data access
- Implement Row-Level Security - Fine-grained access
- Configure Delta Sharing - Share data externally
Last Updated: 2025-01-28 Unity Catalog Version: Current Documentation Status: Complete