Skip to content

Databricks Workspace Setup


Databricks Status

Overview

Step-by-step guide for configuring an Azure Databricks workspace for the real-time analytics platform with Unity Catalog, cluster policies, and security best practices.

Prerequisites

  • Azure subscription with Databricks access
  • Resource group created
  • Virtual network configured
  • Azure AD tenant access

Workspace Creation

Create Databricks Workspace

# Create Databricks workspace with VNet injection
az databricks workspace create \
  --resource-group analytics-rg \
  --name databricks-analytics-prod \
  --location eastus \
  --sku premium \
  --managed-resource-group databricks-managed-rg \
  --vnet analytics-spoke-vnet \
  --public-subnet databricks-public \
  --private-subnet databricks-private \
  --prepare-encryption \
  --no-wait

# Enable Unity Catalog
az databricks workspace update \
  --resource-group analytics-rg \
  --name databricks-analytics-prod \
  --enable-no-public-ip true \
  --public-network-access Disabled

Unity Catalog Configuration

Create Metastore

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Create metastore
metastore = w.metastores.create(
    name="analytics_metastore",
    storage_root="abfss://unity-catalog@analyticsstorage.dfs.core.windows.net/",
    region="eastus"
)

# Assign metastore to workspace
w.metastores.assign(
    workspace_id="<workspace-id>",
    metastore_id=metastore.metastore_id,
    default_catalog_name="realtime_analytics"
)

Create Catalog and Schemas

-- Create catalog
CREATE CATALOG IF NOT EXISTS realtime_analytics;

-- Create schemas
CREATE SCHEMA IF NOT EXISTS realtime_analytics.bronze;
CREATE SCHEMA IF NOT EXISTS realtime_analytics.silver;
CREATE SCHEMA IF NOT EXISTS realtime_analytics.gold;

-- Grant permissions
GRANT USE CATALOG ON CATALOG realtime_analytics TO `data-engineers`;
GRANT ALL PRIVILEGES ON SCHEMA realtime_analytics.bronze TO `data-engineers`;

Cluster Configuration

Create Job Cluster Policy

{
  "cluster_type": {
    "type": "fixed",
    "value": "job"
  },
  "spark_version": {
    "type": "regex",
    "pattern": "13\.3\..*-scala.*"
  },
  "node_type_id": {
    "type": "allowlist",
    "values": ["Standard_DS3_v2", "Standard_DS4_v2"]
  },
  "autoscale": {
    "min_workers": {
      "type": "range",
      "minValue": 2,
      "maxValue": 4
    },
    "max_workers": {
      "type": "range",
      "minValue": 8,
      "maxValue": 50
    }
  }
}

Create Interactive Cluster

cluster_config = {
    "cluster_name": "shared-analytics-cluster",
    "spark_version": "13.3.x-scala2.12",
    "node_type_id": "Standard_DS3_v2",
    "autoscale": {
        "min_workers": 2,
        "max_workers": 10
    },
    "spark_conf": {
        "spark.databricks.delta.preview.enabled": "true",
        "spark.databricks.delta.autoCompact.enabled": "true"
    },
    "data_security_mode": "USER_ISOLATION"
}

from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
cluster = w.clusters.create(**cluster_config)

Access Control

Configure SCIM Provisioning

  1. Navigate to Azure AD > Enterprise Applications > Databricks
  2. Enable automatic provisioning
  3. Configure attribute mappings
  4. Start provisioning

Grant Table Permissions

-- Grant access to data engineers
GRANT USE CATALOG, USE SCHEMA, SELECT, MODIFY 
ON SCHEMA realtime_analytics.silver 
TO `data-engineers`;

-- Grant read-only to analysts
GRANT USE CATALOG, USE SCHEMA, SELECT 
ON SCHEMA realtime_analytics.gold 
TO `analysts`;

Storage Configuration

Mount ADLS Gen2

# Configure storage access using service principal
configs = {
    "fs.azure.account.auth.type": "OAuth",
    "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.account.oauth2.client.id": dbutils.secrets.get("kv-secrets", "sp-client-id"),
    "fs.azure.account.oauth2.client.secret": dbutils.secrets.get("kv-secrets", "sp-client-secret"),
    "fs.azure.account.oauth2.client.endpoint": f"https://login.microsoftonline.com/{tenant_id}/oauth2/token"
}

# Mount storage
dbutils.fs.mount(
    source="abfss://data@analyticsstorage.dfs.core.windows.net/",
    mount_point="/mnt/data",
    extra_configs=configs
)


Last Updated: January 2025 Version: 1.0.0 Status: Production Ready