🏗️ Azure Databricks Workspace Setup¶

Complete guide for creating, configuring, and securing Azure Databricks workspaces for production use.

📋 Table of Contents¶

Prerequisites
Workspace Creation
Network Configuration
Storage Configuration
Cluster Configuration
Security Setup
Access Control
Integration Setup
Best Practices
Troubleshooting

✅ Prerequisites¶

Required Azure Resources¶

Azure Subscription with sufficient permissions
Resource Group for Databricks workspace
Azure Active Directory access for authentication
Network Resources (optional): VNet, subnets, NSGs
Storage Account (optional): For external data storage

Required Permissions¶

Resource	Role Required	Scope
Subscription	Contributor or Owner	Subscription level
Resource Group	Contributor	Resource group level
Network	Network Contributor	VNet level (if using VNet injection)
Azure AD	Application Administrator	For service principal creation

Tools & CLI¶

# Install Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

# Login to Azure
az login

# Install Databricks CLI
pip install databricks-cli

# Verify installation
databricks --version

🚀 Workspace Creation¶

Method 1: Azure Portal (Recommended for First-Time Setup)¶

Step 1: Navigate to Azure Portal¶

Go to Azure Portal
Click Create a resource
Search for Azure Databricks
Click Create

Step 2: Configure Basics¶

Subscription: [Your subscription]
Resource Group: rg-databricks-prod
Workspace Name: dbx-analytics-prod
Region: East US
Pricing Tier: Premium (recommended for production)

💡 Tip: Use Premium tier for production workloads to get RBAC, audit logging, and Unity Catalog support.

Step 3: Configure Networking (Optional)¶

Standard Deployment (Simple): - No VNet injection - Databricks-managed networking - Public IP addresses

VNet Injection (Enterprise): - Deploy into your VNet - Private IPs only - Custom DNS and routing - Network security controls

Step 4: Advanced Settings¶

Tags:
  Environment: Production
  Department: Analytics
  CostCenter: CC-12345

Managed Resource Group:
  Name: rg-databricks-prod-managed
  Location: Same as workspace

Encryption:
  Enable Customer-Managed Keys: Yes (optional)
  Key Vault: kv-databricks-prod

Step 5: Review + Create¶

Review all settings
Click Create
Wait for deployment (5-10 minutes)

Method 2: Azure CLI¶

# Set variables
SUBSCRIPTION_ID="your-subscription-id"
RESOURCE_GROUP="rg-databricks-prod"
LOCATION="eastus"
WORKSPACE_NAME="dbx-analytics-prod"
MANAGED_RG="rg-databricks-prod-managed"
PRICING_TIER="premium"

# Set subscription
az account set --subscription $SUBSCRIPTION_ID

# Create resource group
az group create \
  --name $RESOURCE_GROUP \
  --location $LOCATION \
  --tags Environment=Production Department=Analytics

# Create Databricks workspace
az databricks workspace create \
  --resource-group $RESOURCE_GROUP \
  --name $WORKSPACE_NAME \
  --location $LOCATION \
  --sku $PRICING_TIER \
  --managed-resource-group $MANAGED_RG \
  --tags Environment=Production Department=Analytics

# Get workspace details
az databricks workspace show \
  --resource-group $RESOURCE_GROUP \
  --name $WORKSPACE_NAME \
  --output table

Method 3: ARM Template¶

Template: azuredeploy.json

{
  "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "workspaceName": {
      "type": "string",
      "metadata": {
        "description": "Name of the Azure Databricks workspace"
      }
    },
    "pricingTier": {
      "type": "string",
      "defaultValue": "premium",
      "allowedValues": ["standard", "premium"],
      "metadata": {
        "description": "Pricing tier for the workspace"
      }
    },
    "location": {
      "type": "string",
      "defaultValue": "[resourceGroup().location]",
      "metadata": {
        "description": "Location for all resources"
      }
    }
  },
  "variables": {
    "managedResourceGroupName": "[concat('databricks-rg-', parameters('workspaceName'), '-', uniqueString(parameters('workspaceName'), resourceGroup().id))]"
  },
  "resources": [
    {
      "type": "Microsoft.Databricks/workspaces",
      "apiVersion": "2023-02-01",
      "name": "[parameters('workspaceName')]",
      "location": "[parameters('location')]",
      "sku": {
        "name": "[parameters('pricingTier')]"
      },
      "properties": {
        "managedResourceGroupId": "[subscriptionResourceId('Microsoft.Resources/resourceGroups', variables('managedResourceGroupName'))]"
      }
    }
  ],
  "outputs": {
    "workspaceId": {
      "type": "string",
      "value": "[resourceId('Microsoft.Databricks/workspaces', parameters('workspaceName'))]"
    },
    "workspaceUrl": {
      "type": "string",
      "value": "[reference(resourceId('Microsoft.Databricks/workspaces', parameters('workspaceName'))).workspaceUrl]"
    }
  }
}

Deploy:

az deployment group create \
  --resource-group $RESOURCE_GROUP \
  --template-file azuredeploy.json \
  --parameters workspaceName=$WORKSPACE_NAME pricingTier=premium

🌐 Network Configuration¶

Standard Deployment Architecture¶

graph TB
    subgraph "Azure Subscription"
        subgraph "Customer Resource Group"
            DBX[Databricks<br/>Workspace]
        end

        subgraph "Managed Resource Group"
            VNet[Virtual Network<br/>Databricks-Managed]
            NSG[Network Security<br/>Groups]
            Clusters[Compute<br/>Clusters]
        end
    end

    Users[Users] -->|HTTPS| DBX
    DBX -->|Deploys| VNet
    VNet --> Clusters
    NSG -.Protects.-> Clusters

VNet Injection (Secure Cluster Connectivity)¶

Prerequisites for VNet Injection¶

VNet Requirements:
  - Two subnets (public and private)
  - Minimum /26 CIDR per subnet (64 addresses)
  - No conflicting address spaces
  - NSG rules configured

NSG Requirements:
  - Allow Azure Databricks control plane communication
  - Allow inter-cluster communication
  - Allow access to Azure services

Create VNet and Subnets¶

# Create VNet
az network vnet create \
  --resource-group $RESOURCE_GROUP \
  --name vnet-databricks \
  --address-prefix 10.1.0.0/16 \
  --location $LOCATION

# Create public subnet
az network vnet subnet create \
  --resource-group $RESOURCE_GROUP \
  --vnet-name vnet-databricks \
  --name snet-databricks-public \
  --address-prefix 10.1.1.0/24

# Create private subnet
az network vnet subnet create \
  --resource-group $RESOURCE_GROUP \
  --vnet-name vnet-databricks \
  --name snet-databricks-private \
  --address-prefix 10.1.2.0/24

Configure Network Security Groups¶

# Create NSG for public subnet
az network nsg create \
  --resource-group $RESOURCE_GROUP \
  --name nsg-databricks-public

# Create NSG for private subnet
az network nsg create \
  --resource-group $RESOURCE_GROUP \
  --name nsg-databricks-private

# Associate NSGs with subnets
az network vnet subnet update \
  --resource-group $RESOURCE_GROUP \
  --vnet-name vnet-databricks \
  --name snet-databricks-public \
  --network-security-group nsg-databricks-public

az network vnet subnet update \
  --resource-group $RESOURCE_GROUP \
  --vnet-name vnet-databricks \
  --name snet-databricks-private \
  --network-security-group nsg-databricks-private

Required NSG Rules¶

# Allow Azure Databricks control plane (Control Plane to Cluster)
az network nsg rule create \
  --resource-group $RESOURCE_GROUP \
  --nsg-name nsg-databricks-public \
  --name Allow-Databricks-ControlPlane \
  --priority 100 \
  --direction Inbound \
  --access Allow \
  --protocol Tcp \
  --source-address-prefixes AzureDatabricks \
  --source-port-ranges '*' \
  --destination-address-prefixes '*' \
  --destination-port-ranges 22 443

# Allow inter-cluster communication
az network nsg rule create \
  --resource-group $RESOURCE_GROUP \
  --nsg-name nsg-databricks-public \
  --name Allow-Internal-Communication \
  --priority 110 \
  --direction Inbound \
  --access Allow \
  --protocol '*' \
  --source-address-prefixes VirtualNetwork \
  --source-port-ranges '*' \
  --destination-address-prefixes VirtualNetwork \
  --destination-port-ranges '*'

Deploy Workspace with VNet Injection¶

# Get VNet ID
VNET_ID=$(az network vnet show \
  --resource-group $RESOURCE_GROUP \
  --name vnet-databricks \
  --query id -o tsv)

# Get subnet IDs
PUBLIC_SUBNET_ID=$(az network vnet subnet show \
  --resource-group $RESOURCE_GROUP \
  --vnet-name vnet-databricks \
  --name snet-databricks-public \
  --query id -o tsv)

PRIVATE_SUBNET_ID=$(az network vnet subnet show \
  --resource-group $RESOURCE_GROUP \
  --vnet-name vnet-databricks \
  --name snet-databricks-private \
  --query id -o tsv)

# Create workspace with VNet injection
az databricks workspace create \
  --resource-group $RESOURCE_GROUP \
  --name $WORKSPACE_NAME \
  --location $LOCATION \
  --sku premium \
  --custom-virtual-network-id $VNET_ID \
  --custom-public-subnet-name snet-databricks-public \
  --custom-private-subnet-name snet-databricks-private \
  --no-public-ip

Private Link Configuration¶

graph LR
    subgraph "Corporate Network"
        Users[Users]
    end

    subgraph "Azure VNet"
        PE[Private<br/>Endpoint]
        DBX[Databricks<br/>Workspace]
    end

    Users -->|Private Connection| PE
    PE -->|Private IP| DBX

Enable Private Link:

# Create private endpoint
az network private-endpoint create \
  --name pe-databricks \
  --resource-group $RESOURCE_GROUP \
  --vnet-name vnet-databricks \
  --subnet snet-private-endpoints \
  --private-connection-resource-id $WORKSPACE_ID \
  --group-ids databricks_ui_api \
  --connection-name dbx-private-connection

# Create private DNS zone
az network private-dns zone create \
  --resource-group $RESOURCE_GROUP \
  --name privatelink.azuredatabricks.net

# Link DNS zone to VNet
az network private-dns link vnet create \
  --resource-group $RESOURCE_GROUP \
  --zone-name privatelink.azuredatabricks.net \
  --name databricks-dns-link \
  --virtual-network vnet-databricks \
  --registration-enabled false

💾 Storage Configuration¶

Default Storage (DBFS)¶

Databricks File System (DBFS) is automatically provisioned with every workspace.

# Access DBFS
dbutils.fs.ls("dbfs:/")

# Upload file to DBFS
dbutils.fs.cp("file:/tmp/data.csv", "dbfs:/FileStore/data.csv")

# List files
display(dbutils.fs.ls("dbfs:/FileStore/"))

⚠️ Warning: DBFS is ephemeral and tied to workspace lifecycle. Use external storage for production data.

External Storage - Azure Data Lake Storage Gen2¶

Create Storage Account¶

# Create storage account
az storage account create \
  --name sadatabricksprod \
  --resource-group $RESOURCE_GROUP \
  --location $LOCATION \
  --sku Standard_LRS \
  --kind StorageV2 \
  --enable-hierarchical-namespace true

# Create container
az storage container create \
  --name data \
  --account-name sadatabricksprod \
  --auth-mode login

Mount Storage to Databricks¶

Option 1: Service Principal Authentication (Recommended)

# Create service principal
configs = {
  "fs.azure.account.auth.type": "OAuth",
  "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id": "<application-id>",
  "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>", key="<secret-key>"),
  "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}

# Mount storage
dbutils.fs.mount(
  source = "abfss://data@sadatabricksprod.dfs.core.windows.net/",
  mount_point = "/mnt/data",
  extra_configs = configs
)

# Verify mount
display(dbutils.fs.ls("/mnt/data"))

Option 2: Unity Catalog External Locations (Recommended for Unity Catalog)

-- Create storage credential
CREATE STORAGE CREDENTIAL azure_storage_credential
WITH (
  AZURE_SERVICE_PRINCIPAL = 'application-id',
  AZURE_CLIENT_SECRET = 'client-secret',
  AZURE_TENANT_ID = 'tenant-id'
);

-- Create external location
CREATE EXTERNAL LOCATION data_lake
URL 'abfss://data@sadatabricksprod.dfs.core.windows.net/'
WITH (STORAGE CREDENTIAL azure_storage_credential);

-- Grant access
GRANT CREATE TABLE ON EXTERNAL LOCATION data_lake TO `data_engineers`;

⚙️ Cluster Configuration¶

Cluster Types¶

Type	Use Case	Cost	Lifetime	Best For
All-Purpose	Interactive notebooks	Higher DBU	Manual termination	Development, exploration
Job	Scheduled workloads	Lower DBU	Job duration only	Production ETL, scheduled jobs
SQL Warehouse	SQL analytics	Medium DBU	Auto-scaling	BI tools, SQL queries

Create All-Purpose Cluster¶

Via Portal:

Navigate to Compute in Databricks workspace
Click Create Cluster
Configure:

Cluster Name: interactive-cluster
Cluster Mode: Standard
Databricks Runtime: 13.3 LTS (Scala 2.12, Spark 3.4.1)
Use Photon Acceleration: Yes (for SQL workloads)
Autopilot Options:
  Enable autoscaling local storage: Yes
Auto Termination: 120 minutes
Worker Type: Standard_DS3_v2
Workers: Min 2, Max 8
Driver Type: Standard_DS3_v2
Advanced Options:
  Spot instances: 50% (for dev)
  Init scripts: /dbfs/init/install-packages.sh

Via CLI:

# Configure Databricks CLI
databricks configure --token

# Create cluster config JSON
cat > cluster-config.json <<EOF
{
  "cluster_name": "interactive-cluster",
  "spark_version": "13.3.x-scala2.12",
  "node_type_id": "Standard_DS3_v2",
  "driver_node_type_id": "Standard_DS3_v2",
  "autoscale": {
    "min_workers": 2,
    "max_workers": 8
  },
  "auto_termination_minutes": 120,
  "enable_elastic_disk": true,
  "runtime_engine": "PHOTON"
}
EOF

# Create cluster
databricks clusters create --json-file cluster-config.json

Cluster Pools¶

Reduce cluster start time by maintaining a pool of idle instances.

{
  "instance_pool_name": "general-purpose-pool",
  "min_idle_instances": 0,
  "max_capacity": 10,
  "node_type_id": "Standard_DS3_v2",
  "idle_instance_autotermination_minutes": 15,
  "preloaded_spark_versions": [
    "13.3.x-scala2.12"
  ]
}

Benefits: - 80% faster cluster startup - Cost-effective for frequent cluster creation - Ideal for job workloads

🔒 Security Setup¶

Azure Active Directory Integration¶

Enable AAD Authentication:

Navigate to workspace in Azure Portal
Go to Authentication settings
Enable Azure Active Directory SSO
Configure user/group assignments

Assign Users:

# Add user to workspace
az role assignment create \
  --assignee user@company.com \
  --role "Contributor" \
  --scope $WORKSPACE_ID

Secrets Management with Azure Key Vault¶

Create Key Vault¶

# Create Key Vault
az keyvault create \
  --name kv-databricks-prod \
  --resource-group $RESOURCE_GROUP \
  --location $LOCATION \
  --enable-soft-delete true \
  --enable-purge-protection true

# Add secret
az keyvault secret set \
  --vault-name kv-databricks-prod \
  --name storage-account-key \
  --value "your-storage-account-key"

Configure Databricks Secret Scope¶

# Create secret scope backed by Key Vault
# Note: This is done via UI at https://<databricks-instance>#secrets/createScope

# Use secret in notebook
storage_key = dbutils.secrets.get(scope="key-vault-scope", key="storage-account-key")

spark.conf.set(
  "fs.azure.account.key.sadatabricksprod.dfs.core.windows.net",
  storage_key
)

Enable Audit Logging¶

# Create Log Analytics workspace
az monitor log-analytics workspace create \
  --resource-group $RESOURCE_GROUP \
  --workspace-name law-databricks

# Get workspace ID
LA_WORKSPACE_ID=$(az monitor log-analytics workspace show \
  --resource-group $RESOURCE_GROUP \
  --workspace-name law-databricks \
  --query id -o tsv)

# Configure diagnostic settings
az monitor diagnostic-settings create \
  --name databricks-audit-logs \
  --resource $WORKSPACE_ID \
  --workspace $LA_WORKSPACE_ID \
  --logs '[{"category": "dbfs","enabled": true},{"category": "clusters","enabled": true},{"category": "accounts","enabled": true},{"category": "jobs","enabled": true},{"category": "notebook","enabled": true},{"category": "ssh","enabled": true},{"category": "workspace","enabled": true}]'

👥 Access Control¶

Workspace Access Levels¶

Role	Permissions	Use Case
Admin	Full control	Workspace administrators
User	Create clusters, notebooks	Data engineers, scientists
Reader	View-only access	Auditors, stakeholders

Table Access Control¶

-- Grant table permissions
GRANT SELECT ON TABLE sales_data TO `analysts@company.com`;

-- Grant schema permissions
GRANT USAGE ON SCHEMA analytics TO `data_engineers`;

-- Row-level security
CREATE ROW FILTER region_filter ON sales_data
FOR SELECT
WHEN user() = 'regional_manager@company.com'
THEN region = 'EMEA';

Cluster Policies¶

Enforce organizational standards with cluster policies.

{
  "cluster_type": {
    "type": "fixed",
    "value": "all-purpose"
  },
  "spark_version": {
    "type": "regex",
    "pattern": "13\\.3\\..*-scala2\\.12"
  },
  "node_type_id": {
    "type": "allowlist",
    "values": ["Standard_DS3_v2", "Standard_DS4_v2"]
  },
  "autotermination_minutes": {
    "type": "range",
    "minValue": 10,
    "maxValue": 180
  }
}

🔗 Integration Setup¶

Power BI Integration¶

# Create SQL endpoint for Power BI
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Create SQL warehouse
warehouse = w.warehouses.create(
    name="powerbi-warehouse",
    cluster_size="2X-Small",
    min_num_clusters=1,
    max_num_clusters=3,
    auto_stop_mins=20
)

# Get connection details
print(f"Server Hostname: {warehouse.jdbc_url}")
print(f"HTTP Path: {warehouse.http_path}")

Connect from Power BI: 1. Get Data → Azure → Azure Databricks 2. Enter Server Hostname and HTTP Path 3. Authenticate with Azure AD 4. Select tables and load

Azure Data Factory Integration¶

{
  "name": "DatabricksLinkedService",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "AzureDatabricks",
    "typeProperties": {
      "domain": "https://adb-123456789.12.azuredatabricks.net",
      "authentication": "MSI",
      "workspaceResourceId": "/subscriptions/{subscription}/resourceGroups/{rg}/providers/Microsoft.Databricks/workspaces/{workspace}",
      "existingClusterId": "{cluster-id}"
    }
  }
}

✅ Best Practices¶

Resource Naming Conventions¶

Workspace: dbx-{environment}-{region}-{workload}
Clusters: clstr-{purpose}-{size}
Notebooks: nb-{project}-{function}
Jobs: job-{frequency}-{workload}

Examples:
  - dbx-prod-eastus-analytics
  - clstr-etl-large
  - nb-sales-processing
  - job-daily-revenue-calc

Cost Management¶

# Set cluster auto-termination
spark.conf.set("spark.databricks.cluster.autoTermination.minutes", "30")

# Use spot instances for dev
{
  "azure_attributes": {
    "spot_bid_max_price": -1,
    "first_on_demand": 1,
    "availability": "SPOT_WITH_FALLBACK_AZURE"
  }
}

# Monitor costs
%sql
SELECT
  usage_date,
  SUM(usage_quantity * list_price) as daily_cost
FROM system.billing.usage
WHERE usage_date >= CURRENT_DATE - 30
GROUP BY usage_date
ORDER BY usage_date DESC

Security Hardening¶

Enable Private Link for all workspaces
Use Unity Catalog for centralized governance
Rotate secrets regularly via Key Vault
Enable audit logging for compliance
Implement RBAC at all levels
Use service principals for automation
Encrypt data with customer-managed keys

🆘 Troubleshooting¶

Common Setup Issues¶

Issue: Workspace creation fails with VNet injection

# Solution: Verify subnet delegation
az network vnet subnet update \
  --resource-group $RESOURCE_GROUP \
  --vnet-name vnet-databricks \
  --name snet-databricks-public \
  --delegations Microsoft.Databricks/workspaces

Issue: Cannot access workspace after creation

# Solution: Add firewall rule
az databricks workspace update \
  --resource-group $RESOURCE_GROUP \
  --name $WORKSPACE_NAME \
  --public-network-access Enabled

Issue: Cluster fails to start

# Check cluster logs
cluster_id = "0123-456789-abc123"
logs = w.clusters.get(cluster_id=cluster_id)
print(logs.state_message)

Diagnostic Queries¶

-- Check audit logs
SELECT timestamp, action_name, user_identity, response
FROM system.access.audit
WHERE action_name LIKE '%CLUSTER%'
ORDER BY timestamp DESC
LIMIT 100;

-- Monitor cluster utilization
SELECT
  cluster_id,
  AVG(cpu_utilization) as avg_cpu,
  AVG(memory_utilization) as avg_memory
FROM system.compute.clusters
WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL 1 DAY
GROUP BY cluster_id;

🎯 Next Steps¶

Configure Unity Catalog - Set up centralized governance
Create Delta Live Tables Pipeline - Build ETL workflows
Set Up MLflow - Enable ML lifecycle management
Implement Security - Harden your deployment

Last Updated: 2025-01-28 Databricks Runtime: 13.3 LTS Documentation Status: Complete