🏗️ Azure Databricks Workspace Setup¶
Complete guide for creating, configuring, and securing Azure Databricks workspaces for production use.
📋 Table of Contents¶
- Prerequisites
- Workspace Creation
- Network Configuration
- Storage Configuration
- Cluster Configuration
- Security Setup
- Access Control
- Integration Setup
- Best Practices
- Troubleshooting
✅ Prerequisites¶
Required Azure Resources¶
- Azure Subscription with sufficient permissions
- Resource Group for Databricks workspace
- Azure Active Directory access for authentication
- Network Resources (optional): VNet, subnets, NSGs
- Storage Account (optional): For external data storage
Required Permissions¶
| Resource | Role Required | Scope |
|---|---|---|
| Subscription | Contributor or Owner | Subscription level |
| Resource Group | Contributor | Resource group level |
| Network | Network Contributor | VNet level (if using VNet injection) |
| Azure AD | Application Administrator | For service principal creation |
Tools & CLI¶
# Install Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
# Login to Azure
az login
# Install Databricks CLI
pip install databricks-cli
# Verify installation
databricks --version
🚀 Workspace Creation¶
Method 1: Azure Portal (Recommended for First-Time Setup)¶
Step 1: Navigate to Azure Portal¶
- Go to Azure Portal
- Click Create a resource
- Search for Azure Databricks
- Click Create
Step 2: Configure Basics¶
Subscription: [Your subscription]
Resource Group: rg-databricks-prod
Workspace Name: dbx-analytics-prod
Region: East US
Pricing Tier: Premium (recommended for production)
💡 Tip: Use Premium tier for production workloads to get RBAC, audit logging, and Unity Catalog support.
Step 3: Configure Networking (Optional)¶
Standard Deployment (Simple): - No VNet injection - Databricks-managed networking - Public IP addresses
VNet Injection (Enterprise): - Deploy into your VNet - Private IPs only - Custom DNS and routing - Network security controls
Step 4: Advanced Settings¶
Tags:
Environment: Production
Department: Analytics
CostCenter: CC-12345
Managed Resource Group:
Name: rg-databricks-prod-managed
Location: Same as workspace
Encryption:
Enable Customer-Managed Keys: Yes (optional)
Key Vault: kv-databricks-prod
Step 5: Review + Create¶
- Review all settings
- Click Create
- Wait for deployment (5-10 minutes)
Method 2: Azure CLI¶
# Set variables
SUBSCRIPTION_ID="your-subscription-id"
RESOURCE_GROUP="rg-databricks-prod"
LOCATION="eastus"
WORKSPACE_NAME="dbx-analytics-prod"
MANAGED_RG="rg-databricks-prod-managed"
PRICING_TIER="premium"
# Set subscription
az account set --subscription $SUBSCRIPTION_ID
# Create resource group
az group create \
--name $RESOURCE_GROUP \
--location $LOCATION \
--tags Environment=Production Department=Analytics
# Create Databricks workspace
az databricks workspace create \
--resource-group $RESOURCE_GROUP \
--name $WORKSPACE_NAME \
--location $LOCATION \
--sku $PRICING_TIER \
--managed-resource-group $MANAGED_RG \
--tags Environment=Production Department=Analytics
# Get workspace details
az databricks workspace show \
--resource-group $RESOURCE_GROUP \
--name $WORKSPACE_NAME \
--output table
Method 3: ARM Template¶
Template: azuredeploy.json
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"workspaceName": {
"type": "string",
"metadata": {
"description": "Name of the Azure Databricks workspace"
}
},
"pricingTier": {
"type": "string",
"defaultValue": "premium",
"allowedValues": ["standard", "premium"],
"metadata": {
"description": "Pricing tier for the workspace"
}
},
"location": {
"type": "string",
"defaultValue": "[resourceGroup().location]",
"metadata": {
"description": "Location for all resources"
}
}
},
"variables": {
"managedResourceGroupName": "[concat('databricks-rg-', parameters('workspaceName'), '-', uniqueString(parameters('workspaceName'), resourceGroup().id))]"
},
"resources": [
{
"type": "Microsoft.Databricks/workspaces",
"apiVersion": "2023-02-01",
"name": "[parameters('workspaceName')]",
"location": "[parameters('location')]",
"sku": {
"name": "[parameters('pricingTier')]"
},
"properties": {
"managedResourceGroupId": "[subscriptionResourceId('Microsoft.Resources/resourceGroups', variables('managedResourceGroupName'))]"
}
}
],
"outputs": {
"workspaceId": {
"type": "string",
"value": "[resourceId('Microsoft.Databricks/workspaces', parameters('workspaceName'))]"
},
"workspaceUrl": {
"type": "string",
"value": "[reference(resourceId('Microsoft.Databricks/workspaces', parameters('workspaceName'))).workspaceUrl]"
}
}
}
Deploy:
az deployment group create \
--resource-group $RESOURCE_GROUP \
--template-file azuredeploy.json \
--parameters workspaceName=$WORKSPACE_NAME pricingTier=premium
🌐 Network Configuration¶
Standard Deployment Architecture¶
graph TB
subgraph "Azure Subscription"
subgraph "Customer Resource Group"
DBX[Databricks<br/>Workspace]
end
subgraph "Managed Resource Group"
VNet[Virtual Network<br/>Databricks-Managed]
NSG[Network Security<br/>Groups]
Clusters[Compute<br/>Clusters]
end
end
Users[Users] -->|HTTPS| DBX
DBX -->|Deploys| VNet
VNet --> Clusters
NSG -.Protects.-> Clusters VNet Injection (Secure Cluster Connectivity)¶
Prerequisites for VNet Injection¶
VNet Requirements:
- Two subnets (public and private)
- Minimum /26 CIDR per subnet (64 addresses)
- No conflicting address spaces
- NSG rules configured
NSG Requirements:
- Allow Azure Databricks control plane communication
- Allow inter-cluster communication
- Allow access to Azure services
Create VNet and Subnets¶
# Create VNet
az network vnet create \
--resource-group $RESOURCE_GROUP \
--name vnet-databricks \
--address-prefix 10.1.0.0/16 \
--location $LOCATION
# Create public subnet
az network vnet subnet create \
--resource-group $RESOURCE_GROUP \
--vnet-name vnet-databricks \
--name snet-databricks-public \
--address-prefix 10.1.1.0/24
# Create private subnet
az network vnet subnet create \
--resource-group $RESOURCE_GROUP \
--vnet-name vnet-databricks \
--name snet-databricks-private \
--address-prefix 10.1.2.0/24
Configure Network Security Groups¶
# Create NSG for public subnet
az network nsg create \
--resource-group $RESOURCE_GROUP \
--name nsg-databricks-public
# Create NSG for private subnet
az network nsg create \
--resource-group $RESOURCE_GROUP \
--name nsg-databricks-private
# Associate NSGs with subnets
az network vnet subnet update \
--resource-group $RESOURCE_GROUP \
--vnet-name vnet-databricks \
--name snet-databricks-public \
--network-security-group nsg-databricks-public
az network vnet subnet update \
--resource-group $RESOURCE_GROUP \
--vnet-name vnet-databricks \
--name snet-databricks-private \
--network-security-group nsg-databricks-private
Required NSG Rules¶
# Allow Azure Databricks control plane (Control Plane to Cluster)
az network nsg rule create \
--resource-group $RESOURCE_GROUP \
--nsg-name nsg-databricks-public \
--name Allow-Databricks-ControlPlane \
--priority 100 \
--direction Inbound \
--access Allow \
--protocol Tcp \
--source-address-prefixes AzureDatabricks \
--source-port-ranges '*' \
--destination-address-prefixes '*' \
--destination-port-ranges 22 443
# Allow inter-cluster communication
az network nsg rule create \
--resource-group $RESOURCE_GROUP \
--nsg-name nsg-databricks-public \
--name Allow-Internal-Communication \
--priority 110 \
--direction Inbound \
--access Allow \
--protocol '*' \
--source-address-prefixes VirtualNetwork \
--source-port-ranges '*' \
--destination-address-prefixes VirtualNetwork \
--destination-port-ranges '*'
Deploy Workspace with VNet Injection¶
# Get VNet ID
VNET_ID=$(az network vnet show \
--resource-group $RESOURCE_GROUP \
--name vnet-databricks \
--query id -o tsv)
# Get subnet IDs
PUBLIC_SUBNET_ID=$(az network vnet subnet show \
--resource-group $RESOURCE_GROUP \
--vnet-name vnet-databricks \
--name snet-databricks-public \
--query id -o tsv)
PRIVATE_SUBNET_ID=$(az network vnet subnet show \
--resource-group $RESOURCE_GROUP \
--vnet-name vnet-databricks \
--name snet-databricks-private \
--query id -o tsv)
# Create workspace with VNet injection
az databricks workspace create \
--resource-group $RESOURCE_GROUP \
--name $WORKSPACE_NAME \
--location $LOCATION \
--sku premium \
--custom-virtual-network-id $VNET_ID \
--custom-public-subnet-name snet-databricks-public \
--custom-private-subnet-name snet-databricks-private \
--no-public-ip
Private Link Configuration¶
graph LR
subgraph "Corporate Network"
Users[Users]
end
subgraph "Azure VNet"
PE[Private<br/>Endpoint]
DBX[Databricks<br/>Workspace]
end
Users -->|Private Connection| PE
PE -->|Private IP| DBX Enable Private Link:
# Create private endpoint
az network private-endpoint create \
--name pe-databricks \
--resource-group $RESOURCE_GROUP \
--vnet-name vnet-databricks \
--subnet snet-private-endpoints \
--private-connection-resource-id $WORKSPACE_ID \
--group-ids databricks_ui_api \
--connection-name dbx-private-connection
# Create private DNS zone
az network private-dns zone create \
--resource-group $RESOURCE_GROUP \
--name privatelink.azuredatabricks.net
# Link DNS zone to VNet
az network private-dns link vnet create \
--resource-group $RESOURCE_GROUP \
--zone-name privatelink.azuredatabricks.net \
--name databricks-dns-link \
--virtual-network vnet-databricks \
--registration-enabled false
💾 Storage Configuration¶
Default Storage (DBFS)¶
Databricks File System (DBFS) is automatically provisioned with every workspace.
# Access DBFS
dbutils.fs.ls("dbfs:/")
# Upload file to DBFS
dbutils.fs.cp("file:/tmp/data.csv", "dbfs:/FileStore/data.csv")
# List files
display(dbutils.fs.ls("dbfs:/FileStore/"))
⚠️ Warning: DBFS is ephemeral and tied to workspace lifecycle. Use external storage for production data.
External Storage - Azure Data Lake Storage Gen2¶
Create Storage Account¶
# Create storage account
az storage account create \
--name sadatabricksprod \
--resource-group $RESOURCE_GROUP \
--location $LOCATION \
--sku Standard_LRS \
--kind StorageV2 \
--enable-hierarchical-namespace true
# Create container
az storage container create \
--name data \
--account-name sadatabricksprod \
--auth-mode login
Mount Storage to Databricks¶
Option 1: Service Principal Authentication (Recommended)
# Create service principal
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>", key="<secret-key>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}
# Mount storage
dbutils.fs.mount(
source = "abfss://data@sadatabricksprod.dfs.core.windows.net/",
mount_point = "/mnt/data",
extra_configs = configs
)
# Verify mount
display(dbutils.fs.ls("/mnt/data"))
Option 2: Unity Catalog External Locations (Recommended for Unity Catalog)
-- Create storage credential
CREATE STORAGE CREDENTIAL azure_storage_credential
WITH (
AZURE_SERVICE_PRINCIPAL = 'application-id',
AZURE_CLIENT_SECRET = 'client-secret',
AZURE_TENANT_ID = 'tenant-id'
);
-- Create external location
CREATE EXTERNAL LOCATION data_lake
URL 'abfss://data@sadatabricksprod.dfs.core.windows.net/'
WITH (STORAGE CREDENTIAL azure_storage_credential);
-- Grant access
GRANT CREATE TABLE ON EXTERNAL LOCATION data_lake TO `data_engineers`;
⚙️ Cluster Configuration¶
Cluster Types¶
| Type | Use Case | Cost | Lifetime | Best For |
|---|---|---|---|---|
| All-Purpose | Interactive notebooks | Higher DBU | Manual termination | Development, exploration |
| Job | Scheduled workloads | Lower DBU | Job duration only | Production ETL, scheduled jobs |
| SQL Warehouse | SQL analytics | Medium DBU | Auto-scaling | BI tools, SQL queries |
Create All-Purpose Cluster¶
Via Portal:
- Navigate to Compute in Databricks workspace
- Click Create Cluster
- Configure:
Cluster Name: interactive-cluster
Cluster Mode: Standard
Databricks Runtime: 13.3 LTS (Scala 2.12, Spark 3.4.1)
Use Photon Acceleration: Yes (for SQL workloads)
Autopilot Options:
Enable autoscaling local storage: Yes
Auto Termination: 120 minutes
Worker Type: Standard_DS3_v2
Workers: Min 2, Max 8
Driver Type: Standard_DS3_v2
Advanced Options:
Spot instances: 50% (for dev)
Init scripts: /dbfs/init/install-packages.sh
Via CLI:
# Configure Databricks CLI
databricks configure --token
# Create cluster config JSON
cat > cluster-config.json <<EOF
{
"cluster_name": "interactive-cluster",
"spark_version": "13.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"autoscale": {
"min_workers": 2,
"max_workers": 8
},
"auto_termination_minutes": 120,
"enable_elastic_disk": true,
"runtime_engine": "PHOTON"
}
EOF
# Create cluster
databricks clusters create --json-file cluster-config.json
Cluster Pools¶
Reduce cluster start time by maintaining a pool of idle instances.
{
"instance_pool_name": "general-purpose-pool",
"min_idle_instances": 0,
"max_capacity": 10,
"node_type_id": "Standard_DS3_v2",
"idle_instance_autotermination_minutes": 15,
"preloaded_spark_versions": [
"13.3.x-scala2.12"
]
}
Benefits: - 80% faster cluster startup - Cost-effective for frequent cluster creation - Ideal for job workloads
🔒 Security Setup¶
Azure Active Directory Integration¶
Enable AAD Authentication:
- Navigate to workspace in Azure Portal
- Go to Authentication settings
- Enable Azure Active Directory SSO
- Configure user/group assignments
Assign Users:
# Add user to workspace
az role assignment create \
--assignee user@company.com \
--role "Contributor" \
--scope $WORKSPACE_ID
Secrets Management with Azure Key Vault¶
Create Key Vault¶
# Create Key Vault
az keyvault create \
--name kv-databricks-prod \
--resource-group $RESOURCE_GROUP \
--location $LOCATION \
--enable-soft-delete true \
--enable-purge-protection true
# Add secret
az keyvault secret set \
--vault-name kv-databricks-prod \
--name storage-account-key \
--value "your-storage-account-key"
Configure Databricks Secret Scope¶
# Create secret scope backed by Key Vault
# Note: This is done via UI at https://<databricks-instance>#secrets/createScope
# Use secret in notebook
storage_key = dbutils.secrets.get(scope="key-vault-scope", key="storage-account-key")
spark.conf.set(
"fs.azure.account.key.sadatabricksprod.dfs.core.windows.net",
storage_key
)
Enable Audit Logging¶
# Create Log Analytics workspace
az monitor log-analytics workspace create \
--resource-group $RESOURCE_GROUP \
--workspace-name law-databricks
# Get workspace ID
LA_WORKSPACE_ID=$(az monitor log-analytics workspace show \
--resource-group $RESOURCE_GROUP \
--workspace-name law-databricks \
--query id -o tsv)
# Configure diagnostic settings
az monitor diagnostic-settings create \
--name databricks-audit-logs \
--resource $WORKSPACE_ID \
--workspace $LA_WORKSPACE_ID \
--logs '[{"category": "dbfs","enabled": true},{"category": "clusters","enabled": true},{"category": "accounts","enabled": true},{"category": "jobs","enabled": true},{"category": "notebook","enabled": true},{"category": "ssh","enabled": true},{"category": "workspace","enabled": true}]'
👥 Access Control¶
Workspace Access Levels¶
| Role | Permissions | Use Case |
|---|---|---|
| Admin | Full control | Workspace administrators |
| User | Create clusters, notebooks | Data engineers, scientists |
| Reader | View-only access | Auditors, stakeholders |
Table Access Control¶
-- Grant table permissions
GRANT SELECT ON TABLE sales_data TO `analysts@company.com`;
-- Grant schema permissions
GRANT USAGE ON SCHEMA analytics TO `data_engineers`;
-- Row-level security
CREATE ROW FILTER region_filter ON sales_data
FOR SELECT
WHEN user() = 'regional_manager@company.com'
THEN region = 'EMEA';
Cluster Policies¶
Enforce organizational standards with cluster policies.
{
"cluster_type": {
"type": "fixed",
"value": "all-purpose"
},
"spark_version": {
"type": "regex",
"pattern": "13\\.3\\..*-scala2\\.12"
},
"node_type_id": {
"type": "allowlist",
"values": ["Standard_DS3_v2", "Standard_DS4_v2"]
},
"autotermination_minutes": {
"type": "range",
"minValue": 10,
"maxValue": 180
}
}
🔗 Integration Setup¶
Power BI Integration¶
# Create SQL endpoint for Power BI
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
# Create SQL warehouse
warehouse = w.warehouses.create(
name="powerbi-warehouse",
cluster_size="2X-Small",
min_num_clusters=1,
max_num_clusters=3,
auto_stop_mins=20
)
# Get connection details
print(f"Server Hostname: {warehouse.jdbc_url}")
print(f"HTTP Path: {warehouse.http_path}")
Connect from Power BI: 1. Get Data → Azure → Azure Databricks 2. Enter Server Hostname and HTTP Path 3. Authenticate with Azure AD 4. Select tables and load
Azure Data Factory Integration¶
{
"name": "DatabricksLinkedService",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://adb-123456789.12.azuredatabricks.net",
"authentication": "MSI",
"workspaceResourceId": "/subscriptions/{subscription}/resourceGroups/{rg}/providers/Microsoft.Databricks/workspaces/{workspace}",
"existingClusterId": "{cluster-id}"
}
}
}
✅ Best Practices¶
Resource Naming Conventions¶
Workspace: dbx-{environment}-{region}-{workload}
Clusters: clstr-{purpose}-{size}
Notebooks: nb-{project}-{function}
Jobs: job-{frequency}-{workload}
Examples:
- dbx-prod-eastus-analytics
- clstr-etl-large
- nb-sales-processing
- job-daily-revenue-calc
Cost Management¶
# Set cluster auto-termination
spark.conf.set("spark.databricks.cluster.autoTermination.minutes", "30")
# Use spot instances for dev
{
"azure_attributes": {
"spot_bid_max_price": -1,
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK_AZURE"
}
}
# Monitor costs
%sql
SELECT
usage_date,
SUM(usage_quantity * list_price) as daily_cost
FROM system.billing.usage
WHERE usage_date >= CURRENT_DATE - 30
GROUP BY usage_date
ORDER BY usage_date DESC
Security Hardening¶
- Enable Private Link for all workspaces
- Use Unity Catalog for centralized governance
- Rotate secrets regularly via Key Vault
- Enable audit logging for compliance
- Implement RBAC at all levels
- Use service principals for automation
- Encrypt data with customer-managed keys
🆘 Troubleshooting¶
Common Setup Issues¶
Issue: Workspace creation fails with VNet injection
# Solution: Verify subnet delegation
az network vnet subnet update \
--resource-group $RESOURCE_GROUP \
--vnet-name vnet-databricks \
--name snet-databricks-public \
--delegations Microsoft.Databricks/workspaces
Issue: Cannot access workspace after creation
# Solution: Add firewall rule
az databricks workspace update \
--resource-group $RESOURCE_GROUP \
--name $WORKSPACE_NAME \
--public-network-access Enabled
Issue: Cluster fails to start
# Check cluster logs
cluster_id = "0123-456789-abc123"
logs = w.clusters.get(cluster_id=cluster_id)
print(logs.state_message)
Diagnostic Queries¶
-- Check audit logs
SELECT timestamp, action_name, user_identity, response
FROM system.access.audit
WHERE action_name LIKE '%CLUSTER%'
ORDER BY timestamp DESC
LIMIT 100;
-- Monitor cluster utilization
SELECT
cluster_id,
AVG(cpu_utilization) as avg_cpu,
AVG(memory_utilization) as avg_memory
FROM system.compute.clusters
WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL 1 DAY
GROUP BY cluster_id;
🎯 Next Steps¶
- Configure Unity Catalog - Set up centralized governance
- Create Delta Live Tables Pipeline - Build ETL workflows
- Set Up MLflow - Enable ML lifecycle management
- Implement Security - Harden your deployment
📚 Related Resources¶
- Azure Databricks Documentation
- Networking Best Practices
- Security Hardening Guide
- Cost Optimization
Last Updated: 2025-01-28 Databricks Runtime: 13.3 LTS Documentation Status: Complete