🧪 Azure Databricks¶
See also: CSA-in-a-Box platform guide
This is the generic Azure reference for Azure Databricks. For how CSA-in-a-Box specifically deploys, configures, and integrates this service, see the platform guide: Azure Databricks guide.
Collaborative analytics platform optimized for data science, data engineering, and machine learning workflows on Azure.
🌟 Service Overview¶
Azure Databricks is a unified data analytics platform built on Apache Spark, designed to accelerate innovation by bringing together data science, engineering, and business teams. It provides an optimized environment for big data processing, machine learning, and collaborative analytics with deep integration into the Azure ecosystem.
🔥 Key Value Propositions¶
- Collaborative Workspace: Multi-user notebooks with real-time collaboration and version control
- Advanced ML Capabilities: Native MLflow integration, AutoML, and feature store
- Delta Lake Optimization: Built-in Delta Lake with enhanced performance optimizations
- Multi-Language Support: Python, R, Scala, SQL, and Java in unified workspace
- Enterprise Security: Unity Catalog for centralized governance and access control
- Azure Native Integration: Seamless connectivity with Azure services and data sources
🏗️ Architecture Overview¶
graph TB
subgraph "Data Sources"
ADLS[Azure Data Lake<br/>Storage Gen2]
SQL[Azure SQL<br/>Database]
Cosmos[Cosmos DB]
EventHub[Event Hubs]
end
subgraph "Azure Databricks Workspace"
subgraph "Compute"
Jobs[Job Clusters]
Interactive[Interactive<br/>Clusters]
SQL_WH[SQL Warehouses]
end
subgraph "Data Management"
DLT[Delta Live<br/>Tables]
Delta[Delta Lake]
Unity[Unity<br/>Catalog]
end
subgraph "ML & Analytics"
MLflow[MLflow]
AutoML[AutoML]
FS[Feature<br/>Store]
NB[Notebooks]
end
end
subgraph "Outputs & Integration"
PBI[Power BI]
AML[Azure ML]
Synapse[Synapse<br/>Analytics]
Apps[Applications]
end
ADLS --> Delta
SQL --> Delta
Cosmos --> Delta
EventHub --> DLT
Delta --> Jobs
Delta --> Interactive
Delta --> SQL_WH
Jobs --> DLT
Interactive --> NB
NB --> MLflow
NB --> AutoML
MLflow --> FS
Unity -.Governance.-> Delta
Unity -.Governance.-> DLT
Unity -.Governance.-> FS
SQL_WH --> PBI
MLflow --> AML
Delta --> Synapse
Jobs --> Apps 💰 Pricing Tiers & SKUs¶
🎯 Workspace Editions¶
| Edition | Use Case | Key Features | Pricing Premium |
|---|---|---|---|
| Standard | Development & Testing | Basic security, notebooks, clusters | Base DBU rate |
| Premium | Production Workloads | RBAC, audit logs, job ACLs | +75% DBU rate |
| Enterprise | Enterprise Governance | Unity Catalog, advanced security | Contact Sales |
💵 Pricing Components¶
Azure Databricks pricing consists of three components:
1. Virtual Machine Costs¶
- Standard Azure VM pricing for compute nodes
- Varies by region, VM type, and size
- Examples:
- Standard_DS3_v2: ~$0.27/hour (East US)
- Standard_E4s_v3: ~$0.50/hour (East US)
- GPU Standard_NC6s_v3: ~$3.06/hour (East US)
2. Databricks Units (DBU)¶
Databricks Units are processing units billed per second of cluster uptime:
| Workload Type | Standard DBU/hour | Premium DBU/hour |
|---|---|---|
| Jobs Compute | 0.15 | 0.30 |
| All-Purpose Compute | 0.40 | 0.75 |
| Jobs Light Compute | 0.07 | 0.10 |
| SQL Compute (Serverless) | 0.22 | 0.40 |
| DLT Core | 0.20 | 0.38 |
| DLT Pro | 0.25 | 0.54 |
| DLT Advanced | 0.30 | 0.60 |
3. Storage Costs¶
- Delta Lake Tables: Azure Data Lake Storage Gen2 pricing
- DBFS: Included in workspace, backed by Azure Storage
- Logs & Metrics: Azure Monitor pricing
💡 Cost Optimization Strategies¶
graph LR
A[Cost Optimization] --> B[Right-Size Clusters]
A --> C[Auto-Scaling]
A --> D[Auto-Termination]
A --> E[Spot Instances]
A --> F[Jobs vs Interactive]
B --> B1[Use appropriate<br/>VM types]
C --> C1[Enable min-max<br/>workers]
D --> D1[Set idle<br/>timeouts]
E --> E1[Use for fault-<br/>tolerant workloads]
F --> F1[Jobs clusters<br/>cheaper than<br/>interactive] Best Practices:
- Use Job Clusters for scheduled workloads (50% cheaper DBU)
- Enable Auto-Termination for interactive clusters (120 minutes recommended)
- Implement Auto-Scaling to match workload demands
- Use Spot/Low-Priority VMs for fault-tolerant batch jobs (up to 80% savings)
- Optimize Cluster Sizes - start small and scale as needed
- Pool Idle Instances for faster startup without idle costs
- Monitor DBU Usage with cost management dashboards
🎯 When to Use Azure Databricks¶
✅ Ideal Use Cases¶
🔬 Data Science & Machine Learning¶
Why Databricks: - Collaborative notebook environment for data scientists - Integrated MLflow for experiment tracking and model management - AutoML for automated model training - Feature store for feature engineering and reuse - GPU support for deep learning workloads
Example: Building and deploying recommendation systems, fraud detection models, predictive maintenance solutions.
🏗️ Advanced Data Engineering¶
Why Databricks: - Delta Live Tables for declarative ETL pipelines - Auto-optimize and auto-compaction for Delta Lake - Advanced data quality checks and monitoring - Multi-hop architectures (Bronze-Silver-Gold)
Example: Complex ETL/ELT pipelines, data lakehouse implementations, real-time data processing.
📊 Unified Analytics Platform¶
Why Databricks: - Single platform for batch, streaming, and interactive analytics - SQL Analytics with BI tool integration - Collaborative workspace across teams - Unity Catalog for centralized governance
Example: Enterprise analytics platforms, self-service analytics, cross-functional data teams.
🔄 Real-Time Data Processing¶
Why Databricks: - Structured Streaming for real-time ETL - Delta Lake for ACID transactions on streams - Auto Loader for incremental data ingestion - Low-latency processing with optimized Spark
Example: IoT data processing, clickstream analytics, real-time recommendations.
⚠️ Consider Alternatives When¶
| Scenario | Better Alternative | Reason |
|---|---|---|
| Simple SQL Analytics | Azure Synapse Serverless SQL | Lower cost for ad-hoc queries |
| Pure Data Warehousing | Azure Synapse Dedicated SQL Pool | Optimized for BI workloads |
| Low-Code ETL | Azure Data Factory | Visual interface, easier for non-developers |
| Small-Scale Processing | Azure Functions, Logic Apps | More cost-effective for small workloads |
| Real-Time Streaming Only | Azure Stream Analytics | Purpose-built for streaming, simpler setup |
🆚 Databricks vs Synapse Analytics¶
| Feature | Azure Databricks | Azure Synapse Analytics |
|---|---|---|
| Primary Focus | Data Science & ML | Data Warehousing & Integration |
| Collaboration | Advanced notebooks | Basic notebooks |
| ML Capabilities | MLflow, AutoML, Feature Store | Basic ML integration |
| SQL Performance | Good (Photon engine) | Excellent (MPP architecture) |
| Data Engineering | Delta Live Tables | Pipelines & Data Flows |
| Learning Curve | Steep (Spark expertise) | Moderate (SQL knowledge) |
| Cost Structure | VM + DBU | DWU or per-query |
| Best For | Data science teams | BI & analytics teams |
🛠️ Core Components¶
📘 Workspace Setup¶
Complete guide for creating and configuring Azure Databricks workspaces.
Covers: - Workspace creation and networking - Cluster configuration and pools - Authentication and access control - Integration with Azure services - Best practices and security hardening
🌊 Delta Live Tables¶
Declarative framework for building reliable, maintainable, and testable data pipelines.
Key Features: - Declarative pipeline definitions - Automatic dependency management - Built-in data quality checks - Pipeline observability and monitoring - Continuous and triggered execution modes
🗄️ Unity Catalog¶
Unified governance solution for data and AI assets across Azure Databricks.
Key Features: - Centralized metadata management - Fine-grained access control - Data lineage and discovery - Cross-workspace governance - Secure data sharing
🤖 MLflow Integration¶
End-to-end machine learning lifecycle management with MLflow.
Key Features: - Experiment tracking and comparison - Model registry and versioning - Model deployment and serving - Feature store integration - AutoML capabilities
🚀 Quick Start Guide¶
1️⃣ Create Databricks Workspace¶
# Set variables
RESOURCE_GROUP="rg-databricks-demo"
LOCATION="eastus"
WORKSPACE_NAME="dbx-demo-workspace"
MANAGED_RG="rg-databricks-managed"
# Create resource group
az group create \
--name $RESOURCE_GROUP \
--location $LOCATION
# Create Databricks workspace
az databricks workspace create \
--resource-group $RESOURCE_GROUP \
--name $WORKSPACE_NAME \
--location $LOCATION \
--sku premium \
--managed-resource-group $MANAGED_RG
2️⃣ Create Your First Cluster¶
Via Azure Portal:
- Navigate to your Databricks workspace
- Click "Launch Workspace"
- Go to Compute → Create Cluster
- Configure cluster settings:
Cluster Name: demo-cluster
Cluster Mode: Standard
Databricks Runtime: 13.3 LTS (Scala 2.12, Spark 3.4.1)
Autopilot: Enabled
Auto Termination: 120 minutes
Worker Type: Standard_DS3_v2
Workers: 2-8 (auto-scaling)
Driver Type: Standard_DS3_v2
- Click Create Cluster
3️⃣ Create Your First Notebook¶
Python Example - Read and process Delta Lake data:
# Import libraries
from pyspark.sql.functions import col, when, count
from delta.tables import DeltaTable
# Read data from Azure Data Lake
df = spark.read.format("delta") \
.load("abfss://container@storage.dfs.core.windows.net/data/sales")
# Display sample data
display(df.limit(10))
# Perform transformations
transformed_df = df.filter(col("amount") > 100) \
.withColumn("category",
when(col("amount") < 500, "Small")
.when(col("amount") < 2000, "Medium")
.otherwise("Large"))
# Write to Delta Lake
transformed_df.write.format("delta") \
.mode("overwrite") \
.option("mergeSchema", "true") \
.save("abfss://container@storage.dfs.core.windows.net/data/sales_categorized")
# Create Delta table
spark.sql("""
CREATE TABLE IF NOT EXISTS sales_categorized
USING DELTA
LOCATION 'abfss://container@storage.dfs.core.windows.net/data/sales_categorized'
""")
# Query using SQL
%sql
SELECT category, COUNT(*) as count, SUM(amount) as total
FROM sales_categorized
GROUP BY category
ORDER BY total DESC
SQL Example - Query Delta tables:
-- Create database
CREATE DATABASE IF NOT EXISTS analytics;
-- Use database
USE analytics;
-- Query Delta table
SELECT
product_category,
DATE_TRUNC('month', order_date) as month,
COUNT(DISTINCT customer_id) as unique_customers,
SUM(amount) as total_revenue,
AVG(amount) as avg_order_value
FROM sales_categorized
WHERE order_date >= '2024-01-01'
GROUP BY product_category, DATE_TRUNC('month', order_date)
ORDER BY month DESC, total_revenue DESC;
4️⃣ Schedule a Job¶
Create automated job:
- Go to Workflows → Jobs
- Click Create Job
- Configure job:
Job Name: daily-sales-processing
Task Type: Notebook
Notebook Path: /Users/your-email@domain.com/sales-processing
Cluster: New job cluster (cheaper than interactive)
Schedule:
Type: Cron
Expression: 0 2 * * * (2 AM daily)
Timezone: UTC
Notifications:
- On Success: team@company.com
- On Failure: oncall@company.com
Retries: 2
- Click Create and Run Now to test
🔧 Configuration & Management¶
🛡️ Security Best Practices¶
Identity & Access: - Enable Azure AD integration for SSO - Use Unity Catalog for fine-grained access control - Implement least-privilege access principles - Enable audit logging for compliance
Network Security: - Deploy in VNet with Private Link - Use Azure Private Endpoints for storage - Configure NSG rules for cluster communication - Enable firewall rules for workspace access
Data Protection: - Enable encryption at rest and in transit - Use customer-managed keys (CMK) for sensitive data - Implement data masking for PII - Regular security scans and compliance checks
⚡ Performance Optimization¶
Cluster Configuration: - Use Photon engine for SQL workloads (2-3x faster) - Enable Delta Cache for frequently accessed data - Configure appropriate worker and driver sizes - Use pools for faster cluster startup
Code Optimization: - Partition data appropriately (date, region, category) - Use broadcast joins for small tables - Avoid wide transformations when possible - Leverage Delta Lake optimizations (Z-ordering, optimize)
Delta Lake Tuning:
# Optimize Delta table
spark.sql("OPTIMIZE sales_data ZORDER BY (date, customer_id)")
# Vacuum old files
spark.sql("VACUUM sales_data RETAIN 168 HOURS") # 7 days
# Enable auto-optimize
spark.sql("""
ALTER TABLE sales_data
SET TBLPROPERTIES (
delta.autoOptimize.optimizeWrite = true,
delta.autoOptimize.autoCompact = true
)
""")
📊 Monitoring & Observability¶
Built-in Monitoring: - Cluster metrics and utilization dashboards - Job run history and performance metrics - Query performance insights - Cost tracking and optimization recommendations
Integration with Azure Monitor:
# Configure diagnostic settings
from azure.mgmt.databricks import DatabricksClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
databricks_client = DatabricksClient(credential, subscription_id)
# Enable diagnostic logging
diagnostic_settings = {
"workspace_id": "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.OperationalInsights/workspaces/{workspace}",
"logs": [
{"category": "jobs", "enabled": True},
{"category": "clusters", "enabled": True},
{"category": "notebook", "enabled": True}
],
"metrics": [
{"category": "AllMetrics", "enabled": True}
]
}
🔗 Integration Patterns¶
Power BI Integration¶
Direct connectivity for interactive analytics and reporting.
graph LR
DBX[Databricks SQL<br/>Warehouse] --> PBI[Power BI<br/>Desktop]
PBI --> Service[Power BI<br/>Service]
Service --> Dashboards[Interactive<br/>Dashboards]
Service --> Reports[Reports] Setup:
- Create SQL Warehouse in Databricks
- Get connection string from "Connection Details"
- Connect from Power BI using "Azure Databricks" connector
- Select tables and create visualizations
Azure Machine Learning Integration¶
Seamless MLOps with Azure ML.
graph TB
DBX[Databricks<br/>Notebooks] --> MLflow[MLflow<br/>Tracking]
MLflow --> Registry[Model<br/>Registry]
Registry --> AML[Azure ML<br/>Workspace]
AML --> Deploy[Model<br/>Deployment]
Deploy --> Endpoint[REST API<br/>Endpoint] Example:
import mlflow
from mlflow.tracking import MlflowClient
# Configure MLflow to use Azure ML
mlflow.set_tracking_uri("azureml://workspace-id")
# Log model to Azure ML
with mlflow.start_run():
mlflow.sklearn.log_model(model, "model")
mlflow.log_param("algorithm", "random_forest")
mlflow.log_metric("accuracy", 0.95)
Azure Synapse Analytics Integration¶
Unified lakehouse with Synapse.
# Read from Databricks Delta in Synapse
spark.sql("""
CREATE DATABASE IF NOT EXISTS dbx_shared
LOCATION 'abfss://container@storage.dfs.core.windows.net/delta'
""")
# Access Delta tables
df = spark.read.format("delta").table("dbx_shared.sales")
📚 Learning Resources¶
🎓 Getting Started¶
📖 Deep Dive Guides¶
🔧 Advanced Topics¶
🆘 Troubleshooting¶
🔍 Common Issues¶
Cluster Startup Issues: - Cluster Won't Start - Node Provisioning Errors - Networking Problems
Performance Problems: - Slow Queries - Out of Memory Errors - Shuffle Performance
Data Issues: - Delta Lake Problems - Data Quality Issues - Schema Evolution
📞 Getting Help¶
- Azure Support: Official Microsoft support channels
- Databricks Support: Premium tier includes 24/7 support
- Community Forums: Databricks Community, Stack Overflow
- Documentation: Microsoft Learn and Databricks official docs
- GitHub Issues: Report documentation or sample issues
📖 Complete Troubleshooting Guide →
🎯 Next Steps¶
🚀 For Beginners¶
- Complete Workspace Setup - Get your environment ready
- Try Delta Lake - Learn modern data engineering
- Explore Notebooks - Hands-on coding
🏢 For Teams¶
- Implement Unity Catalog - Set up governance
- Design Data Architecture - Plan your lakehouse
- Establish MLOps - Production ML workflows
🔧 For Advanced Users¶
- Optimize Performance - Tune for production
- Implement CI/CD - Automate deployments
- Build Solutions - Enterprise patterns
💬 Feedback¶
Your feedback helps us improve this documentation. Please let us know:
- Was this page helpful?
- What could be improved?
- Are there any errors or unclear sections?
Last Updated: 2025-01-28 Service Version: General Availability Documentation Status: Complete