Home > Docs > Databricks Guide
Databricks Guide¶
Note
Quick Summary: Setting up and operating Databricks within CSA-in-a-Box — workspace configuration, cluster sizing (dev vs prod), notebook orchestration (3 patterns), dbt integration with Unity Catalog, and troubleshooting common issues.
This guide covers setting up and operating Databricks within the CSA-in-a-Box platform, including workspace configuration, notebook orchestration, dbt integration, and Unity Catalog.
📑 Table of Contents¶
- ⚙️ Workspace Setup
- 📁 Notebook Inventory
- 🔄 Notebook Orchestration
- 🗄️ dbt on Databricks
- 📊 Unity Catalog Setup
- 🔧 Troubleshooting
⚙️ Workspace Setup¶
Prerequisites¶
- Databricks workspace deployed via Bicep (
deploy/bicep/DLZ/modules/databricks/) - ADLS Gen2 storage with Bronze/Silver/Gold containers
- Key Vault with access tokens and connection secrets
- Unity Catalog metastore (see Unity Catalog Setup)
Cluster Configuration¶
Interactive cluster (development):
| Setting | Value | Notes |
|---|---|---|
| Runtime | 14.3 LTS (Spark 3.5) | Use LTS for stability |
| Node Type | Standard_DS3_v2 | 4 cores, 14 GB (dev) |
| Min Workers | 1 | Auto-scale from 1 |
| Max Workers | 4 | Cap for cost control |
| Auto-termination | 30 minutes | Save costs |
| Spark Config | See below | ADLS + OpenLineage |
Job cluster (production):
| Setting | Value | Notes |
|---|---|---|
| Runtime | 14.3 LTS (Spark 3.5) | Match dev |
| Node Type | Standard_DS4_v2 | 8 cores, 28 GB |
| Min Workers | 2 | Higher baseline |
| Max Workers | 8 | Scale for load |
| Spot instances | Yes, 50% | Cost savings |
Required Spark configuration¶
Add these to your cluster's Spark Config:
# ADLS Gen2 access via managed identity
spark.hadoop.fs.azure.account.auth.type.<STORAGE>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<STORAGE>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider
# Delta Lake defaults
spark.databricks.delta.preview.enabled true
spark.databricks.delta.optimizeWrite.enabled true
spark.databricks.delta.autoCompact.enabled true
# OpenLineage (lineage to Purview)
spark.openlineage.transport.type http
spark.openlineage.transport.url https://<PURVIEW_ACCOUNT>.purview.azure.com
spark.openlineage.namespace csa-inabox-databricks
spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener
Important
Replace <STORAGE> with your ADLS account name and <PURVIEW_ACCOUNT> with your Purview account name.
📁 Notebook Inventory¶
All notebooks live under domains/shared/notebooks/databricks/:
| Notebook | Purpose | Layer |
|---|---|---|
bronze_to_silver_spark.py | Schema enforcement, dedup, validation flags | Bronze → Silver |
silver_to_gold_spark.py | Business aggregations, star schema | Silver → Gold |
delta_lake_optimization.py | OPTIMIZE, VACUUM, Z-ORDER maintenance | All layers |
data_quality_monitor.py | Contract validation, SLA monitoring | Silver/Gold |
unity_catalog_setup.py | Catalog, schema, permissions setup | Infrastructure |
orchestration/run_dbt.py | Universal dbt runner | dbt |
Naming conventions¶
<source>_to_<target>_spark.pyfor ETL notebooks<action>_<subject>.pyfor utility notebooks- Prefix with domain name for domain-specific:
sales_daily_transform.py
🔄 Notebook Orchestration¶
Pattern 1: dbutils.notebook.run() (simple)¶
Chain notebooks within a single Databricks job:
# Master orchestration notebook
results = {}
# Step 1: Bronze -> Silver
results["b2s"] = dbutils.notebook.run(
"bronze_to_silver_spark",
timeout_seconds=3600,
arguments={"domain": "sales", "mode": "incremental"}
)
# Step 2: Silver -> Gold
results["s2g"] = dbutils.notebook.run(
"silver_to_gold_spark",
timeout_seconds=3600,
arguments={"domain": "sales"}
)
# Step 3: Data quality
results["dq"] = dbutils.notebook.run(
"data_quality_monitor",
timeout_seconds=1800,
arguments={"suite": "all"}
)
# Step 4: Delta maintenance (weekly)
from datetime import datetime
if datetime.now().weekday() == 6: # Sunday
results["optimize"] = dbutils.notebook.run(
"delta_lake_optimization",
timeout_seconds=7200,
arguments={"tables": "all", "vacuum_hours": "168"}
)
print(results)
Pattern 2: ADF orchestration (production)¶
The pl_medallion_orchestration ADF pipeline calls notebooks via the Databricks linked service:
graph TD
T["ADF Trigger<br/>daily 06:00 UTC"] --> P["pl_medallion_orchestration"]
P --> B["Databricks Activity:<br/>bronze_to_silver_spark"]
B --> D1["Databricks Activity:<br/>run_dbt.py (dbt run)"]
D1 --> D2["Databricks Activity:<br/>run_dbt.py (dbt test)"]
D2 -->|On Failure| W["Teams webhook alert"] See ADF_SETUP.md for pipeline configuration.
Pattern 3: Databricks Workflows (native)¶
Create a multi-task job in the Databricks UI:
- Go to Workflows > Create Job
- Add tasks in order:
- Task 1:
bronze_to_silver_spark(notebook task) - Task 2:
run_dbt(depends on Task 1) - Task 3:
data_quality_monitor(depends on Task 2) - Task 4:
delta_lake_optimization(depends on Task 2, weekly schedule)
- Task 1:
- Set the schedule to daily
- Configure email/webhook alerts on failure
🗄️ dbt on Databricks¶
Profile configuration¶
The dbt profile at domains/shared/dbt/profiles.yml defines the Databricks connection:
csa_analytics:
target: dev
outputs:
dev:
type: databricks
host: "{{ env_var('DBT_DATABRICKS_HOST') }}"
http_path: "{{ env_var('DBT_DATABRICKS_HTTP_PATH') }}"
token: "{{ env_var('DBT_DATABRICKS_TOKEN') }}"
catalog: csa_analytics
schema: dev
threads: 4
prod:
type: databricks
host: "{{ env_var('DBT_DATABRICKS_HOST') }}"
http_path: "{{ env_var('DBT_DATABRICKS_HTTP_PATH') }}"
token: "{{ env_var('DBT_DATABRICKS_TOKEN') }}"
catalog: csa_analytics
schema: prod
threads: 8
Environment variables¶
Set these in your Databricks cluster environment or local shell:
export DBT_DATABRICKS_HOST="adb-1234567890.12.azuredatabricks.net"
export DBT_DATABRICKS_HTTP_PATH="/sql/1.0/warehouses/abc123"
export DBT_DATABRICKS_TOKEN="dapi..."
Running dbt from a notebook¶
The orchestration/run_dbt.py notebook wraps dbt CLI execution with parameter validation, timeout handling, and result parsing:
# Databricks notebook
dbutils.widgets.text("dbt_command", "run")
dbutils.widgets.text("dbt_target", "dev")
dbutils.widgets.text("dbt_models", "")
# run_dbt.py handles:
# - Parameter validation (prevents shell injection)
# - dbt CLI execution via subprocess
# - run_results.json parsing for summaries
# - 1-hour timeout with configurable retries
Running dbt locally¶
cd domains/shared/dbt
dbt compile --profiles-dir . # Verify SQL compiles
dbt seed --profiles-dir . # Load sample data
dbt run --profiles-dir . # Run models
dbt test --profiles-dir . # Run tests
dbt snapshot --profiles-dir . # Run SCD Type 2 snapshots
📊 Unity Catalog Setup¶
The unity_catalog_setup.py notebook configures:
- Catalog creation —
csa_analyticscatalog - Schema per domain —
sales,finance,shared - Schema per layer —
bronze,silver,gold,snapshots - RBAC permissions — per-domain access controls mapped to AAD groups
- External locations — ADLS Gen2 paths for Delta tables
Run it once per environment:
dbutils.notebook.run(
"unity_catalog_setup",
timeout_seconds=600,
arguments={"environment": "dev", "dry_run": "true"}
)
🔧 Troubleshooting¶
"Cluster not found" or "Cluster terminated"¶
- Check the cluster is running in Compute tab
- Verify the HTTP path matches an active SQL warehouse or cluster
- For auto-terminated clusters, start them manually or increase the timeout
"Permission denied" on ADLS paths¶
- Verify the cluster's managed identity has
Storage Blob Data Contributoron the ADLS storage account - Check the Spark config includes the correct OAuth settings
- For Unity Catalog, verify the external location is registered
"Module not found: governance"¶
The governance package must be installed on the cluster:
Or add it to the cluster's Libraries configuration.
dbt connection timeout¶
- Verify the SQL warehouse is running
- Check the
DBT_DATABRICKS_HTTP_PATHmatches the warehouse's HTTP path - Increase
connect_retriesin profiles.yml if the warehouse is auto-starting
Notebook run timeout¶
The default timeout for dbutils.notebook.run() is 48 hours. Set explicit timeouts to fail fast:
See the general TROUBLESHOOTING.md for more issues.
See also:
- ← Previous: Documentation home
- → Next: ADF Setup
- ⌂ Index: Documentation home