Skip to content

Home > Docs > Databricks Guide

Databricks Guide

Note

Quick Summary: Setting up and operating Databricks within CSA-in-a-Box — workspace configuration, cluster sizing (dev vs prod), notebook orchestration (3 patterns), dbt integration with Unity Catalog, and troubleshooting common issues.

This guide covers setting up and operating Databricks within the CSA-in-a-Box platform, including workspace configuration, notebook orchestration, dbt integration, and Unity Catalog.

📑 Table of Contents


⚙️ Workspace Setup

Prerequisites

  • Databricks workspace deployed via Bicep (deploy/bicep/DLZ/modules/databricks/)
  • ADLS Gen2 storage with Bronze/Silver/Gold containers
  • Key Vault with access tokens and connection secrets
  • Unity Catalog metastore (see Unity Catalog Setup)

Cluster Configuration

Interactive cluster (development):

Setting Value Notes
Runtime 14.3 LTS (Spark 3.5) Use LTS for stability
Node Type Standard_DS3_v2 4 cores, 14 GB (dev)
Min Workers 1 Auto-scale from 1
Max Workers 4 Cap for cost control
Auto-termination 30 minutes Save costs
Spark Config See below ADLS + OpenLineage

Job cluster (production):

Setting Value Notes
Runtime 14.3 LTS (Spark 3.5) Match dev
Node Type Standard_DS4_v2 8 cores, 28 GB
Min Workers 2 Higher baseline
Max Workers 8 Scale for load
Spot instances Yes, 50% Cost savings

Required Spark configuration

Add these to your cluster's Spark Config:

# ADLS Gen2 access via managed identity
spark.hadoop.fs.azure.account.auth.type.<STORAGE>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<STORAGE>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider

# Delta Lake defaults
spark.databricks.delta.preview.enabled true
spark.databricks.delta.optimizeWrite.enabled true
spark.databricks.delta.autoCompact.enabled true

# OpenLineage (lineage to Purview)
spark.openlineage.transport.type http
spark.openlineage.transport.url https://<PURVIEW_ACCOUNT>.purview.azure.com
spark.openlineage.namespace csa-inabox-databricks
spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener

Important

Replace <STORAGE> with your ADLS account name and <PURVIEW_ACCOUNT> with your Purview account name.


📁 Notebook Inventory

All notebooks live under domains/shared/notebooks/databricks/:

Notebook Purpose Layer
bronze_to_silver_spark.py Schema enforcement, dedup, validation flags Bronze → Silver
silver_to_gold_spark.py Business aggregations, star schema Silver → Gold
delta_lake_optimization.py OPTIMIZE, VACUUM, Z-ORDER maintenance All layers
data_quality_monitor.py Contract validation, SLA monitoring Silver/Gold
unity_catalog_setup.py Catalog, schema, permissions setup Infrastructure
orchestration/run_dbt.py Universal dbt runner dbt

Naming conventions

  • <source>_to_<target>_spark.py for ETL notebooks
  • <action>_<subject>.py for utility notebooks
  • Prefix with domain name for domain-specific: sales_daily_transform.py

🔄 Notebook Orchestration

Pattern 1: dbutils.notebook.run() (simple)

Chain notebooks within a single Databricks job:

# Master orchestration notebook
results = {}

# Step 1: Bronze -> Silver
results["b2s"] = dbutils.notebook.run(
    "bronze_to_silver_spark",
    timeout_seconds=3600,
    arguments={"domain": "sales", "mode": "incremental"}
)

# Step 2: Silver -> Gold
results["s2g"] = dbutils.notebook.run(
    "silver_to_gold_spark",
    timeout_seconds=3600,
    arguments={"domain": "sales"}
)

# Step 3: Data quality
results["dq"] = dbutils.notebook.run(
    "data_quality_monitor",
    timeout_seconds=1800,
    arguments={"suite": "all"}
)

# Step 4: Delta maintenance (weekly)
from datetime import datetime
if datetime.now().weekday() == 6:  # Sunday
    results["optimize"] = dbutils.notebook.run(
        "delta_lake_optimization",
        timeout_seconds=7200,
        arguments={"tables": "all", "vacuum_hours": "168"}
    )

print(results)

Pattern 2: ADF orchestration (production)

The pl_medallion_orchestration ADF pipeline calls notebooks via the Databricks linked service:

graph TD
    T["ADF Trigger<br/>daily 06:00 UTC"] --> P["pl_medallion_orchestration"]
    P --> B["Databricks Activity:<br/>bronze_to_silver_spark"]
    B --> D1["Databricks Activity:<br/>run_dbt.py (dbt run)"]
    D1 --> D2["Databricks Activity:<br/>run_dbt.py (dbt test)"]
    D2 -->|On Failure| W["Teams webhook alert"]

See ADF_SETUP.md for pipeline configuration.

Pattern 3: Databricks Workflows (native)

Create a multi-task job in the Databricks UI:

  • Go to Workflows > Create Job
  • Add tasks in order:
    • Task 1: bronze_to_silver_spark (notebook task)
    • Task 2: run_dbt (depends on Task 1)
    • Task 3: data_quality_monitor (depends on Task 2)
    • Task 4: delta_lake_optimization (depends on Task 2, weekly schedule)
  • Set the schedule to daily
  • Configure email/webhook alerts on failure

🗄️ dbt on Databricks

Profile configuration

The dbt profile at domains/shared/dbt/profiles.yml defines the Databricks connection:

csa_analytics:
    target: dev
    outputs:
        dev:
            type: databricks
            host: "{{ env_var('DBT_DATABRICKS_HOST') }}"
            http_path: "{{ env_var('DBT_DATABRICKS_HTTP_PATH') }}"
            token: "{{ env_var('DBT_DATABRICKS_TOKEN') }}"
            catalog: csa_analytics
            schema: dev
            threads: 4

        prod:
            type: databricks
            host: "{{ env_var('DBT_DATABRICKS_HOST') }}"
            http_path: "{{ env_var('DBT_DATABRICKS_HTTP_PATH') }}"
            token: "{{ env_var('DBT_DATABRICKS_TOKEN') }}"
            catalog: csa_analytics
            schema: prod
            threads: 8

Environment variables

Set these in your Databricks cluster environment or local shell:

export DBT_DATABRICKS_HOST="adb-1234567890.12.azuredatabricks.net"
export DBT_DATABRICKS_HTTP_PATH="/sql/1.0/warehouses/abc123"
export DBT_DATABRICKS_TOKEN="dapi..."

Running dbt from a notebook

The orchestration/run_dbt.py notebook wraps dbt CLI execution with parameter validation, timeout handling, and result parsing:

# Databricks notebook
dbutils.widgets.text("dbt_command", "run")
dbutils.widgets.text("dbt_target", "dev")
dbutils.widgets.text("dbt_models", "")

# run_dbt.py handles:
# - Parameter validation (prevents shell injection)
# - dbt CLI execution via subprocess
# - run_results.json parsing for summaries
# - 1-hour timeout with configurable retries

Running dbt locally

cd domains/shared/dbt
dbt compile --profiles-dir .    # Verify SQL compiles
dbt seed --profiles-dir .       # Load sample data
dbt run --profiles-dir .        # Run models
dbt test --profiles-dir .       # Run tests
dbt snapshot --profiles-dir .   # Run SCD Type 2 snapshots

📊 Unity Catalog Setup

The unity_catalog_setup.py notebook configures:

  • Catalog creationcsa_analytics catalog
  • Schema per domainsales, finance, shared
  • Schema per layerbronze, silver, gold, snapshots
  • RBAC permissions — per-domain access controls mapped to AAD groups
  • External locations — ADLS Gen2 paths for Delta tables

Run it once per environment:

dbutils.notebook.run(
    "unity_catalog_setup",
    timeout_seconds=600,
    arguments={"environment": "dev", "dry_run": "true"}
)

🔧 Troubleshooting

"Cluster not found" or "Cluster terminated"

  • Check the cluster is running in Compute tab
  • Verify the HTTP path matches an active SQL warehouse or cluster
  • For auto-terminated clusters, start them manually or increase the timeout

"Permission denied" on ADLS paths

  • Verify the cluster's managed identity has Storage Blob Data Contributor on the ADLS storage account
  • Check the Spark config includes the correct OAuth settings
  • For Unity Catalog, verify the external location is registered

"Module not found: governance"

The governance package must be installed on the cluster:

# In a notebook cell
%pip install /Workspace/Repos/csa-inabox/governance

Or add it to the cluster's Libraries configuration.

dbt connection timeout

  • Verify the SQL warehouse is running
  • Check the DBT_DATABRICKS_HTTP_PATH matches the warehouse's HTTP path
  • Increase connect_retries in profiles.yml if the warehouse is auto-starting

Notebook run timeout

The default timeout for dbutils.notebook.run() is 48 hours. Set explicit timeouts to fail fast:

dbutils.notebook.run("heavy_job", timeout_seconds=3600)  # 1 hour max

See the general TROUBLESHOOTING.md for more issues.


See also: