Databricks CI/CD¶
CI/CD pipelines for Azure Databricks deployments.
Overview¶
This guide covers:
- Databricks Asset Bundles (DABs)
- Notebook and job deployment
- Unity Catalog migration
- MLflow model deployment
- Testing strategies
Databricks Asset Bundles¶
Bundle Configuration¶
# databricks.yml
bundle:
name: analytics-pipeline
include:
- resources/*.yml
targets:
dev:
mode: development
default: true
workspace:
host: https://adb-dev.azuredatabricks.net
staging:
mode: development
workspace:
host: https://adb-staging.azuredatabricks.net
prod:
mode: production
workspace:
host: https://adb-prod.azuredatabricks.net
run_as:
service_principal_name: sp-prod-deployer
Job Definition¶
# resources/jobs.yml
resources:
jobs:
etl_daily:
name: "ETL Daily Pipeline"
schedule:
quartz_cron_expression: "0 0 6 * * ?"
timezone_id: "UTC"
tasks:
- task_key: ingest
notebook_task:
notebook_path: ./notebooks/01_ingest.py
new_cluster:
spark_version: "13.3.x-scala2.12"
node_type_id: "Standard_DS3_v2"
num_workers: 2
- task_key: transform
depends_on:
- task_key: ingest
notebook_task:
notebook_path: ./notebooks/02_transform.py
existing_cluster_id: ${var.shared_cluster_id}
- task_key: validate
depends_on:
- task_key: transform
notebook_task:
notebook_path: ./notebooks/03_validate.py
email_notifications:
on_failure:
- data-team@company.com
Azure DevOps Pipeline¶
Build and Deploy¶
# azure-pipelines.yml
trigger:
branches:
include:
- main
- develop
paths:
include:
- databricks/**
pool:
vmImage: 'ubuntu-latest'
variables:
- group: databricks-secrets
stages:
- stage: Validate
jobs:
- job: ValidateBundle
steps:
- task: UsePythonVersion@0
inputs:
versionSpec: '3.10'
- script: |
pip install databricks-cli databricks-sdk
displayName: 'Install Databricks CLI'
- script: |
databricks bundle validate
displayName: 'Validate Bundle'
workingDirectory: databricks
env:
DATABRICKS_HOST: $(DATABRICKS_HOST_DEV)
DATABRICKS_TOKEN: $(DATABRICKS_TOKEN_DEV)
- stage: DeployDev
dependsOn: Validate
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/develop'))
jobs:
- deployment: DeployToDev
environment: 'databricks-dev'
strategy:
runOnce:
deploy:
steps:
- script: |
databricks bundle deploy --target dev
workingDirectory: databricks
env:
DATABRICKS_HOST: $(DATABRICKS_HOST_DEV)
DATABRICKS_TOKEN: $(DATABRICKS_TOKEN_DEV)
- stage: DeployProd
dependsOn: Validate
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
jobs:
- deployment: DeployToProd
environment: 'databricks-prod'
strategy:
runOnce:
deploy:
steps:
- script: |
databricks bundle deploy --target prod
workingDirectory: databricks
env:
DATABRICKS_HOST: $(DATABRICKS_HOST_PROD)
DATABRICKS_TOKEN: $(DATABRICKS_TOKEN_PROD)
GitHub Actions¶
Workflow Configuration¶
# .github/workflows/databricks-deploy.yml
name: Databricks Deployment
on:
push:
branches: [main, develop]
paths: ['databricks/**']
pull_request:
branches: [main]
paths: ['databricks/**']
env:
PYTHON_VERSION: '3.10'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Install Dependencies
run: pip install databricks-cli databricks-sdk pytest
- name: Validate Bundle
working-directory: databricks
run: databricks bundle validate
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST_DEV }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_DEV }}
test:
needs: validate
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run Unit Tests
run: pytest tests/unit -v
deploy-dev:
needs: test
if: github.ref == 'refs/heads/develop'
runs-on: ubuntu-latest
environment: development
steps:
- uses: actions/checkout@v3
- name: Deploy to Dev
working-directory: databricks
run: databricks bundle deploy --target dev
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST_DEV }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_DEV }}
deploy-prod:
needs: test
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v3
- name: Deploy to Prod
working-directory: databricks
run: databricks bundle deploy --target prod
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST_PROD }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_PROD }}
Testing Strategy¶
Unit Tests¶
# tests/unit/test_transformations.py
import pytest
from pyspark.sql import SparkSession
from notebooks.transformations import clean_data, calculate_metrics
@pytest.fixture(scope="session")
def spark():
return SparkSession.builder \
.master("local[*]") \
.appName("unit-tests") \
.getOrCreate()
def test_clean_data(spark):
# Arrange
data = [(1, "Alice", None), (2, "Bob", "active")]
df = spark.createDataFrame(data, ["id", "name", "status"])
# Act
result = clean_data(df)
# Assert
assert result.count() == 1
assert result.filter("status IS NULL").count() == 0
def test_calculate_metrics(spark):
data = [(1, 100), (1, 200), (2, 150)]
df = spark.createDataFrame(data, ["customer_id", "amount"])
result = calculate_metrics(df)
assert result.count() == 2
row = result.filter("customer_id = 1").first()
assert row["total_amount"] == 300
Integration Tests¶
# tests/integration/test_pipeline.py
import pytest
from databricks.sdk import WorkspaceClient
@pytest.fixture
def workspace():
return WorkspaceClient()
def test_job_exists(workspace):
"""Verify job was deployed correctly."""
jobs = workspace.jobs.list(name="ETL Daily Pipeline")
job_list = list(jobs)
assert len(job_list) == 1
def test_cluster_config(workspace):
"""Verify cluster configuration."""
jobs = workspace.jobs.list(name="ETL Daily Pipeline")
job = list(jobs)[0]
job_details = workspace.jobs.get(job.job_id)
task = job_details.settings.tasks[0]
cluster = task.new_cluster
assert cluster.num_workers == 2
assert "Standard_DS3_v2" in cluster.node_type_id
MLflow Model Deployment¶
Model Registry CI/CD¶
# deploy_model.py
import mlflow
from mlflow.tracking import MlflowClient
def promote_model(model_name: str, version: int, stage: str):
"""Promote model to specified stage."""
client = MlflowClient()
# Transition model version
client.transition_model_version_stage(
name=model_name,
version=version,
stage=stage,
archive_existing_versions=True
)
print(f"Model {model_name} v{version} promoted to {stage}")
def deploy_to_serving(model_name: str, endpoint_name: str):
"""Deploy model to serving endpoint."""
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
# Get production model version
client = MlflowClient()
prod_version = client.get_latest_versions(model_name, stages=["Production"])[0]
# Create or update endpoint
config = {
"served_entities": [{
"entity_name": model_name,
"entity_version": str(prod_version.version),
"workload_size": "Small",
"scale_to_zero_enabled": True
}]
}
try:
w.serving_endpoints.update_config_and_wait(endpoint_name, config)
except:
w.serving_endpoints.create_and_wait(name=endpoint_name, config=config)
Best Practices¶
Repository Structure¶
textdatabricks/ ├── databricks.yml # Bundle configuration ├── resources/ │ ├── jobs.yml # Job definitions │ ├── clusters.yml # Cluster configurations │ └── pipelines.yml # DLT pipeline definitions ├── notebooks/ │ ├── 01_ingest.py │ ├── 02_transform.py │ └── 03_validate.py ├── src/ │ └── transformations/ # Shared Python modules ├── tests/ │ ├── unit/ │ └── integration/ └── requirements.txt
Security¶
- Use service principals for deployments
- Store secrets in Key Vault
- Use Unity Catalog for data governance
- Enable audit logging
Related Documentation¶
Last Updated: January 2025