Azure Databricks Cost Optimization¶

💰 Databricks Cost Strategy Optimize Azure Databricks Total Cost of Ownership through intelligent cluster configuration, job scheduling, autoscaling, and workload management.

📋 Table of Contents¶

Overview
Databricks Cost Model
Cluster Optimization
Job Scheduling and Automation
Delta Lake Cost Optimization
Storage Cost Management
Reserved Capacity
Monitoring and Governance
Implementation Checklist

Overview¶

Databricks Cost Components¶

Component	Pricing Model	Primary Cost Drivers
DBU (Databricks Units)	Per DBU-hour	Cluster type, workload, tier
Compute (VMs)	Per VM-hour	VM type, number of nodes
Storage	Per GB/month	DBFS, Delta Lake tables
Data Transfer	Per GB	Cross-region, egress
Premium Features	Per feature	Jobs orchestration, ML Runtime

Quick Wins¶

Enable Cluster Autoscaling - 30-50% savings on underutilized clusters
Use Spot/Low Priority VMs - Up to 80% discount on VM costs
Auto-Terminate Idle Clusters - Eliminate waste from forgotten clusters
Right-Size VM Types - Match VM specs to workload requirements
Optimize Delta Lake Storage - 40-60% storage savings with OPTIMIZE

Total Potential Savings: 30-50% on Databricks TCO

Databricks Cost Model¶

Billing Components Breakdown¶

Total Databricks Cost = DBU Cost + VM Cost + Storage Cost + Data Transfer

Example Calculation (Standard Tier, General Purpose Cluster):
- Cluster: 4 × Standard_DS3_v2 (4 vCPU, 14 GB each)
- Runtime: 8 hours/day, 20 days/month
- Data: 1 TB in Delta Lake

Monthly Costs:
1. DBU Cost: 4 nodes × 0.75 DBU/node-hour × 160 hours × $0.30/DBU = $144
2. VM Cost: 4 nodes × $0.192/hour × 160 hours = $122.88
3. Storage: 1 TB × $0.0184/GB = $18.84
4. Data Transfer: Minimal (same region) = $2

Total: $287.72/month

With Optimizations:
1. DBU Cost (autoscale 2-4 nodes): 3 avg × 0.75 × 160 × $0.30 = $108
2. VM Cost (spot instances, 70% discount): $36.86
3. Storage (optimized): 1 TB × $0.01/GB (cool tier) = $10
4. Data Transfer: $2

Optimized Total: $156.86/month
Savings: $130.86/month (45.5%)

Cluster Optimization¶

1. Cluster Autoscaling¶

Configure Autoscaling:

# Databricks SDK - Create autoscaling cluster
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.compute import ClusterSpec, AutoScale

w = WorkspaceClient()

# Create autoscaling cluster
cluster = w.clusters.create(
    cluster_name="autoscale-cluster",
    spark_version="13.3.x-scala2.12",
    node_type_id="Standard_DS3_v2",
    autoscale=AutoScale(
        min_workers=2,
        max_workers=8
    ),
    auto_termination_minutes=30,
    enable_elastic_disk=True,
    spark_conf={
        "spark.databricks.delta.optimizeWrite.enabled": "true",
        "spark.databricks.delta.autoCompact.enabled": "true"
    },
    azure_attributes={
        "availability": "SPOT_WITH_FALLBACK_AZURE",
        "spot_bid_max_price": -1  # Use spot price
    }
)

print(f"Created autoscaling cluster: {cluster.cluster_id}")

Terraform Configuration:

resource "databricks_cluster" "autoscale_optimized" {
  cluster_name            = "cost-optimized-cluster"
  spark_version           = "13.3.x-scala2.12"
  node_type_id           = "Standard_DS3_v2"
  autotermination_minutes = 30
  enable_elastic_disk     = true

  autoscale {
    min_workers = 2
    max_workers = 8
  }

  azure_attributes {
    availability            = "SPOT_WITH_FALLBACK_AZURE"
    first_on_demand        = 1
    spot_bid_max_price     = -1
  }

  spark_conf = {
    "spark.databricks.delta.optimizeWrite.enabled" = "true"
    "spark.databricks.delta.autoCompact.enabled"   = "true"
    "spark.speculation" = "true"
  }

  custom_tags = {
    "CostCenter" = "Analytics"
    "Environment" = "Production"
  }
}

Cost Impact: 30-50% reduction with autoscaling

2. Spot/Low-Priority Instances¶

Spot Instance Best Practices:

# Job cluster with spot instances
job_cluster_config = {
    "new_cluster": {
        "spark_version": "13.3.x-scala2.12",
        "node_type_id": "Standard_DS3_v2",
        "num_workers": 4,
        "azure_attributes": {
            "availability": "SPOT_WITH_FALLBACK_AZURE",
            "first_on_demand": 1,  # First node always on-demand
            "spot_bid_max_price": -1
        },
        "spark_conf": {
            # Enable speculation for spot instance resilience
            "spark.speculation": "true",
            "spark.speculation.multiplier": "3",
            "spark.speculation.quantile": "0.9"
        }
    }
}

Cost Impact: Up to 80% VM cost savings

3. Pool-Based Clusters¶

Create Instance Pools:

from databricks.sdk.service.compute import InstancePoolAzureAttributes

# Create instance pool for faster startup and cost savings
pool = w.instance_pools.create(
    instance_pool_name="standard-pool",
    node_type_id="Standard_DS3_v2",
    min_idle_instances=2,
    max_capacity=10,
    idle_instance_autotermination_minutes=15,
    azure_attributes=InstancePoolAzureAttributes(
        availability="SPOT_WITH_FALLBACK_AZURE",
        spot_bid_max_price=-1
    )
)

# Create cluster from pool
cluster_from_pool = w.clusters.create(
    cluster_name="pool-cluster",
    spark_version="13.3.x-scala2.12",
    instance_pool_id=pool.instance_pool_id,
    autoscale=AutoScale(min_workers=2, max_workers=8),
    auto_termination_minutes=20
)

Cost Impact: 20-30% faster startup, reduced idle costs

4. Right-Size VM Selection¶

VM Selection Guide:

Workload Type	Recommended VM Series	vCPU:Memory Ratio	Use Case
Memory-Intensive	E-series (Esv3)	1:8	Large aggregations, caching
Compute-Intensive	F-series (Fsv2)	1:2	Transformations, ML training
Balanced	D-series (Dsv3)	1:4	General analytics
Cost-Optimized	B-series (Bs)	Variable	Dev/test environments
GPU Workloads	NC-series	With GPU	Deep learning, AI

Python VM Recommendation Script:

def recommend_vm_type(workload_characteristics):
    """Recommend VM type based on workload characteristics"""

    memory_per_core = workload_characteristics.get('memory_per_core_gb', 4)
    gpu_required = workload_characteristics.get('gpu', False)
    workload_type = workload_characteristics.get('type', 'balanced')

    if gpu_required:
        return "Standard_NC6s_v3"  # GPU instance

    if workload_type == 'memory_intensive' or memory_per_core > 6:
        return "Standard_E8s_v3"  # 8 vCPU, 64 GB RAM

    elif workload_type == 'compute_intensive' or memory_per_core < 3:
        return "Standard_F8s_v2"  # 8 vCPU, 16 GB RAM

    else:  # Balanced workload
        return "Standard_DS3_v2"  # 4 vCPU, 14 GB RAM

# Example usage
workload = {
    'type': 'memory_intensive',
    'memory_per_core_gb': 8,
    'gpu': False
}

recommended_vm = recommend_vm_type(workload)
print(f"Recommended VM: {recommended_vm}")

Job Scheduling and Automation¶

1. Optimize Job Scheduling¶

Schedule Jobs During Off-Peak Hours:

from databricks.sdk.service.jobs import CronSchedule, JobSettings

# Create cost-optimized job schedule
job = w.jobs.create(
    name="nightly-etl-job",
    tasks=[{
        "task_key": "etl-task",
        "new_cluster": {
            "spark_version": "13.3.x-scala2.12",
            "node_type_id": "Standard_DS3_v2",
            "num_workers": 4,
            "azure_attributes": {
                "availability": "SPOT_WITH_FALLBACK_AZURE"
            }
        },
        "notebook_task": {
            "notebook_path": "/ETL/process_data",
            "base_parameters": {"date": "{{job.start_time}}"}
        },
        "timeout_seconds": 3600
    }],
    schedule=CronSchedule(
        quartz_cron_expression="0 0 2 * * ?",  # 2 AM daily
        timezone_id="America/New_York"
    ),
    max_concurrent_runs=1,
    tags={
        "cost_center": "analytics",
        "schedule": "off_peak"
    }
)

2. Job Cluster vs Interactive Cluster¶

Job Cluster Best Practices:

# ✅ GOOD: Use job clusters for automated workloads
job_with_cluster = {
    "name": "data-processing-job",
    "new_cluster": {
        "spark_version": "13.3.x-scala2.12",
        "node_type_id": "Standard_DS3_v2",
        "num_workers": 4,
        "azure_attributes": {
            "availability": "SPOT_WITH_FALLBACK_AZURE"
        }
    },
    "notebook_task": {
        "notebook_path": "/Jobs/process_data"
    }
}

# ❌ BAD: Don't use interactive clusters for jobs
# Interactive clusters cost more and run 24/7

Cost Impact: 40-60% savings using job clusters instead of interactive

3. Workflow Optimization¶

Multi-Task Jobs with Dependencies:

# Create efficient multi-task workflow
workflow = w.jobs.create(
    name="optimized-workflow",
    tasks=[
        {
            "task_key": "extract",
            "new_cluster": {
                "spark_version": "13.3.x-scala2.12",
                "node_type_id": "Standard_F4s_v2",  # Compute-optimized
                "num_workers": 2
            },
            "notebook_task": {"notebook_path": "/Workflow/extract"}
        },
        {
            "task_key": "transform",
            "depends_on": [{"task_key": "extract"}],
            "new_cluster": {
                "spark_version": "13.3.x-scala2.12",
                "node_type_id": "Standard_E8s_v3",  # Memory-optimized
                "autoscale": {"min_workers": 2, "max_workers": 8}
            },
            "notebook_task": {"notebook_path": "/Workflow/transform"}
        },
        {
            "task_key": "load",
            "depends_on": [{"task_key": "transform"}],
            "new_cluster": {
                "spark_version": "13.3.x-scala2.12",
                "node_type_id": "Standard_DS3_v2",
                "num_workers": 2
            },
            "notebook_task": {"notebook_path": "/Workflow/load"}
        }
    ]
)

Delta Lake Cost Optimization¶

1. OPTIMIZE and VACUUM¶

Regular Optimization:

-- Optimize Delta tables to reduce file count and improve performance
OPTIMIZE delta.`/mnt/data/sales`
WHERE date >= current_date() - INTERVAL 7 DAYS;

-- Z-order optimization for common queries
OPTIMIZE delta.`/mnt/data/sales`
ZORDER BY (customer_id, product_id);

-- Vacuum old files (after retention period)
VACUUM delta.`/mnt/data/sales` RETAIN 168 HOURS;  -- 7 days

Python Automation:

from delta.tables import DeltaTable

def optimize_delta_tables(table_paths, zorder_columns=None):
    """Optimize Delta tables to reduce storage costs"""

    for table_path in table_paths:
        print(f"Optimizing {table_path}...")

        delta_table = DeltaTable.forPath(spark, table_path)

        # Run OPTIMIZE
        if zorder_columns:
            delta_table.optimize().executeZOrderBy(zorder_columns)
        else:
            delta_table.optimize().executeCompaction()

        # Vacuum old files
        delta_table.vacuum(retentionHours=168)

        # Get metrics
        history = delta_table.history(1)
        print(f"Optimized {table_path}: {history.select('operationMetrics').first()}")

# Schedule daily optimization
tables_to_optimize = [
    "/mnt/data/sales",
    "/mnt/data/customers",
    "/mnt/data/products"
]

optimize_delta_tables(tables_to_optimize, zorder_columns=["date", "region"])

Cost Impact: 40-60% storage reduction, improved query performance

2. Table Properties Optimization¶

Cost-Optimized Table Configuration:

-- Create Delta table with cost-optimized properties
CREATE TABLE sales_optimized (
    order_id BIGINT,
    customer_id BIGINT,
    amount DECIMAL(10,2),
    order_date DATE
)
USING DELTA
PARTITIONED BY (order_date)
TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true',
    'delta.autoOptimize.autoCompact' = 'true',
    'delta.deletedFileRetentionDuration' = 'interval 7 days',
    'delta.logRetentionDuration' = 'interval 30 days',
    'delta.enableChangeDataFeed' = 'false',  -- Disable if not needed
    'delta.checkpoint.writeStatsAsJson' = 'false'  -- Reduce metadata size
);

Storage Cost Management¶

1. Lifecycle Policies¶

DBFS Storage Tiering:

# Azure CLI: Configure lifecycle policy for DBFS storage account
az storage account management-policy create \
    --account-name dbfsstorage \
    --resource-group rg-databricks \
    --policy @dbfs-lifecycle.json

dbfs-lifecycle.json:

{
  "rules": [
    {
      "enabled": true,
      "name": "tier-old-data",
      "type": "Lifecycle",
      "definition": {
        "actions": {
          "baseBlob": {
            "tierToCool": {
              "daysAfterModificationGreaterThan": 30
            },
            "tierToArchive": {
              "daysAfterModificationGreaterThan": 90
            }
          }
        },
        "filters": {
          "blobTypes": ["blockBlob"],
          "prefixMatch": ["dbfs/mnt/archive/"]
        }
      }
    }
  ]
}

2. External Storage Configuration¶

Use External Storage:

# Mount Azure Data Lake Storage Gen2 (cheaper than DBFS)
configs = {
    "fs.azure.account.auth.type": "OAuth",
    "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.account.oauth2.client.id": "<client-id>",
    "fs.azure.account.oauth2.client.secret": dbutils.secrets.get("key-vault", "client-secret"),
    "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}

dbutils.fs.mount(
    source="abfss://data@storageaccount.dfs.core.windows.net/",
    mount_point="/mnt/external-data",
    extra_configs=configs
)

# Write to external storage (cheaper than DBFS)
df.write \
    .format("delta") \
    .mode("overwrite") \
    .save("/mnt/external-data/tables/sales")

Reserved Capacity¶

1. Databricks Pre-Purchase Plan¶

Purchase Reserved DBUs:

# PowerShell: Purchase Databricks Commit Units (DBCU)
# Contact Microsoft or Azure Databricks sales team for pricing

$commitment = @{
    ProductName = "Azure Databricks Premium"
    Term = "P3Y"  # 3-year commitment
    Quantity = 100000  # DBUs
    Region = "East US"
}

# Estimated savings: 30-40% over pay-as-you-go

Cost Impact: 30-40% savings with 3-year commitment

Monitoring and Governance¶

1. Cost Monitoring Dashboard¶

Azure Monitor Query:

// Databricks cluster cost analysis
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DATABRICKS"
| where Category == "clusters"
| extend
    ClusterName = tostring(parse_json(properties_s).cluster_name),
    ClusterSize = toint(parse_json(properties_s).num_workers),
    VMType = tostring(parse_json(properties_s).node_type_id)
| summarize
    TotalRuntime = sum(toint(parse_json(properties_s).runtime_minutes)),
    AvgClusterSize = avg(ClusterSize)
by ClusterName, VMType
| extend
    EstimatedCost = TotalRuntime * AvgClusterSize * 0.75 * 0.30 / 60
| project ClusterName, VMType, TotalRuntime, AvgClusterSize, EstimatedCost
| order by EstimatedCost desc

2. Automated Cost Alerts¶

Budget Alert Configuration:

# Create budget for Databricks workspace
az consumption budget create \
    --resource-group rg-databricks \
    --budget-name databricks-monthly-budget \
    --amount 5000 \
    --time-grain Monthly \
    --start-date "2024-01-01" \
    --end-date "2025-12-31" \
    --notification threshold=80 contactEmails="team@company.com" \
    --notification threshold=100 contactEmails="team@company.com,exec@company.com"

3. Cost Tagging Strategy¶

Tag Resources for Cost Allocation:

# Tag clusters for cost tracking
w.clusters.edit(
    cluster_id=cluster.cluster_id,
    cluster_name="production-etl",
    spark_version="13.3.x-scala2.12",
    node_type_id="Standard_DS3_v2",
    num_workers=4,
    custom_tags={
        "CostCenter": "Analytics",
        "Environment": "Production",
        "Owner": "data-team",
        "Project": "customer-analytics"
    }
)

Implementation Checklist¶

Immediate Actions (Week 1)¶

Enable autoscaling on all clusters
Configure auto-termination (15-30 minutes)
Switch to spot instances for job clusters
Identify and terminate unused clusters
Review and right-size VM types

Short-Term (Month 1)¶

Migrate interactive workloads to job clusters
Implement instance pools
Optimize Delta Lake tables (OPTIMIZE/VACUUM)
Configure lifecycle policies on DBFS storage
Set up cost monitoring and alerts

Mid-Term (Quarter 1)¶

Schedule jobs during off-peak hours
Implement table optimization automation
Review and optimize cluster configurations
Evaluate reserved capacity options
Conduct cost allocation analysis

Long-Term (Year 1)¶

Purchase reserved DBUs (if applicable)
Implement comprehensive cost governance
Optimize cross-workspace data sharing
Regular quarterly cost reviews
Advanced workload optimization

Cost Optimization ROI¶

Expected Savings by Category¶

Optimization	Implementation Effort	Time to Value	Annual Savings Potential
Autoscaling	Low	Immediate	30-50%
Spot Instances	Low	Immediate	60-80% on VMs
Auto-Termination	Low	Immediate	20-40%
Job vs Interactive	Medium	1 week	40-60%
Delta Optimization	Medium	2 weeks	40-60% on storage
Reserved Capacity	Low	Immediate	30-40% on DBUs

💰 Databricks Cost Optimization is Essential Databricks can be one of the most expensive components in a data platform. Regular monitoring, right-sizing, and automation are critical to maintaining cost efficiency while delivering value.