Tutorial 14: Cost Optimization¶

Estimated Time: 3-4 hours Difficulty: Intermediate

Reduce your Azure data platform spend by 30-60% without sacrificing performance. This tutorial walks through budget alerts, auto-shutdown schedules, right-sizing, reserved capacity analysis, storage lifecycle management, and continuous cost monitoring so your CSA-in-a-Box environment runs lean from day one.

Prerequisites¶

Completed Tutorial 01: Foundation Platform with all resources deployed
Azure subscription with Contributor role (Owner required for policy assignments)
Azure CLI 2.50+ with Bicep CLI 0.22+
Cost Management + Billing reader role at the subscription level
(Optional) Power BI Desktop for Step 9

az version --output table
az provider show -n Microsoft.CostManagement --query "registrationState" -o tsv

Architecture Diagram¶

graph TB
    subgraph CostMgmt["Azure Cost Management"]
        VIEWS[Cost Views]
        EXPORTS[Cost Exports]
        BUDGETS[Budgets]
    end

    subgraph Alerts["Budget Alerts"]
        T80["80% Threshold"]
        T100["100% Threshold"]
        T120["120% Threshold"]
    end

    subgraph Actions["Action Groups"]
        EMAIL[Email Notifications]
        TEAMS[Teams Webhook]
        RUNBOOK[Automation Runbook]
    end

    subgraph Advisor["Azure Advisor"]
        RECS[Recommendations]
        IMPL[Implementation]
        TRACK[Savings Tracking]
    end

    VIEWS --> BUDGETS
    BUDGETS --> T80 & T100 & T120
    T80 --> EMAIL
    T100 --> EMAIL & TEAMS
    T120 --> TEAMS & RUNBOOK
    RECS --> IMPL --> TRACK --> VIEWS

    style CostMgmt fill:#e6f0ff,stroke:#0078d4
    style Alerts fill:#fff3e6,stroke:#ff8c00
    style Actions fill:#f0e6ff,stroke:#5c2d91
    style Advisor fill:#e6ffe6,stroke:#107c10

Step 1: Set Up Cost Management Fundamentals¶

Enable Cost Management and configure views that show where money is going.

az provider register --namespace Microsoft.CostManagement
az provider register --namespace Microsoft.Consumption

export CSA_PREFIX="csa"
export CSA_ENV="dev"
export CSA_SUBSCRIPTION_ID=$(az account show --query "id" -o tsv)
export CSA_RG_DLZ="${CSA_PREFIX}-rg-dlz-${CSA_ENV}"

Set up a recurring cost export to your storage account:

STORAGE_ACCT=$(az storage account list \
  --resource-group "$CSA_RG_DLZ" --query "[0].name" -o tsv)

az costmanagement export create \
  --name "csa-daily-cost-export" \
  --scope "subscriptions/$CSA_SUBSCRIPTION_ID" \
  --type "ActualCost" \
  --timeframe "MonthToDate" \
  --recurrence "Daily" \
  --schedule-status "Active" \
  --storage-account-id "/subscriptions/$CSA_SUBSCRIPTION_ID/resourceGroups/$CSA_RG_DLZ/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCT" \
  --storage-container "cost-exports" \
  --storage-directory "daily"

Expected Output

{
    "name": "csa-daily-cost-export",
    "properties": {
        "schedule": { "recurrence": "Daily", "status": "Active" },
        "deliveryInfo": { "destination": { "container": "cost-exports" } }
    }
}

Step 2: Implement Resource Tagging¶

Tags are the foundation of cost accountability. Every resource should carry these four tags:

Tag Key	Purpose	Example Values
`environment`	Deployment stage	`dev`, `staging`, `prod`
`cost-center`	Budget owner	`data-engineering`, `ml-ops`
`owner`	Team contact	`platform-team`
`project`	Workload identifier	`csa-inabox`, `usda-etl`

Deploy a policy that audits untagged resources:

// file: deploy/bicep/cost/tag-policy.bicep
targetScope = 'subscription'

param requiredTags array = ['environment', 'cost-center', 'owner', 'project']

resource policyDefinition 'Microsoft.Authorization/policyDefinitions@2023-04-01' = [for tag in requiredTags: {
  name: 'require-tag-${tag}'
  properties: {
    policyType: 'Custom'
    mode: 'Indexed'
    displayName: 'Require tag: ${tag}'
    policyRule: {
      if: { field: '[concat(\'tags[\', \'${tag}\', \']\')]', exists: 'false' }
      then: { effect: 'audit' }
    }
  }
}]

az deployment sub create \
  --location eastus \
  --template-file deploy/bicep/cost/tag-policy.bicep \
  --name "tag-policy-$(date +%Y%m%d%H%M)"

# Backfill tags on existing resource group
az tag update \
  --resource-id "/subscriptions/$CSA_SUBSCRIPTION_ID/resourceGroups/$CSA_RG_DLZ" \
  --operation merge \
  --tags environment="$CSA_ENV" cost-center=data-engineering owner=platform-team project=csa-inabox

Warning

Use audit mode first to inventory untagged resources. Switch to deny only after cleanup, otherwise new deployments will fail.

Step 3: Create Budget Alerts¶

Deploy budgets with escalating thresholds at 80%, 100%, and 120%.

// file: deploy/bicep/cost/budgets.bicep
targetScope = 'subscription'

param monthlyBudget int = 1000
param alertEmails array = ['platform-team@contoso.com']

resource budget 'Microsoft.Consumption/budgets@2023-05-01' = {
  name: 'csa-monthly-budget'
  properties: {
    category: 'Cost'
    amount: monthlyBudget
    timeGrain: 'Monthly'
    timePeriod: { startDate: '${utcNow('yyyy-MM')}-01' }
    notifications: {
      warning80:  { enabled: true, operator: 'GreaterThanOrEqualTo', threshold: 80,  contactEmails: alertEmails, thresholdType: 'Actual' }
      critical100:{ enabled: true, operator: 'GreaterThanOrEqualTo', threshold: 100, contactEmails: alertEmails, thresholdType: 'Actual' }
      overrun120: { enabled: true, operator: 'GreaterThanOrEqualTo', threshold: 120, contactEmails: alertEmails, thresholdType: 'Actual' }
    }
  }
}

Resource group budget via CLI:

az consumption budget create \
  --budget-name "csa-dlz-monthly-budget" \
  --resource-group "$CSA_RG_DLZ" \
  --amount 500 --time-grain Monthly \
  --start-date "$(date +%Y-%m)-01" --end-date "2027-12-31" \
  --category Cost

Tip

Add a forecasted budget alongside the actual budget. Forecasted alerts warn you days in advance when Azure predicts you will exceed your target.

Step 4: Configure Auto-Shutdown Schedules¶

Dev/test environments running 24/7 waste 65-75% of their compute budget. Shut them down outside business hours.

az automation account create \
  --name "csa-automation-${CSA_ENV}" \
  --resource-group "$CSA_RG_DLZ" \
  --location eastus

Create a runbook that terminates all running Databricks clusters:

# file: scripts/cost/Stop-DatabricksClusters.ps1
param([string]$ResourceGroupName = "csa-rg-dlz-dev", [string]$WorkspaceName = "csa-dbx-dev")

Connect-AzAccount -Identity
$workspace = Get-AzDatabricksWorkspace -ResourceGroupName $ResourceGroupName -Name $WorkspaceName
$token   = (Get-AzAccessToken -ResourceUrl "2ff814a6-3304-4ab8-85cb-cd0e6f879c1d").Token
$headers = @{ "Authorization" = "Bearer $token"; "Content-Type" = "application/json" }
$baseUrl = "https://$($workspace.Url)"

$clusters = (Invoke-RestMethod -Uri "$baseUrl/api/2.0/clusters/list" -Headers $headers).clusters
$running  = $clusters | Where-Object { $_.state -eq "RUNNING" }

foreach ($c in $running) {
    Write-Output "Terminating: $($c.cluster_name)"
    Invoke-RestMethod -Uri "$baseUrl/api/2.0/clusters/delete" -Method POST -Headers $headers `
        -Body (@{ cluster_id = $c.cluster_id } | ConvertTo-Json)
}

Schedule it for 7 PM EST every weekday:

az automation schedule create \
  --automation-account-name "csa-automation-${CSA_ENV}" \
  --resource-group "$CSA_RG_DLZ" \
  --name "weekday-shutdown" \
  --frequency "Week" --interval 1 \
  --start-time "$(date -d 'tomorrow 00:00' +%Y-%m-%dT00:00:00Z)" \
  --time-zone "America/New_York"

Warning

Confirm your timezone offset. A wrong timezone means clusters shut down during business hours. Test with a one-time run first.

Step 5: Right-Size Compute Resources¶

Over-provisioned compute is the single largest source of waste.

az advisor recommendation list --category Cost \
  --query "[].{Resource:resourceMetadata.resourceId, Savings:extendedProperties.savingsAmount, Action:shortDescription.solution}" \
  --output table

Query Databricks cluster utilization to find over-provisioned clusters:

-- Run in Databricks SQL: clusters with low CPU over 14 days
SELECT cluster_name, node_type,
       avg(cpu_utilization_percent) AS avg_cpu,
       max(cpu_utilization_percent) AS peak_cpu
FROM system.compute.cluster_metrics
WHERE date >= current_date() - INTERVAL 14 DAYS
GROUP BY cluster_name, node_type
HAVING avg_cpu < 30
ORDER BY avg_cpu ASC;

Avg CPU	Action
< 20%	Downsize by 2 VM sizes (e.g., DS5 to DS3)
20-40%	Downsize by 1 VM size
40-70%	Current size is appropriate
> 70%	Consider upsizing or enabling autoscale

Tip

For dev/test Synapse pools, pause the pool when not in use. A paused DW100c costs nothing. Use az synapse sql pool pause to stop billing immediately.

Step 6: Reserved Instances Analysis¶

Reserved capacity provides 30-72% savings for steady-state workloads at the cost of a 1- or 3-year commitment.

az advisor recommendation list --category Cost \
  --query "[?contains(shortDescription.solution, 'Reserved')].{Resource:resourceMetadata.resourceId, AnnualSavings:extendedProperties.annualSavingsAmount, Term:extendedProperties.term}" \
  --output table

Service	1-Year Savings	3-Year Savings	Breakeven
VMs (D-series)	30-40%	55-65%	7-8 mo
SQL Database	25-35%	50-60%	8-9 mo
Cosmos DB	20-30%	45-55%	8-10 mo
Databricks DBU	25-37%	50-62%	7-9 mo

Warning

Reservations are non-refundable. Only reserve for workloads running at 60%+ utilization consistently. Review projected savings in the Azure Portal before committing.

Step 7: Storage Lifecycle Policies¶

Lifecycle policies automatically move aging data to cheaper tiers.

{
    "rules": [
        {
            "enabled": true,
            "name": "bronze-tiering",
            "type": "Lifecycle",
            "definition": {
                "actions": {
                    "baseBlob": {
                        "tierToCool": {
                            "daysAfterModificationGreaterThan": 30
                        },
                        "tierToArchive": {
                            "daysAfterModificationGreaterThan": 90
                        },
                        "delete": { "daysAfterModificationGreaterThan": 365 }
                    }
                },
                "filters": {
                    "blobTypes": ["blockBlob"],
                    "prefixMatch": ["bronze/"]
                }
            }
        }
    ]
}

az storage account management-policy create \
  --account-name "$STORAGE_ACCT" \
  --resource-group "$CSA_RG_DLZ" \
  --policy @deploy/bicep/cost/lifecycle-policy.json

Schedule Delta table maintenance to reclaim storage:

-- Databricks scheduled job: remove stale files and compact small files
VACUUM usda_bronze.raw_nass_quickstats RETAIN 168 HOURS;
OPTIMIZE usda_gold.fct_crop_production ZORDER BY (commodity_desc, year);

Tip

Hot-to-cool tiering saves roughly 50% on storage. Archive saves 90% but has retrieval latency (hours). Only archive data you rarely need.

Step 8: Spot Instance Strategy¶

Spot VMs cost 60-90% less than on-demand but can be evicted with 30 seconds notice. Use them for fault-tolerant batch workloads only.

Configure a Databricks job cluster with spot workers:

{
    "cluster_name": "csa-batch-spot-cluster",
    "spark_version": "14.3.x-scala2.12",
    "node_type_id": "Standard_DS3_v2",
    "num_workers": 4,
    "azure_attributes": {
        "first_on_demand": 1,
        "availability": "SPOT_WITH_FALLBACK_AZURE",
        "spot_bid_max_price": -1
    },
    "autotermination_minutes": 20
}

first_on_demand: 1 keeps the driver on a reliable on-demand node. SPOT_WITH_FALLBACK_AZURE falls back to on-demand when no spot capacity is available. For AKS ML training workloads, add a spot node pool with az aks nodepool add --priority Spot --eviction-policy Delete.

Warning

Never use spot instances for streaming jobs or interactive workloads. Evictions will interrupt your pipeline.

Step 9: Build Cost Dashboards¶

Connect Power BI to Cost Management exports for real-time spend tracking and chargeback.

# Verify the daily export is running
az costmanagement export show \
  --name "csa-daily-cost-export" \
  --scope "subscriptions/$CSA_SUBSCRIPTION_ID" \
  --query "{Name:name, Status:schedule.status}" --output table

Chargeback report by cost-center tag:

az consumption usage list \
  --subscription "$CSA_SUBSCRIPTION_ID" \
  --start-date "$(date -d '30 days ago' +%Y-%m-%d)" \
  --end-date "$(date +%Y-%m-%d)" \
  --query "[].{CostCenter:tags.\"cost-center\", Service:meterCategory, Cost:pretaxCost}" \
  --output table

In Power BI Desktop: Get Data > Azure > Azure Cost Management, then build charts by service, resource group, and cost-center tag for chargeback.

Step 10: Continuous Optimization¶

Cost optimization is not a one-time project. Establish a monthly review cycle.

echo "=== Monthly Cost Review ==="
# Budget status
az consumption budget list \
  --query "[].{Budget:name, Limit:amount, Spent:currentSpend.amount}" --output table
# Advisor recommendations count
az advisor recommendation list --category Cost --query "length(@)"
# Untagged resources
az resource list \
  --query "[?tags.environment == null].{Name:name, Type:type, RG:resourceGroup}" --output table

Track your Advisor score over time to confirm optimization efforts are working:

az advisor recommendation list --category Cost \
  --query "[].{Impact:impact, Problem:shortDescription.problem, Savings:extendedProperties.savingsAmount}" \
  --output table

Tip

Set a calendar reminder for the first Monday of each month to run this review.

Cost Optimization Checklist¶

Typical Savings¶

Optimization	Typical Savings	Effort
Auto-shutdown dev/test compute	65-75%	Low (1 hour)
Storage lifecycle (hot to cool)	40-50%	Low (1 hour)
Databricks spot instances (batch)	60-90%	Medium
Right-size underutilized VMs	20-40%	Medium
Reserved instances (1-year)	30-40%	Low
Reserved instances (3-year)	55-65%	Low
Pause Synapse dedicated pools	100% when idle	Low
Delta VACUUM and OPTIMIZE	10-30% storage	Low
Tag-based governance and cleanup	10-20%	Medium
Continuous Advisor recommendations	5-15% ongoing	Low

Troubleshooting¶

Symptom	Cause	Fix
Budget alerts not firing	Start date is in the future	Set `start-date` to the first of the current month
Cost exports show $0	Export created recently	Wait 24 hours; trigger manually with `az costmanagement export execute`
`AuthorizationFailed` on budget create	Missing Cost Management Contributor role	Request the role from your subscription Owner
Auto-shutdown runbook fails silently	Managed identity not assigned	Enable system-assigned identity and grant Contributor on the DLZ resource group
Lifecycle policy not moving blobs	Policy evaluates once per day	Wait 24-48 hours; verify the policy targets the correct storage account
Advisor shows no recommendations	Resources are new	Advisor needs 7-14 days of usage data before generating recommendations
Spot instances evicted too frequently	Low spot capacity in region	Use a different VM size or region; enable `SPOT_WITH_FALLBACK_AZURE`

Tutorial 01: Foundation Platform -- Deploy the base infrastructure this tutorial optimizes
Tutorial 02: Governance & Compliance -- Policy enforcement and tagging standards
Tutorial 10: Azure Monitor -- Monitoring and alerting for cost anomalies
Azure Cost Management documentation
Azure Advisor cost recommendations
Databricks cluster best practices
ADLS lifecycle management

Tutorial 14: Cost Optimization¶

Prerequisites¶

Architecture Diagram¶

Step 1: Set Up Cost Management Fundamentals¶

Step 2: Implement Resource Tagging¶

Step 3: Create Budget Alerts¶

Step 4: Configure Auto-Shutdown Schedules¶

Step 5: Right-Size Compute Resources¶

Step 6: Reserved Instances Analysis¶

Step 7: Storage Lifecycle Policies¶

Step 8: Spot Instance Strategy¶

Step 9: Build Cost Dashboards¶

Step 10: Continuous Optimization¶

Cost Optimization Checklist¶

Typical Savings¶

Troubleshooting¶

Related¶