Tutorial 14: Cost Optimization¶
Estimated Time: 3-4 hours Difficulty: Intermediate
Reduce your Azure data platform spend by 30-60% without sacrificing performance. This tutorial walks through budget alerts, auto-shutdown schedules, right-sizing, reserved capacity analysis, storage lifecycle management, and continuous cost monitoring so your CSA-in-a-Box environment runs lean from day one.
Prerequisites¶
- Completed Tutorial 01: Foundation Platform with all resources deployed
- Azure subscription with Contributor role (Owner required for policy assignments)
- Azure CLI 2.50+ with Bicep CLI 0.22+
- Cost Management + Billing reader role at the subscription level
- (Optional) Power BI Desktop for Step 9
az version --output table
az provider show -n Microsoft.CostManagement --query "registrationState" -o tsv
Architecture Diagram¶
graph TB
subgraph CostMgmt["Azure Cost Management"]
VIEWS[Cost Views]
EXPORTS[Cost Exports]
BUDGETS[Budgets]
end
subgraph Alerts["Budget Alerts"]
T80["80% Threshold"]
T100["100% Threshold"]
T120["120% Threshold"]
end
subgraph Actions["Action Groups"]
EMAIL[Email Notifications]
TEAMS[Teams Webhook]
RUNBOOK[Automation Runbook]
end
subgraph Advisor["Azure Advisor"]
RECS[Recommendations]
IMPL[Implementation]
TRACK[Savings Tracking]
end
VIEWS --> BUDGETS
BUDGETS --> T80 & T100 & T120
T80 --> EMAIL
T100 --> EMAIL & TEAMS
T120 --> TEAMS & RUNBOOK
RECS --> IMPL --> TRACK --> VIEWS
style CostMgmt fill:#e6f0ff,stroke:#0078d4
style Alerts fill:#fff3e6,stroke:#ff8c00
style Actions fill:#f0e6ff,stroke:#5c2d91
style Advisor fill:#e6ffe6,stroke:#107c10 Step 1: Set Up Cost Management Fundamentals¶
Enable Cost Management and configure views that show where money is going.
az provider register --namespace Microsoft.CostManagement
az provider register --namespace Microsoft.Consumption
export CSA_PREFIX="csa"
export CSA_ENV="dev"
export CSA_SUBSCRIPTION_ID=$(az account show --query "id" -o tsv)
export CSA_RG_DLZ="${CSA_PREFIX}-rg-dlz-${CSA_ENV}"
Set up a recurring cost export to your storage account:
STORAGE_ACCT=$(az storage account list \
--resource-group "$CSA_RG_DLZ" --query "[0].name" -o tsv)
az costmanagement export create \
--name "csa-daily-cost-export" \
--scope "subscriptions/$CSA_SUBSCRIPTION_ID" \
--type "ActualCost" \
--timeframe "MonthToDate" \
--recurrence "Daily" \
--schedule-status "Active" \
--storage-account-id "/subscriptions/$CSA_SUBSCRIPTION_ID/resourceGroups/$CSA_RG_DLZ/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCT" \
--storage-container "cost-exports" \
--storage-directory "daily"
Expected Output
Step 2: Implement Resource Tagging¶
Tags are the foundation of cost accountability. Every resource should carry these four tags:
| Tag Key | Purpose | Example Values |
|---|---|---|
environment | Deployment stage | dev, staging, prod |
cost-center | Budget owner | data-engineering, ml-ops |
owner | Team contact | platform-team |
project | Workload identifier | csa-inabox, usda-etl |
Deploy a policy that audits untagged resources:
// file: deploy/bicep/cost/tag-policy.bicep
targetScope = 'subscription'
param requiredTags array = ['environment', 'cost-center', 'owner', 'project']
resource policyDefinition 'Microsoft.Authorization/policyDefinitions@2023-04-01' = [for tag in requiredTags: {
name: 'require-tag-${tag}'
properties: {
policyType: 'Custom'
mode: 'Indexed'
displayName: 'Require tag: ${tag}'
policyRule: {
if: { field: '[concat(\'tags[\', \'${tag}\', \']\')]', exists: 'false' }
then: { effect: 'audit' }
}
}
}]
az deployment sub create \
--location eastus \
--template-file deploy/bicep/cost/tag-policy.bicep \
--name "tag-policy-$(date +%Y%m%d%H%M)"
# Backfill tags on existing resource group
az tag update \
--resource-id "/subscriptions/$CSA_SUBSCRIPTION_ID/resourceGroups/$CSA_RG_DLZ" \
--operation merge \
--tags environment="$CSA_ENV" cost-center=data-engineering owner=platform-team project=csa-inabox
Warning
Use audit mode first to inventory untagged resources. Switch to deny only after cleanup, otherwise new deployments will fail.
Step 3: Create Budget Alerts¶
Deploy budgets with escalating thresholds at 80%, 100%, and 120%.
// file: deploy/bicep/cost/budgets.bicep
targetScope = 'subscription'
param monthlyBudget int = 1000
param alertEmails array = ['platform-team@contoso.com']
resource budget 'Microsoft.Consumption/budgets@2023-05-01' = {
name: 'csa-monthly-budget'
properties: {
category: 'Cost'
amount: monthlyBudget
timeGrain: 'Monthly'
timePeriod: { startDate: '${utcNow('yyyy-MM')}-01' }
notifications: {
warning80: { enabled: true, operator: 'GreaterThanOrEqualTo', threshold: 80, contactEmails: alertEmails, thresholdType: 'Actual' }
critical100:{ enabled: true, operator: 'GreaterThanOrEqualTo', threshold: 100, contactEmails: alertEmails, thresholdType: 'Actual' }
overrun120: { enabled: true, operator: 'GreaterThanOrEqualTo', threshold: 120, contactEmails: alertEmails, thresholdType: 'Actual' }
}
}
}
Resource group budget via CLI:
az consumption budget create \
--budget-name "csa-dlz-monthly-budget" \
--resource-group "$CSA_RG_DLZ" \
--amount 500 --time-grain Monthly \
--start-date "$(date +%Y-%m)-01" --end-date "2027-12-31" \
--category Cost
Tip
Add a forecasted budget alongside the actual budget. Forecasted alerts warn you days in advance when Azure predicts you will exceed your target.
Step 4: Configure Auto-Shutdown Schedules¶
Dev/test environments running 24/7 waste 65-75% of their compute budget. Shut them down outside business hours.
az automation account create \
--name "csa-automation-${CSA_ENV}" \
--resource-group "$CSA_RG_DLZ" \
--location eastus
Create a runbook that terminates all running Databricks clusters:
# file: scripts/cost/Stop-DatabricksClusters.ps1
param([string]$ResourceGroupName = "csa-rg-dlz-dev", [string]$WorkspaceName = "csa-dbx-dev")
Connect-AzAccount -Identity
$workspace = Get-AzDatabricksWorkspace -ResourceGroupName $ResourceGroupName -Name $WorkspaceName
$token = (Get-AzAccessToken -ResourceUrl "2ff814a6-3304-4ab8-85cb-cd0e6f879c1d").Token
$headers = @{ "Authorization" = "Bearer $token"; "Content-Type" = "application/json" }
$baseUrl = "https://$($workspace.Url)"
$clusters = (Invoke-RestMethod -Uri "$baseUrl/api/2.0/clusters/list" -Headers $headers).clusters
$running = $clusters | Where-Object { $_.state -eq "RUNNING" }
foreach ($c in $running) {
Write-Output "Terminating: $($c.cluster_name)"
Invoke-RestMethod -Uri "$baseUrl/api/2.0/clusters/delete" -Method POST -Headers $headers `
-Body (@{ cluster_id = $c.cluster_id } | ConvertTo-Json)
}
Schedule it for 7 PM EST every weekday:
az automation schedule create \
--automation-account-name "csa-automation-${CSA_ENV}" \
--resource-group "$CSA_RG_DLZ" \
--name "weekday-shutdown" \
--frequency "Week" --interval 1 \
--start-time "$(date -d 'tomorrow 00:00' +%Y-%m-%dT00:00:00Z)" \
--time-zone "America/New_York"
Warning
Confirm your timezone offset. A wrong timezone means clusters shut down during business hours. Test with a one-time run first.
Step 5: Right-Size Compute Resources¶
Over-provisioned compute is the single largest source of waste.
az advisor recommendation list --category Cost \
--query "[].{Resource:resourceMetadata.resourceId, Savings:extendedProperties.savingsAmount, Action:shortDescription.solution}" \
--output table
Query Databricks cluster utilization to find over-provisioned clusters:
-- Run in Databricks SQL: clusters with low CPU over 14 days
SELECT cluster_name, node_type,
avg(cpu_utilization_percent) AS avg_cpu,
max(cpu_utilization_percent) AS peak_cpu
FROM system.compute.cluster_metrics
WHERE date >= current_date() - INTERVAL 14 DAYS
GROUP BY cluster_name, node_type
HAVING avg_cpu < 30
ORDER BY avg_cpu ASC;
| Avg CPU | Action |
|---|---|
| < 20% | Downsize by 2 VM sizes (e.g., DS5 to DS3) |
| 20-40% | Downsize by 1 VM size |
| 40-70% | Current size is appropriate |
| > 70% | Consider upsizing or enabling autoscale |
Tip
For dev/test Synapse pools, pause the pool when not in use. A paused DW100c costs nothing. Use az synapse sql pool pause to stop billing immediately.
Step 6: Reserved Instances Analysis¶
Reserved capacity provides 30-72% savings for steady-state workloads at the cost of a 1- or 3-year commitment.
az advisor recommendation list --category Cost \
--query "[?contains(shortDescription.solution, 'Reserved')].{Resource:resourceMetadata.resourceId, AnnualSavings:extendedProperties.annualSavingsAmount, Term:extendedProperties.term}" \
--output table
| Service | 1-Year Savings | 3-Year Savings | Breakeven |
|---|---|---|---|
| VMs (D-series) | 30-40% | 55-65% | 7-8 mo |
| SQL Database | 25-35% | 50-60% | 8-9 mo |
| Cosmos DB | 20-30% | 45-55% | 8-10 mo |
| Databricks DBU | 25-37% | 50-62% | 7-9 mo |
Warning
Reservations are non-refundable. Only reserve for workloads running at 60%+ utilization consistently. Review projected savings in the Azure Portal before committing.
Step 7: Storage Lifecycle Policies¶
Lifecycle policies automatically move aging data to cheaper tiers.
{
"rules": [
{
"enabled": true,
"name": "bronze-tiering",
"type": "Lifecycle",
"definition": {
"actions": {
"baseBlob": {
"tierToCool": {
"daysAfterModificationGreaterThan": 30
},
"tierToArchive": {
"daysAfterModificationGreaterThan": 90
},
"delete": { "daysAfterModificationGreaterThan": 365 }
}
},
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["bronze/"]
}
}
}
]
}
az storage account management-policy create \
--account-name "$STORAGE_ACCT" \
--resource-group "$CSA_RG_DLZ" \
--policy @deploy/bicep/cost/lifecycle-policy.json
Schedule Delta table maintenance to reclaim storage:
-- Databricks scheduled job: remove stale files and compact small files
VACUUM usda_bronze.raw_nass_quickstats RETAIN 168 HOURS;
OPTIMIZE usda_gold.fct_crop_production ZORDER BY (commodity_desc, year);
Tip
Hot-to-cool tiering saves roughly 50% on storage. Archive saves 90% but has retrieval latency (hours). Only archive data you rarely need.
Step 8: Spot Instance Strategy¶
Spot VMs cost 60-90% less than on-demand but can be evicted with 30 seconds notice. Use them for fault-tolerant batch workloads only.
Configure a Databricks job cluster with spot workers:
{
"cluster_name": "csa-batch-spot-cluster",
"spark_version": "14.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 4,
"azure_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": -1
},
"autotermination_minutes": 20
}
first_on_demand: 1 keeps the driver on a reliable on-demand node. SPOT_WITH_FALLBACK_AZURE falls back to on-demand when no spot capacity is available. For AKS ML training workloads, add a spot node pool with az aks nodepool add --priority Spot --eviction-policy Delete.
Warning
Never use spot instances for streaming jobs or interactive workloads. Evictions will interrupt your pipeline.
Step 9: Build Cost Dashboards¶
Connect Power BI to Cost Management exports for real-time spend tracking and chargeback.
# Verify the daily export is running
az costmanagement export show \
--name "csa-daily-cost-export" \
--scope "subscriptions/$CSA_SUBSCRIPTION_ID" \
--query "{Name:name, Status:schedule.status}" --output table
Chargeback report by cost-center tag:
az consumption usage list \
--subscription "$CSA_SUBSCRIPTION_ID" \
--start-date "$(date -d '30 days ago' +%Y-%m-%d)" \
--end-date "$(date +%Y-%m-%d)" \
--query "[].{CostCenter:tags.\"cost-center\", Service:meterCategory, Cost:pretaxCost}" \
--output table
In Power BI Desktop: Get Data > Azure > Azure Cost Management, then build charts by service, resource group, and cost-center tag for chargeback.
Step 10: Continuous Optimization¶
Cost optimization is not a one-time project. Establish a monthly review cycle.
echo "=== Monthly Cost Review ==="
# Budget status
az consumption budget list \
--query "[].{Budget:name, Limit:amount, Spent:currentSpend.amount}" --output table
# Advisor recommendations count
az advisor recommendation list --category Cost --query "length(@)"
# Untagged resources
az resource list \
--query "[?tags.environment == null].{Name:name, Type:type, RG:resourceGroup}" --output table
Track your Advisor score over time to confirm optimization efforts are working:
az advisor recommendation list --category Cost \
--query "[].{Impact:impact, Problem:shortDescription.problem, Savings:extendedProperties.savingsAmount}" \
--output table
Tip
Set a calendar reminder for the first Monday of each month to run this review.
Cost Optimization Checklist¶
- Enable auto-shutdown for all dev/test Databricks clusters
- Pause Synapse dedicated SQL pools when not in use
- Set auto-termination (30 min) on all interactive clusters
- Apply lifecycle policies to Bronze storage (hot to cool at 30 days)
- Tag all resources with
environment,cost-center,owner,project - Delete unattached managed disks and unused public IPs
- Right-size VMs with < 20% average CPU utilization
- Use spot instances for batch Databricks jobs
- Enable Databricks cluster pools to reduce startup cost
- Set budget alerts at 80%, 100%, and 120% thresholds
- Schedule Data Factory pipelines for off-peak hours only
- Use serverless Synapse SQL instead of dedicated pools for ad-hoc queries
- Evaluate reserved instances for workloads at 60%+ utilization
- Compact Delta tables with OPTIMIZE and ZORDER
- Enable Cost Management anomaly alerts
Typical Savings¶
| Optimization | Typical Savings | Effort |
|---|---|---|
| Auto-shutdown dev/test compute | 65-75% | Low (1 hour) |
| Storage lifecycle (hot to cool) | 40-50% | Low (1 hour) |
| Databricks spot instances (batch) | 60-90% | Medium |
| Right-size underutilized VMs | 20-40% | Medium |
| Reserved instances (1-year) | 30-40% | Low |
| Reserved instances (3-year) | 55-65% | Low |
| Pause Synapse dedicated pools | 100% when idle | Low |
| Delta VACUUM and OPTIMIZE | 10-30% storage | Low |
| Tag-based governance and cleanup | 10-20% | Medium |
| Continuous Advisor recommendations | 5-15% ongoing | Low |
Troubleshooting¶
| Symptom | Cause | Fix |
|---|---|---|
| Budget alerts not firing | Start date is in the future | Set start-date to the first of the current month |
| Cost exports show $0 | Export created recently | Wait 24 hours; trigger manually with az costmanagement export execute |
AuthorizationFailed on budget create | Missing Cost Management Contributor role | Request the role from your subscription Owner |
| Auto-shutdown runbook fails silently | Managed identity not assigned | Enable system-assigned identity and grant Contributor on the DLZ resource group |
| Lifecycle policy not moving blobs | Policy evaluates once per day | Wait 24-48 hours; verify the policy targets the correct storage account |
| Advisor shows no recommendations | Resources are new | Advisor needs 7-14 days of usage data before generating recommendations |
| Spot instances evicted too frequently | Low spot capacity in region | Use a different VM size or region; enable SPOT_WITH_FALLBACK_AZURE |
Related¶
- Tutorial 01: Foundation Platform -- Deploy the base infrastructure this tutorial optimizes
- Tutorial 02: Governance & Compliance -- Policy enforcement and tagging standards
- Tutorial 10: Azure Monitor -- Monitoring and alerting for cost anomalies
- Azure Cost Management documentation
- Azure Advisor cost recommendations
- Databricks cluster best practices
- ADLS lifecycle management