Home > Docs > Runbooks > Cost Alert Response
Runbook — Cost Alert Response¶
Scope: Triage and resolution of Azure Cost Management alerts across all CSA-in-a-Box subscriptions. Covers budget threshold responses, cost anomaly investigation, optimization playbooks, and preventive controls for ongoing cost governance.
Before First Use — Customization Checklist¶
- Populate the Contact Information table.
- Confirm subscription and resource group naming conventions match your environment.
- Set up budget alerts at 50%, 80%, 100%, and 120% thresholds in Azure Cost Management.
- Confirm tag enforcement policy is active (required tags:
costCenter,environment,owner). - Verify Azure Advisor cost recommendations are reviewed monthly.
Symptoms¶
| Symptom | Where you see it | Severity |
|---|---|---|
| Budget alert triggered (80%+ threshold) | Email / action group notification | P2–P1 depending on tier |
| Unexpected resource creation | Azure Activity Log, Cost Management anomaly alert | P2 |
| Cost anomaly detected by Azure | Azure Cost Management anomaly alerts | P2 |
| Forecast exceeding budget by > 20% | Azure Cost Management forecast view | P3 |
| Orphaned resources accumulating charges | Azure Advisor, resource group review | P3 |
| Cross-region data transfer spike | Network cost breakdown in Cost Analysis | P3 |
Triage¶
Step 1 — Check Azure Cost Management for anomaly details¶
- Open Azure Cost Management + Billing and navigate to Cost analysis.
- Set the time range to the current billing period and compare against the prior period.
- Check the Anomaly detection blade for auto-flagged spikes.
# Quick CLI check — current month cost by resource group
az consumption usage list \
--start-date "$(date -u +%Y-%m-01)" \
--end-date "$(date -u +%Y-%m-%d)" \
--query "sort_by([].{rg:instanceName, cost:pretaxCost, currency:currency}, &cost)" \
-o table
Step 2 — Identify the cost driver¶
- Use Cost Analysis Group by to drill down by resource group, service name, region, and tag.
- Run the following KQL against your Log Analytics workspace to correlate resource creation with cost spikes:
AzureActivity
| where TimeGenerated > ago(7d)
| where OperationNameValue has "Microsoft.Resources/deployments/write"
or OperationNameValue has "Microsoft.Compute/virtualMachines/write"
or OperationNameValue has "Microsoft.Storage/storageAccounts/write"
| project TimeGenerated, Caller, ResourceGroup, OperationNameValue, ActivityStatusValue
| order by TimeGenerated desc
Step 3 — Determine if cost is legitimate or waste¶
- Cross-reference newly created resources with active project tasks and deployment pipelines.
- Check if the cost driver maps to an approved change request or sprint story.
- If no owner can be identified, treat as potential waste and escalate.
Step 4 — Check for orphaned resources¶
- Run Azure Advisor cost recommendations:
az advisor recommendation list --category Cost \
--query '[].{resource:resourceMetadata.resourceId, impact:impact, problem:shortDescription.problem}' \
-o table
- Look for unattached disks, idle VMs, empty App Service plans, and unused public IPs:
# Unattached managed disks
az disk list --query "[?diskState=='Unattached'].{name:name, rg:resourceGroup, sizeGb:diskSizeGb, sku:sku.name}" -o table
# Stopped (deallocated) VMs still incurring storage charges
az vm list -d --query "[?powerState!='VM running'].{name:name, rg:resourceGroup, size:hardwareProfile.vmSize}" -o table
Response Actions by Alert Tier¶
80% Budget — Review and Forecast¶
Tip
At 80% you have time to optimize before hard limits. Focus on forecasting and identifying quick wins.
- Review the forecast. Will spend exceed budget at current run rate?
# Forecast remaining spend (requires Cost Management API)
az consumption forecast list \
--query '{currentSpend:totalCost, forecastedSpend:forecastedCost, budget:budget}' \
-o table
- Identify optimization opportunities. Run through the Common Cost Drivers table below.
- Right-size underutilized resources. Check Azure Advisor for right-sizing recommendations.
- Review Reserved Instance coverage. Are any expiring or underutilized?
- Document findings and share with the cost owner for awareness.
100% Budget — Escalate and Reduce¶
Warning
Budget is fully consumed. Immediate action required to prevent overspend.
- Notify the cost owner (see Contact Information) with a cost breakdown by service and resource group.
- Implement immediate reductions:
- Shut down non-production environments outside business hours.
- Scale down dev/test clusters to minimum viable size.
- Pause non-critical batch jobs and data pipelines.
- Freeze non-essential deployments until spend is under control.
- Request budget increase if the overspend is due to legitimate, approved growth. Requires finance and management approval.
120% Budget — Emergency Cost Reduction¶
Danger
Significant overspend. Escalate to management immediately and take emergency measures.
- Escalate to management with a full cost impact analysis.
- Scale down non-production environments to the absolute minimum:
# Scale down non-prod AKS clusters to 1 node
az aks nodepool update \
--resource-group <rg-dev> --cluster-name <aks-dev> \
--name <nodepool> --min-count 1 --max-count 1
# Deallocate non-prod VMs
az vm deallocate --resource-group <rg-dev> --name <vm-name>
- Enable auto-shutdown on all dev/test VMs immediately:
- Review and delete orphaned resources identified in triage Step 4.
- Suspend non-critical data pipelines (ADF triggers, Databricks scheduled jobs).
- Schedule a cost review meeting within 48 hours with all cost owners.
Common Cost Drivers¶
| Service | Common Cause | Fix |
|---|---|---|
| Compute (VMs, AKS) | Oversized VMs, idle clusters, no auto-scaling | Right-size via Advisor, enable cluster autoscaler, auto-shutdown dev/test |
| Storage | Forgotten snapshots, no lifecycle policy, uncompressed data | Apply lifecycle policies, delete stale snapshots, enable compression |
| Networking | Cross-region data transfer, unoptimized egress, idle load balancers | Use VNet peering, enable CDN for static content, remove idle LBs |
| AI/ML (OpenAI, Cognitive) | GPU idle time, prompt waste, no caching | Use spot instances, implement prompt caching, use Batch API |
| Databases (SQL, Cosmos) | Over-provisioned DTUs/RUs, idle replicas, no auto-scale | Enable serverless tier, auto-scale RUs, remove unused replicas |
| App Service | Empty or oversized plans, unused slots | Consolidate apps, scale down plans, delete unused deployment slots |
| Key Vault | High transaction volume from polling | Switch to event-driven refresh, increase cache TTL |
| Log Analytics | Excessive log ingestion, no data retention policy | Set retention policies, filter noisy logs, use Basic tier for low-value tables |
Optimization Playbook¶
Reserved Instances vs Pay-As-You-Go¶
- Analyze usage patterns over the last 30–90 days for stable workloads.
- If a resource runs > 60% of the time with predictable sizing, evaluate 1-year or 3-year reservations.
- Use the Azure Reservation recommendation engine:
az consumption reservation recommendation list \
--scope "Single" \
--resource-type "VirtualMachines" \
--look-back-period "Last30Days" \
-o table
Spot instance strategy¶
- Use spot VMs for fault-tolerant workloads: batch processing, CI/CD agents, dev/test environments.
- Set max price to the pay-as-you-go rate to avoid surprise charges.
- Implement eviction handling in application code.
Dev/test vs production pricing¶
- Ensure all non-production subscriptions use Dev/Test pricing (Enterprise Agreement benefit).
- Validate the subscription offer type:
- Move dev/test workloads to dev/test subscriptions if they are running on production pricing.
Resource tagging for chargeback¶
Enforce the following tags on every resource for cost allocation:
| Tag | Purpose | Example |
|---|---|---|
costCenter | Finance chargeback code | CC-1234 |
environment | Deployment environment | dev, staging, prod |
owner | Team or individual responsible | platform-team |
project | Project or workload name | csa-inabox |
autoShutdown | Eligible for scheduled shutdown | true / false |
- Enforce tags via Azure Policy (deny deployment if required tags are missing).
Azure Advisor recommendations¶
- Review Azure Advisor cost recommendations weekly:
- Track recommendation adoption rate as a KPI in monthly cost reviews.
Preventive Controls¶
Budget alerts setup¶
- Configure budget alerts at multiple thresholds:
az consumption budget create \
--budget-name "csa-monthly-budget" \
--amount 10000 --time-grain Monthly \
--start-date "$(date -u +%Y-%m-01)" \
--end-date "$(date -u -d '+1 year' +%Y-%m-01)" \
--resource-group <rg> \
--notifications '{
"Actual_GreaterThan_80_Percent": {
"enabled": true,
"operator": "GreaterThan",
"threshold": 80,
"contactEmails": ["platform-team@contoso.com"],
"contactRoles": ["Owner"]
}
}'
Policy-based resource restrictions¶
- Deny creation of expensive SKUs in non-production subscriptions:
{
"if": {
"allOf": [
{ "field": "type", "equals": "Microsoft.Compute/virtualMachines" },
{
"field": "Microsoft.Compute.virtualMachines/sku.name",
"notIn": ["Standard_B2s", "Standard_B2ms", "Standard_D2s_v5"]
},
{
"field": "[concat('tags[', 'environment', ']')]",
"in": ["dev", "test"]
}
]
},
"then": { "effect": "deny" }
}
- Restrict expensive regions unless explicitly approved.
- Limit the number of public IP addresses per subscription.
Tag enforcement¶
- Deploy the
Require tag and its valuebuilt-in policy forcostCenter,environment, andowner. - Use
Modifyeffect to auto-apply default tags to resources that are missing them.
Scheduled shutdown¶
- Apply auto-shutdown to all dev/test VMs (default 7:00 PM local time).
- Use Azure Automation runbooks or start/stop v2 solution for AKS clusters and other compute.
- Verify shutdown compliance weekly:
AzureActivity
| where TimeGenerated > ago(7d)
| where OperationNameValue == "Microsoft.Compute/virtualMachines/deallocate/action"
| summarize shutdownCount = count() by ResourceGroup, bin(TimeGenerated, 1d)
| order by TimeGenerated desc
Reporting¶
Monthly cost review template¶
Use the following agenda for monthly cost review meetings:
- Budget vs actual — current month and trailing 3-month trend.
- Top 5 cost drivers — by service, resource group, and tag.
- Anomalies — any flagged anomalies and their root cause.
- Optimization actions — completed and planned.
- Reservation coverage — utilization and upcoming expirations.
- Advisor score — number of open vs adopted recommendations.
- Forecast — projected spend for the next 30, 60, 90 days.
Chargeback dashboard¶
- Build an Azure Cost Management workbook grouped by
costCentertag. - Share the workbook with finance and department leads monthly.
- Include per-team trend lines and month-over-month delta.
Trend analysis KQL¶
// Daily cost trend by resource group — last 30 days
AzureMetrics
| where TimeGenerated > ago(30d)
| where MetricName == "CostUSD"
| summarize dailyCost = sum(Total) by bin(TimeGenerated, 1d), ResourceGroup = _ResourceId
| render timechart
Contact Information¶
Warning
Action Required: Populate these before first production use.
| Role | Contact | Phone | Escalation |
|---|---|---|---|
| Cost Owner / FinOps Lead | (set via your org's finance team) | (office hours) | Budget exceeded events |
| Platform Team Lead | (set via your org's platform team) | (see PagerDuty / OpsGenie) | Resource optimization |
| Subscription Owner | (per-subscription — see governance RBAC) | (DL) | Policy exceptions |
| Azure Support | Case via Portal | N/A | Billing disputes, reservation issues |
Related Documentation¶
- OpenAI Throttling — AI/ML cost drivers and optimization
- Key Rotation — Credential lifecycle (cost-neutral but related governance)
- Tenant Onboarding — Budget setup for new tenants
- DR Drill — Cost implications of DR failover
- Data Pipeline Failure — Pipeline cost during failure/retry storms