Cost Optimization Best Practices¶
Overview¶
Cost optimization is a first-class architecture concern in CSA-in-a-Box. Every design decision — from cluster sizing to storage tiering — carries a direct cost implication. This guide provides concrete, actionable strategies to minimize spend without sacrificing performance or reliability.
Related Guide
For operational cost management workflows (budgets, alerts, chargebacks), see the Cost Management Guide.
The cost optimization lifecycle is a continuous process, not a one-time exercise:
graph LR
A[📋 Plan] --> B[📊 Monitor]
B --> C[🔧 Optimize]
C --> D[🤖 Automate]
D --> A
style A fill:#4CAF50,color:#fff
style B fill:#2196F3,color:#fff
style C fill:#FF9800,color:#fff
style D fill:#9C27B0,color:#fff | Phase | Activities |
|---|---|
| Plan | Right-size resources, choose pricing tiers, estimate monthly cost |
| Monitor | Track spend vs budget, identify anomalies, review utilization |
| Optimize | Eliminate waste, consolidate workloads, apply tiering |
| Automate | Auto-pause, lifecycle policies, scheduled scaling |
Compute Cost Optimization¶
Databricks¶
| Strategy | Savings Estimate | Complexity |
|---|---|---|
| Auto-termination (10 min) | 30–60% | Low |
| Spot instances for jobs | 60–80% | Medium |
| Jobs clusters over all-purpose | 40–70% | Low |
| Cluster policies | 20–40% | Medium |
| Photon acceleration | 20–50% | Low |
Jobs clusters vs all-purpose clusters: Jobs clusters spin up for a specific job and terminate immediately after. All-purpose clusters stay running for interactive use. Always use jobs clusters for production pipelines.
Cluster Policy — Enforce Cost Controls
json { "cluster_name": { "type": "fixed", "value": "job-cluster-{user}" }, "autotermination_minutes": { "type": "range", "minValue": 10, "maxValue": 60, "defaultValue": 15 }, "spark_conf.spark.databricks.cluster.profile": { "type": "fixed", "value": "serverless" }, "node_type_id": { "type": "allowlist", "values": [ "Standard_DS3_v2", "Standard_DS4_v2", "Standard_E4ds_v5" ] }, "num_workers": { "type": "range", "minValue": 1, "maxValue": 8, "defaultValue": 2 }, "azure_attributes.availability": { "type": "fixed", "value": "SPOT_WITH_FALLBACK_AZURE" }, "custom_tags.CostCenter": { "type": "fixed", "value": "analytics" } }
Synapse Analytics¶
| Pool Type | Pricing Model | Best For |
|---|---|---|
| Serverless SQL pool | Pay-per-query (TB) | Ad-hoc queries, exploration |
| Dedicated SQL pool | Reserved DWU/hour | Predictable, high-concurrency |
- Serverless SQL pools charge ~$5/TB scanned — ideal for infrequent or exploratory queries
- Dedicated SQL pools start at ~$1.20/hour (DW100c) — only cost-effective at sustained utilization >60%
- Pause dedicated pools during off-hours to eliminate idle charges
Microsoft Fabric¶
Fabric uses Capacity Units (CUs). Sizing depends on workload:
| Environment | SKU | CUs | Cost (approx/month) | Use Case |
|---|---|---|---|---|
| Dev/Test | F2 | 2 | ~$260 | Individual developer work |
| Team Dev | F8 | 8 | ~$1,040 | Shared development |
| Staging | F32 | 32 | ~$4,160 | Integration testing |
| Production | F64 | 64 | ~$8,320 | Production workloads |
| Large Prod | F128 | 128 | ~$16,640 | High-concurrency production |
Auto-Pause Fabric Capacity (Bicep)
resource fabricCapacity 'Microsoft.Fabric/capacities@2023-11-01' = {
name: 'csa-dev-fabric'
location: resourceGroup().location
sku: {
name: 'F2'
tier: 'Fabric'
}
properties: {}
}
// Auto-pause via Azure Automation runbook
resource automationAccount 'Microsoft.Automation/automationAccounts@2023-11-01' = {
name: 'csa-cost-automation'
location: resourceGroup().location
properties: {
sku: {
name: 'Basic'
}
}
}
resource pauseSchedule 'Microsoft.Automation/automationAccounts/schedules@2023-11-01' = {
parent: automationAccount
name: 'pause-fabric-evenings'
properties: {
frequency: 'Day'
interval: 1
startTime: '2024-01-01T18:00:00-05:00'
timeZone: 'Eastern Standard Time'
}
}
resource resumeSchedule 'Microsoft.Automation/automationAccounts/schedules@2023-11-01' = {
parent: automationAccount
name: 'resume-fabric-mornings'
properties: {
frequency: 'Day'
interval: 1
startTime: '2024-01-01T07:00:00-05:00'
timeZone: 'Eastern Standard Time'
}
}
```
### Spark Right-Sizing
- **Adaptive Query Execution (AQE):** Enable `spark.sql.adaptive.enabled=true` to automatically coalesce shuffle partitions and optimize join strategies at runtime.
- **Photon acceleration:** Use Photon-enabled runtimes for ETL workloads — typically 2–5× faster, reducing DBU consumption proportionally.
- **Executor sizing:** Start with 4 cores / 16 GB per executor. Scale horizontally (more executors) rather than vertically (bigger executors) for better spot instance tolerance.
- **Auto-scaling:** Set min workers = 1, max workers = workload peak. Databricks auto-scaling adds/removes nodes based on pending tasks.
### Compute Do's and Don'ts
| ✅ Do | ❌ Don't | Cost Impact |
| ------------------------------------------------ | ------------------------------------------- | ----------- |
| Use jobs clusters for pipelines | Run all-purpose clusters 24/7 | Save 40–70% |
| Set auto-termination to 10–15 min | Leave default auto-termination at 120 min | Save 30–60% |
| Use spot instances for fault-tolerant jobs | Use on-demand for all workloads | Save 60–80% |
| Enforce cluster policies | Allow users to create unlimited clusters | Save 20–40% |
| Pause dedicated SQL pools on nights/weekends | Run dedicated pools 24/7 | Save 65% |
| Use serverless SQL for < 1 TB/day queries | Provision dedicated pool for ad-hoc queries | Save 70–90% |
| Pause Fabric dev capacity outside business hours | Run dev capacity 24/7 | Save 65% |
| Enable Photon for ETL workloads | Use standard runtime for heavy transforms | Save 20–50% |
---
## Storage Cost Optimization
### ADLS Gen2 Tiering
Azure Data Lake Storage Gen2 supports three access tiers with dramatically different pricing:
| Tier | Storage (per TB/month) | Read (per 10K ops) | Write (per 10K ops) | Min Retention | Best For |
| ----------- | ---------------------- | ------------------ | ------------------- | ------------- | --------------------------- |
| **Hot** | ~$20.80 | $0.0044 | $0.055 | None | Active data, frequent reads |
| **Cool** | ~$10.00 | $0.011 | $0.110 | 30 days | Infrequent access data |
| **Archive** | ~$1.80 | $5.50 | $0.110 | 180 days | Compliance, long-term |
!!! info "CSA-in-a-Box Retention Strategy"
| Medallion Layer | Hot Retention | Cool Retention | Archive Retention |
| --------------- | ------------- | -------------- | ----------------- |
| **Bronze** | 30 days | 90 days | 7 years |
| **Silver** | 90 days | 1 year | 3 years |
| **Gold** | 1 year | 2 years | As needed |
!!! example "Lifecycle Management Policy"
`json
{
"rules": [
{
"enabled": true,
"name": "bronze-tiering",
"type": "Lifecycle",
"definition": {
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["bronze/"]
},
"actions": {
"baseBlob": {
"tierToCool": {
"daysAfterModificationGreaterThan": 30
},
"tierToArchive": {
"daysAfterModificationGreaterThan": 120
},
"delete": {
"daysAfterModificationGreaterThan": 2555
}
}
}
}
},
{
"enabled": true,
"name": "silver-tiering",
"type": "Lifecycle",
"definition": {
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["silver/"]
},
"actions": {
"baseBlob": {
"tierToCool": {
"daysAfterModificationGreaterThan": 90
},
"tierToArchive": {
"daysAfterModificationGreaterThan": 365
}
}
}
}
}
]
}
`
### Delta Lake Optimization
Small files are the #1 hidden cost driver in lakehouse architectures. Each small file increases storage transactions and degrades query performance.
```sql
-- Remove old files (default retention: 7 days)
VACUUM bronze.raw_events RETAIN 168 HOURS;
-- Compact small files into larger ones
OPTIMIZE silver.customer_transactions
ZORDER BY (customer_id, transaction_date);
-- Check table statistics
DESCRIBE DETAIL silver.customer_transactions;
- Run
VACUUMweekly on Bronze tables, daily on high-write Silver/Gold tables - Run
OPTIMIZEafter large batch loads or when file count exceeds 10× the ideal - Use
ZORDERon columns used inWHEREandJOINfilters - Compression: Use Zstandard (
zstd) for Parquet — 10–20% smaller than Snappy with comparable decompression speed
Network & Egress Cost Management¶
Egress Pricing¶
| Transfer Type | Cost per GB |
|---|---|
| Same region (intra-region) | Free |
| Cross-region (within Azure) | $0.02 |
| Azure → Internet | $0.087 |
| AWS → Azure (cross-cloud) | $0.09 |
| GCP → Azure (cross-cloud) | $0.12 |
Cross-cloud egress adds up fast
Transferring 10 TB/month from AWS to Azure costs ~$900/month in egress alone. For multi-cloud scenarios, see the Multi-Cloud Data Virtualization use case.
Strategies to Minimize Egress¶
- Co-locate compute with storage. Always run Databricks/Synapse/Fabric in the same region as your ADLS storage account.
- Use Private Endpoints. Traffic over Private Link stays on the Azure backbone — no egress charges.
- Cache at edge. For repeated cross-region reads, use Azure CDN or a regional cache layer.
- Prefer data virtualization over data copying for cross-cloud queries — query in place rather than replicating.
- Batch transfers during off-peak. Azure ExpressRoute circuits have predictable pricing vs metered egress.
Fabric Capacity Planning¶
CU Sizing Guide¶
Capacity Units (CUs) are consumed by all Fabric workloads (Lakehouse, Warehouse, Pipelines, Notebooks, Reports). Key concepts:
- Smoothing: Fabric averages CU consumption over a 5-minute window, preventing short spikes from throttling
- Bursting: Workloads can temporarily burst up to 2× the provisioned CUs, with excess consumption repaid over the next 24 hours
- Throttling: At sustained consumption >100% of capacity, background jobs are delayed; at >200%, interactive queries are also throttled
Auto-Scale Configuration¶
{
"capacityName": "csa-prod-fabric",
"autoScale": {
"enabled": true,
"maxCUs": 128,
"scaleUpThresholdPercent": 80,
"scaleDownThresholdPercent": 30,
"cooldownMinutes": 15
}
}
Pause Schedules for Dev/Test¶
| Environment | Active Hours | Days | Monthly Savings |
|---|---|---|---|
| Dev (F2) | 08:00–18:00 ET | Mon–Fri | ~65% |
| Staging (F32) | 06:00–22:00 ET | Mon–Fri | ~55% |
| Production | 24/7 | All | 0% (always on) |
Monitoring Capacity Utilization¶
Track these metrics in the Fabric Capacity Metrics app:
- CU utilization % — target < 70% sustained
- Throttling events — should be zero in production
- Bursting duration — track to avoid sustained overuse
- Per-workload CU breakdown — identify cost-heavy workloads
Azure Cost Management¶
Budget Alerts¶
Set up tiered alerts to catch spend increases early:
Azure CLI — Budget Alerts
bash # Create a monthly budget with alerts at 50%, 75%, 90%, 100% az consumption budget create \ --budget-name "csa-platform-monthly" \ --amount 15000 \ --time-grain Monthly \ --start-date "2024-01-01" \ --end-date "2025-12-31" \ --resource-group "rg-csa-prod" \ --category Cost \ --notifications '{ "alert-50": { "enabled": true, "operator": "GreaterThanOrEqualTo", "threshold": 50, "contactEmails": ["platform-team@agency.gov"], "thresholdType": "Actual" }, "alert-75": { "enabled": true, "operator": "GreaterThanOrEqualTo", "threshold": 75, "contactEmails": ["platform-team@agency.gov"], "thresholdType": "Actual" }, "alert-90": { "enabled": true, "operator": "GreaterThanOrEqualTo", "threshold": 90, "contactEmails": ["platform-team@agency.gov", "finance@agency.gov"], "thresholdType": "Actual" }, "forecast-100": { "enabled": true, "operator": "GreaterThanOrEqualTo", "threshold": 100, "contactEmails": ["platform-team@agency.gov", "finance@agency.gov"], "thresholdType": "Forecasted" } }'
Cost Allocation Tags¶
Apply consistent tags to every resource for chargeback and cost attribution:
| Tag Key | Example Values | Purpose |
|---|---|---|
Environment | dev, staging, prod | Environment segmentation |
Domain | finance, hr, compliance | Business domain ownership |
Team | data-eng, analytics, infra | Team chargeback |
CostCenter | CC-1234 | Financial tracking |
Project | csa-inabox | Project-level attribution |
Azure Advisor Recommendations¶
Review Azure Advisor weekly for:
- Right-sizing — VMs and SQL pools running under 30% utilization
- Reserved instances — workloads running 24/7 that would benefit from 1-year or 3-year reservations
- Shutdown recommendations — idle resources with zero utilization
- Storage optimization — blobs that could be moved to cooler tiers
Reserved Instances¶
For predictable, always-on workloads, Reserved Instances provide significant savings:
| Resource | Pay-as-you-go | 1-Year RI | 3-Year RI | Savings (3Y) |
|---|---|---|---|---|
| Databricks (E4ds) | $0.313/hr | $0.198/hr | $0.130/hr | 58% |
| SQL Dedicated Pool | $1.20/hr | $0.78/hr | $0.52/hr | 57% |
| Fabric F64 | $11.39/hr | $7.90/hr | $5.70/hr | 50% |
Cost Monitoring Dashboard¶
Key Metrics to Track¶
| Metric | Target | Alert Threshold |
|---|---|---|
| Monthly spend vs budget | < 80% of budget | > 90% of budget |
| Daily spend delta | < 10% day-over-day | > 25% spike |
| Compute utilization | > 60% | < 30% (over-provisioned) |
| Storage growth rate | < 5% month-over-month | > 15% month-over-month |
| Idle resource count | 0 | > 0 |
| Reserved instance coverage | > 70% | < 50% |
Anomaly Detection¶
Set up anomaly detection alerts in Azure Cost Management:
- Sudden spend spikes — a runaway pipeline or misconfigured auto-scale
- New high-cost resources — unplanned deployments
- Unusual egress patterns — potential data exfiltration or misconfigured networking
- Weekend/holiday spend increases — workloads that should be paused
Cost Anti-Patterns¶
Anti-Pattern: All-Purpose Clusters Running 24/7
Problem: Interactive clusters left running overnight and weekends. Cost: A 4-node Standard_DS4_v2 cluster costs ~\(2,000/month running 24/7 vs ~\)600/month with business-hours-only auto-termination. Fix: Set auto-termination to 10–15 minutes. Use jobs clusters for all automated workloads.
Anti-Pattern: Hot Storage for Archival Data
Problem: Keeping 5+ years of Bronze data in Hot tier. Cost: 50 TB in Hot = ~\(1,040/month. In Archive = ~\)90/month. Fix: Apply lifecycle management policies to move data to Cool after 30 days and Archive after 120 days.
Anti-Pattern: Unoptimized Delta Tables (Millions of Small Files)
Problem: Streaming ingestion creates millions of small Parquet files. Each LIST and GET operation costs money and degrades performance. Cost: 1M small files can cost 10–50× more in transaction fees than 1,000 properly sized files. Fix: Schedule regular OPTIMIZE and VACUUM operations. Use auto-compaction (spark.databricks.delta.autoCompact.enabled=true).
Anti-Pattern: No Auto-Termination on Dev Clusters
Problem: Developers spin up clusters and forget to terminate them. Cost: A forgotten 2-node cluster costs ~$30/day. Fix: Enforce auto-termination via cluster policies (max 15 minutes for dev clusters).
Anti-Pattern: Over-Provisioned Dedicated SQL Pools
Problem: Running DW1000c for workloads that only need DW100c. Cost: DW1000c = ~\(12/hour vs DW100c = ~\)1.20/hour — a 10× overspend. Fix: Start with the smallest pool, monitor query performance, scale up only when queue times exceed SLA.
Monthly Cost Estimation Table¶
Reference pricing for CSA-in-a-Box platform sizes (all prices approximate, USD):
| Component | Small (Dev/POC) | Medium (Dept) | Large (Enterprise) |
|---|---|---|---|
| Databricks (DBUs) | $500 | $3,000 | $15,000 |
| Fabric Capacity | $260 (F2) | $4,160 (F32) | $16,640 (F128) |
| ADLS Gen2 Storage | $50 (2 TB Hot) | $300 (20 TB mix) | $1,200 (100 TB mix) |
| Synapse Serverless | $25 | $200 | $1,000 |
| Networking/Egress | $10 | $100 | $500 |
| Key Vault + Monitoring | $50 | $150 | $500 |
| Azure DevOps / CI-CD | $0 (free tier) | $30 | $150 |
| Azure Cost Management | Free | Free | Free |
| Total (estimated) | ~$895/month | ~$7,940/month | ~$34,990/month |
Optimization Impact
Applying the strategies in this guide typically reduces spend by 30–50% from unoptimized baselines. A medium platform can realistically operate at $5,000–6,000/month with proper auto-pause, spot instances, and storage tiering.
| Optimization Applied | Small Savings | Medium Savings | Large Savings |
|---|---|---|---|
| Auto-pause dev/staging compute | $100 | $1,500 | $5,000 |
| Spot instances for batch jobs | $150 | $1,200 | $6,000 |
| Storage lifecycle tiering | $15 | $100 | $500 |
| Right-sized Fabric capacity | $0 | $800 | $3,000 |
| Reserved instances (1-year) | $0 | $600 | $4,000 |
| Total optimized estimate | ~$630 | ~$5,740 | ~$16,490 |