Cost Optimization Best Practices¶

Overview¶

Cost optimization is a first-class architecture concern in CSA-in-a-Box. Every design decision — from cluster sizing to storage tiering — carries a direct cost implication. This guide provides concrete, actionable strategies to minimize spend without sacrificing performance or reliability.

Related Guide

For operational cost management workflows (budgets, alerts, chargebacks), see the Cost Management Guide.

The cost optimization lifecycle is a continuous process, not a one-time exercise:

graph LR
    A[📋 Plan] --> B[📊 Monitor]
    B --> C[🔧 Optimize]
    C --> D[🤖 Automate]
    D --> A
    style A fill:#4CAF50,color:#fff
    style B fill:#2196F3,color:#fff
    style C fill:#FF9800,color:#fff
    style D fill:#9C27B0,color:#fff

Phase	Activities
Plan	Right-size resources, choose pricing tiers, estimate monthly cost
Monitor	Track spend vs budget, identify anomalies, review utilization
Optimize	Eliminate waste, consolidate workloads, apply tiering
Automate	Auto-pause, lifecycle policies, scheduled scaling

Compute Cost Optimization¶

Databricks¶

Strategy	Savings Estimate	Complexity
Auto-termination (10 min)	30–60%	Low
Spot instances for jobs	60–80%	Medium
Jobs clusters over all-purpose	40–70%	Low
Cluster policies	20–40%	Medium
Photon acceleration	20–50%	Low

Jobs clusters vs all-purpose clusters: Jobs clusters spin up for a specific job and terminate immediately after. All-purpose clusters stay running for interactive use. Always use jobs clusters for production pipelines.

Cluster Policy — Enforce Cost Controls

json { "cluster_name": { "type": "fixed", "value": "job-cluster-{user}" }, "autotermination_minutes": { "type": "range", "minValue": 10, "maxValue": 60, "defaultValue": 15 }, "spark_conf.spark.databricks.cluster.profile": { "type": "fixed", "value": "serverless" }, "node_type_id": { "type": "allowlist", "values": [ "Standard_DS3_v2", "Standard_DS4_v2", "Standard_E4ds_v5" ] }, "num_workers": { "type": "range", "minValue": 1, "maxValue": 8, "defaultValue": 2 }, "azure_attributes.availability": { "type": "fixed", "value": "SPOT_WITH_FALLBACK_AZURE" }, "custom_tags.CostCenter": { "type": "fixed", "value": "analytics" } }

Synapse Analytics¶

Pool Type	Pricing Model	Best For
Serverless SQL pool	Pay-per-query (TB)	Ad-hoc queries, exploration
Dedicated SQL pool	Reserved DWU/hour	Predictable, high-concurrency

Serverless SQL pools charge ~$5/TB scanned — ideal for infrequent or exploratory queries
Dedicated SQL pools start at ~$1.20/hour (DW100c) — only cost-effective at sustained utilization >60%
Pause dedicated pools during off-hours to eliminate idle charges

Microsoft Fabric¶

Fabric uses Capacity Units (CUs). Sizing depends on workload:

Environment	SKU	CUs	Cost (approx/month)	Use Case
Dev/Test	F2	2	~$260	Individual developer work
Team Dev	F8	8	~$1,040	Shared development
Staging	F32	32	~$4,160	Integration testing
Production	F64	64	~$8,320	Production workloads
Large Prod	F128	128	~$16,640	High-concurrency production

Auto-Pause Fabric Capacity (Bicep)

resource fabricCapacity 'Microsoft.Fabric/capacities@2023-11-01' = {
name: 'csa-dev-fabric'
location: resourceGroup().location
sku: {
name: 'F2'
tier: 'Fabric'
}
properties: {}
}

    // Auto-pause via Azure Automation runbook
    resource automationAccount 'Microsoft.Automation/automationAccounts@2023-11-01' = {
      name: 'csa-cost-automation'
      location: resourceGroup().location
      properties: {
        sku: {
          name: 'Basic'
        }
      }
    }

    resource pauseSchedule 'Microsoft.Automation/automationAccounts/schedules@2023-11-01' = {
      parent: automationAccount
      name: 'pause-fabric-evenings'
      properties: {
        frequency: 'Day'
        interval: 1
        startTime: '2024-01-01T18:00:00-05:00'
        timeZone: 'Eastern Standard Time'
      }
    }

    resource resumeSchedule 'Microsoft.Automation/automationAccounts/schedules@2023-11-01' = {
      parent: automationAccount
      name: 'resume-fabric-mornings'
      properties: {
        frequency: 'Day'
        interval: 1
        startTime: '2024-01-01T07:00:00-05:00'
        timeZone: 'Eastern Standard Time'
      }
    }
    ```

### Spark Right-Sizing

- **Adaptive Query Execution (AQE):** Enable `spark.sql.adaptive.enabled=true` to automatically coalesce shuffle partitions and optimize join strategies at runtime.
- **Photon acceleration:** Use Photon-enabled runtimes for ETL workloads — typically 2–5× faster, reducing DBU consumption proportionally.
- **Executor sizing:** Start with 4 cores / 16 GB per executor. Scale horizontally (more executors) rather than vertically (bigger executors) for better spot instance tolerance.
- **Auto-scaling:** Set min workers = 1, max workers = workload peak. Databricks auto-scaling adds/removes nodes based on pending tasks.

### Compute Do's and Don'ts

| ✅ Do                                            | ❌ Don't                                    | Cost Impact |
| ------------------------------------------------ | ------------------------------------------- | ----------- |
| Use jobs clusters for pipelines                  | Run all-purpose clusters 24/7               | Save 40–70% |
| Set auto-termination to 10–15 min                | Leave default auto-termination at 120 min   | Save 30–60% |
| Use spot instances for fault-tolerant jobs       | Use on-demand for all workloads             | Save 60–80% |
| Enforce cluster policies                         | Allow users to create unlimited clusters    | Save 20–40% |
| Pause dedicated SQL pools on nights/weekends     | Run dedicated pools 24/7                    | Save 65%    |
| Use serverless SQL for < 1 TB/day queries        | Provision dedicated pool for ad-hoc queries | Save 70–90% |
| Pause Fabric dev capacity outside business hours | Run dev capacity 24/7                       | Save 65%    |
| Enable Photon for ETL workloads                  | Use standard runtime for heavy transforms   | Save 20–50% |

---

## Storage Cost Optimization

### ADLS Gen2 Tiering

Azure Data Lake Storage Gen2 supports three access tiers with dramatically different pricing:

| Tier        | Storage (per TB/month) | Read (per 10K ops) | Write (per 10K ops) | Min Retention | Best For                    |
| ----------- | ---------------------- | ------------------ | ------------------- | ------------- | --------------------------- |
| **Hot**     | ~$20.80                | $0.0044            | $0.055              | None          | Active data, frequent reads |
| **Cool**    | ~$10.00                | $0.011             | $0.110              | 30 days       | Infrequent access data      |
| **Archive** | ~$1.80                 | $5.50              | $0.110              | 180 days      | Compliance, long-term       |

!!! info "CSA-in-a-Box Retention Strategy"
| Medallion Layer | Hot Retention | Cool Retention | Archive Retention |
| --------------- | ------------- | -------------- | ----------------- |
| **Bronze** | 30 days | 90 days | 7 years |
| **Silver** | 90 days | 1 year | 3 years |
| **Gold** | 1 year | 2 years | As needed |

!!! example "Lifecycle Management Policy"
`json
    {
      "rules": [
        {
          "enabled": true,
          "name": "bronze-tiering",
          "type": "Lifecycle",
          "definition": {
            "filters": {
              "blobTypes": ["blockBlob"],
              "prefixMatch": ["bronze/"]
            },
            "actions": {
              "baseBlob": {
                "tierToCool": {
                  "daysAfterModificationGreaterThan": 30
                },
                "tierToArchive": {
                  "daysAfterModificationGreaterThan": 120
                },
                "delete": {
                  "daysAfterModificationGreaterThan": 2555
                }
              }
            }
          }
        },
        {
          "enabled": true,
          "name": "silver-tiering",
          "type": "Lifecycle",
          "definition": {
            "filters": {
              "blobTypes": ["blockBlob"],
              "prefixMatch": ["silver/"]
            },
            "actions": {
              "baseBlob": {
                "tierToCool": {
                  "daysAfterModificationGreaterThan": 90
                },
                "tierToArchive": {
                  "daysAfterModificationGreaterThan": 365
                }
              }
            }
          }
        }
      ]
    }
    `

### Delta Lake Optimization

Small files are the #1 hidden cost driver in lakehouse architectures. Each small file increases storage transactions and degrades query performance.

```sql
-- Remove old files (default retention: 7 days)
VACUUM bronze.raw_events RETAIN 168 HOURS;

-- Compact small files into larger ones
OPTIMIZE silver.customer_transactions
  ZORDER BY (customer_id, transaction_date);

-- Check table statistics
DESCRIBE DETAIL silver.customer_transactions;

Run VACUUM weekly on Bronze tables, daily on high-write Silver/Gold tables
Run OPTIMIZE after large batch loads or when file count exceeds 10× the ideal
Use ZORDER on columns used in WHERE and JOIN filters
Compression: Use Zstandard (zstd) for Parquet — 10–20% smaller than Snappy with comparable decompression speed

Network & Egress Cost Management¶

Egress Pricing¶

Transfer Type	Cost per GB
Same region (intra-region)	Free
Cross-region (within Azure)	$0.02
Azure → Internet	$0.087
AWS → Azure (cross-cloud)	$0.09
GCP → Azure (cross-cloud)	$0.12

Cross-cloud egress adds up fast

Transferring 10 TB/month from AWS to Azure costs ~$900/month in egress alone. For multi-cloud scenarios, see the Multi-Cloud Data Virtualization use case.

Strategies to Minimize Egress¶

Co-locate compute with storage. Always run Databricks/Synapse/Fabric in the same region as your ADLS storage account.
Use Private Endpoints. Traffic over Private Link stays on the Azure backbone — no egress charges.
Cache at edge. For repeated cross-region reads, use Azure CDN or a regional cache layer.
Prefer data virtualization over data copying for cross-cloud queries — query in place rather than replicating.
Batch transfers during off-peak. Azure ExpressRoute circuits have predictable pricing vs metered egress.

Fabric Capacity Planning¶

CU Sizing Guide¶

Capacity Units (CUs) are consumed by all Fabric workloads (Lakehouse, Warehouse, Pipelines, Notebooks, Reports). Key concepts:

Smoothing: Fabric averages CU consumption over a 5-minute window, preventing short spikes from throttling
Bursting: Workloads can temporarily burst up to 2× the provisioned CUs, with excess consumption repaid over the next 24 hours
Throttling: At sustained consumption >100% of capacity, background jobs are delayed; at >200%, interactive queries are also throttled

Auto-Scale Configuration¶

{
    "capacityName": "csa-prod-fabric",
    "autoScale": {
        "enabled": true,
        "maxCUs": 128,
        "scaleUpThresholdPercent": 80,
        "scaleDownThresholdPercent": 30,
        "cooldownMinutes": 15
    }
}

Pause Schedules for Dev/Test¶

Environment	Active Hours	Days	Monthly Savings
Dev (F2)	08:00–18:00 ET	Mon–Fri	~65%
Staging (F32)	06:00–22:00 ET	Mon–Fri	~55%
Production	24/7	All	0% (always on)

Monitoring Capacity Utilization¶

Track these metrics in the Fabric Capacity Metrics app:

CU utilization % — target < 70% sustained
Throttling events — should be zero in production
Bursting duration — track to avoid sustained overuse
Per-workload CU breakdown — identify cost-heavy workloads

Azure Cost Management¶

Budget Alerts¶

Set up tiered alerts to catch spend increases early:

Azure CLI — Budget Alerts

bash # Create a monthly budget with alerts at 50%, 75%, 90%, 100% az consumption budget create \ --budget-name "csa-platform-monthly" \ --amount 15000 \ --time-grain Monthly \ --start-date "2024-01-01" \ --end-date "2025-12-31" \ --resource-group "rg-csa-prod" \ --category Cost \ --notifications '{ "alert-50": { "enabled": true, "operator": "GreaterThanOrEqualTo", "threshold": 50, "contactEmails": ["platform-team@agency.gov"], "thresholdType": "Actual" }, "alert-75": { "enabled": true, "operator": "GreaterThanOrEqualTo", "threshold": 75, "contactEmails": ["platform-team@agency.gov"], "thresholdType": "Actual" }, "alert-90": { "enabled": true, "operator": "GreaterThanOrEqualTo", "threshold": 90, "contactEmails": ["platform-team@agency.gov", "finance@agency.gov"], "thresholdType": "Actual" }, "forecast-100": { "enabled": true, "operator": "GreaterThanOrEqualTo", "threshold": 100, "contactEmails": ["platform-team@agency.gov", "finance@agency.gov"], "thresholdType": "Forecasted" } }'

Cost Allocation Tags¶

Apply consistent tags to every resource for chargeback and cost attribution:

Tag Key	Example Values	Purpose
`Environment`	`dev`, `staging`, `prod`	Environment segmentation
`Domain`	`finance`, `hr`, `compliance`	Business domain ownership
`Team`	`data-eng`, `analytics`, `infra`	Team chargeback
`CostCenter`	`CC-1234`	Financial tracking
`Project`	`csa-inabox`	Project-level attribution

Azure Advisor Recommendations¶

Review Azure Advisor weekly for:

Right-sizing — VMs and SQL pools running under 30% utilization
Reserved instances — workloads running 24/7 that would benefit from 1-year or 3-year reservations
Shutdown recommendations — idle resources with zero utilization
Storage optimization — blobs that could be moved to cooler tiers

Reserved Instances¶

For predictable, always-on workloads, Reserved Instances provide significant savings:

Resource	Pay-as-you-go	1-Year RI	3-Year RI	Savings (3Y)
Databricks (E4ds)	$0.313/hr	$0.198/hr	$0.130/hr	58%
SQL Dedicated Pool	$1.20/hr	$0.78/hr	$0.52/hr	57%
Fabric F64	$11.39/hr	$7.90/hr	$5.70/hr	50%

Cost Monitoring Dashboard¶

Key Metrics to Track¶

Metric	Target	Alert Threshold
Monthly spend vs budget	< 80% of budget	> 90% of budget
Daily spend delta	< 10% day-over-day	> 25% spike
Compute utilization	> 60%	< 30% (over-provisioned)
Storage growth rate	< 5% month-over-month	> 15% month-over-month
Idle resource count	0	> 0
Reserved instance coverage	> 70%	< 50%

Anomaly Detection¶

Set up anomaly detection alerts in Azure Cost Management:

Sudden spend spikes — a runaway pipeline or misconfigured auto-scale
New high-cost resources — unplanned deployments
Unusual egress patterns — potential data exfiltration or misconfigured networking
Weekend/holiday spend increases — workloads that should be paused

Cost Anti-Patterns¶

Anti-Pattern: All-Purpose Clusters Running 24/7

Problem: Interactive clusters left running overnight and weekends. Cost: A 4-node Standard_DS4_v2 cluster costs ~$2,000/month running 24/7 vs ~$600/month with business-hours-only auto-termination. Fix: Set auto-termination to 10–15 minutes. Use jobs clusters for all automated workloads.

Anti-Pattern: Hot Storage for Archival Data

Problem: Keeping 5+ years of Bronze data in Hot tier. Cost: 50 TB in Hot = ~$1,040/month. In Archive = ~$90/month. Fix: Apply lifecycle management policies to move data to Cool after 30 days and Archive after 120 days.

Anti-Pattern: Unoptimized Delta Tables (Millions of Small Files)

Problem: Streaming ingestion creates millions of small Parquet files. Each LIST and GET operation costs money and degrades performance. Cost: 1M small files can cost 10–50× more in transaction fees than 1,000 properly sized files. Fix: Schedule regular OPTIMIZE and VACUUM operations. Use auto-compaction (spark.databricks.delta.autoCompact.enabled=true).

Anti-Pattern: No Auto-Termination on Dev Clusters

Problem: Developers spin up clusters and forget to terminate them. Cost: A forgotten 2-node cluster costs ~$30/day. Fix: Enforce auto-termination via cluster policies (max 15 minutes for dev clusters).

Anti-Pattern: Over-Provisioned Dedicated SQL Pools

Problem: Running DW1000c for workloads that only need DW100c. Cost: DW1000c = ~$12/hour vs DW100c = ~$1.20/hour — a 10× overspend. Fix: Start with the smallest pool, monitor query performance, scale up only when queue times exceed SLA.

Monthly Cost Estimation Table¶

Reference pricing for CSA-in-a-Box platform sizes (all prices approximate, USD):

Component	Small (Dev/POC)	Medium (Dept)	Large (Enterprise)
Databricks (DBUs)	$500	$3,000	$15,000
Fabric Capacity	$260 (F2)	$4,160 (F32)	$16,640 (F128)
ADLS Gen2 Storage	$50 (2 TB Hot)	$300 (20 TB mix)	$1,200 (100 TB mix)
Synapse Serverless	$25	$200	$1,000
Networking/Egress	$10	$100	$500
Key Vault + Monitoring	$50	$150	$500
Azure DevOps / CI-CD	$0 (free tier)	$30	$150
Azure Cost Management	Free	Free	Free
Total (estimated)	~$895/month	~$7,940/month	~$34,990/month

Optimization Impact

Applying the strategies in this guide typically reduces spend by 30–50% from unoptimized baselines. A medium platform can realistically operate at $5,000–6,000/month with proper auto-pause, spot instances, and storage tiering.

Optimization Applied	Small Savings	Medium Savings	Large Savings
Auto-pause dev/staging compute	$100	$1,500	$5,000
Spot instances for batch jobs	$150	$1,200	$6,000
Storage lifecycle tiering	$15	$100	$500
Right-sized Fabric capacity	$0	$800	$3,000
Reserved instances (1-year)	$0	$600	$4,000
Total optimized estimate	~$630	~$5,740	~$16,490