Performance Testing — Benchmarking and Capacity Planning¶
TL;DR Performance testing validates that CSA-in-a-Box meets SLAs under realistic load. This guide covers methodology for each service tier -- Databricks SQL, Fabric, Cosmos DB, Power BI, API endpoints, and streaming -- plus the automated regression pipeline that guards against silent degradation.
Testing Architecture¶
Every performance test follows the same loop: generate load, observe telemetry, analyse results.
flowchart LR
subgraph Load Generators
ALT[Azure Load Testing]
K6[k6]
LOC[Locust]
DBT[dbt benchmark harness]
end
subgraph Target Services
DBSQL[Databricks SQL]
FAB[Fabric SQL / Lakehouse]
CDB[Cosmos DB]
API[Portal API endpoints]
PBI[Power BI]
EH[Event Hubs / Streaming]
end
subgraph Monitoring
AM[Azure Monitor]
AI[Application Insights]
LA[Log Analytics]
end
subgraph Analysis
KQL[KQL queries]
RPT[Benchmark reports]
end
Load Generators --> Target Services
Target Services --> Monitoring
Monitoring --> Analysis
Analysis -->|feedback| Load Generators Each load generator emits structured results (JSON or CSV) archived as CI artifacts. Azure Monitor and App Insights capture server-side telemetry.
Query Benchmarking Methodology¶
Reliable benchmarks require controlled conditions.
Protocol¶
- Warm-up runs -- Execute the query 2-3 times before recording to populate caches and compile query plans.
- Statistical significance -- Run a minimum of 5 recorded iterations. Report p50, p95, and p99 latencies, not averages.
- Isolation -- Use a dedicated cluster or SQL pool to eliminate noise.
- Consistent dataset -- Pin the dataset version (Delta snapshot or database restore point) for comparability.
- Record environment -- Log cluster size, SKU, config, and dataset row count alongside every result set.
Benchmark Harness Template¶
"""Benchmark harness for SQL queries against Databricks or Fabric."""
import json, statistics, time
from pathlib import Path
def run_benchmark(execute_fn, queries: dict, iterations=5, warmup=2, output=None):
results = {}
for name, sql in queries.items():
for _ in range(warmup):
execute_fn(sql)
durations = []
for _ in range(iterations):
t0 = time.perf_counter()
execute_fn(sql)
durations.append(time.perf_counter() - t0)
durations.sort()
results[name] = {
"p50": durations[len(durations) // 2],
"p95": durations[int(len(durations) * 0.95)],
"p99": durations[int(len(durations) * 0.99)],
"mean": statistics.mean(durations),
"stdev": statistics.stdev(durations) if len(durations) > 1 else 0,
}
print(f"{name}: p50={results[name]['p50']:.3f}s p95={results[name]['p95']:.3f}s")
if output:
Path(output).write_text(json.dumps(results, indent=2))
return results
Databricks Sizing¶
Match cluster configuration to the workload pattern.
| Workload | Workers | Instance Type | Autoscale | Photon | Spot | Notes |
|---|---|---|---|---|---|---|
| ETL batch | 4-16 | Standard_D16ds_v5 | 4-16 | Off | Yes | Cost-optimize; spot tolerates retries |
| Interactive / BI | 2-8 | Standard_E8ds_v5 | 2-8 | On | No | Low latency; Photon speeds scans 2-8x |
| Streaming | 4-8 | Standard_D8ds_v5 | Fixed | Off | No | Stable executor count avoids rebalance |
| ML training | 4-32 | Standard_NC6s_v3 (GPU) | 4-32 | N/A | Yes | GPU spot saves 60-80% |
| SQL Warehouse | Auto | Serverless | Auto | On | N/A | Best for ad-hoc; pay per query |
Cost-Performance Sweet Spot
Start with the smallest autoscale range that keeps queue time under 30 seconds. Monitor cluster.spark.executor.count in Ganglia or Azure Monitor to verify the cluster actually scales before widening the range.
Photon Increases DBU Rate
Photon DBUs cost ~2x standard DBUs. Enable Photon only when scan-heavy queries dominate and the wall-clock savings outweigh the DBU premium. Benchmark both configurations before committing.
Fabric Capacity Planning¶
Microsoft Fabric uses a Capacity Unit (CU) consumption model. Every operation draws from a shared CU pool.
F-SKU Comparison¶
| SKU | CU Seconds / hr | Spark vCores | SQL Parallelism | Typical Use |
|---|---|---|---|---|
| F2 | 7,200 | 8 | 4 | Dev / sandbox |
| F8 | 28,800 | 32 | 16 | Single workload prod |
| F16 | 57,600 | 64 | 32 | Multi-workload prod |
| F64 | 230,400 | 256 | 128 | Enterprise analytics |
| F128+ | 460,800+ | 512+ | 256+ | Data warehouse / max scale |
Smoothing vs Bursting¶
Fabric smooths CU consumption over 5 minutes for interactive operations and 24 hours for background operations. Short spikes are absorbed; sustained load above capacity triggers throttling. Scale up when the Capacity Metrics app shows sustained utilization above 80% or queue delays appear.
CU Monitoring Query¶
// Fabric CU consumption over the last 24 hours
FabricCapacityMetrics
| where Timestamp > ago(24h)
| summarize
AvgCU = avg(CUConsumption),
MaxCU = max(CUConsumption),
P95CU = percentile(CUConsumption, 95)
by bin(Timestamp, 15m), WorkloadType
| order by Timestamp desc
| render timechart
Cosmos DB Throughput¶
Every Cosmos DB operation costs Request Units (RU). Estimate consumption before provisioning.
| Operation | Typical RU Cost | Notes |
|---|---|---|
| Point read (1 KB, by id + partition key) | 1 RU | Cheapest operation |
| Point read (10 KB) | ~3 RU | Scales with document size |
| Single-document write (1 KB) | ~5 RU | Includes indexing |
| Query (single partition, 10 results) | 5-20 RU | Depends on index hits |
| Query (cross-partition, 100 results) | 50-500 RU | Avoid in hot paths |
| Aggregate (cross-partition COUNT) | 100-5,000 RU | Use change feed + materialized view instead |
Rule of thumb: provision peak_ops_per_second * avg_RU_per_op * 1.2 (20% headroom). Use autoscale (100-max RU/s) for variable workloads.
Partition Key Impact¶
A poorly chosen partition key forces cross-partition queries and creates hot partitions. Benchmark with the target partition key before going to production. Use response_headers["x-ms-request-charge"] or page.request_charge from the Python SDK to measure actual RU cost per query.
Use the Capacity Calculator
The Azure Cosmos DB capacity calculator takes document size, index policy, read/write ratio, and query patterns as input and returns a provisioned throughput recommendation.
Power BI Performance¶
Connectivity Mode Comparison¶
| Mode | Latency | Data Freshness | Best For |
|---|---|---|---|
| Import | Fastest (in-memory) | Stale until refresh | Dashboards with scheduled refresh |
| DirectQuery | 2-30 s per visual | Real-time | Operational reports over large tables |
| Direct Lake | Sub-second | Near real-time | Fabric lakehouse (preferred) |
| Composite | Mixed | Mixed | Combine hot Import + cold DirectQuery |
DAX Query Optimization¶
Slow visuals are almost always slow DAX. Use Performance Analyzer (View > Performance Analyzer) to identify expensive visuals, then optimize:
- Replace
CALCULATE+FILTER(ALL(...))withCALCULATETABLE - Use
SUMMARIZECOLUMNSinstead ofADDCOLUMNS+SUMMARIZE - Avoid iterators (
SUMX,AVERAGEX) over large tables; pre-aggregate in the lakehouse - Set
IsAvailableInMDX = falseon columns not used in Excel pivot tables
Direct Lake Guardrails¶
Direct Lake falls back to DirectQuery if a single query exceeds memory or row-count thresholds. Monitor fallback events with:
PowerBIDatasetEvent
| where EventType == "DirectLakeFallback"
| summarize FallbackCount = count() by DatasetName, bin(Timestamp, 1h)
| order by Timestamp desc
API Endpoint Testing¶
The portal API layer sits behind APIM and Azure Functions. Load test it to find the concurrency ceiling before users do.
k6 Script Example¶
// tests/load/k6_portal_api.js
import http from "k6/http";
import { check, sleep } from "k6";
export const options = {
stages: [
{ duration: "30s", target: 20 }, // ramp up
{ duration: "2m", target: 50 }, // sustained load
{ duration: "30s", target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ["p(95)<500", "p(99)<1500"],
http_req_failed: ["rate<0.01"],
},
};
export default function () {
const res = http.get(`${__ENV.BASE_URL}/api/v1/health`, {
headers: { "x-api-key": __ENV.API_KEY },
});
check(res, { "status 200": (r) => r.status === 200 });
sleep(1);
}
Latency Targets¶
| Endpoint Category | p50 | p95 | p99 | Max Concurrent Users |
|---|---|---|---|---|
| Health / readiness | < 50 ms | < 100 ms | < 200 ms | N/A |
| CRUD (single record) | < 200 ms | < 500 ms | < 1,500 ms | 200 |
| Search / list | < 500 ms | < 1,500 ms | < 3,000 ms | 100 |
| AI enrichment | < 2,000 ms | < 5,000 ms | < 10,000 ms | 50 |
Streaming Throughput¶
Event Hubs Partition Sizing¶
Each Event Hubs partition supports a maximum of 1 MB/s ingress and 2 MB/s egress. Size partitions to peak throughput plus headroom.
Formula: partitions = ceil(peak_MB_per_sec) * 1.5 (round up, 50% headroom).
| Throughput Tier | Partitions | Consumer Groups | Notes |
|---|---|---|---|
| Low (< 5 MB/s) | 8 | 2-4 | Default for dev/test |
| Medium (5-50 MB/s) | 32 | 4-8 | Standard production |
| High (50-200 MB/s) | 128 | 8-16 | Dedicated tier required |
| Ultra (> 200 MB/s) | 256+ | 16+ | Premium tier, contact support |
End-to-End Latency Measurement¶
Measure from event publish to dashboard update by embedding a timestamp in the payload and computing the delta at the consumer.
// Consumer-side: measure end-to-end latency
TelemetryEvents
| extend PublishedAt = unixtime_seconds_todatetime(todouble(parsed_payload.published_at))
| extend E2ELatencyMs = datetime_diff("millisecond", ingestion_time(), PublishedAt)
| summarize P50 = percentile(E2ELatencyMs, 50),
P95 = percentile(E2ELatencyMs, 95),
P99 = percentile(E2ELatencyMs, 99)
by bin(ingestion_time(), 5m)
| render timechart
Capacity Planning Formulas¶
Use these rules of thumb to project when a scale-up is needed.
Storage growth: projected_storage_TB = current_TB * (1 + daily_growth_rate) ^ days
Compute scaling trigger: scale when average CPU or CU utilization exceeds 70% sustained over 30 minutes.
Node addition trigger: add workers when Spark task queue depth exceeds 2 * total_cores for more than 5 minutes.
Decision Tree¶
flowchart TD
A[Performance issue detected] --> B{Which layer?}
B -->|Query latency| C{Single query or all queries?}
B -->|Throughput ceiling| D{Compute or I/O bound?}
B -->|Streaming lag| E[Add Event Hubs partitions]
C -->|Single query| F[Optimize query / indexes]
C -->|All queries| G{CPU > 70%?}
G -->|Yes| H[Scale up compute SKU or add workers]
G -->|No| I[Check I/O: small files, missing partition pruning]
D -->|Compute| H
D -->|I/O| J[Run OPTIMIZE, check file sizes, add caching] Automated Performance Regression¶
The load-tests.yml workflow provides on-demand performance regression detection. It is manual-dispatch only because tests hit live environments and incur cost.
How It Works¶
- Dispatch -- trigger via GitHub Actions UI, selecting a target, VU count, and duration.
- Execution -- runs Locust, k6, or the dbt benchmark harness against the target environment.
- Baseline comparison -- dbt benchmarks compare against files in
tests/load/baselines/; a 20% regression triggers a warning annotation. - Artifact upload -- reports are uploaded as workflow artifacts (14-90 day retention).
Establishing a Baseline¶
# Run the benchmark harness locally, then commit the output as a baseline
python tests/load/benchmark_dbt_models.py \
--target dev \
--models tag:silver \
--runs 5 \
--output tests/load/baselines/dbt-bench-silver.baseline.json
git add tests/load/baselines/
git commit -m "perf: establish silver model benchmark baseline"
Regression Detection Thresholds¶
| Metric | Warning | Failure |
|---|---|---|
| Query p50 latency | > 10% regression | > 25% regression |
| Query p95 latency | > 15% regression | > 30% regression |
| Throughput (req/s) | > 10% decrease | > 20% decrease |
| Error rate | > 0.5% | > 2% |
Environment Drift
Baselines are only valid when the environment matches. Re-establish baselines after cluster resizing, SKU changes, or major data volume increases. Comparing a 4-node baseline to an 8-node run is meaningless.
Anti-Patterns¶
Testing in Shared Environments
Shared clusters introduce uncontrollable variance. Another user's query can spike CPU and skew your p95. Always benchmark on dedicated, isolated compute.
Insufficient Warm-Up
The first query execution compiles plans, populates caches, and may trigger JIT compilation. Including it inflates p50 by 2-10x. Always discard at least 2 warm-up iterations.
Ignoring Tail Latency
A p50 of 200 ms with a p99 of 15 s means 1 in 100 users waits 15 seconds. SLAs should be defined on p95 or p99, not averages or medians alone.
Benchmarking with Unrealistic Data Volumes
A query that runs in 500 ms on 1 GB of data may take 5 minutes on 500 GB. Always test with production-scale data or a statistically representative sample (minimum 10% of production volume).
Single-Run Conclusions
One run proves nothing. Variance from GC pauses, network jitter, or autoscale lag can swing results 30%+. Report confidence intervals or at minimum run 5 iterations.
Benchmark Result Template¶
Use this table to record and compare benchmark runs across configurations.
| Metric | Baseline | Current | Delta | Status |
|---|---|---|---|---|
| Test date | 2026-04-01 | 2026-04-15 | -- | -- |
| Environment | 4x D16ds_v5 | 4x D16ds_v5 | Same | -- |
| Dataset rows | 50M | 52M | +4% | -- |
| Query A p50 | 1.2 s | 1.3 s | +8% | Pass |
| Query A p95 | 2.1 s | 2.4 s | +14% | Warning |
| Query A p99 | 3.8 s | 4.0 s | +5% | Pass |
| Query B p50 | 0.4 s | 0.4 s | 0% | Pass |
| Query B p95 | 0.9 s | 1.5 s | +67% | Fail |
| Throughput | 120 req/s | 115 req/s | -4% | Pass |
| Error rate | 0.1% | 0.2% | +0.1pp | Pass |
Copy this template into your benchmark reports. Include environment details, dataset version, and commit SHA so results are reproducible.
Related¶
- Performance Tuning -- Delta Lake, Spark, and query-level optimization techniques
- Cost Optimization -- Balancing performance investment against cloud spend
- Databricks Best Practices -- Cluster policies, workspace layout, and operational hardening
- Cosmos DB Guide -- Throughput provisioning, partition design, and change feed patterns
- Event Hubs Guide -- Partition sizing, consumer groups, and capture configuration
- Power BI Guide -- Connectivity modes, dataset refresh, and governance