Home > Docs > Best Practices > LLM Cost Tracking & FinOps for AI Workloads
๐ธ LLM Cost Tracking & FinOps for AI Workloads¶
Token Economics, Budgeting, Rate Limiting, and Fallback Strategies for Fabric AI Workloads
Last Updated: 2026-04-27 | Version: 1.0.0 | Anchor: MLOps for Fabric Production (Wave 2)
๐ Table of Contents¶
- ๐ฏ Why LLM Cost Tracking
- ๐งพ LLM Cost Surfaces in Fabric
- ๐งฎ Token Economics 101
- ๐ต Pricing Snapshot
- ๐๏ธ Reference Architecture
- ๐ ๏ธ Tracking Implementation
- ๐ท๏ธ Cost Attribution Tags
- ๐ Budgeting & Alerts
- ๐ฆ Rate Limiting Patterns
- โป๏ธ Caching Strategies
- ๐ Fallback Model Strategy
- ๐ RAG-Specific Cost Patterns
- โ๏ธ Optimization Techniques
- ๐ KQL Cost Library
- ๐ฅ๏ธ Cost Dashboard
- ๐ฐ Casino Implementation
- ๐๏ธ Federal Implementation
- ๐ซ Anti-Patterns
- ๐ Implementation Checklist
- ๐ References
๐ฏ Why LLM Cost Tracking¶
Fabric's AI surface area exploded in 2025-2026: Copilot in every workload, Data Agents on every domain, AI Functions on every Spark cluster, Eventhouse vector search, and an open door to Azure OpenAI / Anthropic / OpenAI from any notebook. Each of these consumes tokens. Tokens cost real money โ and unlike CU consumption, they don't show up cleanly on a single Fabric capacity meter.
This document is the FinOps backbone for AI workloads. It covers how to instrument every LLM call, attribute spend to the team or feature that triggered it, set budgets that actually block runaway spend, and design fallback chains that degrade quality before they degrade your P&L.
What "Production-Grade LLM Cost Control" Looks Like¶
| Aspect | No Discipline | Production-Grade |
|---|---|---|
| Visibility | "Azure OpenAI bill went up โ investigating" | Per-user, per-workload, per-model token spend in real time |
| Attribution | Single shared API key, no tags | Every call tagged: cost_center, business_unit, project, workload, env |
| Budgets | Annual budget, reviewed quarterly | Daily/weekly/monthly budgets per workload with hard cutoffs |
| Rate limiting | Provider-side 429s only | Token-bucket per user + adaptive throttling near budget |
| Caching | None | Prompt cache + response cache + embedding cache; hit-rate tracked |
| Fallback | Hard fail if Opus is rate-limited | Graceful degrade: Opus โ Sonnet โ Haiku โ cached response |
| Optimization | One model for everything | Triage with small model; escalate only on uncertainty |
| Audit | "Look at the bill in 30 days" | Per-call log to Eventhouse, queryable, alertable, joinable to business KPIs |
Observed Waste Patterns¶
These show up in nearly every LLM rollout that skips this discipline:
- Demo loops left running โ a notebook that re-asks the same prompt every minute for "live demo" purposes, forgotten on a Friday, $4K by Monday.
- No prompt caching โ Anthropic and Azure OpenAI both support caching the static system prompt; teams paying full input rate for the same 8K-token preamble on every call.
- Wrong-model-for-task โ using GPT-4o or Claude Opus to extract a date from a sentence (a job for
gpt-4o-minior Haiku at ~1/15 the cost). - AI Function row explosion โ applying a per-row LLM call across a 50M-row Bronze table without sampling or filtering first.
- Embedding regeneration โ re-embedding the same documents nightly because nobody hashed the source text and stored a vector cache.
- Reasoning leak in agents โ Data Agent or custom agent that loops on tool calls, racking up 40+ reasoning turns before erroring out. No max-step guard.
- Verbose system prompts โ 12K-token system prompt for a chat that gets 200-token user inputs. Input-heavy ratio that won't show in a dashboard you don't have.
- No fallback path โ provider rate-limits โ app crashes โ engineers re-try the entire batch โ costs double.
๐ Scope: This is the LLM-cost sub-doc of the Wave 2 anchor mlops-fabric-production.md. For broader Fabric capacity FinOps see finops-cost-governance.md and capacity-planning-cost-optimization.md.
๐งพ LLM Cost Surfaces in Fabric¶
Every dollar of LLM spend in Fabric flows through one of these surfaces. Track each separately.
| # | Surface | Driver | Billing Model | Visibility | Mitigation |
|---|---|---|---|---|---|
| 1 | Fabric Copilot (chat in workspace, notebook code-gen, DAX Copilot) | Tokens consumed, billed against Copilot capacity | Capacity Units (CU) on F-SKU | Capacity Metrics App, Workspace Monitoring | Designate Copilot capacity; throttle by user/workload tenant settings |
| 2 | Data Agents (Q&A reasoning, NL2SQL/DAX/KQL) | Tokens per turn ร turns per session | CU on F-SKU + provider-side reasoning tokens | Workspace Monitoring + Data Agent audit logs | Cap turns; trim few-shot examples; restrict source count |
| 3 | AI Functions (ai.classify, ai.extract, ai.translate, ai.summarize) | Per-row API call ร row count | CU + per-row | Spark UI, Workspace Monitoring | Pre-filter, sample, or batch; cache by content hash |
| 4 | Custom LLM calls from notebooks (Azure OpenAI, OpenAI direct, Anthropic, Mistral via AI Foundry) | Tokens ร model tier | Provider direct billing (Azure subscription or external) | Only what you instrument | This doc โ wrap every call |
| 5 | Embeddings generation (vector DB, RAG indexing) | Tokens ร embedding model | Provider per-1M-tokens | Provider portal | Hash content; reuse vectors; batch API |
| 6 | Fine-tuning | Training tokens + hosted deployment hours | Provider per-1K-tokens + per-hour | Provider portal | Rare; require approval; prefer prompt engineering + RAG first |
| 7 | Vector store (Eventhouse vector index, Azure AI Search, Cosmos for NoSQL) | Storage + per-query | Storage GB-month + per-query unit | Service-specific | Right-size index; partition by tenant; cold tier old vectors |
๐ก Attribution gap: Surfaces 1โ3 are billed via Fabric capacity, so they show up in CU metrics but get aggregated. Surfaces 4โ7 are billed via your Azure subscription or an external provider and do not appear in Fabric metrics at all. The only reliable single-pane-of-glass is the
llm_usageEventhouse table this doc defines.
๐งฎ Token Economics 101¶
Input vs. Output¶
LLM pricing is asymmetric: output tokens cost 3-5ร input tokens on most providers. A chat that returns a 4-line answer to a 200-line context is almost entirely input cost. A code-gen call that emits a 2K-line file is mostly output cost. Track both separately โ averaging hides where the money actually goes.
Cached Input Tokens¶
Both Anthropic and Azure OpenAI support prompt caching โ the static prefix (system prompt, few-shot examples, retrieved docs) is cached server-side and billed at a fraction of input rate (typically 10% on Anthropic, 50% on Azure OpenAI). Cache TTL is usually 5 min (Anthropic) or session-scoped (Azure OpenAI). Restructure prompts to put static content first, dynamic content last to maximize cache hits.
Reasoning Tokens¶
Reasoning models (o1, o3, Claude with extended thinking) emit hidden "thinking" tokens that you pay for but don't see. These can dwarf visible output. Always log reasoning_tokens separately from completion_tokens.
Model-Tier Pricing Curve¶
Within a provider family, pricing typically follows a 1ร, 5ร, 25ร curve from small โ medium โ flagship. This is the foundation of the fallback strategy: pick the right rung for the task.
| Tier | Use For | Typical Cost Multiplier |
|---|---|---|
| Small (Haiku, gpt-4o-mini, gpt-3.5-turbo) | Triage, classification, extraction, routing | 1ร |
| Medium (Sonnet, gpt-4o) | Most chat, Q&A, summarization | 5ร |
| Flagship (Opus, o1/o3, gpt-4 turbo) | Complex reasoning, multi-step planning, code-gen | 15-30ร |
๐ต Pricing Snapshot¶
โ ๏ธ Pricing changes frequently. All numbers below are USD per 1M tokens, captured 2026-04-27. Refresh quarterly by re-pulling from the provider URLs in References. Do not embed these in code; pull live from a config table (
config_llm_pricing) that your finance team owns.
Azure OpenAI (East US, Pay-As-You-Go)¶
| Model | Input ($/1M) | Cached Input ($/1M) | Output ($/1M) | Notes |
|---|---|---|---|---|
| gpt-4o (2024-11-20) | $2.50 | $1.25 | $10.00 | Flagship multimodal |
| gpt-4o-mini | $0.15 | $0.075 | $0.60 | Default for triage |
| gpt-4-turbo | $10.00 | n/a | $30.00 | Legacy โ migrate to gpt-4o |
| gpt-3.5-turbo | $0.50 | n/a | $1.50 | Legacy โ migrate to gpt-4o-mini |
| o1 | $15.00 | $7.50 | $60.00 | Reasoning; outputs include reasoning tokens |
| o3-mini | $1.10 | $0.55 | $4.40 | Reasoning, lower tier |
Source: https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/ (captured 2026-04-27)
Anthropic Claude (Direct API or via Azure AI Foundry)¶
| Model | Input ($/1M) | Cached Read ($/1M) | Cache Write ($/1M) | Output ($/1M) | Notes |
|---|---|---|---|---|---|
| Claude Opus 4.x | $15.00 | $1.50 | $18.75 | $75.00 | Flagship |
| Claude Sonnet 4.x | $3.00 | $0.30 | $3.75 | $15.00 | Default chat / agents |
| Claude Haiku 4.x | $0.80 | $0.08 | $1.00 | $4.00 | Triage / extraction |
Source: https://www.anthropic.com/pricing (captured 2026-04-27)
Embedding Models¶
| Model | Provider | $/1M input tokens | Dimensions |
|---|---|---|---|
| text-embedding-3-small | Azure OpenAI | $0.020 | 1536 |
| text-embedding-3-large | Azure OpenAI | $0.130 | 3072 |
| text-embedding-ada-002 | Azure OpenAI | $0.100 | 1536 |
| voyage-3 | Voyage AI (Anthropic-recommended) | $0.060 | 1024 |
Source: https://openai.com/api/pricing/ + https://docs.voyageai.com/docs/pricing (captured 2026-04-27)
๐ก Pricing change protocol: Add a quarterly task to your team's planning rhythm. Update
config_llm_pricingtable; do not edit this doc's numbers without also updating the capture date.
๐๏ธ Reference Architecture¶
flowchart LR
subgraph Caller["๐ Caller"]
NB[Notebook /<br/>Pipeline /<br/>Agent]
end
subgraph Middleware["๐ก๏ธ LLM Middleware"]
TC[Token Counter]
RL[Rate Limiter<br/>Token Bucket]
CACHE[Response<br/>Cache]
BUDGET[Budget Check]
FB[Fallback<br/>Router]
end
subgraph Providers["๐ค Providers"]
AOAI[Azure<br/>OpenAI]
ANTH[Anthropic]
OAI[OpenAI<br/>Direct]
end
subgraph Telemetry["๐ Telemetry"]
EH[(Eventhouse<br/>llm_usage)]
DASH[Power BI<br/>Cost Dashboard]
ALERT[Action<br/>Groups]
end
NB --> TC --> RL --> BUDGET
BUDGET -->|"under budget"| CACHE
BUDGET -.->|"over budget"| ALERT
CACHE -->|"miss"| FB
CACHE -.->|"hit"| NB
FB --> AOAI
FB --> ANTH
FB --> OAI
AOAI --> EH
ANTH --> EH
OAI --> EH
EH --> DASH
EH --> ALERT
ALERT -.->|"throttle"| RL
style Middleware fill:#6C3483,stroke:#4A235A,color:#fff
style Providers fill:#2471A3,stroke:#1A5276,color:#fff
style Telemetry fill:#27AE60,stroke:#1E8449,color:#fff Component Map¶
| Component | Implementation | Purpose |
|---|---|---|
| Token Counter | tiktoken for OpenAI, anthropic.count_tokens for Claude | Count tokens before the call for budget pre-check |
| Rate Limiter | Redis token bucket or in-memory per worker | Throttle per user/tenant |
| Budget Check | KQL on llm_usage rolling window | Block when day/week/month cap reached |
| Response Cache | Redis or Cosmos for NoSQL keyed on prompt hash | Skip provider call entirely on cache hit |
| Fallback Router | Try-catch chain with model-tier ladder | Graceful degradation on rate limit / budget |
Eventhouse llm_usage | KQL DB | Single source of truth for cost analytics |
| Action Groups | Azure Monitor โ Teams / PagerDuty / Email | Wired into observability stack |
๐ ๏ธ Tracking Implementation¶
The cardinal rule: no LLM call without a usage record. Wrap every call with a decorator that logs to Eventhouse on completion (success or failure).
llm_usage Schema¶
.create table llm_usage (
timestamp: datetime,
request_id: string,
tenant_id: string,
user_id: string,
workload: string, // chat | completion | embedding | agent | aifunc
surface: string, // copilot | data_agent | ai_function | custom | embedding
provider: string, // azure_openai | anthropic | openai
model: string, // gpt-4o | claude-sonnet-4 | text-embedding-3-large
prompt_tokens: long,
cached_tokens: long,
completion_tokens:long,
reasoning_tokens: long,
total_tokens: long,
cost_usd: real,
latency_ms: long,
cache_hit: bool,
fallback_count: int, // how many tiers we tried before success
status: string, // ok | rate_limited | budget_block | error
error_message: string,
cost_center: string,
business_unit: string,
project: string,
environment: string, // dev | staging | prod
prompt_sha256: string, // for cache lookup; never store raw prompt
completion_sha256:string
)
๐ PII rule: Never store the raw prompt or completion text in
llm_usage. Hash with SHA-256 (truncated to first 16 hex chars is fine for cache lookup). For deeper debugging, route a sampled subset (1%) into a separatellm_tracestable that has tighter access control and a 30-day retention policy.
PySpark Decorator (drop-in)¶
# notebooks/utils/llm_tracking.py
import hashlib
import json
import os
import time
import uuid
from contextvars import ContextVar
from datetime import datetime, timezone
from functools import wraps
from pyspark.sql import SparkSession
# Context-local attribution (set per session/request)
_ctx_tenant = ContextVar("tenant_id", default="unknown")
_ctx_user = ContextVar("user_id", default="unknown")
_ctx_cost_center = ContextVar("cost_center", default="unallocated")
_ctx_workload = ContextVar("workload", default="custom")
_ctx_project = ContextVar("project", default="unknown")
_ctx_env = ContextVar("environment", default=os.getenv("FABRIC_ENV", "dev"))
def set_attribution(*, tenant_id, user_id, cost_center, workload, project,
environment=None):
"""Call once at the top of any notebook or pipeline activity."""
_ctx_tenant.set(tenant_id)
_ctx_user.set(user_id)
_ctx_cost_center.set(cost_center)
_ctx_workload.set(workload)
_ctx_project.set(project)
if environment:
_ctx_env.set(environment)
def _sha16(s: str) -> str:
return hashlib.sha256(s.encode("utf-8")).hexdigest()[:16]
def _price(provider: str, model: str, prompt_t: int, cached_t: int,
completion_t: int, reasoning_t: int) -> float:
"""
Look up prices from a Spark table that finance owns.
Never hardcode pricing here.
"""
spark = SparkSession.builder.getOrCreate()
row = (spark.table("lh_gold.config_llm_pricing")
.filter(f"provider = '{provider}' AND model = '{model}'")
.first())
if row is None:
return 0.0
inp = (prompt_t - cached_t) / 1_000_000 * row.input_price_per_1m
cin = cached_t / 1_000_000 * row.cached_input_price_per_1m
out = (completion_t + reasoning_t) / 1_000_000 * row.output_price_per_1m
return round(inp + cin + out, 6)
def _emit_to_eventhouse(record: dict) -> None:
"""
Append a single record to the llm_usage Eventhouse table.
Use Eventstream or a direct ingest endpoint in real deployment.
"""
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([record])
(df.write
.format("delta")
.mode("append")
.saveAsTable("llm_eventhouse.llm_usage"))
def track_llm(provider: str, surface: str = "custom"):
"""
Decorator: wrap any function that returns a (response, usage) tuple where
usage has prompt_tokens / completion_tokens / cached_tokens / reasoning_tokens.
"""
def deco(fn):
@wraps(fn)
def wrapped(*args, **kwargs):
t0 = time.time()
request_id = str(uuid.uuid4())
status = "ok"
err_msg = ""
usage = None
response = None
try:
response, usage = fn(*args, **kwargs)
except Exception as e:
status = "error"
err_msg = str(e)[:500]
raise
finally:
latency_ms = int((time.time() - t0) * 1000)
u = usage or {}
pt = int(u.get("prompt_tokens", 0))
ct = int(u.get("cached_tokens", 0))
ot = int(u.get("completion_tokens", 0))
rt = int(u.get("reasoning_tokens", 0))
model = u.get("model", kwargs.get("model", "unknown"))
prompt_text = kwargs.get("prompt", "") or json.dumps(
kwargs.get("messages", []))
completion_text = ""
if response is not None:
completion_text = str(response)[:8192]
_emit_to_eventhouse({
"timestamp": datetime.now(timezone.utc),
"request_id": request_id,
"tenant_id": _ctx_tenant.get(),
"user_id": _ctx_user.get(),
"workload": _ctx_workload.get(),
"surface": surface,
"provider": provider,
"model": model,
"prompt_tokens": pt,
"cached_tokens": ct,
"completion_tokens": ot,
"reasoning_tokens": rt,
"total_tokens": pt + ot + rt,
"cost_usd": _price(provider, model, pt, ct, ot, rt),
"latency_ms": latency_ms,
"cache_hit": bool(u.get("cache_hit", False)),
"fallback_count": int(u.get("fallback_count", 0)),
"status": status,
"error_message": err_msg,
"cost_center": _ctx_cost_center.get(),
"business_unit": u.get("business_unit", ""),
"project": _ctx_project.get(),
"environment": _ctx_env.get(),
"prompt_sha256": _sha16(prompt_text),
"completion_sha256": _sha16(completion_text),
})
return response
return wrapped
return deco
Usage Example¶
from utils.llm_tracking import track_llm, set_attribution
from anthropic import Anthropic
set_attribution(
tenant_id="casino-prod",
user_id="floor-manager-42",
cost_center="casino-data-science",
workload="agent",
project="floor-monitoring",
)
client = Anthropic()
@track_llm(provider="anthropic", surface="custom")
def ask_claude(prompt: str, model: str = "claude-sonnet-4-5"):
resp = client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
usage = {
"model": model,
"prompt_tokens": resp.usage.input_tokens,
"cached_tokens": getattr(resp.usage, "cache_read_input_tokens", 0),
"completion_tokens": resp.usage.output_tokens,
"reasoning_tokens": 0,
}
return resp.content[0].text, usage
answer = ask_claude("Summarize today's CTR filings.")
๐ก Apply the decorator to AI Functions too by wrapping the Spark UDF that calls
ai.classify/ai.extract. For Data Agents, instrument the SDK client; for Copilot, ingest theFabricAuditLogs.DataAgentQueryevents intollm_usagevia Eventstream (best-effort token counts).
๐ท๏ธ Cost Attribution Tags¶
Every record in llm_usage carries five mandatory tags. Make them required at attribution time โ fail closed if any are missing.
| Tag | Example | Source | Why |
|---|---|---|---|
cost_center | casino-data-science | Org chart / HR system | Finance chargeback |
business_unit | gaming-ops | Org chart | P&L roll-up |
project | floor-monitoring | Project tracker (Archon, Jira) | Feature-level ROI |
workload | chat / completion / embedding / agent / aifunc | Caller-declared | Cost-by-pattern analysis |
environment | dev / staging / prod | FABRIC_ENV env var | Prevent dev runaway from blocking prod |
Mandatory Tag Enforcement¶
def set_attribution(**kwargs):
required = {"tenant_id", "user_id", "cost_center", "workload", "project"}
missing = required - kwargs.keys()
if missing:
raise ValueError(f"Missing required attribution tags: {missing}")
# ... set context vars
๐ Note: Mirror these tags into the Spark conf so Fabric capacity FinOps cost rollups align with LLM cost rollups:
spark.conf.set("spark.fabric.cost_center", cost_center).
๐ Budgeting & Alerts¶
Budget Hierarchy¶
Tenant
โโโ Business Unit (monthly budget)
โโโ Cost Center (weekly budget)
โโโ Project (daily budget)
โโโ User (per-session quota)
Higher levels are soft caps (alert + report). Project- and user-level are hard caps (block via rate limiter / circuit breaker).
Budget Configuration Table¶
-- lh_gold.config_llm_budgets
CREATE TABLE lh_gold.config_llm_budgets (
scope_type STRING, -- 'tenant' | 'business_unit' | 'cost_center' | 'project' | 'user'
scope_value STRING, -- e.g. 'casino-data-science'
period STRING, -- 'daily' | 'weekly' | 'monthly'
soft_limit_usd DOUBLE, -- alert at this point
hard_limit_usd DOUBLE, -- block at this point
action_group_id STRING, -- Azure Monitor Action Group resource id
contact STRING -- Teams channel or email
)
USING DELTA;
Budget-Burn KQL¶
// Soft-limit burn detector โ runs every 15 minutes via Workspace Monitoring scheduled query
let budgets = externaldata(scope_type:string, scope_value:string, period:string,
soft_limit_usd:real, hard_limit_usd:real)
[@'https://{onelake}/lh_gold/Tables/config_llm_budgets']
with(format='parquet');
let now = now();
let day_start = startofday(now);
llm_usage
| where timestamp >= day_start
| summarize spend_usd = sum(cost_usd) by cost_center
| join kind=inner (
budgets | where period == "daily"
| project scope_value, soft_limit_usd, hard_limit_usd
) on $left.cost_center == $right.scope_value
| extend pct_burned = round(spend_usd / hard_limit_usd * 100, 1)
| where spend_usd >= soft_limit_usd
| project cost_center, spend_usd, soft_limit_usd, hard_limit_usd, pct_burned
| order by pct_burned desc
Wire this query as an Azure Monitor scheduled query alert โ Action Group โ Teams + on-call. Severity: - โฅ soft_limit (alert): Sev 3 - โฅ 90% of hard_limit (warn): Sev 2 - โฅ hard_limit (block): Sev 1 + automated rate-limiter tightening (see Rate Limiting)
For cross-references on Action Group wiring, see monitoring-observability.md and alerting-data-activator.md.
๐ฆ Rate Limiting Patterns¶
Provider-side 429s are a last line of defense. Build your own first.
Token Bucket per User/Tenant¶
# Redis-backed token bucket
import redis
import time
r = redis.Redis(host="redis.fabric.local")
def take_token(key: str, capacity: int, refill_per_sec: float) -> bool:
now = time.time()
pipe = r.pipeline()
pipe.hgetall(f"bucket:{key}")
state = pipe.execute()[0] or {}
tokens = float(state.get(b"tokens", capacity))
last = float(state.get(b"last", now))
tokens = min(capacity, tokens + (now - last) * refill_per_sec)
if tokens < 1:
return False
tokens -= 1
r.hset(f"bucket:{key}", mapping={"tokens": tokens, "last": now})
return True
def call_with_bucket(user_id, fn, *args, **kwargs):
if not take_token(f"user:{user_id}", capacity=60, refill_per_sec=1.0):
raise RuntimeError("Rate limit: 60 calls/min per user")
return fn(*args, **kwargs)
Adaptive Throttling¶
When a cost center is at 80% of its daily budget, automatically reduce its bucket refill rate. The KQL alert above can write back to a runtime_throttle table that the bucket reader honors:
def effective_refill_rate(cost_center: str, base: float) -> float:
burn_pct = lookup_burn_pct(cost_center) # from runtime_throttle table
if burn_pct < 50:
return base
if burn_pct < 80:
return base * 0.5
if burn_pct < 95:
return base * 0.2
return 0 # hard stop
Circuit Breaker on Repeated 429s¶
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_seconds=60):
self.failures = 0
self.opened_at = None
self.failure_threshold = failure_threshold
self.reset_seconds = reset_seconds
def call(self, fn, *args, **kwargs):
if self.opened_at and time.time() - self.opened_at < self.reset_seconds:
raise RuntimeError("Circuit open โ provider unhealthy")
try:
result = fn(*args, **kwargs)
self.failures = 0
return result
except Exception as e:
if "429" in str(e) or "rate" in str(e).lower():
self.failures += 1
if self.failures >= self.failure_threshold:
self.opened_at = time.time()
raise
Middleware vs API Management¶
| Approach | When | Pros | Cons |
|---|---|---|---|
| Python middleware (this doc) | Notebook-first teams, single-tenant, fast iteration | Easy, instrumented, in-process | Per-runtime; no central enforcement |
| Azure API Management (APIM) | Multi-tenant, multi-app, central policy | Centralized, language-agnostic, has built-in token-rate-limiting policy for AOAI | Operational overhead; another hop |
| Hybrid | Production at scale | APIM for hard limits + Python decorator for attribution | More moving parts |
๐ For multi-tenant SaaS architectures, see multi-tenant-workspace-architecture.md โ APIM-fronted routing is the recommended pattern there.
โป๏ธ Caching Strategies¶
Three Layers, Three Hit Rates¶
| Layer | What's cached | Provider support | Target hit rate |
|---|---|---|---|
| Prompt prefix cache | Static system prompt + few-shot + retrieved docs | Anthropic native, AOAI native | โฅ 60% |
| Response cache | Full prompt โ completion mapping | DIY (Redis/Cosmos) | โฅ 25% (workload-dependent) |
| Embedding cache | Source-text โ vector mapping | DIY (Delta or Cosmos) | โฅ 80% (after warm-up) |
Prompt Prefix Cache (Anthropic example)¶
resp = client.messages.create(
model="claude-sonnet-4-5",
system=[
{
"type": "text",
"text": LARGE_STATIC_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}, # 5-min cache
}
],
messages=[{"role": "user", "content": user_question}],
max_tokens=512,
)
# resp.usage.cache_read_input_tokens > 0 on hit
Response Cache (DIY)¶
import json
import hashlib
def cache_key(model: str, messages: list, temp: float) -> str:
canonical = json.dumps({"m": model, "msgs": messages, "t": temp},
sort_keys=True)
return f"llm:{hashlib.sha256(canonical.encode()).hexdigest()}"
def cached_call(client, model, messages, temperature=0.0, ttl=3600):
if temperature > 0.2:
# don't cache non-deterministic prompts
return client.messages.create(model=model, messages=messages,
temperature=temperature, max_tokens=1024)
key = cache_key(model, messages, temperature)
hit = r.get(key)
if hit:
return json.loads(hit) # mark cache_hit=True in usage
resp = client.messages.create(model=model, messages=messages,
temperature=temperature, max_tokens=1024)
r.setex(key, ttl, json.dumps(resp.model_dump()))
return resp
Embedding Cache¶
Hash source text; check vector store; only embed on miss. For RAG ingestion, a content-addressed vector cache typically achieves 95%+ hit rate after the first full crawl, because most documents don't change between runs.
def embed_with_cache(texts: list[str]) -> list[list[float]]:
hashes = [hashlib.sha256(t.encode()).hexdigest() for t in texts]
cached = lookup_vectors(hashes) # Delta table read
misses = [(h, t) for h, t in zip(hashes, texts) if h not in cached]
if misses:
new_vecs = embed_api([t for _, t in misses])
store_vectors([(h, v) for (h, _), v in zip(misses, new_vecs)])
cached.update(zip([h for h, _ in misses], new_vecs))
return [cached[h] for h in hashes]
๐ Fallback Model Strategy¶
Quality gracefully degrades; cost gracefully degrades; the user gets an answer.
Tiered Ladder¶
Tier 1 (flagship): claude-opus-4 or gpt-4o
Tier 2 (default): claude-sonnet-4 or gpt-4o
Tier 3 (triage): claude-haiku-4 or gpt-4o-mini
Tier 4 (cached): last-known-good response from cache
Tier 5 (graceful): "I'm at capacity right now โ please try again."
Implementation¶
LADDER = [
("anthropic", "claude-opus-4"),
("anthropic", "claude-sonnet-4-5"),
("anthropic", "claude-haiku-4-5"),
]
def resilient_call(messages, ladder=LADDER, fallback_count=0):
last_err = None
for provider, model in ladder:
try:
return invoke(provider, model, messages,
fallback_count=fallback_count)
except RateLimitError as e:
last_err = e
fallback_count += 1
continue
except BudgetBlockError as e:
last_err = e
fallback_count += 1
continue
cached = lookup_response_cache(messages)
if cached:
return cached
raise RuntimeError(f"All tiers exhausted: {last_err}")
Quality Degradation Acceptance¶
For every workload, define what quality drop is acceptable when degrading. Document this in the workload's runbook.
| Workload | Tier 1 โ Tier 2 acceptable? | Tier 2 โ Tier 3 acceptable? | Stale cache acceptable? |
|---|---|---|---|
| Casino compliance Q&A | โ (sub-second SAR detection paramount) | โ ๏ธ flag in response | โ |
| Marketing copy generation | โ | โ | โ (24h) |
| Code generation in Copilot | โ | โ (drops too far) | โ |
| Embedding for RAG | โ (vector dim must match) | โ | โ |
User-Visible Communication¶
When falling back, tell the user. Add a banner: "Currently using a faster, lighter-weight model โ answers may be less detailed." Don't silently degrade; users will report the drop as a bug, and you'll waste engineering hours diagnosing intentional behavior.
๐ RAG-Specific Cost Patterns¶
RAG (retrieval-augmented generation) costs split into three knobs โ pull each one independently.
| Knob | Trade-off | Cost lever |
|---|---|---|
| Top-K retrieval | More context โ better answer + higher input cost | Start at K=5; A/B vs K=10; rarely > 20 |
| Reranking | Reranker model adds latency + a small per-doc cost; lifts quality 5-15% | Use a small reranker (e.g. Cohere Rerank, ~$1/1K) only when retrieval has > 3 close neighbors |
| Generate-only fallback | Skip retrieval entirely for "small-talk" queries | Classify intent first with gpt-4o-mini; route generate-only when retrieval not needed |
Triage-First RAG¶
def smart_rag(question: str):
intent = classify_intent(question, model="gpt-4o-mini") # cheap
if intent == "smalltalk":
return generate_only(question, model="gpt-4o-mini")
docs = retrieve(question, top_k=5)
if max_score(docs) < 0.6: # low confidence
docs = retrieve(question, top_k=20)
docs = rerank(question, docs, top_k=5)
return generate(question, docs, model="claude-sonnet-4-5")
For deeper coverage see features/rag-patterns-deep-dive.md (Wave 2 sibling).
โ๏ธ Optimization Techniques¶
Smaller Models for Triage; Large Models for Hard Cases¶
Cascade: classify with Haiku โ if confidence < 0.8, escalate to Sonnet โ if still uncertain, Opus. Measured 5-10ร cost reduction on production help-desk and classification workloads at no measurable quality loss.
Few-Shot vs Zero-Shot Trade-Off¶
Each example added to the prompt costs ~50-500 input tokens every call. Run an eval: does adding example N+1 actually move accuracy more than the cost? If a model is already at 95% with 3 examples, adding 7 more is pure waste.
Structured Outputs¶
Use JSON-schema mode (Azure OpenAI) or tool-use (Anthropic) to constrain output to exactly the fields you need. A natural-language summary that should produce 5 fields can balloon to 800 output tokens; the same call with a schema produces 80.
resp = client.messages.create(
model="claude-sonnet-4-5",
tools=[{
"name": "extract_ctr",
"input_schema": {
"type": "object",
"properties": {
"amount": {"type": "number"},
"player_id": {"type": "string"},
"timestamp": {"type": "string"},
},
"required": ["amount", "player_id", "timestamp"],
}
}],
tool_choice={"type": "tool", "name": "extract_ctr"},
messages=[{"role": "user", "content": transaction_text}],
)
System Prompt Caching¶
Move every byte of static content (instructions, schemas, few-shot examples, retrieved boilerplate) before the dynamic content, and mark it cache-eligible. A 12K-token system prompt that hits cache costs the same as 1.2K input tokens on Anthropic.
Batch API for Non-Real-Time¶
Both Azure OpenAI and Anthropic offer batch APIs at ~50% of synchronous pricing, with 24-hour SLA. Use them for: nightly report generation, embedding backfills, bulk classification of historical Bronze data, fine-tuning data prep.
# Anthropic Message Batches API โ 50% off, 24h SLA
batch = client.messages.batches.create(
requests=[
{"custom_id": f"row-{i}",
"params": {"model": "claude-haiku-4-5",
"max_tokens": 256,
"messages": [{"role": "user", "content": row}]}}
for i, row in enumerate(rows)
]
)
๐ KQL Cost Library¶
Five queries that cover 90% of the questions finance and engineering will ask. Save these as Workspace Monitoring Saved Queries; pin to the Cost Dashboard.
1. Top Spenders by User (last 7 days)¶
llm_usage
| where timestamp > ago(7d) and status == "ok"
| summarize spend_usd = sum(cost_usd),
calls = count(),
avg_tokens = avg(total_tokens)
by user_id, cost_center
| top 25 by spend_usd desc
2. Token Consumption by Workload Over Time¶
llm_usage
| where timestamp > ago(30d)
| summarize tokens = sum(total_tokens) by bin(timestamp, 1d), workload
| render timechart with (title="Daily token consumption by workload")
3. Cache Hit Rate (rolling 7-day)¶
llm_usage
| where timestamp > ago(7d) and workload != "embedding"
| summarize hits = countif(cache_hit == true),
total = count()
by bin(timestamp, 1h), workload
| extend hit_rate_pct = round(100.0 * hits / total, 1)
| project timestamp, workload, hit_rate_pct
| render timechart
4. Cost per Business Unit (month-to-date)¶
llm_usage
| where startofmonth(timestamp) == startofmonth(now())
| summarize mtd_spend_usd = sum(cost_usd),
calls = count(),
avg_latency_ms = avg(latency_ms)
by business_unit, environment
| order by mtd_spend_usd desc
5. Anomaly Detection โ Sudden Spike¶
// Compare current hour to 7-day-same-hour baseline; flag if 3x higher
let lookback = 7d;
let baseline = llm_usage
| where timestamp between (ago(lookback) .. ago(1h))
| summarize avg_hourly = avg(cost_usd) by hourofday(timestamp), cost_center;
llm_usage
| where timestamp > ago(1h)
| summarize current = sum(cost_usd) by hourofday(timestamp), cost_center
| join kind=inner baseline on hourofday_timestamp, cost_center
| extend ratio = current / iff(avg_hourly == 0, 0.0001, avg_hourly)
| where ratio > 3.0 and current > 1.0 // ignore noise under $1
| project hour = hourofday_timestamp, cost_center, current, avg_hourly, ratio
| order by ratio desc
6. Bonus โ Most Expensive Models per Token Returned¶
// "Are we paying flagship rates for completions Sonnet would have done fine?"
llm_usage
| where timestamp > ago(7d) and workload in ("chat", "completion", "agent")
| summarize cost_per_1k_completion = sum(cost_usd) * 1000.0 / sum(completion_tokens),
calls = count()
by model, workload
| where calls > 100
| order by cost_per_1k_completion desc
๐ฅ๏ธ Cost Dashboard¶
Build a Power BI report on the Eventhouse llm_usage KQL table via Direct Lake (no import refresh needed; sub-minute freshness). Pin the following pages.
| Page | Visuals | Slicers | Refresh |
|---|---|---|---|
| Today | Total spend (card), spend trend (line), top 10 users (bar), cache hit rate (gauge) | environment, business_unit | Direct Lake (live) |
| Workload Breakdown | Stacked area: spend by workload over 30d; donut: spend by surface | environment, cost_center | Direct Lake |
| Budget Burn-Down | Burn-down per cost_center vs daily/weekly/monthly limits; over-budget table | period | Direct Lake |
| Top Users / Top Prompts | Top 25 users, top 25 prompt_sha256 hashes by spend | last 7/30/90 days | Direct Lake |
| Cache & Fallback | Cache hit rate trend, fallback count distribution, % of calls reaching tier 3+ | workload | Direct Lake |
| Anomalies | Spike detector output (query 5), error rate, 429 count | last 24h / 7d | Direct Lake |
For Direct Lake setup details see features/direct-lake.md.
๐ก One-pager export: Schedule a daily PDF export of the Today + Budget Burn-Down pages to finance and engineering leadership Teams channels via Power Automate. Visibility drives behavior change.
๐ฐ Casino Implementation¶
Copilot Cost Attribution to Teams¶
Casino uses Fabric Copilot heavily: floor managers ask DAX questions in Power BI; data engineers use notebook Copilot for PySpark generation; compliance officers use Q&A in Power BI. Without attribution, all three look like one bucket.
Setup: 1. Designate a single F-SKU as the Copilot capacity (per Phase 9 fabric-iq.md). 2. Create three workspaces: casino-ops, casino-eng, casino-compliance. Tag each with cost_center. 3. Workspace Monitoring exports FabricAuditLogs into the same Eventhouse as llm_usage. Join on workspace_id to attribute Copilot spend per team. 4. Daily KQL job rolls forward into llm_usage with surface = 'copilot', cost_usd estimated from CU consumption ร Copilot rate card.
Data Agents in Floor Monitoring¶
The da-casino-compliance Data Agent (see features/data-agents.md) handles ~5,000 turns/day from compliance officers. Each turn averages 3K reasoning tokens.
Controls applied: - Daily project budget: $300 (project = floor-monitoring) - Per-user quota: 200 turns/day (token bucket: capacity 200, refill 1/7min) - Few-shot trim: 4 examples per data source (was 12; cut input tokens 35% with no measurable accuracy drop) - Source cap: 3 data sources max (was 5; reduces NL2X routing tokens) - Fallback: Sonnet 4 โ Haiku 4 when budget burn > 80% - Result: $9.4K/mo โ $4.1K/mo, accuracy unchanged at 91%.
AI Functions on Bronze¶
Compliance team uses ai.classify to auto-tag SAR-suspicious narrative fields. Originally ran on every Bronze row nightly = 14M calls/night. Now: - Pre-filter Bronze to only rows where txn_amount BETWEEN 8000 AND 9999 - Hash narrative; skip if hash matches yesterday's run - Use claude-haiku-4-5 instead of Sonnet - Result: 14M โ 22K calls/night; cost down 99.4%; same SAR detection rate.
๐๏ธ Federal Implementation¶
DOJ Data Agents¶
The DOJ antitrust review agent (da-doj-antitrust) operates on case files and merger filings. High-stakes; quality matters more than cost โ but rate-limit and audit are mandatory under DOJ governance.
Controls: - Tier 1 only (Opus 4) โ no automatic fallback; quality is non-negotiable - Hard daily budget: $1,200/day; on breach, block new turns (Sev 1 alert; engineering pages Director of AI) - Audit table: every prompt + completion logged to llm_traces with 7-year retention per federal records schedule - Cross-geo disabled: all calls pinned to US-Gov regions (FedRAMP High) - PII scan: every prompt run through Purview DLP before reaching the model
USDA Copilot¶
USDA economists use Fabric Copilot in Power BI for crop-production analysis. Lower-stakes; cost matters more than flagship quality.
Controls: - Default to gpt-4o-mini for DAX generation; escalate to gpt-4o only on user "Try harder" button - Monthly cost-center budget: $4,000 - Cache hit target: 70% (most NASS questions repeat across users) - Embedding cache: 99% โ NASS commodity descriptions don't change between runs
๐ซ Anti-Patterns¶
| Anti-Pattern | Why It Hurts | What to Do Instead |
|---|---|---|
| No middleware โ direct provider calls everywhere | No attribution, no caching, no fallback, no budget enforcement | Wrap every call with the track_llm decorator |
| Hardcoded prices in code | Drift from reality the day a provider changes pricing | Pricing in config_llm_pricing Delta table; refresh quarterly |
Storing raw prompts/completions in llm_usage | Data exfiltration risk; bloated KQL table; regulatory exposure | Hash; sample 1% to llm_traces with strict ACL |
| Provider-side rate limit as primary defense | App crashes on 429; users hit the same call 5ร retrying | Token bucket per user; circuit breaker; fallback chain |
| One model for every workload | Paying flagship rates for triage; bad latency on simple tasks | Tier ladder: small for triage, escalate on uncertainty |
| Verbose system prompts not marked for caching | Pay full input rate every call for the same 8K boilerplate | Mark static prefix cache_control ephemeral; reorder static-first |
No anomaly alert on llm_usage | Demo loop runs all weekend; first signal is the monthly bill | Wire query 5 to Action Group; Sev 2 on 3ร spike |
| Letting agents loop without max-step guard | Agent retries tools 40 times; runaway reasoning cost | Cap turns; cap reasoning tokens; circuit-break on repeated tool errors |
๐ Implementation Checklist¶
Use this before declaring an LLM workload "production-ready". Ties into the broader MLOps production checklist.
Tracking & attribution - [ ] Every LLM call goes through the track_llm decorator (or equivalent) - [ ] llm_usage Eventhouse table created with full schema - [ ] All five attribution tags (cost_center, business_unit, project, workload, environment) are required at call time - [ ] Pricing table (config_llm_pricing) maintained; last updated within 90 days - [ ] Raw prompts/completions hashed (not stored); llm_traces 1% sample with ACL
Budgeting & alerts - [ ] Daily, weekly, monthly budgets defined in config_llm_budgets for every cost_center and project - [ ] Soft-limit alert wired to Teams (Sev 3) - [ ] Hard-limit block enforced via rate limiter (Sev 1) - [ ] Anomaly detector (3ร spike) wired to Action Group - [ ] Daily cost dashboard auto-emailed to engineering + finance leadership
Rate limiting & caching - [ ] Per-user token bucket implemented - [ ] Adaptive throttling kicks in at 80% daily burn - [ ] Circuit breaker on repeated 429s - [ ] Prompt prefix cache enabled where supported (Anthropic cache_control / AOAI) - [ ] Response cache (Redis/Cosmos) with hit-rate tracking - [ ] Embedding cache with content-hash key
Fallback & resilience - [ ] Tiered model ladder defined per workload - [ ] Quality degradation acceptance documented per workload - [ ] User-visible banner when running degraded - [ ] All-tiers-exhausted path returns cached response or graceful message (never a stack trace)
Optimization - [ ] Triage-first cascade (small โ large) implemented for batch workloads - [ ] System prompts ordered static-first - [ ] Structured outputs (JSON schema / tool use) used where possible - [ ] Batch API used for non-real-time workloads - [ ] AI Function row-explosion guard: pre-filter or sample before applying
Governance - [ ] Mandatory tags enforced (fail closed) - [ ] Audit logs joined with Purview for sensitive workloads - [ ] Federal/regulated workloads pinned to compliant regions - [ ] On-call runbook covers: budget breach, provider outage, runaway loop
๐ References¶
Provider Pricing (refresh quarterly)¶
| Provider | URL | Captured |
|---|---|---|
| Azure OpenAI Service Pricing | https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/ | 2026-04-27 |
| OpenAI API Pricing | https://openai.com/api/pricing/ | 2026-04-27 |
| Anthropic API Pricing | https://www.anthropic.com/pricing | 2026-04-27 |
| Voyage AI Embedding Pricing | https://docs.voyageai.com/docs/pricing | 2026-04-27 |
| Anthropic Prompt Caching | https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching | 2026-04-27 |
| AOAI Prompt Caching | https://learn.microsoft.com/azure/ai-services/openai/how-to/prompt-caching | 2026-04-27 |
| Anthropic Message Batches | https://docs.anthropic.com/en/api/messages-batches | 2026-04-27 |
Microsoft Fabric Documentation¶
- AI Services in Fabric
- AI Functions
- Fabric Copilot Capacity
- Workspace Monitoring
- Eventhouse Vector Search
Wave 2 Cross-References (Anchor & Siblings)¶
- MLOps for Fabric Production โ Wave 2 anchor
- Model Monitoring & Drift Detection
- Feature Store on OneLake
- Responsible AI Framework
- RAG Patterns Deep Dive
- Prompt Engineering for Fabric
- LLM Evaluation Harness
Wave 1 Operational Docs¶
- Monitoring & Observability
- Alerting & Data Activator
- FinOps & Cost Governance โ Fabric-capacity sibling
- Capacity Planning & Cost Optimization
- Multi-Tenant Workspace Architecture