Skip to content

Home > Docs > Best Practices > LLM Cost Tracking & FinOps for AI Workloads

๐Ÿ’ธ LLM Cost Tracking & FinOps for AI Workloads

Token Economics, Budgeting, Rate Limiting, and Fallback Strategies for Fabric AI Workloads

Category Phase Priority Last Updated


Last Updated: 2026-04-27 | Version: 1.0.0 | Anchor: MLOps for Fabric Production (Wave 2)


๐Ÿ“‘ Table of Contents


๐ŸŽฏ Why LLM Cost Tracking

Fabric's AI surface area exploded in 2025-2026: Copilot in every workload, Data Agents on every domain, AI Functions on every Spark cluster, Eventhouse vector search, and an open door to Azure OpenAI / Anthropic / OpenAI from any notebook. Each of these consumes tokens. Tokens cost real money โ€” and unlike CU consumption, they don't show up cleanly on a single Fabric capacity meter.

This document is the FinOps backbone for AI workloads. It covers how to instrument every LLM call, attribute spend to the team or feature that triggered it, set budgets that actually block runaway spend, and design fallback chains that degrade quality before they degrade your P&L.

What "Production-Grade LLM Cost Control" Looks Like

Aspect No Discipline Production-Grade
Visibility "Azure OpenAI bill went up โ€” investigating" Per-user, per-workload, per-model token spend in real time
Attribution Single shared API key, no tags Every call tagged: cost_center, business_unit, project, workload, env
Budgets Annual budget, reviewed quarterly Daily/weekly/monthly budgets per workload with hard cutoffs
Rate limiting Provider-side 429s only Token-bucket per user + adaptive throttling near budget
Caching None Prompt cache + response cache + embedding cache; hit-rate tracked
Fallback Hard fail if Opus is rate-limited Graceful degrade: Opus โ†’ Sonnet โ†’ Haiku โ†’ cached response
Optimization One model for everything Triage with small model; escalate only on uncertainty
Audit "Look at the bill in 30 days" Per-call log to Eventhouse, queryable, alertable, joinable to business KPIs

Observed Waste Patterns

These show up in nearly every LLM rollout that skips this discipline:

  1. Demo loops left running โ€” a notebook that re-asks the same prompt every minute for "live demo" purposes, forgotten on a Friday, $4K by Monday.
  2. No prompt caching โ€” Anthropic and Azure OpenAI both support caching the static system prompt; teams paying full input rate for the same 8K-token preamble on every call.
  3. Wrong-model-for-task โ€” using GPT-4o or Claude Opus to extract a date from a sentence (a job for gpt-4o-mini or Haiku at ~1/15 the cost).
  4. AI Function row explosion โ€” applying a per-row LLM call across a 50M-row Bronze table without sampling or filtering first.
  5. Embedding regeneration โ€” re-embedding the same documents nightly because nobody hashed the source text and stored a vector cache.
  6. Reasoning leak in agents โ€” Data Agent or custom agent that loops on tool calls, racking up 40+ reasoning turns before erroring out. No max-step guard.
  7. Verbose system prompts โ€” 12K-token system prompt for a chat that gets 200-token user inputs. Input-heavy ratio that won't show in a dashboard you don't have.
  8. No fallback path โ€” provider rate-limits โ†’ app crashes โ†’ engineers re-try the entire batch โ†’ costs double.

๐Ÿ“ Scope: This is the LLM-cost sub-doc of the Wave 2 anchor mlops-fabric-production.md. For broader Fabric capacity FinOps see finops-cost-governance.md and capacity-planning-cost-optimization.md.


๐Ÿงพ LLM Cost Surfaces in Fabric

Every dollar of LLM spend in Fabric flows through one of these surfaces. Track each separately.

# Surface Driver Billing Model Visibility Mitigation
1 Fabric Copilot (chat in workspace, notebook code-gen, DAX Copilot) Tokens consumed, billed against Copilot capacity Capacity Units (CU) on F-SKU Capacity Metrics App, Workspace Monitoring Designate Copilot capacity; throttle by user/workload tenant settings
2 Data Agents (Q&A reasoning, NL2SQL/DAX/KQL) Tokens per turn ร— turns per session CU on F-SKU + provider-side reasoning tokens Workspace Monitoring + Data Agent audit logs Cap turns; trim few-shot examples; restrict source count
3 AI Functions (ai.classify, ai.extract, ai.translate, ai.summarize) Per-row API call ร— row count CU + per-row Spark UI, Workspace Monitoring Pre-filter, sample, or batch; cache by content hash
4 Custom LLM calls from notebooks (Azure OpenAI, OpenAI direct, Anthropic, Mistral via AI Foundry) Tokens ร— model tier Provider direct billing (Azure subscription or external) Only what you instrument This doc โ€” wrap every call
5 Embeddings generation (vector DB, RAG indexing) Tokens ร— embedding model Provider per-1M-tokens Provider portal Hash content; reuse vectors; batch API
6 Fine-tuning Training tokens + hosted deployment hours Provider per-1K-tokens + per-hour Provider portal Rare; require approval; prefer prompt engineering + RAG first
7 Vector store (Eventhouse vector index, Azure AI Search, Cosmos for NoSQL) Storage + per-query Storage GB-month + per-query unit Service-specific Right-size index; partition by tenant; cold tier old vectors

๐Ÿ’ก Attribution gap: Surfaces 1โ€“3 are billed via Fabric capacity, so they show up in CU metrics but get aggregated. Surfaces 4โ€“7 are billed via your Azure subscription or an external provider and do not appear in Fabric metrics at all. The only reliable single-pane-of-glass is the llm_usage Eventhouse table this doc defines.


๐Ÿงฎ Token Economics 101

Input vs. Output

LLM pricing is asymmetric: output tokens cost 3-5ร— input tokens on most providers. A chat that returns a 4-line answer to a 200-line context is almost entirely input cost. A code-gen call that emits a 2K-line file is mostly output cost. Track both separately โ€” averaging hides where the money actually goes.

Cached Input Tokens

Both Anthropic and Azure OpenAI support prompt caching โ€” the static prefix (system prompt, few-shot examples, retrieved docs) is cached server-side and billed at a fraction of input rate (typically 10% on Anthropic, 50% on Azure OpenAI). Cache TTL is usually 5 min (Anthropic) or session-scoped (Azure OpenAI). Restructure prompts to put static content first, dynamic content last to maximize cache hits.

Reasoning Tokens

Reasoning models (o1, o3, Claude with extended thinking) emit hidden "thinking" tokens that you pay for but don't see. These can dwarf visible output. Always log reasoning_tokens separately from completion_tokens.

Model-Tier Pricing Curve

Within a provider family, pricing typically follows a 1ร—, 5ร—, 25ร— curve from small โ†’ medium โ†’ flagship. This is the foundation of the fallback strategy: pick the right rung for the task.

Tier Use For Typical Cost Multiplier
Small (Haiku, gpt-4o-mini, gpt-3.5-turbo) Triage, classification, extraction, routing 1ร—
Medium (Sonnet, gpt-4o) Most chat, Q&A, summarization 5ร—
Flagship (Opus, o1/o3, gpt-4 turbo) Complex reasoning, multi-step planning, code-gen 15-30ร—

๐Ÿ’ต Pricing Snapshot

โš ๏ธ Pricing changes frequently. All numbers below are USD per 1M tokens, captured 2026-04-27. Refresh quarterly by re-pulling from the provider URLs in References. Do not embed these in code; pull live from a config table (config_llm_pricing) that your finance team owns.

Azure OpenAI (East US, Pay-As-You-Go)

Model Input ($/1M) Cached Input ($/1M) Output ($/1M) Notes
gpt-4o (2024-11-20) $2.50 $1.25 $10.00 Flagship multimodal
gpt-4o-mini $0.15 $0.075 $0.60 Default for triage
gpt-4-turbo $10.00 n/a $30.00 Legacy โ€” migrate to gpt-4o
gpt-3.5-turbo $0.50 n/a $1.50 Legacy โ€” migrate to gpt-4o-mini
o1 $15.00 $7.50 $60.00 Reasoning; outputs include reasoning tokens
o3-mini $1.10 $0.55 $4.40 Reasoning, lower tier

Source: https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/ (captured 2026-04-27)

Anthropic Claude (Direct API or via Azure AI Foundry)

Model Input ($/1M) Cached Read ($/1M) Cache Write ($/1M) Output ($/1M) Notes
Claude Opus 4.x $15.00 $1.50 $18.75 $75.00 Flagship
Claude Sonnet 4.x $3.00 $0.30 $3.75 $15.00 Default chat / agents
Claude Haiku 4.x $0.80 $0.08 $1.00 $4.00 Triage / extraction

Source: https://www.anthropic.com/pricing (captured 2026-04-27)

Embedding Models

Model Provider $/1M input tokens Dimensions
text-embedding-3-small Azure OpenAI $0.020 1536
text-embedding-3-large Azure OpenAI $0.130 3072
text-embedding-ada-002 Azure OpenAI $0.100 1536
voyage-3 Voyage AI (Anthropic-recommended) $0.060 1024

Source: https://openai.com/api/pricing/ + https://docs.voyageai.com/docs/pricing (captured 2026-04-27)

๐Ÿ’ก Pricing change protocol: Add a quarterly task to your team's planning rhythm. Update config_llm_pricing table; do not edit this doc's numbers without also updating the capture date.


๐Ÿ—๏ธ Reference Architecture

flowchart LR
    subgraph Caller["๐Ÿ““ Caller"]
        NB[Notebook /<br/>Pipeline /<br/>Agent]
    end

    subgraph Middleware["๐Ÿ›ก๏ธ LLM Middleware"]
        TC[Token Counter]
        RL[Rate Limiter<br/>Token Bucket]
        CACHE[Response<br/>Cache]
        BUDGET[Budget Check]
        FB[Fallback<br/>Router]
    end

    subgraph Providers["๐Ÿค– Providers"]
        AOAI[Azure<br/>OpenAI]
        ANTH[Anthropic]
        OAI[OpenAI<br/>Direct]
    end

    subgraph Telemetry["๐Ÿ“ˆ Telemetry"]
        EH[(Eventhouse<br/>llm_usage)]
        DASH[Power BI<br/>Cost Dashboard]
        ALERT[Action<br/>Groups]
    end

    NB --> TC --> RL --> BUDGET
    BUDGET -->|"under budget"| CACHE
    BUDGET -.->|"over budget"| ALERT
    CACHE -->|"miss"| FB
    CACHE -.->|"hit"| NB
    FB --> AOAI
    FB --> ANTH
    FB --> OAI
    AOAI --> EH
    ANTH --> EH
    OAI --> EH
    EH --> DASH
    EH --> ALERT
    ALERT -.->|"throttle"| RL

    style Middleware fill:#6C3483,stroke:#4A235A,color:#fff
    style Providers fill:#2471A3,stroke:#1A5276,color:#fff
    style Telemetry fill:#27AE60,stroke:#1E8449,color:#fff

Component Map

Component Implementation Purpose
Token Counter tiktoken for OpenAI, anthropic.count_tokens for Claude Count tokens before the call for budget pre-check
Rate Limiter Redis token bucket or in-memory per worker Throttle per user/tenant
Budget Check KQL on llm_usage rolling window Block when day/week/month cap reached
Response Cache Redis or Cosmos for NoSQL keyed on prompt hash Skip provider call entirely on cache hit
Fallback Router Try-catch chain with model-tier ladder Graceful degradation on rate limit / budget
Eventhouse llm_usage KQL DB Single source of truth for cost analytics
Action Groups Azure Monitor โ†’ Teams / PagerDuty / Email Wired into observability stack

๐Ÿ› ๏ธ Tracking Implementation

The cardinal rule: no LLM call without a usage record. Wrap every call with a decorator that logs to Eventhouse on completion (success or failure).

llm_usage Schema

.create table llm_usage (
    timestamp:        datetime,
    request_id:       string,
    tenant_id:        string,
    user_id:          string,
    workload:         string,         // chat | completion | embedding | agent | aifunc
    surface:          string,         // copilot | data_agent | ai_function | custom | embedding
    provider:         string,         // azure_openai | anthropic | openai
    model:            string,         // gpt-4o | claude-sonnet-4 | text-embedding-3-large
    prompt_tokens:    long,
    cached_tokens:    long,
    completion_tokens:long,
    reasoning_tokens: long,
    total_tokens:     long,
    cost_usd:         real,
    latency_ms:       long,
    cache_hit:        bool,
    fallback_count:   int,            // how many tiers we tried before success
    status:           string,         // ok | rate_limited | budget_block | error
    error_message:    string,
    cost_center:      string,
    business_unit:    string,
    project:          string,
    environment:      string,         // dev | staging | prod
    prompt_sha256:    string,         // for cache lookup; never store raw prompt
    completion_sha256:string
)

๐Ÿ” PII rule: Never store the raw prompt or completion text in llm_usage. Hash with SHA-256 (truncated to first 16 hex chars is fine for cache lookup). For deeper debugging, route a sampled subset (1%) into a separate llm_traces table that has tighter access control and a 30-day retention policy.

PySpark Decorator (drop-in)

# notebooks/utils/llm_tracking.py
import hashlib
import json
import os
import time
import uuid
from contextvars import ContextVar
from datetime import datetime, timezone
from functools import wraps

from pyspark.sql import SparkSession

# Context-local attribution (set per session/request)
_ctx_tenant = ContextVar("tenant_id", default="unknown")
_ctx_user = ContextVar("user_id", default="unknown")
_ctx_cost_center = ContextVar("cost_center", default="unallocated")
_ctx_workload = ContextVar("workload", default="custom")
_ctx_project = ContextVar("project", default="unknown")
_ctx_env = ContextVar("environment", default=os.getenv("FABRIC_ENV", "dev"))


def set_attribution(*, tenant_id, user_id, cost_center, workload, project,
                    environment=None):
    """Call once at the top of any notebook or pipeline activity."""
    _ctx_tenant.set(tenant_id)
    _ctx_user.set(user_id)
    _ctx_cost_center.set(cost_center)
    _ctx_workload.set(workload)
    _ctx_project.set(project)
    if environment:
        _ctx_env.set(environment)


def _sha16(s: str) -> str:
    return hashlib.sha256(s.encode("utf-8")).hexdigest()[:16]


def _price(provider: str, model: str, prompt_t: int, cached_t: int,
           completion_t: int, reasoning_t: int) -> float:
    """
    Look up prices from a Spark table that finance owns.
    Never hardcode pricing here.
    """
    spark = SparkSession.builder.getOrCreate()
    row = (spark.table("lh_gold.config_llm_pricing")
           .filter(f"provider = '{provider}' AND model = '{model}'")
           .first())
    if row is None:
        return 0.0
    inp = (prompt_t - cached_t) / 1_000_000 * row.input_price_per_1m
    cin = cached_t / 1_000_000 * row.cached_input_price_per_1m
    out = (completion_t + reasoning_t) / 1_000_000 * row.output_price_per_1m
    return round(inp + cin + out, 6)


def _emit_to_eventhouse(record: dict) -> None:
    """
    Append a single record to the llm_usage Eventhouse table.
    Use Eventstream or a direct ingest endpoint in real deployment.
    """
    spark = SparkSession.builder.getOrCreate()
    df = spark.createDataFrame([record])
    (df.write
       .format("delta")
       .mode("append")
       .saveAsTable("llm_eventhouse.llm_usage"))


def track_llm(provider: str, surface: str = "custom"):
    """
    Decorator: wrap any function that returns a (response, usage) tuple where
    usage has prompt_tokens / completion_tokens / cached_tokens / reasoning_tokens.
    """
    def deco(fn):
        @wraps(fn)
        def wrapped(*args, **kwargs):
            t0 = time.time()
            request_id = str(uuid.uuid4())
            status = "ok"
            err_msg = ""
            usage = None
            response = None
            try:
                response, usage = fn(*args, **kwargs)
            except Exception as e:
                status = "error"
                err_msg = str(e)[:500]
                raise
            finally:
                latency_ms = int((time.time() - t0) * 1000)
                u = usage or {}
                pt = int(u.get("prompt_tokens", 0))
                ct = int(u.get("cached_tokens", 0))
                ot = int(u.get("completion_tokens", 0))
                rt = int(u.get("reasoning_tokens", 0))
                model = u.get("model", kwargs.get("model", "unknown"))
                prompt_text = kwargs.get("prompt", "") or json.dumps(
                    kwargs.get("messages", []))
                completion_text = ""
                if response is not None:
                    completion_text = str(response)[:8192]
                _emit_to_eventhouse({
                    "timestamp":         datetime.now(timezone.utc),
                    "request_id":        request_id,
                    "tenant_id":         _ctx_tenant.get(),
                    "user_id":           _ctx_user.get(),
                    "workload":          _ctx_workload.get(),
                    "surface":           surface,
                    "provider":          provider,
                    "model":             model,
                    "prompt_tokens":     pt,
                    "cached_tokens":     ct,
                    "completion_tokens": ot,
                    "reasoning_tokens":  rt,
                    "total_tokens":      pt + ot + rt,
                    "cost_usd":          _price(provider, model, pt, ct, ot, rt),
                    "latency_ms":        latency_ms,
                    "cache_hit":         bool(u.get("cache_hit", False)),
                    "fallback_count":    int(u.get("fallback_count", 0)),
                    "status":            status,
                    "error_message":     err_msg,
                    "cost_center":       _ctx_cost_center.get(),
                    "business_unit":     u.get("business_unit", ""),
                    "project":           _ctx_project.get(),
                    "environment":       _ctx_env.get(),
                    "prompt_sha256":     _sha16(prompt_text),
                    "completion_sha256": _sha16(completion_text),
                })
            return response
        return wrapped
    return deco

Usage Example

from utils.llm_tracking import track_llm, set_attribution
from anthropic import Anthropic

set_attribution(
    tenant_id="casino-prod",
    user_id="floor-manager-42",
    cost_center="casino-data-science",
    workload="agent",
    project="floor-monitoring",
)

client = Anthropic()


@track_llm(provider="anthropic", surface="custom")
def ask_claude(prompt: str, model: str = "claude-sonnet-4-5"):
    resp = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )
    usage = {
        "model": model,
        "prompt_tokens": resp.usage.input_tokens,
        "cached_tokens": getattr(resp.usage, "cache_read_input_tokens", 0),
        "completion_tokens": resp.usage.output_tokens,
        "reasoning_tokens": 0,
    }
    return resp.content[0].text, usage


answer = ask_claude("Summarize today's CTR filings.")

๐Ÿ’ก Apply the decorator to AI Functions too by wrapping the Spark UDF that calls ai.classify / ai.extract. For Data Agents, instrument the SDK client; for Copilot, ingest the FabricAuditLogs.DataAgentQuery events into llm_usage via Eventstream (best-effort token counts).


๐Ÿท๏ธ Cost Attribution Tags

Every record in llm_usage carries five mandatory tags. Make them required at attribution time โ€” fail closed if any are missing.

Tag Example Source Why
cost_center casino-data-science Org chart / HR system Finance chargeback
business_unit gaming-ops Org chart P&L roll-up
project floor-monitoring Project tracker (Archon, Jira) Feature-level ROI
workload chat / completion / embedding / agent / aifunc Caller-declared Cost-by-pattern analysis
environment dev / staging / prod FABRIC_ENV env var Prevent dev runaway from blocking prod

Mandatory Tag Enforcement

def set_attribution(**kwargs):
    required = {"tenant_id", "user_id", "cost_center", "workload", "project"}
    missing = required - kwargs.keys()
    if missing:
        raise ValueError(f"Missing required attribution tags: {missing}")
    # ... set context vars

๐Ÿ“ Note: Mirror these tags into the Spark conf so Fabric capacity FinOps cost rollups align with LLM cost rollups: spark.conf.set("spark.fabric.cost_center", cost_center).


๐Ÿ“Š Budgeting & Alerts

Budget Hierarchy

Tenant
โ””โ”€โ”€ Business Unit (monthly budget)
    โ””โ”€โ”€ Cost Center (weekly budget)
        โ””โ”€โ”€ Project (daily budget)
            โ””โ”€โ”€ User (per-session quota)

Higher levels are soft caps (alert + report). Project- and user-level are hard caps (block via rate limiter / circuit breaker).

Budget Configuration Table

-- lh_gold.config_llm_budgets
CREATE TABLE lh_gold.config_llm_budgets (
    scope_type      STRING,   -- 'tenant' | 'business_unit' | 'cost_center' | 'project' | 'user'
    scope_value     STRING,   -- e.g. 'casino-data-science'
    period          STRING,   -- 'daily' | 'weekly' | 'monthly'
    soft_limit_usd  DOUBLE,   -- alert at this point
    hard_limit_usd  DOUBLE,   -- block at this point
    action_group_id STRING,   -- Azure Monitor Action Group resource id
    contact         STRING    -- Teams channel or email
)
USING DELTA;

Budget-Burn KQL

// Soft-limit burn detector โ€” runs every 15 minutes via Workspace Monitoring scheduled query
let budgets = externaldata(scope_type:string, scope_value:string, period:string,
                            soft_limit_usd:real, hard_limit_usd:real)
              [@'https://{onelake}/lh_gold/Tables/config_llm_budgets']
              with(format='parquet');
let now = now();
let day_start = startofday(now);
llm_usage
| where timestamp >= day_start
| summarize spend_usd = sum(cost_usd) by cost_center
| join kind=inner (
    budgets | where period == "daily"
            | project scope_value, soft_limit_usd, hard_limit_usd
  ) on $left.cost_center == $right.scope_value
| extend pct_burned = round(spend_usd / hard_limit_usd * 100, 1)
| where spend_usd >= soft_limit_usd
| project cost_center, spend_usd, soft_limit_usd, hard_limit_usd, pct_burned
| order by pct_burned desc

Wire this query as an Azure Monitor scheduled query alert โ†’ Action Group โ†’ Teams + on-call. Severity: - โ‰ฅ soft_limit (alert): Sev 3 - โ‰ฅ 90% of hard_limit (warn): Sev 2 - โ‰ฅ hard_limit (block): Sev 1 + automated rate-limiter tightening (see Rate Limiting)

For cross-references on Action Group wiring, see monitoring-observability.md and alerting-data-activator.md.


๐Ÿšฆ Rate Limiting Patterns

Provider-side 429s are a last line of defense. Build your own first.

Token Bucket per User/Tenant

# Redis-backed token bucket
import redis
import time

r = redis.Redis(host="redis.fabric.local")

def take_token(key: str, capacity: int, refill_per_sec: float) -> bool:
    now = time.time()
    pipe = r.pipeline()
    pipe.hgetall(f"bucket:{key}")
    state = pipe.execute()[0] or {}
    tokens = float(state.get(b"tokens", capacity))
    last = float(state.get(b"last", now))
    tokens = min(capacity, tokens + (now - last) * refill_per_sec)
    if tokens < 1:
        return False
    tokens -= 1
    r.hset(f"bucket:{key}", mapping={"tokens": tokens, "last": now})
    return True


def call_with_bucket(user_id, fn, *args, **kwargs):
    if not take_token(f"user:{user_id}", capacity=60, refill_per_sec=1.0):
        raise RuntimeError("Rate limit: 60 calls/min per user")
    return fn(*args, **kwargs)

Adaptive Throttling

When a cost center is at 80% of its daily budget, automatically reduce its bucket refill rate. The KQL alert above can write back to a runtime_throttle table that the bucket reader honors:

def effective_refill_rate(cost_center: str, base: float) -> float:
    burn_pct = lookup_burn_pct(cost_center)  # from runtime_throttle table
    if burn_pct < 50:
        return base
    if burn_pct < 80:
        return base * 0.5
    if burn_pct < 95:
        return base * 0.2
    return 0  # hard stop

Circuit Breaker on Repeated 429s

class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_seconds=60):
        self.failures = 0
        self.opened_at = None
        self.failure_threshold = failure_threshold
        self.reset_seconds = reset_seconds

    def call(self, fn, *args, **kwargs):
        if self.opened_at and time.time() - self.opened_at < self.reset_seconds:
            raise RuntimeError("Circuit open โ€” provider unhealthy")
        try:
            result = fn(*args, **kwargs)
            self.failures = 0
            return result
        except Exception as e:
            if "429" in str(e) or "rate" in str(e).lower():
                self.failures += 1
                if self.failures >= self.failure_threshold:
                    self.opened_at = time.time()
            raise

Middleware vs API Management

Approach When Pros Cons
Python middleware (this doc) Notebook-first teams, single-tenant, fast iteration Easy, instrumented, in-process Per-runtime; no central enforcement
Azure API Management (APIM) Multi-tenant, multi-app, central policy Centralized, language-agnostic, has built-in token-rate-limiting policy for AOAI Operational overhead; another hop
Hybrid Production at scale APIM for hard limits + Python decorator for attribution More moving parts

๐Ÿ“ For multi-tenant SaaS architectures, see multi-tenant-workspace-architecture.md โ€” APIM-fronted routing is the recommended pattern there.


โ™ป๏ธ Caching Strategies

Three Layers, Three Hit Rates

Layer What's cached Provider support Target hit rate
Prompt prefix cache Static system prompt + few-shot + retrieved docs Anthropic native, AOAI native โ‰ฅ 60%
Response cache Full prompt โ†’ completion mapping DIY (Redis/Cosmos) โ‰ฅ 25% (workload-dependent)
Embedding cache Source-text โ†’ vector mapping DIY (Delta or Cosmos) โ‰ฅ 80% (after warm-up)

Prompt Prefix Cache (Anthropic example)

resp = client.messages.create(
    model="claude-sonnet-4-5",
    system=[
        {
            "type": "text",
            "text": LARGE_STATIC_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},   # 5-min cache
        }
    ],
    messages=[{"role": "user", "content": user_question}],
    max_tokens=512,
)
# resp.usage.cache_read_input_tokens > 0 on hit

Response Cache (DIY)

import json
import hashlib

def cache_key(model: str, messages: list, temp: float) -> str:
    canonical = json.dumps({"m": model, "msgs": messages, "t": temp},
                           sort_keys=True)
    return f"llm:{hashlib.sha256(canonical.encode()).hexdigest()}"


def cached_call(client, model, messages, temperature=0.0, ttl=3600):
    if temperature > 0.2:
        # don't cache non-deterministic prompts
        return client.messages.create(model=model, messages=messages,
                                      temperature=temperature, max_tokens=1024)
    key = cache_key(model, messages, temperature)
    hit = r.get(key)
    if hit:
        return json.loads(hit)  # mark cache_hit=True in usage
    resp = client.messages.create(model=model, messages=messages,
                                  temperature=temperature, max_tokens=1024)
    r.setex(key, ttl, json.dumps(resp.model_dump()))
    return resp

Embedding Cache

Hash source text; check vector store; only embed on miss. For RAG ingestion, a content-addressed vector cache typically achieves 95%+ hit rate after the first full crawl, because most documents don't change between runs.

def embed_with_cache(texts: list[str]) -> list[list[float]]:
    hashes = [hashlib.sha256(t.encode()).hexdigest() for t in texts]
    cached = lookup_vectors(hashes)        # Delta table read
    misses = [(h, t) for h, t in zip(hashes, texts) if h not in cached]
    if misses:
        new_vecs = embed_api([t for _, t in misses])
        store_vectors([(h, v) for (h, _), v in zip(misses, new_vecs)])
        cached.update(zip([h for h, _ in misses], new_vecs))
    return [cached[h] for h in hashes]

๐Ÿ” Fallback Model Strategy

Quality gracefully degrades; cost gracefully degrades; the user gets an answer.

Tiered Ladder

Tier 1 (flagship):  claude-opus-4    or  gpt-4o
Tier 2 (default):   claude-sonnet-4  or  gpt-4o
Tier 3 (triage):    claude-haiku-4   or  gpt-4o-mini
Tier 4 (cached):    last-known-good response from cache
Tier 5 (graceful):  "I'm at capacity right now โ€” please try again."

Implementation

LADDER = [
    ("anthropic", "claude-opus-4"),
    ("anthropic", "claude-sonnet-4-5"),
    ("anthropic", "claude-haiku-4-5"),
]

def resilient_call(messages, ladder=LADDER, fallback_count=0):
    last_err = None
    for provider, model in ladder:
        try:
            return invoke(provider, model, messages,
                          fallback_count=fallback_count)
        except RateLimitError as e:
            last_err = e
            fallback_count += 1
            continue
        except BudgetBlockError as e:
            last_err = e
            fallback_count += 1
            continue
    cached = lookup_response_cache(messages)
    if cached:
        return cached
    raise RuntimeError(f"All tiers exhausted: {last_err}")

Quality Degradation Acceptance

For every workload, define what quality drop is acceptable when degrading. Document this in the workload's runbook.

Workload Tier 1 โ†’ Tier 2 acceptable? Tier 2 โ†’ Tier 3 acceptable? Stale cache acceptable?
Casino compliance Q&A โœ… (sub-second SAR detection paramount) โš ๏ธ flag in response โŒ
Marketing copy generation โœ… โœ… โœ… (24h)
Code generation in Copilot โœ… โŒ (drops too far) โœ…
Embedding for RAG โŒ (vector dim must match) โŒ โŒ

User-Visible Communication

When falling back, tell the user. Add a banner: "Currently using a faster, lighter-weight model โ€” answers may be less detailed." Don't silently degrade; users will report the drop as a bug, and you'll waste engineering hours diagnosing intentional behavior.


๐Ÿ” RAG-Specific Cost Patterns

RAG (retrieval-augmented generation) costs split into three knobs โ€” pull each one independently.

Knob Trade-off Cost lever
Top-K retrieval More context โ†’ better answer + higher input cost Start at K=5; A/B vs K=10; rarely > 20
Reranking Reranker model adds latency + a small per-doc cost; lifts quality 5-15% Use a small reranker (e.g. Cohere Rerank, ~$1/1K) only when retrieval has > 3 close neighbors
Generate-only fallback Skip retrieval entirely for "small-talk" queries Classify intent first with gpt-4o-mini; route generate-only when retrieval not needed

Triage-First RAG

def smart_rag(question: str):
    intent = classify_intent(question, model="gpt-4o-mini")  # cheap
    if intent == "smalltalk":
        return generate_only(question, model="gpt-4o-mini")
    docs = retrieve(question, top_k=5)
    if max_score(docs) < 0.6:                                 # low confidence
        docs = retrieve(question, top_k=20)
        docs = rerank(question, docs, top_k=5)
    return generate(question, docs, model="claude-sonnet-4-5")

For deeper coverage see features/rag-patterns-deep-dive.md (Wave 2 sibling).


โš™๏ธ Optimization Techniques

Smaller Models for Triage; Large Models for Hard Cases

Cascade: classify with Haiku โ†’ if confidence < 0.8, escalate to Sonnet โ†’ if still uncertain, Opus. Measured 5-10ร— cost reduction on production help-desk and classification workloads at no measurable quality loss.

Few-Shot vs Zero-Shot Trade-Off

Each example added to the prompt costs ~50-500 input tokens every call. Run an eval: does adding example N+1 actually move accuracy more than the cost? If a model is already at 95% with 3 examples, adding 7 more is pure waste.

Structured Outputs

Use JSON-schema mode (Azure OpenAI) or tool-use (Anthropic) to constrain output to exactly the fields you need. A natural-language summary that should produce 5 fields can balloon to 800 output tokens; the same call with a schema produces 80.

resp = client.messages.create(
    model="claude-sonnet-4-5",
    tools=[{
        "name": "extract_ctr",
        "input_schema": {
            "type": "object",
            "properties": {
                "amount": {"type": "number"},
                "player_id": {"type": "string"},
                "timestamp": {"type": "string"},
            },
            "required": ["amount", "player_id", "timestamp"],
        }
    }],
    tool_choice={"type": "tool", "name": "extract_ctr"},
    messages=[{"role": "user", "content": transaction_text}],
)

System Prompt Caching

Move every byte of static content (instructions, schemas, few-shot examples, retrieved boilerplate) before the dynamic content, and mark it cache-eligible. A 12K-token system prompt that hits cache costs the same as 1.2K input tokens on Anthropic.

Batch API for Non-Real-Time

Both Azure OpenAI and Anthropic offer batch APIs at ~50% of synchronous pricing, with 24-hour SLA. Use them for: nightly report generation, embedding backfills, bulk classification of historical Bronze data, fine-tuning data prep.

# Anthropic Message Batches API โ€” 50% off, 24h SLA
batch = client.messages.batches.create(
    requests=[
        {"custom_id": f"row-{i}",
         "params": {"model": "claude-haiku-4-5",
                    "max_tokens": 256,
                    "messages": [{"role": "user", "content": row}]}}
        for i, row in enumerate(rows)
    ]
)

๐Ÿ“ˆ KQL Cost Library

Five queries that cover 90% of the questions finance and engineering will ask. Save these as Workspace Monitoring Saved Queries; pin to the Cost Dashboard.

1. Top Spenders by User (last 7 days)

llm_usage
| where timestamp > ago(7d) and status == "ok"
| summarize spend_usd = sum(cost_usd),
            calls     = count(),
            avg_tokens = avg(total_tokens)
          by user_id, cost_center
| top 25 by spend_usd desc

2. Token Consumption by Workload Over Time

llm_usage
| where timestamp > ago(30d)
| summarize tokens = sum(total_tokens) by bin(timestamp, 1d), workload
| render timechart with (title="Daily token consumption by workload")

3. Cache Hit Rate (rolling 7-day)

llm_usage
| where timestamp > ago(7d) and workload != "embedding"
| summarize hits = countif(cache_hit == true),
            total = count()
          by bin(timestamp, 1h), workload
| extend hit_rate_pct = round(100.0 * hits / total, 1)
| project timestamp, workload, hit_rate_pct
| render timechart

4. Cost per Business Unit (month-to-date)

llm_usage
| where startofmonth(timestamp) == startofmonth(now())
| summarize mtd_spend_usd = sum(cost_usd),
            calls         = count(),
            avg_latency_ms = avg(latency_ms)
          by business_unit, environment
| order by mtd_spend_usd desc

5. Anomaly Detection โ€” Sudden Spike

// Compare current hour to 7-day-same-hour baseline; flag if 3x higher
let lookback = 7d;
let baseline = llm_usage
    | where timestamp between (ago(lookback) .. ago(1h))
    | summarize avg_hourly = avg(cost_usd) by hourofday(timestamp), cost_center;
llm_usage
| where timestamp > ago(1h)
| summarize current = sum(cost_usd) by hourofday(timestamp), cost_center
| join kind=inner baseline on hourofday_timestamp, cost_center
| extend ratio = current / iff(avg_hourly == 0, 0.0001, avg_hourly)
| where ratio > 3.0 and current > 1.0  // ignore noise under $1
| project hour = hourofday_timestamp, cost_center, current, avg_hourly, ratio
| order by ratio desc

6. Bonus โ€” Most Expensive Models per Token Returned

// "Are we paying flagship rates for completions Sonnet would have done fine?"
llm_usage
| where timestamp > ago(7d) and workload in ("chat", "completion", "agent")
| summarize cost_per_1k_completion = sum(cost_usd) * 1000.0 / sum(completion_tokens),
            calls = count()
          by model, workload
| where calls > 100
| order by cost_per_1k_completion desc

๐Ÿ–ฅ๏ธ Cost Dashboard

Build a Power BI report on the Eventhouse llm_usage KQL table via Direct Lake (no import refresh needed; sub-minute freshness). Pin the following pages.

Page Visuals Slicers Refresh
Today Total spend (card), spend trend (line), top 10 users (bar), cache hit rate (gauge) environment, business_unit Direct Lake (live)
Workload Breakdown Stacked area: spend by workload over 30d; donut: spend by surface environment, cost_center Direct Lake
Budget Burn-Down Burn-down per cost_center vs daily/weekly/monthly limits; over-budget table period Direct Lake
Top Users / Top Prompts Top 25 users, top 25 prompt_sha256 hashes by spend last 7/30/90 days Direct Lake
Cache & Fallback Cache hit rate trend, fallback count distribution, % of calls reaching tier 3+ workload Direct Lake
Anomalies Spike detector output (query 5), error rate, 429 count last 24h / 7d Direct Lake

For Direct Lake setup details see features/direct-lake.md.

๐Ÿ’ก One-pager export: Schedule a daily PDF export of the Today + Budget Burn-Down pages to finance and engineering leadership Teams channels via Power Automate. Visibility drives behavior change.


๐ŸŽฐ Casino Implementation

Copilot Cost Attribution to Teams

Casino uses Fabric Copilot heavily: floor managers ask DAX questions in Power BI; data engineers use notebook Copilot for PySpark generation; compliance officers use Q&A in Power BI. Without attribution, all three look like one bucket.

Setup: 1. Designate a single F-SKU as the Copilot capacity (per Phase 9 fabric-iq.md). 2. Create three workspaces: casino-ops, casino-eng, casino-compliance. Tag each with cost_center. 3. Workspace Monitoring exports FabricAuditLogs into the same Eventhouse as llm_usage. Join on workspace_id to attribute Copilot spend per team. 4. Daily KQL job rolls forward into llm_usage with surface = 'copilot', cost_usd estimated from CU consumption ร— Copilot rate card.

Data Agents in Floor Monitoring

The da-casino-compliance Data Agent (see features/data-agents.md) handles ~5,000 turns/day from compliance officers. Each turn averages 3K reasoning tokens.

Controls applied: - Daily project budget: $300 (project = floor-monitoring) - Per-user quota: 200 turns/day (token bucket: capacity 200, refill 1/7min) - Few-shot trim: 4 examples per data source (was 12; cut input tokens 35% with no measurable accuracy drop) - Source cap: 3 data sources max (was 5; reduces NL2X routing tokens) - Fallback: Sonnet 4 โ†’ Haiku 4 when budget burn > 80% - Result: $9.4K/mo โ†’ $4.1K/mo, accuracy unchanged at 91%.

AI Functions on Bronze

Compliance team uses ai.classify to auto-tag SAR-suspicious narrative fields. Originally ran on every Bronze row nightly = 14M calls/night. Now: - Pre-filter Bronze to only rows where txn_amount BETWEEN 8000 AND 9999 - Hash narrative; skip if hash matches yesterday's run - Use claude-haiku-4-5 instead of Sonnet - Result: 14M โ†’ 22K calls/night; cost down 99.4%; same SAR detection rate.


๐Ÿ›๏ธ Federal Implementation

DOJ Data Agents

The DOJ antitrust review agent (da-doj-antitrust) operates on case files and merger filings. High-stakes; quality matters more than cost โ€” but rate-limit and audit are mandatory under DOJ governance.

Controls: - Tier 1 only (Opus 4) โ€” no automatic fallback; quality is non-negotiable - Hard daily budget: $1,200/day; on breach, block new turns (Sev 1 alert; engineering pages Director of AI) - Audit table: every prompt + completion logged to llm_traces with 7-year retention per federal records schedule - Cross-geo disabled: all calls pinned to US-Gov regions (FedRAMP High) - PII scan: every prompt run through Purview DLP before reaching the model

USDA Copilot

USDA economists use Fabric Copilot in Power BI for crop-production analysis. Lower-stakes; cost matters more than flagship quality.

Controls: - Default to gpt-4o-mini for DAX generation; escalate to gpt-4o only on user "Try harder" button - Monthly cost-center budget: $4,000 - Cache hit target: 70% (most NASS questions repeat across users) - Embedding cache: 99% โ€” NASS commodity descriptions don't change between runs


๐Ÿšซ Anti-Patterns

Anti-Pattern Why It Hurts What to Do Instead
No middleware โ€” direct provider calls everywhere No attribution, no caching, no fallback, no budget enforcement Wrap every call with the track_llm decorator
Hardcoded prices in code Drift from reality the day a provider changes pricing Pricing in config_llm_pricing Delta table; refresh quarterly
Storing raw prompts/completions in llm_usage Data exfiltration risk; bloated KQL table; regulatory exposure Hash; sample 1% to llm_traces with strict ACL
Provider-side rate limit as primary defense App crashes on 429; users hit the same call 5ร— retrying Token bucket per user; circuit breaker; fallback chain
One model for every workload Paying flagship rates for triage; bad latency on simple tasks Tier ladder: small for triage, escalate on uncertainty
Verbose system prompts not marked for caching Pay full input rate every call for the same 8K boilerplate Mark static prefix cache_control ephemeral; reorder static-first
No anomaly alert on llm_usage Demo loop runs all weekend; first signal is the monthly bill Wire query 5 to Action Group; Sev 2 on 3ร— spike
Letting agents loop without max-step guard Agent retries tools 40 times; runaway reasoning cost Cap turns; cap reasoning tokens; circuit-break on repeated tool errors

๐Ÿ“‹ Implementation Checklist

Use this before declaring an LLM workload "production-ready". Ties into the broader MLOps production checklist.

Tracking & attribution - [ ] Every LLM call goes through the track_llm decorator (or equivalent) - [ ] llm_usage Eventhouse table created with full schema - [ ] All five attribution tags (cost_center, business_unit, project, workload, environment) are required at call time - [ ] Pricing table (config_llm_pricing) maintained; last updated within 90 days - [ ] Raw prompts/completions hashed (not stored); llm_traces 1% sample with ACL

Budgeting & alerts - [ ] Daily, weekly, monthly budgets defined in config_llm_budgets for every cost_center and project - [ ] Soft-limit alert wired to Teams (Sev 3) - [ ] Hard-limit block enforced via rate limiter (Sev 1) - [ ] Anomaly detector (3ร— spike) wired to Action Group - [ ] Daily cost dashboard auto-emailed to engineering + finance leadership

Rate limiting & caching - [ ] Per-user token bucket implemented - [ ] Adaptive throttling kicks in at 80% daily burn - [ ] Circuit breaker on repeated 429s - [ ] Prompt prefix cache enabled where supported (Anthropic cache_control / AOAI) - [ ] Response cache (Redis/Cosmos) with hit-rate tracking - [ ] Embedding cache with content-hash key

Fallback & resilience - [ ] Tiered model ladder defined per workload - [ ] Quality degradation acceptance documented per workload - [ ] User-visible banner when running degraded - [ ] All-tiers-exhausted path returns cached response or graceful message (never a stack trace)

Optimization - [ ] Triage-first cascade (small โ†’ large) implemented for batch workloads - [ ] System prompts ordered static-first - [ ] Structured outputs (JSON schema / tool use) used where possible - [ ] Batch API used for non-real-time workloads - [ ] AI Function row-explosion guard: pre-filter or sample before applying

Governance - [ ] Mandatory tags enforced (fail closed) - [ ] Audit logs joined with Purview for sensitive workloads - [ ] Federal/regulated workloads pinned to compliant regions - [ ] On-call runbook covers: budget breach, provider outage, runaway loop


๐Ÿ“š References

Provider Pricing (refresh quarterly)

Provider URL Captured
Azure OpenAI Service Pricing https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/ 2026-04-27
OpenAI API Pricing https://openai.com/api/pricing/ 2026-04-27
Anthropic API Pricing https://www.anthropic.com/pricing 2026-04-27
Voyage AI Embedding Pricing https://docs.voyageai.com/docs/pricing 2026-04-27
Anthropic Prompt Caching https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching 2026-04-27
AOAI Prompt Caching https://learn.microsoft.com/azure/ai-services/openai/how-to/prompt-caching 2026-04-27
Anthropic Message Batches https://docs.anthropic.com/en/api/messages-batches 2026-04-27

Microsoft Fabric Documentation

Wave 2 Cross-References (Anchor & Siblings)

Wave 1 Operational Docs


โฌ†๏ธ Back to Top | ๐Ÿ“š Best Practices Index | ๐Ÿ  Home