Pattern — LLMOps & Evaluation¶

TL;DR: Treat your prompts, retrievers, and agents like code: version-controlled, eval-on-PR, content-safety-guarded, drift-monitored in production. The platform pattern is Azure AI Content Safety in front, eval suite in CI, Application Insights telemetry, periodic drift detection against a holdout set.

Problem¶

LLM features fail differently from traditional ML:

Prompts work in dev, fail in prod with real distributions
Models change underneath you (Azure OpenAI version updates)
Retrievers go stale as the corpus changes
Agents loop, hallucinate, refuse legitimate requests, comply with adversarial ones
"It works for the demo" ≠ "it works for 10,000 users / day"

LLMOps is MLOps + content safety + retrieval health + agent observability — and most teams underinvest until production breaks.

Architecture¶

flowchart TB
    subgraph Dev[Dev / CI]
        Prompts[Prompts in git<br/>+ tracked versions]
        Eval[Eval suite<br/>50-500 questions]
        ContentTest[Content Safety<br/>red-team test]
        EvalCI[CI: eval + content-safety<br/>on every PR<br/>fails build on regression]
    end

    subgraph Run[Runtime]
        User[User] --> APIM[APIM<br/>+ rate limit]
        APIM --> Guard[Content Safety<br/>input filter]
        Guard --> Agent[Agent / RAG]
        Agent --> Retr[Retriever<br/>AI Search / Fabric]
        Agent --> AOAI[Azure OpenAI]
        AOAI --> GuardOut[Content Safety<br/>output filter]
        GuardOut --> User

        Agent -. trace .-> AI[Application Insights<br/>+ token usage<br/>+ retrieval recall<br/>+ refusal rate]
    end

    subgraph Prod[Production monitoring]
        AI --> Workbook[Workbook:<br/>quality metrics]
        AI --> Alerts[Alerts:<br/>refusal spike,<br/>latency p95,<br/>token spike]
        AI --> Drift[Weekly drift job<br/>re-eval on prod sample]
        Drift -. regression .-> Slack[Notify team]
    end

    Prompts --> Run

Pattern: prompts in git¶

Treat prompts like SQL — version-controlled, code-reviewed, tested:

prompts/
  triage/
    v1.txt              # initial
    v2.txt              # current
    eval/
      questions.jsonl   # 100 representative inputs + expected_topic
      adversarial.jsonl # 50 prompt-injection / jailbreak attempts

Don't store prompts in app config or environment variables — they belong in git with PR review.

Pattern: eval-on-PR¶

Every PR that touches a prompt, retriever, or model selection runs:

# .github/workflows/llm-evals.yml
name: LLM Evals
on:
    pull_request:
        paths: ["prompts/**", "apps/copilot/**"]

jobs:
    eval:
        steps:
            - run: |
                  python -m apps.copilot.evals.runner \
                    --prompts prompts/triage/v2.txt \
                    --questions prompts/triage/eval/questions.jsonl \
                    --baseline reports/triage-baseline.json \
                    --threshold 0.85 \
                    --output reports/triage-pr.json

            - run: |
                  python -m apps.copilot.evals.compare \
                    --baseline reports/triage-baseline.json \
                    --candidate reports/triage-pr.json \
                    --max-regression 0.05  # fail if quality drops >5%

The baseline is committed to the repo and updated when a PR is merged with intentional improvements.

See apps/copilot/evals/ for the platform's eval framework.

Pattern: eval framework choice¶

Framework	Use for
Azure AI Evaluation (Foundry)	Native Azure integration, GPT-as-judge metrics, RAI evaluation
Promptfoo	Lightweight, YAML-based, great for local + CI iteration
DeepEval	Python-native, pytest integration, robust assertions
RAGAS	RAG-specific metrics (faithfulness, context precision/recall)
TruLens	RAG + agent observability with feedback functions
Custom (in-platform)	When metric is domain-specific and standardized frameworks miss it

The platform ships a custom framework under apps/copilot/evals/ because docs-Q&A has specific metrics (citation quality, refusal calibration). Most teams should use Promptfoo + Azure AI Evaluation as the starting combo.

Pattern: content safety in front and behind¶

Both input AND output go through Azure AI Content Safety:

from azure.ai.contentsafety import ContentSafetyClient

cs = ContentSafetyClient(...)

def chat(user_input: str) -> str:
    # 1. Input filter — block obvious adversarial / harmful
    in_check = cs.analyze_text(user_input)
    if max(in_check.categories.values()) >= "Medium":
        return "I can't help with that."

    # 2. Generate
    response = aoai.chat_completions.create(...)

    # 3. Output filter — block harmful generation (rare but happens)
    out_check = cs.analyze_text(response)
    if max(out_check.categories.values()) >= "Medium":
        log_filtered_output(...)
        return "Let me try that again. [refusal]"

    return response

For RAG / agent surfaces, also add prompt injection detection (Azure AI Content Safety includes this) on retrieved context.

Pattern: retrieval health monitoring¶

For RAG systems, the retriever is half the quality story. Monitor:

Metric	Why	Healthy threshold
Recall@K on eval set	"Does it find the right docs?"	>0.8 for K=5
Citation precision	"Of cited docs, how many were actually used?"	>0.7
Empty retrieval rate	"How often does it return nothing relevant?"	<5%
Latency p95	UX	<300ms for retrieval
Index freshness	Stale indexes degrade quality silently	Match SLA per corpus

When recall drops, you usually need: better embeddings, better chunking, hybrid search (vector + keyword), or reranking.

Pattern: production telemetry¶

Every chat / agent invocation emits a structured trace to Application Insights:

{
    "request_id": "req-abc123",
    "user_id_hash": "...",
    "input_tokens": 250,
    "output_tokens": 180,
    "model": "gpt-4o-mini",
    "model_version": "2024-07-18",
    "retrieved_doc_ids": ["doc-1", "doc-7"],
    "retrieved_doc_count": 2,
    "latency_ms": 1240,
    "content_safety_input_severity": "safe",
    "content_safety_output_severity": "safe",
    "agent_iterations": 1,
    "tools_called": ["search_docs"],
    "outcome": "success",
    "user_thumbs": null // populated on feedback
}

This telemetry powers:

Cost dashboards (token spend by feature / user)
Quality dashboards (thumbs up/down rate, refusal rate)
Drift detection (refusal-rate increase = something changed)
Incident investigation (full trace per request_id)

Pattern: drift detection¶

Weekly job re-runs your eval suite against a sample of production traces (not just curated dev questions):

1. Sample 200 prod requests from last week
2. Re-score them with same eval framework
3. Compare scores to baseline (last week, last month, all-time)
4. Alert if median score drops >5% week-over-week

Drift causes:

Azure OpenAI model version updated
Corpus changed (silver/gold tables refreshed; index stale)
User input distribution shifted (new product, new market)
Adversarial users probing

Anti-patterns¶

Anti-pattern	What to do instead
Prompts in environment variables	Git, with PR review
"It works for the demo"	Run eval suite of 50+ representative inputs
One-shot eval at launch, never re-run	Eval-on-PR + weekly drift job
Blind trust in AOAI defaults	Configure content filters explicitly per deployment
No telemetry	Application Insights structured trace per request
Tuning prompts blind	Track every change; A/B in production with feature flags
Bigger model = better	Often a better retriever + smaller model wins

ADR 0007 — Azure OpenAI over Self-Hosted
ADR 0017 — RAG Service Layer
ADR 0021 — Two Rate Limiters
ADR 0022 — Copilot Surfaces vs Docs Widget
Tutorial 06 — AI Analytics with Azure AI Foundry
Tutorial 07 — AI Agents with Semantic Kernel
Tutorial 08 — RAG with AI Search
Example — AI Agents
Example — Fabric Data Agent
Azure AI Content Safety: https://learn.microsoft.com/azure/ai-services/content-safety/
Azure AI Evaluation: https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-approach-gen-ai