Skip to content

Home > Docs > Features > RAG Patterns Deep Dive

πŸ”Ž RAG Patterns Deep Dive β€” Production Retrieval-Augmented Generation on Fabric

Beyond Demo-Grade RAG β€” Chunking, Embedding, Hybrid Retrieval, Reranking, and Evaluation

Category Phase Priority Last Updated


Last Updated: 2026-04-27 | Version: 1.0.0 | Wave 2 Feature: 2.6


πŸ“‘ Table of Contents


🎯 Overview

Retrieval-Augmented Generation (RAG) grounds an LLM's response in a curated corpus of your own documents. Instead of relying solely on the model's parametric knowledge, RAG retrieves relevant passages from a vector store at query time and augments the prompt with that context, letting the model generate an answer that cites real sources. RAG is the dominant pattern for enterprise AI assistants because it produces fewer hallucinations, supports citations, respects access control, and stays current as documents change β€” all without retraining the model.

This document goes beyond the basic "embed your PDFs, run cosine similarity, stuff into a prompt" tutorials. It covers the full production stack: chunking strategies that preserve meaning, hybrid retrieval that beats pure vector search, reranking that fixes the top-K, generation strategies for long contexts, and evaluation harnesses (Ragas, MRR, nDCG) that turn RAG from a demo into a measurable system.

When to Use RAG

Scenario Use RAG Use Fine-Tuning Use Plain Prompting
Knowledge changes frequently (policies, prices, regulations) βœ… Yes ❌ No (retraining needed each change) ❌ Stale
Need citations and provenance βœ… Yes (chunk IDs are first-class) ⚠️ Hard (knowledge baked in) ❌ No source
Per-tenant or per-user data isolation βœ… Yes (filter at retrieval time) ❌ One model per tenant β€” expensive ❌ Not isolated
Teach the model a new style or format ❌ No βœ… Yes ⚠️ Limited
Specialized vocabulary, domain jargon ⚠️ Partial (helps with facts) βœ… Yes (helps with idiom) ❌ Often wrong
Compliance: must answer only from approved corpus βœ… Yes (closed-book mode) ⚠️ Hard to constrain ❌ Open-ended
Fast iteration on knowledge base βœ… Yes (re-index, no retrain) ❌ Slow (retrain cycle) ❌ Hard-coded
Cost per query at scale ⚠️ Moderate (retrieval + LLM) ⚠️ Low inference, high training βœ… Lowest

πŸ“ Rule of thumb: If the answer should change when a document changes, use RAG. If you need to teach the model how to write rather than what to know, fine-tune. Most enterprise assistants want both β€” RAG for facts, light fine-tuning for tone.

Key Building Blocks

Component Role Common Choices
Chunker Splits documents into retrievable units Recursive, semantic, document-structure
Embedder Maps text β†’ dense vector text-embedding-3-small/large, BGE, E5
Vector Store Persists vectors with metadata, supports ANN search Eventhouse Vector16, AI Search, pgvector
Retriever Returns candidate chunks for a query Cosine, BM25, hybrid (RRF)
Reranker Reorders top-N candidates by true relevance Cross-encoder, Cohere Rerank, LLM-judge
Generator Produces final answer from retrieved context GPT-4o, GPT-4o-mini, Claude Sonnet
Evaluator Measures retrieval and generation quality Ragas, MRR, nDCG, LLM-as-judge

πŸ—οΈ Reference Architecture

End-to-End RAG Pipeline

flowchart LR
    subgraph Ingest["πŸ“₯ Ingestion"]
        DOC["πŸ“„ Documents<br/>PDF, DOCX, HTML, MD"]
        PARSE["πŸ”§ Parser<br/>Extract Text + Structure"]
        CHUNK["βœ‚οΈ Chunker<br/>Split into Passages"]
    end

    subgraph Index["🧬 Indexing"]
        EMB["πŸ€– Embedder<br/>text-embedding-3-large"]
        BRZ[("πŸ₯‰ Bronze<br/>Raw Chunks")]
        SLV[("πŸ₯ˆ Silver<br/>Embedded Chunks")]
        EH[("⚑ Eventhouse<br/>Vector16 + BM25")]
    end

    subgraph Query["πŸ” Query Path"]
        Q["❓ User Query"]
        QE["πŸ€– Query Embedder"]
        VEC["πŸ“ Vector Search<br/>Cosine"]
        BM["πŸ”€ BM25 Search<br/>Keyword"]
        RRF["βš—οΈ RRF Fusion"]
        RR["πŸ† Cross-Encoder<br/>Reranker"]
    end

    subgraph Generate["✨ Generation"]
        CTX["πŸ“‘ Context<br/>Top-K Chunks"]
        LLM["🧠 LLM<br/>GPT-4o"]
        ANS["πŸ’¬ Answer<br/>+ Citations"]
    end

    subgraph Observe["πŸ“ˆ Observability"]
        LOG["πŸ“ Query Log"]
        EVAL["πŸ“ Ragas<br/>Evaluator"]
    end

    DOC --> PARSE --> CHUNK --> BRZ
    BRZ --> EMB --> SLV --> EH

    Q --> QE --> VEC
    Q --> BM
    EH --> VEC
    EH --> BM
    VEC --> RRF
    BM --> RRF
    RRF --> RR --> CTX --> LLM --> ANS

    ANS --> LOG --> EVAL

    style Ingest fill:#2471A3,stroke:#1A5276,color:#fff
    style Index fill:#E67E22,stroke:#CA6F1E,color:#fff
    style Query fill:#6C3483,stroke:#4A235A,color:#fff
    style Generate fill:#27AE60,stroke:#1E8449,color:#fff
    style Observe fill:#7B241C,stroke:#641E16,color:#fff

Why Hybrid Retrieval?

Pure vector search misses lexical matches β€” exact identifiers, codes, acronyms, and rare terms. Pure BM25 misses semantic matches β€” paraphrases, synonyms, and conceptual overlap. Production RAG fuses both via Reciprocal Rank Fusion (RRF), which consistently outperforms either alone in benchmarks (BEIR, MS MARCO, MTEB).

Query Type Vector Wins BM25 Wins Hybrid Wins
"How do I file a CTR?" βœ… Semantic intent ⚠️ Match "CTR" only βœ… Best
"31 CFR 1010.311" ❌ Vague βœ… Exact code βœ… Best
"Patron deposited 9500 cash" ⚠️ OK βœ… Exact amount βœ… Best
"What is structuring?" βœ… Definition ⚠️ Limited βœ… Best

Eventhouse as the Primary Store

Eventhouse with Vector16 encoding is our default vector store: it co-locates structured filters (date, doc type, tenant) with vector similarity in a single KQL query, eliminating the cross-system joins that plague Pinecone/Weaviate setups. See Eventhouse Vector Database for the storage primitives.


βœ‚οΈ Chunking Strategies

Chunking determines what the retriever can find. Too large β†’ diluted embeddings, irrelevant context floods the prompt. Too small β†’ lost coreference and context. The right strategy depends on document structure and query patterns.

Strategy Comparison

Strategy How It Works Pros Cons Best For
Fixed-Size (chars) Slice every N characters Simple, fast Breaks sentences, hurts retrieval Quick prototypes only
Fixed-Size (tokens) Slice every N tokens with overlap Predictable embedding cost Still cuts mid-thought Uniform content, transcripts
Recursive Splitting Split by ΒΆ β†’ sentence β†’ word until ≀ N tokens Respects structure, simple Can still split semantic units General-purpose default
Semantic Chunking Split where adjacent sentence embeddings diverge Preserves topical coherence 5-10Γ— slower at ingest High-value corpora, long docs
Document-Structure Split by headers (H1/H2/H3), tables, sections Mirrors author intent Requires structured input Markdown, DOCX, HTML
Sliding Window Overlapping windows (e.g., 512 tokens, 64 overlap) Reduces boundary loss Higher storage (1.1-1.3Γ—) Q&A over narrative text
Late Chunking Embed full doc, then pool over spans Each chunk has full-doc context Requires long-context embedder Legal, scientific, contracts
Parent-Child Embed small child chunks, return larger parent Precise retrieval, rich context Two-tier index complexity Compliance, runbooks

Concrete Sizes β€” What Works in Practice

Content Type Chunk Size Overlap Strategy
Policy / regulation 400-600 tokens 50-100 Document-structure β†’ recursive
Runbook / SOP 200-400 tokens 50 Document-structure (per step)
Q&A pairs / FAQs 150-300 tokens 0 One Q+A per chunk
Long narrative (legal, scientific) 600-1000 tokens 100-150 Late chunking or semantic
Code / API docs 250-500 tokens 0 Function/class boundaries
Transcripts 300-500 tokens 50 Time-windowed
Tables Per-row or per-table 0 Serialize to text + metadata

Trade-Off Triangle

                     Context Preservation
                              β–²
                              β”‚
                              β”‚
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚ Late      β”‚ Document- β”‚
                  β”‚ Chunking  β”‚ Structure β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚ Semantic  β”‚ Recursive β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
            ◀─────────────────┼─────────────────▢
   Retrieval Precision   Storage / Compute Cost

Recursive Splitting β€” Reference Implementation

# Databricks notebook source
# COMMAND ----------
# MAGIC %md
# MAGIC ## Recursive Chunking with Token-Aware Boundaries

# COMMAND ----------

import tiktoken
from typing import List

ENC = tiktoken.encoding_for_model("text-embedding-3-large")

def recursive_chunk(
    text: str,
    max_tokens: int = 500,
    overlap_tokens: int = 50,
    separators: List[str] = None
) -> List[str]:
    """
    Recursively split text on a hierarchy of separators until each chunk
    is under max_tokens. Adds overlap to reduce boundary loss.
    """
    if separators is None:
        separators = ["\n\n", "\n", ". ", " ", ""]

    def _count(t: str) -> int:
        return len(ENC.encode(t))

    if _count(text) <= max_tokens:
        return [text]

    # Try each separator until we get small-enough pieces
    for sep in separators:
        if sep == "":
            # Last resort: hard split at token boundary
            tokens = ENC.encode(text)
            return [
                ENC.decode(tokens[i:i + max_tokens])
                for i in range(0, len(tokens), max_tokens - overlap_tokens)
            ]
        parts = text.split(sep)
        if len(parts) > 1:
            # Greedily pack parts into chunks under max_tokens
            chunks, current = [], ""
            for part in parts:
                candidate = current + (sep if current else "") + part
                if _count(candidate) <= max_tokens:
                    current = candidate
                else:
                    if current:
                        chunks.append(current)
                    if _count(part) > max_tokens:
                        # Recurse on oversized part
                        chunks.extend(
                            recursive_chunk(part, max_tokens, overlap_tokens, separators)
                        )
                        current = ""
                    else:
                        current = part
            if current:
                chunks.append(current)
            # Add overlap by prepending tail of previous chunk
            return _add_overlap(chunks, overlap_tokens)
    return [text]

def _add_overlap(chunks: List[str], overlap_tokens: int) -> List[str]:
    if overlap_tokens <= 0 or len(chunks) <= 1:
        return chunks
    result = [chunks[0]]
    for prev, curr in zip(chunks[:-1], chunks[1:]):
        prev_tokens = ENC.encode(prev)
        tail = ENC.decode(prev_tokens[-overlap_tokens:])
        result.append(tail + " " + curr)
    return result

πŸ’‘ Tip: Always include the document title and section heading at the top of each chunk before embedding. This single change typically lifts recall@10 by 3-7% because the embedder gets contextual signal it would otherwise miss.


🧬 Embedding Models

The embedding model determines retrieval ceiling. Choose carefully β€” switching models means re-embedding the entire corpus.

Model Comparison (April 2026)

Model Provider Dimensions Max Tokens Cost / 1M Tokens MTEB Score Notes
text-embedding-3-small Azure OpenAI 1536 8191 $0.02 62.3 Default for most use cases
text-embedding-3-large Azure OpenAI 3072 8191 $0.13 64.6 Highest accuracy, supports dim reduction
text-embedding-ada-002 Azure OpenAI 1536 8191 $0.10 60.9 Legacy β€” avoid for new work
BGE-large-en-v1.5 BAAI (open-source) 1024 512 self-host 64.2 Excellent for English, free
E5-mistral-7b-instruct Open-source 4096 32768 self-host (GPU) 66.6 Top accuracy, expensive to host
bge-m3 BAAI (open-source) 1024 8192 self-host 65.1 Multilingual + multi-vector
multilingual-e5-large Open-source 1024 512 self-host 64.4 100+ languages
Cohere embed-english-v3.0 Cohere 1024 512 $0.10 64.5 Good with int8 quantization

Selecting the Right Model

Constraint Recommended Model
Lowest cost, English-only, good-enough accuracy text-embedding-3-small
Highest accuracy, latency-tolerant text-embedding-3-large (3072 dims)
FedRAMP / data residency boundary Self-hosted BGE-large on Azure VM in compliant region
Multilingual (>2 languages) bge-m3 or multilingual-e5-large
Long documents (>8K tokens per chunk) bge-m3 or E5-mistral-7b
Storage-constrained text-embedding-3-large with dimensions=512 reduction

Dimensionality vs Quality vs Cost

OpenAI's text-embedding-3-* models support Matryoshka representation learning β€” you can truncate the vector to a smaller dimension and lose minimal accuracy:

Dimensions Storage / 1M vectors (Vector16) MTEB Drop vs Full Use Case
3072 (full large) ~5.9 GB baseline Highest precision
1536 ~2.9 GB -0.3% Default
1024 ~2.0 GB -0.7% Budget-conscious
512 ~1.0 GB -1.5% Very large corpora
256 ~0.5 GB -3.2% Diminishing returns
# Truncate at embedding time
response = client.embeddings.create(
    input=text,
    model="text-embedding-3-large",
    dimensions=1024  # Matryoshka truncation
)

Multilingual Considerations

Issue Mitigation
Cross-lingual retrieval drift Use bge-m3 or multilingual-e5-large
Mixed-language documents Detect language per chunk, store language metadata
Translation in pipeline Embed in source language, translate post-retrieval
Code-mixed text (e.g., English + Spanish) Multilingual model β€” never English-only

Update Cadence β€” Re-Embedding

You must re-embed when: - Switching models (any β†’ any) β€” embeddings live in different vector spaces - Same family but different version (e.g., ada-002 β†’ 3-small) - Changing dimensions (even via truncation) - Major chunking strategy change

You don't need to re-embed when: - Adding new documents (just embed those) - Updating metadata (filters, tags) - Tuning retrieval weights

⚠️ Plan your model selection like a database schema migration. Re-embedding 10M chunks at \(0.02/1M tokens Γ— ~500 tokens/chunk = ~\)100, which is cheap β€” but the runtime is hours and you must coordinate dual-index cutover.


πŸ’Ύ Storing Embeddings in Eventhouse

See Eventhouse Vector Database for setup primitives. Here we cover the production schema and hybrid index patterns.

Production Schema

.create table rag_chunks (
    chunk_id: string,             // UUID per chunk
    document_id: string,          // Parent document
    document_title: string,
    document_uri: string,         // Source URL or OneLake path
    chunk_index: int,             // Position in document
    chunk_text: string,           // The text content
    token_count: int,
    section_path: string,         // Breadcrumb: "Doc > H1 > H2"
    doc_type: string,             // policy, runbook, faq, regulation
    tenant_id: string,            // For multi-tenant isolation
    sensitivity_label: string,    // public, internal, confidential, restricted
    language: string,             // ISO 639-1
    created_at: datetime,
    updated_at: datetime,
    embedding_model: string,      // For migration tracking
    embedding: dynamic            // Vector16-encoded
)

// Vector16 encoding cuts storage 50%
.alter column rag_chunks.embedding policy encoding type = 'Vector16'

// Hot cache recent + frequently retrieved
.alter table rag_chunks policy caching hot = 90d

Hybrid Index β€” Vector + BM25

KQL has native vector similarity (series_cosine_similarity) and full-text search (has, contains_cs, matches regex). For BM25-like ranking, use the search operator with relevance scoring:

// Pure full-text search with relevance scoring
let user_query = "structuring transactions to avoid CTR";
rag_chunks
| where doc_type in ("regulation", "policy")
| where chunk_text has_any (split(user_query, " "))
| extend bm25_score = matches_regex(chunk_text, user_query)  // simplified
| top 50 by bm25_score desc

Hybrid Search via Reciprocal Rank Fusion (RRF)

RRF combines rankings from independent retrievers without needing comparable scores. The formula: RRF(d) = Ξ£ 1 / (k + rank_i(d)), typically with k = 60.

// Hybrid retrieval with RRF
let user_query = "what triggers a CTR for cash transactions";
let query_vec = toscalar(
    evaluate ai_embed_text(
        user_query,
        'https://your-aoai.openai.azure.com',
        'text-embedding-3-large'
    ) | project embedding
);
let k = 60;
let top_n = 50;
// Vector ranking
let vec_results = rag_chunks
    | where tenant_id == 'casino-prod'
    | where sensitivity_label in ('public', 'internal')
    | extend sim = series_cosine_similarity(embedding, query_vec)
    | top top_n by sim desc
    | extend vec_rank = row_number()
    | project chunk_id, vec_rank;
// Keyword ranking β€” split query, score by term hits
let bm25_results = rag_chunks
    | where tenant_id == 'casino-prod'
    | where sensitivity_label in ('public', 'internal')
    | extend kw_score =
        countof(chunk_text, "CTR", "regex") * 3.0 +
        countof(chunk_text, "cash", "regex") * 1.5 +
        countof(chunk_text, "transaction", "regex") * 1.5 +
        countof(chunk_text, "trigger", "regex") * 1.0
    | where kw_score > 0
    | top top_n by kw_score desc
    | extend kw_rank = row_number()
    | project chunk_id, kw_rank;
// RRF fusion
vec_results
| join kind=fullouter bm25_results on chunk_id
| extend chunk_id = coalesce(chunk_id, chunk_id1)
| extend rrf_score =
    iff(isnotnull(vec_rank), 1.0 / (k + vec_rank), 0.0) +
    iff(isnotnull(kw_rank),  1.0 / (k + kw_rank),  0.0)
| top 20 by rrf_score desc
| join kind=inner rag_chunks on chunk_id
| project chunk_id, chunk_text, document_title, section_path,
          rrf_score, vec_rank, kw_rank

Index Configuration Tips

Knob Recommendation Why
Vector16 encoding Always on 50% storage, <0.3% accuracy loss
Hot cache window 30-90 days for active corpora Sub-second similarity at scale
Partition by tenant_id If multi-tenant Engine prunes other tenants
Partition by doc_type If queries filter on type Skips irrelevant rows
chunk_text extent Default Already inverted-indexed
Materialized view for top-K If repeating queries Pre-compute frequent answers

πŸ” Retrieval Patterns

Beyond plain vector search, several patterns dramatically improve recall and precision.

Pattern Comparison

Pattern Latency Cost Recall@10 Lift Implementation
Pure vector ~50ms 1Γ— baseline series_cosine_similarity
BM25 only ~30ms 0.5Γ— -5% to +5% (varies) KQL keyword matching
Hybrid (RRF) ~80ms 1.5Γ— +8 to +15% Both + fusion
Multi-query ~150ms + LLM 3-5Γ— +5 to +10% LLM expands to N queries
HyDE ~120ms + LLM 2Γ— +3 to +8% LLM generates hypothetical doc
Self-query ~100ms + LLM 2Γ— +10% (filter precision) LLM extracts metadata filters
Parent-child ~80ms 1Γ— +5 to +12% Embed children, return parents

Pure Vector Retrieval

let query_vec = toscalar(
    evaluate ai_embed_text('your question',
        'https://your-aoai.openai.azure.com',
        'text-embedding-3-large') | project embedding
);
rag_chunks
| where tenant_id == 'casino-prod'
| extend sim = series_cosine_similarity(embedding, query_vec)
| where sim > 0.55           // similarity floor
| top 10 by sim desc

Multi-Query Retrieval (Query Expansion)

The LLM rephrases the user query into N variants. Each variant is embedded and retrieved; results are deduplicated and merged. This catches paraphrases the original wording missed.

EXPANSION_PROMPT = """Generate 4 alternative phrasings of this question that would
match the same answer in a knowledge base. Return ONLY the questions, one per line.

Question: {query}

Alternatives:"""

def multi_query_retrieve(query: str, top_k: int = 10) -> list[dict]:
    # Step 1: Generate variants
    variants = chat_complete(EXPANSION_PROMPT.format(query=query)).split("\n")
    variants = [v.strip() for v in variants if v.strip()][:4]
    all_queries = [query] + variants

    # Step 2: Retrieve for each
    seen_ids, results = set(), []
    for q in all_queries:
        for hit in vector_search(q, top_k=top_k):
            if hit["chunk_id"] not in seen_ids:
                seen_ids.add(hit["chunk_id"])
                results.append(hit)

    # Step 3: Rerank merged set
    return rerank(query, results)[:top_k]

HyDE β€” Hypothetical Document Embedding

Counter-intuitive but effective: ask the LLM to write a fake answer to the query, embed that, and search. The hypothetical answer often shares more vocabulary with real documents than the query does.

HYDE_PROMPT = """Write a concise paragraph (3-5 sentences) that would appear in
a regulatory document and directly answer the following question. Do not say
'I don't know' β€” write a plausible passage even if you must guess details.

Question: {query}

Passage:"""

def hyde_retrieve(query: str, top_k: int = 10) -> list[dict]:
    fake_passage = chat_complete(HYDE_PROMPT.format(query=query))
    return vector_search(fake_passage, top_k=top_k)

⚠️ HyDE caveat: It can hallucinate domain-specific terminology that misleads retrieval. Ablate before adopting β€” measure recall@10 with and without on a held-out eval set.

Self-Query β€” LLM Extracts Filters

For queries like "SAR filings from Q1 2026 about structuring", an LLM extracts structured filters (doc_type=SAR, filing_date >= 2026-01-01, semantic_query="structuring") so the retriever can pre-filter:

SELF_QUERY_PROMPT = """Extract structured filters from this question. Return JSON:
{{"semantic_query": str, "filters": {{"doc_type": str|null, "date_from": str|null,
"date_to": str|null, "tags": list[str]}}}}

Schema fields: doc_type ∈ {{"CTR","SAR","W2G","policy","regulation"}};
date format YYYY-MM-DD; tags free-form.

Question: {query}
JSON:"""

Then build the KQL where clause from the parsed filters.

Parent-Child Retrieval

Embed small chunks (200 tokens) for retrieval precision, but return the parent chunk (1000 tokens) for generation context. Avoids the precision-vs-context dilemma entirely.

// Two tables: rag_chunks_small (embedded), rag_chunks_parent (text only)
rag_chunks_small
| extend sim = series_cosine_similarity(embedding, query_vec)
| top 10 by sim desc
| join kind=inner (rag_chunks_parent) on parent_id
| project parent_text, sim, parent_id
| distinct parent_text, sim, parent_id  // dedupe parents

πŸ† Reranking

Initial retrieval is fast but coarse. A reranker re-scores the top-N (typically N=20-50) with a more expensive model and produces the top-K (typically K=3-10) that goes into the LLM prompt.

Reranker Comparison

Approach Latency / 50 docs Cost Quality (NDCG@10) When to Use
Cross-encoder (BGE-Reranker-v2) ~200-400ms self-host +12-18% Default for production
Cohere Rerank 3 ~150ms $1.00 / 1K queries +15-20% Hosted, no GPU ops
Voyage rerank-2 ~180ms $0.05 / 1M tokens +14-18% Cost-sensitive, hosted
LLM-as-reranker (GPT-4o-mini) ~800-1500ms ~$0.0015 / query +20-25% High-stakes, low-volume
No rerank (pure RRF) ~80ms 0 baseline Latency-critical paths

Cross-Encoder Reranking

A cross-encoder takes (query, candidate) as a pair and outputs a relevance score. Slower than bi-encoder retrieval but far more accurate because it sees both sides simultaneously.

from sentence_transformers import CrossEncoder

# Load once at app start
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", max_length=512)

def rerank(query: str, candidates: list[dict], top_k: int = 5) -> list[dict]:
    pairs = [(query, c["chunk_text"]) for c in candidates]
    scores = reranker.predict(pairs, batch_size=32)
    for c, s in zip(candidates, scores):
        c["rerank_score"] = float(s)
    return sorted(candidates, key=lambda x: -x["rerank_score"])[:top_k]

Cohere Rerank API

import cohere

co = cohere.Client(api_key=os.environ["COHERE_API_KEY"])

def cohere_rerank(query: str, candidates: list[dict], top_k: int = 5) -> list[dict]:
    docs = [c["chunk_text"] for c in candidates]
    resp = co.rerank(
        model="rerank-3.5",
        query=query,
        documents=docs,
        top_n=top_k,
        return_documents=False
    )
    return [
        {**candidates[r.index], "rerank_score": r.relevance_score}
        for r in resp.results
    ]

LLM-as-Reranker

For the highest-quality reranking on small candidate sets (10-20 docs), use an LLM directly:

LLM_RERANK_PROMPT = """Rate each passage's relevance to the query on a 0-10 scale.
Return ONLY a JSON array of integers, one per passage, in order.

Query: {query}

Passages:
{passages}

Scores (JSON array):"""

πŸ’‘ LLM rerankers are slow but powerful for compliance Q&A where 5 wrong answers a day from a faster reranker is unacceptable.

Score Fusion β€” Reciprocal Rank Fusion

When combining multiple retrievers (vector, BM25, possibly multiple rerankers), RRF is the default fusion method:

def rrf_fuse(rankings: list[list[str]], k: int = 60) -> list[str]:
    """Each ranking is an ordered list of chunk_ids. Returns merged ranking."""
    scores = {}
    for ranking in rankings:
        for rank, chunk_id in enumerate(ranking, start=1):
            scores[chunk_id] = scores.get(chunk_id, 0) + 1.0 / (k + rank)
    return [cid for cid, _ in sorted(scores.items(), key=lambda x: -x[1])]

Cost vs Quality Trade-Off

Latency Budget (P95) Recommendation
< 500ms RRF only, no reranker
500ms - 1.5s RRF + cross-encoder (BGE-reranker-v2)
1.5s - 3s RRF + Cohere/Voyage
> 3s RRF + LLM-as-reranker (GPT-4o-mini)

πŸͺ‘ Generation Strategies

How you stuff retrieved context into the LLM prompt matters as much as what you retrieve.

Strategy Comparison

Strategy Latency Cost Faithfulness When to Use
Stuffing 1Γ— 1Γ— High Top-K fits in context window (default)
Map-Reduce NΓ— LLM calls NΓ— Medium Many chunks, need parallelism
Refine Sequential N calls NΓ— High Iterative summarization
Tree summarization log(N) levels log(N)Γ— Medium-High Very large corpora
Citation-first 1Γ— + parsing 1Γ— Highest Compliance, legal

Stuffing (Default)

Concatenate all retrieved chunks into the prompt with clear separators and citation markers.

STUFF_TEMPLATE = """You are a compliance analyst assistant. Answer the question using
ONLY the passages below. Cite each claim with the passage id like [P3]. If the
passages do not contain the answer, say "I don't have that information."

Passages:
{passages}

Question: {question}

Answer (with citations):"""

def build_stuffed_prompt(query: str, chunks: list[dict]) -> str:
    passages = "\n\n".join(
        f"[P{i+1}] (source: {c['document_title']} > {c['section_path']})\n{c['chunk_text']}"
        for i, c in enumerate(chunks)
    )
    return STUFF_TEMPLATE.format(passages=passages, question=query)

Map-Reduce

For 50+ chunks where stuffing exceeds the context window:

  1. Map: For each chunk, ask the LLM to extract relevant facts.
  2. Reduce: Combine extracted facts into a final answer.
MAP_PROMPT = """Extract any facts from this passage that help answer the question.
Return only the facts, one per line. If none, return 'NONE'.

Question: {question}
Passage: {passage}

Facts:"""

REDUCE_PROMPT = """Synthesize a final answer using these extracted facts. Cite
sources by [P#].

Question: {question}
Facts:
{facts}

Answer:"""

Refine

Iterative β€” pass current answer + next chunk + ask LLM to refine. Best for evolving summaries (e.g., "summarize all SAR filings on Player X").

Tree Summarization

Group chunks into batches of B, summarize each batch, then summarize the summaries recursively. O(log_B(N)) levels for N chunks.

Citation Tracking β€” Always On for Compliance

import re

CITATION_RE = re.compile(r"\[P(\d+)\]")

def extract_citations(answer: str, chunks: list[dict]) -> list[dict]:
    """Map [P#] markers in answer to chunk metadata."""
    cited_idxs = {int(m.group(1)) - 1 for m in CITATION_RE.finditer(answer)}
    return [
        {
            "marker": f"P{i+1}",
            "chunk_id": chunks[i]["chunk_id"],
            "document_title": chunks[i]["document_title"],
            "document_uri": chunks[i]["document_uri"],
            "section_path": chunks[i]["section_path"]
        }
        for i in cited_idxs if i < len(chunks)
    ]

πŸ’‘ For regulated domains, render citations as clickable links to the source document and require the LLM to cite at least once per claim β€” strip the answer if no citations are present.


πŸ“ Evaluation Metrics

You cannot improve what you don't measure. RAG evaluation has two layers: retrieval and generation.

Retrieval Metrics

Metric Formula Range What It Measures
Recall@K |relevant ∩ retrieved@K| / |relevant| 0-1 Did we find the right docs in top-K?
Precision@K |relevant ∩ retrieved@K| / K 0-1 Are top-K all relevant?
MRR mean(1 / rank of first relevant) 0-1 How fast does relevant appear?
nDCG@K DCG@K / IDCG@K 0-1 Position-weighted relevance
Hit Rate@K fraction of queries with β‰₯1 relevant in top-K 0-1 Coverage

Production Thresholds (Retrieval)

Metric Acceptable Good Excellent
Recall@10 > 0.70 > 0.85 > 0.93
MRR > 0.55 > 0.70 > 0.82
nDCG@10 > 0.60 > 0.75 > 0.85
Hit Rate@5 > 0.85 > 0.92 > 0.97

Generation Metrics β€” Ragas Framework

Ragas is the standard open-source evaluation framework for RAG. Key metrics:

Metric What It Measures How
Faithfulness Are claims in the answer supported by retrieved context? LLM-judge: extract claims β†’ verify each
Answer Relevance Does the answer address the question? Generate questions from answer, compare to original
Context Precision Are retrieved chunks ranked by relevance? LLM-judge per chunk vs ground-truth answer
Context Recall Did retrieval cover everything in the ground-truth answer? LLM-judge: is each ground-truth fact present?
Answer Correctness Semantic + factual match to ground truth Composite of similarity + factual overlap

Production Thresholds (Generation)

Metric Acceptable Good Excellent
Faithfulness > 0.80 > 0.90 > 0.96
Answer Relevance > 0.75 > 0.85 > 0.92
Context Precision > 0.70 > 0.82 > 0.90
Context Recall > 0.75 > 0.85 > 0.92

Ragas β€” Reference Implementation

from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_precision, context_recall
)
from datasets import Dataset

eval_data = Dataset.from_dict({
    "question":    [r["query"]              for r in eval_set],
    "answer":      [r["generated_answer"]   for r in eval_set],
    "contexts":    [r["retrieved_chunks"]   for r in eval_set],
    "ground_truth":[r["expected_answer"]    for r in eval_set],
})

scores = evaluate(
    eval_data,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

print(scores)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
#  'context_precision': 0.84, 'context_recall': 0.86}

End-to-End Evaluation

Method Cost Reliability
Human evaluation High ($$$) Gold standard
LLM-as-judge Moderate 0.7-0.85 correlation with humans
Automated benchmarks (Ragas, ARES) Low Best for regression detection

⚠️ Build an eval set on day 1. 50-100 high-quality (question, ideal-answer, source-chunks) tuples curated by domain experts is enough to detect regressions and run A/B comparisons. Without it, every "improvement" is a guess.


βš™οΈ Implementation in Fabric

A reference three-notebook + pipeline pattern, fully runnable on Fabric F64.

Eventhouse Setup

.create database db_rag
.create table db_rag.rag_chunks (
    chunk_id: string, document_id: string, document_title: string,
    document_uri: string, chunk_index: int, chunk_text: string,
    token_count: int, section_path: string, doc_type: string,
    tenant_id: string, sensitivity_label: string, language: string,
    created_at: datetime, updated_at: datetime,
    embedding_model: string, embedding: dynamic
)
.alter column db_rag.rag_chunks.embedding policy encoding type = 'Vector16'
.alter table db_rag.rag_chunks policy caching hot = 90d

.create table db_rag.rag_query_log (
    query_id: string, query_text: string, retrieved_ids: dynamic,
    rerank_scores: dynamic, generated_answer: string,
    latency_ms: int, total_tokens: int, cost_usd: real,
    user_id: string, ts: datetime
)

Notebook 1 β€” Ingestion + Chunking β†’ Bronze

# Notebook: 18_bronze_rag_chunking.py
# COMMAND ----------
# MAGIC %md
# MAGIC ## Bronze: Document Ingestion + Chunking
# MAGIC Reads documents from OneLake, parses, chunks, writes Delta Bronze table.

# COMMAND ----------

import uuid
from pyspark.sql import Row
from pyspark.sql.types import (StructType, StructField, StringType,
                               IntegerType, TimestampType)
from datetime import datetime
import notebookutils as mssparkutils

BRONZE_PATH = "abfss://<workspace>@onelake.dfs.fabric.microsoft.com/lh_bronze.Lakehouse/Tables/bronze_rag_chunks"
SOURCE_DIR  = "abfss://<workspace>@onelake.dfs.fabric.microsoft.com/lh_bronze.Lakehouse/Files/rag_corpus"

# COMMAND ----------

# Load chunker (defined in shared utils)
from rag_utils.chunker import recursive_chunk
from rag_utils.parser  import parse_document  # PDF/DOCX/MD/HTML

def process_document(file_path: str, doc_type: str, tenant_id: str) -> list[Row]:
    title, sections = parse_document(file_path)
    rows = []
    for sec_path, sec_text in sections:
        for idx, chunk_text in enumerate(
            recursive_chunk(sec_text, max_tokens=500, overlap_tokens=50)
        ):
            rows.append(Row(
                chunk_id=str(uuid.uuid4()),
                document_id=file_path.split("/")[-1],
                document_title=title,
                document_uri=file_path,
                chunk_index=idx,
                chunk_text=f"{title}\n{sec_path}\n\n{chunk_text}",  # contextual prefix
                token_count=len(chunk_text.split()) * 4 // 3,       # rough
                section_path=sec_path,
                doc_type=doc_type,
                tenant_id=tenant_id,
                sensitivity_label="internal",
                language="en",
                created_at=datetime.utcnow(),
                updated_at=datetime.utcnow(),
                embedding_model=None,
                embedding=None
            ))
    return rows

# COMMAND ----------

files = mssparkutils.fs.ls(SOURCE_DIR)
all_rows = []
for f in files:
    all_rows.extend(process_document(f.path, doc_type="policy", tenant_id="casino-prod"))

df = spark.createDataFrame(all_rows)
df.write.format("delta").mode("append").save(BRONZE_PATH)
print(f"Wrote {df.count()} chunks to Bronze")

Notebook 2 β€” Embedding β†’ Silver β†’ Eventhouse

# Notebook: 18_silver_rag_embed.py
# COMMAND ----------
# MAGIC %md
# MAGIC ## Silver: Embed Bronze Chunks β†’ Eventhouse Vector Table

# COMMAND ----------

from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

AOAI_ENDPOINT   = "https://your-aoai.openai.azure.com"
AOAI_DEPLOYMENT = "text-embedding-3-large"
EMBED_DIMS      = 1024  # Matryoshka truncation

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)
client = AzureOpenAI(
    azure_endpoint=AOAI_ENDPOINT,
    azure_ad_token_provider=token_provider,
    api_version="2024-10-21"
)

def embed_batch(texts: list[str]) -> list[list[float]]:
    resp = client.embeddings.create(
        input=texts, model=AOAI_DEPLOYMENT, dimensions=EMBED_DIMS
    )
    return [d.embedding for d in resp.data]

# COMMAND ----------

bronze = spark.read.format("delta").load(BRONZE_PATH).filter("embedding IS NULL")
print(f"To embed: {bronze.count()}")

# Process in batches of 100 (AOAI rate-limit friendly)
BATCH = 100
rows = bronze.collect()
embedded = []
for i in range(0, len(rows), BATCH):
    batch = rows[i:i+BATCH]
    vectors = embed_batch([r.chunk_text for r in batch])
    for r, v in zip(batch, vectors):
        d = r.asDict()
        d["embedding"] = v
        d["embedding_model"] = f"{AOAI_DEPLOYMENT}-{EMBED_DIMS}d"
        d["updated_at"] = datetime.utcnow()
        embedded.append(d)

# COMMAND ----------

# Write to Eventhouse via Kusto SDK
from azure.kusto.data import KustoConnectionStringBuilder
from azure.kusto.ingest import QueuedIngestClient, IngestionProperties, DataFormat

KUSTO_URI    = "https://<eventhouse>.kusto.fabric.microsoft.com"
KUSTO_DB     = "db_rag"
KUSTO_TABLE  = "rag_chunks"

kcsb = KustoConnectionStringBuilder.with_az_cli_authentication(KUSTO_URI)
ingest = QueuedIngestClient(kcsb)
props  = IngestionProperties(database=KUSTO_DB, table=KUSTO_TABLE,
                             data_format=DataFormat.JSON)

import json, tempfile
with tempfile.NamedTemporaryFile(suffix=".json", delete=False, mode="w") as fh:
    for r in embedded:
        fh.write(json.dumps(r, default=str) + "\n")
    fh.flush()
    ingest.ingest_from_file(fh.name, ingestion_properties=props)

print(f"Ingested {len(embedded)} chunks into Eventhouse")

Notebook 3 β€” Query β†’ Retrieve β†’ Rerank β†’ Generate β†’ Log

# Notebook: 18_gold_rag_query.py
# COMMAND ----------
# MAGIC %md
# MAGIC ## Gold: End-to-End RAG Query with Hybrid Retrieval, Reranking, Logging

# COMMAND ----------

import time, uuid, json
from azure.kusto.data import KustoClient
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", max_length=512)
kusto    = KustoClient(kcsb)
chat     = AzureOpenAI(azure_endpoint=AOAI_ENDPOINT,
                      azure_ad_token_provider=token_provider,
                      api_version="2024-10-21")
CHAT_DEPLOYMENT = "gpt-4o"

# COMMAND ----------

HYBRID_KQL = """
let user_query = '{q}';
let query_vec = toscalar(
    evaluate ai_embed_text(user_query, '{ep}', '{m}') | project embedding
);
let k = 60; let top_n = 50;
let vec_r = rag_chunks
    | where tenant_id == '{tenant}'
    | extend sim = series_cosine_similarity(embedding, query_vec)
    | top top_n by sim desc
    | extend vec_rank = row_number()
    | project chunk_id, vec_rank;
let bm_r = rag_chunks
    | where tenant_id == '{tenant}'
    | where chunk_text has_any (split(user_query, ' '))
    | extend kw_score = countof(chunk_text, user_query, 'normal')
    | top top_n by kw_score desc
    | extend kw_rank = row_number()
    | project chunk_id, kw_rank;
vec_r
| join kind=fullouter bm_r on chunk_id
| extend chunk_id = coalesce(chunk_id, chunk_id1)
| extend rrf = iff(isnotnull(vec_rank), 1.0/(k+vec_rank), 0.0)
              + iff(isnotnull(kw_rank),  1.0/(k+kw_rank),  0.0)
| top 30 by rrf desc
| join kind=inner rag_chunks on chunk_id
| project chunk_id, chunk_text, document_title, section_path,
          document_uri, rrf
"""

def hybrid_retrieve(query: str, tenant: str = "casino-prod") -> list[dict]:
    kql = HYBRID_KQL.format(
        q=query.replace("'", "''"),
        ep=AOAI_ENDPOINT, m="text-embedding-3-large", tenant=tenant
    )
    resp = kusto.execute(KUSTO_DB, kql)
    return [r.to_dict() for r in resp.primary_results[0]]

def rerank(query: str, candidates: list[dict], top_k: int = 5) -> list[dict]:
    pairs  = [(query, c["chunk_text"]) for c in candidates]
    scores = reranker.predict(pairs, batch_size=32)
    for c, s in zip(candidates, scores):
        c["rerank_score"] = float(s)
    return sorted(candidates, key=lambda x: -x["rerank_score"])[:top_k]

# COMMAND ----------

ANSWER_PROMPT = """You are a casino compliance analyst. Answer ONLY from the passages
below. Cite each claim as [P#]. If passages don't cover the question, say
"I don't have that information."

Passages:
{passages}

Question: {q}

Answer (cite as [P#]):"""

def generate(query: str, chunks: list[dict]) -> dict:
    passages = "\n\n".join(
        f"[P{i+1}] ({c['document_title']} > {c['section_path']})\n{c['chunk_text']}"
        for i, c in enumerate(chunks)
    )
    resp = chat.chat.completions.create(
        model=CHAT_DEPLOYMENT,
        messages=[{"role": "user",
                   "content": ANSWER_PROMPT.format(passages=passages, q=query)}],
        temperature=0.1,
        max_tokens=600
    )
    return {
        "answer": resp.choices[0].message.content,
        "tokens_in":  resp.usage.prompt_tokens,
        "tokens_out": resp.usage.completion_tokens
    }

# COMMAND ----------

def rag(query: str, tenant: str = "casino-prod", user_id: str = "anonymous") -> dict:
    qid = str(uuid.uuid4())
    t0  = time.time()
    candidates = hybrid_retrieve(query, tenant)
    t1  = time.time()
    top = rerank(query, candidates, top_k=5)
    t2  = time.time()
    out = generate(query, top)
    t3  = time.time()

    log_row = {
        "query_id": qid, "query_text": query,
        "retrieved_ids": [c["chunk_id"] for c in top],
        "rerank_scores": [c["rerank_score"] for c in top],
        "generated_answer": out["answer"],
        "latency_ms": int((t3 - t0) * 1000),
        "total_tokens": out["tokens_in"] + out["tokens_out"],
        "cost_usd": (out["tokens_in"]/1e6)*2.50 + (out["tokens_out"]/1e6)*10.00,
        "user_id": user_id,
        "ts": datetime.utcnow().isoformat()
    }
    # Async log to Eventhouse rag_query_log
    log_to_kusto(log_row)
    return {**out, **log_row,
            "retrieve_ms": int((t1-t0)*1000),
            "rerank_ms":   int((t2-t1)*1000),
            "generate_ms": int((t3-t2)*1000)}

# COMMAND ----------

result = rag("What triggers a CTR for cash transactions over $10,000?")
print(result["answer"])

Pipeline Orchestration

Stage Schedule Notebook Output
Document ingest Hourly (or on-event) 18_bronze_rag_chunking Delta Bronze
Embed new chunks Every 30 min 18_silver_rag_embed Eventhouse
Eval regression Nightly 18_gold_rag_eval Ragas scores β†’ Power BI
Online query On-demand 18_gold_rag_query (or REST) User-facing

Wire via Fabric Data Pipeline with notebook activities and Eventhouse activity for KQL setup. See fabric-cicd Deployment for promoting across dev/test/prod.


πŸ›‘οΈ Production Concerns

Latency Budgets

Stage Budget (P95) Optimization
Query embedding 80ms Cache common queries
Hybrid retrieval 150ms Hot cache, partition pruning
Reranking (BGE-v2) 300ms GPU instance, batch=32
LLM generation 1500ms Stream tokens to user
Total E2E P95 < 2.5s

Caching Strategy

Cache What Hit Rate Target Storage
Query embedding cache query_text β†’ vector > 30% Redis or in-memory LRU
Answer cache query_text β†’ (answer, citations) 10-25% Eventhouse query_results_cache
Document cache chunk_id β†’ text > 80% Eventhouse hot cache
Negative cache "I don't know" results 5-10% Short TTL (1h)

Cost Management

Track per-query cost: embed + retrieve + rerank + generate. For GPT-4o at $2.50/1M input, \(10/1M output and average 4K input + 400 output tokens, generation alone is ~\)0.014/query. At 10K queries/day = $140/day = $51K/year.

Levers: - Use gpt-4o-mini for routine queries; reserve gpt-4o for hard ones (router pattern) - Smaller embeddings (1024 dims vs 3072) β€” 50% storage savings - Aggressive answer caching (semantic cache via embedding similarity) - Truncate context to top-5 chunks unless evidence shows top-10 helps

See LLM Cost Tracking for detailed FinOps patterns.

Privacy and PII

Risk Mitigation
PII in retrieved chunks leaks to LLM Pre-redact at chunking time (Presidio, regex)
Embedded PII inferable from vectors Salt and hash identifiers before chunking
Cross-tenant data bleed Hard-filter by tenant_id in every KQL query β€” RLS as defense-in-depth
Audit failures Log every (query, retrieved_ids, user_id) to Eventhouse for 7 years
Data residency (FedRAMP) Self-host embedder; AOAI in compliant region; no cross-border transit

Hallucination Detection

Even faithful prompts can produce hallucinations. Defenses:

  1. Citation enforcement β€” strip answers without [P#] markers
  2. Faithfulness check β€” Ragas faithfulness < 0.85 β†’ flag for review
  3. Claim verification β€” extract claims, verify each against retrieved chunks via NLI model
  4. Refusal training β€” system prompt: "If passages don't answer the question, say 'I don't have that information.'"
  5. Confidence threshold β€” if top-1 rerank score < 0.4, refuse to answer
def detect_hallucination(answer: str, chunks: list[dict]) -> dict:
    has_citations = bool(re.search(r"\[P\d+\]", answer))
    refuses_when_uncertain = "I don't have that information" in answer
    # Optional: run Ragas faithfulness on (answer, chunks)
    return {
        "has_citations": has_citations,
        "appears_to_refuse": refuses_when_uncertain,
        "warning": (not has_citations and not refuses_when_uncertain)
    }

🎰 Casino Implementation

Use Case 1 β€” Compliance Q&A Bot

A Data Agent grounded by RAG over: NIGC MICS, BSA/AML regulations, internal SOPs, prior CTR/SAR narratives, training materials.

Corpus Chunks Update Cadence
NIGC MICS Title 25 CFR Part 543 ~800 Annual + erratum
BSA / 31 CFR Part 1010 + 1021 ~600 Annual
Internal compliance SOPs ~1200 Quarterly
Historical SAR narratives (last 3 years, redacted) ~5000 Daily append
W-2G threshold guidance (IRS Pub 515, 3079) ~300 Annual

Sample queries the agent must handle: - "When does a slot machine jackpot trigger W-2G?" - "How is structuring defined under 31 USC 5324?" - "Has any patron been flagged for similar behavior to John Doe's last filing?" - "What are the SAR filing deadlines after detection?"

Eval set: 150 expert-curated (question, ideal-answer, source-citations) tuples maintained by the BSA Officer. Nightly Ragas regression.

Use Case 2 β€” Operations Runbook Chat

RAG over runbooks/ directory: incident response, escalation paths, vendor SLAs, surveillance procedures. Floor managers query via Teams.

| Latency target | < 3s P95 | | Answer-with-citation rate | > 95% | | Faithfulness (Ragas) | > 0.92 | | Refusal rate (out-of-scope) | > 0.85 (correctly refuses non-runbook questions) |

flowchart LR
    subgraph Sources["πŸ“š Sources"]
        MICS["NIGC MICS"]
        BSA["BSA / 31 CFR"]
        SOP["Internal SOPs"]
        SAR["Historical SARs<br/>(redacted)"]
    end
    subgraph Pipeline["πŸ”„ Pipeline"]
        ING["Ingest +<br/>Chunk"]
        EMB["Embed"]
        EH["Eventhouse"]
    end
    subgraph Query["πŸ’¬ Query"]
        TEAMS["Teams /<br/>Copilot Studio"]
        AGT["Compliance<br/>Agent"]
        RAG["Hybrid + Rerank"]
        LLM["GPT-4o"]
    end
    Sources --> Pipeline --> Query
    style Sources fill:#2471A3,stroke:#1A5276,color:#fff
    style Pipeline fill:#E67E22,stroke:#CA6F1E,color:#fff
    style Query fill:#27AE60,stroke:#1E8449,color:#fff

πŸ›οΈ Federal Implementation

USDA β€” Crop Guidance Q&A

RAG over USDA Farm Service Agency handbooks, RMA crop insurance bulletins, NRCS conservation practice standards. Producers query via web portal.

Corpus Source Volume
FSA Handbooks (1-FLP, 2-FLP, etc.) fsa.usda.gov ~50,000 chunks
RMA crop insurance handbooks rma.usda.gov ~30,000 chunks
NRCS practice standards nrcs.usda.gov ~12,000 chunks
Title 7 CFR ecfr.gov ~20,000 chunks

Sample query: "What's the prevented planting payment factor for soybeans in Iowa for 2026?" β€” agent must combine RMA actuarial documents and current bulletins, cite handbook section.

DOJ β€” Case Law Retrieval

RAG over DOJ-released opinions, OLC memoranda, public US Attorney's Manual / Justice Manual sections, and SAR-related civil case summaries. Used by AUSAs for prior-art research.

Concern Mitigation
Privileged content Tag at ingest; filter at retrieval by clearance level
Citation accuracy Always include exact case citation [V###, F.### (Cir. Year)]
FedRAMP boundary Self-hosted BGE-large embedder; AOAI in Gov region; air-gapped option
Hallucination on legal facts LLM-as-judge faithfulness check; require manual review for filings

Cross-Agency Eval Benchmark

Agency Corpus Size Recall@10 Faithfulness E2E P95
Casino Compliance 8K chunks 0.91 0.94 1.8s
USDA Producer Q&A 110K chunks 0.86 0.91 2.4s
DOJ Case Law 250K chunks 0.83 0.93 3.1s
EPA Regulations 65K chunks 0.88 0.92 2.0s
NOAA Severe Wx Guidance 18K chunks 0.92 0.95 1.6s

🚫 Anti-Patterns

Anti-Pattern Why It Fails Fix
Embedding entire documents as one chunk Diluted vector, no precise retrieval, LLM context overflow Chunk to 200-600 tokens with structure
No reranker Top-K from bi-encoder is noisy; precision plateaus Add cross-encoder rerank β€” cheapest 10%+ quality lift
Pure vector, no BM25 Misses exact codes, IDs, acronyms Hybrid retrieval with RRF
No eval set Can't tell if changes improve or regress 50-100 expert-curated tuples on day 1
Same model for embed and chat Wasted: chat models aren't trained for retrieval Use dedicated embedding model
Ignoring citations Compliance failure; users can't verify Enforce [P#] markers, strip answers without them
No re-embedding plan when changing models Mixed-model index returns garbage Treat embeddings as schema; plan migrations
Stuffing top-50 into context "to be safe" LLM gets lost-in-the-middle, costs explode, latency tanks Top-3 to top-5 reranked chunks is usually optimal

πŸ“‹ Production Checklist

Pre-Launch

  • Eval set of β‰₯ 50 (question, answer, citations) tuples curated by SME
  • Recall@10 > 0.85 on eval set
  • Ragas faithfulness > 0.90 on eval set
  • Hybrid retrieval (vector + BM25) with RRF
  • Cross-encoder reranker in pipeline
  • Citation extraction implemented and enforced
  • PII redaction at chunk time (Presidio or domain regex)
  • Per-tenant tenant_id filtering in every query (defense-in-depth + RLS)
  • Sensitivity-label filtering (public/internal/confidential/restricted)
  • Refusal prompt: "say 'I don't have that information' when uncertain"
  • Hallucination detector: strip uncited answers
  • Query log table in Eventhouse with 7-year retention
  • Cost tracking per query (embed + retrieve + rerank + generate)
  • Latency P95 < 3s on representative load
  • Embedding model + version captured per chunk for migration safety
  • Re-embed runbook documented

Operational

  • Nightly Ragas regression on eval set (alert if any metric drops > 5%)
  • Weekly review of refused / low-confidence queries (corpus gap analysis)
  • Monthly cost review vs budget
  • Quarterly eval set expansion with newly-discovered failure modes
  • Model version pinned; upgrade path tested before swap
  • Drift monitoring on query distribution (sudden topic shifts)
  • Feedback loop: users can flag bad answers; flags trigger curation
  • Disaster recovery: re-embed time SLO documented (typical: hours)
  • Privacy review: PII never sent to non-compliant LLM
  • Audit log integrity verified weekly (query, user, chunks, answer)

πŸ“š References

Microsoft Learn

Resource URL
RAG with Azure AI Search https://learn.microsoft.com/azure/search/retrieval-augmented-generation-overview
Eventhouse Vector Database https://learn.microsoft.com/fabric/real-time-intelligence/vector-database
AI Embed Text Plugin (KQL) https://learn.microsoft.com/kusto/query/ai-embed-text-plugin
AI Chat Completion Plugin (KQL) https://learn.microsoft.com/kusto/query/ai-chat-completion-plugin
series_cosine_similarity() https://learn.microsoft.com/kusto/query/series-cosine-similarity-function
Azure OpenAI Embeddings https://learn.microsoft.com/azure/ai-services/openai/concepts/models#embeddings
Fabric Data Agents https://learn.microsoft.com/fabric/data-science/concept-data-agent

Frameworks and Tooling

Resource URL
Ragas β€” RAG Evaluation Framework https://docs.ragas.io
LangChain Retrievers https://python.langchain.com/docs/concepts/retrievers/
LlamaIndex RAG Patterns https://docs.llamaindex.ai/en/stable/optimizing/production_rag/
sentence-transformers (BGE) https://www.sbert.net/
Cohere Rerank https://docs.cohere.com/docs/rerank-overview
Presidio (PII redaction) https://microsoft.github.io/presidio/

Foundational Papers

Paper URL
Retrieval-Augmented Generation for Knowledge-Intensive NLP (Lewis et al., 2020) https://arxiv.org/abs/2005.11401
Reciprocal Rank Fusion (Cormack et al., 2009) https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf
Dense Passage Retrieval (Karpukhin et al., 2020) https://arxiv.org/abs/2004.04906
HyDE: Hypothetical Document Embeddings (Gao et al., 2022) https://arxiv.org/abs/2212.10496
Lost in the Middle (Liu et al., 2023) https://arxiv.org/abs/2307.03172
Late Chunking (GΓΌnther et al., 2024) https://arxiv.org/abs/2409.04701
BGE-Reranker (Xiao et al., 2024) https://arxiv.org/abs/2402.03216
MTEB Benchmark https://huggingface.co/spaces/mteb/leaderboard

Wave 2 β€” ML/AI Cluster

Adjacent Features

Infrastructure


πŸ“ Document Metadata - Author: Documentation Team - Reviewers: Data Science, AI/ML, Compliance, Federal Programs - Classification: Internal - Phase: 14 Wave 2 β€” Feature 2.6 - Next Review: 2026-07-27