ADR 0007 — Azure OpenAI over self-hosted LLM for AI integration¶

Context and Problem Statement¶

The platform exposes AI-integration patterns (RAG over catalog metadata, text-to-SQL, enrichment pipelines) through the csa_platform.ai_integration module. Federal customers need a model endpoint that is FedRAMP-authorized, data-residency-correct, and integrable with the platform's Entra ID + Private Endpoint pattern. We must pick a default model serving path before the AI module stabilizes.

Decision Drivers¶

FedRAMP High authorization of the inference endpoint in Azure Government.
Data residency and contractual non-training guarantees — customer data in prompts must not be used to train upstream models.
Private Endpoint support for network isolation — no public egress to an inference API.
Capability frontier — access to current-generation models (GPT-4o / 4.1 class and embeddings) without customer-owned GPU capacity planning.
Composability — the model choice should not lock application code to a single SDK; prefer OpenAI-compatible interfaces.

Considered Options¶

Azure OpenAI Service (chosen) — Managed, FedRAMP High in Azure Gov, Private Endpoints, OpenAI SDK-compatible, content filtering built in.
Self-hosted open-weights model on Azure ML / AKS (Llama 3, Mistral, Phi-3) — Full control, no per-token cost, but customer-owned GPU fleet and weights lifecycle.
Anthropic / Google / third-party LLM APIs via Azure AI Studio — Model diversity, but mixed Gov availability and separate authorizations.
On-device / CPU-only small models (Phi-3 mini, DistilBERT-class) — Zero infra cost, but quality floor is too low for production RAG.

Decision Outcome¶

Chosen: Option 1 — Azure OpenAI Service as the default model endpoint, accessed through an OpenAI-compatible client so application code can be re-pointed at a self-hosted endpoint (Option 2) if a tenant requires it.

Consequences¶

Positive: FedRAMP High + DoD IL4/IL5 authorization path in Azure Gov.
Positive: Private Endpoint support removes public-internet egress from the threat model.
Positive: Entra ID authentication with managed-identity support — no long-lived API keys.
Positive: Content filtering and jailbreak detection are built into the service — one less thing to implement.
Positive: OpenAI SDK-compatible surface keeps application code portable.
Negative: Per-token cost; token budgets are a live FinOps concern (tracked in docs/COST_MANAGEMENT.md).
Negative: Model versions are Microsoft-controlled — deprecation windows are short and require active version management.
Negative: Quota + capacity commitments (PTUs) are a procurement process for bursty workloads.
Neutral: Self-hosted open-weights models remain a supported alternate via Azure ML, behind the same SDK shape.

Pros and Cons of the Options¶

Option 1 — Azure OpenAI Service¶

Pros: FedRAMP High; Private Endpoints; Entra ID auth; content filtering; frontier-class models; OpenAI SDK-compatible.
Cons: Per-token cost; model deprecation churn; quota management.

Option 2 — Self-hosted open-weights on Azure ML / AKS¶

Pros: No per-token cost at scale; full control over model version and weights; model fine-tuning is fully in-tenant.
Cons: Customer-owned GPU fleet; patching, autoscaling, and observability are customer responsibilities; capability gap vs. frontier models.

Option 3 — Third-party LLM APIs (Anthropic, Google)¶

Pros: Model diversity; strong capabilities; some have competitive non-training guarantees.
Cons: Separate FedRAMP authorizations; different auth models; additional vendor procurement.

Option 4 — On-device small models¶

Pros: Zero infra cost; offline-capable; trivial data residency.
Cons: Quality floor too low for production RAG and text-to-SQL on non-trivial schemas.

Validation¶

We will know this decision is right if:

RAG + text-to-SQL use cases meet accuracy targets using Azure OpenAI models without needing a self-hosted fallback.
Per-tenant monthly inference cost stays within the FinOps envelope set in docs/COST_MANAGEMENT.md.
If token cost or model deprecation churn exceeds acceptable thresholds, activate the self-hosted fallback path (Option 2) for bulk workloads.

References¶

Decision tree: RAG vs. Fine-tune vs. Agents
Related code: csa_platform/ai_integration/, csa_platform/ai_integration/rag/, csa_platform/ai_integration/model_serving/, csa_platform/ai_integration/enrichment/
Framework controls: NIST 800-53 AC-4 (information flow enforcement via Private Endpoints), SC-7 (boundary protection), SC-8 (TLS in transit), SI-4 (content filtering / monitoring), AU-2 (audit of prompt + completion metadata — not content). See governance/compliance/nist-800-53-rev5.yaml.
Discussion: CSA-0087