ADR 0007 — Azure OpenAI over self-hosted LLM for AI integration¶
Context and Problem Statement¶
The platform exposes AI-integration patterns (RAG over catalog metadata, text-to-SQL, enrichment pipelines) through the csa_platform.ai_integration module. Federal customers need a model endpoint that is FedRAMP-authorized, data-residency-correct, and integrable with the platform's Entra ID + Private Endpoint pattern. We must pick a default model serving path before the AI module stabilizes.
Decision Drivers¶
- FedRAMP High authorization of the inference endpoint in Azure Government.
- Data residency and contractual non-training guarantees — customer data in prompts must not be used to train upstream models.
- Private Endpoint support for network isolation — no public egress to an inference API.
- Capability frontier — access to current-generation models (GPT-4o / 4.1 class and embeddings) without customer-owned GPU capacity planning.
- Composability — the model choice should not lock application code to a single SDK; prefer OpenAI-compatible interfaces.
Considered Options¶
- Azure OpenAI Service (chosen) — Managed, FedRAMP High in Azure Gov, Private Endpoints, OpenAI SDK-compatible, content filtering built in.
- Self-hosted open-weights model on Azure ML / AKS (Llama 3, Mistral, Phi-3) — Full control, no per-token cost, but customer-owned GPU fleet and weights lifecycle.
- Anthropic / Google / third-party LLM APIs via Azure AI Studio — Model diversity, but mixed Gov availability and separate authorizations.
- On-device / CPU-only small models (Phi-3 mini, DistilBERT-class) — Zero infra cost, but quality floor is too low for production RAG.
Decision Outcome¶
Chosen: Option 1 — Azure OpenAI Service as the default model endpoint, accessed through an OpenAI-compatible client so application code can be re-pointed at a self-hosted endpoint (Option 2) if a tenant requires it.
Consequences¶
- Positive: FedRAMP High + DoD IL4/IL5 authorization path in Azure Gov.
- Positive: Private Endpoint support removes public-internet egress from the threat model.
- Positive: Entra ID authentication with managed-identity support — no long-lived API keys.
- Positive: Content filtering and jailbreak detection are built into the service — one less thing to implement.
- Positive: OpenAI SDK-compatible surface keeps application code portable.
- Negative: Per-token cost; token budgets are a live FinOps concern (tracked in
docs/COST_MANAGEMENT.md). - Negative: Model versions are Microsoft-controlled — deprecation windows are short and require active version management.
- Negative: Quota + capacity commitments (PTUs) are a procurement process for bursty workloads.
- Neutral: Self-hosted open-weights models remain a supported alternate via Azure ML, behind the same SDK shape.
Pros and Cons of the Options¶
Option 1 — Azure OpenAI Service¶
- Pros: FedRAMP High; Private Endpoints; Entra ID auth; content filtering; frontier-class models; OpenAI SDK-compatible.
- Cons: Per-token cost; model deprecation churn; quota management.
Option 2 — Self-hosted open-weights on Azure ML / AKS¶
- Pros: No per-token cost at scale; full control over model version and weights; model fine-tuning is fully in-tenant.
- Cons: Customer-owned GPU fleet; patching, autoscaling, and observability are customer responsibilities; capability gap vs. frontier models.
Option 3 — Third-party LLM APIs (Anthropic, Google)¶
- Pros: Model diversity; strong capabilities; some have competitive non-training guarantees.
- Cons: Separate FedRAMP authorizations; different auth models; additional vendor procurement.
Option 4 — On-device small models¶
- Pros: Zero infra cost; offline-capable; trivial data residency.
- Cons: Quality floor too low for production RAG and text-to-SQL on non-trivial schemas.
Validation¶
We will know this decision is right if:
- RAG + text-to-SQL use cases meet accuracy targets using Azure OpenAI models without needing a self-hosted fallback.
- Per-tenant monthly inference cost stays within the FinOps envelope set in
docs/COST_MANAGEMENT.md. - If token cost or model deprecation churn exceeds acceptable thresholds, activate the self-hosted fallback path (Option 2) for bulk workloads.
References¶
- Decision tree: RAG vs. Fine-tune vs. Agents
- Related code:
csa_platform/ai_integration/,csa_platform/ai_integration/rag/,csa_platform/ai_integration/model_serving/,csa_platform/ai_integration/enrichment/ - Framework controls: NIST 800-53 AC-4 (information flow enforcement via Private Endpoints), SC-7 (boundary protection), SC-8 (TLS in transit), SI-4 (content filtering / monitoring), AU-2 (audit of prompt + completion metadata — not content). See
governance/compliance/nist-800-53-rev5.yaml. - Discussion: CSA-0087