Best Practice — Multi-Model AI Orchestration¶

Comparative positioning note. This document is written from the perspective of Microsoft Azure, Cloud Scale Analytics, and CSA Loom. Any description of third-party or competing products, services, pricing, or capabilities is derived from publicly available documentation and sources believed accurate at the time of writing, and is provided for general comparison only. We do not claim expertise in, or authority over, any non-Microsoft product or service; the respective vendor's official documentation is the authoritative source for their offerings, which may change over time. Nothing here is intended to disparage any vendor — where a competing product has genuine advantages, we aim to note them honestly. Verify all third-party details against the vendor's current official documentation before making decisions.

The principle¶

No single model wins every task. The platform decision is the model-routing layer, not the model. Build the routing layer on day one; pick models as commodities.

Three years into the LLM era, frontier, open-weight, sovereign, and small task models all coexist in production. A serious enterprise architecture treats the model as a configurable backend, not a hardcoded dependency. This page is the opinionated playbook for orchestrating multiple models behind a single gateway on Azure.

The model portfolio¶

Most production deployments end up with five tiers of models in parallel:

Tier	Examples	Hosting	Typical use
Frontier general	GPT-4o, GPT-4.1, o-series, Claude (via Foundry MaaS where available)	Azure OpenAI / Foundry MaaS	Reasoning, broad NL, code, agentic
Open-weight (cost / sovereign)	Llama 3.x, Mistral Large, DeepSeek-V3, Phi-4	Foundry MaaS	Cheaper inference, data-boundary sensitivity
Small task-specific	Phi-4-mini, distilled classifiers, embedding models	Foundry MaaS or containerized	High-volume cheap inference; edge
Domain fine-tunes	Custom-tuned models on enterprise data	Foundry custom-deployment	Specialized tasks (legal, scientific, mission)
External / sovereign	Bedrock-hosted, sovereign LLM gateways, partner-hosted	Brokered via APIM	Existing investments, sovereign accreditation

Each tier has cost / latency / capability / boundary characteristics. Routing decisions select among them at runtime.

The routing layer — APIM as multi-model gateway¶

The routing layer lives in APIM, not in application code. Reasons:

One identity / governance / observability surface across every model call
Token budgets per consumer apply across all backends — the consumer is bounded regardless of which model they hit
Caching, content safety, and metric emission apply at the gateway with no per-application code
Model swap-outs happen in policy, not in code — no redeployments to switch a workload from GPT-4o to Mistral

Minimum policy set for a multi-model API¶

<policies>
  <inbound>
    <base />
    <validate-jwt header-name="Authorization" failed-validation-httpcode="401">
      <openid-config url="https://login.microsoftonline.com/{tenant}/v2.0/.well-known/openid-configuration" />
    </validate-jwt>
    <!-- Per-subscription token budget — protects against runaway agent loops -->
    <azure-openai-token-limit
        counter-key="@(context.Subscription.Id)"
        tokens-per-minute="100000"
        estimate-prompt-tokens="true"
        remaining-tokens-header-name="x-ratelimit-remaining-tokens" />
    <!-- Semantic cache — cuts cost 30-70% on repeated questions -->
    <azure-openai-semantic-cache-lookup
        score-threshold="0.85"
        embeddings-backend-id="aoai-embeddings"
        embeddings-deployment-id="text-embedding-3-large"
        ignore-system-messages="true" />
    <!-- Content safety inline -->
    <llm-content-safety backend-id="content-safety">
      <categories>
        <category name="Hate" threshold="4" />
        <category name="Violence" threshold="4" />
        <category name="SelfHarm" threshold="4" />
        <category name="Sexual" threshold="4" />
      </categories>
    </llm-content-safety>
    <!-- Backend pool selection by route -->
    <set-backend-service backend-id="aoai-pool" />
  </inbound>
  <backend><forward-request /></backend>
  <outbound>
    <base />
    <!-- Cache the response keyed by semantic similarity -->
    <azure-openai-semantic-cache-store duration="3600" />
    <!-- Emit token usage for chargeback -->
    <azure-openai-emit-token-metric>
      <dimension name="subscription-id" />
      <dimension name="model-id" />
      <dimension name="api-name" />
    </azure-openai-emit-token-metric>
  </outbound>
</policies>

This is the production-grade answer to four hard problems at once: cost control, latency control, content safety, and chargeback telemetry. No native equivalent exists on a competing cloud's API gateway, nor on a competing integration platform.

Routing strategies¶

Pick one explicit strategy per API endpoint. Mixed strategies in one code path are a pathology.

Strategy A — Static routing¶

One endpoint → one model. Simplest. Use for capability-bound workloads (e.g., a vision-capable endpoint that requires GPT-4o).

Strategy B — Cost-aware tiered routing¶

Default to a cheaper model; escalate to frontier on signal:

graph LR
    REQ[Request] --> CLASS[Classifier<br/>Phi-4-mini · 1ms]
    CLASS -->|simple| SMALL[Mistral / Phi]
    CLASS -->|complex| LARGE[GPT-4o]
    CLASS -->|reasoning| REASON[o-series]

Use when: workload has a wide cost / complexity distribution; latency tolerates an extra hop.

Strategy C — Capability-bound routing¶

Choose model per declared capability requirement in the request:

POST /v1/chat/completions
x-ms-model-class: vision-and-reasoning

APIM evaluates the header and routes to the matching backend.

Strategy D — Multi-region fallback pool¶

One logical model, several regional deployments behind it. APIM backend pool with circuit breakers handles 429 / quota exhaustion automatically:

graph LR
    CLIENT --> APIM
    APIM --> AOAI1[AOAI East US]
    APIM --> AOAI2[AOAI West US]
    APIM --> AOAI3[AOAI Europe]
    AOAI1 -.-> CB[Circuit breaker on 429]
    AOAI2 -.-> CB
    AOAI3 -.-> CB

Use as the default for any production AOAI deployment.

Strategy E — Tenant-bound routing¶

Different consumers route to different backends — for sovereign-data boundaries, partner-specific fine-tunes, or cost tiering by SLA.

<choose>
  <when condition="@(context.Subscription.Name == "tenant-sovereign")">
    <set-backend-service backend-id="aoai-sovereign-pool" />
  </when>
  <when condition="@(context.Subscription.Name == "tenant-cost-tier")">
    <set-backend-service backend-id="maas-mistral-pool" />
  </when>
  <otherwise>
    <set-backend-service backend-id="aoai-default-pool" />
  </otherwise>
</choose>

Cost governance¶

Token budgets¶

Every consuming subscription has a token-per-minute budget enforced at APIM. Tiers:

Tier	Tokens/min	Tokens/month	Typical use
Sandbox	10,000	5M	Developer exploration
Dev / test	50,000	50M	Application QA
Production small	100,000	500M	Standard production workload
Production large	1,000,000	Negotiated	Heavy production workload

Budgets are enforced not requested. Consumers who blow past their budget get 429; FinOps reviews escalations.

Semantic caching¶

For FAQ-style traffic and repeated agent prompts, semantic caching cuts cost 30–70%:

score-threshold 0.85 is a sane default; 0.9 for higher safety, 0.8 for higher hit rate
ignore-system-messages true — most prompt variation is in system role
TTL — 1 hour for highly dynamic data, 24 hours for stable knowledge

Measure hit rate per API; tune threshold per workload.

Chargeback¶

azure-openai-emit-token-metric with subscription-id, model-id, and api-name dimensions feeds a chargeback dashboard. Hook into FinOps monthly reporting:

Dimension	What it shows
`subscription-id`	Which consuming application
`model-id`	Cost driver: frontier vs commodity
`api-name`	Which workload (chat, embeddings, etc.)
`tenant` (custom)	Per-tenant billing in multi-tenant scenarios

Content safety¶

Three layers, applied in order:

Prompt-side guard — APIM llm-content-safety policy on inbound request body
Model-side guard — Azure OpenAI / Foundry built-in content filters at model invocation
Response-side guard — APIM llm-content-safety on outbound response body

For agentic workloads, add prompt-injection detection via Azure AI Content Safety Prompt Shields on tool-call inputs and outputs.

For mission / regulated environments, the policy thresholds tighten:

Category	Standard	Mission / regulated
Hate	4	2
Violence	4	2
Self-harm	4	2
Sexual	4	2
Prompt-injection	On	On with response blocking, not just annotation

Evaluation discipline¶

Models are commodity; prompts and tools are the differentiated asset. Treat them as code:

Prompts versioned in source control
Eval datasets versioned in source control
Foundry Evaluations runs continuously in CI on prompt or model changes
Regression on capability metrics blocks deployment
A/B testing in production with shadow traffic before flipping route weights

The eval-on-every-prompt-change discipline catches the silent regressions that destroy agent workloads.

Lifecycle management¶

Model versions, fine-tunes, prompt versions, and agent definitions all need lifecycle treatment:

Artifact	Lifecycle
Model deployment	Version-pinned in APIM backend; rotation via new deployment + DNS swap
Prompt	Versioned in repo; deployed via CI; A/B tested
Fine-tune	Catalogued in Purview AI registry; eval results attached
Agent	Catalogued in Foundry Agent registry; tied to MCP tool definitions; lifecycle owner named
MCP server	Catalogued in APIM as an API; OpenAPI / tool manifest published

Retirement is announced. Deprecated artifacts return 410 after sunset.

The MCP layer¶

Model Context Protocol servers — exposing tools and resources to model clients — sit behind APIM in the recommended architecture:

graph LR
    AGENT[Agent] --> APIM
    APIM --> MCP1[MCP: Dataverse]
    APIM --> MCP2[MCP: Graph]
    APIM --> MCP3[MCP: Databricks]
    APIM --> MCP4[MCP: EAM]
    APIM --> MCP5[MCP: Web]

APIM in front of MCP gives:

Token budgets across all MCP tool calls per consumer
Per-tool authorization via Entra scope mapping
Observability per tool with App Insights dimensions
Easy retirement / replacement of one MCP server without changing the agent
Versioning of tool catalogs

See the APIM + MCP guide for implementation.

Anti-patterns to refuse¶

Anti-pattern	Refuse because
Hardcoding a model SDK in the application	Wrong layer; the model should be configurable backend
Letting each app implement its own token budgeting	Re-implements platform capability; loses cross-app limits
Skipping semantic cache "until we measure cost"	The cost shows up when it's too late; semantic cache is cheap to enable and cheap to remove
Letting each app implement its own content safety	Inconsistent; an audit finding waiting to happen
Letting each app log its own token usage	Chargeback breaks at the moment you need it
Deploying a fine-tune without an eval baseline	Regression-prone; impossible to operate at scale
"We don't need MCP yet"	Build the gateway with MCP in mind; bolting it on later costs more