Multi-Cloud AI — OpenAI-compatible APIs as the open contract¶

Comparative positioning note

This document is written from the perspective of Microsoft Azure, Cloud Scale Analytics, and CSA Loom. Any description of third-party or competing products, services, pricing, or capabilities is derived from publicly available documentation and sources believed accurate at the time of writing, and is provided for general comparison only. We do not claim expertise in, or authority over, any non-Microsoft product or service; the respective vendor's official documentation is the authoritative source for their offerings, which may change over time. Nothing here is intended to disparage any vendor — where a competing product has genuine advantages, we aim to note them honestly. Verify all third-party details against the vendor's current official documentation before making decisions.

The model market converged on the OpenAI chat-completions contract between 2024 and 2026. Every major provider now exposes a wrapper or native endpoint that speaks the same OpenAI SDK calls. That convergence is the AI layer's equivalent of Delta Lake — an open standard that every vendor implements, so the application becomes portable across model providers without code changes.

The right pattern is: applications talk OpenAI SDK; an orchestrator (Azure AI Foundry) routes the call to whichever backend the policy says to use. Swapping models becomes a config change, not a code change.

The architecture¶

flowchart LR
    APP["Application<br/>(OpenAI SDK call)"]
    FOUNDRY["Azure AI Foundry<br/>(policy router + safety + tracing)"]
    AOAI["Azure OpenAI<br/>(GPT-4o, GPT-4.1)"]
    BEDROCK["AWS Bedrock<br/>(Claude, Llama via wrapper)"]
    VERTEX["GCP Vertex<br/>(Gemini via wrapper)"]
    ANTHROPIC["Anthropic direct<br/>(Claude)"]
    LOCAL["On-prem Ollama / vLLM<br/>(air-gapped)"]

    APP --> FOUNDRY
    FOUNDRY --> AOAI
    FOUNDRY --> BEDROCK
    FOUNDRY --> VERTEX
    FOUNDRY --> ANTHROPIC
    FOUNDRY --> LOCAL

    classDef anchor fill:#0078D4,stroke:#fff,color:#fff,stroke-width:2px
    classDef peer fill:#5C2D91,stroke:#fff,color:#fff,stroke-width:2px
    classDef oss fill:#107C10,stroke:#fff,color:#fff,stroke-width:2px

    class FOUNDRY,AOAI anchor
    class BEDROCK,VERTEX,ANTHROPIC peer
    class LOCAL oss

The OpenAI-compatible API contract¶

The contract is small enough to fit on a card:

POST /v1/chat/completions — model name + messages array → completion.
POST /v1/embeddings — model name + input → vector.
POST /v1/images/generations — model name + prompt → image URL.
GET /v1/models — list available models.

Every provider's wrapper accepts the same payload shape. The provider may add provider-specific fields (Bedrock has anthropic_version, Vertex has safety_settings), but the required fields are common. Applications written against the OpenAI SDK pass the same payload to any of them.

Provider matrix¶

Provider	Endpoint	OpenAI compat	Best for
Azure OpenAI	`https://{resource}.openai.azure.com/...`	Native (it is OpenAI)	Primary, regulated data, Azure-resident workloads
OpenAI direct	`https://api.openai.com/v1/...`	Native	Frontier model access, fastest GPT-4.x updates
AWS Bedrock	`https://bedrock-runtime.{region}.amazonaws.com/openai/v1/...`	Wrapper since 2024	Claude, Llama, Titan; AWS-resident workloads
GCP Vertex	`https://{region}-aiplatform.googleapis.com/openai/v1/...`	Wrapper since 2024	Gemini, PaLM; GCP-resident workloads
Anthropic direct	`https://api.anthropic.com/v1/...` (compat endpoint)	Wrapper	Claude with no cloud intermediary
Ollama	`http://{host}:11434/v1/...`	Native	Local dev, air-gapped, on-prem
vLLM	configurable	Native	High-throughput self-hosted
LiteLLM	configurable	Native	Universal proxy with cost tracking
TGI (HuggingFace)	configurable	Native	Self-hosted with HF model zoo

Azure AI Foundry as the orchestrator¶

Foundry is the right orchestrator anchor because:

It speaks the OpenAI contract natively as both client and server.
Built-in content safety — Azure AI Content Safety scans prompts and completions for harm categories.
Built-in evaluation — Prompt Flow runs evaluation suites against any backend.
Built-in tracing — OpenTelemetry-native, exports to Application Insights.
Policy router — route by cost, latency, data residency, or model quality.
Connection registry — register Bedrock, Vertex, Anthropic, Ollama as connections; switch via config.

Applications never call backends directly. They call Foundry. The backend choice is operational policy.

Policy patterns¶

The policy router is where the multi-cloud value shows up. Common policies:

Cost-based routing¶

# Route long, low-stakes prompts to cheaper backends
routes:
  - match:
      prompt_tokens: ">2000"
      sensitivity: "low"
    backend: gcp-vertex-gemini-flash
  - match:
      sensitivity: "high"
    backend: azure-openai-gpt-4o
  - default:
      backend: azure-openai-gpt-4o-mini

Data-residency routing¶

# Keep regulated data in-region
routes:
  - match:
      data_classification: "restricted"
      region: "us-gov-virginia"
    backend: azure-openai-gov
  - match:
      data_classification: "restricted"
      region: "eu"
    backend: azure-openai-westeurope
  - default:
      backend: azure-openai-eastus2

Model-quality routing¶

# Route to the best model for the task type
routes:
  - match:
      task: "code_generation"
    backend: anthropic-claude-sonnet-4
  - match:
      task: "rag_synthesis"
    backend: azure-openai-gpt-4o
  - match:
      task: "embedding"
    backend: azure-openai-text-embedding-3-large
  - default:
      backend: azure-openai-gpt-4o-mini

Fallback routing¶

# Failover on backend outage
backends:
  - name: primary
    endpoint: azure-openai-eastus2
  - name: secondary
    endpoint: aws-bedrock-claude-sonnet
  - name: tertiary
    endpoint: gcp-vertex-gemini-pro
strategy: failover_on_5xx

Embeddings + vector store portability¶

Embeddings have their own portability story. The vector dimensions of text-embedding-3-large (3072) are different from bedrock-titan-embed-v2 (1024) which is different from vertex-text-embedding-005 (768). Embeddings from different models are not interchangeable.

The discipline: pick one embedding model and stick with it across the entire vector store. Treat the embedding model as part of the schema, not a swappable choice. If you must change embedding model, re-embed the entire corpus.

The recommended default is Azure OpenAI text-embedding-3-large (3072 dims) with the vector store on Azure AI Search or Postgres pgvector. Both are first-class supported and the 3072-dim embeddings have strong cross-domain retrieval quality.

On-prem and air-gapped patterns¶

Some workloads cannot leave on-prem. The OpenAI-compatible contract covers this too:

Ollama — runs any GGUF-format model on commodity hardware, exposes OpenAI-compatible endpoint.
vLLM — high-throughput self-hosted serving, OpenAI-compatible.
LiteLLM — proxy that exposes OpenAI-compatible endpoint over any backend (including on-prem); useful for a single consistent interface across a mix of cloud + on-prem.
Azure AI Foundry connected to Arc-enabled on-prem nodes — the same Foundry policy router can route to on-prem Ollama when the data classification requires it.

Air-gapped pattern: Foundry runs in a disconnected Azure Stack Hub or sovereign cloud; backends are all on-prem Ollama / vLLM; the application code is identical to the connected case.

Anti-patterns¶

Application calls Azure OpenAI directly. This locks the application to Azure OpenAI. Always go through Foundry (or LiteLLM as a lighter proxy) so backend swap is config-only.
Hard-coding model names in the application. Use a model alias resolved at runtime by the orchestrator. Application asks for code-model-fast, orchestrator resolves to whatever the current best fit is.
Different embedding model per use case. Vector stores cannot mix dimensions. Pick one embedding model per corpus and stick with it.
No content safety on non-Azure backends. Bedrock and Vertex have their own safety; Ollama has none. Foundry's content safety layer covers all of them uniformly. Use it.
No cost tracking. Foundry's tracing exports per-call cost to App Insights. Use it; otherwise multi-backend deployments develop runaway-spend habits.