Guide — APIM + MCP Layered Orchestration¶
Freshness
Validated against: Azure API Management (AI-gateway policies: token-limit, token-metric, semantic caching) fronting Model Context Protocol (MCP) servers + Azure OpenAI / multi-model routing — as of 2026-06-02. MCP is an evolving spec and APIM's AI-gateway + MCP support is moving quickly; verify the protocol version and APIM policy surface against the current docs before deploying.
The pattern¶
Model Context Protocol (MCP) servers expose tools (callable functions) and resources (readable data) to LLM clients over a uniform protocol. Agent clients call MCP servers; MCP servers call backends.
The naive deployment puts each agent directly in front of its own MCP servers. That fails in production for four reasons:
- No shared rate limit / token budget. One bad agent loop drains shared quotas across multiple agents.
- No central auth. Each agent / MCP server pairing reimplements auth.
- No central observability. Token usage, tool latency, failure rates fragment across teams.
- No central lifecycle. Retiring a tool means touching every consumer.
The production pattern puts APIM between agents and MCP servers — making MCP servers first-class APIs in the gateway.
Architecture¶
graph TB
subgraph Agents["Agentic clients"]
AGT1[Copilot Studio agent]
AGT2[Foundry Agent Service]
AGT3[Semantic Kernel / AutoGen agent]
AGT4[GitHub Copilot extension]
end
subgraph Gateway["Azure API Management"]
APIM["APIM Premium v2<br/>JWT validation · Token quotas<br/>Semantic cache · Content safety<br/>Observability · Catalog"]
end
subgraph MCPTier["MCP Server Tier"]
MCP_DV[MCP: Dataverse domain<br/>tools: query_records, create_record, ...]
MCP_GR[MCP: Graph domain<br/>tools: search_email, list_files, ...]
MCP_DBX[MCP: Databricks domain<br/>tools: run_query, get_table_schema]
MCP_EAM[MCP: EAM domain<br/>tools: get_work_orders, update_work_order]
MCP_WEB[MCP: Web search<br/>tools: search, fetch_url]
end
subgraph Backends["Backend systems"]
DV[(Dataverse)]
GR[(Microsoft Graph)]
DBX[(Databricks SQL)]
EAM[(EAM REST API)]
WEB[(Bing / Tavily)]
end
Agents --> APIM
APIM --> MCP_DV
APIM --> MCP_GR
APIM --> MCP_DBX
APIM --> MCP_EAM
APIM --> MCP_WEB
MCP_DV --> DV
MCP_GR --> GR
MCP_DBX --> DBX
MCP_EAM --> EAM
MCP_WEB --> WEB Why layered¶
| Property | Direct MCP | Layered (APIM + MCP) |
|---|---|---|
| Token budget per consumer | Per-MCP | Cross-MCP (one budget regardless of which tool) |
| Auth | Per-MCP | One Entra-issued token validated at gateway |
| Per-tool authorization | Per-MCP custom | APIM scope mapping |
| Observability | Per-MCP | Unified App Insights with dimensions |
| Caching | Per-MCP | Semantic cache shared across MCPs for repeated queries |
| Retirement | Touch every agent | Update APIM routing |
| Multi-agent reuse | Re-implement per agent | Configure once, reuse across agents |
Concrete benefits in production¶
Token exhaustion guard¶
One runaway agent loop can drain a $40k token budget overnight. APIM's llm-token-limit policy applied at the gateway caps it:
<azure-openai-token-limit
counter-key="@(context.Subscription.Id)"
tokens-per-minute="100000"
estimate-prompt-tokens="true"
remaining-tokens-header-name="x-ratelimit-remaining-tokens" />
The budget enforces regardless of which MCP server the agent calls or which backend that MCP server hits.
Cost management¶
azure-openai-emit-token-metric emits per-subscription token counts to App Insights. Dimensioning by tool_name and agent_name gives a chargeback report:
customMetrics
| where name == "Total tokens"
| extend agent = tostring(customDimensions["agent-id"])
| extend tool = tostring(customDimensions["tool-name"])
| extend sub = tostring(customDimensions["subscription-id"])
| summarize tokens = sum(value) by agent, tool, sub, bin(timestamp, 1d)
| render columnchart
This is the production-grade answer to "which agent / tool combination is costing us money."
Semantic caching across tools¶
When two agents ask similar questions of the same MCP, the gateway returns the cached result without recomputing the embedding or hitting the backend:
<azure-openai-semantic-cache-lookup
score-threshold="0.85"
embeddings-backend-id="aoai-embeddings"
ignore-system-messages="true" />
Especially valuable for read-heavy MCPs (Graph search, Dataverse queries, EAM lookups).
MCP server skeleton¶
A minimal MCP server in Python, designed to sit behind APIM:
# mcp_server_eam.py
from mcp.server.fastmcp import FastMCP
from azure.identity import ManagedIdentityCredential
import httpx
import os
app = FastMCP("eam-domain", version="1.0.0")
EAM_API_BASE = os.environ["EAM_API_BASE"]
# Identity for downstream call to EAM REST API
cred = ManagedIdentityCredential()
@app.tool()
async def get_work_orders(
site: str | None = None,
status: str | None = None,
priority: str | None = None,
limit: int = 50,
) -> list[dict]:
"""Return open work orders, optionally filtered by site, status, priority."""
token = cred.get_token(EAM_API_BASE + "/.default").token
headers = {"Authorization": f"Bearer {token}"}
params = {"limit": str(limit)}
if site: params["site"] = site
if status: params["status"] = status
if priority: params["priority"] = priority
async with httpx.AsyncClient(timeout=30) as client:
resp = await client.get(f"{EAM_API_BASE}/work-orders", params=params, headers=headers)
resp.raise_for_status()
return resp.json()["value"]
@app.tool()
async def update_work_order(work_order_id: str, status: str, notes: str = "") -> dict:
"""Update a work order's status; requires Maintenance.Write scope at APIM."""
token = cred.get_token(EAM_API_BASE + "/.default").token
headers = {"Authorization": f"Bearer {token}"}
body = {"status": status, "notes": notes}
async with httpx.AsyncClient(timeout=30) as client:
resp = await client.patch(f"{EAM_API_BASE}/work-orders/{work_order_id}", json=body, headers=headers)
resp.raise_for_status()
return resp.json()
if __name__ == "__main__":
app.run()
Deploy as a container in Azure Container Apps. The container does not need to be exposed to the public — only to APIM (private link).
APIM policy fragment for an MCP-fronted API¶
<policies>
<inbound>
<base />
<!-- Entra-issued JWT, with scope mapping -->
<validate-jwt header-name="Authorization" failed-validation-httpcode="401">
<openid-config url="https://login.microsoftonline.com/{tenant}/v2.0/.well-known/openid-configuration" />
<required-claims>
<claim name="aud"><value>api://your-mcp-app-id</value></claim>
</required-claims>
</validate-jwt>
<!-- Map JWT scope to authorized MCP tools -->
<set-variable name="allowedTools" value="@{
var scopes = (context.Request.Headers.GetValueOrDefault("Authorization", "").Replace("Bearer ", ""))
.Split('.')[1]; // base64-decoded payload — production should use JwtSecurityTokenHandler
// In production: parse properly and return the list of scope names.
return scopes;
}" />
<!-- Per-subscription token budget across all MCPs -->
<azure-openai-token-limit
counter-key="@(context.Subscription.Id)"
tokens-per-minute="100000"
estimate-prompt-tokens="true" />
<!-- Per-subscription request rate (catches non-LLM tool calls too) -->
<rate-limit-by-key calls="600" renewal-period="60" counter-key="@(context.Subscription.Id)" />
<!-- Semantic cache for read-heavy tools -->
<azure-openai-semantic-cache-lookup
score-threshold="0.85"
embeddings-backend-id="aoai-embeddings"
ignore-system-messages="true" />
<!-- Content safety on tool inputs -->
<llm-content-safety backend-id="content-safety">
<categories>
<category name="PromptInjection" threshold="2" />
</categories>
</llm-content-safety>
<!-- Route to the MCP backend -->
<set-backend-service backend-id="mcp-eam-domain" />
</inbound>
<backend>
<forward-request />
</backend>
<outbound>
<base />
<azure-openai-semantic-cache-store duration="600" />
<azure-openai-emit-token-metric>
<dimension name="subscription-id" />
<dimension name="tool-name" value="@(context.Request.OriginalUrl.Path)" />
<dimension name="agent-id" value="@(context.Request.Headers.GetValueOrDefault("X-Agent-Id", "unknown"))" />
</azure-openai-emit-token-metric>
</outbound>
</policies>
Agent-side wiring¶
Copilot Studio¶
In Copilot Studio, register the MCP-fronted API as a custom connector built from the OpenAPI document APIM publishes. The Copilot Studio runtime calls APIM; APIM calls MCP; MCP calls the backend.
Foundry Agent Service¶
from azure.ai.agents.models import OpenApiAnonymousAuthDetails, OpenApiTool
import json
# Pull the OpenAPI doc from the APIM developer portal endpoint
with open("eam-mcp-openapi.json") as f:
spec = json.load(f)
eam_tool = OpenApiTool(
name="enterprise_asset_management",
description="Tools for querying and updating work orders, assets, and maintenance events.",
spec=spec,
auth=OpenApiAnonymousAuthDetails(), # APIM handles auth; the agent passes through
)
Semantic Kernel¶
var kernel = Kernel.CreateBuilder()
.AddAzureOpenAIChatCompletion(...)
.Build();
await kernel.ImportPluginFromOpenApiAsync(
"EamMcp",
new Uri("https://yourapim.azure-api.net/eam-mcp/openapi.json"),
new OpenApiFunctionExecutionParameters {
AuthCallback = AddApimSubscriptionKeyHeader
}
);
In every case, the agent is configured against the APIM-fronted MCP, not against the MCP directly.
Per-tool authorization¶
Use Entra scopes to express which tools a given consumer can invoke:
| Scope | Tool surface |
|---|---|
Eam.Read | get_work_orders, get_asset_history, list_sites |
Eam.Write | update_work_order, create_work_order |
Eam.Admin | retire_asset, modify_pm_schedule |
The APIM policy maps requested tool to required scope; rejects calls without the scope. This is enforced at the gateway, not in MCP code — the MCP server doesn't have to know about Entra.
Observability — what to dashboard¶
The operational dashboard for an agent estate:
| Panel | Query |
|---|---|
| Tokens per agent per day | customMetrics | where name == "Total tokens" | summarize sum(value) by tostring(customDimensions["agent-id"]), bin(timestamp, 1d) |
| Tokens per tool per day | customMetrics | where name == "Total tokens" | summarize sum(value) by tostring(customDimensions["tool-name"]), bin(timestamp, 1d) |
| Tool call success rate | ApiManagementGatewayLogs | summarize total = count(), errors = countif(ResponseCode >= 400) by ApiId | extend success_rate = (total - errors) * 100.0 / total |
| Cache hit rate | customMetrics | where name == "Cache hit" against Cache miss |
| Rate-limit hits | ApiManagementGatewayLogs | where ResponseCode == 429 | summarize count() by ApimSubscriptionId, ApiId |
| Top consumers | ApiManagementGatewayLogs | summarize calls = count() by ApimSubscriptionId | top 10 by calls |
Set alerts on token-budget-near-exhaustion, sustained rate-limit hits, and tool error rate above threshold.
Anti-patterns¶
| Anti-pattern | Refuse because |
|---|---|
| Putting MCP servers directly in front of agents without APIM | Loses cross-tool budget, observability, auth consolidation |
| One giant MCP server for everything | Cohesion / blast-radius — break by domain |
| Each agent stands up its own MCP servers | Wastes engineering; loses portfolio view |
| MCP servers fronting their own ad-hoc REST clients | The MCP should call backends through APIM where possible to keep one identity model |
| Skipping semantic cache "for now" | Cost discipline is best built from the start |
| MCP servers exposing free-form SQL | Tools should be parameterized; raw SQL invites prompt injection on backends |