Guide — APIM + MCP Layered Orchestration¶

Freshness

Validated against: Azure API Management (AI-gateway policies: token-limit, token-metric, semantic caching) fronting Model Context Protocol (MCP) servers + Azure OpenAI / multi-model routing — as of 2026-06-02. MCP is an evolving spec and APIM's AI-gateway + MCP support is moving quickly; verify the protocol version and APIM policy surface against the current docs before deploying.

The pattern¶

Model Context Protocol (MCP) servers expose tools (callable functions) and resources (readable data) to LLM clients over a uniform protocol. Agent clients call MCP servers; MCP servers call backends.

The naive deployment puts each agent directly in front of its own MCP servers. That fails in production for four reasons:

No shared rate limit / token budget. One bad agent loop drains shared quotas across multiple agents.
No central auth. Each agent / MCP server pairing reimplements auth.
No central observability. Token usage, tool latency, failure rates fragment across teams.
No central lifecycle. Retiring a tool means touching every consumer.

The production pattern puts APIM between agents and MCP servers — making MCP servers first-class APIs in the gateway.

Architecture¶

graph TB
    subgraph Agents["Agentic clients"]
        AGT1[Copilot Studio agent]
        AGT2[Foundry Agent Service]
        AGT3[Semantic Kernel / AutoGen agent]
        AGT4[GitHub Copilot extension]
    end

    subgraph Gateway["Azure API Management"]
        APIM["APIM Premium v2<br/>JWT validation · Token quotas<br/>Semantic cache · Content safety<br/>Observability · Catalog"]
    end

    subgraph MCPTier["MCP Server Tier"]
        MCP_DV[MCP: Dataverse domain<br/>tools: query_records, create_record, ...]
        MCP_GR[MCP: Graph domain<br/>tools: search_email, list_files, ...]
        MCP_DBX[MCP: Databricks domain<br/>tools: run_query, get_table_schema]
        MCP_EAM[MCP: EAM domain<br/>tools: get_work_orders, update_work_order]
        MCP_WEB[MCP: Web search<br/>tools: search, fetch_url]
    end

    subgraph Backends["Backend systems"]
        DV[(Dataverse)]
        GR[(Microsoft Graph)]
        DBX[(Databricks SQL)]
        EAM[(EAM REST API)]
        WEB[(Bing / Tavily)]
    end

    Agents --> APIM
    APIM --> MCP_DV
    APIM --> MCP_GR
    APIM --> MCP_DBX
    APIM --> MCP_EAM
    APIM --> MCP_WEB

    MCP_DV --> DV
    MCP_GR --> GR
    MCP_DBX --> DBX
    MCP_EAM --> EAM
    MCP_WEB --> WEB

Why layered¶

Property	Direct MCP	Layered (APIM + MCP)
Token budget per consumer	Per-MCP	Cross-MCP (one budget regardless of which tool)
Auth	Per-MCP	One Entra-issued token validated at gateway
Per-tool authorization	Per-MCP custom	APIM scope mapping
Observability	Per-MCP	Unified App Insights with dimensions
Caching	Per-MCP	Semantic cache shared across MCPs for repeated queries
Retirement	Touch every agent	Update APIM routing
Multi-agent reuse	Re-implement per agent	Configure once, reuse across agents

Concrete benefits in production¶

Token exhaustion guard¶

One runaway agent loop can drain a $40k token budget overnight. APIM's llm-token-limit policy applied at the gateway caps it:

<azure-openai-token-limit
    counter-key="@(context.Subscription.Id)"
    tokens-per-minute="100000"
    estimate-prompt-tokens="true"
    remaining-tokens-header-name="x-ratelimit-remaining-tokens" />

The budget enforces regardless of which MCP server the agent calls or which backend that MCP server hits.

Cost management¶

azure-openai-emit-token-metric emits per-subscription token counts to App Insights. Dimensioning by tool_name and agent_name gives a chargeback report:

customMetrics
| where name == "Total tokens"
| extend agent = tostring(customDimensions["agent-id"])
| extend tool = tostring(customDimensions["tool-name"])
| extend sub = tostring(customDimensions["subscription-id"])
| summarize tokens = sum(value) by agent, tool, sub, bin(timestamp, 1d)
| render columnchart

This is the production-grade answer to "which agent / tool combination is costing us money."

Semantic caching across tools¶

When two agents ask similar questions of the same MCP, the gateway returns the cached result without recomputing the embedding or hitting the backend:

<azure-openai-semantic-cache-lookup
    score-threshold="0.85"
    embeddings-backend-id="aoai-embeddings"
    ignore-system-messages="true" />

Especially valuable for read-heavy MCPs (Graph search, Dataverse queries, EAM lookups).

MCP server skeleton¶

A minimal MCP server in Python, designed to sit behind APIM:

# mcp_server_eam.py
from mcp.server.fastmcp import FastMCP
from azure.identity import ManagedIdentityCredential
import httpx
import os

app = FastMCP("eam-domain", version="1.0.0")

EAM_API_BASE = os.environ["EAM_API_BASE"]
# Identity for downstream call to EAM REST API
cred = ManagedIdentityCredential()

@app.tool()
async def get_work_orders(
    site: str | None = None,
    status: str | None = None,
    priority: str | None = None,
    limit: int = 50,
) -> list[dict]:
    """Return open work orders, optionally filtered by site, status, priority."""
    token = cred.get_token(EAM_API_BASE + "/.default").token
    headers = {"Authorization": f"Bearer {token}"}
    params = {"limit": str(limit)}
    if site: params["site"] = site
    if status: params["status"] = status
    if priority: params["priority"] = priority
    async with httpx.AsyncClient(timeout=30) as client:
        resp = await client.get(f"{EAM_API_BASE}/work-orders", params=params, headers=headers)
        resp.raise_for_status()
        return resp.json()["value"]

@app.tool()
async def update_work_order(work_order_id: str, status: str, notes: str = "") -> dict:
    """Update a work order's status; requires Maintenance.Write scope at APIM."""
    token = cred.get_token(EAM_API_BASE + "/.default").token
    headers = {"Authorization": f"Bearer {token}"}
    body = {"status": status, "notes": notes}
    async with httpx.AsyncClient(timeout=30) as client:
        resp = await client.patch(f"{EAM_API_BASE}/work-orders/{work_order_id}", json=body, headers=headers)
        resp.raise_for_status()
        return resp.json()

if __name__ == "__main__":
    app.run()

Deploy as a container in Azure Container Apps. The container does not need to be exposed to the public — only to APIM (private link).

APIM policy fragment for an MCP-fronted API¶

<policies>
  <inbound>
    <base />
    <!-- Entra-issued JWT, with scope mapping -->
    <validate-jwt header-name="Authorization" failed-validation-httpcode="401">
      <openid-config url="https://login.microsoftonline.com/{tenant}/v2.0/.well-known/openid-configuration" />
      <required-claims>
        <claim name="aud"><value>api://your-mcp-app-id</value></claim>
      </required-claims>
    </validate-jwt>

    <!-- Map JWT scope to authorized MCP tools -->
    <set-variable name="allowedTools" value="@{
      var scopes = (context.Request.Headers.GetValueOrDefault("Authorization", "").Replace("Bearer ", ""))
        .Split('.')[1]; // base64-decoded payload — production should use JwtSecurityTokenHandler
      // In production: parse properly and return the list of scope names.
      return scopes;
    }" />

    <!-- Per-subscription token budget across all MCPs -->
    <azure-openai-token-limit
        counter-key="@(context.Subscription.Id)"
        tokens-per-minute="100000"
        estimate-prompt-tokens="true" />

    <!-- Per-subscription request rate (catches non-LLM tool calls too) -->
    <rate-limit-by-key calls="600" renewal-period="60" counter-key="@(context.Subscription.Id)" />

    <!-- Semantic cache for read-heavy tools -->
    <azure-openai-semantic-cache-lookup
        score-threshold="0.85"
        embeddings-backend-id="aoai-embeddings"
        ignore-system-messages="true" />

    <!-- Content safety on tool inputs -->
    <llm-content-safety backend-id="content-safety">
      <categories>
        <category name="PromptInjection" threshold="2" />
      </categories>
    </llm-content-safety>

    <!-- Route to the MCP backend -->
    <set-backend-service backend-id="mcp-eam-domain" />
  </inbound>

  <backend>
    <forward-request />
  </backend>

  <outbound>
    <base />
    <azure-openai-semantic-cache-store duration="600" />
    <azure-openai-emit-token-metric>
      <dimension name="subscription-id" />
      <dimension name="tool-name" value="@(context.Request.OriginalUrl.Path)" />
      <dimension name="agent-id" value="@(context.Request.Headers.GetValueOrDefault("X-Agent-Id", "unknown"))" />
    </azure-openai-emit-token-metric>
  </outbound>
</policies>

Agent-side wiring¶

Copilot Studio¶

In Copilot Studio, register the MCP-fronted API as a custom connector built from the OpenAPI document APIM publishes. The Copilot Studio runtime calls APIM; APIM calls MCP; MCP calls the backend.

Foundry Agent Service¶

from azure.ai.agents.models import OpenApiAnonymousAuthDetails, OpenApiTool
import json

# Pull the OpenAPI doc from the APIM developer portal endpoint
with open("eam-mcp-openapi.json") as f:
    spec = json.load(f)

eam_tool = OpenApiTool(
    name="enterprise_asset_management",
    description="Tools for querying and updating work orders, assets, and maintenance events.",
    spec=spec,
    auth=OpenApiAnonymousAuthDetails(),  # APIM handles auth; the agent passes through
)

Semantic Kernel¶

var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAIChatCompletion(...)
    .Build();

await kernel.ImportPluginFromOpenApiAsync(
    "EamMcp",
    new Uri("https://yourapim.azure-api.net/eam-mcp/openapi.json"),
    new OpenApiFunctionExecutionParameters {
        AuthCallback = AddApimSubscriptionKeyHeader
    }
);

In every case, the agent is configured against the APIM-fronted MCP, not against the MCP directly.

Per-tool authorization¶

Use Entra scopes to express which tools a given consumer can invoke:

Scope	Tool surface
`Eam.Read`	get_work_orders, get_asset_history, list_sites
`Eam.Write`	update_work_order, create_work_order
`Eam.Admin`	retire_asset, modify_pm_schedule

The APIM policy maps requested tool to required scope; rejects calls without the scope. This is enforced at the gateway, not in MCP code — the MCP server doesn't have to know about Entra.

Observability — what to dashboard¶

The operational dashboard for an agent estate:

Panel	Query
Tokens per agent per day	`customMetrics \| where name == "Total tokens" \| summarize sum(value) by tostring(customDimensions["agent-id"]), bin(timestamp, 1d)`
Tokens per tool per day	`customMetrics \| where name == "Total tokens" \| summarize sum(value) by tostring(customDimensions["tool-name"]), bin(timestamp, 1d)`
Tool call success rate	`ApiManagementGatewayLogs \| summarize total = count(), errors = countif(ResponseCode >= 400) by ApiId \| extend success_rate = (total - errors) * 100.0 / total`
Cache hit rate	`customMetrics \| where name == "Cache hit"` against `Cache miss`
Rate-limit hits	`ApiManagementGatewayLogs \| where ResponseCode == 429 \| summarize count() by ApimSubscriptionId, ApiId`
Top consumers	`ApiManagementGatewayLogs \| summarize calls = count() by ApimSubscriptionId \| top 10 by calls`

Set alerts on token-budget-near-exhaustion, sustained rate-limit hits, and tool error rate above threshold.

Anti-patterns¶

Anti-pattern	Refuse because
Putting MCP servers directly in front of agents without APIM	Loses cross-tool budget, observability, auth consolidation
One giant MCP server for everything	Cohesion / blast-radius — break by domain
Each agent stands up its own MCP servers	Wastes engineering; loses portfolio view
MCP servers fronting their own ad-hoc REST clients	The MCP should call backends through APIM where possible to keep one identity model
Skipping semantic cache "for now"	Cost discipline is best built from the start
MCP servers exposing free-form SQL	Tools should be parameterized; raw SQL invites prompt injection on backends