Pattern — Observability with OpenTelemetry¶

TL;DR: OpenTelemetry SDK in every service, Application Insights as the backend, W3C Trace Context propagated end-to-end, structured logs with correlation IDs everywhere, dashboards in Azure Workbooks, SLI/SLO + error budgets to drive engineering effort. Don't roll your own.

Azure Application Insights Overview blade — Essentials panel showing instrumentation key, connection string, retention, and the Performance / Failures / Live Metrics left-nav. App Insights is the OpenTelemetry-compatible backend that every CSA-in-a-Box service ships traces, metrics, and logs to.

Problem¶

A modern data platform spans a dozen services per request — portal → BFF → AI service → vector search → AOAI → response. Without distributed tracing, debugging a "this one request was slow" report means hunting through 6 separate log streams hoping timestamps line up. With OpenTelemetry, every log line and every span carries the same trace ID, and one click in App Insights shows you the full request waterfall.

Architecture¶

flowchart LR
    subgraph User
        Browser[Browser]
    end

    subgraph Portal[Portal — React + FastAPI BFF]
        React[React app<br/>OTel JS SDK]
        BFF[FastAPI BFF<br/>OTel Python SDK<br/>+ auto-instrumentation]
    end

    subgraph AI[AI Services]
        RAG[RAG service<br/>OTel SDK]
        AOAI[Azure OpenAI<br/>logs trace context<br/>via App Insights connector]
        Search[AI Search<br/>diagnostic logs]
    end

    subgraph Backend[Data Backend]
        DAB[Data API Builder]
        Synapse[Synapse SQL]
    end

    subgraph Telemetry[Application Insights / Log Analytics]
        AI1[Distributed traces]
        AI2[Structured logs]
        AI3[Metrics]
        AI4[Workbooks + alerts]
    end

    Browser -- traceparent --> React
    React -- traceparent --> BFF
    BFF -- traceparent --> RAG
    BFF -- traceparent --> DAB
    RAG -- request --> AOAI
    RAG -- request --> Search
    DAB -- query --> Synapse

    React -. spans .-> AI1
    BFF -. spans + logs .-> AI1
    RAG -. spans + logs .-> AI1
    AOAI -. AI logs .-> AI2
    Search -. diagnostic .-> AI2
    DAB -. spans .-> AI1
    Synapse -. diagnostic .-> AI2

    AI1 --> AI4
    AI2 --> AI4
    AI3 --> AI4

Pattern: end-to-end trace propagation¶

Every service:

Reads incoming traceparent header (W3C Trace Context)
Creates a child span for its work
Adds business-relevant attributes (user.id, tenant.id, request_id, model, tokens)
Propagates traceparent on outbound calls
Emits structured logs that include trace_id + span_id

Python (FastAPI)¶

from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from azure.monitor.opentelemetry import configure_azure_monitor

# One-line setup — auto-instruments FastAPI, requests, urllib3, sqlalchemy, etc.
configure_azure_monitor(
    connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"]
)

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

tracer = trace.get_tracer(__name__)

@app.post("/chat")
async def chat(req: ChatRequest):
    with tracer.start_as_current_span("rag-grounding") as span:
        span.set_attributes({
            "user.id_hash": hash_user(req.user_id),
            "tenant.id": req.tenant_id,
            "rag.corpus": req.corpus,
        })
        # ... do work
        span.set_attribute("rag.docs_retrieved", len(docs))

TypeScript (React)¶

import { WebTracerProvider } from "@opentelemetry/sdk-trace-web";
import { AzureMonitorTraceExporter } from "@azure/monitor-opentelemetry-exporter";
import { registerInstrumentations } from "@opentelemetry/instrumentation";
import { FetchInstrumentation } from "@opentelemetry/instrumentation-fetch";

const provider = new WebTracerProvider();
provider.addSpanProcessor(
    new BatchSpanProcessor(
        new AzureMonitorTraceExporter({
            connectionString: import.meta.env
                .VITE_APPINSIGHTS_CONNECTION_STRING,
        }),
    ),
);
provider.register();

registerInstrumentations({
    instrumentations: [
        new FetchInstrumentation({
            propagateTraceHeaderCorsUrls: [/api\.mycompany\.com/],
        }),
    ],
});

Pattern: structured logging with correlation¶

Every log line includes the trace ID:

import structlog
from opentelemetry import trace

def add_trace_context(logger, method_name, event_dict):
    span = trace.get_current_span()
    if span:
        ctx = span.get_span_context()
        event_dict['trace_id'] = format(ctx.trace_id, '032x')
        event_dict['span_id'] = format(ctx.span_id, '016x')
    return event_dict

structlog.configure(processors=[
    add_trace_context,
    structlog.processors.JSONRenderer(),
])

log = structlog.get_logger()
log.info("rag_retrieved", docs_count=len(docs), user_id_hash=hash_user(uid))

In App Insights you can join logs to traces by trace_id.

Pattern: SLI / SLO / error budget¶

Define what "working" means as numbers:

Service	SLI	SLO	Error budget
Portal API	success rate (HTTP 2xx/3xx / total)	99.9% over 30 days	43 minutes / month
Portal API	latency p95	<500ms over 30 days	5% of requests can exceed
RAG service	retrieval recall@5 (eval)	>0.85	15% of queries can be sub-recall
AOAI surface	refusal rate	<2%	spike = drift, not error budget

When you blow your error budget, engineering work shifts from features to reliability until you're back inside budget. This forces honest conversations.

Pattern: dashboards in Azure Workbooks¶

Don't reinvent. Workbooks ship a few good templates and the platform extends them:

Service Health — uptime, error rate, latency p50/p95/p99 per service
Trace Explorer — slowest 100 traces, filter by service / operation / error
AOAI Cost — token usage by deployment / day / feature, $ trend
RAG Quality — recall, citation precision, refusal rate over time
Drift — weekly eval scores vs baseline

Workbook JSON lives under deploy/bicep/shared/modules/workbooks/ (planned — see roadmap). Until then, Workbook templates can be exported from the portal and committed.

Pattern: alerts that don't page-fatigue¶

Alert	Threshold	Severity
Service down (ping >3 fails in 5min)	0% success	Sev 1 page
Error rate >5% over 5min	>5%	Sev 2 page
Latency p95 >2x baseline for 10min	2× baseline	Sev 3 ticket
AOAI throttled >10% over 1hr	>10% throttle	Sev 3 ticket
AOAI cost >2x daily baseline	2× daily	Sev 3 ticket (cost control)
Refusal rate >5% sustained 1hr	>5%	Sev 3 ticket (drift)
Eval score regression in CI	>5% drop	block PR

Rules of thumb:

Sev 1 wakes someone up. Reserve for "service is down."
Sev 2 ticks during business hours. "Significant degradation."
Sev 3 is a ticket for next-day investigation. "Something's off."
More than 1-2 Sev 1 pages / week → recalibrate; alerts are too noisy.

Pattern: correlation across the boundary¶

For a chat request:

trace_id = 4bf92f3577b34da6a3ce929d0e0e4736

[browser]   span: page-load               trace_id=4bf9... duration=200ms
  [browser]   span: HTTP POST /chat       trace_id=4bf9... duration=1240ms
    [BFF]      span: chat handler         trace_id=4bf9... duration=1230ms
      [BFF]      span: rate-limit check   trace_id=4bf9... duration=2ms
      [BFF]      span: rag.grounding      trace_id=4bf9... duration=850ms
        [RAG]      span: retrieve         trace_id=4bf9... duration=120ms
        [RAG]      span: aoai-completion  trace_id=4bf9... duration=720ms
      [BFF]      span: response-build     trace_id=4bf9... duration=10ms

Click any span in App Insights → see the full waterfall + every log line associated with each span.

Anti-patterns¶

Anti-pattern	What to do
Each service has its own trace ID	Propagate `traceparent` end-to-end
Logs as freeform text	JSON structured with trace_id + span_id
Alert on everything	Alert on SLO violations only
Workbooks built once, never updated	Treat as code; commit JSON; review with PRs
App Insights connection string in code	Key Vault reference, MI auth
OTel SDK + custom telemetry library	Pick one — OTel is the standard

ADR 0020 — Portal Observability and Rate Limiting
ADR 0021 — Two Rate Limiters
Best Practices — Monitoring & Observability
LOG_SCHEMA.md
Compliance — SOC 2 Type II (CC7 Monitoring)
Azure Monitor OpenTelemetry: https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-overview