Best Practices: Observability Migration to Azure Monitor¶

Audience: Platform Engineers, SREs, DevOps Leads Last updated: 2026-04-30

Overview¶

This document captures proven best practices for migrating from third-party observability platforms to Azure Monitor. These recommendations are based on patterns observed across enterprise and federal migrations.

1. Incremental migration strategy¶

Start with new applications¶

The lowest-risk migration approach is to instrument all new applications with Azure Monitor from day one while keeping existing applications on the current vendor. This creates a growing Azure Monitor footprint without disrupting existing operations.

Implementation:

Add Application Insights OpenTelemetry to all new application templates and scaffolding
Configure new AKS clusters with Container Insights enabled by default
Deploy new VMs with Azure Monitor Agent via Azure Policy
Existing applications migrate on a service-by-service basis during scheduled maintenance windows

Benefits:

Zero disruption to existing monitoring
Team gains Azure Monitor experience on lower-risk applications
Natural migration as applications are rewritten or updated

Migration priority order¶

Migrate observability components in this recommended order:

Infrastructure monitoring (VM Insights, Container Insights) -- lowest risk, most standardized
Log ingestion (Azure Monitor Agent, DCR) -- high value, well-understood patterns
Metrics and dashboards -- visible impact, manageable scope
APM (Application Insights) -- highest value, most migration effort
Alerting -- migrate last, after data sources are established
Synthetic monitoring -- can run in parallel indefinitely

2. Dual-shipping: Run both platforms in parallel¶

Why dual-ship¶

Never attempt a big-bang cutover. Dual-shipping telemetry to both the existing vendor and Azure Monitor for 2-4 weeks provides:

Validation: Confirm data completeness and accuracy
Confidence: Operations teams can compare dashboards and alerts side-by-side
Rollback: If Azure Monitor reveals gaps, the existing vendor continues providing coverage
Training: Teams learn KQL and Azure Monitor workflows while the safety net is in place

OpenTelemetry Collector for dual-shipping¶

The OpenTelemetry Collector is the ideal intermediary for dual-shipping. Applications send telemetry to the Collector once; the Collector forwards to both destinations.

# otel-collector-config.yaml
receivers:
    otlp:
        protocols:
            grpc:
                endpoint: 0.0.0.0:4317
            http:
                endpoint: 0.0.0.0:4318

processors:
    batch:
        timeout: 10s
        send_batch_size: 1024

exporters:
    # Azure Monitor (target)
    azuremonitor:
        connection_string: ${APPLICATIONINSIGHTS_CONNECTION_STRING}

    # Datadog (existing - temporary during migration)
    datadog:
        api:
            key: ${DD_API_KEY}
            site: datadoghq.com

    # Log Analytics (for logs)
    azuremonitor/logs:
        connection_string: ${APPLICATIONINSIGHTS_CONNECTION_STRING}

service:
    pipelines:
        traces:
            receivers: [otlp]
            processors: [batch]
            exporters: [azuremonitor, datadog]
        metrics:
            receivers: [otlp]
            processors: [batch]
            exporters: [azuremonitor, datadog]
        logs:
            receivers: [otlp]
            processors: [batch]
            exporters: [azuremonitor/logs, datadog]

Dual-shipping cost management¶

Dual-shipping doubles telemetry volume temporarily. Manage costs by:

Sampling aggressively during the dual-ship period (25-50%)
Dual-shipping only for services actively being migrated (not the entire fleet)
Limiting the dual-ship period to 2-4 weeks per service
Using Basic logs tier for the duplicate data stream on Azure Monitor

3. OpenTelemetry as the abstraction layer¶

Vendor-neutral instrumentation¶

The single most important architectural decision for a sustainable observability strategy is to instrument with OpenTelemetry rather than vendor-specific SDKs.

Why:

Applications are instrumented once with OpenTelemetry APIs
The backend (Azure Monitor, Datadog, Grafana Cloud, etc.) is a deployment-time configuration decision
Future vendor changes require zero application code changes
OpenTelemetry is a CNCF standard with broad industry support

How:

// .NET example -- vendor-neutral instrumentation
using System.Diagnostics;
using System.Diagnostics.Metrics;

// These use OpenTelemetry APIs, NOT vendor-specific APIs
var activitySource = new ActivitySource("MyApp.Orders");
var meter = new Meter("MyApp.Metrics");
var counter = meter.CreateCounter<long>("orders_created");

// The exporter configuration determines where telemetry goes
// Swap Azure Monitor for any OTel-compatible backend with zero code changes
builder.Services.AddOpenTelemetry()
    .UseAzureMonitor()  // Azure Monitor exporter
    // .WithTracing(b => b.AddOtlpExporter())  // Or OTLP to any backend
    ;

Avoid vendor-specific SDK features¶

When migrating to Azure Monitor, resist the temptation to use Application Insights-specific APIs (e.g., TelemetryClient.TrackEvent()) for new instrumentation. Use OpenTelemetry APIs exclusively for new code. The Azure Monitor OpenTelemetry distribution transparently translates OTel telemetry to Application Insights format.

Exceptions where Application Insights-specific features are justified:

Snapshot Debugger (no OTel equivalent)
Live Metrics (no OTel equivalent)
Release annotations (API-only feature)

4. Cost optimization¶

Tier routing strategy¶

Route logs to the appropriate tier based on value and query frequency.

flowchart TD
    A["Incoming Logs"] --> B{"Critical for<br/>alerting?"}
    B -->|Yes| C["Analytics Tier<br/>($2.76/GB or commitment)"]
    B -->|No| D{"Queried more than<br/>weekly?"}
    D -->|Yes| C
    D -->|No| E{"Compliance<br/>retention needed?"}
    E -->|Yes| F["Basic Tier<br/>($0.88/GB) → Archive"]
    E -->|No| G{"Any query<br/>value?"}
    G -->|Yes| F
    G -->|No| H["Drop at DCR<br/>(no ingestion cost)"]

Specific routing recommendations¶

Log source	Tier	Rationale
Application exceptions	Analytics	Active alerting and investigation
API request traces	Analytics	Performance monitoring and SLO tracking
Security audit logs	Analytics	Compliance and security alerting
Infrastructure syslog (warn+)	Analytics	Operational alerting
Application debug/trace logs	Basic	Rarely queried; ad-hoc investigation
CDN/WAF access logs	Basic	Forensic investigation only
Health check probes	Drop (DCR filter)	No analytical value; generates noise
Kubernetes liveness/readiness probes	Drop (DCR filter)	High volume; no analytical value
Verbose SDK telemetry	Drop (DCR filter)	Internal SDK diagnostics; not needed in production

DCR transformation to drop noise¶

// Drop health check requests before ingestion
source
| where not(Url has "/health" or Url has "/ready" or Url has "/alive")
| where not(UserAgent has "kube-probe" or UserAgent has "ELB-HealthChecker")

Commitment tier selection¶

Choose your commitment tier based on 30 days of observed ingestion.

// Query to determine daily ingestion volume
Usage
| where TimeGenerated > ago(30d)
| summarize DailyGB = sum(Quantity) / 1024.0 by bin(TimeGenerated, 1d)
| summarize
    AvgDailyGB = avg(DailyGB),
    P90DailyGB = percentile(DailyGB, 90),
    MaxDailyGB = max(DailyGB)

Decision rule: Commit to the tier at or below your P90 daily ingestion. Overages above the commitment tier are billed at pay-as-you-go rates, which is more expensive than the commitment but ensures you never pay for committed capacity you do not use.

Application Insights sampling¶

Enable adaptive sampling for production workloads. Configure sampling overrides to keep 100% of high-value telemetry.

{
    "sampling": {
        "percentage": 25,
        "overrides": [
            { "telemetryType": "exception", "percentage": 100 },
            {
                "telemetryType": "request",
                "attributes": [
                    {
                        "key": "http.status_code",
                        "value": "5.*",
                        "matchType": "regexp"
                    }
                ],
                "percentage": 100
            },
            {
                "telemetryType": "dependency",
                "attributes": [{ "key": "success", "value": "false" }],
                "percentage": 100
            }
        ]
    }
}

5. Workspace design¶

Single workspace vs multiple workspaces¶

Pattern	Use case	Trade-offs
Single workspace	Small-medium organizations; simplicity	Simplest queries; single RBAC boundary; risk of noisy-neighbor query impact
Per-environment (dev/staging/prod)	Environment isolation	Clean separation; no risk of dev data in prod queries; slightly more complex cross-env analysis
Per-team + shared	Large organizations with data ownership boundaries	Team autonomy; table-level RBAC; cross-workspace queries possible but more complex
Per-compliance-boundary	Federal (IL4 vs IL5, classified vs unclassified)	Mandatory for compliance; separate regions and access controls

Recommendation for CSA-in-a-Box: Start with a single workspace for the platform (Fabric, Databricks, ADF, Purview diagnostics) and separate workspace(s) for application teams. Cross-workspace queries in KQL (workspace('other-workspace').TableName) enable correlation when needed.

Workspace architecture for CSA-in-a-Box¶

flowchart TD
    subgraph Platform["Platform Workspace"]
        PF["Fabric Diagnostics"]
        PD["Databricks Diagnostics"]
        PA["ADF Pipeline Logs"]
        PP["Purview Scan Logs"]
        PK["Key Vault Audit"]
        PS["Storage Access Logs"]
    end

    subgraph App["Application Workspace"]
        AA["Application Insights (APM)"]
        AL["Custom App Logs"]
        AM["Custom Metrics"]
    end

    subgraph Security["Security Workspace (optional)"]
        SE["Entra ID Sign-ins"]
        SD["Defender Alerts"]
        SA["Activity Logs"]
    end

    Platform --> |"cross-workspace query"| App
    App --> |"cross-workspace query"| Platform
    Security --> |"Sentinel integration"| Platform

6. CSA-in-a-Box monitoring integration¶

Enable diagnostic settings for all CSA-in-a-Box components¶

Ensure every deployed CSA-in-a-Box resource sends diagnostics to the central Log Analytics workspace.

// Template pattern for diagnostic settings
resource diagnosticSetting 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
  scope: targetResource
  name: 'diag-to-law'
  properties: {
    workspaceId: platformWorkspace.id
    logs: [
      { categoryGroup: 'allLogs', enabled: true }
    ]
    metrics: [
      { category: 'AllMetrics', enabled: true }
    ]
  }
}

CSA-in-a-Box platform health workbook¶

Create a unified workbook that shows the health of all CSA-in-a-Box components.

// ADF pipeline success rate (last 24h)
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DATAFACTORY"
| where Category == "PipelineRuns"
| where TimeGenerated > ago(24h)
| summarize
    Total = count(),
    Succeeded = countif(status_s == "Succeeded"),
    Failed = countif(status_s == "Failed")
| extend SuccessRate = round(Succeeded * 100.0 / Total, 2)

// Databricks cluster utilization
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DATABRICKS"
| where Category == "clusters"
| where TimeGenerated > ago(1h)
| summarize AvgNodes = avg(toint(num_workers_s)) by bin(TimeGenerated, 5m)
| render timechart

// Purview scan status
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.PURVIEW"
| where Category == "ScanStatusLogEvent"
| where TimeGenerated > ago(7d)
| summarize Count = count() by resultType_s

7. Team enablement¶

KQL training plan¶

Week	Topic	Resources
1	KQL basics: `where`, `project`, `summarize`, `order by`	KQL Quick Reference
2	Time series: `bin`, `render timechart`, `make-series`	KQL tutorials in Azure portal
3	Joins and subqueries: `join`, `let`, `materialize`	Log Analytics demo workspace
4	Advanced: `parse`, `extract`, `reduce`, `evaluate`	Team query library

Query library¶

Maintain a shared KQL query library (as a Workbook or in Git) with pre-built queries for common investigation patterns. This accelerates adoption by giving teams working examples rather than requiring them to write KQL from scratch.

Runbook integration¶

Update operational runbooks to reference Azure Monitor:

Replace "open Datadog and search for..." with "open Log Analytics and run this KQL query..."
Include direct links to Azure Monitor alert investigation views
Add workbook URLs for common troubleshooting dashboards

8. Operational readiness checklist¶

Before decommissioning the source vendor, verify: