Why AKS: Executive Brief for Container Platform Modernization¶

Status: Authored 2026-04-30 Audience: Federal CIOs, CTOs, and platform engineering leadership evaluating Azure Kubernetes Service against self-managed Kubernetes and Red Hat OpenShift. Purpose: Provide an evidence-based business case for AKS adoption, with an honest assessment of where self-managed Kubernetes and OpenShift retain advantages.

The container platform inflection point¶

Kubernetes won the container orchestration war. It is the undisputed standard for running containerized workloads at scale. But operating Kubernetes clusters is not the same as using Kubernetes. The operational burden of running production Kubernetes -- etcd management, control-plane patching, certificate rotation, upgrade planning, CNI troubleshooting, storage driver maintenance -- consumes 40--60% of a typical platform engineering team's capacity.

This is the core value proposition of AKS: eliminate the undifferentiated heavy lifting of Kubernetes operations so platform teams can focus on developer productivity, application reliability, and data platform integration.

AKS is the fastest-growing service on Azure. Microsoft reports over 29,000 AKS clusters running in Azure Government alone. The service processes over 15 million Kubernetes API requests per second globally. It is not a niche offering -- it is the primary container platform for Azure-native organizations.

1. Managed control plane: the operational argument¶

What AKS manages for you¶

Control plane component	Self-managed K8s	OpenShift	AKS
API server	Customer provisions, patches, scales	Red Hat manages (OCP)	Microsoft manages
etcd	Customer manages backup, compaction, defrag	Red Hat manages	Microsoft manages (99.95% SLA)
Scheduler	Customer configures and monitors	Red Hat manages	Microsoft manages
Controller manager	Customer patches and tunes	Red Hat manages	Microsoft manages
Cloud controller manager	Customer integrates	Red Hat manages	Microsoft manages (Azure-native)
Certificate rotation	Customer automates (cert-manager or manual)	Automated (OCP)	Automated (AKS)
Kubernetes upgrades	Customer plans, tests, executes (4-month cadence)	Red Hat release cadence (OCP versions)	Auto-upgrade channels (patch, stable, rapid, node-image)
etcd backup	Customer scripts and verifies	Automated (OCP)	Microsoft manages
API server scaling	Customer sizes and scales	Red Hat manages	Auto-scales based on load
Control plane HA	Customer configures multi-master	Built-in (3+ masters)	Built-in (Azure-managed, SLA-backed)
Control plane cost	Hardware/VM cost + labor	Included in OCP subscription	Free (free tier)

The control plane cost: free¶

AKS free tier provides a fully managed control plane at zero cost. The standard tier ($0.10/cluster/hour, ~$73/month) adds a financially backed 99.95% uptime SLA for the control plane and additional features like long-term support (LTS) Kubernetes versions and cluster autoscaler improvements.

Compare this to self-managed Kubernetes where the control plane runs on 3--5 dedicated servers (or VMs) that must be provisioned, patched, backed up, and replaced. Or OpenShift, where the control plane is managed by Red Hat but the subscription cost for 50 worker nodes ranges from $200K to $500K per year depending on the support tier and deployment model.

Upgrade automation¶

Kubernetes releases every four months. Each release has a 14-month support window. Self-managed clusters require teams to plan, test, and execute upgrades manually -- a process that typically takes 2--4 weeks per cluster, including staging validation, workload compatibility testing, and production rollout.

AKS auto-upgrade channels reduce this to a configuration choice:

Channel	Behavior	Best for
`none`	No automatic upgrades	Teams wanting full control
`patch`	Auto-applies patch versions (e.g., 1.29.2 to 1.29.4)	Most production clusters
`stable`	Auto-upgrades to latest stable minor version	Teams comfortable with minor version changes
`rapid`	Auto-upgrades to latest supported version	Dev/test environments
`node-image`	Auto-updates node OS images weekly	Security-focused teams

Combined with planned maintenance windows, AKS upgrade automation eliminates the single most time-consuming operational task for Kubernetes platform teams.

2. Azure integration: the ecosystem argument¶

Identity: Entra ID native¶

AKS integrates natively with Entra ID (formerly Azure AD) for both cluster administration and workload identity:

Cluster RBAC: Entra ID groups map directly to Kubernetes ClusterRoleBindings and RoleBindings. No separate identity provider configuration. No OIDC federation setup. No LDAP connector maintenance.
Workload Identity: Pods authenticate to Azure services (Key Vault, Storage, SQL, Cosmos DB) using federated identity credentials -- no secrets, no managed identity pods, no token refresh logic. The pod's service account token is exchanged for an Entra ID token transparently.
Conditional Access: Apply Entra Conditional Access policies to kubectl access -- require MFA, compliant devices, specific network locations, or risk-based evaluation before allowing cluster administration.
Privileged Identity Management (PIM): Just-in-time elevation for cluster-admin access with approval workflows, time-limited activation, and audit trails.

Compare this to self-managed Kubernetes where identity integration requires deploying and maintaining an OIDC provider (Dex, Keycloak), configuring API server flags, managing certificate rotation for OIDC endpoints, and building custom webhook authenticators.

Secrets: Azure Key Vault integration¶

The Azure Key Vault Secrets Provider (CSI driver) mounts Key Vault secrets, keys, and certificates directly into pods as volumes or environment variables. Secrets rotate automatically. No sidecar containers. No init containers pulling secrets at startup. No custom operators watching Secret resources.

# Secrets sync from Key Vault to pod
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
    name: azure-kv-secrets
spec:
    provider: azure
    parameters:
        keyvaultName: "kv-csa-prod"
        tenantId: "your-tenant-id"
        objects: |
            array:
              - |
                objectName: db-connection-string
                objectType: secret
              - |
                objectName: tls-cert
                objectType: secret
    secretObjects:
        - secretName: app-secrets
          type: Opaque
          data:
              - objectName: db-connection-string
                key: DATABASE_URL

Container Registry: ACR integration¶

Azure Container Registry integrates with AKS through managed identity -- no imagePullSecrets, no registry credential rotation, no Docker config secrets to manage. AKS authenticates to ACR automatically. ACR Tasks provide cloud-native image builds (replacing Jenkins-based Docker builds or OpenShift Source-to-Image). Defender for Containers scans images in ACR automatically and blocks deployment of vulnerable images through admission control.

Monitoring: Azure Monitor Container Insights¶

Container Insights provides out-of-the-box monitoring for AKS clusters:

Metrics: node CPU/memory/disk, pod CPU/memory/network, container restarts, OOMKills
Logs: container stdout/stderr, Kubernetes events, audit logs
Live data: real-time container logs and metrics in the Azure Portal
Prometheus integration: AKS Managed Prometheus for native Prometheus metric collection without running your own Prometheus server
Grafana integration: AKS Managed Grafana for dashboards without running your own Grafana instance
Alerts: pre-built alert rules for common failure modes (node not ready, pod crash loops, high CPU/memory)
Cost analysis: per-namespace and per-workload cost allocation

Compare this to self-managed clusters where monitoring requires deploying the kube-prometheus-stack (Prometheus Operator, Alertmanager, Grafana, node-exporter, kube-state-metrics), sizing persistent storage for metrics retention, managing Prometheus federation for multi-cluster, and maintaining Grafana dashboards.

Networking: Azure-native¶

Azure CNI Overlay: high-performance pod networking without consuming VNet IP addresses per pod
Azure CNI powered by Cilium: eBPF-based networking with advanced network policy, observability, and encryption
Azure Load Balancer: automatic provisioning for Service type: LoadBalancer
Application Gateway Ingress Controller (AGIC): L7 load balancing with WAF, SSL termination, and URL-based routing integrated with Azure Application Gateway
Private clusters: API server accessible only through Private Link -- no public endpoint
Azure Private Link: connect AKS pods to Azure PaaS services (Storage, SQL, Cosmos, Key Vault) through private endpoints

Policy: Azure Policy for Kubernetes¶

Azure Policy for Kubernetes (built on OPA Gatekeeper) provides:

Pre-built policy initiatives for CIS benchmarks, Pod Security Standards, and STIG baselines
Custom policy definitions using Rego
Compliance reporting in Azure Portal
Audit and deny enforcement modes
Policy exemptions for specific namespaces or workloads

3. CNCF conformance: the portability argument¶

AKS is CNCF Certified Kubernetes. Every standard Kubernetes API, resource type, and behavior works exactly as specified. This means:

Helm charts that run on self-managed Kubernetes run on AKS without modification
Kubernetes operators (Prometheus Operator, cert-manager, external-dns, Strimzi, Spark Operator) deploy and operate identically
kubectl, Kustomize, Skaffold, Tilt, and every standard Kubernetes toolchain work without changes
CRDs and custom controllers work without modification
Network policies (Calico, Cilium) work without modification
CSI drivers (beyond Azure-native) can be installed for specialized storage needs

The migration path from self-managed Kubernetes to AKS is workload-transparent for standard Kubernetes resources. The effort is in infrastructure configuration (networking, storage classes, identity), not in application changes.

For OpenShift, the effort is higher because OpenShift extends Kubernetes with non-standard resources (Routes, DeploymentConfigs, BuildConfigs, ImageStreams, SCCs). These require conversion to standard Kubernetes equivalents. See the Feature Mapping for detailed conversion guidance.

4. Cost: the financial argument¶

AKS control plane pricing¶

Tier	Control plane cost	SLA	Key features
Free	$0	No SLA (99.5% design target)	Managed control plane, 10 agents free
Standard	$0.10/cluster/hour (~$73/month)	99.95% uptime SLA	LTS versions, cluster autoscaler, Uptime SLA
Premium	$0.60/cluster/hour (~$438/month)	99.95% uptime SLA	Long-term support + advanced networking + AKS Automatic

Savings sources¶

Control plane infrastructure: eliminate 3--5 control plane servers per cluster (~$50K--$100K/year per cluster in hardware + hosting)
Platform engineering FTEs: reduce from 6--8 FTEs (self-managed) to 2--3 FTEs (AKS) -- the team shifts from "keep Kubernetes alive" to "make developers productive"
OpenShift subscription elimination: $200K--$500K/year for a 50-node deployment
Container registry: ACR ($0.167/day for Standard SKU) replaces self-hosted Harbor or Quay ($50K--$100K/year)
Monitoring infrastructure: Container Insights + Managed Prometheus replaces self-hosted Prometheus stack (~$100K--$200K/year in infrastructure + labor)
Spot instances: AKS supports Azure Spot VMs for batch and fault-tolerant workloads at up to 90% discount
Reserved Instances: 1-year (up to 38% savings) or 3-year (up to 56% savings) commitments on worker node VMs
Cluster autoscaler + node auto-provisioning: right-size infrastructure automatically, avoiding persistent over-provisioning

Cost comparison: 50-node deployment (3-year TCO)¶

Component	Self-managed K8s	OpenShift 4.x	AKS Standard
Control plane	$450K (hardware + ops)	Included in sub	$2.6K (standard tier)
Worker nodes	$1.8M (servers + DC)	$1.8M (servers + DC)	$1.2M (Azure VMs, 1yr RI)
Platform subscription	N/A	$1.2M (OCP Premium)	N/A
Registry	$300K (Harbor)	Included (Quay)	$18K (ACR Standard)
Monitoring	$450K (Prometheus stack)	$300K	$150K (Container Insights)
FTEs (platform team)	$3.0M (8 FTEs)	$2.1M (6 FTEs)	$1.2M (3 FTEs)
Networking	$300K	$300K	$200K (ExpressRoute)
3-year total	$6.3M	$5.7M	$2.8M

See the detailed TCO analysis for a rigorous comparison across small, medium, and large deployment scenarios.

5. Copilot in AKS: the AI-assisted operations argument¶

Copilot in AKS brings natural-language Kubernetes operations to the Azure Portal:

Troubleshoot clusters: "Why are pods in namespace production crashing?" Copilot queries cluster metrics, logs, and events to provide a diagnosis.
Generate YAML: "Create a deployment for a Python Flask app with 3 replicas and a readiness probe on /health" generates valid Kubernetes YAML.
Explain resources: "Explain why this pod is in CrashBackLoopOff" analyzes container logs, events, and resource limits to identify root cause.
Optimize configurations: "Suggest resource limits for this deployment based on last 7 days of metrics" analyzes Container Insights data.
Policy guidance: "What Azure Policies are blocking this deployment?" identifies which Gatekeeper constraints are preventing admission.

This is not a replacement for experienced platform engineers. It is a force multiplier -- reducing mean-time-to-diagnosis for common operational issues from hours to minutes, and enabling application developers to self-service basic Kubernetes operations without deep platform expertise.

6. Automatic upgrades and maintenance: the reliability argument¶

AKS Automatic (Preview to GA 2025--2026)¶

AKS Automatic represents the fully managed AKS experience:

Automatic node pool sizing and VM selection based on workload requirements
Automatic scaling (node autoscaler + KEDA)
Automatic upgrades (Kubernetes version + node image)
Automatic security patching (node OS + runtime)
Pre-configured with best practices (Pod Security Standards, network policy, Container Insights, Defender for Containers)
Azure CNI Overlay with Cilium (default)

For new deployments, AKS Automatic reduces the decision surface from dozens of configuration choices to a single az aks create --sku automatic command.

Long-term support (LTS)¶

AKS Standard and Premium tiers offer Long-Term Support Kubernetes versions with 2 years of community support + patches, compared to the standard 1 year. For federal agencies with slower upgrade cadences or certification requirements, LTS reduces the upgrade pressure from annual to biennial.

7. CSA-in-a-Box integration: the data platform argument¶

AKS is not just a container runtime in the CSA-in-a-Box architecture -- it is a first-class compute tier for data workloads:

Workload	AKS role	CSA-in-a-Box integration
Spark on Kubernetes	Spark Operator runs Spark drivers and executors as pods on AKS	Jobs read/write ADLS Gen2 via managed identity; metadata in Unity Catalog; lineage in Purview
Model serving	Triton / vLLM / TorchServe on GPU node pools (NC, ND series)	Models registered in AI Foundry; endpoints in data marketplace; inference logs to Container Insights
Event-driven ETL	KEDA-scaled consumers reading Event Hubs / Kafka	Writes to medallion architecture on ADLS Gen2; schema registry integration
dbt execution	CronJobs running dbt-core containers	Transforms against Databricks SQL or Fabric SQL; contracts validated in CI
Data APIs	REST/GraphQL APIs serving data products	Entra Workload Identity auth; AGIC routing; Purview-governed data access
Notebook execution	Papermill / Jupyter containers for scheduled notebook runs	Output artifacts to ADLS Gen2; metadata to Purview

This integration means organizations migrating to AKS can simultaneously modernize their data platform -- running containerized data workloads on the same infrastructure that serves their application workloads.

8. Where self-managed Kubernetes and OpenShift still win¶

This section exists because an honest assessment is more useful than a sales pitch.

Self-managed Kubernetes advantages¶

Total control: you own every binary, configuration flag, and kernel parameter. For workloads requiring custom kernel modules, specific Linux distributions, or exotic hardware (FPGAs, custom NICs), self-managed K8s gives you control AKS cannot.
Air-gapped environments: fully disconnected networks with no Azure connectivity. AKS on Azure Stack HCI addresses some of this, but the most restrictive air-gapped environments (SCIF, submarine, forward-deployed) need self-managed Kubernetes.
Multi-cloud portability: if your strategy requires identical Kubernetes configurations across AWS EKS, GCP GKE, Azure AKS, and on-prem, self-managed K8s on VMs provides the highest portability (though at the highest operational cost).
Cost at extreme scale: for organizations running 500+ nodes per cluster with mature operations teams, the per-node cost of self-managed K8s on bare metal can be lower than Azure VM pricing. This only applies to organizations with existing data center capacity and staff.

OpenShift advantages¶

Developer experience: OpenShift's integrated developer console, Source-to-Image builds, and opinionated project model provide a more complete developer experience out of the box than AKS, which is more of a building-blocks platform.
Operator ecosystem: OperatorHub + OLM provides a curated, tested operator catalog with lifecycle management. AKS extensions are growing but not yet as comprehensive.
Enterprise Linux: OpenShift runs on Red Hat CoreOS with automated host management. Organizations with deep Red Hat relationships and RHEL standardization may prefer this.
Service Mesh: OpenShift Service Mesh (Istio-based) is deeply integrated with the platform, including the console, monitoring, and routing. AKS Istio addon is functional but less integrated.
Existing investment: teams with years of OpenShift operational knowledge, custom operators built on the OCP SDK, and CI/CD pipelines using BuildConfigs and ImageStreams face real migration costs. ARO preserves this investment on Azure.

Decision framework¶

Choose AKS if: standard Kubernetes workloads, Azure-first strategy, cost optimization priority, small-to-medium platform team, new container platform deployment, or data platform integration with CSA-in-a-Box.
Choose ARO if: deep OpenShift dependency (Routes, SCC, OLM, BuildConfigs), Red Hat enterprise agreement, existing OCP operational expertise, and willingness to pay the ARO premium for OpenShift compatibility.
Stay self-managed if: air-gapped with no Azure connectivity, extreme bare-metal performance requirements, multi-cloud parity mandate, or 500+ node clusters with mature dedicated operations teams and existing data center capacity.

9. Migration risk assessment¶

Risk	Probability	Impact	Mitigation
OpenShift-specific CRD conversion	Medium	Medium	Feature mapping guide + pilot migration validate conversion
Persistent volume data loss	Low	High	Velero backup/restore with validation; dual-run during transition
Network policy incompatibility	Low	Medium	Test Calico/Cilium policies on AKS before cutover
CI/CD pipeline disruption	Medium	Medium	Parallel pipeline execution during transition
Performance regression	Low	Medium	Benchmark source cluster; validate on AKS before cutover
Compliance gap during transition	Low	High	Pre-deploy Azure Policy + Defender for Containers before workload migration
Team skill gap	Medium	Medium	Microsoft FastTrack for AKS; training budget for platform team

10. Next steps¶

Read the detailed analysis: TCO Analysis for cost justification, Feature Mapping for technical assessment
Assess your current state: inventory clusters, workloads, and dependencies using the Cluster Migration discovery checklist
Run a pilot: follow the Tutorial: App Migration to migrate a single application end-to-end
Plan the migration: use the phased project plan in the Migration Playbook as a starting template
Engage Microsoft FastTrack: request a migration assessment from the AKS FastTrack team for clusters with 50+ nodes or complex workloads

Maintainers: CSA-in-a-Box core team Last updated: 2026-04-30 Related: Migration Playbook | Migration Center | Federal Guide