AKS Best Practices for Federal Container Platforms¶

Status: Authored 2026-04-30 Audience: Platform engineers, SREs, and architects operating AKS clusters for federal workloads, especially those integrating with CSA-in-a-Box data services. Scope: Cluster design, node pool strategy, namespace organization, monitoring, GitOps adoption, security hardening, and CSA-in-a-Box integration patterns.

1. Cluster design¶

Multi-cluster strategy¶

For federal deployments, use separate clusters per environment and security boundary:

Cluster	Purpose	AKS tier	Node pools	Notes
aks-dev	Development and testing	Free	1 system + 1 workload	Smallest viable; Spot nodes for cost savings
aks-staging	Pre-production validation	Standard	1 system + 2 workload	Mirrors production topology
aks-prod	Production workloads	Standard or Premium	1 system + 3--5 workload	Full HA, zone-redundant
aks-data	CSA-in-a-Box data workloads	Standard	1 system + GPU + high-mem	Spark, model serving, dbt

Single cluster with namespace isolation (alternative)¶

For smaller deployments, a single cluster with namespace isolation is acceptable:

# Namespace hierarchy
namespaces:
    system:
        - kube-system # K8s system pods
        - ingress-nginx # Ingress controller
        - cert-manager # TLS certificate management
        - monitoring # Prometheus, Grafana
        - flux-system # GitOps controller
        - gatekeeper-system # Policy enforcement
        - velero # Backup and restore
    workloads:
        - production # Production applications
        - staging # Staging applications
        - development # Development sandboxes
    data:
        - spark-jobs # Spark Operator workloads
        - model-serving # ML model inference
        - data-pipelines # dbt, event consumers
        - data-apis # Data product APIs

Cluster naming convention¶

aks-{environment}-{region}-{purpose}
Examples:
  aks-prod-govva-platform
  aks-prod-govva-data
  aks-dev-govva-sandbox

2. Node pool strategy¶

Recommended node pool architecture¶

flowchart TB
    subgraph AKS["AKS Cluster"]
        SYS["System Pool<br>D4s_v5 x3<br>Taint: CriticalAddonsOnly"]
        GEN["General Workload Pool<br>D8s_v5 x5-20<br>Autoscaler: 5-20"]
        MEM["Memory-Optimized Pool<br>E16s_v5 x2-10<br>Taint: workload=memory"]
        GPU_P["GPU Pool<br>NC24ads_A100 x0-4<br>Taint: nvidia.com/gpu"]
        SPOT["Spot Pool<br>D8s_v5 x0-30<br>Taint: spot"]
        FIPS_P["FIPS Pool<br>D8s_v5 x3-15<br>FIPS-enabled image"]
    end

    SYS --> |CoreDNS, metrics-server, CSI| SYS
    GEN --> |API servers, web apps, workers| GEN
    MEM --> |Spark executors, caches, databases| MEM
    GPU_P --> |Triton, vLLM, training jobs| GPU_P
    SPOT --> |Batch jobs, CI/CD builds, Spark| SPOT
    FIPS_P --> |IL5 workloads, PII processing| FIPS_P

Node pool sizing guidelines¶

Workload type	VM series	Sizing rule	Autoscaler min/max
System pods	D4s_v5	3 nodes (one per zone)	⅗
General workloads	D8s_v5	2--3 pods per vCPU (packing)	3/20
Memory-intensive (Spark, Redis)	E16s_v5 or E32s_v5	Memory-bound; 70% utilization target	2/10
GPU inference	NC24ads_A100_v4	1 model per GPU (simple) or MIG (multi-model)	0/4
GPU training	ND96isr_H100_v5	Job-based; scale to 0 when idle	0/8
Batch / CI builds	D8s_v5 (Spot)	Scale to 0 when idle; tolerate eviction	0/30
FIPS workloads	D8s_v5 (FIPS image)	Dedicated pool for IL5+ workloads	3/15

Best practices for node pools¶

Always use availability zones (zones 1, 2, 3) for production node pools
Taint system pool with CriticalAddonsOnly=true:NoSchedule to prevent application pods from landing on system nodes
Taint specialized pools (GPU, memory, Spot) and use tolerations to target workloads explicitly
Scale GPU and Spot pools to 0 when idle -- no cost for unused capacity
Use max-pods=110 (default 30 is too low for most workloads)
Set OS disk type to Managed and OS disk size to 128 GB minimum for production

3. Namespace organization¶

Resource isolation per namespace¶

# Template for production namespace
apiVersion: v1
kind: Namespace
metadata:
    name: production
    labels:
        # Pod Security Standards
        pod-security.kubernetes.io/enforce: restricted
        pod-security.kubernetes.io/audit: restricted
        pod-security.kubernetes.io/warn: restricted
        # Metadata
        team: platform
        environment: production
        cost-center: "CC-12345"
---
apiVersion: v1
kind: ResourceQuota
metadata:
    name: resource-limits
    namespace: production
spec:
    hard:
        requests.cpu: "40"
        requests.memory: "80Gi"
        limits.cpu: "80"
        limits.memory: "160Gi"
        persistentvolumeclaims: "20"
        services.loadbalancers: "3"
        count/deployments.apps: "50"
        count/services: "50"
---
apiVersion: v1
kind: LimitRange
metadata:
    name: default-limits
    namespace: production
spec:
    limits:
        - default:
              cpu: "500m"
              memory: "512Mi"
          defaultRequest:
              cpu: "100m"
              memory: "128Mi"
          max:
              cpu: "8"
              memory: "16Gi"
          min:
              cpu: "50m"
              memory: "64Mi"
          type: Container
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
    name: default-deny-all
    namespace: production
spec:
    podSelector: {}
    policyTypes:
        - Ingress
        - Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
    name: allow-dns
    namespace: production
spec:
    podSelector: {}
    policyTypes:
        - Egress
    egress:
        - to:
              - namespaceSelector:
                    matchLabels:
                        kubernetes.io/metadata.name: kube-system
          ports:
              - protocol: UDP
                port: 53
              - protocol: TCP
                port: 53

4. Monitoring and observability¶

Monitoring stack architecture¶

flowchart LR
    subgraph AKS["AKS Cluster"]
        APP[Application Pods]
        OMS[Azure Monitor Agent<br>Container Insights]
        PROM[Managed Prometheus<br>Metric Collection]
    end

    subgraph Azure["Azure Monitor"]
        LAW[Log Analytics<br>Workspace]
        GRAF[Managed Grafana<br>Dashboards]
        ALERT[Azure Monitor<br>Alerts]
        AI[Application<br>Insights]
    end

    APP --> OMS
    APP --> PROM
    OMS --> LAW
    PROM --> GRAF
    LAW --> ALERT
    APP --> AI

    style AKS fill:#0078d4,color:#fff
    style Azure fill:#5c2d91,color:#fff

Essential alerts¶

# Critical alerts to configure
alerts:
    cluster:
        - name: NodeNotReady
          condition: "Node status NotReady for > 5 minutes"
          severity: Critical
        - name: ClusterAutoscalerFailed
          condition: "Autoscaler unable to provision nodes"
          severity: Critical
        - name: PVCPendingBound
          condition: "PVC pending for > 10 minutes"
          severity: Warning

    workload:
        - name: PodCrashLooping
          condition: "Pod restart count > 5 in 10 minutes"
          severity: Critical
        - name: ContainerOOMKilled
          condition: "Container OOMKilled event"
          severity: Warning
        - name: HPAMaxedOut
          condition: "HPA at maxReplicas for > 30 minutes"
          severity: Warning
        - name: HighErrorRate
          condition: "HTTP 5xx rate > 5% for > 5 minutes"
          severity: Critical

    security:
        - name: DefenderThreatDetected
          condition: "Defender for Containers alert fired"
          severity: Critical
        - name: PolicyViolation
          condition: "Azure Policy deny event"
          severity: Warning
        - name: UnauthorizedAPIAccess
          condition: "401/403 to API server > 10 in 5 minutes"
          severity: Warning

Log retention strategy¶

Log category	Retention	Purpose
Container stdout/stderr	30 days	Debugging and troubleshooting
Kubernetes events	30 days	Cluster state changes
kube-audit	90 days (federal minimum)	Security audit trail
kube-audit-admin	1 year (for IL5/STIG compliance)	Admin action audit
Defender alerts	1 year	Security incident investigation
Azure Activity Log	90 days	Azure resource changes

5. GitOps adoption¶

GitOps repository structure¶

aks-gitops/
  clusters/
    aks-prod-govva/
      flux-system/           # Flux bootstrap (auto-generated)
      infrastructure.yaml    # Kustomization pointing to infrastructure/
      applications.yaml      # Kustomization pointing to apps/
  infrastructure/
    sources/                 # Helm repositories, Git sources
      helm-repos.yaml
    controllers/
      ingress-nginx/
        namespace.yaml
        helmrelease.yaml
      cert-manager/
        namespace.yaml
        helmrelease.yaml
      kube-prometheus-stack/
        namespace.yaml
        helmrelease.yaml
    policies/
      pod-security.yaml
      network-policies.yaml
      resource-quotas.yaml
  applications/
    base/
      api-server/
        deployment.yaml
        service.yaml
        ingress.yaml
        kustomization.yaml
    overlays/
      production/
        kustomization.yaml
        patches/
      staging/
        kustomization.yaml
        patches/

GitOps best practices¶

Separate infrastructure and application repositories -- infrastructure changes have different review and approval requirements
Use Kustomize overlays for environment-specific configuration (never duplicate manifests)
Pin Helm chart versions -- never use latest or * in HelmRelease version constraints
Require PR approval for production changes -- GitOps is only as good as the review process
Use Flux image automation for automated image updates in non-production environments
Encrypt secrets in Git with SOPS + Azure Key Vault -- or reference Key Vault via SecretProviderClass and do not store secrets in Git at all

6. Security hardening best practices¶

Defense in depth layers¶

Layer	Control	Implementation
Cluster	Private cluster, Entra ID auth, local accounts disabled	Cluster creation flags
Network	Default-deny network policies, Azure Firewall egress	NetworkPolicy + Azure Firewall
Node	FIPS images, CIS benchmarks, auto-patching	Node pool config + Azure Policy
Container	PSS restricted, read-only filesystem, drop all capabilities	Pod security context
Image	Approved registries only, vulnerability scanning, image signing	Azure Policy + Defender + Notation
Identity	Workload Identity (no secrets), Key Vault secrets, MFA for admins	Workload Identity + Key Vault + Conditional Access
Runtime	Defender runtime protection, binary drift detection	Defender for Containers
Data	Encryption at rest (CMK), encryption in transit (TLS/mTLS)	Azure Disk encryption + Istio

Security baseline checklist¶

# Verify security configuration
# 1. Local accounts disabled
az aks show -g rg-aks-prod -n aks-prod --query disableLocalAccounts

# 2. Entra ID integration enabled
az aks show -g rg-aks-prod -n aks-prod --query aadProfile.managed

# 3. Private cluster
az aks show -g rg-aks-prod -n aks-prod --query apiServerAccessProfile.enablePrivateCluster

# 4. Defender enabled
az aks show -g rg-aks-prod -n aks-prod --query securityProfile.defender

# 5. Workload Identity enabled
az aks show -g rg-aks-prod -n aks-prod --query securityProfile.workloadIdentity.enabled

# 6. FIPS node pools
az aks nodepool show -g rg-aks-prod --cluster-name aks-prod -n fipspool --query enableFIPS

# 7. Azure Policy assigned
az policy assignment list --scope "/subscriptions/$SUB_ID/resourceGroups/rg-aks-prod" --query "[].displayName"

7. Cost optimization¶

Right-sizing workloads¶

# Use Container Insights to identify over-provisioned workloads
# Azure Portal > AKS > Insights > Containers
# Look for: CPU request >> CPU usage, Memory request >> Memory usage

# CLI: check resource usage vs requests
kubectl top pods -A --sort-by=cpu | head -20
kubectl top pods -A --sort-by=memory | head -20

Reserved Instances strategy¶

Workload type	Commitment	Recommendation
System pool (always running)	3-year RI	56% savings
Production workload pool (steady-state)	1-year RI	38% savings
GPU pool (intermittent)	Pay-as-you-go + Spot	Use Spot for training; PAYG for inference
Spot pool (batch jobs)	Spot pricing	Up to 90% savings
Dev/staging	Azure Savings Plan	Flexible across regions and sizes

Cost allocation with namespaces¶

# Enable cost allocation in Container Insights
# Azure Portal > AKS > Cost Analysis > Namespaces
# Attribute costs to teams/projects via namespace labels:
# cost-center, team, project, environment

8. CSA-in-a-Box containerized data services integration¶

Spark on Kubernetes¶

# Spark Operator SparkApplication on AKS
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
    name: daily-sales-etl
    namespace: spark-jobs
spec:
    type: Python
    pythonVersion: "3"
    mode: cluster
    image: csainaboxacr.azurecr.io/spark/spark-py:3.5.0
    mainApplicationFile: "abfss://code@stcsainbox.dfs.core.windows.net/jobs/daily_sales_etl.py"
    sparkVersion: "3.5.0"
    driver:
        cores: 2
        memory: "4g"
        serviceAccount: spark-sa
        labels:
            azure.workload.identity/use: "true"
        nodeSelector:
            agentpool: workload
    executor:
        cores: 4
        instances: 8
        memory: "8g"
        nodeSelector:
            agentpool: highmem
        tolerations:
            - key: workload
              value: memory-intensive
              effect: NoSchedule
    sparkConf:
        spark.hadoop.fs.azure.account.auth.type: OAuth
        spark.hadoop.fs.azure.account.oauth.provider.type: org.apache.hadoop.fs.azurebfs.oauth2.WorkloadIdentityTokenProvider
        spark.sql.catalog.unity: com.databricks.sql.catalog.UnityCatalog
        spark.sql.catalog.unity.uri: "https://adb-xxxx.azuredatabricks.net"
        spark.kubernetes.driver.volumes.persistentVolumeClaim.scratch.mount.path: /scratch
        spark.kubernetes.driver.volumes.persistentVolumeClaim.scratch.options.claimName: spark-scratch
    restartPolicy:
        type: OnFailure
        onFailureRetries: 3
        onFailureRetryInterval: 30

Model serving on GPU node pools¶

# Triton Inference Server on AKS GPU nodes
apiVersion: apps/v1
kind: Deployment
metadata:
    name: triton-inference
    namespace: model-serving
spec:
    replicas: 2
    selector:
        matchLabels:
            app: triton
    template:
        metadata:
            labels:
                app: triton
                azure.workload.identity/use: "true"
        spec:
            serviceAccountName: model-serving-sa
            containers:
                - name: triton
                  image: csainaboxacr.azurecr.io/nvidia/tritonserver:24.01-py3
                  args:
                      - tritonserver
                      - --model-repository=abfss://models@stcsainbox.dfs.core.windows.net/triton/
                  ports:
                      - containerPort: 8000
                        name: http
                      - containerPort: 8001
                        name: grpc
                      - containerPort: 8002
                        name: metrics
                  resources:
                      limits:
                          nvidia.com/gpu: 1
                          memory: "32Gi"
                      requests:
                          nvidia.com/gpu: 1
                          cpu: "8"
                          memory: "16Gi"
            nodeSelector:
                agentpool: gpu
            tolerations:
                - key: nvidia.com/gpu
                  operator: Exists
                  effect: NoSchedule

Event-driven data consumer with KEDA¶

# KEDA-scaled Event Hubs consumer
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
    name: eventhub-consumer
    namespace: data-pipelines
spec:
    scaleTargetRef:
        name: event-consumer
    pollingInterval: 15
    cooldownPeriod: 120
    minReplicaCount: 1
    maxReplicaCount: 20
    triggers:
        - type: azure-eventhub
          metadata:
              connectionFromEnv: EVENTHUB_CONNECTION
              storageConnectionFromEnv: CHECKPOINT_STORAGE_CONNECTION
              consumerGroup: "$Default"
              unprocessedEventThreshold: "100"
              blobContainer: "checkpoints"

dbt runner as CronJob¶

# Scheduled dbt run on AKS
apiVersion: batch/v1
kind: CronJob
metadata:
    name: dbt-daily-run
    namespace: data-pipelines
spec:
    schedule: "0 2 * * *" # 02:00 UTC daily
    jobTemplate:
        spec:
            template:
                metadata:
                    labels:
                        azure.workload.identity/use: "true"
                spec:
                    serviceAccountName: dbt-sa
                    containers:
                        - name: dbt
                          image: csainaboxacr.azurecr.io/dbt/dbt-databricks:1.8.0
                          command: ["dbt"]
                          args:
                              - "run"
                              - "--profiles-dir=/profiles"
                              - "--project-dir=/dbt"
                              - "--select"
                              - "tag:daily"
                          volumeMounts:
                              - name: dbt-project
                                mountPath: /dbt
                              - name: dbt-profiles
                                mountPath: /profiles
                          resources:
                              requests:
                                  cpu: "500m"
                                  memory: "1Gi"
                              limits:
                                  cpu: "2"
                                  memory: "4Gi"
                    volumes:
                        - name: dbt-project
                          configMap:
                              name: dbt-project-config
                        - name: dbt-profiles
                          secret:
                              secretName: dbt-profiles
                    restartPolicy: OnFailure

9. Operational runbooks¶

Common operational procedures¶

Procedure	Frequency	Tool	Notes
Node pool scaling	As needed	`az aks nodepool scale` or autoscaler	Autoscaler handles most cases
Kubernetes upgrade	Quarterly (or per auto-upgrade channel)	`az aks upgrade` or auto-upgrade	Test in staging first
Node image upgrade	Weekly (auto) or monthly (manual)	Auto-upgrade channel: `NodeImage`	Security patches
Backup	Daily (Velero)	Velero CronJob	PV data + K8s resources
Certificate rotation	Automatic (AKS)	N/A	AKS handles cluster cert rotation
Log retention cleanup	Monthly	Log Analytics retention policy	Automated via policy
Cost review	Monthly	Azure Cost Management	Per-namespace cost allocation
Security scan review	Weekly	Defender for Containers dashboard	Remediate critical/high CVEs
Policy compliance review	Weekly	Azure Policy compliance dashboard	Address non-compliant resources

Incident response for AKS¶

# Quick triage commands
# 1. Cluster health
kubectl get nodes
kubectl get componentstatuses 2>/dev/null || kubectl get --raw /healthz

# 2. Pod issues
kubectl get pods -A --field-selector status.phase!=Running,status.phase!=Succeeded

# 3. Events (last hour)
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# 4. Resource pressure
kubectl top nodes
kubectl describe nodes | grep -A 5 "Conditions"

# 5. Network issues
kubectl get networkpolicies -A
kubectl get svc -A --field-selector spec.type=LoadBalancer

# 6. Storage issues
kubectl get pvc -A --field-selector status.phase!=Bound

10. Migration readiness checklist¶

Before considering the migration complete, verify all best practices are implemented:

Maintainers: CSA-in-a-Box core team Last updated: 2026-04-30 Related: Cluster Migration | Federal Migration Guide | Benchmarks