OSS Stack Monitoring Guide¶

[!NOTE] TL;DR: Deploy Prometheus + Grafana via kube-prometheus-stack to monitor the OSS data platform on AKS. Includes per-service exporters, key metrics, alerting rules, and a sample Grafana dashboard.

Table of Contents¶

Monitoring Architecture
Prometheus + Grafana Deployment
Service Exporters
Key Metrics per Service
Alerting Rules
Sample Grafana Dashboard

Monitoring Architecture¶

┌─────────────────────────────────────────────────┐
│                    AKS Cluster                  │
│                                                 │
│  ┌──────────┐  ┌──────────┐  ┌──────────────┐  │
│  │  Atlas   │  │  Trino   │  │  OpenSearch   │  │
│  │ :21000   │  │ :8080    │  │  :9200        │  │
│  └────┬─────┘  └────┬─────┘  └──────┬───────┘  │
│       │ /metrics     │ /v1/...       │ _nodes   │
│       ▼              ▼               ▼          │
│  ┌──────────────────────────────────────────┐   │
│  │          Prometheus (scrape)             │   │
│  │          + AlertManager                  │   │
│  └─────────────────┬────────────────────────┘   │
│                    │                            │
│  ┌─────────────────▼────────────────────────┐   │
│  │             Grafana                      │   │
│  │        Dashboards + Alerts               │   │
│  └──────────────────────────────────────────┘   │
└─────────────────────────────────────────────────┘

Prometheus + Grafana Deployment¶

Install the community kube-prometheus-stack Helm chart:

# Add Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
    --namespace monitoring \
    --create-namespace \
    --set grafana.adminPassword="<secure-password>" \
    --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
    --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
    --set grafana.service.type=LoadBalancer \
    --values monitoring-values.yaml

monitoring-values.yaml (overrides)¶

prometheus:
    prometheusSpec:
        retention: 30d
        storageSpec:
            volumeClaimTemplate:
                spec:
                    storageClassName: managed-csi
                    resources:
                        requests:
                            storage: 50Gi
        additionalScrapeConfigs:
            - job_name: atlas
              metrics_path: /api/atlas/admin/metrics
              static_configs:
                  - targets: ["csa-oss-atlas:21000"]
            - job_name: trino
              metrics_path: /v1/info
              static_configs:
                  - targets: ["csa-oss-trino:8080"]
            - job_name: superset
              metrics_path: /health
              static_configs:
                  - targets: ["csa-oss-superset:8088"]

grafana:
    persistence:
        enabled: true
        size: 10Gi
        storageClassName: managed-csi

Service Exporters¶

Apache Atlas¶

Atlas exposes metrics at /api/atlas/admin/metrics. Use a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
    name: atlas-metrics
    namespace: monitoring
spec:
    selector:
        matchLabels:
            app.kubernetes.io/component: atlas
    endpoints:
        - port: http
          path: /api/atlas/admin/metrics
          interval: 30s

Trino¶

Trino has a built-in JMX exporter. Deploy the Prometheus JMX exporter as a sidecar:

# Add to Trino coordinator pod spec
- name: jmx-exporter
  image: bitnami/jmx-exporter:0.20.0
  ports:
      - containerPort: 9404
  env:
      - name: SERVICE_PORT
        value: "9404"
  volumeMounts:
      - name: jmx-config
        mountPath: /etc/jmx-exporter/config.yaml
        subPath: config.yaml

Apache NiFi¶

NiFi exposes metrics via the Reporting Task API. Deploy the PrometheusReportingTask:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
    name: nifi-metrics
    namespace: monitoring
spec:
    selector:
        matchLabels:
            app.kubernetes.io/component: nifi
    endpoints:
        - port: metrics
          path: /metrics
          interval: 30s

Apache Superset¶

Superset uses StatsD for metrics. Deploy a StatsD-to-Prometheus exporter:

# statsd-exporter sidecar for Superset
- name: statsd-exporter
  image: prom/statsd-exporter:v0.26.0
  ports:
      - containerPort: 9102
        name: metrics
      - containerPort: 9125
        protocol: UDP
        name: statsd

OpenSearch¶

OpenSearch exposes metrics at /_nodes/stats and /_cluster/health:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
    name: opensearch-metrics
    namespace: monitoring
spec:
    selector:
        matchLabels:
            app.kubernetes.io/component: opensearch
    endpoints:
        - port: http
          path: /_prometheus/metrics
          interval: 30s
          scheme: https
          tlsConfig:
              insecureSkipVerify: true

Key Metrics per Service¶

Atlas¶

Metric	Description	Warning	Critical
`atlas_entity_count`	Total entities in catalog	-	-
`atlas_api_latency_ms`	REST API response time	> 500ms	> 2000ms
`atlas_jvm_heap_used`	JVM heap utilization	> 80%	> 95%

Trino¶

Metric	Description	Warning	Critical
`trino_running_queries`	Active query count	> 50	> 100
`trino_queued_queries`	Queued queries	> 10	> 50
`trino_blocked_queries`	Blocked queries	> 5	> 20
`trino_failed_queries_total`	Cumulative failures	rate > 5/min	rate > 20/min
`trino_worker_memory_used_bytes`	Worker memory	> 80%	> 95%

Superset¶

Metric	Description	Warning	Critical
`superset_query_duration_seconds`	Query execution time	> 30s	> 120s
`superset_cache_hit_ratio`	Cache effectiveness	< 60%	< 30%
`superset_active_async_queries`	Async query backlog	> 20	> 50

OpenSearch¶

Metric	Description	Warning	Critical
`opensearch_cluster_health_status`	Cluster health	yellow	red
`opensearch_jvm_mem_heap_used_percent`	JVM heap	> 80%	> 95%
`opensearch_fs_total_available_in_bytes`	Disk space	< 20% free	< 10% free
`opensearch_search_query_time_in_millis`	Search latency	> 500ms	> 2000ms
`opensearch_indexing_index_total`	Indexing rate	-	drop > 50%

Airflow¶

Metric	Description	Warning	Critical
`airflow_scheduler_heartbeat`	Scheduler alive	miss 2	miss 5
`airflow_dag_processing_total_parse_time`	DAG parse time	> 30s	> 120s
`airflow_ti_failures`	Task instance failures	> 5/hour	> 20/hour

Alerting Rules¶

# prometheus-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
    name: csa-oss-alerts
    namespace: monitoring
spec:
    groups:
        - name: csa-oss-critical
          rules:
              - alert: OpenSearchClusterRed
                expr: opensearch_cluster_health_status{color="red"} == 1
                for: 5m
                labels:
                    severity: critical
                annotations:
                    summary: "OpenSearch cluster is RED"
                    description: "Cluster {{ $labels.cluster }} has been in RED state for 5 minutes."

              - alert: TrinoWorkerDown
                expr: up{job="trino", component="worker"} == 0
                for: 2m
                labels:
                    severity: critical
                annotations:
                    summary: "Trino worker is down"

              - alert: HighDiskUsage
                expr: (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 90
                for: 10m
                labels:
                    severity: critical
                annotations:
                    summary: "Disk usage above 90% on {{ $labels.instance }}"

              - alert: HighMemoryUsage
                expr: container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.95
                for: 5m
                labels:
                    severity: warning
                annotations:
                    summary: "Container {{ $labels.container }} memory usage above 95%"

              - alert: AirflowSchedulerDown
                expr: absent(airflow_scheduler_heartbeat) == 1
                for: 5m
                labels:
                    severity: critical
                annotations:
                    summary: "Airflow scheduler heartbeat missing"

              - alert: TrinoQueryLatencyHigh
                expr: histogram_quantile(0.95, rate(trino_query_execution_time_seconds_bucket[5m])) > 120
                for: 10m
                labels:
                    severity: warning
                annotations:
                    summary: "Trino p95 query latency above 2 minutes"

Sample Grafana Dashboard¶

Import the following JSON as a Grafana dashboard. It provides an overview panel for all OSS services:

{
    "dashboard": {
        "title": "CSA OSS Stack Overview",
        "uid": "csa-oss-overview",
        "tags": ["csa", "oss"],
        "timezone": "browser",
        "panels": [
            {
                "title": "Service Health",
                "type": "stat",
                "gridPos": { "h": 4, "w": 24, "x": 0, "y": 0 },
                "targets": [
                    {
                        "expr": "up{job=~\"atlas|trino|superset|opensearch|airflow\"}",
                        "legendFormat": "{{job}}"
                    }
                ],
                "fieldConfig": {
                    "defaults": {
                        "mappings": [
                            {
                                "type": "value",
                                "options": {
                                    "0": { "text": "DOWN", "color": "red" },
                                    "1": { "text": "UP", "color": "green" }
                                }
                            }
                        ]
                    }
                }
            },
            {
                "title": "Trino — Active Queries",
                "type": "timeseries",
                "gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 },
                "targets": [
                    {
                        "expr": "trino_running_queries",
                        "legendFormat": "Running"
                    },
                    { "expr": "trino_queued_queries", "legendFormat": "Queued" }
                ]
            },
            {
                "title": "OpenSearch — Cluster Health",
                "type": "timeseries",
                "gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 },
                "targets": [
                    {
                        "expr": "opensearch_cluster_health_active_shards",
                        "legendFormat": "Active Shards"
                    },
                    {
                        "expr": "opensearch_cluster_health_relocating_shards",
                        "legendFormat": "Relocating"
                    }
                ]
            },
            {
                "title": "Memory Usage by Service",
                "type": "timeseries",
                "gridPos": { "h": 8, "w": 12, "x": 0, "y": 12 },
                "targets": [
                    {
                        "expr": "container_memory_working_set_bytes{container=~\"atlas|trino|superset|opensearch|airflow\"} / 1024 / 1024 / 1024",
                        "legendFormat": "{{container}} (GB)"
                    }
                ]
            },
            {
                "title": "Disk Usage (PVCs)",
                "type": "bargauge",
                "gridPos": { "h": 8, "w": 12, "x": 12, "y": 12 },
                "targets": [
                    {
                        "expr": "(1 - kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100",
                        "legendFormat": "{{persistentvolumeclaim}}"
                    }
                ],
                "fieldConfig": {
                    "defaults": {
                        "thresholds": {
                            "steps": [
                                { "value": 0, "color": "green" },
                                { "value": 70, "color": "yellow" },
                                { "value": 90, "color": "red" }
                            ]
                        },
                        "unit": "percent"
                    }
                }
            }
        ],
        "refresh": "30s",
        "time": { "from": "now-6h", "to": "now" }
    }
}

To import: Grafana UI → Dashboards → Import → paste JSON → Load.

OSS Stack Monitoring Guide¶

Table of Contents¶

Monitoring Architecture¶

Prometheus + Grafana Deployment¶

monitoring-values.yaml (overrides)¶

Service Exporters¶

Apache Atlas¶

Trino¶

Apache NiFi¶

Apache Superset¶

OpenSearch¶

Key Metrics per Service¶

Atlas¶

Trino¶

Superset¶

OpenSearch¶

Airflow¶

Alerting Rules¶

Sample Grafana Dashboard¶

Related Documentation¶