OSS Stack Monitoring Guide¶
[!NOTE] TL;DR: Deploy Prometheus + Grafana via kube-prometheus-stack to monitor the OSS data platform on AKS. Includes per-service exporters, key metrics, alerting rules, and a sample Grafana dashboard.
Table of Contents¶
- Monitoring Architecture
- Prometheus + Grafana Deployment
- Service Exporters
- Key Metrics per Service
- Alerting Rules
- Sample Grafana Dashboard
Monitoring Architecture¶
┌─────────────────────────────────────────────────┐
│ AKS Cluster │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Atlas │ │ Trino │ │ OpenSearch │ │
│ │ :21000 │ │ :8080 │ │ :9200 │ │
│ └────┬─────┘ └────┬─────┘ └──────┬───────┘ │
│ │ /metrics │ /v1/... │ _nodes │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ Prometheus (scrape) │ │
│ │ + AlertManager │ │
│ └─────────────────┬────────────────────────┘ │
│ │ │
│ ┌─────────────────▼────────────────────────┐ │
│ │ Grafana │ │
│ │ Dashboards + Alerts │ │
│ └──────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
Prometheus + Grafana Deployment¶
Install the community kube-prometheus-stack Helm chart:
# Add Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword="<secure-password>" \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
--set grafana.service.type=LoadBalancer \
--values monitoring-values.yaml
monitoring-values.yaml (overrides)¶
prometheus:
prometheusSpec:
retention: 30d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: managed-csi
resources:
requests:
storage: 50Gi
additionalScrapeConfigs:
- job_name: atlas
metrics_path: /api/atlas/admin/metrics
static_configs:
- targets: ["csa-oss-atlas:21000"]
- job_name: trino
metrics_path: /v1/info
static_configs:
- targets: ["csa-oss-trino:8080"]
- job_name: superset
metrics_path: /health
static_configs:
- targets: ["csa-oss-superset:8088"]
grafana:
persistence:
enabled: true
size: 10Gi
storageClassName: managed-csi
Service Exporters¶
Apache Atlas¶
Atlas exposes metrics at /api/atlas/admin/metrics. Use a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: atlas-metrics
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/component: atlas
endpoints:
- port: http
path: /api/atlas/admin/metrics
interval: 30s
Trino¶
Trino has a built-in JMX exporter. Deploy the Prometheus JMX exporter as a sidecar:
# Add to Trino coordinator pod spec
- name: jmx-exporter
image: bitnami/jmx-exporter:0.20.0
ports:
- containerPort: 9404
env:
- name: SERVICE_PORT
value: "9404"
volumeMounts:
- name: jmx-config
mountPath: /etc/jmx-exporter/config.yaml
subPath: config.yaml
Apache NiFi¶
NiFi exposes metrics via the Reporting Task API. Deploy the PrometheusReportingTask:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nifi-metrics
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/component: nifi
endpoints:
- port: metrics
path: /metrics
interval: 30s
Apache Superset¶
Superset uses StatsD for metrics. Deploy a StatsD-to-Prometheus exporter:
# statsd-exporter sidecar for Superset
- name: statsd-exporter
image: prom/statsd-exporter:v0.26.0
ports:
- containerPort: 9102
name: metrics
- containerPort: 9125
protocol: UDP
name: statsd
OpenSearch¶
OpenSearch exposes metrics at /_nodes/stats and /_cluster/health:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: opensearch-metrics
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/component: opensearch
endpoints:
- port: http
path: /_prometheus/metrics
interval: 30s
scheme: https
tlsConfig:
insecureSkipVerify: true
Key Metrics per Service¶
Atlas¶
| Metric | Description | Warning | Critical |
|---|---|---|---|
atlas_entity_count | Total entities in catalog | - | - |
atlas_api_latency_ms | REST API response time | > 500ms | > 2000ms |
atlas_jvm_heap_used | JVM heap utilization | > 80% | > 95% |
Trino¶
| Metric | Description | Warning | Critical |
|---|---|---|---|
trino_running_queries | Active query count | > 50 | > 100 |
trino_queued_queries | Queued queries | > 10 | > 50 |
trino_blocked_queries | Blocked queries | > 5 | > 20 |
trino_failed_queries_total | Cumulative failures | rate > 5/min | rate > 20/min |
trino_worker_memory_used_bytes | Worker memory | > 80% | > 95% |
Superset¶
| Metric | Description | Warning | Critical |
|---|---|---|---|
superset_query_duration_seconds | Query execution time | > 30s | > 120s |
superset_cache_hit_ratio | Cache effectiveness | < 60% | < 30% |
superset_active_async_queries | Async query backlog | > 20 | > 50 |
OpenSearch¶
| Metric | Description | Warning | Critical |
|---|---|---|---|
opensearch_cluster_health_status | Cluster health | yellow | red |
opensearch_jvm_mem_heap_used_percent | JVM heap | > 80% | > 95% |
opensearch_fs_total_available_in_bytes | Disk space | < 20% free | < 10% free |
opensearch_search_query_time_in_millis | Search latency | > 500ms | > 2000ms |
opensearch_indexing_index_total | Indexing rate | - | drop > 50% |
Airflow¶
| Metric | Description | Warning | Critical |
|---|---|---|---|
airflow_scheduler_heartbeat | Scheduler alive | miss 2 | miss 5 |
airflow_dag_processing_total_parse_time | DAG parse time | > 30s | > 120s |
airflow_ti_failures | Task instance failures | > 5/hour | > 20/hour |
Alerting Rules¶
# prometheus-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: csa-oss-alerts
namespace: monitoring
spec:
groups:
- name: csa-oss-critical
rules:
- alert: OpenSearchClusterRed
expr: opensearch_cluster_health_status{color="red"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "OpenSearch cluster is RED"
description: "Cluster {{ $labels.cluster }} has been in RED state for 5 minutes."
- alert: TrinoWorkerDown
expr: up{job="trino", component="worker"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Trino worker is down"
- alert: HighDiskUsage
expr: (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 90
for: 10m
labels:
severity: critical
annotations:
summary: "Disk usage above 90% on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.container }} memory usage above 95%"
- alert: AirflowSchedulerDown
expr: absent(airflow_scheduler_heartbeat) == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Airflow scheduler heartbeat missing"
- alert: TrinoQueryLatencyHigh
expr: histogram_quantile(0.95, rate(trino_query_execution_time_seconds_bucket[5m])) > 120
for: 10m
labels:
severity: warning
annotations:
summary: "Trino p95 query latency above 2 minutes"
Sample Grafana Dashboard¶
Import the following JSON as a Grafana dashboard. It provides an overview panel for all OSS services:
{
"dashboard": {
"title": "CSA OSS Stack Overview",
"uid": "csa-oss-overview",
"tags": ["csa", "oss"],
"timezone": "browser",
"panels": [
{
"title": "Service Health",
"type": "stat",
"gridPos": { "h": 4, "w": 24, "x": 0, "y": 0 },
"targets": [
{
"expr": "up{job=~\"atlas|trino|superset|opensearch|airflow\"}",
"legendFormat": "{{job}}"
}
],
"fieldConfig": {
"defaults": {
"mappings": [
{
"type": "value",
"options": {
"0": { "text": "DOWN", "color": "red" },
"1": { "text": "UP", "color": "green" }
}
}
]
}
}
},
{
"title": "Trino — Active Queries",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 },
"targets": [
{
"expr": "trino_running_queries",
"legendFormat": "Running"
},
{ "expr": "trino_queued_queries", "legendFormat": "Queued" }
]
},
{
"title": "OpenSearch — Cluster Health",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 },
"targets": [
{
"expr": "opensearch_cluster_health_active_shards",
"legendFormat": "Active Shards"
},
{
"expr": "opensearch_cluster_health_relocating_shards",
"legendFormat": "Relocating"
}
]
},
{
"title": "Memory Usage by Service",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 12 },
"targets": [
{
"expr": "container_memory_working_set_bytes{container=~\"atlas|trino|superset|opensearch|airflow\"} / 1024 / 1024 / 1024",
"legendFormat": "{{container}} (GB)"
}
]
},
{
"title": "Disk Usage (PVCs)",
"type": "bargauge",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 12 },
"targets": [
{
"expr": "(1 - kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100",
"legendFormat": "{{persistentvolumeclaim}}"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 70, "color": "yellow" },
{ "value": 90, "color": "red" }
]
},
"unit": "percent"
}
}
}
],
"refresh": "30s",
"time": { "from": "now-6h", "to": "now" }
}
}
To import: Grafana UI → Dashboards → Import → paste JSON → Load.