HDInsight Monitoring¶

Home | Monitoring | HDInsight Monitoring

Comprehensive monitoring guide for Azure HDInsight clusters.

Overview¶

This guide covers monitoring for:

Spark clusters
Kafka clusters
HBase clusters
Cluster health and resource utilization
Job and application monitoring

Azure Monitor Integration¶

Enable Diagnostic Settings¶

# Enable monitoring for HDInsight cluster
az hdinsight monitor enable \
    --name spark-cluster \
    --resource-group rg-hdinsight \
    --workspace "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.OperationalInsights/workspaces/{law}"

# Verify monitoring status
az hdinsight monitor show \
    --name spark-cluster \
    --resource-group rg-hdinsight

Log Categories¶

Category	Description	Retention
AmbariMetrics	Cluster metrics from Ambari	30 days
YarnMetrics	YARN resource manager metrics	30 days
SparkApplications	Spark job metrics	30 days
KafkaMetrics	Kafka broker metrics	30 days
HBaseMetrics	HBase region server metrics	30 days

Spark Cluster Monitoring¶

KQL Queries¶

// Spark application summary
HDInsightSparkApplications
| where TimeGenerated > ago(24h)
| summarize
    TotalApps = count(),
    Succeeded = countif(State == "FINISHED"),
    Failed = countif(State == "FAILED"),
    Running = countif(State == "RUNNING")
    by ClusterName, bin(TimeGenerated, 1h)

// Long-running Spark jobs
HDInsightSparkApplications
| where TimeGenerated > ago(24h)
| extend DurationMinutes = (EndTime - StartTime) / 1m
| where DurationMinutes > 60
| project ClusterName, ApplicationId, Name, DurationMinutes, State
| order by DurationMinutes desc

// Executor failures
HDInsightSparkLogs
| where TimeGenerated > ago(24h)
| where Message contains "executor" and Message contains "failed"
| summarize FailureCount = count() by ClusterName, bin(TimeGenerated, 1h)

Spark UI Metrics¶

# Collect Spark metrics programmatically
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

# Get application metrics
metrics = {
    "application_id": sc.applicationId,
    "executor_count": len(sc._jsc.sc().getExecutorMemoryStatus()),
    "default_parallelism": sc.defaultParallelism,
    "active_jobs": len(sc.statusTracker().getActiveJobIds()),
    "completed_jobs": sc.statusTracker().getJobIdsForGroup()
}

print(metrics)

Kafka Cluster Monitoring¶

Broker Metrics¶

// Kafka broker health
HDInsightKafkaMetrics
| where TimeGenerated > ago(1h)
| where MetricName in ("UnderReplicatedPartitions", "OfflinePartitionsCount", "ActiveControllerCount")
| summarize Value = avg(Value) by ClusterName, MetricName, BrokerId, bin(TimeGenerated, 5m)
| render timechart

// Message throughput
HDInsightKafkaMetrics
| where TimeGenerated > ago(24h)
| where MetricName in ("MessagesInPerSec", "BytesInPerSec", "BytesOutPerSec")
| summarize AvgValue = avg(Value) by ClusterName, MetricName, bin(TimeGenerated, 1h)
| render timechart

// Consumer lag
HDInsightKafkaMetrics
| where TimeGenerated > ago(1h)
| where MetricName == "ConsumerLag"
| summarize MaxLag = max(Value) by ConsumerGroup, Topic, Partition
| where MaxLag > 1000

Alert Thresholds¶

Metric	Warning	Critical
UnderReplicatedPartitions	> 0 for 5 min	> 0 for 15 min
OfflinePartitions	> 0	> 0
Consumer Lag	> 10000	> 100000
Disk Usage	> 70%	> 85%

HBase Monitoring¶

Region Server Metrics¶

// HBase region server health
HDInsightHBaseMetrics
| where TimeGenerated > ago(1h)
| where MetricName in ("regionServerCount", "deadRegionServers", "averageLoad")
| summarize Value = avg(Value) by ClusterName, MetricName, bin(TimeGenerated, 5m)

// Request latency
HDInsightHBaseMetrics
| where TimeGenerated > ago(24h)
| where MetricName in ("readRequestLatency_mean", "writeRequestLatency_mean")
| summarize AvgLatency = avg(Value) by MetricName, bin(TimeGenerated, 1h)
| render timechart

// Region count per server
HDInsightHBaseMetrics
| where TimeGenerated > ago(1h)
| where MetricName == "regionCount"
| summarize RegionCount = sum(Value) by RegionServer
| order by RegionCount desc

Resource Utilization¶

Cluster Health Dashboard¶

// Node health overview
HDInsightAmbariMetrics
| where TimeGenerated > ago(1h)
| where MetricName in ("cpu_user", "mem_used_percent", "disk_used_percent")
| summarize AvgValue = avg(Value) by ClusterName, NodeName, MetricName, bin(TimeGenerated, 5m)

// YARN resource utilization
HDInsightYarnMetrics
| where TimeGenerated > ago(24h)
| where MetricName in ("AllocatedVCores", "AvailableVCores", "AllocatedMB", "AvailableMB")
| summarize AvgValue = avg(Value) by ClusterName, MetricName, bin(TimeGenerated, 1h)
| render timechart

// Container failures
HDInsightYarnMetrics
| where TimeGenerated > ago(24h)
| where MetricName == "ContainersFailed"
| summarize TotalFailed = sum(Value) by ClusterName, bin(TimeGenerated, 1h)

Alerting Configuration¶

Critical Alerts¶

{
    "alerts": [
        {
            "name": "HDInsight Node Down",
            "query": "HDInsightAmbariMetrics | where MetricName == 'host_state' and Value != 'HEALTHY'",
            "threshold": 0,
            "severity": 0,
            "frequency": "PT5M"
        },
        {
            "name": "High CPU Utilization",
            "query": "HDInsightAmbariMetrics | where MetricName == 'cpu_user' and Value > 85",
            "threshold": 0,
            "severity": 2,
            "frequency": "PT5M"
        },
        {
            "name": "Kafka Under-Replicated Partitions",
            "query": "HDInsightKafkaMetrics | where MetricName == 'UnderReplicatedPartitions' and Value > 0",
            "threshold": 0,
            "severity": 1,
            "frequency": "PT5M"
        }
    ]
}

Ambari Dashboard¶

Access Ambari¶

Navigate to Azure Portal > HDInsight cluster
Click "Ambari home" under Cluster dashboards
Use cluster credentials to log in

Key Dashboards¶

Hosts: View all cluster nodes and their health
Services: Monitor individual services (HDFS, YARN, Spark, etc.)
Alerts: View active alerts and history
Configs: Review and modify service configurations

Last Updated: January 2025

HDInsight Monitoring¶

Overview¶

Azure Monitor Integration¶

Enable Diagnostic Settings¶

Log Categories¶

Spark Cluster Monitoring¶

KQL Queries¶

Spark UI Metrics¶

Kafka Cluster Monitoring¶

Broker Metrics¶

Alert Thresholds¶

HBase Monitoring¶

Region Server Metrics¶

Resource Utilization¶

Cluster Health Dashboard¶

Alerting Configuration¶

Critical Alerts¶

Ambari Dashboard¶

Access Ambari¶

Key Dashboards¶

Related Documentation¶