Skip to content

⚡ Spark on HDInsight¶

Master Apache Spark on HDInsight. Learn in-memory processing, DataFrames, SQL, and streaming analytics.

🎯 Learning Objectives¶

Create HDInsight Spark cluster
Work with Spark DataFrames and SQL
Implement batch and streaming processing
Optimize Spark jobs for performance
Integrate with Azure services

📋 Prerequisites¶

Azure subscription
HDInsight experience - HDInsight Quickstart
Python or Scala knowledge
Understanding of distributed systems

🚀 Step 1: Create Spark Cluster¶

# Azure CLI
az hdinsight create \
  --name spark-cluster-01 \
  --resource-group rg-hdinsight \
  --type spark \
  --component-version Spark=3.1 \
  --cluster-tier standard \
  --worker-node-count 2 \
  --storage-account mystorageaccount

📊 Step 2: Spark DataFrames¶

# Create DataFrame from CSV
df = spark.read.csv(
    "wasb:///data/sales.csv",
    header=True,
    inferSchema=True
)

# Show DataFrame
df.show()

# DataFrame operations
df_filtered = df.filter(df.amount > 100)
df_grouped = df.groupBy("category").sum("amount")

🔥 Step 3: Spark SQL¶

# Register temp view
df.createOrReplaceTempView("sales")

# SQL query
result = spark.sql("""
    SELECT
        category,
        COUNT(*) as orders,
        SUM(amount) as revenue
    FROM sales
    GROUP BY category
    ORDER BY revenue DESC
""")

result.show()

🌊 Step 4: Structured Streaming¶

# Read stream from Event Hubs
stream_df = spark.readStream \
    .format("eventhubs") \
    .options(**ehConf) \
    .load()

# Process stream
query = stream_df \
    .groupBy("category") \
    .count() \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

⚡ Performance Optimization¶

Cache frequently used DataFrames
Partition data appropriately
Use broadcast joins for small tables
Configure executor memory and cores

📚 Next Steps¶

Last Updated: January 2025