⚡ Spark on HDInsight¶
Master Apache Spark on HDInsight. Learn in-memory processing, DataFrames, SQL, and streaming analytics.
🎯 Learning Objectives¶
- Create HDInsight Spark cluster
- Work with Spark DataFrames and SQL
- Implement batch and streaming processing
- Optimize Spark jobs for performance
- Integrate with Azure services
📋 Prerequisites¶
- Azure subscription
- HDInsight experience - HDInsight Quickstart
- Python or Scala knowledge
- Understanding of distributed systems
🚀 Step 1: Create Spark Cluster¶
# Azure CLI
az hdinsight create \
--name spark-cluster-01 \
--resource-group rg-hdinsight \
--type spark \
--component-version Spark=3.1 \
--cluster-tier standard \
--worker-node-count 2 \
--storage-account mystorageaccount
📊 Step 2: Spark DataFrames¶
# Create DataFrame from CSV
df = spark.read.csv(
"wasb:///data/sales.csv",
header=True,
inferSchema=True
)
# Show DataFrame
df.show()
# DataFrame operations
df_filtered = df.filter(df.amount > 100)
df_grouped = df.groupBy("category").sum("amount")
🔥 Step 3: Spark SQL¶
# Register temp view
df.createOrReplaceTempView("sales")
# SQL query
result = spark.sql("""
SELECT
category,
COUNT(*) as orders,
SUM(amount) as revenue
FROM sales
GROUP BY category
ORDER BY revenue DESC
""")
result.show()
🌊 Step 4: Structured Streaming¶
# Read stream from Event Hubs
stream_df = spark.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
# Process stream
query = stream_df \
.groupBy("category") \
.count() \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
⚡ Performance Optimization¶
- Cache frequently used DataFrames
- Partition data appropriately
- Use broadcast joins for small tables
- Configure executor memory and cores
📚 Next Steps¶
Last Updated: January 2025