🐘 Azure HDInsight Quickstart¶
Get started with Azure HDInsight. Learn to create Hadoop clusters and run big data workloads on Azure.
🎯 Learning Objectives¶
After completing this quickstart, you will be able to:
- Understand what Azure HDInsight is and its capabilities
- Create an HDInsight cluster with Hadoop
- Upload data to cluster storage
- Run MapReduce and Hive jobs
- Query data with HiveQL
- Monitor cluster performance
📋 Prerequisites¶
- Azure subscription - Create free account
- Azure Storage account - Create one
- Basic SQL knowledge - Understanding of SELECT, WHERE, JOIN
- SSH client (optional) - For cluster access
🔍 What is Azure HDInsight?¶
Azure HDInsight is a fully managed, cloud-based service for open-source analytics frameworks:
- Hadoop - Batch processing with MapReduce
- Spark - Fast in-memory processing
- HBase - NoSQL database
- Kafka - Event streaming
- Interactive Query - Interactive Hive (LLAP)
Key Features¶
✅ Fully managed Hadoop clusters ✅ Enterprise-grade security ✅ Integration with Azure services ✅ Cost-effective with auto-scaling ✅ Multiple frameworks support
When to Use HDInsight¶
✅ Good For:
- Migrating on-premises Hadoop workloads
- Batch ETL processing
- Log and event analytics
- Data warehousing
- Machine learning at scale
❌ Consider Alternatives For:
- Real-time analytics (use Databricks or Synapse)
- Small datasets (use Synapse Serverless)
- Managed notebooks (use Databricks)
🚀 Step 1: Create HDInsight Cluster¶
Using Azure Portal¶
- Navigate to Azure Portal
- Go to portal.azure.com
- Click "Create a resource"
- Search for "HDInsight"
-
Click "Create"
-
Configure Basics
- Subscription: Select subscription
- Resource Group: Create "rg-hdinsight-quickstart"
- Cluster Name: "hdinsight-quickstart-[yourname]"
- Region: Select nearest region
- Cluster Type: Hadoop
- Version: Latest (e.g., Hadoop 3.1.1)
-
Tier: Standard
-
Configure Security + Networking
- Cluster Login Username: admin
- Cluster Login Password: Create strong password
- SSH Username: sshuser
-
SSH Password: Same or different password
-
Configure Storage
- Primary Storage Type: Azure Storage or ADLS Gen2
- Select a Storage Account: Choose existing or create new
- Container: Create new "hdinsight"
-
Filesystem: "hdinsight" (for ADLS Gen2)
-
Configure Scale
- Head nodes: 2 (default)
- Worker nodes: 2 (minimum for quickstart)
-
Node Size: D13 v2 (or smaller for cost savings)
-
Review and Create
- Click "Review + create"
- Click "Create"
- Wait 15-20 minutes for deployment
📂 Step 2: Upload Sample Data¶
Create Sample Data¶
Create sales.csv:
order_id,product,category,amount,order_date
1001,Laptop,Electronics,1299.99,2024-01-15
1002,Chair,Furniture,249.99,2024-01-15
1003,Monitor,Electronics,399.99,2024-01-16
1004,Desk,Furniture,549.99,2024-01-16
1005,Keyboard,Electronics,89.99,2024-01-17
Upload to Cluster Storage¶
# Using Azure Storage Explorer or Azure Portal
# 1. Navigate to storage account
# 2. Go to "hdinsight" container
# 3. Create folder "data"
# 4. Upload sales.csv to data/sales.csv
Using Azure CLI¶
# Set variables
STORAGE_ACCOUNT="your-storage-account"
CONTAINER="hdinsight"
# Upload file
az storage blob upload \
--account-name $STORAGE_ACCOUNT \
--container-name $CONTAINER \
--name data/sales.csv \
--file sales.csv \
--auth-mode login
🔍 Step 3: Create Hive Table¶
Access Hive View¶
- Navigate to HDInsight cluster in Azure Portal
- Click "Cluster dashboards" → "Ambari home"
- Login with cluster credentials
- Click Hive View icon (9 squares grid)
Create External Table¶
-- Create database
CREATE DATABASE IF NOT EXISTS sales_db;
USE sales_db;
-- Create external table
CREATE EXTERNAL TABLE sales (
order_id INT,
product STRING,
category STRING,
amount DECIMAL(10,2),
order_date DATE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'wasb://hdinsight@your-storage-account.blob.core.windows.net/data/'
TBLPROPERTIES ("skip.header.line.count"="1");
-- Verify data loaded
SELECT * FROM sales LIMIT 10;
📊 Step 4: Query Data with HiveQL¶
Basic Queries¶
-- Sales by category
SELECT
category,
COUNT(*) as order_count,
SUM(amount) as total_sales,
AVG(amount) as avg_order_value
FROM sales
GROUP BY category
ORDER BY total_sales DESC;
-- Top products
SELECT
product,
SUM(amount) as revenue
FROM sales
GROUP BY product
ORDER BY revenue DESC
LIMIT 5;
Advanced Analysis¶
-- Daily sales with running total
SELECT
order_date,
SUM(amount) as daily_sales,
SUM(SUM(amount)) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as running_total
FROM sales
GROUP BY order_date
ORDER BY order_date;
💻 Step 5: Run MapReduce Job (Optional)¶
Word Count Example¶
# SSH into cluster
ssh sshuser@your-cluster-name-ssh.azurehdinsight.net
# Run word count on sample data
hadoop jar \
/usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar \
wordcount \
wasb:///example/data/gutenberg/davinci.txt \
wasb:///example/data/WordCountOutput
# View results
hdfs dfs -cat /example/data/WordCountOutput/part-r-00000
🎯 Step 6: Create Managed Table¶
-- Create managed table for better performance
CREATE TABLE sales_managed
STORED AS ORC
AS
SELECT * FROM sales;
-- Query managed table (faster)
SELECT * FROM sales_managed;
📈 Step 7: Monitor Cluster¶
Ambari Dashboard¶
- Navigate to cluster → Ambari home
- View dashboard metrics:
- CPU usage
- Memory usage
- Disk I/O
- YARN applications
YARN Resource Manager¶
- Click "YARN" in left menu
- Click "Quick Links" → "Resource Manager UI"
- View running applications
- Check job history
⚡ Performance Optimization¶
Optimize Table Format¶
-- Use ORC format for better performance
CREATE TABLE sales_orc
STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY")
AS SELECT * FROM sales;
Partitioning¶
-- Partition by date for better query performance
CREATE TABLE sales_partitioned (
order_id INT,
product STRING,
category STRING,
amount DECIMAL(10,2)
)
PARTITIONED BY (order_date DATE)
STORED AS ORC;
-- Insert data
INSERT INTO sales_partitioned PARTITION (order_date)
SELECT order_id, product, category, amount, order_date
FROM sales;
Enable Compression¶
-- Enable compression for better storage
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
🔧 Troubleshooting¶
Common Issues¶
Cannot Access Data
- ✅ Verify storage account connection
- ✅ Check firewall rules
- ✅ Ensure correct path format (wasb:// or abfs://)
Query Fails with "Out of Memory"
- ✅ Increase node size
- ✅ Add more worker nodes
- ✅ Optimize query (use filters)
- ✅ Use partitioning
Cluster Creation Fails
- ✅ Check subscription quotas
- ✅ Verify VM availability in region
- ✅ Ensure storage account accessible
Slow Performance
- ✅ Use ORC format
- ✅ Partition tables
- ✅ Increase cluster size
- ✅ Enable vectorization
🎓 Next Steps¶
Beginner Practice¶
- Load your own data
- Create multiple tables
- Join tables in queries
- Export results to storage
Intermediate Topics¶
- HDInsight Spark
- HDInsight HBase
- Schedule jobs with Azure Data Factory
- Implement security with ESP
Advanced Topics¶
- Hadoop Migration
- Kafka Streaming
- Custom scripts and actions
- Multi-cluster architectures
📚 Additional Resources¶
Documentation¶
Next Tutorials¶
🧹 Cleanup¶
To avoid charges:
💰 Cost Tip: HDInsight clusters incur charges while running. Delete when not in use!
🎉 Congratulations!¶
You've successfully:
✅ Created HDInsight Hadoop cluster ✅ Uploaded and queried data ✅ Used Hive for SQL-like analysis ✅ Optimized tables for performance ✅ Monitored cluster resources
Ready for enterprise big data processing!
Last Updated: January 2025 Tutorial Version: 1.0