🐘 Azure HDInsight Quickstart¶

Get started with Azure HDInsight. Learn to create Hadoop clusters and run big data workloads on Azure.

🎯 Learning Objectives¶

After completing this quickstart, you will be able to:

Understand what Azure HDInsight is and its capabilities
Create an HDInsight cluster with Hadoop
Upload data to cluster storage
Run MapReduce and Hive jobs
Query data with HiveQL
Monitor cluster performance

📋 Prerequisites¶

Azure subscription - Create free account
Azure Storage account - Create one
Basic SQL knowledge - Understanding of SELECT, WHERE, JOIN
SSH client (optional) - For cluster access

🔍 What is Azure HDInsight?¶

Azure HDInsight is a fully managed, cloud-based service for open-source analytics frameworks:

Hadoop - Batch processing with MapReduce
Spark - Fast in-memory processing
HBase - NoSQL database
Kafka - Event streaming
Interactive Query - Interactive Hive (LLAP)

Key Features¶

✅ Fully managed Hadoop clusters ✅ Enterprise-grade security ✅ Integration with Azure services ✅ Cost-effective with auto-scaling ✅ Multiple frameworks support

When to Use HDInsight¶

✅ Good For:

Migrating on-premises Hadoop workloads
Batch ETL processing
Log and event analytics
Data warehousing
Machine learning at scale

❌ Consider Alternatives For:

Real-time analytics (use Databricks or Synapse)
Small datasets (use Synapse Serverless)
Managed notebooks (use Databricks)

🚀 Step 1: Create HDInsight Cluster¶

Using Azure Portal¶

Navigate to Azure Portal
Go to portal.azure.com
Click "Create a resource"
Search for "HDInsight"
Click "Create"
Configure Basics
Subscription: Select subscription
Resource Group: Create "rg-hdinsight-quickstart"
Cluster Name: "hdinsight-quickstart-[yourname]"
Region: Select nearest region
Cluster Type: Hadoop
Version: Latest (e.g., Hadoop 3.1.1)
Tier: Standard
Configure Security + Networking
Cluster Login Username: admin
Cluster Login Password: Create strong password
SSH Username: sshuser
SSH Password: Same or different password
Configure Storage
Primary Storage Type: Azure Storage or ADLS Gen2
Select a Storage Account: Choose existing or create new
Container: Create new "hdinsight"
Filesystem: "hdinsight" (for ADLS Gen2)
Configure Scale
Head nodes: 2 (default)
Worker nodes: 2 (minimum for quickstart)
Node Size: D13 v2 (or smaller for cost savings)
Review and Create
Click "Review + create"
Click "Create"
Wait 15-20 minutes for deployment

📂 Step 2: Upload Sample Data¶

Create Sample Data¶

Create sales.csv:

order_id,product,category,amount,order_date
1001,Laptop,Electronics,1299.99,2024-01-15
1002,Chair,Furniture,249.99,2024-01-15
1003,Monitor,Electronics,399.99,2024-01-16
1004,Desk,Furniture,549.99,2024-01-16
1005,Keyboard,Electronics,89.99,2024-01-17

Upload to Cluster Storage¶

# Using Azure Storage Explorer or Azure Portal
# 1. Navigate to storage account
# 2. Go to "hdinsight" container
# 3. Create folder "data"
# 4. Upload sales.csv to data/sales.csv

Using Azure CLI¶

# Set variables
STORAGE_ACCOUNT="your-storage-account"
CONTAINER="hdinsight"

# Upload file
az storage blob upload \
  --account-name $STORAGE_ACCOUNT \
  --container-name $CONTAINER \
  --name data/sales.csv \
  --file sales.csv \
  --auth-mode login

🔍 Step 3: Create Hive Table¶

Access Hive View¶

Navigate to HDInsight cluster in Azure Portal
Click "Cluster dashboards" → "Ambari home"
Login with cluster credentials
Click Hive View icon (9 squares grid)

Create External Table¶

-- Create database
CREATE DATABASE IF NOT EXISTS sales_db;

USE sales_db;

-- Create external table
CREATE EXTERNAL TABLE sales (
    order_id INT,
    product STRING,
    category STRING,
    amount DECIMAL(10,2),
    order_date DATE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'wasb://hdinsight@your-storage-account.blob.core.windows.net/data/'
TBLPROPERTIES ("skip.header.line.count"="1");

-- Verify data loaded
SELECT * FROM sales LIMIT 10;

📊 Step 4: Query Data with HiveQL¶

Basic Queries¶

-- Total sales
SELECT
    SUM(amount) as total_sales,
    COUNT(*) as order_count
FROM sales;

-- Sales by category
SELECT
    category,
    COUNT(*) as order_count,
    SUM(amount) as total_sales,
    AVG(amount) as avg_order_value
FROM sales
GROUP BY category
ORDER BY total_sales DESC;

-- Top products
SELECT
    product,
    SUM(amount) as revenue
FROM sales
GROUP BY product
ORDER BY revenue DESC
LIMIT 5;

Advanced Analysis¶

-- Daily sales with running total
SELECT
    order_date,
    SUM(amount) as daily_sales,
    SUM(SUM(amount)) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as running_total
FROM sales
GROUP BY order_date
ORDER BY order_date;

💻 Step 5: Run MapReduce Job (Optional)¶

Word Count Example¶

# SSH into cluster
ssh sshuser@your-cluster-name-ssh.azurehdinsight.net

# Run word count on sample data
hadoop jar \
  /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar \
  wordcount \
  wasb:///example/data/gutenberg/davinci.txt \
  wasb:///example/data/WordCountOutput

# View results
hdfs dfs -cat /example/data/WordCountOutput/part-r-00000

🎯 Step 6: Create Managed Table¶

-- Create managed table for better performance
CREATE TABLE sales_managed
STORED AS ORC
AS
SELECT * FROM sales;

-- Query managed table (faster)
SELECT * FROM sales_managed;

📈 Step 7: Monitor Cluster¶

Ambari Dashboard¶

Navigate to cluster → Ambari home
View dashboard metrics:
CPU usage
Memory usage
Disk I/O
YARN applications

YARN Resource Manager¶

Click "YARN" in left menu
Click "Quick Links" → "Resource Manager UI"
View running applications
Check job history

⚡ Performance Optimization¶

Optimize Table Format¶

-- Use ORC format for better performance
CREATE TABLE sales_orc
STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY")
AS SELECT * FROM sales;

Partitioning¶

-- Partition by date for better query performance
CREATE TABLE sales_partitioned (
    order_id INT,
    product STRING,
    category STRING,
    amount DECIMAL(10,2)
)
PARTITIONED BY (order_date DATE)
STORED AS ORC;

-- Insert data
INSERT INTO sales_partitioned PARTITION (order_date)
SELECT order_id, product, category, amount, order_date
FROM sales;

Enable Compression¶

-- Enable compression for better storage
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

🔧 Troubleshooting¶

Common Issues¶

Cannot Access Data

✅ Verify storage account connection
✅ Check firewall rules
✅ Ensure correct path format (wasb:// or abfs://)

Query Fails with "Out of Memory"

✅ Increase node size
✅ Add more worker nodes
✅ Optimize query (use filters)
✅ Use partitioning

Cluster Creation Fails

✅ Check subscription quotas
✅ Verify VM availability in region
✅ Ensure storage account accessible

Slow Performance

✅ Use ORC format
✅ Partition tables
✅ Increase cluster size
✅ Enable vectorization

🎓 Next Steps¶

Beginner Practice¶

Load your own data
Create multiple tables
Join tables in queries
Export results to storage

Intermediate Topics¶

HDInsight Spark
HDInsight HBase
Schedule jobs with Azure Data Factory
Implement security with ESP

Advanced Topics¶

Hadoop Migration
Kafka Streaming
Custom scripts and actions
Multi-cluster architectures

📚 Additional Resources¶

Documentation¶

Next Tutorials¶

🧹 Cleanup¶

To avoid charges:

# Delete resource group
az group delete --name rg-hdinsight-quickstart --yes --no-wait

💰 Cost Tip: HDInsight clusters incur charges while running. Delete when not in use!

🎉 Congratulations!¶

You've successfully:

✅ Created HDInsight Hadoop cluster ✅ Uploaded and queried data ✅ Used Hive for SQL-like analysis ✅ Optimized tables for performance ✅ Monitored cluster resources

Ready for enterprise big data processing!

Last Updated: January 2025 Tutorial Version: 1.0