Skip to content

🐘 Azure HDInsight Quickstart

Status Level Duration

Get started with Azure HDInsight. Learn to create Hadoop clusters and run big data workloads on Azure.

🎯 Learning Objectives

After completing this quickstart, you will be able to:

  • Understand what Azure HDInsight is and its capabilities
  • Create an HDInsight cluster with Hadoop
  • Upload data to cluster storage
  • Run MapReduce and Hive jobs
  • Query data with HiveQL
  • Monitor cluster performance

📋 Prerequisites

  • Azure subscription - Create free account
  • Azure Storage account - Create one
  • Basic SQL knowledge - Understanding of SELECT, WHERE, JOIN
  • SSH client (optional) - For cluster access

🔍 What is Azure HDInsight?

Azure HDInsight is a fully managed, cloud-based service for open-source analytics frameworks:

  • Hadoop - Batch processing with MapReduce
  • Spark - Fast in-memory processing
  • HBase - NoSQL database
  • Kafka - Event streaming
  • Interactive Query - Interactive Hive (LLAP)

Key Features

✅ Fully managed Hadoop clusters ✅ Enterprise-grade security ✅ Integration with Azure services ✅ Cost-effective with auto-scaling ✅ Multiple frameworks support

When to Use HDInsight

Good For:

  • Migrating on-premises Hadoop workloads
  • Batch ETL processing
  • Log and event analytics
  • Data warehousing
  • Machine learning at scale

Consider Alternatives For:

  • Real-time analytics (use Databricks or Synapse)
  • Small datasets (use Synapse Serverless)
  • Managed notebooks (use Databricks)

🚀 Step 1: Create HDInsight Cluster

Using Azure Portal

  1. Navigate to Azure Portal
  2. Go to portal.azure.com
  3. Click "Create a resource"
  4. Search for "HDInsight"
  5. Click "Create"

  6. Configure Basics

  7. Subscription: Select subscription
  8. Resource Group: Create "rg-hdinsight-quickstart"
  9. Cluster Name: "hdinsight-quickstart-[yourname]"
  10. Region: Select nearest region
  11. Cluster Type: Hadoop
  12. Version: Latest (e.g., Hadoop 3.1.1)
  13. Tier: Standard

  14. Configure Security + Networking

  15. Cluster Login Username: admin
  16. Cluster Login Password: Create strong password
  17. SSH Username: sshuser
  18. SSH Password: Same or different password

  19. Configure Storage

  20. Primary Storage Type: Azure Storage or ADLS Gen2
  21. Select a Storage Account: Choose existing or create new
  22. Container: Create new "hdinsight"
  23. Filesystem: "hdinsight" (for ADLS Gen2)

  24. Configure Scale

  25. Head nodes: 2 (default)
  26. Worker nodes: 2 (minimum for quickstart)
  27. Node Size: D13 v2 (or smaller for cost savings)

  28. Review and Create

  29. Click "Review + create"
  30. Click "Create"
  31. Wait 15-20 minutes for deployment

📂 Step 2: Upload Sample Data

Create Sample Data

Create sales.csv:

order_id,product,category,amount,order_date
1001,Laptop,Electronics,1299.99,2024-01-15
1002,Chair,Furniture,249.99,2024-01-15
1003,Monitor,Electronics,399.99,2024-01-16
1004,Desk,Furniture,549.99,2024-01-16
1005,Keyboard,Electronics,89.99,2024-01-17

Upload to Cluster Storage

# Using Azure Storage Explorer or Azure Portal
# 1. Navigate to storage account
# 2. Go to "hdinsight" container
# 3. Create folder "data"
# 4. Upload sales.csv to data/sales.csv

Using Azure CLI

# Set variables
STORAGE_ACCOUNT="your-storage-account"
CONTAINER="hdinsight"

# Upload file
az storage blob upload \
  --account-name $STORAGE_ACCOUNT \
  --container-name $CONTAINER \
  --name data/sales.csv \
  --file sales.csv \
  --auth-mode login

🔍 Step 3: Create Hive Table

Access Hive View

  1. Navigate to HDInsight cluster in Azure Portal
  2. Click "Cluster dashboards" → "Ambari home"
  3. Login with cluster credentials
  4. Click Hive View icon (9 squares grid)

Create External Table

-- Create database
CREATE DATABASE IF NOT EXISTS sales_db;

USE sales_db;

-- Create external table
CREATE EXTERNAL TABLE sales (
    order_id INT,
    product STRING,
    category STRING,
    amount DECIMAL(10,2),
    order_date DATE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'wasb://hdinsight@your-storage-account.blob.core.windows.net/data/'
TBLPROPERTIES ("skip.header.line.count"="1");

-- Verify data loaded
SELECT * FROM sales LIMIT 10;

📊 Step 4: Query Data with HiveQL

Basic Queries

-- Total sales
SELECT
    SUM(amount) as total_sales,
    COUNT(*) as order_count
FROM sales;
-- Sales by category
SELECT
    category,
    COUNT(*) as order_count,
    SUM(amount) as total_sales,
    AVG(amount) as avg_order_value
FROM sales
GROUP BY category
ORDER BY total_sales DESC;
-- Top products
SELECT
    product,
    SUM(amount) as revenue
FROM sales
GROUP BY product
ORDER BY revenue DESC
LIMIT 5;

Advanced Analysis

-- Daily sales with running total
SELECT
    order_date,
    SUM(amount) as daily_sales,
    SUM(SUM(amount)) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as running_total
FROM sales
GROUP BY order_date
ORDER BY order_date;

💻 Step 5: Run MapReduce Job (Optional)

Word Count Example

# SSH into cluster
ssh sshuser@your-cluster-name-ssh.azurehdinsight.net

# Run word count on sample data
hadoop jar \
  /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar \
  wordcount \
  wasb:///example/data/gutenberg/davinci.txt \
  wasb:///example/data/WordCountOutput

# View results
hdfs dfs -cat /example/data/WordCountOutput/part-r-00000

🎯 Step 6: Create Managed Table

-- Create managed table for better performance
CREATE TABLE sales_managed
STORED AS ORC
AS
SELECT * FROM sales;

-- Query managed table (faster)
SELECT * FROM sales_managed;

📈 Step 7: Monitor Cluster

Ambari Dashboard

  1. Navigate to cluster → Ambari home
  2. View dashboard metrics:
  3. CPU usage
  4. Memory usage
  5. Disk I/O
  6. YARN applications

YARN Resource Manager

  1. Click "YARN" in left menu
  2. Click "Quick Links" → "Resource Manager UI"
  3. View running applications
  4. Check job history

⚡ Performance Optimization

Optimize Table Format

-- Use ORC format for better performance
CREATE TABLE sales_orc
STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY")
AS SELECT * FROM sales;

Partitioning

-- Partition by date for better query performance
CREATE TABLE sales_partitioned (
    order_id INT,
    product STRING,
    category STRING,
    amount DECIMAL(10,2)
)
PARTITIONED BY (order_date DATE)
STORED AS ORC;

-- Insert data
INSERT INTO sales_partitioned PARTITION (order_date)
SELECT order_id, product, category, amount, order_date
FROM sales;

Enable Compression

-- Enable compression for better storage
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

🔧 Troubleshooting

Common Issues

Cannot Access Data

  • ✅ Verify storage account connection
  • ✅ Check firewall rules
  • ✅ Ensure correct path format (wasb:// or abfs://)

Query Fails with "Out of Memory"

  • ✅ Increase node size
  • ✅ Add more worker nodes
  • ✅ Optimize query (use filters)
  • ✅ Use partitioning

Cluster Creation Fails

  • ✅ Check subscription quotas
  • ✅ Verify VM availability in region
  • ✅ Ensure storage account accessible

Slow Performance

  • ✅ Use ORC format
  • ✅ Partition tables
  • ✅ Increase cluster size
  • ✅ Enable vectorization

🎓 Next Steps

Beginner Practice

  • Load your own data
  • Create multiple tables
  • Join tables in queries
  • Export results to storage

Intermediate Topics

Advanced Topics

📚 Additional Resources

Documentation

Next Tutorials

🧹 Cleanup

To avoid charges:

# Delete resource group
az group delete --name rg-hdinsight-quickstart --yes --no-wait

💰 Cost Tip: HDInsight clusters incur charges while running. Delete when not in use!

🎉 Congratulations!

You've successfully:

✅ Created HDInsight Hadoop cluster ✅ Uploaded and queried data ✅ Used Hive for SQL-like analysis ✅ Optimized tables for performance ✅ Monitored cluster resources

Ready for enterprise big data processing!


Last Updated: January 2025 Tutorial Version: 1.0