💾 Azure Data Lake Storage Gen2 Quickstart¶

Get started with Azure Data Lake Storage Gen2 in under an hour. Learn to create storage, organize data with hierarchical namespaces, and access data efficiently.

🎯 Learning Objectives¶

After completing this quickstart, you will be able to:

Understand what ADLS Gen2 is and when to use it
Create a storage account with hierarchical namespace enabled
Upload and organize files in containers and directories
Set access controls and permissions
Access data using Azure Storage Explorer and Python SDK

📋 Prerequisites¶

Before starting, ensure you have:

Azure subscription - Create free account
Azure Portal access - portal.azure.com
Azure Storage Explorer - Download here
Python 3.7+ (optional for SDK examples) - Download Python

🔍 What is Azure Data Lake Storage Gen2?¶

ADLS Gen2 combines the best of Azure Blob Storage and Azure Data Lake Storage Gen1, providing:

Hierarchical namespace - File system semantics (directories, ACLs)
Hadoop compatibility - Works with HDInsight, Databricks, Synapse
High performance - Optimized for analytics workloads
Low cost - Blob storage pricing with enterprise features

Key Concepts¶

Storage Account: Top-level container for storage resources
Container: Similar to a bucket or file system root
Directory: Hierarchical folder structure
File: The actual data objects
ACLs: POSIX-style access control lists

When to Use ADLS Gen2¶

✅ Good For:

Big data analytics with Spark, Hive, or Databricks
Data lake architectures
Machine learning data storage
Data warehousing staging areas
Hierarchical data organization

❌ Not Ideal For:

Small file storage (use Blob Storage)
Database storage (use SQL Database)
Frequent small updates (use Cosmos DB)

🚀 Step 1: Create Storage Account¶

Using Azure Portal¶

Navigate to Azure Portal
Go to portal.azure.com
Click "Create a resource"
Search for "Storage account"
Click "Create"
Configure Basics
Subscription: Select your subscription
Resource Group: Create new "rg-adls-quickstart"
Storage Account Name: "adlsquickstart[yourname]" (lowercase, no spaces)
Region: Select nearest region
Performance: Standard (or Premium for high IOPS)
Redundancy: LRS (Locally-redundant storage) for quickstart
Enable Hierarchical Namespace
Click "Advanced" tab
Hierarchical namespace: Check "Enable"
This creates a Data Lake Storage Gen2 account!
Review and Create
Click "Review + create"
Click "Create"
Wait 1-2 minutes for deployment

💡 Important: Hierarchical namespace cannot be changed after creation!

📂 Step 2: Create Container and Directories¶

Using Azure Portal¶

Navigate to Storage Account
Go to your storage account
Click "Containers" in left menu
Create Container
Click "+ Container"
Name: "data"
Public access level: Private (no anonymous access)
Click "Create"
Create Directory Structure
Click on "data" container
Click "+ Add Directory"
Create these directories:
- raw - For raw ingested data
- processed - For processed/transformed data
- curated - For business-ready data

Using Azure CLI¶

# Set variables
RESOURCE_GROUP="rg-adls-quickstart"
STORAGE_ACCOUNT="adlsquickstart$RANDOM"
LOCATION="eastus"

# Create storage account with hierarchical namespace
az storage account create \
  --name $STORAGE_ACCOUNT \
  --resource-group $RESOURCE_GROUP \
  --location $LOCATION \
  --sku Standard_LRS \
  --kind StorageV2 \
  --hierarchical-namespace true

# Create container
az storage container create \
  --name data \
  --account-name $STORAGE_ACCOUNT \
  --auth-mode login

# Create directories
az storage fs directory create \
  --name raw \
  --file-system data \
  --account-name $STORAGE_ACCOUNT \
  --auth-mode login

az storage fs directory create \
  --name processed \
  --file-system data \
  --account-name $STORAGE_ACCOUNT \
  --auth-mode login

az storage fs directory create \
  --name curated \
  --file-system data \
  --account-name $STORAGE_ACCOUNT \
  --auth-mode login

echo "Storage Account: $STORAGE_ACCOUNT"

📤 Step 3: Upload Data¶

Create Sample Data File¶

Create a file named sales.csv:

order_id,customer_id,product,category,amount,order_date
1001,C101,Laptop,Electronics,1299.99,2024-01-15
1002,C102,Chair,Furniture,249.99,2024-01-15
1003,C101,Monitor,Electronics,399.99,2024-01-16
1004,C103,Desk,Furniture,549.99,2024-01-16
1005,C102,Keyboard,Electronics,89.99,2024-01-17

Using Azure Storage Explorer¶

Open Storage Explorer
Launch Azure Storage Explorer
Sign in with your Azure account
Navigate to Container
Expand your subscription
Expand "Storage Accounts"
Expand your storage account
Expand "Blob Containers"
Click "data" container
Upload File
Click "Upload" button
Select "Upload Files"
Choose sales.csv
Set destination to raw/sales.csv
Click "Upload"

Using Azure Portal¶

Navigate to "data" container
Click on "raw" directory
Click "Upload"
Select sales.csv
Click "Upload"

Using Python SDK¶

"""
Upload file to ADLS Gen2
"""
from azure.storage.filedatalake import DataLakeServiceClient
from azure.identity import DefaultAzureCredential

# Configuration
STORAGE_ACCOUNT = "your-storage-account-name"
CONTAINER = "data"
DIRECTORY = "raw"
FILE_NAME = "sales.csv"

# Create service client
credential = DefaultAzureCredential()
service_client = DataLakeServiceClient(
    account_url=f"https://{STORAGE_ACCOUNT}.dfs.core.windows.net",
    credential=credential
)

# Get file system (container) client
file_system_client = service_client.get_file_system_client(CONTAINER)

# Get directory client
directory_client = file_system_client.get_directory_client(DIRECTORY)

# Get file client
file_client = directory_client.get_file_client(FILE_NAME)

# Upload file
with open(FILE_NAME, "rb") as data:
    file_client.upload_data(data, overwrite=True)
    print(f"✅ Uploaded {FILE_NAME} to {DIRECTORY}/")

🔐 Step 4: Set Access Controls¶

ADLS Gen2 supports both RBAC and ACLs for fine-grained access control.

Assign RBAC Role¶

Navigate to Storage Account
Click "Access Control (IAM)" in left menu
Click "+ Add" > "Add role assignment"
Select Role
Search for "Storage Blob Data Contributor"
Click the role
Click "Next"
Assign Access
Select "User, group, or service principal"
Click "+ Select members"
Search for your user or app
Click "Select"
Click "Review + assign"

Set Directory ACLs¶

# Get user object ID
USER_ID=$(az ad signed-in-user show --query id --output tsv)

# Set ACL on raw directory (read, write, execute)
az storage fs access set \
  --acl "user:${USER_ID}:rwx" \
  --path raw \
  --file-system data \
  --account-name $STORAGE_ACCOUNT

📥 Step 5: Access and Query Data¶

Using Azure Storage Explorer¶

Navigate to your file in Storage Explorer
Right-click the file
Select "Open" to download and view
Select "Properties" to see metadata

Using Python SDK¶

"""
Read file from ADLS Gen2
"""
from azure.storage.filedatalake import DataLakeServiceClient
from azure.identity import DefaultAzureCredential
import pandas as pd
from io import StringIO

# Configuration
STORAGE_ACCOUNT = "your-storage-account-name"
CONTAINER = "data"
FILE_PATH = "raw/sales.csv"

# Create service client
credential = DefaultAzureCredential()
service_client = DataLakeServiceClient(
    account_url=f"https://{STORAGE_ACCOUNT}.dfs.core.windows.net",
    credential=credential
)

# Get file system client
file_system_client = service_client.get_file_system_client(CONTAINER)

# Get file client
file_client = file_system_client.get_file_client(FILE_PATH)

# Download and read file
download = file_client.download_file()
content = download.readall().decode('utf-8')

# Parse CSV
df = pd.read_csv(StringIO(content))
print(df.head())
print(f"\nTotal rows: {len(df)}")

Using Azure Synapse or Databricks¶

# Synapse Spark
df = spark.read.csv(
    "abfss://data@your-storage-account.dfs.core.windows.net/raw/sales.csv",
    header=True,
    inferSchema=True
)
df.show()

# Databricks
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("abfss://data@your-storage-account.dfs.core.windows.net/raw/sales.csv")
df.display()

🏗️ Best Practices¶

Directory Organization¶

```textdata/ ├── raw/ # Raw ingested data (immutable) │ ├── year=2024/ │ │ ├── month=01/ │ │ │ └── sales_20240115.csv ├── processed/ # Cleaned and transformed data │ ├── year=2024/ │ │ ├── month=01/ │ │ │ └── sales_cleaned.parquet └── curated/ # Business-ready datasets └── sales_summary/ └── sales_monthly.parquet

### __Naming Conventions__

- Use lowercase for consistency
- Use hyphens for multi-word names
- Include date/time in file names: `sales_20240115_103045.csv`
- Use partitioning for large datasets: `year=2024/month=01/`

### __Performance Optimization__

1. __Use Parquet format__ for analytics (columnar, compressed)
2. __Partition large datasets__ by date or category
3. __Enable lifecycle management__ to move old data to cool/archive tiers
4. __Use concurrent uploads__ for large file sets

## 🔧 Troubleshooting

### __Common Issues__

__Error: "Hierarchical namespace not enabled"__

- ✅ Create new storage account with HNS enabled
- ❌ Cannot enable on existing account

__Error: "Forbidden" or "Authorization failed"__

- ✅ Check RBAC role assignment
- ✅ Verify ACL permissions
- ✅ Wait 5-10 minutes for permissions to propagate

__Slow Upload/Download__

- ✅ Use Azure Storage Explorer for GUI
- ✅ Use AzCopy for bulk transfers
- ✅ Enable concurrent operations

__Cannot Access from Synapse/Databricks__

- ✅ Verify storage account firewall settings
- ✅ Check managed identity permissions
- ✅ Use correct URL format: `abfss://`

## 🎓 Next Steps

### __Beginner Practice__

- [ ] Create additional directories for your data
- [ ] Upload multiple files
- [ ] Set different ACLs for different users
- [ ] Organize data using partitioning strategy

### __Intermediate Challenges__

- [ ] Implement lifecycle management policies
- [ ] Use AzCopy for bulk data transfer
- [ ] Set up Azure Monitor alerts
- [ ] Integrate with Azure Data Factory

### __Advanced Topics__

- [ ] Implement data lake zones (raw/bronze, processed/silver, curated/gold)
- [ ] Set up Azure Purview for data governance
- [ ] Configure private endpoints
- [ ] Implement encryption with customer-managed keys

## 📚 Additional Resources

### __Documentation__

- [ADLS Gen2 Overview](https://learn.microsoft.com/azure/storage/blobs/data-lake-storage-introduction)
- [Access Control in ADLS Gen2](https://learn.microsoft.com/azure/storage/blobs/data-lake-storage-access-control)
- [Best Practices](https://learn.microsoft.com/azure/storage/blobs/data-lake-storage-best-practices)

### __Next Tutorials__

- [Databricks Quickstart](databricks-quickstart.md) - Use Databricks with ADLS Gen2
- [Delta Lake Basics](delta-lake-basics.md) - Store data in Delta format
- [Data Engineer Path](../learning-paths/data-engineer-path.md)

### __Tools__

- [Azure Storage Explorer](https://azure.microsoft.com/features/storage-explorer/)
- [AzCopy](https://learn.microsoft.com/azure/storage/common/storage-use-azcopy-v10)
- [Azure Data Lake Tools for VS Code](https://marketplace.visualstudio.com/items?itemName=usqlextpublisher.usql-vscode-ext)

## 🧹 Cleanup

To avoid Azure charges, delete resources when done:

```bash
# Delete resource group (deletes everything)
az group delete --name rg-adls-quickstart --yes --no-wait

Or use Azure Portal:

Navigate to Resource Groups
Select "rg-adls-quickstart"
Click "Delete resource group"
Type resource group name to confirm
Click "Delete"

🎉 Congratulations!¶

You've successfully:

✅ Created an Azure Data Lake Storage Gen2 account ✅ Organized data with hierarchical namespaces ✅ Uploaded and accessed files ✅ Set access controls and permissions ✅ Learned best practices for data organization

You're ready to build scalable data lake solutions!

Next Recommended Tutorial: Delta Lake Basics to learn about transactional data lakes

Last Updated: January 2025 Tutorial Version: 1.0 Tested with: Azure Storage Account with HNS enabled