Skip to content

🏞️ Azure Data Lake Storage Gen2

See also: CSA-in-a-Box platform guide

This is the generic Azure reference for Azure Data Lake Storage Gen2. For how CSA-in-a-Box specifically deploys, configures, and integrates this service, see the platform guide: Azure Data Lake Storage Gen2 guide.

Status Type Complexity

Azure Data Lake Storage Gen2 is a highly scalable and secure data lake solution built on Azure Blob Storage with hierarchical namespace capabilities, optimized for big data analytics workloads.


🌟 Service Overview

Azure Data Lake Storage Gen2 (ADLS Gen2) converges the capabilities of Azure Data Lake Storage Gen1 with Azure Blob Storage. It provides a hierarchical file system while maintaining the scalability, security, and cost-effectiveness of Azure Blob Storage, making it the ideal foundation for enterprise data lakes.

🔥 Key Value Propositions

  • Hierarchical Namespace: File and directory-level operations for performance and organization
  • Multi-Protocol Access: Supports both Blob and Data Lake File System (DFS) APIs
  • Fine-Grained Security: POSIX-compliant ACLs and Azure RBAC integration
  • Cost-Effective: Same pricing as Blob Storage with added capabilities
  • Massive Scale: Petabyte-scale storage with high throughput

🏗️ Architecture Overview

graph TB
    subgraph "Data Ingestion"
        Sources[Data Sources]
        ADF[Data Factory]
        Spark[Spark/Databricks]
        SDK[Azure SDKs]
    end

    subgraph "ADLS Gen2 Storage Account"
        subgraph "Hierarchical Namespace"
            Root[Root Container]
            Bronze[/bronze]
            Silver[/silver]
            Gold[/gold]
        end

        subgraph "Security Layers"
            RBAC[Azure RBAC]
            ACL[POSIX ACLs]
            SAS[Shared Access<br/>Signatures]
            Firewall[Network<br/>Security]
        end

        subgraph "Data Access Tiers"
            Hot[Hot Tier]
            Cool[Cool Tier]
            Archive[Archive Tier]
        end
    end

    subgraph "Analytics & Consumption"
        Synapse[Synapse Analytics]
        Databricks[Databricks]
        PowerBI[Power BI]
        AzureML[Azure ML]
    end

    Sources --> ADF
    ADF --> Root
    Spark --> Root
    SDK --> Root

    Root --> Bronze
    Bronze --> Silver
    Silver --> Gold

    RBAC -.-> Root
    ACL -.-> Root

    Gold --> Synapse
    Gold --> Databricks
    Gold --> PowerBI
    Gold --> AzureML

🛠️ Core Features

🌳 Hierarchical Namespace

Performance

True file system semantics with directory operations and atomic rename.

Key Capabilities:

  • Directory-level operations (rename, delete, move)
  • Atomic operations for ACID transactions
  • Improved performance for big data workloads
  • Better organization with folder hierarchies

Best For: Big data analytics, data lake implementations, file-based workloads

📖 Detailed Guide →


🔐 Access Control

Security

Multi-layered security with RBAC, ACLs, and encryption.

Security Features:

  • Azure RBAC for management operations
  • POSIX ACLs for file/directory permissions
  • Shared Access Signatures (SAS) for delegated access
  • Azure AD integration for identity management

Best For: Enterprise security requirements, multi-tenant scenarios, fine-grained access

📖 Detailed Guide →


♻️ Data Lifecycle Management

Cost

Automated tiering and lifecycle policies for cost optimization.

Lifecycle Features:

  • Rule-based tier transitions (Hot → Cool → Archive)
  • Automated deletion of old data
  • Last access time-based policies
  • Blob snapshots and versions management

Best For: Long-term data retention, cost optimization, compliance requirements

📖 Detailed Guide →


Performance Optimization

Throughput

Techniques and best practices for maximizing performance.

Optimization Areas:

  • Partitioning strategies
  • File size optimization
  • Parallel processing patterns
  • Network throughput tuning

Best For: High-throughput workloads, large-scale processing, performance-critical applications

📖 Detailed Guide →


🎯 Common Use Cases

🏗️ Modern Data Lake Architecture

Implement medallion architecture for enterprise data lakes.

Architecture: Bronze → Silver → Gold data zones Pattern: Medallion Architecture

graph LR
    Raw[Raw Data] --> Bronze[Bronze Layer<br/>Raw Storage]
    Bronze --> Silver[Silver Layer<br/>Cleaned Data]
    Silver --> Gold[Gold Layer<br/>Business-Ready]
    Gold --> BI[Business<br/>Intelligence]
    Gold --> ML[Machine<br/>Learning]

📊 Big Data Analytics

Foundation for Spark, Synapse, and Databricks workloads.

Architecture: ADLS Gen2 + Compute Engines Pattern: Data Lake Analytics

🔄 Hybrid Data Integration

Connect on-premises and cloud data sources.

Architecture: Data Factory + Private Link + ADLS Gen2 Pattern: Hybrid Integration

📦 Data Archival & Compliance

Long-term retention with cost-effective archival.

Architecture: Lifecycle Policies + Archive Tier Pattern: Data Retention Strategy


📊 Pricing Guide

💰 Cost Components

Component Pricing Model Key Factors Optimization Tips
Storage Per GB/month Tier (Hot/Cool/Archive) Use lifecycle policies
Operations Per 10,000 operations Operation type Batch operations
Data Transfer Per GB Egress region Use local processing
Metadata Included - No additional cost

💡 Storage Tiers Comparison

Tier Use Case Storage Cost Access Cost Minimum Duration
Hot Frequently accessed data Highest Lowest None
Cool Infrequently accessed (30+ days) Lower Higher 30 days
Archive Rarely accessed (180+ days) Lowest Highest 180 days

🎯 Cost Optimization Strategies

  1. Implement Lifecycle Policies: Auto-transition data to cooler tiers
  2. Optimize File Sizes: Larger files reduce operation costs
  3. Use Local Redundancy: LRS vs GRS based on requirements
  4. Monitor Access Patterns: Identify candidates for tier changes
  5. Leverage Reserved Capacity: Commit to 1-3 years for discounts

📖 Detailed Cost Guide →


🚀 Quick Start Guide

1️⃣ Create Storage Account with Hierarchical Namespace

# Create resource group
az group create --name rg-datalake-demo --location eastus

# Create ADLS Gen2 storage account
az storage account create \
  --name adlsgen2demo \
  --resource-group rg-datalake-demo \
  --location eastus \
  --sku Standard_LRS \
  --kind StorageV2 \
  --enable-hierarchical-namespace true

# Create container (file system)
az storage fs create \
  --name datalake \
  --account-name adlsgen2demo

2️⃣ Create Directory Structure

# Create medallion architecture folders
az storage fs directory create --name bronze --file-system datalake --account-name adlsgen2demo
az storage fs directory create --name silver --file-system datalake --account-name adlsgen2demo
az storage fs directory create --name gold --file-system datalake --account-name adlsgen2demo

3️⃣ Set Access Control Lists (ACLs)

# Assign permissions to a user
az storage fs access set \
  --acl "user:user@domain.com:rwx" \
  --path bronze \
  --file-system datalake \
  --account-name adlsgen2demo

# Set default ACL for new items
az storage fs access set \
  --acl "default:user:user@domain.com:rwx" \
  --path bronze \
  --file-system datalake \
  --account-name adlsgen2demo

4️⃣ Upload Data with Python SDK

from azure.storage.filedatalake import DataLakeServiceClient
from azure.identity import DefaultAzureCredential

# Initialize client
credential = DefaultAzureCredential()
service_client = DataLakeServiceClient(
    account_url="https://adlsgen2demo.dfs.core.windows.net",
    credential=credential
)

# Get file system client
file_system_client = service_client.get_file_system_client("datalake")

# Upload file
file_client = file_system_client.get_file_client("bronze/sales/data.csv")
with open("local_data.csv", "rb") as data:
    file_client.upload_data(data, overwrite=True)

print("File uploaded successfully!")

5️⃣ Query Data with Synapse Serverless SQL

-- Create external data source
CREATE EXTERNAL DATA SOURCE DataLake
WITH (
    LOCATION = 'https://adlsgen2demo.dfs.core.windows.net/datalake'
);

-- Query CSV files
SELECT TOP 100 *
FROM OPENROWSET(
    BULK 'bronze/sales/*.csv',
    DATA_SOURCE = 'DataLake',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0',
    HEADER_ROW = TRUE
) AS sales_data;

🔧 Configuration & Management

🛡️ Security Best Practices

Recommended Security Configuration:

  1. Enable Azure AD Authentication: Use managed identities
  2. Implement Network Security: Private endpoints and firewalls
  3. Use Customer-Managed Keys: For encryption at rest
  4. Enable Soft Delete: Protect against accidental deletion
  5. Configure Access Logging: Monitor all access patterns
# Example: Configure firewall rules
from azure.mgmt.storage import StorageManagementClient

# Update network rules
storage_client.storage_accounts.update(
    resource_group_name="rg-datalake",
    account_name="adlsgen2demo",
    parameters={
        "properties": {
            "networkAcls": {
                "defaultAction": "Deny",
                "ipRules": [{"value": "203.0.113.0/24"}],
                "virtualNetworkRules": [],
                "bypass": "AzureServices"
            }
        }
    }
)

📖 Security Guide →

⚡ Performance Tuning

Key Performance Factors:

  • File Size: Optimal range is 256MB - 1GB
  • Partitioning: Use partition columns for filtering
  • Parallel Operations: Leverage multi-threading for uploads/downloads
  • Network Proximity: Co-locate compute and storage
# Example: Parallel file upload
from concurrent.futures import ThreadPoolExecutor
import os

def upload_file(file_path, destination_path):
    file_client = file_system_client.get_file_client(destination_path)
    with open(file_path, "rb") as data:
        file_client.upload_data(data, overwrite=True)
    return f"Uploaded {file_path}"

# Upload multiple files in parallel
files = ["file1.csv", "file2.csv", "file3.csv"]
with ThreadPoolExecutor(max_workers=10) as executor:
    results = executor.map(
        lambda f: upload_file(f, f"bronze/{os.path.basename(f)}"),
        files
    )
    for result in results:
        print(result)

📖 Performance Guide →

📊 Monitoring & Diagnostics

Key Metrics to Monitor:

  • Availability: Storage account uptime
  • Latency: End-to-end and server latency
  • Transactions: Success rate and error types
  • Capacity: Used capacity and growth trends
# Example: Query metrics with Azure Monitor
from azure.mgmt.monitor import MonitorManagementClient

monitor_client = MonitorManagementClient(credential, subscription_id)

metrics = monitor_client.metrics.list(
    resource_uri=f"/subscriptions/{subscription_id}/resourceGroups/rg-datalake/providers/Microsoft.Storage/storageAccounts/adlsgen2demo",
    timespan="PT1H",
    interval="PT5M",
    metricnames="Transactions,Availability,SuccessE2ELatency",
    aggregation="Average"
)

for metric in metrics.value:
    print(f"{metric.name.value}: {metric.timeseries[0].data}")

📖 Monitoring Guide →


🔗 Integration Patterns

Azure Synapse Analytics Integration

Direct integration for serverless and dedicated SQL pools.

-- Create external table in Synapse
CREATE EXTERNAL TABLE SalesData (
    SaleID INT,
    Product NVARCHAR(100),
    Amount DECIMAL(10,2),
    SaleDate DATE
)
WITH (
    LOCATION = 'gold/sales/',
    DATA_SOURCE = DataLake,
    FILE_FORMAT = ParquetFormat
);

-- Query external data
SELECT Product, SUM(Amount) as TotalSales
FROM SalesData
WHERE SaleDate >= '2024-01-01'
GROUP BY Product;

Azure Databricks Integration

Mount ADLS Gen2 for Spark processing.

# Mount ADLS Gen2 in Databricks
configs = {
    "fs.azure.account.auth.type": "OAuth",
    "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.account.oauth2.client.id": "<client-id>",
    "fs.azure.account.oauth2.client.secret": "<client-secret>",
    "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}

dbutils.fs.mount(
    source="abfss://datalake@adlsgen2demo.dfs.core.windows.net/",
    mount_point="/mnt/datalake",
    extra_configs=configs
)

# Read data using Spark
df = spark.read.format("delta").load("/mnt/datalake/gold/sales")
df.show()

Azure Data Factory Integration

Build ETL pipelines with ADLS Gen2 as source and sink.

{
    "name": "CopyToDataLake",
    "type": "Copy",
    "inputs": [{
        "referenceName": "SourceDataset",
        "type": "DatasetReference"
    }],
    "outputs": [{
        "referenceName": "ADLSGen2Dataset",
        "type": "DatasetReference"
    }],
    "typeProperties": {
        "source": {
            "type": "SqlServerSource"
        },
        "sink": {
            "type": "ParquetSink",
            "storeSettings": {
                "type": "AzureBlobFSWriteSettings",
                "copyBehavior": "PreserveHierarchy"
            }
        }
    }
}

📖 Integration Examples →


📚 Learning Resources

🎓 Getting Started

📖 Deep Dive Guides

🔧 Advanced Topics


🆘 Troubleshooting

🔍 Common Issues

📞 Getting Help

  • Azure Support: Create support ticket in Azure Portal
  • Community Forums: Microsoft Q&A, Stack Overflow
  • Documentation: Microsoft Learn official docs
  • GitHub: Azure SDK issues and samples

📖 Troubleshooting Guide →


Microsoft Documentation

Architecture Guidance


Last Updated: 2025-01-28 Service Version: General Availability Documentation Status: Complete