🏞️ Azure Data Lake Storage Gen2¶
See also: CSA-in-a-Box platform guide
This is the generic Azure reference for Azure Data Lake Storage Gen2. For how CSA-in-a-Box specifically deploys, configures, and integrates this service, see the platform guide: Azure Data Lake Storage Gen2 guide.
Azure Data Lake Storage Gen2 is a highly scalable and secure data lake solution built on Azure Blob Storage with hierarchical namespace capabilities, optimized for big data analytics workloads.
🌟 Service Overview¶
Azure Data Lake Storage Gen2 (ADLS Gen2) converges the capabilities of Azure Data Lake Storage Gen1 with Azure Blob Storage. It provides a hierarchical file system while maintaining the scalability, security, and cost-effectiveness of Azure Blob Storage, making it the ideal foundation for enterprise data lakes.
🔥 Key Value Propositions¶
- Hierarchical Namespace: File and directory-level operations for performance and organization
- Multi-Protocol Access: Supports both Blob and Data Lake File System (DFS) APIs
- Fine-Grained Security: POSIX-compliant ACLs and Azure RBAC integration
- Cost-Effective: Same pricing as Blob Storage with added capabilities
- Massive Scale: Petabyte-scale storage with high throughput
🏗️ Architecture Overview¶
graph TB
subgraph "Data Ingestion"
Sources[Data Sources]
ADF[Data Factory]
Spark[Spark/Databricks]
SDK[Azure SDKs]
end
subgraph "ADLS Gen2 Storage Account"
subgraph "Hierarchical Namespace"
Root[Root Container]
Bronze[/bronze]
Silver[/silver]
Gold[/gold]
end
subgraph "Security Layers"
RBAC[Azure RBAC]
ACL[POSIX ACLs]
SAS[Shared Access<br/>Signatures]
Firewall[Network<br/>Security]
end
subgraph "Data Access Tiers"
Hot[Hot Tier]
Cool[Cool Tier]
Archive[Archive Tier]
end
end
subgraph "Analytics & Consumption"
Synapse[Synapse Analytics]
Databricks[Databricks]
PowerBI[Power BI]
AzureML[Azure ML]
end
Sources --> ADF
ADF --> Root
Spark --> Root
SDK --> Root
Root --> Bronze
Bronze --> Silver
Silver --> Gold
RBAC -.-> Root
ACL -.-> Root
Gold --> Synapse
Gold --> Databricks
Gold --> PowerBI
Gold --> AzureML 🛠️ Core Features¶
🌳 Hierarchical Namespace¶
True file system semantics with directory operations and atomic rename.
Key Capabilities:
- Directory-level operations (rename, delete, move)
- Atomic operations for ACID transactions
- Improved performance for big data workloads
- Better organization with folder hierarchies
Best For: Big data analytics, data lake implementations, file-based workloads
🔐 Access Control¶
Multi-layered security with RBAC, ACLs, and encryption.
Security Features:
- Azure RBAC for management operations
- POSIX ACLs for file/directory permissions
- Shared Access Signatures (SAS) for delegated access
- Azure AD integration for identity management
Best For: Enterprise security requirements, multi-tenant scenarios, fine-grained access
♻️ Data Lifecycle Management¶
Automated tiering and lifecycle policies for cost optimization.
Lifecycle Features:
- Rule-based tier transitions (Hot → Cool → Archive)
- Automated deletion of old data
- Last access time-based policies
- Blob snapshots and versions management
Best For: Long-term data retention, cost optimization, compliance requirements
⚡ Performance Optimization¶
Techniques and best practices for maximizing performance.
Optimization Areas:
- Partitioning strategies
- File size optimization
- Parallel processing patterns
- Network throughput tuning
Best For: High-throughput workloads, large-scale processing, performance-critical applications
🎯 Common Use Cases¶
🏗️ Modern Data Lake Architecture¶
Implement medallion architecture for enterprise data lakes.
Architecture: Bronze → Silver → Gold data zones Pattern: Medallion Architecture
graph LR
Raw[Raw Data] --> Bronze[Bronze Layer<br/>Raw Storage]
Bronze --> Silver[Silver Layer<br/>Cleaned Data]
Silver --> Gold[Gold Layer<br/>Business-Ready]
Gold --> BI[Business<br/>Intelligence]
Gold --> ML[Machine<br/>Learning] 📊 Big Data Analytics¶
Foundation for Spark, Synapse, and Databricks workloads.
Architecture: ADLS Gen2 + Compute Engines Pattern: Data Lake Analytics
🔄 Hybrid Data Integration¶
Connect on-premises and cloud data sources.
Architecture: Data Factory + Private Link + ADLS Gen2 Pattern: Hybrid Integration
📦 Data Archival & Compliance¶
Long-term retention with cost-effective archival.
Architecture: Lifecycle Policies + Archive Tier Pattern: Data Retention Strategy
📊 Pricing Guide¶
💰 Cost Components¶
| Component | Pricing Model | Key Factors | Optimization Tips |
|---|---|---|---|
| Storage | Per GB/month | Tier (Hot/Cool/Archive) | Use lifecycle policies |
| Operations | Per 10,000 operations | Operation type | Batch operations |
| Data Transfer | Per GB | Egress region | Use local processing |
| Metadata | Included | - | No additional cost |
💡 Storage Tiers Comparison¶
| Tier | Use Case | Storage Cost | Access Cost | Minimum Duration |
|---|---|---|---|---|
| Hot | Frequently accessed data | Highest | Lowest | None |
| Cool | Infrequently accessed (30+ days) | Lower | Higher | 30 days |
| Archive | Rarely accessed (180+ days) | Lowest | Highest | 180 days |
🎯 Cost Optimization Strategies¶
- Implement Lifecycle Policies: Auto-transition data to cooler tiers
- Optimize File Sizes: Larger files reduce operation costs
- Use Local Redundancy: LRS vs GRS based on requirements
- Monitor Access Patterns: Identify candidates for tier changes
- Leverage Reserved Capacity: Commit to 1-3 years for discounts
🚀 Quick Start Guide¶
1️⃣ Create Storage Account with Hierarchical Namespace¶
# Create resource group
az group create --name rg-datalake-demo --location eastus
# Create ADLS Gen2 storage account
az storage account create \
--name adlsgen2demo \
--resource-group rg-datalake-demo \
--location eastus \
--sku Standard_LRS \
--kind StorageV2 \
--enable-hierarchical-namespace true
# Create container (file system)
az storage fs create \
--name datalake \
--account-name adlsgen2demo
2️⃣ Create Directory Structure¶
# Create medallion architecture folders
az storage fs directory create --name bronze --file-system datalake --account-name adlsgen2demo
az storage fs directory create --name silver --file-system datalake --account-name adlsgen2demo
az storage fs directory create --name gold --file-system datalake --account-name adlsgen2demo
3️⃣ Set Access Control Lists (ACLs)¶
# Assign permissions to a user
az storage fs access set \
--acl "user:user@domain.com:rwx" \
--path bronze \
--file-system datalake \
--account-name adlsgen2demo
# Set default ACL for new items
az storage fs access set \
--acl "default:user:user@domain.com:rwx" \
--path bronze \
--file-system datalake \
--account-name adlsgen2demo
4️⃣ Upload Data with Python SDK¶
from azure.storage.filedatalake import DataLakeServiceClient
from azure.identity import DefaultAzureCredential
# Initialize client
credential = DefaultAzureCredential()
service_client = DataLakeServiceClient(
account_url="https://adlsgen2demo.dfs.core.windows.net",
credential=credential
)
# Get file system client
file_system_client = service_client.get_file_system_client("datalake")
# Upload file
file_client = file_system_client.get_file_client("bronze/sales/data.csv")
with open("local_data.csv", "rb") as data:
file_client.upload_data(data, overwrite=True)
print("File uploaded successfully!")
5️⃣ Query Data with Synapse Serverless SQL¶
-- Create external data source
CREATE EXTERNAL DATA SOURCE DataLake
WITH (
LOCATION = 'https://adlsgen2demo.dfs.core.windows.net/datalake'
);
-- Query CSV files
SELECT TOP 100 *
FROM OPENROWSET(
BULK 'bronze/sales/*.csv',
DATA_SOURCE = 'DataLake',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE
) AS sales_data;
🔧 Configuration & Management¶
🛡️ Security Best Practices¶
Recommended Security Configuration:
- Enable Azure AD Authentication: Use managed identities
- Implement Network Security: Private endpoints and firewalls
- Use Customer-Managed Keys: For encryption at rest
- Enable Soft Delete: Protect against accidental deletion
- Configure Access Logging: Monitor all access patterns
# Example: Configure firewall rules
from azure.mgmt.storage import StorageManagementClient
# Update network rules
storage_client.storage_accounts.update(
resource_group_name="rg-datalake",
account_name="adlsgen2demo",
parameters={
"properties": {
"networkAcls": {
"defaultAction": "Deny",
"ipRules": [{"value": "203.0.113.0/24"}],
"virtualNetworkRules": [],
"bypass": "AzureServices"
}
}
}
)
📖 Security Guide →
⚡ Performance Tuning¶
Key Performance Factors:
- File Size: Optimal range is 256MB - 1GB
- Partitioning: Use partition columns for filtering
- Parallel Operations: Leverage multi-threading for uploads/downloads
- Network Proximity: Co-locate compute and storage
# Example: Parallel file upload
from concurrent.futures import ThreadPoolExecutor
import os
def upload_file(file_path, destination_path):
file_client = file_system_client.get_file_client(destination_path)
with open(file_path, "rb") as data:
file_client.upload_data(data, overwrite=True)
return f"Uploaded {file_path}"
# Upload multiple files in parallel
files = ["file1.csv", "file2.csv", "file3.csv"]
with ThreadPoolExecutor(max_workers=10) as executor:
results = executor.map(
lambda f: upload_file(f, f"bronze/{os.path.basename(f)}"),
files
)
for result in results:
print(result)
📊 Monitoring & Diagnostics¶
Key Metrics to Monitor:
- Availability: Storage account uptime
- Latency: End-to-end and server latency
- Transactions: Success rate and error types
- Capacity: Used capacity and growth trends
# Example: Query metrics with Azure Monitor
from azure.mgmt.monitor import MonitorManagementClient
monitor_client = MonitorManagementClient(credential, subscription_id)
metrics = monitor_client.metrics.list(
resource_uri=f"/subscriptions/{subscription_id}/resourceGroups/rg-datalake/providers/Microsoft.Storage/storageAccounts/adlsgen2demo",
timespan="PT1H",
interval="PT5M",
metricnames="Transactions,Availability,SuccessE2ELatency",
aggregation="Average"
)
for metric in metrics.value:
print(f"{metric.name.value}: {metric.timeseries[0].data}")
🔗 Integration Patterns¶
Azure Synapse Analytics Integration¶
Direct integration for serverless and dedicated SQL pools.
-- Create external table in Synapse
CREATE EXTERNAL TABLE SalesData (
SaleID INT,
Product NVARCHAR(100),
Amount DECIMAL(10,2),
SaleDate DATE
)
WITH (
LOCATION = 'gold/sales/',
DATA_SOURCE = DataLake,
FILE_FORMAT = ParquetFormat
);
-- Query external data
SELECT Product, SUM(Amount) as TotalSales
FROM SalesData
WHERE SaleDate >= '2024-01-01'
GROUP BY Product;
Azure Databricks Integration¶
Mount ADLS Gen2 for Spark processing.
# Mount ADLS Gen2 in Databricks
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<client-id>",
"fs.azure.account.oauth2.client.secret": "<client-secret>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}
dbutils.fs.mount(
source="abfss://datalake@adlsgen2demo.dfs.core.windows.net/",
mount_point="/mnt/datalake",
extra_configs=configs
)
# Read data using Spark
df = spark.read.format("delta").load("/mnt/datalake/gold/sales")
df.show()
Azure Data Factory Integration¶
Build ETL pipelines with ADLS Gen2 as source and sink.
{
"name": "CopyToDataLake",
"type": "Copy",
"inputs": [{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "ADLSGen2Dataset",
"type": "DatasetReference"
}],
"typeProperties": {
"source": {
"type": "SqlServerSource"
},
"sink": {
"type": "ParquetSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
📚 Learning Resources¶
🎓 Getting Started¶
📖 Deep Dive Guides¶
- Architecture Patterns
- Best Practices
- Code Examples
🔧 Advanced Topics¶
🆘 Troubleshooting¶
🔍 Common Issues¶
- Connection & Access Issues
- Performance Problems
- ACL Configuration Errors
📞 Getting Help¶
- Azure Support: Create support ticket in Azure Portal
- Community Forums: Microsoft Q&A, Stack Overflow
- Documentation: Microsoft Learn official docs
- GitHub: Azure SDK issues and samples
🔗 Related Resources¶
Microsoft Documentation¶
Architecture Guidance¶
Last Updated: 2025-01-28 Service Version: General Availability Documentation Status: Complete