Azure Data Lake Storage Gen2¶
See also: generic Azure reference
For service-agnostic deep-dive content on Azure Data Lake Storage Gen2 — architecture, feature reference, code samples, and patterns independent of CSA-in-a-Box — see Azure Data Lake Storage Gen2 in the reference library.
Overview¶
Azure Data Lake Storage Gen2 (ADLS Gen2) is the foundation storage layer for every CSA-in-a-Box deployment. It combines the scalability and cost-efficiency of Azure Blob Storage with a hierarchical namespace (HNS) purpose-built for analytics workloads — giving you POSIX-like directory semantics, fine-grained ACLs, and atomic directory operations that flat object stores cannot provide.
In the CSA-in-a-Box architecture, ADLS Gen2 holds every byte of the medallion lakehouse: raw ingestion files in Bronze, cleansed and conformed Delta tables in Silver, and analytics-ready data products in Gold. Every compute engine — Databricks, Synapse, Fabric, ADF, dbt — reads from and writes to this single storage layer.
Related Guides
| Guide | Purpose |
|---|---|
| Medallion Architecture | Layer design, quality gates, and dbt transformations |
| Data Flow — Medallion | End-to-end ingestion through Bronze/Silver/Gold |
| Security & Compliance | Network isolation, encryption, and Zero Trust |
| Cost Optimization | Storage tiering and spend reduction strategies |
| Disaster Recovery | Replication, failover, and DR drills |
Architecture¶
The storage hierarchy follows a strict convention that aligns with the medallion architecture and enables predictable access control, lifecycle management, and data discovery.
graph TD
SA["Storage Account<br/><code>csainaboxdls</code><br/>HNS enabled, TLS 1.2"]
subgraph Containers
BZ["bronze/<br/>Raw ingestion"]
SV["silver/<br/>Cleansed & conformed"]
GD["gold/<br/>Analytics-ready products"]
SB["sandbox/<br/>Ad-hoc exploration"]
end
SA --> BZ
SA --> SV
SA --> GD
SA --> SB
subgraph "Directory Structure (example: bronze)"
D1["erp/"]
D2["crm/"]
D3["iot/"]
end
BZ --> D1
BZ --> D2
BZ --> D3
subgraph "Entity & Partition Layout"
E1["erp/sales_orders/"]
P1["year=2026/"]
P2["month=04/"]
P3["day=29/"]
F1["part-00000.parquet"]
end
D1 --> E1
E1 --> P1
P1 --> P2
P2 --> P3
P3 --> F1 Canonical path pattern:
Container & Directory Design¶
Container-per-Layer (recommended)¶
CSA-in-a-Box uses one container per medallion layer rather than one container per domain. This design simplifies lifecycle policies (one rule per container), aligns ACLs with data maturity, and makes cross-domain joins in Silver and Gold natural.
| Container | Purpose | Typical Writers | Typical Readers |
|---|---|---|---|
bronze | Raw, append-only ingestion | ADF, Event Hubs Capture | Databricks, dbt |
silver | Cleansed, conformed, deduplicated | dbt, Databricks jobs | dbt, Databricks, Synapse |
gold | Star schemas, aggregates, products | dbt, Databricks jobs | Power BI, APIs, ML pipelines |
sandbox | Ad-hoc exploration, prototyping | Data scientists | Data scientists (self-service) |
Hierarchical Namespace Benefits¶
Enabling HNS is mandatory for CSA-in-a-Box. Without it:
- Directory renames are O(n) object copies instead of atomic metadata operations
- ACLs cannot be applied at the directory level
- Delta Lake
VACUUMandOPTIMIZEperformance degrades significantly - Spark partition discovery scans every blob prefix
HNS Cannot Be Added Later
Hierarchical namespace must be enabled at storage account creation time. Converting an existing flat-namespace account requires a migration. Always provision new accounts with HNS enabled.
Path Naming Conventions¶
| Level | Convention | Example |
|---|---|---|
| Container | Lowercase medallion layer | bronze, silver, gold |
| Domain | Snake_case business domain | erp, crm, iot_telemetry |
| Entity | Snake_case source entity | sales_orders, customers |
| Partition | Hive-style key=value | year=2026/month=04/day=29 |
| Files | Engine-generated part files | part-00000-*.snappy.parquet |
Partition Layout Strategies¶
| Strategy | When to Use | Example Path Suffix |
|---|---|---|
| Date (Y/M/D) | Append-heavy event data, time-series ingestion | year=2026/month=04/day=29 |
| Date (Y/M) | Monthly batch loads, financial closes | year=2026/month=04 |
| Ingestion timestamp | CDC / watermark-based loads | ingest_date=2026-04-29 |
| Region / tenant | Multi-tenant isolation | region=east/tenant=agency_a |
| None (flat) | Small reference tables (< 1 GB) | (files directly under entity) |
Partition Cardinality
Keep partition cardinality under 10,000 per table. Over-partitioning (e.g., partitioning by customer ID with millions of values) creates excessive small files and metadata overhead.
Access Control¶
ADLS Gen2 provides four complementary access control mechanisms. In CSA-in-a-Box deployments, the preferred pattern is RBAC for broad access, ACLs for fine-grained directory permissions, and managed identities for all service-to-service communication.
RBAC (Role-Based Access Control)¶
RBAC roles grant access at the storage account or container level and are the first layer of defense.
| Role | Scope | Grants | CSA-in-a-Box Usage |
|---|---|---|---|
| Storage Blob Data Owner | Storage account | Full control including ACL management | Platform admins only |
| Storage Blob Data Contributor | Container | Read + write + delete (no ACL mgmt) | ADF, Databricks service principals |
| Storage Blob Data Reader | Container | Read-only | Power BI, Synapse serverless, Purview |
| Storage Blob Delegator | Storage account | Generate user-delegation SAS tokens | Rare — short-lived SAS scenarios |
# Assign Reader on gold container to the Power BI service principal
az role assignment create \
--assignee <pbi-service-principal-id> \
--role "Storage Blob Data Reader" \
--scope "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<acct>/blobServices/default/containers/gold"
ACLs (Access Control Lists)¶
ACLs provide POSIX-like permissions (read, write, execute) on individual directories and files. They are essential when different teams need different access within the same container.
| ACL Type | Purpose |
|---|---|
| Access ACL | Controls access to the specific object (file or directory) |
| Default ACL | Template inherited by new child objects in a directory |
# Grant the data-engineering group rwx on bronze/erp/ and set as default
az storage fs access set \
--acl "group:<data-eng-group-id>:rwx" \
--path erp \
--file-system bronze \
--account-name csainaboxdls \
--auth-mode login
# Set default ACL so new subdirectories inherit
az storage fs access set \
--acl "default:group:<data-eng-group-id>:rwx" \
--path erp \
--file-system bronze \
--account-name csainaboxdls \
--auth-mode login
ACL Inheritance
Default ACLs only apply to new child objects created after the default is set. Existing children are unaffected. Use az storage fs access set-recursive to backfill ACLs on existing directory trees.
Managed Identity Access¶
All CSA-in-a-Box services authenticate to ADLS using managed identities — never shared keys or connection strings.
| Service | Identity Type | Recommended Role |
|---|---|---|
| Azure Data Factory | System-assigned MI | Storage Blob Data Contributor |
| Databricks | Service principal (Unity Catalog) | Storage Blob Data Contributor |
| Synapse serverless | Workspace MI | Storage Blob Data Reader |
| Purview | System-assigned MI | Storage Blob Data Reader |
| Power BI | Service principal | Storage Blob Data Reader |
SAS Tokens — When to Avoid¶
Avoid SAS Tokens in Production Pipelines
SAS tokens are opaque, difficult to audit, and cannot be revoked individually. Use them only for time-limited, external-party data sharing where managed identity is not possible. Never embed SAS tokens in application code or pipeline configurations.
If a SAS token is required, always use user-delegation SAS (backed by Entra ID) rather than account-key SAS, and set the shortest practical expiry.
Network Security¶
| Control | Purpose | CSA-in-a-Box Default |
|---|---|---|
| Storage Firewall | Restrict access to allowed VNets and IPs | Enabled |
| Private Endpoints | Route traffic through Azure backbone, no public IP | Enabled (production) |
| Disable public access | Block all internet-routable requests | Yes (production) |
| Trusted Azure services | Allow ADF, Purview, Synapse through firewall | Enabled |
# Create a private endpoint for the DFS sub-resource
az network private-endpoint create \
--name pe-csainaboxdls-dfs \
--resource-group rg-data-platform \
--vnet-name vnet-data \
--subnet snet-private-endpoints \
--private-connection-resource-id <storage-account-resource-id> \
--group-id dfs \
--connection-name pec-csainaboxdls-dfs
Performance¶
File Size Optimization¶
File size is the single biggest performance lever for ADLS Gen2 workloads. The hierarchical namespace accelerates metadata operations, but read throughput is still governed by file size and parallelism.
| File Size | Impact | Recommendation |
|---|---|---|
| < 8 MB | Excessive metadata overhead, slow listing, poor throughput | Avoid — compact immediately |
| 8 MB -- 128 MB | Acceptable for streaming micro-batches | Compact on schedule |
| 256 MB -- 1 GB | Optimal for analytical reads (Spark, Synapse) | Target for Silver/Gold |
| > 2 GB | Diminishing returns, harder to parallelize | Split during write |
Small File Problem¶
Small files are the most common performance anti-pattern in lakehouse storage. They arise from:
- Streaming micro-batches writing every few seconds
- Over-partitioned tables
- CDC pipelines with per-record commits
Remediation:
-- Delta Lake: compact small files into optimal size
OPTIMIZE delta.`abfss://silver@csainaboxdls.dfs.core.windows.net/erp/sales_orders`
WHERE year = 2026 AND month = 4;
-- Delta Lake: Z-order for predicate pushdown
OPTIMIZE delta.`abfss://gold@csainaboxdls.dfs.core.windows.net/erp/fact_sales`
ZORDER BY (customer_id, order_date);
Throughput Limits¶
| Metric | Standard (GPv2 HNS) | Premium (BlockBlob HNS) |
|---|---|---|
| Max ingress per account | 25 Gbps | 45 Gbps |
| Max egress per account | 50 Gbps | 75 Gbps |
| Max request rate per account | 20,000 IOPS | 75,000 IOPS |
| Max single-file throughput | ~60 MBps | ~250 MBps |
Premium for Hot Path
Use Premium BlockBlobStorage with HNS for streaming ingestion or interactive query workloads where latency matters. Standard GPv2 is sufficient for batch-heavy Bronze/Silver/Gold pipelines.
Parallel Upload / Download¶
# Python SDK — parallel upload with max_concurrency
from azure.storage.filedatalake import DataLakeServiceClient
service = DataLakeServiceClient(
account_url="https://csainaboxdls.dfs.core.windows.net",
credential=default_credential,
)
file_client = service.get_file_client("bronze", "erp/sales_orders/full_extract.parquet")
with open("full_extract.parquet", "rb") as f:
file_client.upload_data(
f,
overwrite=True,
max_concurrency=8, # parallel transfer threads
chunk_size=100 * 1024 * 1024, # 100 MB chunks
)
Data Lifecycle Management¶
Lifecycle management policies automatically tier or delete data based on age, reducing storage costs without manual intervention.
Policy Configuration¶
{
"rules": [
{
"enabled": true,
"name": "bronze-tiering",
"type": "Lifecycle",
"definition": {
"actions": {
"baseBlob": {
"tierToCool": {
"daysAfterModificationGreaterThan": 30
},
"tierToArchive": {
"daysAfterModificationGreaterThan": 90
},
"delete": { "daysAfterModificationGreaterThan": 365 }
}
},
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["bronze/"]
}
}
},
{
"enabled": true,
"name": "gold-tiering",
"type": "Lifecycle",
"definition": {
"actions": {
"baseBlob": {
"tierToCool": { "daysAfterModificationGreaterThan": 90 }
}
},
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["gold/"]
}
}
}
]
}
Cost Impact by Tier¶
Pricing below is approximate (East US, LRS, per GB/month) and should be validated against the Azure pricing calculator.
| Tier | Storage Cost | Read Cost (per 10K ops) | Retrieval Cost | Min Retention | Best For |
|---|---|---|---|---|---|
| Hot | $0.018 | $0.004 | Free | None | Active Silver/Gold tables |
| Cool | $0.010 | $0.010 | Free | 30 days | Bronze after 30 days |
| Cold | $0.0036 | $0.065 | $0.03/GB | 90 days | Bronze after 90 days |
| Archive | $0.0012 | $5.00 | $0.022/GB | 180 days | Compliance / legal retention |
Archive Rehydration
Archive tier reads require rehydration (1--15 hours for standard, < 1 hour for high priority at higher cost). Never place data in Archive that your pipelines query on a regular schedule.
Immutable Storage¶
For regulated workloads (FedRAMP, SEC 17a-4, HIPAA), ADLS Gen2 supports immutable storage policies.
| Policy Type | Behavior | Use Case |
|---|---|---|
| Time-based retention | Data cannot be deleted or modified until expiry | Regulatory retention (7 yr) |
| Legal hold | Data is immutable until all holds are cleared | Active litigation, audit |
# Set 7-year immutable retention on the bronze container
az storage container immutability-policy create \
--account-name csainaboxdls \
--container-name bronze \
--period 2555 \
--allow-protected-append-writes true
Redundancy & Disaster Recovery¶
Redundancy Options¶
| Option | Copies | Regions | Durability (annual) | ~Cost Multiplier | CSA-in-a-Box Usage |
|---|---|---|---|---|---|
| LRS | 3 | 1 | 11 nines | 1.0x | Dev/test, sandbox |
| ZRS | 3 | 1 (3 AZs) | 12 nines | 1.25x | Production (single region) |
| GRS | 6 | 2 | 16 nines | 2.0x | DR-required workloads |
| GZRS | 6 | 2 (3 AZs primary) | 16 nines | 2.5x | Mission-critical production |
CSA-in-a-Box Default
Production deployments default to ZRS for cost-effective zone resilience. Enable GRS/GZRS only when cross-region DR is an explicit requirement and the 2x cost premium is justified.
Cross-Region Replication¶
For GRS/GZRS accounts, replication is asynchronous with no SLA on replication lag. For workloads requiring deterministic RPO:
- Object replication rules — replicate specific containers (e.g.,
goldonly) to a secondary storage account in another region - ADF copy activities — schedule periodic cross-region copies with validation
- azcopy sync — manual or scripted DR as a last resort
# azcopy sync for manual DR (last resort)
azcopy sync \
"https://csainaboxdls.dfs.core.windows.net/gold" \
"https://csainaboxdr.dfs.core.windows.net/gold" \
--recursive \
--delete-destination=false
Object Replication Rules¶
Object replication provides asynchronous, policy-based replication between two storage accounts. Unlike GRS, you choose which containers and prefixes to replicate.
# Create replication policy: gold container → DR account
az storage account or-policy create \
--account-name csainaboxdls \
--destination-account csainaboxdr \
--source-container gold \
--destination-container gold \
--min-creation-time "2026-01-01T00:00:00Z"
Delta Lake on ADLS Gen2¶
Delta Lake is the default table format in CSA-in-a-Box. Its transaction log (_delta_log) lives alongside data files in ADLS Gen2, making the storage account the single source of truth.
Transaction Log Performance¶
The Delta transaction log writes a JSON file per commit and periodically consolidates into Parquet checkpoints. On ADLS Gen2 with HNS:
- Listing is fast — HNS makes directory listing O(1) per level, not O(n) blob enumeration
- Checkpointing matters — without checkpoints, readers must replay every JSON commit file
-- Configure checkpoint interval (default is every 10 commits)
ALTER TABLE delta.`abfss://silver@csainaboxdls.dfs.core.windows.net/erp/customers`
SET TBLPROPERTIES ('delta.checkpointInterval' = '10');
VACUUM with ADLS¶
VACUUM removes data files no longer referenced by the transaction log. On ADLS Gen2, this triggers physical deletes against the hierarchical namespace.
-- Remove unreferenced files older than 7 days (default retention)
VACUUM delta.`abfss://silver@csainaboxdls.dfs.core.windows.net/erp/customers`
RETAIN 168 HOURS;
VACUUM and Time Travel
VACUUM permanently deletes old file versions. After vacuuming, time-travel queries beyond the retention window will fail. Set RETAIN to at least your longest query or rollback window.
Concurrent Writes¶
Delta Lake uses optimistic concurrency control (OCC) on ADLS Gen2. Concurrent writers to the same table succeed as long as they do not modify the same partitions. For high-contention tables:
- Enable write-ahead log compaction to reduce commit conflicts
- Partition by a key that distributes writes across non-overlapping partitions
- Use
MERGEwith narrow predicates to minimize conflict windows
Integration with CSA-in-a-Box Services¶
Databricks — Direct Access (recommended)¶
CSA-in-a-Box prefers direct access via abfss:// over legacy DBFS mounts. Direct access integrates with Unity Catalog for governance and supports credential passthrough.
# Read Delta table directly — no mounts needed
df = spark.read.format("delta").load(
"abfss://silver@csainaboxdls.dfs.core.windows.net/erp/customers"
)
# Unity Catalog external location (managed by Terraform/Bicep)
# CREATE EXTERNAL LOCATION silver_erp
# URL 'abfss://silver@csainaboxdls.dfs.core.windows.net/erp'
# WITH (STORAGE CREDENTIAL csa_inabox_cred);
Avoid DBFS Mounts
DBFS mounts (/mnt/...) bypass Unity Catalog governance, do not support fine-grained access, and are deprecated in Databricks. Always use abfss:// paths with external locations.
Synapse Serverless SQL¶
-- External data source pointing to ADLS Gen2
CREATE EXTERNAL DATA SOURCE GoldLake
WITH (
LOCATION = 'abfss://gold@csainaboxdls.dfs.core.windows.net',
TYPE = HADOOP
);
-- Query Delta table via OPENROWSET
SELECT *
FROM OPENROWSET(
BULK 'erp/fact_sales/**',
DATA_SOURCE = 'GoldLake',
FORMAT = 'DELTA'
) AS rows
WHERE year = 2026;
Azure Data Factory¶
ADF accesses ADLS Gen2 through a linked service backed by the factory's managed identity.
{
"name": "ls_adls_csainabox",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://csainaboxdls.dfs.core.windows.net"
},
"connectVia": { "referenceName": "AutoResolveIntegrationRuntime" }
}
}
Microsoft Purview¶
Purview scans ADLS Gen2 to populate the data catalog with schema, lineage, and classification metadata. Register the storage account as a data source and schedule scans against each container.
# Register ADLS Gen2 as a Purview data source (via REST API)
az rest --method PUT \
--uri "https://<purview-account>.purview.azure.com/scan/datasources/csainaboxdls?api-version=2022-07-01-preview" \
--body '{
"kind": "AzureDataLakeStoreGen2",
"properties": {
"endpoint": "https://csainaboxdls.dfs.core.windows.net",
"resourceGroup": "rg-data-platform",
"subscriptionId": "<sub-id>"
}
}'
dbt — External Locations¶
dbt models in CSA-in-a-Box read from and write to ADLS Gen2 via Databricks external locations or Synapse external tables.
# dbt source configuration (sources.yml)
sources:
- name: bronze_erp
schema: bronze_erp
meta:
external_location: "abfss://bronze@csainaboxdls.dfs.core.windows.net/erp/{name}"
tables:
- name: sales_orders
- name: customers
Monitoring & Alerting¶
Key Metrics¶
| Metric | Source | Alert Threshold | Why It Matters |
|---|---|---|---|
| Used capacity | Azure Monitor | > 80% of quota | Prevent throttling from capacity limits |
| Ingress/Egress | Azure Monitor | Sustained > 80% of limit | Detect throughput bottlenecks |
| Transaction count | Azure Monitor | Spike > 3x baseline | Identify runaway queries or scans |
| E2E latency (P99) | Azure Monitor | > 200 ms (standard) | Detect storage slowdowns |
| Availability | Azure Monitor | < 99.9% | Trigger DR procedures |
| Lifecycle policy actions | Diagnostic logs | Unexpected deletes or tiers | Catch misconfigured policies |
Diagnostic Logging¶
Enable Storage Analytics and diagnostic settings to stream logs to Log Analytics.
# Enable diagnostic settings for ADLS Gen2
az monitor diagnostic-settings create \
--name diag-csainaboxdls \
--resource <storage-account-resource-id> \
--workspace <log-analytics-workspace-id> \
--logs '[
{"category": "StorageRead", "enabled": true, "retentionPolicy": {"enabled": true, "days": 30}},
{"category": "StorageWrite", "enabled": true, "retentionPolicy": {"enabled": true, "days": 30}},
{"category": "StorageDelete","enabled": true, "retentionPolicy": {"enabled": true, "days": 90}}
]' \
--metrics '[{"category": "Transaction", "enabled": true}]'
Capacity Alert¶
# Alert when used capacity exceeds 80% of 50 TiB quota
az monitor metrics alert create \
--name alert-adls-capacity-80pct \
--resource-group rg-data-platform \
--scopes <storage-account-resource-id> \
--condition "avg UsedCapacity > 43980465111040" \
--window-size 1h \
--evaluation-frequency 1h \
--action <action-group-id> \
--description "ADLS Gen2 used capacity exceeds 80% of 50 TiB"
Anti-Patterns¶
Common Mistakes That Degrade Performance, Security, or Cost
1. Flat namespace storage account for analytics. Without HNS, every directory rename is an O(n) copy. Delta Lake VACUUM and OPTIMIZE become painfully slow. Always enable HNS.
2. Shared access keys in pipeline configurations. Account keys grant full, unrevocable access to the entire storage account. Use managed identities for all service access.
3. Thousands of tiny files per partition. Streaming micro-batches without compaction create millions of small files. Run OPTIMIZE on a schedule (hourly for Silver, daily for Gold).
4. Archive tier on actively queried data. Archive rehydration takes hours and costs per-GB. Only archive data that is never queried in normal operations.
5. DBFS mounts instead of abfss:// direct access. Mounts bypass Unity Catalog governance and are deprecated. Use external locations with abfss:// paths.
6. Single storage account for all environments. Dev/test workloads sharing a production storage account risk accidental data corruption and create noisy neighbor issues. Use separate accounts per environment.
7. No default ACLs on directories. Without default ACLs, new files inherit no permissions beyond RBAC. Set default ACLs on every domain directory to ensure consistent access.
Do / Don't Quick Reference¶
| Do | Don't |
|---|---|
| Enable HNS at account creation | Create flat-namespace accounts for lakehouse workloads |
Use abfss:// direct access paths | Mount storage as DBFS /mnt/ |
| Authenticate via managed identity / service principal | Use shared account keys or embed SAS tokens |
| Target 256 MB -- 1 GB file sizes | Leave streaming micro-batch small files uncompacted |
| Apply lifecycle policies per container | Manually move blobs between tiers |
| Set default ACLs on domain directories | Rely on RBAC alone for directory-level access |
| Enable Private Endpoints in production | Allow public network access to production accounts |
| Use ZRS (minimum) for production | Use LRS for production workloads |
Run OPTIMIZE and VACUUM on schedule | Let Delta transaction logs grow unbounded |
| Monitor capacity and throughput metrics | Ignore storage until throttling occurs |
Deployment Checklist¶
- Storage account created with HNS enabled and TLS 1.2 minimum
- Containers provisioned:
bronze,silver,gold,sandbox - Redundancy set to ZRS (production) or LRS (dev/test)
- Public network access disabled (production)
- Private Endpoints created for
dfsandblobsub-resources - Storage Firewall configured with trusted Azure services allowed
- Managed identities assigned RBAC roles per service (see Access Control table)
- Default ACLs set on all domain directories in each container
- Lifecycle management policies applied (bronze tiering, gold tiering)
- Diagnostic logging enabled to Log Analytics
- Capacity and availability alerts configured
- Purview data source registered and scan scheduled
- Databricks external locations created (Unity Catalog)
- Synapse external data sources configured
- ADF linked service created with managed identity auth
- Soft delete enabled (7-day retention minimum)
- Shared access keys disabled (Entra-only authentication)
Further Reading¶
- Medallion Architecture Best Practices — layer design and quality gates
- Data Flow — Medallion — end-to-end pipeline architecture
- Security & Compliance Best Practices — network isolation and Zero Trust
- Cost Optimization Best Practices — storage spend reduction
- Disaster Recovery Best Practices — replication and failover drills
- Delta Lake over Iceberg & Parquet (ADR-0003) — table format decision
- Databricks Guide — cluster configuration and Unity Catalog setup