Data Retention Policies¶
Home | Best Practices | Cross-Cutting | Retention Policies
Data retention policy implementation for Azure analytics platforms.
Overview¶
Effective data retention ensures compliance, cost optimization, and performance across your data platform.
Retention Strategy¶
Tiered Retention Model¶
| Tier | Retention Period | Storage Class | Use Case |
|---|---|---|---|
| Hot | 0-30 days | Premium/Hot | Active analytics |
| Warm | 30-90 days | Standard/Cool | Recent history |
| Cold | 90-365 days | Cool | Compliance |
| Archive | 1-7 years | Archive | Legal hold |
Implementation Pattern¶
# Delta Lake retention configuration
from delta.tables import DeltaTable
def configure_table_retention(table_path: str, retention_days: int = 30):
"""Configure Delta Lake table retention."""
spark.sql(f"""
ALTER TABLE delta.`{table_path}`
SET TBLPROPERTIES (
'delta.deletedFileRetentionDuration' = 'interval {retention_days} days',
'delta.logRetentionDuration' = 'interval {retention_days} days'
)
""")
# Apply retention with VACUUM
def enforce_retention(table_path: str, retention_hours: int = 168):
"""Remove files older than retention period."""
delta_table = DeltaTable.forPath(spark, table_path)
delta_table.vacuum(retention_hours)
Storage Lifecycle Management¶
Azure Storage Policy¶
{
"rules": [
{
"name": "move-to-cool-tier",
"enabled": true,
"type": "Lifecycle",
"definition": {
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["bronze/", "silver/"]
},
"actions": {
"baseBlob": {
"tierToCool": {"daysAfterModificationGreaterThan": 30},
"tierToArchive": {"daysAfterModificationGreaterThan": 180},
"delete": {"daysAfterModificationGreaterThan": 730}
}
}
}
},
{
"name": "gold-retention",
"enabled": true,
"type": "Lifecycle",
"definition": {
"filters": {
"blobTypes": ["blockBlob"],
"prefixMatch": ["gold/"]
},
"actions": {
"baseBlob": {
"tierToCool": {"daysAfterModificationGreaterThan": 90},
"delete": {"daysAfterModificationGreaterThan": 2555}
}
}
}
}
]
}
Terraform Configuration¶
resource "azurerm_storage_management_policy" "retention" {
storage_account_id = azurerm_storage_account.datalake.id
rule {
name = "bronze-retention"
enabled = true
filters {
prefix_match = ["bronze/"]
blob_types = ["blockBlob"]
}
actions {
base_blob {
tier_to_cool_after_days_since_modification_greater_than = 30
tier_to_archive_after_days_since_modification_greater_than = 180
delete_after_days_since_modification_greater_than = 730
}
}
}
}
Compliance Requirements¶
Regulatory Mapping¶
| Regulation | Min Retention | Max Retention | Notes |
|---|---|---|---|
| GDPR | None | As needed | Right to erasure |
| HIPAA | 6 years | None | From last use |
| SOX | 7 years | None | Financial records |
| PCI DSS | 1 year | None | Audit logs |
Implementation¶
class RetentionPolicyManager:
"""Manage retention policies by data classification."""
POLICIES = {
"pii": {"retention_days": 365, "archive_days": 730},
"financial": {"retention_days": 2555, "archive_days": 3650},
"operational": {"retention_days": 90, "archive_days": 365},
"telemetry": {"retention_days": 30, "archive_days": 90}
}
def apply_policy(self, table_path: str, classification: str):
"""Apply retention policy based on data classification."""
policy = self.POLICIES.get(classification, self.POLICIES["operational"])
# Configure Delta Lake
configure_table_retention(table_path, policy["retention_days"])
# Schedule VACUUM
self.schedule_vacuum(table_path, policy["retention_days"])
Automated Enforcement¶
Scheduled Jobs¶
# Databricks job for retention enforcement
def daily_retention_job():
"""Daily job to enforce retention policies."""
tables = spark.sql("""
SELECT table_path, classification
FROM governance.table_registry
WHERE retention_enabled = true
""").collect()
for table in tables:
try:
policy_manager.apply_policy(
table.table_path,
table.classification
)
log_success(table.table_path)
except Exception as e:
log_failure(table.table_path, str(e))
alert_on_failure(table.table_path)
Monitoring¶
-- Track retention compliance
SELECT
table_path,
classification,
last_vacuum_date,
DATEDIFF(current_date(), last_vacuum_date) as days_since_vacuum,
CASE
WHEN DATEDIFF(current_date(), last_vacuum_date) > 7 THEN 'OVERDUE'
ELSE 'COMPLIANT'
END as status
FROM governance.retention_audit
ORDER BY days_since_vacuum DESC;
Related Documentation¶
Last Updated: January 2025