Skip to content

Data Retention Policies

Home | Best Practices | Cross-Cutting | Retention Policies

Status Category

Data retention policy implementation for Azure analytics platforms.


Overview

Effective data retention ensures compliance, cost optimization, and performance across your data platform.


Retention Strategy

Tiered Retention Model

Tier Retention Period Storage Class Use Case
Hot 0-30 days Premium/Hot Active analytics
Warm 30-90 days Standard/Cool Recent history
Cold 90-365 days Cool Compliance
Archive 1-7 years Archive Legal hold

Implementation Pattern

# Delta Lake retention configuration
from delta.tables import DeltaTable

def configure_table_retention(table_path: str, retention_days: int = 30):
    """Configure Delta Lake table retention."""
    spark.sql(f"""
        ALTER TABLE delta.`{table_path}`
        SET TBLPROPERTIES (
            'delta.deletedFileRetentionDuration' = 'interval {retention_days} days',
            'delta.logRetentionDuration' = 'interval {retention_days} days'
        )
    """)

# Apply retention with VACUUM
def enforce_retention(table_path: str, retention_hours: int = 168):
    """Remove files older than retention period."""
    delta_table = DeltaTable.forPath(spark, table_path)
    delta_table.vacuum(retention_hours)

Storage Lifecycle Management

Azure Storage Policy

{
    "rules": [
        {
            "name": "move-to-cool-tier",
            "enabled": true,
            "type": "Lifecycle",
            "definition": {
                "filters": {
                    "blobTypes": ["blockBlob"],
                    "prefixMatch": ["bronze/", "silver/"]
                },
                "actions": {
                    "baseBlob": {
                        "tierToCool": {"daysAfterModificationGreaterThan": 30},
                        "tierToArchive": {"daysAfterModificationGreaterThan": 180},
                        "delete": {"daysAfterModificationGreaterThan": 730}
                    }
                }
            }
        },
        {
            "name": "gold-retention",
            "enabled": true,
            "type": "Lifecycle",
            "definition": {
                "filters": {
                    "blobTypes": ["blockBlob"],
                    "prefixMatch": ["gold/"]
                },
                "actions": {
                    "baseBlob": {
                        "tierToCool": {"daysAfterModificationGreaterThan": 90},
                        "delete": {"daysAfterModificationGreaterThan": 2555}
                    }
                }
            }
        }
    ]
}

Terraform Configuration

resource "azurerm_storage_management_policy" "retention" {
  storage_account_id = azurerm_storage_account.datalake.id

  rule {
    name    = "bronze-retention"
    enabled = true

    filters {
      prefix_match = ["bronze/"]
      blob_types   = ["blockBlob"]
    }

    actions {
      base_blob {
        tier_to_cool_after_days_since_modification_greater_than    = 30
        tier_to_archive_after_days_since_modification_greater_than = 180
        delete_after_days_since_modification_greater_than          = 730
      }
    }
  }
}

Compliance Requirements

Regulatory Mapping

Regulation Min Retention Max Retention Notes
GDPR None As needed Right to erasure
HIPAA 6 years None From last use
SOX 7 years None Financial records
PCI DSS 1 year None Audit logs

Implementation

class RetentionPolicyManager:
    """Manage retention policies by data classification."""

    POLICIES = {
        "pii": {"retention_days": 365, "archive_days": 730},
        "financial": {"retention_days": 2555, "archive_days": 3650},
        "operational": {"retention_days": 90, "archive_days": 365},
        "telemetry": {"retention_days": 30, "archive_days": 90}
    }

    def apply_policy(self, table_path: str, classification: str):
        """Apply retention policy based on data classification."""
        policy = self.POLICIES.get(classification, self.POLICIES["operational"])

        # Configure Delta Lake
        configure_table_retention(table_path, policy["retention_days"])

        # Schedule VACUUM
        self.schedule_vacuum(table_path, policy["retention_days"])

Automated Enforcement

Scheduled Jobs

# Databricks job for retention enforcement
def daily_retention_job():
    """Daily job to enforce retention policies."""
    tables = spark.sql("""
        SELECT table_path, classification
        FROM governance.table_registry
        WHERE retention_enabled = true
    """).collect()

    for table in tables:
        try:
            policy_manager.apply_policy(
                table.table_path,
                table.classification
            )
            log_success(table.table_path)
        except Exception as e:
            log_failure(table.table_path, str(e))
            alert_on_failure(table.table_path)

Monitoring

-- Track retention compliance
SELECT
    table_path,
    classification,
    last_vacuum_date,
    DATEDIFF(current_date(), last_vacuum_date) as days_since_vacuum,
    CASE
        WHEN DATEDIFF(current_date(), last_vacuum_date) > 7 THEN 'OVERDUE'
        ELSE 'COMPLIANT'
    END as status
FROM governance.retention_audit
ORDER BY days_since_vacuum DESC;


Last Updated: January 2025