Home > Docs > Runbooks > Key Rotation
Key Rotation Runbook (CSA-0059)¶
Note
Quick Summary: Scheduled + emergency rotation procedures for every credential class in CSA-in-a-Box — Key Vault secrets, Storage account access keys, MSAL / Entra ID token-signing keys, SQL and Cosmos master keys, Databricks PATs, and ADF linked-service credentials. Covers cadence, automation (secret rotation Function), manual steps for each class, and verification queries.
Before First Use — Customization Checklist¶
- Populate the Contact Information table.
- Confirm the Key Vault names per environment (dev / staging / prod / gov-dev / gov-prod) in §3.
- Confirm the secret-rotation Function app name and identity under §4.1.
- Confirm your organization's compliance cadence (NIST 800-53 SC-12 typically requires 90-day rotation on high-impact systems).
📑 Table of Contents¶
- 📋 1. Scope
- 📅 2. Cadence
- 📦 3. Inventory
- 🔒 4. Rotation Procedures
- 🚨 5. Emergency Rotation (Compromise)
- ✅ 6. Verification
- 📋 7. Evidence Preservation
- 📎 8. Contact Information
- 🗓️ 9. Drill Log
- 🔗 10. Related Documentation
📋 1. Scope¶
Covers scheduled and emergency rotation for every credential surface on the platform. For compromise response, start in §5 and return to the per-credential procedure. For ATO / compliance rotation schedules, see docs/COMPLIANCE.md.
Out of scope: user AAD passwords (governed by Entra ID policy), FIDO2 / certificate re-enrollment (owned by IT endpoint management).
📅 2. Cadence¶
| Credential class | Scheduled rotation | Automation |
|---|---|---|
| Key Vault secrets (generic) | 90 days | Secret-rotation Function (event-driven) |
| Storage account access keys | 90 days | Key Vault rotation policy + Function |
| MSAL token-signing keys | 180 days | App reg rollover — see §4.4 |
| Cosmos DB primary key | 90 days | Manual + Function syncs Key Vault |
| SQL admin password / DB master key | 90 days | Manual (high risk — run during maintenance) |
| Databricks PATs | 60 days | Manual; migrate to managed identity where possible |
| ADF linked-service creds | 90 days | Automatic via Key Vault reference (no rotation at ADF layer) |
| Service principal client secrets | Migrate to OIDC FedCred | N/A once migrated |
Important
Scheduled rotations run at 02:00 UTC Sunday in the rotation window defined in .github/workflows/deploy.yml environment guards. Any manual rotation during business hours must be P2-approved.
📦 3. Inventory¶
Every rotatable credential has an entry in Key Vault + an owner tag. Audit monthly:
az keyvault secret list --vault-name <vault> \
--query '[].{name:name,enabled:attributes.enabled,expires:attributes.expires,updated:attributes.updated}' \
-o table
// Upcoming expirations in the next 30 days
AzureDiagnostics
| where ResourceType == "VAULTS"
| where OperationName == "SecretNearExpiry"
| where TimeGenerated > ago(1d)
| project ResourceId, secretName = tostring(Properties.id), expiryEta = tostring(Properties.expiryTime)
🔒 4. Rotation Procedures¶
4.1 Automated rotation via secret-rotation Function¶
The csa_platform/functions/secretRotation/ Function subscribes to Key Vault Microsoft.KeyVault.SecretNearExpiry events and rotates on your behalf. Happy path requires nothing from the operator.
- Confirm the function is healthy:
bash az monitor app-insights events show \ --app <ai-resource> --type requests \ --query "[?customDimensions.function=='rotateSecret']|[0:10]" - On failure, the Function publishes to the
rotation-faileddead-letter queue. Seedocs/runbooks/dead-letter.md.
4.2 Key Vault secret (manual)¶
- Create the new version under the same secret name (do not create a new name — downstream references will not follow you):
- Give the old version a 24-hour grace window (do not disable immediately):
- Confirm consumers have picked up the new version (restart Key Vault-refreshing pods if they pin the version).
- After 24 hours, disable the old version.
4.3 Storage account access keys¶
Storage accounts expose keys key1 and key2. Rotate one at a time so no consumer ever sees an invalid key.
- Regenerate
key2: - Update Key Vault reference to point at the new
key2. - Wait 1 hour for all consumers to pick up (or force-cycle AKS pods / Function apps).
- Regenerate
key1: - Flip Key Vault back to
key1on the next scheduled rotation.
Tip
Prefer Microsoft Entra ID / managed-identity RBAC over access keys wherever possible. Every rotation is a chance to retire one more key.
4.4 MSAL / Entra ID token-signing keys¶
For the portal's BFF and MCP surfaces (see CSA-0020 Phase 3), the app registration's client credential is either a secret or a signing certificate. The HMAC-sealed MSAL token cache adds a separate per-node seal key (see portal/shared/api/ docs — treat the seal key like any other KV secret, §4.2).
- Add a new client credential (certificate preferred) without deleting the old one:
- Deploy the new credential to the portal (via Key Vault).
- Confirm token issuance works against the new credential.
- Remove the old credential:
4.5 Cosmos DB primary / secondary keys¶
Cosmos DB follows the same two-key pattern as Storage.
az cosmosdb keys regenerate --name <account> --resource-group <rg> --key-kind secondary
# update Key Vault to use the freshly-rotated secondary
az cosmosdb keys regenerate --name <account> --resource-group <rg> --key-kind primary
# flip Key Vault back to primary next rotation
4.6 Azure SQL master / SA keys¶
- Rotate the SQL admin password via portal / CLI. Update the KV entry consumed by
portal.shared.api.persistence_factory. - If the database uses a database master key (DEK rotation), coordinate with the app owner — rotating the master key requires re-encrypting column-level encrypted data and may require a maintenance window.
4.7 Databricks personal access tokens¶
- Prefer service-principal OAuth over PATs. PAT rotation = revoke old, mint new, update Key Vault.
- Every PAT must have an expiry < 90 days at creation. Audit via Databricks workspace → User Settings → Access Tokens.
4.8 ADF linked-service credentials¶
ADF pulls credentials from Key Vault by reference. Rotating the KV secret (§4.2) is sufficient — no redeploy of ADF required. Validate one pipeline run after each rotation.
🚨 5. Emergency Rotation (Compromise)¶
Danger
Start here if a key is suspected to be compromised. Do not wait for the scheduled rotation window.
Run the procedures below in parallel where possible.
- Contain. Rotate the compromised credential immediately using §4. Do not preserve the old version — disable it the moment the new one is in Key Vault.
- Audit. Pull the last 30 days of access logs for every resource the key touched:
- Escalate. Invoke
security-incident.md— this is a P1/P2 security event, not just an ops event. - Rotate adjacent credentials. Any credential that shared the same host / identity / storage path should also be rotated (credential theft rarely stays scoped).
- Document. Add a row to the Drill Log in §9 + file a post-incident review task.
✅ 6. Verification¶
After every rotation (scheduled or emergency):
- Every consumer of the rotated credential has issued a successful request within the last hour:
kql AzureDiagnostics | where TimeGenerated > ago(1h) | where ResultType == "Success" | where _ResourceId has "<resource>" | summarize successCount = count() by _ResourceId - No 401 / 403 spike post-rotation:
kql AzureDiagnostics | where TimeGenerated > ago(1h) | where ResultType in ("Unauthorized", "Forbidden") | summarize c = count() by bin(TimeGenerated, 5m), ResultType - Key Vault audit shows the rotation event:
kql AzureDiagnostics | where ResourceType == "VAULTS" | where OperationName in ("SecretSet", "KeyUpdate") | where TimeGenerated > ago(1h)
📋 7. Evidence Preservation¶
For emergency rotations, preserve:
- The pre-rotation access log for the compromised resource (export to CSV).
- The Key Vault audit event for the old version being disabled.
- The incident ticket + rotation timestamp.
- The list of adjacent credentials that were rotated as a precaution.
📎 8. Contact Information¶
Warning
Action Required: Populate these before first production use.
| Role | Contact | Phone | Escalation |
|---|---|---|---|
| Security On-Call | (set via your org's security team) | (see PagerDuty / OpsGenie) | Compromise events |
| Platform Team Lead | (set via your org's platform team) | (see PagerDuty / OpsGenie) | Scheduled rotation issues |
| Data Eng Lead | (set via your org's data eng DL) | (office hours) | ADF / Databricks creds |
| App Reg Owner | (per-app registration — see governance RBAC) | (DL) | MSAL / Entra ID key rollover |
| Azure Support | Case via Portal | N/A | Platform issues |
🗓️ 9. Drill Log¶
| Quarter | Date | Type (tabletop / live) | Scenario exercised | Lead | Gaps identified | Fixes tracked |
|---|---|---|---|---|---|---|
| Q1 — Jan | TBD | TBD | TBD | TBD | TBD | TBD |
| Q2 — Apr | TBD | TBD | TBD | TBD | TBD | TBD |
| Q3 — Jul | TBD | TBD | TBD | TBD | TBD | TBD |
| Q4 — Oct | TBD | TBD | TBD | TBD | TBD | TBD |
🔗 10. Related Documentation¶
- Security Incident — Compromise response
- Tenant Onboarding — New-tenant key setup
- Break-Glass Access — Emergency admin flow
- DR Drill — Key Vault restore scenario
- COMPLIANCE — Rotation cadence & regulatory requirements