Home > Docs > Runbooks > DR Drill
DR Drill Runbook (CSA-0073)¶
Note
Quick Summary: Quarterly disaster-recovery drill that exercises Cosmos failover, Storage failover, Key Vault restore, and Bicep rollback against a scratch subscription. Automated via .github/workflows/dr-drill.yml on the first day of each quarter with on-demand workflow_dispatch.
This runbook operationalises docs/DR.md §4 ("Failover Readiness — Quarterly Drill"). The parent DR document is the authoritative reference for RPO/RTO targets, region pairing, and the real failover procedure; this document covers how we rehearse that procedure safely against a scratch subscription so the runbook stays honest instead of rotting between real incidents.
📑 Table of Contents¶
- 🎯 1. Objectives
- 📅 2. Cadence
- 🔐 3. Required Azure Permissions
- 🧪 4. Scenarios
- 🚀 5. Triggering a Drill
- 📊 6. Results & Reporting
- 📐 7. RPO / RTO Expectations
- 🧰 8. Follow-up Work
- 🔗 9. Related Documentation
🎯 1. Objectives¶
A DR drill is successful when:
- Every scenario executes end-to-end without human intervention during the automated portion.
- Observed RTO per scenario is within the target documented in §7 (and cross-referenced to
docs/DR.md§1). - The drill report is posted to the ops channel and archived by the workflow's
reportjob. - Any deviation, flake, or gap is captured as a follow-up task in Archon under the "DR" tag so the runbook can be updated before the next drill.
Drills are not a deployment. They never target production and must never mutate resources that real workloads depend on.
📅 2. Cadence¶
- Scheduled:
cron: "0 10 1 1,4,7,10 *"— 10:00 UTC on the 1st day of Jan, Apr, Jul, and Oct. - On-demand: via GitHub → Actions →
dr-drill→ Run workflow. Anyone withwriteaccess to the repository can trigger a drill. - Ad hoc post-incident: after any real DR event, re-run the relevant scenario within five business days to confirm the post-incident patch is effective.
🔐 3. Required Azure Permissions¶
The drill workflow authenticates via OIDC federated credentials. The service principal behind AZURE_CLIENT_ID needs:
| Scope | Role | Why |
|---|---|---|
subscriptions/$SCRATCH_SUB | Contributor | Create + mutate resources during the drill |
subscriptions/$SCRATCH_SUB | User Access Administrator (optional) | Only if the drill provisions role assignments |
subscriptions/$SCRATCH_SUB/providers/Microsoft.KeyVault | Key Vault Administrator (RBAC model) or vault access-policy entry with get / list / recover / purge | Required by keyvault-restore |
subscriptions/$SCRATCH_SUB/providers/Microsoft.DocumentDB | Cosmos DB Operator | Required by cosmos-failover |
subscriptions/$SCRATCH_SUB/providers/Microsoft.Storage | Storage Account Contributor | Required by storage-failover |
The scratch subscription ID is stored in the repository secret AZURE_SUBSCRIPTION_ID_SCRATCH. The tenant is the usual AZURE_TENANT_ID. The drill must not use the production subscription secret.
Warning
Real storage-account failovers drop the account to LRS and the replication seed can take hours to rebuild. That is fine on a scratch account; do not run the storage-failover scenario against staging or prod.
🧪 4. Scenarios¶
Each scenario maps to one job in .github/workflows/dr-drill.yml. Scenarios can be run individually via the scenarios input (see §5).
4.1 cosmos-failover¶
Purpose: Validate Cosmos DB multi-region write + automatic failover using az cosmosdb failover-priority-change.
Expected behaviour:
- Drill provisions (or reuses) a 2-region Cosmos account in the scratch subscription with
enableAutomaticFailover=trueandenableMultipleWriteLocations=true. - Write a sentinel document to the primary region.
- Force a failover-priority swap so the former secondary becomes the new primary.
- Re-read the sentinel document from the new primary.
- Swap priorities back (failback).
Verification:
- Read after failover must return the sentinel.
az cosmosdb showreflects the expected priority order post-failback.- End-to-end duration logged for RTO comparison.
Rollback path: Scenario always finishes by restoring the original priority order. On failure, the on-call runs az cosmosdb failover-priority-change manually using the commands in docs/DR.md §5.2.
4.2 storage-failover¶
Purpose: Validate customer-initiated failover on an RA-GRS storage account (az storage account failover) and the subsequent re- enablement of geo-replication.
Expected behaviour:
- Drill provisions (or reuses) a scratch
Standard_RAGRSstorage account with a small test container + blob. - Invoke
az storage account failover --yes. - Confirm the blob is readable from the failover region and that the account SKU is now
Standard_LRS. - Set SKU back to
Standard_RAGRSand wait forgeoReplicationStatus=Live.
Verification:
- Blob content matches the pre-failover sentinel.
az storage account showreports the expected primary region.- Failback SKU update succeeds (seed time is logged, not asserted — it can legitimately take hours).
Rollback path: The scenario's final step always resets the SKU to Standard_RAGRS. If it fails, the steady-state scratch environment is left in LRS — operator reruns the SKU update manually.
4.3 keyvault-restore¶
Purpose: Validate soft-delete + purge-protection recovery using a Key Vault with 90-day soft-delete retention.
Expected behaviour:
- Drill creates a secret in a scratch Key Vault.
- Deletes the secret (soft delete).
- Lists deleted secrets and asserts the secret is present with the expected
scheduledPurgeDate. - Recovers the secret via
az keyvault secret recover. - Confirms the recovered secret value matches the original.
Verification:
- Recovered value byte-for-byte equals the original.
- Key Vault emits audit events for
SecretDeleteandSecretRecover(cross-checked against the tamper-evident audit logger wired in CSA-0016).
Rollback path: None required — recovery is the happy path. If recovery fails, the secret remains in soft-delete state and is purged automatically at the end of the retention window.
4.4 bicep-rollback¶
Purpose: Validate that a previous main.bicep can be redeployed and that az deployment group what-if produces a clean diff.
Expected behaviour:
- Drill checks out the
main.bicepatHEAD~1(or a pinned "known good" tag). - Runs
az deployment group what-ifagainst the scratch RG and records the diff. - Executes the deployment.
- Redeploys the current
HEADmain.bicepand confirms the resource graph returns to the expected state.
Verification:
what-ifexits 0.- Deployment success is captured in
az deployment group show. - Return-to-HEAD deployment is idempotent (
noChangeresult).
Rollback path: The scenario always finishes by redeploying HEAD, so the scratch RG is left in the current-main state. If any step fails, operator runs the deployment manually per docs/ROLLBACK.md.
🚀 5. Triggering a Drill¶
Scheduled run¶
No action required. The cron entry in dr-drill.yml fires on the 1st of each quarter at 10:00 UTC. The report job posts results even if one or more scenarios fail.
Manual run (all scenarios)¶
Manual run (single scenario)¶
Use the scenarios input to restrict to one of: cosmos-failover, storage-failover, keyvault-restore, bicep-rollback.
Comma-separated values are also accepted:
gh workflow run dr-drill.yml \
-f environment=scratch \
-f scenarios=cosmos-failover,keyvault-restore
Running against staging¶
Allowed only for the bicep-rollback and keyvault-restore scenarios and only with explicit sign-off from the on-call lead. Storage and Cosmos failover in staging is prohibited because it affects shared paired resources.
📊 6. Results & Reporting¶
- The
reportjob in the workflow aggregates per-scenarioneeds.<job>.resultand prints a summary line per scenario (success | failure | cancelled | skipped). - Results are posted to the ops Teams channel via the webhook stub in the
reportjob (wiring TODO — see §8). - Archive the workflow run URL plus the drill ID (format
drill-YYYYMMDDTHHMMSSZ) underreports/dr-drills/<date>.mdin the ops tracker. The file is gitignored — attach it to the Archon "DR drills" document instead. - Any scenario returning
failuremust open an Archon task taggeddr-drill-followupwithin one business day.
📐 7. RPO / RTO Expectations¶
These expectations must stay aligned with docs/DR.md §1. If they drift, update both files in the same PR.
| Scenario | RPO (data-loss window) | RTO (recovery time) | Source of truth |
|---|---|---|---|
cosmos-failover | < 15 min | < 30 min | DR.md §1 (Cosmos DB, Critical tier) |
storage-failover | < 1 h | < 1 h | DR.md §1 (Data Lake Storage Silver + Gold, Critical tier) |
keyvault-restore | N/A (recovery from soft-delete, not replication) | < 15 min | DR.md §1 (Key Vault, Critical tier) |
bicep-rollback | N/A (IaC redeploy, no runtime data) | < 30 min | docs/ROLLBACK.md |
If a drill's observed RTO exceeds the target by > 25%, treat it as a finding and open a follow-up task — do not silently accept drift.
🧰 8. Follow-up Work¶
The workflow lands as a shell with each scenario as an echo "TODO: wire to scripts/drill/<scenario>.sh" stub. The following items are known follow-ups (not in scope for CSA-0073):
- Implement
scripts/drill/cosmos-failover.sh. - Implement
scripts/drill/storage-failover.sh. - Implement
scripts/drill/keyvault-restore.sh. - Implement
scripts/drill/bicep-rollback.sh. - Wire the
reportjob's Teams webhook POST (TEAMS_OPS_WEBHOOKsecret, not yet created). - Add a scratch-subscription Bicep module under
deploy/bicep/scratch/for idempotent provisioning of the drill fixtures (Cosmos + Storage + KV + empty RG). - Extend the drill
reportjob to write an artifact intoreports/dr-drills/and attach it to the Archon DR drills doc. - Capture per-step duration metrics and publish them to Log Analytics so trend analysis is possible across quarters.
🔗 9. Related Documentation¶
- Disaster Recovery — Authoritative DR runbook (RPO/RTO, region pairing, real failover procedure)
- Rollback — Deployment rollback procedure
- Security Incident — Sibling operations runbook
- Multi-Region Deployment — Multi-region deployment guide
.github/workflows/dr-drill.yml— The drill workflow itself