Deployment Rollback Runbook¶
Note
Quick Summary: Step-by-step rollback procedures for failed CSA-in-a-Box deployments — Bicep landing zone redeployment from git tags, ADF pipeline restore, dbt model full-refresh, Cosmos DB point-in-time restore, and storage account blob/container recovery via soft-delete and versioning.
This runbook covers how to recover from a failed or bad deployment. It pairs with the Rollback Deployment GitHub Actions workflow (.github/workflows/rollback.yml) and the point-in-time restore features wired into the Bicep modules.
Danger
Before you start: rollback is a high-risk, privileged operation. Always run the rollback workflow in dry_run: true mode first, read the what-if output, and only flip to a real rollback once the blast radius is clear.
📑 Table of Contents¶
- 🏷️ 1. Rollback Targets
- 📦 2. Landing Zone (Bicep) Rollback
- 🔧 3. ADF Pipeline Rollback
- 🗄️ 4. dbt Model Rollback
- 🔄 5. Cosmos DB Point-in-Time Restore
- 💾 6. Storage Account Data Recovery
- ✅ 7. Post-Rollback Checklist
- 🧪 8. Testing Cadence
🏷️ 1. Rollback Targets¶
Every successful, non-dry-run run of the Deploy Infrastructure workflow pushes a tag of the form:
For example: deploy/prod-9f3a1e2-47. These tags are the canonical rollback targets — pass one to the Rollback workflow's rollback_ref input to redeploy that exact commit.
List recent deploy tags locally:
📦 2. Landing Zone (Bicep) Rollback¶
This is the most common path: the last infrastructure deploy broke something, and you want to return the subscription to the previous good state.
- In GitHub, go to Actions → Rollback Deployment → Run workflow.
- Fill in the inputs:
environment:dev,test, orprod— must match where the bad deploy landed.target:alz,dmlz,dlz, orall.- Prefer the narrowest target you can justify.
rollback_ref: thedeploy/<env>-<sha>-<run>tag from the last known-good deploy, or any git ref / SHA that still contains the Bicep templates.confirm: type exactlyROLLBACK. The preflight job will refuse to continue otherwise.dry_run: leave thistruefor the first run. Review the what-if output, then re-run withdry_run: false.
- The workflow runs a preflight, then redeploys the selected landing zones at the rollback ref, then runs the post-rollback verification job.
When a rollback is the wrong tool¶
!!! warning - Schema-changing dbt deploys. Bicep rollback will not undo Delta table schema changes. Use dbt model rollback (section 4) first, then Bicep if needed. - Resource deletions. If the bad deploy deleted resources, plain Bicep redeploy will not recreate data inside them. Use PITR (section 5) or storage soft-delete (section 6) to restore data first, then redeploy.
🔧 3. ADF Pipeline Rollback¶
ADF pipeline definitions live in the repo under domains/shared/pipelines/adf/. They are deployed by importing the JSON into the ADF instance; there is no per-pipeline rollback primitive in Azure.
Procedure:
-
git checkout <good-ref> -- domains/shared/pipelines/adf/ - Import the pipeline JSON files into the ADF instance (via the ADF Studio "Import ARM template" or your CI step).
- Publish and re-trigger.
Note
Prevention: Always run the Bicep What-If workflow on a PR that touches ADF JSON so what-if rejects structurally invalid pipeline changes before they land.
🗄️ 4. dbt Model Rollback¶
Rolling back a Delta model means running the previous version of the model SQL so the new version's output gets overwritten by a fresh merge/insert.
Procedure:
-
git checkout <good-ref> -- domains/shared/dbt/models/ -
Run a full-refresh for the affected models:
-
Re-run the dependent gold models so the downstream state is consistent:
-
Verify the data-quality suite:
Important
If the bad deploy added new columns or dropped columns, you may have to drop the affected Delta tables by hand before the full-refresh.
🔄 5. Cosmos DB Point-in-Time Restore¶
Cosmos DB accounts now default to Continuous / Continuous30Days backup (see deploy/bicep/DLZ/modules/cosmos/cosmosdb.bicep), which gives a 30-day point-in-time restore window.
Procedure:
- Identify the timestamp just before the bad deploy from the
Deploy Infrastructureworkflow run log. -
Restore to a new account (Cosmos cannot in-place restore):
-
Point the application at the restored account, validate, then rename the original and the restored account to swap them.
- Update the Bicep parameter file so the next deploy references the renamed account if you decide to keep the restored copy.
💾 6. Storage Account Data Recovery¶
The DLZ storage account now enables blob versioning, change feed, blob soft-delete, container soft-delete, and a 6-day point-in-time restore window. See deploy/bicep/DLZ/modules/storage/storage.bicep.
To recover a deleted blob:
az storage blob undelete \
--account-name <storage-account> \
--container-name <container> \
--name <blob-path> \
--auth-mode login
To recover a deleted container:
az storage container restore \
--account-name <storage-account> \
--name <container> \
--deleted-version <version-id> \
--auth-mode login
To restore the entire account to an earlier point in time (up to 6 days):
az storage account restore-blob-ranges \
--account-name <storage-account> \
--resource-group <rg> \
--time-to-restore 2026-04-10T14:00:00Z \
--blob-range-list '[{"start":"","end":""}]'
✅ 7. Post-Rollback Checklist¶
After any rollback, run through the following before declaring done:
- Confirm the affected workload is functioning (smoke test one representative dbt model + one representative ADF pipeline).
- Confirm data quality checks pass:
python csa_platform/governance/dataquality/run_quality_checks.py --suite all. - Tag the rolled-back state as
rollback/<env>-<sha>-<date>for traceability. - Open a tracking issue describing the root cause of the bad deploy and a prevention action (extra test, extra validation gate, etc.).
- Update
.claude/DEVELOPMENT_LOG.mdwith a short entry so the next session has context.
🧪 8. Testing Cadence¶
Rollback procedures are only useful if they work. Run a rollback drill in dev once per quarter, pointing to a known-good previous deploy/dev-* tag. Capture the run URL in the development log and update this runbook if anything drifts.
See also:
- ← Previous: Cost Management
- → Next: Troubleshooting
- ⌂ Index: Documentation home