Home > Docs > Rollback

Deployment Rollback Runbook¶

Note

Quick Summary: Step-by-step rollback procedures for failed CSA-in-a-Box deployments — Bicep landing zone redeployment from git tags, ADF pipeline restore, dbt model full-refresh, Cosmos DB point-in-time restore, and storage account blob/container recovery via soft-delete and versioning.

This runbook covers how to recover from a failed or bad deployment. It pairs with the Rollback Deployment GitHub Actions workflow (.github/workflows/rollback.yml) and the point-in-time restore features wired into the Bicep modules.

Danger

Before you start: rollback is a high-risk, privileged operation. Always run the rollback workflow in dry_run: true mode first, read the what-if output, and only flip to a real rollback once the blast radius is clear.

📑 Table of Contents¶

🏷️ 1. Rollback Targets
📦 2. Landing Zone (Bicep) Rollback
- When a rollback is the wrong tool
🔧 3. ADF Pipeline Rollback
🗄️ 4. dbt Model Rollback
🔄 5. Cosmos DB Point-in-Time Restore
💾 6. Storage Account Data Recovery
✅ 7. Post-Rollback Checklist
🧪 8. Testing Cadence

🏷️ 1. Rollback Targets¶

Every successful, non-dry-run run of the Deploy Infrastructure workflow pushes a tag of the form:

deploy/<environment>-<short-sha>-<run-number>

For example: deploy/prod-9f3a1e2-47. These tags are the canonical rollback targets — pass one to the Rollback workflow's rollback_ref input to redeploy that exact commit.

List recent deploy tags locally:

git fetch --tags
git tag --list 'deploy/*' --sort=-creatordate | head -20

📦 2. Landing Zone (Bicep) Rollback¶

This is the most common path: the last infrastructure deploy broke something, and you want to return the subscription to the previous good state.

In GitHub, go to Actions → Rollback Deployment → Run workflow.
Fill in the inputs:
- environment: dev, test, or prod — must match where the bad deploy landed.
- target: alz, dmlz, dlz, or all.
  - Prefer the narrowest target you can justify.
- rollback_ref: the deploy/<env>-<sha>-<run> tag from the last known-good deploy, or any git ref / SHA that still contains the Bicep templates.
- confirm: type exactly ROLLBACK. The preflight job will refuse to continue otherwise.
- dry_run: leave this true for the first run. Review the what-if output, then re-run with dry_run: false.
The workflow runs a preflight, then redeploys the selected landing zones at the rollback ref, then runs the post-rollback verification job.

When a rollback is the wrong tool¶

!!! warning - Schema-changing dbt deploys. Bicep rollback will not undo Delta table schema changes. Use dbt model rollback (section 4) first, then Bicep if needed. - Resource deletions. If the bad deploy deleted resources, plain Bicep redeploy will not recreate data inside them. Use PITR (section 5) or storage soft-delete (section 6) to restore data first, then redeploy.

🔧 3. ADF Pipeline Rollback¶

ADF pipeline definitions live in the repo under domains/shared/pipelines/adf/. They are deployed by importing the JSON into the ADF instance; there is no per-pipeline rollback primitive in Azure.

Procedure:

git checkout <good-ref> -- domains/shared/pipelines/adf/
Import the pipeline JSON files into the ADF instance (via the ADF Studio "Import ARM template" or your CI step).
Publish and re-trigger.

Note

Prevention: Always run the Bicep What-If workflow on a PR that touches ADF JSON so what-if rejects structurally invalid pipeline changes before they land.

🗄️ 4. dbt Model Rollback¶

Rolling back a Delta model means running the previous version of the model SQL so the new version's output gets overwritten by a fresh merge/insert.

Procedure:

git checkout <good-ref> -- domains/shared/dbt/models/

Run a full-refresh for the affected models:

cd domains/shared/dbt
dbt run --select <model_name>+ --full-refresh --target <env>

Re-run the dependent gold models so the downstream state is consistent:

dbt run --select gld_customer_lifetime_value gld_daily_order_metrics --full-refresh

Verify the data-quality suite:

python ../../csa_platform/governance/dataquality/run_quality_checks.py --suite all

Important

If the bad deploy added new columns or dropped columns, you may have to drop the affected Delta tables by hand before the full-refresh.

🔄 5. Cosmos DB Point-in-Time Restore¶

Cosmos DB accounts now default to Continuous / Continuous30Days backup (see deploy/bicep/DLZ/modules/cosmos/cosmosdb.bicep), which gives a 30-day point-in-time restore window.

Procedure:

Identify the timestamp just before the bad deploy from the Deploy Infrastructure workflow run log.

Restore to a new account (Cosmos cannot in-place restore):

az cosmosdb restore \
  --target-database-account-name csa-cosmos-restored \
  --account-name csa-cosmos-dlz \
  --resource-group rg-csa-cosmos \
  --location eastus \
  --restore-timestamp 2026-04-10T14:00:00Z

Point the application at the restored account, validate, then rename the original and the restored account to swap them.
Update the Bicep parameter file so the next deploy references the renamed account if you decide to keep the restored copy.

💾 6. Storage Account Data Recovery¶

The DLZ storage account now enables blob versioning, change feed, blob soft-delete, container soft-delete, and a 6-day point-in-time restore window. See deploy/bicep/DLZ/modules/storage/storage.bicep.

To recover a deleted blob:

az storage blob undelete \
  --account-name <storage-account> \
  --container-name <container> \
  --name <blob-path> \
  --auth-mode login

To recover a deleted container:

az storage container restore \
  --account-name <storage-account> \
  --name <container> \
  --deleted-version <version-id> \
  --auth-mode login

To restore the entire account to an earlier point in time (up to 6 days):

az storage account restore-blob-ranges \
  --account-name <storage-account> \
  --resource-group <rg> \
  --time-to-restore 2026-04-10T14:00:00Z \
  --blob-range-list '[{"start":"","end":""}]'

✅ 7. Post-Rollback Checklist¶

After any rollback, run through the following before declaring done:

Confirm the affected workload is functioning (smoke test one representative dbt model + one representative ADF pipeline).
Confirm data quality checks pass: python csa_platform/governance/dataquality/run_quality_checks.py --suite all.
Tag the rolled-back state as rollback/<env>-<sha>-<date> for traceability.
Open a tracking issue describing the root cause of the bad deploy and a prevention action (extra test, extra validation gate, etc.).
Update .claude/DEVELOPMENT_LOG.md with a short entry so the next session has context.

🧪 8. Testing Cadence¶

Rollback procedures are only useful if they work. Run a rollback drill in dev once per quarter, pointing to a known-good previous deploy/dev-* tag. Capture the run URL in the development log and update this runbook if anything drifts.

See also:

← Previous: Cost Management
→ Next: Troubleshooting
⌂ Index: Documentation home