Appearance
Rollback via Terraform
Operational runbook for rolling back a bad deploy on the new Terraform-driven pipeline (ADR 0007).
Quick reference
| Scenario | Fastest mitigation | Then |
|---|---|---|
| New deploy made the app worse but service is still healthy | Re-dispatch with the previous good ref | Investigate the bad change in a follow-up PR |
| New deploy made the service unhealthy (5xx, crashloop) | ECS circuit breaker auto-rolls back to previous revision; verify and confirm | Investigate why the circuit breaker triggered |
| Circuit breaker didn't catch it (rare; new task is "healthy" but app behaves wrong) | Pin to previous revision via aws ecs update-service --task-definition <family>:<old-N> | Then revert the PR + re-dispatch |
| Bad infra change (Terraform source) blocking deploys | git revert <PR-commit> on main; re-dispatch | Re-architect the change |
Path A — Redeploy a previous good commit (preferred)
This is the cleanest rollback. The same Terraform-driven pipeline rebuilds + applies a known-good image.
Prod
bash
# Find the last known-good commit
git log --oneline main -20
# Dispatch deploy-prod with that commit
gh workflow run deploy-prod.yml --ref main -f ref=<good-commit-sha>The ref input determines what gets checked out + built. Approve the GitHub environment protection prompt.
Staging
bash
# Force-push the staging branch to point at the previous good commit
git push origin <good-commit-sha>:staging --force-with-leaseThe deploy workflow fires on the push and applies the previous commit's state. Terraform will see whatever drift exists between the bad apply and the good source, plan to reconcile, and apply.
This is the canonical "fix forward via rollback" path. Use it whenever possible. The state stays clean because every state mutation comes from a real workflow run, not from out-of-band CLI surgery.
Path B — Pin ECS service to a previous task def revision (emergency)
Use only when path A is slow (build + apply takes ~5-10 min) AND the regression is severe.
bash
# Identify the previous good task definition revision
aws ecs describe-services \
--cluster askflorence-prod \
--services askflorence-prod-app \
--query 'services[0].deployments[]' \
--profile askflorence-prod
# Pin the service to a specific revision
aws ecs update-service \
--cluster askflorence-prod \
--service askflorence-prod-app \
--task-definition askflorence-prod-app-task:<N> \
--profile askflorence-prod
# Wait for stability
aws ecs wait services-stable \
--cluster askflorence-prod \
--services askflorence-prod-app \
--profile askflorence-prodWARNING: this creates Terraform-state drift. The next terraform apply (deploy or local) will see that the service's task_definition ARN doesn't match what Terraform tracks, plan to "reconcile" by registering a new revision matching the bad source, and re-apply the bad change.
You MUST follow Path B with one of:
git revert <bad-PR-commit>onmain+ dispatch a new deploy (preferred). The next deploy applies the reverted source over the manually-pinned revision; the service ends up on the reverted code via a clean Terraform apply.- Or
terraform import/terraform refreshcarefully so state matches the pinned reality. This is fragile; reserve for unrecoverable scenarios.
Path C — Revert the infra PR + redeploy (when infra source is the problem)
For deploys broken by an infra change (not application code):
bash
git revert <bad-infra-PR-commit>
git push origin <branch-with-revert>
# Open + merge revert PR
# Re-dispatch deploySame out-of-band IAM-apply pattern used during ENG-308 + ENG-313 may be needed if the bad change touches the deploy role's permissions and iam:PutRolePolicy is denied (by design — the deploy role can't mutate itself).
Rolling back the ENG-277 deploy pipeline itself
If the new Terraform-driven pipeline needs to be reverted entirely:
- Revert PRs #234 (prod) + #264 (staging) — restores
lifecycle.ignore_changes = [container_definitions]and the legacy deploy workflow chain. - Live ECS task def is not affected by the source revert — Terraform respects
ignore_changesagain, so it leaves the live revision alone. - Next deploy uses the restored legacy chain (
aws ecs describe-task-definition+amazon-ecs-render-task-definition+amazon-ecs-deploy-task-definition). - The ENG-272 bug class returns. Document the revert reason carefully because the team learned the hard way that detection-only mitigations are insufficient.
This shouldn't be necessary — prod has been stable on the new pipeline since 2026-05-13T07:08Z. But it's a real escape hatch.
Verifying rollback success
After any rollback path:
bash
# Confirm live task def matches expectations
aws ecs describe-task-definition \
--task-definition askflorence-<env>-app-task \
--query 'taskDefinition.{revision:revision,image:containerDefinitions[0].image,envCount:length(containerDefinitions[0].environment),secretCount:length(containerDefinitions[0].secrets)}' \
--profile askflorence-<env>
# Confirm service is stable
aws ecs describe-services \
--cluster askflorence-<env> \
--services askflorence-<env>-app \
--query 'services[0].{desiredCount:desiredCount,runningCount:runningCount,taskDefinition:taskDefinition}' \
--profile askflorence-<env>
# Apex / stage health
curl -sS https://askflorence.health/api/health # or https://stage.askflorence.health/api/health
# Should return {"status":"ok","commit":"<expected-sha>","env":"prod"}See also
- ADR 0007 — the deploy pipeline design
docs/runbooks/deploy-via-terraform.md— normal deploy operationsdocs/infrastructure/change-log.md— recent changes to look at when diagnosing a regression