Rollback via Terraform

Operational runbook for rolling back a bad deploy on the new Terraform-driven pipeline (ADR 0007).

Quick reference

Scenario	Fastest mitigation	Then
New deploy made the app worse but service is still healthy	Re-dispatch with the previous good `ref`	Investigate the bad change in a follow-up PR
New deploy made the service unhealthy (5xx, crashloop)	ECS circuit breaker auto-rolls back to previous revision; verify and confirm	Investigate why the circuit breaker triggered
Circuit breaker didn't catch it (rare; new task is "healthy" but app behaves wrong)	Pin to previous revision via `aws ecs update-service --task-definition <family>:<old-N>`	Then revert the PR + re-dispatch
Bad infra change (Terraform source) blocking deploys	`git revert <PR-commit>` on `main`; re-dispatch	Re-architect the change

Path A — Redeploy a previous good commit (preferred)

This is the cleanest rollback. The same Terraform-driven pipeline rebuilds + applies a known-good image.

Prod

bash

# Find the last known-good commit
git log --oneline main -20

# Dispatch deploy-prod with that commit
gh workflow run deploy-prod.yml --ref main -f ref=<good-commit-sha>

The ref input determines what gets checked out + built. Approve the GitHub environment protection prompt.

Staging

bash

# Force-push the staging branch to point at the previous good commit
git push origin <good-commit-sha>:staging --force-with-lease

The deploy workflow fires on the push and applies the previous commit's state. Terraform will see whatever drift exists between the bad apply and the good source, plan to reconcile, and apply.

This is the canonical "fix forward via rollback" path. Use it whenever possible. The state stays clean because every state mutation comes from a real workflow run, not from out-of-band CLI surgery.

Path B — Pin ECS service to a previous task def revision (emergency)

Use only when path A is slow (build + apply takes ~5-10 min) AND the regression is severe.

bash

# Identify the previous good task definition revision
aws ecs describe-services \
  --cluster askflorence-prod \
  --services askflorence-prod-app \
  --query 'services[0].deployments[]' \
  --profile askflorence-prod

# Pin the service to a specific revision
aws ecs update-service \
  --cluster askflorence-prod \
  --service askflorence-prod-app \
  --task-definition askflorence-prod-app-task:<N> \
  --profile askflorence-prod

# Wait for stability
aws ecs wait services-stable \
  --cluster askflorence-prod \
  --services askflorence-prod-app \
  --profile askflorence-prod

WARNING: this creates Terraform-state drift. The next terraform apply (deploy or local) will see that the service's task_definition ARN doesn't match what Terraform tracks, plan to "reconcile" by registering a new revision matching the bad source, and re-apply the bad change.

You MUST follow Path B with one of:

git revert <bad-PR-commit> on main + dispatch a new deploy (preferred). The next deploy applies the reverted source over the manually-pinned revision; the service ends up on the reverted code via a clean Terraform apply.
Or terraform import / terraform refresh carefully so state matches the pinned reality. This is fragile; reserve for unrecoverable scenarios.

Path C — Revert the infra PR + redeploy (when infra source is the problem)

For deploys broken by an infra change (not application code):

bash

git revert <bad-infra-PR-commit>
git push origin <branch-with-revert>
# Open + merge revert PR
# Re-dispatch deploy

Same out-of-band IAM-apply pattern used during ENG-308 + ENG-313 may be needed if the bad change touches the deploy role's permissions and iam:PutRolePolicy is denied (by design — the deploy role can't mutate itself).

Rolling back the ENG-277 deploy pipeline itself

If the new Terraform-driven pipeline needs to be reverted entirely:

Revert PRs #234 (prod) + #264 (staging) — restores lifecycle.ignore_changes = [container_definitions] and the legacy deploy workflow chain.
Live ECS task def is not affected by the source revert — Terraform respects ignore_changes again, so it leaves the live revision alone.
Next deploy uses the restored legacy chain (aws ecs describe-task-definition + amazon-ecs-render-task-definition + amazon-ecs-deploy-task-definition).
The ENG-272 bug class returns. Document the revert reason carefully because the team learned the hard way that detection-only mitigations are insufficient.

This shouldn't be necessary — prod has been stable on the new pipeline since 2026-05-13T07:08Z. But it's a real escape hatch.

Verifying rollback success

After any rollback path:

bash

# Confirm live task def matches expectations
aws ecs describe-task-definition \
  --task-definition askflorence-<env>-app-task \
  --query 'taskDefinition.{revision:revision,image:containerDefinitions[0].image,envCount:length(containerDefinitions[0].environment),secretCount:length(containerDefinitions[0].secrets)}' \
  --profile askflorence-<env>

# Confirm service is stable
aws ecs describe-services \
  --cluster askflorence-<env> \
  --services askflorence-<env>-app \
  --query 'services[0].{desiredCount:desiredCount,runningCount:runningCount,taskDefinition:taskDefinition}' \
  --profile askflorence-<env>

# Apex / stage health
curl -sS https://askflorence.health/api/health     # or https://stage.askflorence.health/api/health
# Should return {"status":"ok","commit":"<expected-sha>","env":"prod"}

Rollback via Terraform

Quick reference

Path A — Redeploy a previous good commit (preferred)

Prod

Staging

Path B — Pin ECS service to a previous task def revision (emergency)

Path C — Revert the infra PR + redeploy (when infra source is the problem)

Rolling back the ENG-277 deploy pipeline itself

Verifying rollback success

See also

AskFlorence