Skip to content
AskFlorence
Main Navigation ArchitectureFlorence AIAgentsMembersAgent PlatformValidationInfrastructure

Appearance

Sidebar Navigation

Overview

Home

Glossary

System Architecture

Consumer & Agent Flow

Florence AI

Overview

Principles

Runtime

Tool surface

Adding a tool

Tool registry

Knowledge: SBC scenarios & CSR

Voice

Evals & observability

Provider risk & portability

Outage playbook

Roadmap

Build plan

Agents

Overview

Workflows & pain points

Members

Overview

Medicaid coverage gap

Carriers

Overview

Marketplaces

Overview

Agency

Overview

Regulations

Overview

Agent Platform

Overview

Auth Architecture

MongoDB Permissioning

Compliance Model

Data Models

Data Sources

Overview

CMS Marketplace API

CMS dependency map

PUF Data

State Subsidies

SBE Ingestion Playbook

SBE State Watchouts + Decisions

CA Phase C/D Playbook

NY Phase C/D Playbook

Validation

Overview

Methodology

APTC Formula

California 2026

New York 2026

CAPS Formula

Scenario Results

Infrastructure

Account Inventory

AWS Setup Runbook

AWS Organizations

CloudTrail

GuardDuty

Security Hub

Config

CloudFront + WAFv2

Data sources & ingest

Phase 4 DNS

Change Log

Vulnerability Management

MongoDB Setup

Access Control

Data Classification

Documentation Hosting

Post-deploy Smoke

Development

Preflight (local CI mirror)

Testing strategy

Compliance

Overview (auditor entry point)

SOC 2 Control Mapping

HIPAA Control Mapping

CMS EDE Appendix A Mapping

Risk Assessment

Encryption Policy

Data Retention Policy

Privacy Impact Assessment

Consent Capture & Versioning

Incident Response Plan

Access Control Policy

Marketing vs. Portal Analytics

Vendor / Subprocessor Register

Dependency Vulnerability Policy

BAA / Compliance Evidence

Compliance-Automation Integration

Compliance-Automation Vendor Evaluation

Penetration Test Reports

Architecture

Portal entry handoff

Mobile app strategy

Deferred architecture decisions

Session cookie architecture

Share flows

Decisions (ADRs)

Index

0001 — Atlas project isolation

0002 — Append-only audit log

0003 — Narrow-scoped Mongo users

0004 — Cross-cluster Atlas PrivateLink

0005 — Delayed-job architecture

0006 — Mongo user simplification

0007 — Terraform owns ECS task def

0008 — E2E testing strategy

0009 — Self-hosted analytics + observability (superseded)

0010 — PostHog HIPAA Cloud (supersedes 0009)

Runbooks

Security Incident Response

Break-Glass Root Login

Onboard Team Member

Offboard Team Member

Atlas user provisioning

Deploy via Terraform (ENG-277)

Rollback via Terraform (ENG-277)

S3 data bucket migration (planned Phase 11)

Access Reviews

2026-Q2 Review

Session log

Index

2026-04-23 — Phase 10 DNS cutover

2026-04-22 — Phase 8 prod AWS mirror

2026-04-22 — Phase 7 Atlas VPC peering

2026-04-22 — Phase 6 CloudFront + WAF

2026-04-21 — Phase 5 staging go-live

2026-04-17 — Atlas staging

Briefs

Index

Member portal plan (ENG-187)

2026-04-16/17 handoff

2026-04-17 Atlas handoff

System briefing (2026-04-17)

Creative AdBundance proposal brief

Creative AdBundance analytics brief

ElevenLabs RN integration research

Policies

Overview

On this page

Deploying via Terraform ​

Operational runbook for the deploy pipeline that landed in ENG-277 / ADR 0007.

How deploys fire ​

EnvTriggerApproval
Prodgh workflow run deploy-prod.yml --ref <ref> (manual workflow_dispatch) OR GitHub Actions UIGitHub environment protection rule on production requires Taha to click Approve
Staginggit push origin <commit>:staging (push to the staging branch fires automatically) OR gh workflow run deploy-staging.ymlNone (auto on push)

Vercel prod from main is retired as of Phase 10 (2026-04-29). All prod traffic goes through AWS ECS via the prod deploy workflow.

What happens during a deploy run (both envs, same shape) ​

  1. Validate-secrets gate (ENG-297 / ENG-309). Reusable validate-secrets.yml workflow runs first, assumes the env's dedicated ValidateSecretsRole, and iterates every secret in the env's Secrets Manager checking for \n, trailing whitespace, empty values, placeholder strings. Failure blocks the build job.
  2. Checkout + OIDC + ECR login. Deploy job assumes GitHubActionsDeployRole for the env account.
  3. Build + push image. docker buildx build against linux/amd64, push to ECR with :<sha> tag (prod uses immutable tags; staging additionally pushes :latest).
  4. Ensure-indexes pre-deploy task (ENG-266 Phase 3.5). In-VPC ECS task assumes its own execution role (which has the app_admin_schema admin-tier Mongo secret), runs index creation, awaits completion, checks exit code. Failure aborts the deploy with old service still active.
  5. Setup Terraform. hashicorp/setup-terraform@v3 pinned to 1.14.0.
  6. terraform init in infra/envs/<env>. Assumes TerraformBackendRole in the management account (778477254880) via the deploy role's AssumeRoleBackend Sid.
  7. terraform validate.
  8. terraform apply -auto-approve -input=false -var "app_image_uri=<built-image-uri>". This is the moment of truth. Terraform refreshes the env state, computes the diff (image SHA change + any source-side env/secret additions), registers a new task definition revision, updates the ECS service to point at it. Service deployment_circuit_breaker { enable = true, rollback = true } handles the rolling deployment with auto-rollback on health failure.
  9. aws ecs wait services-stable with timeout (prod 15min, staging 10min). Polls every 15s until runningCount == desiredCount on the new revision.
  10. Scale to N if first deploy. (Prod: 0→2. Staging: 0→1.) No-op on subsequent deploys.
  11. ALB smoke against origin.<stage.>askflorence.health/api/health — must return 200.
  12. Fetch smoke secrets from Secrets Manager (4 ARNs: mongodb/app-write, resume-token-secret, internal-reminder-token, hubspot-access-token).
  13. Post-deploy smoke via npx tsx scripts/audit/post-deploy-smoke.ts — 11 checks (6 read-path from ENG-272 + 5 write-path from ENG-275).
  14. Report summary to the workflow run summary page.

Deploying a specific commit (rollback or canary) ​

Prod ​

bash
gh workflow run deploy-prod.yml --ref <commit-or-branch> -f ref=<commit-or-branch>

The ref input determines what code gets checked out + built. Default is main.

Staging ​

bash
# Push the specific commit to the staging branch.
git push origin <commit>:staging --force-with-lease

Or via merge if a clean history is desired:

bash
git checkout staging
git merge --no-ff <commit> -m "deploy: promote <commit-summary> to staging"
git push origin staging

When to expect state-lock contention ​

  • Within a single env: GitHub Actions concurrency: deploy-<env> group serializes deploys. Two concurrent dispatches queue rather than fight.
  • Cross-env: prod and staging have independent state files. No cross-contention.
  • Local terraform plan/apply from a developer machine WILL conflict with an in-flight deploy. Coordinate via team chat before running local ops against the same env root.

When to expect refresh slowdowns ​

  • First deploy after a long quiet period: refresh reads every resource in the env root. Can take 60-90s on prod (more state). Normal deploys 30-50s.
  • After a Terraform provider upgrade: refresh may re-read attribute shapes. One-time per provider bump.
  • After unrelated infra source changes that haven't been applied yet: refresh + plan will surface them. Either apply them as part of the deploy (intentional) or rebase the PR so they're already in state (cleaner).

Ensure-indexes interaction ​

The ensure-indexes task family (<env>-app-ensure-indexes-task) still uses the legacy CLI path (register-task-definition + run-task) and has its own lifecycle.ignore_changes = [container_definitions] in infra/modules/ecs-ensure-indexes/main.tf. This is intentional and tracked as a separate concern — ensure-indexes is a one-shot pre-deploy task with visible non-zero-exit failure mode, not a long-running service. Its drift surface is smaller and less consequential.

If the ensure-indexes task fails (non-zero exit), the deploy aborts BEFORE Terraform apply runs, leaving the old service active. Clean rollback by default.

Debug-on-failure paths ​

  • Workflow failed at Build and push image to ECR: docker build or push errored. Check the step log for the actual error.
  • Workflow failed at Run ensure-indexes ECS task: the task ran in-VPC and exited non-zero. Check /aws/ecs/<env>-app-ensure-indexes log group (aws logs tail shown in the workflow's debug-on-failure step).
  • Workflow failed at Terraform init: state backend access issue. Most common: OIDC role's AssumeRoleBackend Sid is missing or the backend KMS key isn't readable. See ENG-308 for the original IAM expansion.
  • Workflow failed at Terraform apply: can mean (a) AccessDenied on a refresh path (deploy role missing a Get* / Describe* action — add to TerraformRefreshReadOnly Sid), (b) AccessDenied on a write path (deploy role can't actually mutate the resource — usually by design; check what Terraform was trying to change), or (c) a real diff that fails to apply (e.g. ECS service circuit breaker auto-rollback). Check the apply step log.
  • Workflow failed at Wait for ECS service stability: new task def is unhealthy; circuit breaker should have auto-rolled back. Confirm with aws ecs describe-services — if the service is still on the old revision, the rollback worked.
  • Workflow failed at Smoke test /api/health: ALB target is reachable but the app is unhealthy. Check /aws/ecs/<env>-app log group.
  • Workflow failed at Post-deploy smoke: one of the 11 checks failed. The script prints which check + the response. Most common failure mode: a recently-added secret is in Terraform source but the live task def's secrets[] doesn't have it — except after ENG-277 this can't happen by construction, so any failure here points to a real app-side regression.

Why this works (the design intent) ​

lifecycle.ignore_changes = [container_definitions] on aws_ecs_task_definition is gone. Terraform now owns the task def end-to-end: image SHA via -var, env vars + secret bindings from Terraform source. The deploy workflow's job reduces to "build the image + tell Terraform what SHA to use + smoke-test the result." Every other artifact (revision number, secrets[] composition, environment[] composition) is reconciled by terraform apply against the source-of-truth Terraform code.

Result: the silent-secret-binding-drift bug class (ENG-249, ENG-271, ENG-272, ENG-279) cannot recur. Terraform source ↔ live ECS task def parity is enforced by construction, not by a nightly checker that catches drift after-the-fact.

See also ​

  • ADR 0007 — the design decision
  • docs/runbooks/rollback-via-terraform.md — rollback procedures
  • infra/modules/ecs-service/main.tf — the module that owns the task def
  • .github/workflows/deploy-prod.yml + deploy-staging.yml — the workflow definitions
Pager
Previous pageAtlas user provisioning
Next pageRollback via Terraform (ENG-277)

AskFlorence Internal Documentation. Not for public distribution.

AskFlorence

Internal Documentation

Access restricted. Not for public distribution.