Skip to content
AskFlorence
Main Navigation ArchitectureFlorence AIAgentsMembersAgent PlatformValidationInfrastructure

Appearance

Sidebar Navigation

Overview

Home

Glossary

System Architecture

Consumer & Agent Flow

Florence AI

Overview

Principles

Runtime

Tool surface

Adding a tool

Tool registry

Knowledge: SBC scenarios & CSR

Voice

Evals & observability

Provider risk & portability

Outage playbook

Roadmap

Build plan

Agents

Overview

Workflows & pain points

Members

Overview

Medicaid coverage gap

Carriers

Overview

Marketplaces

Overview

Agency

Overview

Regulations

Overview

Agent Platform

Overview

Auth Architecture

MongoDB Permissioning

Compliance Model

Data Models

Data Sources

Overview

CMS Marketplace API

CMS dependency map

PUF Data

State Subsidies

SBE Ingestion Playbook

SBE State Watchouts + Decisions

CA Phase C/D Playbook

NY Phase C/D Playbook

Validation

Overview

Methodology

APTC Formula

California 2026

New York 2026

CAPS Formula

Scenario Results

Infrastructure

Account Inventory

AWS Setup Runbook

AWS Organizations

CloudTrail

GuardDuty

Security Hub

Config

CloudFront + WAFv2

Data sources & ingest

Phase 4 DNS

Change Log

Vulnerability Management

MongoDB Setup

Access Control

Data Classification

Documentation Hosting

Post-deploy Smoke

Development

Preflight (local CI mirror)

Testing strategy

Compliance

Overview (auditor entry point)

SOC 2 Control Mapping

HIPAA Control Mapping

CMS EDE Appendix A Mapping

Risk Assessment

Encryption Policy

Data Retention Policy

Privacy Impact Assessment

Consent Capture & Versioning

Incident Response Plan

Access Control Policy

Marketing vs. Portal Analytics

Vendor / Subprocessor Register

Dependency Vulnerability Policy

BAA / Compliance Evidence

Compliance-Automation Integration

Compliance-Automation Vendor Evaluation

Penetration Test Reports

Architecture

Portal entry handoff

Mobile app strategy

Deferred architecture decisions

Session cookie architecture

Share flows

Decisions (ADRs)

Index

0001 — Atlas project isolation

0002 — Append-only audit log

0003 — Narrow-scoped Mongo users

0004 — Cross-cluster Atlas PrivateLink

0005 — Delayed-job architecture

0006 — Mongo user simplification

0007 — Terraform owns ECS task def

0008 — E2E testing strategy

0009 — Self-hosted analytics + observability (superseded)

0010 — PostHog HIPAA Cloud (supersedes 0009)

Runbooks

Security Incident Response

Break-Glass Root Login

Onboard Team Member

Offboard Team Member

Atlas user provisioning

Deploy via Terraform (ENG-277)

Rollback via Terraform (ENG-277)

S3 data bucket migration (planned Phase 11)

Access Reviews

2026-Q2 Review

Session log

Index

2026-04-23 — Phase 10 DNS cutover

2026-04-22 — Phase 8 prod AWS mirror

2026-04-22 — Phase 7 Atlas VPC peering

2026-04-22 — Phase 6 CloudFront + WAF

2026-04-21 — Phase 5 staging go-live

2026-04-17 — Atlas staging

Briefs

Index

Member portal plan (ENG-187)

2026-04-16/17 handoff

2026-04-17 Atlas handoff

System briefing (2026-04-17)

Creative AdBundance proposal brief

Creative AdBundance analytics brief

ElevenLabs RN integration research

Policies

Overview

On this page

ADR 0007 — Terraform owns the ECS task definition; deploys via terraform apply -var app_image_uri= ​

Status ​

Accepted — 2026-05-13.

Shipped under ENG-277. Prod went live on the new pipeline 2026-05-13T07:08Z (deploy run 25784042404). Staging mirror landed 2026-05-13 via PR #264.

Supersedes the prior "CI image-swap + Terraform lifecycle.ignore_changes on the task definition" pattern documented in PRs #150 (ENG-272 Layers 1-4) and #162 (ENG-272 Layer 5 drift detection).

Context ​

The ECS module pinned lifecycle.ignore_changes = [container_definitions] so CI could register new task-definition revisions on every deploy without Terraform fighting it. The block is all-or-nothing on the JSON-encoded container_definitions attribute — when Terraform source added a new secrets[] or environment[] entry, Terraform silently stopped tracking it, and the deploy workflow's describe-task-definition + render-task-definition chain only swapped the image. The new binding never landed on the running task.

Four recurrences of this bug class in ten days:

DateIssueWhat broke
2026-05-04ENG-249HubSpot CRM sync code shipped to apex but never created a contact (silent fire-and-forget — HUBSPOT_ACCESS_TOKEN missing from live task def)
2026-05-08ENG-27115-min agent-reminder never fired (silent no-op — SCHEDULER_* env vars missing from live task def)
2026-05-11ENG-272"Tyler Wood not covered" wrong on the YC application surface (MONGODB_REFERENCE_URI missing from live staging task def)
2026-05-12ENG-279POST /api/waitlist 500 on the YC link smoke (MONGODB_WRITE_URI + MONGODB_AUDIT_WRITE_URI missing from live staging task def revision 75)

Each recurrence cost real founder-time to diagnose. Detection layers (Layer 1-5 documented in ENG-272 retro) catch drift between PR-time and runtime, but only AFTER the bad code has shipped and run for some window. The right answer was structural: stop the gap from existing.

Decision ​

Drop lifecycle.ignore_changes = [container_definitions] from infra/modules/ecs-service. Make Terraform own the whole task definition (including image). Have the deploy workflow drive Terraform with the new image SHA as a per-deploy variable.

Module shape (infra/modules/ecs-service/main.tf):

  • aws_ecs_task_definition.this — no lifecycle.ignore_changes block on container_definitions. Image is set via var.container_image which env callsites plumb through.
  • aws_ecs_service.this — lifecycle.ignore_changes shrunk from [desired_count, task_definition] to [desired_count]. (Kept desired_count so the first-deploy 0→2 scaling step doesn't fight Terraform.)

Env callsite shape (infra/envs/{prod,staging}/):

  • ecs.tf — container_image = var.app_image_uri (instead of hardcoded placeholder).
  • variables.tf — declares app_image_uri with default "public.ecr.aws/docker/library/nginx:alpine" so no-arg terraform apply parses.

Deploy workflow shape (.github/workflows/deploy-{prod,staging}.yml):

  • Build + push image to ECR (unchanged).
  • Ensure-indexes pre-deploy task (unchanged — still uses legacy register-task-definition + run-task CLI path because its own lifecycle.ignore_changes is a separate, smaller-blast-radius concern; future tightening tracked outside this ADR).
  • hashicorp/setup-terraform@v3 pinned terraform_version: 1.14.0.
  • terraform init -input=false in infra/envs/<env>.
  • terraform validate.
  • terraform apply -auto-approve -input=false -var "app_image_uri=<ecr-uri-from-build-step-output>" (the workflow uses the build step's image-uri output).
  • timeout <secs> aws ecs wait services-stable mirroring legacy wait-for-service-stability semantics.

Consequences ​

Good ​

  • The ENG-249 / ENG-271 / ENG-272 / ENG-279 bug class is structurally retired. Terraform-source secrets[] and environment[] additions land on the running task on the next deploy by construction, not by hoping the detection layers catch the gap before users do.
  • Single source of truth. Terraform owns the task definition; the manifest declares users/roles; the workflow only declares "what image SHA." Everything reconciles by Terraform's plan/apply, not by string concatenation in the workflow.
  • Reproducible past deploys. terraform apply -var app_image_uri=<old-sha> reproduces any past deploy. Better than reading old task def revisions out of AWS state.
  • Removed ~600 lines of CI machinery (ENG-272 Layer 5 drift checker + workflow + manual patch helper + custom rendering in deploy.yml — replaced by a single terraform apply).
  • Cleaner SOC 2 / EDE evidence story. Every state change has one auditable source (the workflow run) instead of "the workflow rendered the task def with this content, and Terraform separately said something different which the live state ignored."

Bad ​

  • Deploy time grows by ~30-60s. Terraform init + plan + apply overhead vs the legacy image-swap chain. Net per-deploy delta: ~30-50s slower (acceptable; deploys aren't time-critical).
  • Deploy workflows gain a Terraform dependency. Workflow now needs setup-terraform action + Terraform state backend access. Both were already used by infra workflows so the patterns existed.
  • Concurrent deploys serialize via state lock. Terraform's state lock means two deploys can't apply in parallel. GH Actions concurrency: deploy-<env> already enforced serial deploys, so no real behavior change.
  • Deploy role needs broader read perms. terraform apply triggers a full state refresh that reads every resource. The original ENG-277 issue text claimed "the deploy role already has terraform-grade perms" — that was wrong. Resolved via ENG-308 (added TerraformRefreshReadOnly Sid with ~50 read-only Get* / Describe* / List* actions across iam, ec2, acm, elbv2, cloudfront, wafv2, scheduler, lambda, logs, kms, sesv2, secretsmanager-metadata-only, route53, ecs, ecr). Documented compliance trade-off; all actions are read-only with no escalation possible.
  • Secret-value read scope is broader than ideal. aws_secretsmanager_secret_version.placeholder resources force the deploy role to have secretsmanager:GetSecretValue on every secret Terraform tracks (the provider's Read function calls it even with lifecycle.ignore_changes = [secret_string] — empirically proven in ENG-313). Current resting scope: arn:aws:secretsmanager:us-east-1:<env-account>:secret:*. Accepted trade-off, documented in the DeploySecretsRead Sid comment block. Two architectural paths to safely re-narrow (cancelled as theater under the realistic threat model — see ENG-313 + ENG-314 + ENG-315 for the analysis).

Compliance posture trade-off (broader deploy-role read access) ​

The deploy role's broader secret-read scope is documented as an accepted trade-off because:

  • All deploy-role access is gated by OIDC trust scoped to main branch + production environment
  • GitHub environment protection rule requires manual approval per dispatch
  • max_session_duration = 3600 (1h) limits any leaked credential's window
  • No IAM escalation actions (no iam:Create* / Put* / Attach* / Update*) — verified via the failed ENG-313 deploy where iam:PutRolePolicy was correctly denied
  • Resource boundary to env account (cross-account read impossible)
  • CloudTrail records every GetSecretValue with role + ARN attribution
  • Separate ValidateSecretsRole (ENG-309) prevents the validate-secrets workflow from inheriting broad read

Defensible under SOC 2 CC6.1 (logical access) + HIPAA §164.308(a)(4) (information access management). EDE Phase 3 auditor narrative: "deploy role has env-account secret read because Terraform manages secret-version resources; least-privilege scoped to the env account; cross-account read impossible; manual approval gate prevents unauthorized invocation."

Alternatives considered ​

  1. Narrow lifecycle.ignore_changes to ignore only the image field. Rejected — container_definitions is a JSON-encoded string attribute; AWS provider doesn't expose sub-field-level lifecycle control.
  2. Keep ignore_changes and add stricter manifest↔Terraform PR-time checks. Tried this (ENG-272 Layer 4 + Layer 5 detection). Caught some recurrences after-the-fact but didn't prevent them. Detection is necessary but not sufficient.
  3. Move ECS service management out of Terraform entirely. Rejected — that loses the structural benefits Terraform provides (state lock, plan review, audit trail).
  4. Replace aws_ecs_task_definition with replace_triggered_by on image_uri data source. Rejected — replace_triggered_by is brittle and still suffers the ignore_changes problem on secondary attributes.
  5. CloudFormation StackSets instead of Terraform. Rejected — team has zero CFN tooling; full re-platform of infra layer for marginal benefit.

References ​

  • ENG-277 — the structural fix issue
  • ENG-272 — the bug class root-cause analysis (Layers 1-5 detection that this ADR retires Layer 5 of)
  • ENG-279 — the 4th recurrence that motivated the structural fix
  • ENG-308 — TerraformRefreshReadOnly IAM expansion (unblocked the new pipeline)
  • ENG-309 — ValidateSecretsRole separation (preserves SOC 2 role-per-workload posture)
  • ENG-313 — compliance hardening attempt + empirical revert; documented why the secret-read scope stays at env-account
  • PR #234 — PR 1, prod-side ship
  • PR #264 — PR 2, staging-side ship
  • PR #234 onward chain — PRs #240, #241, #244, #245, #252, #253, #254, #255, #256, #260, #261 — the IAM-perms unblock chain
  • docs/runbooks/deploy-via-terraform.md — operational runbook for the new pipeline
  • docs/runbooks/rollback-via-terraform.md — rollback procedures
  • docs/infrastructure/atlas-access-matrix.md — updated to remove the Layer 5 drift-checker reference
  • infra/modules/ecs-service/main.tf:215 (legacy line; now removed) — the historical location of lifecycle.ignore_changes = [container_definitions]
Pager
Previous page0006 — Mongo user simplification
Next page0008 — E2E testing strategy

AskFlorence Internal Documentation. Not for public distribution.

AskFlorence

Internal Documentation

Access restricted. Not for public distribution.