Appearance
Session brief — ENG-279 Mongo user simplification
Linear: ENG-279 · Priority: High · Branch:
eng-279-mongo-user-simplification· Worktree:~/Developer/ask-florence-eng-279-mongo-user-simplificationGoal: Roll back the speculative narrow-scoped Mongo user model (10+ custom roles, recurring silent-regression source) to a clean 3-user functional model per cluster (
app_read,app_write,app_audit_writer), aligned with MongoDB's documented best practices. Re-narrow only at Phase 5 PHI introduction, using views + JWT for tenant filtering instead of per-tenant DB users.ZERO-DOWNTIME CONSTRAINT — read this twice:
https://stage.askflorence.healthis the YCombinator demo link — Y Combinator and prospective investors visit this URL. Staging must stay online + functionally correct AT ALL TIMES throughout this work. Same constraint as production: never degrade, never break, never show "covered" → "not covered" or "$X" → "error." A single broken probe on staging during this work is a STOP-the-line incident — roll back the change that caused it before continuing. Brief writes "staging" as if it were a sandbox in some places below for prose readability; treat every staging mention as if it said "STAGING IS YC-FACING PROD-EQUIVALENT" instead.
What you are walking into
The previous session (ENG-214 compliance docs) discovered that the canonical .env.local has MONGODB_URI bound to the wrong narrow user (app_read_staging — the cross-cluster reference reader, not the deployment-local reader). That was the third silent-regression bug in this Mongo-user category in a single week. The pattern:
- ENG-239 — narrowed
app_read_stagingrole (intended; clean) - ENG-271 — narrowed staging deployment
MONGODB_URIto a new user, butgetReferenceDb()silently fell back to it;/api/providers/coveredreturned HTTP 500 silently on apex - ENG-272 — added
app_read_local_staging+ removed the silent fallback; fixed deployed staging but missed canonical.env.local - This session (ENG-279) — found the local-side miss
The cycle has consumed multiple emergency sessions and caused at least one user-facing prod outage (ENG-271).
MongoDB itself names this anti-pattern (over-segmentation; custom roles before exhausting built-in; collection-level when DB-level would do). We are explicitly fighting the documented model. ENG-279's plan is to stop.
Required reading (~45 min, IN THIS ORDER)
- ENG-279 issue body + every comment — full problem statement, current-state matrix across 3 envs, proposal, MongoDB pattern reference, acceptance criteria.
- ENG-214 issue body + the ENG-279 cross-link comment — context on the compliance work that surfaced this.
docs/infrastructure/atlas-access-matrix.md— authoritative current state of all 12 Mongo users + env-var bindings + consumers per cluster. Read end-to-end.infra/atlas/access-matrix.ts— TypeScript manifest that generates the doc; this is what you'll be editing for the user model change.- ADR 0003 — original narrow-scoped users decision (this issue supersedes it).
- ADR 0002 — append-only audit log (PRESERVED;
app_audit_writerkeeps the narrow append-only role). - ADR 0004 — cross-cluster PrivateLink (PRESERVED; just the role + user that backs the reader changes).
src/lib/db.ts— the dual-poolgetDb()+getReferenceDb()connection helpers. Read header comments fully — they document the env-var contract.infra/envs/prod/ecs.tf+infra/envs/prod/secrets.tf— prod ECS task definition env binding + Secrets Manager declarations.infra/envs/staging/ecs.tf+infra/envs/staging/secrets.tf— same for staging.docs/security-compliance/access-control-policy.mdDB section +docs/security-compliance/hipaa-control-mapping.md§164.308(a)(4) row — the compliance narrative that the simplification must preserve.
Skim: ENG-239, ENG-249/PR91, ENG-271, ENG-272 — the historical narrowing-then-bug cycles. You don't need to fully internalize each; you need to know the failure mode.
What you are building (the end state)
Three users per cluster (6 total), with MongoDB best-practices baked in:
Per cluster (askflorence-prod-01 + askflorence-staging)
| User | Role | Privileges | Env var bindings |
|---|---|---|---|
app_read | Built-in read@askflorence | DB-wide read on askflorence | MONGODB_URI (prod + staging + local) + MONGODB_REFERENCE_URI (prod via PrivateLink to staging cluster's app_read) |
app_write | Built-in readWrite@askflorence | DB-wide readWrite on askflorence | MONGODB_WRITE_URI, plus until Phase 5 lands also the legacy MONGODB_URI_PLANS_WRITE / _SURVEY_WRITE / _WAITLIST_WRITE / _HUBSPOT_SYNC_WRITE env vars all point at this user (no behavioral change for app code; just the underlying user is consolidated) |
app_audit_writer | Custom role_audit_writer — FIND + INSERT on agent_audit_log only | Append-only append-only-only | MONGODB_AUDIT_WRITE_URI (new env var; consumed by Phase 5 audit-log writers; falls back to MONGODB_WRITE_URI if unset, but unset is a config bug) |
Total users post-migration: 6 (3 per cluster × 2 clusters). Down from 12 today. Down from the projected 15-20 at Phase 5 launch.
Why this works
- Pre-PHI today: all
askflorenceDB content is public CMS marketplace data + agent waitlist PII. No row-level filtering needed at the DB layer.app_readwhole-DB read is the right grant. - Audit-log integrity preserved:
app_audit_writerkeeps the FIND+INSERT-only role per ADR 0002 — this is the one custom role that's compliance-critical. - Re-narrowing playbook at Phase 5: when PHI lands, use views for per-agent / per-member filtering + JWT in app middleware for tenant identity. Do NOT add per-tenant Mongo users. Cap total users at 5 per cluster (per ENG-279 acceptance criteria).
Execution plan — STAGED, additive-first
This is the part that determines whether prod stays up. Atlas state changes first (additive only), then Terraform (additive only), then ECS task definitions (rolling deploy to use new env bindings), THEN verification, THEN deprecation. Never delete an old user until the new task revisions have rolled successfully across all tasks AND verification has passed.
Phase A — Plan + capture pre-change baselines
STEP 1: Capture today's YC-demo baseline on staging + prod BEFORE READING ANYTHING ELSE. The "Tyler Wood + Synthroid + Lipitor covered on 14 plans" contract is the success criterion — you cannot measure success without first knowing the starting state. Run the 8-gate probe (described under Verification Gates) against both https://stage.askflorence.health and https://askflorence.health and save the responses verbatim:
bash
mkdir -p /tmp/eng-279-baselines
# Capture staging baseline
for gate in eligibility plans providers-tyler-wood drugs-synthroid drugs-lipitor; do
# ... probe and save response to /tmp/eng-279-baselines/staging-<gate>-pre.json
done
# Same for prodIf today's plan count is NOT 14, that's interesting — capture whatever number it IS, that becomes the session contract. If "Tyler Wood" or "Synthroid" or "Lipitor" don't return covered on the current staging today, STOP and ask the user — something is already broken pre-session and that needs to be resolved before any Mongo-user change can ship.
STEP 2: Now do the planning work:
- Open this brief + the required reading in order.
- Write a plan file at
~/.claude/plans/eng-279-mongo-user-simplification-<adjective-noun-adjective>.md(the runtime gives you a name). Plan must include:- Reference the captured pre-change baselines (paths under
/tmp/eng-279-baselines/) - Exact list of Atlas users to create (3 per cluster, 6 total) with the role definitions verbatim
- Exact list of users to deprecate (12 total — every existing custom + legacy user)
- Exact env-var binding changes per env (prod ECS, staging ECS, local) — table format
- Verification probes at each gate (the 8-gate YC-demo smoke + calculator regression + dev probe)
- Rollback plan for each phase, with the snapshotted task-def paths referenced
- The session-contract plan count (14 if confirmed; whatever today's number is otherwise)
- Reference the captured pre-change baselines (paths under
- Use AskUserQuestion to clarify any ambiguity. Examples that should NOT be assumed:
- Whether to keep
app_admin_schema(it's used for index creation byscripts/db/setup-collections.js— likely needs migration toapp_writeor stays as a 4th user; ask) - Whether
MONGODB_AUDIT_WRITE_URIis wired today or only at Phase 5 (probably Phase 5 — confirm before adding env binding) - Whether
MONGODB_REFERENCE_URIon prod stays pointed at the staging cluster's NEWapp_read(via PrivateLink) or gets a dedicated cross-cluster reader user (clean architectural call — recommend reusingapp_readsince narrowness no longer the priority; ADR 0004 PrivateLink architecture is preserved either way) - Whether today's staging plan count for the YC default scenario is actually 14, or whether the contract should be the actually-captured-today number (likely the latter; ask after capturing the baseline)
- Whether to keep
- ExitPlanMode for user approval.
Phase B — Atlas changes (additive)
Order matters. Each step is verifiable before the next.
- Staging cluster first. Create
app_read@staging(built-inread@askflorence); createapp_write@staging(built-inreadWrite@askflorence); createapp_audit_writer@stagingwith the custom append-only role (clone from existingrole_audit_reader/role_writer_agentsdefinitions). Verify each user can do exactly what they should via direct mongo probes from your laptop (FIND + INSERT smoke tests, plus a denial probe forapp_audit_writerUPDATE/REMOVE attempts to prove the append-only property holds). - Prod cluster second. Same three users created on
askflorence-prod-01. - Do NOT touch any existing user yet. Old users continue to work for old ECS task revisions.
Phase C — Secrets Manager (additive)
- Add new secret entries:
staging/mongodb/app-read-v2(or whatever naming; consider future-proof) ←app_read@stagingURIstaging/mongodb/app-write-v2←app_write@stagingURIstaging/mongodb/audit-write←app_audit_writer@stagingURIprod/mongodb/app-read-v2←app_read@prodURIprod/mongodb/app-write-v2←app_write@prodURIprod/mongodb/audit-write←app_audit_writer@prodURI
- Verify each secret resolves correctly via
aws secretsmanager get-secret-value(test from your laptop with active AWS auth). - Do NOT delete the old secrets yet.
Phase D — Terraform + ECS task definition rev
Verify-new-users-WORK before swapping any env binding. The cycle that has burned us repeatedly: assume a new credential works → wire it in → discover at runtime it doesn't have the privileges the code needs. Break that cycle by exercising every new user against every consuming code path BEFORE any task-def revision swaps the live binding.
D.0 — Pre-swap verification (NO ECS changes yet)
For each new user created in Phase B, run a direct probe from your laptop using the corresponding secret value:
bash
# Test app_read@staging can do everything getDb() does
MONGODB_URI="<app_read@staging URI>" npx tsx scripts/debug-probe-new-user.ts \
--collections "plans,zip_county,regions,plan_years,audit_log,agent_waitlist_submissions,formularies_staging,providers_staging" \
--action FIND
# Test app_write@staging can do everything getDb() write paths do
MONGODB_URI="<app_write@staging URI>" npx tsx scripts/debug-probe-new-user.ts \
--collections "plans,agent_waitlist_submissions,agent_survey_responses,hubspot_sync_log,plan_years,zip_county,regions" \
--action FIND_INSERT_UPDATE_REMOVE
# Test app_audit_writer@staging has FIND + INSERT on agent_audit_log AND IS DENIED UPDATE/REMOVE
MONGODB_URI="<app_audit_writer@staging URI>" npx tsx scripts/debug-probe-new-user.ts \
--collection agent_audit_log \
--expect-allow FIND,INSERT \
--expect-deny UPDATE,REMOVE(You'll need to write scripts/debug-probe-new-user.ts — small wrapper. Delete after the migration.)
ALL probes must pass on the new users before swapping any task-def env binding. If a probe fails, fix the Atlas role in Phase B; do not proceed.
Repeat for prod cluster's new users.
D.1 — Staging ECS task-def swap
- Update
infra/envs/staging/ecs.tfso the env-var bindings point at the new secrets. Keep the env var NAMES the same (MONGODB_URI,MONGODB_WRITE_URI,MONGODB_REFERENCE_URI) so app code doesn't need to change. - Add
MONGODB_AUDIT_WRITE_URIbinding (new env var; safe to add — falls back toMONGODB_WRITE_URIinsrc/lib/db.tsif unset). terraform plan— review carefully.- Snapshot the current ECS task-def revision (
aws ecs describe-task-definition --task-definition <name> > /tmp/staging-task-def-prerollback.json) — this is your rollback artifact. terraform applyagainst staging. ECS rolling deploy fires.- Watch the staging service stabilize (
aws ecs describe-servicesuntilrunningCount === desiredCountANDrolloutState=COMPLETEDon PRIMARY).
D.2 — Staging YC-demo smoke gate (MUST PASS before D.3)
Run the YC-demo smoke test on https://stage.askflorence.health. This is the contract the YC reviewer would hit:
Calculator basic flow — POST
/api/eligibility+ POST/api/planswith the demo-default inputs (ZIP 84094 / married couple ages 35+30 / appropriate non-Medicaid income that returns plan results). Expect: plans JSON returned, no 500s.Plan count baseline — Plans returned for the canonical doctor + Rx test inputs (call them the "YC default scenario") must total 14 plans for default calculator inputs that exercise the doctor + Rx coverage path. The exact ZIP / income / household composition is in
scripts/audit/fixtures/calculator-baseline.jsonor — if not there — derive from the current staging behavior BEFORE making any change (capture the pre-change baseline as/tmp/yc-demo-baseline-pre.json). The 14 number is the contract — if pre-change staging returns a different count, capture that number first, that is the new baseline.Provider coverage probe — Tyler Wood — POST
/api/providers/coveredsearching for "Tyler Wood" (a real provider with known coverage across UT plans). Expected: provider returned as covered on the 14 plans the default calculator returns. Specifically the response payload must showcovered: truefor the doctor across every plan in the YC default scenario plan list.Medication coverage probe — Synthroid — POST
/api/drugs/coveredwith "Synthroid" (levothyroxine sodium, a high-volume formulary entry). Expected: drug returned as covered on the 14 plans, with a tier classification populated (drug_tier=PreferredBrandorGenericper plan).Medication coverage probe — Lipitor — POST
/api/drugs/coveredwith "Lipitor" (atorvastatin calcium, another high-volume formulary entry). Expected: drug returned as covered on the 14 plans, with a tier classification populated.End-to-end member flow — full happy-path walkthrough in the browser at
https://stage.askflorence.health: open the home page → demo calculator → land on plans → click "Check doctor + Rx coverage" → search for Tyler Wood, Synthroid, Lipitor → verify the per-plan coverage indicators render correctly → no console errors, no broken renders.Calculator regression —
BASELINE_BASE=https://stage.askflorence.health npx tsx scripts/audit/calculator-baseline-diff.ts→ ZERO DIFFS.Agent flow — submit a synthetic agent waitlist signup at
https://stage.askflorence.health/agent-onboarding(usetahaabbasi+yctest-eng279-<timestamp>@me.com). Confirm: Mongo write succeeded (verify via Atlas Admin UI oraws ecs execute-commandinto the task), SES sent the confirmation email, HubSpot mirror created the contact, no errors in CloudWatch logs. Clean up the test row + HubSpot contact after.
If ANY of these eight gates fails, immediately roll back by re-registering the snapshotted prior task-def revision (aws ecs register-task-definition --cli-input-json file:///tmp/staging-task-def-prerollback.json + aws ecs update-service) and DO NOT proceed to D.3 until the root cause is found and fixed in Phase B (Atlas role) or Phase C (Secrets Manager value).
D.3 — Soak staging for ≥ 30 min after D.2 passes
Hands off. Watch CloudWatch logs (aws logs tail) for any error spike. Watch Atlas Admin UI for any auth-failure spike. After 30 min of clean operation:
D.4 — Prod ECS task-def swap
Mirror D.1 + D.2 against prod (infra/envs/prod/ecs.tf + apply + watch + run the YC-demo smoke test against https://askflorence.health).
Same eight-gate smoke test. Same rollback recipe. Same 30-min soak.
Phase E — Canonical .env.local update
- Update
/Users/tahaabbasi/Developer/askflorence/.env.local:MONGODB_URI←app_read@stagingconnection stringMONGODB_WRITE_URI←app_write@stagingconnection stringMONGODB_REFERENCE_URI←app_read@stagingconnection string (same cluster + user; can be identical toMONGODB_URI, but kept as a distinct env var sogetReferenceDb()doesn't silently lose its explicit binding contract per ENG-272)MONGODB_AUDIT_WRITE_URI←app_audit_writer@stagingconnection string- Remove all legacy
MONGODB_URI_*_WRITEentries (no consumer will use them after Phase F)
- Restart local dev server. Probe
/api/plans+/api/eligibility+/api/providers/covered+/api/drugs/coveredagainsthttp://localhost:3000(or whatever port). - Run calculator regression:
npx tsx scripts/audit/calculator-baseline-diff.ts— must be ZERO DIFFS, no env-var juggling required. - Update
docs/briefs/SESSION_BRIEF_NEW_WORKTREE.mdor whatever the worktree-setup brief is, so future worktrees pick up the right env-var convention. Also update.env.exampleif one exists.
Phase F — Code consolidation (remove dead env-var references)
- Grep
src/,scripts/,infra/for the deprecated env-var names:MONGODB_URI_PLANS_WRITE,MONGODB_URI_SURVEY_WRITE,MONGODB_URI_AGENTS_WRITE,MONGODB_URI_AGENTS_ADMIN,MONGODB_URI_AUDIT_READ,MONGODB_URI_WAITLIST_WRITE,MONGODB_URI_HUBSPOT_SYNC_WRITE. - Replace each call site with
MONGODB_WRITE_URI(orMONGODB_AUDIT_WRITE_URIfor audit-log writers when those ship in Phase 5). - Run
npx tsc --noEmitclean. - Run calculator regression. Run a full smoke test of every API route that does a DB write.
Phase G — Deprecate old Mongo users (DELETE last)
Only after Phase F is verified. Three rolling deploys later (so no in-flight ECS tasks reference old creds) AND a clean nightly drift check AND a clean calculator regression.
- Delete custom roles:
role_writer_survey,role_writer_plans,role_writer_agents,role_admin_agents,role_admin_schema,role_audit_reader,role_reader_reference,role_reader_local_staging, plusrole_writer_waitlistandrole_writer_hubspot_syncif they exist as separate roles. - Delete users:
app_read_local_staging,app_read_staging,app_writer_survey,app_writer_plans,app_writer_waitlist,app_writer_hubspot_sync,app_writer_agents,app_admin_agents,app_admin_schema,audit_reader. - Delete prod
app-write(Issue #56 exit criterion). - Delete the corresponding Secrets Manager secrets.
- Final Terraform pass to remove the deleted secret definitions from
infra/envs/{prod,staging}/secrets.tf.
Phase H — Documentation
- New ADR
docs/adr/0005-mongo-user-simplification.mdthat supersedes ADR 0003. References MongoDB's documented anti-pattern guidance from the ENG-279 comment. Lays out the 3-user model + the Phase 5 re-narrowing playbook (views + JWT). - Update
infra/atlas/access-matrix.tsto the new 6-user manifest. Runnpm run docs:atlasto regeneratedocs/infrastructure/atlas-access-matrix.md. - Touch up
docs/security-compliance/access-control-policy.mdDB section. The §164.308(a)(4) row indocs/security-compliance/hipaa-control-mapping.mdstays accurate (least-privilege is still met at the role-tier level). - Touch up the now-pointer
docs/infrastructure/access-control.mdif it still has stale Mongo-user details. - Add a "What shipped" entry to
CLAUDE.mdunder today's date, no version bump per cadence policy. - Close out ENG-279 — comment with summary + tick all acceptance-criteria checkboxes + move status to In Review.
- Update ENG-214 close-out comment to note the access-control-policy + atlas-access-matrix refreshes shipped.
Verification gates (NON-NEGOTIABLE)
At each gate, all probes must pass before proceeding to the next phase:
Probe set — the YC-demo smoke test
This is the contract. Run all 8 gates against https://stage.askflorence.health after every Atlas / Secrets / Terraform change. Run all 8 against https://askflorence.health after every prod change. Local-only probes (calculator regression on http://localhost:3000) supplement but do not replace the staging + prod gates — staging is the YC link and prod is prod; both are live external-facing surfaces.
| # | Gate | What it tests | Pass criteria |
|---|---|---|---|
| 1 | POST /api/eligibility (UT 84094 + non-Medicaid income) | ZIP lookup + APTC / CSR calc | HTTP 200, eligibility payload returned, no error |
| 2 | POST /api/plans (same inputs as #1) | Plan search + scoring | HTTP 200, plans JSON returned, plan count matches the "YC default scenario" baseline (target 14 plans; capture pre-change baseline if different from 14 and use that as the contract for this session) |
| 3 | POST /api/providers/covered (search "Tyler Wood") | Cross-cluster providers_staging read | HTTP 200, response shows covered: true on each of the 14 plans the calculator returns |
| 4 | POST /api/drugs/covered (search "Synthroid") | Cross-cluster formularies_staging read | HTTP 200, response shows covered: true on each of the 14 plans with drug_tier populated |
| 5 | POST /api/drugs/covered (search "Lipitor") | Same path as #4, different drug | HTTP 200, response shows covered: true on each of the 14 plans with drug_tier populated |
| 6 | End-to-end browser flow (home → calculator → plans → coverage check for the three test items) | Real user journey | All renders correct, no console errors, no broken states; "Tyler Wood + Synthroid + Lipitor all show covered on the 14 plans" |
| 7 | BASELINE_BASE=<env-url> npx tsx scripts/audit/calculator-baseline-diff.ts | Full 12-scenario pipeline regression | ZERO DIFFS |
| 8 | Agent flow — synthetic waitlist signup at /agent-onboarding with tahaabbasi+yctest-eng279-<ts>@me.com | Mongo write + SES send + HubSpot mirror | All three side effects fire correctly; clean up test row + HubSpot contact post-test |
Plus the standing CI guards:
- nightly
staging-cluster-driftworkflow passes (trigger manually viagh workflow run staging-cluster-driftafter Atlas changes) staging-collections-guardCI workflow passes on the PRatlas-env-var-coverageCI workflow passes on the PRatlas-docs-syncCI workflow passes on the PR (afternpm run docs:atlasregen)validate-secretsCI workflow passes on the PR
When ANY probe fails
STOP THE LINE. Roll back the change that caused the failure immediately:
| Failure point | Rollback action |
|---|---|
| Gate 1-8 fails against staging after a Terraform apply | aws ecs register-task-definition --cli-input-json file:///tmp/staging-task-def-prerollback.json + aws ecs update-service. Watch rollout. Re-probe gates. Investigate root cause in Atlas / Secrets layer. |
| Gate 1-8 fails against prod after a Terraform apply | Same as above, against prod task def + service. Page Asad if customer-visible. |
| Pre-swap probe (D.0) fails | Fix the Atlas role grant in the Atlas Admin UI. Re-probe before Terraform changes. |
validate-secrets CI fails | Inspect — likely a malformed connection string in Secrets Manager. Fix the secret value before re-running. |
atlas-cluster-drift fails after Phase B (additive create) | Should NOT happen — additive create doesn't change existing user privileges. If it does, the new user's role definition is wider than declared. Inspect role JSON; tighten in Atlas Admin UI. |
NEVER proceed past a failed gate. NEVER. Each Mongo-user regression in our history reached prod or staging because someone "fixed it in the next step" instead of stopping.
When a probe fails
STOP. Roll back the change that caused the failure (Terraform apply of the prior task-def revision; revert the Atlas user role; etc.). Do NOT proceed.
Rollback recipes per phase
- Phase B (new users created): roll back = delete the new users in Atlas Admin UI. No app traffic touches them yet.
- Phase C (new secrets created): roll back = delete the new Secrets Manager secrets. No app traffic uses them yet.
- Phase D (ECS task-def points at new secrets): roll back = re-register the prior task-def revision (
aws ecs register-task-definitionfrom saved:N-1snapshot) +aws ecs update-service. Old secrets are still alive so old creds still work. - Phase E (canonical
.env.localupdate): roll back =git restore .env.localor hand-revert. - Phase F (code env-var consolidation): roll back =
git revert <commit>. - Phase G (delete old users + secrets): NOT rollback-safe without a fresh Atlas user create + secret regenerate. Treat as point of no return — do not enter Phase G unless Phase A–F are all 100% verified across all 3 envs for ≥ 24 hours.
Hard rules
- Staging is the YC demo link — it is prod-equivalent for downtime tolerance. Never degrade, never break, never invalidate the "Tyler Wood + Synthroid + Lipitor covered on 14 plans" smoke test. A failed staging gate is a STOP-the-line incident even if prod is fine. Roll back BEFORE investigating root cause.
- Verify new users WORK before swapping any live binding. Phase D.0 pre-swap probe is non-negotiable — exercise each new user against every code path it will serve, from your laptop, BEFORE any task-def revision change. This is the principle that breaks the historical bug cycle.
- Never delete a Mongo user that has any task definition revision still bound to it. ECS rolling deploys take time; new task revisions stabilize over minutes. Wait for
desiredCount === runningCountANDrolloutState=COMPLETEDon PRIMARY for the new revision before considering the old credential safe to delete. - Never commit a Mongo connection string (with password) to git. Pull from AWS Secrets Manager or
.env.local(gitignored). Per CLAUDE.md Security rules. - Never run
git add .orgit add -A. Stage specific paths (infra/,docs/,CLAUDE.md, etc.). Per CLAUDE.md Security rules. - The append-only audit-log property is sacred.
app_audit_writerkeeps FIND+INSERT-only onagent_audit_log. Any change to this role requires a separate ADR + Asad/Taha sign-off. Per ADR 0002. - Atlas BAA enumeration unchanged. Both
askflorence-prod-01+askflorence-stagingstay in scope. No new clusters, no project changes — the simplification is users + roles within existing projects only. - Calculator regression must pass at EVERY phase boundary. Plus the full 8-gate YC-demo smoke test against staging after every change touching staging, and against prod after every change touching prod.
- Capture pre-change baselines before any change. The "14 plans" + "Tyler Wood + Synthroid + Lipitor covered" contract is what the YC reviewer sees today on staging. If today's staging returns a different number (e.g., 13 or 15 plans), capture that as the session-start baseline FIRST — don't discover mid-session that you changed an output you couldn't measure against. Save baselines at
/tmp/yc-demo-baseline-pre-<env>.jsonbefore Phase B even starts. - Snapshot task-def revisions before each Terraform apply.
aws ecs describe-task-definition --task-definition <name> > /tmp/<env>-task-def-prerollback.json. Your rollback artifact.
When a deploy is needed
This work touches prod ECS task definitions, so it requires real terraform apply runs against the prod account. Per CLAUDE.md "Deploy + release cadence policy":
- Wait for explicit Taha approval before EACH prod-side change
- Staging-side changes can ship as part of normal workflow (Taha reviewing after the fact is fine)
- Hotfixes (rollback) can ship immediately if a probe fails
This is a multi-deploy operation. Probably 2 prod deploys (Phase D task-def update + Phase G secrets cleanup). Confirm scope with Taha at plan-approval time.
What success looks like
- All 8 YC-demo verification gates pass against staging at every phase boundary AND at the end. Staging stayed online + correct throughout — Tyler Wood + Synthroid + Lipitor remained
covered: trueacross all 14 (or session-contract-N) plans for the entire session. - All 8 YC-demo verification gates pass against prod at every phase boundary AND at the end. Same contract on prod.
- ENG-279 acceptance criteria all ticked
- 6 total Mongo users across both clusters (down from 12)
MONGODB_URI,MONGODB_WRITE_URI,MONGODB_REFERENCE_URI,MONGODB_AUDIT_WRITE_URIare the only Mongo env vars in active code- Local dev works with vanilla
MONGODB_URI+MONGODB_REFERENCE_URI— no env-var juggling required for calculator regression - ADR 0005 supersedes ADR 0003 documenting the new model + the Phase 5 re-narrowing discipline (views + JWT)
- Atlas access matrix regenerated + clean
agent_audit_logappend-only property still verifiable via the same probe pattern as today- CLAUDE.md has the "What shipped" entry
- ENG-279 closes (status In Review awaiting Taha sign-off)
- ENG-214 close-out comment updated noting the access-control-policy.md refresh
- Pre-change baselines preserved at
/tmp/eng-279-baselines/until ENG-279 PR merges — proof artifact that nothing degraded
Estimated effort
| Phase | Effort |
|---|---|
| A — Plan (reading + writing plan file) | 1 hr |
| B — Atlas additive user creation | 30 min |
| C — Secrets Manager additive | 30 min |
| D — Terraform + ECS rolling deploy (staging + prod with 30-min stable wait between) | 1.5 hr |
E — .env.local update + local verification | 30 min |
| F — Code env-var consolidation | 1 hr |
| G — Deprecation (only after 24h verified) | 1 hr (next session if 24h elapsed) |
| H — Documentation | 1 hr |
| Total | ~6 hr active + 24h wait between F and G |
Realistically 2 sessions: one to ship Phases A–F (~5 hr), one to do Phase G + H after the 24h soak.
At plan approval time, surface these decisions to Taha
- Naming of new users + secrets —
app_readvsapp_read_v2vsappReader. Consistency matters; pick once. - Whether to consolidate the legacy
MONGODB_URI_*_WRITEenv vars now (Phase F) or leave them as aliases forMONGODB_WRITE_URI. Recommend consolidating — fewer env vars = less drift surface. - Whether the prod
app_read@prodcluster user should exist or whether prod reads only from staging cluster via PrivateLink — current Terraform hasprod/mongodb/app-readpointing at prod cluster. Re-confirm the cluster split today is "prod has its own plan data; staging has its own plan data; cross-cluster is only formularies + providers". (Spoiler: that's what the access matrix says + what the live probes confirmed.) MONGODB_AUDIT_WRITE_URIintroduction now vs at Phase 5 — recommend introducing now as a wired-up env var even if no code consumer exists yet, so Phase 5 work can drop the binding straight in.- Whether to file a separate issue for the access-control-policy.md doc refresh or include it in this PR. Recommend including it — keeps the doc + the implementation atomically synced.
Cross-references
- Issue: ENG-279
- Related: ENG-214 (the work that surfaced this)
- ADRs: 0001 (project isolation — preserved), 0002 (audit log — preserved), 0003 (narrow-scoped users — SUPERSEDED by this work), 0004 (PrivateLink — preserved)
- Docs to update:
docs/infrastructure/atlas-access-matrix.md(auto-regen),docs/security-compliance/access-control-policy.md(DB section),CLAUDE.md(What shipped) - MongoDB pattern reference: built-in roles, custom roles at DB level, views for filtering, JWT for tenant identity — see ENG-279 comment for the full citation