Session log — 2026-04-21 — Phase 5 staging go-live

Scope

Stand up the AWS staging application stack end-to-end on top of the Phase 3 Terraform scaffolding + Phase 4 staging networking, deploy the Next.js app to stage.askflorence.health, validate every outbound integration (MongoDB Atlas, CMS Marketplace API, AWS SES, PostHog), and get the staging environment to a state where shipping the current Vercel-served app on AWS is a no-risk path. No production traffic moved. Vercel askflorence.health and www.askflorence.health continue to serve production users throughout the session.

Actor

Human: Taha Abbasi.
Agent: Claude Opus 4.7 (1M context), running in Claude Code CLI.

Tickets

Advances Issue #47 from Phase 3 (Terraform scaffolding) through Phase 5.6 (end-to-end SES send proven on the staging app).
Provisions the app_writer_waitlist user in the staging Atlas project — closes the staging-side gap tracked under Issue #56. Prod rollout remains deferred per the plan.

External systems touched

AWS (staging account `549136075525`)

ECR repository askflorence-app created (was missing — Phase 4 had the networking/KMS/secrets only).
ECS cluster askflorence-staging. Fargate capacity providers FARGATE + FARGATE_SPOT. Container Insights enabled.
ECS task definition askflorence-staging-app-task — 0.25 vCPU / 0.5 GB, non-root user nextjs (UID 1001), port 3000, 14-day CloudWatch log retention under CMK alias/askflorence-staging-data. Revisions :1–:8 registered across the session; :8 is the live image on main@04cfd35. Task role policy limits runtime AWS actions to ses:SendEmail/ses:SendRawEmail on account identities + configuration sets.
ECS service askflorence-staging-app: desired 1, min 100/max 200 for rollover, deployment circuit breaker enabled, target group attached to staging ALB.
ALB askflorence-staging-alb fronting the ECS service in public subnets. HTTPS listener with the stage.askflorence.health ACM certificate; HTTP redirects to HTTPS. Target group askflorence-staging-tg health-checks /api/health.
Secrets Manager — staging/mongodb/waitlist-write rotated to point at the new Atlas user (see Atlas section). All other staging/mongodb/* secrets left untouched.
Task role inline policy widened from identity/stage.askflorence.health to identity/* scoped to the staging account. Rationale: ses:SendEmail authorizes on every identity referenced in the call (From + To/CC/BCC). In SES sandbox, recipients must also be verified identities in the account, so the role needs permission on them too.
IAM / no new roles created. GitHub Actions deploy role from Phase 3 is the only principal that pushes images + updates the service.
SES — staging ses:SendEmail path exercised successfully from three call sites (direct AWS CLI, /api/waitlist via ECS task). AWS/SES/Send CloudWatch metric shows 3 DeliveryAttempts, 0 bounces, 0 rejects. Still sandbox mode; production access request ticket separately filed.
CloudWatch Logs log group /aws/ecs/askflorence-staging-app captures container stdout/stderr. CMK-encrypted.
Route 53 subzone for stage.askflorence.health (delegated from Cloudflare in Phase 4) now has an A-record alias pointing stage.askflorence.health → staging ALB DNS name.

MongoDB Atlas (staging project `69e31af12fd2c0aef51bbb41`)

New custom role role_writer_waitlist — 7 actions (FIND, INSERT, UPDATE, REMOVE, CREATE_INDEX, DROP_INDEX, COLL_MOD) scoped to askflorence.agent_waitlist_submissions only.
New database user app_writer_waitlist — bound to role_writer_waitlist. Password (32-char alphanumeric, generated locally, never echoed) written via a temp file to Secrets Manager + .env.staging.local.
Prod project (AskFlorence, 69dc20c64005b222804dafa4) — untouched. No Atlas CLI command in this session targeted the prod project.

Cloudflare + Route 53

Unchanged from Phase 4. Cloudflare remains authoritative for apex askflorence.health; Route 53 holds the delegated stage.askflorence.health subzone. Cloudflare was not touched today.

Vercel

Untouched. No project settings, no env vars, no deployments. askflorence.health and www.askflorence.health continued to serve production traffic through every phase of this session. Two commits land on main today (e24c5ca, 44c1493, 90d05af, 04cfd35) — none are promoted to Vercel in this session. A separate deploy step using vercel --prod from a dev machine will roll them forward as a discrete action with its own owner approval.

What shipped (chronological)

Phase 5.5 — email provider abstraction

Code (main@e24c5ca):

New src/lib/email.ts with sendEmail() + getEmailProvider(). Two providers behind a single typed API:
- ResendProvider — existing behavior, unchanged; uses RESEND_API_KEY + fetch("https://api.resend.com/emails").
- SesProvider — new; uses @aws-sdk/client-sesv2 with SESv2Client. Client is lazily constructed so Vercel builds don't require AWS creds at build time.
Provider selected once at module load via EMAIL_PROVIDER env var ("ses" vs "resend"; default is "resend").
Both providers return the same result shape { ok, messageId?, error?, provider } — sendEmail never throws on provider errors, callers inspect result.ok.
Refactored call sites:
- src/app/api/waitlist/route.ts: 3 sends (consumer confirmation, agent confirmation, ops notification) + kept the Resend-specific audience REST sync, now gated behind emailProvider === "resend" so it's a no-op on SES.
- src/app/api/agents/discovery/route.ts: 2 sends (agent confirmation, ops notification). sendResendEmail helper + RESEND_API_BASE constant deleted.
Added dep @aws-sdk/client-sesv2 ^3.1033.0 to package.json.

Vercel posture: EMAIL_PROVIDER is unset on Vercel → falls through to the Resend path, RESEND_API_KEY still read, unchanged behavior. Zero runtime change. Verified by npm run build producing a bundle that doesn't pull in the AWS SDK on the Resend code path (tree-shaking).

Phase 5.5a — `EMAIL_FROM_DOMAIN` override

Code (main@44c1493): After the first SES deploy, SES rejected sends from [email protected] (the Resend-verified prod sender, hardcoded in the route files) because staging SES only has stage.askflorence.health verified. Rather than touch the route files or add five separate env vars, extended sendEmail() with a single EMAIL_FROM_DOMAIN env override that rewrites the domain part of every From header at send time. Works for bare addresses (user@domain) and display-name form (Name <user@domain>). Unset on Vercel → no rewrite. Staging ECS sets EMAIL_FROM_DOMAIN=stage.askflorence.health.

Phase 5.6 — end-to-end SES validation on `/api/waitlist`

Three layers of evidence accumulated before declaring the SES path green:

Direct aws sesv2 send-email from a staging SSO session: MessageId 0100019daf13d623-07efec70-..., email delivered to [email protected]. Proves domain + DKIM + MAIL FROM + IAM at the account level.
ECS task role policy widened from identity/stage.askflorence.health to identity/* (main@90d05af) after the ses:SendEmail call failed with "not authorized to perform ses:SendEmail on resource identity/[email protected]". Rationale + implementation in the change log entry below.
POST /api/waitlist with [email protected] returned HTTP 200 + a real Mongo waitlist_submission_id. No error log in /aws/ecs/askflorence-staging-app. AWS/SES/Send metric incremented.

Blocker surfaced along the way: the staging Mongo secret staging/mongodb/waitlist-write was a placeholder string (PLACEHOLDER-REPLACE-ME-OUT-OF-BAND) because the parallel Mongo session hadn't provisioned app_writer_waitlist yet. Rather than hack around with a broader user (tried — app_admin_agents doesn't have createIndex on agent_waitlist_submissions either), ran the Atlas CLI flow described in the Atlas section above to create the narrow-scoped user properly.

Phase 5.7 — PostHog server fail-open + staging analytics opt-out

Code (main@04cfd35): Last blocker on the staging app code path was getPostHogClient() throwing on missing token, returning 500 to the caller AFTER the Mongo write + SES send had already succeeded. Two-part fix:

Server client fail-open (src/lib/posthog-server.ts): returns a no-op client (same methods, no-op implementations) when the token is missing OR when DEPLOY_ENV === "staging". Contract is "capture-by-default unless we see a positive signal we're not prod" — critical ordering because Vercel prod doesn't set DEPLOY_ENV, so inverting the rule to "only capture on DEPLOY_ENV=prod" would have silently killed production analytics.
Client host opt-out (instrumentation-client.ts): extended the existing syncNoTrackMode() toggle with a OPT_OUT_HOSTS set containing stage.askflorence.health. The opt-out condition is now hostOptOut || paramOptOut — whichever trigger fires causes opt_out_capturing(), and opt_in_capturing() only runs when both are false. Prod behavior of ?no_track=1 is preserved exactly: add param → opted out; remove param → opted back in (no reload needed). On staging, hostOptOut is always true, so the param is additive but cannot opt back in.

Infra wiring: NEXT_PUBLIC_POSTHOG_PROJECT_TOKEN + NEXT_PUBLIC_POSTHOG_HOST threaded in two places because Next.js inlines NEXT_PUBLIC_* at build time:

Dockerfile: accepted as ARGs and exported as ENV before RUN npm run build so they're baked into the client bundle.
.github/workflows/deploy-staging.yml: passed as --build-args sourced from GitHub Actions variables (not secrets — PostHog project tokens are public and ship in every page load's browser bundle).
infra/envs/staging/ecs.tf: added as plain environment entries so server-side reads at runtime have them too; also makes future token rotation a task-def update, not an image rebuild.

Evidence that the wiring is correct: grepping /_next/static/chunks/0u92fl5tvujj9.js served from stage.askflorence.health finds both the exact token value and the literal string stage.askflorence.health. A follow-up SES send via POST /api/waitlist returned HTTP 200 with no PostHog crash.

Addresses from Issue #47 docs comment

docs/infrastructure/aws-setup.md — created in this session as the general AWS runbook. Follows the established file naming + frontmatter pattern.
Reference in docs/infrastructure/cloudtrail-setup.md to aws-setup.md will be re-linked once that file's initial commit lands alongside this session log.
ignoreDeadLinks in docs/.vitepress/config.ts tightened to cover only the specific cross-repo Terraform source references that genuinely cannot be fixed without a pattern change (the repo-root SESSION_BRIEF_*.md issue is being handled separately by a follow-up of moving those artifacts into docs/session-log/ over time).

What this session does NOT do (explicit non-goals)

Does not move production traffic. Cloudflare apex DNS still points at Vercel. Nothing in this session affects what a real visitor hitting askflorence.health or www.askflorence.health experiences.
Does not touch prod Atlas. All Mongo operations targeted the staging project (69e31af12fd2c0aef51bbb41); the prod project was not even discovered-against.
Does not retire Resend. ResendProvider + EMAIL_PROVIDER=resend code path stays live until Phase 11 post-cutover cleanup.
Does not provision prod AWS. Prod account askflorence-prod (039624954211) stays at Phase 2.5 baseline — no VPC, no ECS, no ALB. Phase 8 is the mirror-from-staging step.
Does not grant SES production access. Staging still needs verified sandbox recipients; taking SES out of sandbox is an AWS-side review on a ticket filed Phase 5.4.
Does not touch /agents, /agent-onboarding, /agent-discovery page UIs. Route handler code was refactored to use the sendEmail() abstraction, but form flows, validation, copy, and styling are byte-for-byte unchanged from v0.14.0.

Verification

All exercised on the staging ALB hostname stage.askflorence.health, which is reachable globally. None of these steps touched Vercel prod.

GET /api/health → 200 {"status":"ok","commit":"04cfd35...","env":"staging"}.
POST /api/waitlist with {"email":"[email protected]","zip":"10001","interest":"consumer"} → 200 with real waitlist_submission_id; record visible in Atlas agent_waitlist_submissions; SES DeliveryAttempts metric +1; zero error logs.
aws sesv2 get-account shows sandbox still true (expected pre production-access). SentLast24Hours: 3.
Client-side PostHog bundle verification: curl https://stage.askflorence.health/_next/static/chunks/0u92fl5tvujj9.js | grep -aoE '(phc_Azu[^"]+|stage\.askflorence\.health)' returns both expected strings.
Vercel prod regression sanity: npm run build green; route-handler diff shows no behavioral change when EMAIL_PROVIDER is unset (Resend path identical). Live Vercel deploy not modified.

Next session priorities

Phase 6 — staging CloudFront distribution + WAFv2 web ACL in front of the ALB. Cloudflare CNAME stage.askflorence.health swings from ALB DNS → CloudFront distribution. WAF managed rule sets: CommonRuleSet + KnownBadInputs + SQLiRuleSet + AmazonIpReputationList + AnonymousIpList + rate-based rule (2000 req / 5min / IP).
Phase 7 — staging Atlas VPC peering. Replaces NAT EIP 54.164.140.5 currently on the Atlas allowlist with the staging VPC CIDR 10.40.0.0/16. Allowlist tightened to VPC-only.
(Taha) Reply to the AWS SES production-access review email.
(Taha) Fix the trailing \n on CMS_API_KEY on Vercel prod env (staging is already clean).
Once Phase 6 + 7 are green, the staging stack is feature-complete — Phase 8 is mirroring that exact shape into askflorence-prod.