Session log — 2026-04-23 / 2026-04-24 UTC — Phase 10 cutover

Scope

Flip Cloudflare apex DNS from Vercel to the prod CloudFront distribution, migrating live customer traffic to the AWS stack. Identify + fix latent bugs surfaced during post-cutover smoke (Vercel write failure, broken Resend account, missing S3 upload permissions on prod task role). Vercel stays warm as rollback target for the first 48h.

Actor

Human: Taha Abbasi.
Agent: Claude Opus 4.7 (1M context), running in Claude Code CLI.

Tickets

Advances Issue #47 from Phase 8 (prod canary) through Phase 10 (live cutover).
Surfaces a pre-existing Vercel prod bug (empty MONGODB_WRITE_URI for ~2 weeks) — resolved in-session via app-write password rotation and Vercel env repopulation.
Phase 11 items started — SES production-access request filed, Resend retirement confirmed rather than revived.

External systems touched

Cloudflare DNS (`askflorence.health`)

Two records edited via Cloudflare dashboard, both changed from Proxied to DNS-only with TTL 300s:

Record	Before	After
`askflorence.health` (apex)	`A 216.198.79.1` (proxied)	`CNAME d1pnfyzua893hx.cloudfront.net` (DNS only)
`www.askflorence.health`	`CNAME askflorence.health` (proxied)	`CNAME d1pnfyzua893hx.cloudfront.net` (DNS only)

Global DNS propagation observed within 15 seconds. First CloudFront edge log entry from a real user hit came in ~30s after save.

AWS prod (`039624954211`)

Prod ECS task def revision :5 created (EMAIL_PROVIDER=resend failover attempt), :6 (back to EMAIL_PROVIDER=ses), :7 (final — adds S3_AGENT_SURVEY_BUCKET=askflorence-data). Service rolled to :7.
Prod task role gained inline policy S3AgentSurveyUploadsWrite via Terraform (infra/envs/prod/ecs.tf).
Prod Atlas IP access list unchanged — 0.0.0.0/0 + 10.20.0.0/16 both present. 0.0.0.0/0 removal deferred until 48h Phase 10 bake completes.

AWS management (`778477254880`)

askflorence-data bucket gained a Terraform-managed bucket policy (first TF-owned aspect of that bucket). New file infra/envs/management/s3-askflorence-data.tf. Preserves existing DenyNonSSLRequests statement and adds cross-account grant AllowProdEcsTaskRolePutAgentSurveyUploads for arn:aws:iam::039624954211:role/askflorence-prod-app-task on s3:PutObject on askflorence-data/agent-survey-uploads/*.
KMS CMK alias/askflorence-data key policy — unchanged. Existing AllowOrgPrincipalsForTfstate statement (ViaService-bound to s3.us-east-1) covers the prod task role's need to GenerateDataKey when writing KMS-encrypted objects.

MongoDB Atlas (prod project `69dc20c64005b222804dafa4`)

app-write user password rotated via atlas dbusers update (safe rotation — pre-existing Vercel bug had MONGODB_WRITE_URI="" so no production consumer relied on the old password).
IP access list entries unchanged.

Vercel

MONGODB_WRITE_URI env var repopulated with the rotated app-write URI.
vercel --prod redeploy triggered so the running functions pick up the new env.
Vercel write path restored to working state (previously broken since 2026-04-16 per env var modification timestamp).
Deployment otherwise unchanged — Vercel stays warm as Phase 10 rollback target.

Resend (external)

Investigated: API key stored in Vercel env has a literal \n (backslash + n) at the end, same bug class as previously-fixed CMS_API_KEY. Stripping the literal \n produces a valid authenticating key.
Resend account's updates.askflorence.health domain has been in status: "failed" since 2026-04-10 — required DKIM CNAMEs were never added to Cloudflare. Vercel email sending therefore stopped ~2 weeks ago (compounded with the empty MONGODB_WRITE_URI bug).
Decision: not revive Resend. Send path moves to AWS SES full-time. Phase 11 retires the Resend account.

AWS Support (prod account SES case)

SES production-access request filed with conservative transactional framing (< 100/day current, < 500/day 60d ceiling, < 5k/day through end of 2026). AWS initial response asked for email-type detail + bounce/complaint/unsubscribe handling + example content + verified-identity status. Detailed response submitted 2026-04-24T02:05Z. Awaiting approval (typical turnaround 24-72h).

The three bugs surfaced during cutover smoke

(1) Vercel `MONGODB_WRITE_URI=""` — latent since 2026-04-16

Consumer + agent waitlist + agent discovery writes on Vercel prod were failing with the code-level error MONGODB_URI_WAITLIST_WRITE or MONGODB_URI_SURVEY_WRITE or MONGODB_WRITE_URI must be set. Because the UI renders the same "You're on the list" success page regardless of the backend outcome (email is fire-and-forget, the Mongo write is best-effort after the response is already formed in memory), no user or monitor caught this. No alerting on MongoParseError in Vercel logs.

Discovered by reading Vercel env during Phase 8 secret population (Vercel stored the key with an empty value). Confirmed via direct Mongo query — no agent waitlist rows from the Vercel era in the ~2-week window.

Fix: rotate the prod Atlas app-write password (Atlas CLI), populate prod/mongodb/app-write in AWS Secrets Manager, push the same URI to Vercel env, re-deploy Vercel. Verified via POST /api/waitlist at Vercel returning 200 + real _id post-fix.

Impact lesson: trailing-\n literal on a secret value, and empty-string on a required env var, are a class of bug that needs a pre-commit or CI guard. Adding a validation step to a future CI job is captured in Phase 11 todo.

(2) Resend API key literal `\n` + domain "failed" status

AWS SES cutover smoke surfaced an attempt to failover to Resend (via EMAIL_PROVIDER=resend). Resend API returned API key is invalid. Hex dump of the Vercel-stored value:

N   q   k   i   \   n   "   \n

That's literal backslash-n followed by LF — a known bug class from the CMS_API_KEY episode. Stripping the \n (shell: "${V%\\n}") produces a valid key. Subsequent test with the stripped key returned domain updates.askflorence.health is not verified — separate issue.

Resend dashboard (via API): updates.askflorence.health added 2026-04-10, status failed. No DKIM CNAMEs for Resend in Cloudflare. Resend email sending has been non-functional on this account for ~2 weeks, independently of the Mongo bug.

Decision: AWS SES is the forward path. Resend retires per Phase 11. No need to verify Resend DKIM now.

(3) Prod ECS task role had no S3 upload permission

POST /api/agents/discovery/upload on prod returned 400 "Only PDF, JPG, or PNG files are accepted" after correct docType + blankConfirmed fields. Reading the upload route revealed:

Writes to S3 bucket askflorence-data in management account (778477254880), not prod.
Vercel had cross-account access via static AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY env vars (IAM user creds, retire at Phase 11 post-cutover).
Prod ECS task role had only ses:SendEmail in its inline policy — no S3 grant. Upload path was dead end-to-end.

Proper fix via Terraform (not IAM user keys):

infra/envs/management/s3-askflorence-data.tf — new file. Manages the bucket policy on askflorence-data (first TF-owned aspect of that bucket; the bucket resource itself predates TF and stays unmanaged). Preserves existing DenyNonSSLRequests + adds explicit cross-account grant for the prod task role.
infra/envs/prod/ecs.tf — task role gains S3AgentSurveyUploadsWrite inline policy granting s3:PutObject on the same prefix. Task def env var S3_AGENT_SURVEY_BUCKET=askflorence-data added.
No KMS key policy change — existing AllowOrgPrincipalsForTfstate statement on the mgmt CMK covers S3-via-kms:ViaService.

Verified end-to-end: PDF upload → 200 + object present at askflorence-data/agent-survey-uploads/custom/1776993996441-0a767d98490801537e44789e-consent-template.pdf. GuardDuty Malware Protection scans the new object automatically.

Verification (Phase 9 + Phase 10 combined)

Pre-cutover (Phase 9 gate):

HTTP parity probe: 60/60 PASS across 20 stratified scenarios × 3 endpoints (/api/counties, /api/eligibility, /api/plans). Stratified over federal states (TX, FL, OH, GA, NC, UT including UT's unique age-curve band, AZ, PA) + SBE NY (Manhattan, Rochester, Syracuse), various household sizes 1-4, incomes spanning CSR-94 through no-CSR zones.
Prod canary /api/waitlist POST → 200 with real Mongo _id via peering.
Direct ALB smoke from CI runners via origin.askflorence.health — bypasses WAF false-positive block on GitHub IP ranges.

Post-cutover:

Every public route (/, /plans, /agents, /agent-onboarding, /agent-discovery, /updates, /privacy, /terms) → 200.
POST /api/eligibility with correct nested shape → 200 with real APTC + CSR (APTC=$425, CSR=73% AV Silver for age 35 single $35k Dallas TX).
POST /api/plans same shape → 200 with 100 plans + full cost-share data.
POST /api/waitlist consumer + agent variants → 200 with real Mongo writes.
POST /api/agents/discovery/upload with valid PDF → 200 with real S3 object key.
WAF SQLi probe → 403 blocked.
Response headers clean: server: AskFlorence, HSTS, CSP, X-Frame-Options DENY. No trace of Vercel or Cloudflare proxy.
ECS: 2 HA tasks, rollout COMPLETED, task def :7, 0 × 5xx over 10-min window.

T+48h: remove 0.0.0.0/0 from prod Atlas IP access list (closes Vercel's reach into prod Atlas; Vercel keeps running without DB access as a pure DNS-level rollback target).
T+48h: archive Vercel project (don't delete — keep for reference).
T+48h: raise Cloudflare TTL back from 300s to Auto.
SES production-access approval: expected 24-72h from reply submission. Once granted, all email sends resume from SES without any sandbox recipient limitations.
Phase 11 hardening: retire Resend account, finalize PostHog self-host vs replace decision, activate Drata read-only IAM, schedule first pen test, clean up secret-validation CI guard to catch future literal-\n bugs.
Phase 12 compliance docs: SOC 2 + HIPAA + EDE control-mapping.