Appearance
Session log — 2026-04-23 / 2026-04-24 UTC — Phase 10 cutover
Scope
Flip Cloudflare apex DNS from Vercel to the prod CloudFront distribution, migrating live customer traffic to the AWS stack. Identify + fix latent bugs surfaced during post-cutover smoke (Vercel write failure, broken Resend account, missing S3 upload permissions on prod task role). Vercel stays warm as rollback target for the first 48h.
Actor
- Human: Taha Abbasi.
- Agent: Claude Opus 4.7 (1M context), running in Claude Code CLI.
Tickets
- Advances Issue #47 from Phase 8 (prod canary) through Phase 10 (live cutover).
- Surfaces a pre-existing Vercel prod bug (empty
MONGODB_WRITE_URIfor ~2 weeks) — resolved in-session via app-write password rotation and Vercel env repopulation. - Phase 11 items started — SES production-access request filed, Resend retirement confirmed rather than revived.
External systems touched
Cloudflare DNS (askflorence.health)
Two records edited via Cloudflare dashboard, both changed from Proxied to DNS-only with TTL 300s:
| Record | Before | After |
|---|---|---|
askflorence.health (apex) | A 216.198.79.1 (proxied) | CNAME d1pnfyzua893hx.cloudfront.net (DNS only) |
www.askflorence.health | CNAME askflorence.health (proxied) | CNAME d1pnfyzua893hx.cloudfront.net (DNS only) |
Global DNS propagation observed within 15 seconds. First CloudFront edge log entry from a real user hit came in ~30s after save.
AWS prod (039624954211)
- Prod ECS task def revision
:5created (EMAIL_PROVIDER=resend failover attempt),:6(back to EMAIL_PROVIDER=ses),:7(final — addsS3_AGENT_SURVEY_BUCKET=askflorence-data). Service rolled to:7. - Prod task role gained inline policy
S3AgentSurveyUploadsWritevia Terraform (infra/envs/prod/ecs.tf). - Prod Atlas IP access list unchanged —
0.0.0.0/0+10.20.0.0/16both present.0.0.0.0/0removal deferred until 48h Phase 10 bake completes.
AWS management (778477254880)
askflorence-databucket gained a Terraform-managed bucket policy (first TF-owned aspect of that bucket). New fileinfra/envs/management/s3-askflorence-data.tf. Preserves existingDenyNonSSLRequestsstatement and adds cross-account grantAllowProdEcsTaskRolePutAgentSurveyUploadsforarn:aws:iam::039624954211:role/askflorence-prod-app-taskons3:PutObjectonaskflorence-data/agent-survey-uploads/*.- KMS CMK
alias/askflorence-datakey policy — unchanged. ExistingAllowOrgPrincipalsForTfstatestatement (ViaService-bound to s3.us-east-1) covers the prod task role's need toGenerateDataKeywhen writing KMS-encrypted objects.
MongoDB Atlas (prod project 69dc20c64005b222804dafa4)
app-writeuser password rotated viaatlas dbusers update(safe rotation — pre-existing Vercel bug hadMONGODB_WRITE_URI=""so no production consumer relied on the old password).- IP access list entries unchanged.
Vercel
MONGODB_WRITE_URIenv var repopulated with the rotatedapp-writeURI.vercel --prodredeploy triggered so the running functions pick up the new env.- Vercel write path restored to working state (previously broken since 2026-04-16 per env var modification timestamp).
- Deployment otherwise unchanged — Vercel stays warm as Phase 10 rollback target.
Resend (external)
- Investigated: API key stored in Vercel env has a literal
\n(backslash + n) at the end, same bug class as previously-fixed CMS_API_KEY. Stripping the literal\nproduces a valid authenticating key. - Resend account's
updates.askflorence.healthdomain has been instatus: "failed"since 2026-04-10 — required DKIM CNAMEs were never added to Cloudflare. Vercel email sending therefore stopped ~2 weeks ago (compounded with the emptyMONGODB_WRITE_URIbug). - Decision: not revive Resend. Send path moves to AWS SES full-time. Phase 11 retires the Resend account.
AWS Support (prod account SES case)
- SES production-access request filed with conservative transactional framing (< 100/day current, < 500/day 60d ceiling, < 5k/day through end of 2026). AWS initial response asked for email-type detail + bounce/complaint/unsubscribe handling + example content + verified-identity status. Detailed response submitted 2026-04-24T02:05Z. Awaiting approval (typical turnaround 24-72h).
The three bugs surfaced during cutover smoke
(1) Vercel MONGODB_WRITE_URI="" — latent since 2026-04-16
Consumer + agent waitlist + agent discovery writes on Vercel prod were failing with the code-level error MONGODB_URI_WAITLIST_WRITE or MONGODB_URI_SURVEY_WRITE or MONGODB_WRITE_URI must be set. Because the UI renders the same "You're on the list" success page regardless of the backend outcome (email is fire-and-forget, the Mongo write is best-effort after the response is already formed in memory), no user or monitor caught this. No alerting on MongoParseError in Vercel logs.
Discovered by reading Vercel env during Phase 8 secret population (Vercel stored the key with an empty value). Confirmed via direct Mongo query — no agent waitlist rows from the Vercel era in the ~2-week window.
Fix: rotate the prod Atlas app-write password (Atlas CLI), populate prod/mongodb/app-write in AWS Secrets Manager, push the same URI to Vercel env, re-deploy Vercel. Verified via POST /api/waitlist at Vercel returning 200 + real _id post-fix.
Impact lesson: trailing-\n literal on a secret value, and empty-string on a required env var, are a class of bug that needs a pre-commit or CI guard. Adding a validation step to a future CI job is captured in Phase 11 todo.
(2) Resend API key literal \n + domain "failed" status
AWS SES cutover smoke surfaced an attempt to failover to Resend (via EMAIL_PROVIDER=resend). Resend API returned API key is invalid. Hex dump of the Vercel-stored value:
N q k i \ n " \nThat's literal backslash-n followed by LF — a known bug class from the CMS_API_KEY episode. Stripping the \n (shell: "${V%\\n}") produces a valid key. Subsequent test with the stripped key returned domain updates.askflorence.health is not verified — separate issue.
Resend dashboard (via API): updates.askflorence.health added 2026-04-10, status failed. No DKIM CNAMEs for Resend in Cloudflare. Resend email sending has been non-functional on this account for ~2 weeks, independently of the Mongo bug.
Decision: AWS SES is the forward path. Resend retires per Phase 11. No need to verify Resend DKIM now.
(3) Prod ECS task role had no S3 upload permission
POST /api/agents/discovery/upload on prod returned 400 "Only PDF, JPG, or PNG files are accepted" after correct docType + blankConfirmed fields. Reading the upload route revealed:
- Writes to S3 bucket
askflorence-datain management account (778477254880), not prod. - Vercel had cross-account access via static
AWS_ACCESS_KEY_ID+AWS_SECRET_ACCESS_KEYenv vars (IAM user creds, retire at Phase 11 post-cutover). - Prod ECS task role had only
ses:SendEmailin its inline policy — no S3 grant. Upload path was dead end-to-end.
Proper fix via Terraform (not IAM user keys):
infra/envs/management/s3-askflorence-data.tf— new file. Manages the bucket policy onaskflorence-data(first TF-owned aspect of that bucket; the bucket resource itself predates TF and stays unmanaged). Preserves existingDenyNonSSLRequests+ adds explicit cross-account grant for the prod task role.infra/envs/prod/ecs.tf— task role gainsS3AgentSurveyUploadsWriteinline policy grantings3:PutObjecton the same prefix. Task def env varS3_AGENT_SURVEY_BUCKET=askflorence-dataadded.- No KMS key policy change — existing
AllowOrgPrincipalsForTfstatestatement on the mgmt CMK covers S3-via-kms:ViaService.
Verified end-to-end: PDF upload → 200 + object present at askflorence-data/agent-survey-uploads/custom/1776993996441-0a767d98490801537e44789e-consent-template.pdf. GuardDuty Malware Protection scans the new object automatically.
Verification (Phase 9 + Phase 10 combined)
Pre-cutover (Phase 9 gate):
- HTTP parity probe: 60/60 PASS across 20 stratified scenarios × 3 endpoints (
/api/counties,/api/eligibility,/api/plans). Stratified over federal states (TX, FL, OH, GA, NC, UT including UT's unique age-curve band, AZ, PA) + SBE NY (Manhattan, Rochester, Syracuse), various household sizes 1-4, incomes spanning CSR-94 through no-CSR zones. - Prod canary
/api/waitlistPOST → 200 with real Mongo_idvia peering. - Direct ALB smoke from CI runners via
origin.askflorence.health— bypasses WAF false-positive block on GitHub IP ranges.
Post-cutover:
- Every public route (
/,/plans,/agents,/agent-onboarding,/agent-discovery,/updates,/privacy,/terms) → 200. POST /api/eligibilitywith correct nested shape → 200 with real APTC + CSR (APTC=$425, CSR=73% AV Silver for age 35 single $35k Dallas TX).POST /api/planssame shape → 200 with 100 plans + full cost-share data.POST /api/waitlistconsumer + agent variants → 200 with real Mongo writes.POST /api/agents/discovery/uploadwith valid PDF → 200 with real S3 object key.- WAF SQLi probe → 403 blocked.
- Response headers clean:
server: AskFlorence, HSTS, CSP, X-Frame-Options DENY. No trace of Vercel or Cloudflare proxy. - ECS: 2 HA tasks, rollout COMPLETED, task def
:7, 0 × 5xx over 10-min window.
Next
- T+48h: remove
0.0.0.0/0from prod Atlas IP access list (closes Vercel's reach into prod Atlas; Vercel keeps running without DB access as a pure DNS-level rollback target). - T+48h: archive Vercel project (don't delete — keep for reference).
- T+48h: raise Cloudflare TTL back from 300s to Auto.
- SES production-access approval: expected 24-72h from reply submission. Once granted, all email sends resume from SES without any sandbox recipient limitations.
- Phase 11 hardening: retire Resend account, finalize PostHog self-host vs replace decision, activate Drata read-only IAM, schedule first pen test, clean up secret-validation CI guard to catch future literal-
\nbugs. - Phase 12 compliance docs: SOC 2 + HIPAA + EDE control-mapping.