Appearance
Session log — 2026-04-22 / 2026-04-23 UTC — Phase 8 prod AWS mirror
Scope
Build the entire AWS prod stack in askflorence-prod (039624954211), mirroring staging 1:1 with prod-scoped values + HA. Get the current Next.js app (consumer marketplace, agent flows, APIs) running behind a CloudFront + WAF edge on a private canary URL (prod-canary.askflorence.health) for end-to-end validation before Phase 10 cutover. Vercel prod (askflorence.health + www) continues to serve real user traffic throughout. The apex does not move in this phase.
Actor
- Human: Taha Abbasi.
- Agent: Claude Opus 4.7 (1M context), running in Claude Code CLI.
Tickets
- Advances Issue #47 Phase 8.
- Identifies a pre-existing prod Vercel bug worth a follow-up:
MONGODB_WRITE_URIon Vercel prod is empty-string, so writes from Vercel have been failing ~6 days. Taha sets the new URI on Vercel post-session to restore Vercel writes until cutover.
External systems touched
AWS (prod account 039624954211)
- Network module applied — VPC
10.20.0.0/16, 2 AZs (us-east-1a/b), 2 NAT gateways (HA vs staging's 1), 6 VPC endpoints multi-AZ (kms, secretsmanager, bedrock-runtime interface + S3 gateway + ECR api/dkr), 90-day flow-log retention. Disjoint from staging (10.40.0.0/16) and the future log-archive CIDR for eventual org-wide peering. - KMS — new CMK
alias/askflorence-prod, rotation on, 30-day deletion window. Same service-principal grants as staging (Secrets Manager + CloudWatch Logs). - Secrets Manager — 15 prod shells under
prod/*. Populated during the session:prod/mongodb/app-read— from Vercel prod env (theapp-readuser's SRV URI).prod/mongodb/app-write— freshly generated URI using a rotatedapp-writepassword (viaatlas dbusers update). Safe rotation because Vercel'sMONGODB_WRITE_URIwas empty, so nothing on Vercel currently used app-write credentials.prod/mongodb/{waitlist,survey,agents}-write+prod/mongodb/agents-admin— stopgap-populated with the sameapp-writeURI until a follow-up #56 prod session creates narrow-scoped users on the prod Atlas project.prod/mongodb/audit-read— populated with theapp-readURI.prod/cms-api-key+prod/resend-api-key+prod/unsubscribe-token-secret+prod/posthog-key— copied from Vercel prod env.- Florence / Bedrock / Whisper shells left as PLACEHOLDER.
- ACM cert —
askflorence.health+www.askflorence.health+*.askflorence.healthin us-east-1 (required for CloudFront). DNS validation via 2 CNAMEs Taha added at Cloudflare. Status ISSUED in ~3 min after records landed. - SES identity —
updates.askflorence.healthverified (6 records added at Cloudflare by Taha: 3 DKIM CNAMEs + MX for MAIL FROM + SPF TXT + DMARC TXTp=quarantine). DKIM SUCCESS, MAIL FROM SUCCESS, VerifiedForSending true. SES account still in sandbox (production-access request filed separately). - ECR —
askflorence-apprepo with immutable tags (prod-strict — each tag can only be written once, no:latestdrift), scan-on-push, 50-image lifecycle retention, CMK-encrypted. - ECS — cluster
askflorence-prod, task definition familyaskflorence-prod-app-task(0.5 vCPU / 1 GB per task), serviceaskflorence-prod-appdesired 2 (HA across AZs), SES-send inline policy on task role, 90-day CloudWatch Logs retention. - ALB —
askflorence-prod-alb-1177205004.us-east-1.elb.amazonaws.com. HTTPS listener with the prod cert, HTTP→HTTPS redirect. Deletion protection ON (prevents accidentalterraform destroyfrom nuking the hostname CloudFront points at). - CloudFront distribution
E9RU8LOGSYL9I(d1pnfyzua893hx.cloudfront.net). Serves 3 aliases:askflorence.health,www.askflorence.health,prod-canary.askflorence.health. PriceClass_All, HTTP/2+HTTP/3, TLSv1.2_2021, the same WAFv2 rule set used on staging (5 managed groups + rate rule 2000 req/5min/IP). Same response-headers policy — HSTS + CSP + X-Frame-Options DENY + Server override toAskFlorence. - Atlas prod peering
pcx-0cefe999865679045— Atlas-initiated, accepted on AWS,AllowDnsResolutionFromRemoteVpc=trueon accepter side, routes added in both prod private route tables (192.168.248.0/21 → pcx). Atlas IP access list adds10.20.0.0/16. Legacy0.0.0.0/0entry taggeddevstays in place until Phase 10 cutover (Vercel still needs reachability). - deploy-prod.yml GitHub Actions workflow — manual
workflow_dispatchtrigger (GitHub Team plan doesn't support required-reviewers on private-repo environments;workflow_dispatchis the approval surrogate). OIDC federation toarn:aws:iam::039624954211:role/GitHubActionsDeployRole. Smokesorigin.askflorence.health/api/health(direct ALB, not CloudFront) because WAF managed rule groups false-positive-block GitHub Actions runner IPs on theprod-canary.*path.
Cloudflare (zone askflorence.health)
DNS records added manually by Taha during the session. All DNS-only (proxy OFF):
- 2 × CNAME for ACM validation (
_<hex>.askflorence.health,_<hex>.www.askflorence.health) - 3 × CNAME for SES DKIM (
<token>._domainkey.updates) - 1 × MX + 1 × TXT for SES MAIL FROM (
mail.updates) - 1 × TXT for DMARC (
_dmarc.updates) - 1 × CNAME
origin.askflorence.health→ ALB DNS (CloudFront origin handshake target) - 1 × CNAME
prod-canary.askflorence.health→ CloudFrontd1pnfyzua893hx.cloudfront.net(private canary URL for validation)
Apex + www CNAMEs stay pointed at Vercel through this phase. Phase 10 is the swap.
MongoDB Atlas (prod project 69dc20c64005b222804dafa4)
- Peering connection from prod project to the new prod AWS VPC. Added project-level route to 10.20.0.0/16 in prod VPC.
app-writeuser password rotated viaatlas dbusers update— because Vercel'sMONGODB_WRITE_URIis empty (pre-existing Vercel bug), nothing consumer-facing was using the old password so rotation is a pure no-op from Vercel's perspective.- IP access list — added
10.20.0.0/16. Kept the legacy0.0.0.0/0entry (tagdev) because Vercel still serves real traffic and uses unpredictable egress IPs. - Prod cluster data — untouched.
Vercel
- Not touched. Apex DNS unchanged, env vars unchanged. Vercel continues serving real production traffic through this phase.
GitHub
- Repo variables
NEXT_PUBLIC_POSTHOG_PROJECT_TOKEN+NEXT_PUBLIC_POSTHOG_HOSTalready set during Phase 5.5 — the prod workflow reuses them. - No
productionenvironment protection rule created — GitHub Team plan limits required-reviewers to public repos.workflow_dispatchis the approval gate instead.
What shipped (chronological)
- Terraform scaffolding applied in prod. Phase 3 had already planted
versions.tf,providers.tf,github-oidc.tffor the prod env — they came up clean onterraform plan. Addednetwork.tf,kms.tf. terraform apply1 — 28 resources: VPC + subnets + NAT HA + 6 VPC endpoints + KMS CMK + flow logs + IGW + RTs.terraform apply2 — 31 resources: 15 Secrets Manager shells + ACM cert (request only; validation pending DNS). Paused for Taha to add ACM validation CNAMEs at Cloudflare.- SES identity applied alongside ACM. Paused again for Taha to add 6 SES CNAMEs + MX + TXT at Cloudflare.
- Polled ACM + SES status until all four signals green (ACM ISSUED, DKIM SUCCESS, MAIL FROM SUCCESS, VerifiedForSending true). ~5 min total propagation after DNS.
terraform apply3 — 24 resources: ECR + ECS cluster/task-def/service (desired=0) + ALB + CloudFront distribution (slow first-create) + WAFv2 web ACL + response-headers policy + CloudWatch log groups.- Secrets populated with values pulled from
vercel env pull. DiscoveredMONGODB_WRITE_URI=""on Vercel; rotatedapp-writepassword via Atlas CLI, populatedprod/mongodb/app-write, applied the same URI to the 4 narrow-scoped write secrets as a stopgap until #56 prod session. - Atlas prod peering handshake via
atlas networking peering create aws+aws ec2 accept-vpc-peering-connection. Routes added in both private RTs, allowlist entry added. Terraform-imported the accepter + routes intopeering.tf. deploy-prod.ymlworkflow written and pushed to main. First invocation ran, image built, ECS deployed, tasks came up, smoke step blocked by WAF on the runner IP (false-positive fromAnonymousIpList/AmazonIpReputationList). Fix: switched smoke target fromprod-canary.*(CloudFront + WAF) toorigin.*(direct ALB). Second invocation failed on ECR immutable-tag rejection of:latest. Fix: stopped pushing:lateston prod + switched buildx cache from inline-in-ECR to GHA cache backend. Third invocation fully green.- Full canary validation — all endpoints serve correctly on
prod-canary.askflorence.healthwith parity against Vercel responses; Mongo write over the peered network path succeeds with a realwaitlist_submission_id; WAF blocks SQLi probes; CloudFront security headers + server override all correct.
Two gotchas worth preserving
(1) workflow_dispatch as approval gate on private Team-plan repos. The original plan had push: branches: [main] + a production GitHub environment with required-reviewers. Confirmed GitHub Team does not support required-reviewers on environments attached to private repos — that's an Enterprise ($21/user) feature. workflow_dispatch gives us the same "nothing deploys without Taha clicking a button" guarantee without a plan upgrade or making the repo public. Any Vercel-era "release on merge" habit doesn't apply.
(2) Immutable ECR tags + inline buildx cache are incompatible. --cache-to type=inline embeds the cache manifest into the image's own manifest, which means re-pushing the same tag. Fine with staging's immutable_tags=false. On prod with immutable_tags=true, every cached rebuild attempts a tag rewrite and gets rejected. The resolution — ditch :latest entirely on prod (task defs pin :<sha>, so no one consumes :latest) and use GitHub Actions' layer cache (type=gha) instead. Side benefit: GHA cache is account-scoped, not repository-scoped, so it doesn't leak container bits outside the account.
Verification
From operator laptop, direct public internet, against https://prod-canary.askflorence.health (via Cloudflare CNAME → CloudFront edge → origin.askflorence.health CNAME → ALB → ECS):
GET /api/health→ 200{"status":"ok","commit":"a189041…","env":"prod"}GET /api/counties?state=TX&zip=75001→ 200 identical JSON to Vercel prod (CMS proxy)GET /api/counties?state=NY&zip=10001→ 200 identical JSON to Vercel prod (owned-data path)POST /api/waitlist→ 200 withwaitlist_submission_id(Mongo write via peering — NAT never touched)- Response headers from CloudFront:
server: AskFlorence,strict-transport-security: max-age=31536000; includeSubDomains; preload,x-frame-options: DENY,content-security-policy: … GET /?id=1' OR '1'='1→ 403 blocked by WAF SQLiRuleSet- ECS state: desired 2, running 2, rollout COMPLETED, task def revision
:4(after the narrow-user secret populate +force-new-deployment) - CloudWatch
aws-waf-logs-askflorence-prod-web-aclreceiving WAF logs
SES send path attempted on /api/waitlist flow returned with the expected sandbox rejection — recipient [email protected] is verified in the staging SES account, not prod. Non-blocker: app returns 200 because sendEmail is fire-and-forget; the code path is exercised and will work on first SES production-access approval + prod-side sandbox recipient verification.
What this session does NOT do
- Does not move production DNS. Cloudflare apex still points at Vercel.
- Does not retire Vercel. Vercel continues to serve real users exactly as before.
- Does not populate prod secrets that Florence + Bedrock + Whisper will eventually use — those shells stay as PLACEHOLDER until the relevant workloads ship.
- Does not provision narrow-scoped prod Atlas users. Stopgap points the
app_writer_*secrets at the broadapp-writeURI. The proper scoped users land in a follow-up #56 prod session. - Does not remove the legacy
0.0.0.0/0entry from the prod Atlas allowlist. Removing that would break Vercel right now. Phase 10 cutover is where it comes out. - Does not request SES production access from the prod account. Separate manual request; not blocking because nothing real sends email from the prod AWS stack yet.
Next
- Phase 9 — canary bake. Real-ish synthetic traffic against
prod-canary.askflorence.healthfor 48 h. Full audit tier 1-5 parity run. GuardDuty + Security Hub clean. Nothing in Phase 9 touches apex DNS. - Phase 10 — Cloudflare apex CNAME flip
askflorence.health+wwwfrom Vercel edges tod1pnfyzua893hx.cloudfront.net. After 48 h of clean cutover: pull0.0.0.0/0from prod Atlas allowlist, retire Vercel prod. - Phase 11 — post-cutover hardening. Resend retirement, PostHog self-host/replace decision, Drata read-only role activation, annual pen test vendor selection.
- Phase 12 — SOC 2 + HIPAA + EDE control mapping docs closed out against the operating state established from Phase 2 onward.