Appearance
ADR 0004 — Cross-cluster Atlas reads from prod via AWS PrivateLink
Status
Accepted — 2026-05-08.
Context
The doctor + Rx coverage flow on prod (askflorence.health) needs to read 2.14M NPI provider docs (providers_staging) and 12,557 RxCUI / ~30M drug-plan tuples (formularies_staging). All public CMS marketplace data — non-PHI by classification, sourced from the §1311 MRF ingest pipeline.
This data canonically lives on the staging Atlas cluster (askflorence-staging, project_id 69e31af12fd2c0aef51bbb41, M30 tier ~$382/mo) — that's where the §1311 ingest writes it, where the staging app reads it, and where the non-prod surface for the YC demo URL is exercised.
To make doctor + Rx work on prod we considered three paths:
Path A — Duplicate the data onto prod cluster. Forces prod cluster from M10 HIPAA ($56/mo) to M30 ($382/mo) to handle the collection size + index footprint. Total cost rises from $438/mo to $764/mo (+$326/mo recurring). Cleanest audit boundary but expensive, and it adds public reference data to the prod cluster's PHI audit surface, arguably making future EDE Phase 3 audit harder rather than easier.
Path B — VPC peering between prod VPC and staging Atlas project. Free, AWS-backbone-only. Blocked by an unsolvable CIDR conflict: both Atlas projects use the default
192.168.248.0/21for their network containers, Atlas does not allow changing CIDRs on existing projects, and re-creating either project means data migration + multi-day operational disruption.Path B1 — AWS PrivateLink. AWS Interface VPC Endpoint in the prod VPC targets a PrivateLink endpoint service that Atlas exposes for the staging project. AWS-backbone-only at the network layer, TLS at the application layer (doubly protected). Identity-bound at the AWS account level. Doesn't use route-table CIDRs — the endpoint is an ENI in our subnets, traffic flows through it directly, no peering / transit-gateway / route-table involvement. This is the documented pattern Atlas + AWS designed for cross-Atlas-project access where peering doesn't fit. Cost: ~$7-10/mo for the Interface endpoint + negligible AWS data egress.
The three paths were filed and analyzed in docs/decisions/2026-05-03-pivot-cms-api-direct.md "Cross-cluster reference reads via AWS PrivateLink" and the decision-matrix walkthrough on #101.
Decision
Prod VPC reads non-PHI public CMS reference data from the staging Atlas cluster over AWS PrivateLink. Concretely:
- AWS Interface VPC Endpoint
vpce-0c81aea11e29bb928in prod VPCvpc-09201679b87261b6d, multi-AZ across the prod private subnets. - Targets Atlas-issued endpoint service
com.amazonaws.vpce.us-east-1.vpce-svc-0d8138ea0f6542afa(Atlas endpointId69fe75c5b02c024f32d2af50). - Connection authenticated as a read-only
app_read_staginguser on theaskflorencedatabase. - Connection string lives in AWS Secrets Manager (
prod/mongodb/reference-uri) with project CMK encryption. - Application layer uses a distinct connection pool (
getReferenceDb()insrc/lib/db.ts) routed via theMONGODB_REFERENCE_URIenv var. Falls back toMONGODB_URIwhen unset, so dev + staging keep working without code changes.
The prod cluster (askflorence-prod-01, M10 HIPAA) remains the only PHI processor. The staging cluster remains the only home for formularies_staging + providers_staging (and the §1311 ingest pipeline that writes them).
Consequences
Positive:
- Saves ~$326/mo recurring vs duplicating data onto a prod M30 cluster.
- EDE Phase 3 audit boundary stays clean — PHI lives only on prod cluster; non-PHI public reference data lives only on staging; the cross-cluster path is read-only, AWS-backbone-only, and easy to point at in an audit ("identity-bound, no public IP, doubly-protected encryption").
- Avoids the CIDR conflict that blocks Path B without requiring a project re-creation.
- §1311 delta-aware MRF refresh (#98) gets a clean architectural target: refresh runs in the staging AWS account, prod picks up the refresh automatically via PrivateLink with no prod-side cron, no double-ingest, no cluster cutover.
- PrivateLink is the documented pattern Atlas + AWS designed for this case — it survives the EDE Phase 3 cutover to FedRAMP-authorized Atlas Government with the same architecture.
Accepted costs:
- Two Atlas projects must stay configured + monitored together for the data layer to function. Cross-cluster posture is now load-bearing for the doctor + Rx feature.
- Staging cluster's IP allowlist remains permissive during pre-launch (Taha's laptop + CI runners need IP-based access for ingest). Hardening to "PrivateLink-only" is deferred post-launch when ingest jobs move to ECS Fargate in the staging VPC. Tracked in #71.
- A drift risk exists if a future writer ever puts PHI on the staging cluster — the PrivateLink "non-PHI cross-cluster read" architectural claim would silently break. Mitigated by the CI guard at #100, which ships in two complementary phases:
- Phase 1 shipped 2026-05-08 — static check at
scripts/audit/staging-collections-guard.tsruns on every PR via.github/workflows/staging-collections-guard.yml. Fails the build if anygetReferenceDb()call is made against a collection not onSTAGING_ALLOWED_COLLECTIONSinsrc/lib/db.ts. Catches string-literal, dynamic-name, and inline-call patterns. Verified on synthetic violations 2026-05-08. - Phase 2 shipped 2026-05-09 — live nightly check at
scripts/audit/staging-cluster-drift.tsruns at 08:00 UTC daily via.github/workflows/staging-cluster-drift.yml(cron + manual dispatch). Audits the actual Atlas state ofapp_read_staging: verifies the user has exactly one role (role_reader_reference@admin) granting onlyFINDaction on exactlyaskflorence.formularies_staging+askflorence.providers_stagingand nothing else (no extra roles, no inheritedRoles, no wider actions, no extra collections, no DB-wide grants). Opens a P1 GitHub issue on drift. Catches the runtime cases the static guard cannot — privilege escalation via Atlas Admin UI, out-of-band role changes, etc. Verified 2026-05-09 against three synthetic violations (extra collection grant / wider action / extra role on user) — all caught with correct violation reports. - As part of Phase 2,
app_read_staging's role was tightened from built-inread@askflorence(whole-DB scope) to customrole_reader_reference@admin(per-collection FIND on the 2 collections actually consumed:formularies_staging+providers_staging). This is the "future audit requires per-collection scoping" follow-up referenced in docs/runbooks/atlas-user-provisioning.md. Verified prod cross-cluster reads (drug + provider tier fallback) remain healthy after the tightening —drug_tier=PreferredBrandandnetwork_tier=Preferredsmoke responses byte-identical to baseline.
- Phase 1 shipped 2026-05-08 — static check at
- Atlas BAA must enumerate both projects in writing — chase tracked in #57.
Alternatives considered
- Path A — Duplicate data onto prod cluster. Rejected. Recurring $326/mo cost is prohibitive for a pre-revenue startup; co-residing public reference data with PHI on the prod audit surface complicates EDE Phase 3 narrative.
- Path B — VPC peering prod VPC ↔ staging Atlas. Rejected. Blocked by CIDR conflict (
192.168.248.0/21on both Atlas projects); resolution would require destroy + recreate of one project, multi-day operational disruption. - Status quo — keep doctor + Rx feature staging-only. Rejected. Doctor + Rx is a launch-tier feature for
askflorence.health; deferring it past launch was not an option per product priority.
Revisit triggers
Switch architecture if any of these fire:
- Staging cluster cost > $500/mo sustained for >2 months → evaluate M20 with delta refresh (#98).
- Cross-cluster read p99 latency > 250ms → evaluate co-locating data on prod cluster (Path A revisited).
- Auditor flags cross-cluster path under EDE Phase 3 review → migrate both clusters to FedRAMP-authorized Atlas Government (architecture transfers; PrivateLink stays).
- Any PHI ever needs to land on the staging cluster → immediate cutover. CI guard #100 is the early-warning system for this.
Amendment 2026-05-11 (ENG-257 closeout)
role_reader_reference's canonical scope is four collections, not two: formularies_staging, providers_staging, plans, mrpuf_issuers_staging. All four are part of the §1311 / MRF reference dataset on the staging cluster and share the same non-PHI data classification.
When this ADR shipped on 2026-05-08, the role was tightened to two collections (the runtime-fallback consumers only). On 2026-05-09 the §1311 re-validation audit (ENG-230) needed read access to plans + mrpuf_issuers_staging and the role was widened to four. ENG-257 was filed as the planned narrow-back once the audit cycle closed.
Re-examining on 2026-05-11: the wider scope is the correct permanent posture, not a temporary tradeoff. The role's responsibility is "cross-cluster reads of staging-cluster §1311 reference data for both runtime tier-fallback AND periodic audit re-validation." Both purposes:
- Operate against the same dataset family (§1311 / MRF).
- Share the same data classification (non-PHI public CMS marketplace data).
- Use the same AWS PrivateLink path (no incremental network surface).
- Recur on a known cadence (audit re-validation runs each refresh cycle — ENG-231 makes that ongoing).
Narrowing back to two collections and re-widening on each future audit cycle would be operational churn that delivers no posture benefit — the data classification is identical and the network path is unchanged.
Resolution: re-baseline the matrix at four collections as canonical. No Atlas change (the live role has been at four since 2026-05-09). No code-consumer change (the runtime consumers still touch only formularies_staging + providers_staging; plans + mrpuf_issuers_staging are reachable via the same role for audit harnesses only). The drift CI guard (Phase 2 above) stays green because the matrix matches Atlas state. ENG-257 closed as not planned with this rationale. The ADR's "non-PHI cross-cluster read" architectural claim is unchanged — only the enumeration of reachable collections widens to reflect the role's full purpose.
If a future cycle wants to ratchet back to a 2-collection runtime + dedicated audit_reader user pattern, the original recipe is preserved in GH #122's description.
References
- Issue #101 — Phase 11 shipped: 2.14M provider + 12.5K medication data live on prod via cross-cluster Atlas PrivateLink
- Decision doc:
docs/decisions/2026-05-03-pivot-cms-api-direct.md"Cross-cluster reference reads via AWS PrivateLink" - Terraform:
infra/envs/prod/atlas-staging-privatelink.tf - Code:
src/lib/db.tsgetReferenceDb(),src/lib/drug-tier-fallback.ts - ADR 0001 — Atlas project isolation — the project-boundary baseline this decision builds on
- ADR 0003 — Narrow-scoped MongoDB users — the role-shape pattern
app_read_stagingfollows - SOC 2 controls mapping:
docs/security-compliance/soc2-control-mapping.mdCC6.6 (additional row) + CC6.7 - Vendor register:
docs/security-compliance/vendor-register.mdMongoDB Atlas row (both project IDs enumerated) - Cross-references: #57 (BAA enumeration), #71 (staging IP allowlist hardening), #96 (Phase D provider-network fallback — same pattern), #98 (delta-aware MRF refresh), #100 (CI guard)
- AWS PrivateLink for MongoDB Atlas — docs (vendor reference)
- HIPAA §164.312(e)(1) — Transmission Security
- SOC 2 TSC 2017 — CC6.6, CC6.7
- CMS EDE Audit Program Appendix A § 3 (Environment Separation), § 4 (Encryption in Transit)