Appearance
2026-05-03: Pivot doctor + Rx coverage from §1311 ingest to CMS Marketplace API direct
Decision record. Captures what we built, what we learned, what we paused, and the conditions under which we'd resume the §1311-owned-data approach.
Decision
For the MVP doctor + Rx coverage flow (calculator → takeover → /plans → plan detail → search-doctor / search-Rx), call the CMS Marketplace API at query time. Do NOT use our staging Mongo §1311 mirror.
Pivot date: 2026-05-03. Decided after reaching 99.94% audit match on §1311 staging data via reconciliation. Rationale below.
Why
Two compounding problems with the §1311-owned-data approach:
- Update cadence is unsolved. CMS publishes coverage refreshes daily/sub-daily (per
/versionsendpoint timestamps). Our model assumed annual ingest. Closing that gap is an ongoing pipeline + monitoring + cache-invalidation problem we haven't built. See #86. - Issuer §1311 MRF and CMS coverage DB disagree at scale. Two different upstreams at the issuer — §1311 publication (compliance obligation) vs HIOS administrative submission (what CMS adjudicates against). Empirical disagreement rates measured on staging: 25896 IA at 8% (2,976 distinct NPIs disagreed across 333,950 NPI×plan tuples), 42326 SC at 0.5%, 50305 SD localized to specific NPIs. Reconciling at scale takes ~85 min per issuer single-threaded, parallelizable to ~30 min, but the result is "100% match against a moving CMS dataset" — Sisyphean.
For the calculator → plan-detail → doctor/Rx-lookup user flow, CMS Marketplace API gives us the same answer Healthcare.gov shows, with zero ingest infrastructure, zero reconciliation, zero staleness window. That's literally what consumers experience. Ship to validate.
What we shipped (the new path)
src/lib/cms-api.ts— runtime CMS API client with token-bucket throttle, in-memory cache (60s TTL), retry on 5xx- 4 thin proxy routes:
/api/providers/{autocomplete,covered},/api/drugs/{autocomplete,covered} DoctorLookupPanel+DrugLookupPanelreusable components- (Initial)
/plans/[planId]/page.tsx— later removed in favor of integrating into/plansbrowse + future detail page
Operational learnings (use these on the next ingest pass)
CMS Marketplace API — verified facts
| Limit | Value | How we know |
|---|---|---|
| Rate limit (per second) | 200 RPS | x-ratelimit-limit-second header |
| Rate limit (per minute) | 1,000 / min | x-ratelimit-limit-minute header |
| Per-call max (NPIs × plans) | 100 combos | tested 10×10 OK, 5×20 / 2×25 / 3×15 → 400 |
| Per-call max NPIs alone | 10 | 25 → 400 "NPIs: the length..." |
| Per-call max plans alone | 10 | inferred symmetrical |
/providers/autocomplete query format | name only, NOT NPI | tested 5 ways with various ZIPs |
/drugs/autocomplete query format | substring on drug-name field; multi-token queries fail | "Lipitor 20mg" → 0 results ("20mg" not in any drug-name field) |
| Brand-with-generic-substitution case | returned as coverage: "GenericCovered" + generic_rxcui | UI must surface this gracefully |
| Live updates | CMS coverage data updates daily per /versions | not in CMS docs; verified observationally — coverage and npis datasets stamped <24h ago at audit time |
Performance ceiling at scale (the load-bearing concern)
The CMS-API-direct path does not survive Open Enrollment scale. Math:
Worst-case page load: a /plans visitor with 10 saved doctors + 10 saved drugs viewing a 50-plan list.
- Best-case batched calls: 10 plans × 10 NPIs = 100 combos per call, so 5 plan-batches × (1 doctor-batch + 1 drug-batch) = 10 calls.
- Per visit: ~10 CMS calls.
- At 1,000 concurrent /plans visitors during OE peak: ~10,000 calls in a 30-second window = 333 calls/sec.
- CMS limit: 200 RPS / 1,000 per minute per API key. We'd exhaust the per-minute budget in ❤️ seconds. Visitors get throttled or 429'd.
The single-key limit is the binding constraint, not the per-second one. Mitigations available:
- Multiple CMS API keys rotated server-side. Each key has its own 1,000/min budget. At 5 keys we'd handle 5,000/min — buys us ~5,000 concurrent /plans visitors in steady state. CMS issues keys liberally; rotation is cheap.
- Server-side coverage cache keyed by
(plan_id, npi)and(plan_id, rxcui)tuples with hours-long TTL. Most coverage is stable across days. Massively reduces CMS calls — overlapping users in same metro hitting same plans + same popular drugs would all share cache. Critical for OE. - CDN edge cache in front of the proxy routes. CloudFront / Vercel can cache by request body hash — same-shaped queries hit the edge.
- Batching across users. If 50 concurrent users all check Lipitor against 20 plans, deduplicate the unique tuples and call CMS once.
- Tier the coverage check. Show top 3 plans coverage immediately; lazy-load the rest as user scrolls. Reduces avg per-visit calls from ~10 to ~3.
- Fall back to ingested DB at high load. Behind a feature flag, switch to staging Mongo coverage when CMS rate-limit budget < 50% remaining. The §1311 mirror still has 99.94% accuracy — degraded but live.
For MVP (pre-OE) the CMS-API path is fine. Mitigations 1, 2, 3 should land before OE 2027. Mitigation 6 is the long-term bridge to the §1311 path if we resume it.
Issuer-side gotchas (relevant to any future §1311 ingest)
- IP allowlist / Cloudflare blocking on Medica MO, Dean WI, BCBSFL — our home IP gets 403'd. Solvable only via VPC egress from
askflorence-staging-vpcNAT gateway. Document the EC2 setup pattern (seescripts/db/recover-via-vpc-egress.js). - Multi-GB published JSONs (Sanford 1.8 GB, MedMutual 2.2 GB, UHC 2.66 GB). Node v25 max string length is 512 MB → must use streaming parser. We used
stream-jsonv1.x (NOT v2 — v2 has incompatible API). - URL encoding: issuer indexes contain literal spaces / special chars (e.g. MedMutual's
HIX - MMO 2026 02192026-0331_drugs.json). MustencodeURIbefore passing to curl, or HTTP=000 at load balancer. - Non-data file extensions in indexes. MedMutual published
.docxreference docs alongside.jsondata files. Pipeline must filter to JSON. - Rate of disagreement is issuer-dependent. Some issuers' MRFs diverge 8% from CMS's coverage view; others <1%. Don't assume uniform reconciliation cost.
MRF ↔ CMS reconciliation
Any future ingest must reconcile to CMS at write time, not just trust the MRF. CMS is authoritative for user-facing coverage display per Healthcare.gov parity. We built the tooling for this:
scripts/db/reconcile-npis-against-cms.js— per-NPI exhaustive sweepscripts/db/reconcile-issuer-exhaustive.js— per-issuer sweep (10×10 batched, ~85 min single-threaded per heavy issuer)
Storage decisions for consumer doctor/drug lists
- MVP:
localStoragewith 30-day TTL, visible "Stored on your device — Clear" affordance, no cross-domain tracking, no server-side mirror. - OE high-volume edge case: future kiosk / agent-assisted mode that bypasses localStorage entirely (
?kiosk=1or dedicated/agentroute → uses sessionStorage; clears on tab close). Avoids cross-user PII leak on shared devices in agent storefronts during OE. - Post-Phase-5 EDE: authed user accounts → server-side mirror under BAA-covered storage, localStorage continues as cache.
Conditions to resume §1311 owned-data approach
Resume when ANY of:
- Consumer traffic crosses ~5,000 concurrent /plans visitors (CMS API multi-key + cache mitigations stop scaling)
- We need multi-plan compare features that CMS API doesn't expose (e.g. "show all plans that cover Dr. Patel + Lipitor + Eliquis simultaneously")
- We launch features requiring offline analytics on provider/drug data (e.g. "doctors taking new patients in your area" maps; "cheapest-by-tier drug rankings")
- CMS deprecates
/providers/coveredor/drugs/covered(signaled in their changelog) - We have engineering capacity to own the cadence + reconciliation pipeline (~1 dedicated FTE)
Until any of those: keep the CMS-API path.
Pointers (for the next person)
- Ingest scripts live in
scripts/db/(ingest-mrf-providers.js,ingest-mrf-formularies.js,ingest-mrpuf.js,recover-*.js,reconcile-*.js) - Audit harness:
scripts/audit/tier-6-mrf-coverage-validation.js - Tier mapping:
scripts/db/lib/mrf-helpers.js(TIER_MAP, TIER_META — covers 130 raw issuer tier strings) - Final audit reports:
docs/audits/mrf-ingest-staging-audit-2026-05-03T18-52.md+recovery-final-state-2026-05-03.md - Stats handoff for marketing:
docs/briefs/homepage-stats-handoff.md
Staging collections we created — final keep/drop split (revised 2026-05-04 after CMS API surface audit):
KEEP (load-bearing for runtime + scale resilience):
formularies_staging(12,557 RxCUIs, 67,889 drug names, 29.7M drug-plan tuples)providers_staging(~2.14M docs, 1.75M individuals, 226K facilities)
DROP (ingest-tracking only, no runtime consumer):
mrpuf_issuers_staging(183 issuers across 31 states — file URLs, fetch state, etl bookkeeping)coverage_disagreements_staging(~2,996 NPIs reconciled to CMS authoritative — historical reconciliation log)mrf_ingest_log_staging(operational ingest history)
Recreating the dropped three takes minutes (re-derived from a fresh ingest run if we ever resume). Recreating the kept two would take ~6-8 hours of compute (mostly download time + EC2 provisioning for IP-blocked issuers).
Why we kept formularies + providers (revised 2026-05-04)
The original plan (Phase 6) slated all five §1311 staging collections for drop after the CMS-API path validated. We're revising that to keep two of them. Two compounding reasons:
1. CMS API has no endpoint for the all-N coverage UX
Verified against the official CMS OpenAPI spec (downloaded from developer.cms.gov/marketplace-api):
PlanSearchFilter.drugsandPlanSearchFilter.providersare plain[]stringarrays — schema is closed. OR/union semantics only. Verified by probing: NPI₁ alone returns 23 plans, NPI₂ alone returns 38, both together returns 38 (the union, not the intersection). Adding a fake provider doesn't narrow the result.PlanSearchRequest.sortenum =[premium, deductible, oopc, total_costs, quality_rating]— nocoveragevalue. Server explicitly rejectsinvalid Sort "coverage".- No endpoint returns plan + drug-tier inline.
plans/searchplan records carryformulary_url+rx_3mo_mail_orderonly — tier requires a separate/drugs/coveredtuple call. - No
/drugs/{rxcui}/plansor/providers/{npi}/plansreverse endpoints exist (404 on every shape). /coverage/search(which I'd hoped was the bulk endpoint) is just a combined autocomplete — query string + zipcode → list of nearby providers + matching drugs. Not a plan filter.
To render our UX (per-card per-item ✓/✗ with tier + copay, sorted by full-coverage match) using only CMS API, we'd need (drug × plan) and (provider × plan) tuple lookups. At ~10 calls per visitor and 1,000 concurrent visitors during OE, that exhausts the 1,000-per-minute single-key budget in ~3 seconds. Mitigation 6 in this doc — fall back to ingested DB at high load — exists exactly for this, and it requires formularies_staging + providers_staging.
2. Healthcare.gov's own UI has the failure mode we want to fix
Confirmed from the live Healthcare.gov UI: when a user adds drug + provider filters, the site does not actually sort or narrow plans by full-coverage match. It shows ✓/✗ checkboxes per plan but leaves the list in default order. A user with multiple drugs/doctors has to page through results to find a plan that covers all their items — which is exactly the friction Florence's UX is built to remove.
Our owned-data path is the only way to deliver "plans that cover ALL your items, sorted to the top" cheaply at OE scale. That UX is a real differentiator vs Healthcare.gov, not just an internal-architecture choice. Keeping formularies + providers is what enables the differentiator.
Trade-off we're accepting
Owning the data means we own the freshness problem. CMS publishes coverage updates daily per /versions. We need a refresh cadence + reconciliation pipeline that keeps our copy within an acceptable staleness window. This is real ongoing work — see follow-up tasks below — but the engineering cost is bounded and the §1311 ingest tooling already exists (paused, not deleted).
Refresh strategy is non-negotiable, not optional (call-out)
Without delta-aware refresh, the §1311-owned-data architecture is operationally unsustainable:
- Combined data on disk (post-2026-05-04 ingest baseline):
formularies_staging~30M drug-plan tuples +providers_staging2.14M docs = ~22 GB compressed / ~63 GB raw on Atlas - Full nightly ingest cost sustained: ~$700+/mo Atlas tier (M50+) just to absorb IOPS pressure
- Delta refresh cost sustained: M30 base ($389/mo) holds the runtime workload comfortably; ingest churn becomes 5-15% of records per week instead of 100% per night
The 2026-05-06 sticker shock confirmed this. The staging cluster ran at M60 ($1.30/hr × 24/7 ≈ $930/mo) for ~4 days during ingest because nobody scaled it back down. May MTD usage hit $423 on staging alone — would have been ~$2,800/mo if left running. The math only works if refresh is delta-aware from day one.
Implementation is tracked at #98 with concrete acceptance criteria. This MUST land before any production cutover — without it, the data goes stale fast (CMS updates coverage + npis every ~19 hours per /versions) AND we burn cluster cost we don't have to.
The pattern is three-layer cache invalidation:
- CMS
/versionspoll (~30 sec, daily) — exit early ifcoverage/npis/drugshaven't advanced. ~99% of weekdays this is the entire run. - Issuer-file HEAD checks (~2 min when CMS advanced) — HTTP HEAD against each issuer's MRF URLs, compare
Last-Modified+ETagto per-issuer state in a newmrf_file_state_stagingcollection. Most issuers republish every 2-4 weeks; ~95% are unchanged on any given day. - Record-level diff (only on changed files) — stream-parse, hash each NPI/RxCUI record, bulk-upsert only the deltas. Real-world: ~5% of records change between issuer file versions.
Atlas tier strategy that follows from this:
- Base: M30 ($382/mo, real rate from 2026-05-07 invoice) — handles runtime queries comfortably with the indexed lookups on (rxcui × plan_id) and (npi × plan_id)
- Burst: scale to M50/M60 only during weekly full reconciliation (Sundays 03:00 UTC, ~2-4 hours of compute) — programmatic via Atlas CLI / API, return to M30 when done
- Annual cost: ~$5K/yr base + ~$200/yr burst vs ~$11K/yr if M60 is left running
Cross-cluster reference reads via AWS PrivateLink (decided 2026-05-08)
After the staging M60 → M30 resize landed, the migration question became: do we move the reference data (formularies_staging + providers_staging) to the prod cluster, or keep it on staging and have prod read it cross-cluster?
Decision: keep canonical reference data on staging cluster, prod reads via AWS PrivateLink.
Cost framing: at our actual Atlas pricing, M10 → M30 on prod is +$326/mo. Over a year that's ~$3,900 of recurring spend just to host non-PHI public CMS marketplace data on prod. Compared to ~$8/mo for a PrivateLink endpoint that delivers the same parity, the migrate-to-prod option is hard to justify pre-funding.
Why PrivateLink and not the simpler patterns:
| Path | Why ruled out |
|---|---|
| Migrate data to prod cluster | $326/mo recurring, also bloats prod's audit surface with 2.14M provider + 12.5K formulary records that don't need to be there pre-EDE |
| Standard VPC peering (mirror Phase 7/8) | Both Atlas projects use the same default CIDR 192.168.248.0/21 per AWS region; prod VPC's route table can't have two routes to the same destination CIDR. Architectural blocker. |
| IP allowlist of prod NAT EIP on staging Atlas | Public-internet path, fragile to NAT EIP rotation, leaves a permanent allow-from-NAT rule auditors will question. Worst on every dimension. |
| AWS PrivateLink | Chosen. Identity-bound at AWS account level, AWS-backbone-only, no public IP exposure, no CIDR conflicts. Atlas-supported pattern. |
Compliance posture (the dispositive reasoning):
Atlas BAA covers both projects — MongoDB BAA is signed at the organization level, not per-project. Both
askflorence-prodandaskflorence-staginginherit the same BAA. Action: get both project IDs explicitly enumerated in writing with MongoDB Sales (tracked in #57) — one-line email, free.EDE Phase 3 audit posture: this IS the correct segmentation. Auditors flag unsegmented architectures where PHI mingles with non-PHI. Our design — PHI ultimately on prod cluster, non-PHI public §1311 marketplace data isolated on staging cluster, one-way private read prod → staging — is textbook data-classification segmentation. Non-PHI Atlas project is out of scope of EDE Phase 3 audit by definition (same posture as calling CMS Marketplace API directly). Action: explicit data-classification entry in our SSP describing staging cluster as public marketplace reference data only.
HIPAA §164.312 Technical Safeguards: PrivateLink + Atlas TLS 1.2+ + Atlas audit logging satisfies access control, audit, integrity, authentication, transmission security. §164.312 doesn't formally bind the staging cluster (non-PHI), but mirroring the controls is defense-in-depth + future-proofing.
AWS Security Reference Architecture alignment: cross-account/cross-project PrivateLink for data-classification-segmented workloads is the recommended SRA approach for regulated workloads. Not a novel design.
Implementation (committed 2026-05-08):
- Atlas-side endpoint service:
com.amazonaws.vpce.us-east-1.vpce-svc-0d8138ea0f6542afa(Atlas endpointId69fe75c5b02c024f32d2af50, staging project69e31af12fd2c0aef51bbb41) - AWS-side VPC interface endpoint:
vpce-0c81aea11e29bb928in prod VPCvpc-09201679b87261b6d, both private subnets, dedicated SGaskflorence-prod-atlas-staging-privatelink - Terraform:
infra/envs/prod/atlas-staging-privatelink.tf(namedaf-prod-to-staging-reference-plfor audit-trail clarity) - Code:
src/lib/db.tsexposesgetReferenceDb()alongsidegetDb(). Prod ECS setsMONGODB_REFERENCE_URIto the Atlas-issued private connection string for the staging cluster; falls back toMONGODB_URIwhen unset (dev + staging deploy posture). - Cost: ~$8/mo for the PrivateLink endpoint hours
When to revisit (move data to prod cluster):
- EDE Phase 3 / carrier BAA cutover — once prod owns PHI behind BAA, the M30 tier is justified and operational separation between clusters compounds the audit benefit
- Funding lands — $326/mo recurring becomes acceptable when runway isn't tight
- Cross-cluster latency becomes user-perceptible — if PrivateLink-mediated reads add measurable p95 latency on the /plans surface, migrate to single-cluster
- Staging cluster availability becomes a real risk — at scale, if staging Atlas downtime starts producing prod degradation we can't tolerate
Until any of those: stay on the cross-cluster PrivateLink pattern. It's the right answer for our pre-funding posture and the right migration foundation when the trigger fires (env var flip, no code change).
Open mitigations being implemented alongside this decision:
CI guard against data-classification drift — tracked at #100 / ENG-239. Both phases shipped:
- Phase 1 shipped 2026-05-08: static CI guard at
scripts/audit/staging-collections-guard.tsruns on every PR via.github/workflows/staging-collections-guard.yml. Fails the build if anygetReferenceDb()call accesses a collection not onSTAGING_ALLOWED_COLLECTIONS(defined insrc/lib/db.ts). Catches string-literal, dynamic-name, and inline-call patterns; verified against 3 synthetic violation cases the same day. - Phase 2 shipped 2026-05-09: live nightly drift check at
scripts/audit/staging-cluster-drift.tsruns at 08:00 UTC daily via.github/workflows/staging-cluster-drift.yml. Audits the actual Atlas state ofapp_read_staging(the cross-cluster reader): verifies the user has exactly one role (role_reader_reference@admin) granting onlyFINDon exactly the expected 2 collections (formularies_staging+providers_staging) and nothing else. Opens a P1 GitHub issue on drift. Catches the runtime cases the static guard cannot — privilege escalation via Atlas Admin UI, out-of-band role changes, etc. Verified against 3 synthetic violations (extra collection / wider action / extra role on user) the same day. As part of Phase 2 the user's role was tightened from built-inread@askflorenceto per-collection custom role; verified prod cross-cluster reads still healthy after the swap.
The hand-discipline backstop (only public CMS-derived fields written to staging by ingest scripts) remains as defense-in-depth.
- Phase 1 shipped 2026-05-08: static CI guard at
Staging Atlas IP access list hardening (post-launch) — once we cut over fully to AWS-prod and don't need ad-hoc script access from local IPs, the staging cluster's IP allowlist should be reduced to internal CIDRs + PrivateLink only (removing the temp-audit IP and any laptop IPs). Tracked under #71 (Phase 12 compliance docs). NOT done tonight because we're still pre-launch and the §1311 ingest scripts run from operator machines.
Follow-up tasks (separate work, not blocking the demo)
For tomorrow / next week / before OE 2027:
- #92 — Re-validate §1311 staging audit at 100% match. Tier-6 harness landed at 99.94% (16 NPIs reconciled). Re-run against current dataset to measure drift since 2026-05-03.
- #93 — Design refresh cadence for formularies + providers. CMS
/versionssays coverage data updates sub-daily. Pick: nightly full re-ingest,/versions-driven incremental, hybrid, or drift-detection-only. Cost vs staleness trade-off. - #94 — Investigate alternative data sources. §1311 MRFs are one path. Time-box research on NPPES, NIPR, per-issuer REST APIs, third-party aggregators. One-page comparison + recommendation.
- #95 — Decide fallback ordering. Owned-data-first with CMS fallback on miss (fastest, owns staleness risk) vs CMS-first with staging fallback on rate-limit (always-fresh, owned only on overflow). Pick after cadence + audit-revalidation land.
- #96 — Phase D provider-network fallback parity. Mirror
src/lib/drug-tier-fallback.tsfor providers (src/lib/provider-network-fallback.ts). Required for Mitigation 6 on the doctor side. - #98 — Delta-aware MRF refresh pipeline (LOAD-BEARING). Three-layer cache invalidation (CMS
/versionspoll → issuer-file HEAD checks → record-level hash diff). Without this, §1311-owned-data is operationally unsustainable. Newmrf_file_state_stagingcollection tracks per-issuer per-file state (Last-Modified,ETag, content hash, last-fetch timestamps). Daily run completes in <5 min on no-op days, <30 min on typical change days. GitHub Actions cron primary; ECS scheduled task fallback for VPC-egress issuers. Must land before any production cutover — the 2026-05-06 M60 sticker shock proved the math only works with delta-aware refresh. - Drop the three ingest-tracking collections via
scripts/db/drop-mrf-staging-collections.js(explicit allowlist, dry-run by default, requiresMONGODB_DROP_URIprivileged user for--apply). Phase 6 of the original plan, scope reduced from 5 → 3.
These are open tasks, not blockers. The current code already uses formularies_staging for the Eliquis-class CMS tier-omission case (see src/lib/drug-tier-fallback.ts) and the MVP path is shippable as-is. The above items harden the path for OE scale.
Linked
- #17 — Phase C drug formulary (paused)
- #18 — Phase D provider directory (paused)
- #53 — Phase E plan detail page (now CMS-API path)
- #86 — CMS data is not annual; rethink ingest cadence
- #87 — Use
/versionsendpoint for incremental re-ingest - #92 — Re-validate §1311 staging audit at 100% match
- #93 — Design refresh cadence for formularies + providers
- #94 — Investigate alternative data sources
- #95 — Decide fallback ordering: owned-first vs CMS-first
- #96 — Phase D provider-network fallback parity
- #98 — Delta-aware MRF refresh pipeline (load-bearing for sustainable §1311 ownership)