2026-05-03: Pivot doctor + Rx coverage from §1311 ingest to CMS Marketplace API direct

Decision record. Captures what we built, what we learned, what we paused, and the conditions under which we'd resume the §1311-owned-data approach.

Decision

For the MVP doctor + Rx coverage flow (calculator → takeover → /plans → plan detail → search-doctor / search-Rx), call the CMS Marketplace API at query time. Do NOT use our staging Mongo §1311 mirror.

Pivot date: 2026-05-03. Decided after reaching 99.94% audit match on §1311 staging data via reconciliation. Rationale below.

Why

Two compounding problems with the §1311-owned-data approach:

Update cadence is unsolved. CMS publishes coverage refreshes daily/sub-daily (per /versions endpoint timestamps). Our model assumed annual ingest. Closing that gap is an ongoing pipeline + monitoring + cache-invalidation problem we haven't built. See #86.
Issuer §1311 MRF and CMS coverage DB disagree at scale. Two different upstreams at the issuer — §1311 publication (compliance obligation) vs HIOS administrative submission (what CMS adjudicates against). Empirical disagreement rates measured on staging: 25896 IA at 8% (2,976 distinct NPIs disagreed across 333,950 NPI×plan tuples), 42326 SC at 0.5%, 50305 SD localized to specific NPIs. Reconciling at scale takes ~85 min per issuer single-threaded, parallelizable to ~30 min, but the result is "100% match against a moving CMS dataset" — Sisyphean.

For the calculator → plan-detail → doctor/Rx-lookup user flow, CMS Marketplace API gives us the same answer Healthcare.gov shows, with zero ingest infrastructure, zero reconciliation, zero staleness window. That's literally what consumers experience. Ship to validate.

What we shipped (the new path)

src/lib/cms-api.ts — runtime CMS API client with token-bucket throttle, in-memory cache (60s TTL), retry on 5xx
4 thin proxy routes: /api/providers/{autocomplete,covered}, /api/drugs/{autocomplete,covered}
DoctorLookupPanel + DrugLookupPanel reusable components
(Initial) /plans/[planId]/page.tsx — later removed in favor of integrating into /plans browse + future detail page

Operational learnings (use these on the next ingest pass)

CMS Marketplace API — verified facts

Limit	Value	How we know
Rate limit (per second)	200 RPS	`x-ratelimit-limit-second` header
Rate limit (per minute)	1,000 / min	`x-ratelimit-limit-minute` header
Per-call max (NPIs × plans)	100 combos	tested 10×10 OK, 5×20 / 2×25 / 3×15 → 400
Per-call max NPIs alone	10	25 → 400 "NPIs: the length..."
Per-call max plans alone	10	inferred symmetrical
`/providers/autocomplete` query format	name only, NOT NPI	tested 5 ways with various ZIPs
`/drugs/autocomplete` query format	substring on drug-name field; multi-token queries fail	"Lipitor 20mg" → 0 results ("20mg" not in any drug-name field)
Brand-with-generic-substitution case	returned as `coverage: "GenericCovered"` + `generic_rxcui`	UI must surface this gracefully
Live updates	CMS coverage data updates daily per `/versions`	not in CMS docs; verified observationally — `coverage` and `npis` datasets stamped <24h ago at audit time

Performance ceiling at scale (the load-bearing concern)

The CMS-API-direct path does not survive Open Enrollment scale. Math:

Worst-case page load: a /plans visitor with 10 saved doctors + 10 saved drugs viewing a 50-plan list.
Best-case batched calls: 10 plans × 10 NPIs = 100 combos per call, so 5 plan-batches × (1 doctor-batch + 1 drug-batch) = 10 calls.
Per visit: ~10 CMS calls.
At 1,000 concurrent /plans visitors during OE peak: ~10,000 calls in a 30-second window = 333 calls/sec.
CMS limit: 200 RPS / 1,000 per minute per API key. We'd exhaust the per-minute budget in ❤️ seconds. Visitors get throttled or 429'd.

The single-key limit is the binding constraint, not the per-second one. Mitigations available:

Multiple CMS API keys rotated server-side. Each key has its own 1,000/min budget. At 5 keys we'd handle 5,000/min — buys us ~5,000 concurrent /plans visitors in steady state. CMS issues keys liberally; rotation is cheap.
Server-side coverage cache keyed by (plan_id, npi) and (plan_id, rxcui) tuples with hours-long TTL. Most coverage is stable across days. Massively reduces CMS calls — overlapping users in same metro hitting same plans + same popular drugs would all share cache. Critical for OE.
CDN edge cache in front of the proxy routes. CloudFront / Vercel can cache by request body hash — same-shaped queries hit the edge.
Batching across users. If 50 concurrent users all check Lipitor against 20 plans, deduplicate the unique tuples and call CMS once.
Tier the coverage check. Show top 3 plans coverage immediately; lazy-load the rest as user scrolls. Reduces avg per-visit calls from ~10 to ~3.
Fall back to ingested DB at high load. Behind a feature flag, switch to staging Mongo coverage when CMS rate-limit budget < 50% remaining. The §1311 mirror still has 99.94% accuracy — degraded but live.

For MVP (pre-OE) the CMS-API path is fine. Mitigations 1, 2, 3 should land before OE 2027. Mitigation 6 is the long-term bridge to the §1311 path if we resume it.

Issuer-side gotchas (relevant to any future §1311 ingest)

IP allowlist / Cloudflare blocking on Medica MO, Dean WI, BCBSFL — our home IP gets 403'd. Solvable only via VPC egress from askflorence-staging-vpc NAT gateway. Document the EC2 setup pattern (see scripts/db/recover-via-vpc-egress.js).
Multi-GB published JSONs (Sanford 1.8 GB, MedMutual 2.2 GB, UHC 2.66 GB). Node v25 max string length is 512 MB → must use streaming parser. We used stream-json v1.x (NOT v2 — v2 has incompatible API).
URL encoding: issuer indexes contain literal spaces / special chars (e.g. MedMutual's HIX - MMO 2026 02192026-0331_drugs.json). Must encodeURI before passing to curl, or HTTP=000 at load balancer.
Non-data file extensions in indexes. MedMutual published .docx reference docs alongside .json data files. Pipeline must filter to JSON.
Rate of disagreement is issuer-dependent. Some issuers' MRFs diverge 8% from CMS's coverage view; others <1%. Don't assume uniform reconciliation cost.

MRF ↔ CMS reconciliation

Any future ingest must reconcile to CMS at write time, not just trust the MRF. CMS is authoritative for user-facing coverage display per Healthcare.gov parity. We built the tooling for this:

scripts/db/reconcile-npis-against-cms.js — per-NPI exhaustive sweep
scripts/db/reconcile-issuer-exhaustive.js — per-issuer sweep (10×10 batched, ~85 min single-threaded per heavy issuer)

Storage decisions for consumer doctor/drug lists

MVP: localStorage with 30-day TTL, visible "Stored on your device — Clear" affordance, no cross-domain tracking, no server-side mirror.
OE high-volume edge case: future kiosk / agent-assisted mode that bypasses localStorage entirely (?kiosk=1 or dedicated /agent route → uses sessionStorage; clears on tab close). Avoids cross-user PII leak on shared devices in agent storefronts during OE.
Post-Phase-5 EDE: authed user accounts → server-side mirror under BAA-covered storage, localStorage continues as cache.

Conditions to resume §1311 owned-data approach

Resume when ANY of:

Consumer traffic crosses ~5,000 concurrent /plans visitors (CMS API multi-key + cache mitigations stop scaling)
We need multi-plan compare features that CMS API doesn't expose (e.g. "show all plans that cover Dr. Patel + Lipitor + Eliquis simultaneously")
We launch features requiring offline analytics on provider/drug data (e.g. "doctors taking new patients in your area" maps; "cheapest-by-tier drug rankings")
CMS deprecates /providers/covered or /drugs/covered (signaled in their changelog)
We have engineering capacity to own the cadence + reconciliation pipeline (~1 dedicated FTE)

Until any of those: keep the CMS-API path.

Pointers (for the next person)

Ingest scripts live in scripts/db/ (ingest-mrf-providers.js, ingest-mrf-formularies.js, ingest-mrpuf.js, recover-*.js, reconcile-*.js)
Audit harness: scripts/audit/tier-6-mrf-coverage-validation.js
Tier mapping: scripts/db/lib/mrf-helpers.js (TIER_MAP, TIER_META — covers 130 raw issuer tier strings)
Final audit reports: docs/audits/mrf-ingest-staging-audit-2026-05-03T18-52.md + recovery-final-state-2026-05-03.md
Stats handoff for marketing: docs/briefs/homepage-stats-handoff.md

Staging collections we created — final keep/drop split (revised 2026-05-04 after CMS API surface audit):

KEEP (load-bearing for runtime + scale resilience):

formularies_staging (12,557 RxCUIs, 67,889 drug names, 29.7M drug-plan tuples)
providers_staging (~2.14M docs, 1.75M individuals, 226K facilities)

DROP (ingest-tracking only, no runtime consumer):

mrpuf_issuers_staging (183 issuers across 31 states — file URLs, fetch state, etl bookkeeping)
coverage_disagreements_staging (~2,996 NPIs reconciled to CMS authoritative — historical reconciliation log)
mrf_ingest_log_staging (operational ingest history)

Recreating the dropped three takes minutes (re-derived from a fresh ingest run if we ever resume). Recreating the kept two would take ~6-8 hours of compute (mostly download time + EC2 provisioning for IP-blocked issuers).

Why we kept formularies + providers (revised 2026-05-04)

The original plan (Phase 6) slated all five §1311 staging collections for drop after the CMS-API path validated. We're revising that to keep two of them. Two compounding reasons:

1. CMS API has no endpoint for the all-N coverage UX

Verified against the official CMS OpenAPI spec (downloaded from developer.cms.gov/marketplace-api):

PlanSearchFilter.drugs and PlanSearchFilter.providers are plain []string arrays — schema is closed. OR/union semantics only. Verified by probing: NPI₁ alone returns 23 plans, NPI₂ alone returns 38, both together returns 38 (the union, not the intersection). Adding a fake provider doesn't narrow the result.
PlanSearchRequest.sort enum = [premium, deductible, oopc, total_costs, quality_rating] — no coverage value. Server explicitly rejects invalid Sort "coverage".
No endpoint returns plan + drug-tier inline. plans/search plan records carry formulary_url + rx_3mo_mail_order only — tier requires a separate /drugs/covered tuple call.
No /drugs/{rxcui}/plans or /providers/{npi}/plans reverse endpoints exist (404 on every shape).
/coverage/search (which I'd hoped was the bulk endpoint) is just a combined autocomplete — query string + zipcode → list of nearby providers + matching drugs. Not a plan filter.

To render our UX (per-card per-item ✓/✗ with tier + copay, sorted by full-coverage match) using only CMS API, we'd need (drug × plan) and (provider × plan) tuple lookups. At ~10 calls per visitor and 1,000 concurrent visitors during OE, that exhausts the 1,000-per-minute single-key budget in ~3 seconds. Mitigation 6 in this doc — fall back to ingested DB at high load — exists exactly for this, and it requires formularies_staging + providers_staging.

2. Healthcare.gov's own UI has the failure mode we want to fix

Confirmed from the live Healthcare.gov UI: when a user adds drug + provider filters, the site does not actually sort or narrow plans by full-coverage match. It shows ✓/✗ checkboxes per plan but leaves the list in default order. A user with multiple drugs/doctors has to page through results to find a plan that covers all their items — which is exactly the friction Florence's UX is built to remove.

Our owned-data path is the only way to deliver "plans that cover ALL your items, sorted to the top" cheaply at OE scale. That UX is a real differentiator vs Healthcare.gov, not just an internal-architecture choice. Keeping formularies + providers is what enables the differentiator.

Trade-off we're accepting

Owning the data means we own the freshness problem. CMS publishes coverage updates daily per /versions. We need a refresh cadence + reconciliation pipeline that keeps our copy within an acceptable staleness window. This is real ongoing work — see follow-up tasks below — but the engineering cost is bounded and the §1311 ingest tooling already exists (paused, not deleted).

Refresh strategy is non-negotiable, not optional (call-out)

Without delta-aware refresh, the §1311-owned-data architecture is operationally unsustainable:

Combined data on disk (post-2026-05-04 ingest baseline): formularies_staging ~30M drug-plan tuples + providers_staging 2.14M docs = ~22 GB compressed / ~63 GB raw on Atlas
Full nightly ingest cost sustained: ~$700+/mo Atlas tier (M50+) just to absorb IOPS pressure
Delta refresh cost sustained: M30 base ($389/mo) holds the runtime workload comfortably; ingest churn becomes 5-15% of records per week instead of 100% per night

The 2026-05-06 sticker shock confirmed this. The staging cluster ran at M60 ($1.30/hr × 24/7 ≈ $930/mo) for ~4 days during ingest because nobody scaled it back down. May MTD usage hit $423 on staging alone — would have been ~$2,800/mo if left running. The math only works if refresh is delta-aware from day one.

Implementation is tracked at #98 with concrete acceptance criteria. This MUST land before any production cutover — without it, the data goes stale fast (CMS updates coverage + npis every ~19 hours per /versions) AND we burn cluster cost we don't have to.

The pattern is three-layer cache invalidation:

CMS /versions poll (~30 sec, daily) — exit early if coverage/npis/drugs haven't advanced. ~99% of weekdays this is the entire run.
Issuer-file HEAD checks (~2 min when CMS advanced) — HTTP HEAD against each issuer's MRF URLs, compare Last-Modified + ETag to per-issuer state in a new mrf_file_state_staging collection. Most issuers republish every 2-4 weeks; ~95% are unchanged on any given day.
Record-level diff (only on changed files) — stream-parse, hash each NPI/RxCUI record, bulk-upsert only the deltas. Real-world: ~5% of records change between issuer file versions.

Atlas tier strategy that follows from this:

Base: M30 ($382/mo, real rate from 2026-05-07 invoice) — handles runtime queries comfortably with the indexed lookups on (rxcui × plan_id) and (npi × plan_id)
Burst: scale to M50/M60 only during weekly full reconciliation (Sundays 03:00 UTC, ~2-4 hours of compute) — programmatic via Atlas CLI / API, return to M30 when done
Annual cost: ~$5K/yr base + ~$200/yr burst vs ~$11K/yr if M60 is left running

Cross-cluster reference reads via AWS PrivateLink (decided 2026-05-08)

After the staging M60 → M30 resize landed, the migration question became: do we move the reference data (formularies_staging + providers_staging) to the prod cluster, or keep it on staging and have prod read it cross-cluster?

Decision: keep canonical reference data on staging cluster, prod reads via AWS PrivateLink.

Cost framing: at our actual Atlas pricing, M10 → M30 on prod is +$326/mo. Over a year that's ~$3,900 of recurring spend just to host non-PHI public CMS marketplace data on prod. Compared to ~$8/mo for a PrivateLink endpoint that delivers the same parity, the migrate-to-prod option is hard to justify pre-funding.

Why PrivateLink and not the simpler patterns:

Path	Why ruled out
Migrate data to prod cluster	$326/mo recurring, also bloats prod's audit surface with 2.14M provider + 12.5K formulary records that don't need to be there pre-EDE
Standard VPC peering (mirror Phase 7/8)	Both Atlas projects use the same default CIDR `192.168.248.0/21` per AWS region; prod VPC's route table can't have two routes to the same destination CIDR. Architectural blocker.
IP allowlist of prod NAT EIP on staging Atlas	Public-internet path, fragile to NAT EIP rotation, leaves a permanent allow-from-NAT rule auditors will question. Worst on every dimension.
AWS PrivateLink	Chosen. Identity-bound at AWS account level, AWS-backbone-only, no public IP exposure, no CIDR conflicts. Atlas-supported pattern.

Compliance posture (the dispositive reasoning):

Atlas BAA covers both projects — MongoDB BAA is signed at the organization level, not per-project. Both askflorence-prod and askflorence-staging inherit the same BAA. Action: get both project IDs explicitly enumerated in writing with MongoDB Sales (tracked in #57) — one-line email, free.
EDE Phase 3 audit posture: this IS the correct segmentation. Auditors flag unsegmented architectures where PHI mingles with non-PHI. Our design — PHI ultimately on prod cluster, non-PHI public §1311 marketplace data isolated on staging cluster, one-way private read prod → staging — is textbook data-classification segmentation. Non-PHI Atlas project is out of scope of EDE Phase 3 audit by definition (same posture as calling CMS Marketplace API directly). Action: explicit data-classification entry in our SSP describing staging cluster as public marketplace reference data only.
HIPAA §164.312 Technical Safeguards: PrivateLink + Atlas TLS 1.2+ + Atlas audit logging satisfies access control, audit, integrity, authentication, transmission security. §164.312 doesn't formally bind the staging cluster (non-PHI), but mirroring the controls is defense-in-depth + future-proofing.
AWS Security Reference Architecture alignment: cross-account/cross-project PrivateLink for data-classification-segmented workloads is the recommended SRA approach for regulated workloads. Not a novel design.

Implementation (committed 2026-05-08):

Atlas-side endpoint service: com.amazonaws.vpce.us-east-1.vpce-svc-0d8138ea0f6542afa (Atlas endpointId 69fe75c5b02c024f32d2af50, staging project 69e31af12fd2c0aef51bbb41)
AWS-side VPC interface endpoint: vpce-0c81aea11e29bb928 in prod VPC vpc-09201679b87261b6d, both private subnets, dedicated SG askflorence-prod-atlas-staging-privatelink
Terraform: infra/envs/prod/atlas-staging-privatelink.tf (named af-prod-to-staging-reference-pl for audit-trail clarity)
Code: src/lib/db.ts exposes getReferenceDb() alongside getDb(). Prod ECS sets MONGODB_REFERENCE_URI to the Atlas-issued private connection string for the staging cluster; falls back to MONGODB_URI when unset (dev + staging deploy posture).
Cost: ~$8/mo for the PrivateLink endpoint hours

When to revisit (move data to prod cluster):

EDE Phase 3 / carrier BAA cutover — once prod owns PHI behind BAA, the M30 tier is justified and operational separation between clusters compounds the audit benefit
Funding lands — $326/mo recurring becomes acceptable when runway isn't tight
Cross-cluster latency becomes user-perceptible — if PrivateLink-mediated reads add measurable p95 latency on the /plans surface, migrate to single-cluster
Staging cluster availability becomes a real risk — at scale, if staging Atlas downtime starts producing prod degradation we can't tolerate

Until any of those: stay on the cross-cluster PrivateLink pattern. It's the right answer for our pre-funding posture and the right migration foundation when the trigger fires (env var flip, no code change).

Open mitigations being implemented alongside this decision:

CI guard against data-classification drift — tracked at #100 / ENG-239. Both phases shipped:
- Phase 1 shipped 2026-05-08: static CI guard at scripts/audit/staging-collections-guard.ts runs on every PR via .github/workflows/staging-collections-guard.yml. Fails the build if any getReferenceDb() call accesses a collection not on STAGING_ALLOWED_COLLECTIONS (defined in src/lib/db.ts). Catches string-literal, dynamic-name, and inline-call patterns; verified against 3 synthetic violation cases the same day.
- Phase 2 shipped 2026-05-09: live nightly drift check at scripts/audit/staging-cluster-drift.ts runs at 08:00 UTC daily via .github/workflows/staging-cluster-drift.yml. Audits the actual Atlas state of app_read_staging (the cross-cluster reader): verifies the user has exactly one role (role_reader_reference@admin) granting only FIND on exactly the expected 2 collections (formularies_staging + providers_staging) and nothing else. Opens a P1 GitHub issue on drift. Catches the runtime cases the static guard cannot — privilege escalation via Atlas Admin UI, out-of-band role changes, etc. Verified against 3 synthetic violations (extra collection / wider action / extra role on user) the same day. As part of Phase 2 the user's role was tightened from built-in read@askflorence to per-collection custom role; verified prod cross-cluster reads still healthy after the swap.
The hand-discipline backstop (only public CMS-derived fields written to staging by ingest scripts) remains as defense-in-depth.
Staging Atlas IP access list hardening (post-launch) — once we cut over fully to AWS-prod and don't need ad-hoc script access from local IPs, the staging cluster's IP allowlist should be reduced to internal CIDRs + PrivateLink only (removing the temp-audit IP and any laptop IPs). Tracked under #71 (Phase 12 compliance docs). NOT done tonight because we're still pre-launch and the §1311 ingest scripts run from operator machines.

Follow-up tasks (separate work, not blocking the demo)

For tomorrow / next week / before OE 2027:

#92 — Re-validate §1311 staging audit at 100% match. Tier-6 harness landed at 99.94% (16 NPIs reconciled). Re-run against current dataset to measure drift since 2026-05-03.
#93 — Design refresh cadence for formularies + providers. CMS /versions says coverage data updates sub-daily. Pick: nightly full re-ingest, /versions-driven incremental, hybrid, or drift-detection-only. Cost vs staleness trade-off.
#94 — Investigate alternative data sources. §1311 MRFs are one path. Time-box research on NPPES, NIPR, per-issuer REST APIs, third-party aggregators. One-page comparison + recommendation.
#95 — Decide fallback ordering. Owned-data-first with CMS fallback on miss (fastest, owns staleness risk) vs CMS-first with staging fallback on rate-limit (always-fresh, owned only on overflow). Pick after cadence + audit-revalidation land.
#96 — Phase D provider-network fallback parity. Mirror src/lib/drug-tier-fallback.ts for providers (src/lib/provider-network-fallback.ts). Required for Mitigation 6 on the doctor side.
#98 — Delta-aware MRF refresh pipeline (LOAD-BEARING). Three-layer cache invalidation (CMS /versions poll → issuer-file HEAD checks → record-level hash diff). Without this, §1311-owned-data is operationally unsustainable. New mrf_file_state_staging collection tracks per-issuer per-file state (Last-Modified, ETag, content hash, last-fetch timestamps). Daily run completes in <5 min on no-op days, <30 min on typical change days. GitHub Actions cron primary; ECS scheduled task fallback for VPC-egress issuers. Must land before any production cutover — the 2026-05-06 M60 sticker shock proved the math only works with delta-aware refresh.
Drop the three ingest-tracking collections via scripts/db/drop-mrf-staging-collections.js (explicit allowlist, dry-run by default, requires MONGODB_DROP_URI privileged user for --apply). Phase 6 of the original plan, scope reduced from 5 → 3.

These are open tasks, not blockers. The current code already uses formularies_staging for the Eliquis-class CMS tier-omission case (see src/lib/drug-tier-fallback.ts) and the MVP path is shippable as-is. The above items harden the path for OE scale.

Linked

#17 — Phase C drug formulary (paused)
#18 — Phase D provider directory (paused)
#53 — Phase E plan detail page (now CMS-API path)
#86 — CMS data is not annual; rethink ingest cadence
#87 — Use /versions endpoint for incremental re-ingest
#92 — Re-validate §1311 staging audit at 100% match
#93 — Design refresh cadence for formularies + providers
#94 — Investigate alternative data sources
#95 — Decide fallback ordering: owned-first vs CMS-first
#96 — Phase D provider-network fallback parity
#98 — Delta-aware MRF refresh pipeline (load-bearing for sustainable §1311 ownership)