Appearance
SBE state watchouts + decisions
For any work touching SBE state data (CA, NY, PA, NJ, MA, WA, CO, CT, MD, etc.) — read this doc BEFORE designing or implementing. It captures state-specific decisions already made (data sources, coarsening trade-offs, anon-endpoint legalities, gaps in our plan store) so we don't re-derive them every time. When a session opens a new state's work, add a section here.
Index:
- California — Phase C/D complete (one open gap: ENG-410)
- New York — plans ✅, coverage NOT ingested (ENG-412); NPPES-native so no CA bridge gap
- Pennsylvania — Phase 0 research locked (ENG-418); GetInsured-stack pilot, federal APTC only, no SBP
- New Jersey — all 3 phases shipped (ENG-438) + 2026-06-10 Horizon plan-id repair (ENG-447)
- Illinois — ALL PHASES COMPLETE 2026-06-10 (ENG-448); GetInsured stack, 13 RAs / 7 carriers
- Virginia — ALL PHASES COMPLETE 2026-06-10 (ENG-450); GetInsured stack, 12 RAs / 6 carriers
- Nevada — Phase A + partial B/C 2026-06-10 (ENG-451); GetInsured stack, 4 RAs / 9 carriers
- Template for new states — what to investigate first
California
Status: Phase C (drugs) ✅ shipped (ENG-395 / PR #505). Phase D (providers) ✅ ingested (ENG-408 / PR #TBD). Route extension state-aware ✅ shipped (ENG-407 / PR #510). SBE estimate Wave 1+2 ✅ shipped (ENG-373/374).
Data sources
| Layer | Source | Auth | Schema notes |
|---|---|---|---|
| Plans + pricing | Covered California rating-area data (Wave 2 / ENG-374) | curated harvest | 169 CA 2026 plans in plans collection, 11 carriers, puf: {} empty (see below) |
| Drug formularies | Carrier marketplace formulary PDFs (Kaiser, Blue Shield, Anthem, Health Net, Molina, LA Care, SHARP, IEHP, CCHP, Valley Health, Western Health) | public PDFs | parsed into formularies_staging, 17,447 drugs × 169 plans, tagged source: ca_<carrier>_2026_marketplace_formulary |
| Provider directory | CalHEERS anon endpoint backed by Symphony (IHA + Availity, SB 137) | anon (no auth) | 165,974 providers in providers_staging namespaced _id: "ca-sym:<providerId>", tagged source: ca_symphony_2026 |
| Live provider search | Same anon endpoint via coveredca-provider-proxy.ts | anon | 24h cache, ~5sec per (zip, radius, year) query |
Standing decisions
1. CA plan docs have empty puf: {}
PR #460 / ENG-374 ingested CA pricing + plan metadata but did NOT puf-augment. Implications:
puf.networkIdnot available → provider-plan mapping uses HIOS issuer prefix (see #2)puf.formularyIdnot available → drug coverage usesformularies_stagingvia state-aware dispatch (ENG-407)puf.qualityRatingnot available → no star ratings on CA plan cards (acceptable v1)puf.sbcScenariosnot available → no SBC year-cost scenarios on plan detail page (acceptable v1)
Recommendation if puf-augment desired later: parse CC-published plan detail templates + carrier MRFs for the cost-share + scenario data, write to puf.* on each CA plan doc (same shape as FFM).
2. Provider-plan mapping is HIOS-prefix coarse, not per-network precise
Symphony returns networkId like 40513CAN011-2600 (Kaiser network 011, segment 2600). Without puf.networkId on our plan docs, we can't do per-network mapping. Instead we use the first 5 chars (HIOS issuer prefix) — every provider whose networkId starts with 40513 is treated as in-network for ALL Kaiser CA plans.
Why this is acceptable for CC: Covered California's Standard Benefit Designs (SBP) mean marketplace plans within a metal tier share standardized cost-sharing AND typically share the provider directory at the carrier level. A doctor credentialed in Kaiser's CA network is effectively in-network for all Kaiser CA marketplace plans.
Trade-off: can over-attribute when a carrier runs distinct narrow networks (HMO vs PPO vs EPO). Recommendation: refine to per-network when we license Symphony commercially OR ingest CA carrier MRFs.
3. NPI bridge NOT NEEDED for the customer flow (decided 2026-05-28)
Symphony anon doesn't return NPPES NPI (0/85,735 probed). I initially flagged this as a major gap.
Founder reframe: real customer flow is ZIP-first — enter ZIP → see plans → search doctors within those plans. They never come in holding an NPPES NPI. Doctor search returns Symphony-providerId-keyed results that link directly to plans via the HIOS prefix map. Saved doctors carry Symphony providerId, NOT NPI. The "is THIS doctor I already saved on this plan?" check uses the same providerId we wrote at save time.
Therefore: NPI bridge is solving a problem nobody has. external_ids.npi: null on every CA Symphony doc is fine. If we ever surface NPPES-keyed doctors in the CA UX later (e.g. agent-side workflow), we'd reopen this. Not before.
4. accepting defaulted to "accepting" on every CA Symphony row (decided 2026-05-28)
Symphony anon doesn't return the new-patients-accepting flag. We default to "accepting".
Founder reframe: if a customer's existing doctor isn't taking new patients on their new plan, "they just go get a new doctor." Not a launch blocker. Florence flow surfaces 3-5 PCP suggestions per plan — if one doesn't pan out, the user picks another.
SB 137 sub-argument: California SB 137 mandates monthly Symphony directory refresh — providers who stop accepting are supposed to be removed from the directory within 30 days. So the directory's currency rule is itself a soft proxy for "currently accepting." Not airtight but reasonable.
If we ever need the explicit flag later: Symphony commercial license includes it OR live-verify against the anon endpoint at suggest time.
5. Tier-aware copays are NOT a CA gap (decided 2026-05-28)
Initially flagged as a limitation. Wrong on inspection.
Reality: doctor copays come from the plan doc (plan.copays.primaryCare), not from the provider's network tier. For CC SBP plans there's effectively one in-network tier — providers are in-network or not. Multi-tier networks (where Preferred costs $10 / Standard costs $30 on the SAME plan) exist on some FFM commercial plans but not on CC marketplace.
Therefore: defaulting network_tier: "InNetwork" on every CA Symphony row is correct. Plan card shows $5 primary care visit from the plan doc; in-network/out-of-network is the only check needed.
6. Symphony anon endpoint legal posture
Scraping CalHEERS' anonymous SPA endpoint is unofficial. Acceptable as backend interim solution. NOT marketable as "powered by Symphony" until we license directly from IHA. Symphony customer login at symphony.iha.org; IHA Oakland 510-208-1740 for downstream-data-consumer subscription pricing (no public price sheet; expect $5-20K/yr).
When to license: when CA-derived revenue justifies the cost AND any of these become acute:
- We need per-network precision (kills the HIOS-prefix coarsening of decision #2)
- We need explicit
acceptingflag (decision #4 upgrade) - We start marketing CA-specific UX externally and want the legal cover
- We add another CA state-specific feature that benefits from the richer schema (languages, sex, board cert, education, etc.)
7. Provider DISCOVERY vs COVERAGE — the /plans name-search gap (found + verified 2026-05-28, ENG-408 post-ship trace)
This is the one genuinely open CA gap. There are TWO ways a user surfaces a provider, and they use DIFFERENT identifier systems:
| Surface | Discovery path | ID returned | Coverage check | CA status |
|---|---|---|---|---|
| Florence ("show doctors on this plan") | /api/providers/suggest → CalHEERS Symphony | Symphony providerId | matches our providers_staging (_id: ca-sym:<id>) | ✅ Works |
/plans CoveragePanel ("search MY doctor by name") | /api/providers/autocomplete → NPPES (national NPI registry) | NPPES npi (10-digit) | queries providers_staging by npi | ❌ Broken for CA |
Why /plans is broken for CA providers: NPPES autocomplete does return CA doctors (NPPES is the national federal registry, state-agnostic), so discovery looks like it works. But it hands back the doctor's NPPES NPI. Our CA Symphony docs are keyed by Symphony providerId with external_ids.npi: null — there is no NPPES↔Symphony bridge. So /api/providers/covered finds no match → returns NotCovered for EVERY name-searched CA doctor, even in-network ones.
Verified on apex 2026-05-28: NPPES NPI 1265150155 (Riley Smith LCSW, SF) checked against Blue Shield plan 70285CA135000101 with state=CA → NotCovered, despite the carrier being in-network. This is the NPI-bridge gap (decision #3) biting the /plans surface specifically.
This does NOT affect: drugs (RxCUI is a single national identifier — search AND coverage both use it, fully works end-to-end for CA), CA pricing, CA subsidy, or the Florence provider flow.
The contained fix (not yet built): /api/providers/search-ca already exists (PR #505, Symphony-backed) but is wired to NO UI. For CA, point the /plans CoveragePanel doctor-search at it instead of /api/providers/autocomplete. It returns Symphony providerIds, which match our ingested data → coverage closes the loop. Saved CA doctors would carry providerIds (consistent with the ZIP-first flow the founder described). Medium effort; needs a ProviderHit-shape adapter for the Symphony response. File as a ticket when /plans CA provider-coverage becomes a priority — if Florence is the primary CA consumer surface, this can wait.
Decision posture: CA is "sufficiently working" for pricing + subsidy + drugs (full) + Florence-flow providers. The /plans name-search provider coverage is the single known gap, blocked on either (a) wiring search-ca into CoveragePanel for CA, or (b) the Symphony license / NPPES cross-reference that closes the NPI bridge generally.
Known limitations (acceptable for v1)
- Multiple practice addresses per provider — current crawl captures 1 address per Symphony provider. Symphony's anon endpoint returns one row per provider per zip query; cross-zip dedupe folds duplicates by providerId without merging addresses. Could re-crawl per-provider but adds 165K serial API calls. Not a UX blocker.
- Phone number, languages, sex, education, board certification, facility type — none returned by anon endpoint. License Symphony to unlock.
- Provider photo / profile / reviews — out of scope for v1 across all states.
Follow-ups (Linear-tracked, NOT blocking CA launch)
- [ ]
/plansCA provider-search → search-ca wiring (decision #7, [ENG-410]) — the one user-facing gap. Point CoveragePanel doctor-search at/api/providers/search-cafor CA so it returns Symphony providerIds that match ingested data. File when/plansCA provider coverage is prioritized. - [ ] Symphony license inquiry — IHA Oakland (no ticket filed; file when revenue justifies). Closes decision #3 (NPI bridge), #4 (accepting flag), #2 (per-network) at once.
- [ ] Per-network refinement (decision #2 upgrade) — depends on license OR CA MRF ingest
- [ ] CA carrier §1311 MRF ingest — bigger project (~weeks); only if Symphony license declined AND we need NPI/tier data
- [ ] Multi-address re-crawl — only if user research surfaces a need
- [ ] CA puf augment (decision #1) — only if star ratings or SBC scenarios become a real ask for CA
- [ ] ~471 SF/Oakland providers from crawl cold-start — RECOVERED 2026-05-28 (idempotent re-ingest). Crawl now has retry-with-backoff (PR #525) so it can't recur.
Verified end-to-end state (apex, 2026-05-28)
| Capability | Status |
|---|---|
CA plan pricing (/api/plans) | ✅ 28 real SF plans, CAPS/CAPC math |
CA subsidy estimate (/api/sbe-estimate) | ✅ federal APTC math |
| CA drugs: search → coverage | ✅ FULL — national RxCUI search + formularies_staging coverage |
CA provider coverage-check API (/api/providers/covered, state=CA) | ✅ Symphony providerId → Covered/InNetwork |
Florence provider suggestions (/api/providers/suggest) | ✅ Symphony-backed |
/plans doctor search-by-name → coverage | ❌ NPPES NPI can't bridge to Symphony providerId (decision #7) |
| Data integrity | ✅ 165,974 CA providers, 17,447 CA drugs; FFM cohort byte-identical (2,145,064) |
Cross-references
- ENG-395 — Phase C/D ingestion (Done)
- ENG-407 — State-aware route dispatch (Done, PR #510)
- ENG-408 — CA provider ingest: 165,974 providers + crawl retry (Done, PRs #523/#524/#525)
- ADR 0004 — Cross-cluster Atlas PrivateLink (why staging holds provider/formulary data)
docs/data-sources/ca-phase-c-d-ingestion-playbook.md— full methodology for future per-state replaysscripts/db/ingest-ca-providers.cjs— provider crawl/map/ingest/verify CLI (run via Fargate smoke-runner task family)
New York
Status (verified 2026-05-28 apex + DB trace): plans + pricing ✅; drug + provider coverage ❌ NOT ingested. Phase C/D ingest is ENG-412.
⚠️ Correction: the prior version of this section claimed NY had "FFM-style provider data inherited from legacy ingest." That was wrong — verified empirically. There is no NY provider or drug coverage data in our collections.
Verified state
| Capability | NY status | Evidence |
|---|---|---|
| ZIP → county | ✅ works | 10001 → New York County, 11201 → Kings |
| ZIP → plans + pricing + subsidy | ✅ works | 282 NY 2026 plans; calculateNyEligibility() in owned-plans.ts |
| Doctor search by name → matched plans | ❌ broken | 0 NY providers in providers_staging across all 8 major carriers (Fidelis 25303, Healthfirst 91237, MetroPlus 11177, Excellus/Highmark 78124, MVP 56184, EmblemHealth 88582, CDPHP 94788, Oscar 74289) |
| Medication search → matched plans | ❌ broken | 0 of 282 NY plans cover atorvastatin; NY formularies not ingested. (Only stale partial 74289NY* Oscar docs exist — 4,023 — and they don't map to current plans / common drugs. /api/drugs/covered NY Fidelis + Lipitor → NotCovered.) |
Key facts (locked)
- NY is NPPES-NPI-native — NY State of Health (NYSOH) + NY DFS use the national NPI registry, NOT a Symphony-style internal ID. This is the big advantage over CA: once NY providers are ingested keyed by NPPES NPI (
_id: npi, like FFM), the/api/providers/autocomplete(NPPES) →/api/providers/coveredjoin works with no bridge gap. The CA limitation (ENG-410 — NPPES↔Symphony) does NOT apply to NY. NY will be the first fully-complete SBE provider surface. pufIS populated for NY (unlike CA's emptypuf— decision #1 in the CA section). So NY provider-plan mapping could usepuf.networkIdper-network precision instead of CA's HIOS-prefix coarsening — verify during Phase D.- NY plan IDs use 14-char FFM format (
42640NY0320001) — distinguishable from CA's 16-char only by thestatefield, never plan-ID format. This is exactly why coverage dispatch keys offstate(usesOwnedCoverageData, ENG-411), not plan-ID regex. - Coverage dispatch already routes NY correctly — NY is in
OWNED_COVERAGE_STATES(ENG-411). No route changes needed; the flow lights up the moment data lands informularies_staging+providers_staging. - Data sources DIFFER from CA and need a discovery pass (ENG-412 Phase 0): NY carrier formulary PDFs/JSON for drugs (CA carrier-PDF playbook is the template); NYSOH provider-search endpoint (probe for a CalHEERS-style anon endpoint) / NY DFS / NY DOH network-adequacy / per-carrier directories / NY §1311 MRFs for providers. See
docs/data-sources/ny-phase-c-d-ingestion-playbook.md(created during ENG-412 Phase 0).
Follow-up
- ENG-412 — NY Phase C/D ingest (composite: source discovery → drug formularies → provider directory → verify). Same safety covenants as ENG-395/408 (snapshot, FFM+CA byte-identical, cluster guard, collection allowlist,
$addToSet, dry-run, Fargate in-VPC).
Pennsylvania
Status (Phase A complete + live on apex, 2026-06-02): plans ✅ ingested on BOTH clusters (staging + prod), 291 HIOS plans across 9 RAs, ENG-418 PR #566 + A.4 prod-cluster apply. Apex smoke verified across 5 anchor ZIPs spanning all 9 rating areas (19103 RA8 / 16501 RA1 / 15834 RA2 / 18015 RA6 / 17101 RA9) — real PA plans returned in every case (Jefferson / UPMC / Geisinger / Capital BC top per RA). Cohort guard: non-PA 2026 unchanged at 4,495 on both clusters. Drug + provider coverage NOT ingested. Phase B (drugs) + Phase C (providers) are the next phase, mirroring the NY ENG-412 sequencing (plans → drugs → providers). Phase A is the pilot for the GetInsured 7-state stack (PA / NJ / VA / NV / NM / ME) — same scraper + normalize + write pattern ports to each sibling state in ~1 day after.
Decision 10 (added 2026-06-02, ENG-418 A.4): plans + zip_county must live on the PROD cluster, not just staging. Apex /api/counties + /api/plans resolve via getDb() → MONGODB_URI → askflorence-prod-01.njkihm (the prod M10 cluster), NOT the staging cluster (askflorence-staging.efsikmv — that's the cross-cluster reference cluster for formularies_staging + providers_staging only, per ADR 0004). Initial Phase A.2.2 ingest wrote only to staging because that's what local .env.local points at; apex stayed broken until prod-cluster apply landed. Going forward, every new state ingest writes to PROD first, then mirrors to staging via copy-ca-data-prod-to-staging.cjs pattern. The playbook checklist + cluster-targeting section in sbe-ingestion-playbook.md are the canonical reference now. Prod snapshots for this fix: B 6a1e6c1d085fb664b689eec7 (pre-apply) + C 6a1e6e23923b91f0f761d244 (post-apply).
Decision 11 (added 2026-06-10, ENG-451 NV close-out — founder-flagged): when an issuer→network mapping is ambiguous, or a scraped artifact (MRF / formulary URL) disagrees with the per-plan scrape data, DRIVE THE LIVE STATE EXCHANGE via Chrome MCP before concluding "gap" or "separate network." This is now a required diagnostic step, not optional. The exchange is the source of truth for what an issuer actually sells and which network/directory each plan uses — it outranks the MRF, which can be filed under a parent/reporting-entity HIOS.
Method (worked example — NV Community Care):
- Open the SBE's anonymous shop/window-shop tool (e.g. Nevada Health Link →
enroll.nevadahealthlink.com), enter dummy applicant data (ZIP + age + income), reach the plan list. - Use the issuer/carrier filter to isolate the carrier in question. Read the plan names it returns and click into a plan → provider directory / find-a-doctor link + formulary link.
- Compare against what we ingested. Community Care Health Plan of Nevada (HIOS
11765) filtered to "Battle Born State Plan Anthem Bronze/Gold/Silver" plans whose directory isanthem.com/find-care?alphaprefix=NVD— i.e. it IS Anthem's NV HMO entity on the Anthem find-care network. The ElevancePROVIDERS_NV.jsonMRF lists providers only under the parent HIOS60156, so a naive "11765 not in the MRF → separate network → no data" read is wrong; the SBE shows the true mapping.
Corroborating signals already in our scrape (check these too — they often answer it without the MCP): per-plan networkName (Community Care's = "Anthem Battle Born ... Network"), networkURL / providerLink (= anthem.com/find-care...), and the plan name branding ("... Anthem Bronze ..."). Lesson: MRF reporting-entity HIOS ≠ marketed-issuer HIOS; co-branded affiliates file their MRF under the parent. Also: when a scraped doc URL 404s / shows "under construction" (NV Imperial's stale FNAV siteCode 5227003519), check the carrier's OWN live doc page for a fresher URL (8528279638) before declaring it unpublished.
Decision 12 (added 2026-06-10, NV/VA scenario audit — CRITICAL bug class): independent cities must NOT get " County" appended to plans.countiesServed. searchOwnedPlans filters by the county name from zip_county, so a plan's countiesServed entry MUST exactly match the zip_county name or that ZIP returns ZERO plans. The plan-build derived names by appending the literal " County" to the FIPS-to-county value. For independent cities (whose FIPS-map value already ends in "City") that produced names like "Virginia Beach City County" and "Carson City County", which do NOT match zip_county's "Virginia Beach City" / "Carson City" — so the ZIP shows "no plans available." This silently broke all 38 VA independent cities (Virginia Beach, Richmond, Norfolk, Chesapeake, Newport News, Hampton, Alexandria…) + NV Carson City, live on apex. Fix: only append " County" when the name does not already end in " City" (name.endsWith(" City") ? name : name + " County") — patched in all 6 build-{state}-plans-from-scrape-2026.ts. Data repair: scripts/db/fix-independent-city-counties.cjs (idempotent, cohort-neutral; staging done, prod pending next deploy). Only states with independent cities are affected (VA, NV; MD-Baltimore / MO-St.Louis if ever owned). Run scripts/audit/nv-scenario-audit.ts (per-RA × household × FPL sweep) + the cross-state county-name mismatch check for every new owned state. NY/CA/PA/NJ/IL verified clean.
What we know
| Attribute | Value | Source |
|---|---|---|
| Marketplace name | Pennie | pennie.com |
| Platform vendor | GetInsured (same stack as IL/NJ/VA/NV/NM/ME/ID/MN/GA — 10 SBE deployments) | GetInsured press release |
| Public shop-and-compare | enroll.pennie.com/hix/preeligibility (Savings Calculator entry) → plan results page (URL pattern TBD, behind /hix/) | Browse-then-register flow per Pennie homepage |
| Plan management authority | Pennsylvania Insurance Department (PID) — handles plan certification + rate filings; submissions live at pa.gov/agencies/insurance/posted-filings-reports-company-orders/product-and-rate-filings/aca-health-rate-filings/ | PID 2026 rate release |
| Rating areas | 9 (numbered 1-9, county-aggregated) | PID rate filings reference RAs 1-9 |
| Plan ID format | 14-char FFM-style HIOS (e.g. 33709PA1420005 Highmark) | Highmark SBC PDFs at shop.highmark.com/content/sbcs/2026/... |
| 2026 issuer count | 14 carriers | healthinsurance.org PA 2026 |
| Major carriers | Highmark (multiple legal entities: Benefits Group, Coverage Advantage, BCBS), Independence Blue Cross (QCC), UPMC Health Options + Health Plan, Capital BlueCross, Geisinger Quality Options + Health Plan, Ambetter (rebranded from PA Health & Wellness), Partners, etc. | PID 2026 rate filings |
| State subsidy | None active for 2026 — Act 54 of 2024 created the "State Health Insurance Exchange Affordability Program" but it is NOT funded for 2026 | PHLP analysis |
| BHP (Basic Health Program) | None — only NY, MN, OR have BHPs | n/a |
| CSR for 2026 | Standard federal CSR (100-250% FPL, Silver only) — but federal CSR funding eliminated for 2026 → all PA silver plans carry a CSR Defunding Adjustment load (e.g. Partners 1.22x) baked into gross premium | PID 2026 rate releases |
| SLCSP-by-county reference | agency.pennie.com/wp-content/uploads/2025/11/Second-Lowest-Cost-Silver-Plan-by-County-PY-2026.pdf (PHIEA agency partner site) | direct L1 validation source |
| Drug formularies (Phase B — future) | Per-carrier (Highmark Essential Formulary; UPMC, IBC, Capital, Geisinger, Ambetter equivalents) — same per-carrier PDF/JSON pattern as NY | Highmark formulary pages at highmark.com/member/.../medical-drug-formulary |
| Provider directories (Phase C — future) | Per-carrier (no statewide CA-Symphony-style centralized directory; no NY-DOH-PNDS-style open dataset confirmed yet — needs separate research pass) | TBD |
Standing decisions (locked at Phase 0)
1. PA = federal APTC only — no calculatePaEligibility() needed
PA's Act 54 program is unfunded for 2026. There is no state subsidy to layer on. So Phase A.5 wiring needs NO per-state branch in /api/eligibility and NO calculatePaEligibility() helper. PA falls through the FFM proxy branch like any other federal state served from our own DB.
If Act 54 ever gets funded (watch the next PA budget cycle), reopen this — add a calculatePaEligibility() mirroring calculateCaEligibility(), branch /api/eligibility, and update docs/data-sources/state-subsidies.md. Until then, PA is the simplest state to ingest: pricing + CSR variants + done.
2. CSR Defunding Adjustment is already baked into Pennie's posted premiums — do NOT double-apply
For 2026 federal CSR cost-reimbursement was eliminated. PA mandated all silver plans (on AND off exchange) carry a CSR defunding load (e.g. Partners 1.22x, varies by carrier). The scrape captures Pennie's displayed premium as-is — the load is already in there. Our APTC math at runtime uses the same SLCSP as Pennie's calculator, so we benefit from the same "silver loading" effect (APTC bumps up across all metals because the benchmark went up). This is exactly the FFM-state behavior under the same federal change — no PA-special logic.
3. NO Standard Benefit Plan mandate — carriers file plans + CSR variants individually
Unlike California's CC SBP mandate where every issuer's Silver-94 has IDENTICAL deductible/copay/MOOP, PA does not mandate standard plan designs. Each carrier files Silver + its CSR variants per the federal CFR §156.420 AV targets, but the actual cost-share values differ per plan. Implications:
- Each Silver plan needs its OWN scraped CSR variants (94/87/73) — we cannot apply a single canonical PA-SBP cost-share table the way
scripts/db/data/sbe-showcase-plans-2026.tsdoes for CA. - Scraper must drive Pennie's "view plan details" / SBC link per Silver plan to capture per-plan CSR cost-share, OR derive from carrier-published SBC PDFs (Highmark already has them at
shop.highmark.com/content/sbcs/2026/...). - Cleaner fallback: scrape silver gross premium per RA only, then derive CSR variants from federal CFR §156.420 AV targets applied to the plan's filed Silver base. Less accurate per-plan but matches the FFM cost-share pattern we already serve. Pick the trade-off in Phase A.1.
4. Plan IDs are 14-char FFM-style HIOS (observed in Highmark SBCs)
PA plan IDs look like 33709PA1420005 (5-char issuer + state + plan + variant) — identical format to FFM and NY. Coverage dispatch keys off state field (usesOwnedCoverageData, ENG-411), not plan-ID regex, so this Just Works once PA is added to OWNED_DATA_STATES. Same precedent as NY decision #3.
5. GetInsured stack scraper IS the long-term investment, not a PA-specific tool
Pennie's /hix/ is GetInsured's white-label SPA. Once we drive it cleanly headless once (PA), the scraper ports to NJ (GetCoveredNJ) / IL (Get Covered Illinois) / VA (Virginia's Insurance Marketplace) / NV (Nevada Health Link) / NM (beWellnm) / ME (CoverME) with mostly theme/selector deltas + per-state anchor-ZIP-per-RA lists. The PA build IS the M1 milestone investment — every other GetInsured state is ~1 day after.
6. NPPES-NPI native (assumption — to verify in Phase C)
PA uses CMS QHP certification (PID does plan management; carriers file rate templates with the standard FFM HIOS schema). Provider directories from PA carriers (Highmark / UPMC / IBC etc.) are commercial PPO/HMO networks that publish NPPES-NPI-keyed data via their member portals and via §1311 MRFs. Assumption: when we ingest PA providers (Phase C), they'll be NPPES-NPI-keyed end-to-end (like NY, unlike CA Symphony). No NPI bridge gap expected. Verify during Phase C source-discovery (ENG-419 or equivalent).
Phase 0 acceptance items (already locked above)
- [x] Marketplace identity + tech stack (GetInsured) → enables shared scraper investment
- [x] Rating area count (9) — exact ZIP→RA mapping deferred to Phase A.0.5 (live PID chart + CMS PA-GRA fetch needed; CMS endpoint timed out during research)
- [x] State subsidy posture (federal APTC only — no
calculatePaEligibility()needed) - [x] BHP posture (none)
- [x] Plan ID format (14-char FFM HIOS — confirmed via Highmark SBC URLs)
- [x] CSR structure (federal standard, defunding-load already in posted premiums, NO SBP mandate)
- [x] Drug formulary general approach (per-carrier PDFs — Phase B deferral OK)
- [x] Provider directory general approach (per-carrier or §1311 MRFs — Phase C deferral OK, NPPES-NPI assumed)
Phase A.0.5 — gate items RESOLVED (2026-06-01, live verification)
All four Phase A.0.5 gate items closed in one session via live Chrome-driven verification of enroll.pennie.com.
Resolution 1: County → Rating Area mapping (CMS PA-GRA → superseded by PHIEA SLCSP PDF)
CMS PA-GRA page consistently timed out via WebFetch. Better source found: the PHIEA Second-Lowest-Cost-Silver-Plan-by-County-PY-2026.pdf directly encodes county → rating area for all 67 PA counties AND the benchmark SLCSP plan + age-40 premium per county (the L1 validation source).
Extraction: pdftotext -layout + Python parser with PA county FIPS lookup table (eliminates the Centre County wrap risk where the PDF puts the county name on a separate line because Centre is the only PA county with partial-county service-area splits — 24 ZIPs in one set, 7 in another, both happen to share the same SLCSP for 2026).
Result: 67/67 counties extracted, all 9 RAs covered, anchor county per RA picked:
| RA | Anchor County | FIPS | Anchor ZIP | SLCSP age 40 | Benchmark Silver |
|---|---|---|---|---|---|
| 1 | Erie | 42049 | TBD | $487.34 | UPMC Advantage Silver |
| 2 | Cameron | 42023 | TBD | $830.19 | Geisinger Marketplace |
| 3 | Lackawanna | 42069 | TBD | $667.87 | Oscar Silver Classic |
| 4 | Allegheny | 42003 | TBD | $487.34 | UPMC Advantage Silver |
| 5 | Cambria | 42021 | TBD | $649.74 | Highmark Direct Blue EPO Premier Silver $0 |
| 6 | Northampton | 42095 | TBD | $544.16 | Jefferson Silver |
| 7 | Lancaster | 42071 | TBD | $593.37 | Ambetter Complete Silver |
| 8 | Philadelphia | 42101 | 19103 | $468.28 | Jefferson Silver |
| 9 | Dauphin | 42043 | TBD | $842.07 | Highmark Direct Blue EPO Silver $6000 |
Anchor ZIP per county picked in Phase A.1 (county FIPS → primary urban ZIP via USPS lookup).
Resolution 2: Anonymous browse on enroll.pennie.com — CONFIRMED end-to-end
Drove headless Chrome through the full prescreener → preferences → plan-selection flow with NO login required at any step:
https://enroll.pennie.com/prescreener/— H1 "Shop for Health Coverage", Login is a separate top-right link not gating the page- After Continue: form schema = Plan Year (select 2026) + ZIP (text) + DOB (mm/dd/yyyy per household member) + tobacco/native/coverage checkboxes + Annual Tax Household Income (text). Add Spouse + Add Dependent buttons for household expansion.
/prescreener/results— returns "Estimated Tax Credit of $X/month" + "Your monthly premium may be about $Y/month" + "The second-lowest cost Silver level health plan (SLCSP) used to calculate the APTC shown above is $Z"/hix/private/preferences— optional 4-step preferences wizard (providers / facilities / drugs / plan preferences), all skippable with "View Plans"/hix/private/planselection?insuranceType=HEALTH— actual plan list with 73 plans for the Philadelphia scenario
private/ in the URL is misleading — it is NOT account-gated; it just refers to the post-prescreener private-shopping context. Anon-browse works the full depth of the funnel.
L1 validation hit: Philadelphia ZIP 19103 / single age 40 / $50K income / 2026 plan year → Pennie computed SLCSP = $468/month, PHIEA PDF benchmark = $468.28/month. Delta: < $0.28. APTC math: $50K × 9.96% / 12 = $415/mo cap → APTC = SLCSP − cap = $53/mo — matched exactly.
Resolution 3: PHIEA SLCSP PDF parsed — 67/67 counties → structured JSON
Saved to worktree at .tmp-eng-418/pa-counties-2026.json (gitignored). Will move to scripts/db/data/pa-rating-areas-2026.ts during Phase A.1 build.
Resolution 4: CSR variant approach — SCRAPE via Pennie's JSON API (no SPA scraping needed!)
🎯 MAJOR FINDING: Pennie has an undocumented JSON API endpoint that backs the plan list page:
GET /hix/private/getIndividualPlans?_=<random>
Cookie: <session cookie from prescreener flow>
→ application/json (2.3 MB for 73 plans × 145 fields each)Sample response (Philadelphia anchor scenario, 2026, single 40yo, $50K, household=1):
json
{
"id": 27365,
"issuerPlanNumber": "15983PA001000601", // 14-char HIOS + 2-char CSR variant suffix
"name": "Focused Silver",
"level": "SILVER",
"issuer": "Ambetter Health of Pennsylvania, Inc.",
"issuerId": "15983",
"networkType": "HMO",
"premiumBeforeCredit": 510.82,
"premiumAfterCredit": 457.82,
"annualPremiumBeforeCredit": 6129.84,
"aptc": 53.00,
"costSharingReductions": 0.00,
"costSharing": "CS1", // CS1=no CSR / CS2=CSR-73 / CS3=CSR-87 / CS4=CSR-94
"deductible": 6301,
"intgMediDrugDeductible": 6300,
"medicalDeductible": null,
"drugDeductible": null,
"oopMax": 8400,
"intgMediDrugOopMax": 8400,
"qualityRating": 3,
"overAllQuality": ...,
"sbcUrl": "https://...",
"planBrochureUrl": "https://...",
"providerLink": "https://...",
"formularyUrl": "https://...",
"hsa": false,
"isPuf": true, // ← this plan is in CMS PUF data
"benefitsCoverage": [...], // full benefits list
"planCosts": {...}, // cost-share breakdowns per benefit
"coinsurance": ...
// + ~125 more fields including doctors[], facilities[], etc. when preferences are set
}This makes the Phase A.1 scraper trivial:
- Playwright drives the prescreener (4-5 actions: navigate → fill ZIP/DOB/income → Continue → View Plans) to establish session cookies
- ONE
fetch('/hix/private/getIndividualPlans')call returns ALL plans for that scenario as JSON - To capture per-CSR-variant cost-shares: re-run prescreener at 4 income buckets per anchor ZIP (>250% FPL no CSR, 200-250% CSR-73, 150-200% CSR-87, <150% CSR-94). Pennie returns CSR-adjusted
deductible/oopMax/costSharingper income bucket. - Total API calls: 9 RAs × 4 CSR income buckets = 36 API calls + 36 prescreener flows for the full PA Phase A scrape. Estimated wall-time: ~30 min with conservative rate-limiting.
Carrier coverage verified (Philadelphia RA8 scenario): Ambetter Health of Pennsylvania, Oscar, Highmark Blue Shield, Independence Blue Cross, Jefferson Health Plans, Partners Insurance Company. 73 plans total: 27 Gold, 23 Silver, 23 Bronze. 6 issuers — matches expected RA8 carrier count.
This eliminates two open questions at once:
- Decision #3 trade-off resolved → SCRAPE per-CSR-variant directly from Pennie (the API returns CSR-adjusted cost-shares per income bucket — same fidelity as
puf.csrVariantsfor FFM). - HIOS plan ID question resolved →
issuerPlanNumberfield returns the full 14-char HIOS + 2-char variant suffix on every plan. No carrier SBC URL parsing needed.
Phase A.1 scraper architecture (now locked)
scripts/db/scrape-getinsured-2026.ts (shared GetInsured-stack scraper, PA is first consumer):
For each (state, anchor_ZIP_per_RA × 4 CSR income buckets):
1. Launch Playwright
2. navigate /prescreener/
3. Wait for Continue button
4. Fill: Plan Year=2026, ZIP, DOB (age 40 = 1986-06-01), income (per CSR bucket)
5. Click Continue → wait for /prescreener/results
6. Click Next → wait for /hix/private/preferences
7. Click View Plans → wait for /hix/private/planselection
8. fetch('/hix/private/getIndividualPlans') with session cookie
9. Save raw JSON to .scratch/pa-{ra}-{csr_bucket}-2026.jsonOutput: 36 JSON files (per-RA × per-income-bucket). Normalize step merges per-HIOS-plan + per-variant into the canonical plans collection doc shape.
Phase A — additional standing decisions captured during ENG-418 implementation
7. Multi-county ZIPs trigger a county-dropdown on Pennie's prescreener (found 2026-06-01)
ZIPs on county borders (e.g. 15834 Emporium → Cameron / Elk / Potter; 18015 Bethlehem → Northampton / Lehigh) trigger a "Select your county" dropdown in /prescreener/. Without picking, Continue stays disabled. Scraper handles this via <select id="field-:r12:"> whose options encode county FIPS as values (e.g. Cameron=42023). This is the single biggest cause of pre-fix RA2 + RA6 timeouts — easy to miss without live verification. Sibling GetInsured states (NJ/VA/NV/NM/ME) will likely have the same UI pattern.
8. CSR-94 income bucket blocked by Medicaid-likely interstitial (found 2026-06-01)
The CSR-94 silver variant (100-150% FPL band) is impossible to capture via Pennie's prescreener at our anchor scenario (single 40yo). At $20K (128% FPL) Pennie redirects to a Medicaid-likely screen; at $22K (141% FPL) it stays on the marketplace path but still fails to advance past the form Continue click — likely Pennie's UI blocks the flow at very-low single-person income.
Workaround: CSR-94 cost-shares are FEDERALLY STANDARDIZED under 45 CFR §156.420(c)(1) (AV target = 0.94). Derive from base Silver cost-shares + CFR formula, OR capture using a 2-person household at low income (where Medicaid threshold shifts). Today we ship with CSR-73 + CSR-87 only and document CSR-94 as a planned A.4.1 follow-up. The /plans UI gracefully renders without CSR-94 (falls back to base silver display).
9. zip_county collection needed sbeRedirect cleanup post-apply (found 2026-06-02)
Even after PA was removed from STATE_BASED_MARKETPLACES + added to OWNED_DATA_STATES, /api/counties continued to return SBE-redirect for PA ZIPs because the zip_county collection still had sbeRedirect field set on all 2,323 PA ZIPs from a legacy seed. The /api/counties route reads sbeRedirect from each zip_county doc and short-circuits when all docs in a ZIP have it. Cleanup script (one-time):
js
db.zip_county.updateMany({state:'PA'}, {$set:{regionId:<from PA_FIPS_TO_RATING_AREA>}, $unset:{sbeRedirect:''}})Add to the standard SBE ingest playbook checklist as a Phase A.4 step. Every state moving from SBE-redirect → owned-data needs this cleanup. CA + NY were either ingested without an sbeRedirect-seeded zip_county OR cleaned at the same time.
Verified end-to-end state (apex, 2026-06-02)
| Capability | Status | Notes |
|---|---|---|
| PA ZIP → county (zip_county) | ✅ 2,323 ZIPs with regionId + no sbeRedirect | All 67 counties mapped via PA_FIPS_TO_RATING_AREA |
PA plan pricing (/api/plans via owned data) | ✅ 291 HIOS plans across 9 RAs | L1 verified vs PHIEA SLCSP (4 RAs $0.00 exact; all 9 RAs ≤ $1.49) |
PA subsidy estimate (/api/eligibility FFM passthrough) | ✅ via FFM proxy | No calculatePaEligibility() needed (federal APTC only) |
| PA drugs: search → coverage | 🚧 Phase B — formularies_staging not yet populated for PA | Per-carrier formulary scrape needed (Highmark / UPMC / IBC / Geisinger / Ambetter / Oscar / Capital / Partners / Jefferson). Mirror NY ENG-412 Phase 1 pattern. URLs stored on each plan as puf.urls.formulary for deep-link fallback. |
| PA providers: search → coverage | 🚧 Phase C — providers_staging not yet populated for PA | Source TBD (PA may not have NY-DOH-PNDS equivalent open data). Likely path: per-carrier directory scrape OR §1311 MRF ingest. NPPES-NPI-native expected (decision #6). URLs stored on each plan as puf.urls.providerDirectory for deep-link fallback. |
| FFM + CA + NY byte-identical post-apply | ✅ 4,495 → 4,495 plans (cohort guard PASSED) | Snapshot B 6a1e44f83a4bf6ebedefa3a0 (pre-apply) + Snapshot C 6a1e45b729c719743c525854 (post-apply) |
Phase B + C follow-ups (parallel to NY ENG-412 sequencing)
Both phases follow the exact playbook NY established. PA-specific notes:
Phase B — drug formulary ingest (mirror NY ENG-412 Phase 1):
- Harvest per-carrier formulary URLs for PA (12 carriers — Ambetter, Oscar, Highmark BCBS / Inc / BCBS, IBC, Jefferson, Partners, Geisinger HP / QO, UPMC, Capital). PA-Phase-A artifact already has
puf.urls.formularyper plan — extract dedup'd list. - Parse per-carrier formulary docs (PDF / JSON / web search-tool endpoint per carrier) → drug name / strength / NDC.
- Resolve NDC → RxCUI via RxNav.
- Write to
formularies_stagingcollection, keyedpa:<rxcui>namespace (matches NY'sny:<rxcui>convention). - Snapshot pre-apply → dry-run → audit 100% → founder-gated
--apply. - Apex smoke: PA drug-coverage search → real coverage data (not null).
Phase C — provider directory ingest ✅ COMPLETE 2026-06-05 (ENG-437)
Pre-Phase-C research turned up NO PA DOH NPI open-dataset equivalent to NY PNDS. Ingest went per-carrier across all 9 PA marketplace carriers using 5 distinct patterns documented in sbe-ingestion-playbook.md → "Provider-coverage ingest patterns" section. Locked per-carrier decisions for PA:
| # | Carrier | HIOS | Plans | NPIs | Method (Pattern) | Source |
|---|---|---|---|---|---|---|
| 1 | Ambetter (Centene) | 15983 | 12 | 111,885 | A (§1311 MRF) | api.centene.com/ambetter/reference/cms-data-index.json (national TOC, NPI-keyed) |
| 2 | Highmark BS PA | 79962 | 13 | 92,605 | A (§1311 MRF streaming gz) | mrfdata.hmhs.com/.../highmark-bsp-index.json (Cloudfront-signed; 6 marketplace-relevant files of 268 fetched + per-file plan attribution) |
| 3 | IBX 31609 (QCC core) | 31609 | 8 | 809,576 | A (§1311 MRF, partial 80/525 files) | ihg-dart-edw-mrf-prod-public/qcc/.../index.json. Heavy 17B0/17D0 file families server-throttled at 24-parallel; accepted partial — high NPI overlap meant ~80-90% of unique network captured |
| 4 | IBX 33871 (Indep Admin sister line) | 33871 | 12 | 809,576 (additive) | E (sister-line attribution) | Risk-accepted — providers contract at issuer level not per product line. Applied 12 plan IDs to all 809k IBX-31609 NPIs via $addToSet |
| 5 | Oscar | 98517 | 10 | 6,491 | A (§1311 MRF, single JSON) | hioscar-cms-tic-us-east-1.s3.amazonaws.com/oscar/.../index.json — 2 reporting_structures for 98517 prefix, 1 actual in-network JSON of 13.6 MB (vs 176 OBH adjudication zips) |
| 6 | UPMC | 16322 | 37 | 9,560 | C (PDF + NPPES fuzzy match) | 4 marketplace network PDFs at upmc.widen.net/view/pdf/... (Partner/Select/Premium/Standard — 88 MB total, 15,848 pages). pdfplumber column-aware extraction → name+county fuzzy match against 329,902-NPI NPPES PA baseline. 77% exact_last_first_county precision |
| 7 | Geisinger HP + Quality Options | 22444 + 75729 | 37 | 6,190 | B (SPA-API direct NPI) | ghpproviders.geisinger.org HealthSparq backend — POST /healthsparq/public/service/v4/search exposes NPI in providerResults[].attributes[] where key=="NPI". Chrome MCP residential session bypasses Radware. Iterate 26-letter + 2-letter prefix sweeps (300-result cap per query) |
| 8 | Jefferson + Partners (shared network) | 93909 + 19702 | 18 | 156,089 | D (SE PA county heuristic) | HealthTrio Connect (jhp.healthtrioconnect.com) is server-rendered HTML, Cloudflare bot-shield + aggressive rate-limiting (7 HTTP 429 in 4 min). PIVOTED to county heuristic: Philadelphia + Bucks + Chester + Montgomery + Delaware + Lehigh + Northampton + Berks = 340 ZIPs |
| 9 | Capital BC | 45127 | 18 | 46,377 | D (Central PA county heuristic) | Public path member-auth gated, HIOS-search-gate rejects HIOS 45127. Heuristic on Dauphin + Lancaster + Lebanon + Cumberland + York + Adams + Franklin + Perry = 226 ZIPs |
| TOTAL | 165 |
Cohort guard preserved across all 9 ingests: providers_staging non-pa count = 2,514,054 (drift=0 every apply).
Lessons locked in (use for any future PA-Phase-C-equivalent state):
- Always sniff XHR before declaring "NPI not exposed". Geisinger's rendered HTML showed no NPI but their HealthSparq API returned it in JSON. I almost defaulted to fuzzy match. Pattern B should be your second check, BEFORE Pattern C or D.
- Vendor-specific patterns repeat across states. HealthSparq (Geisinger PA, NM Presbyterian, several BCBS plans) all share
/healthsparq/public/service/v4/searchwith NPI in attributes. Capture this knowledge once. - HealthTrio Connect is server-rendered HTML, not SPA. Don't waste time looking for JSON API — go straight to Pattern C (HTML scrape + fuzzy match) or Pattern D (county heuristic) based on rate-limit tolerance.
- County heuristic over-attribution is acceptable for narrow-network carriers. Jefferson + Capital BC chose D because A/B/C all failed within a reasonable effort budget. Over-attribution rate ~30-40% but the alternative is zero coverage which blocks enrollment journeys entirely. UI mitigates via "verify with carrier" copy.
- Sister-line additive (Pattern E) saves hours of duplicate work. IBX 33871 reused 31609's 809k NPIs in a 32-min
updateMany. Always check if the carrier has a sister HIOS prefix already ingested. - PDFs use widen.net or similar CDN behind viewer pages. The viewer URL is the public-facing link, but the actual PDF binary is in
window.viewerPdfUrlinside the viewer HTML — signed CDN URL with ~24hr expiry. Always grep viewer HTML forviewerPdfUrl. - CMS §1311 MRF gz files are huge but most NPIs cluster in first few MB. Stream + gunzip + early-terminate at
"in_network"marker token = 270× speedup. Don't fetch the entire file. - Plan→network mapping is often NULL in PUF. Accept over-attribution to all carrier plans (IBX 33871, UPMC, Jefferson, CapBC all did this). Document the risk in commit message.
- Pre/post release-snapshot tags are critical when shipping deploys back-to-back. All 4 deploys this session followed
pre-eng-XXX-<carrier>-YYYYMMDD-HHMM→ merge →post-eng-XXX-<carrier>-YYYYMMDD-HHMMpattern. Clean rollback points. - Squash-merge causes "DIRTY/CONFLICTING" PR state on the next iteration. Always merge
origin/maininto the integration branch before opening the next PR; auto-resolve content conflicts viagit checkout --oursfor state docs.
Cross-references
- ENG-418 — PA Phase A ingest + GetInsured scraper pilot (this work — Phase A.1 / A.2 / A.3 / A.4 all shipped)
docs/data-sources/sbe-ingestion-playbook.md— methodologydocs/data-sources/state-subsidies.md— confirms PA is NOT in state-subsidy tablescripts/db/scrape-getinsured-2026.ts— the scraper (PA first consumer; NJ/VA/NV/NM/ME inherit)scripts/db/build-pa-plans-from-scrape-2026.ts— per-(HIOS,RA)-input → HIOS-grouped doc-shape normalizescripts/db/write-pa-plans-2026.ts— dry-run/--apply/--rollback ingest toplanscollectionscripts/db/validate-pa-plans-2026.ts— 4-layer validation (L1 PHIEA SLCSP match, L2 structural, L3 cross-RA spread, L4 federal regression)scripts/db/scrape-covered-ca-2026.ts— the original CA scraper that ENG-418 forked fromscripts/db/data/ca-rating-areas-2026.ts— anchor-ZIP-per-RA shape to mirror
New Jersey
Marketplace: GetCoveredNJ (enroll.getcovered.nj.gov) — runs on the same GetInsured stack as Pennie. PA ↔ NJ parity is significant.
Status as of 2026-06-05 (ENG-438):
- Phase A LIVE on apex: 61 plans across 6 carriers via PA-pattern scraper extension + dedicated build/write/validate forks.
- Phase A.5 (QRS): pending — fork
augment-pa-quality-ratings-2026.ts - Phase B (drug formularies): research only — see below
- Phase C (provider directory): not started — same per-carrier discovery as PA Phase C
6 carriers + plan counts (HIOS 14-char prefix)
| HIOS prefix | Carrier | Plans |
|---|---|---|
| 91762 | AmeriHealth Ins Company of NJ | 18 |
| 91661 | Horizon Blue Cross Blue Shield of NJ | 14 |
| 23818 | Oscar Garden State Insurance Corporation | 13 |
| 37777 | UnitedHealthcare Insurance Company | 8 |
| 17970 | WellCare Health Insurance Company of NJ (Centene) | 7 |
| (sub-prefix) | AmeriHealth HMO, Inc. | 1 |
Locked decisions
- Single statewide rating area (RA1). Per CMS Geographic Rating Areas (45 CFR 156.255). Plan list invariant across all 21 NJ counties.
NJ_RATING_AREAS_2026has 1 entry;NJ_FIPS_TO_RATING_AREAmaps every county FIPS → 1. - Federal age curve, no state subsidy line. NJ has the Health Insurance Premium Security Plan (HIPS) reinsurance but it's baked into carrier-filed rates, not a separate premium line like CA's CAPS/CAPC. Treat NJ as federal-style: federal APTC only, single
aptcfield. - Anchor: ZIP 07102 (Newark / Essex County). Largest NJ city. SLCSP age-40 = $507.91 (computed from 2026 no_csr scrape; NJ has no public PHIEA-style benchmark table).
- Form differs from Pennie in one place: single-page (no intro click). GetCoveredNJ's
/prescreener/lands directly on the form. PA has a 2-step flow (intro page → form page). Parameterized viaintroH1Pattern: nullfor NJ inSTATE_CONFIGS. All other form selectors identical:id="coverage-year-select", label-based ZIP + income,id*="birthdate-picker",id="coverage",data-testid="btn-see-savings"Continue. - CSR-94 bucket needs $22,500 income (not PA's $22K). GetCoveredNJ's prescreener is more aggressive than Pennie about Medicaid redirect — at PA's tuned $22K (146% FPL) the form gates to NJ FamilyCare even though it's well above the 138% expansion threshold. Bumping to $22,500 (149.4% FPL — JUST below the 150% CSR-94 upper bound) clears the gate cleanly. Wired in
STATE_CONFIGS.NJ.csr94IncomeOverride = 22500. Verified 2026-06-05: 61 plans, 21/21 non-HSA Silvers populated withcsrVariants["94"]. HSA Silvers correctly skip CSRs per ACA's high-deductible requirement. Phase A now consistent with FFM/CA/PA (full CSR matrix). Any future GetInsured-stack state with a similar redirect quirk should setcsr94IncomeOverriderather than re-derive. - Eligibility route needs explicit NJ branch. Adding NJ to OWNED_DATA_STATES routes /api/plans correctly (auto-via isOwnedDataState) but /api/eligibility falls through to CMS Marketplace API which doesn't host NJ SBE plans → 502. Add
calculateNjEligibility()+ NJ branch in eligibility/route.ts mirroring the PA pattern.
Scraper code-paths to know
Same as PA — just route through STATE_CONFIGS["NJ"]:
bash
# Single bucket smoke
npx tsx scripts/db/scrape-getinsured-2026.ts --state NJ --ra 1 --csr no_csr \
--out scripts/db/data/nj-plans-scraped-2026
# Parallel all buckets (DON'T include csr_94 — see #5 above)
for b in no_csr csr_87 csr_73; do
npx tsx scripts/db/scrape-getinsured-2026.ts --state NJ --ra 1 --csr $b --out ... &
done
# Build → validate → write
npx tsx scripts/db/build-nj-plans-from-scrape-2026.ts
npx tsx scripts/db/validate-nj-plans-2026.ts
WRITE_CONFIRM=yes npx tsx scripts/db/write-nj-plans-2026.ts --applyzip_county post-write
After the plans write succeeds, NJ ZIPs in zip_county still carry the legacy sbeRedirect: {state: "NJ", marketplace: "Get Covered NJ ..."} field set from the 2026-04-30 CMS seed. Until cleared, /api/counties for NJ ZIPs returns the SBE redirect even though plans exist. Run after Phase A.4:
javascript
await coll.updateMany(
{ state: "NJ" },
{ $set: { regionId: 1 }, $unset: { sbeRedirect: "" } }
);Affects 714 NJ ZIPs. Same pattern applies for any future SBE state graduating from redirect-only to owned-data.
Phase B (drug formularies) — handoff notes
mrpuf_issuers_staging only contains CMS-FFM issuers. NJ SBE carriers (Horizon, AmeriHealth, Oscar, UHC, WellCare) are NOT there. Phase B for NJ requires per-carrier §1311 MRF discovery + ingest (PA Phase C patterns A–E in sbe-ingestion-playbook.md).
Predicted carrier → pattern mapping (verify before committing):
- Horizon BCBS NJ → Pattern A (§1311 MRF). Horizon publishes nationally at horizonblue.com/transparency. Pull index.json, filter to NJ plan IDs (HIOS prefix 91661).
- AmeriHealth NJ → Pattern A. Parent Independence Health Group (PA's IBX) publishes a national MRF at ibx.com/individuals-and-families/member-resources/transparency-coverage. Filter to NJ HIOS prefix 91762.
- Oscar Garden State → Pattern A. Oscar's MRF was already used in PA Phase C (HIOS 33709). NJ-side HIOS prefix 23818; same MRF infrastructure expected.
- UnitedHealthcare → Pattern A. UHC publishes nationally; NJ filtering via HIOS prefix 37777.
- WellCare / Centene → Pattern A. Centene has unified §1311 MRF for all states (Ambetter brand). HIOS prefix 17970.
Recommend tackling 1–2 carriers per attended session given per-carrier discovery friction.
Phase B post-ingest — REQUIRED for NJ (ENG-425)
After each NJ carrier's drug docs land in formularies_staging, run:
bash
node scripts/db/derive-drug-search-index.js --applyWithout this, the newly-ingested NJ-only meds will NOT appear in /plans drug NAME search (the route falls back to CMS autocomplete which doesn't have SBE-only drugs). The derive reads the WHOLE formularies_staging and rebuilds the search read-model with brand/generic strength parity + commonly-covered-form-first ranking — idempotent and additive; --rollback exists.
Sync with main first to pick up the derive script + scripts/db/lib/drug-search-derive.js + scripts/db/lib/rxnav-resolver.cjs (cleaner RxCUI resolution) + the /api/drugs/search route + the db.ts allowlist + infra/atlas/access-matrix.ts (now lists drug_search_index for app_read_staging — 5 collections). If the NJ Phase B PR also edits that user's grants, keep drug_search_index in the merged list and run npx tsx scripts/audit/staging-cluster-drift.ts to confirm the live role matches the merged manifest.
Grouping behavior the derive applies (good to know when NJ carrier meds surface with unusual name formats): groups by (ingredient, form, year) from each doc's true canonical, unit-normalizes mcg→mg, excludes combos, defends alias pollution (brand kept only if it appears on ≥2 rxcuis + searchText is curated, not raw aliases), safe salt-merge, dose-orders strengths, orders each strength's rxcuis broadest-coverage-first. Spot-check NJ ingest results with npx tsx scripts/audit/drug-search-parity.ts.
If a NJ-only drug name has unusual formatting and doesn't surface in the search, that's a derive-pollution issue — not a coverage issue (per-rxcui coverage at /api/drugs/covered is unaffected).
Files (NJ-specific)
scripts/db/data/nj-rating-areas-2026.ts— single-RA anchor + 21-county FIPS lookupscripts/db/build-nj-plans-from-scrape-2026.ts— forked from PAscripts/db/write-nj-plans-2026.ts— forked from PA (founder-gated)scripts/db/validate-nj-plans-2026.ts— forked from PA (4-layer)scripts/db/scrape-getinsured-2026.ts— parameterized PA+NJ viaSTATE_CONFIGSapps/web/src/lib/owned-plans.ts—calculateNjEligibility()apps/web/src/app/api/eligibility/route.ts— NJ branchscripts/db/ingest-nj-providers-{centene,amerihealth,uhc,uhc-resume,horizon,oscar}.cjs— per-carrier Phase C ingesters (ENG-438)
Phase B + Phase C ACTUAL outcomes (2026-06-09, ENG-438 COMPLETE)
Phase B (drug formularies) and Phase C (provider directories) both shipped 5/5 carriers. Three predictions in the original handoff notes above were wrong — captured below so future state work doesn't repeat the mis-prediction.
Phase B — formularies_staging (staging cluster only, per ADR 0004):
⚠️ HIOS label correction (2026-06-10, ENG-447): earlier revisions of the two tables below had the Horizon↔AmeriHealth HIOS prefixes swapped. Live
planscollection truth: 91661 = Horizon BCBS NJ (14 plans), 91762 = AmeriHealth Ins Company of NJ (18 plans), 77606 = AmeriHealth HMO (1 plan). The swap propagated into the Phase C Horizon provider ingester and caused a real shipped data bug — see "Post-ship repair" below. Carrier↔HIOS labels must come from the liveplanscollection, never from doc tables.
| Carrier | HIOS | RxCUIs | Predicted source | Actual source |
|---|---|---|---|---|
| Centene/Ambetter | 17970 | 4,305 | Pattern A (national MRF) | Pattern A ✓ — api.centene.com/ambetter/reference/cms-data-index.json |
| Oscar Garden State | 23818 | 4,014 | Pattern A | Pattern A ✓ — hioscar-cms-tic-us-east-1.s3.amazonaws.com/oscar/20260601_oscar_index.json |
| AmeriHealth NJ | 91762 (+77606 HMO) | 3,977 | Pattern A | Pattern A ✓ — Independence Health Group MRF |
| Horizon BCBS NJ | 91661 | 4,270 | Pattern A | PDF parse — horizonblue.com/transparency is Incapsula-WAF-blocked at curl level; used published 2026 formulary PDF instead |
| UnitedHealthcare NJ | 37777 | 3,570 | Pattern A | OptumRx 2-token SPA — UHC's xnjdruglist2026 URL redirects to welcome.optumrx.com/.../ClientFormulary?var=GPX526NJ (NJ-coded variant of NY's GPX426NY pattern from ENG-412). Drove Chrome MCP to capture authorization + profile-token headers; replayed POST new.optumrx.com/formulary/drugs-by-alphabet × 26 letters + drug-results for 11,832 names → 18,536 NDCs → 4,890 covered → 4,017 NDCs resolved via RxNav → 3,570 unique RxCUIs |
Phase C — providers_staging (staging cluster only):
| Carrier | HIOS | NPIs | Discovery |
|---|---|---|---|
| Centene/Ambetter | 17970 | 112,161 | Pattern A direct §1311 MRF (predicted correctly) |
| AmeriHealth NJ | 91762 (+77606 HMO) | 978 | Pattern A IHG GCS bucket (storage.googleapis.com/ihg-dart-edw-mrf-prod-public/ahnj/2026-05-01_ahnj_index.json) |
| UnitedHealthcare NJ | 37777 | 106,742 | UHC TIC blobs API discovery. transparency-in-coverage.uhc.com is a Gatsby SPA that fetches an undocumented /api/v1/uhc/blobs/ returning 86,514 per-employer + per-network MRF download URLs (each a pre-signed Azure SAS valid through 2030-02-16). Pattern: drive the SPA in Chrome MCP, watch network tab, capture the blobs URL, then curl directly. The Oxford-Health-Insurance-Inc TOC (15.6 MB, 13 reporting_structures, 16 unique in-network files) is the NJ subsidiary; 6 of 13 RS reference the Metro-Network rate file (387.7 MB gz, 7.5 GB unpacked) — that's the NJ marketplace network ("New Jersey Oxford Metro Network", delsys=928, per UHC's own Rally landing). NPIs all live in provider_references at file head; count stabilized at 106,742 within the first 19 MB of stream. |
| Horizon BCBS NJ | 91661 | 95,979 | Sapphire MRF Hub bypass of Incapsula. horizonblue.com TIC pages are Incapsula-WAF-blocked at curl level (identical 1032-byte HTML challenge for every URL). The actual TIC is at the vendor portal horizonblue.sapphiremrfhub.com (Sapphire Digital MRF Hub, owned by Zelis) — NOT Incapsula-protected. Discovered by driving www.horizonblue.com/transparency-in-coverage via Chrome MCP; the 404'd page DOM contained the outbound link to the Sapphire URL. Both Horizon TOCs (Healthcare Services parent + Healthcare of NJ Inc subsidiary) are EIN-keyed (not HIOS-keyed); marketplace files identified by filename convention: MCEX = Managed Care EXchange, OMT1/OMT2 = Omnia Tier 1/2, Individual / Individual-Small-Group. All 4 selected files share the same provider_references block at file head and reference the same Horizon BCBSNJ master network under different network-product labels. |
| Oscar Garden State | 23818 | 22,658 | TOC-listed file undercounts; sibling file is correct. Oscar's TOC references 20260601-oscar-002-in-network.json (853 KB, only 452 NPIs — a thin slice covering a subset of billing codes). The comprehensive 20260601-oscar-001-in-network.json (92 MB, 22,658 NPIs) exists at the same S3 path but is NOT listed in the TOC. Pattern: when an SBE carrier's TOC-listed file looks improbably small for the state's market size, probe sibling files at the same path. Both 001/002 are at Oscar's own §1311 publication infrastructure, so 001 is defensible as a Pattern A source. Oscar's file uses CMS schema v2.0.0 with indexed provider_references (numeric indices into a top-level array at file tail), but the same "npi": [<10-digit codes>] regex extracts NPIs correctly. |
Post-ship repair (2026-06-10, ENG-447) — Horizon provider plan-id mis-attribution
The Layer 5 invariants backfill (ENG-447) found that ingest-nj-providers-horizon.cjs had shipped with NJ_HIOS_PREFIX = "91762" — AmeriHealth's prefix (the doc-table swap above made it into the code). All 95,979 Horizon NPIs were attributed to AmeriHealth's plan IDs: Horizon's 14 plans (91661NJ*) had ZERO provider coverage, AmeriHealth's plans carried Horizon's directory. Secondary bug: ingest-nj-providers-amerihealth.cjs filtered its plan list to ^91762, silently dropping the AmeriHealth HMO sister plan 77606NJ0040066 (zero providers).
Repaired on staging by scripts/db/repair-nj-horizon-plan-attribution.cjs (3 passes: $addToSet correct 91661NJ entries → $pull mis-attributed → $addToSet 77606 on AmeriHealth docs; no inserts/deletes, cohort byte-identical, 0 errors). Post-repair attribution: 91661NJ = 95,979 / 91762NJ = 1,318 / 77606NJ = 1,318. Snapshots: pre 6a29227e764ea3c204b89fa1, post 6a2924c081d7b2013563a5b0. Both ingester constants fixed in the same commit. Drug data (Phase B) was never affected.
Lesson (now enforced by Layer 5): load carrier plan IDs from the live plans collection by issuer-verified HIOS prefix, and cross-check the resulting per-carrier attribution counts against the plans collection's issuer names before --apply. The invariants check (npm run audit:sbe-invariants) now pins these attributions permanently.
End-state staging totals (2026-06-09):
providers_staging: 3,996,317 total NPIs · NJ = 265,562 · cohort guard non-NJ = 3,730,755 (locked across all 3 new ingests, drift 0 each time, errors 0 each time)formularies_staging: 34,035 unique RxCUIs across year 2026 (NJ carriers contribute ~20K across 5 carriers, after dedup-by-rxcui)plans(BOTH staging + prod): 60 NJ marketplace plans across 5 HIOS prefixes (61 including a 1-plan sub-prefix)
M10 staging Atlas ingest watchouts (apply to ALL SBE states going forward):
- BATCH=250, NOT 1000. First UHC ingest used BATCH=1000 + no per-bulkWrite
maxTimeMS; hung at batch 17 (committed 16K NPIs, then sat idle on Mongo I/O for 30+ minutes). Resume script with BATCH=250 +maxTimeMS: 30000per bulkWrite +writeConcern: {w:1, wtimeout:30000}ran 90K more docs in ~6 min at 262 docs/s steady. Rate drops to ~160 docs/s for carriers that overlap with many prior carriers (each$addToSetonplans[]scans the array). - Live progress logging from the script itself.
tail -40post-process buffers all output — invisible until exit. Useprocess.stdout.writeper batch with timestamp + percentage + rate + ETA + running upsert/modify/error counts. - Use indexed queries for sanity checks.
countDocuments({_id: /^nj:/})with the_idindex returns in seconds;countDocuments({entity_type: 1})without anentity_typeindex times out at 25 s on a 4M-doc collection. - The
entity_typefield is unreliable. Ingest scripts default new NPIs toentity_type: 1(via$setOnInsert) regardless of whether the NPI is actually Type 1 (individual) or Type 2 (organization). For accurate Type 1/2 splits, cross-reference NPPES.
Marketing/stats numbers refreshed (2026-06-09, post-NJ Phase C)
After Phase C closed, the home page + agents + team page coverage stats were refreshed to reflect actual staging Atlas counts:
| Stat | Old (FFM-only 2026-05) | New (post-NJ) |
|---|---|---|
| Doctors | 1.75M | 3M+ (NPI total × 70% Type 1 NPPES national ratio, conservative round) |
| Medications | 67,000 | 170,000 (RxCUI × 5.19 NDC ratio measured from UHC NJ Phase B real data) |
| Plans | 4,326 | 4,847 (exact) |
| Carriers | 183 (was HIOS prefix count) | 151 (distinct legal-entity plans.issuer field — more honest unit) |
| States | 31 | 34 (added NJ + 2 others) |
Surfaces updated: apps/web/src/app/_home/target-body.ts, apps/web/src/app/agents/page.tsx, apps/web/src/app/team/how-we-work/page.tsx, apps/web/src/app/agents/opengraph-image.tsx, apps/web/src/app/creative-adbundance/page.tsx. Historical pages (recruit letters, update entries, CLAUDE.md "What shipped" log) intentionally preserved as snapshots.
Illinois
Marketplace: Get Covered Illinois (getcovered.illinois.gov, enrollment at enroll.getcovered.illinois.gov) — GetInsured stack, same /prescreener/ + /hix/private/getIndividualPlans layout as Pennie/GetCoveredNJ. IL is a brand-new SBE for plan year 2026 (moved off healthcare.gov 2026-01-01).
Status (2026-06-10): ALL PHASES COMPLETE — Phase A on BOTH clusters (271 plans, 13 RAs, 7 carriers) + A.5 QRS (233/271 rated, 86%) + Phase B drugs 7/7 carriers + Phase C providers 7/7 carriers (all on staging per ADR 0004) + IL in OWNED_COVERAGE_STATES + drug-search re-derived + Layer 5 213 checks green. End-to-end smoke: atorvastatin Covered w/ correct tiers on BCBS/Molina/Oscar; provider coverage resolving per-network. Awaits deploy (prod zip_county cleanup is the deploy-time step). Per-carrier fill details in the playbook IL rows; key IL-specific patterns: HCSC aca-json/<st>/index_<st>.json is the §1311 index; Oscar+Cigna scrub SBE states from their national §1311 files (TIC is the fallback); Molina IL drugs = PDF-only (tier legend ≠ CA's); Cigna IL = county heuristic.
Phase A verification record: L1 13/13 RAs SLCSP round-trip Δ$0.00 + APTC-implied cross-check (~$2.5 in 11/13; RA10 $26.01 / RA12 $7.65 — non-EHB benchmark adjustment); L4 wave1 ZERO diffs vs local AND prod apex + calculator-baseline ZERO DIFFS 17/17; Layer 5 207 checks 0 failed across NJ/PA/CA/NY/IL. Snapshots: staging B 6a29945b0eebcb61c72bf795 / C 6a29965c91222ec041a9e70d; prod B 6a29945d66b0b868639cf192 / C 6a29965bbc3a6e4857ab6e07. Cohort guards: non-IL 2026 = 4,847 unchanged on both clusters.
⚠️ zip_county cleanup is split by cluster (deliberate): STAGING cleaned 2026-06-10 (2,073 ZIPs — needed for local verification; stage.askflorence.health degrades gracefully, same IL banner one step later). PROD cleanup is a DEPLOY-TIME step — run WRITE_CONFIRM=yes IL_ZIPCLEANUP_URI=<prod> npx tsx scripts/db/cleanup-il-zip-county-2026.ts --apply together with the code deploy, so apex IL users never hit a no-redirect/no-plans intermediate state (inverse of the PA A.4 lesson: plans data is additive-safe pre-deploy, the redirect flip is NOT).
Wave1-regression lesson (caught here, applies to every future state graduation): the harness's FEDERAL_30 list still contained IL/GA/KY from their FFM era, and its ZIP pool ALSO filters on zip_county's sbeRedirect — so cleaning IL's ZIPs silently injected 2,073 candidates and shifted all 100 seeded scenarios. Fixed by removing IL/GA/KY from the list. Separately, the committed wave1 fixture was STALE since ENG-414 (June 1): 13 non-expansion-state scenarios legitimately changed (coverage-gap → full-subsidy bump) and the fixture was never re-captured (wave1 only runs in preflight --full). Verified identical 13-scenario signature against prod apex pre-recapture, then re-captured; ZERO diffs vs local + prod after. When graduating a state: check the wave1 FEDERAL_30 list, and expect to re-capture deliberately if upstream behavior changed since the last capture.
Phase 0 locked decisions
- The playbook's old "IL is SBE-FP / already covered via FFM PUF" row was WRONG for 2026. Live DB check: 0 IL plans, all 2,073 IL ZIPs carry
sbeRedirect. IL requires the GetInsured scrape. (The claim may have described an earlier plan year.) - 13 rating areas, county-based, unchanged from PY2025. County→RA source: CMS IL-GRA page — 403s in WebFetch but serves fine to
curlwith a browser UA (same datacenter-blocking class as PA's CMS-GRA timeout). All 102 counties mapped + cross-checked against the IDOI "2026 Analysis of Illinois On-Exchange Plans" PDF (idoi.illinois.gov). RA1 = Cook alone; RA13 = 27 southern counties. - 7 carriers, 285 on-exchange plans statewide per IDOI (down from 11 carriers in 2025 — Aetna ×2, Health Alliance, Quartz exited): BCBS IL (HCSC — only statewide issuer, sole issuer in 63 counties), Ambetter of Illinois (Celtic, HIOS 27833), Cigna IL, Molina, Oscar, UnitedHealthcare, MercyCare (WI-border counties).
- No state premium subsidy for 2026 — federal APTC only. No
calculateIlEligibility()math needed; whether the eligibility ROUTE needs an IL branch (NJ-style, because CMS API 502s for SBE states) vs FFM passthrough (PA-style) gets verified during Phase A.4. - IL prescreener = PA-style intro page (H1 "Shop for Health Coverage" → Continue → form H1 "Estimated monthly premium for 2026"), with TWO deltas wired into the shared scraper 2026-06-10: (a) no
#coverage-year-selecton the form — the year-select fill is now feature-detected; (b) the intro H1 renders before the Continue button does, and IL's SPA navigates synchronously on click —clickButtonsByText()(new helper) waits for the button to exist and tolerates execution-context teardown. PA/NJ behavior unchanged. - IL results page shows APTC but NO SLCSP sentence.
prescreener.slcspMonthlyis null in every IL scrape. SLCSP benchmark (expectedSlcspAge40) is backfilled from the no_csr scrape's 2nd-lowest silver (NJ precedent) + independently cross-checked against the APTC-implied SLCSP (aptc + income × applicable%/12) where APTC > 0. - csr_94 bucket is Medicaid-gated at the default $22K income (like PA/NJ). Resolution via
csr94IncomeOverrideprobing after the main matrix (NJ needed $22,500 = 149.4% FPL). - Phase B head start: 5 IL HIOS prefixes ALREADY have year-2026 drug data in
formularies_stagingfrom the national §1311 MRF ingest (ffm_1311_mrftag): 11574 (4,063 drugs), 27833 Ambetter (4,302), 42529 (3,805), 53882 (4,700), 54322 (4,571). Issuers publish §1311 MRFs regardless of FFM/SBE status, so the May-2026 FFM sweep captured IL plan ids. Phase B = verify these against the scraped 2026 hiosIds + fill the missing carriers, not a from-scratch ingest. - Anchor ZIPs (all verified single-county in zip_county): RA1 60602 Cook · RA2 60085 Lake · RA3 60187 DuPage · RA4 60435 Will · RA5 61101 Winnebago · RA6 61201 Rock Island · RA7 61602 Peoria · RA8 61701 McLean · RA9 61820 Champaign · RA10 62701 Sangamon · RA11 62626 Macoupin · RA12 62220 St. Clair · RA13 62959 Williamson (Carbondale's 62901 is multi-county — avoided).
Files (IL-specific)
scripts/db/data/il-rating-areas-2026.ts— 13 anchors + 102-county FIPS→RAscripts/db/build-il-plans-from-scrape-2026.ts— forked from NJ; CSR variants now fill per-key across ALL RAs (the NJ/PA first-RA-only guard would lose variants when a later RA supplies a bucket the anchor RA missed)scripts/db/write-il-plans-2026.ts/scripts/db/validate-il-plans-2026.ts— forked from NJscripts/db/scrape-getinsured-2026.ts—STATE_CONFIGS.IL+ the two IL deltas (decision #5)
Virginia
Marketplace: Virginia's Insurance Marketplace (marketplace.virginia.gov, enrollment at enroll.marketplace.virginia.gov) — GetInsured stack, SBE since 2024. Fourth GetInsured consumer (PA → NJ → IL → VA).
Status (2026-06-10, ENG-450): ALL PHASES COMPLETE. Phase A on BOTH clusters (69 plans, 12 RAs, 6 carriers, QRS 65/69 = 94%); Phase B drugs 6/6; Phase C providers 6/6 (staging per ADR 0004); VA in OWNED_DATA_STATES + OWNED_COVERAGE_STATES; Layer 5 243 checks green across all six states. Snapshots: staging B 6a29bce5/C 6a29be91/post-BC 6a29da55; prod B 6a29bce6/C 6a29be90. Prod zip_county cleanup is a deploy-time step (cleanup-va-zip-county-2026.ts, IL-pattern).
Locked decisions + carrier map
- 12 RAs / 133 county-equivalents (CMS VA-GRA via curl; WebFetch 403s). VA independent cities are separate FIPS — our
zip_countystores them as"<Name> County"; Radford (51750) + Salem (51775) need name aliases when joining against CMS's "City" naming. - Federal APTC only; CMS API rejects VA → NJ-style eligibility branch via the generalized
calculateOwnedSbeEligibility(). ⚠️ Process lesson: a case-sensitive copy of the IL branch leftcalculateIlEligibilityin place — VA briefly computed IL SLCSP locally. The end-to-end smoke caught it; ALWAYS smoke the new state's eligibility against the scrape benchmark before shipping. - csr_94 at $22,500 (same override as NJ/IL). VA's results page DISPLAYS the SLCSP (unlike IL) but with different phrasing than PA —
prescreener.slcspMonthlystays null; the APTC-implied cross-check matched ≤$0.49 on 12/12 RAs. - Anthem 2026 drugs have NO machine-readable file. Elevance's FNAV JSON (
publish/143/40/drugs.json, found by sweeping the id range for88380VA) is 2025-only (072-series plan ids). The 2026 product (099-series) exists only as the FNAV-hosted PDF (2026_Select_4_Tier_VA_IND.pdf, all 14 plans share it) → Molina-IL-style PDF parse + RxCUI fan-out → 8,297 RxCUIs. - UHC drugs via OptumRx
GPX526VA— the predicted one-char swap from NJ's GPX526NJ. Headless Playwright from a residential IP needed NO Chrome MCP (unlike the NJ session): 2-token capture + drugs-by-alphabet/drug-results replay → 3,604 RxCUIs (NJ parity: 3,570). - Kaiser: per-state §1311 path (
healthy.kaiserpermanente.org/content/dam/kporg/data/va/) — drugs current; the provider file is stale at year-2023 (KP stopped refreshing when VA left the FFM). Year-waiver applied (plan ids match the scraped 2026 hiosIds exactly; KP's integrated network is minimal-drift) — 1,901 NPIs. KP's tier vocabulary isTIER-ONE..FOUR— added to the ingester's FFM_TIER_VOCAB after the first apply landed UNCLASSIFIED (pulled by source tag + re-applied; ALWAYS review the dry-run tier list BEFORE applying). - Anthem providers:
www22.elevancehealth.com/cms/PROVIDERS_VA.json(per-state files; the index lists FFM states only but the VA file exists unlisted) — 54,383 NPIs, 14/14 plans. Oscar: TIC network 027 → 3,816. Cigna: county heuristic (RAs 7/10/11, 278 ZIPs) → 12,772. Sentara/UHC: ffm-swept (57K/43K). - L3 warns heavily for VA — 8 of 12 RAs share the identical $493.61 benchmark (Anthem prices its benchmark statewide). Genuine, not broken RA partitioning.
- Legacy
89242VA*entries (Anthem's prior HIOS) exist in formularies_staging from the old sweep — they don't serve 2026 plans; ignore in counts.
Nevada
Marketplace: Nevada Health Link (enroll.nevadahealthlink.com) — GetInsured stack, NJ-style direct form (no intro page). Fifth GetInsured consumer.
Status (2026-06-10, ENG-451): Phase A LIVE both clusters (135 plans, 4 RAs, 9 carriers, QRS 106/135). Phase B drugs 9/9 carriers; Phase C providers 7/9. NV in OWNED_DATA_STATES + OWNED_COVERAGE_STATES. Snapshots: staging B 6a29e184/C 6a29e34f/postFill 6a29f036/postDrugFill 6a2a00c7/preCCrefill 6a2a13f1/preImperialRx 6a2a1e23; prod B 6a29e186/C 6a29e34e. Prod zip_county cleanup deploy-time (cleanup-nv-zip-county-2026.ts). Final gaps (2 of 9, evidenced carrier-side): CareSource 35107 providers (SPA API host-blocked + §1311 MRF broken) and Imperial 43314 providers (directory PDF 0 NPIs, online tool empty iframe, no §1311 MRF). All 9 carriers have drugs; 7 have providers. Two earlier "gaps" were diagnostic errors corrected by driving the live SBE (see Decision 11): Community Care 11765 providers (= Anthem Battle Born network) + Imperial 43314 drugs (live FNAV siteCode 8528279638, not the stale scraped one).
Locked decisions + carrier map
- 4 RAs / 17 counties (CMS NV-GRA via curl; incl. Carson City independent city). RA1 = Clark+Nye (Vegas), RA2 = Washoe (Reno), RA3 = Carson/Douglas/Lyon/Storey, RA4 = 10 rural.
- Federal APTC only; csr_94 $22,500 (NJ/IL/VA pattern). NV results page displays SLCSP.
calculateNvEligibility()via the generalized helper — VA case-lesson applied (verified the branch calls the NV helper, not IL/VA). L1 + APTC-implied ≤$0.32 on 4/4. - 9 carriers — the batch's most: Health Plan of Nevada 95865 (31 plans, Sierra/UHC HMO), Ambetter/SilverSummit 45142 (30), Anthem 60156 (22), CareSource 35107 (17), Hometown Health 41094 (13), SelectHealth 84445 (10), Community Care 11765 (5), Imperial 43314 (5), Molina 79363 (2).
- Coverage reachability: Ambetter + Molina + SelectHealth complete (drugs+providers, ffm-swept). CareSource drugs ffm-swept (providers gap). Anthem 60156 + Community Care 11765 share ONE FNAV PDF (
2026_Select_4_Tier_NV_IND.pdf, publisher 143 — same parser as VA Anthem) → 8,560 RxCUIs attributed to the union of both carriers' 27 plans. HPN/Hometown/Imperial: no clean §1311 source (not in MR-PUF; CareSource/HPN §1311 index URLs all 404). - Provider fills found on a second pass (the first pass stopped too early — lesson): Phase C initially shipped 3/9, rationalized as "statewide-HMO heuristic inappropriate." That reasoning is valid for the heuristic, but it masked two carriers with REAL §1311/portal sources that just weren't pursued:
- Anthem 60156 →
elevancehealth.com/cms/PROVIDERS_NV.json(15,053 NPIs, 22/22 plans) — the EXACT proven pattern from VA's PROVIDERS_VA.json, which I'd used an hour earlier and failed to try for NV. Year-stamped 2025 but plan ids == 2026 hiosIds 22/22 → year-waiver (Kaiser-VA precedent). - HPN 95865 (largest carrier, 31 plans) → UHC TIC blobs API (
transparency-in-coverage.uhc.com/api/v1/uhc/blobs/?searchText=Sierra-Health) →Sierra-Health---Life---Nevada_Insurer_Commercial-HMOin-network file (HPN/Sierra is UHC-owned). 5,491 NPIs fromprovider_referencesat head (NJ-Oxford streaming pattern). Now 5/9 providers (the 5 largest carriers by plan count = 100 of 135 plans). The lesson: before declaring a provider gap, exhaust the proven sibling-state patterns (ElevancePROVIDERS_<ST>.json, UHC blobs API, FNAV) — only THEN fall back to "no data." The heuristic-is-wrong point still holds for the genuinely-sourceless statewide HMOs.
- Anthem 60156 →
- Drug fills found on a second push (the user correctly flagged 7/9 wasn't done): HPN 95865 (the LARGEST carrier, 31 plans) drugs via OptumRx ClientFormulary
SE42L77— same 2-token replay as UHC NJ/VA but aClientFormularyvar, not GPX (found on HPN's PDL page; HPN/Sierra is UHC-owned; 4-tier, no Tier 5) → 4,794 RxCUIs. Hometown 41094 drugs via its IFP Exchange formulary PDF (Optum Rx EHB Base template, bare-integer tiers) → 8,792 RxCUIs. NV drugs now 8/9. - Corrected gaps after driving the LIVE SBE (2026-06-10 evening, founder-flagged — see Decision 11 below): two of the three "gaps" above were MY diagnostic errors, caught by going to the actual exchange:
- Community Care 11765 IS Anthem (NOT a separate network). Filtering Nevada Health Link by issuer "Community Care Health Plan of Nevada" returns Anthem-branded "Battle Born State Plan Anthem Bronze/Gold/Silver" plans; their provider directory is
anthem.com/find-care?alphaprefix=NVD— the SAME find-care network as Anthem's own plans. The scrape corroborates (networkName: "Anthem Battle Born ... Network"). So 11765 shares the Anthem find-care network (15,053 NPIs) + Anthem formulary. I had briefly trustedPROVIDERS_NV.json(which files providers ONLY under parent HIOS 60156) and wrongly rolled this back — the SBE is the tie-breaker; a co-branded affiliate's MRF is filed under the PARENT HIOS, so its absence ≠ separate network. Providers → 7/9. - Imperial 43314 drugs RECOVERABLE. Its formulary IS published at FormularyNavigator siteCode 8528279638 (linked from
exchange.imperialhealthplan.com/nevada/drug-formulary/). The plan-scrape captured a stale siteCode (5227003519→UnderConstruction.htm) — every prior pass mis-concluded "carrier hasn't published." Scraped viascrape-pa-formulary-navigator.cjs(same MMIT platform as PA Highmark/IBX). Drugs → 9/9. Lesson: when a scraped doc URL 404s/under-constructions, check the carrier's OWN live doc page for a fresher URL before declaring it unpublished. - Hometown 41094 providers RECOVERED 2026-06-10 (20,148 NPIs): its LEASED-NETWORK §1311 index pointed at a broken
/about-us/path, but the same single Reno HMO network's in-network file is live at the/documents/Transparency-in-Coverage/path (the Caesars employer file — NJ-Oxford shared-network pattern). A 404 on ONE index path ≠ no source. - Genuinely-remaining gaps (evidenced, fully probed): CareSource 35107 providers (documented least-compliant §1311 payer — own marketplace MRF index 404/000; UHC hosts only CareSource's employee plans; find-a-doctor SPA API is host-permission-blocked to automation), Imperial 43314 providers (directory PDF has 0 NPIs; online tool is an empty iframe; no §1311 MRF). 2 of 9 (22 of 135 plans).
- Net: drugs 9/9, providers 7/9 — up from the prematurely-reported 8/9 + 6/9, entirely by driving the live exchange to correct two mis-diagnoses.
- Community Care 11765 IS Anthem (NOT a separate network). Filtering Nevada Health Link by issuer "Community Care Health Plan of Nevada" returns Anthem-branded "Battle Born State Plan Anthem Bronze/Gold/Silver" plans; their provider directory is
Cross-state lessons learned (2026-06-09, post-CA + NY + PA + NJ)
This section synthesizes patterns across four ingested SBEs. Every SBE state going forward should be checked against these heuristics during Phase 0 (research) BEFORE writing code.
A. Provider-discovery patterns by carrier type
| Carrier shape | Examples | Discovery |
|---|---|---|
| National carrier with own TIC SPA | UHC (transparency-in-coverage.uhc.com) | Gatsby/React SPA fronts an undocumented API. Drive in Chrome MCP, watch network tab for the catalogue endpoint (e.g. UHC's /api/v1/uhc/blobs/). Returns Azure SAS-signed URLs valid for years. |
| Vendor-hosted TIC | Horizon → Sapphire MRF Hub (Zelis), some carriers → HealthSparq, BCBS family → various | The www.{carrier}.com TIC page is often Incapsula/Imperva-protected to defeat curl. The vendor portal itself usually isn't. Drive the www. page in Chrome to find the outbound vendor link, then curl the vendor portal directly. |
| Carrier on national §1311 with state-coded var | UHC drugs via OptumRx (GPX426NY → NY, GPX526NJ → NJ, likely GPX[#][STATE] for sibling states) | When you've cracked the carrier in one state, the next state's var is usually a one-char swap. Carry the OptumRx 2-token auth capture pattern over. |
| State-rolled S3 with TOC-listed thin file + bigger sibling | Oscar (oscar-002 853 KB listed in TOC, oscar-001 92 MB at same path with the full data) | When a TOC-listed §1311 file looks improbably small for the state's market size (e.g. <1K NPIs for a state with >10K marketplace members), probe sibling files at the same S3/path prefix. Both files are at the carrier's own §1311 infra so either is defensible. |
| Carrier-direct §1311 MRF | Centene/Ambetter, AmeriHealth via IHG GCS, Highmark via HMHS, Aetna direct | Pattern A as originally documented. These are the easiest. |
B. TOC schema patterns
Most §1311 TOCs are EIN-keyed (employer-reporting), not HIOS-keyed. The marketplace plan (HIOS prefix) is reachable via:
- Reporting structures that group employers sharing a network (
reporting_plans[]withplan_id_type: "EIN",in_network_files[]pointing at shared network rate files) - Network-level files for "Insurer" reporting structures (Oxford-Health-Insurance-Inc has
Metro-Network,Choice-Plus,Core,Freedom-Network, etc.) — the marketplace network is the one matching the state's filename convention (MCEX, Marketplace, Individual, OMT, etc.) - Filename heuristics more reliable than schema for Horizon-style EIN-keyed TOCs:
- MCEX = Managed Care EXchange
- OMT1, OMT2 = Omnia Tier 1/2 (Horizon's product family for marketplace)
- Individual / Individual-Small-Group = filters off large group
- Recent date suffix (2026-05-DD) > older (2022/2023/2024) leftover files
C. Streaming + NPI extraction
- NPIs in §1311 in-network rate files almost always live in a
provider_referencesblock at the file HEAD (before the"in_network"array). Read until you hit"in_network"marker, then stop. For a 1 GB gz file, this typically captures all NPIs within the first 1-2 MB of decompressed bytes. - Oscar's CMS schema v2.0.0 with indexed
provider_references(numeric indices) is the exception. Read the full file but the same"npi": [...digits...]regex catches the trailing dictionary. - Use a sliding-window text buffer + regex match (
/"npi"\s*:\s*\[([^\]]*)\]/g). Cap the buffer at ~1 MB to defend against missing-bracket pathological cases.
D. Atlas write discipline (this took 30+ minutes to learn the hard way)
For per-carrier provider ingest on M10 staging:
| Setting | Wrong (hangs on M10) | Right |
|---|---|---|
| Batch size | 1000 | 250 |
Per-bulkWrite maxTimeMS | omitted | 30000 ms |
writeConcern | default (majority, no timeout) | {w: 1, wtimeout: 30000} |
| Logging | end of script | per-batch process.stdout.write with rate + ETA |
| Resume capability | none | --start-line flag so a hung first run can resume cleanly |
E. Cluster targeting (ADR 0004 + ADR 0006)
- Plans live on BOTH clusters. New SBE state ingest writes to
MONGODB_URI(prod,askflorence-prod-01.njkihm) AND toMONGODB_WRITE_URI(staging,askflorence-staging.efsikmv). The PA Phase A.4 decision (ENG-418, 2026-06-02) is canonical: apex/api/counties+/api/plansresolve viagetDb()→ prod cluster. Without the prod-cluster mirror, apex returns 502 on the new state's ZIPs even though staging looks fine. providers_staging+formularies_staginglive ONLY on staging. Per ADR 0004 cross-cluster Atlas PrivateLink + cost-optimization: prod M10 doesn't carry these collections. All Phase B/C ingest scripts MUSTassertSafeCluster()against the staging hostname and exit if the URI points at prod. Apex reads coverage data via PrivateLink from staging cluster transparently.- Snapshot before
--apply. PA Phase A.4 used Atlas snapshot IDs (e.g.6a1e6c1d085fb664b689eec7pre-apply,6a1e6e23923b91f0f761d244post-apply) to capture restore points. Every new SBE state's--applystep should be preceded by an Atlas snapshot on the destination cluster (staging for Phase B/C, BOTH clusters for Phase A plans).
F. Cohort + cluster guards in every ingester
Every Phase B/C ingester must include these guards. They have saved real damage twice in this session alone:
javascript
function assertSafeCluster(uri) {
const host = uri.match(/^mongodb(?:\+srv)?:\/\/[^@]*@([^/?]+)/)?.[1];
if (!host?.includes("askflorence-staging") && !host?.includes("efsikmv")) {
throw new Error(`CLUSTER GUARD FAIL: host=${host}`);
}
}
// Pre-flight + post-flight, both indexed queries:
const nonStateBefore = total - await coll.countDocuments({_id: /^<state>:/});
// ... ingest ...
const nonStateAfter = total - await coll.countDocuments({_id: /^<state>:/});
if (Math.abs(nonStateAfter - nonStateBefore) > 50) {
throw new Error(`COHORT GUARD FAIL: drift=${Math.abs(...)}`);
}The user gate on writes: WRITE_CONFIRM=yes node ingest-...cjs --apply — the env var is the founder gate; the script must refuse --apply without it.
G. Validation discipline (the gap closing now)
The existing per-state Playwright specs (calculator-{ut,ca,sbe-nj}-takeover.spec.ts) test the UI flow but do NOT test the depth of the actual ingest data. The gap closes via the SBE invariants framework being built in a follow-on session — see "Per-state ingest invariants" in sbe-ingestion-playbook.md.
Template for new states
When you open work on a new SBE state (PA, NJ, MA, WA, CO, CT, MD, etc.), investigate these before writing code:
Data-layer investigation
- Marketplace system — name + tech stack (CalHEERS / NYSOH / GetCoveredNJ / etc.). Check for an anon SPA endpoint pattern (same technique that surfaced CalHEERS anon endpoints).
- Provider directory — is there a statewide centralized one (CA-style Symphony) or per-carrier portals? What's the provider ID system (NPPES NPI, state PIN, internal ID)?
- Drug formulary publication format — usually per-carrier PDFs but format varies; check if state mandates a standard layout (CA does via SBP).
- Rating areas — how does the state partition its geography for pricing? Some states have 1-2; CA has 19.
- Plan ID format — 14-char FFM or state-specific? Document the format so
lookupPlanBackend()can dispatch correctly.
Decisions you'll need to make + document here
- Plan-attribution granularity (per-network vs HIOS-prefix vs other)
- Accepting-status default + rationale
- Tier-aware copays vs single-tier
- Whether
pufis populated (CA wasn't — affects every downstream consumer) - Anon endpoint legal posture + license inquiry path
- NPI bridge necessity (almost always: no — see CA decision #3 reasoning)
Layer 5 invariants (REQUIRED for every state — ENG-447)
- During Phase C ingest, record 5-10 golden NPIs (spread across carriers) + 5-10 golden RxCUIs (covered by multiple carriers). At the END of Phase C — BEFORE the next state's Phase A — append the state's
StateInvariantsConfigblock toscripts/audit/sbe-state-invariants.ts(exact plan count, carrier HIOS prefixes from the liveplanscollection, per-carrier NPI/RxCUI floors ~5-15% under verified actuals), run--capture --state <ST>, review the fixture diff, runnpm run audit:sbe-invariants, commit both files. This is what protects YOUR state from the NEXT state's ingest.
Post-ingest (REQUIRED for every state — ENG-425)
- After the state's formulary docs land in
formularies_staging, rebuild the drug-name search read-model:node scripts/db/derive-drug-search-index.js --apply. It re-derives from the WHOLE collection (FFM + all SBE/CA), so the new state's meds become searchable with brand/generic strength parity + commonally-covered-form-first ranking. Search-only; coverage stays per-rxcui. Seedocs/decisions/2026-05-09-refresh-cadence.md§ "Post-ingest: rebuild derived collections".
Per-state Linear tags + files
- File a sibling to
docs/data-sources/{state}-phase-c-d-ingestion-playbook.md - Add a section to this doc as soon as decisions are made
- Tag Linear issues
[SBE-{state}]for greppability
Last updated: 2026-06-01 — PA Phase 0 + A.0.5 locked (ENG-418 — GetInsured 7-state stack pilot; Pennie /hix/private/getIndividualPlans JSON API discovered, eliminates SPA scraping); CA decisions 1-6 locked (2026-05-28).