Appearance
SBE Data Ingestion Playbook
Status: Living document. Add a section per state as we onboard. Last updated: 2026-05-24 (CA Wave 2 / ENG-374 in progress) Purpose: The canonical "how do we get a state's plan data" reference. When a future year or future state ingestion starts, this is what the session reads first.
State-Based Marketplaces (SBEs) do NOT file their plan data into the CMS Public Use Files (PUF). Federal PUF only covers the 30 FFM states. For every SBE (CA, PA, NJ, MA, CT, CO, MD, NY, RI, DC, VT, WA, ID, NM, MN, KY, GA, IL, NV, ME, VA), we have to source plan data per state.
This playbook captures the approaches we've used, the validation we run, and the gotchas we hit. Future ingestions reference this and either match an existing pattern or extend the playbook with a new one.
TL;DR — pick your approach by state shape
| State property | Best approach | Why |
|---|---|---|
| Community-rated (same price all ages) | NY-style scrape | Simplest. Region × plan. No age curve. |
| Age-rated, rating-area-bound | CA-style scrape | 1 anchor age × N regions; federal age curve does the rest. |
| Federal PUF coverage exists | PUF ingest | Already-shipping pipeline; one schema for whole state grid. |
| No public marketplace tool, only filings | SERFF rate-filing parse | Last resort. High friction. |
Standing rule: prefer the public-facing marketplace shop tool as the source of truth. It's what consumers see, it embeds all subsidy + filter logic, and it's free to scrape. SERFF is fallback when no public tool exists.
The decision tree
Does the state have a public shop & compare tool that shows all 2026 plans?
├── Yes → does premium vary by ZIP or only by rating area?
│ ├── ZIP-level → scrape per ZIP (NY pattern)
│ └── Rating-area-level → scrape 1 ZIP per rating area (CA pattern)
└── No → is the state on federal PUF?
├── Yes (one of 30 FFM) → ingest via scripts/db/ingest-puf-augment.js
└── No → SERFF rate-filing parse (last resort)
Always validate by:
(1) Subsidy math vs the live state marketplace tool for 1 anchor scenario
(2) Same-rating-area invariance: two ZIPs in same RA must return identical plans
(3) Cross-rating-area difference: two ZIPs in different RAs must differ
(4) Federal regression sweep: 100 federal scenarios stay byte-identicalApproach 1: NY-style scrape
When to use: community-rated state (one price all ages), small number of plans (< 100 statewide).
Used by: New York 2026 (ENG-211, shipped).
Data shape:
- NY State of Health (NYSOH) lists every Silver / Gold / Bronze / Platinum / Catastrophic plan with one premium per (plan x rating region) pair. No age multiplier — community-rated.
- 16 NY rating regions; ~20 plans visible per region; ~320 total plan rows.
Mechanics:
- Anchor demographic: any age, any income (community rating makes age irrelevant).
- Drive the marketplace tool's plan search for one ZIP per rating region.
- Capture: plan name, carrier, metal, premium, deductible, MOOP, copays per plan.
- Cross-reference DFS Exhibit 23 (NY's annual rate filing) for the canonical premium values as a backstop.
Output shape: documents in plans collection with state: "NY", regionId: 1..16, no ageRatesByArea table (community-rated means one premium across all ages).
Validation:
- 11 issuers x 16 regions = 176 SLCSP probes against NYSOH live tool. Result: zero variance.
- DFS Standard Silver chart (32 plans) all match exactly.
- See docs/validation/new-york-2026.md for the full pass record.
Approach 2: CA-style scrape (rating areas + age curve)
When to use: age-rated state where premium varies by rating area, not by ZIP. This is the most common SBE pattern.
Used by: California 2026 Wave 2 (ENG-374, in progress).
Why this works:
By regulation (CA Insurance Code §10965.9, similar laws in CO/CT/MD/WA/etc.), premium does NOT vary by ZIP within a rating area. So one anchor scrape per RA covers every ZIP in that RA. Federal default age curve (45 CFR 147.102, Appendix A) maps the anchor-age premium to all 65 ages. Most SBEs ban tobacco rating in the individual market, eliminating that variable.
Validated approach (proven 2026-05-24 for CA):
- Same-RA invariance: ZIP 94102 (Tenderloin SF) vs ZIP 94110 (Mission SF), both in RA4. Both return identical 24 plans, identical $1,341.65/mo subsidy, identical Kaiser cost-share. ✓
- Cross-RA difference: ZIP 94102 (SF RA4) vs ZIP 90039 (LA RA15). Different subsidy ($1,341.65 vs $848.92), different plan count (24 vs 34), different cheapest carrier (Kaiser vs Molina). ✓
- Subsidy math ($30K SF couple 35+30): predicted $1,344.62 vs CC live $1,341.65 = $2.97 (0.22%) delta. ✓
Anchor demographic (CA 2026):
- 1 person, age 21, household income $500,000 (well above any subsidy threshold = gross premium displayed with zero subsidy applied; clean reading).
- Tobacco N/A (CA bans tobacco rating per Knox-Keene §1357.504).
Mechanics:
- Maintain a
data/{state}-rating-areas-{year}.tsfile with one anchor ZIP per RA. - Playwright scraper drives the public shop tool per anchor ZIP:
- Fill ZIP, income (high), household size 1, age 21
- Submit eligibility, click through preference screens (skip provider + Rx)
- Land on plan list, paginate through all pages (CC uses 10/page MUI Pagination)
- Extract per plan: carrier, plan name, metal tier, gross monthly premium, deductible, primary care copay, generic Rx copay
- Output: structured JSON keyed by
regionId, array of plans per RA. - Normalize step (separate script) converts JSON →
planscollection doc shape. - Apply federal default age curve to derive all 65 ages from the age-21 anchor.
- Apply canonical CSR variants from the state's Standard Benefit Plan Designs (state-mandated, same across all carriers per metal x CSR tier).
Output shape: documents in plans collection with state: "CA", regionId: 1..19, full ageRatesByArea table (derived), csrVariants map (canonical SBP).
Validation:
- 10-scenario harness across all 19 RAs vs CC Shop & Compare live, target < $10 spread.
- See Validation methodology below.
Code locations:
scripts/db/scrape-covered-ca-2026.ts— Playwright harness (CA 2026 specific)scripts/db/data/ca-rating-areas-2026.ts— 19 RA anchor ZIPs + federal default age curve constantsscripts/db/data/ca-plans-scraped-2026.json— raw scrape output
Approach 3: Federal PUF ingest (baseline for 30 FFM states)
When to use: state files into CMS Public Use Files (= one of the 30 FFM states + a few that opted in).
Used by: all 30 FFM states (already shipping, see scripts/db/ingest-puf-augment.js).
Mechanics: documented separately at docs/data-sources/puf-data.md. The PUF pipeline is the simplest path when available — one well-defined schema covers the whole state.
Why this does NOT apply to most SBEs: SBE states file with their state marketplace, not CMS. So PUF rows for SBE states are either missing or stale. Confirm via db.plans.find({state: "X", year: 2026}).count() — if zero, the state isn't on PUF.
Approach 4: SERFF filings (fallback)
When to use: state has no public shop tool that exposes all plans (rare; this is mostly a contingency).
Mechanics: SERFF (System for Electronic Rates & Forms Filing) is the NAIC submission portal. Carriers file rate filings + benefit designs there annually. Requires NAIC registration to access bulk download. Filings are PDF + spreadsheet attachments per carrier.
Why it's last resort: parsing dozens of per-carrier filings is high friction. Use scrape approach 1 or 2 whenever a public marketplace tool exists.
Status: not currently used for any AskFlorence state. Document the parse pattern here if/when adopted.
Validation methodology (applies to every state)
The exact validation steps are state-agnostic. Run all four layers per state per plan year:
Layer 1 — Subsidy math sanity check (1 scenario)
Pick a representative subsidy scenario (e.g., couple ages 35 + 30, $30K income, anchor metro ZIP). Drive the state's live shop tool. Capture the displayed monthly subsidy + cheapest Silver plan.
Run the same scenario through our own pipeline. Compare:
| Check | Tolerance |
|---|---|
| Monthly subsidy total | < 1% delta (target < $5/mo) |
| Cheapest Silver plan name | Exact match |
| Subsidized "You pay" monthly | < $5 delta |
| CSR tier assigned | Exact match |
If any field exceeds tolerance, fix the formula before scraping more data. The CA Wave 2 baseline scenario landed at 0.22% subsidy delta — that's the bar.
Layer 2 — Same-rating-area invariance (2 ZIPs)
Pick two ZIPs in the same rating area (different counties if possible, same county otherwise). Drive both through the same scenario. Plans + subsidy must be byte-identical. This confirms the regulatory model: premium is RA-bound, not ZIP-bound.
If two same-RA ZIPs differ, the assumption is broken — either the rating-area mapping table is wrong, or the state actually does ZIP-level pricing (rare but exists). Either way, halt and investigate before continuing.
Layer 3 — Cross-rating-area difference (2 RAs)
Pick two ZIPs in different rating areas. They MUST return different plan lists or different premiums or both. Confirms the RA boundary works.
Layer 4 — Federal regression sweep (Wave 1 federal flow stays byte-identical)
Every SBE data change must NOT touch the 30-FFM + NY pipeline. Run scripts/audit/wave1-federal-regression.ts --diff after every commit. Result must be 0 diffs across 100 randomized federal scenarios.
This is the bright-line gate. If the federal sweep finds any diff, revert immediately. Federal flow byte-identical is non-negotiable.
Layer 5 — Per-state ingest invariants (REQUIRED post-ingest, every state)
New as of 2026-06-09 (post-NJ Phase C). Layer 1-4 checks test calculator/eligibility math but not the depth of the ingested provider + drug data. The existing per-state Playwright specs (calculator-{ut,ca,sbe-nj}-takeover.spec.ts) verify the UI flow + carrier names appear, but do NOT verify:
- Specific known NPIs return the right
plans[].plan_idset - Per-carrier NPI thresholds (e.g. NJ UHC ≥ 100K — future ingests can't silently drop 50% of NPIs)
- Per-carrier RxCUI thresholds
- Cohort guard reaffirmation outside of the ingest scripts themselves
- Plan count + carrier set invariants per state
SHIPPED 2026-06-10 (ENG-447): scripts/audit/sbe-state-invariants.ts — parameterized, per-state config blocks (NJ + PA + CA + NY backfilled). Wired into npm run preflight -- --full (graceful SKIP when Atlas is unreachable so offline/GHA runs don't false-fail) and standalone as npm run audit:sbe-invariants (runs with --require-db, which ingest sessions MUST use). Golden expectations live in the committed fixture scripts/audit/fixtures/sbe-state-invariants-baseline.json, (re)generated via --capture (calculator-baseline pattern — review the git diff). Measured runtime: ~2 min all states / ~15-45s per --state run on M10 staging.
The backfill immediately caught a real shipped bug (proof of value before the first new-state ingest): NJ Phase C's Horizon provider ingest had used NJ_HIOS_PREFIX = "91762" (AmeriHealth's prefix) instead of 91661, so all 95,979 Horizon NPIs were attributed to AmeriHealth plan IDs — Horizon's 14 plans had ZERO provider coverage and AmeriHealth's plans carried Horizon's directory. Repaired on staging 2026-06-10 by scripts/db/repair-nj-horizon-plan-attribution.cjs (snapshots: pre 6a29227e764ea3c204b89fa1, post 6a2924c081d7b2013563a5b0). See the NJ section of sbe-state-watchouts.md. Moral: carrier↔HIOS labels come from the live plans collection, never from doc tables.
Each state declares:
typescript
{
state: "NJ",
expectedPlanCount: 60,
expectedCarrierHiosPrefixes: ["17970", "91661", "37777", "91762", "23818"],
expectedNonStateBaseline: 3_730_755, // cohort guard
carrierInvariants: [
{ hios: "37777", carrier: "UHC NJ", minNpis: 100_000, minRxCuis: 3_400 },
{ hios: "91762", carrier: "Horizon BCBS NJ", minNpis: 90_000, minRxCuis: 4_000 },
{ hios: "17970", carrier: "Centene/Ambetter", minNpis: 100_000, minRxCuis: 4_000 },
{ hios: "23818", carrier: "Oscar Garden State", minNpis: 20_000, minRxCuis: 3_800 },
{ hios: "91661", carrier: "AmeriHealth NJ", minNpis: 800, minRxCuis: 3_700 },
],
goldenNpis: [
{ npi: "1750636635", expectedPlans: ["37777NJ0100002", "37777NJ0100005", /* ... */] },
// ... 5-10 per state ...
],
goldenDrugs: [
{ rxcui: 197316, name: "atorvastatin 20 MG", expectedTiersByCarrier: { "nj_uhc_2026_marketplace_formulary": "Tier 1", /* ... */ } },
// ... 5-10 per state ...
],
}Backfill status (2026-06-10, ENG-447):
- ✅ NJ — captured post-repair (265,562 NPIs / 11,373 state RxCUI docs; per-carrier floors + 8 golden NPIs + 6 golden drugs)
- ✅ CA — captured (165,974 Symphony providers / 5,447 state RxCUI docs; 11 carriers)
- ✅ PA — captured (1,216,701 NPIs / 30,881 state RxCUI docs; 13 carriers — 33709 + 79279 Highmark entities are drug-floor-only, their provider coverage is the accepted ENG-437 gap)
- ✅ NY — bonus backfill (203,016 NPIs / 7,349 state RxCUI docs; note: Healthfirst 91237 has providers but ZERO formulary docs — known ENG-412 gap, no rx floor set)
- ✅ Wired into preflight
--full+npm run audit:sbe-invariants; any threshold miss fails the run - ✅ "Template for new states" in
sbe-state-watchouts.mdupdated — recording golden NPIs + RxCUIs is now a REQUIRED Phase C capture step
Adding a new state's block: at the END of its Phase C (before the next state's Phase A), append a StateInvariantsConfig to STATE_INVARIANTS (exact plan count, every carrier HIOS prefix from the live plans collection, per-carrier floors ~5-15% under verified actuals, 5-10 golden NPIs across carriers, 5-10 golden RxCUIs), then --capture --state <ST>, review the fixture diff, run a plain check, commit both files.
Layer 6 — Scenario sweep + live-exchange parity (REQUIRED post-ingest, every state)
New as of 2026-06-10 (NV). scripts/audit/sbe-scenario-audit.ts (npm run audit:sbe-scenarios -- --state XX [--parity]) drives the REAL fetchPlansForHousehold pipeline against a dev server for a grid of anchor ZIPs (one per rating area) × household types × FPL bands, and:
- Scenario sweep flags: no plans above 138% FPL, missing Medicaid bump below it,
realPrice > sticker, negative price, and SBE-redirect when the state should be owned. This is the check that caught the independent-city bug (Decision 12) — a whole rating area / 38 VA cities returning zero plans. Run it for every RA, not just the demo ZIP. --parityreplays exchange-captured scenarios fromscripts/audit/fixtures/sbe-parity-baseline.jsonand asserts:- 138-400% FPL → our APTC within tolerance of the LIVE exchange (abs $12 OR 4%, whichever looser — SLCSP plan-selection + FPL-rounding differ slightly). NV verified: ours $293.34 vs Nevada Health Link $298 (Δ$4.66).
- >400% FPL → no subsidy on both (NV: ours $0; exchange "no tax credit, ~$478/mo").
- <138% FPL → INTENTIONAL divergence: expansion-state exchanges route to Medicaid (no marketplace subsidy); WE bump to 138% and show the federal estimate (NV: ours $436 + plans). This is by design — do NOT "fix" it to match the exchange.
Capturing the parity baseline (Decision 11 method): drive the live exchange's anonymous shop via Chrome MCP (e.g. Nevada Health Link → enroll.nevadahealthlink.com/prescreener → "Browse for health & dental plans" → enter dummy ZIP + DOB + income), read the APTC / SLCSP / Medicaid result for ~3 scenarios spanning the bands, and add a states.<ST> block to sbe-parity-baseline.json. Recapture on plan-year change.
SBE audit suite + test philosophy
The three layers below ARE the SBE regression suite. Do NOT run all states in CI (overkill); instead run the suite for every state touched in an SBE addition — adding plans, drugs, providers, or adjusting any of them means re-running these for that state (and its siblings if the change is shared, e.g. a formulary shared across carriers):
| Layer | Command | What it guards |
|---|---|---|
| 5 — data depth | npm run audit:sbe-invariants (or --state XX) | per-carrier NPI/RxCUI floors, golden NPIs/drugs, cohort floor (re-ingest can't silently drop coverage) |
| 6 — runtime + parity | npm run audit:sbe-scenarios -- --state XX --parity | plans return for every RA + household + FPL band; APTC matches the live exchange above 138% FPL; Medicaid-gap divergence intact |
| Playwright | npm run test:e2e (per-state calculator-*-takeover.spec.ts) | end-to-end UI flow for the state |
Plus Layer 4 (wave1-federal-regression --diff, 0 diffs) after every change — the federal/NY flow must stay byte-identical. All layers need a dev server on BASELINE_BASE pointed at the cluster holding the data (formularies_staging + providers_staging are STAGING-cluster only per ADR 0004 — never write provider/drug data to prod).
State-by-state log
New York
| Year | Approach | Status | Notes |
|---|---|---|---|
| 2026 | NY-style scrape (NYSOH) + DFS Exhibit 23 backstop | Shipped (ENG-211) | Community-rated, 16 regions, ~320 plan rows. Validation: new-york-2026.md. Zero variance vs NYSOH live tool. |
California
| Year | Approach | Status | Notes |
|---|---|---|---|
| 2026 Wave 1 | Hand-curated showcase (3-4 plans per state in CA/PA/NJ) | Shipped (ENG-373) | Sourced from CC 2026 SBP designs PDF. Estimate-only surface; not in plans collection. Wave 2 replaces this for CA. |
| 2026 Wave 2 | CA-style scrape (Approach 2), CC Shop & Compare | In progress (ENG-374) | 19 RAs, age 21 anchor, $500K income, household 1. POC validated against $30K SF couple scenario at 0.22% delta. Code: scripts/db/scrape-covered-ca-2026.ts. |
Pennsylvania
| Year | Phase | Status | Notes |
|---|---|---|---|
| 2026 Phase A — plans | GetInsured-stack scrape (Approach 2 variant), Pennie /hix/private/getIndividualPlans JSON API | Shipped + live on apex (ENG-418 PR #566 + A.4 prod-cluster apply 2026-06-02) | 9 RAs, age 40 anchor, single-person household, 4 income buckets (no_csr / csr_73 / csr_87 / csr_94). 291 unique HIOS plans across 12 carriers (Jefferson / UPMC / Geisinger / IBC / Highmark family / Capital BC / Ambetter / Oscar / Partners). L1 validation: 4 RAs $0.00 exact vs PHIEA SLCSP PDF, all 9 RAs ≤ $1.49. Both clusters populated — staging snapshots B 6a1e44f83a4bf6ebedefa3a0 + C 6a1e45b729c719743c525854; prod snapshots B 6a1e6c1d085fb664b689eec7 + C 6a1e6e23923b91f0f761d244. Apex smoke verified 2026-06-02: ZIP 19103 → 73 plans (Jefferson top), 16501 → 61 (UPMC), 15834 → 34 (Geisinger), 18015 → 91 (Jefferson), 17101 → 57 (Capital BC) — all matching pre-apply per-RA breakdown. Cohort guard: non-PA 2026 unchanged at 4,495 on both clusters. Code: scripts/db/scrape-getinsured-2026.ts + build-pa-plans-from-scrape-2026.ts + write-pa-plans-2026.ts + validate-pa-plans-2026.ts + .tmp-eng-418/sync-pa-zip-county-staging-to-prod.mjs (one-shot, archived). PA is the pilot for the GetInsured 7-state batch — NJ / VA / NV / NM / ME inherit this pipeline (~1 day each). |
| 2026 Phase B — drugs | Per-carrier formulary scrape, mirror NY ENG-412 Phase 1 | Backlog | 12 carriers: Ambetter / Oscar / Highmark (3 entities) / IBC / Jefferson / Partners / Geisinger (2 entities) / UPMC / Capital BC. Formulary URLs already captured per plan as puf.urls.formulary — extract dedup'd list. Writes to formularies_staging keyed pa:<rxcui> (matches NY's ny:<rxcui>). |
| 2026 Phase C — providers | TBD source: PA-equivalent of NY-DOH-PNDS open data (if exists) OR per-carrier directory crawl OR §1311 MRFs | Backlog | Provider directory URLs already captured per plan as puf.urls.providerDirectory. NPPES-NPI-native expected (PA carriers use FFM HIOS schema). Writes to providers_staging keyed pa:<npi> (matches NY's ny:<npi>). |
Illinois
| Year | Phase | Status | Notes |
|---|---|---|---|
| 2026 Phase A — plans | GetInsured-stack scrape (third consumer after PA + NJ) | Applied to BOTH clusters 2026-06-10 (ENG-448) — deploy pending | 13 RAs × 4 CSR buckets, 271 HIOS plans, 7 carriers (BCBS IL/HCSC = 207 plans, sole issuer in 63 counties). csr_94 at $22,500 override. L1 13/13 Δ$0.00; L4 wave1 + calculator-baseline ZERO diffs; Layer 5 green. Snapshots staging B 6a29945b…/C 6a29965c…, prod B 6a29945d…/C 6a29965b…. Prod zip_county cleanup is a deploy-time step (cleanup-il-zip-county-2026.ts) — see watchouts IL section. QRS inline (233/271 rated). Catastrophic tier not captured (age-40 anchor). |
| 2026 Phase B — drugs | ffm_1311_mrf sweep head start: 5/7 carriers already current (Oscar/Ambetter/UHC/Cigna/MercyCare, 3.8-4.7K RxCUIs each, plan ids verified == scraped 2026 hiosIds) | COMPLETE 2026-06-10 — 7/7 carriers | Fills: BCBS IL via bcbsil.com/aca-json/il/drugs_il-1.json (3,959 drugs; the HCSC bcbs<st>.com/aca-json/<st>/ pattern, discovered from FFM af_sources + confirmed via index_il.json); Molina IL via ILFormulary2026.pdf parse + RxCUI fan-out (8,539 RxCUIs, 99.9% resolution — Molina publishes NO IL machine-readable drugs; its national cmsjson is FFM-only. ⚠️ Molina IL tier legend = national scheme 1=Preventive, NOT CA's SBP map). derive-drug-search-index --apply re-run (6,482 groups). Issuers publish §1311 MRFs regardless of FFM/SBE status — ALWAYS check existing coverage before a from-scratch Phase B. |
| 2026 Phase C — providers | Same head start: 3/7 current (Ambetter 138,578 / UHC 98,645 / MercyCare 3,412 NPIs on bare-NPI docs — the owned-coverage npi-field lookup serves them as-is) | COMPLETE 2026-06-10 — 7/7 carriers | Fills (all bare-NPI, source-tagged il_*): BCBS IL 93,130 NPIs / 207/207 plans via index_il.json → 24 medical files (4 networks × INDIVIDUAL/GROUP/FACILITY; the filename codes ≠ network names — BAV files hold Blue-Precision); Oscar IL 34,145 via its TIC S3 (042 Select + 056 Choice; national §1311 providers_N.json scrubbed IL); Molina IL 19,412 via marketplace TIC (158 shards, provider_references at head, range-fetch + early-terminate; 152/158 across 3 passes, 6 shards 403'd); Cigna IL 20,444 via Pattern D county heuristic (collar counties; §1311 + TIC both scrubbed IL; over-attribution risk-accepted per PA precedent). IL added to OWNED_COVERAGE_STATES. IL has NO il: namespace — Layer 5 scopes by plan-prefix (framework extended, ENG-448). Layer 5: 213 checks 0 failed. |
Virginia
| Year | Phase | Status | Notes |
|---|---|---|---|
| 2026 Phase A — plans | GetInsured-stack scrape (fourth consumer) | Applied to BOTH clusters 2026-06-10 (ENG-450) — deploy pending | 12 RAs × 4 buckets, 69 plans, 6 carriers. csr_94 $22,500 override. APTC-implied SLCSP cross-check ≤$0.49 12/12. QRS 94%. Snapshots staging 6a29bce5/6a29be91, prod 6a29bce6/6a29be90. Prod zip_county cleanup = deploy-time (cleanup-va-zip-county-2026.ts). |
| 2026 Phase B — drugs | 3 ffm-swept + Kaiser kporg JSON + UHC OptumRx GPX526VA + Anthem 2026 PDF parse (FNAV JSON is 2025-only) | COMPLETE 2026-06-10 — 6/6 | Per-carrier RxCUIs: Anthem 8,297 · Kaiser 6,010 (TIER-ONE..FOUR vocab added) · Cigna 4,690 · Sentara 4,336 · Oscar 3,994 · UHC 3,604. Search read-model re-derived (6,516 groups). |
| 2026 Phase C — providers | 2 ffm-swept + Anthem PROVIDERS_VA.json + Kaiser kporg (year-2023 waiver) + Oscar TIC 027 + Cigna county heuristic | COMPLETE 2026-06-10 — 6/6 | NPIs: Sentara 57,111 · Anthem 54,383 · UHC 43,253 · Cigna 12,772 · Oscar 3,816 · Kaiser 1,901. Bare-NPI model (like IL). Layer 5: 243 checks 0 failed across all six states. |
Nevada
| Year | Phase | Status | Notes |
|---|---|---|---|
| 2026 Phase A — plans | GetInsured-stack scrape (fifth consumer) | Applied BOTH clusters 2026-06-10 (ENG-451) — deploy pending | 4 RAs × 4 buckets, 135 plans, 9 carriers. csr_94 $22,500. L1 + APTC-implied ≤$0.32 on 4/4. QRS 106/135. Snapshots staging 6a29e184/6a29e34f, prod 6a29e186/6a29e34e. Prod zip_county cleanup = deploy-time. |
| 2026 Phase B — drugs | 4 ffm-swept + Anthem/CommunityCare FNAV PDF + HPN OptumRx SE42L77 + Hometown IFP PDF | 8/9 carriers | Ambetter 4,251 · Molina 3,803 · SelectHealth 3,861 · CareSource 3,555 (ffm) · Anthem + Community Care 8,560 (shared PDF) · HPN 4,794 (OptumRx ClientFormulary SE42L77 — largest carrier, 31 plans) · Hometown 8,792 (IFP Exchange PDF). Only Imperial gapped — its FNAV formulary is UnderConstruction.htm (carrier unpublished). Search read-model re-derived. |
| 2026 Phase C — providers | 3 ffm-swept + Anthem (Elevance PROVIDERS_NV.json) + HPN (UHC TIC blobs → Sierra-Nevada HMO) | 5/9 carriers | Ambetter 87,449 · Molina 60,984 · Anthem 15,053 (year-waiver) · HPN 5,491 (largest carrier) · SelectHealth 2,786 = 100 of 135 plans. A second pass found Anthem + HPN after the first pass stopped at 3/9 — Anthem via the SAME Elevance PROVIDERS_<ST>.json pattern used for VA; HPN via the UHC blobs API used for NJ. Lesson: exhaust proven sibling-state patterns before declaring a gap. Remaining gap (CareSource low-compliance §1311, Community Care separate HMO, Hometown/Imperial small portals) = honest "no coverage data" per NJ/PA precedent. Layer 5: 274 checks 0 failed across 7 states. |
Other SBEs (queued)
| State | Marketplace | Expected approach | Status |
|---|---|---|---|
| New Jersey | GetCoveredNJ | GetInsured stack — inherit PA scraper (ENG-418), ~1 day | All 3 phases SHIPPED (ENG-438, 2026-06-09). Phase A: 60 plans across 5 carriers (HIOS 17970/91661/37777/91762/23818) live on BOTH prod + staging clusters. Phase B: 5/5 formularies (Centene 4,305 / Oscar 4,014 / AmeriHealth 3,977 / Horizon 4,270 / UHC 3,570 RxCUIs) on staging. Phase C: 5/5 providers (Centene 112,161 + AmeriHealth 978 + UHC 106,742 + Horizon 95,979 + Oscar 22,658 = 265,562 NJ NPIs) on staging. Cohort guard intact (non-NJ 3,730,755, drift 0 each ingest). UHC required OptumRx GPX526NJ for drugs + transparency-in-coverage.uhc.com/api/v1/uhc/blobs/ discovery for providers. Horizon required horizonblue.sapphiremrfhub.com vendor portal (curl-level Incapsula bypass). Oscar required oscar-001-in-network.json (NOT the TOC-listed oscar-002). Full carrier-by-carrier discovery patterns + watchouts: see sbe-state-watchouts.md NJ section. |
| Virginia | Virginia's Insurance Marketplace | GetInsured stack — inherit PA scraper, ~1 day | M1 follow-up |
| Nevada | Nevada Health Link | GetInsured stack — inherit PA scraper, ~1 day | M1 follow-up |
| New Mexico | beWellnm | GetInsured stack — inherit PA scraper, ~1 day | M1 follow-up |
| Maine | CoverME | GetInsured stack — inherit PA scraper, ~1 day | M1 follow-up |
| Idaho | Your Health Idaho | GetInsured stack — inherit PA scraper, ~1 day | Not in original M1 7-state batch; queue after the 5 above |
| Minnesota | MNsure | GetInsured stack — inherit PA scraper, ~1 day | Not in M1 batch; queue after |
| Georgia | Georgia Access | GetInsured stack — inherit PA scraper, ~1 day | New SBE 2026 |
| Illinois | Get Covered Illinois | SBE-FP — already in CMS FFM PUFs we ingest | Already covered via FFM PUF path (no separate scrape needed) |
| Colorado | Connect for Health Colorado | Approach 2 | Backlog (founder noted "standardized way to quickly import" — promising) |
| Connecticut | Access Health CT | Approach 2 | Backlog (same) |
| Maryland | Maryland Health Connection | Approach 2 | Backlog |
| Washington | Washington Healthplanfinder | Approach 2 | Backlog |
| Idaho | Your Health Idaho | Approach 2 | Backlog |
| New Mexico | beWellnm | Approach 2 | Backlog |
| Massachusetts | Health Connector | Approach 2 | Backlog |
| Rhode Island | HealthSource RI | Approach 2 | Backlog |
| DC | DC Health Link | Approach 2 | Backlog |
| Vermont | Vermont Health Connect | Approach 2 | Backlog |
| Kentucky | kynect | Approach 2 | Backlog |
| Maine | CoverME | Approach 2 | Backlog |
| Minnesota | MNsure | Approach 2 | Backlog |
| Georgia | Georgia Access | Approach 2 | Backlog |
| Illinois | Get Covered Illinois | Approach 2 | Backlog |
| Nevada | Nevada Health Link | Approach 2 | Backlog |
| Virginia | Virginia's Insurance Marketplace | Approach 2 | Backlog |
When adding a new state, fill in the row and link to the validation page once complete.
Prioritized roadmap — 19 remaining SBE states (2026-06-01)
Status as of 2026-06-01: CA + NY shipped (451 plans live in
planscollection). 19 SBE states remain unconfigured: CO, CT, DC, GA, ID, IL, KY, ME, MD, MA, MN, NV, NJ, NM, PA, RI, VT, VA, WA.This section ranks them by leverage (population coverage + ease) and dependency stack (plans → drugs → providers, in that order per state). Per the standing covenant, never start a new state's ingestion without completing this section's planning checklist for it — the per-state research column locks the approach before any code/scrape work.
Ranking criteria
Four signals, listed in priority order:
- Population leverage — how many uninsured-ACA-eligible people does the state add to our addressable surface? Bigger states = bigger TAM unlock.
- Marketplace tool quality — does it have a clean plan-list page that the CA-scrape pattern handles? Or is it built on an older / less consistent framework that needs a one-off scraper?
- Plan-data shape — community-rated (NY pattern, simpler) vs. age-rated rating-area-bound (CA pattern, standard) vs. ZIP-level pricing (rare, would need a per-ZIP scrape).
- State subsidy program — does the state add its own premium subsidy on top of federal APTC (like CA's CAPS+CAPC)? If yes, we need a
calculate{State}Eligibility()helper before the state's pricing is consumer-correct.
Tier 1 — High population, standard pattern, ship-fast candidates
These are large-population states with the CA-pattern (age-rated, rating-area-bound) and either no state subsidy or a well-documented one. Plans-only first (drug + provider ingest queued after pricing lands).
| # | State | Marketplace | Approach | Est. uninsured pop. | State subsidy? | Why ship first |
|---|---|---|---|---|---|---|
| 1 | PA | Pennie | CA-style scrape | ~290K | No (federal APTC only) | 5th-largest state by uninsured; Pennie is built on the Get Insured tech stack — SBE-modern, clean plan list. Wave 1 hand-curated showcase already shipped under ENG-373, so the carrier roster is partially known. |
| 2 | NJ | GetCoveredNJ | CA-style scrape | ~210K | Yes — NJ Health Plan Savings | Same Get Insured tech stack as PA. Wave 1 showcase already shipped (ENG-373). State subsidy adds modest top-up on federal APTC up to 600% FPL — researchable from NJ DOBI publications. |
| 3 | IL | Get Covered Illinois | CA-style scrape | ~370K | No (federal APTC only) | Brand-new SBE for 2026 plan year (transitioned off HC.gov). Tech stack: GetInsured (same as PA/NJ). Largest by uninsured among 2026-new SBEs. |
| 4 | VA | Virginia's Insurance Marketplace | CA-style scrape | ~320K | No (federal APTC only) | Built on GetInsured tech stack. Transitioned to SBE in 2024. High population, expansion state (no ENG-414-class subsidy ambiguity). |
| 5 | GA | Georgia Access | CA-style scrape | ~530K | No (federal APTC only) | Largest state in this set by uninsured. Brand-new SBE for 2026. Pre-flight check needed: confirm Georgia Access publishes a public shop-and-compare tool (some states gate plan browsing behind eligibility submission). Tech stack is custom (not GetInsured). |
Tier 1 rationale: scoring the 5 most populous easy-pattern states first unlocks ~1.7M additional uninsured-ACA-eligible Americans across the platform. The PA/NJ pair has the lowest delta-effort because Wave 1 hand-curation already mapped the carrier roster (ENG-373); the GetInsured tech-stack reuse means a single scraper port covers PA + NJ + IL + VA.
Tier 2 — Medium population, standard pattern
Same approach (CA-style scrape) but lower population leverage per state. Group as a batch after Tier 1 ships.
| # | State | Marketplace | Approach | Est. uninsured pop. | State subsidy? | Notes |
|---|---|---|---|---|---|---|
| 6 | WA | Washington Healthplanfinder | CA-style scrape | ~330K | Yes — Cascade Care subsidy | Custom tech stack (HBE-built). Cascade Care adds state premium assistance on top of federal APTC up to 250% FPL. |
| 7 | MD | Maryland Health Connection | CA-style scrape | ~280K | No (federal APTC only) | Custom MD tech stack. Standardized SBP designs across carriers; canonical CSR variants well-documented. |
| 8 | MA | Massachusetts Health Connector | CA-style scrape (likely) | ~110K | Yes — ConnectorCare | Custom MA tech stack. ConnectorCare is a major state-funded subsidy program (essentially state-augmented Silver). Subsidy modeling adds complexity — research first. |
| 9 | MN | MNsure | CA-style scrape | ~290K | Yes — MinnesotaCare (BHP) | Custom MNsure tech stack. MN runs a Basic Health Program (BHP) for 138-200% FPL like NY's Essential Plan — adds complexity. |
| 10 | CO | Connect for Health Colorado | CA-style scrape | ~210K | Yes — Colorado OmniSalud + reinsurance | Custom CO tech stack. State Reinsurance Program reduces premiums by ~25% across the board; OmniSalud subsidy covers some undocumented residents (out of scope for federal-eligible pricing). |
| 11 | CT | Access Health CT | CA-style scrape | ~140K | No (federal APTC only) | Custom AHCT tech stack. Modest population; relatively straightforward eligibility (no state subsidy program). |
Tier 3 — Smaller population or higher-friction marketplaces
Lower priority by raw population. Cluster as a final batch.
| # | State | Marketplace | Approach | Est. uninsured pop. | State subsidy? | Notes |
|---|---|---|---|---|---|---|
| 12 | NV | Nevada Health Link | CA-style scrape | ~250K | No (federal APTC only) | GetInsured tech stack (same as PA/NJ/IL/VA — once Tier 1 lands, NV could ship as a fast-follow). |
| 13 | KY | kynect | CA-style scrape | ~190K | No (federal APTC only) | Re-launched as SBE in 2022. Custom KY tech stack. |
| 14 | NM | beWellnm | CA-style scrape | ~150K | No (federal APTC only) | GetInsured tech stack. |
| 15 | ME | CoverME.gov | CA-style scrape | ~50K | No (federal APTC only) | GetInsured tech stack. Small population. |
| 16 | RI | HealthSource RI | CA-style scrape | ~30K | No (federal APTC only) | Custom HSRI tech stack. Smallest by uninsured. |
| 17 | DC | DC Health Link | CA-style scrape | ~25K | No (federal APTC only) | Custom DC tech stack. Tiny but politically visible. |
| 18 | VT | Vermont Health Connect | CA-style scrape | ~25K | No (federal APTC only) | Custom VT tech stack. Tiny. |
| 19 | ID | Your Health Idaho | CA-style scrape | ~120K | No (federal APTC only) | Custom YHI tech stack. Only non-expansion-state SBE; CSR-94 still applies via federal APTC at 100-138% FPL the same as any other state (no ENG-414 carveout needed — the SBE redirect fires before our FFM bump logic). |
Per-state research backlog (locks the approach before scraping)
For each state, the following must be answered + documented in sbe-state-watchouts.md BEFORE starting the scraper code. This prevents the kind of stale-data + sync-bug class we hit on CA (Wave 1 stale 2025 carryover) and NY (PNDS specialty-code surprise, UHC GPX426NY-vs-Advantage confusion).
| Question | Why it matters |
|---|---|
| Does the public shop-and-compare tool expose ALL plans without an eligibility submission? | Some states gate plan browsing behind eligibility (NJ does this for some flows). Determines whether the CA "anchor-age + $500K income" trick works as-is. |
| Premium varies by ZIP, by rating area, or community-rated? | Picks the scrape pattern (NY vs CA vs per-ZIP). Confirmed via Layer 2 same-RA invariance test. |
| Does the state publish a Standard Benefit Plan Designs (SBP) PDF for the plan year? | If yes, CSR variants are state-mandated and consistent across carriers (CA pattern); if no, each carrier's CSR variants must be sourced from per-carrier filings or PUF (FFM pattern). |
| State subsidy program shape (APTC top-up, BHP, reinsurance, OmniSalud-style outreach)? | Drives whether we need calculate{State}Eligibility() (NY/CA pattern) OR can rely on federal APTC only (most states). |
| Drug formulary source pattern: per-carrier PDFs? FormularyNavigator? Per-carrier API? PBM aggregator? | Picks the drug-ingest approach. NY taught us per-carrier PDFs + OptumRx-style API for the UHC outlier. |
| Provider directory source pattern: state DOH PNDS-equivalent? FHIR Plan-Net? Per-carrier portals? | Picks the provider-ingest approach. NY used PNDS (state DOH); CA used Symphony (the SBE-licensed directory). Smaller states may have neither — fall back to per-carrier FHIR Plan-Net. |
| Does the state run a Basic Health Program (BHP, NY's Essential Plan equivalent)? | Affects the 138-200% FPL band. MN runs MinnesotaCare; NY runs Essential Plan; no other 2026 SBE does. |
Dependency stack per state (plans → drugs → providers)
The standing pattern that worked for CA + NY:
- Phase A — Plans + pricing (the prerequisite for everything else). Required before anything user-facing works for the state. Validated by the 4-layer methodology above.
- Phase B — Drug formularies →
formularies_staging, source-tagged<state>_<carrier>_<year>_marketplace_formulary. Per-carrier PDFs the typical source. RxCUI resolution + dedup follow. - Phase C — Provider directories →
providers_staging. State-by-state shape varies; NY used PNDS open-data (Socrata bulk), CA used CalHEERS Symphony, smaller states may rely on per-carrier FHIR Plan-Net or licensed feeds.
Each phase ships independently to apex behind the standing covenants (Atlas snapshot → dry-run audit → founder-gated apply → byte-identical guard on FFM + other-SBE cohorts → apex smoke).
Shipping cadence recommendation
- Tier 1 (5 states, ~1.7M uninsured): ship Phase A only first (plans + pricing), batched as one Wave per state. Goal: a TX/FL-equivalent surface for the largest remaining SBEs. Defer drug + provider ingest per-state until pricing is verified live on apex.
- Tier 2 (6 states): ship Phase A in the same batched pattern. Add state-subsidy helpers where required (WA Cascade Care, MA ConnectorCare, MN MinnesotaCare, CO Reinsurance).
- Tier 3 (8 states): ship Phase A in a final batch.
- Drug + provider ingest (Phases B+C) revisits each tier after pricing is live, prioritized by user demand (which states are showing up in waitlist signups + agent referrals).
Estimated effort per state (Phase A only)
Based on NY + CA ingests:
- Scraper port (CA pattern, GetInsured tech stack): ~1 day per state once the first GetInsured scraper is reusable.
- Scraper port (custom tech stack): ~2-3 days per state.
- Validation (4-layer methodology): ~0.5 day per state.
- State subsidy helper (if applicable): ~1-2 days per state (the actuarial math is well-defined; the testing is what takes time).
- Atlas snapshot + dry-run + apply + apex smoke: ~0.5 day per state.
Tier 1 ship estimate: ~10-14 working days for all 5 states (assuming GetInsured scraper reuse across PA/NJ/IL/VA).
Out of scope for THIS plan
- No execution. This document defines the work + priority + dependency order. Each state's actual ship is its own Linear issue + worktree per the standing pattern.
- No deadline commitments. Sequencing depends on which states are showing up in user demand + agent referrals + funding priorities.
- No formulary or provider work. Phase A (plans + pricing) is the gate for everything else; Phase B + C plans are filed per-state after Phase A lands.
Population data caveat
Uninsured population estimates above are directional — sourced from KFF state health facts (https://www.kff.org/state-category/health-coverage-uninsured/) and rounded for prioritization purposes only. Refresh with the next plan-year KFF release before committing to a Wave schedule.
Lessons learned + gotchas
A running list of mistakes + non-obvious findings. Read this before starting a new state ingest.
CA Wave 1 → Wave 2 transition (2026-05-24)
- Stale-year carryover risk: Wave 1 seed used
Silver 70 deductible: $5,400, which was a 2025 carryover. 2026 CA SBP actual is $5,200. Always source the SBP design values from the CURRENT plan year, not the prior year. Cross-check via the state marketplace's live display. - SF rating-area "gap" (ENG-152) was stale data, not a formula bug. Looked like a $173-$263 SF SLCSP delta in Wave 1; rerunning against 2026 source data eliminated it. Don't chase formula bugs without first verifying the source-of-truth year.
LA County RA15/RA16 unification finding (2026-05-25, Wave 2 Block B.5)
- LA County historically splits between RA15 (North/East) and RA16 (South/West) at the ZIP level. CC 2026 PY scrape of both anchors revealed identical carrier roster (6 carriers, 40 total plans) and identical per-plan gross premium for both RAs.
- Verify before assuming a historical split persists year-over-year: scrape both anchor ZIPs and diff the plan lists. If pricing + roster match exactly, the split is administrative-only and you can collapse to one canonical RA.
- Decision for 2026: map all LA County ZIPs to RA15. RA16 remains in the regions collection for completeness but receives no ZIPs in zip_county.
- If a future year reintroduces a meaningful split:
scripts/db/scrape-la-zip-ra-split-2026.tsis kept as scaffolding for per-ZIP classification via subsidy-fingerprint match.
CC Shop & Compare scraper specifics
- Pagination, NOT virtualization. CC's plan list paginates 10 plans per page using MUI Pagination. Click "Next page" button; do NOT try to scroll-infinite-load. Selector:
button[aria-label*="next page" i]. - Eligibility modal. After "See My Results" submit, a modal appears: "This isn't an application for health coverage." Click "Continue" before navigating to results.
- Filter modal on plan list arrival. "Your Health Plan Filters" modal blocks the top of the plan list. Click "OK" to dismiss before extracting cards.
- Plan cards are
<li>elements with dynamic MUI classes. Can't anchor on class. Identify by text content: each plan card's<li>contains BOTH$X.YY /moANDYearly deductible. - Plan name heading contains the whole card text. Don't use
h1/h2/h3.textContentto grab the plan name — it'll include Compare button + cost-share labels. Instead grep for the carrier + plan-line + network-type pattern (e.g.,^(Kaiser|Anthem|...) ... (HMO|PPO|EPO)$). - Combobox is custom, not native
<select>. CC uses MUI Autocomplete. Click the combobox to open, then click the[role=option]element with the desired text. Two comboboxes on the form share name "Dropdown" (Coverage Year + Household Size) — pick by position (.nth(1)). - HDHP / HSA plans don't render a metal pill. They have plan-line names like "Kaiser Bronze 60 HDHP HMO" but no "BRONZE HSA" pill. Derive metal from plan name when pill not found.
- High-income anchor for clean gross premium. With
$500,000income, CC shows zero subsidy and gross premium displays directly — no need to subtract subsidy to derive gross. Cleaner for ingestion math.
PA / GetInsured-stack scraper specifics (ENG-418)
- JSON API beats SPA scraping where it exists. Pennie's
/hix/private/getIndividualPlansreturns the full 73-plan × 145-field response that the React SPA renders. Drive the prescreener form via Playwright JUST to establish the session cookie, then fetch the JSON API for the actual data. 36× faster than DOM-scraping the plan cards. Pattern likely repeats across the GetInsured stack (NJ/VA/NV/NM/ME) since they all run the same upstream platform — try the same endpoint first. - Multi-county ZIPs trigger a
<select>dropdown that gates Continue. Border ZIPs like 15834 (Cameron/Elk/Potter) and 18015 (Bethlehem — Northampton/Lehigh) show a "Select your county" select with options whose VALUES are county FIPS. Handle in the scraper: detect select presence, set value to the anchor RA's county FIPS via the React setter pattern. THIS was the single biggest source of pre-fix RA2 + RA6 timeouts. Validate on at least one multi-county ZIP per state during Phase 0. - CSR-94 income bucket can be unreachable at single-person anchor. Pennie blocks the form at very-low single-person income (~$22K). Workarounds: (a) derive CSR-94 cost-shares from federal 45 CFR §156.420 AV target (0.94) applied to base Silver; (b) use a 2-person household at proportionally low income (Medicaid threshold scales by household size). Document which one you used in the validation page.
- Pennie's
costSharingcodes don't match CMS HIOS variant suffix. Observed: CS5 returned for the csr_87 income bucket (CMS standard would be CS3); CS4 for csr_73 (CMS standard CS2). Don't key on Pennie's code — key on the income bucket you supplied. The cost-share VALUES are correct. - Plan-level fields are HIOS-keyed (one doc per HIOS), NOT per-(HIOS, RA). Look at an existing CA or NY plan in the
planscollection before designing your normalize —regionId: null+premiumsByRatingArea: { 1: X, 4: Y, 8: Z }+ageRatesByArea: { 1: {0-14:…, 64 and over:…}, 4: {…}, ... }is the convention. PA originally emitted per-(HIOS,RA) docs and had to be re-grouped before apply. - Age curve keys are
"0-14"/"15"-"63"/"64 and over", NOT integers 0-64. The"64 and over"key is what makes premiums work for users locked out of Medicare (recent green-card holders, non-citizens, anyone 65+ without 40 quarters of work history) — they pay the federally-capped age-64 rate per 45 CFR 147.102(e)'s 3:1 max age-rating rule. (Aside: CA stored integer 0-64 keys, which is a latent bug — 65+ CA users hit the route's fallback path. Tracked separately.) - Field-name parity matters:
metalis UPPERCASE;planNamenot "name";hsaEligibleat top-level not nested underpuf.planFeatures;program: "QHP"constant; bothpremiumsByRatingAreaANDpremiumsByRegionpopulated identically (CA legacy alias). Diverging from this shape means the /plans surface renders blank for your new state. copaysfield shape varies across cohorts but the runtime tolerates both CMS-canonical keys ("Primary Care Visit to Treat an Injury or Illness") and camelCase shorthand ("primaryCare"). PA uses CMS-canonical with raw Pennie displayVal strings ("$15 Copay", "20% Coinsurance after deductible") — the runtime'sextractCopay()parser handles all formats.
Quality ratings hide in plain sight (ENG-418 A.5 lesson)
- Every SBE plan-search response we've seen so far includes the full CMS QRS issuer rating inline — the same data
scripts/db/ingest-qrs-ratings.jscalls the CMS Marketplace API for on the federal-30 side. The SBE scrape ALREADY has it; the normalize just has to read it. - Pennie field name:
issuerQualityRating(24-field nested object includingQualityRating,GlobalRating,SummaryClinicalQualityManagement,SummaryEnrolleeExperience,SummaryPlanEfficiencyAffordabilityAndManagement,EffectiveDate+ 18 sub-scores by clinical category). Each plan record also has a flatqualityRatingmirror at the top level. - Mapping: Pennie strings → canonical ints (
""→ 0).SummaryClinicalQualityManagement→clinicalQualityManagementRating;SummaryEnrolleeExperience→enrolleeExperienceRating;SummaryPlanEfficiencyAffordabilityAndManagement→planEfficiencyRating. Use the 14-char HIOS prefix (issuerPlanNumber[0:14]) as the dedupe key — multiple variants of the same product share the issuer-level rating. - 217/291 PA plans (74.6%) had ratings; 74 (mostly new carriers + new product lines without 3 enrolled years) returned empty
issuerQualityRating. Federal-30 sees ~98.1% coverage. Both gaps are real — render "Not rated" rather than treating as 0-star. - CC + NYSOH scrapes are expected to follow the same pattern (per-plan inline rating from the upstream CMS QRS feed). Audit those scrapes' raw JSON before designing a separate ingest. Likely 90% of the work is
git grep -i "qualityRating\|starRating\|qrs" scripts/db/data/and a 20-line schema mapper.
Playwright + tsx interop
- Inner function declarations break inside
page.evaluate(() => {}). tsx wraps functions with__name(class)helper calls that don't exist in browser context. Pass the evaluator as a string template literal instead:page.evaluate(\(() => { ... })()`)` — and inline all logic, no inner functions. - Regex escape rules in template-literal evaluators. Use 2 backslashes for regex special chars:
\\bin the template becomes\bin the string the browser parses. Avoid character-class dashes in the middle — put-at the end of the class:[A-Za-z0-9 ()./-].
MongoDB writes (every state, every year)
- Standing covenant for
planscollection writes: dry-run → tier-1..5 audit harness re-runs at 100% → rollback snapshot → founder-gated apply. See docs/validation/audit/ for the harness. - Atlas snapshot BEFORE any
planswrite.atlas backups snapshots create --clusterName <prod-cluster> --retention 30 --description "pre-state-X-Y ingest". Record snapshot ID in the ingest commit message for rollback traceability.
Cluster targeting — plans go to PROD FIRST (caught in ENG-418 A.4 apex smoke)
This is the single most important playbook addition from ENG-418. Plans (and zip_county) on apex are read via getDb() → MONGODB_URI → the prod M10 cluster (askflorence-prod-01.njkihm). The staging cluster (askflorence-staging.efsikmv) is what .env.local points at for local dev + the cross-cluster reference cluster for getReferenceDb() (formularies_staging + providers_staging only). Two separate Atlas clusters in two separate Atlas projects.
CA + NY convention (per scripts/db/copy-ca-data-prod-to-staging.cjs): write plans to PROD cluster first, then mirror to staging. ENG-418 Phase A.2.2 violated this by writing only to staging (because .env.local defaulted there + the write-pa-plans-2026.ts cluster guard was set to staging-only). Result: code went live on apex with OWNED_DATA_STATES containing PA, but apex /api/counties STILL returned sbeRedirect because PROD's zip_county still had sbeRedirect set and PROD's plans had 0 PA rows. Hour of investigation later: write the same data to prod, problem gone.
The two-cluster ingest pattern (apex-correct order):
- Take Snapshot B on prod cluster (
atlas backups snapshots create <prod-cluster> --projectId <prod-project-id> --desc "pre-state-X-Y ingest"). - Get prod-cluster write access. Pattern (CLAUDE.md precedent):bash
# Create temp DB user (45-min auto-expiry): atlas dbusers create --projectId <prod-project-id> \ --username eng<NNN>-<state>-prod-temp \ --password "$(openssl rand -hex 32)" \ --role readWriteAnyDatabase \ --deleteAfter "$(date -u -v+45M '+%Y-%m-%dT%H:%M:%SZ')" # Add your IPv4 to prod allowlist (NOT IPv6 — Atlas rejects): atlas accessLists create "$(curl -s -4 ifconfig.me)" \ --type ipAddress --projectId <prod-project-id> \ --comment "eng<NNN> temp" \ --deleteAfter "$(date -u -v+45M '+%Y-%m-%dT%H:%M:%SZ')" - Run
--dry-runagainst prod URI (verifies cohort baseline + per-RA counts). - Run
--applywithWRITE_CONFIRM=yesenv var against prod URI. Cohort guard MUST show non-PA 2026 count unchanged. - zip_county cleanup on prod cluster (same script as staging, just against prod URI):js
// For each ZIP belonging to the new state, $set regionId (per fips→RA table) + $unset sbeRedirect. // If staging already has the canonical post-cleanup state, simpler: copy staging → prod by {zip, countyFips}. - Apex smoke matrix (curl per anchor ZIP → confirm counties, no sbeRedirect → POST
/api/plans→ confirm non-empty plan list with state's HIOS-prefix carriers). - Take Snapshot C on prod cluster.
- Delete temp DB user + IP allowlist entry (
atlas dbusers delete ... --force+atlas accessLists delete ... --force). - (Optional, cost-savings parity) Mirror prod → staging for the new state:
node scripts/db/copy-ca-data-prod-to-staging.cjsstyle — adapt for the state's data subset.
The write scripts (write-<state>-plans-<year>.ts) MUST accept BOTH clusters in their safeHost regex. The founder-gated WRITE_CONFIRM=yes + cohort guard are the real safety net; the cluster regex just blocks accidental third-cluster runs (e.g., a dev cluster). Example from write-pa-plans-2026.ts:
ts
const safeHost = /askflorence-staging|askflorence-prod-01|atlas-ttlyyd|njkihm/i;Federal regression bright-line
- Never merge an SBE data change without re-running
wave1-federal-regression --diff. Zero diffs is the gate. Wire it into preflight (already done viascripts/preflight.ts --full).
Provider-coverage ingest patterns — PA Phase C playbook (added 2026-06-05)
PA Phase C (ENG-435 + ENG-437) shipped all 9 PA marketplace carriers (165 plans) with provider attribution to providers_staging in one extended session. Discovered 5 distinct ingest patterns that cover the realistic set of carrier-publishing approaches. Every future state's provider ingest should map each carrier to one of these patterns first.
Pattern decision matrix (run this BEFORE coding)
For each carrier, check IN ORDER (cheaper patterns first):
| Step | Check | If yes → use |
|---|---|---|
| 1 | §1311 MRF TOC is publicly accessible (carrier site footer or /transparency-in-coverage) AND TOC contains HIOS marketplace plan IDs | Pattern A — MRF direct |
| 2 | TOC files are CMS-standard .json.gz (Cloudfront-signed) | Pattern A (streaming gz) |
| 3 | Public find-a-doctor SPA exposes NPI in network XHR responses (capture via Chrome MCP fetch-wrapping) | Pattern B — SPA-API direct |
| 4 | Public find-a-doctor page IS NPI-blind (PDF / server-rendered HTML / vCard with no NPI) BUT carrier publishes provider PDFs | Pattern C — PDF + NPPES fuzzy match |
| 5 | Public path member-auth gated OR rate-limited to unviability | Pattern D — county heuristic (sister-line risk-accepted attribution) |
| 6 | Public sister product line shares the same network and we already ingested it | Pattern E — additive attribution (zero new NPIs needed) |
Pattern A — §1311 MRF direct ingest
Recipe:
- Discover TOC URL (carrier transparency page or footer)
- Verify TOC has
plan_id_type === "hios"entries matching state marketplace HIOS prefix - For each
reporting_structurethat lists our marketplace plans, collectin_network_files[]URLs - Stream each file (HTTP fetch); for
.json.gz, gunzip + early-termination at"in_network"marker token (270× speedup vs full parse) - Regex extract NPIs from
"npi"\s*:\s*\[...\]arrays +"npi"\s*:\s*<10-digit>scalars - JSONL write with
our_plan_idspopulated from the 14-char marketplace HIOS lookup --applyupsertspa:<npi>with$setOnInsert(identity locked) +$addToSetplans (idempotent)
Real examples: Ambetter Centene national TOC (api.centene.com/.../cms-data-index.json → 111,885 NPIs), Highmark BS PA (mrfdata.hmhs.com/.../highmark-bsp-index.json → 92,605 NPIs), IBX QCC (ihg-dart-edw-mrf-prod-public/qcc/.../index.json → 809,576 NPIs), Oscar (single in-network JSON file → 6,491 NPIs).
Caveats:
- TOCs vary wildly in size: from one file (Oscar = 13.6 MB) to 525 files (IBX = 75+ GB if all downloaded)
- Cloudfront URLs have signed expiry — fetch within 24 hr of TOC capture
- Server-side rate-limit on heavy file families (e.g., IBX's 17B0/17D0 — terminated connections at 24-parallel fetch). Solution: accept partial + dedupe overlap is high
Key script: scripts/db/ingest-pa-providers-mrf-generic.cjs — env-driven generic implementation. Set CARRIER_NAME, CARRIER_SOURCE_TAG, HIOS_PREFIX, TOC_URL, PLAN_ISSUER_FILTER env vars.
Pattern B — SPA-API direct (NPI in JSON XHR)
Recipe:
- Open carrier's find-a-doctor page via Chrome MCP (residential IP bypasses datacenter bot-shields like Radware)
- Set up
fetchwrapper via Chrome MCP javascript_tool to capture all XHR calls - Trigger a search via the SPA's UI (or direct URL navigation)
- Identify the search endpoint (e.g.,
POST /healthsparq/public/service/v4/search) - Replay the captured body shape from JS in the same browser session (cookies + auth tokens auto-flow)
- Parse JSON response; look for
attributes[]array with{key:"NPI", value:"<10-digit>"}pattern - Iterate via prefix-letter sweep + pagination (HealthSparq caps at 300 per query / 10 per page)
- Apply with cohort guard
Real example: Geisinger HealthSparq API (POST /healthsparq/public/service/v4/search) → 6,190 unique NPIs across 26-letter + 2-letter prefix sweeps in ~25 min. NPI was in providerResults[].attributes[] where key === "NPI".
Caveats:
- 5-15% of carriers expose NPIs in API but filter from rendered HTML — always sniff XHR before deciding "NPI not exposed"
- HealthSparq / similar vendors hard-cap results per query (~300-500) — must iterate via prefix sweeps
- Session timeouts (Geisinger expired after ~20 min sustained queries → HTTP 440)
- Some endpoints prevent re-auth via fetch — must trigger via SPA UI
Pattern detection:
- Vendor tells: HealthSparq, Sapphire (CalHEERS Symphony), HealthTrio Connect (HTML only, NOT JSON), Vitals, MyHealthToolkit
- Test: search a known last name in SPA + check Network tab for
application/jsonresponses containing NPI digits
Pattern C — PDF/HTML directory + NPPES fuzzy match
Recipe (PDF source):
- Discover marketplace PDF download URLs (often via
/pdf-directory/...redirects to widen.net or other CDN) - Real PDF URL hidden in viewer HTML at
window.viewerPdfUrl(signed CDN URL with ~24hr expiry) - Download PDFs (one per network if carrier offers tiered networks)
- Use pdfplumber for column-aware extraction (NOT pdftotext — interleaves multi-column layouts)
- Use NAME-ANCHORED regex extraction over column-aware extraction:
- Provider names follow
Last, First [Middle], CREDpattern (very strong anchor) - Page-header has county name (geographic anchor)
- Skip facility sections (Hospitals, Labs) — focus on PCP/Specialist sections
- Provider names follow
- Output JSONL:
{last_name, first_name, county, credential} - Build NPPES PA index from existing
providers_staging(pa:*docs withname.{first,last}+addresses[].zip) - Build PA ZIP → county lookup from
zip_countycollection - Match by
(last_name, first_name, county)— score: exact_last_first_county (high), exact_last_initial_county (medium), unique_lastname (low) - Apply matched NPIs with
$addToSet plans
Real example: UPMC 4 marketplace PDFs (88 MB, 15,848 pages total) → name extraction → matched 9,560 of 34,541 directory entries (77% exact_last_first_county high-confidence) against 329,902 NPPES PA baseline.
Recipe (HTML source):
- Same as PDF but parse rendered HTML for name/city/zip per provider card
- HealthTrio Connect uses
LI.ht2-ProviderCardwrappers - Cloudflare rate-limits aggressively — must use long throttle (1-2 sec/request)
Caveats:
- Match accuracy 70-95% depending on directory data quality (more credentials = better match)
- Plan→network mapping often UNKNOWN at PUF level (
puf.networkIdnull) — accept sister-line over-attribution to all carrier plans - Address parsing from multi-column PDFs is fragile; ZIP-only matching often sufficient
Key scripts:
scripts/db/upmc-pdf-extract-names.py(pdfplumber + regex name extraction)scripts/db/upmc-fuzzy-match-nppes.cjs(NPPES index + match + apply)
Pattern D — county-heuristic over-attribution
When to use: Carrier's provider directory is unscrapably difficult:
- Member-auth gated (
/secure/member/...) - HIOS-search-gate rejects marketplace HIOS
- HealthTrio HTML directory rate-limits at 4hr+ ETA
- Vendor refuses datacenter IPs across all bypass attempts
Recipe:
- Identify carrier's actual service area (county-level) via marketing materials, broker maps, "where we serve" page
- Resolve county set to all PA ZIPs via
zip_countycollection (distinct('zip', {state:'PA', county:{$in: counties}})) - Count NPPES PA NPIs whose
addresses.zipis in service-area ZIPs (sanity check expected count) updateManywith$addToSet plansfiltered to matched ZIPs
Real examples:
- Jefferson + Partners (HIOS 19702 + 93909) — SE PA (Philadelphia, Bucks, Chester, Montgomery, Delaware, Lehigh, Northampton, Berks) → 156,089 NPIs in 4 min
- Capital BC (HIOS 45127) — Central PA (Dauphin, Lancaster, Lebanon, Cumberland, York + border Adams, Franklin, Perry) → 46,377 NPIs in 80 sec
Risks (always document in commit message):
- ~30-40% over-attribution: providers in service-area counties who DON'T actually contract with the carrier get marked InNetwork
- UI mitigation: copy directing users to verify with carrier directly before booking
- Mitigation justification: same discipline as IBX 33871 sister-line + UPMC all-networks attribution. Better than zero coverage which blocks enrollment journeys
Quality bar to ship over heuristic:
- Coverage on most populous network in service area (>40k providers per major carrier)
- Direct API/MRF/PDF availability that survives a 4-hour effort budget
Pattern E — additive attribution to sister product line
When to use: Same issuer has multiple HIOS prefixes for parallel product lines (HMO + PPO, Advantage + Premier, etc.) and we've already ingested one of them.
Recipe:
- Verify issuer-name match in
/api/plansoutput (both prefixes branded same) - Confirm no separate publicly-discoverable TOC for the secondary prefix
updateManyto add secondary-prefix plan IDs to ALL existing pa:* docs with primary-prefix plan source
Real example: IBX 33871 (12 plans, Independence Administrators line) attributed to all 809,576 IBX-31609 (Independence BC core line) NPIs in 32 min via $addToSet.
Caveats:
- Risk-accept ~5-15% false positive for product-line-restricted providers
- Best for sister lines of same legal issuer (e.g., HMO + PPO of same parent)
- NOT appropriate when "sister" line is a separate underwriter (different BAA, different contracts)
Cohort guard discipline (every Pattern, every apply)
javascript
// Pre-flight: capture invariant
const totalBefore = await coll.estimatedDocumentCount();
const paBefore = await coll.countDocuments({_id: {$regex: "^pa:"}});
const nonPaBefore = totalBefore - paBefore;
log(`Pre-flight: pa:*=${paBefore}, non-pa=${nonPaBefore}`);
// ... ingest ...
// Post-flight: verify
const drift = Math.abs(nonPaAfter - nonPaBefore);
if (drift > 50) throw new Error(`COHORT GUARD FAIL: drift=${drift}`);
log(`✓ Cohort guard preserved (drift=${drift})`);Why: Catches accidental cross-state writes (e.g., bug where pa: prefix gets dropped from _id and writes go to bare-NPI docs touching other states). Across 9 PA carrier ingests, drift was 0 every time — the guard fires cleanly when wrong.
Cluster guard (every Pattern, every apply)
javascript
function assertSafeCluster(uri) {
const m = uri.match(/^mongodb(?:\+srv)?:\/\/[^@]*@([^/?]+)/);
if (!m) throw new Error("Bad URI");
const host = m[1];
if (!host.includes("askflorence-staging") && !host.includes("efsikmv")) {
throw new Error(`CLUSTER GUARD FAIL: host=${host}`);
}
return host;
}Per ADR 0004 — providers_staging writes go to staging cluster only; apex reads via cross-cluster Atlas PrivateLink.
Release snapshot tags (every Pattern, every deploy)
Per ENG-435 + ENG-437 cadence:
- Before merging deploy PR:
git tag pre-eng-XXX-<carrier>-YYYYMMDD-HHMMat main HEAD - After merge:
git tag post-eng-XXX-<carrier>-YYYYMMDD-HHMMat merge commit - Push both. These create clean rollback points if the deploy regresses.
Per-carrier discovery sequence (for any future SBE state)
For each carrier in the new state's marketplace:
- Marketplace HIOS prefix lookup — query
/api/plansfor the state to get prefix → issuer-name map - Transparency-page discovery — search for
<carrier>.com/transparency-in-coverageor footer link "Machine-Readable Files" or "Price Transparency" - TOC verification — fetch TOC URL + grep for HIOS marketplace prefix
- If TOC has marketplace plans → Pattern A
- If TOC missing or empty → find-a-doctor SPA discovery
- Set up Chrome MCP fetch-wrapping + trigger search → check for NPI in JSON
- If NPI in JSON → Pattern B
- If NPI absent but PDF directory available → Pattern C (PDF)
- If NPI absent and only HTML directory + rate-limit testing fails → Pattern D (county heuristic)
- If sister product line exists with already-ingested network → Pattern E
Bias toward earlier (cheaper) patterns. Pattern D is the safety net but accept the over-attribution risk in the commit message.
Adding a new state — checklist
Use this when starting Wave 2 for any SBE not yet ingested.
- [ ] Identify the state's public shop & compare tool URL
- [ ] Run Layer 1-3 validation: 1 subsidy scenario + same-RA test + cross-RA test against the live tool. Block if any layer fails.
- [ ] Source the state's published rating-area → ZIP mapping (state DOI or marketplace publishes this annually)
- [ ] Create
data/{state}-rating-areas-{year}.tswith one anchor ZIP per RA (start with 1 per RA; add per-county anchors if Layer 2 reveals carrier-availability variance within a single RA) - [ ] Source the state's Standard Benefit Plan Designs (state-mandated cost-share canonical values per metal x CSR tier). Most SBEs publish this annually.
- [ ] Branch the CA scraper pattern (
scripts/db/scrape-covered-ca-2026.ts) for the state's tool. Selectors will differ. - [ ] Run the scrape, output JSON.
- [ ] Build the normalize script: JSON →
planscollection doc shape, applying federal default age curve to derive all 65 ages from anchor age. - [ ] Build the state-subsidy helper if the state has its own subsidy program (see
state-subsidies.mdfor the catalogue). - [ ] Add
calculate{State}Eligibility()inapps/web/src/lib/owned-plans.tsmirroringcalculateNyEligibility(). - [ ] Capture
puf.qualityRatingin the normalize step, NOT in a separate ingest. SBE-state scrapes universally expose CMS QRS issuer ratings inline (Pennie:issuerQualityRating.{QualityRating, GlobalRating, SummaryClinicalQualityManagement, SummaryEnrolleeExperience, SummaryPlanEfficiencyAffordabilityAndManagement, EffectiveDate}; equivalent fields expected on Covered California's Shop & Compare scrape + NYSOH plan-list response). Map to the canonical schema ({available, year, globalRating, clinicalQualityManagementRating, enrolleeExperienceRating, planEfficiencyRating, globalNotRatedReason, clinicalNotRatedReason, enrolleeNotRatedReason, efficiencyNotRatedReason, effectiveDate, qualitySource}— same shapescripts/db/ingest-qrs-ratings.jswrites for federal-30). Coverage typically 70-100% (carriers without 3 years of enrollment go unrated; matches federal QRS pattern). DO NOT skip this. PA Phase A.2.2 missed it; had to ship a follow-up ENG-418 A.5 augment to backfill from the same scrape JSONs (zero new scraping needed — the data was already in our hands). - [ ] Atlas snapshot B (pre-apply, mandatory — descriptive name includes pre-apply row counts).
- [ ] Dry-run plans write → audit harness 100% → founder approval → apply.
- [ ] Atlas snapshot C (post-apply marker — descriptive name includes new row counts).
- [ ]
zip_countycleanup (NEW — caught in ENG-418 Phase A.4 apex smoke): every ZIP for the new state hassbeRedirectset from the legacy SBE-redirect seed./api/countiesreads this field and short-circuits, so even after the constants flip the route still returns sbeRedirect. Run:jsVerify post-flight: every ZIP for the state hasdb.zip_county.updateMany( { state: '<NEW_STATE>' }, { $set: { regionId: <fips→ra via state-specific map> }, $unset: { sbeRedirect: '' } } );regionIdpopulated + nosbeRedirectfield. Sample a few ZIPs to confirm the regionId matches the state's RA-by-FIPS table. - [ ] Add state code to
OWNED_DATA_STATESinpackages/shared/src/plans/owned-plans.ts. - [ ] Remove state from
STATE_BASED_MARKETPLACESinpackages/shared/src/constants/index.ts. - [ ] Re-run federal regression sweep. Must stay 0 diffs.
- [ ] Apex smoke — at least one ZIP per RA:
/api/counties?zip=…returns counties (not sbeRedirect),/api/plansreturns real plan list (not empty), one plan-detail load shows copays + SBC link + formulary link + provider directory link. - [ ] Add validation page at
docs/validation/{state}-{year}.mdwith L1-L4 results + snapshot IDs (A, B, C). - [ ] Update the state-by-state log table above with the new row.
- [ ] If the state has carrier-availability quirks within an RA, add per-county anchors AND document the anomaly in "Lessons learned" so the next state doesn't repeat the same investigation.
- [ ] Phase B (drugs) + Phase C (providers) follow plans — file as separate issues mirroring NY ENG-412 Phase 1 + Phase 2. Don't try to do all three in one PR; the snapshot + cohort-guard discipline gets harder to verify per layer.