SBE Data Ingestion Playbook

Status: Living document. Add a section per state as we onboard. Last updated: 2026-05-24 (CA Wave 2 / ENG-374 in progress) Purpose: The canonical "how do we get a state's plan data" reference. When a future year or future state ingestion starts, this is what the session reads first.

State-Based Marketplaces (SBEs) do NOT file their plan data into the CMS Public Use Files (PUF). Federal PUF only covers the 30 FFM states. For every SBE (CA, PA, NJ, MA, CT, CO, MD, NY, RI, DC, VT, WA, ID, NM, MN, KY, GA, IL, NV, ME, VA), we have to source plan data per state.

This playbook captures the approaches we've used, the validation we run, and the gotchas we hit. Future ingestions reference this and either match an existing pattern or extend the playbook with a new one.

TL;DR — pick your approach by state shape

State property	Best approach	Why
Community-rated (same price all ages)	NY-style scrape	Simplest. Region × plan. No age curve.
Age-rated, rating-area-bound	CA-style scrape	1 anchor age × N regions; federal age curve does the rest.
Federal PUF coverage exists	PUF ingest	Already-shipping pipeline; one schema for whole state grid.
No public marketplace tool, only filings	SERFF rate-filing parse	Last resort. High friction.

Standing rule: prefer the public-facing marketplace shop tool as the source of truth. It's what consumers see, it embeds all subsidy + filter logic, and it's free to scrape. SERFF is fallback when no public tool exists.

The decision tree

Does the state have a public shop & compare tool that shows all 2026 plans?
├── Yes → does premium vary by ZIP or only by rating area?
│         ├── ZIP-level                    → scrape per ZIP (NY pattern)
│         └── Rating-area-level            → scrape 1 ZIP per rating area (CA pattern)
└── No  → is the state on federal PUF?
          ├── Yes (one of 30 FFM)          → ingest via scripts/db/ingest-puf-augment.js
          └── No                           → SERFF rate-filing parse (last resort)

Always validate by:
  (1) Subsidy math vs the live state marketplace tool for 1 anchor scenario
  (2) Same-rating-area invariance: two ZIPs in same RA must return identical plans
  (3) Cross-rating-area difference: two ZIPs in different RAs must differ
  (4) Federal regression sweep: 100 federal scenarios stay byte-identical

Approach 1: NY-style scrape

When to use: community-rated state (one price all ages), small number of plans (< 100 statewide).

Used by: New York 2026 (ENG-211, shipped).

Data shape:

NY State of Health (NYSOH) lists every Silver / Gold / Bronze / Platinum / Catastrophic plan with one premium per (plan x rating region) pair. No age multiplier — community-rated.
16 NY rating regions; ~20 plans visible per region; ~320 total plan rows.

Mechanics:

Anchor demographic: any age, any income (community rating makes age irrelevant).
Drive the marketplace tool's plan search for one ZIP per rating region.
Capture: plan name, carrier, metal, premium, deductible, MOOP, copays per plan.
Cross-reference DFS Exhibit 23 (NY's annual rate filing) for the canonical premium values as a backstop.

Output shape: documents in plans collection with state: "NY", regionId: 1..16, no ageRatesByArea table (community-rated means one premium across all ages).

Validation:

11 issuers x 16 regions = 176 SLCSP probes against NYSOH live tool. Result: zero variance.
DFS Standard Silver chart (32 plans) all match exactly.
See docs/validation/new-york-2026.md for the full pass record.

Approach 2: CA-style scrape (rating areas + age curve)

When to use: age-rated state where premium varies by rating area, not by ZIP. This is the most common SBE pattern.

Used by: California 2026 Wave 2 (ENG-374, in progress).

Why this works:

By regulation (CA Insurance Code §10965.9, similar laws in CO/CT/MD/WA/etc.), premium does NOT vary by ZIP within a rating area. So one anchor scrape per RA covers every ZIP in that RA. Federal default age curve (45 CFR 147.102, Appendix A) maps the anchor-age premium to all 65 ages. Most SBEs ban tobacco rating in the individual market, eliminating that variable.

Validated approach (proven 2026-05-24 for CA):

Same-RA invariance: ZIP 94102 (Tenderloin SF) vs ZIP 94110 (Mission SF), both in RA4. Both return identical 24 plans, identical $1,341.65/mo subsidy, identical Kaiser cost-share. ✓
Cross-RA difference: ZIP 94102 (SF RA4) vs ZIP 90039 (LA RA15). Different subsidy ($1,341.65 vs $848.92), different plan count (24 vs 34), different cheapest carrier (Kaiser vs Molina). ✓
Subsidy math ($30K SF couple 35+30): predicted $1,344.62 vs CC live $1,341.65 = $2.97 (0.22%) delta. ✓

Anchor demographic (CA 2026):

1 person, age 21, household income $500,000 (well above any subsidy threshold = gross premium displayed with zero subsidy applied; clean reading).
Tobacco N/A (CA bans tobacco rating per Knox-Keene §1357.504).

Mechanics:

Maintain a data/{state}-rating-areas-{year}.ts file with one anchor ZIP per RA.
Playwright scraper drives the public shop tool per anchor ZIP:
- Fill ZIP, income (high), household size 1, age 21
- Submit eligibility, click through preference screens (skip provider + Rx)
- Land on plan list, paginate through all pages (CC uses 10/page MUI Pagination)
- Extract per plan: carrier, plan name, metal tier, gross monthly premium, deductible, primary care copay, generic Rx copay
Output: structured JSON keyed by regionId, array of plans per RA.
Normalize step (separate script) converts JSON → plans collection doc shape.
Apply federal default age curve to derive all 65 ages from the age-21 anchor.
Apply canonical CSR variants from the state's Standard Benefit Plan Designs (state-mandated, same across all carriers per metal x CSR tier).

Output shape: documents in plans collection with state: "CA", regionId: 1..19, full ageRatesByArea table (derived), csrVariants map (canonical SBP).

Validation:

10-scenario harness across all 19 RAs vs CC Shop & Compare live, target < $10 spread.
See Validation methodology below.

Code locations:

scripts/db/scrape-covered-ca-2026.ts — Playwright harness (CA 2026 specific)
scripts/db/data/ca-rating-areas-2026.ts — 19 RA anchor ZIPs + federal default age curve constants
scripts/db/data/ca-plans-scraped-2026.json — raw scrape output

Approach 3: Federal PUF ingest (baseline for 30 FFM states)

When to use: state files into CMS Public Use Files (= one of the 30 FFM states + a few that opted in).

Used by: all 30 FFM states (already shipping, see scripts/db/ingest-puf-augment.js).

Mechanics: documented separately at docs/data-sources/puf-data.md. The PUF pipeline is the simplest path when available — one well-defined schema covers the whole state.

Why this does NOT apply to most SBEs: SBE states file with their state marketplace, not CMS. So PUF rows for SBE states are either missing or stale. Confirm via db.plans.find({state: "X", year: 2026}).count() — if zero, the state isn't on PUF.

Approach 4: SERFF filings (fallback)

When to use: state has no public shop tool that exposes all plans (rare; this is mostly a contingency).

Mechanics: SERFF (System for Electronic Rates & Forms Filing) is the NAIC submission portal. Carriers file rate filings + benefit designs there annually. Requires NAIC registration to access bulk download. Filings are PDF + spreadsheet attachments per carrier.

Why it's last resort: parsing dozens of per-carrier filings is high friction. Use scrape approach 1 or 2 whenever a public marketplace tool exists.

Status: not currently used for any AskFlorence state. Document the parse pattern here if/when adopted.

Validation methodology (applies to every state)

The exact validation steps are state-agnostic. Run all four layers per state per plan year:

Layer 1 — Subsidy math sanity check (1 scenario)

Pick a representative subsidy scenario (e.g., couple ages 35 + 30, $30K income, anchor metro ZIP). Drive the state's live shop tool. Capture the displayed monthly subsidy + cheapest Silver plan.

Run the same scenario through our own pipeline. Compare:

Check	Tolerance
Monthly subsidy total	< 1% delta (target < $5/mo)
Cheapest Silver plan name	Exact match
Subsidized "You pay" monthly	< $5 delta
CSR tier assigned	Exact match

If any field exceeds tolerance, fix the formula before scraping more data. The CA Wave 2 baseline scenario landed at 0.22% subsidy delta — that's the bar.

Layer 2 — Same-rating-area invariance (2 ZIPs)

Pick two ZIPs in the same rating area (different counties if possible, same county otherwise). Drive both through the same scenario. Plans + subsidy must be byte-identical. This confirms the regulatory model: premium is RA-bound, not ZIP-bound.

If two same-RA ZIPs differ, the assumption is broken — either the rating-area mapping table is wrong, or the state actually does ZIP-level pricing (rare but exists). Either way, halt and investigate before continuing.

Layer 3 — Cross-rating-area difference (2 RAs)

Pick two ZIPs in different rating areas. They MUST return different plan lists or different premiums or both. Confirms the RA boundary works.

Layer 4 — Federal regression sweep (Wave 1 federal flow stays byte-identical)

Every SBE data change must NOT touch the 30-FFM + NY pipeline. Run scripts/audit/wave1-federal-regression.ts --diff after every commit. Result must be 0 diffs across 100 randomized federal scenarios.

This is the bright-line gate. If the federal sweep finds any diff, revert immediately. Federal flow byte-identical is non-negotiable.

Layer 5 — Per-state ingest invariants (REQUIRED post-ingest, every state)

New as of 2026-06-09 (post-NJ Phase C). Layer 1-4 checks test calculator/eligibility math but not the depth of the ingested provider + drug data. The existing per-state Playwright specs (calculator-{ut,ca,sbe-nj}-takeover.spec.ts) verify the UI flow + carrier names appear, but do NOT verify:

Specific known NPIs return the right plans[].plan_id set
Per-carrier NPI thresholds (e.g. NJ UHC ≥ 100K — future ingests can't silently drop 50% of NPIs)
Per-carrier RxCUI thresholds
Cohort guard reaffirmation outside of the ingest scripts themselves
Plan count + carrier set invariants per state

SHIPPED 2026-06-10 (ENG-447): scripts/audit/sbe-state-invariants.ts — parameterized, per-state config blocks (NJ + PA + CA + NY backfilled). Wired into npm run preflight -- --full (graceful SKIP when Atlas is unreachable so offline/GHA runs don't false-fail) and standalone as npm run audit:sbe-invariants (runs with --require-db, which ingest sessions MUST use). Golden expectations live in the committed fixture scripts/audit/fixtures/sbe-state-invariants-baseline.json, (re)generated via --capture (calculator-baseline pattern — review the git diff). Measured runtime: ~2 min all states / ~15-45s per --state run on M10 staging.

The backfill immediately caught a real shipped bug (proof of value before the first new-state ingest): NJ Phase C's Horizon provider ingest had used NJ_HIOS_PREFIX = "91762" (AmeriHealth's prefix) instead of 91661, so all 95,979 Horizon NPIs were attributed to AmeriHealth plan IDs — Horizon's 14 plans had ZERO provider coverage and AmeriHealth's plans carried Horizon's directory. Repaired on staging 2026-06-10 by scripts/db/repair-nj-horizon-plan-attribution.cjs (snapshots: pre 6a29227e764ea3c204b89fa1, post 6a2924c081d7b2013563a5b0). See the NJ section of sbe-state-watchouts.md. Moral: carrier↔HIOS labels come from the live plans collection, never from doc tables.

Each state declares:

typescript

{
  state: "NJ",
  expectedPlanCount: 60,
  expectedCarrierHiosPrefixes: ["17970", "91661", "37777", "91762", "23818"],
  expectedNonStateBaseline: 3_730_755,  // cohort guard
  carrierInvariants: [
    { hios: "37777", carrier: "UHC NJ", minNpis: 100_000, minRxCuis: 3_400 },
    { hios: "91762", carrier: "Horizon BCBS NJ", minNpis: 90_000, minRxCuis: 4_000 },
    { hios: "17970", carrier: "Centene/Ambetter", minNpis: 100_000, minRxCuis: 4_000 },
    { hios: "23818", carrier: "Oscar Garden State", minNpis: 20_000, minRxCuis: 3_800 },
    { hios: "91661", carrier: "AmeriHealth NJ", minNpis: 800, minRxCuis: 3_700 },
  ],
  goldenNpis: [
    { npi: "1750636635", expectedPlans: ["37777NJ0100002", "37777NJ0100005", /* ... */] },
    // ... 5-10 per state ...
  ],
  goldenDrugs: [
    { rxcui: 197316, name: "atorvastatin 20 MG", expectedTiersByCarrier: { "nj_uhc_2026_marketplace_formulary": "Tier 1", /* ... */ } },
    // ... 5-10 per state ...
  ],
}

Backfill status (2026-06-10, ENG-447):

✅ NJ — captured post-repair (265,562 NPIs / 11,373 state RxCUI docs; per-carrier floors + 8 golden NPIs + 6 golden drugs)
✅ CA — captured (165,974 Symphony providers / 5,447 state RxCUI docs; 11 carriers)
✅ PA — captured (1,216,701 NPIs / 30,881 state RxCUI docs; 13 carriers — 33709 + 79279 Highmark entities are drug-floor-only, their provider coverage is the accepted ENG-437 gap)
✅ NY — bonus backfill (203,016 NPIs / 7,349 state RxCUI docs; note: Healthfirst 91237 has providers but ZERO formulary docs — known ENG-412 gap, no rx floor set)
✅ Wired into preflight --full + npm run audit:sbe-invariants; any threshold miss fails the run
✅ "Template for new states" in sbe-state-watchouts.md updated — recording golden NPIs + RxCUIs is now a REQUIRED Phase C capture step

Adding a new state's block: at the END of its Phase C (before the next state's Phase A), append a StateInvariantsConfig to STATE_INVARIANTS (exact plan count, every carrier HIOS prefix from the live plans collection, per-carrier floors ~5-15% under verified actuals, 5-10 golden NPIs across carriers, 5-10 golden RxCUIs), then --capture --state <ST>, review the fixture diff, run a plain check, commit both files.

Layer 6 — Scenario sweep + live-exchange parity (REQUIRED post-ingest, every state)

New as of 2026-06-10 (NV). scripts/audit/sbe-scenario-audit.ts (npm run audit:sbe-scenarios -- --state XX [--parity]) drives the REAL fetchPlansForHousehold pipeline against a dev server for a grid of anchor ZIPs (one per rating area) × household types × FPL bands, and:

Scenario sweep flags: no plans above 138% FPL, missing Medicaid bump below it, realPrice > sticker, negative price, and SBE-redirect when the state should be owned. This is the check that caught the independent-city bug (Decision 12) — a whole rating area / 38 VA cities returning zero plans. Run it for every RA, not just the demo ZIP.
--parity replays exchange-captured scenarios from scripts/audit/fixtures/sbe-parity-baseline.json and asserts:
1. 138-400% FPL → our APTC within tolerance of the LIVE exchange (abs $12 OR 4%, whichever looser — SLCSP plan-selection + FPL-rounding differ slightly). NV verified: ours $293.34 vs Nevada Health Link $298 (Δ$4.66).
2. >400% FPL → no subsidy on both (NV: ours $0; exchange "no tax credit, ~$478/mo").
3. <138% FPL → INTENTIONAL divergence: expansion-state exchanges route to Medicaid (no marketplace subsidy); WE bump to 138% and show the federal estimate (NV: ours $436 + plans). This is by design — do NOT "fix" it to match the exchange.

Capturing the parity baseline (Decision 11 method): drive the live exchange's anonymous shop via Chrome MCP (e.g. Nevada Health Link → enroll.nevadahealthlink.com/prescreener → "Browse for health & dental plans" → enter dummy ZIP + DOB + income), read the APTC / SLCSP / Medicaid result for ~3 scenarios spanning the bands, and add a states.<ST> block to sbe-parity-baseline.json. Recapture on plan-year change.

SBE audit suite + test philosophy

The three layers below ARE the SBE regression suite. Do NOT run all states in CI (overkill); instead run the suite for every state touched in an SBE addition — adding plans, drugs, providers, or adjusting any of them means re-running these for that state (and its siblings if the change is shared, e.g. a formulary shared across carriers):

Layer	Command	What it guards
5 — data depth	`npm run audit:sbe-invariants` (or `--state XX`)	per-carrier NPI/RxCUI floors, golden NPIs/drugs, cohort floor (re-ingest can't silently drop coverage)
6 — runtime + parity	`npm run audit:sbe-scenarios -- --state XX --parity`	plans return for every RA + household + FPL band; APTC matches the live exchange above 138% FPL; Medicaid-gap divergence intact
Playwright	`npm run test:e2e` (per-state `calculator-*-takeover.spec.ts`)	end-to-end UI flow for the state

Plus Layer 4 (wave1-federal-regression --diff, 0 diffs) after every change — the federal/NY flow must stay byte-identical. All layers need a dev server on BASELINE_BASE pointed at the cluster holding the data (formularies_staging + providers_staging are STAGING-cluster only per ADR 0004 — never write provider/drug data to prod).

State-by-state log

New York

Year	Approach	Status	Notes
2026	NY-style scrape (NYSOH) + DFS Exhibit 23 backstop	Shipped (ENG-211)	Community-rated, 16 regions, ~320 plan rows. Validation: new-york-2026.md. Zero variance vs NYSOH live tool.

California

Year	Approach	Status	Notes
2026 Wave 1	Hand-curated showcase (3-4 plans per state in CA/PA/NJ)	Shipped (ENG-373)	Sourced from CC 2026 SBP designs PDF. Estimate-only surface; not in `plans` collection. Wave 2 replaces this for CA.
2026 Wave 2	CA-style scrape (Approach 2), CC Shop & Compare	In progress (ENG-374)	19 RAs, age 21 anchor, $500K income, household 1. POC validated against $30K SF couple scenario at 0.22% delta. Code: `scripts/db/scrape-covered-ca-2026.ts`.

Pennsylvania

Year	Phase	Status	Notes
2026 Phase A — plans	GetInsured-stack scrape (Approach 2 variant), Pennie `/hix/private/getIndividualPlans` JSON API	Shipped + live on apex (ENG-418 PR #566 + A.4 prod-cluster apply 2026-06-02)	9 RAs, age 40 anchor, single-person household, 4 income buckets (no_csr / csr_73 / csr_87 / csr_94). 291 unique HIOS plans across 12 carriers (Jefferson / UPMC / Geisinger / IBC / Highmark family / Capital BC / Ambetter / Oscar / Partners). L1 validation: 4 RAs $0.00 exact vs PHIEA SLCSP PDF, all 9 RAs ≤ $1.49. Both clusters populated — staging snapshots B `6a1e44f83a4bf6ebedefa3a0` + C `6a1e45b729c719743c525854`; prod snapshots B `6a1e6c1d085fb664b689eec7` + C `6a1e6e23923b91f0f761d244`. Apex smoke verified 2026-06-02: ZIP 19103 → 73 plans (Jefferson top), 16501 → 61 (UPMC), 15834 → 34 (Geisinger), 18015 → 91 (Jefferson), 17101 → 57 (Capital BC) — all matching pre-apply per-RA breakdown. Cohort guard: non-PA 2026 unchanged at 4,495 on both clusters. Code: `scripts/db/scrape-getinsured-2026.ts` + `build-pa-plans-from-scrape-2026.ts` + `write-pa-plans-2026.ts` + `validate-pa-plans-2026.ts` + `.tmp-eng-418/sync-pa-zip-county-staging-to-prod.mjs` (one-shot, archived). PA is the pilot for the GetInsured 7-state batch — NJ / VA / NV / NM / ME inherit this pipeline (~1 day each).
2026 Phase B — drugs	Per-carrier formulary scrape, mirror NY ENG-412 Phase 1	Backlog	12 carriers: Ambetter / Oscar / Highmark (3 entities) / IBC / Jefferson / Partners / Geisinger (2 entities) / UPMC / Capital BC. Formulary URLs already captured per plan as `puf.urls.formulary` — extract dedup'd list. Writes to `formularies_staging` keyed `pa:<rxcui>` (matches NY's `ny:<rxcui>`).
2026 Phase C — providers	TBD source: PA-equivalent of NY-DOH-PNDS open data (if exists) OR per-carrier directory crawl OR §1311 MRFs	Backlog	Provider directory URLs already captured per plan as `puf.urls.providerDirectory`. NPPES-NPI-native expected (PA carriers use FFM HIOS schema). Writes to `providers_staging` keyed `pa:<npi>` (matches NY's `ny:<npi>`).

Illinois

Year	Phase	Status	Notes
2026 Phase A — plans	GetInsured-stack scrape (third consumer after PA + NJ)	Applied to BOTH clusters 2026-06-10 (ENG-448) — deploy pending	13 RAs × 4 CSR buckets, 271 HIOS plans, 7 carriers (BCBS IL/HCSC = 207 plans, sole issuer in 63 counties). csr_94 at $22,500 override. L1 13/13 Δ$0.00; L4 wave1 + calculator-baseline ZERO diffs; Layer 5 green. Snapshots staging B `6a29945b…`/C `6a29965c…`, prod B `6a29945d…`/C `6a29965b…`. Prod zip_county cleanup is a deploy-time step (`cleanup-il-zip-county-2026.ts`) — see watchouts IL section. QRS inline (233/271 rated). Catastrophic tier not captured (age-40 anchor).
2026 Phase B — drugs	ffm_1311_mrf sweep head start: 5/7 carriers already current (Oscar/Ambetter/UHC/Cigna/MercyCare, 3.8-4.7K RxCUIs each, plan ids verified == scraped 2026 hiosIds)	COMPLETE 2026-06-10 — 7/7 carriers	Fills: BCBS IL via `bcbsil.com/aca-json/il/drugs_il-1.json` (3,959 drugs; the HCSC `bcbs<st>.com/aca-json/<st>/` pattern, discovered from FFM af_sources + confirmed via `index_il.json`); Molina IL via ILFormulary2026.pdf parse + RxCUI fan-out (8,539 RxCUIs, 99.9% resolution — Molina publishes NO IL machine-readable drugs; its national cmsjson is FFM-only. ⚠️ Molina IL tier legend = national scheme 1=Preventive, NOT CA's SBP map). `derive-drug-search-index --apply` re-run (6,482 groups). Issuers publish §1311 MRFs regardless of FFM/SBE status — ALWAYS check existing coverage before a from-scratch Phase B.
2026 Phase C — providers	Same head start: 3/7 current (Ambetter 138,578 / UHC 98,645 / MercyCare 3,412 NPIs on bare-NPI docs — the owned-coverage npi-field lookup serves them as-is)	COMPLETE 2026-06-10 — 7/7 carriers	Fills (all bare-NPI, source-tagged il_*): BCBS IL 93,130 NPIs / 207/207 plans via `index_il.json` → 24 medical files (4 networks × INDIVIDUAL/GROUP/FACILITY; the filename codes ≠ network names — BAV files hold Blue-Precision); Oscar IL 34,145 via its TIC S3 (042 Select + 056 Choice; national §1311 providers_N.json scrubbed IL); Molina IL 19,412 via marketplace TIC (158 shards, provider_references at head, range-fetch + early-terminate; 152/158 across 3 passes, 6 shards 403'd); Cigna IL 20,444 via Pattern D county heuristic (collar counties; §1311 + TIC both scrubbed IL; over-attribution risk-accepted per PA precedent). IL added to OWNED_COVERAGE_STATES. IL has NO il: namespace — Layer 5 scopes by plan-prefix (framework extended, ENG-448). Layer 5: 213 checks 0 failed.

Virginia

Year	Phase	Status	Notes
2026 Phase A — plans	GetInsured-stack scrape (fourth consumer)	Applied to BOTH clusters 2026-06-10 (ENG-450) — deploy pending	12 RAs × 4 buckets, 69 plans, 6 carriers. csr_94 $22,500 override. APTC-implied SLCSP cross-check ≤$0.49 12/12. QRS 94%. Snapshots staging `6a29bce5`/`6a29be91`, prod `6a29bce6`/`6a29be90`. Prod zip_county cleanup = deploy-time (`cleanup-va-zip-county-2026.ts`).
2026 Phase B — drugs	3 ffm-swept + Kaiser kporg JSON + UHC OptumRx GPX526VA + Anthem 2026 PDF parse (FNAV JSON is 2025-only)	COMPLETE 2026-06-10 — 6/6	Per-carrier RxCUIs: Anthem 8,297 · Kaiser 6,010 (TIER-ONE..FOUR vocab added) · Cigna 4,690 · Sentara 4,336 · Oscar 3,994 · UHC 3,604. Search read-model re-derived (6,516 groups).
2026 Phase C — providers	2 ffm-swept + Anthem PROVIDERS_VA.json + Kaiser kporg (year-2023 waiver) + Oscar TIC 027 + Cigna county heuristic	COMPLETE 2026-06-10 — 6/6	NPIs: Sentara 57,111 · Anthem 54,383 · UHC 43,253 · Cigna 12,772 · Oscar 3,816 · Kaiser 1,901. Bare-NPI model (like IL). Layer 5: 243 checks 0 failed across all six states.

Nevada

Year	Phase	Status	Notes
2026 Phase A — plans	GetInsured-stack scrape (fifth consumer)	Applied BOTH clusters 2026-06-10 (ENG-451) — deploy pending	4 RAs × 4 buckets, 135 plans, 9 carriers. csr_94 $22,500. L1 + APTC-implied ≤$0.32 on 4/4. QRS 106/135. Snapshots staging `6a29e184`/`6a29e34f`, prod `6a29e186`/`6a29e34e`. Prod zip_county cleanup = deploy-time.
2026 Phase B — drugs	4 ffm-swept + Anthem/CommunityCare FNAV PDF + HPN OptumRx SE42L77 + Hometown IFP PDF	8/9 carriers	Ambetter 4,251 · Molina 3,803 · SelectHealth 3,861 · CareSource 3,555 (ffm) · Anthem + Community Care 8,560 (shared PDF) · HPN 4,794 (OptumRx ClientFormulary SE42L77 — largest carrier, 31 plans) · Hometown 8,792 (IFP Exchange PDF). Only Imperial gapped — its FNAV formulary is UnderConstruction.htm (carrier unpublished). Search read-model re-derived.
2026 Phase C — providers	3 ffm-swept + Anthem (Elevance PROVIDERS_NV.json) + HPN (UHC TIC blobs → Sierra-Nevada HMO)	5/9 carriers	Ambetter 87,449 · Molina 60,984 · Anthem 15,053 (year-waiver) · HPN 5,491 (largest carrier) · SelectHealth 2,786 = 100 of 135 plans. A second pass found Anthem + HPN after the first pass stopped at 3/9 — Anthem via the SAME Elevance `PROVIDERS_<ST>.json` pattern used for VA; HPN via the UHC blobs API used for NJ. Lesson: exhaust proven sibling-state patterns before declaring a gap. Remaining gap (CareSource low-compliance §1311, Community Care separate HMO, Hometown/Imperial small portals) = honest "no coverage data" per NJ/PA precedent. Layer 5: 274 checks 0 failed across 7 states.

Other SBEs (queued)

State	Marketplace	Expected approach	Status
New Jersey	GetCoveredNJ	GetInsured stack — inherit PA scraper (ENG-418), ~1 day	All 3 phases SHIPPED (ENG-438, 2026-06-09). Phase A: 60 plans across 5 carriers (HIOS 17970/91661/37777/91762/23818) live on BOTH prod + staging clusters. Phase B: 5/5 formularies (Centene 4,305 / Oscar 4,014 / AmeriHealth 3,977 / Horizon 4,270 / UHC 3,570 RxCUIs) on staging. Phase C: 5/5 providers (Centene 112,161 + AmeriHealth 978 + UHC 106,742 + Horizon 95,979 + Oscar 22,658 = 265,562 NJ NPIs) on staging. Cohort guard intact (non-NJ 3,730,755, drift 0 each ingest). UHC required `OptumRx GPX526NJ` for drugs + `transparency-in-coverage.uhc.com/api/v1/uhc/blobs/` discovery for providers. Horizon required `horizonblue.sapphiremrfhub.com` vendor portal (curl-level Incapsula bypass). Oscar required `oscar-001-in-network.json` (NOT the TOC-listed `oscar-002`). Full carrier-by-carrier discovery patterns + watchouts: see `sbe-state-watchouts.md` NJ section.
Virginia	Virginia's Insurance Marketplace	GetInsured stack — inherit PA scraper, ~1 day	M1 follow-up
Nevada	Nevada Health Link	GetInsured stack — inherit PA scraper, ~1 day	M1 follow-up
New Mexico	beWellnm	GetInsured stack — inherit PA scraper, ~1 day	M1 follow-up
Maine	CoverME	GetInsured stack — inherit PA scraper, ~1 day	M1 follow-up
Idaho	Your Health Idaho	GetInsured stack — inherit PA scraper, ~1 day	Not in original M1 7-state batch; queue after the 5 above
Minnesota	MNsure	GetInsured stack — inherit PA scraper, ~1 day	Not in M1 batch; queue after
Georgia	Georgia Access	GetInsured stack — inherit PA scraper, ~1 day	New SBE 2026
Illinois	Get Covered Illinois	SBE-FP — already in CMS FFM PUFs we ingest	Already covered via FFM PUF path (no separate scrape needed)
Colorado	Connect for Health Colorado	Approach 2	Backlog (founder noted "standardized way to quickly import" — promising)
Connecticut	Access Health CT	Approach 2	Backlog (same)
Maryland	Maryland Health Connection	Approach 2	Backlog
Washington	Washington Healthplanfinder	Approach 2	Backlog
Idaho	Your Health Idaho	Approach 2	Backlog
New Mexico	beWellnm	Approach 2	Backlog
Massachusetts	Health Connector	Approach 2	Backlog
Rhode Island	HealthSource RI	Approach 2	Backlog
DC	DC Health Link	Approach 2	Backlog
Vermont	Vermont Health Connect	Approach 2	Backlog
Kentucky	kynect	Approach 2	Backlog
Maine	CoverME	Approach 2	Backlog
Minnesota	MNsure	Approach 2	Backlog
Georgia	Georgia Access	Approach 2	Backlog
Illinois	Get Covered Illinois	Approach 2	Backlog
Nevada	Nevada Health Link	Approach 2	Backlog
Virginia	Virginia's Insurance Marketplace	Approach 2	Backlog

When adding a new state, fill in the row and link to the validation page once complete.

Prioritized roadmap — 19 remaining SBE states (2026-06-01)

Status as of 2026-06-01: CA + NY shipped (451 plans live in plans collection). 19 SBE states remain unconfigured: CO, CT, DC, GA, ID, IL, KY, ME, MD, MA, MN, NV, NJ, NM, PA, RI, VT, VA, WA.
This section ranks them by leverage (population coverage + ease) and dependency stack (plans → drugs → providers, in that order per state). Per the standing covenant, never start a new state's ingestion without completing this section's planning checklist for it — the per-state research column locks the approach before any code/scrape work.

Ranking criteria

Four signals, listed in priority order:

Population leverage — how many uninsured-ACA-eligible people does the state add to our addressable surface? Bigger states = bigger TAM unlock.
Marketplace tool quality — does it have a clean plan-list page that the CA-scrape pattern handles? Or is it built on an older / less consistent framework that needs a one-off scraper?
Plan-data shape — community-rated (NY pattern, simpler) vs. age-rated rating-area-bound (CA pattern, standard) vs. ZIP-level pricing (rare, would need a per-ZIP scrape).
State subsidy program — does the state add its own premium subsidy on top of federal APTC (like CA's CAPS+CAPC)? If yes, we need a calculate{State}Eligibility() helper before the state's pricing is consumer-correct.

Tier 1 — High population, standard pattern, ship-fast candidates

These are large-population states with the CA-pattern (age-rated, rating-area-bound) and either no state subsidy or a well-documented one. Plans-only first (drug + provider ingest queued after pricing lands).

#	State	Marketplace	Approach	Est. uninsured pop.	State subsidy?	Why ship first
1	PA	Pennie	CA-style scrape	~290K	No (federal APTC only)	5th-largest state by uninsured; Pennie is built on the Get Insured tech stack — SBE-modern, clean plan list. Wave 1 hand-curated showcase already shipped under ENG-373, so the carrier roster is partially known.
2	NJ	GetCoveredNJ	CA-style scrape	~210K	Yes — NJ Health Plan Savings	Same Get Insured tech stack as PA. Wave 1 showcase already shipped (ENG-373). State subsidy adds modest top-up on federal APTC up to 600% FPL — researchable from NJ DOBI publications.
3	IL	Get Covered Illinois	CA-style scrape	~370K	No (federal APTC only)	Brand-new SBE for 2026 plan year (transitioned off HC.gov). Tech stack: GetInsured (same as PA/NJ). Largest by uninsured among 2026-new SBEs.
4	VA	Virginia's Insurance Marketplace	CA-style scrape	~320K	No (federal APTC only)	Built on GetInsured tech stack. Transitioned to SBE in 2024. High population, expansion state (no ENG-414-class subsidy ambiguity).
5	GA	Georgia Access	CA-style scrape	~530K	No (federal APTC only)	Largest state in this set by uninsured. Brand-new SBE for 2026. Pre-flight check needed: confirm Georgia Access publishes a public shop-and-compare tool (some states gate plan browsing behind eligibility submission). Tech stack is custom (not GetInsured).

Tier 1 rationale: scoring the 5 most populous easy-pattern states first unlocks ~1.7M additional uninsured-ACA-eligible Americans across the platform. The PA/NJ pair has the lowest delta-effort because Wave 1 hand-curation already mapped the carrier roster (ENG-373); the GetInsured tech-stack reuse means a single scraper port covers PA + NJ + IL + VA.

Tier 2 — Medium population, standard pattern

Same approach (CA-style scrape) but lower population leverage per state. Group as a batch after Tier 1 ships.

#	State	Marketplace	Approach	Est. uninsured pop.	State subsidy?	Notes
6	WA	Washington Healthplanfinder	CA-style scrape	~330K	Yes — Cascade Care subsidy	Custom tech stack (HBE-built). Cascade Care adds state premium assistance on top of federal APTC up to 250% FPL.
7	MD	Maryland Health Connection	CA-style scrape	~280K	No (federal APTC only)	Custom MD tech stack. Standardized SBP designs across carriers; canonical CSR variants well-documented.
8	MA	Massachusetts Health Connector	CA-style scrape (likely)	~110K	Yes — ConnectorCare	Custom MA tech stack. ConnectorCare is a major state-funded subsidy program (essentially state-augmented Silver). Subsidy modeling adds complexity — research first.
9	MN	MNsure	CA-style scrape	~290K	Yes — MinnesotaCare (BHP)	Custom MNsure tech stack. MN runs a Basic Health Program (BHP) for 138-200% FPL like NY's Essential Plan — adds complexity.
10	CO	Connect for Health Colorado	CA-style scrape	~210K	Yes — Colorado OmniSalud + reinsurance	Custom CO tech stack. State Reinsurance Program reduces premiums by ~25% across the board; OmniSalud subsidy covers some undocumented residents (out of scope for federal-eligible pricing).
11	CT	Access Health CT	CA-style scrape	~140K	No (federal APTC only)	Custom AHCT tech stack. Modest population; relatively straightforward eligibility (no state subsidy program).

Tier 3 — Smaller population or higher-friction marketplaces

Lower priority by raw population. Cluster as a final batch.

#	State	Marketplace	Approach	Est. uninsured pop.	State subsidy?	Notes
12	NV	Nevada Health Link	CA-style scrape	~250K	No (federal APTC only)	GetInsured tech stack (same as PA/NJ/IL/VA — once Tier 1 lands, NV could ship as a fast-follow).
13	KY	kynect	CA-style scrape	~190K	No (federal APTC only)	Re-launched as SBE in 2022. Custom KY tech stack.
14	NM	beWellnm	CA-style scrape	~150K	No (federal APTC only)	GetInsured tech stack.
15	ME	CoverME.gov	CA-style scrape	~50K	No (federal APTC only)	GetInsured tech stack. Small population.
16	RI	HealthSource RI	CA-style scrape	~30K	No (federal APTC only)	Custom HSRI tech stack. Smallest by uninsured.
17	DC	DC Health Link	CA-style scrape	~25K	No (federal APTC only)	Custom DC tech stack. Tiny but politically visible.
18	VT	Vermont Health Connect	CA-style scrape	~25K	No (federal APTC only)	Custom VT tech stack. Tiny.
19	ID	Your Health Idaho	CA-style scrape	~120K	No (federal APTC only)	Custom YHI tech stack. Only non-expansion-state SBE; CSR-94 still applies via federal APTC at 100-138% FPL the same as any other state (no ENG-414 carveout needed — the SBE redirect fires before our FFM bump logic).

Per-state research backlog (locks the approach before scraping)

For each state, the following must be answered + documented in sbe-state-watchouts.md BEFORE starting the scraper code. This prevents the kind of stale-data + sync-bug class we hit on CA (Wave 1 stale 2025 carryover) and NY (PNDS specialty-code surprise, UHC GPX426NY-vs-Advantage confusion).

Question	Why it matters
Does the public shop-and-compare tool expose ALL plans without an eligibility submission?	Some states gate plan browsing behind eligibility (NJ does this for some flows). Determines whether the CA "anchor-age + $500K income" trick works as-is.
Premium varies by ZIP, by rating area, or community-rated?	Picks the scrape pattern (NY vs CA vs per-ZIP). Confirmed via Layer 2 same-RA invariance test.
Does the state publish a Standard Benefit Plan Designs (SBP) PDF for the plan year?	If yes, CSR variants are state-mandated and consistent across carriers (CA pattern); if no, each carrier's CSR variants must be sourced from per-carrier filings or PUF (FFM pattern).
State subsidy program shape (APTC top-up, BHP, reinsurance, OmniSalud-style outreach)?	Drives whether we need `calculate{State}Eligibility()` (NY/CA pattern) OR can rely on federal APTC only (most states).
Drug formulary source pattern: per-carrier PDFs? FormularyNavigator? Per-carrier API? PBM aggregator?	Picks the drug-ingest approach. NY taught us per-carrier PDFs + OptumRx-style API for the UHC outlier.
Provider directory source pattern: state DOH PNDS-equivalent? FHIR Plan-Net? Per-carrier portals?	Picks the provider-ingest approach. NY used PNDS (state DOH); CA used Symphony (the SBE-licensed directory). Smaller states may have neither — fall back to per-carrier FHIR Plan-Net.
Does the state run a Basic Health Program (BHP, NY's Essential Plan equivalent)?	Affects the 138-200% FPL band. MN runs MinnesotaCare; NY runs Essential Plan; no other 2026 SBE does.

Dependency stack per state (plans → drugs → providers)

The standing pattern that worked for CA + NY:

Phase A — Plans + pricing (the prerequisite for everything else). Required before anything user-facing works for the state. Validated by the 4-layer methodology above.
Phase B — Drug formularies → formularies_staging, source-tagged <state>_<carrier>_<year>_marketplace_formulary. Per-carrier PDFs the typical source. RxCUI resolution + dedup follow.
Phase C — Provider directories → providers_staging. State-by-state shape varies; NY used PNDS open-data (Socrata bulk), CA used CalHEERS Symphony, smaller states may rely on per-carrier FHIR Plan-Net or licensed feeds.

Each phase ships independently to apex behind the standing covenants (Atlas snapshot → dry-run audit → founder-gated apply → byte-identical guard on FFM + other-SBE cohorts → apex smoke).

Shipping cadence recommendation

Tier 1 (5 states, ~1.7M uninsured): ship Phase A only first (plans + pricing), batched as one Wave per state. Goal: a TX/FL-equivalent surface for the largest remaining SBEs. Defer drug + provider ingest per-state until pricing is verified live on apex.
Tier 2 (6 states): ship Phase A in the same batched pattern. Add state-subsidy helpers where required (WA Cascade Care, MA ConnectorCare, MN MinnesotaCare, CO Reinsurance).
Tier 3 (8 states): ship Phase A in a final batch.
Drug + provider ingest (Phases B+C) revisits each tier after pricing is live, prioritized by user demand (which states are showing up in waitlist signups + agent referrals).

Estimated effort per state (Phase A only)

Based on NY + CA ingests:

Scraper port (CA pattern, GetInsured tech stack): ~1 day per state once the first GetInsured scraper is reusable.
Scraper port (custom tech stack): ~2-3 days per state.
Validation (4-layer methodology): ~0.5 day per state.
State subsidy helper (if applicable): ~1-2 days per state (the actuarial math is well-defined; the testing is what takes time).
Atlas snapshot + dry-run + apply + apex smoke: ~0.5 day per state.

Tier 1 ship estimate: ~10-14 working days for all 5 states (assuming GetInsured scraper reuse across PA/NJ/IL/VA).

Out of scope for THIS plan

No execution. This document defines the work + priority + dependency order. Each state's actual ship is its own Linear issue + worktree per the standing pattern.
No deadline commitments. Sequencing depends on which states are showing up in user demand + agent referrals + funding priorities.
No formulary or provider work. Phase A (plans + pricing) is the gate for everything else; Phase B + C plans are filed per-state after Phase A lands.

Population data caveat

Uninsured population estimates above are directional — sourced from KFF state health facts (https://www.kff.org/state-category/health-coverage-uninsured/) and rounded for prioritization purposes only. Refresh with the next plan-year KFF release before committing to a Wave schedule.

Lessons learned + gotchas

A running list of mistakes + non-obvious findings. Read this before starting a new state ingest.

CA Wave 1 → Wave 2 transition (2026-05-24)

Stale-year carryover risk: Wave 1 seed used Silver 70 deductible: $5,400, which was a 2025 carryover. 2026 CA SBP actual is $5,200. Always source the SBP design values from the CURRENT plan year, not the prior year. Cross-check via the state marketplace's live display.
SF rating-area "gap" (ENG-152) was stale data, not a formula bug. Looked like a $173-$263 SF SLCSP delta in Wave 1; rerunning against 2026 source data eliminated it. Don't chase formula bugs without first verifying the source-of-truth year.

LA County RA15/RA16 unification finding (2026-05-25, Wave 2 Block B.5)

LA County historically splits between RA15 (North/East) and RA16 (South/West) at the ZIP level. CC 2026 PY scrape of both anchors revealed identical carrier roster (6 carriers, 40 total plans) and identical per-plan gross premium for both RAs.
Verify before assuming a historical split persists year-over-year: scrape both anchor ZIPs and diff the plan lists. If pricing + roster match exactly, the split is administrative-only and you can collapse to one canonical RA.
Decision for 2026: map all LA County ZIPs to RA15. RA16 remains in the regions collection for completeness but receives no ZIPs in zip_county.
If a future year reintroduces a meaningful split: scripts/db/scrape-la-zip-ra-split-2026.ts is kept as scaffolding for per-ZIP classification via subsidy-fingerprint match.

CC Shop & Compare scraper specifics

Pagination, NOT virtualization. CC's plan list paginates 10 plans per page using MUI Pagination. Click "Next page" button; do NOT try to scroll-infinite-load. Selector: button[aria-label*="next page" i].
Eligibility modal. After "See My Results" submit, a modal appears: "This isn't an application for health coverage." Click "Continue" before navigating to results.
Filter modal on plan list arrival. "Your Health Plan Filters" modal blocks the top of the plan list. Click "OK" to dismiss before extracting cards.
Plan cards are <li> elements with dynamic MUI classes. Can't anchor on class. Identify by text content: each plan card's <li> contains BOTH $X.YY /mo AND Yearly deductible.
Plan name heading contains the whole card text. Don't use h1/h2/h3.textContent to grab the plan name — it'll include Compare button + cost-share labels. Instead grep for the carrier + plan-line + network-type pattern (e.g., ^(Kaiser|Anthem|...) ... (HMO|PPO|EPO)$).
Combobox is custom, not native <select>. CC uses MUI Autocomplete. Click the combobox to open, then click the [role=option] element with the desired text. Two comboboxes on the form share name "Dropdown" (Coverage Year + Household Size) — pick by position (.nth(1)).
HDHP / HSA plans don't render a metal pill. They have plan-line names like "Kaiser Bronze 60 HDHP HMO" but no "BRONZE HSA" pill. Derive metal from plan name when pill not found.
High-income anchor for clean gross premium. With $500,000 income, CC shows zero subsidy and gross premium displays directly — no need to subtract subsidy to derive gross. Cleaner for ingestion math.

PA / GetInsured-stack scraper specifics (ENG-418)

JSON API beats SPA scraping where it exists. Pennie's /hix/private/getIndividualPlans returns the full 73-plan × 145-field response that the React SPA renders. Drive the prescreener form via Playwright JUST to establish the session cookie, then fetch the JSON API for the actual data. 36× faster than DOM-scraping the plan cards. Pattern likely repeats across the GetInsured stack (NJ/VA/NV/NM/ME) since they all run the same upstream platform — try the same endpoint first.
Multi-county ZIPs trigger a <select> dropdown that gates Continue. Border ZIPs like 15834 (Cameron/Elk/Potter) and 18015 (Bethlehem — Northampton/Lehigh) show a "Select your county" select with options whose VALUES are county FIPS. Handle in the scraper: detect select presence, set value to the anchor RA's county FIPS via the React setter pattern. THIS was the single biggest source of pre-fix RA2 + RA6 timeouts. Validate on at least one multi-county ZIP per state during Phase 0.
CSR-94 income bucket can be unreachable at single-person anchor. Pennie blocks the form at very-low single-person income (~$22K). Workarounds: (a) derive CSR-94 cost-shares from federal 45 CFR §156.420 AV target (0.94) applied to base Silver; (b) use a 2-person household at proportionally low income (Medicaid threshold scales by household size). Document which one you used in the validation page.
Pennie's costSharing codes don't match CMS HIOS variant suffix. Observed: CS5 returned for the csr_87 income bucket (CMS standard would be CS3); CS4 for csr_73 (CMS standard CS2). Don't key on Pennie's code — key on the income bucket you supplied. The cost-share VALUES are correct.
Plan-level fields are HIOS-keyed (one doc per HIOS), NOT per-(HIOS, RA). Look at an existing CA or NY plan in the plans collection before designing your normalize — regionId: null + premiumsByRatingArea: { 1: X, 4: Y, 8: Z } + ageRatesByArea: { 1: {0-14:…, 64 and over:…}, 4: {…}, ... } is the convention. PA originally emitted per-(HIOS,RA) docs and had to be re-grouped before apply.
Age curve keys are "0-14" / "15"-"63" / "64 and over", NOT integers 0-64. The "64 and over" key is what makes premiums work for users locked out of Medicare (recent green-card holders, non-citizens, anyone 65+ without 40 quarters of work history) — they pay the federally-capped age-64 rate per 45 CFR 147.102(e)'s 3:1 max age-rating rule. (Aside: CA stored integer 0-64 keys, which is a latent bug — 65+ CA users hit the route's fallback path. Tracked separately.)
Field-name parity matters: metal is UPPERCASE; planName not "name"; hsaEligible at top-level not nested under puf.planFeatures; program: "QHP" constant; both premiumsByRatingArea AND premiumsByRegion populated identically (CA legacy alias). Diverging from this shape means the /plans surface renders blank for your new state.
copays field shape varies across cohorts but the runtime tolerates both CMS-canonical keys ("Primary Care Visit to Treat an Injury or Illness") and camelCase shorthand ("primaryCare"). PA uses CMS-canonical with raw Pennie displayVal strings ("$15 Copay", "20% Coinsurance after deductible") — the runtime's extractCopay() parser handles all formats.

Quality ratings hide in plain sight (ENG-418 A.5 lesson)

Every SBE plan-search response we've seen so far includes the full CMS QRS issuer rating inline — the same data scripts/db/ingest-qrs-ratings.js calls the CMS Marketplace API for on the federal-30 side. The SBE scrape ALREADY has it; the normalize just has to read it.
Pennie field name: issuerQualityRating (24-field nested object including QualityRating, GlobalRating, SummaryClinicalQualityManagement, SummaryEnrolleeExperience, SummaryPlanEfficiencyAffordabilityAndManagement, EffectiveDate + 18 sub-scores by clinical category). Each plan record also has a flat qualityRating mirror at the top level.
Mapping: Pennie strings → canonical ints ("" → 0). SummaryClinicalQualityManagement → clinicalQualityManagementRating; SummaryEnrolleeExperience → enrolleeExperienceRating; SummaryPlanEfficiencyAffordabilityAndManagement → planEfficiencyRating. Use the 14-char HIOS prefix (issuerPlanNumber[0:14]) as the dedupe key — multiple variants of the same product share the issuer-level rating.
217/291 PA plans (74.6%) had ratings; 74 (mostly new carriers + new product lines without 3 enrolled years) returned empty issuerQualityRating. Federal-30 sees ~98.1% coverage. Both gaps are real — render "Not rated" rather than treating as 0-star.
CC + NYSOH scrapes are expected to follow the same pattern (per-plan inline rating from the upstream CMS QRS feed). Audit those scrapes' raw JSON before designing a separate ingest. Likely 90% of the work is git grep -i "qualityRating\|starRating\|qrs" scripts/db/data/ and a 20-line schema mapper.

Playwright + tsx interop

Inner function declarations break inside page.evaluate(() => {}). tsx wraps functions with __name(class) helper calls that don't exist in browser context. Pass the evaluator as a string template literal instead: page.evaluate(\(() => { ... })()`)` — and inline all logic, no inner functions.
Regex escape rules in template-literal evaluators. Use 2 backslashes for regex special chars: \\b in the template becomes \b in the string the browser parses. Avoid character-class dashes in the middle — put - at the end of the class: [A-Za-z0-9 ()./-].

MongoDB writes (every state, every year)

Standing covenant for plans collection writes: dry-run → tier-1..5 audit harness re-runs at 100% → rollback snapshot → founder-gated apply. See docs/validation/audit/ for the harness.
Atlas snapshot BEFORE any plans write. atlas backups snapshots create --clusterName <prod-cluster> --retention 30 --description "pre-state-X-Y ingest". Record snapshot ID in the ingest commit message for rollback traceability.

Cluster targeting — plans go to PROD FIRST (caught in ENG-418 A.4 apex smoke)

This is the single most important playbook addition from ENG-418. Plans (and zip_county) on apex are read via getDb() → MONGODB_URI → the prod M10 cluster (askflorence-prod-01.njkihm). The staging cluster (askflorence-staging.efsikmv) is what .env.local points at for local dev + the cross-cluster reference cluster for getReferenceDb() (formularies_staging + providers_staging only). Two separate Atlas clusters in two separate Atlas projects.

CA + NY convention (per scripts/db/copy-ca-data-prod-to-staging.cjs): write plans to PROD cluster first, then mirror to staging. ENG-418 Phase A.2.2 violated this by writing only to staging (because .env.local defaulted there + the write-pa-plans-2026.ts cluster guard was set to staging-only). Result: code went live on apex with OWNED_DATA_STATES containing PA, but apex /api/counties STILL returned sbeRedirect because PROD's zip_county still had sbeRedirect set and PROD's plans had 0 PA rows. Hour of investigation later: write the same data to prod, problem gone.

The two-cluster ingest pattern (apex-correct order):

Take Snapshot B on prod cluster (atlas backups snapshots create <prod-cluster> --projectId <prod-project-id> --desc "pre-state-X-Y ingest").

Get prod-cluster write access. Pattern (CLAUDE.md precedent):

bash

# Create temp DB user (45-min auto-expiry):
atlas dbusers create --projectId <prod-project-id> \
  --username eng<NNN>-<state>-prod-temp \
  --password "$(openssl rand -hex 32)" \
  --role readWriteAnyDatabase \
  --deleteAfter "$(date -u -v+45M '+%Y-%m-%dT%H:%M:%SZ')"
# Add your IPv4 to prod allowlist (NOT IPv6 — Atlas rejects):
atlas accessLists create "$(curl -s -4 ifconfig.me)" \
  --type ipAddress --projectId <prod-project-id> \
  --comment "eng<NNN> temp" \
  --deleteAfter "$(date -u -v+45M '+%Y-%m-%dT%H:%M:%SZ')"

Run --dry-run against prod URI (verifies cohort baseline + per-RA counts).
Run --apply with WRITE_CONFIRM=yes env var against prod URI. Cohort guard MUST show non-PA 2026 count unchanged.

zip_county cleanup on prod cluster (same script as staging, just against prod URI):

// For each ZIP belonging to the new state, $set regionId (per fips→RA table) + $unset sbeRedirect.
// If staging already has the canonical post-cleanup state, simpler: copy staging → prod by {zip, countyFips}.

Apex smoke matrix (curl per anchor ZIP → confirm counties, no sbeRedirect → POST /api/plans → confirm non-empty plan list with state's HIOS-prefix carriers).
Take Snapshot C on prod cluster.
Delete temp DB user + IP allowlist entry (atlas dbusers delete ... --force + atlas accessLists delete ... --force).
(Optional, cost-savings parity) Mirror prod → staging for the new state: node scripts/db/copy-ca-data-prod-to-staging.cjs style — adapt for the state's data subset.

The write scripts (write-<state>-plans-<year>.ts) MUST accept BOTH clusters in their safeHost regex. The founder-gated WRITE_CONFIRM=yes + cohort guard are the real safety net; the cluster regex just blocks accidental third-cluster runs (e.g., a dev cluster). Example from write-pa-plans-2026.ts:

const safeHost = /askflorence-staging|askflorence-prod-01|atlas-ttlyyd|njkihm/i;

Federal regression bright-line

Never merge an SBE data change without re-running wave1-federal-regression --diff. Zero diffs is the gate. Wire it into preflight (already done via scripts/preflight.ts --full).

Provider-coverage ingest patterns — PA Phase C playbook (added 2026-06-05)

PA Phase C (ENG-435 + ENG-437) shipped all 9 PA marketplace carriers (165 plans) with provider attribution to providers_staging in one extended session. Discovered 5 distinct ingest patterns that cover the realistic set of carrier-publishing approaches. Every future state's provider ingest should map each carrier to one of these patterns first.

Pattern decision matrix (run this BEFORE coding)

For each carrier, check IN ORDER (cheaper patterns first):

Step	Check	If yes → use
1	§1311 MRF TOC is publicly accessible (carrier site footer or `/transparency-in-coverage`) AND TOC contains HIOS marketplace plan IDs	Pattern A — MRF direct
2	TOC files are CMS-standard `.json.gz` (Cloudfront-signed)	Pattern A (streaming gz)
3	Public find-a-doctor SPA exposes NPI in network XHR responses (capture via Chrome MCP fetch-wrapping)	Pattern B — SPA-API direct
4	Public find-a-doctor page IS NPI-blind (PDF / server-rendered HTML / vCard with no NPI) BUT carrier publishes provider PDFs	Pattern C — PDF + NPPES fuzzy match
5	Public path member-auth gated OR rate-limited to unviability	Pattern D — county heuristic (sister-line risk-accepted attribution)
6	Public sister product line shares the same network and we already ingested it	Pattern E — additive attribution (zero new NPIs needed)

Pattern A — §1311 MRF direct ingest

Recipe:

Discover TOC URL (carrier transparency page or footer)
Verify TOC has plan_id_type === "hios" entries matching state marketplace HIOS prefix
For each reporting_structure that lists our marketplace plans, collect in_network_files[] URLs
Stream each file (HTTP fetch); for .json.gz, gunzip + early-termination at "in_network" marker token (270× speedup vs full parse)
Regex extract NPIs from "npi"\s*:\s*\[...\] arrays + "npi"\s*:\s*<10-digit> scalars
JSONL write with our_plan_ids populated from the 14-char marketplace HIOS lookup
--apply upserts pa:<npi> with $setOnInsert (identity locked) + $addToSet plans (idempotent)

Real examples: Ambetter Centene national TOC (api.centene.com/.../cms-data-index.json → 111,885 NPIs), Highmark BS PA (mrfdata.hmhs.com/.../highmark-bsp-index.json → 92,605 NPIs), IBX QCC (ihg-dart-edw-mrf-prod-public/qcc/.../index.json → 809,576 NPIs), Oscar (single in-network JSON file → 6,491 NPIs).

Caveats:

TOCs vary wildly in size: from one file (Oscar = 13.6 MB) to 525 files (IBX = 75+ GB if all downloaded)
Cloudfront URLs have signed expiry — fetch within 24 hr of TOC capture
Server-side rate-limit on heavy file families (e.g., IBX's 17B0/17D0 — terminated connections at 24-parallel fetch). Solution: accept partial + dedupe overlap is high

Key script: scripts/db/ingest-pa-providers-mrf-generic.cjs — env-driven generic implementation. Set CARRIER_NAME, CARRIER_SOURCE_TAG, HIOS_PREFIX, TOC_URL, PLAN_ISSUER_FILTER env vars.

Pattern B — SPA-API direct (NPI in JSON XHR)

Recipe:

Open carrier's find-a-doctor page via Chrome MCP (residential IP bypasses datacenter bot-shields like Radware)
Set up fetch wrapper via Chrome MCP javascript_tool to capture all XHR calls
Trigger a search via the SPA's UI (or direct URL navigation)
Identify the search endpoint (e.g., POST /healthsparq/public/service/v4/search)
Replay the captured body shape from JS in the same browser session (cookies + auth tokens auto-flow)
Parse JSON response; look for attributes[] array with {key:"NPI", value:"<10-digit>"} pattern
Iterate via prefix-letter sweep + pagination (HealthSparq caps at 300 per query / 10 per page)
Apply with cohort guard

Real example: Geisinger HealthSparq API (POST /healthsparq/public/service/v4/search) → 6,190 unique NPIs across 26-letter + 2-letter prefix sweeps in ~25 min. NPI was in providerResults[].attributes[] where key === "NPI".

Caveats:

5-15% of carriers expose NPIs in API but filter from rendered HTML — always sniff XHR before deciding "NPI not exposed"
HealthSparq / similar vendors hard-cap results per query (~300-500) — must iterate via prefix sweeps
Session timeouts (Geisinger expired after ~20 min sustained queries → HTTP 440)
Some endpoints prevent re-auth via fetch — must trigger via SPA UI

Pattern detection:

Vendor tells: HealthSparq, Sapphire (CalHEERS Symphony), HealthTrio Connect (HTML only, NOT JSON), Vitals, MyHealthToolkit
Test: search a known last name in SPA + check Network tab for application/json responses containing NPI digits

Pattern C — PDF/HTML directory + NPPES fuzzy match

Recipe (PDF source):

Discover marketplace PDF download URLs (often via /pdf-directory/... redirects to widen.net or other CDN)
Real PDF URL hidden in viewer HTML at window.viewerPdfUrl (signed CDN URL with ~24hr expiry)
Download PDFs (one per network if carrier offers tiered networks)
Use pdfplumber for column-aware extraction (NOT pdftotext — interleaves multi-column layouts)
Use NAME-ANCHORED regex extraction over column-aware extraction:
- Provider names follow Last, First [Middle], CRED pattern (very strong anchor)
- Page-header has county name (geographic anchor)
- Skip facility sections (Hospitals, Labs) — focus on PCP/Specialist sections
Output JSONL: {last_name, first_name, county, credential}
Build NPPES PA index from existing providers_staging (pa:* docs with name.{first,last} + addresses[].zip)
Build PA ZIP → county lookup from zip_county collection
Match by (last_name, first_name, county) — score: exact_last_first_county (high), exact_last_initial_county (medium), unique_lastname (low)
Apply matched NPIs with $addToSet plans

Real example: UPMC 4 marketplace PDFs (88 MB, 15,848 pages total) → name extraction → matched 9,560 of 34,541 directory entries (77% exact_last_first_county high-confidence) against 329,902 NPPES PA baseline.

Recipe (HTML source):

Same as PDF but parse rendered HTML for name/city/zip per provider card
HealthTrio Connect uses LI.ht2-ProviderCard wrappers
Cloudflare rate-limits aggressively — must use long throttle (1-2 sec/request)

Caveats:

Match accuracy 70-95% depending on directory data quality (more credentials = better match)
Plan→network mapping often UNKNOWN at PUF level (puf.networkId null) — accept sister-line over-attribution to all carrier plans
Address parsing from multi-column PDFs is fragile; ZIP-only matching often sufficient

Key scripts:

scripts/db/upmc-pdf-extract-names.py (pdfplumber + regex name extraction)
scripts/db/upmc-fuzzy-match-nppes.cjs (NPPES index + match + apply)

Pattern D — county-heuristic over-attribution

When to use: Carrier's provider directory is unscrapably difficult:

Member-auth gated (/secure/member/...)
HIOS-search-gate rejects marketplace HIOS
HealthTrio HTML directory rate-limits at 4hr+ ETA
Vendor refuses datacenter IPs across all bypass attempts

Recipe:

Identify carrier's actual service area (county-level) via marketing materials, broker maps, "where we serve" page
Resolve county set to all PA ZIPs via zip_county collection (distinct('zip', {state:'PA', county:{$in: counties}}))
Count NPPES PA NPIs whose addresses.zip is in service-area ZIPs (sanity check expected count)
updateMany with $addToSet plans filtered to matched ZIPs

Real examples:

Jefferson + Partners (HIOS 19702 + 93909) — SE PA (Philadelphia, Bucks, Chester, Montgomery, Delaware, Lehigh, Northampton, Berks) → 156,089 NPIs in 4 min
Capital BC (HIOS 45127) — Central PA (Dauphin, Lancaster, Lebanon, Cumberland, York + border Adams, Franklin, Perry) → 46,377 NPIs in 80 sec

Risks (always document in commit message):

~30-40% over-attribution: providers in service-area counties who DON'T actually contract with the carrier get marked InNetwork
UI mitigation: copy directing users to verify with carrier directly before booking
Mitigation justification: same discipline as IBX 33871 sister-line + UPMC all-networks attribution. Better than zero coverage which blocks enrollment journeys

Quality bar to ship over heuristic:

Coverage on most populous network in service area (>40k providers per major carrier)
Direct API/MRF/PDF availability that survives a 4-hour effort budget

Pattern E — additive attribution to sister product line

When to use: Same issuer has multiple HIOS prefixes for parallel product lines (HMO + PPO, Advantage + Premier, etc.) and we've already ingested one of them.

Recipe:

Verify issuer-name match in /api/plans output (both prefixes branded same)
Confirm no separate publicly-discoverable TOC for the secondary prefix
updateMany to add secondary-prefix plan IDs to ALL existing pa:* docs with primary-prefix plan source

Real example: IBX 33871 (12 plans, Independence Administrators line) attributed to all 809,576 IBX-31609 (Independence BC core line) NPIs in 32 min via $addToSet.

Caveats:

Risk-accept ~5-15% false positive for product-line-restricted providers
Best for sister lines of same legal issuer (e.g., HMO + PPO of same parent)
NOT appropriate when "sister" line is a separate underwriter (different BAA, different contracts)

Cohort guard discipline (every Pattern, every apply)

javascript

// Pre-flight: capture invariant
const totalBefore = await coll.estimatedDocumentCount();
const paBefore = await coll.countDocuments({_id: {$regex: "^pa:"}});
const nonPaBefore = totalBefore - paBefore;
log(`Pre-flight: pa:*=${paBefore}, non-pa=${nonPaBefore}`);

// ... ingest ...

// Post-flight: verify
const drift = Math.abs(nonPaAfter - nonPaBefore);
if (drift > 50) throw new Error(`COHORT GUARD FAIL: drift=${drift}`);
log(`✓ Cohort guard preserved (drift=${drift})`);

Why: Catches accidental cross-state writes (e.g., bug where pa: prefix gets dropped from _id and writes go to bare-NPI docs touching other states). Across 9 PA carrier ingests, drift was 0 every time — the guard fires cleanly when wrong.

Cluster guard (every Pattern, every apply)

javascript

function assertSafeCluster(uri) {
  const m = uri.match(/^mongodb(?:\+srv)?:\/\/[^@]*@([^/?]+)/);
  if (!m) throw new Error("Bad URI");
  const host = m[1];
  if (!host.includes("askflorence-staging") && !host.includes("efsikmv")) {
    throw new Error(`CLUSTER GUARD FAIL: host=${host}`);
  }
  return host;
}

Per ADR 0004 — providers_staging writes go to staging cluster only; apex reads via cross-cluster Atlas PrivateLink.

Release snapshot tags (every Pattern, every deploy)

Per ENG-435 + ENG-437 cadence:

Before merging deploy PR: git tag pre-eng-XXX-<carrier>-YYYYMMDD-HHMM at main HEAD
After merge: git tag post-eng-XXX-<carrier>-YYYYMMDD-HHMM at merge commit
Push both. These create clean rollback points if the deploy regresses.

Per-carrier discovery sequence (for any future SBE state)

For each carrier in the new state's marketplace:

Marketplace HIOS prefix lookup — query /api/plans for the state to get prefix → issuer-name map
Transparency-page discovery — search for <carrier>.com/transparency-in-coverage or footer link "Machine-Readable Files" or "Price Transparency"
TOC verification — fetch TOC URL + grep for HIOS marketplace prefix
If TOC has marketplace plans → Pattern A
If TOC missing or empty → find-a-doctor SPA discovery
Set up Chrome MCP fetch-wrapping + trigger search → check for NPI in JSON
If NPI in JSON → Pattern B
If NPI absent but PDF directory available → Pattern C (PDF)
If NPI absent and only HTML directory + rate-limit testing fails → Pattern D (county heuristic)
If sister product line exists with already-ingested network → Pattern E

Bias toward earlier (cheaper) patterns. Pattern D is the safety net but accept the over-attribution risk in the commit message.

Adding a new state — checklist

Use this when starting Wave 2 for any SBE not yet ingested.

[ ] Identify the state's public shop & compare tool URL
[ ] Run Layer 1-3 validation: 1 subsidy scenario + same-RA test + cross-RA test against the live tool. Block if any layer fails.
[ ] Source the state's published rating-area → ZIP mapping (state DOI or marketplace publishes this annually)
[ ] Create data/{state}-rating-areas-{year}.ts with one anchor ZIP per RA (start with 1 per RA; add per-county anchors if Layer 2 reveals carrier-availability variance within a single RA)
[ ] Source the state's Standard Benefit Plan Designs (state-mandated cost-share canonical values per metal x CSR tier). Most SBEs publish this annually.
[ ] Branch the CA scraper pattern (scripts/db/scrape-covered-ca-2026.ts) for the state's tool. Selectors will differ.
[ ] Run the scrape, output JSON.
[ ] Build the normalize script: JSON → plans collection doc shape, applying federal default age curve to derive all 65 ages from anchor age.
[ ] Build the state-subsidy helper if the state has its own subsidy program (see state-subsidies.md for the catalogue).
[ ] Add calculate{State}Eligibility() in apps/web/src/lib/owned-plans.ts mirroring calculateNyEligibility().
[ ] Capture puf.qualityRating in the normalize step, NOT in a separate ingest. SBE-state scrapes universally expose CMS QRS issuer ratings inline (Pennie: issuerQualityRating.{QualityRating, GlobalRating, SummaryClinicalQualityManagement, SummaryEnrolleeExperience, SummaryPlanEfficiencyAffordabilityAndManagement, EffectiveDate}; equivalent fields expected on Covered California's Shop & Compare scrape + NYSOH plan-list response). Map to the canonical schema ({available, year, globalRating, clinicalQualityManagementRating, enrolleeExperienceRating, planEfficiencyRating, globalNotRatedReason, clinicalNotRatedReason, enrolleeNotRatedReason, efficiencyNotRatedReason, effectiveDate, qualitySource} — same shape scripts/db/ingest-qrs-ratings.js writes for federal-30). Coverage typically 70-100% (carriers without 3 years of enrollment go unrated; matches federal QRS pattern). DO NOT skip this. PA Phase A.2.2 missed it; had to ship a follow-up ENG-418 A.5 augment to backfill from the same scrape JSONs (zero new scraping needed — the data was already in our hands).
[ ] Atlas snapshot B (pre-apply, mandatory — descriptive name includes pre-apply row counts).
[ ] Dry-run plans write → audit harness 100% → founder approval → apply.
[ ] Atlas snapshot C (post-apply marker — descriptive name includes new row counts).
[ ] zip_county cleanup (NEW — caught in ENG-418 Phase A.4 apex smoke): every ZIP for the new state has sbeRedirect set from the legacy SBE-redirect seed. /api/counties reads this field and short-circuits, so even after the constants flip the route still returns sbeRedirect. Run:
js
```
db.zip_county.updateMany(
  { state: '<NEW_STATE>' },
  { $set: { regionId: <fips→ra via state-specific map> }, $unset: { sbeRedirect: '' } }
);
```
Verify post-flight: every ZIP for the state has regionId populated + no sbeRedirect field. Sample a few ZIPs to confirm the regionId matches the state's RA-by-FIPS table.
[ ] Add state code to OWNED_DATA_STATES in packages/shared/src/plans/owned-plans.ts.
[ ] Remove state from STATE_BASED_MARKETPLACES in packages/shared/src/constants/index.ts.
[ ] Re-run federal regression sweep. Must stay 0 diffs.
[ ] Apex smoke — at least one ZIP per RA: /api/counties?zip=… returns counties (not sbeRedirect), /api/plans returns real plan list (not empty), one plan-detail load shows copays + SBC link + formulary link + provider directory link.
[ ] Add validation page at docs/validation/{state}-{year}.md with L1-L4 results + snapshot IDs (A, B, C).
[ ] Update the state-by-state log table above with the new row.
[ ] If the state has carrier-availability quirks within an RA, add per-county anchors AND document the anomaly in "Lessons learned" so the next state doesn't repeat the same investigation.
[ ] Phase B (drugs) + Phase C (providers) follow plans — file as separate issues mirroring NY ENG-412 Phase 1 + Phase 2. Don't try to do all three in one PR; the snapshot + cohort-guard discipline gets harder to verify per layer.