Skip to content
AskFlorence
Main Navigation ArchitectureFlorence AIAgentsMembersAgent PlatformValidationInfrastructure

Appearance

Sidebar Navigation

Overview

Home

Glossary

System Architecture

Consumer & Agent Flow

Florence AI

Overview

Principles

Runtime

Tool surface

Adding a tool

Tool registry

Knowledge: SBC scenarios & CSR

Voice

Evals & observability

Provider risk & portability

Outage playbook

Roadmap

Build plan

Agents

Overview

Workflows & pain points

Members

Overview

Medicaid coverage gap

Carriers

Overview

Marketplaces

Overview

Agency

Overview

Regulations

Overview

Agent Platform

Overview

Auth Architecture

MongoDB Permissioning

Compliance Model

Data Models

Data Sources

Overview

CMS Marketplace API

CMS dependency map

PUF Data

State Subsidies

SBE Ingestion Playbook

SBE State Watchouts + Decisions

CA Phase C/D Playbook

NY Phase C/D Playbook

Validation

Overview

Methodology

APTC Formula

California 2026

New York 2026

CAPS Formula

Scenario Results

Infrastructure

Account Inventory

AWS Setup Runbook

AWS Organizations

CloudTrail

GuardDuty

Security Hub

Config

CloudFront + WAFv2

Data sources & ingest

Phase 4 DNS

Change Log

Vulnerability Management

MongoDB Setup

Access Control

Data Classification

Documentation Hosting

Post-deploy Smoke

Development

Preflight (local CI mirror)

Testing strategy

Compliance

Overview (auditor entry point)

SOC 2 Control Mapping

HIPAA Control Mapping

CMS EDE Appendix A Mapping

Risk Assessment

Encryption Policy

Data Retention Policy

Privacy Impact Assessment

Consent Capture & Versioning

Incident Response Plan

Access Control Policy

Marketing vs. Portal Analytics

Vendor / Subprocessor Register

Dependency Vulnerability Policy

BAA / Compliance Evidence

Compliance-Automation Integration

Compliance-Automation Vendor Evaluation

Penetration Test Reports

Architecture

Portal entry handoff

Mobile app strategy

Deferred architecture decisions

Session cookie architecture

Share flows

Decisions (ADRs)

Index

0001 — Atlas project isolation

0002 — Append-only audit log

0003 — Narrow-scoped Mongo users

0004 — Cross-cluster Atlas PrivateLink

0005 — Delayed-job architecture

0006 — Mongo user simplification

0007 — Terraform owns ECS task def

0008 — E2E testing strategy

0009 — Self-hosted analytics + observability (superseded)

0010 — PostHog HIPAA Cloud (supersedes 0009)

Runbooks

Security Incident Response

Break-Glass Root Login

Onboard Team Member

Offboard Team Member

Atlas user provisioning

Deploy via Terraform (ENG-277)

Rollback via Terraform (ENG-277)

S3 data bucket migration (planned Phase 11)

Access Reviews

2026-Q2 Review

Session log

Index

2026-04-23 — Phase 10 DNS cutover

2026-04-22 — Phase 8 prod AWS mirror

2026-04-22 — Phase 7 Atlas VPC peering

2026-04-22 — Phase 6 CloudFront + WAF

2026-04-21 — Phase 5 staging go-live

2026-04-17 — Atlas staging

Briefs

Index

Member portal plan (ENG-187)

2026-04-16/17 handoff

2026-04-17 Atlas handoff

System briefing (2026-04-17)

Creative AdBundance proposal brief

Creative AdBundance analytics brief

ElevenLabs RN integration research

Policies

Overview

On this page

SBE state watchouts + decisions ​

For any work touching SBE state data (CA, NY, PA, NJ, MA, WA, CO, CT, MD, etc.) — read this doc BEFORE designing or implementing. It captures state-specific decisions already made (data sources, coarsening trade-offs, anon-endpoint legalities, gaps in our plan store) so we don't re-derive them every time. When a session opens a new state's work, add a section here.

Index:

  • California — Phase C/D complete (one open gap: ENG-410)
  • New York — plans ✅, coverage NOT ingested (ENG-412); NPPES-native so no CA bridge gap
  • Pennsylvania — Phase 0 research locked (ENG-418); GetInsured-stack pilot, federal APTC only, no SBP
  • New Jersey — all 3 phases shipped (ENG-438) + 2026-06-10 Horizon plan-id repair (ENG-447)
  • Illinois — ALL PHASES COMPLETE 2026-06-10 (ENG-448); GetInsured stack, 13 RAs / 7 carriers
  • Virginia — ALL PHASES COMPLETE 2026-06-10 (ENG-450); GetInsured stack, 12 RAs / 6 carriers
  • Nevada — Phase A + partial B/C 2026-06-10 (ENG-451); GetInsured stack, 4 RAs / 9 carriers
  • Template for new states — what to investigate first

California ​

Status: Phase C (drugs) ✅ shipped (ENG-395 / PR #505). Phase D (providers) ✅ ingested (ENG-408 / PR #TBD). Route extension state-aware ✅ shipped (ENG-407 / PR #510). SBE estimate Wave 1+2 ✅ shipped (ENG-373/374).

Data sources ​

LayerSourceAuthSchema notes
Plans + pricingCovered California rating-area data (Wave 2 / ENG-374)curated harvest169 CA 2026 plans in plans collection, 11 carriers, puf: {} empty (see below)
Drug formulariesCarrier marketplace formulary PDFs (Kaiser, Blue Shield, Anthem, Health Net, Molina, LA Care, SHARP, IEHP, CCHP, Valley Health, Western Health)public PDFsparsed into formularies_staging, 17,447 drugs × 169 plans, tagged source: ca_<carrier>_2026_marketplace_formulary
Provider directoryCalHEERS anon endpoint backed by Symphony (IHA + Availity, SB 137)anon (no auth)165,974 providers in providers_staging namespaced _id: "ca-sym:<providerId>", tagged source: ca_symphony_2026
Live provider searchSame anon endpoint via coveredca-provider-proxy.tsanon24h cache, ~5sec per (zip, radius, year) query

Standing decisions ​

1. CA plan docs have empty puf: {} ​

PR #460 / ENG-374 ingested CA pricing + plan metadata but did NOT puf-augment. Implications:

  • puf.networkId not available → provider-plan mapping uses HIOS issuer prefix (see #2)
  • puf.formularyId not available → drug coverage uses formularies_staging via state-aware dispatch (ENG-407)
  • puf.qualityRating not available → no star ratings on CA plan cards (acceptable v1)
  • puf.sbcScenarios not available → no SBC year-cost scenarios on plan detail page (acceptable v1)

Recommendation if puf-augment desired later: parse CC-published plan detail templates + carrier MRFs for the cost-share + scenario data, write to puf.* on each CA plan doc (same shape as FFM).

2. Provider-plan mapping is HIOS-prefix coarse, not per-network precise ​

Symphony returns networkId like 40513CAN011-2600 (Kaiser network 011, segment 2600). Without puf.networkId on our plan docs, we can't do per-network mapping. Instead we use the first 5 chars (HIOS issuer prefix) — every provider whose networkId starts with 40513 is treated as in-network for ALL Kaiser CA plans.

Why this is acceptable for CC: Covered California's Standard Benefit Designs (SBP) mean marketplace plans within a metal tier share standardized cost-sharing AND typically share the provider directory at the carrier level. A doctor credentialed in Kaiser's CA network is effectively in-network for all Kaiser CA marketplace plans.

Trade-off: can over-attribute when a carrier runs distinct narrow networks (HMO vs PPO vs EPO). Recommendation: refine to per-network when we license Symphony commercially OR ingest CA carrier MRFs.

3. NPI bridge NOT NEEDED for the customer flow (decided 2026-05-28) ​

Symphony anon doesn't return NPPES NPI (0/85,735 probed). I initially flagged this as a major gap.

Founder reframe: real customer flow is ZIP-first — enter ZIP → see plans → search doctors within those plans. They never come in holding an NPPES NPI. Doctor search returns Symphony-providerId-keyed results that link directly to plans via the HIOS prefix map. Saved doctors carry Symphony providerId, NOT NPI. The "is THIS doctor I already saved on this plan?" check uses the same providerId we wrote at save time.

Therefore: NPI bridge is solving a problem nobody has. external_ids.npi: null on every CA Symphony doc is fine. If we ever surface NPPES-keyed doctors in the CA UX later (e.g. agent-side workflow), we'd reopen this. Not before.

4. accepting defaulted to "accepting" on every CA Symphony row (decided 2026-05-28) ​

Symphony anon doesn't return the new-patients-accepting flag. We default to "accepting".

Founder reframe: if a customer's existing doctor isn't taking new patients on their new plan, "they just go get a new doctor." Not a launch blocker. Florence flow surfaces 3-5 PCP suggestions per plan — if one doesn't pan out, the user picks another.

SB 137 sub-argument: California SB 137 mandates monthly Symphony directory refresh — providers who stop accepting are supposed to be removed from the directory within 30 days. So the directory's currency rule is itself a soft proxy for "currently accepting." Not airtight but reasonable.

If we ever need the explicit flag later: Symphony commercial license includes it OR live-verify against the anon endpoint at suggest time.

5. Tier-aware copays are NOT a CA gap (decided 2026-05-28) ​

Initially flagged as a limitation. Wrong on inspection.

Reality: doctor copays come from the plan doc (plan.copays.primaryCare), not from the provider's network tier. For CC SBP plans there's effectively one in-network tier — providers are in-network or not. Multi-tier networks (where Preferred costs $10 / Standard costs $30 on the SAME plan) exist on some FFM commercial plans but not on CC marketplace.

Therefore: defaulting network_tier: "InNetwork" on every CA Symphony row is correct. Plan card shows $5 primary care visit from the plan doc; in-network/out-of-network is the only check needed.

6. Symphony anon endpoint legal posture ​

Scraping CalHEERS' anonymous SPA endpoint is unofficial. Acceptable as backend interim solution. NOT marketable as "powered by Symphony" until we license directly from IHA. Symphony customer login at symphony.iha.org; IHA Oakland 510-208-1740 for downstream-data-consumer subscription pricing (no public price sheet; expect $5-20K/yr).

When to license: when CA-derived revenue justifies the cost AND any of these become acute:

  • We need per-network precision (kills the HIOS-prefix coarsening of decision #2)
  • We need explicit accepting flag (decision #4 upgrade)
  • We start marketing CA-specific UX externally and want the legal cover
  • We add another CA state-specific feature that benefits from the richer schema (languages, sex, board cert, education, etc.)

7. Provider DISCOVERY vs COVERAGE — the /plans name-search gap (found + verified 2026-05-28, ENG-408 post-ship trace) ​

This is the one genuinely open CA gap. There are TWO ways a user surfaces a provider, and they use DIFFERENT identifier systems:

SurfaceDiscovery pathID returnedCoverage checkCA status
Florence ("show doctors on this plan")/api/providers/suggest → CalHEERS SymphonySymphony providerIdmatches our providers_staging (_id: ca-sym:<id>)✅ Works
/plans CoveragePanel ("search MY doctor by name")/api/providers/autocomplete → NPPES (national NPI registry)NPPES npi (10-digit)queries providers_staging by npi❌ Broken for CA

Why /plans is broken for CA providers: NPPES autocomplete does return CA doctors (NPPES is the national federal registry, state-agnostic), so discovery looks like it works. But it hands back the doctor's NPPES NPI. Our CA Symphony docs are keyed by Symphony providerId with external_ids.npi: null — there is no NPPES↔Symphony bridge. So /api/providers/covered finds no match → returns NotCovered for EVERY name-searched CA doctor, even in-network ones.

Verified on apex 2026-05-28: NPPES NPI 1265150155 (Riley Smith LCSW, SF) checked against Blue Shield plan 70285CA135000101 with state=CA → NotCovered, despite the carrier being in-network. This is the NPI-bridge gap (decision #3) biting the /plans surface specifically.

This does NOT affect: drugs (RxCUI is a single national identifier — search AND coverage both use it, fully works end-to-end for CA), CA pricing, CA subsidy, or the Florence provider flow.

The contained fix (not yet built): /api/providers/search-ca already exists (PR #505, Symphony-backed) but is wired to NO UI. For CA, point the /plans CoveragePanel doctor-search at it instead of /api/providers/autocomplete. It returns Symphony providerIds, which match our ingested data → coverage closes the loop. Saved CA doctors would carry providerIds (consistent with the ZIP-first flow the founder described). Medium effort; needs a ProviderHit-shape adapter for the Symphony response. File as a ticket when /plans CA provider-coverage becomes a priority — if Florence is the primary CA consumer surface, this can wait.

Decision posture: CA is "sufficiently working" for pricing + subsidy + drugs (full) + Florence-flow providers. The /plans name-search provider coverage is the single known gap, blocked on either (a) wiring search-ca into CoveragePanel for CA, or (b) the Symphony license / NPPES cross-reference that closes the NPI bridge generally.

Known limitations (acceptable for v1) ​

  • Multiple practice addresses per provider — current crawl captures 1 address per Symphony provider. Symphony's anon endpoint returns one row per provider per zip query; cross-zip dedupe folds duplicates by providerId without merging addresses. Could re-crawl per-provider but adds 165K serial API calls. Not a UX blocker.
  • Phone number, languages, sex, education, board certification, facility type — none returned by anon endpoint. License Symphony to unlock.
  • Provider photo / profile / reviews — out of scope for v1 across all states.

Follow-ups (Linear-tracked, NOT blocking CA launch) ​

  • [ ] /plans CA provider-search → search-ca wiring (decision #7, [ENG-410]) — the one user-facing gap. Point CoveragePanel doctor-search at /api/providers/search-ca for CA so it returns Symphony providerIds that match ingested data. File when /plans CA provider coverage is prioritized.
  • [ ] Symphony license inquiry — IHA Oakland (no ticket filed; file when revenue justifies). Closes decision #3 (NPI bridge), #4 (accepting flag), #2 (per-network) at once.
  • [ ] Per-network refinement (decision #2 upgrade) — depends on license OR CA MRF ingest
  • [ ] CA carrier §1311 MRF ingest — bigger project (~weeks); only if Symphony license declined AND we need NPI/tier data
  • [ ] Multi-address re-crawl — only if user research surfaces a need
  • [ ] CA puf augment (decision #1) — only if star ratings or SBC scenarios become a real ask for CA
  • [ ] ~471 SF/Oakland providers from crawl cold-start — RECOVERED 2026-05-28 (idempotent re-ingest). Crawl now has retry-with-backoff (PR #525) so it can't recur.

Verified end-to-end state (apex, 2026-05-28) ​

CapabilityStatus
CA plan pricing (/api/plans)✅ 28 real SF plans, CAPS/CAPC math
CA subsidy estimate (/api/sbe-estimate)✅ federal APTC math
CA drugs: search → coverage✅ FULL — national RxCUI search + formularies_staging coverage
CA provider coverage-check API (/api/providers/covered, state=CA)✅ Symphony providerId → Covered/InNetwork
Florence provider suggestions (/api/providers/suggest)✅ Symphony-backed
/plans doctor search-by-name → coverage❌ NPPES NPI can't bridge to Symphony providerId (decision #7)
Data integrity✅ 165,974 CA providers, 17,447 CA drugs; FFM cohort byte-identical (2,145,064)

Cross-references ​

  • ENG-395 — Phase C/D ingestion (Done)
  • ENG-407 — State-aware route dispatch (Done, PR #510)
  • ENG-408 — CA provider ingest: 165,974 providers + crawl retry (Done, PRs #523/#524/#525)
  • ADR 0004 — Cross-cluster Atlas PrivateLink (why staging holds provider/formulary data)
  • docs/data-sources/ca-phase-c-d-ingestion-playbook.md — full methodology for future per-state replays
  • scripts/db/ingest-ca-providers.cjs — provider crawl/map/ingest/verify CLI (run via Fargate smoke-runner task family)

New York ​

Status (verified 2026-05-28 apex + DB trace): plans + pricing ✅; drug + provider coverage ❌ NOT ingested. Phase C/D ingest is ENG-412.

⚠️ Correction: the prior version of this section claimed NY had "FFM-style provider data inherited from legacy ingest." That was wrong — verified empirically. There is no NY provider or drug coverage data in our collections.

Verified state ​

CapabilityNY statusEvidence
ZIP → county✅ works10001 → New York County, 11201 → Kings
ZIP → plans + pricing + subsidy✅ works282 NY 2026 plans; calculateNyEligibility() in owned-plans.ts
Doctor search by name → matched plans❌ broken0 NY providers in providers_staging across all 8 major carriers (Fidelis 25303, Healthfirst 91237, MetroPlus 11177, Excellus/Highmark 78124, MVP 56184, EmblemHealth 88582, CDPHP 94788, Oscar 74289)
Medication search → matched plans❌ broken0 of 282 NY plans cover atorvastatin; NY formularies not ingested. (Only stale partial 74289NY* Oscar docs exist — 4,023 — and they don't map to current plans / common drugs. /api/drugs/covered NY Fidelis + Lipitor → NotCovered.)

Key facts (locked) ​

  1. NY is NPPES-NPI-native — NY State of Health (NYSOH) + NY DFS use the national NPI registry, NOT a Symphony-style internal ID. This is the big advantage over CA: once NY providers are ingested keyed by NPPES NPI (_id: npi, like FFM), the /api/providers/autocomplete (NPPES) → /api/providers/covered join works with no bridge gap. The CA limitation (ENG-410 — NPPES↔Symphony) does NOT apply to NY. NY will be the first fully-complete SBE provider surface.
  2. puf IS populated for NY (unlike CA's empty puf — decision #1 in the CA section). So NY provider-plan mapping could use puf.networkId per-network precision instead of CA's HIOS-prefix coarsening — verify during Phase D.
  3. NY plan IDs use 14-char FFM format (42640NY0320001) — distinguishable from CA's 16-char only by the state field, never plan-ID format. This is exactly why coverage dispatch keys off state (usesOwnedCoverageData, ENG-411), not plan-ID regex.
  4. Coverage dispatch already routes NY correctly — NY is in OWNED_COVERAGE_STATES (ENG-411). No route changes needed; the flow lights up the moment data lands in formularies_staging + providers_staging.
  5. Data sources DIFFER from CA and need a discovery pass (ENG-412 Phase 0): NY carrier formulary PDFs/JSON for drugs (CA carrier-PDF playbook is the template); NYSOH provider-search endpoint (probe for a CalHEERS-style anon endpoint) / NY DFS / NY DOH network-adequacy / per-carrier directories / NY §1311 MRFs for providers. See docs/data-sources/ny-phase-c-d-ingestion-playbook.md (created during ENG-412 Phase 0).

Follow-up ​

  • ENG-412 — NY Phase C/D ingest (composite: source discovery → drug formularies → provider directory → verify). Same safety covenants as ENG-395/408 (snapshot, FFM+CA byte-identical, cluster guard, collection allowlist, $addToSet, dry-run, Fargate in-VPC).

Pennsylvania ​

Status (Phase A complete + live on apex, 2026-06-02): plans ✅ ingested on BOTH clusters (staging + prod), 291 HIOS plans across 9 RAs, ENG-418 PR #566 + A.4 prod-cluster apply. Apex smoke verified across 5 anchor ZIPs spanning all 9 rating areas (19103 RA8 / 16501 RA1 / 15834 RA2 / 18015 RA6 / 17101 RA9) — real PA plans returned in every case (Jefferson / UPMC / Geisinger / Capital BC top per RA). Cohort guard: non-PA 2026 unchanged at 4,495 on both clusters. Drug + provider coverage NOT ingested. Phase B (drugs) + Phase C (providers) are the next phase, mirroring the NY ENG-412 sequencing (plans → drugs → providers). Phase A is the pilot for the GetInsured 7-state stack (PA / NJ / VA / NV / NM / ME) — same scraper + normalize + write pattern ports to each sibling state in ~1 day after.

Decision 10 (added 2026-06-02, ENG-418 A.4): plans + zip_county must live on the PROD cluster, not just staging. Apex /api/counties + /api/plans resolve via getDb() → MONGODB_URI → askflorence-prod-01.njkihm (the prod M10 cluster), NOT the staging cluster (askflorence-staging.efsikmv — that's the cross-cluster reference cluster for formularies_staging + providers_staging only, per ADR 0004). Initial Phase A.2.2 ingest wrote only to staging because that's what local .env.local points at; apex stayed broken until prod-cluster apply landed. Going forward, every new state ingest writes to PROD first, then mirrors to staging via copy-ca-data-prod-to-staging.cjs pattern. The playbook checklist + cluster-targeting section in sbe-ingestion-playbook.md are the canonical reference now. Prod snapshots for this fix: B 6a1e6c1d085fb664b689eec7 (pre-apply) + C 6a1e6e23923b91f0f761d244 (post-apply).

Decision 11 (added 2026-06-10, ENG-451 NV close-out — founder-flagged): when an issuer→network mapping is ambiguous, or a scraped artifact (MRF / formulary URL) disagrees with the per-plan scrape data, DRIVE THE LIVE STATE EXCHANGE via Chrome MCP before concluding "gap" or "separate network." This is now a required diagnostic step, not optional. The exchange is the source of truth for what an issuer actually sells and which network/directory each plan uses — it outranks the MRF, which can be filed under a parent/reporting-entity HIOS.

Method (worked example — NV Community Care):

  1. Open the SBE's anonymous shop/window-shop tool (e.g. Nevada Health Link → enroll.nevadahealthlink.com), enter dummy applicant data (ZIP + age + income), reach the plan list.
  2. Use the issuer/carrier filter to isolate the carrier in question. Read the plan names it returns and click into a plan → provider directory / find-a-doctor link + formulary link.
  3. Compare against what we ingested. Community Care Health Plan of Nevada (HIOS 11765) filtered to "Battle Born State Plan Anthem Bronze/Gold/Silver" plans whose directory is anthem.com/find-care?alphaprefix=NVD — i.e. it IS Anthem's NV HMO entity on the Anthem find-care network. The Elevance PROVIDERS_NV.json MRF lists providers only under the parent HIOS 60156, so a naive "11765 not in the MRF → separate network → no data" read is wrong; the SBE shows the true mapping.

Corroborating signals already in our scrape (check these too — they often answer it without the MCP): per-plan networkName (Community Care's = "Anthem Battle Born ... Network"), networkURL / providerLink (= anthem.com/find-care...), and the plan name branding ("... Anthem Bronze ..."). Lesson: MRF reporting-entity HIOS ≠ marketed-issuer HIOS; co-branded affiliates file their MRF under the parent. Also: when a scraped doc URL 404s / shows "under construction" (NV Imperial's stale FNAV siteCode 5227003519), check the carrier's OWN live doc page for a fresher URL (8528279638) before declaring it unpublished.

Decision 12 (added 2026-06-10, NV/VA scenario audit — CRITICAL bug class): independent cities must NOT get " County" appended to plans.countiesServed. searchOwnedPlans filters by the county name from zip_county, so a plan's countiesServed entry MUST exactly match the zip_county name or that ZIP returns ZERO plans. The plan-build derived names by appending the literal " County" to the FIPS-to-county value. For independent cities (whose FIPS-map value already ends in "City") that produced names like "Virginia Beach City County" and "Carson City County", which do NOT match zip_county's "Virginia Beach City" / "Carson City" — so the ZIP shows "no plans available." This silently broke all 38 VA independent cities (Virginia Beach, Richmond, Norfolk, Chesapeake, Newport News, Hampton, Alexandria…) + NV Carson City, live on apex. Fix: only append " County" when the name does not already end in " City" (name.endsWith(" City") ? name : name + " County") — patched in all 6 build-{state}-plans-from-scrape-2026.ts. Data repair: scripts/db/fix-independent-city-counties.cjs (idempotent, cohort-neutral; staging done, prod pending next deploy). Only states with independent cities are affected (VA, NV; MD-Baltimore / MO-St.Louis if ever owned). Run scripts/audit/nv-scenario-audit.ts (per-RA × household × FPL sweep) + the cross-state county-name mismatch check for every new owned state. NY/CA/PA/NJ/IL verified clean.

What we know ​

AttributeValueSource
Marketplace namePenniepennie.com
Platform vendorGetInsured (same stack as IL/NJ/VA/NV/NM/ME/ID/MN/GA — 10 SBE deployments)GetInsured press release
Public shop-and-compareenroll.pennie.com/hix/preeligibility (Savings Calculator entry) → plan results page (URL pattern TBD, behind /hix/)Browse-then-register flow per Pennie homepage
Plan management authorityPennsylvania Insurance Department (PID) — handles plan certification + rate filings; submissions live at pa.gov/agencies/insurance/posted-filings-reports-company-orders/product-and-rate-filings/aca-health-rate-filings/PID 2026 rate release
Rating areas9 (numbered 1-9, county-aggregated)PID rate filings reference RAs 1-9
Plan ID format14-char FFM-style HIOS (e.g. 33709PA1420005 Highmark)Highmark SBC PDFs at shop.highmark.com/content/sbcs/2026/...
2026 issuer count14 carriershealthinsurance.org PA 2026
Major carriersHighmark (multiple legal entities: Benefits Group, Coverage Advantage, BCBS), Independence Blue Cross (QCC), UPMC Health Options + Health Plan, Capital BlueCross, Geisinger Quality Options + Health Plan, Ambetter (rebranded from PA Health & Wellness), Partners, etc.PID 2026 rate filings
State subsidyNone active for 2026 — Act 54 of 2024 created the "State Health Insurance Exchange Affordability Program" but it is NOT funded for 2026PHLP analysis
BHP (Basic Health Program)None — only NY, MN, OR have BHPsn/a
CSR for 2026Standard federal CSR (100-250% FPL, Silver only) — but federal CSR funding eliminated for 2026 → all PA silver plans carry a CSR Defunding Adjustment load (e.g. Partners 1.22x) baked into gross premiumPID 2026 rate releases
SLCSP-by-county referenceagency.pennie.com/wp-content/uploads/2025/11/Second-Lowest-Cost-Silver-Plan-by-County-PY-2026.pdf (PHIEA agency partner site)direct L1 validation source
Drug formularies (Phase B — future)Per-carrier (Highmark Essential Formulary; UPMC, IBC, Capital, Geisinger, Ambetter equivalents) — same per-carrier PDF/JSON pattern as NYHighmark formulary pages at highmark.com/member/.../medical-drug-formulary
Provider directories (Phase C — future)Per-carrier (no statewide CA-Symphony-style centralized directory; no NY-DOH-PNDS-style open dataset confirmed yet — needs separate research pass)TBD

Standing decisions (locked at Phase 0) ​

1. PA = federal APTC only — no calculatePaEligibility() needed ​

PA's Act 54 program is unfunded for 2026. There is no state subsidy to layer on. So Phase A.5 wiring needs NO per-state branch in /api/eligibility and NO calculatePaEligibility() helper. PA falls through the FFM proxy branch like any other federal state served from our own DB.

If Act 54 ever gets funded (watch the next PA budget cycle), reopen this — add a calculatePaEligibility() mirroring calculateCaEligibility(), branch /api/eligibility, and update docs/data-sources/state-subsidies.md. Until then, PA is the simplest state to ingest: pricing + CSR variants + done.

2. CSR Defunding Adjustment is already baked into Pennie's posted premiums — do NOT double-apply ​

For 2026 federal CSR cost-reimbursement was eliminated. PA mandated all silver plans (on AND off exchange) carry a CSR defunding load (e.g. Partners 1.22x, varies by carrier). The scrape captures Pennie's displayed premium as-is — the load is already in there. Our APTC math at runtime uses the same SLCSP as Pennie's calculator, so we benefit from the same "silver loading" effect (APTC bumps up across all metals because the benchmark went up). This is exactly the FFM-state behavior under the same federal change — no PA-special logic.

3. NO Standard Benefit Plan mandate — carriers file plans + CSR variants individually ​

Unlike California's CC SBP mandate where every issuer's Silver-94 has IDENTICAL deductible/copay/MOOP, PA does not mandate standard plan designs. Each carrier files Silver + its CSR variants per the federal CFR §156.420 AV targets, but the actual cost-share values differ per plan. Implications:

  • Each Silver plan needs its OWN scraped CSR variants (94/87/73) — we cannot apply a single canonical PA-SBP cost-share table the way scripts/db/data/sbe-showcase-plans-2026.ts does for CA.
  • Scraper must drive Pennie's "view plan details" / SBC link per Silver plan to capture per-plan CSR cost-share, OR derive from carrier-published SBC PDFs (Highmark already has them at shop.highmark.com/content/sbcs/2026/...).
  • Cleaner fallback: scrape silver gross premium per RA only, then derive CSR variants from federal CFR §156.420 AV targets applied to the plan's filed Silver base. Less accurate per-plan but matches the FFM cost-share pattern we already serve. Pick the trade-off in Phase A.1.

4. Plan IDs are 14-char FFM-style HIOS (observed in Highmark SBCs) ​

PA plan IDs look like 33709PA1420005 (5-char issuer + state + plan + variant) — identical format to FFM and NY. Coverage dispatch keys off state field (usesOwnedCoverageData, ENG-411), not plan-ID regex, so this Just Works once PA is added to OWNED_DATA_STATES. Same precedent as NY decision #3.

5. GetInsured stack scraper IS the long-term investment, not a PA-specific tool ​

Pennie's /hix/ is GetInsured's white-label SPA. Once we drive it cleanly headless once (PA), the scraper ports to NJ (GetCoveredNJ) / IL (Get Covered Illinois) / VA (Virginia's Insurance Marketplace) / NV (Nevada Health Link) / NM (beWellnm) / ME (CoverME) with mostly theme/selector deltas + per-state anchor-ZIP-per-RA lists. The PA build IS the M1 milestone investment — every other GetInsured state is ~1 day after.

6. NPPES-NPI native (assumption — to verify in Phase C) ​

PA uses CMS QHP certification (PID does plan management; carriers file rate templates with the standard FFM HIOS schema). Provider directories from PA carriers (Highmark / UPMC / IBC etc.) are commercial PPO/HMO networks that publish NPPES-NPI-keyed data via their member portals and via §1311 MRFs. Assumption: when we ingest PA providers (Phase C), they'll be NPPES-NPI-keyed end-to-end (like NY, unlike CA Symphony). No NPI bridge gap expected. Verify during Phase C source-discovery (ENG-419 or equivalent).

Phase 0 acceptance items (already locked above) ​

  • [x] Marketplace identity + tech stack (GetInsured) → enables shared scraper investment
  • [x] Rating area count (9) — exact ZIP→RA mapping deferred to Phase A.0.5 (live PID chart + CMS PA-GRA fetch needed; CMS endpoint timed out during research)
  • [x] State subsidy posture (federal APTC only — no calculatePaEligibility() needed)
  • [x] BHP posture (none)
  • [x] Plan ID format (14-char FFM HIOS — confirmed via Highmark SBC URLs)
  • [x] CSR structure (federal standard, defunding-load already in posted premiums, NO SBP mandate)
  • [x] Drug formulary general approach (per-carrier PDFs — Phase B deferral OK)
  • [x] Provider directory general approach (per-carrier or §1311 MRFs — Phase C deferral OK, NPPES-NPI assumed)

Phase A.0.5 — gate items RESOLVED (2026-06-01, live verification) ​

All four Phase A.0.5 gate items closed in one session via live Chrome-driven verification of enroll.pennie.com.

Resolution 1: County → Rating Area mapping (CMS PA-GRA → superseded by PHIEA SLCSP PDF) ​

CMS PA-GRA page consistently timed out via WebFetch. Better source found: the PHIEA Second-Lowest-Cost-Silver-Plan-by-County-PY-2026.pdf directly encodes county → rating area for all 67 PA counties AND the benchmark SLCSP plan + age-40 premium per county (the L1 validation source).

Extraction: pdftotext -layout + Python parser with PA county FIPS lookup table (eliminates the Centre County wrap risk where the PDF puts the county name on a separate line because Centre is the only PA county with partial-county service-area splits — 24 ZIPs in one set, 7 in another, both happen to share the same SLCSP for 2026).

Result: 67/67 counties extracted, all 9 RAs covered, anchor county per RA picked:

RAAnchor CountyFIPSAnchor ZIPSLCSP age 40Benchmark Silver
1Erie42049TBD$487.34UPMC Advantage Silver
2Cameron42023TBD$830.19Geisinger Marketplace
3Lackawanna42069TBD$667.87Oscar Silver Classic
4Allegheny42003TBD$487.34UPMC Advantage Silver
5Cambria42021TBD$649.74Highmark Direct Blue EPO Premier Silver $0
6Northampton42095TBD$544.16Jefferson Silver
7Lancaster42071TBD$593.37Ambetter Complete Silver
8Philadelphia4210119103$468.28Jefferson Silver
9Dauphin42043TBD$842.07Highmark Direct Blue EPO Silver $6000

Anchor ZIP per county picked in Phase A.1 (county FIPS → primary urban ZIP via USPS lookup).

Resolution 2: Anonymous browse on enroll.pennie.com — CONFIRMED end-to-end ​

Drove headless Chrome through the full prescreener → preferences → plan-selection flow with NO login required at any step:

  1. https://enroll.pennie.com/prescreener/ — H1 "Shop for Health Coverage", Login is a separate top-right link not gating the page
  2. After Continue: form schema = Plan Year (select 2026) + ZIP (text) + DOB (mm/dd/yyyy per household member) + tobacco/native/coverage checkboxes + Annual Tax Household Income (text). Add Spouse + Add Dependent buttons for household expansion.
  3. /prescreener/results — returns "Estimated Tax Credit of $X/month" + "Your monthly premium may be about $Y/month" + "The second-lowest cost Silver level health plan (SLCSP) used to calculate the APTC shown above is $Z"
  4. /hix/private/preferences — optional 4-step preferences wizard (providers / facilities / drugs / plan preferences), all skippable with "View Plans"
  5. /hix/private/planselection?insuranceType=HEALTH — actual plan list with 73 plans for the Philadelphia scenario

private/ in the URL is misleading — it is NOT account-gated; it just refers to the post-prescreener private-shopping context. Anon-browse works the full depth of the funnel.

L1 validation hit: Philadelphia ZIP 19103 / single age 40 / $50K income / 2026 plan year → Pennie computed SLCSP = $468/month, PHIEA PDF benchmark = $468.28/month. Delta: < $0.28. APTC math: $50K × 9.96% / 12 = $415/mo cap → APTC = SLCSP − cap = $53/mo — matched exactly.

Resolution 3: PHIEA SLCSP PDF parsed — 67/67 counties → structured JSON ​

Saved to worktree at .tmp-eng-418/pa-counties-2026.json (gitignored). Will move to scripts/db/data/pa-rating-areas-2026.ts during Phase A.1 build.

Resolution 4: CSR variant approach — SCRAPE via Pennie's JSON API (no SPA scraping needed!) ​

🎯 MAJOR FINDING: Pennie has an undocumented JSON API endpoint that backs the plan list page:

GET /hix/private/getIndividualPlans?_=<random>
Cookie: <session cookie from prescreener flow>
→ application/json (2.3 MB for 73 plans × 145 fields each)

Sample response (Philadelphia anchor scenario, 2026, single 40yo, $50K, household=1):

json
{
  "id": 27365,
  "issuerPlanNumber": "15983PA001000601",  // 14-char HIOS + 2-char CSR variant suffix
  "name": "Focused Silver",
  "level": "SILVER",
  "issuer": "Ambetter Health of Pennsylvania, Inc.",
  "issuerId": "15983",
  "networkType": "HMO",
  "premiumBeforeCredit": 510.82,
  "premiumAfterCredit": 457.82,
  "annualPremiumBeforeCredit": 6129.84,
  "aptc": 53.00,
  "costSharingReductions": 0.00,
  "costSharing": "CS1",                    // CS1=no CSR / CS2=CSR-73 / CS3=CSR-87 / CS4=CSR-94
  "deductible": 6301,
  "intgMediDrugDeductible": 6300,
  "medicalDeductible": null,
  "drugDeductible": null,
  "oopMax": 8400,
  "intgMediDrugOopMax": 8400,
  "qualityRating": 3,
  "overAllQuality": ...,
  "sbcUrl": "https://...",
  "planBrochureUrl": "https://...",
  "providerLink": "https://...",
  "formularyUrl": "https://...",
  "hsa": false,
  "isPuf": true,                            // ← this plan is in CMS PUF data
  "benefitsCoverage": [...],                // full benefits list
  "planCosts": {...},                       // cost-share breakdowns per benefit
  "coinsurance": ...
  // + ~125 more fields including doctors[], facilities[], etc. when preferences are set
}

This makes the Phase A.1 scraper trivial:

  1. Playwright drives the prescreener (4-5 actions: navigate → fill ZIP/DOB/income → Continue → View Plans) to establish session cookies
  2. ONE fetch('/hix/private/getIndividualPlans') call returns ALL plans for that scenario as JSON
  3. To capture per-CSR-variant cost-shares: re-run prescreener at 4 income buckets per anchor ZIP (>250% FPL no CSR, 200-250% CSR-73, 150-200% CSR-87, <150% CSR-94). Pennie returns CSR-adjusted deductible / oopMax / costSharing per income bucket.
  4. Total API calls: 9 RAs × 4 CSR income buckets = 36 API calls + 36 prescreener flows for the full PA Phase A scrape. Estimated wall-time: ~30 min with conservative rate-limiting.

Carrier coverage verified (Philadelphia RA8 scenario): Ambetter Health of Pennsylvania, Oscar, Highmark Blue Shield, Independence Blue Cross, Jefferson Health Plans, Partners Insurance Company. 73 plans total: 27 Gold, 23 Silver, 23 Bronze. 6 issuers — matches expected RA8 carrier count.

This eliminates two open questions at once:

  • Decision #3 trade-off resolved → SCRAPE per-CSR-variant directly from Pennie (the API returns CSR-adjusted cost-shares per income bucket — same fidelity as puf.csrVariants for FFM).
  • HIOS plan ID question resolved → issuerPlanNumber field returns the full 14-char HIOS + 2-char variant suffix on every plan. No carrier SBC URL parsing needed.

Phase A.1 scraper architecture (now locked) ​

scripts/db/scrape-getinsured-2026.ts (shared GetInsured-stack scraper, PA is first consumer):

For each (state, anchor_ZIP_per_RA × 4 CSR income buckets):
  1. Launch Playwright
  2. navigate /prescreener/
  3. Wait for Continue button
  4. Fill: Plan Year=2026, ZIP, DOB (age 40 = 1986-06-01), income (per CSR bucket)
  5. Click Continue → wait for /prescreener/results
  6. Click Next → wait for /hix/private/preferences
  7. Click View Plans → wait for /hix/private/planselection
  8. fetch('/hix/private/getIndividualPlans') with session cookie
  9. Save raw JSON to .scratch/pa-{ra}-{csr_bucket}-2026.json

Output: 36 JSON files (per-RA × per-income-bucket). Normalize step merges per-HIOS-plan + per-variant into the canonical plans collection doc shape.

Phase A — additional standing decisions captured during ENG-418 implementation ​

7. Multi-county ZIPs trigger a county-dropdown on Pennie's prescreener (found 2026-06-01) ​

ZIPs on county borders (e.g. 15834 Emporium → Cameron / Elk / Potter; 18015 Bethlehem → Northampton / Lehigh) trigger a "Select your county" dropdown in /prescreener/. Without picking, Continue stays disabled. Scraper handles this via <select id="field-:r12:"> whose options encode county FIPS as values (e.g. Cameron=42023). This is the single biggest cause of pre-fix RA2 + RA6 timeouts — easy to miss without live verification. Sibling GetInsured states (NJ/VA/NV/NM/ME) will likely have the same UI pattern.

8. CSR-94 income bucket blocked by Medicaid-likely interstitial (found 2026-06-01) ​

The CSR-94 silver variant (100-150% FPL band) is impossible to capture via Pennie's prescreener at our anchor scenario (single 40yo). At $20K (128% FPL) Pennie redirects to a Medicaid-likely screen; at $22K (141% FPL) it stays on the marketplace path but still fails to advance past the form Continue click — likely Pennie's UI blocks the flow at very-low single-person income.

Workaround: CSR-94 cost-shares are FEDERALLY STANDARDIZED under 45 CFR §156.420(c)(1) (AV target = 0.94). Derive from base Silver cost-shares + CFR formula, OR capture using a 2-person household at low income (where Medicaid threshold shifts). Today we ship with CSR-73 + CSR-87 only and document CSR-94 as a planned A.4.1 follow-up. The /plans UI gracefully renders without CSR-94 (falls back to base silver display).

9. zip_county collection needed sbeRedirect cleanup post-apply (found 2026-06-02) ​

Even after PA was removed from STATE_BASED_MARKETPLACES + added to OWNED_DATA_STATES, /api/counties continued to return SBE-redirect for PA ZIPs because the zip_county collection still had sbeRedirect field set on all 2,323 PA ZIPs from a legacy seed. The /api/counties route reads sbeRedirect from each zip_county doc and short-circuits when all docs in a ZIP have it. Cleanup script (one-time):

js
db.zip_county.updateMany({state:'PA'}, {$set:{regionId:<from PA_FIPS_TO_RATING_AREA>}, $unset:{sbeRedirect:''}})

Add to the standard SBE ingest playbook checklist as a Phase A.4 step. Every state moving from SBE-redirect → owned-data needs this cleanup. CA + NY were either ingested without an sbeRedirect-seeded zip_county OR cleaned at the same time.

Verified end-to-end state (apex, 2026-06-02) ​

CapabilityStatusNotes
PA ZIP → county (zip_county)✅ 2,323 ZIPs with regionId + no sbeRedirectAll 67 counties mapped via PA_FIPS_TO_RATING_AREA
PA plan pricing (/api/plans via owned data)✅ 291 HIOS plans across 9 RAsL1 verified vs PHIEA SLCSP (4 RAs $0.00 exact; all 9 RAs ≤ $1.49)
PA subsidy estimate (/api/eligibility FFM passthrough)✅ via FFM proxyNo calculatePaEligibility() needed (federal APTC only)
PA drugs: search → coverage🚧 Phase B — formularies_staging not yet populated for PAPer-carrier formulary scrape needed (Highmark / UPMC / IBC / Geisinger / Ambetter / Oscar / Capital / Partners / Jefferson). Mirror NY ENG-412 Phase 1 pattern. URLs stored on each plan as puf.urls.formulary for deep-link fallback.
PA providers: search → coverage🚧 Phase C — providers_staging not yet populated for PASource TBD (PA may not have NY-DOH-PNDS equivalent open data). Likely path: per-carrier directory scrape OR §1311 MRF ingest. NPPES-NPI-native expected (decision #6). URLs stored on each plan as puf.urls.providerDirectory for deep-link fallback.
FFM + CA + NY byte-identical post-apply✅ 4,495 → 4,495 plans (cohort guard PASSED)Snapshot B 6a1e44f83a4bf6ebedefa3a0 (pre-apply) + Snapshot C 6a1e45b729c719743c525854 (post-apply)

Phase B + C follow-ups (parallel to NY ENG-412 sequencing) ​

Both phases follow the exact playbook NY established. PA-specific notes:

Phase B — drug formulary ingest (mirror NY ENG-412 Phase 1):

  1. Harvest per-carrier formulary URLs for PA (12 carriers — Ambetter, Oscar, Highmark BCBS / Inc / BCBS, IBC, Jefferson, Partners, Geisinger HP / QO, UPMC, Capital). PA-Phase-A artifact already has puf.urls.formulary per plan — extract dedup'd list.
  2. Parse per-carrier formulary docs (PDF / JSON / web search-tool endpoint per carrier) → drug name / strength / NDC.
  3. Resolve NDC → RxCUI via RxNav.
  4. Write to formularies_staging collection, keyed pa:<rxcui> namespace (matches NY's ny:<rxcui> convention).
  5. Snapshot pre-apply → dry-run → audit 100% → founder-gated --apply.
  6. Apex smoke: PA drug-coverage search → real coverage data (not null).

Phase C — provider directory ingest ✅ COMPLETE 2026-06-05 (ENG-437)

Pre-Phase-C research turned up NO PA DOH NPI open-dataset equivalent to NY PNDS. Ingest went per-carrier across all 9 PA marketplace carriers using 5 distinct patterns documented in sbe-ingestion-playbook.md → "Provider-coverage ingest patterns" section. Locked per-carrier decisions for PA:

#CarrierHIOSPlansNPIsMethod (Pattern)Source
1Ambetter (Centene)1598312111,885A (§1311 MRF)api.centene.com/ambetter/reference/cms-data-index.json (national TOC, NPI-keyed)
2Highmark BS PA799621392,605A (§1311 MRF streaming gz)mrfdata.hmhs.com/.../highmark-bsp-index.json (Cloudfront-signed; 6 marketplace-relevant files of 268 fetched + per-file plan attribution)
3IBX 31609 (QCC core)316098809,576A (§1311 MRF, partial 80/525 files)ihg-dart-edw-mrf-prod-public/qcc/.../index.json. Heavy 17B0/17D0 file families server-throttled at 24-parallel; accepted partial — high NPI overlap meant ~80-90% of unique network captured
4IBX 33871 (Indep Admin sister line)3387112809,576 (additive)E (sister-line attribution)Risk-accepted — providers contract at issuer level not per product line. Applied 12 plan IDs to all 809k IBX-31609 NPIs via $addToSet
5Oscar98517106,491A (§1311 MRF, single JSON)hioscar-cms-tic-us-east-1.s3.amazonaws.com/oscar/.../index.json — 2 reporting_structures for 98517 prefix, 1 actual in-network JSON of 13.6 MB (vs 176 OBH adjudication zips)
6UPMC16322379,560C (PDF + NPPES fuzzy match)4 marketplace network PDFs at upmc.widen.net/view/pdf/... (Partner/Select/Premium/Standard — 88 MB total, 15,848 pages). pdfplumber column-aware extraction → name+county fuzzy match against 329,902-NPI NPPES PA baseline. 77% exact_last_first_county precision
7Geisinger HP + Quality Options22444 + 75729376,190B (SPA-API direct NPI)ghpproviders.geisinger.org HealthSparq backend — POST /healthsparq/public/service/v4/search exposes NPI in providerResults[].attributes[] where key=="NPI". Chrome MCP residential session bypasses Radware. Iterate 26-letter + 2-letter prefix sweeps (300-result cap per query)
8Jefferson + Partners (shared network)93909 + 1970218156,089D (SE PA county heuristic)HealthTrio Connect (jhp.healthtrioconnect.com) is server-rendered HTML, Cloudflare bot-shield + aggressive rate-limiting (7 HTTP 429 in 4 min). PIVOTED to county heuristic: Philadelphia + Bucks + Chester + Montgomery + Delaware + Lehigh + Northampton + Berks = 340 ZIPs
9Capital BC451271846,377D (Central PA county heuristic)Public path member-auth gated, HIOS-search-gate rejects HIOS 45127. Heuristic on Dauphin + Lancaster + Lebanon + Cumberland + York + Adams + Franklin + Perry = 226 ZIPs
TOTAL165

Cohort guard preserved across all 9 ingests: providers_staging non-pa count = 2,514,054 (drift=0 every apply).

Lessons locked in (use for any future PA-Phase-C-equivalent state):

  1. Always sniff XHR before declaring "NPI not exposed". Geisinger's rendered HTML showed no NPI but their HealthSparq API returned it in JSON. I almost defaulted to fuzzy match. Pattern B should be your second check, BEFORE Pattern C or D.
  2. Vendor-specific patterns repeat across states. HealthSparq (Geisinger PA, NM Presbyterian, several BCBS plans) all share /healthsparq/public/service/v4/search with NPI in attributes. Capture this knowledge once.
  3. HealthTrio Connect is server-rendered HTML, not SPA. Don't waste time looking for JSON API — go straight to Pattern C (HTML scrape + fuzzy match) or Pattern D (county heuristic) based on rate-limit tolerance.
  4. County heuristic over-attribution is acceptable for narrow-network carriers. Jefferson + Capital BC chose D because A/B/C all failed within a reasonable effort budget. Over-attribution rate ~30-40% but the alternative is zero coverage which blocks enrollment journeys entirely. UI mitigates via "verify with carrier" copy.
  5. Sister-line additive (Pattern E) saves hours of duplicate work. IBX 33871 reused 31609's 809k NPIs in a 32-min updateMany. Always check if the carrier has a sister HIOS prefix already ingested.
  6. PDFs use widen.net or similar CDN behind viewer pages. The viewer URL is the public-facing link, but the actual PDF binary is in window.viewerPdfUrl inside the viewer HTML — signed CDN URL with ~24hr expiry. Always grep viewer HTML for viewerPdfUrl.
  7. CMS §1311 MRF gz files are huge but most NPIs cluster in first few MB. Stream + gunzip + early-terminate at "in_network" marker token = 270× speedup. Don't fetch the entire file.
  8. Plan→network mapping is often NULL in PUF. Accept over-attribution to all carrier plans (IBX 33871, UPMC, Jefferson, CapBC all did this). Document the risk in commit message.
  9. Pre/post release-snapshot tags are critical when shipping deploys back-to-back. All 4 deploys this session followed pre-eng-XXX-<carrier>-YYYYMMDD-HHMM → merge → post-eng-XXX-<carrier>-YYYYMMDD-HHMM pattern. Clean rollback points.
  10. Squash-merge causes "DIRTY/CONFLICTING" PR state on the next iteration. Always merge origin/main into the integration branch before opening the next PR; auto-resolve content conflicts via git checkout --ours for state docs.

Cross-references ​

  • ENG-418 — PA Phase A ingest + GetInsured scraper pilot (this work — Phase A.1 / A.2 / A.3 / A.4 all shipped)
  • docs/data-sources/sbe-ingestion-playbook.md — methodology
  • docs/data-sources/state-subsidies.md — confirms PA is NOT in state-subsidy table
  • scripts/db/scrape-getinsured-2026.ts — the scraper (PA first consumer; NJ/VA/NV/NM/ME inherit)
  • scripts/db/build-pa-plans-from-scrape-2026.ts — per-(HIOS,RA)-input → HIOS-grouped doc-shape normalize
  • scripts/db/write-pa-plans-2026.ts — dry-run/--apply/--rollback ingest to plans collection
  • scripts/db/validate-pa-plans-2026.ts — 4-layer validation (L1 PHIEA SLCSP match, L2 structural, L3 cross-RA spread, L4 federal regression)
  • scripts/db/scrape-covered-ca-2026.ts — the original CA scraper that ENG-418 forked from
  • scripts/db/data/ca-rating-areas-2026.ts — anchor-ZIP-per-RA shape to mirror

New Jersey ​

Marketplace: GetCoveredNJ (enroll.getcovered.nj.gov) — runs on the same GetInsured stack as Pennie. PA ↔ NJ parity is significant.

Status as of 2026-06-05 (ENG-438):

  • Phase A LIVE on apex: 61 plans across 6 carriers via PA-pattern scraper extension + dedicated build/write/validate forks.
  • Phase A.5 (QRS): pending — fork augment-pa-quality-ratings-2026.ts
  • Phase B (drug formularies): research only — see below
  • Phase C (provider directory): not started — same per-carrier discovery as PA Phase C

6 carriers + plan counts (HIOS 14-char prefix) ​

HIOS prefixCarrierPlans
91762AmeriHealth Ins Company of NJ18
91661Horizon Blue Cross Blue Shield of NJ14
23818Oscar Garden State Insurance Corporation13
37777UnitedHealthcare Insurance Company8
17970WellCare Health Insurance Company of NJ (Centene)7
(sub-prefix)AmeriHealth HMO, Inc.1

Locked decisions ​

  1. Single statewide rating area (RA1). Per CMS Geographic Rating Areas (45 CFR 156.255). Plan list invariant across all 21 NJ counties. NJ_RATING_AREAS_2026 has 1 entry; NJ_FIPS_TO_RATING_AREA maps every county FIPS → 1.
  2. Federal age curve, no state subsidy line. NJ has the Health Insurance Premium Security Plan (HIPS) reinsurance but it's baked into carrier-filed rates, not a separate premium line like CA's CAPS/CAPC. Treat NJ as federal-style: federal APTC only, single aptc field.
  3. Anchor: ZIP 07102 (Newark / Essex County). Largest NJ city. SLCSP age-40 = $507.91 (computed from 2026 no_csr scrape; NJ has no public PHIEA-style benchmark table).
  4. Form differs from Pennie in one place: single-page (no intro click). GetCoveredNJ's /prescreener/ lands directly on the form. PA has a 2-step flow (intro page → form page). Parameterized via introH1Pattern: null for NJ in STATE_CONFIGS. All other form selectors identical: id="coverage-year-select", label-based ZIP + income, id*="birthdate-picker", id="coverage", data-testid="btn-see-savings" Continue.
  5. CSR-94 bucket needs $22,500 income (not PA's $22K). GetCoveredNJ's prescreener is more aggressive than Pennie about Medicaid redirect — at PA's tuned $22K (146% FPL) the form gates to NJ FamilyCare even though it's well above the 138% expansion threshold. Bumping to $22,500 (149.4% FPL — JUST below the 150% CSR-94 upper bound) clears the gate cleanly. Wired in STATE_CONFIGS.NJ.csr94IncomeOverride = 22500. Verified 2026-06-05: 61 plans, 21/21 non-HSA Silvers populated with csrVariants["94"]. HSA Silvers correctly skip CSRs per ACA's high-deductible requirement. Phase A now consistent with FFM/CA/PA (full CSR matrix). Any future GetInsured-stack state with a similar redirect quirk should set csr94IncomeOverride rather than re-derive.
  6. Eligibility route needs explicit NJ branch. Adding NJ to OWNED_DATA_STATES routes /api/plans correctly (auto-via isOwnedDataState) but /api/eligibility falls through to CMS Marketplace API which doesn't host NJ SBE plans → 502. Add calculateNjEligibility() + NJ branch in eligibility/route.ts mirroring the PA pattern.

Scraper code-paths to know ​

Same as PA — just route through STATE_CONFIGS["NJ"]:

bash
# Single bucket smoke
npx tsx scripts/db/scrape-getinsured-2026.ts --state NJ --ra 1 --csr no_csr \
  --out scripts/db/data/nj-plans-scraped-2026

# Parallel all buckets (DON'T include csr_94 — see #5 above)
for b in no_csr csr_87 csr_73; do
  npx tsx scripts/db/scrape-getinsured-2026.ts --state NJ --ra 1 --csr $b --out ... &
done

# Build → validate → write
npx tsx scripts/db/build-nj-plans-from-scrape-2026.ts
npx tsx scripts/db/validate-nj-plans-2026.ts
WRITE_CONFIRM=yes npx tsx scripts/db/write-nj-plans-2026.ts --apply

zip_county post-write ​

After the plans write succeeds, NJ ZIPs in zip_county still carry the legacy sbeRedirect: {state: "NJ", marketplace: "Get Covered NJ ..."} field set from the 2026-04-30 CMS seed. Until cleared, /api/counties for NJ ZIPs returns the SBE redirect even though plans exist. Run after Phase A.4:

javascript
await coll.updateMany(
  { state: "NJ" },
  { $set: { regionId: 1 }, $unset: { sbeRedirect: "" } }
);

Affects 714 NJ ZIPs. Same pattern applies for any future SBE state graduating from redirect-only to owned-data.

Phase B (drug formularies) — handoff notes ​

mrpuf_issuers_staging only contains CMS-FFM issuers. NJ SBE carriers (Horizon, AmeriHealth, Oscar, UHC, WellCare) are NOT there. Phase B for NJ requires per-carrier §1311 MRF discovery + ingest (PA Phase C patterns A–E in sbe-ingestion-playbook.md).

Predicted carrier → pattern mapping (verify before committing):

  • Horizon BCBS NJ → Pattern A (§1311 MRF). Horizon publishes nationally at horizonblue.com/transparency. Pull index.json, filter to NJ plan IDs (HIOS prefix 91661).
  • AmeriHealth NJ → Pattern A. Parent Independence Health Group (PA's IBX) publishes a national MRF at ibx.com/individuals-and-families/member-resources/transparency-coverage. Filter to NJ HIOS prefix 91762.
  • Oscar Garden State → Pattern A. Oscar's MRF was already used in PA Phase C (HIOS 33709). NJ-side HIOS prefix 23818; same MRF infrastructure expected.
  • UnitedHealthcare → Pattern A. UHC publishes nationally; NJ filtering via HIOS prefix 37777.
  • WellCare / Centene → Pattern A. Centene has unified §1311 MRF for all states (Ambetter brand). HIOS prefix 17970.

Recommend tackling 1–2 carriers per attended session given per-carrier discovery friction.

Phase B post-ingest — REQUIRED for NJ (ENG-425) ​

After each NJ carrier's drug docs land in formularies_staging, run:

bash
node scripts/db/derive-drug-search-index.js --apply

Without this, the newly-ingested NJ-only meds will NOT appear in /plans drug NAME search (the route falls back to CMS autocomplete which doesn't have SBE-only drugs). The derive reads the WHOLE formularies_staging and rebuilds the search read-model with brand/generic strength parity + commonly-covered-form-first ranking — idempotent and additive; --rollback exists.

Sync with main first to pick up the derive script + scripts/db/lib/drug-search-derive.js + scripts/db/lib/rxnav-resolver.cjs (cleaner RxCUI resolution) + the /api/drugs/search route + the db.ts allowlist + infra/atlas/access-matrix.ts (now lists drug_search_index for app_read_staging — 5 collections). If the NJ Phase B PR also edits that user's grants, keep drug_search_index in the merged list and run npx tsx scripts/audit/staging-cluster-drift.ts to confirm the live role matches the merged manifest.

Grouping behavior the derive applies (good to know when NJ carrier meds surface with unusual name formats): groups by (ingredient, form, year) from each doc's true canonical, unit-normalizes mcg→mg, excludes combos, defends alias pollution (brand kept only if it appears on ≥2 rxcuis + searchText is curated, not raw aliases), safe salt-merge, dose-orders strengths, orders each strength's rxcuis broadest-coverage-first. Spot-check NJ ingest results with npx tsx scripts/audit/drug-search-parity.ts.

If a NJ-only drug name has unusual formatting and doesn't surface in the search, that's a derive-pollution issue — not a coverage issue (per-rxcui coverage at /api/drugs/covered is unaffected).

Files (NJ-specific) ​

  • scripts/db/data/nj-rating-areas-2026.ts — single-RA anchor + 21-county FIPS lookup
  • scripts/db/build-nj-plans-from-scrape-2026.ts — forked from PA
  • scripts/db/write-nj-plans-2026.ts — forked from PA (founder-gated)
  • scripts/db/validate-nj-plans-2026.ts — forked from PA (4-layer)
  • scripts/db/scrape-getinsured-2026.ts — parameterized PA+NJ via STATE_CONFIGS
  • apps/web/src/lib/owned-plans.ts — calculateNjEligibility()
  • apps/web/src/app/api/eligibility/route.ts — NJ branch
  • scripts/db/ingest-nj-providers-{centene,amerihealth,uhc,uhc-resume,horizon,oscar}.cjs — per-carrier Phase C ingesters (ENG-438)

Phase B + Phase C ACTUAL outcomes (2026-06-09, ENG-438 COMPLETE) ​

Phase B (drug formularies) and Phase C (provider directories) both shipped 5/5 carriers. Three predictions in the original handoff notes above were wrong — captured below so future state work doesn't repeat the mis-prediction.

Phase B — formularies_staging (staging cluster only, per ADR 0004):

⚠️ HIOS label correction (2026-06-10, ENG-447): earlier revisions of the two tables below had the Horizon↔AmeriHealth HIOS prefixes swapped. Live plans collection truth: 91661 = Horizon BCBS NJ (14 plans), 91762 = AmeriHealth Ins Company of NJ (18 plans), 77606 = AmeriHealth HMO (1 plan). The swap propagated into the Phase C Horizon provider ingester and caused a real shipped data bug — see "Post-ship repair" below. Carrier↔HIOS labels must come from the live plans collection, never from doc tables.

CarrierHIOSRxCUIsPredicted sourceActual source
Centene/Ambetter179704,305Pattern A (national MRF)Pattern A ✓ — api.centene.com/ambetter/reference/cms-data-index.json
Oscar Garden State238184,014Pattern APattern A ✓ — hioscar-cms-tic-us-east-1.s3.amazonaws.com/oscar/20260601_oscar_index.json
AmeriHealth NJ91762 (+77606 HMO)3,977Pattern APattern A ✓ — Independence Health Group MRF
Horizon BCBS NJ916614,270Pattern APDF parse — horizonblue.com/transparency is Incapsula-WAF-blocked at curl level; used published 2026 formulary PDF instead
UnitedHealthcare NJ377773,570Pattern AOptumRx 2-token SPA — UHC's xnjdruglist2026 URL redirects to welcome.optumrx.com/.../ClientFormulary?var=GPX526NJ (NJ-coded variant of NY's GPX426NY pattern from ENG-412). Drove Chrome MCP to capture authorization + profile-token headers; replayed POST new.optumrx.com/formulary/drugs-by-alphabet × 26 letters + drug-results for 11,832 names → 18,536 NDCs → 4,890 covered → 4,017 NDCs resolved via RxNav → 3,570 unique RxCUIs

Phase C — providers_staging (staging cluster only):

CarrierHIOSNPIsDiscovery
Centene/Ambetter17970112,161Pattern A direct §1311 MRF (predicted correctly)
AmeriHealth NJ91762 (+77606 HMO)978Pattern A IHG GCS bucket (storage.googleapis.com/ihg-dart-edw-mrf-prod-public/ahnj/2026-05-01_ahnj_index.json)
UnitedHealthcare NJ37777106,742UHC TIC blobs API discovery. transparency-in-coverage.uhc.com is a Gatsby SPA that fetches an undocumented /api/v1/uhc/blobs/ returning 86,514 per-employer + per-network MRF download URLs (each a pre-signed Azure SAS valid through 2030-02-16). Pattern: drive the SPA in Chrome MCP, watch network tab, capture the blobs URL, then curl directly. The Oxford-Health-Insurance-Inc TOC (15.6 MB, 13 reporting_structures, 16 unique in-network files) is the NJ subsidiary; 6 of 13 RS reference the Metro-Network rate file (387.7 MB gz, 7.5 GB unpacked) — that's the NJ marketplace network ("New Jersey Oxford Metro Network", delsys=928, per UHC's own Rally landing). NPIs all live in provider_references at file head; count stabilized at 106,742 within the first 19 MB of stream.
Horizon BCBS NJ9166195,979Sapphire MRF Hub bypass of Incapsula. horizonblue.com TIC pages are Incapsula-WAF-blocked at curl level (identical 1032-byte HTML challenge for every URL). The actual TIC is at the vendor portal horizonblue.sapphiremrfhub.com (Sapphire Digital MRF Hub, owned by Zelis) — NOT Incapsula-protected. Discovered by driving www.horizonblue.com/transparency-in-coverage via Chrome MCP; the 404'd page DOM contained the outbound link to the Sapphire URL. Both Horizon TOCs (Healthcare Services parent + Healthcare of NJ Inc subsidiary) are EIN-keyed (not HIOS-keyed); marketplace files identified by filename convention: MCEX = Managed Care EXchange, OMT1/OMT2 = Omnia Tier 1/2, Individual / Individual-Small-Group. All 4 selected files share the same provider_references block at file head and reference the same Horizon BCBSNJ master network under different network-product labels.
Oscar Garden State2381822,658TOC-listed file undercounts; sibling file is correct. Oscar's TOC references 20260601-oscar-002-in-network.json (853 KB, only 452 NPIs — a thin slice covering a subset of billing codes). The comprehensive 20260601-oscar-001-in-network.json (92 MB, 22,658 NPIs) exists at the same S3 path but is NOT listed in the TOC. Pattern: when an SBE carrier's TOC-listed file looks improbably small for the state's market size, probe sibling files at the same path. Both 001/002 are at Oscar's own §1311 publication infrastructure, so 001 is defensible as a Pattern A source. Oscar's file uses CMS schema v2.0.0 with indexed provider_references (numeric indices into a top-level array at file tail), but the same "npi": [<10-digit codes>] regex extracts NPIs correctly.

Post-ship repair (2026-06-10, ENG-447) — Horizon provider plan-id mis-attribution ​

The Layer 5 invariants backfill (ENG-447) found that ingest-nj-providers-horizon.cjs had shipped with NJ_HIOS_PREFIX = "91762" — AmeriHealth's prefix (the doc-table swap above made it into the code). All 95,979 Horizon NPIs were attributed to AmeriHealth's plan IDs: Horizon's 14 plans (91661NJ*) had ZERO provider coverage, AmeriHealth's plans carried Horizon's directory. Secondary bug: ingest-nj-providers-amerihealth.cjs filtered its plan list to ^91762, silently dropping the AmeriHealth HMO sister plan 77606NJ0040066 (zero providers).

Repaired on staging by scripts/db/repair-nj-horizon-plan-attribution.cjs (3 passes: $addToSet correct 91661NJ entries → $pull mis-attributed → $addToSet 77606 on AmeriHealth docs; no inserts/deletes, cohort byte-identical, 0 errors). Post-repair attribution: 91661NJ = 95,979 / 91762NJ = 1,318 / 77606NJ = 1,318. Snapshots: pre 6a29227e764ea3c204b89fa1, post 6a2924c081d7b2013563a5b0. Both ingester constants fixed in the same commit. Drug data (Phase B) was never affected.

Lesson (now enforced by Layer 5): load carrier plan IDs from the live plans collection by issuer-verified HIOS prefix, and cross-check the resulting per-carrier attribution counts against the plans collection's issuer names before --apply. The invariants check (npm run audit:sbe-invariants) now pins these attributions permanently.

End-state staging totals (2026-06-09):

  • providers_staging: 3,996,317 total NPIs · NJ = 265,562 · cohort guard non-NJ = 3,730,755 (locked across all 3 new ingests, drift 0 each time, errors 0 each time)
  • formularies_staging: 34,035 unique RxCUIs across year 2026 (NJ carriers contribute ~20K across 5 carriers, after dedup-by-rxcui)
  • plans (BOTH staging + prod): 60 NJ marketplace plans across 5 HIOS prefixes (61 including a 1-plan sub-prefix)

M10 staging Atlas ingest watchouts (apply to ALL SBE states going forward):

  • BATCH=250, NOT 1000. First UHC ingest used BATCH=1000 + no per-bulkWrite maxTimeMS; hung at batch 17 (committed 16K NPIs, then sat idle on Mongo I/O for 30+ minutes). Resume script with BATCH=250 + maxTimeMS: 30000 per bulkWrite + writeConcern: {w:1, wtimeout:30000} ran 90K more docs in ~6 min at 262 docs/s steady. Rate drops to ~160 docs/s for carriers that overlap with many prior carriers (each $addToSet on plans[] scans the array).
  • Live progress logging from the script itself. tail -40 post-process buffers all output — invisible until exit. Use process.stdout.write per batch with timestamp + percentage + rate + ETA + running upsert/modify/error counts.
  • Use indexed queries for sanity checks. countDocuments({_id: /^nj:/}) with the _id index returns in seconds; countDocuments({entity_type: 1}) without an entity_type index times out at 25 s on a 4M-doc collection.
  • The entity_type field is unreliable. Ingest scripts default new NPIs to entity_type: 1 (via $setOnInsert) regardless of whether the NPI is actually Type 1 (individual) or Type 2 (organization). For accurate Type 1/2 splits, cross-reference NPPES.

Marketing/stats numbers refreshed (2026-06-09, post-NJ Phase C) ​

After Phase C closed, the home page + agents + team page coverage stats were refreshed to reflect actual staging Atlas counts:

StatOld (FFM-only 2026-05)New (post-NJ)
Doctors1.75M3M+ (NPI total × 70% Type 1 NPPES national ratio, conservative round)
Medications67,000170,000 (RxCUI × 5.19 NDC ratio measured from UHC NJ Phase B real data)
Plans4,3264,847 (exact)
Carriers183 (was HIOS prefix count)151 (distinct legal-entity plans.issuer field — more honest unit)
States3134 (added NJ + 2 others)

Surfaces updated: apps/web/src/app/_home/target-body.ts, apps/web/src/app/agents/page.tsx, apps/web/src/app/team/how-we-work/page.tsx, apps/web/src/app/agents/opengraph-image.tsx, apps/web/src/app/creative-adbundance/page.tsx. Historical pages (recruit letters, update entries, CLAUDE.md "What shipped" log) intentionally preserved as snapshots.


Illinois ​

Marketplace: Get Covered Illinois (getcovered.illinois.gov, enrollment at enroll.getcovered.illinois.gov) — GetInsured stack, same /prescreener/ + /hix/private/getIndividualPlans layout as Pennie/GetCoveredNJ. IL is a brand-new SBE for plan year 2026 (moved off healthcare.gov 2026-01-01).

Status (2026-06-10): ALL PHASES COMPLETE — Phase A on BOTH clusters (271 plans, 13 RAs, 7 carriers) + A.5 QRS (233/271 rated, 86%) + Phase B drugs 7/7 carriers + Phase C providers 7/7 carriers (all on staging per ADR 0004) + IL in OWNED_COVERAGE_STATES + drug-search re-derived + Layer 5 213 checks green. End-to-end smoke: atorvastatin Covered w/ correct tiers on BCBS/Molina/Oscar; provider coverage resolving per-network. Awaits deploy (prod zip_county cleanup is the deploy-time step). Per-carrier fill details in the playbook IL rows; key IL-specific patterns: HCSC aca-json/<st>/index_<st>.json is the §1311 index; Oscar+Cigna scrub SBE states from their national §1311 files (TIC is the fallback); Molina IL drugs = PDF-only (tier legend ≠ CA's); Cigna IL = county heuristic.

Phase A verification record: L1 13/13 RAs SLCSP round-trip Δ$0.00 + APTC-implied cross-check (~$2.5 in 11/13; RA10 $26.01 / RA12 $7.65 — non-EHB benchmark adjustment); L4 wave1 ZERO diffs vs local AND prod apex + calculator-baseline ZERO DIFFS 17/17; Layer 5 207 checks 0 failed across NJ/PA/CA/NY/IL. Snapshots: staging B 6a29945b0eebcb61c72bf795 / C 6a29965c91222ec041a9e70d; prod B 6a29945d66b0b868639cf192 / C 6a29965bbc3a6e4857ab6e07. Cohort guards: non-IL 2026 = 4,847 unchanged on both clusters.

⚠️ zip_county cleanup is split by cluster (deliberate): STAGING cleaned 2026-06-10 (2,073 ZIPs — needed for local verification; stage.askflorence.health degrades gracefully, same IL banner one step later). PROD cleanup is a DEPLOY-TIME step — run WRITE_CONFIRM=yes IL_ZIPCLEANUP_URI=<prod> npx tsx scripts/db/cleanup-il-zip-county-2026.ts --apply together with the code deploy, so apex IL users never hit a no-redirect/no-plans intermediate state (inverse of the PA A.4 lesson: plans data is additive-safe pre-deploy, the redirect flip is NOT).

Wave1-regression lesson (caught here, applies to every future state graduation): the harness's FEDERAL_30 list still contained IL/GA/KY from their FFM era, and its ZIP pool ALSO filters on zip_county's sbeRedirect — so cleaning IL's ZIPs silently injected 2,073 candidates and shifted all 100 seeded scenarios. Fixed by removing IL/GA/KY from the list. Separately, the committed wave1 fixture was STALE since ENG-414 (June 1): 13 non-expansion-state scenarios legitimately changed (coverage-gap → full-subsidy bump) and the fixture was never re-captured (wave1 only runs in preflight --full). Verified identical 13-scenario signature against prod apex pre-recapture, then re-captured; ZERO diffs vs local + prod after. When graduating a state: check the wave1 FEDERAL_30 list, and expect to re-capture deliberately if upstream behavior changed since the last capture.

Phase 0 locked decisions ​

  1. The playbook's old "IL is SBE-FP / already covered via FFM PUF" row was WRONG for 2026. Live DB check: 0 IL plans, all 2,073 IL ZIPs carry sbeRedirect. IL requires the GetInsured scrape. (The claim may have described an earlier plan year.)
  2. 13 rating areas, county-based, unchanged from PY2025. County→RA source: CMS IL-GRA page — 403s in WebFetch but serves fine to curl with a browser UA (same datacenter-blocking class as PA's CMS-GRA timeout). All 102 counties mapped + cross-checked against the IDOI "2026 Analysis of Illinois On-Exchange Plans" PDF (idoi.illinois.gov). RA1 = Cook alone; RA13 = 27 southern counties.
  3. 7 carriers, 285 on-exchange plans statewide per IDOI (down from 11 carriers in 2025 — Aetna ×2, Health Alliance, Quartz exited): BCBS IL (HCSC — only statewide issuer, sole issuer in 63 counties), Ambetter of Illinois (Celtic, HIOS 27833), Cigna IL, Molina, Oscar, UnitedHealthcare, MercyCare (WI-border counties).
  4. No state premium subsidy for 2026 — federal APTC only. No calculateIlEligibility() math needed; whether the eligibility ROUTE needs an IL branch (NJ-style, because CMS API 502s for SBE states) vs FFM passthrough (PA-style) gets verified during Phase A.4.
  5. IL prescreener = PA-style intro page (H1 "Shop for Health Coverage" → Continue → form H1 "Estimated monthly premium for 2026"), with TWO deltas wired into the shared scraper 2026-06-10: (a) no #coverage-year-select on the form — the year-select fill is now feature-detected; (b) the intro H1 renders before the Continue button does, and IL's SPA navigates synchronously on click — clickButtonsByText() (new helper) waits for the button to exist and tolerates execution-context teardown. PA/NJ behavior unchanged.
  6. IL results page shows APTC but NO SLCSP sentence. prescreener.slcspMonthly is null in every IL scrape. SLCSP benchmark (expectedSlcspAge40) is backfilled from the no_csr scrape's 2nd-lowest silver (NJ precedent) + independently cross-checked against the APTC-implied SLCSP (aptc + income × applicable%/12) where APTC > 0.
  7. csr_94 bucket is Medicaid-gated at the default $22K income (like PA/NJ). Resolution via csr94IncomeOverride probing after the main matrix (NJ needed $22,500 = 149.4% FPL).
  8. Phase B head start: 5 IL HIOS prefixes ALREADY have year-2026 drug data in formularies_staging from the national §1311 MRF ingest (ffm_1311_mrf tag): 11574 (4,063 drugs), 27833 Ambetter (4,302), 42529 (3,805), 53882 (4,700), 54322 (4,571). Issuers publish §1311 MRFs regardless of FFM/SBE status, so the May-2026 FFM sweep captured IL plan ids. Phase B = verify these against the scraped 2026 hiosIds + fill the missing carriers, not a from-scratch ingest.
  9. Anchor ZIPs (all verified single-county in zip_county): RA1 60602 Cook · RA2 60085 Lake · RA3 60187 DuPage · RA4 60435 Will · RA5 61101 Winnebago · RA6 61201 Rock Island · RA7 61602 Peoria · RA8 61701 McLean · RA9 61820 Champaign · RA10 62701 Sangamon · RA11 62626 Macoupin · RA12 62220 St. Clair · RA13 62959 Williamson (Carbondale's 62901 is multi-county — avoided).

Files (IL-specific) ​

  • scripts/db/data/il-rating-areas-2026.ts — 13 anchors + 102-county FIPS→RA
  • scripts/db/build-il-plans-from-scrape-2026.ts — forked from NJ; CSR variants now fill per-key across ALL RAs (the NJ/PA first-RA-only guard would lose variants when a later RA supplies a bucket the anchor RA missed)
  • scripts/db/write-il-plans-2026.ts / scripts/db/validate-il-plans-2026.ts — forked from NJ
  • scripts/db/scrape-getinsured-2026.ts — STATE_CONFIGS.IL + the two IL deltas (decision #5)

Virginia ​

Marketplace: Virginia's Insurance Marketplace (marketplace.virginia.gov, enrollment at enroll.marketplace.virginia.gov) — GetInsured stack, SBE since 2024. Fourth GetInsured consumer (PA → NJ → IL → VA).

Status (2026-06-10, ENG-450): ALL PHASES COMPLETE. Phase A on BOTH clusters (69 plans, 12 RAs, 6 carriers, QRS 65/69 = 94%); Phase B drugs 6/6; Phase C providers 6/6 (staging per ADR 0004); VA in OWNED_DATA_STATES + OWNED_COVERAGE_STATES; Layer 5 243 checks green across all six states. Snapshots: staging B 6a29bce5/C 6a29be91/post-BC 6a29da55; prod B 6a29bce6/C 6a29be90. Prod zip_county cleanup is a deploy-time step (cleanup-va-zip-county-2026.ts, IL-pattern).

Locked decisions + carrier map ​

  1. 12 RAs / 133 county-equivalents (CMS VA-GRA via curl; WebFetch 403s). VA independent cities are separate FIPS — our zip_county stores them as "<Name> County"; Radford (51750) + Salem (51775) need name aliases when joining against CMS's "City" naming.
  2. Federal APTC only; CMS API rejects VA → NJ-style eligibility branch via the generalized calculateOwnedSbeEligibility(). ⚠️ Process lesson: a case-sensitive copy of the IL branch left calculateIlEligibility in place — VA briefly computed IL SLCSP locally. The end-to-end smoke caught it; ALWAYS smoke the new state's eligibility against the scrape benchmark before shipping.
  3. csr_94 at $22,500 (same override as NJ/IL). VA's results page DISPLAYS the SLCSP (unlike IL) but with different phrasing than PA — prescreener.slcspMonthly stays null; the APTC-implied cross-check matched ≤$0.49 on 12/12 RAs.
  4. Anthem 2026 drugs have NO machine-readable file. Elevance's FNAV JSON (publish/143/40/drugs.json, found by sweeping the id range for 88380VA) is 2025-only (072-series plan ids). The 2026 product (099-series) exists only as the FNAV-hosted PDF (2026_Select_4_Tier_VA_IND.pdf, all 14 plans share it) → Molina-IL-style PDF parse + RxCUI fan-out → 8,297 RxCUIs.
  5. UHC drugs via OptumRx GPX526VA — the predicted one-char swap from NJ's GPX526NJ. Headless Playwright from a residential IP needed NO Chrome MCP (unlike the NJ session): 2-token capture + drugs-by-alphabet/drug-results replay → 3,604 RxCUIs (NJ parity: 3,570).
  6. Kaiser: per-state §1311 path (healthy.kaiserpermanente.org/content/dam/kporg/data/va/) — drugs current; the provider file is stale at year-2023 (KP stopped refreshing when VA left the FFM). Year-waiver applied (plan ids match the scraped 2026 hiosIds exactly; KP's integrated network is minimal-drift) — 1,901 NPIs. KP's tier vocabulary is TIER-ONE..FOUR — added to the ingester's FFM_TIER_VOCAB after the first apply landed UNCLASSIFIED (pulled by source tag + re-applied; ALWAYS review the dry-run tier list BEFORE applying).
  7. Anthem providers: www22.elevancehealth.com/cms/PROVIDERS_VA.json (per-state files; the index lists FFM states only but the VA file exists unlisted) — 54,383 NPIs, 14/14 plans. Oscar: TIC network 027 → 3,816. Cigna: county heuristic (RAs 7/10/11, 278 ZIPs) → 12,772. Sentara/UHC: ffm-swept (57K/43K).
  8. L3 warns heavily for VA — 8 of 12 RAs share the identical $493.61 benchmark (Anthem prices its benchmark statewide). Genuine, not broken RA partitioning.
  9. Legacy 89242VA* entries (Anthem's prior HIOS) exist in formularies_staging from the old sweep — they don't serve 2026 plans; ignore in counts.

Nevada ​

Marketplace: Nevada Health Link (enroll.nevadahealthlink.com) — GetInsured stack, NJ-style direct form (no intro page). Fifth GetInsured consumer.

Status (2026-06-10, ENG-451): Phase A LIVE both clusters (135 plans, 4 RAs, 9 carriers, QRS 106/135). Phase B drugs 9/9 carriers; Phase C providers 7/9. NV in OWNED_DATA_STATES + OWNED_COVERAGE_STATES. Snapshots: staging B 6a29e184/C 6a29e34f/postFill 6a29f036/postDrugFill 6a2a00c7/preCCrefill 6a2a13f1/preImperialRx 6a2a1e23; prod B 6a29e186/C 6a29e34e. Prod zip_county cleanup deploy-time (cleanup-nv-zip-county-2026.ts). Final gaps (2 of 9, evidenced carrier-side): CareSource 35107 providers (SPA API host-blocked + §1311 MRF broken) and Imperial 43314 providers (directory PDF 0 NPIs, online tool empty iframe, no §1311 MRF). All 9 carriers have drugs; 7 have providers. Two earlier "gaps" were diagnostic errors corrected by driving the live SBE (see Decision 11): Community Care 11765 providers (= Anthem Battle Born network) + Imperial 43314 drugs (live FNAV siteCode 8528279638, not the stale scraped one).

Locked decisions + carrier map ​

  1. 4 RAs / 17 counties (CMS NV-GRA via curl; incl. Carson City independent city). RA1 = Clark+Nye (Vegas), RA2 = Washoe (Reno), RA3 = Carson/Douglas/Lyon/Storey, RA4 = 10 rural.
  2. Federal APTC only; csr_94 $22,500 (NJ/IL/VA pattern). NV results page displays SLCSP. calculateNvEligibility() via the generalized helper — VA case-lesson applied (verified the branch calls the NV helper, not IL/VA). L1 + APTC-implied ≤$0.32 on 4/4.
  3. 9 carriers — the batch's most: Health Plan of Nevada 95865 (31 plans, Sierra/UHC HMO), Ambetter/SilverSummit 45142 (30), Anthem 60156 (22), CareSource 35107 (17), Hometown Health 41094 (13), SelectHealth 84445 (10), Community Care 11765 (5), Imperial 43314 (5), Molina 79363 (2).
  4. Coverage reachability: Ambetter + Molina + SelectHealth complete (drugs+providers, ffm-swept). CareSource drugs ffm-swept (providers gap). Anthem 60156 + Community Care 11765 share ONE FNAV PDF (2026_Select_4_Tier_NV_IND.pdf, publisher 143 — same parser as VA Anthem) → 8,560 RxCUIs attributed to the union of both carriers' 27 plans. HPN/Hometown/Imperial: no clean §1311 source (not in MR-PUF; CareSource/HPN §1311 index URLs all 404).
  5. Provider fills found on a second pass (the first pass stopped too early — lesson): Phase C initially shipped 3/9, rationalized as "statewide-HMO heuristic inappropriate." That reasoning is valid for the heuristic, but it masked two carriers with REAL §1311/portal sources that just weren't pursued:
    • Anthem 60156 → elevancehealth.com/cms/PROVIDERS_NV.json (15,053 NPIs, 22/22 plans) — the EXACT proven pattern from VA's PROVIDERS_VA.json, which I'd used an hour earlier and failed to try for NV. Year-stamped 2025 but plan ids == 2026 hiosIds 22/22 → year-waiver (Kaiser-VA precedent).
    • HPN 95865 (largest carrier, 31 plans) → UHC TIC blobs API (transparency-in-coverage.uhc.com/api/v1/uhc/blobs/?searchText=Sierra-Health) → Sierra-Health---Life---Nevada_Insurer_Commercial-HMO in-network file (HPN/Sierra is UHC-owned). 5,491 NPIs from provider_references at head (NJ-Oxford streaming pattern). Now 5/9 providers (the 5 largest carriers by plan count = 100 of 135 plans). The lesson: before declaring a provider gap, exhaust the proven sibling-state patterns (Elevance PROVIDERS_<ST>.json, UHC blobs API, FNAV) — only THEN fall back to "no data." The heuristic-is-wrong point still holds for the genuinely-sourceless statewide HMOs.
  6. Drug fills found on a second push (the user correctly flagged 7/9 wasn't done): HPN 95865 (the LARGEST carrier, 31 plans) drugs via OptumRx ClientFormulary SE42L77 — same 2-token replay as UHC NJ/VA but a ClientFormulary var, not GPX (found on HPN's PDL page; HPN/Sierra is UHC-owned; 4-tier, no Tier 5) → 4,794 RxCUIs. Hometown 41094 drugs via its IFP Exchange formulary PDF (Optum Rx EHB Base template, bare-integer tiers) → 8,792 RxCUIs. NV drugs now 8/9.
  7. Corrected gaps after driving the LIVE SBE (2026-06-10 evening, founder-flagged — see Decision 11 below): two of the three "gaps" above were MY diagnostic errors, caught by going to the actual exchange:
    • Community Care 11765 IS Anthem (NOT a separate network). Filtering Nevada Health Link by issuer "Community Care Health Plan of Nevada" returns Anthem-branded "Battle Born State Plan Anthem Bronze/Gold/Silver" plans; their provider directory is anthem.com/find-care?alphaprefix=NVD — the SAME find-care network as Anthem's own plans. The scrape corroborates (networkName: "Anthem Battle Born ... Network"). So 11765 shares the Anthem find-care network (15,053 NPIs) + Anthem formulary. I had briefly trusted PROVIDERS_NV.json (which files providers ONLY under parent HIOS 60156) and wrongly rolled this back — the SBE is the tie-breaker; a co-branded affiliate's MRF is filed under the PARENT HIOS, so its absence ≠ separate network. Providers → 7/9.
    • Imperial 43314 drugs RECOVERABLE. Its formulary IS published at FormularyNavigator siteCode 8528279638 (linked from exchange.imperialhealthplan.com/nevada/drug-formulary/). The plan-scrape captured a stale siteCode (5227003519 → UnderConstruction.htm) — every prior pass mis-concluded "carrier hasn't published." Scraped via scrape-pa-formulary-navigator.cjs (same MMIT platform as PA Highmark/IBX). Drugs → 9/9. Lesson: when a scraped doc URL 404s/under-constructions, check the carrier's OWN live doc page for a fresher URL before declaring it unpublished.
    • Hometown 41094 providers RECOVERED 2026-06-10 (20,148 NPIs): its LEASED-NETWORK §1311 index pointed at a broken /about-us/ path, but the same single Reno HMO network's in-network file is live at the /documents/Transparency-in-Coverage/ path (the Caesars employer file — NJ-Oxford shared-network pattern). A 404 on ONE index path ≠ no source.
    • Genuinely-remaining gaps (evidenced, fully probed): CareSource 35107 providers (documented least-compliant §1311 payer — own marketplace MRF index 404/000; UHC hosts only CareSource's employee plans; find-a-doctor SPA API is host-permission-blocked to automation), Imperial 43314 providers (directory PDF has 0 NPIs; online tool is an empty iframe; no §1311 MRF). 2 of 9 (22 of 135 plans).
    • Net: drugs 9/9, providers 7/9 — up from the prematurely-reported 8/9 + 6/9, entirely by driving the live exchange to correct two mis-diagnoses.

Cross-state lessons learned (2026-06-09, post-CA + NY + PA + NJ) ​

This section synthesizes patterns across four ingested SBEs. Every SBE state going forward should be checked against these heuristics during Phase 0 (research) BEFORE writing code.

A. Provider-discovery patterns by carrier type ​

Carrier shapeExamplesDiscovery
National carrier with own TIC SPAUHC (transparency-in-coverage.uhc.com)Gatsby/React SPA fronts an undocumented API. Drive in Chrome MCP, watch network tab for the catalogue endpoint (e.g. UHC's /api/v1/uhc/blobs/). Returns Azure SAS-signed URLs valid for years.
Vendor-hosted TICHorizon → Sapphire MRF Hub (Zelis), some carriers → HealthSparq, BCBS family → variousThe www.{carrier}.com TIC page is often Incapsula/Imperva-protected to defeat curl. The vendor portal itself usually isn't. Drive the www. page in Chrome to find the outbound vendor link, then curl the vendor portal directly.
Carrier on national §1311 with state-coded varUHC drugs via OptumRx (GPX426NY → NY, GPX526NJ → NJ, likely GPX[#][STATE] for sibling states)When you've cracked the carrier in one state, the next state's var is usually a one-char swap. Carry the OptumRx 2-token auth capture pattern over.
State-rolled S3 with TOC-listed thin file + bigger siblingOscar (oscar-002 853 KB listed in TOC, oscar-001 92 MB at same path with the full data)When a TOC-listed §1311 file looks improbably small for the state's market size (e.g. <1K NPIs for a state with >10K marketplace members), probe sibling files at the same S3/path prefix. Both files are at the carrier's own §1311 infra so either is defensible.
Carrier-direct §1311 MRFCentene/Ambetter, AmeriHealth via IHG GCS, Highmark via HMHS, Aetna directPattern A as originally documented. These are the easiest.

B. TOC schema patterns ​

Most §1311 TOCs are EIN-keyed (employer-reporting), not HIOS-keyed. The marketplace plan (HIOS prefix) is reachable via:

  • Reporting structures that group employers sharing a network (reporting_plans[] with plan_id_type: "EIN", in_network_files[] pointing at shared network rate files)
  • Network-level files for "Insurer" reporting structures (Oxford-Health-Insurance-Inc has Metro-Network, Choice-Plus, Core, Freedom-Network, etc.) — the marketplace network is the one matching the state's filename convention (MCEX, Marketplace, Individual, OMT, etc.)
  • Filename heuristics more reliable than schema for Horizon-style EIN-keyed TOCs:
    • MCEX = Managed Care EXchange
    • OMT1, OMT2 = Omnia Tier 1/2 (Horizon's product family for marketplace)
    • Individual / Individual-Small-Group = filters off large group
    • Recent date suffix (2026-05-DD) > older (2022/2023/2024) leftover files

C. Streaming + NPI extraction ​

  • NPIs in §1311 in-network rate files almost always live in a provider_references block at the file HEAD (before the "in_network" array). Read until you hit "in_network" marker, then stop. For a 1 GB gz file, this typically captures all NPIs within the first 1-2 MB of decompressed bytes.
  • Oscar's CMS schema v2.0.0 with indexed provider_references (numeric indices) is the exception. Read the full file but the same "npi": [...digits...] regex catches the trailing dictionary.
  • Use a sliding-window text buffer + regex match (/"npi"\s*:\s*\[([^\]]*)\]/g). Cap the buffer at ~1 MB to defend against missing-bracket pathological cases.

D. Atlas write discipline (this took 30+ minutes to learn the hard way) ​

For per-carrier provider ingest on M10 staging:

SettingWrong (hangs on M10)Right
Batch size1000250
Per-bulkWrite maxTimeMSomitted30000 ms
writeConcerndefault (majority, no timeout){w: 1, wtimeout: 30000}
Loggingend of scriptper-batch process.stdout.write with rate + ETA
Resume capabilitynone--start-line flag so a hung first run can resume cleanly

E. Cluster targeting (ADR 0004 + ADR 0006) ​

  • Plans live on BOTH clusters. New SBE state ingest writes to MONGODB_URI (prod, askflorence-prod-01.njkihm) AND to MONGODB_WRITE_URI (staging, askflorence-staging.efsikmv). The PA Phase A.4 decision (ENG-418, 2026-06-02) is canonical: apex /api/counties + /api/plans resolve via getDb() → prod cluster. Without the prod-cluster mirror, apex returns 502 on the new state's ZIPs even though staging looks fine.
  • providers_staging + formularies_staging live ONLY on staging. Per ADR 0004 cross-cluster Atlas PrivateLink + cost-optimization: prod M10 doesn't carry these collections. All Phase B/C ingest scripts MUST assertSafeCluster() against the staging hostname and exit if the URI points at prod. Apex reads coverage data via PrivateLink from staging cluster transparently.
  • Snapshot before --apply. PA Phase A.4 used Atlas snapshot IDs (e.g. 6a1e6c1d085fb664b689eec7 pre-apply, 6a1e6e23923b91f0f761d244 post-apply) to capture restore points. Every new SBE state's --apply step should be preceded by an Atlas snapshot on the destination cluster (staging for Phase B/C, BOTH clusters for Phase A plans).

F. Cohort + cluster guards in every ingester ​

Every Phase B/C ingester must include these guards. They have saved real damage twice in this session alone:

javascript
function assertSafeCluster(uri) {
  const host = uri.match(/^mongodb(?:\+srv)?:\/\/[^@]*@([^/?]+)/)?.[1];
  if (!host?.includes("askflorence-staging") && !host?.includes("efsikmv")) {
    throw new Error(`CLUSTER GUARD FAIL: host=${host}`);
  }
}

// Pre-flight + post-flight, both indexed queries:
const nonStateBefore = total - await coll.countDocuments({_id: /^<state>:/});
// ... ingest ...
const nonStateAfter  = total - await coll.countDocuments({_id: /^<state>:/});
if (Math.abs(nonStateAfter - nonStateBefore) > 50) {
  throw new Error(`COHORT GUARD FAIL: drift=${Math.abs(...)}`);
}

The user gate on writes: WRITE_CONFIRM=yes node ingest-...cjs --apply — the env var is the founder gate; the script must refuse --apply without it.

G. Validation discipline (the gap closing now) ​

The existing per-state Playwright specs (calculator-{ut,ca,sbe-nj}-takeover.spec.ts) test the UI flow but do NOT test the depth of the actual ingest data. The gap closes via the SBE invariants framework being built in a follow-on session — see "Per-state ingest invariants" in sbe-ingestion-playbook.md.


Template for new states ​

When you open work on a new SBE state (PA, NJ, MA, WA, CO, CT, MD, etc.), investigate these before writing code:

Data-layer investigation ​

  1. Marketplace system — name + tech stack (CalHEERS / NYSOH / GetCoveredNJ / etc.). Check for an anon SPA endpoint pattern (same technique that surfaced CalHEERS anon endpoints).
  2. Provider directory — is there a statewide centralized one (CA-style Symphony) or per-carrier portals? What's the provider ID system (NPPES NPI, state PIN, internal ID)?
  3. Drug formulary publication format — usually per-carrier PDFs but format varies; check if state mandates a standard layout (CA does via SBP).
  4. Rating areas — how does the state partition its geography for pricing? Some states have 1-2; CA has 19.
  5. Plan ID format — 14-char FFM or state-specific? Document the format so lookupPlanBackend() can dispatch correctly.

Decisions you'll need to make + document here ​

  • Plan-attribution granularity (per-network vs HIOS-prefix vs other)
  • Accepting-status default + rationale
  • Tier-aware copays vs single-tier
  • Whether puf is populated (CA wasn't — affects every downstream consumer)
  • Anon endpoint legal posture + license inquiry path
  • NPI bridge necessity (almost always: no — see CA decision #3 reasoning)

Layer 5 invariants (REQUIRED for every state — ENG-447) ​

  • During Phase C ingest, record 5-10 golden NPIs (spread across carriers) + 5-10 golden RxCUIs (covered by multiple carriers). At the END of Phase C — BEFORE the next state's Phase A — append the state's StateInvariantsConfig block to scripts/audit/sbe-state-invariants.ts (exact plan count, carrier HIOS prefixes from the live plans collection, per-carrier NPI/RxCUI floors ~5-15% under verified actuals), run --capture --state <ST>, review the fixture diff, run npm run audit:sbe-invariants, commit both files. This is what protects YOUR state from the NEXT state's ingest.

Post-ingest (REQUIRED for every state — ENG-425) ​

  • After the state's formulary docs land in formularies_staging, rebuild the drug-name search read-model: node scripts/db/derive-drug-search-index.js --apply. It re-derives from the WHOLE collection (FFM + all SBE/CA), so the new state's meds become searchable with brand/generic strength parity + commonally-covered-form-first ranking. Search-only; coverage stays per-rxcui. See docs/decisions/2026-05-09-refresh-cadence.md § "Post-ingest: rebuild derived collections".

Per-state Linear tags + files ​

  • File a sibling to docs/data-sources/{state}-phase-c-d-ingestion-playbook.md
  • Add a section to this doc as soon as decisions are made
  • Tag Linear issues [SBE-{state}] for greppability

Last updated: 2026-06-01 — PA Phase 0 + A.0.5 locked (ENG-418 — GetInsured 7-state stack pilot; Pennie /hix/private/getIndividualPlans JSON API discovered, eliminates SPA scraping); CA decisions 1-6 locked (2026-05-28).

Pager
Previous pageSBE Ingestion Playbook
Next pageCA Phase C/D Playbook

AskFlorence Internal Documentation. Not for public distribution.

AskFlorence

Internal Documentation

Access restricted. Not for public distribution.