Skip to content
AskFlorence
Main Navigation ArchitectureFlorence AIAgentsMembersAgent PlatformValidationInfrastructure

Appearance

Sidebar Navigation

Overview

Home

Glossary

System Architecture

Consumer & Agent Flow

Florence AI

Overview

Principles

Runtime

Tool surface

Adding a tool

Tool registry

Knowledge: SBC scenarios & CSR

Voice

Evals & observability

Provider risk & portability

Outage playbook

Roadmap

Build plan

Agents

Overview

Workflows & pain points

Members

Overview

Medicaid coverage gap

Carriers

Overview

Marketplaces

Overview

Agency

Overview

Regulations

Overview

Agent Platform

Overview

Auth Architecture

MongoDB Permissioning

Compliance Model

Data Models

Data Sources

Overview

CMS Marketplace API

CMS dependency map

PUF Data

State Subsidies

SBE Ingestion Playbook

SBE State Watchouts + Decisions

CA Phase C/D Playbook

NY Phase C/D Playbook

Validation

Overview

Methodology

APTC Formula

California 2026

New York 2026

CAPS Formula

Scenario Results

Infrastructure

Account Inventory

AWS Setup Runbook

AWS Organizations

CloudTrail

GuardDuty

Security Hub

Config

CloudFront + WAFv2

Data sources & ingest

Phase 4 DNS

Change Log

Vulnerability Management

MongoDB Setup

Access Control

Data Classification

Documentation Hosting

Post-deploy Smoke

Development

Preflight (local CI mirror)

Testing strategy

Compliance

Overview (auditor entry point)

SOC 2 Control Mapping

HIPAA Control Mapping

CMS EDE Appendix A Mapping

Risk Assessment

Encryption Policy

Data Retention Policy

Privacy Impact Assessment

Consent Capture & Versioning

Incident Response Plan

Access Control Policy

Marketing vs. Portal Analytics

Vendor / Subprocessor Register

Dependency Vulnerability Policy

BAA / Compliance Evidence

Compliance-Automation Integration

Compliance-Automation Vendor Evaluation

Penetration Test Reports

Architecture

Portal entry handoff

Mobile app strategy

Deferred architecture decisions

Session cookie architecture

Share flows

Decisions (ADRs)

Index

0001 — Atlas project isolation

0002 — Append-only audit log

0003 — Narrow-scoped Mongo users

0004 — Cross-cluster Atlas PrivateLink

0005 — Delayed-job architecture

0006 — Mongo user simplification

0007 — Terraform owns ECS task def

0008 — E2E testing strategy

0009 — Self-hosted analytics + observability (superseded)

0010 — PostHog HIPAA Cloud (supersedes 0009)

Runbooks

Security Incident Response

Break-Glass Root Login

Onboard Team Member

Offboard Team Member

Atlas user provisioning

Deploy via Terraform (ENG-277)

Rollback via Terraform (ENG-277)

S3 data bucket migration (planned Phase 11)

Access Reviews

2026-Q2 Review

Session log

Index

2026-04-23 — Phase 10 DNS cutover

2026-04-22 — Phase 8 prod AWS mirror

2026-04-22 — Phase 7 Atlas VPC peering

2026-04-22 — Phase 6 CloudFront + WAF

2026-04-21 — Phase 5 staging go-live

2026-04-17 — Atlas staging

Briefs

Index

Member portal plan (ENG-187)

2026-04-16/17 handoff

2026-04-17 Atlas handoff

System briefing (2026-04-17)

Creative AdBundance proposal brief

Creative AdBundance analytics brief

ElevenLabs RN integration research

Policies

Overview

On this page

SBE Data Ingestion Playbook ​

Status: Living document. Add a section per state as we onboard. Last updated: 2026-05-24 (CA Wave 2 / ENG-374 in progress) Purpose: The canonical "how do we get a state's plan data" reference. When a future year or future state ingestion starts, this is what the session reads first.

State-Based Marketplaces (SBEs) do NOT file their plan data into the CMS Public Use Files (PUF). Federal PUF only covers the 30 FFM states. For every SBE (CA, PA, NJ, MA, CT, CO, MD, NY, RI, DC, VT, WA, ID, NM, MN, KY, GA, IL, NV, ME, VA), we have to source plan data per state.

This playbook captures the approaches we've used, the validation we run, and the gotchas we hit. Future ingestions reference this and either match an existing pattern or extend the playbook with a new one.

TL;DR — pick your approach by state shape ​

State propertyBest approachWhy
Community-rated (same price all ages)NY-style scrapeSimplest. Region × plan. No age curve.
Age-rated, rating-area-boundCA-style scrape1 anchor age × N regions; federal age curve does the rest.
Federal PUF coverage existsPUF ingestAlready-shipping pipeline; one schema for whole state grid.
No public marketplace tool, only filingsSERFF rate-filing parseLast resort. High friction.

Standing rule: prefer the public-facing marketplace shop tool as the source of truth. It's what consumers see, it embeds all subsidy + filter logic, and it's free to scrape. SERFF is fallback when no public tool exists.


The decision tree ​

Does the state have a public shop & compare tool that shows all 2026 plans?
├── Yes → does premium vary by ZIP or only by rating area?
│         ├── ZIP-level                    → scrape per ZIP (NY pattern)
│         └── Rating-area-level            → scrape 1 ZIP per rating area (CA pattern)
└── No  → is the state on federal PUF?
          ├── Yes (one of 30 FFM)          → ingest via scripts/db/ingest-puf-augment.js
          └── No                           → SERFF rate-filing parse (last resort)

Always validate by:
  (1) Subsidy math vs the live state marketplace tool for 1 anchor scenario
  (2) Same-rating-area invariance: two ZIPs in same RA must return identical plans
  (3) Cross-rating-area difference: two ZIPs in different RAs must differ
  (4) Federal regression sweep: 100 federal scenarios stay byte-identical

Approach 1: NY-style scrape ​

When to use: community-rated state (one price all ages), small number of plans (< 100 statewide).

Used by: New York 2026 (ENG-211, shipped).

Data shape:

  • NY State of Health (NYSOH) lists every Silver / Gold / Bronze / Platinum / Catastrophic plan with one premium per (plan x rating region) pair. No age multiplier — community-rated.
  • 16 NY rating regions; ~20 plans visible per region; ~320 total plan rows.

Mechanics:

  1. Anchor demographic: any age, any income (community rating makes age irrelevant).
  2. Drive the marketplace tool's plan search for one ZIP per rating region.
  3. Capture: plan name, carrier, metal, premium, deductible, MOOP, copays per plan.
  4. Cross-reference DFS Exhibit 23 (NY's annual rate filing) for the canonical premium values as a backstop.

Output shape: documents in plans collection with state: "NY", regionId: 1..16, no ageRatesByArea table (community-rated means one premium across all ages).

Validation:

  • 11 issuers x 16 regions = 176 SLCSP probes against NYSOH live tool. Result: zero variance.
  • DFS Standard Silver chart (32 plans) all match exactly.
  • See docs/validation/new-york-2026.md for the full pass record.

Approach 2: CA-style scrape (rating areas + age curve) ​

When to use: age-rated state where premium varies by rating area, not by ZIP. This is the most common SBE pattern.

Used by: California 2026 Wave 2 (ENG-374, in progress).

Why this works:

By regulation (CA Insurance Code §10965.9, similar laws in CO/CT/MD/WA/etc.), premium does NOT vary by ZIP within a rating area. So one anchor scrape per RA covers every ZIP in that RA. Federal default age curve (45 CFR 147.102, Appendix A) maps the anchor-age premium to all 65 ages. Most SBEs ban tobacco rating in the individual market, eliminating that variable.

Validated approach (proven 2026-05-24 for CA):

  • Same-RA invariance: ZIP 94102 (Tenderloin SF) vs ZIP 94110 (Mission SF), both in RA4. Both return identical 24 plans, identical $1,341.65/mo subsidy, identical Kaiser cost-share. ✓
  • Cross-RA difference: ZIP 94102 (SF RA4) vs ZIP 90039 (LA RA15). Different subsidy ($1,341.65 vs $848.92), different plan count (24 vs 34), different cheapest carrier (Kaiser vs Molina). ✓
  • Subsidy math ($30K SF couple 35+30): predicted $1,344.62 vs CC live $1,341.65 = $2.97 (0.22%) delta. ✓

Anchor demographic (CA 2026):

  • 1 person, age 21, household income $500,000 (well above any subsidy threshold = gross premium displayed with zero subsidy applied; clean reading).
  • Tobacco N/A (CA bans tobacco rating per Knox-Keene §1357.504).

Mechanics:

  1. Maintain a data/{state}-rating-areas-{year}.ts file with one anchor ZIP per RA.
  2. Playwright scraper drives the public shop tool per anchor ZIP:
    • Fill ZIP, income (high), household size 1, age 21
    • Submit eligibility, click through preference screens (skip provider + Rx)
    • Land on plan list, paginate through all pages (CC uses 10/page MUI Pagination)
    • Extract per plan: carrier, plan name, metal tier, gross monthly premium, deductible, primary care copay, generic Rx copay
  3. Output: structured JSON keyed by regionId, array of plans per RA.
  4. Normalize step (separate script) converts JSON → plans collection doc shape.
  5. Apply federal default age curve to derive all 65 ages from the age-21 anchor.
  6. Apply canonical CSR variants from the state's Standard Benefit Plan Designs (state-mandated, same across all carriers per metal x CSR tier).

Output shape: documents in plans collection with state: "CA", regionId: 1..19, full ageRatesByArea table (derived), csrVariants map (canonical SBP).

Validation:

  • 10-scenario harness across all 19 RAs vs CC Shop & Compare live, target < $10 spread.
  • See Validation methodology below.

Code locations:

  • scripts/db/scrape-covered-ca-2026.ts — Playwright harness (CA 2026 specific)
  • scripts/db/data/ca-rating-areas-2026.ts — 19 RA anchor ZIPs + federal default age curve constants
  • scripts/db/data/ca-plans-scraped-2026.json — raw scrape output

Approach 3: Federal PUF ingest (baseline for 30 FFM states) ​

When to use: state files into CMS Public Use Files (= one of the 30 FFM states + a few that opted in).

Used by: all 30 FFM states (already shipping, see scripts/db/ingest-puf-augment.js).

Mechanics: documented separately at docs/data-sources/puf-data.md. The PUF pipeline is the simplest path when available — one well-defined schema covers the whole state.

Why this does NOT apply to most SBEs: SBE states file with their state marketplace, not CMS. So PUF rows for SBE states are either missing or stale. Confirm via db.plans.find({state: "X", year: 2026}).count() — if zero, the state isn't on PUF.


Approach 4: SERFF filings (fallback) ​

When to use: state has no public shop tool that exposes all plans (rare; this is mostly a contingency).

Mechanics: SERFF (System for Electronic Rates & Forms Filing) is the NAIC submission portal. Carriers file rate filings + benefit designs there annually. Requires NAIC registration to access bulk download. Filings are PDF + spreadsheet attachments per carrier.

Why it's last resort: parsing dozens of per-carrier filings is high friction. Use scrape approach 1 or 2 whenever a public marketplace tool exists.

Status: not currently used for any AskFlorence state. Document the parse pattern here if/when adopted.


Validation methodology (applies to every state) ​

The exact validation steps are state-agnostic. Run all four layers per state per plan year:

Layer 1 — Subsidy math sanity check (1 scenario) ​

Pick a representative subsidy scenario (e.g., couple ages 35 + 30, $30K income, anchor metro ZIP). Drive the state's live shop tool. Capture the displayed monthly subsidy + cheapest Silver plan.

Run the same scenario through our own pipeline. Compare:

CheckTolerance
Monthly subsidy total< 1% delta (target < $5/mo)
Cheapest Silver plan nameExact match
Subsidized "You pay" monthly< $5 delta
CSR tier assignedExact match

If any field exceeds tolerance, fix the formula before scraping more data. The CA Wave 2 baseline scenario landed at 0.22% subsidy delta — that's the bar.

Layer 2 — Same-rating-area invariance (2 ZIPs) ​

Pick two ZIPs in the same rating area (different counties if possible, same county otherwise). Drive both through the same scenario. Plans + subsidy must be byte-identical. This confirms the regulatory model: premium is RA-bound, not ZIP-bound.

If two same-RA ZIPs differ, the assumption is broken — either the rating-area mapping table is wrong, or the state actually does ZIP-level pricing (rare but exists). Either way, halt and investigate before continuing.

Layer 3 — Cross-rating-area difference (2 RAs) ​

Pick two ZIPs in different rating areas. They MUST return different plan lists or different premiums or both. Confirms the RA boundary works.

Layer 4 — Federal regression sweep (Wave 1 federal flow stays byte-identical) ​

Every SBE data change must NOT touch the 30-FFM + NY pipeline. Run scripts/audit/wave1-federal-regression.ts --diff after every commit. Result must be 0 diffs across 100 randomized federal scenarios.

This is the bright-line gate. If the federal sweep finds any diff, revert immediately. Federal flow byte-identical is non-negotiable.

Layer 5 — Per-state ingest invariants (REQUIRED post-ingest, every state) ​

New as of 2026-06-09 (post-NJ Phase C). Layer 1-4 checks test calculator/eligibility math but not the depth of the ingested provider + drug data. The existing per-state Playwright specs (calculator-{ut,ca,sbe-nj}-takeover.spec.ts) verify the UI flow + carrier names appear, but do NOT verify:

  • Specific known NPIs return the right plans[].plan_id set
  • Per-carrier NPI thresholds (e.g. NJ UHC ≥ 100K — future ingests can't silently drop 50% of NPIs)
  • Per-carrier RxCUI thresholds
  • Cohort guard reaffirmation outside of the ingest scripts themselves
  • Plan count + carrier set invariants per state

SHIPPED 2026-06-10 (ENG-447): scripts/audit/sbe-state-invariants.ts — parameterized, per-state config blocks (NJ + PA + CA + NY backfilled). Wired into npm run preflight -- --full (graceful SKIP when Atlas is unreachable so offline/GHA runs don't false-fail) and standalone as npm run audit:sbe-invariants (runs with --require-db, which ingest sessions MUST use). Golden expectations live in the committed fixture scripts/audit/fixtures/sbe-state-invariants-baseline.json, (re)generated via --capture (calculator-baseline pattern — review the git diff). Measured runtime: ~2 min all states / ~15-45s per --state run on M10 staging.

The backfill immediately caught a real shipped bug (proof of value before the first new-state ingest): NJ Phase C's Horizon provider ingest had used NJ_HIOS_PREFIX = "91762" (AmeriHealth's prefix) instead of 91661, so all 95,979 Horizon NPIs were attributed to AmeriHealth plan IDs — Horizon's 14 plans had ZERO provider coverage and AmeriHealth's plans carried Horizon's directory. Repaired on staging 2026-06-10 by scripts/db/repair-nj-horizon-plan-attribution.cjs (snapshots: pre 6a29227e764ea3c204b89fa1, post 6a2924c081d7b2013563a5b0). See the NJ section of sbe-state-watchouts.md. Moral: carrier↔HIOS labels come from the live plans collection, never from doc tables.

Each state declares:

typescript
{
  state: "NJ",
  expectedPlanCount: 60,
  expectedCarrierHiosPrefixes: ["17970", "91661", "37777", "91762", "23818"],
  expectedNonStateBaseline: 3_730_755,  // cohort guard
  carrierInvariants: [
    { hios: "37777", carrier: "UHC NJ", minNpis: 100_000, minRxCuis: 3_400 },
    { hios: "91762", carrier: "Horizon BCBS NJ", minNpis: 90_000, minRxCuis: 4_000 },
    { hios: "17970", carrier: "Centene/Ambetter", minNpis: 100_000, minRxCuis: 4_000 },
    { hios: "23818", carrier: "Oscar Garden State", minNpis: 20_000, minRxCuis: 3_800 },
    { hios: "91661", carrier: "AmeriHealth NJ", minNpis: 800, minRxCuis: 3_700 },
  ],
  goldenNpis: [
    { npi: "1750636635", expectedPlans: ["37777NJ0100002", "37777NJ0100005", /* ... */] },
    // ... 5-10 per state ...
  ],
  goldenDrugs: [
    { rxcui: 197316, name: "atorvastatin 20 MG", expectedTiersByCarrier: { "nj_uhc_2026_marketplace_formulary": "Tier 1", /* ... */ } },
    // ... 5-10 per state ...
  ],
}

Backfill status (2026-06-10, ENG-447):

  1. ✅ NJ — captured post-repair (265,562 NPIs / 11,373 state RxCUI docs; per-carrier floors + 8 golden NPIs + 6 golden drugs)
  2. ✅ CA — captured (165,974 Symphony providers / 5,447 state RxCUI docs; 11 carriers)
  3. ✅ PA — captured (1,216,701 NPIs / 30,881 state RxCUI docs; 13 carriers — 33709 + 79279 Highmark entities are drug-floor-only, their provider coverage is the accepted ENG-437 gap)
  4. ✅ NY — bonus backfill (203,016 NPIs / 7,349 state RxCUI docs; note: Healthfirst 91237 has providers but ZERO formulary docs — known ENG-412 gap, no rx floor set)
  5. ✅ Wired into preflight --full + npm run audit:sbe-invariants; any threshold miss fails the run
  6. ✅ "Template for new states" in sbe-state-watchouts.md updated — recording golden NPIs + RxCUIs is now a REQUIRED Phase C capture step

Adding a new state's block: at the END of its Phase C (before the next state's Phase A), append a StateInvariantsConfig to STATE_INVARIANTS (exact plan count, every carrier HIOS prefix from the live plans collection, per-carrier floors ~5-15% under verified actuals, 5-10 golden NPIs across carriers, 5-10 golden RxCUIs), then --capture --state <ST>, review the fixture diff, run a plain check, commit both files.

Layer 6 — Scenario sweep + live-exchange parity (REQUIRED post-ingest, every state) ​

New as of 2026-06-10 (NV). scripts/audit/sbe-scenario-audit.ts (npm run audit:sbe-scenarios -- --state XX [--parity]) drives the REAL fetchPlansForHousehold pipeline against a dev server for a grid of anchor ZIPs (one per rating area) × household types × FPL bands, and:

  • Scenario sweep flags: no plans above 138% FPL, missing Medicaid bump below it, realPrice > sticker, negative price, and SBE-redirect when the state should be owned. This is the check that caught the independent-city bug (Decision 12) — a whole rating area / 38 VA cities returning zero plans. Run it for every RA, not just the demo ZIP.
  • --parity replays exchange-captured scenarios from scripts/audit/fixtures/sbe-parity-baseline.json and asserts:
    1. 138-400% FPL → our APTC within tolerance of the LIVE exchange (abs $12 OR 4%, whichever looser — SLCSP plan-selection + FPL-rounding differ slightly). NV verified: ours $293.34 vs Nevada Health Link $298 (Δ$4.66).
    2. >400% FPL → no subsidy on both (NV: ours $0; exchange "no tax credit, ~$478/mo").
    3. <138% FPL → INTENTIONAL divergence: expansion-state exchanges route to Medicaid (no marketplace subsidy); WE bump to 138% and show the federal estimate (NV: ours $436 + plans). This is by design — do NOT "fix" it to match the exchange.

Capturing the parity baseline (Decision 11 method): drive the live exchange's anonymous shop via Chrome MCP (e.g. Nevada Health Link → enroll.nevadahealthlink.com/prescreener → "Browse for health & dental plans" → enter dummy ZIP + DOB + income), read the APTC / SLCSP / Medicaid result for ~3 scenarios spanning the bands, and add a states.<ST> block to sbe-parity-baseline.json. Recapture on plan-year change.

SBE audit suite + test philosophy ​

The three layers below ARE the SBE regression suite. Do NOT run all states in CI (overkill); instead run the suite for every state touched in an SBE addition — adding plans, drugs, providers, or adjusting any of them means re-running these for that state (and its siblings if the change is shared, e.g. a formulary shared across carriers):

LayerCommandWhat it guards
5 — data depthnpm run audit:sbe-invariants (or --state XX)per-carrier NPI/RxCUI floors, golden NPIs/drugs, cohort floor (re-ingest can't silently drop coverage)
6 — runtime + paritynpm run audit:sbe-scenarios -- --state XX --parityplans return for every RA + household + FPL band; APTC matches the live exchange above 138% FPL; Medicaid-gap divergence intact
Playwrightnpm run test:e2e (per-state calculator-*-takeover.spec.ts)end-to-end UI flow for the state

Plus Layer 4 (wave1-federal-regression --diff, 0 diffs) after every change — the federal/NY flow must stay byte-identical. All layers need a dev server on BASELINE_BASE pointed at the cluster holding the data (formularies_staging + providers_staging are STAGING-cluster only per ADR 0004 — never write provider/drug data to prod).


State-by-state log ​

New York ​

YearApproachStatusNotes
2026NY-style scrape (NYSOH) + DFS Exhibit 23 backstopShipped (ENG-211)Community-rated, 16 regions, ~320 plan rows. Validation: new-york-2026.md. Zero variance vs NYSOH live tool.

California ​

YearApproachStatusNotes
2026 Wave 1Hand-curated showcase (3-4 plans per state in CA/PA/NJ)Shipped (ENG-373)Sourced from CC 2026 SBP designs PDF. Estimate-only surface; not in plans collection. Wave 2 replaces this for CA.
2026 Wave 2CA-style scrape (Approach 2), CC Shop & CompareIn progress (ENG-374)19 RAs, age 21 anchor, $500K income, household 1. POC validated against $30K SF couple scenario at 0.22% delta. Code: scripts/db/scrape-covered-ca-2026.ts.

Pennsylvania ​

YearPhaseStatusNotes
2026 Phase A — plansGetInsured-stack scrape (Approach 2 variant), Pennie /hix/private/getIndividualPlans JSON APIShipped + live on apex (ENG-418 PR #566 + A.4 prod-cluster apply 2026-06-02)9 RAs, age 40 anchor, single-person household, 4 income buckets (no_csr / csr_73 / csr_87 / csr_94). 291 unique HIOS plans across 12 carriers (Jefferson / UPMC / Geisinger / IBC / Highmark family / Capital BC / Ambetter / Oscar / Partners). L1 validation: 4 RAs $0.00 exact vs PHIEA SLCSP PDF, all 9 RAs ≤ $1.49. Both clusters populated — staging snapshots B 6a1e44f83a4bf6ebedefa3a0 + C 6a1e45b729c719743c525854; prod snapshots B 6a1e6c1d085fb664b689eec7 + C 6a1e6e23923b91f0f761d244. Apex smoke verified 2026-06-02: ZIP 19103 → 73 plans (Jefferson top), 16501 → 61 (UPMC), 15834 → 34 (Geisinger), 18015 → 91 (Jefferson), 17101 → 57 (Capital BC) — all matching pre-apply per-RA breakdown. Cohort guard: non-PA 2026 unchanged at 4,495 on both clusters. Code: scripts/db/scrape-getinsured-2026.ts + build-pa-plans-from-scrape-2026.ts + write-pa-plans-2026.ts + validate-pa-plans-2026.ts + .tmp-eng-418/sync-pa-zip-county-staging-to-prod.mjs (one-shot, archived). PA is the pilot for the GetInsured 7-state batch — NJ / VA / NV / NM / ME inherit this pipeline (~1 day each).
2026 Phase B — drugsPer-carrier formulary scrape, mirror NY ENG-412 Phase 1Backlog12 carriers: Ambetter / Oscar / Highmark (3 entities) / IBC / Jefferson / Partners / Geisinger (2 entities) / UPMC / Capital BC. Formulary URLs already captured per plan as puf.urls.formulary — extract dedup'd list. Writes to formularies_staging keyed pa:<rxcui> (matches NY's ny:<rxcui>).
2026 Phase C — providersTBD source: PA-equivalent of NY-DOH-PNDS open data (if exists) OR per-carrier directory crawl OR §1311 MRFsBacklogProvider directory URLs already captured per plan as puf.urls.providerDirectory. NPPES-NPI-native expected (PA carriers use FFM HIOS schema). Writes to providers_staging keyed pa:<npi> (matches NY's ny:<npi>).

Illinois ​

YearPhaseStatusNotes
2026 Phase A — plansGetInsured-stack scrape (third consumer after PA + NJ)Applied to BOTH clusters 2026-06-10 (ENG-448) — deploy pending13 RAs × 4 CSR buckets, 271 HIOS plans, 7 carriers (BCBS IL/HCSC = 207 plans, sole issuer in 63 counties). csr_94 at $22,500 override. L1 13/13 Δ$0.00; L4 wave1 + calculator-baseline ZERO diffs; Layer 5 green. Snapshots staging B 6a29945b…/C 6a29965c…, prod B 6a29945d…/C 6a29965b…. Prod zip_county cleanup is a deploy-time step (cleanup-il-zip-county-2026.ts) — see watchouts IL section. QRS inline (233/271 rated). Catastrophic tier not captured (age-40 anchor).
2026 Phase B — drugsffm_1311_mrf sweep head start: 5/7 carriers already current (Oscar/Ambetter/UHC/Cigna/MercyCare, 3.8-4.7K RxCUIs each, plan ids verified == scraped 2026 hiosIds)COMPLETE 2026-06-10 — 7/7 carriersFills: BCBS IL via bcbsil.com/aca-json/il/drugs_il-1.json (3,959 drugs; the HCSC bcbs<st>.com/aca-json/<st>/ pattern, discovered from FFM af_sources + confirmed via index_il.json); Molina IL via ILFormulary2026.pdf parse + RxCUI fan-out (8,539 RxCUIs, 99.9% resolution — Molina publishes NO IL machine-readable drugs; its national cmsjson is FFM-only. ⚠️ Molina IL tier legend = national scheme 1=Preventive, NOT CA's SBP map). derive-drug-search-index --apply re-run (6,482 groups). Issuers publish §1311 MRFs regardless of FFM/SBE status — ALWAYS check existing coverage before a from-scratch Phase B.
2026 Phase C — providersSame head start: 3/7 current (Ambetter 138,578 / UHC 98,645 / MercyCare 3,412 NPIs on bare-NPI docs — the owned-coverage npi-field lookup serves them as-is)COMPLETE 2026-06-10 — 7/7 carriersFills (all bare-NPI, source-tagged il_*): BCBS IL 93,130 NPIs / 207/207 plans via index_il.json → 24 medical files (4 networks × INDIVIDUAL/GROUP/FACILITY; the filename codes ≠ network names — BAV files hold Blue-Precision); Oscar IL 34,145 via its TIC S3 (042 Select + 056 Choice; national §1311 providers_N.json scrubbed IL); Molina IL 19,412 via marketplace TIC (158 shards, provider_references at head, range-fetch + early-terminate; 152/158 across 3 passes, 6 shards 403'd); Cigna IL 20,444 via Pattern D county heuristic (collar counties; §1311 + TIC both scrubbed IL; over-attribution risk-accepted per PA precedent). IL added to OWNED_COVERAGE_STATES. IL has NO il: namespace — Layer 5 scopes by plan-prefix (framework extended, ENG-448). Layer 5: 213 checks 0 failed.

Virginia ​

YearPhaseStatusNotes
2026 Phase A — plansGetInsured-stack scrape (fourth consumer)Applied to BOTH clusters 2026-06-10 (ENG-450) — deploy pending12 RAs × 4 buckets, 69 plans, 6 carriers. csr_94 $22,500 override. APTC-implied SLCSP cross-check ≤$0.49 12/12. QRS 94%. Snapshots staging 6a29bce5/6a29be91, prod 6a29bce6/6a29be90. Prod zip_county cleanup = deploy-time (cleanup-va-zip-county-2026.ts).
2026 Phase B — drugs3 ffm-swept + Kaiser kporg JSON + UHC OptumRx GPX526VA + Anthem 2026 PDF parse (FNAV JSON is 2025-only)COMPLETE 2026-06-10 — 6/6Per-carrier RxCUIs: Anthem 8,297 · Kaiser 6,010 (TIER-ONE..FOUR vocab added) · Cigna 4,690 · Sentara 4,336 · Oscar 3,994 · UHC 3,604. Search read-model re-derived (6,516 groups).
2026 Phase C — providers2 ffm-swept + Anthem PROVIDERS_VA.json + Kaiser kporg (year-2023 waiver) + Oscar TIC 027 + Cigna county heuristicCOMPLETE 2026-06-10 — 6/6NPIs: Sentara 57,111 · Anthem 54,383 · UHC 43,253 · Cigna 12,772 · Oscar 3,816 · Kaiser 1,901. Bare-NPI model (like IL). Layer 5: 243 checks 0 failed across all six states.

Nevada ​

YearPhaseStatusNotes
2026 Phase A — plansGetInsured-stack scrape (fifth consumer)Applied BOTH clusters 2026-06-10 (ENG-451) — deploy pending4 RAs × 4 buckets, 135 plans, 9 carriers. csr_94 $22,500. L1 + APTC-implied ≤$0.32 on 4/4. QRS 106/135. Snapshots staging 6a29e184/6a29e34f, prod 6a29e186/6a29e34e. Prod zip_county cleanup = deploy-time.
2026 Phase B — drugs4 ffm-swept + Anthem/CommunityCare FNAV PDF + HPN OptumRx SE42L77 + Hometown IFP PDF8/9 carriersAmbetter 4,251 · Molina 3,803 · SelectHealth 3,861 · CareSource 3,555 (ffm) · Anthem + Community Care 8,560 (shared PDF) · HPN 4,794 (OptumRx ClientFormulary SE42L77 — largest carrier, 31 plans) · Hometown 8,792 (IFP Exchange PDF). Only Imperial gapped — its FNAV formulary is UnderConstruction.htm (carrier unpublished). Search read-model re-derived.
2026 Phase C — providers3 ffm-swept + Anthem (Elevance PROVIDERS_NV.json) + HPN (UHC TIC blobs → Sierra-Nevada HMO)5/9 carriersAmbetter 87,449 · Molina 60,984 · Anthem 15,053 (year-waiver) · HPN 5,491 (largest carrier) · SelectHealth 2,786 = 100 of 135 plans. A second pass found Anthem + HPN after the first pass stopped at 3/9 — Anthem via the SAME Elevance PROVIDERS_<ST>.json pattern used for VA; HPN via the UHC blobs API used for NJ. Lesson: exhaust proven sibling-state patterns before declaring a gap. Remaining gap (CareSource low-compliance §1311, Community Care separate HMO, Hometown/Imperial small portals) = honest "no coverage data" per NJ/PA precedent. Layer 5: 274 checks 0 failed across 7 states.

Other SBEs (queued) ​

StateMarketplaceExpected approachStatus
New JerseyGetCoveredNJGetInsured stack — inherit PA scraper (ENG-418), ~1 dayAll 3 phases SHIPPED (ENG-438, 2026-06-09). Phase A: 60 plans across 5 carriers (HIOS 17970/91661/37777/91762/23818) live on BOTH prod + staging clusters. Phase B: 5/5 formularies (Centene 4,305 / Oscar 4,014 / AmeriHealth 3,977 / Horizon 4,270 / UHC 3,570 RxCUIs) on staging. Phase C: 5/5 providers (Centene 112,161 + AmeriHealth 978 + UHC 106,742 + Horizon 95,979 + Oscar 22,658 = 265,562 NJ NPIs) on staging. Cohort guard intact (non-NJ 3,730,755, drift 0 each ingest). UHC required OptumRx GPX526NJ for drugs + transparency-in-coverage.uhc.com/api/v1/uhc/blobs/ discovery for providers. Horizon required horizonblue.sapphiremrfhub.com vendor portal (curl-level Incapsula bypass). Oscar required oscar-001-in-network.json (NOT the TOC-listed oscar-002). Full carrier-by-carrier discovery patterns + watchouts: see sbe-state-watchouts.md NJ section.
VirginiaVirginia's Insurance MarketplaceGetInsured stack — inherit PA scraper, ~1 dayM1 follow-up
NevadaNevada Health LinkGetInsured stack — inherit PA scraper, ~1 dayM1 follow-up
New MexicobeWellnmGetInsured stack — inherit PA scraper, ~1 dayM1 follow-up
MaineCoverMEGetInsured stack — inherit PA scraper, ~1 dayM1 follow-up
IdahoYour Health IdahoGetInsured stack — inherit PA scraper, ~1 dayNot in original M1 7-state batch; queue after the 5 above
MinnesotaMNsureGetInsured stack — inherit PA scraper, ~1 dayNot in M1 batch; queue after
GeorgiaGeorgia AccessGetInsured stack — inherit PA scraper, ~1 dayNew SBE 2026
IllinoisGet Covered IllinoisSBE-FP — already in CMS FFM PUFs we ingestAlready covered via FFM PUF path (no separate scrape needed)
ColoradoConnect for Health ColoradoApproach 2Backlog (founder noted "standardized way to quickly import" — promising)
ConnecticutAccess Health CTApproach 2Backlog (same)
MarylandMaryland Health ConnectionApproach 2Backlog
WashingtonWashington HealthplanfinderApproach 2Backlog
IdahoYour Health IdahoApproach 2Backlog
New MexicobeWellnmApproach 2Backlog
MassachusettsHealth ConnectorApproach 2Backlog
Rhode IslandHealthSource RIApproach 2Backlog
DCDC Health LinkApproach 2Backlog
VermontVermont Health ConnectApproach 2Backlog
KentuckykynectApproach 2Backlog
MaineCoverMEApproach 2Backlog
MinnesotaMNsureApproach 2Backlog
GeorgiaGeorgia AccessApproach 2Backlog
IllinoisGet Covered IllinoisApproach 2Backlog
NevadaNevada Health LinkApproach 2Backlog
VirginiaVirginia's Insurance MarketplaceApproach 2Backlog

When adding a new state, fill in the row and link to the validation page once complete.


Prioritized roadmap — 19 remaining SBE states (2026-06-01) ​

Status as of 2026-06-01: CA + NY shipped (451 plans live in plans collection). 19 SBE states remain unconfigured: CO, CT, DC, GA, ID, IL, KY, ME, MD, MA, MN, NV, NJ, NM, PA, RI, VT, VA, WA.

This section ranks them by leverage (population coverage + ease) and dependency stack (plans → drugs → providers, in that order per state). Per the standing covenant, never start a new state's ingestion without completing this section's planning checklist for it — the per-state research column locks the approach before any code/scrape work.

Ranking criteria ​

Four signals, listed in priority order:

  1. Population leverage — how many uninsured-ACA-eligible people does the state add to our addressable surface? Bigger states = bigger TAM unlock.
  2. Marketplace tool quality — does it have a clean plan-list page that the CA-scrape pattern handles? Or is it built on an older / less consistent framework that needs a one-off scraper?
  3. Plan-data shape — community-rated (NY pattern, simpler) vs. age-rated rating-area-bound (CA pattern, standard) vs. ZIP-level pricing (rare, would need a per-ZIP scrape).
  4. State subsidy program — does the state add its own premium subsidy on top of federal APTC (like CA's CAPS+CAPC)? If yes, we need a calculate{State}Eligibility() helper before the state's pricing is consumer-correct.

Tier 1 — High population, standard pattern, ship-fast candidates ​

These are large-population states with the CA-pattern (age-rated, rating-area-bound) and either no state subsidy or a well-documented one. Plans-only first (drug + provider ingest queued after pricing lands).

#StateMarketplaceApproachEst. uninsured pop.State subsidy?Why ship first
1PAPennieCA-style scrape~290KNo (federal APTC only)5th-largest state by uninsured; Pennie is built on the Get Insured tech stack — SBE-modern, clean plan list. Wave 1 hand-curated showcase already shipped under ENG-373, so the carrier roster is partially known.
2NJGetCoveredNJCA-style scrape~210KYes — NJ Health Plan SavingsSame Get Insured tech stack as PA. Wave 1 showcase already shipped (ENG-373). State subsidy adds modest top-up on federal APTC up to 600% FPL — researchable from NJ DOBI publications.
3ILGet Covered IllinoisCA-style scrape~370KNo (federal APTC only)Brand-new SBE for 2026 plan year (transitioned off HC.gov). Tech stack: GetInsured (same as PA/NJ). Largest by uninsured among 2026-new SBEs.
4VAVirginia's Insurance MarketplaceCA-style scrape~320KNo (federal APTC only)Built on GetInsured tech stack. Transitioned to SBE in 2024. High population, expansion state (no ENG-414-class subsidy ambiguity).
5GAGeorgia AccessCA-style scrape~530KNo (federal APTC only)Largest state in this set by uninsured. Brand-new SBE for 2026. Pre-flight check needed: confirm Georgia Access publishes a public shop-and-compare tool (some states gate plan browsing behind eligibility submission). Tech stack is custom (not GetInsured).

Tier 1 rationale: scoring the 5 most populous easy-pattern states first unlocks ~1.7M additional uninsured-ACA-eligible Americans across the platform. The PA/NJ pair has the lowest delta-effort because Wave 1 hand-curation already mapped the carrier roster (ENG-373); the GetInsured tech-stack reuse means a single scraper port covers PA + NJ + IL + VA.

Tier 2 — Medium population, standard pattern ​

Same approach (CA-style scrape) but lower population leverage per state. Group as a batch after Tier 1 ships.

#StateMarketplaceApproachEst. uninsured pop.State subsidy?Notes
6WAWashington HealthplanfinderCA-style scrape~330KYes — Cascade Care subsidyCustom tech stack (HBE-built). Cascade Care adds state premium assistance on top of federal APTC up to 250% FPL.
7MDMaryland Health ConnectionCA-style scrape~280KNo (federal APTC only)Custom MD tech stack. Standardized SBP designs across carriers; canonical CSR variants well-documented.
8MAMassachusetts Health ConnectorCA-style scrape (likely)~110KYes — ConnectorCareCustom MA tech stack. ConnectorCare is a major state-funded subsidy program (essentially state-augmented Silver). Subsidy modeling adds complexity — research first.
9MNMNsureCA-style scrape~290KYes — MinnesotaCare (BHP)Custom MNsure tech stack. MN runs a Basic Health Program (BHP) for 138-200% FPL like NY's Essential Plan — adds complexity.
10COConnect for Health ColoradoCA-style scrape~210KYes — Colorado OmniSalud + reinsuranceCustom CO tech stack. State Reinsurance Program reduces premiums by ~25% across the board; OmniSalud subsidy covers some undocumented residents (out of scope for federal-eligible pricing).
11CTAccess Health CTCA-style scrape~140KNo (federal APTC only)Custom AHCT tech stack. Modest population; relatively straightforward eligibility (no state subsidy program).

Tier 3 — Smaller population or higher-friction marketplaces ​

Lower priority by raw population. Cluster as a final batch.

#StateMarketplaceApproachEst. uninsured pop.State subsidy?Notes
12NVNevada Health LinkCA-style scrape~250KNo (federal APTC only)GetInsured tech stack (same as PA/NJ/IL/VA — once Tier 1 lands, NV could ship as a fast-follow).
13KYkynectCA-style scrape~190KNo (federal APTC only)Re-launched as SBE in 2022. Custom KY tech stack.
14NMbeWellnmCA-style scrape~150KNo (federal APTC only)GetInsured tech stack.
15MECoverME.govCA-style scrape~50KNo (federal APTC only)GetInsured tech stack. Small population.
16RIHealthSource RICA-style scrape~30KNo (federal APTC only)Custom HSRI tech stack. Smallest by uninsured.
17DCDC Health LinkCA-style scrape~25KNo (federal APTC only)Custom DC tech stack. Tiny but politically visible.
18VTVermont Health ConnectCA-style scrape~25KNo (federal APTC only)Custom VT tech stack. Tiny.
19IDYour Health IdahoCA-style scrape~120KNo (federal APTC only)Custom YHI tech stack. Only non-expansion-state SBE; CSR-94 still applies via federal APTC at 100-138% FPL the same as any other state (no ENG-414 carveout needed — the SBE redirect fires before our FFM bump logic).

Per-state research backlog (locks the approach before scraping) ​

For each state, the following must be answered + documented in sbe-state-watchouts.md BEFORE starting the scraper code. This prevents the kind of stale-data + sync-bug class we hit on CA (Wave 1 stale 2025 carryover) and NY (PNDS specialty-code surprise, UHC GPX426NY-vs-Advantage confusion).

QuestionWhy it matters
Does the public shop-and-compare tool expose ALL plans without an eligibility submission?Some states gate plan browsing behind eligibility (NJ does this for some flows). Determines whether the CA "anchor-age + $500K income" trick works as-is.
Premium varies by ZIP, by rating area, or community-rated?Picks the scrape pattern (NY vs CA vs per-ZIP). Confirmed via Layer 2 same-RA invariance test.
Does the state publish a Standard Benefit Plan Designs (SBP) PDF for the plan year?If yes, CSR variants are state-mandated and consistent across carriers (CA pattern); if no, each carrier's CSR variants must be sourced from per-carrier filings or PUF (FFM pattern).
State subsidy program shape (APTC top-up, BHP, reinsurance, OmniSalud-style outreach)?Drives whether we need calculate{State}Eligibility() (NY/CA pattern) OR can rely on federal APTC only (most states).
Drug formulary source pattern: per-carrier PDFs? FormularyNavigator? Per-carrier API? PBM aggregator?Picks the drug-ingest approach. NY taught us per-carrier PDFs + OptumRx-style API for the UHC outlier.
Provider directory source pattern: state DOH PNDS-equivalent? FHIR Plan-Net? Per-carrier portals?Picks the provider-ingest approach. NY used PNDS (state DOH); CA used Symphony (the SBE-licensed directory). Smaller states may have neither — fall back to per-carrier FHIR Plan-Net.
Does the state run a Basic Health Program (BHP, NY's Essential Plan equivalent)?Affects the 138-200% FPL band. MN runs MinnesotaCare; NY runs Essential Plan; no other 2026 SBE does.

Dependency stack per state (plans → drugs → providers) ​

The standing pattern that worked for CA + NY:

  1. Phase A — Plans + pricing (the prerequisite for everything else). Required before anything user-facing works for the state. Validated by the 4-layer methodology above.
  2. Phase B — Drug formularies → formularies_staging, source-tagged <state>_<carrier>_<year>_marketplace_formulary. Per-carrier PDFs the typical source. RxCUI resolution + dedup follow.
  3. Phase C — Provider directories → providers_staging. State-by-state shape varies; NY used PNDS open-data (Socrata bulk), CA used CalHEERS Symphony, smaller states may rely on per-carrier FHIR Plan-Net or licensed feeds.

Each phase ships independently to apex behind the standing covenants (Atlas snapshot → dry-run audit → founder-gated apply → byte-identical guard on FFM + other-SBE cohorts → apex smoke).

Shipping cadence recommendation ​

  • Tier 1 (5 states, ~1.7M uninsured): ship Phase A only first (plans + pricing), batched as one Wave per state. Goal: a TX/FL-equivalent surface for the largest remaining SBEs. Defer drug + provider ingest per-state until pricing is verified live on apex.
  • Tier 2 (6 states): ship Phase A in the same batched pattern. Add state-subsidy helpers where required (WA Cascade Care, MA ConnectorCare, MN MinnesotaCare, CO Reinsurance).
  • Tier 3 (8 states): ship Phase A in a final batch.
  • Drug + provider ingest (Phases B+C) revisits each tier after pricing is live, prioritized by user demand (which states are showing up in waitlist signups + agent referrals).

Estimated effort per state (Phase A only) ​

Based on NY + CA ingests:

  • Scraper port (CA pattern, GetInsured tech stack): ~1 day per state once the first GetInsured scraper is reusable.
  • Scraper port (custom tech stack): ~2-3 days per state.
  • Validation (4-layer methodology): ~0.5 day per state.
  • State subsidy helper (if applicable): ~1-2 days per state (the actuarial math is well-defined; the testing is what takes time).
  • Atlas snapshot + dry-run + apply + apex smoke: ~0.5 day per state.

Tier 1 ship estimate: ~10-14 working days for all 5 states (assuming GetInsured scraper reuse across PA/NJ/IL/VA).

Out of scope for THIS plan ​

  • No execution. This document defines the work + priority + dependency order. Each state's actual ship is its own Linear issue + worktree per the standing pattern.
  • No deadline commitments. Sequencing depends on which states are showing up in user demand + agent referrals + funding priorities.
  • No formulary or provider work. Phase A (plans + pricing) is the gate for everything else; Phase B + C plans are filed per-state after Phase A lands.

Population data caveat ​

Uninsured population estimates above are directional — sourced from KFF state health facts (https://www.kff.org/state-category/health-coverage-uninsured/) and rounded for prioritization purposes only. Refresh with the next plan-year KFF release before committing to a Wave schedule.


Lessons learned + gotchas ​

A running list of mistakes + non-obvious findings. Read this before starting a new state ingest.

CA Wave 1 → Wave 2 transition (2026-05-24) ​

  • Stale-year carryover risk: Wave 1 seed used Silver 70 deductible: $5,400, which was a 2025 carryover. 2026 CA SBP actual is $5,200. Always source the SBP design values from the CURRENT plan year, not the prior year. Cross-check via the state marketplace's live display.
  • SF rating-area "gap" (ENG-152) was stale data, not a formula bug. Looked like a $173-$263 SF SLCSP delta in Wave 1; rerunning against 2026 source data eliminated it. Don't chase formula bugs without first verifying the source-of-truth year.

LA County RA15/RA16 unification finding (2026-05-25, Wave 2 Block B.5) ​

  • LA County historically splits between RA15 (North/East) and RA16 (South/West) at the ZIP level. CC 2026 PY scrape of both anchors revealed identical carrier roster (6 carriers, 40 total plans) and identical per-plan gross premium for both RAs.
  • Verify before assuming a historical split persists year-over-year: scrape both anchor ZIPs and diff the plan lists. If pricing + roster match exactly, the split is administrative-only and you can collapse to one canonical RA.
  • Decision for 2026: map all LA County ZIPs to RA15. RA16 remains in the regions collection for completeness but receives no ZIPs in zip_county.
  • If a future year reintroduces a meaningful split: scripts/db/scrape-la-zip-ra-split-2026.ts is kept as scaffolding for per-ZIP classification via subsidy-fingerprint match.

CC Shop & Compare scraper specifics ​

  • Pagination, NOT virtualization. CC's plan list paginates 10 plans per page using MUI Pagination. Click "Next page" button; do NOT try to scroll-infinite-load. Selector: button[aria-label*="next page" i].
  • Eligibility modal. After "See My Results" submit, a modal appears: "This isn't an application for health coverage." Click "Continue" before navigating to results.
  • Filter modal on plan list arrival. "Your Health Plan Filters" modal blocks the top of the plan list. Click "OK" to dismiss before extracting cards.
  • Plan cards are <li> elements with dynamic MUI classes. Can't anchor on class. Identify by text content: each plan card's <li> contains BOTH $X.YY /mo AND Yearly deductible.
  • Plan name heading contains the whole card text. Don't use h1/h2/h3.textContent to grab the plan name — it'll include Compare button + cost-share labels. Instead grep for the carrier + plan-line + network-type pattern (e.g., ^(Kaiser|Anthem|...) ... (HMO|PPO|EPO)$).
  • Combobox is custom, not native <select>. CC uses MUI Autocomplete. Click the combobox to open, then click the [role=option] element with the desired text. Two comboboxes on the form share name "Dropdown" (Coverage Year + Household Size) — pick by position (.nth(1)).
  • HDHP / HSA plans don't render a metal pill. They have plan-line names like "Kaiser Bronze 60 HDHP HMO" but no "BRONZE HSA" pill. Derive metal from plan name when pill not found.
  • High-income anchor for clean gross premium. With $500,000 income, CC shows zero subsidy and gross premium displays directly — no need to subtract subsidy to derive gross. Cleaner for ingestion math.

PA / GetInsured-stack scraper specifics (ENG-418) ​

  • JSON API beats SPA scraping where it exists. Pennie's /hix/private/getIndividualPlans returns the full 73-plan × 145-field response that the React SPA renders. Drive the prescreener form via Playwright JUST to establish the session cookie, then fetch the JSON API for the actual data. 36× faster than DOM-scraping the plan cards. Pattern likely repeats across the GetInsured stack (NJ/VA/NV/NM/ME) since they all run the same upstream platform — try the same endpoint first.
  • Multi-county ZIPs trigger a <select> dropdown that gates Continue. Border ZIPs like 15834 (Cameron/Elk/Potter) and 18015 (Bethlehem — Northampton/Lehigh) show a "Select your county" select with options whose VALUES are county FIPS. Handle in the scraper: detect select presence, set value to the anchor RA's county FIPS via the React setter pattern. THIS was the single biggest source of pre-fix RA2 + RA6 timeouts. Validate on at least one multi-county ZIP per state during Phase 0.
  • CSR-94 income bucket can be unreachable at single-person anchor. Pennie blocks the form at very-low single-person income (~$22K). Workarounds: (a) derive CSR-94 cost-shares from federal 45 CFR §156.420 AV target (0.94) applied to base Silver; (b) use a 2-person household at proportionally low income (Medicaid threshold scales by household size). Document which one you used in the validation page.
  • Pennie's costSharing codes don't match CMS HIOS variant suffix. Observed: CS5 returned for the csr_87 income bucket (CMS standard would be CS3); CS4 for csr_73 (CMS standard CS2). Don't key on Pennie's code — key on the income bucket you supplied. The cost-share VALUES are correct.
  • Plan-level fields are HIOS-keyed (one doc per HIOS), NOT per-(HIOS, RA). Look at an existing CA or NY plan in the plans collection before designing your normalize — regionId: null + premiumsByRatingArea: { 1: X, 4: Y, 8: Z } + ageRatesByArea: { 1: {0-14:…, 64 and over:…}, 4: {…}, ... } is the convention. PA originally emitted per-(HIOS,RA) docs and had to be re-grouped before apply.
  • Age curve keys are "0-14" / "15"-"63" / "64 and over", NOT integers 0-64. The "64 and over" key is what makes premiums work for users locked out of Medicare (recent green-card holders, non-citizens, anyone 65+ without 40 quarters of work history) — they pay the federally-capped age-64 rate per 45 CFR 147.102(e)'s 3:1 max age-rating rule. (Aside: CA stored integer 0-64 keys, which is a latent bug — 65+ CA users hit the route's fallback path. Tracked separately.)
  • Field-name parity matters: metal is UPPERCASE; planName not "name"; hsaEligible at top-level not nested under puf.planFeatures; program: "QHP" constant; both premiumsByRatingArea AND premiumsByRegion populated identically (CA legacy alias). Diverging from this shape means the /plans surface renders blank for your new state.
  • copays field shape varies across cohorts but the runtime tolerates both CMS-canonical keys ("Primary Care Visit to Treat an Injury or Illness") and camelCase shorthand ("primaryCare"). PA uses CMS-canonical with raw Pennie displayVal strings ("$15 Copay", "20% Coinsurance after deductible") — the runtime's extractCopay() parser handles all formats.

Quality ratings hide in plain sight (ENG-418 A.5 lesson) ​

  • Every SBE plan-search response we've seen so far includes the full CMS QRS issuer rating inline — the same data scripts/db/ingest-qrs-ratings.js calls the CMS Marketplace API for on the federal-30 side. The SBE scrape ALREADY has it; the normalize just has to read it.
  • Pennie field name: issuerQualityRating (24-field nested object including QualityRating, GlobalRating, SummaryClinicalQualityManagement, SummaryEnrolleeExperience, SummaryPlanEfficiencyAffordabilityAndManagement, EffectiveDate + 18 sub-scores by clinical category). Each plan record also has a flat qualityRating mirror at the top level.
  • Mapping: Pennie strings → canonical ints ("" → 0). SummaryClinicalQualityManagement → clinicalQualityManagementRating; SummaryEnrolleeExperience → enrolleeExperienceRating; SummaryPlanEfficiencyAffordabilityAndManagement → planEfficiencyRating. Use the 14-char HIOS prefix (issuerPlanNumber[0:14]) as the dedupe key — multiple variants of the same product share the issuer-level rating.
  • 217/291 PA plans (74.6%) had ratings; 74 (mostly new carriers + new product lines without 3 enrolled years) returned empty issuerQualityRating. Federal-30 sees ~98.1% coverage. Both gaps are real — render "Not rated" rather than treating as 0-star.
  • CC + NYSOH scrapes are expected to follow the same pattern (per-plan inline rating from the upstream CMS QRS feed). Audit those scrapes' raw JSON before designing a separate ingest. Likely 90% of the work is git grep -i "qualityRating\|starRating\|qrs" scripts/db/data/ and a 20-line schema mapper.

Playwright + tsx interop ​

  • Inner function declarations break inside page.evaluate(() => {}). tsx wraps functions with __name(class) helper calls that don't exist in browser context. Pass the evaluator as a string template literal instead: page.evaluate(\(() => { ... })()`)` — and inline all logic, no inner functions.
  • Regex escape rules in template-literal evaluators. Use 2 backslashes for regex special chars: \\b in the template becomes \b in the string the browser parses. Avoid character-class dashes in the middle — put - at the end of the class: [A-Za-z0-9 ()./-].

MongoDB writes (every state, every year) ​

  • Standing covenant for plans collection writes: dry-run → tier-1..5 audit harness re-runs at 100% → rollback snapshot → founder-gated apply. See docs/validation/audit/ for the harness.
  • Atlas snapshot BEFORE any plans write. atlas backups snapshots create --clusterName <prod-cluster> --retention 30 --description "pre-state-X-Y ingest". Record snapshot ID in the ingest commit message for rollback traceability.

Cluster targeting — plans go to PROD FIRST (caught in ENG-418 A.4 apex smoke) ​

This is the single most important playbook addition from ENG-418. Plans (and zip_county) on apex are read via getDb() → MONGODB_URI → the prod M10 cluster (askflorence-prod-01.njkihm). The staging cluster (askflorence-staging.efsikmv) is what .env.local points at for local dev + the cross-cluster reference cluster for getReferenceDb() (formularies_staging + providers_staging only). Two separate Atlas clusters in two separate Atlas projects.

CA + NY convention (per scripts/db/copy-ca-data-prod-to-staging.cjs): write plans to PROD cluster first, then mirror to staging. ENG-418 Phase A.2.2 violated this by writing only to staging (because .env.local defaulted there + the write-pa-plans-2026.ts cluster guard was set to staging-only). Result: code went live on apex with OWNED_DATA_STATES containing PA, but apex /api/counties STILL returned sbeRedirect because PROD's zip_county still had sbeRedirect set and PROD's plans had 0 PA rows. Hour of investigation later: write the same data to prod, problem gone.

The two-cluster ingest pattern (apex-correct order):

  1. Take Snapshot B on prod cluster (atlas backups snapshots create <prod-cluster> --projectId <prod-project-id> --desc "pre-state-X-Y ingest").
  2. Get prod-cluster write access. Pattern (CLAUDE.md precedent):
    bash
    # Create temp DB user (45-min auto-expiry):
    atlas dbusers create --projectId <prod-project-id> \
      --username eng<NNN>-<state>-prod-temp \
      --password "$(openssl rand -hex 32)" \
      --role readWriteAnyDatabase \
      --deleteAfter "$(date -u -v+45M '+%Y-%m-%dT%H:%M:%SZ')"
    # Add your IPv4 to prod allowlist (NOT IPv6 — Atlas rejects):
    atlas accessLists create "$(curl -s -4 ifconfig.me)" \
      --type ipAddress --projectId <prod-project-id> \
      --comment "eng<NNN> temp" \
      --deleteAfter "$(date -u -v+45M '+%Y-%m-%dT%H:%M:%SZ')"
  3. Run --dry-run against prod URI (verifies cohort baseline + per-RA counts).
  4. Run --apply with WRITE_CONFIRM=yes env var against prod URI. Cohort guard MUST show non-PA 2026 count unchanged.
  5. zip_county cleanup on prod cluster (same script as staging, just against prod URI):
    js
    // For each ZIP belonging to the new state, $set regionId (per fips→RA table) + $unset sbeRedirect.
    // If staging already has the canonical post-cleanup state, simpler: copy staging → prod by {zip, countyFips}.
  6. Apex smoke matrix (curl per anchor ZIP → confirm counties, no sbeRedirect → POST /api/plans → confirm non-empty plan list with state's HIOS-prefix carriers).
  7. Take Snapshot C on prod cluster.
  8. Delete temp DB user + IP allowlist entry (atlas dbusers delete ... --force + atlas accessLists delete ... --force).
  9. (Optional, cost-savings parity) Mirror prod → staging for the new state: node scripts/db/copy-ca-data-prod-to-staging.cjs style — adapt for the state's data subset.

The write scripts (write-<state>-plans-<year>.ts) MUST accept BOTH clusters in their safeHost regex. The founder-gated WRITE_CONFIRM=yes + cohort guard are the real safety net; the cluster regex just blocks accidental third-cluster runs (e.g., a dev cluster). Example from write-pa-plans-2026.ts:

ts
const safeHost = /askflorence-staging|askflorence-prod-01|atlas-ttlyyd|njkihm/i;

Federal regression bright-line ​

  • Never merge an SBE data change without re-running wave1-federal-regression --diff. Zero diffs is the gate. Wire it into preflight (already done via scripts/preflight.ts --full).

Provider-coverage ingest patterns — PA Phase C playbook (added 2026-06-05) ​

PA Phase C (ENG-435 + ENG-437) shipped all 9 PA marketplace carriers (165 plans) with provider attribution to providers_staging in one extended session. Discovered 5 distinct ingest patterns that cover the realistic set of carrier-publishing approaches. Every future state's provider ingest should map each carrier to one of these patterns first.

Pattern decision matrix (run this BEFORE coding) ​

For each carrier, check IN ORDER (cheaper patterns first):

StepCheckIf yes → use
1§1311 MRF TOC is publicly accessible (carrier site footer or /transparency-in-coverage) AND TOC contains HIOS marketplace plan IDsPattern A — MRF direct
2TOC files are CMS-standard .json.gz (Cloudfront-signed)Pattern A (streaming gz)
3Public find-a-doctor SPA exposes NPI in network XHR responses (capture via Chrome MCP fetch-wrapping)Pattern B — SPA-API direct
4Public find-a-doctor page IS NPI-blind (PDF / server-rendered HTML / vCard with no NPI) BUT carrier publishes provider PDFsPattern C — PDF + NPPES fuzzy match
5Public path member-auth gated OR rate-limited to unviabilityPattern D — county heuristic (sister-line risk-accepted attribution)
6Public sister product line shares the same network and we already ingested itPattern E — additive attribution (zero new NPIs needed)

Pattern A — §1311 MRF direct ingest ​

Recipe:

  1. Discover TOC URL (carrier transparency page or footer)
  2. Verify TOC has plan_id_type === "hios" entries matching state marketplace HIOS prefix
  3. For each reporting_structure that lists our marketplace plans, collect in_network_files[] URLs
  4. Stream each file (HTTP fetch); for .json.gz, gunzip + early-termination at "in_network" marker token (270× speedup vs full parse)
  5. Regex extract NPIs from "npi"\s*:\s*\[...\] arrays + "npi"\s*:\s*<10-digit> scalars
  6. JSONL write with our_plan_ids populated from the 14-char marketplace HIOS lookup
  7. --apply upserts pa:<npi> with $setOnInsert (identity locked) + $addToSet plans (idempotent)

Real examples: Ambetter Centene national TOC (api.centene.com/.../cms-data-index.json → 111,885 NPIs), Highmark BS PA (mrfdata.hmhs.com/.../highmark-bsp-index.json → 92,605 NPIs), IBX QCC (ihg-dart-edw-mrf-prod-public/qcc/.../index.json → 809,576 NPIs), Oscar (single in-network JSON file → 6,491 NPIs).

Caveats:

  • TOCs vary wildly in size: from one file (Oscar = 13.6 MB) to 525 files (IBX = 75+ GB if all downloaded)
  • Cloudfront URLs have signed expiry — fetch within 24 hr of TOC capture
  • Server-side rate-limit on heavy file families (e.g., IBX's 17B0/17D0 — terminated connections at 24-parallel fetch). Solution: accept partial + dedupe overlap is high

Key script: scripts/db/ingest-pa-providers-mrf-generic.cjs — env-driven generic implementation. Set CARRIER_NAME, CARRIER_SOURCE_TAG, HIOS_PREFIX, TOC_URL, PLAN_ISSUER_FILTER env vars.

Pattern B — SPA-API direct (NPI in JSON XHR) ​

Recipe:

  1. Open carrier's find-a-doctor page via Chrome MCP (residential IP bypasses datacenter bot-shields like Radware)
  2. Set up fetch wrapper via Chrome MCP javascript_tool to capture all XHR calls
  3. Trigger a search via the SPA's UI (or direct URL navigation)
  4. Identify the search endpoint (e.g., POST /healthsparq/public/service/v4/search)
  5. Replay the captured body shape from JS in the same browser session (cookies + auth tokens auto-flow)
  6. Parse JSON response; look for attributes[] array with {key:"NPI", value:"<10-digit>"} pattern
  7. Iterate via prefix-letter sweep + pagination (HealthSparq caps at 300 per query / 10 per page)
  8. Apply with cohort guard

Real example: Geisinger HealthSparq API (POST /healthsparq/public/service/v4/search) → 6,190 unique NPIs across 26-letter + 2-letter prefix sweeps in ~25 min. NPI was in providerResults[].attributes[] where key === "NPI".

Caveats:

  • 5-15% of carriers expose NPIs in API but filter from rendered HTML — always sniff XHR before deciding "NPI not exposed"
  • HealthSparq / similar vendors hard-cap results per query (~300-500) — must iterate via prefix sweeps
  • Session timeouts (Geisinger expired after ~20 min sustained queries → HTTP 440)
  • Some endpoints prevent re-auth via fetch — must trigger via SPA UI

Pattern detection:

  • Vendor tells: HealthSparq, Sapphire (CalHEERS Symphony), HealthTrio Connect (HTML only, NOT JSON), Vitals, MyHealthToolkit
  • Test: search a known last name in SPA + check Network tab for application/json responses containing NPI digits

Pattern C — PDF/HTML directory + NPPES fuzzy match ​

Recipe (PDF source):

  1. Discover marketplace PDF download URLs (often via /pdf-directory/... redirects to widen.net or other CDN)
  2. Real PDF URL hidden in viewer HTML at window.viewerPdfUrl (signed CDN URL with ~24hr expiry)
  3. Download PDFs (one per network if carrier offers tiered networks)
  4. Use pdfplumber for column-aware extraction (NOT pdftotext — interleaves multi-column layouts)
  5. Use NAME-ANCHORED regex extraction over column-aware extraction:
    • Provider names follow Last, First [Middle], CRED pattern (very strong anchor)
    • Page-header has county name (geographic anchor)
    • Skip facility sections (Hospitals, Labs) — focus on PCP/Specialist sections
  6. Output JSONL: {last_name, first_name, county, credential}
  7. Build NPPES PA index from existing providers_staging (pa:* docs with name.{first,last} + addresses[].zip)
  8. Build PA ZIP → county lookup from zip_county collection
  9. Match by (last_name, first_name, county) — score: exact_last_first_county (high), exact_last_initial_county (medium), unique_lastname (low)
  10. Apply matched NPIs with $addToSet plans

Real example: UPMC 4 marketplace PDFs (88 MB, 15,848 pages total) → name extraction → matched 9,560 of 34,541 directory entries (77% exact_last_first_county high-confidence) against 329,902 NPPES PA baseline.

Recipe (HTML source):

  • Same as PDF but parse rendered HTML for name/city/zip per provider card
  • HealthTrio Connect uses LI.ht2-ProviderCard wrappers
  • Cloudflare rate-limits aggressively — must use long throttle (1-2 sec/request)

Caveats:

  • Match accuracy 70-95% depending on directory data quality (more credentials = better match)
  • Plan→network mapping often UNKNOWN at PUF level (puf.networkId null) — accept sister-line over-attribution to all carrier plans
  • Address parsing from multi-column PDFs is fragile; ZIP-only matching often sufficient

Key scripts:

  • scripts/db/upmc-pdf-extract-names.py (pdfplumber + regex name extraction)
  • scripts/db/upmc-fuzzy-match-nppes.cjs (NPPES index + match + apply)

Pattern D — county-heuristic over-attribution ​

When to use: Carrier's provider directory is unscrapably difficult:

  • Member-auth gated (/secure/member/...)
  • HIOS-search-gate rejects marketplace HIOS
  • HealthTrio HTML directory rate-limits at 4hr+ ETA
  • Vendor refuses datacenter IPs across all bypass attempts

Recipe:

  1. Identify carrier's actual service area (county-level) via marketing materials, broker maps, "where we serve" page
  2. Resolve county set to all PA ZIPs via zip_county collection (distinct('zip', {state:'PA', county:{$in: counties}}))
  3. Count NPPES PA NPIs whose addresses.zip is in service-area ZIPs (sanity check expected count)
  4. updateMany with $addToSet plans filtered to matched ZIPs

Real examples:

  • Jefferson + Partners (HIOS 19702 + 93909) — SE PA (Philadelphia, Bucks, Chester, Montgomery, Delaware, Lehigh, Northampton, Berks) → 156,089 NPIs in 4 min
  • Capital BC (HIOS 45127) — Central PA (Dauphin, Lancaster, Lebanon, Cumberland, York + border Adams, Franklin, Perry) → 46,377 NPIs in 80 sec

Risks (always document in commit message):

  • ~30-40% over-attribution: providers in service-area counties who DON'T actually contract with the carrier get marked InNetwork
  • UI mitigation: copy directing users to verify with carrier directly before booking
  • Mitigation justification: same discipline as IBX 33871 sister-line + UPMC all-networks attribution. Better than zero coverage which blocks enrollment journeys

Quality bar to ship over heuristic:

  • Coverage on most populous network in service area (>40k providers per major carrier)
  • Direct API/MRF/PDF availability that survives a 4-hour effort budget

Pattern E — additive attribution to sister product line ​

When to use: Same issuer has multiple HIOS prefixes for parallel product lines (HMO + PPO, Advantage + Premier, etc.) and we've already ingested one of them.

Recipe:

  1. Verify issuer-name match in /api/plans output (both prefixes branded same)
  2. Confirm no separate publicly-discoverable TOC for the secondary prefix
  3. updateMany to add secondary-prefix plan IDs to ALL existing pa:* docs with primary-prefix plan source

Real example: IBX 33871 (12 plans, Independence Administrators line) attributed to all 809,576 IBX-31609 (Independence BC core line) NPIs in 32 min via $addToSet.

Caveats:

  • Risk-accept ~5-15% false positive for product-line-restricted providers
  • Best for sister lines of same legal issuer (e.g., HMO + PPO of same parent)
  • NOT appropriate when "sister" line is a separate underwriter (different BAA, different contracts)

Cohort guard discipline (every Pattern, every apply) ​

javascript
// Pre-flight: capture invariant
const totalBefore = await coll.estimatedDocumentCount();
const paBefore = await coll.countDocuments({_id: {$regex: "^pa:"}});
const nonPaBefore = totalBefore - paBefore;
log(`Pre-flight: pa:*=${paBefore}, non-pa=${nonPaBefore}`);

// ... ingest ...

// Post-flight: verify
const drift = Math.abs(nonPaAfter - nonPaBefore);
if (drift > 50) throw new Error(`COHORT GUARD FAIL: drift=${drift}`);
log(`✓ Cohort guard preserved (drift=${drift})`);

Why: Catches accidental cross-state writes (e.g., bug where pa: prefix gets dropped from _id and writes go to bare-NPI docs touching other states). Across 9 PA carrier ingests, drift was 0 every time — the guard fires cleanly when wrong.

Cluster guard (every Pattern, every apply) ​

javascript
function assertSafeCluster(uri) {
  const m = uri.match(/^mongodb(?:\+srv)?:\/\/[^@]*@([^/?]+)/);
  if (!m) throw new Error("Bad URI");
  const host = m[1];
  if (!host.includes("askflorence-staging") && !host.includes("efsikmv")) {
    throw new Error(`CLUSTER GUARD FAIL: host=${host}`);
  }
  return host;
}

Per ADR 0004 — providers_staging writes go to staging cluster only; apex reads via cross-cluster Atlas PrivateLink.

Release snapshot tags (every Pattern, every deploy) ​

Per ENG-435 + ENG-437 cadence:

  • Before merging deploy PR: git tag pre-eng-XXX-<carrier>-YYYYMMDD-HHMM at main HEAD
  • After merge: git tag post-eng-XXX-<carrier>-YYYYMMDD-HHMM at merge commit
  • Push both. These create clean rollback points if the deploy regresses.

Per-carrier discovery sequence (for any future SBE state) ​

For each carrier in the new state's marketplace:

  1. Marketplace HIOS prefix lookup — query /api/plans for the state to get prefix → issuer-name map
  2. Transparency-page discovery — search for <carrier>.com/transparency-in-coverage or footer link "Machine-Readable Files" or "Price Transparency"
  3. TOC verification — fetch TOC URL + grep for HIOS marketplace prefix
  4. If TOC has marketplace plans → Pattern A
  5. If TOC missing or empty → find-a-doctor SPA discovery
  6. Set up Chrome MCP fetch-wrapping + trigger search → check for NPI in JSON
  7. If NPI in JSON → Pattern B
  8. If NPI absent but PDF directory available → Pattern C (PDF)
  9. If NPI absent and only HTML directory + rate-limit testing fails → Pattern D (county heuristic)
  10. If sister product line exists with already-ingested network → Pattern E

Bias toward earlier (cheaper) patterns. Pattern D is the safety net but accept the over-attribution risk in the commit message.


Adding a new state — checklist ​

Use this when starting Wave 2 for any SBE not yet ingested.

  • [ ] Identify the state's public shop & compare tool URL
  • [ ] Run Layer 1-3 validation: 1 subsidy scenario + same-RA test + cross-RA test against the live tool. Block if any layer fails.
  • [ ] Source the state's published rating-area → ZIP mapping (state DOI or marketplace publishes this annually)
  • [ ] Create data/{state}-rating-areas-{year}.ts with one anchor ZIP per RA (start with 1 per RA; add per-county anchors if Layer 2 reveals carrier-availability variance within a single RA)
  • [ ] Source the state's Standard Benefit Plan Designs (state-mandated cost-share canonical values per metal x CSR tier). Most SBEs publish this annually.
  • [ ] Branch the CA scraper pattern (scripts/db/scrape-covered-ca-2026.ts) for the state's tool. Selectors will differ.
  • [ ] Run the scrape, output JSON.
  • [ ] Build the normalize script: JSON → plans collection doc shape, applying federal default age curve to derive all 65 ages from anchor age.
  • [ ] Build the state-subsidy helper if the state has its own subsidy program (see state-subsidies.md for the catalogue).
  • [ ] Add calculate{State}Eligibility() in apps/web/src/lib/owned-plans.ts mirroring calculateNyEligibility().
  • [ ] Capture puf.qualityRating in the normalize step, NOT in a separate ingest. SBE-state scrapes universally expose CMS QRS issuer ratings inline (Pennie: issuerQualityRating.{QualityRating, GlobalRating, SummaryClinicalQualityManagement, SummaryEnrolleeExperience, SummaryPlanEfficiencyAffordabilityAndManagement, EffectiveDate}; equivalent fields expected on Covered California's Shop & Compare scrape + NYSOH plan-list response). Map to the canonical schema ({available, year, globalRating, clinicalQualityManagementRating, enrolleeExperienceRating, planEfficiencyRating, globalNotRatedReason, clinicalNotRatedReason, enrolleeNotRatedReason, efficiencyNotRatedReason, effectiveDate, qualitySource} — same shape scripts/db/ingest-qrs-ratings.js writes for federal-30). Coverage typically 70-100% (carriers without 3 years of enrollment go unrated; matches federal QRS pattern). DO NOT skip this. PA Phase A.2.2 missed it; had to ship a follow-up ENG-418 A.5 augment to backfill from the same scrape JSONs (zero new scraping needed — the data was already in our hands).
  • [ ] Atlas snapshot B (pre-apply, mandatory — descriptive name includes pre-apply row counts).
  • [ ] Dry-run plans write → audit harness 100% → founder approval → apply.
  • [ ] Atlas snapshot C (post-apply marker — descriptive name includes new row counts).
  • [ ] zip_county cleanup (NEW — caught in ENG-418 Phase A.4 apex smoke): every ZIP for the new state has sbeRedirect set from the legacy SBE-redirect seed. /api/counties reads this field and short-circuits, so even after the constants flip the route still returns sbeRedirect. Run:
    js
    db.zip_county.updateMany(
      { state: '<NEW_STATE>' },
      { $set: { regionId: <fips→ra via state-specific map> }, $unset: { sbeRedirect: '' } }
    );
    Verify post-flight: every ZIP for the state has regionId populated + no sbeRedirect field. Sample a few ZIPs to confirm the regionId matches the state's RA-by-FIPS table.
  • [ ] Add state code to OWNED_DATA_STATES in packages/shared/src/plans/owned-plans.ts.
  • [ ] Remove state from STATE_BASED_MARKETPLACES in packages/shared/src/constants/index.ts.
  • [ ] Re-run federal regression sweep. Must stay 0 diffs.
  • [ ] Apex smoke — at least one ZIP per RA: /api/counties?zip=… returns counties (not sbeRedirect), /api/plans returns real plan list (not empty), one plan-detail load shows copays + SBC link + formulary link + provider directory link.
  • [ ] Add validation page at docs/validation/{state}-{year}.md with L1-L4 results + snapshot IDs (A, B, C).
  • [ ] Update the state-by-state log table above with the new row.
  • [ ] If the state has carrier-availability quirks within an RA, add per-county anchors AND document the anomaly in "Lessons learned" so the next state doesn't repeat the same investigation.
  • [ ] Phase B (drugs) + Phase C (providers) follow plans — file as separate issues mirroring NY ENG-412 Phase 1 + Phase 2. Don't try to do all three in one PR; the snapshot + cohort-guard discipline gets harder to verify per layer.
Pager
Previous pageState Subsidies
Next pageSBE State Watchouts + Decisions

AskFlorence Internal Documentation. Not for public distribution.

AskFlorence

Internal Documentation

Access restricted. Not for public distribution.