Appearance
Data sources & ingest cadence
Purpose: Canonical record of every external data source we ingest into MongoDB, refresh cadence, lineage, and the script that owns each pipeline. SOC 2 CC8.1 (Change Management) + CMS EDE Phase 3 data-provenance evidence.
Convention: every row in
Sourcestable below maps to exactly one ingest script underscripts/db/and one MongoDB collection. The script's header docstring embeds the refresh playbook.
Sources
| Domain | Source | Format | License | Cadence | Script | MongoDB collection(s) |
|---|---|---|---|---|---|---|
| ZIP → county → state (SBE-state ZIPs) | CMS Marketplace API /counties/by/zip/{zip} (canonical for byte-for-byte audit parity) | JSON snapshot | Public domain (CMS) | Annual (plan-year transition) | scripts/db/build-cms-snapshot.js → scripts/db/seed-sbe-zips-from-cms.js | zip_county (per-county redirect docs with _seedSource marker) |
| ZIP → county (federal-30 + NY) | U.S. Census ZCTA + hand-curated NY data | JSON | Public / internal | Annual (plan-year transition) | scripts/db/load-zip-county.js | zip_county (county docs) |
| ZIP USPS-completeness universe (federal-30 + NY) | zipcodes npm package (USPS-derived, MIT) - catches PO-Box-only / business-only / single-building ZIPs Census ZCTA misses | CSV snapshot | MIT (npm) | Annual (plan-year transition) - upgrade to HUD ZIP-County crosswalk recommended | scripts/db/build-usps-snapshot.js → scripts/db/audit-federal-completeness-tier-0-5.js → scripts/db/seed-federal-tier-0-5.js | zip_county (_seedSource: "federal-tier-0-5-audit-2026-05-01") |
| Plans, premiums, rate areas (federal-30) | CMS Federal Marketplace PUF (Plan Attributes, Benefits & Cost Sharing, Service Areas) | CSV | Public domain (CMS) | Annual (CMS releases ~Sept) | scripts/db/ingest-puf-augment.js, ingest-qrs-ratings.js | plans, regions, plan_years |
| Plans, premiums (NY) | NY State of Health Essential Plan + DFS rate filings | JSON | Public (NY DFS) | Annual | scripts/db/load-ny-2026.js | plans, regions, plan_years |
| Stale ZIP redirects + PO Box ZIPs | Manual curation from CMS/Census discrepancies | Hardcoded JS array | n/a | Ad-hoc (audit-driven) | scripts/db/fix-stale-zips.js | zip_county (specific overrides) |
| (SUPERSEDED) ZIP → state (SBE redirects, original) | U.S. Census 2020 ZCTA-to-County relationship file | Pipe-delimited TXT | — | — | scripts/db/seed-sbe-zips.js (deprecated) | — |
Why CMS replaced Census ZCTA for SBE-state ZIPs (2026-04-30): the original seed used Census-derived FIPS, which doesn't always agree with what CMS Marketplace API returns for the same ZIP. Future audits compare our DB byte-for-byte against CMS as the canonical source for ACA marketplace ZIP→county mapping; using a different source breaks that property. The corrective seed switched to querying CMS directly for every SBE-state ZIP, persisting the response to
scripts/db/data/sbe-zip-cms-snapshot.jsonfor reproducibility, and inserting per-county docs with FIPS anchors. See change-log entries from 2026-04-30 for the full lineage.
Refresh cadence
Every plan-year transition (typically October-November in the year before the plan year, e.g., October 2026 for 2027 plan year) runs the same sequence:
- Refresh STATE_BASED_MARKETPLACES in
src/lib/constants.ts. A state may have transitioned to/from SBE for the new plan year. Cross-check with CMS's Marketplace operating-status page. - Federal-30 + NY plan + ZIP refresh — primary annual ingest. Owner:
ingest-puf-augment.js+load-zip-county.js+load-ny-2026.js. - SBE-state ZIP refresh — see
scripts/db/seed-sbe-zips.jsheader for the full playbook (download Census ZCTA → regenerate snapshot CSV → dry-run → apply staging → apply prod). - Tier 0 federal-completeness audit + apply — see Tier 0 doc. Catches Census-ZCTA-tracked federal+NY ZIPs missing from DB.
- Tier 0.5 USPS-completeness audit + apply — see Tier 0.5 doc. Catches PO-Box-only / business-only / single-building ZIPs that Census ZCTA misses but USPS recognizes. Playbook:
npm update zipcodes→node scripts/db/build-usps-snapshot.js→node scripts/db/audit-federal-completeness-tier-0-5.js(default concurrency=5; CMS rate-limits at concurrency=10) →node scripts/audit/validate-cms-errors.js --tier=1for any CMS errors → triage report → phasedseed-federal-tier-0-5.js --applyper Constraint 2 (with mongodump backup before each batch per Constraint 1). Recommended upgrade: swapzipcodesnpm for HUD ZIP-County crosswalk (https://www.huduser.gov/portal/datasets/usps_crosswalk.html, quarterly refresh, free + HUD account). - Tier 1 + Tier 1.5 audits re-run on prod. Expect TRUE 100% pass post-refresh. Run
validate-cms-errors.js --tier=1and--tier=1.5for any leftover errors before declaring 100%. - Append change-log entry in
change-log.mdwith timestamp + commit SHA + counts.
For all the operational patterns (CMS rate-limit defenses, backup discipline, phased apply protocol, audit-script behaviors, post-apply validation), see the audit operations runbook. Required reading for any future plan-year refresh or ad-hoc audit work.
Quarterly (within a plan year) refreshes are optional and only triggered if a known data drift is observed.
Provenance + audit trail
- Source URLs: every script's header docstring records the upstream data URL the snapshot was derived from.
- Committed snapshots: source data is committed to the repo under
scripts/db/data/(e.g.,sbe-zip-state-2020.csvfor SBE ZIPs). Allows reproducibility, airgap-safe re-runs, andgit blame-able lineage. - CloudTrail + Atlas audit log: every
--applyrun produces MongoDB write operations recorded by Atlas's audit log (HIPAA tier) and any AWS-side credential fetches recorded by CloudTrail. Cross-reference at audit time. - Tier audits (
scripts/audit/tier-1throughtier-5): structural validation of ingest correctness, run after every refresh.
SBE-state ZIP redirect data — specific notes
The seed-sbe-zips.js pipeline has three layers of safety guards documented in the script header. Key invariants:
- Federal-30 + NY county docs are never modified by this script. Guard 2 detects any
countyFipsfield on an existing doc and skips the source row with a CONFLICT log. - Border ZIPs that span SBE + federal counties (57 such ZIPs identified in 2026 ingest, 12 of which already had
fix-stale-zips.jsredirect entries; 45 of which are skipped as conflicts) require per-county redirect handling — out of scope for current schema. Tracked as future enhancement on Issue #68. - Marketplace strings are sourced from
STATE_BASED_MARKETPLACESinsrc/lib/constants.ts(the application's single source of truth). The script copies them inline; a cross-check at script start would catch divergence (TODO if drift becomes a concern). - IL was added to
STATE_BASED_MARKETPLACESin this seed (2026-04-30) reflecting Get Covered Illinois launch in 2025 plan year. Prior to this seed, IL ZIPs returned 404 from/api/counties(then HTTP-403 from CMS via the temp-fix fallback).
Conflict log archive (45 ZIPs as of 2026-04-30)
Border ZIPs where Census data claims SBE state but our verified federal data wins. Listed for audit + future per-county-redirect feature scope:
ME/NH border: 03579
DE/MD border: 19973 (also covered by fix-stale-zips: 21874, 21912)
VA/WV border: 20135 (also fix-stale-zips: 24604, 24622)
NC/VA border: 27048, 28675
AL/GA border: 36855 (also fix-stale-zips: 30165, 30741)
TN/VA border: 37642, 37752
TN/KY border: 38079, 38549, 42223, 42602, 40965 (last via fix-stale-zips)
IA/MN border: 51360
SD/MN border: 56136, 56144, 56164, 56219, 56220, 56257, 57026, 57030, 57068
ND/MN border: 58030, 58225 (also fix-stale-zips: 56027, 56744)
MT/ID border: 59847
NM border: (fix-stale-zips: 88430, 87328)
CO border: (fix-stale-zips: 81324)These ZIPs continue to serve federal county data (i.e., the user gets a real plan-search experience for the federal county). Per-county SBE redirect support — where the user could pick "this county is in MD, redirect me" vs "this county is in DE, search plans" — is a future enhancement to the /api/counties response shape + frontend useCalculator() consumer.
Optional CMS API cross-check
scripts/db/seed-sbe-zips.js --verify-cms-sample cross-checks 200 ZIPs against CMS Marketplace API at 1 req/sec (~3 minutes). Recommended on first seed of any plan year; skippable on quarterly refreshes if previously clean. The script refuses to --apply if CMS sample finds mismatches (configurable behavior).
Future improvements
- Self-checking: at script start, parse
src/lib/constants.tsand assertSTATE_BASED_MARKETPLACESmatches the script's inlined copy. Currently manual. - Per-county SBE redirect support for border ZIPs (Issue #68 follow-up scope).
- Atlas-CLI-driven snapshot refresh: a one-command
scripts/db/refresh-sbe-snapshot.shthat pulls the latest Census file, regenerates the CSV, and emits a diff report. - Annual refresh runbook landed as a GitHub Action that opens a PR with the regenerated CSV + dry-run output for human review.