Appearance
Audit operations runbook (zip_county audit family)
Purpose: Consolidated operational guidance for running the zip_county audit family (Tier 0, Tier 0.5, Tier 1, Tier 1.5) cleanly. Distilled from the 2026-05-01 Tier 0.5 session that drove Tier 1 + Tier 1.5 to TRUE 100% match. Applies to any future plan-year ingest, any ad-hoc data-integrity sweep, and any new tier added to the family.
Scope: operational patterns only. The "what does Tier X audit?" content lives in each tier's dedicated validation doc (Tier 0, Tier 0.5, Tier 1, Tier 1.5).
TL;DR for "I'm running an audit, what do I need to know?"
- Backup before any prod write.
mongodumpofzip_county, sha256 the BSON, store under~/Documents/askflorence-db-backups/zip_county/<TAG>/, write aMANIFEST.txt. See "Backup discipline" below for full protocol. - CMS rate limits at concurrency=10. Use concurrency=3-5 for any audit that hits CMS broadly. Build a retry-validation pass for any leftover errors.
- Two-pass pattern is mandatory. Initial audit run + a second retry-validation pass for anything that errored. Never accept "X CMS errors" as a final state — they're UNKNOWN until validated. Use
scripts/audit/validate-cms-errors.js. - Phased apply per Constraint 2. Audit produces report -> stop -> user reviews -> user greenlights specific batches -> apply with backup -> smoke -> repeat. Never auto-insert.
- Audit reports overwrite. Every audit script writes to a fixed path. Re-runs overwrite. Always commit before re-running, or rename outputs (
-post-apply-confirm-<date>.jsonis a good convention). - In-memory state goes stale during long audits. If you apply mid-audit, the audit's in-memory
ourCountiessnapshot is from script start, not current DB. Kill + restart fresh; don't try to race. - Progress files at
/tmp/audit-progress-<name>.jsonresume across runs. This is great for crash recovery, dangerous when you change audit logic and want a fresh evaluation. Delete the progress file when re-running with patched logic.
CMS rate-limit defensive patterns
The CMS Marketplace API will throttle at sustained high-concurrency loads. The 2026-05-01 Tier 0.5 audit hit HTTP 429 on 2,352 of 4,347 ZIPs at concurrency=10. Two patterns to avoid this:
Pattern A: lower default concurrency.
- Tier 0 used concurrency=10 against ~13K residential ZIPs and reported clean. The hot zone is ~20K+ ZIPs at concurrency=10.
- Recommended default for new audits: concurrency=5. If you see ANY 429s in the first ~500 zips, drop to concurrency=3 for the rest.
audit-federal-completeness-tier-0-5.jsdefaults toCMS_CONCURRENCY = 5(was 10; lowered post-session).
Pattern B: retry-validation pass. Even with low concurrency, transient errors happen (network blips, CMS deploys, etc.). Always run a retry-validation pass after an initial audit:
bash
MONGODB_URI=$PROD_READ_URI CMS_API_KEY=$KEY \
node scripts/audit/validate-cms-errors.js --tier=1
MONGODB_URI=$PROD_READ_URI CMS_API_KEY=$KEY \
node scripts/audit/validate-cms-errors.js --tier=1.5validate-cms-errors.js retries each errored ZIP at concurrency=1 with exponential backoff (5s/10s/20s/40s/80s, 5 retries) and re-classifies as match/mismatch/still-failed using the same DB-vs-CMS comparison the audit does. Result drops into audit-tier-N-rate-limit-validation.json.
The 2026-05-01 session validated 33 of 33 Tier 1 + 26 of 26 Tier 1.5 rate-limit errors as MATCH (no real mismatches were hiding). This pattern is now mandatory for closing out any audit.
Acceptance: an audit is "complete" only when:
- exactMatches + retried-validated-as-match == totalZips, OR
- the remaining items have been investigated as real data gaps and either fixed (with their own backup-and-greenlight protocol) or documented as known exclusions.
Backup discipline (Constraint 1)
Every prod write to zip_county requires an immediately-preceding fresh backup. Per-batch, not per-session.
Backup protocol
bash
PROD_READ_URI=$(aws --profile askflorence-prod secretsmanager get-secret-value \
--secret-id prod/mongodb/app-read --query SecretString --output text)
TS=$(date -u +%Y%m%dT%H%M%SZ)
TAG="pre-<batch-descriptor>-${TS}"
BACKUP_DIR="$HOME/Documents/askflorence-db-backups/zip_county/${TAG}"
mkdir -p "$BACKUP_DIR"
# 1. Dump
mongodump --uri="$PROD_READ_URI" --collection=zip_county --out="$BACKUP_DIR/dump"
# 2. Verify record count matches prod
DUMP_BSON="$BACKUP_DIR/dump/askflorence/zip_county.bson"
PROD_COUNT=$(mongosh "$PROD_READ_URI" --quiet --eval 'db.zip_county.countDocuments()')
BSON_COUNT=$(bsondump --quiet "$DUMP_BSON" 2>/dev/null | wc -l | tr -d ' ')
[ "$BSON_COUNT" = "$PROD_COUNT" ] && echo "VERIFIED" || { echo "ABORT - count mismatch"; exit 1; }
# 3. sha256 for tamper detection
SHA256=$(shasum -a 256 "$DUMP_BSON" | awk '{print $1}')
# 4. MANIFEST.txt with backup tag, sha256, doc counts, planned _seedSource marker, restore commands
cat > "$BACKUP_DIR/MANIFEST.txt" <<EOF
backup_tag: ${TAG}
created_utc: $(date -u +%Y-%m-%dT%H:%M:%SZ)
prod_collection: askflorence.zip_county
pre_apply_doc_count: $PROD_COUNT
backup_bson_record_count: $BSON_COUNT
bson_file: dump/askflorence/zip_county.bson
bson_sha256: $SHA256
bson_size_bytes: $(stat -f%z "$DUMP_BSON")
planned_seedSource_marker: <your-marker-here>
batch_scope: <description>
expected_post_apply_count: <pre-count + planned-inserts>
restore_command_targeted_rollback: |
node scripts/db/<your-seed>.js --rollback
restore_command_full_collection_replace: |
mongorestore --uri="\$PROD_WRITE_URI" --nsInclude='askflorence.zip_county' --drop "$BACKUP_DIR/dump"
EOFBackup storage
Currently local-only at ~/Documents/askflorence-db-backups/zip_county/<TAG>/ because s3://askflorence-data/db-backups/ blocks the SSO admin role at the bucket-policy layer. Until that's resolved (separate ops issue), local is the canonical location.
What goes in the BSON
A full zip_county snapshot, not just the affected docs. This makes restore unambiguous: mongorestore --drop replaces the entire collection back to the pre-batch state. Per-doc rollback via seed-script --rollback filters by _seedSource marker (preferred for surgical scenarios), but having the full snapshot is the nuclear-option safety net.
Restore patterns
Targeted (preferred — surgical):
bash
node scripts/db/<seed-script>.js --rollback
# Removes ONLY docs with the seed's _seedSource marker
# Each seed script's --rollback is scoped to its own markerFull collection replace (nuclear — replaces collection):
bash
PROD_WRITE_URI=$(aws --profile askflorence-prod secretsmanager get-secret-value \
--secret-id prod/mongodb/app-write --query SecretString --output text)
mongorestore --uri="$PROD_WRITE_URI" \
--nsInclude='askflorence.zip_county' --drop \
~/Documents/askflorence-db-backups/zip_county/<TAG>/dumpThe full-replace is appropriate when targeted-rollback is insufficient (e.g., docs were mutated post-insert by another process). Confirm with user before triggering.
mongorestore --dryRun quirk
mongorestore --dryRun in MongoDB Database Tools v100.x always reports "0 document(s) restored successfully" even when the BSON file is fully recognized and parseable. This is a misleading display quirk, NOT a failure. The fact that mongorestore prints dry run completed with no errors after checking for collection data in zip_county.bson means the BSON is restorable. Don't trust the "0 documents" line; verify integrity via bsondump | wc -l matching prod count + a sha256 + a sample round-trip instead.
Phased apply discipline (Constraint 2)
The audit does NOT auto-insert. Flow per batch:
- Dry-run audit produces the report (per-class + per-state counts, anomaly flags, sample mismatches, full JSON).
- HARD STOP. Surface the report to the user (status comment on the tracker issue + chat-surface). Wait.
- User makes phased decisions explicitly. Examples: "apply AZ + TX classes only, hold the rest"; "the discrepancy class - investigate the KY airport cases first"; "state X has 50+ gaps - investigate before any apply."
- Execute ONLY user-greenlighted batches, one at a time.
- Each batch: backup -> dry-run preview of THAT batch -> user re-confirms -> apply -> smoke matrix -> status comment on tracker.
- If unsure whether the user has greenlighted a specific apply: ask, don't assume.
Per-batch filters
Seed scripts MUST support per-state and per-class filters:
--state=<comma-list>— restrict to specified states--class=<comma-list>— restrict to specified classes (insertable / discrepancy / non_residential / etc.)--dry-run(default) /--apply/--rollback- Idempotency guard at apply-time (skip if (zip, countyFips) already exists)
- State allowlist enforcement
_seedSource: "<descriptive-marker-with-date>"on every insert
Reference: scripts/db/seed-federal-tier-0-5.js is the canonical example.
Audit-script behaviors to know about
Reports overwrite in place
Every audit script writes to a fixed path:
scripts/db/data/federal-tier-0-5-gap-report-<date>.jsonaudit-tier-1-results.json(project root)audit-tier-1-5-results.json(project root)
Re-runs overwrite. Always commit the original before re-running, or rename outputs to preserve history. The Tier 0.5 session uses -post-apply-confirm-<date>.json to keep the post-apply confirmation distinct from the original triage report.
The root-level audit-tier-*.json files are gitignored to avoid noise. The canonical historical snapshots live under scripts/db/data/tier-N-<date>.json (commit-tracked).
Progress files at /tmp/audit-progress-<name>.json
The audit harness saves progress every 200 zips so a crashed audit can resume. This is great for long runs. It's a footgun when you change audit logic and want a fresh evaluation: the progress file marks the previously-processed zips as done, the re-run skips them, and your patched logic never runs on them.
Always delete the progress file when:
- You changed the audit script's comparison logic
- You applied DB writes that change what counts as a match for previously-audited zips
- You want a clean baseline reading for any reason
bash
rm -f /tmp/audit-progress-tier-1.json
rm -f /tmp/audit-progress-tier-1-5.jsonIn-memory state goes stale during long audits
The audit script loads the entire ourCounties snapshot at start (line ~34 of tier-1-zip-county.js: db.zip_county.aggregate(...)). It then iterates over that loaded list and queries CMS live for each zip.
If you apply DB writes WHILE an audit is running, the audit's in-memory ourCounties for the affected zips is STALE. The audit will compare CMS's response against the OLD DB state and may report a spurious mismatch even though the new DB state is correct.
Pattern: kill + restart the audit after any apply. Don't try to race.
bash
pkill -f "node scripts/audit/tier-N-..."
sleep 2
rm -f /tmp/audit-progress-tier-N.json
node scripts/audit/tier-N-....jsTier 1 unsupported-doc handling (post-2026-05-01 patch)
scripts/audit/tier-1-zip-county.js was patched on 2026-05-01 to pre-fetch all (zip, countyFips) tuples from unsupported-class or non-federal-state docs and subtract them from the CMS-side comparison. Without this patch, future inserts of unsupported docs (territory ZIPs, corporate-only ZIPs, etc.) would surface as "extra in CMS" / "county-count mismatch" false positives.
The patch is permanent. Future Tier 1 runs benefit without action. Same logic should be carried forward if a future Tier 2/3 etc. audit needs equivalent handling.
Post-apply validation
After any prod-write batch, three gates must pass:
| Gate | Script | Expected result |
|---|---|---|
| Calculator baseline diff | npx tsx scripts/audit/calculator-baseline-diff.ts | ZERO DIFFS on all 12 scenarios |
| Tier 1 (federal zip-county) | node scripts/audit/tier-1-zip-county.js | 100% exact match |
| Tier 1.5 (SBE zip-county) | node scripts/audit/tier-1-5-sbe-zip-county.js | 100% exact match |
Plus the relevant tier audit re-run (Tier 0 / Tier 0.5 / etc.) to confirm the gap class is closed.
For audits with leftover CMS errors: ALWAYS run validate-cms-errors.js before declaring 100%.
Documentation conventions
Each audit + apply session lands documentation across THREE files:
docs/validation/<tier-name>.md— the audit report itself: methodology, summary, per-class breakdown, refresh playbook. Stays canonical for future plan-year refreshes.docs/infrastructure/data-sources.md— annual refresh playbook entry. Update with the new audit step + lessons learned.docs/infrastructure/change-log.md— dated entry with backup tags, batch scope, verification matrix, rollback paths, outstanding follow-ups.
Plus this runbook gets updated whenever a new operational pattern emerges that future runs should benefit from.
Cross-references
- Tier 0 audit — original Census-derived federal completeness audit
- Tier 0.5 audit — USPS-derived complement to Tier 0; source of most of these operational patterns
docs/infrastructure/data-sources.md— annual refresh playbookdocs/infrastructure/change-log.md— change historyscripts/audit/validate-cms-errors.js— canonical retry-validation scriptscripts/db/fix-tier-1-completeness-gaps.js— single-doc tuple fix pattern (for ad-hoc data-gap fixes)scripts/db/seed-federal-tier-0-5.js— full-batch seed pattern with--state/--classfilters
Open follow-ups (from the 2026-05-01 session)
- S3 backup access — local-only until ops grants the SSO admin role bucket-policy access on
s3://askflorence-data/db-backups/. - Tier 0.5b - tuple-level completeness sweep — defensive audit that, for every ZIP already in DB, queries CMS and inserts any missing (zip, countyFips) tuples. Catches what zip-level Tier 0.5 misses (the 50613 IA Bremer County pattern). One known instance fixed; could surface more.
- HUD ZIP-County crosswalk upgrade — replaces
zipcodesnpm before next plan-year refresh. Quarterly-refreshed, free + HUD account, catches the npm-stale extras automatically. - Calculator 404 message refinement — frontend-side: "ZIP not recognized; check the digits and try your home address" beats bare "Zip code not found."