Audit operations runbook (zip_county audit family)

Purpose: Consolidated operational guidance for running the zip_county audit family (Tier 0, Tier 0.5, Tier 1, Tier 1.5) cleanly. Distilled from the 2026-05-01 Tier 0.5 session that drove Tier 1 + Tier 1.5 to TRUE 100% match. Applies to any future plan-year ingest, any ad-hoc data-integrity sweep, and any new tier added to the family.
Scope: operational patterns only. The "what does Tier X audit?" content lives in each tier's dedicated validation doc (Tier 0, Tier 0.5, Tier 1, Tier 1.5).

TL;DR for "I'm running an audit, what do I need to know?"

Backup before any prod write. mongodump of zip_county, sha256 the BSON, store under ~/Documents/askflorence-db-backups/zip_county/<TAG>/, write a MANIFEST.txt. See "Backup discipline" below for full protocol.
CMS rate limits at concurrency=10. Use concurrency=3-5 for any audit that hits CMS broadly. Build a retry-validation pass for any leftover errors.
Two-pass pattern is mandatory. Initial audit run + a second retry-validation pass for anything that errored. Never accept "X CMS errors" as a final state — they're UNKNOWN until validated. Use scripts/audit/validate-cms-errors.js.
Phased apply per Constraint 2. Audit produces report -> stop -> user reviews -> user greenlights specific batches -> apply with backup -> smoke -> repeat. Never auto-insert.
Audit reports overwrite. Every audit script writes to a fixed path. Re-runs overwrite. Always commit before re-running, or rename outputs (-post-apply-confirm-<date>.json is a good convention).
In-memory state goes stale during long audits. If you apply mid-audit, the audit's in-memory ourCounties snapshot is from script start, not current DB. Kill + restart fresh; don't try to race.
Progress files at /tmp/audit-progress-<name>.json resume across runs. This is great for crash recovery, dangerous when you change audit logic and want a fresh evaluation. Delete the progress file when re-running with patched logic.

CMS rate-limit defensive patterns

The CMS Marketplace API will throttle at sustained high-concurrency loads. The 2026-05-01 Tier 0.5 audit hit HTTP 429 on 2,352 of 4,347 ZIPs at concurrency=10. Two patterns to avoid this:

Pattern A: lower default concurrency.

Tier 0 used concurrency=10 against ~13K residential ZIPs and reported clean. The hot zone is ~20K+ ZIPs at concurrency=10.
Recommended default for new audits: concurrency=5. If you see ANY 429s in the first ~500 zips, drop to concurrency=3 for the rest.
audit-federal-completeness-tier-0-5.js defaults to CMS_CONCURRENCY = 5 (was 10; lowered post-session).

Pattern B: retry-validation pass. Even with low concurrency, transient errors happen (network blips, CMS deploys, etc.). Always run a retry-validation pass after an initial audit:

bash

MONGODB_URI=$PROD_READ_URI CMS_API_KEY=$KEY \
  node scripts/audit/validate-cms-errors.js --tier=1
MONGODB_URI=$PROD_READ_URI CMS_API_KEY=$KEY \
  node scripts/audit/validate-cms-errors.js --tier=1.5

validate-cms-errors.js retries each errored ZIP at concurrency=1 with exponential backoff (5s/10s/20s/40s/80s, 5 retries) and re-classifies as match/mismatch/still-failed using the same DB-vs-CMS comparison the audit does. Result drops into audit-tier-N-rate-limit-validation.json.

The 2026-05-01 session validated 33 of 33 Tier 1 + 26 of 26 Tier 1.5 rate-limit errors as MATCH (no real mismatches were hiding). This pattern is now mandatory for closing out any audit.

Acceptance: an audit is "complete" only when:

exactMatches + retried-validated-as-match == totalZips, OR
the remaining items have been investigated as real data gaps and either fixed (with their own backup-and-greenlight protocol) or documented as known exclusions.

Backup discipline (Constraint 1)

Every prod write to zip_county requires an immediately-preceding fresh backup. Per-batch, not per-session.

Backup protocol

bash

PROD_READ_URI=$(aws --profile askflorence-prod secretsmanager get-secret-value \
  --secret-id prod/mongodb/app-read --query SecretString --output text)

TS=$(date -u +%Y%m%dT%H%M%SZ)
TAG="pre-<batch-descriptor>-${TS}"
BACKUP_DIR="$HOME/Documents/askflorence-db-backups/zip_county/${TAG}"
mkdir -p "$BACKUP_DIR"

# 1. Dump
mongodump --uri="$PROD_READ_URI" --collection=zip_county --out="$BACKUP_DIR/dump"

# 2. Verify record count matches prod
DUMP_BSON="$BACKUP_DIR/dump/askflorence/zip_county.bson"
PROD_COUNT=$(mongosh "$PROD_READ_URI" --quiet --eval 'db.zip_county.countDocuments()')
BSON_COUNT=$(bsondump --quiet "$DUMP_BSON" 2>/dev/null | wc -l | tr -d ' ')
[ "$BSON_COUNT" = "$PROD_COUNT" ] && echo "VERIFIED" || { echo "ABORT - count mismatch"; exit 1; }

# 3. sha256 for tamper detection
SHA256=$(shasum -a 256 "$DUMP_BSON" | awk '{print $1}')

# 4. MANIFEST.txt with backup tag, sha256, doc counts, planned _seedSource marker, restore commands
cat > "$BACKUP_DIR/MANIFEST.txt" <<EOF
backup_tag: ${TAG}
created_utc: $(date -u +%Y-%m-%dT%H:%M:%SZ)
prod_collection: askflorence.zip_county
pre_apply_doc_count: $PROD_COUNT
backup_bson_record_count: $BSON_COUNT
bson_file: dump/askflorence/zip_county.bson
bson_sha256: $SHA256
bson_size_bytes: $(stat -f%z "$DUMP_BSON")
planned_seedSource_marker: <your-marker-here>
batch_scope: <description>
expected_post_apply_count: <pre-count + planned-inserts>
restore_command_targeted_rollback: |
  node scripts/db/<your-seed>.js --rollback
restore_command_full_collection_replace: |
  mongorestore --uri="\$PROD_WRITE_URI" --nsInclude='askflorence.zip_county' --drop "$BACKUP_DIR/dump"
EOF

Backup storage

Currently local-only at ~/Documents/askflorence-db-backups/zip_county/<TAG>/ because s3://askflorence-data/db-backups/ blocks the SSO admin role at the bucket-policy layer. Until that's resolved (separate ops issue), local is the canonical location.

What goes in the BSON

A full zip_county snapshot, not just the affected docs. This makes restore unambiguous: mongorestore --drop replaces the entire collection back to the pre-batch state. Per-doc rollback via seed-script --rollback filters by _seedSource marker (preferred for surgical scenarios), but having the full snapshot is the nuclear-option safety net.

Restore patterns

Targeted (preferred — surgical):

bash

node scripts/db/<seed-script>.js --rollback
# Removes ONLY docs with the seed's _seedSource marker
# Each seed script's --rollback is scoped to its own marker

Full collection replace (nuclear — replaces collection):

bash

PROD_WRITE_URI=$(aws --profile askflorence-prod secretsmanager get-secret-value \
  --secret-id prod/mongodb/app-write --query SecretString --output text)
mongorestore --uri="$PROD_WRITE_URI" \
  --nsInclude='askflorence.zip_county' --drop \
  ~/Documents/askflorence-db-backups/zip_county/<TAG>/dump

The full-replace is appropriate when targeted-rollback is insufficient (e.g., docs were mutated post-insert by another process). Confirm with user before triggering.

`mongorestore --dryRun` quirk

mongorestore --dryRun in MongoDB Database Tools v100.x always reports "0 document(s) restored successfully" even when the BSON file is fully recognized and parseable. This is a misleading display quirk, NOT a failure. The fact that mongorestore prints dry run completed with no errors after checking for collection data in zip_county.bson means the BSON is restorable. Don't trust the "0 documents" line; verify integrity via bsondump | wc -l matching prod count + a sha256 + a sample round-trip instead.

Phased apply discipline (Constraint 2)

The audit does NOT auto-insert. Flow per batch:

Dry-run audit produces the report (per-class + per-state counts, anomaly flags, sample mismatches, full JSON).
HARD STOP. Surface the report to the user (status comment on the tracker issue + chat-surface). Wait.
User makes phased decisions explicitly. Examples: "apply AZ + TX classes only, hold the rest"; "the discrepancy class - investigate the KY airport cases first"; "state X has 50+ gaps - investigate before any apply."
Execute ONLY user-greenlighted batches, one at a time.
Each batch: backup -> dry-run preview of THAT batch -> user re-confirms -> apply -> smoke matrix -> status comment on tracker.
If unsure whether the user has greenlighted a specific apply: ask, don't assume.

Per-batch filters

Seed scripts MUST support per-state and per-class filters:

--state=<comma-list> — restrict to specified states
--class=<comma-list> — restrict to specified classes (insertable / discrepancy / non_residential / etc.)
--dry-run (default) / --apply / --rollback
Idempotency guard at apply-time (skip if (zip, countyFips) already exists)
State allowlist enforcement
_seedSource: "<descriptive-marker-with-date>" on every insert

Reference: scripts/db/seed-federal-tier-0-5.js is the canonical example.

Audit-script behaviors to know about

Reports overwrite in place

Every audit script writes to a fixed path:

scripts/db/data/federal-tier-0-5-gap-report-<date>.json
audit-tier-1-results.json (project root)
audit-tier-1-5-results.json (project root)

Re-runs overwrite. Always commit the original before re-running, or rename outputs to preserve history. The Tier 0.5 session uses -post-apply-confirm-<date>.json to keep the post-apply confirmation distinct from the original triage report.

The root-level audit-tier-*.json files are gitignored to avoid noise. The canonical historical snapshots live under scripts/db/data/tier-N-<date>.json (commit-tracked).

Progress files at `/tmp/audit-progress-<name>.json`

The audit harness saves progress every 200 zips so a crashed audit can resume. This is great for long runs. It's a footgun when you change audit logic and want a fresh evaluation: the progress file marks the previously-processed zips as done, the re-run skips them, and your patched logic never runs on them.

Always delete the progress file when:

You changed the audit script's comparison logic
You applied DB writes that change what counts as a match for previously-audited zips
You want a clean baseline reading for any reason

bash

rm -f /tmp/audit-progress-tier-1.json
rm -f /tmp/audit-progress-tier-1-5.json

In-memory state goes stale during long audits

The audit script loads the entire ourCounties snapshot at start (line ~34 of tier-1-zip-county.js: db.zip_county.aggregate(...)). It then iterates over that loaded list and queries CMS live for each zip.

If you apply DB writes WHILE an audit is running, the audit's in-memory ourCounties for the affected zips is STALE. The audit will compare CMS's response against the OLD DB state and may report a spurious mismatch even though the new DB state is correct.

Pattern: kill + restart the audit after any apply. Don't try to race.

bash

pkill -f "node scripts/audit/tier-N-..."
sleep 2
rm -f /tmp/audit-progress-tier-N.json
node scripts/audit/tier-N-....js

Tier 1 unsupported-doc handling (post-2026-05-01 patch)

scripts/audit/tier-1-zip-county.js was patched on 2026-05-01 to pre-fetch all (zip, countyFips) tuples from unsupported-class or non-federal-state docs and subtract them from the CMS-side comparison. Without this patch, future inserts of unsupported docs (territory ZIPs, corporate-only ZIPs, etc.) would surface as "extra in CMS" / "county-count mismatch" false positives.

The patch is permanent. Future Tier 1 runs benefit without action. Same logic should be carried forward if a future Tier 2/3 etc. audit needs equivalent handling.

Post-apply validation

After any prod-write batch, three gates must pass:

Gate	Script	Expected result
Calculator baseline diff	`npx tsx scripts/audit/calculator-baseline-diff.ts`	ZERO DIFFS on all 12 scenarios
Tier 1 (federal zip-county)	`node scripts/audit/tier-1-zip-county.js`	100% exact match
Tier 1.5 (SBE zip-county)	`node scripts/audit/tier-1-5-sbe-zip-county.js`	100% exact match

Plus the relevant tier audit re-run (Tier 0 / Tier 0.5 / etc.) to confirm the gap class is closed.

For audits with leftover CMS errors: ALWAYS run validate-cms-errors.js before declaring 100%.

Documentation conventions

Each audit + apply session lands documentation across THREE files:

docs/validation/<tier-name>.md — the audit report itself: methodology, summary, per-class breakdown, refresh playbook. Stays canonical for future plan-year refreshes.
docs/infrastructure/data-sources.md — annual refresh playbook entry. Update with the new audit step + lessons learned.
docs/infrastructure/change-log.md — dated entry with backup tags, batch scope, verification matrix, rollback paths, outstanding follow-ups.

Plus this runbook gets updated whenever a new operational pattern emerges that future runs should benefit from.

Cross-references

Tier 0 audit — original Census-derived federal completeness audit
Tier 0.5 audit — USPS-derived complement to Tier 0; source of most of these operational patterns
docs/infrastructure/data-sources.md — annual refresh playbook
docs/infrastructure/change-log.md — change history
scripts/audit/validate-cms-errors.js — canonical retry-validation script
scripts/db/fix-tier-1-completeness-gaps.js — single-doc tuple fix pattern (for ad-hoc data-gap fixes)
scripts/db/seed-federal-tier-0-5.js — full-batch seed pattern with --state/--class filters

Open follow-ups (from the 2026-05-01 session)

S3 backup access — local-only until ops grants the SSO admin role bucket-policy access on s3://askflorence-data/db-backups/.
Tier 0.5b - tuple-level completeness sweep — defensive audit that, for every ZIP already in DB, queries CMS and inserts any missing (zip, countyFips) tuples. Catches what zip-level Tier 0.5 misses (the 50613 IA Bremer County pattern). One known instance fixed; could surface more.
HUD ZIP-County crosswalk upgrade — replaces zipcodes npm before next plan-year refresh. Quarterly-refreshed, free + HUD account, catches the npm-stale extras automatically.
Calculator 404 message refinement — frontend-side: "ZIP not recognized; check the digits and try your home address" beats bare "Zip code not found."