Skip to content
AskFlorence
Main Navigation ArchitectureFlorence AIAgentsMembersAgent PlatformValidationInfrastructure

Appearance

Sidebar Navigation

Overview

Home

Glossary

System Architecture

Consumer & Agent Flow

Florence AI

Overview

Principles

Runtime

Tool surface

Adding a tool

Tool registry

Knowledge: SBC scenarios & CSR

Voice

Evals & observability

Provider risk & portability

Outage playbook

Roadmap

Build plan

Agents

Overview

Workflows & pain points

Members

Overview

Medicaid coverage gap

Carriers

Overview

Marketplaces

Overview

Agency

Overview

Regulations

Overview

Agent Platform

Overview

Auth Architecture

MongoDB Permissioning

Compliance Model

Data Models

Data Sources

Overview

CMS Marketplace API

CMS dependency map

PUF Data

State Subsidies

SBE Ingestion Playbook

SBE State Watchouts + Decisions

CA Phase C/D Playbook

NY Phase C/D Playbook

Validation

Overview

Methodology

APTC Formula

California 2026

New York 2026

CAPS Formula

Scenario Results

Infrastructure

Account Inventory

AWS Setup Runbook

AWS Organizations

CloudTrail

GuardDuty

Security Hub

Config

CloudFront + WAFv2

Data sources & ingest

Phase 4 DNS

Change Log

Vulnerability Management

MongoDB Setup

Access Control

Data Classification

Documentation Hosting

Post-deploy Smoke

Development

Preflight (local CI mirror)

Testing strategy

Compliance

Overview (auditor entry point)

SOC 2 Control Mapping

HIPAA Control Mapping

CMS EDE Appendix A Mapping

Risk Assessment

Encryption Policy

Data Retention Policy

Privacy Impact Assessment

Consent Capture & Versioning

Incident Response Plan

Access Control Policy

Marketing vs. Portal Analytics

Vendor / Subprocessor Register

Dependency Vulnerability Policy

BAA / Compliance Evidence

Compliance-Automation Integration

Compliance-Automation Vendor Evaluation

Penetration Test Reports

Architecture

Portal entry handoff

Mobile app strategy

Deferred architecture decisions

Session cookie architecture

Share flows

Decisions (ADRs)

Index

0001 — Atlas project isolation

0002 — Append-only audit log

0003 — Narrow-scoped Mongo users

0004 — Cross-cluster Atlas PrivateLink

0005 — Delayed-job architecture

0006 — Mongo user simplification

0007 — Terraform owns ECS task def

0008 — E2E testing strategy

0009 — Self-hosted analytics + observability (superseded)

0010 — PostHog HIPAA Cloud (supersedes 0009)

Runbooks

Security Incident Response

Break-Glass Root Login

Onboard Team Member

Offboard Team Member

Atlas user provisioning

Deploy via Terraform (ENG-277)

Rollback via Terraform (ENG-277)

S3 data bucket migration (planned Phase 11)

Access Reviews

2026-Q2 Review

Session log

Index

2026-04-23 — Phase 10 DNS cutover

2026-04-22 — Phase 8 prod AWS mirror

2026-04-22 — Phase 7 Atlas VPC peering

2026-04-22 — Phase 6 CloudFront + WAF

2026-04-21 — Phase 5 staging go-live

2026-04-17 — Atlas staging

Briefs

Index

Member portal plan (ENG-187)

2026-04-16/17 handoff

2026-04-17 Atlas handoff

System briefing (2026-04-17)

Creative AdBundance proposal brief

Creative AdBundance analytics brief

ElevenLabs RN integration research

Policies

Overview

On this page

Audit operations runbook (zip_county audit family) ​

Purpose: Consolidated operational guidance for running the zip_county audit family (Tier 0, Tier 0.5, Tier 1, Tier 1.5) cleanly. Distilled from the 2026-05-01 Tier 0.5 session that drove Tier 1 + Tier 1.5 to TRUE 100% match. Applies to any future plan-year ingest, any ad-hoc data-integrity sweep, and any new tier added to the family.

Scope: operational patterns only. The "what does Tier X audit?" content lives in each tier's dedicated validation doc (Tier 0, Tier 0.5, Tier 1, Tier 1.5).

TL;DR for "I'm running an audit, what do I need to know?" ​

  1. Backup before any prod write. mongodump of zip_county, sha256 the BSON, store under ~/Documents/askflorence-db-backups/zip_county/<TAG>/, write a MANIFEST.txt. See "Backup discipline" below for full protocol.
  2. CMS rate limits at concurrency=10. Use concurrency=3-5 for any audit that hits CMS broadly. Build a retry-validation pass for any leftover errors.
  3. Two-pass pattern is mandatory. Initial audit run + a second retry-validation pass for anything that errored. Never accept "X CMS errors" as a final state — they're UNKNOWN until validated. Use scripts/audit/validate-cms-errors.js.
  4. Phased apply per Constraint 2. Audit produces report -> stop -> user reviews -> user greenlights specific batches -> apply with backup -> smoke -> repeat. Never auto-insert.
  5. Audit reports overwrite. Every audit script writes to a fixed path. Re-runs overwrite. Always commit before re-running, or rename outputs (-post-apply-confirm-<date>.json is a good convention).
  6. In-memory state goes stale during long audits. If you apply mid-audit, the audit's in-memory ourCounties snapshot is from script start, not current DB. Kill + restart fresh; don't try to race.
  7. Progress files at /tmp/audit-progress-<name>.json resume across runs. This is great for crash recovery, dangerous when you change audit logic and want a fresh evaluation. Delete the progress file when re-running with patched logic.

CMS rate-limit defensive patterns ​

The CMS Marketplace API will throttle at sustained high-concurrency loads. The 2026-05-01 Tier 0.5 audit hit HTTP 429 on 2,352 of 4,347 ZIPs at concurrency=10. Two patterns to avoid this:

Pattern A: lower default concurrency.

  • Tier 0 used concurrency=10 against ~13K residential ZIPs and reported clean. The hot zone is ~20K+ ZIPs at concurrency=10.
  • Recommended default for new audits: concurrency=5. If you see ANY 429s in the first ~500 zips, drop to concurrency=3 for the rest.
  • audit-federal-completeness-tier-0-5.js defaults to CMS_CONCURRENCY = 5 (was 10; lowered post-session).

Pattern B: retry-validation pass. Even with low concurrency, transient errors happen (network blips, CMS deploys, etc.). Always run a retry-validation pass after an initial audit:

bash
MONGODB_URI=$PROD_READ_URI CMS_API_KEY=$KEY \
  node scripts/audit/validate-cms-errors.js --tier=1
MONGODB_URI=$PROD_READ_URI CMS_API_KEY=$KEY \
  node scripts/audit/validate-cms-errors.js --tier=1.5

validate-cms-errors.js retries each errored ZIP at concurrency=1 with exponential backoff (5s/10s/20s/40s/80s, 5 retries) and re-classifies as match/mismatch/still-failed using the same DB-vs-CMS comparison the audit does. Result drops into audit-tier-N-rate-limit-validation.json.

The 2026-05-01 session validated 33 of 33 Tier 1 + 26 of 26 Tier 1.5 rate-limit errors as MATCH (no real mismatches were hiding). This pattern is now mandatory for closing out any audit.

Acceptance: an audit is "complete" only when:

  • exactMatches + retried-validated-as-match == totalZips, OR
  • the remaining items have been investigated as real data gaps and either fixed (with their own backup-and-greenlight protocol) or documented as known exclusions.

Backup discipline (Constraint 1) ​

Every prod write to zip_county requires an immediately-preceding fresh backup. Per-batch, not per-session.

Backup protocol ​

bash
PROD_READ_URI=$(aws --profile askflorence-prod secretsmanager get-secret-value \
  --secret-id prod/mongodb/app-read --query SecretString --output text)

TS=$(date -u +%Y%m%dT%H%M%SZ)
TAG="pre-<batch-descriptor>-${TS}"
BACKUP_DIR="$HOME/Documents/askflorence-db-backups/zip_county/${TAG}"
mkdir -p "$BACKUP_DIR"

# 1. Dump
mongodump --uri="$PROD_READ_URI" --collection=zip_county --out="$BACKUP_DIR/dump"

# 2. Verify record count matches prod
DUMP_BSON="$BACKUP_DIR/dump/askflorence/zip_county.bson"
PROD_COUNT=$(mongosh "$PROD_READ_URI" --quiet --eval 'db.zip_county.countDocuments()')
BSON_COUNT=$(bsondump --quiet "$DUMP_BSON" 2>/dev/null | wc -l | tr -d ' ')
[ "$BSON_COUNT" = "$PROD_COUNT" ] && echo "VERIFIED" || { echo "ABORT - count mismatch"; exit 1; }

# 3. sha256 for tamper detection
SHA256=$(shasum -a 256 "$DUMP_BSON" | awk '{print $1}')

# 4. MANIFEST.txt with backup tag, sha256, doc counts, planned _seedSource marker, restore commands
cat > "$BACKUP_DIR/MANIFEST.txt" <<EOF
backup_tag: ${TAG}
created_utc: $(date -u +%Y-%m-%dT%H:%M:%SZ)
prod_collection: askflorence.zip_county
pre_apply_doc_count: $PROD_COUNT
backup_bson_record_count: $BSON_COUNT
bson_file: dump/askflorence/zip_county.bson
bson_sha256: $SHA256
bson_size_bytes: $(stat -f%z "$DUMP_BSON")
planned_seedSource_marker: <your-marker-here>
batch_scope: <description>
expected_post_apply_count: <pre-count + planned-inserts>
restore_command_targeted_rollback: |
  node scripts/db/<your-seed>.js --rollback
restore_command_full_collection_replace: |
  mongorestore --uri="\$PROD_WRITE_URI" --nsInclude='askflorence.zip_county' --drop "$BACKUP_DIR/dump"
EOF

Backup storage ​

Currently local-only at ~/Documents/askflorence-db-backups/zip_county/<TAG>/ because s3://askflorence-data/db-backups/ blocks the SSO admin role at the bucket-policy layer. Until that's resolved (separate ops issue), local is the canonical location.

What goes in the BSON ​

A full zip_county snapshot, not just the affected docs. This makes restore unambiguous: mongorestore --drop replaces the entire collection back to the pre-batch state. Per-doc rollback via seed-script --rollback filters by _seedSource marker (preferred for surgical scenarios), but having the full snapshot is the nuclear-option safety net.

Restore patterns ​

Targeted (preferred — surgical):

bash
node scripts/db/<seed-script>.js --rollback
# Removes ONLY docs with the seed's _seedSource marker
# Each seed script's --rollback is scoped to its own marker

Full collection replace (nuclear — replaces collection):

bash
PROD_WRITE_URI=$(aws --profile askflorence-prod secretsmanager get-secret-value \
  --secret-id prod/mongodb/app-write --query SecretString --output text)
mongorestore --uri="$PROD_WRITE_URI" \
  --nsInclude='askflorence.zip_county' --drop \
  ~/Documents/askflorence-db-backups/zip_county/<TAG>/dump

The full-replace is appropriate when targeted-rollback is insufficient (e.g., docs were mutated post-insert by another process). Confirm with user before triggering.

mongorestore --dryRun quirk ​

mongorestore --dryRun in MongoDB Database Tools v100.x always reports "0 document(s) restored successfully" even when the BSON file is fully recognized and parseable. This is a misleading display quirk, NOT a failure. The fact that mongorestore prints dry run completed with no errors after checking for collection data in zip_county.bson means the BSON is restorable. Don't trust the "0 documents" line; verify integrity via bsondump | wc -l matching prod count + a sha256 + a sample round-trip instead.

Phased apply discipline (Constraint 2) ​

The audit does NOT auto-insert. Flow per batch:

  1. Dry-run audit produces the report (per-class + per-state counts, anomaly flags, sample mismatches, full JSON).
  2. HARD STOP. Surface the report to the user (status comment on the tracker issue + chat-surface). Wait.
  3. User makes phased decisions explicitly. Examples: "apply AZ + TX classes only, hold the rest"; "the discrepancy class - investigate the KY airport cases first"; "state X has 50+ gaps - investigate before any apply."
  4. Execute ONLY user-greenlighted batches, one at a time.
  5. Each batch: backup -> dry-run preview of THAT batch -> user re-confirms -> apply -> smoke matrix -> status comment on tracker.
  6. If unsure whether the user has greenlighted a specific apply: ask, don't assume.

Per-batch filters ​

Seed scripts MUST support per-state and per-class filters:

  • --state=<comma-list> — restrict to specified states
  • --class=<comma-list> — restrict to specified classes (insertable / discrepancy / non_residential / etc.)
  • --dry-run (default) / --apply / --rollback
  • Idempotency guard at apply-time (skip if (zip, countyFips) already exists)
  • State allowlist enforcement
  • _seedSource: "<descriptive-marker-with-date>" on every insert

Reference: scripts/db/seed-federal-tier-0-5.js is the canonical example.

Audit-script behaviors to know about ​

Reports overwrite in place ​

Every audit script writes to a fixed path:

  • scripts/db/data/federal-tier-0-5-gap-report-<date>.json
  • audit-tier-1-results.json (project root)
  • audit-tier-1-5-results.json (project root)

Re-runs overwrite. Always commit the original before re-running, or rename outputs to preserve history. The Tier 0.5 session uses -post-apply-confirm-<date>.json to keep the post-apply confirmation distinct from the original triage report.

The root-level audit-tier-*.json files are gitignored to avoid noise. The canonical historical snapshots live under scripts/db/data/tier-N-<date>.json (commit-tracked).

Progress files at /tmp/audit-progress-<name>.json ​

The audit harness saves progress every 200 zips so a crashed audit can resume. This is great for long runs. It's a footgun when you change audit logic and want a fresh evaluation: the progress file marks the previously-processed zips as done, the re-run skips them, and your patched logic never runs on them.

Always delete the progress file when:

  • You changed the audit script's comparison logic
  • You applied DB writes that change what counts as a match for previously-audited zips
  • You want a clean baseline reading for any reason
bash
rm -f /tmp/audit-progress-tier-1.json
rm -f /tmp/audit-progress-tier-1-5.json

In-memory state goes stale during long audits ​

The audit script loads the entire ourCounties snapshot at start (line ~34 of tier-1-zip-county.js: db.zip_county.aggregate(...)). It then iterates over that loaded list and queries CMS live for each zip.

If you apply DB writes WHILE an audit is running, the audit's in-memory ourCounties for the affected zips is STALE. The audit will compare CMS's response against the OLD DB state and may report a spurious mismatch even though the new DB state is correct.

Pattern: kill + restart the audit after any apply. Don't try to race.

bash
pkill -f "node scripts/audit/tier-N-..."
sleep 2
rm -f /tmp/audit-progress-tier-N.json
node scripts/audit/tier-N-....js

Tier 1 unsupported-doc handling (post-2026-05-01 patch) ​

scripts/audit/tier-1-zip-county.js was patched on 2026-05-01 to pre-fetch all (zip, countyFips) tuples from unsupported-class or non-federal-state docs and subtract them from the CMS-side comparison. Without this patch, future inserts of unsupported docs (territory ZIPs, corporate-only ZIPs, etc.) would surface as "extra in CMS" / "county-count mismatch" false positives.

The patch is permanent. Future Tier 1 runs benefit without action. Same logic should be carried forward if a future Tier 2/3 etc. audit needs equivalent handling.

Post-apply validation ​

After any prod-write batch, three gates must pass:

GateScriptExpected result
Calculator baseline diffnpx tsx scripts/audit/calculator-baseline-diff.tsZERO DIFFS on all 12 scenarios
Tier 1 (federal zip-county)node scripts/audit/tier-1-zip-county.js100% exact match
Tier 1.5 (SBE zip-county)node scripts/audit/tier-1-5-sbe-zip-county.js100% exact match

Plus the relevant tier audit re-run (Tier 0 / Tier 0.5 / etc.) to confirm the gap class is closed.

For audits with leftover CMS errors: ALWAYS run validate-cms-errors.js before declaring 100%.

Documentation conventions ​

Each audit + apply session lands documentation across THREE files:

  1. docs/validation/<tier-name>.md — the audit report itself: methodology, summary, per-class breakdown, refresh playbook. Stays canonical for future plan-year refreshes.
  2. docs/infrastructure/data-sources.md — annual refresh playbook entry. Update with the new audit step + lessons learned.
  3. docs/infrastructure/change-log.md — dated entry with backup tags, batch scope, verification matrix, rollback paths, outstanding follow-ups.

Plus this runbook gets updated whenever a new operational pattern emerges that future runs should benefit from.

Cross-references ​

  • Tier 0 audit — original Census-derived federal completeness audit
  • Tier 0.5 audit — USPS-derived complement to Tier 0; source of most of these operational patterns
  • docs/infrastructure/data-sources.md — annual refresh playbook
  • docs/infrastructure/change-log.md — change history
  • scripts/audit/validate-cms-errors.js — canonical retry-validation script
  • scripts/db/fix-tier-1-completeness-gaps.js — single-doc tuple fix pattern (for ad-hoc data-gap fixes)
  • scripts/db/seed-federal-tier-0-5.js — full-batch seed pattern with --state/--class filters

Open follow-ups (from the 2026-05-01 session) ​

  • S3 backup access — local-only until ops grants the SSO admin role bucket-policy access on s3://askflorence-data/db-backups/.
  • Tier 0.5b - tuple-level completeness sweep — defensive audit that, for every ZIP already in DB, queries CMS and inserts any missing (zip, countyFips) tuples. Catches what zip-level Tier 0.5 misses (the 50613 IA Bremer County pattern). One known instance fixed; could surface more.
  • HUD ZIP-County crosswalk upgrade — replaces zipcodes npm before next plan-year refresh. Quarterly-refreshed, free + HUD account, catches the npm-stale extras automatically.
  • Calculator 404 message refinement — frontend-side: "ZIP not recognized; check the digits and try your home address" beats bare "Zip code not found."
Pager
Next pageHome

AskFlorence Internal Documentation. Not for public distribution.

AskFlorence

Internal Documentation

Access restricted. Not for public distribution.