Data Classification Policy

Status: Active. Last updated April 12, 2026. Purpose: SOC 2 evidence for CC6.1 (Logical Access), CC6.5 (Data Protection), A1.2 (Availability)

Classification Levels

Level	Definition	Examples	Encryption	Retention
Public	No restrictions. Intentionally published.	Plan names, metal levels, issuer names, premium amounts	At rest (AES-256)	Indefinite
Internal	Business-sensitive. Not for external sharing.	SLCSP calculations, data source URLs, API keys	At rest + in transit (TLS)	Duration of use
PII	Personally identifiable information.	Email, name, phone, address	At rest + in transit + field-level (CSFLE)	Per purpose + 7yr audit
PHI	Protected health information (HIPAA).	SSN, DOB, income (with health context), enrollment records	At rest + in transit + field-level (CSFLE + KMS)	Per purpose + 7yr audit

Collection Classification

Phase 1 Collections (Active)

Collection	Classification	Contains PII/PHI?	Encryption	Retention	Access
`plan_years`	Public	No	At rest (Atlas default)	Per plan year (keep all years)	`app-read`, `app-write`
`plans`	Public	No	At rest (Atlas default)	Per plan year (keep all years)	`app-read`, `app-write`
`regions`	Public	No	At rest (Atlas default)	Per plan year (keep all years)	`app-read`, `app-write`
`zip_county`	Public	No	At rest (Atlas default)	Indefinite (geographic data)	`app-read`, `app-write`
`audit_log`	Internal	May contain IP addresses	At rest (Atlas default)	7 years (TTL index)	`audit-write` (insert), admin (read)

Key: Phase 1 collections contain NO PII or PHI. All data is publicly available plan information from government sources (DFS filings, marketplace data, CMS PUF).

Cross-cluster reference collections (live on staging Atlas, read by prod via AWS PrivateLink — Phase 11)

Collection	Classification	Contains PII/PHI?	Encryption	Retention	Access
`formularies_staging`	Public	No (CMS §1311 MRF formulary data — RxCUI → plan tier mappings)	At rest (Atlas default) + TLS in transit + AWS PrivateLink (network layer)	Per plan year	`app_read_staging` (read-only, prod) + ingest pipeline (write, staging account)
`providers_staging`	Public	No (NPPES public NPI directory — provider name, NPI, specialty, network membership)	At rest (Atlas default) + TLS in transit + AWS PrivateLink (network layer)	Per refresh cycle	`app_read_staging` (read-only, prod) + ingest pipeline (write, staging account)

Where these live + read path: these collections live ONLY on the staging Atlas cluster (askflorence-staging, project_id 69e31af12fd2c0aef51bbb41). The prod app (askflorence.health) reads them via AWS PrivateLink endpoint vpce-0c81aea11e29bb928 using the read-only app_read_staging Atlas user. The §1311 ingest pipeline writes them from the staging AWS account; nothing on prod ever writes to these collections.

Why staging cluster, not prod cluster: keeps prod cluster on M10 HIPAA tier ($56/mo) instead of upgrading to M30 ($382/mo) to handle the 2.14M-doc + 30M-tuple footprint. Saves ~$326/mo recurring while keeping prod's audit boundary clean (only PHI processing on prod cluster). See ADR 0004 for the full decision.

Drift guard: #100 / ENG-239. Two-phase enforcement, both shipped:

Phase 1 (static CI guard) shipped 2026-05-08 — scripts/audit/staging-collections-guard.ts enforces the data-classification contract at PR time: fails the build if any getReferenceDb() call references a collection not on STAGING_ALLOWED_COLLECTIONS (defined in src/lib/db.ts). Workflow at .github/workflows/staging-collections-guard.yml. Allow-list duplicated in the script (defense-in-depth).
Phase 2 (live nightly drift check) shipped 2026-05-09 — scripts/audit/staging-cluster-drift.ts audits the actual Atlas state of app_read_staging (the cross-cluster reader) at 08:00 UTC daily via .github/workflows/staging-cluster-drift.yml. Verifies the user has exactly one custom role (role_reader_reference@admin) granting only FIND on exactly the expected 2 collections (formularies_staging + providers_staging) — opens a P1 GitHub issue on drift. As part of Phase 2 the user's role was tightened from built-in read@askflorence (whole-DB scope) to per-collection custom role role_reader_reference; verified prod cross-cluster reads remain healthy after the tightening.

Together these protect the classification claim above: Phase 1 catches code-level drift at PR time; Phase 2 catches runtime drift (privilege escalation via Atlas Admin UI, out-of-band role changes, etc.).

Phase 2 Collections (Future — Not Yet Created)

Collection	Classification	Contains PII/PHI?	Encryption	Retention	Access
`consumers`	PHI	Yes (SSN, name, DOB, address)	At rest + CSFLE + KMS	Per purpose + 7yr audit trail	Scoped (per-consumer access)
`enrollments`	PHI	Yes (links consumer to health plan)	At rest + CSFLE	Per purpose + 7yr audit trail	Broker (assigned only), consumer (own)
`broker_assignments`	Internal	No (broker business info only)	At rest	Duration of relationship	Admin

Phase 2 requires: MongoDB Client-Side Field Level Encryption (CSFLE) with AWS KMS before these collections are created. See docs/security-compliance/encryption-policy.md for the encryption policy + CSFLE roadmap.

Data Flow Classification

Data Flow	Classification	Handling
User enters zip + age + income	Not stored	Stateless; used for calculation only; not persisted
Plan search results	Public	Returned to client; no PII
Waitlist email submission	PII	Stored via Resend API; not in MongoDB
Enrollment application (future)	PHI	Field-level encrypted in MongoDB; audit logged
Broker view of consumer data (future)	PHI access event	Decrypted on-demand; time-limited session; audit logged

Source File Classification

Source	Classification	Storage	Retention
DFS Final Exhibit ZIPs	Public (government filings)	S3 + local backup	Indefinite
NYSOH scraped HTML	Public (public marketplace data)	S3 + local backup	Indefinite
CMS PUF CSVs	Public (government data)	S3 + local backup	Indefinite
Official NY documents (PDFs)	Public	S3 + local backup	Indefinite
Data ingestion manifests	Internal	S3 (with source file checksums)	Indefinite

Role-to-Collection Access Matrix

Role	`plan_years`	`plans`	`regions`	`zip_county`	`audit_log`
`app-read`	Read	Read	Read	Read	—
`app-write`	Read/Write	Read/Write	Read/Write	Read/Write	—
`audit-write`	—	—	—	—	Insert only
Atlas admin	Full	Full	Full	Full	Full

SOC 2 Control Mapping

Control	Evidence
CC6.1 (Logical Access)	Role-to-collection matrix, minimum necessary access
CC6.5 (Data Protection)	Classification levels, encryption requirements per level
A1.2 (Availability)	Retention policies, backup configuration
P6.1 (Privacy — Data Use)	Data flow classification, "not stored" for anonymous queries