Appearance
Data Retention Policy
Status: Active. Effective 2026-05-11. Owner: Taha Abbasi (technical implementation) + Asad Khalid (legal / regulatory). Reviewed: Annually + whenever a new collection / data store / regulatory obligation is added.
Purpose
Defines how long AskFlorence retains each class of data, by what mechanism it is deleted, and how the retention claim is auditable. Required artifact for:
- HIPAA §164.316 (documentation requirements; 6-year minimum retention for the documented policies + procedures themselves)
- HIPAA §164.530(j) (record retention for HIPAA-related decisions)
- HIPAA §164.312(b) (audit controls — implicit retention period)
- CMS EDE Phase 3 / MARS-E 2.2 AU-11 (audit record retention)
- SOC 2 CC2.3 (information used to support objectives — retention as part of information lifecycle)
- State breach notification laws (varies; default to longest applicable)
Data classification — source of truth
This policy aligns with the data classification taxonomy. When a row below references a "data class," it refers to the classes defined there: Public / Internal / PII / PHI plus the EDE-introduced classes FTI (Federal Tax Information) and cms_hub (data fetched from the CMS Marketplace hub or HealthCare.gov FFE).
Retention schedule
| Data class | Examples | Retention period | Deletion mechanism | Notes |
|---|---|---|---|---|
| Public | Plan names, premium amounts, ZIP→county mappings, NPI provider directory, RxCUI formulary tier mappings | Indefinite (current plan year + prior years) | None — kept indefinitely as reference data | Refreshed per ingest cadence; superseded versions retained for historical comparison |
| Internal — application telemetry | API access logs, ingest manifests, deploy logs (CloudWatch) | 90 days hot in CloudWatch; 7 years in log-archive S3 (CloudTrail org-trail) | CloudWatch Logs retention policy + S3 lifecycle to Glacier after 1 year | Org-wide CloudTrail trail captures all AWS API events |
| Internal — audit log (DB-layer) | agent_audit_log collection — every auth event, admin action, data change | 7 years minimum (HIPAA §164.312(b)); target 10 years (EDE-safer) | TTL index at the Mongo collection level (set at Phase 5 collection creation alongside the append-only role binding) | Append-only enforced at DB permission layer (ADR 0002); aged-out records cannot be selectively purged before TTL fires |
| Internal — Mongo audit logs (Atlas-side) | Atlas database audit logs (atlasAdmin-level audit) | Atlas-managed retention (90 days default; 12 months on paid tier — confirm tier) | Atlas-managed | Used for incident-response post-mortem reconstruction |
| PII — waitlist / agent waitlist | agent_waitlist_submissions (email, name, phone, NPN, role); consumer-side waitlist (email only) | 6 years from last activity | Manual review at quarterly access review; planned automated TTL post-Phase-5 | The CAN-SPAM "unsubscribe" path triggers a soft-archive within 10 business days (#59). GDPR / CCPA "right to erasure" requests trigger immediate purge (within 30 days) with audit-log row written. |
| PII — agent discovery survey responses | agent_survey_responses (NPN, agent profile + free-text fields) | 6 years from collection | Manual review + planned automated TTL | Same erasure-on-request path as waitlist. Consent capture is per-record per agent platform compliance. |
| PII — Google Workspace email + Drive content | Founder + ops @askflorence.health mail; team documents | Per Google Workspace Vault retention rules — to be configured before SOC 2 evidence window starts | Google Vault retention rules (default: indefinite) | Vault rules to be applied: mail = 7 years; chat = 1 year; Drive content = until role changes |
| PII — HubSpot CRM | Agent waitlist + survey mirrors (member data never touches HubSpot by design) | 7 years from last contact (HubSpot Marketing Hub default) | HubSpot platform deletion (HubSpot data retention controls) | GDPR-delete endpoint use is restricted — only +test* aliases per the 2026-05-09 incident learning |
PHI — consumers, enrollments (not yet created) | SSN, DOB, plan-enrollment records | 10 years from last activity (HIPAA 6-year minimum; EDE-safer 10) | TTL index at collection level + CSFLE-encrypted blobs become unrecoverable when KMS CMK is rotated past retention boundary | These collections do NOT exist today. Pre-launch checklist includes: (1) CSFLE + KMS-per-field, (2) TTL index, (3) audit-log row written on insert/update/delete, (4) GDPR / state-AG erasure procedure documented. |
| PHI — files | Income verification PDFs, ID proofing artifacts (not yet collected) | 10 years from collection + immediate redaction on close-of-enrollment for non-PHI fields | S3 Object Lock + lifecycle to Glacier Deep Archive after 1 year + permanent deletion at TTL | S3 bucket policy + Object Lock retention applied at creation time; pre-launch checklist mirrors PHI collections above. |
| FTI — Federal Tax Information (not yet collected) | Income data from IRS Data Hub (FTI as defined by IRS Publication 1075) | Per IRS 1075 — typically until purpose-served, then secure-destroy | IRS-1075-aligned destruction procedure (not yet documented; required before any FTI is collected) | FTI is collected only at enrollment with explicit consent + audit log; never logged in application telemetry; storage path is purpose-bound (eligibility determination) |
| cms_hub — CMS Marketplace API / FFE data | Eligibility determinations, FFM plan inventories, public marketplace data | Indefinite for public; 10 years for any identified-individual-bound determinations | Per CMS EDE program requirements + same TTL as PHI for identified records | Public marketplace data refreshed at ingest cadence; identified records (Phase 5+) follow PHI retention. |
| Secrets — credentials, API keys | AWS Secrets Manager entries; Atlas connection strings; SES domain identities | Until secret value rotation (annual or on-incident); old versions retained 30 days for rollback | Secrets Manager has a 30-day default recovery window; explicit force-delete-without-recovery only with ADR | Rotation cadence in access-control-policy |
| Backups — S3 versioning, Atlas snapshots | Versioned objects in tfstate / data buckets; Atlas continuous snapshots | S3: 90 days for tfstate, lifecycle thereafter; Atlas: per-tier (M10 = 7-day point-in-time + daily snapshots for 30 days) | S3 lifecycle policies + Atlas snapshot retention config | Backups inherit the encryption + classification of the source data |
| Source code (GitHub) | All repositories | Indefinite (commit history is the audit trail) | Branch deletion does not remove history; force-pushes are blocked on main | No PHI / secrets in repo by .gitignore + GitHub secret scanning |
| Compliance documentation (this directory) | Policies, control mappings, runbooks, ADRs, access reviews | 6 years minimum (HIPAA §164.316) — preserved indefinitely as part of git history | Never deleted; superseded versions retained as git history; quarterly access-review documents stamped + archived in-tree | Versioned via git; each annual policy review appends a row to the change-log, never overwrites prior versions |
Deletion procedures
Routine deletion (TTL-driven)
- Mongo TTL indexes on PHI / PII collections — configured at collection creation; verified at quarterly access review.
- S3 lifecycle rules — configured at bucket creation; verified at quarterly access review (
aws s3api get-bucket-lifecycle-configuration). - CloudWatch Log retention — set per log group at creation.
Erasure on request (GDPR / CCPA / HIPAA right-of-access-and-amendment)
When a data subject requests erasure:
- Validate the request — confirm identity using the contact email on file + any additional identifier (NPN for agents).
- Document the request — write an audit-log row to
agent_audit_log(action: "erasure_request"). - Scope — identify every collection + system holding the subject's data. Default scope: Atlas (
agent_waitlist_submissions,agent_survey_responses, futureconsumers/enrollments); HubSpot CRM; SES suppression list (if marketing send history exists); CloudWatch Logs (if request mentions a session ID, scrub via PII-redaction script). - Execute — within 30 days:
- Atlas: hard-delete the record AND write an
erasure_completeaudit-log row. - HubSpot: use the
archive(soft-delete) endpoint NOT thegdpr-deleteendpoint unless the address is unambiguously synthetic. The 2026-05-09 incident with[email protected]is the negative example: gdpr-delete permanently blocklists, and the irreversible portal-level blocklist cannot be lifted even by HubSpot Support. - SES: add to suppression list to prevent any future sends.
- Atlas: hard-delete the record AND write an
- Confirm — email confirmation to the requester (using a fresh thread, not the suppression-listed address).
- Retain the audit-log entries — the erasure request + completion rows remain in
agent_audit_logfor the full 7-10 year retention period. The audit-log entries are NOT subject to the erasure (regulatory permitted exception); they are minimized — they record that the erasure occurred, not the erased content.
Decommissioning / migration
When a collection / data store is decommissioned (e.g., Phase 5 schema migration):
- Capture a backup snapshot dated + named with the migration session.
- Migrate live readers + writers to the new collection (
getReferenceDbpattern, etc.). - Verify the new collection is operating correctly + the old collection has zero read/write traffic for 30 days minimum.
- Drop the old collection. Audit-log the drop.
- Retain the dated backup snapshot for the full retention period of the data class involved (e.g., if the collection held PHI, retain the snapshot 10 years).
Vendor-side data
Each vendor BAA (see vendor register) commits the vendor to deletion-on-termination procedures. At vendor retirement:
- Trigger the contract-termination deletion procedure with the vendor.
- Collect a written confirmation of deletion + scheduled-purge date.
- Move the vendor to the "retired" section of the vendor register; preserve the BAA + deletion-confirmation in
docs/infrastructure/evidence/for the full retention period (6 years HIPAA minimum, 10 years EDE-safer).
Verification
| Cadence | What | How |
|---|---|---|
| Quarterly access review | Confirm TTL indexes are in place + S3 lifecycle rules are configured | aws s3api get-bucket-lifecycle-configuration + Atlas db.runCommand({listCollections: 1}) review |
| Annually (audit prep) | Sample a deleted record + confirm it is irretrievable (subject to backup retention) | One synthetic erasure exercise during Q3 review |
| At every vendor retirement | Collect deletion-confirmation; archive in docs/infrastructure/evidence/ | Documented in vendor register row |
| At every collection drop | Audit-log row + dated snapshot | Decommissioning procedure above |
Reference
- Data Classification Policy — source-of-truth for what data is in each class
- Encryption Policy — encryption posture per data class
- Access Control Policy — credential rotation + access cadence
- Incident Response Plan — handling of inadvertent retention-policy violations
- Vendor / Subprocessor Register — vendor-side deletion commitments
- SOC 2 Control Mapping — CC2.3 row
- HIPAA Control Mapping — §164.316 (documentation retention), §164.312(b) (audit retention)
- CMS EDE Appendix A Mapping — §9 (Access Control Logging retention)