S3 data bucket migration runbook

Status: planned, not executed. Target window: Phase 11 (post 48h Phase 10 bake).
Why this exists: the current askflorence-data bucket in the management account (778477254880) predates AWS Organizations and holds two unrelated payloads — PUF source CSVs for the ingest pipeline, and agent-uploaded templates from the runtime app. Both will benefit from being split into purpose-scoped, environment-scoped buckets with clean SOC 2 / EDE audit lines.

Target end-state architecture

Management account (778477254880)
└── askflorence-data-archive                  ← Immutable backup. Object Lock
                                                 COMPLIANCE mode, 7-year retention.
                                                 Replicated-to, never read by runtime.

Staging account (549136075525)
├── askflorence-staging-data                  ← Authoritative PUF source for review.
│   └── puf/<year>/                            Each year's new PUF validated here
│       ├── plan-attributes-puf.csv            BEFORE promotion to prod.
│       ├── benefits-and-cost-sharing-puf.csv
│       └── ...
└── askflorence-staging-agent-uploads         ← Staging-app agent template uploads.
    └── agent-survey-uploads/

Prod account (039624954211)
├── askflorence-prod-data                     ← Authoritative PUF source for runtime.
│   └── puf/<year>/                            Same layout as staging. Populated via
│                                              explicit promotion after staging validates.
└── askflorence-prod-agent-uploads            ← Prod-app agent template uploads.
    └── agent-survey-uploads/

Why this shape

Concern	Current (mgmt bucket)	Target
Blast radius	mgmt compromise reaches prod agent uploads + PUF	Each env owns its own bucket; compromise stays local
Year-over-year PUF release cycle	New PUF ingested directly to prod with no review env	Staging ingests first → audit → promote → prod
Backup/retention	No separate archive; primary == backup	Immutable archive in mgmt with Object Lock; primary is mutable
Auditor framing	"Why is production PHI in management account?"	Clean per-environment isolation
Lifecycle independence	One bucket, conflicting policies (PUF-indefinite vs agent-7yr)	One policy per bucket, per purpose
Cross-account IAM complexity	Every write needs mgmt bucket policy + prod task role	Same-account writes, simpler IAM
Cost allocation	Prod app costs billed to mgmt account	Prod S3 costs stay in prod account

Why staging-first for PUF

CMS releases updated PUF data each year in Q3/Q4 for the next plan year's open enrollment. The current workflow ingests that data directly to the prod Atlas cluster via scripts/db/ingest-puf-augment.js, with a sanity check afterward via the tier 1-5 audits. No review environment sits between "CMS drops new data" and "prod serves it to real consumers."

The staging-first pattern adds a review gate:

New CMS PUF CSVs uploaded to staging bucket askflorence-staging-data/puf/<year>/
ingest-puf-augment.js run against staging Atlas with the new year — stores as year: 2027 (or whichever)
Tier 1-5 audit harness runs against staging — catches rate drift, shape changes, missing fields, new carrier formats
Manual review at stage.askflorence.health with the new year's data
Only once staging is clean: aws s3 sync staging bucket → prod bucket, run ingest against prod Atlas, run final audit
aws s3 sync staging bucket → archive bucket for the permanent historical record

This mirrors the code promotion flow (main → staging → prod) for data. Any breaking change in CMS's PUF format gets caught on staging, not in production.

Migration plan

Phase 11 — Agent-uploads bucket split (low-risk, no data migration)

Why first: the agent-uploads prefix in the current mgmt bucket has exactly one real object today (a smoke-test PDF from the Phase 10 cutover validation). Vercel-era uploads, if any exist, stay where they are as historical. This step has zero user-data migration.

Prod steps:

New file infra/envs/prod/s3-agent-uploads.tf:
- aws_s3_bucket prod_agent_uploads with name askflorence-prod-agent-uploads
- Bucket encryption: SSE-KMS using module.kms_prod.key_arn
- Versioning: enabled
- Public access block: all 4 flags true
- Bucket policy: DenyNonSSLRequests (mirror of mgmt bucket pattern)
- Lifecycle: agent-survey-uploads/ → 7-year retention then delete (HIPAA minimum with buffer)
Update infra/envs/prod/ecs.tf:
- Change S3_AGENT_SURVEY_BUCKET=askflorence-data to S3_AGENT_SURVEY_BUCKET=askflorence-prod-agent-uploads
- Remove the S3AgentSurveyUploadsWrite inline policy's cross-account resource ARN; replace with same-account arn:aws:s3:::askflorence-prod-agent-uploads/agent-survey-uploads/*
GuardDuty Malware Protection for S3: add the new bucket to the protected-resources list.
aws s3 sync s3://askflorence-data/agent-survey-uploads/ s3://askflorence-prod-agent-uploads/agent-survey-uploads/ — copy any accumulated objects (expected: 0 user data, maybe smoke tests).
terraform apply + register new ECS task def revision + force-new-deployment.
Smoke test: POST /api/agents/discovery/upload from prod with a real PDF → verify object lands in askflorence-prod-agent-uploads, NOT in askflorence-data.
Update infra/envs/management/s3-askflorence-data.tf: remove the AllowProdEcsTaskRolePutAgentSurveyUploads statement. Prod task role loses the cross-account grant. Mgmt bucket's agent-survey-uploads/ prefix becomes read-only history.

Staging steps (mirror of prod but scoped to staging):

New file infra/envs/staging/s3-agent-uploads.tf with askflorence-staging-agent-uploads.
Update infra/envs/staging/ecs.tf to point S3_AGENT_SURVEY_BUCKET there.
GuardDuty Malware Protection added.
Smoke test via stage.askflorence.health.

Rollback: aws s3 sync in reverse (new bucket → mgmt bucket) + revert env var + revert task role policy. No user impact during rollback because the active env var is the primary switch.

Phase 11.5 — Mgmt immutable archive bucket

Purpose: receive replicated copies from both staging and prod data buckets, retain immutably for 7 years. No runtime process reads from this bucket.

Steps:

New file infra/envs/management/s3-data-archive.tf:
- aws_s3_bucket data_archive with name askflorence-data-archive
- Object Lock enabled at create-time (can only be enabled when the bucket is created, not retrofitted)
- Default retention: 7 years COMPLIANCE mode
- Versioning: enabled (required for Object Lock)
- SSE-KMS with mgmt CMK alias/askflorence-data
- Public access block: all 4 flags true
- Bucket policy: DenyNonSSLRequests + DenyDeleteObject + only-allow-replication-writes
- Replication destination configuration accepting writes from staging + prod buckets
Cross-account replication IAM:
- Role askflorence-data-replicator in mgmt account, trusted by s3.amazonaws.com
- Role policy: PutObject + PutObjectVersionAcl on askflorence-data-archive
- Source-account grants: staging + prod bucket policies allow this replicator role to read source objects
Lifecycle on archive: Standard → Standard-IA after 30 days → Glacier Deep Archive after 90 days. Deep Archive is ~$0.00099/GB-month; effectively free for PUF volumes.

Rollback: destroy in reverse order. Object Lock COMPLIANCE mode means objects placed during the testing period are permanent — test this step in a disposable bucket first before committing to askflorence-data-archive.

Phase 11.75 — Staging PUF data bucket + staging-first ingest validation

Purpose: establish staging as the PUF review environment.

Steps:

New file infra/envs/staging/s3-puf-data.tf:
- aws_s3_bucket staging_data with name askflorence-staging-data
- SSE-KMS staging CMK
- Versioning enabled
- Public access block all flags true
- Replication to askflorence-data-archive in mgmt account (for historical backup)
- Lifecycle: same as prod bucket (Standard → IA after 90d)
aws s3 sync s3://askflorence-data/ s3://askflorence-staging-data/ --exclude 'agent-survey-uploads/*' — copies all PUF years to staging.
Update scripts/db/ingest-*.js env var handling:
- Current scripts read from askflorence-data hard-coded or via env
- Add S3_PUF_SOURCE_BUCKET env with staging bucket default when running locally against staging Mongo
- Document the env var contract in docs/infrastructure/mongodb-setup.md
Ingest sanity check (critical step — do NOT skip):
- MONGODB_WRITE_URI=<staging> S3_PUF_SOURCE_BUCKET=askflorence-staging-data node scripts/db/ingest-puf-augment.js --dry-run --year 2026
- Confirm the dry-run output matches the existing prod Atlas state for year 2026 — no drift introduced by the bucket change
- Run the same thing in --apply mode against a throwaway staging collection; verify collection contents byte-for-byte match what's already in prod Atlas
If step 4 passes cleanly, staging data bucket is the authoritative review environment going forward.

Rollback: if the ingest scripts break, revert S3_PUF_SOURCE_BUCKET to point at the mgmt bucket (unchanged at this point). Scripts resume working. Staging data bucket stays as a replicated copy, not authoritative.

Phase 12 — Prod PUF data bucket + promotion flow

Steps:

New file infra/envs/prod/s3-puf-data.tf:
- aws_s3_bucket prod_data with name askflorence-prod-data
- SSE-KMS prod CMK
- Versioning enabled
- Public access block all flags true
- Replication to askflorence-data-archive in mgmt account
- Lifecycle: Standard → IA after 90d
Initial population: aws s3 sync s3://askflorence-staging-data/ s3://askflorence-prod-data/ (staging → prod for the first time)
Update ingest scripts' prod-pointing env:
- S3_PUF_SOURCE_BUCKET=askflorence-prod-data in any prod-run context
- This is a scripts-only change; the serving app does not read S3 for PUF data (it reads Atlas)
Re-run full audit tier 1-5 against prod Atlas after the first ingest from the new bucket — confirm no regression in serving data.

Document the PUF promotion workflow at docs/runbooks/puf-year-promotion.md (new file):

New PUF year arrives from CMS:
1. Upload to staging bucket: aws s3 cp plan-attributes-puf.csv s3://askflorence-staging-data/puf/<year>/
2. Run ingest against staging Atlas: S3_PUF_SOURCE_BUCKET=askflorence-staging-data ... --year <year>
3. Run audit tier 1-5 against staging: scripts/audit/*.js with staging URI
4. Manual review at stage.askflorence.health
5. aws s3 sync s3://askflorence-staging-data/puf/<year>/ s3://askflorence-prod-data/puf/<year>/
6. Run ingest against prod Atlas: S3_PUF_SOURCE_BUCKET=askflorence-prod-data ... --year <year>
7. Run audit tier 1-5 against prod: scripts/audit/*.js with prod URI
8. Replication to archive bucket happens automatically (both staging + prod replicate to mgmt archive)

Phase 12.5 — Deprecate mgmt `askflorence-data` bucket

Steps:

Verify all runtime flows have moved off askflorence-data:
- Prod ECS task def env: S3_AGENT_SURVEY_BUCKET points at prod bucket
- Staging ECS task def env: points at staging bucket
- Prod + staging ingest scripts point at their respective new buckets
- No code path reads from askflorence-data directly
Make the bucket read-only:
- Replace bucket policy with only DenyNonSSLRequests + Deny * for any write actions + allow Get/List for archive browsing
- Revoke any IAM user creds that still had write access (Vercel-era IAM user — see Phase 11 Resend retirement + related static-creds retirement)
Leave the existing objects in place as the pre-migration historical record. Do NOT delete.
Optionally: set a lifecycle policy to transition everything in the bucket to Glacier Deep Archive after 90 days. Keep-forever retention.

Documentation trail (evidence for auditors)

Every step of the migration generates evidence. Log locations:

Evidence type	Where it lives
Timestamped change record	`docs/infrastructure/change-log.md` — one entry per phase step
Session-level narrative	`docs/session-log/<date>-s3-data-migration-phase-<n>.md` — the chronological "what happened"
Terraform state diffs	`infra/envs/<env>/terraform.tfstate.backup` (auto) + `git log` on `infra/`
Ingest sanity check output	`scripts/audit/audit-tier-*-results.json` snapshots before + after the migration
Data-level confirmation	`scripts/audit/audit-parity-check.js` output at each phase transition
CloudTrail	All bucket creation, policy changes, replication config changes captured in org trail in log-archive account (Phase 2 setup)

Rollback philosophy

At every phase, the previous phase's state is preserved. No step makes the PRIOR state unreachable. This means:

Phase 11 agent-uploads split: mgmt bucket's agent-survey-uploads/ prefix stays in place; flipping S3_AGENT_SURVEY_BUCKET env var back restores the old behavior
Phase 11.5 archive bucket: writes are one-directional (replication INTO archive); no workload depends on reading FROM archive yet
Phase 11.75 staging data bucket: scripts fall back to mgmt bucket via env-var change
Phase 12 prod data bucket: scripts fall back to mgmt bucket via env-var change
Phase 12.5 mgmt deprecation: read-only posture, not destruction; re-granting write access is one bucket policy edit if needed

The mgmt askflorence-data bucket is never deleted during this migration. It becomes read-only cold storage but remains intact. This is the recovery floor.

Effort estimate

Phase	Effort	Risk	Data migration
11 — Agent-uploads split	~1 hour	Low	~0 objects (fresh prefix)
11.5 — Archive bucket	~30 min	Low (isolated, no downstream yet)	None
11.75 — Staging data bucket + ingest validation	~2 hours	Medium (ingest scripts touch prod DB if not careful)	~100 MB PUF copy to staging
12 — Prod data bucket + promotion flow	~2 hours	Medium (same as 11.75 on prod side)	~100 MB staging-to-prod sync
12.5 — Mgmt bucket deprecation	~30 min	Very low	None
Total	~6 hours focused	—	~200 MB total network transfer

All five phases can happen in a single focused session, or spread across 2-3 sessions. All are post-48h-bake, and all are compatible with the existing Phase 11/12 hardening + compliance closeout work.

Session log 2026-04-23 — Phase 10 cutover — context on why this migration surfaced
aws-setup runbook — account topology + general AWS operations
infra/envs/management/s3-askflorence-data.tf — current state of the mgmt bucket policy; gets reduced in Phase 12.5
infra/envs/prod/ecs.tf — S3AgentSurveyUploadsWrite inline policy + S3_AGENT_SURVEY_BUCKET env var that Phase 11 updates
GuardDuty setup — Malware Protection for S3 config; will be extended to the new buckets in Phase 11