Appearance
S3 data bucket migration runbook
Status: planned, not executed. Target window: Phase 11 (post 48h Phase 10 bake).
Why this exists: the current
askflorence-databucket in the management account (778477254880) predates AWS Organizations and holds two unrelated payloads — PUF source CSVs for the ingest pipeline, and agent-uploaded templates from the runtime app. Both will benefit from being split into purpose-scoped, environment-scoped buckets with clean SOC 2 / EDE audit lines.
Target end-state architecture
Management account (778477254880)
└── askflorence-data-archive ← Immutable backup. Object Lock
COMPLIANCE mode, 7-year retention.
Replicated-to, never read by runtime.
Staging account (549136075525)
├── askflorence-staging-data ← Authoritative PUF source for review.
│ └── puf/<year>/ Each year's new PUF validated here
│ ├── plan-attributes-puf.csv BEFORE promotion to prod.
│ ├── benefits-and-cost-sharing-puf.csv
│ └── ...
└── askflorence-staging-agent-uploads ← Staging-app agent template uploads.
└── agent-survey-uploads/
Prod account (039624954211)
├── askflorence-prod-data ← Authoritative PUF source for runtime.
│ └── puf/<year>/ Same layout as staging. Populated via
│ explicit promotion after staging validates.
└── askflorence-prod-agent-uploads ← Prod-app agent template uploads.
└── agent-survey-uploads/Why this shape
| Concern | Current (mgmt bucket) | Target |
|---|---|---|
| Blast radius | mgmt compromise reaches prod agent uploads + PUF | Each env owns its own bucket; compromise stays local |
| Year-over-year PUF release cycle | New PUF ingested directly to prod with no review env | Staging ingests first → audit → promote → prod |
| Backup/retention | No separate archive; primary == backup | Immutable archive in mgmt with Object Lock; primary is mutable |
| Auditor framing | "Why is production PHI in management account?" | Clean per-environment isolation |
| Lifecycle independence | One bucket, conflicting policies (PUF-indefinite vs agent-7yr) | One policy per bucket, per purpose |
| Cross-account IAM complexity | Every write needs mgmt bucket policy + prod task role | Same-account writes, simpler IAM |
| Cost allocation | Prod app costs billed to mgmt account | Prod S3 costs stay in prod account |
Why staging-first for PUF
CMS releases updated PUF data each year in Q3/Q4 for the next plan year's open enrollment. The current workflow ingests that data directly to the prod Atlas cluster via scripts/db/ingest-puf-augment.js, with a sanity check afterward via the tier 1-5 audits. No review environment sits between "CMS drops new data" and "prod serves it to real consumers."
The staging-first pattern adds a review gate:
- New CMS PUF CSVs uploaded to staging bucket
askflorence-staging-data/puf/<year>/ ingest-puf-augment.jsrun against staging Atlas with the new year — stores asyear: 2027(or whichever)- Tier 1-5 audit harness runs against staging — catches rate drift, shape changes, missing fields, new carrier formats
- Manual review at
stage.askflorence.healthwith the new year's data - Only once staging is clean:
aws s3 syncstaging bucket → prod bucket, run ingest against prod Atlas, run final audit aws s3 syncstaging bucket → archive bucket for the permanent historical record
This mirrors the code promotion flow (main → staging → prod) for data. Any breaking change in CMS's PUF format gets caught on staging, not in production.
Migration plan
Phase 11 — Agent-uploads bucket split (low-risk, no data migration)
Why first: the agent-uploads prefix in the current mgmt bucket has exactly one real object today (a smoke-test PDF from the Phase 10 cutover validation). Vercel-era uploads, if any exist, stay where they are as historical. This step has zero user-data migration.
Prod steps:
New file
infra/envs/prod/s3-agent-uploads.tf:aws_s3_bucketprod_agent_uploadswith nameaskflorence-prod-agent-uploads- Bucket encryption: SSE-KMS using
module.kms_prod.key_arn - Versioning: enabled
- Public access block: all 4 flags true
- Bucket policy:
DenyNonSSLRequests(mirror of mgmt bucket pattern) - Lifecycle:
agent-survey-uploads/→ 7-year retention then delete (HIPAA minimum with buffer)
Update
infra/envs/prod/ecs.tf:- Change
S3_AGENT_SURVEY_BUCKET=askflorence-datatoS3_AGENT_SURVEY_BUCKET=askflorence-prod-agent-uploads - Remove the
S3AgentSurveyUploadsWriteinline policy's cross-account resource ARN; replace with same-accountarn:aws:s3:::askflorence-prod-agent-uploads/agent-survey-uploads/*
- Change
GuardDuty Malware Protection for S3: add the new bucket to the protected-resources list.
aws s3 sync s3://askflorence-data/agent-survey-uploads/ s3://askflorence-prod-agent-uploads/agent-survey-uploads/— copy any accumulated objects (expected: 0 user data, maybe smoke tests).terraform apply+ register new ECS task def revision + force-new-deployment.Smoke test: POST
/api/agents/discovery/uploadfrom prod with a real PDF → verify object lands inaskflorence-prod-agent-uploads, NOT inaskflorence-data.Update
infra/envs/management/s3-askflorence-data.tf: remove theAllowProdEcsTaskRolePutAgentSurveyUploadsstatement. Prod task role loses the cross-account grant. Mgmt bucket'sagent-survey-uploads/prefix becomes read-only history.
Staging steps (mirror of prod but scoped to staging):
- New file
infra/envs/staging/s3-agent-uploads.tfwithaskflorence-staging-agent-uploads. - Update
infra/envs/staging/ecs.tfto pointS3_AGENT_SURVEY_BUCKETthere. - GuardDuty Malware Protection added.
- Smoke test via
stage.askflorence.health.
Rollback: aws s3 sync in reverse (new bucket → mgmt bucket) + revert env var + revert task role policy. No user impact during rollback because the active env var is the primary switch.
Phase 11.5 — Mgmt immutable archive bucket
Purpose: receive replicated copies from both staging and prod data buckets, retain immutably for 7 years. No runtime process reads from this bucket.
Steps:
New file
infra/envs/management/s3-data-archive.tf:aws_s3_bucketdata_archivewith nameaskflorence-data-archive- Object Lock enabled at create-time (can only be enabled when the bucket is created, not retrofitted)
- Default retention: 7 years COMPLIANCE mode
- Versioning: enabled (required for Object Lock)
- SSE-KMS with mgmt CMK
alias/askflorence-data - Public access block: all 4 flags true
- Bucket policy:
DenyNonSSLRequests+DenyDeleteObject+ only-allow-replication-writes - Replication destination configuration accepting writes from staging + prod buckets
Cross-account replication IAM:
- Role
askflorence-data-replicatorin mgmt account, trusted bys3.amazonaws.com - Role policy: PutObject + PutObjectVersionAcl on
askflorence-data-archive - Source-account grants: staging + prod bucket policies allow this replicator role to read source objects
- Role
Lifecycle on archive: Standard → Standard-IA after 30 days → Glacier Deep Archive after 90 days. Deep Archive is ~$0.00099/GB-month; effectively free for PUF volumes.
Rollback: destroy in reverse order. Object Lock COMPLIANCE mode means objects placed during the testing period are permanent — test this step in a disposable bucket first before committing to askflorence-data-archive.
Phase 11.75 — Staging PUF data bucket + staging-first ingest validation
Purpose: establish staging as the PUF review environment.
Steps:
New file
infra/envs/staging/s3-puf-data.tf:aws_s3_bucketstaging_datawith nameaskflorence-staging-data- SSE-KMS staging CMK
- Versioning enabled
- Public access block all flags true
- Replication to
askflorence-data-archivein mgmt account (for historical backup) - Lifecycle: same as prod bucket (Standard → IA after 90d)
aws s3 sync s3://askflorence-data/ s3://askflorence-staging-data/ --exclude 'agent-survey-uploads/*'— copies all PUF years to staging.Update
scripts/db/ingest-*.jsenv var handling:- Current scripts read from
askflorence-datahard-coded or via env - Add
S3_PUF_SOURCE_BUCKETenv with staging bucket default when running locally against staging Mongo - Document the env var contract in
docs/infrastructure/mongodb-setup.md
- Current scripts read from
Ingest sanity check (critical step — do NOT skip):
MONGODB_WRITE_URI=<staging>S3_PUF_SOURCE_BUCKET=askflorence-staging-datanodescripts/db/ingest-puf-augment.js --dry-run --year 2026- Confirm the dry-run output matches the existing prod Atlas state for year 2026 — no drift introduced by the bucket change
- Run the same thing in
--applymode against a throwaway staging collection; verify collection contents byte-for-byte match what's already in prod Atlas
If step 4 passes cleanly, staging data bucket is the authoritative review environment going forward.
Rollback: if the ingest scripts break, revert S3_PUF_SOURCE_BUCKET to point at the mgmt bucket (unchanged at this point). Scripts resume working. Staging data bucket stays as a replicated copy, not authoritative.
Phase 12 — Prod PUF data bucket + promotion flow
Steps:
New file
infra/envs/prod/s3-puf-data.tf:aws_s3_bucketprod_datawith nameaskflorence-prod-data- SSE-KMS prod CMK
- Versioning enabled
- Public access block all flags true
- Replication to
askflorence-data-archivein mgmt account - Lifecycle: Standard → IA after 90d
Initial population:
aws s3 sync s3://askflorence-staging-data/ s3://askflorence-prod-data/(staging → prod for the first time)Update ingest scripts' prod-pointing env:
S3_PUF_SOURCE_BUCKET=askflorence-prod-datain any prod-run context- This is a scripts-only change; the serving app does not read S3 for PUF data (it reads Atlas)
Re-run full audit tier 1-5 against prod Atlas after the first ingest from the new bucket — confirm no regression in serving data.
Document the PUF promotion workflow at
docs/runbooks/puf-year-promotion.md(new file):New PUF year arrives from CMS: 1. Upload to staging bucket: aws s3 cp plan-attributes-puf.csv s3://askflorence-staging-data/puf/<year>/ 2. Run ingest against staging Atlas: S3_PUF_SOURCE_BUCKET=askflorence-staging-data ... --year <year> 3. Run audit tier 1-5 against staging: scripts/audit/*.js with staging URI 4. Manual review at stage.askflorence.health 5. aws s3 sync s3://askflorence-staging-data/puf/<year>/ s3://askflorence-prod-data/puf/<year>/ 6. Run ingest against prod Atlas: S3_PUF_SOURCE_BUCKET=askflorence-prod-data ... --year <year> 7. Run audit tier 1-5 against prod: scripts/audit/*.js with prod URI 8. Replication to archive bucket happens automatically (both staging + prod replicate to mgmt archive)
Phase 12.5 — Deprecate mgmt askflorence-data bucket
Steps:
Verify all runtime flows have moved off
askflorence-data:- Prod ECS task def env:
S3_AGENT_SURVEY_BUCKETpoints at prod bucket - Staging ECS task def env: points at staging bucket
- Prod + staging ingest scripts point at their respective new buckets
- No code path reads from
askflorence-datadirectly
- Prod ECS task def env:
Make the bucket read-only:
- Replace bucket policy with only
DenyNonSSLRequests+Deny *for any write actions + allow Get/List for archive browsing - Revoke any IAM user creds that still had write access (Vercel-era IAM user — see Phase 11 Resend retirement + related static-creds retirement)
- Replace bucket policy with only
Leave the existing objects in place as the pre-migration historical record. Do NOT delete.
Optionally: set a lifecycle policy to transition everything in the bucket to Glacier Deep Archive after 90 days. Keep-forever retention.
Documentation trail (evidence for auditors)
Every step of the migration generates evidence. Log locations:
| Evidence type | Where it lives |
|---|---|
| Timestamped change record | docs/infrastructure/change-log.md — one entry per phase step |
| Session-level narrative | docs/session-log/<date>-s3-data-migration-phase-<n>.md — the chronological "what happened" |
| Terraform state diffs | infra/envs/<env>/terraform.tfstate.backup (auto) + git log on infra/ |
| Ingest sanity check output | scripts/audit/audit-tier-*-results.json snapshots before + after the migration |
| Data-level confirmation | scripts/audit/audit-parity-check.js output at each phase transition |
| CloudTrail | All bucket creation, policy changes, replication config changes captured in org trail in log-archive account (Phase 2 setup) |
Rollback philosophy
At every phase, the previous phase's state is preserved. No step makes the PRIOR state unreachable. This means:
- Phase 11 agent-uploads split: mgmt bucket's
agent-survey-uploads/prefix stays in place; flippingS3_AGENT_SURVEY_BUCKETenv var back restores the old behavior - Phase 11.5 archive bucket: writes are one-directional (replication INTO archive); no workload depends on reading FROM archive yet
- Phase 11.75 staging data bucket: scripts fall back to mgmt bucket via env-var change
- Phase 12 prod data bucket: scripts fall back to mgmt bucket via env-var change
- Phase 12.5 mgmt deprecation: read-only posture, not destruction; re-granting write access is one bucket policy edit if needed
The mgmt askflorence-data bucket is never deleted during this migration. It becomes read-only cold storage but remains intact. This is the recovery floor.
Effort estimate
| Phase | Effort | Risk | Data migration |
|---|---|---|---|
| 11 — Agent-uploads split | ~1 hour | Low | ~0 objects (fresh prefix) |
| 11.5 — Archive bucket | ~30 min | Low (isolated, no downstream yet) | None |
| 11.75 — Staging data bucket + ingest validation | ~2 hours | Medium (ingest scripts touch prod DB if not careful) | ~100 MB PUF copy to staging |
| 12 — Prod data bucket + promotion flow | ~2 hours | Medium (same as 11.75 on prod side) | ~100 MB staging-to-prod sync |
| 12.5 — Mgmt bucket deprecation | ~30 min | Very low | None |
| Total | ~6 hours focused | — | ~200 MB total network transfer |
All five phases can happen in a single focused session, or spread across 2-3 sessions. All are post-48h-bake, and all are compatible with the existing Phase 11/12 hardening + compliance closeout work.
Related work
- Session log 2026-04-23 — Phase 10 cutover — context on why this migration surfaced
- aws-setup runbook — account topology + general AWS operations
infra/envs/management/s3-askflorence-data.tf— current state of the mgmt bucket policy; gets reduced in Phase 12.5infra/envs/prod/ecs.tf— S3AgentSurveyUploadsWrite inline policy + S3_AGENT_SURVEY_BUCKET env var that Phase 11 updates- GuardDuty setup — Malware Protection for S3 config; will be extended to the new buckets in Phase 11