Runbook — Security Incident Response

Use this in the moment. This is the first-responder checklist; the policy framework lives at docs/security-compliance/incident-response-plan.md.

Severity decision (one sentence)

If PHI may have been accessed by an unauthorized party OR the site is fully down to customers OR there is active exploitation in progress → SEV-0. Page the IC immediately.

For the full SEV-0/1/2/3 matrix see the IRP severity classification.

Page the IC (Incident Commander)

Method	Use when
Text Taha at his personal phone	First contact for SEV-0 / SEV-1, day or night
Email `[email protected]` + `[email protected]` simultaneously	First contact for SEV-2
Google Chat in the team space	SEV-3

Incident Commander acks within the notification window (15 min for SEV-0; 1h for SEV-1; 4h for SEV-2; 24h for SEV-3).

Step 1 — Detect + open incident channel

When the IC acks:

Open a private incident channel: Google Chat space named 🚨 sev-N incident YYYY-MM-DD <short slug>.
Invite: IC + Compliance Liaison (Asad) + Comms Lead (Ian) + the team member who detected the incident.
Pin the incident summary at the top: 1 sentence on what was detected + 1 sentence on initial impact estimate.

Step 2 — Contain

The IC + Engineering Responder execute, in this order:

Stop the bleeding. Pick the smallest action that stops the immediate damage:
- If a credential is suspected compromised → rotate it. aws secretsmanager update-secret --secret-id <id> --secret-string <new> then deploy task-def update.
- If an Atlas user is suspected compromised → revoke the user's password (atlas dbusers update --password <random>) or disable the user.
- If a route is exposing data → disable the route. Feature-flag toggle, or temporary 503 deploy.
- If a source IP is hostile → block at AWS WAF + Atlas IP allowlist.
- If an ECS task is compromised → stop the task (aws ecs stop-task --task <arn>). ECS replaces it automatically; the replaced task does NOT inherit the suspect's state.
Preserve evidence. Do NOT delete logs. Do NOT clean up. Snapshot first:
- Atlas: atlas clusters snapshots create askflorence-prod-01 --description "incident-<slug>".
- S3: ensure bucket versioning is on (default for stateful buckets); take an inventory of suspect objects via aws s3api list-objects-v2.
- CloudWatch: log groups have 90-day default retention; capture relevant streams via aws logs filter-log-events and save to the incident channel.
- GuardDuty: capture finding ARNs in the incident channel.
Stand up a war-room cadence for SEV-0/1: 30-min IC updates in the incident channel until status is "stable."

Step 3 — Assess

The IC + Engineering Responder + Compliance Liaison collaborate:

What data was accessed? Read the audit log (agent_audit_log for app-layer, CloudTrail for AWS-layer, Atlas database audit for DB-layer).
Whose data? Build a candidate-affected-individuals list. If PHI / PII is in scope, the list goes into the Compliance Liaison's regulatory-clock tracker.
HIPAA breach definition (45 CFR §164.402): unauthorized acquisition, access, use, or disclosure of unsecured PHI. If yes, the 60-day notification clock starts at discovery — record the discovery timestamp prominently.
Time-bound the assessment: SEV-0/1 assessment within 24h; SEV-2 within 72h.

Step 4 — Notify

The Compliance Liaison owns the regulatory clock; the Comms Lead owns the messaging.

Recipient	Trigger	Deadline	Owner	How
Affected individuals	HIPAA breach involving their PHI	60 days from discovery (CA = 30 days for residents; check each state)	Comms Lead drafts; Compliance Liaison reviews	Per-individual letter or email with HHS-required content
HHS OCR	HIPAA breach (any number of individuals)	60 days (>500 affected) or annually (<500)	Compliance Liaison	OCR breach portal
Media	HIPAA breach affecting >500 individuals in a state	60 days	Comms Lead	"Prominent media outlet" in the state
State AG	Per state-specific law	Varies; default 30 days	Compliance Liaison	Per state-specific procedure
CMS EDE program contact	EDE program-eligibility-relevant incident (post-submission)	Per program	Compliance Liaison	EDE program portal
Affected vendor (BAA partner)	Incident involves their data flow	Per BAA terms (typically 30 days)	Compliance Liaison	Per vendor contract
Investors + advisors	SEV-0 customer-facing	Same business day	Founder (Taha)	Email + scheduled brief
Internal team	All SEV-0/1	Immediate	IC	Incident channel

Each notification's sent date is recorded in the incident channel + the post-mortem file.

Step 5 — Remediate + post-mortem

Implement the durable fix. Document the fix's deploy timestamp in the incident channel.
Verify remediation. Run the relevant CI guards (staging-collections-guard, staging-cluster-drift, validate-secrets) + a synthetic exercise of the previously-vulnerable path.
Close the incident when (a) the immediate vector is closed, (b) all required notifications are sent, (c) the durable fix is deployed, (d) the post-mortem placeholder is open.
File the post-mortem within 5 business days at docs/session-log/<date>-incident-<slug>.md. Use the template below.

Post-mortem template

markdown

---
title: "Incident post-mortem — <slug>"
date: YYYY-MM-DD
severity: SEV-N
status: closed
---

# Incident — <slug>

## Timeline

| When (UTC) | What |
|---|---|
| YYYY-MM-DD HH:MM | Detection — `<source>` |
| YYYY-MM-DD HH:MM | IC acknowledged + incident channel opened |
| YYYY-MM-DD HH:MM | Containment action — `<action>` |
| YYYY-MM-DD HH:MM | Assessment complete |
| YYYY-MM-DD HH:MM | Notifications sent (per regulatory clock) |
| YYYY-MM-DD HH:MM | Durable fix deployed |
| YYYY-MM-DD HH:MM | Incident closed |

## Impact

- Data exposure: `<scope>`
- Affected individuals: `<count, or N/A>`
- Customer-visible impact: `<duration, or N/A>`
- Financial impact: `<estimate, or N/A>`

## Root cause

`<plain language; 1-3 paragraphs>`

## Contributing factors

`<bulleted list>`

## What worked

`<bulleted list>`

## What didn't

`<bulleted list>`

## Preventive measures

| Owner | Action | Due |
|---|---|---|
| | | |

## Regulatory notifications

| Recipient | Sent | Confirmation # |
|---|---|---|
| | | |

Preventive-measure rows feed into the next quarterly access review until they close.

When in doubt

Classify SEVerity higher (one tier up if you're unsure).
Stop the bleeding first; root-cause analysis can wait.
Don't clean up — preserve evidence.
Ack early to the team — silence on a suspected incident is worse than a false alarm.

Reference

Incident Response Plan — full policy framework + worked examples
Break-Glass Root Login — when standard auth is unavailable
Atlas user provisioning runbook — credential rotation specifics
Risk Assessment — known risks the IRP must handle
HIPAA Breach Notification Rule: 45 CFR §§164.400-414
HHS OCR Breach Portal: https://ocrportal.hhs.gov/ocr/breach/wizard_breach.jsf