Testing strategy

Where each test layer lives, what it catches, and the decision history behind it.

For the decision narrative (why we chose Playwright + PR-CI against staging vs ephemeral PR previews), see ADR 0008. This doc is the operational view.

The four layers

Layer	What runs	Where	Catches	Speed
L1: Local preflight	`npm run preflight` — typecheck + 3 audits + 2 builds. `--full` mode adds HTTP smoke + Playwright E2E.	Dev laptop	Same regressions as PR CI catches, before push	~57s default, ~90s `--full`
L2: PR-time CI guards	7 required checks: 2 static, ECS task-def coverage, EBS resource coverage, typecheck, Next.js build, docs build. Plus Playwright (ENG-304 — runs against staging; soak phase before adding to required checks)	GitHub Actions on every PR	TS errors, manifest↔Terraform drift, doc dead links, UI regressions	~3-4min cold, ~1.5-2min warm
L3: Post-deploy smoke	`scripts/audit/post-deploy-smoke.ts` — 11 HTTP-level checks (6 read + 5 write paths with cleanup)	After every deploy, against staging or prod origin	Runtime regressions only visible on deployed code (write paths, secret bindings, route 500s)	~10s
L4: Nightly drift detection	Playwright suite + post-deploy-smoke against staging + prod in parallel matrix on `0 9 * * *` UTC cron (ENG-305). Auto-files GitHub issue on failure (Linear too if `LINEAR_API_KEY` repo secret is wired)	Scheduled GHA cron + `workflow_dispatch`	Out-of-band regressions: CMS API contract changes, cert rotations, infra config drift, Atlas index changes, vendor outages	~5-7min per env, parallel

Each layer catches a different threat model. They compound — a PR going to prod has to pass all four (L1 before push, L2 to merge, L3 to verify deploy, L4 to detect drift later).

L4 nightly drift detection — what it actually catches

The same code that passed L1/L2/L3 can fail at L4 hours or days later because the things L4 exercises aren't owned by us:

CMS Marketplace API drifts (the eligibility/county/plan endpoints) — no SLA, no change-log we get notified about, contract changes ship without warning
Atlas index/role state drifted out-of-band — someone (or an automation) widened a role via the UI
CloudFront / WAF / ACM config rotated — managed-rule version bump silently breaks a route, cert renewal flipped a SAN
HubSpot / SES / EventBridge vendor outages — partial failure modes that don't trip ALB health checks but break user-visible flows
Application regression with delayed manifestation — e.g. a memory leak, a SQS-backed cleanup job that didn't fire for 6 hours, a feature flag flipped server-side

L4 runs the same Playwright suite + post-deploy-smoke as PR CI and deploy time, but against both staging and prod in a matrix, every night. Same tests, different threat model, different time. The 14-day artifact retention on traces + videos means a regression that flakes once a week still has 2 nights of evidence to diagnose from.

Failure handling: the workflow opens a GitHub issue scoped to the failing step (steps.smoke.outcome == 'failure' || steps.playwright.outcome == 'failure'), NOT generic failure() — so infra hiccups (npm ci flake, OIDC blip) don't auto-open P1 issues. Linear filing is optional; wire LINEAR_API_KEY as a repo secret to enable. See .github/workflows/nightly-drift-check.yml.

What's NOT in the strategy (deliberately)

Ephemeral PR preview environments (ENG-303 / M2) — deferred. See ADR 0008 "What's the actual gap" for the honest accounting. Re-evaluate when team grows past 3-4 devs, SOC 2 Type II audit lands, or specific failure classes start shipping.
Component-level unit tests with Jest/Vitest — we have minimal coverage today; deferred until a feature class proves to need it. The integration + E2E layers carry most of the load.
Visual regression tests (Chromatic, Percy, Storybook visual) — deferred; Playwright screenshots + the manual flow catch most visual regressions for now.
Paid AI-native test platforms (Mabl, Reflect, Octomind) — re-evaluate post-funding. Pre-funding budget posture wins.
Stagehand AI-wrapped Playwright (browserbase/stagehand) — interesting future direction if Playwright maintenance cost grows; tracked in ADR 0008 "Tool evolution path."

When to revisit ENG-303 (ephemeral PR previews)

ADR 0008 documents the five revisit triggers. Most likely first trigger: team grows past 3-4 devs. Today (1-2 devs), shared staging works.

Tool: Playwright

Per ADR 0008 alternatives analysis, Playwright is the chosen tool. Key reasons:

$0 license + $0 runtime — fits pre-funding posture
Native TypeScript — codebase alignment
Trace viewer + video on failure — fast flake diagnosis
Cross-browser (Chromium, Firefox, WebKit/Safari)
State of JS 2024: overtook Cypress in satisfaction + usage growth

Selector strategy (codified to avoid maintenance debt)

BrowserStack's 2026 best-practices guide plus our own lessons from the manual smoke flow:

Prefer role-based selectors — page.getByRole("button", { name: "Submit" }). Resilient to CSS refactors.
data-testid for ambiguous or dynamic content — calculator inputs, autocomplete results, plan card placement. Small ongoing PR commitment: when you touch a UI component the smoke tests against, leave a data-testid if one's missing.
NEVER use CSS class selectors or XPath — they break on every CSS refactor.
No page.waitForTimeout(N) — only event-based waits via await expect(locator).toBeVisible() etc.

This isn't ceremonial. The most-cited cause of Playwright flake per Mergify's analysis is selector brittleness — and it's avoidable with discipline.

Flake budget

Target: <1% flake rate within 4 weeks of M3 ship.

Industry benchmark: mid-stage SaaS teams hit 4% flake → ~1,000 spurious failures/week (Bug0 analysis). Our hard line:

retry: 2 per spec (built-in Playwright)
If a spec flakes more than 2× in a week, it gets fixed (better selectors or removed) — not "tolerated"
Each flake gets a 5-min trace-viewer diagnosis pass — track recurring patterns

Test data discipline

Same pattern post-deploy-smoke established (see docs/infrastructure/post-deploy-smoke.md):

Synthetic email pattern: taha+playwright-<spec>-<runId>@askflorence.health
Per-test cleanup in afterEach blocks
Suite-level cleanup sweep removes any orphans matching the pattern older than 5 minutes
HubSpot GDPR-delete reserved for +ci-smoke-* / +playwright-* patterns only (never real work emails — see CLAUDE.md May 9 precedent)

Cost summary

L1 + L2 (Playwright on PR CI against staging): ~$41/month at 40 PRs/day.

For the cost comparison vs ephemeral previews (~$270/month), see ADR 0008 cost analysis.

ADR 0008 — the decision document this operationalizes
ADR 0007 — Terraform-driven deploy (companion architectural decision)
docs/development/preflight.md — L1 local mirror
docs/infrastructure/post-deploy-smoke.md — L3 deploy-time smoke
ENG-304 — Playwright UI smoke (M3, the work that operationalizes this strategy)
ENG-303 — AWS ephemeral PR previews (M2, deferred per ADR 0008; revisit triggers documented there)