Appearance
Testing strategy
Where each test layer lives, what it catches, and the decision history behind it.
For the decision narrative (why we chose Playwright + PR-CI against staging vs ephemeral PR previews), see ADR 0008. This doc is the operational view.
The four layers
| Layer | What runs | Where | Catches | Speed |
|---|---|---|---|---|
| L1: Local preflight | npm run preflight — typecheck + 3 audits + 2 builds. --full mode adds HTTP smoke + Playwright E2E. | Dev laptop | Same regressions as PR CI catches, before push | ~57s default, ~90s --full |
| L2: PR-time CI guards | 7 required checks: 2 static, ECS task-def coverage, EBS resource coverage, typecheck, Next.js build, docs build. Plus Playwright (ENG-304 — runs against staging; soak phase before adding to required checks) | GitHub Actions on every PR | TS errors, manifest↔Terraform drift, doc dead links, UI regressions | ~3-4min cold, ~1.5-2min warm |
| L3: Post-deploy smoke | scripts/audit/post-deploy-smoke.ts — 11 HTTP-level checks (6 read + 5 write paths with cleanup) | After every deploy, against staging or prod origin | Runtime regressions only visible on deployed code (write paths, secret bindings, route 500s) | ~10s |
| L4: Nightly drift detection | Playwright suite + post-deploy-smoke against staging + prod in parallel matrix on 0 9 * * * UTC cron (ENG-305). Auto-files GitHub issue on failure (Linear too if LINEAR_API_KEY repo secret is wired) | Scheduled GHA cron + workflow_dispatch | Out-of-band regressions: CMS API contract changes, cert rotations, infra config drift, Atlas index changes, vendor outages | ~5-7min per env, parallel |
Each layer catches a different threat model. They compound — a PR going to prod has to pass all four (L1 before push, L2 to merge, L3 to verify deploy, L4 to detect drift later).
L4 nightly drift detection — what it actually catches
The same code that passed L1/L2/L3 can fail at L4 hours or days later because the things L4 exercises aren't owned by us:
- CMS Marketplace API drifts (the eligibility/county/plan endpoints) — no SLA, no change-log we get notified about, contract changes ship without warning
- Atlas index/role state drifted out-of-band — someone (or an automation) widened a role via the UI
- CloudFront / WAF / ACM config rotated — managed-rule version bump silently breaks a route, cert renewal flipped a SAN
- HubSpot / SES / EventBridge vendor outages — partial failure modes that don't trip ALB health checks but break user-visible flows
- Application regression with delayed manifestation — e.g. a memory leak, a SQS-backed cleanup job that didn't fire for 6 hours, a feature flag flipped server-side
L4 runs the same Playwright suite + post-deploy-smoke as PR CI and deploy time, but against both staging and prod in a matrix, every night. Same tests, different threat model, different time. The 14-day artifact retention on traces + videos means a regression that flakes once a week still has 2 nights of evidence to diagnose from.
Failure handling: the workflow opens a GitHub issue scoped to the failing step (steps.smoke.outcome == 'failure' || steps.playwright.outcome == 'failure'), NOT generic failure() — so infra hiccups (npm ci flake, OIDC blip) don't auto-open P1 issues. Linear filing is optional; wire LINEAR_API_KEY as a repo secret to enable. See .github/workflows/nightly-drift-check.yml.
What's NOT in the strategy (deliberately)
- Ephemeral PR preview environments (ENG-303 / M2) — deferred. See ADR 0008 "What's the actual gap" for the honest accounting. Re-evaluate when team grows past 3-4 devs, SOC 2 Type II audit lands, or specific failure classes start shipping.
- Component-level unit tests with Jest/Vitest — we have minimal coverage today; deferred until a feature class proves to need it. The integration + E2E layers carry most of the load.
- Visual regression tests (Chromatic, Percy, Storybook visual) — deferred; Playwright screenshots + the manual flow catch most visual regressions for now.
- Paid AI-native test platforms (Mabl, Reflect, Octomind) — re-evaluate post-funding. Pre-funding budget posture wins.
- Stagehand AI-wrapped Playwright (browserbase/stagehand) — interesting future direction if Playwright maintenance cost grows; tracked in ADR 0008 "Tool evolution path."
When to revisit ENG-303 (ephemeral PR previews)
ADR 0008 documents the five revisit triggers. Most likely first trigger: team grows past 3-4 devs. Today (1-2 devs), shared staging works.
Tool: Playwright
Per ADR 0008 alternatives analysis, Playwright is the chosen tool. Key reasons:
- $0 license + $0 runtime — fits pre-funding posture
- Native TypeScript — codebase alignment
- Trace viewer + video on failure — fast flake diagnosis
- Cross-browser (Chromium, Firefox, WebKit/Safari)
- State of JS 2024: overtook Cypress in satisfaction + usage growth
Selector strategy (codified to avoid maintenance debt)
BrowserStack's 2026 best-practices guide plus our own lessons from the manual smoke flow:
- Prefer role-based selectors —
page.getByRole("button", { name: "Submit" }). Resilient to CSS refactors. data-testidfor ambiguous or dynamic content — calculator inputs, autocomplete results, plan card placement. Small ongoing PR commitment: when you touch a UI component the smoke tests against, leave adata-testidif one's missing.- NEVER use CSS class selectors or XPath — they break on every CSS refactor.
- No
page.waitForTimeout(N)— only event-based waits viaawait expect(locator).toBeVisible()etc.
This isn't ceremonial. The most-cited cause of Playwright flake per Mergify's analysis is selector brittleness — and it's avoidable with discipline.
Flake budget
Target: <1% flake rate within 4 weeks of M3 ship.
Industry benchmark: mid-stage SaaS teams hit 4% flake → ~1,000 spurious failures/week (Bug0 analysis). Our hard line:
retry: 2per spec (built-in Playwright)- If a spec flakes more than 2× in a week, it gets fixed (better selectors or removed) — not "tolerated"
- Each flake gets a 5-min trace-viewer diagnosis pass — track recurring patterns
Test data discipline
Same pattern post-deploy-smoke established (see docs/infrastructure/post-deploy-smoke.md):
- Synthetic email pattern:
taha+playwright-<spec>-<runId>@askflorence.health - Per-test cleanup in
afterEachblocks - Suite-level cleanup sweep removes any orphans matching the pattern older than 5 minutes
- HubSpot GDPR-delete reserved for
+ci-smoke-*/+playwright-*patterns only (never real work emails — see CLAUDE.md May 9 precedent)
Cost summary
L1 + L2 (Playwright on PR CI against staging): ~$41/month at 40 PRs/day.
For the cost comparison vs ephemeral previews (~$270/month), see ADR 0008 cost analysis.
Related
- ADR 0008 — the decision document this operationalizes
- ADR 0007 — Terraform-driven deploy (companion architectural decision)
docs/development/preflight.md— L1 local mirrordocs/infrastructure/post-deploy-smoke.md— L3 deploy-time smoke- ENG-304 — Playwright UI smoke (M3, the work that operationalizes this strategy)
- ENG-303 — AWS ephemeral PR previews (M2, deferred per ADR 0008; revisit triggers documented there)