Skip to content
AskFlorence
Main Navigation ArchitectureFlorence AIAgentsMembersAgent PlatformValidationInfrastructure

Appearance

Sidebar Navigation

Overview

Home

Glossary

System Architecture

Consumer & Agent Flow

Florence AI

Overview

Principles

Runtime

Tool surface

Adding a tool

Tool registry

Knowledge: SBC scenarios & CSR

Voice

Evals & observability

Provider risk & portability

Outage playbook

Roadmap

Build plan

Agents

Overview

Workflows & pain points

Members

Overview

Medicaid coverage gap

Carriers

Overview

Marketplaces

Overview

Agency

Overview

Regulations

Overview

Agent Platform

Overview

Auth Architecture

MongoDB Permissioning

Compliance Model

Data Models

Data Sources

Overview

CMS Marketplace API

CMS dependency map

PUF Data

State Subsidies

SBE Ingestion Playbook

SBE State Watchouts + Decisions

CA Phase C/D Playbook

NY Phase C/D Playbook

Validation

Overview

Methodology

APTC Formula

California 2026

New York 2026

CAPS Formula

Scenario Results

Infrastructure

Account Inventory

AWS Setup Runbook

AWS Organizations

CloudTrail

GuardDuty

Security Hub

Config

CloudFront + WAFv2

Data sources & ingest

Phase 4 DNS

Change Log

Vulnerability Management

MongoDB Setup

Access Control

Data Classification

Documentation Hosting

Post-deploy Smoke

Development

Preflight (local CI mirror)

Testing strategy

Compliance

Overview (auditor entry point)

SOC 2 Control Mapping

HIPAA Control Mapping

CMS EDE Appendix A Mapping

Risk Assessment

Encryption Policy

Data Retention Policy

Privacy Impact Assessment

Consent Capture & Versioning

Incident Response Plan

Access Control Policy

Marketing vs. Portal Analytics

Vendor / Subprocessor Register

Dependency Vulnerability Policy

BAA / Compliance Evidence

Compliance-Automation Integration

Compliance-Automation Vendor Evaluation

Penetration Test Reports

Architecture

Portal entry handoff

Mobile app strategy

Deferred architecture decisions

Session cookie architecture

Share flows

Decisions (ADRs)

Index

0001 — Atlas project isolation

0002 — Append-only audit log

0003 — Narrow-scoped Mongo users

0004 — Cross-cluster Atlas PrivateLink

0005 — Delayed-job architecture

0006 — Mongo user simplification

0007 — Terraform owns ECS task def

0008 — E2E testing strategy

0009 — Self-hosted analytics + observability (superseded)

0010 — PostHog HIPAA Cloud (supersedes 0009)

Runbooks

Security Incident Response

Break-Glass Root Login

Onboard Team Member

Offboard Team Member

Atlas user provisioning

Deploy via Terraform (ENG-277)

Rollback via Terraform (ENG-277)

S3 data bucket migration (planned Phase 11)

Access Reviews

2026-Q2 Review

Session log

Index

2026-04-23 — Phase 10 DNS cutover

2026-04-22 — Phase 8 prod AWS mirror

2026-04-22 — Phase 7 Atlas VPC peering

2026-04-22 — Phase 6 CloudFront + WAF

2026-04-21 — Phase 5 staging go-live

2026-04-17 — Atlas staging

Briefs

Index

Member portal plan (ENG-187)

2026-04-16/17 handoff

2026-04-17 Atlas handoff

System briefing (2026-04-17)

Creative AdBundance proposal brief

Creative AdBundance analytics brief

ElevenLabs RN integration research

Policies

Overview

On this page

Testing strategy ​

Where each test layer lives, what it catches, and the decision history behind it.

For the decision narrative (why we chose Playwright + PR-CI against staging vs ephemeral PR previews), see ADR 0008. This doc is the operational view.

The four layers ​

LayerWhat runsWhereCatchesSpeed
L1: Local preflightnpm run preflight — typecheck + 3 audits + 2 builds. --full mode adds HTTP smoke + Playwright E2E.Dev laptopSame regressions as PR CI catches, before push~57s default, ~90s --full
L2: PR-time CI guards7 required checks: 2 static, ECS task-def coverage, EBS resource coverage, typecheck, Next.js build, docs build. Plus Playwright (ENG-304 — runs against staging; soak phase before adding to required checks)GitHub Actions on every PRTS errors, manifest↔Terraform drift, doc dead links, UI regressions~3-4min cold, ~1.5-2min warm
L3: Post-deploy smokescripts/audit/post-deploy-smoke.ts — 11 HTTP-level checks (6 read + 5 write paths with cleanup)After every deploy, against staging or prod originRuntime regressions only visible on deployed code (write paths, secret bindings, route 500s)~10s
L4: Nightly drift detectionPlaywright suite + post-deploy-smoke against staging + prod in parallel matrix on 0 9 * * * UTC cron (ENG-305). Auto-files GitHub issue on failure (Linear too if LINEAR_API_KEY repo secret is wired)Scheduled GHA cron + workflow_dispatchOut-of-band regressions: CMS API contract changes, cert rotations, infra config drift, Atlas index changes, vendor outages~5-7min per env, parallel

Each layer catches a different threat model. They compound — a PR going to prod has to pass all four (L1 before push, L2 to merge, L3 to verify deploy, L4 to detect drift later).

L4 nightly drift detection — what it actually catches ​

The same code that passed L1/L2/L3 can fail at L4 hours or days later because the things L4 exercises aren't owned by us:

  • CMS Marketplace API drifts (the eligibility/county/plan endpoints) — no SLA, no change-log we get notified about, contract changes ship without warning
  • Atlas index/role state drifted out-of-band — someone (or an automation) widened a role via the UI
  • CloudFront / WAF / ACM config rotated — managed-rule version bump silently breaks a route, cert renewal flipped a SAN
  • HubSpot / SES / EventBridge vendor outages — partial failure modes that don't trip ALB health checks but break user-visible flows
  • Application regression with delayed manifestation — e.g. a memory leak, a SQS-backed cleanup job that didn't fire for 6 hours, a feature flag flipped server-side

L4 runs the same Playwright suite + post-deploy-smoke as PR CI and deploy time, but against both staging and prod in a matrix, every night. Same tests, different threat model, different time. The 14-day artifact retention on traces + videos means a regression that flakes once a week still has 2 nights of evidence to diagnose from.

Failure handling: the workflow opens a GitHub issue scoped to the failing step (steps.smoke.outcome == 'failure' || steps.playwright.outcome == 'failure'), NOT generic failure() — so infra hiccups (npm ci flake, OIDC blip) don't auto-open P1 issues. Linear filing is optional; wire LINEAR_API_KEY as a repo secret to enable. See .github/workflows/nightly-drift-check.yml.

What's NOT in the strategy (deliberately) ​

  • Ephemeral PR preview environments (ENG-303 / M2) — deferred. See ADR 0008 "What's the actual gap" for the honest accounting. Re-evaluate when team grows past 3-4 devs, SOC 2 Type II audit lands, or specific failure classes start shipping.
  • Component-level unit tests with Jest/Vitest — we have minimal coverage today; deferred until a feature class proves to need it. The integration + E2E layers carry most of the load.
  • Visual regression tests (Chromatic, Percy, Storybook visual) — deferred; Playwright screenshots + the manual flow catch most visual regressions for now.
  • Paid AI-native test platforms (Mabl, Reflect, Octomind) — re-evaluate post-funding. Pre-funding budget posture wins.
  • Stagehand AI-wrapped Playwright (browserbase/stagehand) — interesting future direction if Playwright maintenance cost grows; tracked in ADR 0008 "Tool evolution path."

When to revisit ENG-303 (ephemeral PR previews) ​

ADR 0008 documents the five revisit triggers. Most likely first trigger: team grows past 3-4 devs. Today (1-2 devs), shared staging works.

Tool: Playwright ​

Per ADR 0008 alternatives analysis, Playwright is the chosen tool. Key reasons:

  • $0 license + $0 runtime — fits pre-funding posture
  • Native TypeScript — codebase alignment
  • Trace viewer + video on failure — fast flake diagnosis
  • Cross-browser (Chromium, Firefox, WebKit/Safari)
  • State of JS 2024: overtook Cypress in satisfaction + usage growth

Selector strategy (codified to avoid maintenance debt) ​

BrowserStack's 2026 best-practices guide plus our own lessons from the manual smoke flow:

  1. Prefer role-based selectors — page.getByRole("button", { name: "Submit" }). Resilient to CSS refactors.
  2. data-testid for ambiguous or dynamic content — calculator inputs, autocomplete results, plan card placement. Small ongoing PR commitment: when you touch a UI component the smoke tests against, leave a data-testid if one's missing.
  3. NEVER use CSS class selectors or XPath — they break on every CSS refactor.
  4. No page.waitForTimeout(N) — only event-based waits via await expect(locator).toBeVisible() etc.

This isn't ceremonial. The most-cited cause of Playwright flake per Mergify's analysis is selector brittleness — and it's avoidable with discipline.

Flake budget ​

Target: <1% flake rate within 4 weeks of M3 ship.

Industry benchmark: mid-stage SaaS teams hit 4% flake → ~1,000 spurious failures/week (Bug0 analysis). Our hard line:

  • retry: 2 per spec (built-in Playwright)
  • If a spec flakes more than 2× in a week, it gets fixed (better selectors or removed) — not "tolerated"
  • Each flake gets a 5-min trace-viewer diagnosis pass — track recurring patterns

Test data discipline ​

Same pattern post-deploy-smoke established (see docs/infrastructure/post-deploy-smoke.md):

  • Synthetic email pattern: taha+playwright-<spec>-<runId>@askflorence.health
  • Per-test cleanup in afterEach blocks
  • Suite-level cleanup sweep removes any orphans matching the pattern older than 5 minutes
  • HubSpot GDPR-delete reserved for +ci-smoke-* / +playwright-* patterns only (never real work emails — see CLAUDE.md May 9 precedent)

Cost summary ​

L1 + L2 (Playwright on PR CI against staging): ~$41/month at 40 PRs/day.

For the cost comparison vs ephemeral previews (~$270/month), see ADR 0008 cost analysis.

Related ​

  • ADR 0008 — the decision document this operationalizes
  • ADR 0007 — Terraform-driven deploy (companion architectural decision)
  • docs/development/preflight.md — L1 local mirror
  • docs/infrastructure/post-deploy-smoke.md — L3 deploy-time smoke
  • ENG-304 — Playwright UI smoke (M3, the work that operationalizes this strategy)
  • ENG-303 — AWS ephemeral PR previews (M2, deferred per ADR 0008; revisit triggers documented there)
Pager
Previous pagePreflight (local CI mirror)
Next pageOverview (auditor entry point)

AskFlorence Internal Documentation. Not for public distribution.

AskFlorence

Internal Documentation

Access restricted. Not for public distribution.