Florence AI - the voice-synced rendered WOW flow

Research + a runnable local demo. Linear ENG-356. Branch eng-356-florence-wow-flow-research. No em-dashes anywhere per the CLAUDE.md hard rule (hyphens only). Brand: Florence AI / AskFlorence AI per index.

0. One paragraph

A member talks to Florence by voice. The screen composes each beat in lockstep with her voice (a Clicky-style rendered, guided experience, not a chat): the hook, then ZIP / household / income collected conversationally, then the real subsidized plan revealed with the sticker-to-subsidized strike, then optional doctor / Rx coverage, then onto the waitlist for the plan they pick. Florence orchestrates only. Every dollar, subsidy, deductible, and coverage answer comes verbatim from the existing deterministic AskFlorence pipeline (the same fetchPlansForHousehold + /api/* the home calculator uses) - she never computes a number. ElevenLabs Conversational AI is the ears + turn-taking

voice; our deterministic client tools are the only source of facts.

1. Clicky distilled (from the actual `farzaa/clicky` code)

Concrete, named patterns - not vague praise. Read: worker/src/index.ts, ElevenLabsTTSClient.swift, CompanionManager.swift, CompanionResponseOverlay.swift, CompanionPanelView.swift, BuddyDictationManager.swift, AssemblyAIStreamingTranscriptionProvider.swift, OverlayWindow.swift, DesignSystem.swift.

1a. Their voice-loop architecture

Clicky is a macOS menu-bar companion. The loop:

Push-to-talk (Control+Option) starts mic capture (BuddyDictationManager).
Audio streams over a websocket to AssemblyAI v3 streaming for ASR; partial transcripts compose live (storedTurnTranscriptsByOrder -> accumulated text).
On release, the final transcript + a screenshot go to Claude (api.anthropic.com/v1/messages, streaming SSE) via a Cloudflare Worker that holds the keys (/chat, /tts, /transcribe-token).
Claude's reply (point-tag stripped) is spoken via ElevenLabs TTS (/v1/text-to-speech/{voiceId}, model_id: eleven_flash_v2_5, streaming, AVAudioPlayer).
Barge-in: a new push-to-talk cancels the in-flight response task and stopPlayback() immediately.

The load-bearing fact for us: Clicky uses ElevenLabs as TTS only. Claude is the brain; AssemblyAI is the ears. This 1:1 matches our documented three-stream voice architecture. It proves the separation works; we go one better by using ElevenLabs Conversational AI (its own integrated ASR + turn-taking) for a far higher conversational bar than Clicky's push-to-talk.

1b. Their rendering craft (the part the founder is pointing at)

Clicky pattern (code)	Effect	Our editorial translation (shipped)
Cursor-glued non-activating overlay, 60fps, never steals focus (`CompanionResponseOverlay`)	A presence on the canvas, not a window	`FlorencePresence`: a lantern mark on cream with a gold glow, anchored to the active scene
`[POINT:x,y:label]` -> triangle flies to + points at the exact element; bezier arc, smoothstep ease, tangent rotation, scale pulse at apex, glow intensifies in flight (`OverlayWindow` 495-568)	The AI directs attention as it talks	On-canvas spotlight: the named element gets a gold focus ring + the rest dims; spotlight target is part of the Scene state
Visible voice state: dot color + status text + live audio-power waveform; RMS boosted 10.2x, decay `max(level, prev*0.72)`, 70ms sample (`CompanionManager`, `BuddyDictationManager` 687-734)	User always knows what it is doing	`FlorencePresence` 4 states (greeting / listening / thinking / speaking); the listening/speaking ring scale is driven by `getInputVolume`/`getOutputVolume` with the same 0.72 decay + 70ms cadence
Streaming narration near the cursor, fades after read	Narration rendered where attention is	Caption ribbon bound to the agent transcript
Latency hiding: hold the processing/spinner state until TTS audio truly starts (`sendTranscriptToClaudeWithScreenshot`)	No dead air without a visual signal	The `searching` beat (gold pulse) holds until the find_plans tool result, reusing the proven home cinematic
Onboarding self-reveal (welcome anim -> intro -> at 40s the buddy demos itself -> char-streamed prompt)	The product teaches itself, cinematically	Florence's hook IS the self-reveal; the first scene composes as she speaks it
Spring 0.2-0.4s / 0.6 damping, char-stream 30-60ms, fade 0.4s, cursor blue `#3380FF` (`DesignSystem`)	Alive, not mechanical	Reused as the home register's `cubic-bezier(0.16,1,0.3,1)` 520ms stagger + 1400ms gold pulse; color stays our gold-2 `#B8903F` on cream
Conversation history capped at 10; point-tag stripped before TTS	Clean spoken text, bounded context	Agent transcript is the record; a grounding dragnet flags any spoken number absent from a tool result

2. Voice vendor evaluation: Cartesia vs ElevenLabs

This section is the written research, decoupled from the demo (the demo is ElevenLabs per the founder decision). The load-bearing factor is BAA: members speak medications, doctor names, and conditions into the mic, so the voice vendor processes PHI-adjacent audio AND stores a transcript (which IS PHI).

Axis	ElevenLabs	Cartesia
Product fit	Conversational AI (Agents): integrated ASR + turn-taking model + barge-in + TTS + tools + per-conversation transcripts. One vendor, one socket. This is what the founder validated on elevenlabs.io.	Sonic-2 TTS is best-in-class for latency (~75ms first audio) but Cartesia is TTS-first; ASR + turn-taking + agent orchestration are not a single integrated product to the same degree. You assemble the loop yourself (Deepgram ASR + your turn logic + Sonic TTS).
Time-to-first-audio	Low; turn-taking model hides latency well in practice	Sonic-2 ~75ms first audio (the strongest single number); but end-to-end depends on the ASR + LLM you bolt on
Naturalness	Very high; large voice library; the "Sarah" voice reads as a reassuring professional	Very high; fewer voices; excellent prosody
Turn-taking / barge-in	Native in Conversational AI (their turn model + interruption handling). This is the hard part and it is solved for you.	You build it (VAD + endpointing). More control, more work, more risk for an overnight bar.
Browser SDK + mic	First-class: `@elevenlabs/react` `useConversation` + WebRTC, mic + playback + barge-in handled	Browser TTS SDK exists; full conversational loop in-browser is more assembly
STT side	Their ASR inside Conversational AI (no separate STT vendor needed). Standalone "Scribe" model also exists.	No first-party real-time ASR at parity; pairs with Deepgram Nova-3 (the voice.md Phase 1.5 ASR pick, has a BAA)
Cost at expected volume	Conversational AI priced per minute; higher than raw TTS; acceptable at pre-scale, watch at 100k members against the unit-economics targets	Sonic TTS cheaper per minute; total cost depends on the ASR you add
HIPAA / BAA	The decision gate. ElevenLabs offers a BAA on enterprise/scale tiers (not the default self-serve tier). Confirm: (a) does the BAA cover Conversational AI specifically (ASR audio + the hosted LLM turn) or only TTS; (b) the transcript store. Until a signed BAA explicitly covering Conversational AI + transcript storage is in hand, ElevenLabs Conversational AI is demo-acceptable but NOT production-acceptable for real member PHI.	Cartesia: confirm BAA on TTS. Because you would pair it with Deepgram (BAA: yes) + our own LLM, the PHI surface is more decomposable and each subprocessor's BAA is individually known. Easier to reason about for production.
Transcript = audit AND PHI	ElevenLabs stores a transcript of every conversation (a free audit trail - good). But a transcript of a member speaking meds/doctors/conditions IS PHI. Must resolve before production: where does that transcript live (ElevenLabs side), retention + deletion controls, and is it inside the ElevenLabs BAA. The audit-trail benefit only counts if that store is BAA-covered OR we disable/redirect it to our own BAA-covered storage (Mongo `florence_*` per runtime.md).	Same question, smaller blast radius: with the decomposed stack the transcript can be produced + stored by OUR runtime (text-as-source-of-truth, voice.md) inside our existing BAA boundary, rather than vendor-side.

Recommendation

Demo (now): ElevenLabs Conversational AI. Fastest path to the WOW bar, founder-validated, the turn-taking is solved. Acceptable because the demo uses synthetic data only; no real member PHI.
Production: do NOT let the demo's vendor choice imply the production vendor. Two viable production shapes, decided by which BAA lands cleanest:
1. ElevenLabs Conversational AI with a signed BAA that explicitly covers Conversational AI (ASR + transcript store) + bring-your-own LLM pointed at our Bedrock Claude so the reasoning + grounding stay ours. Transcript retention/deletion contractually pinned or disabled in favor of our Mongo store.
2. Decomposed: Deepgram Nova-3 ASR (BAA) + our Bedrock Claude + Cartesia Sonic-2 TTS (BAA), our runtime owns the transcript inside the existing AWS BAA boundary. More wiring, cleanest compliance story, matches voice.md Phase 1.5 exactly.
Tie-in to vendor-BAA discipline (#57): add ElevenLabs (Conversational AI tier + transcript store) and Cartesia to the vendor register with BAA status = OPEN; a vendor that will not sign a BAA covering the conversational + transcript surface is disqualifying for production even though it is fine for this demo.

3. The proposed WOW flow (what shipped in the demo)

Persona: the warmest, most expert, most genuinely helpful health insurance guide alive - a sharp friend who is the best agent in the country. Hook (spoken first, no tool):

"Turns out good, affordable healthcare in America is real. It is just hidden from the people who need it. I am Florence. Plans here can start at zero to seven dollars a month, many with no deductible and strong coverage. The same plans run close to a thousand dollars a month on healthcare dot gov, and usually only show up through a broker. Tell me three quick things and I will pull real plans with the subsidy already applied. No social security number, no spam, just real numbers. First, what is your ZIP code?"

Beat	Scene composes	Tool (deterministic)	Spotlight
greeting	hook + presence bloom	none	presence
collect_zip	ZIP prompt; location chip fills	`collect_location` -> `GET /api/counties` (multi-county / SBE / PO-box edges handled)	zip chip
collect_household	who + ages	`set_household`	household chip
collect_income	rough yearly	`set_income`	income chip
searching	gold pulse, holds for audio + result	`find_plans` -> `fetchPlansForHousehold` (the exact shared pipeline = `/api/eligibility` + `/api/plans`)	center
reveal	PriceReveal strike: sticker -> subsidized	(from find_plans)	price
plans	3-col PlanCard micro stagger	(from find_plans)	top plan
coverage	per-plan coverage pills light	`check_provider` / `check_drug` -> `/api/providers	drugs`
waitlist	hold-your-spot card	`select_plan` then `join_waitlist` -> `POST /api/waitlist` (`interest:"plan_interest"`, `source_page:"florence_voice"`)	waitlist
done	warm close + what's next	none	presence

Voice state machine (UX layer; ElevenLabs owns transport turn-taking): IDLE -> GREETING -> LISTENING -> THINKING(tool) -> SPEAKING -> LISTENING ... -> WAITLISTED. Barge-in cancels scene + spotlight.

Edges handled: mis-heard / invalid ZIP (re-ask), multi-county ZIP (ask which), SBE state (honest stop, name the state marketplace, no fabricated numbers), PO-box / unsupported ZIP (suggest nearby county), no plans (honest), provider/drug not found (re-ask), bad email (re-ask), tool error (Florence says she could not pull it + offers retry; never fabricates), mic denied (typed fallback on the dedicated page).

Grounding (the architectural invariant): the agent prompt forbids stating any number not in a tool result this turn; a client-side dragnet scans every salient number Florence speaks against the tool-result numbers and soft-flags ungrounded ones (logged + surfaced in dev). Production uses the Haiku grounding pass per principles #1.

4. Architecture - demo vs production

Both keep every number byte-for-byte ours. They differ only in where the conversational LLM runs.

Demo (shipped): ElevenLabs Conversational AI + their hosted LLM (gemini-2.0-flash, orchestration only) + our deterministic CLIENT tools. The agent cannot state a number; it must call client tools that run in the browser and hit our same-origin /api/*. Zero tunnel, fully local, lowest latency, most robust, numbers 100% deterministic. Client-tool calls double as the Scene Director's timing events.
Production (recommended, seam shipped, off the demo path): bring-your-own-LLM -> our Bedrock Claude via the OpenAI-shaped shim at /api/florence/byo-llm (uses @anthropic-ai/bedrock-sdk, FLORENCE_BEDROCK_MODEL_ID default us.anthropic.claude-sonnet-4-6, flips to us.anthropic.claude-opus-4-7 by one env var once Bedrock model-access for opus-4-7 is granted on the mgmt account). Requires ElevenLabs cloud to reach our endpoint (deploy or tunnel) - that is why it is the production path, decoupled from the local demo.

Why this resolves the comment-6 fork: ElevenLabs Agents brings its own LLM, but the agent is contractually + structurally barred from being the source of any fact - the only way it can say a price is to call our tool and read back exactly what the deterministic pipeline returned. Audio + transcript transit ElevenLabs in BOTH wirings, so the BAA question is identical either way and is the production gate, not a demo blocker.

Files (post-M0 monorepo): apps/web/src/lib/florence/* (agent prompt + provision, scene steps + director, deterministic client tools, Bedrock seam), apps/web/src/app/api/florence/* (agent-session, flag, byo-llm), apps/web/src/components/florence/* (the shared hook + presence + the two dedicated experiences + launcher), apps/web/src/app/florence/* (flag-gated page + CSS). Server flag FLORENCE_WOW_DEMO_ENABLED (plain server env, default off, mirrors session-flag.ts).

Two dedicated experiences, one brain: desktop (1496x756, side presence + on-canvas spotlight + dim, 3-col stagger) and a separate purpose-built mobile-native tree (390x844, full-bleed beat scenes, persistent presence, 56px thumb CTA, safe-area, progressive disclosure). Selected by a real device decision (UA hint + matchMedia swap), not a CSS reflow.

5. Running the demo

Local only. No deploy, no staging.

# from the worktree
cd ~/Developer/ask-florence-eng-356-florence-wow-flow-research/apps/web
PORT=3056 npx next dev
# open http://localhost:3056/florence  (grant microphone)
# resize to <=600px wide to load the dedicated mobile tree, or
#   /florence?device=mobile  /florence?device=desktop  to force a tree

The flag FLORENCE_WOW_DEMO_ENABLED=enabled and ELEVENLABS_API_KEY live in the canonical .env.local (gitignored, symlinked into the worktree). Flag off -> /florence 404s and the launcher does not mount.

Real vs simulated (honest)

Verified live in the sandbox	How
ElevenLabs Conversational AI integration end to end	`POST /api/florence/agent-session` provisions a real agent + mints a real WebRTC conversation token with the founder's key (agent `AskFlorence WOW (ENG-356)`, voice Sarah). The voice brain is real, not stubbed.
Both dedicated experiences render on-register	Desktop intro pixel-centered at 1496x756 (DOM geometry + screenshot), mobile a separate tree at 390x844 with a 56px thumb CTA (screenshot). Zero console errors. tsc clean across `apps/web`.
The full voice-synced rendered arc + grounding	A network-free Scene Director proof runs the entire golden arc (greeting -> ... -> done) and confirms the grounding dragnet flags ONLY a fabricated `$873`, never the real tool-sourced `$0` / `$1,051.30`.
Flag gate	`/api/florence/flag` -> `{enabled:true}`; the 404-when-off path is the `notFound()` guard.

NOT live-captured here Why

The spoken golden-scenario numbers (84094 -> Salt Lake UT -> Medicaid -> ~$1,041 APTC -> Tyler Wood -> Ozempic narrowing) This sandbox firewalls outbound TCP 27017, so MongoDB Atlas is unreachable. This breaks the ENTIRE app's data layer here (the home calculator, /plans, every /api/* that hits getDb()), not anything Florence-specific. The Florence layer delegates 100% to the unchanged shared pipeline and recomputes nothing, so on any Atlas-reachable machine (the founder's local, prod) the real byte-for-byte numbers flow. The Atlas CLI is authed but allowlisting cannot fix an egress-port block.

The live spoken loop (mic in, Florence voice out) A headless preview cannot grant a real microphone, run WebRTC, or play audio. Founder: open http://localhost:3056/florence on your machine (Atlas-reachable), grant the mic, and speak the golden scenario; the spoken numbers will be the live deterministic values.

NOT live-captured here	Why
The spoken golden-scenario numbers (84094 -> Salt Lake UT -> Medicaid -> ~$1,041 APTC -> Tyler Wood -> Ozempic narrowing)	This sandbox firewalls outbound TCP 27017, so MongoDB Atlas is unreachable. This breaks the ENTIRE app's data layer here (the home calculator, /plans, every `/api/*` that hits `getDb()`), not anything Florence-specific. The Florence layer delegates 100% to the unchanged shared pipeline and recomputes nothing, so on any Atlas-reachable machine (the founder's local, prod) the real byte-for-byte numbers flow. The Atlas CLI is authed but allowlisting cannot fix an egress-port block.
The live spoken loop (mic in, Florence voice out)	A headless preview cannot grant a real microphone, run WebRTC, or play audio. Founder: open `http://localhost:3056/florence` on your machine (Atlas-reachable), grant the mic, and speak the golden scenario; the spoken numbers will be the live deterministic values.

6. Open decisions / asks for the founder

Bedrock Opus 4.7 access. Verified working on the askflorence-mgmt profile: Sonnet 4.6 / Opus 4.6 / Opus 4.5 / Haiku 4.5. Opus 4.7 is an ACTIVE inference profile but anthropic.claude-opus-4-7 is not yet granted (AccessDeniedException). Enable model-access in the Bedrock console and the production BYO-LLM flips to Opus 4.7 with one env var.
Production voice-vendor + BAA decision (section 2). Demo-acceptable != production-signable. Needs: a signed BAA explicitly covering ElevenLabs Conversational AI (ASR + transcript store) OR a decision to go decomposed (Deepgram + Bedrock + Cartesia). Add both vendors to the #57 register, BAA status OPEN.
Transcript-as-PHI (section 2). Decide: contractually pin ElevenLabs transcript retention/deletion inside a BAA, or disable vendor-side transcripts and persist only to our Mongo florence_* store (text-as-source-of-truth). Production blocker, not a demo one.
Prod brain-wiring = bring-your-own-LLM -> Bedrock Claude + server tools (seam shipped at /api/florence/byo-llm). Confirm direction.
Integrated launcher placement. FlorenceLauncher is built + self- gating but intentionally NOT mounted on home / /plans / plan-detail in this PR (avoid regression risk on conversion-critical pages during an unattended build). Say where to mount it.