Appearance
Florence AI - the voice-synced rendered WOW flow
Research + a runnable local demo. Linear ENG-356. Branch
eng-356-florence-wow-flow-research. No em-dashes anywhere per the CLAUDE.md hard rule (hyphens only). Brand: Florence AI / AskFlorence AI per index.
0. One paragraph
A member talks to Florence by voice. The screen composes each beat in lockstep with her voice (a Clicky-style rendered, guided experience, not a chat): the hook, then ZIP / household / income collected conversationally, then the real subsidized plan revealed with the sticker-to-subsidized strike, then optional doctor / Rx coverage, then onto the waitlist for the plan they pick. Florence orchestrates only. Every dollar, subsidy, deductible, and coverage answer comes verbatim from the existing deterministic AskFlorence pipeline (the same fetchPlansForHousehold + /api/* the home calculator uses) - she never computes a number. ElevenLabs Conversational AI is the ears + turn-taking
- voice; our deterministic client tools are the only source of facts.
1. Clicky distilled (from the actual farzaa/clicky code)
Concrete, named patterns - not vague praise. Read: worker/src/index.ts, ElevenLabsTTSClient.swift, CompanionManager.swift, CompanionResponseOverlay.swift, CompanionPanelView.swift, BuddyDictationManager.swift, AssemblyAIStreamingTranscriptionProvider.swift, OverlayWindow.swift, DesignSystem.swift.
1a. Their voice-loop architecture
Clicky is a macOS menu-bar companion. The loop:
- Push-to-talk (Control+Option) starts mic capture (
BuddyDictationManager). - Audio streams over a websocket to AssemblyAI v3 streaming for ASR; partial transcripts compose live (
storedTurnTranscriptsByOrder-> accumulated text). - On release, the final transcript + a screenshot go to Claude (
api.anthropic.com/v1/messages, streaming SSE) via a Cloudflare Worker that holds the keys (/chat,/tts,/transcribe-token). - Claude's reply (point-tag stripped) is spoken via ElevenLabs TTS (
/v1/text-to-speech/{voiceId},model_id: eleven_flash_v2_5, streaming, AVAudioPlayer). - Barge-in: a new push-to-talk cancels the in-flight response task and
stopPlayback()immediately.
The load-bearing fact for us: Clicky uses ElevenLabs as TTS only. Claude is the brain; AssemblyAI is the ears. This 1:1 matches our documented three-stream voice architecture. It proves the separation works; we go one better by using ElevenLabs Conversational AI (its own integrated ASR + turn-taking) for a far higher conversational bar than Clicky's push-to-talk.
1b. Their rendering craft (the part the founder is pointing at)
| Clicky pattern (code) | Effect | Our editorial translation (shipped) |
|---|---|---|
Cursor-glued non-activating overlay, 60fps, never steals focus (CompanionResponseOverlay) | A presence on the canvas, not a window | FlorencePresence: a lantern mark on cream with a gold glow, anchored to the active scene |
[POINT:x,y:label] -> triangle flies to + points at the exact element; bezier arc, smoothstep ease, tangent rotation, scale pulse at apex, glow intensifies in flight (OverlayWindow 495-568) | The AI directs attention as it talks | On-canvas spotlight: the named element gets a gold focus ring + the rest dims; spotlight target is part of the Scene state |
Visible voice state: dot color + status text + live audio-power waveform; RMS boosted 10.2x, decay max(level, prev*0.72), 70ms sample (CompanionManager, BuddyDictationManager 687-734) | User always knows what it is doing | FlorencePresence 4 states (greeting / listening / thinking / speaking); the listening/speaking ring scale is driven by getInputVolume/getOutputVolume with the same 0.72 decay + 70ms cadence |
| Streaming narration near the cursor, fades after read | Narration rendered where attention is | Caption ribbon bound to the agent transcript |
Latency hiding: hold the processing/spinner state until TTS audio truly starts (sendTranscriptToClaudeWithScreenshot) | No dead air without a visual signal | The searching beat (gold pulse) holds until the find_plans tool result, reusing the proven home cinematic |
| Onboarding self-reveal (welcome anim -> intro -> at 40s the buddy demos itself -> char-streamed prompt) | The product teaches itself, cinematically | Florence's hook IS the self-reveal; the first scene composes as she speaks it |
Spring 0.2-0.4s / 0.6 damping, char-stream 30-60ms, fade 0.4s, cursor blue #3380FF (DesignSystem) | Alive, not mechanical | Reused as the home register's cubic-bezier(0.16,1,0.3,1) 520ms stagger + 1400ms gold pulse; color stays our gold-2 #B8903F on cream |
| Conversation history capped at 10; point-tag stripped before TTS | Clean spoken text, bounded context | Agent transcript is the record; a grounding dragnet flags any spoken number absent from a tool result |
2. Voice vendor evaluation: Cartesia vs ElevenLabs
This section is the written research, decoupled from the demo (the demo is ElevenLabs per the founder decision). The load-bearing factor is BAA: members speak medications, doctor names, and conditions into the mic, so the voice vendor processes PHI-adjacent audio AND stores a transcript (which IS PHI).
| Axis | ElevenLabs | Cartesia |
|---|---|---|
| Product fit | Conversational AI (Agents): integrated ASR + turn-taking model + barge-in + TTS + tools + per-conversation transcripts. One vendor, one socket. This is what the founder validated on elevenlabs.io. | Sonic-2 TTS is best-in-class for latency (~75ms first audio) but Cartesia is TTS-first; ASR + turn-taking + agent orchestration are not a single integrated product to the same degree. You assemble the loop yourself (Deepgram ASR + your turn logic + Sonic TTS). |
| Time-to-first-audio | Low; turn-taking model hides latency well in practice | Sonic-2 ~75ms first audio (the strongest single number); but end-to-end depends on the ASR + LLM you bolt on |
| Naturalness | Very high; large voice library; the "Sarah" voice reads as a reassuring professional | Very high; fewer voices; excellent prosody |
| Turn-taking / barge-in | Native in Conversational AI (their turn model + interruption handling). This is the hard part and it is solved for you. | You build it (VAD + endpointing). More control, more work, more risk for an overnight bar. |
| Browser SDK + mic | First-class: @elevenlabs/react useConversation + WebRTC, mic + playback + barge-in handled | Browser TTS SDK exists; full conversational loop in-browser is more assembly |
| STT side | Their ASR inside Conversational AI (no separate STT vendor needed). Standalone "Scribe" model also exists. | No first-party real-time ASR at parity; pairs with Deepgram Nova-3 (the voice.md Phase 1.5 ASR pick, has a BAA) |
| Cost at expected volume | Conversational AI priced per minute; higher than raw TTS; acceptable at pre-scale, watch at 100k members against the unit-economics targets | Sonic TTS cheaper per minute; total cost depends on the ASR you add |
| HIPAA / BAA | The decision gate. ElevenLabs offers a BAA on enterprise/scale tiers (not the default self-serve tier). Confirm: (a) does the BAA cover Conversational AI specifically (ASR audio + the hosted LLM turn) or only TTS; (b) the transcript store. Until a signed BAA explicitly covering Conversational AI + transcript storage is in hand, ElevenLabs Conversational AI is demo-acceptable but NOT production-acceptable for real member PHI. | Cartesia: confirm BAA on TTS. Because you would pair it with Deepgram (BAA: yes) + our own LLM, the PHI surface is more decomposable and each subprocessor's BAA is individually known. Easier to reason about for production. |
| Transcript = audit AND PHI | ElevenLabs stores a transcript of every conversation (a free audit trail - good). But a transcript of a member speaking meds/doctors/conditions IS PHI. Must resolve before production: where does that transcript live (ElevenLabs side), retention + deletion controls, and is it inside the ElevenLabs BAA. The audit-trail benefit only counts if that store is BAA-covered OR we disable/redirect it to our own BAA-covered storage (Mongo florence_* per runtime.md). | Same question, smaller blast radius: with the decomposed stack the transcript can be produced + stored by OUR runtime (text-as-source-of-truth, voice.md) inside our existing BAA boundary, rather than vendor-side. |
Recommendation
- Demo (now): ElevenLabs Conversational AI. Fastest path to the WOW bar, founder-validated, the turn-taking is solved. Acceptable because the demo uses synthetic data only; no real member PHI.
- Production: do NOT let the demo's vendor choice imply the production vendor. Two viable production shapes, decided by which BAA lands cleanest:
- ElevenLabs Conversational AI with a signed BAA that explicitly covers Conversational AI (ASR + transcript store) + bring-your-own LLM pointed at our Bedrock Claude so the reasoning + grounding stay ours. Transcript retention/deletion contractually pinned or disabled in favor of our Mongo store.
- Decomposed: Deepgram Nova-3 ASR (BAA) + our Bedrock Claude + Cartesia Sonic-2 TTS (BAA), our runtime owns the transcript inside the existing AWS BAA boundary. More wiring, cleanest compliance story, matches voice.md Phase 1.5 exactly.
- Tie-in to vendor-BAA discipline (#57): add ElevenLabs (Conversational AI tier + transcript store) and Cartesia to the vendor register with BAA status = OPEN; a vendor that will not sign a BAA covering the conversational + transcript surface is disqualifying for production even though it is fine for this demo.
3. The proposed WOW flow (what shipped in the demo)
Persona: the warmest, most expert, most genuinely helpful health insurance guide alive - a sharp friend who is the best agent in the country. Hook (spoken first, no tool):
"Turns out good, affordable healthcare in America is real. It is just hidden from the people who need it. I am Florence. Plans here can start at zero to seven dollars a month, many with no deductible and strong coverage. The same plans run close to a thousand dollars a month on healthcare dot gov, and usually only show up through a broker. Tell me three quick things and I will pull real plans with the subsidy already applied. No social security number, no spam, just real numbers. First, what is your ZIP code?"
| Beat | Scene composes | Tool (deterministic) | Spotlight |
|---|---|---|---|
| greeting | hook + presence bloom | none | presence |
| collect_zip | ZIP prompt; location chip fills | collect_location -> GET /api/counties (multi-county / SBE / PO-box edges handled) | zip chip |
| collect_household | who + ages | set_household | household chip |
| collect_income | rough yearly | set_income | income chip |
| searching | gold pulse, holds for audio + result | find_plans -> fetchPlansForHousehold (the exact shared pipeline = /api/eligibility + /api/plans) | center |
| reveal | PriceReveal strike: sticker -> subsidized | (from find_plans) | price |
| plans | 3-col PlanCard micro stagger | (from find_plans) | top plan |
| coverage | per-plan coverage pills light | check_provider / check_drug -> `/api/providers | drugs` |
| waitlist | hold-your-spot card | select_plan then join_waitlist -> POST /api/waitlist (interest:"plan_interest", source_page:"florence_voice") | waitlist |
| done | warm close + what's next | none | presence |
Voice state machine (UX layer; ElevenLabs owns transport turn-taking): IDLE -> GREETING -> LISTENING -> THINKING(tool) -> SPEAKING -> LISTENING ... -> WAITLISTED. Barge-in cancels scene + spotlight.
Edges handled: mis-heard / invalid ZIP (re-ask), multi-county ZIP (ask which), SBE state (honest stop, name the state marketplace, no fabricated numbers), PO-box / unsupported ZIP (suggest nearby county), no plans (honest), provider/drug not found (re-ask), bad email (re-ask), tool error (Florence says she could not pull it + offers retry; never fabricates), mic denied (typed fallback on the dedicated page).
Grounding (the architectural invariant): the agent prompt forbids stating any number not in a tool result this turn; a client-side dragnet scans every salient number Florence speaks against the tool-result numbers and soft-flags ungrounded ones (logged + surfaced in dev). Production uses the Haiku grounding pass per principles #1.
4. Architecture - demo vs production
Both keep every number byte-for-byte ours. They differ only in where the conversational LLM runs.
- Demo (shipped): ElevenLabs Conversational AI + their hosted LLM (gemini-2.0-flash, orchestration only) + our deterministic CLIENT tools. The agent cannot state a number; it must call client tools that run in the browser and hit our same-origin
/api/*. Zero tunnel, fully local, lowest latency, most robust, numbers 100% deterministic. Client-tool calls double as the Scene Director's timing events. - Production (recommended, seam shipped, off the demo path): bring-your-own-LLM -> our Bedrock Claude via the OpenAI-shaped shim at
/api/florence/byo-llm(uses@anthropic-ai/bedrock-sdk,FLORENCE_BEDROCK_MODEL_IDdefaultus.anthropic.claude-sonnet-4-6, flips tous.anthropic.claude-opus-4-7by one env var once Bedrock model-access for opus-4-7 is granted on the mgmt account). Requires ElevenLabs cloud to reach our endpoint (deploy or tunnel) - that is why it is the production path, decoupled from the local demo.
Why this resolves the comment-6 fork: ElevenLabs Agents brings its own LLM, but the agent is contractually + structurally barred from being the source of any fact - the only way it can say a price is to call our tool and read back exactly what the deterministic pipeline returned. Audio + transcript transit ElevenLabs in BOTH wirings, so the BAA question is identical either way and is the production gate, not a demo blocker.
Files (post-M0 monorepo): apps/web/src/lib/florence/* (agent prompt + provision, scene steps + director, deterministic client tools, Bedrock seam), apps/web/src/app/api/florence/* (agent-session, flag, byo-llm), apps/web/src/components/florence/* (the shared hook + presence + the two dedicated experiences + launcher), apps/web/src/app/florence/* (flag-gated page + CSS). Server flag FLORENCE_WOW_DEMO_ENABLED (plain server env, default off, mirrors session-flag.ts).
Two dedicated experiences, one brain: desktop (1496x756, side presence + on-canvas spotlight + dim, 3-col stagger) and a separate purpose-built mobile-native tree (390x844, full-bleed beat scenes, persistent presence, 56px thumb CTA, safe-area, progressive disclosure). Selected by a real device decision (UA hint + matchMedia swap), not a CSS reflow.
5. Running the demo
Local only. No deploy, no staging.
# from the worktree
cd ~/Developer/ask-florence-eng-356-florence-wow-flow-research/apps/web
PORT=3056 npx next dev
# open http://localhost:3056/florence (grant microphone)
# resize to <=600px wide to load the dedicated mobile tree, or
# /florence?device=mobile /florence?device=desktop to force a treeThe flag FLORENCE_WOW_DEMO_ENABLED=enabled and ELEVENLABS_API_KEY live in the canonical .env.local (gitignored, symlinked into the worktree). Flag off -> /florence 404s and the launcher does not mount.
Real vs simulated (honest)
| Verified live in the sandbox | How |
|---|---|
| ElevenLabs Conversational AI integration end to end | POST /api/florence/agent-session provisions a real agent + mints a real WebRTC conversation token with the founder's key (agent AskFlorence WOW (ENG-356), voice Sarah). The voice brain is real, not stubbed. |
| Both dedicated experiences render on-register | Desktop intro pixel-centered at 1496x756 (DOM geometry + screenshot), mobile a separate tree at 390x844 with a 56px thumb CTA (screenshot). Zero console errors. tsc clean across apps/web. |
| The full voice-synced rendered arc + grounding | A network-free Scene Director proof runs the entire golden arc (greeting -> ... -> done) and confirms the grounding dragnet flags ONLY a fabricated $873, never the real tool-sourced $0 / $1,051.30. |
| Flag gate | /api/florence/flag -> {enabled:true}; the 404-when-off path is the notFound() guard. |
| NOT live-captured here | Why |
|---|---|
| The spoken golden-scenario numbers (84094 -> Salt Lake UT -> Medicaid -> ~$1,041 APTC -> Tyler Wood -> Ozempic narrowing) | This sandbox firewalls outbound TCP 27017, so MongoDB Atlas is unreachable. This breaks the ENTIRE app's data layer here (the home calculator, /plans, every /api/* that hits getDb()), not anything Florence-specific. The Florence layer delegates 100% to the unchanged shared pipeline and recomputes nothing, so on any Atlas-reachable machine (the founder's local, prod) the real byte-for-byte numbers flow. The Atlas CLI is authed but allowlisting cannot fix an egress-port block. |
| The live spoken loop (mic in, Florence voice out) | A headless preview cannot grant a real microphone, run WebRTC, or play audio. Founder: open http://localhost:3056/florence on your machine (Atlas-reachable), grant the mic, and speak the golden scenario; the spoken numbers will be the live deterministic values. |
6. Open decisions / asks for the founder
- Bedrock Opus 4.7 access. Verified working on the
askflorence-mgmtprofile: Sonnet 4.6 / Opus 4.6 / Opus 4.5 / Haiku 4.5. Opus 4.7 is an ACTIVE inference profile butanthropic.claude-opus-4-7is not yet granted (AccessDeniedException). Enable model-access in the Bedrock console and the production BYO-LLM flips to Opus 4.7 with one env var. - Production voice-vendor + BAA decision (section 2). Demo-acceptable != production-signable. Needs: a signed BAA explicitly covering ElevenLabs Conversational AI (ASR + transcript store) OR a decision to go decomposed (Deepgram + Bedrock + Cartesia). Add both vendors to the #57 register, BAA status OPEN.
- Transcript-as-PHI (section 2). Decide: contractually pin ElevenLabs transcript retention/deletion inside a BAA, or disable vendor-side transcripts and persist only to our Mongo
florence_*store (text-as-source-of-truth). Production blocker, not a demo one. - Prod brain-wiring = bring-your-own-LLM -> Bedrock Claude + server tools (seam shipped at
/api/florence/byo-llm). Confirm direction. - Integrated launcher placement.
FlorenceLauncheris built + self- gating but intentionally NOT mounted on home / /plans / plan-detail in this PR (avoid regression risk on conversion-critical pages during an unattended build). Say where to mount it.