Appearance
Voice
Voice ships in Phase 1.5, immediately after text Florence is validated in production. The architecture is designed so voice is an I/O affordance on the same text runtime, not a separate product.
Three independent streams, joined in FlorenceRuntime
Text is the source of truth. ASR produces text; text is what the runtime consumes, what the audit log stores, what evals grade, what the grounding check verifies. Voice does not bypass any quality or compliance control that applies to text.
Why not voice-to-voice models (OpenAI Realtime, Gemini Live, etc.)
Tempting — one stream in, one stream out, visibly magical. Wrong choice for Florence. Reasons:
- They abstract away the tool loop, grounding check, and audit trail — exactly the things we cannot abstract under our compliance posture.
- Text is the legal record; voice is a UI affordance. Designs that make audio the primary artifact are harder to audit.
- Unit economics are opaque and vendor-controlled.
The three-stream design costs us ~100–200 ms of latency over voice-to-voice; that's the price of auditability and we accept it.
Latency budget
End-of-speech to first audio back, target ≤ 400 ms:
| Stage | Budget |
|---|---|
| ASR finalization | 100–250 ms |
| Intent classification + model routing | 50–80 ms |
| First LLM token (with prompt cache hit + pre-warmed tool calls where possible) | 100–250 ms |
| TTS first audio | 75–150 ms |
Dominant term is ASR finalization. Two moves compress it:
- Partial-transcript dispatch. Start intent classification (and optionally speculative tool calls) on partial ASR output; commit when ASR finalizes.
- Warm connections. Persistent streaming connections to ASR + TTS; avoid TLS + DNS per turn.
Vendor strategy through Phase 3
The principle: preserve premium quality through Phase 3, not degrade to basic FedRAMP options. Detail of the strategy, including the dedicated-VPC + FedRAMP-reference-customer tracks, is in #61 — summarized below.
Phase 1.5 launch (commercial AWS + BAA)
| Role | Vendor | Languages | Cost | Latency | BAA |
|---|---|---|---|---|---|
| ASR | Deepgram Nova-3 | 36 incl. EN + ES with code-switching | ~$0.004/min | ~250 ms | Yes |
| TTS | Cartesia Sonic-2 | 15 incl. ES | ~$0.015/min | ~75 ms first audio | Confirming (Issue #57) |
Both pluggable via adapter sinks (see tool surface); swap is a config change.
Phase 3 — keep premium quality without regression
Three tracks run in parallel; whichever lands first becomes primary.
Track A — Dedicated VPC / single-tenant deployment
Negotiate with Deepgram / Cartesia / ElevenLabs for deployment of their models inside our AWS account, on our GPUs. Pattern exists (MongoDB dedicated-VPC, Anthropic via Bedrock). Vendor becomes a software license, not a data subprocessor — our FedRAMP posture covers them.
Most likely fast yes: Cartesia (Series A, hungry for enterprise logos). Deepgram already offers enterprise on-prem. ElevenLabs is the hardest nut; worth the ask at our projected scale.
Track B — Named reference customer for FedRAMP
Deepgram is actively pursuing FedRAMP Moderate. Named regulated-healthcare reference customers accelerate vendor 3PAO packages 6–12 months. Same conversation opening with Cartesia and ElevenLabs. We become part of their regulatory roadmap rather than waiting passively.
Track C — Self-hosted fine-tuned Florence voice
Fallback AND potentially the authenticated-member experience by design. Open-weight models:
- ASR: Whisper v3 large-turbo, hosted on AWS SageMaker in our FedRAMP account. Quality matches Deepgram on EN, very close on ES. Latency 100–250 ms warm.
- TTS: F5-TTS / Orpheus / StyleTTS-2, fine-tuned on a reference-audio corpus we collect from launch day forward (consent-captured). Produces a proprietary Florence voice no competitor can fingerprint.
Inherits our FedRAMP posture. No subprocessor. Unit economics flip favorable past ~100 voice-hours/day (GPU amortization).
Product framing: after enrollment, Florence becomes "your Florence" — slightly warmer, personalized, hers alone. Transition feels like an upgrade, not a compromise.
Compliance read that may unlock Track A+B wholesale
FTI is IRS-sourced data via the CMS Hub — not user-self-attested income. Pre-enrollment subsidy estimates computed from "I make $40k" are PII, not FTI. Post-enrollment confirmed APTC from the Hub is FTI.
If an EDE-literate compliance counsel confirms this reading, most of the voice surface is not in EDE scope, and Deepgram + Cartesia with HIPAA BAA covers it. Only the narrow post-enrollment authenticated-member FTI utterances need special handling. This would be the single biggest unlock on the whole voice track — #61 carries the action item.
Multilingual — EN + ES from launch, more later
Multilingual is nearly free by architecture, because:
- Claude speaks Spanish fluently — no translation layer needed for the LLM turn.
- Tool results are language-agnostic JSON.
- Deepgram Nova-3 handles EN/ES natively with in-utterance code-switching (common for US Hispanic members).
- Cartesia has ES voices; Polly / Transcribe have ES at the P3 fallback tier.
- The system prompt includes a one-line instruction: respond in the user's language.
Adding Mandarin / Vietnamese / Tagalog later = UI locale toggle + new eval set + TTS-voice check. No architectural change.
Voice adapter sinks
Same pattern as every vendor integration:
ts
// src/lib/adapters/voice-asr.ts
export const asrAdapter = defineAdapter({
name: "voice-asr",
provider: process.env.ASR_PROVIDER ?? "deepgram", // or "aws-transcribe" | "self-whisper"
fedramp: /* resolved from provider */,
baa: /* resolved from provider */,
accepts: ["Public", "PHI", "PII", "FTI"] as const, // after compliance read
});Swap is a config change. Adapter enforces declared class acceptance at compile + runtime.
On-device voice for the native app
When a React Native app ships:
- iOS:
SFSpeechRecognizerfor ASR,AVSpeechSynthesizer(premium voices) for TTS. Audio never leaves the device. - Android: equivalent on-device APIs.
Zero voice subprocessor. Strongest privacy posture. Marketable as "your voice stays on your phone." Web support via Web Speech API is inconsistent; skip on web, use in native.
Quality on modern iPhones is competitive enough for the majority of conversations; fall back to server-side voice for the long tail where on-device is too weak.
What voice does NOT change
- The tool surface (see tool surface): same tools, same schemas.
- The audit log (see evals & observability): text transcript is the record; audio is stored encrypted with retention policy but not the primary artifact.
- Evals: grade the text transcript produced by ASR + the text response Florence generates. Audio-level quality is a separate telemetry dimension (see evals & observability).
- Guardrails: all five classifier layers run on the text path, before TTS.
Tracking
- Phase 1.5 voice launch: tied to Florence text launch + 1 iteration window
- Phase 3 voice track outcome: #61 + voice-specific sub-issue to be spawned when text launch nears
- Vendor partnership work: #61 voice-vendor-partnership-track comment
- BAA + FedRAMP status per vendor: #57