Runtime

How Florence is built. The FlorenceRuntime is a thin first-party server-side layer on top of the Claude Agent SDK; the client is a ~50-line streaming hook; the intelligence lives in the tool surface, prompts, and eval harness.

Stack choice, and why

Layer	Choice	Rationale
Agent runtime (server)	Claude Agent SDK (wrapped by our provider abstraction)	Anthropic's production SDK (powers Claude Code). Supports Anthropic-direct and Bedrock transparently — Phase 1 → Phase 3 provider swap is a config change. Best-in-class context compression, tool parallelism, streaming, interrupts. Wrapped by our own `FlorenceLLMProvider` interface so Claude Agent SDK is one of several possible backends, not the external contract (see provider risk). Does not pull Vercel-specific packages into our bundle (competitor-camouflage bonus).
Streaming transport	Server-Sent Events (Next.js App Router streaming response)	Native to our stack. Works behind CloudFront. No websocket infra required.
Client hook	Custom ~50-line React hook	`useFlorence()`. Consumes SSE, renders assistant tokens as they arrive, exposes tool-call events for UI-driving tools (`ui_*`). We own the rendering so no framework leaks the model family.
LLM provider	Phase 1: Anthropic direct API under BAA. Phase 3: Claude via AWS Bedrock.	Adapter-sink pattern (see tool surface) makes swap one file.
State store	MongoDB Atlas (extends existing architecture)	PHI with CSFLE per data class (see data classification). Already in compliance boundary.

Rejected options: Vercel AI SDK (TypeScript library is provider-agnostic but we want zero @ai-sdk/* in the public bundle, and Claude Agent SDK is closer to Bedrock), LangChain / LangGraph (heavyweight, Python-idiom, no net value over Claude Agent SDK), Mastra (young, adds abstraction we don't need), OpenClaw (category mismatch — personal assistant framework, not a multi-tenant regulated-product runtime), self-hosted open-weight models as Phase 1 default (quality gap, op burden; revisit at scale per roadmap).

Component diagram

The turn lifecycle

A single user turn walks this sequence. Budgets shown are all-in targets; steps in parallel where noted.

Step	Purpose	Model / cost	Budget
1. Input classifier	Health-insurance-legit vs. jailbreak vs. out-of-scope. Out-of-scope → scripted refusal, short-circuit.	Haiku 4.5, ~$0.0001	~80 ms
2. Model router	Heuristic-first (input length, tool-use history, user signal); escalate to Haiku router call only when uncertain. Selects Haiku / Sonnet / Opus.	Haiku 4.5 (when called), ~$0.0001	~80 ms
3. Prompt assembly	Fixed-order: `systemPrompt / toolDefinitions / userProfile / conversationSummary / recentTurns / currentTurn`. First four cached via Anthropic prompt cache.	—	~5 ms
4. Main turn	The assistant response + zero-or-more tool calls. Streams tokens. Tool calls executed in parallel via Claude Agent SDK.	Haiku / Sonnet / Opus	depends on tool latency
5. Profile extractor (parallel to step 4)	Cheap pass: "does this user turn add facts to the profile?" Writes to `user_profile`.	Haiku 4.5, ~$0.0003	~300 ms, non-blocking
6. Grounding check	Scans assistant response for factual claims; asserts each ties to a tool-result ID in this turn. Shadow-mode at launch → blocking after eval validation.	Haiku 4.5, ~$0.0005	~250 ms
7. Output classifier	Blocks: code, off-domain content, system-prompt echo, model-identity leakage.	Haiku 4.5, ~$0.0003	~200 ms
8. Style normalizer	Deterministic pass: strip model-family tells, enforce Florence voice.	—	~5 ms
9. Audit emit	Append-only record to `florence_audit_log` with tool-call IDs, token counts, classifier outcomes, escalation flags.	—	~20 ms, async

Per-turn cost target: ≤ $0.005 all-in. Dominated by main-turn output tokens; classifiers + grounding add ~$0.0012 baseline and are the conscious safety-over-savings tradeoff.

Prompt architecture (the caching lever)

The single biggest unit-economics decision. A fixed prompt structure allows Anthropic's prompt cache to hit ≥ 85 % of input tokens across turns within a session:

[1] systemPrompt          — stable, version-pinned
[2] toolDefinitions       — stable per session (snapshot at start)
[3] userProfile           — stable unless extractor updates mid-session
[4] conversationSummary   — grows; rewritten at 40% context budget
[5] recentTurns           — last N turns verbatim
[6] currentTurn           — the user's new message

Layers 1–3 are marked for cache on every call. Layer 4 is re-written only at summarization events. Layers 5–6 are fresh per turn.

Consequences of getting this wrong (Sonnet-for-everything, no cache, fresh system prompt per turn): 5–10× LLM cost. At 100 k members, this alone is the difference between ~$1 k/month and ~$8 k/month in LLM spend.

Model routing

Heuristic-first routing; LLM-based routing as fallback.

Signal	Decision
Input < 80 characters AND no member-auth context	Haiku 4.5
Tool-call-required intent (plan search, drug lookup, provider lookup)	Haiku 4.5
Tool-use response phase (assistant synthesizing tool results)	Haiku 4.5
Multi-plan comparison or SEP triage or ambiguous life-event	Sonnet 4.6
Prior-turn escalation flag OR low-confidence assistant reply	Sonnet 4.6
Sonnet emitted a low-confidence result with explicit escalation marker	Opus 4.7 (budgeted; alert if > 1 % of turns)

Measured monthly. Routing drift > 5 % from target mix triggers a tuning pass on the heuristics. Opus usage > 1.5 % of turns triggers investigation.

Context management

Conversations grow. Two levers:

Summarization at 40 % context budget, not 90 %. A Haiku call rewrites everything prior to the last N turns into a compact summary, which replaces the earlier turns in prompt layer [4]. Preserves Anthropic-cache sweet spot and keeps the fresh-token count predictable.
Hard cap at 80 k tokens per conversation. Past that, the conversation is closed and a new conversation opened with the compressed profile as its seed. User-invisible; audit-trailed.

State persistence

Store	Collection	Class	Retention
MongoDB	`florence_conversations`	PHI (mixed)	6 years min (HIPAA), 10 years target (EDE)
MongoDB	`user_profile`	PHI + PII	Lifetime of member + 6 years
MongoDB	`florence_audit_log`	Audit (append-only)	10 years
MongoDB	`florence_escalations`	PHI	10 years

All collections CSFLE-encrypted with the CMK for their class (data classification).

Guardrails and camouflage

Detail is split across multiple principles; the runtime is where they compose. Five layers vs. prompt injection and scope abuse:

Input classifier (step 1 above) — stops scope abuse pre-model.
System-prompt fortification — explicit refusals baked in; version-controlled.
Tool-layer authorization — declared accepted auth contexts per tool; enforced in the wrapper.
Output classifier (step 7 above) — blocks code, off-domain, model-identity leaks.
Canary tokens + jailbreak eval suite — prompt contains unique tokens; if they ever appear in output, we know what leaked from where.

Camouflage posture:

Tool-use blocks never stream to the client. ui_* calls dispatch as a separate event type the client renders; api_* calls are entirely server-side.
Style normalizer (step 8) deterministically strips Claude-isms and enforces the Florence voice.
All model calls server-mediated; no browser → Anthropic/OpenAI/Bedrock.
Error messages sanitized — no vendor provenance reaches users.
Randomized small timing jitter on response start to defeat trivial timing fingerprinting.
Public marketing says "Florence," not "powered by Claude."

Observability (cost + latency)

Every turn emits metrics into a Mongo-backed dashboard (in-house; no Langfuse / Helicone to avoid vendor spend + stay in boundary):

Per-conversation cost attribution (tokens × model × provider rate)
Routing mix (Haiku / Sonnet / Opus distribution, weekly)
Cache-hit rate (target ≥ 85 %)
Latency percentiles per step
Escalation rate (target ≤ 5 %)
Classifier blocks per category

Alert thresholds baked in from day one. Daily cost drift > 20 % triggers review. See evals & observability.