Skip to content
AskFlorence
Main Navigation ArchitectureFlorence AIAgentsMembersAgent PlatformValidationInfrastructure

Appearance

Sidebar Navigation

Overview

Home

Glossary

System Architecture

Consumer & Agent Flow

Florence AI

Overview

Principles

Runtime

Tool surface

Adding a tool

Tool registry

Knowledge: SBC scenarios & CSR

Voice

Evals & observability

Provider risk & portability

Outage playbook

Roadmap

Build plan

Agents

Overview

Workflows & pain points

Members

Overview

Medicaid coverage gap

Carriers

Overview

Marketplaces

Overview

Agency

Overview

Regulations

Overview

Agent Platform

Overview

Auth Architecture

MongoDB Permissioning

Compliance Model

Data Models

Data Sources

Overview

CMS Marketplace API

CMS dependency map

PUF Data

State Subsidies

SBE Ingestion Playbook

SBE State Watchouts + Decisions

CA Phase C/D Playbook

NY Phase C/D Playbook

Validation

Overview

Methodology

APTC Formula

California 2026

New York 2026

CAPS Formula

Scenario Results

Infrastructure

Account Inventory

AWS Setup Runbook

AWS Organizations

CloudTrail

GuardDuty

Security Hub

Config

CloudFront + WAFv2

Data sources & ingest

Phase 4 DNS

Change Log

Vulnerability Management

MongoDB Setup

Access Control

Data Classification

Documentation Hosting

Post-deploy Smoke

Development

Preflight (local CI mirror)

Testing strategy

Compliance

Overview (auditor entry point)

SOC 2 Control Mapping

HIPAA Control Mapping

CMS EDE Appendix A Mapping

Risk Assessment

Encryption Policy

Data Retention Policy

Privacy Impact Assessment

Consent Capture & Versioning

Incident Response Plan

Access Control Policy

Marketing vs. Portal Analytics

Vendor / Subprocessor Register

Dependency Vulnerability Policy

BAA / Compliance Evidence

Compliance-Automation Integration

Compliance-Automation Vendor Evaluation

Penetration Test Reports

Architecture

Portal entry handoff

Mobile app strategy

Deferred architecture decisions

Session cookie architecture

Share flows

Decisions (ADRs)

Index

0001 — Atlas project isolation

0002 — Append-only audit log

0003 — Narrow-scoped Mongo users

0004 — Cross-cluster Atlas PrivateLink

0005 — Delayed-job architecture

0006 — Mongo user simplification

0007 — Terraform owns ECS task def

0008 — E2E testing strategy

0009 — Self-hosted analytics + observability (superseded)

0010 — PostHog HIPAA Cloud (supersedes 0009)

Runbooks

Security Incident Response

Break-Glass Root Login

Onboard Team Member

Offboard Team Member

Atlas user provisioning

Deploy via Terraform (ENG-277)

Rollback via Terraform (ENG-277)

S3 data bucket migration (planned Phase 11)

Access Reviews

2026-Q2 Review

Session log

Index

2026-04-23 — Phase 10 DNS cutover

2026-04-22 — Phase 8 prod AWS mirror

2026-04-22 — Phase 7 Atlas VPC peering

2026-04-22 — Phase 6 CloudFront + WAF

2026-04-21 — Phase 5 staging go-live

2026-04-17 — Atlas staging

Briefs

Index

Member portal plan (ENG-187)

2026-04-16/17 handoff

2026-04-17 Atlas handoff

System briefing (2026-04-17)

Creative AdBundance proposal brief

Creative AdBundance analytics brief

ElevenLabs RN integration research

Policies

Overview

On this page

Runtime ​

How Florence is built. The FlorenceRuntime is a thin first-party server-side layer on top of the Claude Agent SDK; the client is a ~50-line streaming hook; the intelligence lives in the tool surface, prompts, and eval harness.

Stack choice, and why ​

LayerChoiceRationale
Agent runtime (server)Claude Agent SDK (wrapped by our provider abstraction)Anthropic's production SDK (powers Claude Code). Supports Anthropic-direct and Bedrock transparently — Phase 1 → Phase 3 provider swap is a config change. Best-in-class context compression, tool parallelism, streaming, interrupts. Wrapped by our own FlorenceLLMProvider interface so Claude Agent SDK is one of several possible backends, not the external contract (see provider risk). Does not pull Vercel-specific packages into our bundle (competitor-camouflage bonus).
Streaming transportServer-Sent Events (Next.js App Router streaming response)Native to our stack. Works behind CloudFront. No websocket infra required.
Client hookCustom ~50-line React hookuseFlorence(). Consumes SSE, renders assistant tokens as they arrive, exposes tool-call events for UI-driving tools (ui_*). We own the rendering so no framework leaks the model family.
LLM providerPhase 1: Anthropic direct API under BAA. Phase 3: Claude via AWS Bedrock.Adapter-sink pattern (see tool surface) makes swap one file.
State storeMongoDB Atlas (extends existing architecture)PHI with CSFLE per data class (see data classification). Already in compliance boundary.

Rejected options: Vercel AI SDK (TypeScript library is provider-agnostic but we want zero @ai-sdk/* in the public bundle, and Claude Agent SDK is closer to Bedrock), LangChain / LangGraph (heavyweight, Python-idiom, no net value over Claude Agent SDK), Mastra (young, adds abstraction we don't need), OpenClaw (category mismatch — personal assistant framework, not a multi-tenant regulated-product runtime), self-hosted open-weight models as Phase 1 default (quality gap, op burden; revisit at scale per roadmap).

Component diagram ​

The turn lifecycle ​

A single user turn walks this sequence. Budgets shown are all-in targets; steps in parallel where noted.

StepPurposeModel / costBudget
1. Input classifierHealth-insurance-legit vs. jailbreak vs. out-of-scope. Out-of-scope → scripted refusal, short-circuit.Haiku 4.5, ~$0.0001~80 ms
2. Model routerHeuristic-first (input length, tool-use history, user signal); escalate to Haiku router call only when uncertain. Selects Haiku / Sonnet / Opus.Haiku 4.5 (when called), ~$0.0001~80 ms
3. Prompt assemblyFixed-order: systemPrompt / toolDefinitions / userProfile / conversationSummary / recentTurns / currentTurn. First four cached via Anthropic prompt cache.—~5 ms
4. Main turnThe assistant response + zero-or-more tool calls. Streams tokens. Tool calls executed in parallel via Claude Agent SDK.Haiku / Sonnet / Opusdepends on tool latency
5. Profile extractor (parallel to step 4)Cheap pass: "does this user turn add facts to the profile?" Writes to user_profile.Haiku 4.5, ~$0.0003~300 ms, non-blocking
6. Grounding checkScans assistant response for factual claims; asserts each ties to a tool-result ID in this turn. Shadow-mode at launch → blocking after eval validation.Haiku 4.5, ~$0.0005~250 ms
7. Output classifierBlocks: code, off-domain content, system-prompt echo, model-identity leakage.Haiku 4.5, ~$0.0003~200 ms
8. Style normalizerDeterministic pass: strip model-family tells, enforce Florence voice.—~5 ms
9. Audit emitAppend-only record to florence_audit_log with tool-call IDs, token counts, classifier outcomes, escalation flags.—~20 ms, async

Per-turn cost target: ≤ $0.005 all-in. Dominated by main-turn output tokens; classifiers + grounding add ~$0.0012 baseline and are the conscious safety-over-savings tradeoff.

Prompt architecture (the caching lever) ​

The single biggest unit-economics decision. A fixed prompt structure allows Anthropic's prompt cache to hit ≥ 85 % of input tokens across turns within a session:

[1] systemPrompt          — stable, version-pinned
[2] toolDefinitions       — stable per session (snapshot at start)
[3] userProfile           — stable unless extractor updates mid-session
[4] conversationSummary   — grows; rewritten at 40% context budget
[5] recentTurns           — last N turns verbatim
[6] currentTurn           — the user's new message

Layers 1–3 are marked for cache on every call. Layer 4 is re-written only at summarization events. Layers 5–6 are fresh per turn.

Consequences of getting this wrong (Sonnet-for-everything, no cache, fresh system prompt per turn): 5–10× LLM cost. At 100 k members, this alone is the difference between ~$1 k/month and ~$8 k/month in LLM spend.

Model routing ​

Heuristic-first routing; LLM-based routing as fallback.

SignalDecision
Input < 80 characters AND no member-auth contextHaiku 4.5
Tool-call-required intent (plan search, drug lookup, provider lookup)Haiku 4.5
Tool-use response phase (assistant synthesizing tool results)Haiku 4.5
Multi-plan comparison or SEP triage or ambiguous life-eventSonnet 4.6
Prior-turn escalation flag OR low-confidence assistant replySonnet 4.6
Sonnet emitted a low-confidence result with explicit escalation markerOpus 4.7 (budgeted; alert if > 1 % of turns)

Measured monthly. Routing drift > 5 % from target mix triggers a tuning pass on the heuristics. Opus usage > 1.5 % of turns triggers investigation.

Context management ​

Conversations grow. Two levers:

  1. Summarization at 40 % context budget, not 90 %. A Haiku call rewrites everything prior to the last N turns into a compact summary, which replaces the earlier turns in prompt layer [4]. Preserves Anthropic-cache sweet spot and keeps the fresh-token count predictable.
  2. Hard cap at 80 k tokens per conversation. Past that, the conversation is closed and a new conversation opened with the compressed profile as its seed. User-invisible; audit-trailed.

State persistence ​

StoreCollectionClassRetention
MongoDBflorence_conversationsPHI (mixed)6 years min (HIPAA), 10 years target (EDE)
MongoDBuser_profilePHI + PIILifetime of member + 6 years
MongoDBflorence_audit_logAudit (append-only)10 years
MongoDBflorence_escalationsPHI10 years

All collections CSFLE-encrypted with the CMK for their class (data classification).

Guardrails and camouflage ​

Detail is split across multiple principles; the runtime is where they compose. Five layers vs. prompt injection and scope abuse:

  1. Input classifier (step 1 above) — stops scope abuse pre-model.
  2. System-prompt fortification — explicit refusals baked in; version-controlled.
  3. Tool-layer authorization — declared accepted auth contexts per tool; enforced in the wrapper.
  4. Output classifier (step 7 above) — blocks code, off-domain, model-identity leaks.
  5. Canary tokens + jailbreak eval suite — prompt contains unique tokens; if they ever appear in output, we know what leaked from where.

Camouflage posture:

  • Tool-use blocks never stream to the client. ui_* calls dispatch as a separate event type the client renders; api_* calls are entirely server-side.
  • Style normalizer (step 8) deterministically strips Claude-isms and enforces the Florence voice.
  • All model calls server-mediated; no browser → Anthropic/OpenAI/Bedrock.
  • Error messages sanitized — no vendor provenance reaches users.
  • Randomized small timing jitter on response start to defeat trivial timing fingerprinting.
  • Public marketing says "Florence," not "powered by Claude."

Observability (cost + latency) ​

Every turn emits metrics into a Mongo-backed dashboard (in-house; no Langfuse / Helicone to avoid vendor spend + stay in boundary):

  • Per-conversation cost attribution (tokens × model × provider rate)
  • Routing mix (Haiku / Sonnet / Opus distribution, weekly)
  • Cache-hit rate (target ≥ 85 %)
  • Latency percentiles per step
  • Escalation rate (target ≤ 5 %)
  • Classifier blocks per category

Alert thresholds baked in from day one. Daily cost drift > 20 % triggers review. See evals & observability.

Pager
Previous pagePrinciples
Next pageTool surface

AskFlorence Internal Documentation. Not for public distribution.

AskFlorence

Internal Documentation

Access restricted. Not for public distribution.