Skip to content
AskFlorence
Main Navigation ArchitectureFlorence AIAgentsMembersAgent PlatformValidationInfrastructure

Appearance

Sidebar Navigation

Overview

Home

Glossary

System Architecture

Consumer & Agent Flow

Florence AI

Overview

Principles

Runtime

Tool surface

Adding a tool

Tool registry

Knowledge: SBC scenarios & CSR

Voice

Evals & observability

Provider risk & portability

Outage playbook

Roadmap

Build plan

Agents

Overview

Workflows & pain points

Members

Overview

Medicaid coverage gap

Carriers

Overview

Marketplaces

Overview

Agency

Overview

Regulations

Overview

Agent Platform

Overview

Auth Architecture

MongoDB Permissioning

Compliance Model

Data Models

Data Sources

Overview

CMS Marketplace API

CMS dependency map

PUF Data

State Subsidies

SBE Ingestion Playbook

SBE State Watchouts + Decisions

CA Phase C/D Playbook

NY Phase C/D Playbook

Validation

Overview

Methodology

APTC Formula

California 2026

New York 2026

CAPS Formula

Scenario Results

Infrastructure

Account Inventory

AWS Setup Runbook

AWS Organizations

CloudTrail

GuardDuty

Security Hub

Config

CloudFront + WAFv2

Data sources & ingest

Phase 4 DNS

Change Log

Vulnerability Management

MongoDB Setup

Access Control

Data Classification

Documentation Hosting

Post-deploy Smoke

Development

Preflight (local CI mirror)

Testing strategy

Compliance

Overview (auditor entry point)

SOC 2 Control Mapping

HIPAA Control Mapping

CMS EDE Appendix A Mapping

Risk Assessment

Encryption Policy

Data Retention Policy

Privacy Impact Assessment

Consent Capture & Versioning

Incident Response Plan

Access Control Policy

Marketing vs. Portal Analytics

Vendor / Subprocessor Register

Dependency Vulnerability Policy

BAA / Compliance Evidence

Compliance-Automation Integration

Compliance-Automation Vendor Evaluation

Penetration Test Reports

Architecture

Portal entry handoff

Mobile app strategy

Deferred architecture decisions

Session cookie architecture

Share flows

Decisions (ADRs)

Index

0001 — Atlas project isolation

0002 — Append-only audit log

0003 — Narrow-scoped Mongo users

0004 — Cross-cluster Atlas PrivateLink

0005 — Delayed-job architecture

0006 — Mongo user simplification

0007 — Terraform owns ECS task def

0008 — E2E testing strategy

0009 — Self-hosted analytics + observability (superseded)

0010 — PostHog HIPAA Cloud (supersedes 0009)

Runbooks

Security Incident Response

Break-Glass Root Login

Onboard Team Member

Offboard Team Member

Atlas user provisioning

Deploy via Terraform (ENG-277)

Rollback via Terraform (ENG-277)

S3 data bucket migration (planned Phase 11)

Access Reviews

2026-Q2 Review

Session log

Index

2026-04-23 — Phase 10 DNS cutover

2026-04-22 — Phase 8 prod AWS mirror

2026-04-22 — Phase 7 Atlas VPC peering

2026-04-22 — Phase 6 CloudFront + WAF

2026-04-21 — Phase 5 staging go-live

2026-04-17 — Atlas staging

Briefs

Index

Member portal plan (ENG-187)

2026-04-16/17 handoff

2026-04-17 Atlas handoff

System briefing (2026-04-17)

Creative AdBundance proposal brief

Creative AdBundance analytics brief

ElevenLabs RN integration research

Policies

Overview

On this page

Provider risk & portability ​

Florence's core intelligence runs on a third-party LLM. That vendor relationship is the single largest concentration risk in the stack: one policy change, pricing move, compliance snag, or outage at Anthropic — or at AWS Bedrock as the Phase 3 transport — could degrade or disable the product overnight.

This document articulates the risks, the architecture we build to survive them, the tiers of switch we can execute, the capability matrix across alternatives, and the operational plan that keeps all escape hatches warm — not theoretical.

The guiding principle: the provider is a commodity; Florence's differentiation is the tool surface, prompts, evals, and deterministic platform underneath. We must retain the freedom to move.

Risk register ​

#RiskLikelihoodImpactTime to detectTime to mitigate (without plan)With plan
R1Anthropic/Bedrock outage (hours-to-day scale)High (validated — see outage playbook)HighMinutesProduct degraded until restoredSeconds (kill-switch to warm standby)
R2Anthropic significant price increase (≥ 2×)Medium-LowMedium-HighDays (announcement)Weeks (migration)Days (pre-benchmarked alternatives)
R3Model deprecation on a version we depend onMediumLow-MediumMonths (announced)Days (code change + re-eval)Hours (drop-in to successor after standing eval)
R4Quality regression on a new model versionMediumHighHours (evals)Rollback to previousRollback; no production change
R5BAA revocation / compliance disputeLowCriticalDaysSubstantial rebuildPriority cutover via pre-signed secondary BAA
R6Anthropic acquisition / strategy pivot away from APILowCriticalMonthsMonths (platform re-implementation)Weeks (abstraction + evals already exist)
R7Regulatory exclusion of Claude from EDE scopeLowCriticalMonthsMonthsWeeks (self-hosted + alternate-vendor paths)
R8Rate-limit / capacity constraint at growthMediumMediumDaysWeeks (renegotiation)Hours (spill to secondary)
R9Geographic / political restrictionLowLow-MediumDaysWeeksDays
R10Catastrophic Anthropic security incidentVery lowCriticalHoursMonthsDays (activate self-hosted)

R1, R4, R8 are the high-likelihood day-to-day risks; the other six are low-frequency high-consequence tail risks. Architecture must address both.

Architectural enablers — what we build into the system to make switching possible ​

These are not optional; they are prerequisites for the product being resilient.

1. Provider abstraction at the adapter layer (required from day 1) ​

No application code calls Anthropic, Bedrock, OpenAI, Vertex AI, or any provider SDK directly. Every LLM call goes through a single interface:

ts
// src/lib/florence/providers/types.ts
interface FlorenceLLMProvider {
  name: "anthropic-direct" | "bedrock-claude" | "openai" | "vertex-gemini" | "self-hosted" | string;
  capabilities: {
    toolUse: boolean;
    streaming: boolean;
    promptCaching: boolean;
    parallelToolUse: boolean;
    maxContextTokens: number;
  };
  invoke(req: FlorenceLLMRequest): AsyncIterable<FlorenceLLMEvent>;
}

Primary implementation wraps the Claude Agent SDK. Every other implementation is a drop-in replacement of this one interface. The runtime does not know or care which is active; it reads FLORENCE_LLM_PROVIDER and instantiates accordingly.

This is non-negotiable. The Claude Agent SDK is excellent for Claude; it is not the external API the rest of Florence depends on.

2. Model-neutral tool schemas ​

Tool definitions live as Zod schemas + semantic metadata. They are rendered per-provider at runtime:

  • Anthropic: { name, description, input_schema } JSON schema format
  • OpenAI: { type: "function", function: { name, description, parameters } }
  • Vertex AI: Google's function-declaration format
  • Self-hosted (Hermes / Qwen): OpenAI-compatible or Hermes native

The Zod schema is the single source of truth. Provider-specific rendering lives in the adapter. Adding a provider adds a renderer; it does not change a tool.

3. Prompt templates with provider adaptation layer ​

Claude prompts and GPT prompts are not drop-in compatible. Claude responds well to long structured markdown + XML tagging; GPT prefers JSON-shaped instructions; Gemini has its own idioms.

Structure:

prompts/
  florence-member/
    core.md                — provider-neutral canonical prompt
    adaptations/
      anthropic.md         — Claude-specific wrapping + tags
      openai.md            — GPT-specific format
      vertex.md            — Gemini-specific format
      self-hosted.md       — open-weight models

The adapter assembles core + adaptations/<provider> at load time. Eval harness runs the full suite against every adaptation; regressions block merge.

4. Evals run against every provider we maintain (not just the primary) ​

The eval harness (see evals & observability) includes a PROVIDERS parameter. CI runs the full eval suite against the primary on every PR and against the warm-standby providers on a daily schedule. Capability differences become known data, not surprises.

This is the most important enabler. We cannot credibly claim "we can switch" unless we measure the alternative's performance continuously. A quarterly evaluation is too rare; daily is the minimum.

5. Kill switch — provider swap is a config change ​

Flipping primary provider is:

bash
# In production environment config
FLORENCE_LLM_PROVIDER=openai-gpt4o   # was: bedrock-claude-sonnet-4-6

No code change, no deploy, no rollout. An SRE or on-call engineer can flip it in under 60 seconds. Announced via the ops channel + audit log.

6. Prompt-and-provider pair versioning ​

Every audit log row and every eval run records the (prompt_version, provider, model) triple. Debugging a regression or an incident is O(1): which combination was running?

The four tiers of switch ​

Switches happen at different scopes with different costs. Architecture supports all four.

Tier 0 — same-model, same-vendor, different transport ​

"Anthropic direct API is down; swap to Bedrock Claude (or vice versa)."

  • Effort: seconds (env var flip)
  • Quality impact: zero
  • Cost impact: minor pricing delta; different BAA path
  • Use for: outages, regional issues, testing Phase 1 ↔ Phase 3 path before the real migration

Maintained as: both paths always configured; both BAAs signed; both tested weekly.

Tier 1 — same-vendor, different model ​

"Claude Haiku 4.5 deprecated; upgrade to Haiku 5.0" or "Sonnet 4.6 regressed on plan comparison; rollback to 4.5."

  • Effort: minutes (router config change) after evals validate
  • Quality impact: depends; evals gate the change
  • Cost impact: price-per-token varies by model version
  • Use for: model deprecation, new-model adoption, regression rollback

Maintained as: eval harness tests every released Claude model monthly against our suite. When Anthropic announces a deprecation, we already know the target.

Tier 2 — cross-vendor swap (Claude → OpenAI / Gemini) ​

"Anthropic has a BAA dispute / 2× price hike / access issue; route primary to OpenAI GPT-5 via Azure."

  • Effort: small with plan in place (adapter + adaptation prompt + BAA pre-signed); substantial re-prompt + re-eval burden without
  • Quality impact: likely regression on some eval categories, improvement on others; measured, not guessed
  • Cost impact: varies per vendor per model
  • Compliance impact: new BAA required (or confirm existing Azure OpenAI / Vertex AI BAA)
  • Use for: BAA revocation, strategic pricing pressure, quality advantage
  • Phase 3 impact: target vendor must have FedRAMP Moderate posture (Azure Government, Vertex AI in authorized region) or we fall back to Tier 3

Maintained as:

  1. Abstraction + per-provider prompt adaptation exists from day 1.
  2. Full eval suite runs against the secondary vendor daily, with pass-rate published to the provider-risk dashboard.
  3. BAA with the secondary vendor is signed and active at launch (even if 0% of production traffic flows through it).
  4. Secondary vendor gets a small percentage of production traffic (1–5%) in shadow mode continuously — proves the path works under real load, keeps the integration from bit-rotting.

Tier 3 — self-hosted open-weight model ​

"Every commercial option is unavailable or economically untenable; activate self-hosted fallback."

  • Effort: shippable once GPU capacity is live + eval baseline validated; multi-step cold-start otherwise (capacity provisioning + fine-tune + eval baseline)
  • Quality impact: known gap vs. Claude/GPT on complex reasoning; acceptable for lookup + response synthesis if the deterministic API carries the load
  • Cost impact: favorable at volume (GPU amortization), unfavorable at low volume
  • Compliance impact: inherits our FedRAMP posture — no subprocessor
  • Use for: true nuclear scenario (no commercial option viable), long-term strategic independence, brand moat via custom "Florence" fine-tune
  • Current candidates: Hermes-4 (Qwen-based, strong function calling), Llama-4 (Meta's next generation — sizes and licensing TBD), DeepSeek V3.x, Qwen3, open-weight specific to healthcare if a good one emerges

Maintained as:

  1. SageMaker hosting runbook + Terraform module exists, but endpoint is not running in production (scale-to-zero).
  2. Weekly eval run against the candidate model using a spun-up-and-torn-down SageMaker job (costs ~tens of dollars/month).
  3. Quarterly spike to validate the latest open-weight candidate (always moving target; quality is improving rapidly in 2025–2026).
  4. The voice track (see voice) already invests in self-hosted open-weight TTS for the authenticated-member experience; the LLM self-hosting story rides on the same infra pattern.

Capability matrix — current assessment ​

Indicative; refreshed quarterly on the provider-risk dashboard.

Vendor / modelTool use qualityStreamingPrompt cacheContextLatencyCost $/M inCost $/M outBAAFedRAMP
Anthropic Claude Sonnet 4.6 (direct)ExcellentYesYes200 k~800 ms first token~$3~$15SignedN (Phase 1)
Anthropic Claude Haiku 4.5 (direct)Very goodYesYes200 k~500 ms~$0.80~$4SignedN
Bedrock Claude Sonnet 4.6Same as directYesYes200 k~800 ms + regionAWS tierAWS tierAWS BAAYes (Mod)
OpenAI GPT-5 (via Azure)ExcellentYesPartial128–256 k~600 ms~$2–5~$8–20Azure BAAAzure Gov has FedRAMP High
OpenAI GPT-4o-mini (Azure)GoodYesPartial128 k~300 ms~$0.15~$0.60Azure BAAAzure Gov
Google Gemini 2.5 Pro (Vertex AI)Very goodYesYes1 M+~700 ms~$1.25~$10Vertex BAAVertex has FedRAMP High
Google Gemini 2.5 Flash (Vertex)GoodYesYes1 M+~400 ms~$0.30~$2.50Vertex BAAYes
Hermes-4 (Qwen3-based, self-hosted)Good for structured tool useYesvia vLLM prefix cache128 k~150–300 ms warmGPU amortizedGPU amortizedInherits oursInherits ours
Llama-4-class self-hostedTBD at releaseYesvia vLLMDepends~200 ms warmGPU amortizedGPU amortizedInherits oursInherits ours

Numbers drift; the dashboard is the source of truth in production.

Operational plan — keeping the options warm ​

Launch state (Phase 1) ​

  • Primary: Anthropic direct API (or Bedrock if that's where we land; decision pending per roadmap)
  • Tier 0 configured: the non-primary Claude transport is signed + tested. BAA covers both paths.
  • Tier 2 configured, idle: Azure OpenAI or Vertex AI Gemini — one is chosen as the designated Tier 2 — has BAA signed, adapter implemented, adaptation prompt written, evals run daily, no production traffic.
  • Tier 3 dormant: SageMaker runbook exists + a throwaway eval spike has validated one self-hosted candidate. No running endpoint.

Ongoing cadence ​

CadenceActivity
ContinuousEval dashboard shows provider pass rates. Alert on any primary-vs-secondary gap > 10 pp.
DailyEval suite runs against primary + secondary on latest prompts.
WeeklyTier 0 swap drill (flip Anthropic ↔ Bedrock for a canary turn; verify parity).
MonthlyCost, latency, and quality report across all tiers. Posted to the architecture issue thread.
QuarterlySelf-hosted spike — latest open-weight candidate benchmarked against Claude baseline; capability matrix updated.
AnnuallyProvider-risk review with Taha + CFO + counsel. Re-confirm BAAs, re-validate pricing assumptions, reassess strategic posture.
Ad-hocOn any model deprecation announcement — spike the successor as priority work; don't wait for the deprecation window.

Triggers to activate each tier ​

Pre-committed triggers reduce decision latency during an incident:

  • Activate Tier 0: primary-provider error rate > 2 % for > 5 minutes, OR latency p95 degrades > 2× baseline. Automatic failover for individual turns.
  • Activate Tier 1: eval pass rate drops > 5 pp on a new model version. Router rolled back to prior model automatically.
  • Activate Tier 2: Anthropic pricing increases > 50 %, OR BAA / compliance dispute unresolved for > 72 hours, OR sustained capacity constraint limiting growth. Decision: Taha. Execution: priority cutover.
  • Activate Tier 3: no Tier 0–2 option is viable. Decision: Taha + board. Execution: GPU capacity planning + staged traffic ramp.

What this costs us vs. not investing ​

Daily eval against secondary: ~$5–15 per day in LLM spend on the secondary provider. Trivial. Warm-standby 1–5 % production shadow: ~1–5 % of primary LLM spend. Acceptable insurance. Maintaining prompt adaptation layer: hours per sprint to keep aligned. Low. BAA with secondary vendor from launch: legal time, ~$0 marginal (BAAs are usually template). SageMaker self-hosted runbook: one-time scaffold; maintenance is a weekly eval run.

Total ongoing cost: estimated <2 % of LLM spend. Total ongoing benefit: the product survives any of the 10 risks in the register.

What this does NOT promise ​

  • Zero user-visible impact during a cross-vendor switch. Quality shifts will be measurable; prompts may need real tuning under load. Tier 2 activations target "functional on cutover, continuously tuned against eval signal thereafter."
  • Identical output. Claude-ism, GPT-ism, Gemini-ism — the voice normalizer (see runtime — guardrails and camouflage) helps, but each model has its own tendencies. Evals catch regressions on substance, not style.
  • Zero-downtime Tier 3 activation. Self-hosting at full traffic requires GPU capacity planning; a cold-start scale-up from zero to a large member population is not a one-hour move. Tier 3 is a plan-for-the-worst, not a routine flex.

Related ​

  • Principles — unit economics — provider-switch scenarios must continue to hit the same targets or the switch is not accepted.
  • Runtime — where the abstraction lives and how the kill switch is wired.
  • Voice — vendor strategy — the same portability discipline applies to voice vendors.
  • Roadmap — revisit self-hosted — the cost-driven trigger for Tier 3 adoption as a primary (not just fallback).

Tracking ​

Open items on #61:

  • Secondary-vendor selection (Azure OpenAI vs. Vertex AI Gemini) — decision + BAA in place before Florence text launch
  • Daily-eval-vs-secondary CI pipeline
  • Self-hosted spike schedule (quarterly)
  • Warm-standby production traffic percentage (target: 1 % at launch, 5 % by six-month mark)
Pager
Previous pageEvals & observability
Next pageOutage playbook

AskFlorence Internal Documentation. Not for public distribution.

AskFlorence

Internal Documentation

Access restricted. Not for public distribution.