Appearance
Provider risk & portability
Florence's core intelligence runs on a third-party LLM. That vendor relationship is the single largest concentration risk in the stack: one policy change, pricing move, compliance snag, or outage at Anthropic — or at AWS Bedrock as the Phase 3 transport — could degrade or disable the product overnight.
This document articulates the risks, the architecture we build to survive them, the tiers of switch we can execute, the capability matrix across alternatives, and the operational plan that keeps all escape hatches warm — not theoretical.
The guiding principle: the provider is a commodity; Florence's differentiation is the tool surface, prompts, evals, and deterministic platform underneath. We must retain the freedom to move.
Risk register
| # | Risk | Likelihood | Impact | Time to detect | Time to mitigate (without plan) | With plan |
|---|---|---|---|---|---|---|
| R1 | Anthropic/Bedrock outage (hours-to-day scale) | High (validated — see outage playbook) | High | Minutes | Product degraded until restored | Seconds (kill-switch to warm standby) |
| R2 | Anthropic significant price increase (≥ 2×) | Medium-Low | Medium-High | Days (announcement) | Weeks (migration) | Days (pre-benchmarked alternatives) |
| R3 | Model deprecation on a version we depend on | Medium | Low-Medium | Months (announced) | Days (code change + re-eval) | Hours (drop-in to successor after standing eval) |
| R4 | Quality regression on a new model version | Medium | High | Hours (evals) | Rollback to previous | Rollback; no production change |
| R5 | BAA revocation / compliance dispute | Low | Critical | Days | Substantial rebuild | Priority cutover via pre-signed secondary BAA |
| R6 | Anthropic acquisition / strategy pivot away from API | Low | Critical | Months | Months (platform re-implementation) | Weeks (abstraction + evals already exist) |
| R7 | Regulatory exclusion of Claude from EDE scope | Low | Critical | Months | Months | Weeks (self-hosted + alternate-vendor paths) |
| R8 | Rate-limit / capacity constraint at growth | Medium | Medium | Days | Weeks (renegotiation) | Hours (spill to secondary) |
| R9 | Geographic / political restriction | Low | Low-Medium | Days | Weeks | Days |
| R10 | Catastrophic Anthropic security incident | Very low | Critical | Hours | Months | Days (activate self-hosted) |
R1, R4, R8 are the high-likelihood day-to-day risks; the other six are low-frequency high-consequence tail risks. Architecture must address both.
Architectural enablers — what we build into the system to make switching possible
These are not optional; they are prerequisites for the product being resilient.
1. Provider abstraction at the adapter layer (required from day 1)
No application code calls Anthropic, Bedrock, OpenAI, Vertex AI, or any provider SDK directly. Every LLM call goes through a single interface:
ts
// src/lib/florence/providers/types.ts
interface FlorenceLLMProvider {
name: "anthropic-direct" | "bedrock-claude" | "openai" | "vertex-gemini" | "self-hosted" | string;
capabilities: {
toolUse: boolean;
streaming: boolean;
promptCaching: boolean;
parallelToolUse: boolean;
maxContextTokens: number;
};
invoke(req: FlorenceLLMRequest): AsyncIterable<FlorenceLLMEvent>;
}Primary implementation wraps the Claude Agent SDK. Every other implementation is a drop-in replacement of this one interface. The runtime does not know or care which is active; it reads FLORENCE_LLM_PROVIDER and instantiates accordingly.
This is non-negotiable. The Claude Agent SDK is excellent for Claude; it is not the external API the rest of Florence depends on.
2. Model-neutral tool schemas
Tool definitions live as Zod schemas + semantic metadata. They are rendered per-provider at runtime:
- Anthropic:
{ name, description, input_schema }JSON schema format - OpenAI:
{ type: "function", function: { name, description, parameters } } - Vertex AI: Google's function-declaration format
- Self-hosted (Hermes / Qwen): OpenAI-compatible or Hermes native
The Zod schema is the single source of truth. Provider-specific rendering lives in the adapter. Adding a provider adds a renderer; it does not change a tool.
3. Prompt templates with provider adaptation layer
Claude prompts and GPT prompts are not drop-in compatible. Claude responds well to long structured markdown + XML tagging; GPT prefers JSON-shaped instructions; Gemini has its own idioms.
Structure:
prompts/
florence-member/
core.md — provider-neutral canonical prompt
adaptations/
anthropic.md — Claude-specific wrapping + tags
openai.md — GPT-specific format
vertex.md — Gemini-specific format
self-hosted.md — open-weight modelsThe adapter assembles core + adaptations/<provider> at load time. Eval harness runs the full suite against every adaptation; regressions block merge.
4. Evals run against every provider we maintain (not just the primary)
The eval harness (see evals & observability) includes a PROVIDERS parameter. CI runs the full eval suite against the primary on every PR and against the warm-standby providers on a daily schedule. Capability differences become known data, not surprises.
This is the most important enabler. We cannot credibly claim "we can switch" unless we measure the alternative's performance continuously. A quarterly evaluation is too rare; daily is the minimum.
5. Kill switch — provider swap is a config change
Flipping primary provider is:
bash
# In production environment config
FLORENCE_LLM_PROVIDER=openai-gpt4o # was: bedrock-claude-sonnet-4-6No code change, no deploy, no rollout. An SRE or on-call engineer can flip it in under 60 seconds. Announced via the ops channel + audit log.
6. Prompt-and-provider pair versioning
Every audit log row and every eval run records the (prompt_version, provider, model) triple. Debugging a regression or an incident is O(1): which combination was running?
The four tiers of switch
Switches happen at different scopes with different costs. Architecture supports all four.
Tier 0 — same-model, same-vendor, different transport
"Anthropic direct API is down; swap to Bedrock Claude (or vice versa)."
- Effort: seconds (env var flip)
- Quality impact: zero
- Cost impact: minor pricing delta; different BAA path
- Use for: outages, regional issues, testing Phase 1 ↔ Phase 3 path before the real migration
Maintained as: both paths always configured; both BAAs signed; both tested weekly.
Tier 1 — same-vendor, different model
"Claude Haiku 4.5 deprecated; upgrade to Haiku 5.0" or "Sonnet 4.6 regressed on plan comparison; rollback to 4.5."
- Effort: minutes (router config change) after evals validate
- Quality impact: depends; evals gate the change
- Cost impact: price-per-token varies by model version
- Use for: model deprecation, new-model adoption, regression rollback
Maintained as: eval harness tests every released Claude model monthly against our suite. When Anthropic announces a deprecation, we already know the target.
Tier 2 — cross-vendor swap (Claude → OpenAI / Gemini)
"Anthropic has a BAA dispute / 2× price hike / access issue; route primary to OpenAI GPT-5 via Azure."
- Effort: small with plan in place (adapter + adaptation prompt + BAA pre-signed); substantial re-prompt + re-eval burden without
- Quality impact: likely regression on some eval categories, improvement on others; measured, not guessed
- Cost impact: varies per vendor per model
- Compliance impact: new BAA required (or confirm existing Azure OpenAI / Vertex AI BAA)
- Use for: BAA revocation, strategic pricing pressure, quality advantage
- Phase 3 impact: target vendor must have FedRAMP Moderate posture (Azure Government, Vertex AI in authorized region) or we fall back to Tier 3
Maintained as:
- Abstraction + per-provider prompt adaptation exists from day 1.
- Full eval suite runs against the secondary vendor daily, with pass-rate published to the provider-risk dashboard.
- BAA with the secondary vendor is signed and active at launch (even if 0% of production traffic flows through it).
- Secondary vendor gets a small percentage of production traffic (1–5%) in shadow mode continuously — proves the path works under real load, keeps the integration from bit-rotting.
Tier 3 — self-hosted open-weight model
"Every commercial option is unavailable or economically untenable; activate self-hosted fallback."
- Effort: shippable once GPU capacity is live + eval baseline validated; multi-step cold-start otherwise (capacity provisioning + fine-tune + eval baseline)
- Quality impact: known gap vs. Claude/GPT on complex reasoning; acceptable for lookup + response synthesis if the deterministic API carries the load
- Cost impact: favorable at volume (GPU amortization), unfavorable at low volume
- Compliance impact: inherits our FedRAMP posture — no subprocessor
- Use for: true nuclear scenario (no commercial option viable), long-term strategic independence, brand moat via custom "Florence" fine-tune
- Current candidates: Hermes-4 (Qwen-based, strong function calling), Llama-4 (Meta's next generation — sizes and licensing TBD), DeepSeek V3.x, Qwen3, open-weight specific to healthcare if a good one emerges
Maintained as:
- SageMaker hosting runbook + Terraform module exists, but endpoint is not running in production (scale-to-zero).
- Weekly eval run against the candidate model using a spun-up-and-torn-down SageMaker job (costs ~tens of dollars/month).
- Quarterly spike to validate the latest open-weight candidate (always moving target; quality is improving rapidly in 2025–2026).
- The voice track (see voice) already invests in self-hosted open-weight TTS for the authenticated-member experience; the LLM self-hosting story rides on the same infra pattern.
Capability matrix — current assessment
Indicative; refreshed quarterly on the provider-risk dashboard.
| Vendor / model | Tool use quality | Streaming | Prompt cache | Context | Latency | Cost $/M in | Cost $/M out | BAA | FedRAMP |
|---|---|---|---|---|---|---|---|---|---|
| Anthropic Claude Sonnet 4.6 (direct) | Excellent | Yes | Yes | 200 k | ~800 ms first token | ~$3 | ~$15 | Signed | N (Phase 1) |
| Anthropic Claude Haiku 4.5 (direct) | Very good | Yes | Yes | 200 k | ~500 ms | ~$0.80 | ~$4 | Signed | N |
| Bedrock Claude Sonnet 4.6 | Same as direct | Yes | Yes | 200 k | ~800 ms + region | AWS tier | AWS tier | AWS BAA | Yes (Mod) |
| OpenAI GPT-5 (via Azure) | Excellent | Yes | Partial | 128–256 k | ~600 ms | ~$2–5 | ~$8–20 | Azure BAA | Azure Gov has FedRAMP High |
| OpenAI GPT-4o-mini (Azure) | Good | Yes | Partial | 128 k | ~300 ms | ~$0.15 | ~$0.60 | Azure BAA | Azure Gov |
| Google Gemini 2.5 Pro (Vertex AI) | Very good | Yes | Yes | 1 M+ | ~700 ms | ~$1.25 | ~$10 | Vertex BAA | Vertex has FedRAMP High |
| Google Gemini 2.5 Flash (Vertex) | Good | Yes | Yes | 1 M+ | ~400 ms | ~$0.30 | ~$2.50 | Vertex BAA | Yes |
| Hermes-4 (Qwen3-based, self-hosted) | Good for structured tool use | Yes | via vLLM prefix cache | 128 k | ~150–300 ms warm | GPU amortized | GPU amortized | Inherits ours | Inherits ours |
| Llama-4-class self-hosted | TBD at release | Yes | via vLLM | Depends | ~200 ms warm | GPU amortized | GPU amortized | Inherits ours | Inherits ours |
Numbers drift; the dashboard is the source of truth in production.
Operational plan — keeping the options warm
Launch state (Phase 1)
- Primary: Anthropic direct API (or Bedrock if that's where we land; decision pending per roadmap)
- Tier 0 configured: the non-primary Claude transport is signed + tested. BAA covers both paths.
- Tier 2 configured, idle: Azure OpenAI or Vertex AI Gemini — one is chosen as the designated Tier 2 — has BAA signed, adapter implemented, adaptation prompt written, evals run daily, no production traffic.
- Tier 3 dormant: SageMaker runbook exists + a throwaway eval spike has validated one self-hosted candidate. No running endpoint.
Ongoing cadence
| Cadence | Activity |
|---|---|
| Continuous | Eval dashboard shows provider pass rates. Alert on any primary-vs-secondary gap > 10 pp. |
| Daily | Eval suite runs against primary + secondary on latest prompts. |
| Weekly | Tier 0 swap drill (flip Anthropic ↔ Bedrock for a canary turn; verify parity). |
| Monthly | Cost, latency, and quality report across all tiers. Posted to the architecture issue thread. |
| Quarterly | Self-hosted spike — latest open-weight candidate benchmarked against Claude baseline; capability matrix updated. |
| Annually | Provider-risk review with Taha + CFO + counsel. Re-confirm BAAs, re-validate pricing assumptions, reassess strategic posture. |
| Ad-hoc | On any model deprecation announcement — spike the successor as priority work; don't wait for the deprecation window. |
Triggers to activate each tier
Pre-committed triggers reduce decision latency during an incident:
- Activate Tier 0: primary-provider error rate > 2 % for > 5 minutes, OR latency p95 degrades > 2× baseline. Automatic failover for individual turns.
- Activate Tier 1: eval pass rate drops > 5 pp on a new model version. Router rolled back to prior model automatically.
- Activate Tier 2: Anthropic pricing increases > 50 %, OR BAA / compliance dispute unresolved for > 72 hours, OR sustained capacity constraint limiting growth. Decision: Taha. Execution: priority cutover.
- Activate Tier 3: no Tier 0–2 option is viable. Decision: Taha + board. Execution: GPU capacity planning + staged traffic ramp.
What this costs us vs. not investing
Daily eval against secondary: ~$5–15 per day in LLM spend on the secondary provider. Trivial. Warm-standby 1–5 % production shadow: ~1–5 % of primary LLM spend. Acceptable insurance. Maintaining prompt adaptation layer: hours per sprint to keep aligned. Low. BAA with secondary vendor from launch: legal time, ~$0 marginal (BAAs are usually template). SageMaker self-hosted runbook: one-time scaffold; maintenance is a weekly eval run.
Total ongoing cost: estimated <2 % of LLM spend. Total ongoing benefit: the product survives any of the 10 risks in the register.
What this does NOT promise
- Zero user-visible impact during a cross-vendor switch. Quality shifts will be measurable; prompts may need real tuning under load. Tier 2 activations target "functional on cutover, continuously tuned against eval signal thereafter."
- Identical output. Claude-ism, GPT-ism, Gemini-ism — the voice normalizer (see runtime — guardrails and camouflage) helps, but each model has its own tendencies. Evals catch regressions on substance, not style.
- Zero-downtime Tier 3 activation. Self-hosting at full traffic requires GPU capacity planning; a cold-start scale-up from zero to a large member population is not a one-hour move. Tier 3 is a plan-for-the-worst, not a routine flex.
Related
- Principles — unit economics — provider-switch scenarios must continue to hit the same targets or the switch is not accepted.
- Runtime — where the abstraction lives and how the kill switch is wired.
- Voice — vendor strategy — the same portability discipline applies to voice vendors.
- Roadmap — revisit self-hosted — the cost-driven trigger for Tier 3 adoption as a primary (not just fallback).
Tracking
Open items on #61:
- Secondary-vendor selection (Azure OpenAI vs. Vertex AI Gemini) — decision + BAA in place before Florence text launch
- Daily-eval-vs-secondary CI pipeline
- Self-hosted spike schedule (quarterly)
- Warm-standby production traffic percentage (target: 1 % at launch, 5 % by six-month mark)