Provider risk & portability

Florence's core intelligence runs on a third-party LLM. That vendor relationship is the single largest concentration risk in the stack: one policy change, pricing move, compliance snag, or outage at Anthropic — or at AWS Bedrock as the Phase 3 transport — could degrade or disable the product overnight.

This document articulates the risks, the architecture we build to survive them, the tiers of switch we can execute, the capability matrix across alternatives, and the operational plan that keeps all escape hatches warm — not theoretical.

The guiding principle: the provider is a commodity; Florence's differentiation is the tool surface, prompts, evals, and deterministic platform underneath. We must retain the freedom to move.

Risk register

#	Risk	Likelihood	Impact	Time to detect	Time to mitigate (without plan)	With plan
R1	Anthropic/Bedrock outage (hours-to-day scale)	High (validated — see outage playbook)	High	Minutes	Product degraded until restored	Seconds (kill-switch to warm standby)
R2	Anthropic significant price increase (≥ 2×)	Medium-Low	Medium-High	Days (announcement)	Weeks (migration)	Days (pre-benchmarked alternatives)
R3	Model deprecation on a version we depend on	Medium	Low-Medium	Months (announced)	Days (code change + re-eval)	Hours (drop-in to successor after standing eval)
R4	Quality regression on a new model version	Medium	High	Hours (evals)	Rollback to previous	Rollback; no production change
R5	BAA revocation / compliance dispute	Low	Critical	Days	Substantial rebuild	Priority cutover via pre-signed secondary BAA
R6	Anthropic acquisition / strategy pivot away from API	Low	Critical	Months	Months (platform re-implementation)	Weeks (abstraction + evals already exist)
R7	Regulatory exclusion of Claude from EDE scope	Low	Critical	Months	Months	Weeks (self-hosted + alternate-vendor paths)
R8	Rate-limit / capacity constraint at growth	Medium	Medium	Days	Weeks (renegotiation)	Hours (spill to secondary)
R9	Geographic / political restriction	Low	Low-Medium	Days	Weeks	Days
R10	Catastrophic Anthropic security incident	Very low	Critical	Hours	Months	Days (activate self-hosted)

R1, R4, R8 are the high-likelihood day-to-day risks; the other six are low-frequency high-consequence tail risks. Architecture must address both.

Architectural enablers — what we build into the system to make switching possible

These are not optional; they are prerequisites for the product being resilient.

1. Provider abstraction at the adapter layer (required from day 1)

No application code calls Anthropic, Bedrock, OpenAI, Vertex AI, or any provider SDK directly. Every LLM call goes through a single interface:

// src/lib/florence/providers/types.ts
interface FlorenceLLMProvider {
  name: "anthropic-direct" | "bedrock-claude" | "openai" | "vertex-gemini" | "self-hosted" | string;
  capabilities: {
    toolUse: boolean;
    streaming: boolean;
    promptCaching: boolean;
    parallelToolUse: boolean;
    maxContextTokens: number;
  };
  invoke(req: FlorenceLLMRequest): AsyncIterable<FlorenceLLMEvent>;
}

Primary implementation wraps the Claude Agent SDK. Every other implementation is a drop-in replacement of this one interface. The runtime does not know or care which is active; it reads FLORENCE_LLM_PROVIDER and instantiates accordingly.

This is non-negotiable. The Claude Agent SDK is excellent for Claude; it is not the external API the rest of Florence depends on.

2. Model-neutral tool schemas

Tool definitions live as Zod schemas + semantic metadata. They are rendered per-provider at runtime:

Anthropic: { name, description, input_schema } JSON schema format
OpenAI: { type: "function", function: { name, description, parameters } }
Vertex AI: Google's function-declaration format
Self-hosted (Hermes / Qwen): OpenAI-compatible or Hermes native

The Zod schema is the single source of truth. Provider-specific rendering lives in the adapter. Adding a provider adds a renderer; it does not change a tool.

3. Prompt templates with provider adaptation layer

Claude prompts and GPT prompts are not drop-in compatible. Claude responds well to long structured markdown + XML tagging; GPT prefers JSON-shaped instructions; Gemini has its own idioms.

Structure:

prompts/
  florence-member/
    core.md                — provider-neutral canonical prompt
    adaptations/
      anthropic.md         — Claude-specific wrapping + tags
      openai.md            — GPT-specific format
      vertex.md            — Gemini-specific format
      self-hosted.md       — open-weight models

The adapter assembles core + adaptations/<provider> at load time. Eval harness runs the full suite against every adaptation; regressions block merge.

4. Evals run against every provider we maintain (not just the primary)

The eval harness (see evals & observability) includes a PROVIDERS parameter. CI runs the full eval suite against the primary on every PR and against the warm-standby providers on a daily schedule. Capability differences become known data, not surprises.

This is the most important enabler. We cannot credibly claim "we can switch" unless we measure the alternative's performance continuously. A quarterly evaluation is too rare; daily is the minimum.

5. Kill switch — provider swap is a config change

Flipping primary provider is:

bash

# In production environment config
FLORENCE_LLM_PROVIDER=openai-gpt4o   # was: bedrock-claude-sonnet-4-6

No code change, no deploy, no rollout. An SRE or on-call engineer can flip it in under 60 seconds. Announced via the ops channel + audit log.

6. Prompt-and-provider pair versioning

Every audit log row and every eval run records the (prompt_version, provider, model) triple. Debugging a regression or an incident is O(1): which combination was running?

The four tiers of switch

Switches happen at different scopes with different costs. Architecture supports all four.

Tier 0 — same-model, same-vendor, different transport

"Anthropic direct API is down; swap to Bedrock Claude (or vice versa)."

Effort: seconds (env var flip)
Quality impact: zero
Cost impact: minor pricing delta; different BAA path
Use for: outages, regional issues, testing Phase 1 ↔ Phase 3 path before the real migration

Maintained as: both paths always configured; both BAAs signed; both tested weekly.

Tier 1 — same-vendor, different model

"Claude Haiku 4.5 deprecated; upgrade to Haiku 5.0" or "Sonnet 4.6 regressed on plan comparison; rollback to 4.5."

Effort: minutes (router config change) after evals validate
Quality impact: depends; evals gate the change
Cost impact: price-per-token varies by model version
Use for: model deprecation, new-model adoption, regression rollback

Maintained as: eval harness tests every released Claude model monthly against our suite. When Anthropic announces a deprecation, we already know the target.

Tier 2 — cross-vendor swap (Claude → OpenAI / Gemini)

"Anthropic has a BAA dispute / 2× price hike / access issue; route primary to OpenAI GPT-5 via Azure."

Effort: small with plan in place (adapter + adaptation prompt + BAA pre-signed); substantial re-prompt + re-eval burden without
Quality impact: likely regression on some eval categories, improvement on others; measured, not guessed
Cost impact: varies per vendor per model
Compliance impact: new BAA required (or confirm existing Azure OpenAI / Vertex AI BAA)
Use for: BAA revocation, strategic pricing pressure, quality advantage
Phase 3 impact: target vendor must have FedRAMP Moderate posture (Azure Government, Vertex AI in authorized region) or we fall back to Tier 3

Maintained as:

Abstraction + per-provider prompt adaptation exists from day 1.
Full eval suite runs against the secondary vendor daily, with pass-rate published to the provider-risk dashboard.
BAA with the secondary vendor is signed and active at launch (even if 0% of production traffic flows through it).
Secondary vendor gets a small percentage of production traffic (1–5%) in shadow mode continuously — proves the path works under real load, keeps the integration from bit-rotting.

Tier 3 — self-hosted open-weight model

"Every commercial option is unavailable or economically untenable; activate self-hosted fallback."

Effort: shippable once GPU capacity is live + eval baseline validated; multi-step cold-start otherwise (capacity provisioning + fine-tune + eval baseline)
Quality impact: known gap vs. Claude/GPT on complex reasoning; acceptable for lookup + response synthesis if the deterministic API carries the load
Cost impact: favorable at volume (GPU amortization), unfavorable at low volume
Compliance impact: inherits our FedRAMP posture — no subprocessor
Use for: true nuclear scenario (no commercial option viable), long-term strategic independence, brand moat via custom "Florence" fine-tune
Current candidates: Hermes-4 (Qwen-based, strong function calling), Llama-4 (Meta's next generation — sizes and licensing TBD), DeepSeek V3.x, Qwen3, open-weight specific to healthcare if a good one emerges

Maintained as:

SageMaker hosting runbook + Terraform module exists, but endpoint is not running in production (scale-to-zero).
Weekly eval run against the candidate model using a spun-up-and-torn-down SageMaker job (costs ~tens of dollars/month).
Quarterly spike to validate the latest open-weight candidate (always moving target; quality is improving rapidly in 2025–2026).
The voice track (see voice) already invests in self-hosted open-weight TTS for the authenticated-member experience; the LLM self-hosting story rides on the same infra pattern.

Capability matrix — current assessment

Indicative; refreshed quarterly on the provider-risk dashboard.

Vendor / model	Tool use quality	Streaming	Prompt cache	Context	Latency	Cost $/M in	Cost $/M out	BAA	FedRAMP
Anthropic Claude Sonnet 4.6 (direct)	Excellent	Yes	Yes	200 k	~800 ms first token	~$3	~$15	Signed	N (Phase 1)
Anthropic Claude Haiku 4.5 (direct)	Very good	Yes	Yes	200 k	~500 ms	~$0.80	~$4	Signed	N
Bedrock Claude Sonnet 4.6	Same as direct	Yes	Yes	200 k	~800 ms + region	AWS tier	AWS tier	AWS BAA	Yes (Mod)
OpenAI GPT-5 (via Azure)	Excellent	Yes	Partial	128–256 k	~600 ms	~$2–5	~$8–20	Azure BAA	Azure Gov has FedRAMP High
OpenAI GPT-4o-mini (Azure)	Good	Yes	Partial	128 k	~300 ms	~$0.15	~$0.60	Azure BAA	Azure Gov
Google Gemini 2.5 Pro (Vertex AI)	Very good	Yes	Yes	1 M+	~700 ms	~$1.25	~$10	Vertex BAA	Vertex has FedRAMP High
Google Gemini 2.5 Flash (Vertex)	Good	Yes	Yes	1 M+	~400 ms	~$0.30	~$2.50	Vertex BAA	Yes
Hermes-4 (Qwen3-based, self-hosted)	Good for structured tool use	Yes	via vLLM prefix cache	128 k	~150–300 ms warm	GPU amortized	GPU amortized	Inherits ours	Inherits ours
Llama-4-class self-hosted	TBD at release	Yes	via vLLM	Depends	~200 ms warm	GPU amortized	GPU amortized	Inherits ours	Inherits ours

Numbers drift; the dashboard is the source of truth in production.

Operational plan — keeping the options warm

Launch state (Phase 1)

Primary: Anthropic direct API (or Bedrock if that's where we land; decision pending per roadmap)
Tier 0 configured: the non-primary Claude transport is signed + tested. BAA covers both paths.
Tier 2 configured, idle: Azure OpenAI or Vertex AI Gemini — one is chosen as the designated Tier 2 — has BAA signed, adapter implemented, adaptation prompt written, evals run daily, no production traffic.
Tier 3 dormant: SageMaker runbook exists + a throwaway eval spike has validated one self-hosted candidate. No running endpoint.

Ongoing cadence

Cadence	Activity
Continuous	Eval dashboard shows provider pass rates. Alert on any primary-vs-secondary gap > 10 pp.
Daily	Eval suite runs against primary + secondary on latest prompts.
Weekly	Tier 0 swap drill (flip Anthropic ↔ Bedrock for a canary turn; verify parity).
Monthly	Cost, latency, and quality report across all tiers. Posted to the architecture issue thread.
Quarterly	Self-hosted spike — latest open-weight candidate benchmarked against Claude baseline; capability matrix updated.
Annually	Provider-risk review with Taha + CFO + counsel. Re-confirm BAAs, re-validate pricing assumptions, reassess strategic posture.
Ad-hoc	On any model deprecation announcement — spike the successor as priority work; don't wait for the deprecation window.

Triggers to activate each tier

Pre-committed triggers reduce decision latency during an incident:

Activate Tier 0: primary-provider error rate > 2 % for > 5 minutes, OR latency p95 degrades > 2× baseline. Automatic failover for individual turns.
Activate Tier 1: eval pass rate drops > 5 pp on a new model version. Router rolled back to prior model automatically.
Activate Tier 2: Anthropic pricing increases > 50 %, OR BAA / compliance dispute unresolved for > 72 hours, OR sustained capacity constraint limiting growth. Decision: Taha. Execution: priority cutover.
Activate Tier 3: no Tier 0–2 option is viable. Decision: Taha + board. Execution: GPU capacity planning + staged traffic ramp.

What this costs us vs. not investing

Daily eval against secondary: ~$5–15 per day in LLM spend on the secondary provider. Trivial. Warm-standby 1–5 % production shadow: ~1–5 % of primary LLM spend. Acceptable insurance. Maintaining prompt adaptation layer: hours per sprint to keep aligned. Low. BAA with secondary vendor from launch: legal time, ~$0 marginal (BAAs are usually template). SageMaker self-hosted runbook: one-time scaffold; maintenance is a weekly eval run.

Total ongoing cost: estimated <2 % of LLM spend. Total ongoing benefit: the product survives any of the 10 risks in the register.

What this does NOT promise

Zero user-visible impact during a cross-vendor switch. Quality shifts will be measurable; prompts may need real tuning under load. Tier 2 activations target "functional on cutover, continuously tuned against eval signal thereafter."
Identical output. Claude-ism, GPT-ism, Gemini-ism — the voice normalizer (see runtime — guardrails and camouflage) helps, but each model has its own tendencies. Evals catch regressions on substance, not style.
Zero-downtime Tier 3 activation. Self-hosting at full traffic requires GPU capacity planning; a cold-start scale-up from zero to a large member population is not a one-hour move. Tier 3 is a plan-for-the-worst, not a routine flex.

Principles — unit economics — provider-switch scenarios must continue to hit the same targets or the switch is not accepted.
Runtime — where the abstraction lives and how the kill switch is wired.
Voice — vendor strategy — the same portability discipline applies to voice vendors.
Roadmap — revisit self-hosted — the cost-driven trigger for Tier 3 adoption as a primary (not just fallback).

Tracking

Open items on #61:

Secondary-vendor selection (Azure OpenAI vs. Vertex AI Gemini) — decision + BAA in place before Florence text launch
Daily-eval-vs-secondary CI pipeline
Self-hosted spike schedule (quarterly)
Warm-standby production traffic percentage (target: 1 % at launch, 5 % by six-month mark)