Adding (or removing) a tool

The tool surface will grow and change continuously. Drug coverage ships (Phase C, #17). Provider directory ships (Phase D, #18). Appointment booking, claims, bill review, renewal analysis, carrier integrations, agent-side drafting tools — all arrive as tools. Some earlier tools will be renamed, refactored, deprecated.

This playbook is the standard path for adding, versioning, and removing a Florence tool. Follow it every time.

Lifecycle at a glance

Adding a tool — checklist

1. Proposal (before writing code)

One-paragraph proposal in the PR description or a linked issue:

What member / agent scenario does this tool unlock?
What deterministic endpoint does it wrap? (If the endpoint doesn't exist yet, this is two PRs.)
What data classes flow through it (input + output)?
Which auth contexts should be allowed?
What failure modes matter most (slow API, empty result, auth edge cases)?

2. Design review (required before implementation)

Confirm with the Florence AI architect (Taha by default):

[ ] Tool name follows conventions: api_verb_noun or ui_verb_noun.
[ ] Input / output shapes are reviewed for LLM-friendliness (stable field names, compact, self-describing).
[ ] Data classification assessed against data classification. If any FTI or ApplicationPayload is in the output, additional egress-control review is required.
[ ] Auth contexts reviewed: is anonymous OK? Agent cross-member access intentional?
[ ] Cacheability decided: TTL set, invalidation rules clear.

3. Implementation

[ ] Write the tool in src/lib/florence/tools/api/<tool-name>.ts (or ui/<tool-name>.ts). One tool per file.
[ ] Export a FlorenceTool<Input, Output> object with every field populated.
[ ] Zod schemas for input and output.
[ ] Wire it into src/lib/florence/tools/registry.ts.
[ ] Add to tool registry doc.
[ ] If the tool reads plan / drug / provider data: handle SBE vs FFM correctly. Thread the user's state into the deterministic call, and confirm the discovery identifier matches the coverage identifier for owned-data states (NY, CA). CA providers key on Symphony providerId, not NPPES NPI — a tool that discovers via NPPES but checks coverage against CA data will silently return "not covered." See the SBE vs FFM data lookup section in the tool-surface contract + docs/data-sources/sbe-state-watchouts.md. Add at least one eval case per owned-data state you claim to support.

4. Eval coverage (blocking on merge)

Minimum eval bundle in scripts/florence-evals/tools/<tool-name>/:

[ ] At least three factual cases where the expected tool output is known and the response must include those facts.
[ ] At least one adversarial case: question that looks computational but should route to the tool.
[ ] At least one auth-boundary case: call from a disallowed auth context must be rejected without the underlying endpoint being hit.
[ ] At least one hallucination-dragnet case if the tool returns numeric data.

5. Security review (blocking on merge for any tool touching PHI, PII, FTI, or authenticated endpoints)

[ ] Data classification sign-off: output class declared correctly.
[ ] Auth context allowlist reviewed.
[ ] Adapter-sink compatibility confirmed (the deterministic endpoint's destination vendor accepts the declared class).
[ ] Audit-log payload reviewed — does it capture what an auditor would need without over-retaining PII?
[ ] Cache semantics reviewed — member-specific outputs must not be cached across members.

6. Ship as beta

[ ] Feature flag on: FLORENCE_TOOL_<NAME>=beta.
[ ] Announced to the LLM in the tool-definitions block only when flag is beta or stable.
[ ] Staging deploy verified end-to-end (see evals & observability).
[ ] Monitor dashboard includes this tool's latency, cost, error rate, auth-denial count, cache-hit rate.

7. Graduate to stable

Flip to stable once beta has sustained all of:

[ ] Zero unexplained auth denials.
[ ] p95 latency within budget (set per tool; default ≤ 800 ms for API-wrapper tools).
[ ] Eval pass rate ≥ 98 %.
[ ] Cost per invocation within 20 % of estimate.
[ ] Cache-hit rate within target (if cacheable).

Update tool registry status.

Versioning a tool

Tools evolve. Two rules:

Minor (additive, non-breaking). New optional input field. New output field. Relaxed validation. No version bump. Deploy straight to stable once evals pass.

Major (breaking). Input renamed, removed, retyped. Output shape changed. Behavior changed.

Create src/lib/florence/tools/api/<tool-name>-v2.ts. Register as api_<tool_name>_v2.
Keep the v1 tool in the registry marked deprecated for one audit window (one eval cycle minimum).
The system prompt announces only stable + beta tools — the v1 deprecated tool is still callable by in-flight conversations that already have it in context, but Florence no longer volunteers it.
Full eval bundle for v2 (independent of v1).
After the audit window, v1 is removed in a PR that also archives its eval bundle (kept for auditor traceability; not run in CI).

Deprecating / removing a tool

Tools go away. Most often because:

Deterministic endpoint was replaced.
Scope of Florence shifted (e.g. Ian decides agent-side Florence doesn't need api_draft_renewal_outreach because the marketing team owns that copy).
Data classification changed and the tool must be removed from a specific auth context.

Steps:

Mark the tool deprecated in its file. Registry reflects this automatically.
Announce in the sprint notes; flag the downstream code / prompts / UI that depended on it.
One audit window of dual-running (tool still callable, not announced to LLM for new conversations).
Remove the tool file + registry entry in a PR that:
- [ ] Deletes src/lib/florence/tools/{api,ui}/<tool-name>.ts.
- [ ] Removes registry entry.
- [ ] Updates tool registry doc — removed tools are moved to an "Archived" section with the removal date, so the auditor-facing trail is preserved.
- [ ] Archives the eval bundle (move scripts/florence-evals/tools/<tool-name>/ to scripts/florence-evals/tools/_archived/<tool-name>/).
- [ ] Bumps the system-prompt version (see runtime) — removing tools is a cache-invalidating change.

Anti-patterns

Things that look fine and aren't.

Hand-editing the LLM-visible tool description in one place and the Zod schema in another. Single source of truth: the Zod schema + the description field on the tool object. Regenerate the LLM-visible block from those.
Skipping eval coverage "because the API is tested." The API being tested doesn't tell us Florence calls it correctly. Eval coverage is non-negotiable.
Catching and swallowing deterministic-API errors inside the wrapper. If the API is slow or errors, Florence needs to know — she'll tell the user, offer to retry, or escalate. A wrapper that returns a faked "empty" result on error causes silent hallucinations downstream.
Per-user prompt caching. The Anthropic prompt cache cannot include per-user data in the cached prefix. Keep user_profile as its own prompt slot, cached separately; do not bake user-specific content into the stable prefix.
Adding a tool "just for agents" without thinking about the prompt. Agent-mode Florence is a different system prompt + tool surface; see principles §9. Agent-only tools declare acceptsAuthContexts: ["authenticated_agent", "authenticated_admin"] and are registered only in the agent-mode tool list.
Caching member-specific outputs across members. The cache key must include the member ID (or any identifier that makes the result user-specific). Static analysis check on cacheKey implementations flags this.

Tool surface — the shape every tool conforms to
Tool registry — living inventory
Evals & observability — eval harness detail
Data classification — the broader compliance-in-code plan