ADR 0005 — Delayed-job architecture for sub-hour transactional + 24h+ marketing

Status

Accepted — 2026-05-09.

Context

Several agent-flow features need to execute deferred work at a specific later time:

15-minute discovery survey reminder (ENG-242) — agent signs up via /agent-onboarding, doesn't complete /agent-discovery within 15 min, gets a single nudge with a resume link.
Resume email on first partial save (ENG-244) — fires inline from the partial-save handler. Already event-driven; not a delayed-job concern.
24h / 72h / 7d second/third nudges — marketing-class lifecycle cadence. Same trigger condition as the 15-min reminder but on longer windows.
Future: agent-activation email after admin approval ([Phase 5 / ENG-202 family]) — once-off email when an admin approves a NIPR-validated agent.
Future: renewal alerts at policy renewal time — service-of-record reminders; ~30-day notice window.
Future: Florence-AI-driven personalized nudges — high-volume per-member events.

The class of problem ("schedule arbitrary work for arbitrary later time, conditional on intermediate state") will recur. We need a default architecture for it.

We are deployed on AWS ECS Fargate post-#47 (Phase 10 cutover, v0.18.0). Email goes through AWS SES v2 (Resend retired in v0.33.0). AWS BAA covers ECS, SES, S3, Secrets Manager, CloudFront, CloudWatch, EventBridge, Lambda, SQS, Step Functions, and DynamoDB. We have an AWS Activate grant covering ~$1K+ of AWS-side spend.

The original ENG-242 implementation used a Vercel-Cron polling pattern. That was wrong for our deploy target — we are no longer on Vercel. Refactored in commit 18d9abd to AWS-native primitives. This ADR captures why.

We reviewed the 2025 SaaS landscape — Inngest (post-Mergent acquisition), Trigger.dev v3, Hatchet (YC-backed Postgres-based, MIT-licensed), Defer (sunset), Quirrel (acquired by Netlify), Vercel Queues (limited beta as of June 2025), Temporal Cloud (priced ~$100-500/mo floor with $6K startup credits), Convex (full-stack, no managed-tier BAA). The decision criteria forced by AskFlorence's stage:

HIPAA BAA mandatory. SES, EventBridge Scheduler, SQS, Lambda, Step Functions are all on the AWS BAA. Inngest BAA is enterprise-only (negotiated, reports peg ~$500-1,500/mo+). Trigger.dev managed cloud has no HIPAA. Convex enterprise-only. Vercel Queues BAA is on Pro tier but we are not on Vercel.
Cost-sensitive (pre-revenue, AWS Activate grant covers AWS). $0 floor on EventBridge Scheduler (free tier covers 14M invocations/mo) vs $500-1,500/mo+ for Inngest BAA tier.
Single engineer. Whatever ships needs to not require dedicated platform-team operations.
Deploy target is AWS ECS — the choice should ride on existing IaC patterns rather than introducing a parallel deploy surface.

Decision

Two-tier architecture by time horizon and audience:

Sub-hour transactional delays (engineering-controlled, individual events): AWS EventBridge Scheduler one-shot per row, target = AWS Lambda thin-proxy invoking our app's HTTP endpoint with an internal token (INTERNAL_REMINDER_TOKEN).
24h+ marketing/lifecycle cadences (marketer-controlled, cohort-based): HubSpot lifecycle workflows triggered by HubSpot contact properties synced from the app. The marketer (Ian) builds + iterates campaigns in HubSpot UI without engineering tickets.

Concretely for the 15-minute reminder (ENG-242):

/api/waitlist agent-signup
  → scheduleDiscoveryReminder() — CreateScheduleCommand at submittedAt+15m
  → EventBridge Scheduler (one-shot, ActionAfterCompletion=DELETE)
T+15min:
  → EBS invokes Lambda target with { email } payload
  → Lambda POSTs to /api/agents/discovery/send-reminder with internal token
  → route atomically claims the row, sends SES email, fires PostHog event
If user submits full survey within 15min:
  → /api/agents/discovery → cancelDiscoveryReminder() → DeleteScheduleCommand
  → reminder never fires

Step Functions are reserved for multi-step durable workflows that aren't needed yet. Kafka / MSK is reserved for the case where Florence AI produces high-throughput multi-consumer event streams (Year 2-3+); even then, Kinesis Data Streams is the cheaper AWS-native first step before considering MSK.

Consequences

What we accept

Lambda thin-proxy is required. EventBridge Scheduler does not support raw HTTPS targets directly (Universal Target API is limited to AWS service ARNs). HTTPS targets need either an EventBridge API Destination (heavier Terraform) or a small Lambda. We picked Lambda. The Lambda code is ~30 lines, deployable via Terraform alongside the schedule group + IAM roles. It's another deployable surface, but it's small and AWS-native.
Per-row schedule resource overhead. EventBridge Scheduler limits 1M concurrent schedules per region (raisable). At AskFlorence scale (forecast: <10/day Year 1 → 1K/day Year 2 → 10K/day Year 3), this is well under the cap.
Schedule names must be deterministic for cancel idempotency. We use agent-reminder-${sha256(email).slice(0,32)} so cancel can find the schedule without storing its name on the waitlist row. Re-creating an already-existing schedule returns ConflictException (treated as success); deleting a non-existent schedule returns ResourceNotFoundException (treated as success).
Vendor risk = none added. All primitives are AWS-managed. AWS BAA already signed. No new vendor contracts, no new BAAs, no new deploy pipelines.
HubSpot is the marketer surface. Means we don't build 24h+ cadence in code. Means Ian owns the templates + scheduling rules + cohort segmentation. Trade-off: we depend on HubSpot's workflow engine being reliable (it is, at our scale). Engineering still owns the property sync that powers the workflows.

What we don't get

Inngest's superior dev experience — multi-step step.sleep(), step.waitForEvent(), observability dashboard. AWS-native equivalent is Step Functions + CloudWatch which is more verbose. Acceptable until use case count crosses 5+ AND coordination becomes painful.
Generic delayed-job abstraction. We're building one-off integrations per use case. With only one use case today (the 15-min reminder), a generic abstraction would be premature. Re-evaluate at use case 3+.
Exact-time precision below 1 second. EventBridge Scheduler precision is ~30 seconds in practice. Acceptable for "approximately 15 min after signup" semantics.

Alternatives considered

Inngest (Enterprise tier with BAA)

Rejected. Best dev experience in 2025 (TypeScript-native step.* API, dashboard, retry semantics built in). But BAA is enterprise-only with no public pricing — reports indicate $500-1,500/mo+. At our pre-revenue stage with the AWS Activate grant offsetting AWS costs to ~$0, the Inngest premium has no offsetting benefit yet. Re-evaluate when use case count crosses 5+ AND we have engineering time being burned on AWS-side workflow boilerplate.

Trigger.dev (managed cloud)

Rejected. No HIPAA on managed cloud. BYOC option exists but adds operational burden (run our own Trigger.dev backplane on AWS) without proportional benefit over EventBridge Scheduler.

Hatchet (Team tier with BAA)

Considered, deferred. YC-backed, MIT-licensed, Postgres-based. Team tier includes BAA at lower price than Inngest Enterprise. Self-host on our existing ECS cluster is technically feasible. Re-evaluate at use case count 5+ if Inngest pricing hasn't improved.

Vercel Queues

Rejected. Limited Beta as of June 2025 with BAA on Pro tier — but we are not on Vercel post-#47.

AWS Step Functions (with Wait state)

Reserved for multi-step workflows. Pricing $25 per million state transitions makes it expensive for single-delay use cases. Right tool when we need durable multi-step orchestration (e.g. "wait 24h, check status, branch on result, loop"). Not needed for the single 15-minute reminder.

AWS SQS (with DelaySeconds)

Considered. Max DelaySeconds is 15 min — exactly our case for the reminder. Would require an SQS-triggered Lambda to drain the queue. Equivalent operational complexity to EventBridge Scheduler + Lambda; chose EBS for the per-schedule observability + the cleaner cancel semantics (DeleteScheduleCommand vs SQS message-purge dance).

Kafka / MSK

Rejected. Wrong category — Kafka is a stream/log, not a scheduler. Would need a delayed-message scheduler built on top. Cost floor ~$3,300/yr for unused capacity. Right tool when we have hundreds of MB/sec sustained throughput AND multiple downstream consumers per event AND want event-sourcing semantics — none of which apply at AskFlorence scale yet.

MongoDB-backed delayed worker (poll DB for `runAt <= now`)

Considered, deferred. $0 incremental cost (uses existing M10 cluster), familiar pattern (Sidekiq / DelayedJob / pg-boss). Trade-off: we own worker reliability. Acceptable for very-early-stage but EBS gives us AWS-managed reliability for the same effort. Re-evaluate if we grow into multiple use cases that benefit from a single generic worker.

DynamoDB TTL + Streams

Right tool for >24h delays. TTL precision is "within 48 hours of TTL expiry" per AWS docs, so wrong for sub-hour use cases. Strong candidate for renewal alerts (30 days out) and weekly digests. Not the answer for the 15-min reminder.

Cron via ECS Scheduled Tasks

Rejected. Polling-based — wastes compute when no rows match (most ticks are empty). Per-row precision is 0-N min depending on cron interval. EventBridge Scheduler beats it on every axis for our case.

HubSpot for everything

Rejected for transactional. HubSpot's workflow scheduling has minimum ~5-10 minute precision (depending on the trigger model) and is not designed for sub-hour transactional sends. The 15-min reminder is too tight + too engineering-controlled to live in HubSpot.

Revisit triggers (explicit)

Reopen this ADR (file ADR 0006 superseding) when one or more of these fire:

Trigger	New consideration
Delayed-job use case count crosses 5+ AND multi-step coordination becomes painful (e.g. "wait, check, branch, loop")	Generic abstraction layer + maybe Inngest if BAA pricing has improved, or Step Functions per workflow
Florence AI requires high-throughput stream processing (>1M events/day, multiple downstream consumers)	Kinesis Data Streams (not Kafka) — AWS-native + BAA-covered + 10-20x cheaper floor than MSK
Inngest publishes self-serve BAA pricing < $200/mo	Re-evaluate Inngest as the workflow tier — DX wins are real
Hatchet Cloud (managed) ships with BAA tier < $500/mo	Re-evaluate Hatchet — closest open-source competitor
ECS team grows to 3+ engineers AND we accumulate 10+ delayed-job use cases	Build vs buy reconsideration; might justify a small custom abstraction (Stripe Pelican / Shopify Postal pattern at smaller scale)
EDE Phase 3 audit demands an immutable event log beyond CloudTrail+Mongo append-only	EventBridge bus + Kinesis (or MSK at scale)
Vercel Queues exits Limited Beta with BAA on Pro tier AND we move part of the system back to Vercel	Reconsider if Vercel ever returns to the deploy mix

References

GitHub #103 — Discovery survey reminder email (15min stall) (ENG-242)
GitHub #105 — Save & resume token flow + bundled flow polish (ENG-244)
GitHub #110 — HubSpot lifecycle workflows for 24h+ agent nudges (ENG-249)
GitHub #111 — This ADR (ENG-250)
GitHub #47 — AWS migration (Phase 10 cutover, ECS Fargate)
GitHub #57 — Vendor BAA coverage tracking
Implementation: commit 18d9abd — Vercel Cron → AWS EventBridge Scheduler one-shot per row
AWS docs: EventBridge Scheduler limits, HIPAA-eligible services
2025 modern-SaaS landscape research: see Linear ENG-244 architecture thread for vendor-by-vendor BAA + pricing table