Appearance
ADR 0005 — Delayed-job architecture for sub-hour transactional + 24h+ marketing
Status
Accepted — 2026-05-09.
Context
Several agent-flow features need to execute deferred work at a specific later time:
- 15-minute discovery survey reminder (ENG-242) — agent signs up via
/agent-onboarding, doesn't complete/agent-discoverywithin 15 min, gets a single nudge with a resume link. - Resume email on first partial save (ENG-244) — fires inline from the partial-save handler. Already event-driven; not a delayed-job concern.
- 24h / 72h / 7d second/third nudges — marketing-class lifecycle cadence. Same trigger condition as the 15-min reminder but on longer windows.
- Future: agent-activation email after admin approval ([Phase 5 / ENG-202 family]) — once-off email when an admin approves a NIPR-validated agent.
- Future: renewal alerts at policy renewal time — service-of-record reminders; ~30-day notice window.
- Future: Florence-AI-driven personalized nudges — high-volume per-member events.
The class of problem ("schedule arbitrary work for arbitrary later time, conditional on intermediate state") will recur. We need a default architecture for it.
We are deployed on AWS ECS Fargate post-#47 (Phase 10 cutover, v0.18.0). Email goes through AWS SES v2 (Resend retired in v0.33.0). AWS BAA covers ECS, SES, S3, Secrets Manager, CloudFront, CloudWatch, EventBridge, Lambda, SQS, Step Functions, and DynamoDB. We have an AWS Activate grant covering ~$1K+ of AWS-side spend.
The original ENG-242 implementation used a Vercel-Cron polling pattern. That was wrong for our deploy target — we are no longer on Vercel. Refactored in commit 18d9abd to AWS-native primitives. This ADR captures why.
We reviewed the 2025 SaaS landscape — Inngest (post-Mergent acquisition), Trigger.dev v3, Hatchet (YC-backed Postgres-based, MIT-licensed), Defer (sunset), Quirrel (acquired by Netlify), Vercel Queues (limited beta as of June 2025), Temporal Cloud (priced ~$100-500/mo floor with $6K startup credits), Convex (full-stack, no managed-tier BAA). The decision criteria forced by AskFlorence's stage:
- HIPAA BAA mandatory. SES, EventBridge Scheduler, SQS, Lambda, Step Functions are all on the AWS BAA. Inngest BAA is enterprise-only (negotiated, reports peg ~$500-1,500/mo+). Trigger.dev managed cloud has no HIPAA. Convex enterprise-only. Vercel Queues BAA is on Pro tier but we are not on Vercel.
- Cost-sensitive (pre-revenue, AWS Activate grant covers AWS). $0 floor on EventBridge Scheduler (free tier covers 14M invocations/mo) vs $500-1,500/mo+ for Inngest BAA tier.
- Single engineer. Whatever ships needs to not require dedicated platform-team operations.
- Deploy target is AWS ECS — the choice should ride on existing IaC patterns rather than introducing a parallel deploy surface.
Decision
Two-tier architecture by time horizon and audience:
- Sub-hour transactional delays (engineering-controlled, individual events): AWS EventBridge Scheduler one-shot per row, target = AWS Lambda thin-proxy invoking our app's HTTP endpoint with an internal token (
INTERNAL_REMINDER_TOKEN). - 24h+ marketing/lifecycle cadences (marketer-controlled, cohort-based): HubSpot lifecycle workflows triggered by HubSpot contact properties synced from the app. The marketer (Ian) builds + iterates campaigns in HubSpot UI without engineering tickets.
Concretely for the 15-minute reminder (ENG-242):
/api/waitlist agent-signup
→ scheduleDiscoveryReminder() — CreateScheduleCommand at submittedAt+15m
→ EventBridge Scheduler (one-shot, ActionAfterCompletion=DELETE)
T+15min:
→ EBS invokes Lambda target with { email } payload
→ Lambda POSTs to /api/agents/discovery/send-reminder with internal token
→ route atomically claims the row, sends SES email, fires PostHog event
If user submits full survey within 15min:
→ /api/agents/discovery → cancelDiscoveryReminder() → DeleteScheduleCommand
→ reminder never firesStep Functions are reserved for multi-step durable workflows that aren't needed yet. Kafka / MSK is reserved for the case where Florence AI produces high-throughput multi-consumer event streams (Year 2-3+); even then, Kinesis Data Streams is the cheaper AWS-native first step before considering MSK.
Consequences
What we accept
- Lambda thin-proxy is required. EventBridge Scheduler does not support raw HTTPS targets directly (Universal Target API is limited to AWS service ARNs). HTTPS targets need either an EventBridge API Destination (heavier Terraform) or a small Lambda. We picked Lambda. The Lambda code is ~30 lines, deployable via Terraform alongside the schedule group + IAM roles. It's another deployable surface, but it's small and AWS-native.
- Per-row schedule resource overhead. EventBridge Scheduler limits 1M concurrent schedules per region (raisable). At AskFlorence scale (forecast: <10/day Year 1 → 1K/day Year 2 → 10K/day Year 3), this is well under the cap.
- Schedule names must be deterministic for cancel idempotency. We use
agent-reminder-${sha256(email).slice(0,32)}so cancel can find the schedule without storing its name on the waitlist row. Re-creating an already-existing schedule returnsConflictException(treated as success); deleting a non-existent schedule returnsResourceNotFoundException(treated as success). - Vendor risk = none added. All primitives are AWS-managed. AWS BAA already signed. No new vendor contracts, no new BAAs, no new deploy pipelines.
- HubSpot is the marketer surface. Means we don't build 24h+ cadence in code. Means Ian owns the templates + scheduling rules + cohort segmentation. Trade-off: we depend on HubSpot's workflow engine being reliable (it is, at our scale). Engineering still owns the property sync that powers the workflows.
What we don't get
- Inngest's superior dev experience — multi-step
step.sleep(),step.waitForEvent(), observability dashboard. AWS-native equivalent is Step Functions + CloudWatch which is more verbose. Acceptable until use case count crosses 5+ AND coordination becomes painful. - Generic delayed-job abstraction. We're building one-off integrations per use case. With only one use case today (the 15-min reminder), a generic abstraction would be premature. Re-evaluate at use case 3+.
- Exact-time precision below 1 second. EventBridge Scheduler precision is ~30 seconds in practice. Acceptable for "approximately 15 min after signup" semantics.
Alternatives considered
Inngest (Enterprise tier with BAA)
Rejected. Best dev experience in 2025 (TypeScript-native step.* API, dashboard, retry semantics built in). But BAA is enterprise-only with no public pricing — reports indicate $500-1,500/mo+. At our pre-revenue stage with the AWS Activate grant offsetting AWS costs to ~$0, the Inngest premium has no offsetting benefit yet. Re-evaluate when use case count crosses 5+ AND we have engineering time being burned on AWS-side workflow boilerplate.
Trigger.dev (managed cloud)
Rejected. No HIPAA on managed cloud. BYOC option exists but adds operational burden (run our own Trigger.dev backplane on AWS) without proportional benefit over EventBridge Scheduler.
Hatchet (Team tier with BAA)
Considered, deferred. YC-backed, MIT-licensed, Postgres-based. Team tier includes BAA at lower price than Inngest Enterprise. Self-host on our existing ECS cluster is technically feasible. Re-evaluate at use case count 5+ if Inngest pricing hasn't improved.
Vercel Queues
Rejected. Limited Beta as of June 2025 with BAA on Pro tier — but we are not on Vercel post-#47.
AWS Step Functions (with Wait state)
Reserved for multi-step workflows. Pricing $25 per million state transitions makes it expensive for single-delay use cases. Right tool when we need durable multi-step orchestration (e.g. "wait 24h, check status, branch on result, loop"). Not needed for the single 15-minute reminder.
AWS SQS (with DelaySeconds)
Considered. Max DelaySeconds is 15 min — exactly our case for the reminder. Would require an SQS-triggered Lambda to drain the queue. Equivalent operational complexity to EventBridge Scheduler + Lambda; chose EBS for the per-schedule observability + the cleaner cancel semantics (DeleteScheduleCommand vs SQS message-purge dance).
Kafka / MSK
Rejected. Wrong category — Kafka is a stream/log, not a scheduler. Would need a delayed-message scheduler built on top. Cost floor ~$3,300/yr for unused capacity. Right tool when we have hundreds of MB/sec sustained throughput AND multiple downstream consumers per event AND want event-sourcing semantics — none of which apply at AskFlorence scale yet.
MongoDB-backed delayed worker (poll DB for runAt <= now)
Considered, deferred. $0 incremental cost (uses existing M10 cluster), familiar pattern (Sidekiq / DelayedJob / pg-boss). Trade-off: we own worker reliability. Acceptable for very-early-stage but EBS gives us AWS-managed reliability for the same effort. Re-evaluate if we grow into multiple use cases that benefit from a single generic worker.
DynamoDB TTL + Streams
Right tool for >24h delays. TTL precision is "within 48 hours of TTL expiry" per AWS docs, so wrong for sub-hour use cases. Strong candidate for renewal alerts (30 days out) and weekly digests. Not the answer for the 15-min reminder.
Cron via ECS Scheduled Tasks
Rejected. Polling-based — wastes compute when no rows match (most ticks are empty). Per-row precision is 0-N min depending on cron interval. EventBridge Scheduler beats it on every axis for our case.
HubSpot for everything
Rejected for transactional. HubSpot's workflow scheduling has minimum ~5-10 minute precision (depending on the trigger model) and is not designed for sub-hour transactional sends. The 15-min reminder is too tight + too engineering-controlled to live in HubSpot.
Revisit triggers (explicit)
Reopen this ADR (file ADR 0006 superseding) when one or more of these fire:
| Trigger | New consideration |
|---|---|
| Delayed-job use case count crosses 5+ AND multi-step coordination becomes painful (e.g. "wait, check, branch, loop") | Generic abstraction layer + maybe Inngest if BAA pricing has improved, or Step Functions per workflow |
| Florence AI requires high-throughput stream processing (>1M events/day, multiple downstream consumers) | Kinesis Data Streams (not Kafka) — AWS-native + BAA-covered + 10-20x cheaper floor than MSK |
| Inngest publishes self-serve BAA pricing < $200/mo | Re-evaluate Inngest as the workflow tier — DX wins are real |
| Hatchet Cloud (managed) ships with BAA tier < $500/mo | Re-evaluate Hatchet — closest open-source competitor |
| ECS team grows to 3+ engineers AND we accumulate 10+ delayed-job use cases | Build vs buy reconsideration; might justify a small custom abstraction (Stripe Pelican / Shopify Postal pattern at smaller scale) |
| EDE Phase 3 audit demands an immutable event log beyond CloudTrail+Mongo append-only | EventBridge bus + Kinesis (or MSK at scale) |
| Vercel Queues exits Limited Beta with BAA on Pro tier AND we move part of the system back to Vercel | Reconsider if Vercel ever returns to the deploy mix |
References
- GitHub #103 — Discovery survey reminder email (15min stall) (ENG-242)
- GitHub #105 — Save & resume token flow + bundled flow polish (ENG-244)
- GitHub #110 — HubSpot lifecycle workflows for 24h+ agent nudges (ENG-249)
- GitHub #111 — This ADR (ENG-250)
- GitHub #47 — AWS migration (Phase 10 cutover, ECS Fargate)
- GitHub #57 — Vendor BAA coverage tracking
- Implementation: commit
18d9abd— Vercel Cron → AWS EventBridge Scheduler one-shot per row - AWS docs: EventBridge Scheduler limits, HIPAA-eligible services
- 2025 modern-SaaS landscape research: see Linear ENG-244 architecture thread for vendor-by-vendor BAA + pricing table