Appearance
Deferred architecture decisions
This page is the canonical home for architecture decisions that are correctly deferred today but worth documenting so future engineers (and auditors) can see we've thought about them. Pattern: keep Linear backlog focused on actively-actionable work; park "revisit when conditions X happen" decisions here.
For each deferred decision: current state, known limitations, trigger conditions to revisit, proposed migration plan + effort, cross-references.
Companion to: docs/data-sources/cms-dependency-map.md (same pattern, applied to CMS dependency posture).
How to use this page:
- Reviewing this page quarterly is enough to catch shifts in trigger conditions
- When a trigger fires, file a Linear issue for the migration work and reference the section here
- New deferred decisions land here, not as perpetual Linear backlog issues
Rate-limiter storage: per-task in-memory → Redis (ElastiCache)
Source: ENG-286 audit M8. Originally filed as ENG-334 (cancelled 2026-05-14 — moved here).
Current state
Per-task in-memory Map<ip, timestamps[]> rate limiter in src/lib/agent-db.ts:136-195. Used by waitlist + agent-discovery + (post-ENG-321) every state-changing POST + every CMS-proxy route.
typescript
// Pattern (simplified):
const buckets = new Map<string, number[]>();
function checkRateLimit(ip: string, limit: number, windowMs: number): boolean {
const now = Date.now();
const timestamps = (buckets.get(ip) ?? []).filter(t => now - t < windowMs);
if (timestamps.length >= limit) return false;
timestamps.push(now);
buckets.set(ip, timestamps);
return true;
}Known limitation: fuzzy cap
- ECS service runs N tasks (currently 2 in prod)
- Each task holds its own in-memory Map
- Effective user-facing cap is N × configured cap — a user load-balanced across tasks gets up to N× the per-task throughput
- Not a breakage; just means configured
30/5minis in practiceup to 60/5minfor a real user
ENG-321 explicitly documents acceptance of this fuzziness — for anti-scraping defense it's a speed bump, not a hard ceiling. Real scraper still hits a meaningful (even if fuzzy) cap.
Trigger conditions to revisit
Migrate to shared-state rate limiting when any of:
- ECS scales to ≥4 tasks (effective cap drift ≥4x, abuse defense gets too loose)
- Legitimate traffic grows to where
N × per-task capmatters for legitimate UX (currently low volume; user portal milestone will change this) - Anti-scraping precision becomes a strategic requirement vs. "speed bump" deterrent
- Specific abuse pattern observed that the fuzzy cap permits (e.g., scraper exploiting per-task state intentionally)
Proposed migration plan
Target: ElastiCache Redis cluster in VPC, shared across all ECS tasks for rate-limit state.
Scope:
- Provision ElastiCache Redis cluster (cache.t4g.micro for start) in existing VPC subnet group
- Wire security group: ECS task SG → Redis SG, port 6379
- Add Redis connection string to Secrets Manager (
prod/redis-rate-limit,staging/redis-rate-limit) - Refactor
src/lib/agent-db.tsrate-limit logic to use RedisINCR+EXPIRE:
typescript
async function checkRateLimit(ip: string, route: string, limit: number, windowSec: number): Promise<boolean> {
const key = `rl:${route}:${ip}`;
const count = await redis.incr(key);
if (count === 1) await redis.expire(key, windowSec);
return count <= limit;
}- Update ECS task-def to inject Redis connection string env var
- Add Redis health check to startup probes
- Fall-open behavior: if Redis unreachable, allow request (log WARN with
[rate-limit-degraded]marker per ENG-330 observability pattern)
Effort: ~4h (Terraform + code + verification)
Reversibility: trivial — revert code change keeps in-memory map; ElastiCache cluster can stay running for future use or be destroyed.
Related future opportunities (not part of the rate-limiter migration itself)
When Redis lands for rate limiting, two adjacent opportunities to consider (file separate Linear issues at that time):
- Marketing session storage — ENG-322's
marketing_sessionsMongo collection could optionally move to Redis. Faster reads (~ms vs ~10-50ms), but Mongo with encryption-at-rest is sufficient for marketing-tier (non-PHI) data. Decide at migration time based on actual perf data. - Distributed locks for delayed-job coordination — currently scheduler-coordinated; Redis-backed locks would enable finer-grained job orchestration. Phase 5 (user portal) work may need this.
Cross-references
src/lib/agent-db.ts:134— current rate-limiter- ENG-321 — rate limits + Origin allowlist + test bypass (consumer of this rate-limiter today)
- ENG-322 — session-cookie architecture (potential future co-tenant on Redis)
- ENG-330 — graceful degradation + observability pattern (same
[degraded]log marker pattern applies) - ENG-286 audit doc
docs/audit/comprehensive-code-review-2026-05-12.md— finding M8
ECS task execution role: shared → per-task-def secret ARN scoping
Source: ENG-286 audit I16. Originally filed as ENG-333 (cancelled 2026-05-14 — moved here).
Current state
Prod ECS task execution role gets secretsmanager:GetSecretValue on every ARN in values(module.secrets.secret_arns) — broader than the per-task-def need.
hcl
# infra/envs/prod/ecs.tf:117
task_execution_secret_arns = values(module.secrets.secret_arns)Known limitation: defense-in-depth gap
- The TASK role (different from execution role) is correctly narrow
- The EXECUTION role's job is to pull secrets at task startup and inject them as env vars
- Execution role is invoked once per task spawn, never used by the running app
- Even if an attacker compromised the execution role (highly unusual — startup-time credential, not runtime), they could pull secrets the task doesn't reference
- BUT: the task definition's
secrets_from_managermap limits which secrets actually get injected into the running task at runtime
So this is defense-in-depth: tightening the IAM grant matches the task-def need, but the gap doesn't change runtime behavior. Audit explicitly flagged this as Info-severity ("fine for current scale, refine when scaling").
Trigger conditions to revisit
Tighten to per-task-def secret ARN scoping when any of:
- User portal milestone adds new task definitions (multiple task defs sharing one execution role = bigger blast radius if execution role compromised)
- SOC 2 audit specifically flags least-privilege evidence requirements
- General Terraform refactor sweeps the ECS module (fold this in for free)
Proposed migration plan
In infra/modules/ecs-service (or per-env config), build the list of secret ARNs from the actual task_definition.secrets_from_manager map rather than values(module.secrets.secret_arns). Each task def's execution role policy contains only the ARNs that task def references.
Effort: ~30min Terraform refactor + verification (IAM policy JSON diff pre/post)
Reversibility: trivial — revert the Terraform change.
Cross-references
infra/envs/prod/ecs.tf:117— current task-execution-role secret ARN grant- ENG-286 audit doc
docs/audit/comprehensive-code-review-2026-05-12.md— finding I16
Pattern: when does a decision belong here vs in Linear?
Belongs in Linear (actionable now, time-boxed, milestone-bound):
- The work has an immediate acute pain it addresses
- The work is part of an active milestone or sprint
- The work has a definite "done" state achievable in the current cycle
Belongs here (deferred-pending-trigger):
- No acute pain today
- Specific trigger conditions exist that would change the calculus
- Migration plan can be sketched but execution waits for the trigger
- Auditor / future engineer benefits from documented thinking
When a trigger fires:
- File a new Linear issue
- Reference the relevant section here
- The Linear issue captures the execution; this doc captures the decision and trigger
This page is reviewed quarterly to catch shifts in trigger conditions and surface any items that have become actionable.