Skip to main content

Architecture: multi-agent sessions

How a chat thread, a scheduled automation, and a long-running cooperative background task share the same primitives — session, agents, tasks, runs, messages — in a Postgres-backed schema that survives process restarts and scales across pods.

Schema

Ten tables live under the automation Postgres schema, created by Alembic migration 029_automation_platform_init. They are owned by the chat/automation surface, not the rest of the app:

TableRole
agent_sessionsTop-level unit of work. Kind ∈ {interactive, automation, background}.
automationsReusable blueprint that spawns sessions when triggers fire.
automation_triggersOne row per trigger condition. Indexed on next_fire_at for the scheduler tick.
agent_definitionsA role/agent attached to a session: planner, worker, judge, researcher, marketer, …
agent_tasksDAG of work; depends_on for dependency ordering.
agent_runsOne execution attempt. Workers claim with SELECT ... FOR UPDATE SKIP LOCKED.
agent_messagesDurable inter-agent inbox. Replaces in-process queues; LISTEN/NOTIFY for low-latency delivery.
agent_artifactsOutputs (file writes, PRs opened, emails sent, social posts). URI + sha256 + size.
agent_memory_entriesPer-automation key/value with optional TTL + confidence; survives across runs.
agent_credentialsFernet-encrypted OAuth tokens + API keys. Cross-tenant access rejected at resolver.

AutomationDaemon

One daemon instance lives in every API pod. Three coroutines:

  • scheduler_tick_loop — every MIDCORE_AUTOMATION_TICK_SECONDS (default 30s), select automation_triggers with next_fire_at <= now(), insert agent_runs rows in QUEUED status, recompute next_fire_at for cron triggers. Guarded by a Postgres advisory lock so multiple replicas don't fan out the same fire.
  • run_worker_loopMIDCORE_AUTOMATION_WORKERS concurrent coroutines (default 4). Each claims with SELECT ... FOR UPDATE SKIP LOCKED LIMIT 1 so horizontal pod scaling is safe. Dispatches to AutomationEngine, persists outcome.
  • watchdog_loop — every 60s, detect runs with heartbeat_at < now() - MIDCORE_AUTOMATION_HEARTBEAT_STALE (default 180s). Mark as STALLED and re-queue up to max_attempts times.

Worker id format: <hostname>-<pid>-<random8> — so a single run is traceable to the exact pod that claimed it, even after pod restarts.

DurableCoordinator (LISTEN/NOTIFY)

Inter-agent messaging is durable: every message is one row in agent_messages. The coordinator wraps it with Postgres LISTEN/NOTIFY for ≤10ms delivery and a polling fallback when LISTEN/NOTIFY isn't reachable (rare — sqlite tests, weird pgbouncer modes).

Channel naming: automation_session_<uuid_with_underscores>. NOTIFY payloads are just the message UUID — the subscriber re-reads the row from the DB. This keeps payloads under the 8KiB PG NOTIFY limit and avoids races where a listener sees the NOTIFY before the transaction commits.

Public API: publish(session_id, message) writes + NOTIFY; subscribe(session_id, to_agent_def_id?) async-iterates new messages with optional backfill from since_message_id; ack(message_id) sets delivered_at for at-least-once callers.

Session lifecycle

draft  ─enable─▶  active  ─pause─▶  paused  ─resume─▶  active
                  │                                    │
                  └─complete──▶ done                   └──fail──▶ failed
                                  │
                                  └──archive─▶ archived

Interactive sessions are chat threads. Automation sessions are triggered runs of a defined automation. Background sessions are user-spawned multi-agent goals (e.g. "draft a marketing campaign, review my inbox, summarize the week"). All three share the same primitives so the UI, daemon, and replay engine treat them uniformly.

Live trace (SSE)

GET /api/v1/sessions/{id}/trace coalesces three streams into one Server-Sent-Events feed:

  • event: run — new agent_runs row (claimed / started / finished).
  • event: message — new agent_messages row delivered to the session.
  • event: task — task status transition (PENDING → READY → RUNNING → DONE/FAILED).

Reconnect honors the standard Last-Event-ID header — events created after that timestamp replay on reconnect.

Why durable, why now

Before this design, multi-agent state lived in process memory and the legacyasyncio.Queue coordinator. That worked when one pod owned the whole session but lost messages on pod restart and couldn't span horizontally-scaled API workers. The new schema-backed design unlocks four things: scheduled work continues when the user's laptop closes; pod restarts no longer drop in-flight conversations; multi-pod scale-out is correct by construction (SKIP LOCKED); replay is just SELECT * FROM agent_messages WHERE session_id = ….

See also