Architecture: multi-agent sessions
How a chat thread, a scheduled automation, and a long-running cooperative background task share the same primitives — session, agents, tasks, runs, messages — in a Postgres-backed schema that survives process restarts and scales across pods.
Schema
Ten tables live under the automation Postgres schema, created by Alembic migration 029_automation_platform_init. They are owned by the chat/automation surface, not the rest of the app:
| Table | Role |
|---|---|
agent_sessions | Top-level unit of work. Kind ∈ {interactive, automation, background}. |
automations | Reusable blueprint that spawns sessions when triggers fire. |
automation_triggers | One row per trigger condition. Indexed on next_fire_at for the scheduler tick. |
agent_definitions | A role/agent attached to a session: planner, worker, judge, researcher, marketer, … |
agent_tasks | DAG of work; depends_on for dependency ordering. |
agent_runs | One execution attempt. Workers claim with SELECT ... FOR UPDATE SKIP LOCKED. |
agent_messages | Durable inter-agent inbox. Replaces in-process queues; LISTEN/NOTIFY for low-latency delivery. |
agent_artifacts | Outputs (file writes, PRs opened, emails sent, social posts). URI + sha256 + size. |
agent_memory_entries | Per-automation key/value with optional TTL + confidence; survives across runs. |
agent_credentials | Fernet-encrypted OAuth tokens + API keys. Cross-tenant access rejected at resolver. |
AutomationDaemon
One daemon instance lives in every API pod. Three coroutines:
- scheduler_tick_loop — every
MIDCORE_AUTOMATION_TICK_SECONDS(default 30s), selectautomation_triggerswithnext_fire_at <= now(), insertagent_runsrows in QUEUED status, recomputenext_fire_atfor cron triggers. Guarded by a Postgres advisory lock so multiple replicas don't fan out the same fire. - run_worker_loop —
MIDCORE_AUTOMATION_WORKERSconcurrent coroutines (default 4). Each claims withSELECT ... FOR UPDATE SKIP LOCKED LIMIT 1so horizontal pod scaling is safe. Dispatches to AutomationEngine, persists outcome. - watchdog_loop — every 60s, detect runs with
heartbeat_at < now() - MIDCORE_AUTOMATION_HEARTBEAT_STALE(default 180s). Mark as STALLED and re-queue up tomax_attemptstimes.
Worker id format: <hostname>-<pid>-<random8> — so a single run is traceable to the exact pod that claimed it, even after pod restarts.
DurableCoordinator (LISTEN/NOTIFY)
Inter-agent messaging is durable: every message is one row in agent_messages. The coordinator wraps it with Postgres LISTEN/NOTIFY for ≤10ms delivery and a polling fallback when LISTEN/NOTIFY isn't reachable (rare — sqlite tests, weird pgbouncer modes).
Channel naming: automation_session_<uuid_with_underscores>. NOTIFY payloads are just the message UUID — the subscriber re-reads the row from the DB. This keeps payloads under the 8KiB PG NOTIFY limit and avoids races where a listener sees the NOTIFY before the transaction commits.
Public API: publish(session_id, message) writes + NOTIFY; subscribe(session_id, to_agent_def_id?) async-iterates new messages with optional backfill from since_message_id; ack(message_id) sets delivered_at for at-least-once callers.
Session lifecycle
draft ─enable─▶ active ─pause─▶ paused ─resume─▶ active
│ │
└─complete──▶ done └──fail──▶ failed
│
└──archive─▶ archivedInteractive sessions are chat threads. Automation sessions are triggered runs of a defined automation. Background sessions are user-spawned multi-agent goals (e.g. "draft a marketing campaign, review my inbox, summarize the week"). All three share the same primitives so the UI, daemon, and replay engine treat them uniformly.
Live trace (SSE)
GET /api/v1/sessions/{id}/trace coalesces three streams into one Server-Sent-Events feed:
event: run— newagent_runsrow (claimed / started / finished).event: message— newagent_messagesrow delivered to the session.event: task— task status transition (PENDING → READY → RUNNING → DONE/FAILED).
Reconnect honors the standard Last-Event-ID header — events created after that timestamp replay on reconnect.
Why durable, why now
Before this design, multi-agent state lived in process memory and the legacyasyncio.Queue coordinator. That worked when one pod owned the whole session but lost messages on pod restart and couldn't span horizontally-scaled API workers. The new schema-backed design unlocks four things: scheduled work continues when the user's laptop closes; pod restarts no longer drop in-flight conversations; multi-pod scale-out is correct by construction (SKIP LOCKED); replay is just SELECT * FROM agent_messages WHERE session_id = ….