Skip to main content

Agent System (WG)

This is a guide for implementing an LLM-driven agent system — implementation-agnostic, normative, and meant to play the same role for agent runtimes that the Agent Client Protocol plays for editor ↔ agent integration, or that the Language Server Protocol plays for language tooling.

It answers one question: what is an agent system that hosts a code agent, a design agent, or any other task-agnostic agent without rewriting the core?

The shape is host-agnostic: it holds for a desktop daemon, a cloud sandbox runtime, a CLI, an IDE plugin, a hosted multi-tenant service. UX (window, panel, picker) is out of scope except where a UX requirement reaches back into the protocol.

Conventions

The keywords MUST, MUST NOT, SHOULD, SHOULD NOT, MAY are used as in RFC 2119.

Identifier shapeConvention used in this guide
Field, column, and function namessnake_case
Path variableskebab-case (e.g. {user-data}, {workspace})
Type namesPascalCase (e.g. SessionStatus, ChatSessionRow)
ACP wire identifiersCarried verbatim from upstream (camelCase); translated at the seam. See ACP / naming seam.

Vocabulary

  • Agent — a config object: a system prompt + a tool list + a resolved model. Agent-as-data, not agent-as-class.
  • Session — one conversation. Carries messages, parts, token rollups, and a parent pointer for forks. Persistent; survives the client process.
  • Turn — one round-trip from a user message through assistant output (text, reasoning, tool calls, tool outputs) to a finished state. The unit a user can rewind to.
  • Tool — a self-describing capability the agent can invoke. The set of fundamental tools is locked across agents; MCP tools and skills extend it without changing the contract.
  • Runtime — the per-run capability surface handed to the agent (fs, net, shell, stream). Backed by a sandbox the agent does not see.
  • Host — the process that loads the agent system. Desktop app, CLI, server, cloud sandbox. The host decides UI; the system decides protocol.
  • Environment — where the host (and therefore the agent) runs: web, cloud sandbox, or computer. See environments.

What this is

A normative guide:

  • Names the invariants every implementor MUST honor for an agent to be portable.
  • Names the policies each implementor picks for their product shape.
  • Specifies the wire-level shapes (session schema, chunk vocabulary, result envelope) that two conforming implementations agree on.

What this is not

  • A model-provider router. Provider selection (Anthropic, OpenAI, cloud gateways, BYOK) is a sibling concern. The guide only requires that whichever provider is picked feeds the same AI-SDK-v6 chunk shape.
  • A UI framework. Window / tab / sidebar / picker decisions belong to the host. The guide touches UX only where UX requirements bend the protocol (compositor format, sidecar branching, queued sends).
  • A billing engine. Usage rollups land on the session row so a billing layer can read them; pricing is not the agent system's job.
  • A multi-agent orchestration graph. Agents call subagents through the locked task tool; there are no chains, no DAGs, no shared cross-run state.

Pages

The guide is organized as a set of pages. Read Foundations first; the rest can be read in any order.

PageCovers
FoundationsBedrock: AI SDK v6 chunk shape, directory-rooted execution, the locked tool set summary, watchdog placement, web search, cross-cutting invariants.
AI SDK (reference substrate)Implementor's annex to AI SDK's own docs. The token-usage cache normalization rule, where the SDK's tool-loop helper fits, what the RFC adds on top of the substrate.
Runtime EnvironmentsWeb / cloud sandbox / computer. Which capabilities each environment exposes; how the locked tool set degrades; sandbox primitives.
Sandbox Runtime (srt)srt as the reference implementation of the computer environment's sandbox primitive. Capability surface, platform support, what the protocol does and does not lock to.
Session LifecycleContext tracking, rewinding, branching, compaction (auto + manual + failure), per-turn model switch, streaming, interruption, session status, permission scopes.
PersistencyStorage engine, the three-table schema, save policy, ID strategy, JSON discipline, event-log opt-in, schema evolution.
ToolsThe locked fundamental set, the tool contract, capability requirements, result envelope, truncation, watchdog at the tool boundary, ACP kind mapping.
MCP and ConnectorsUser-plugged MCP servers, lazy materialization, tool_search for bulk discovery, OAuth, dynamic refresh, the untrusted-by-default trust policy.
Skills and Project InstructionsTwo layers of knowledge: skills (lazy, advertise-then-load) and project instructions (eager, unconditional). Discovery sources, manifests, decision matrix.
Binary file handlingGlossary / reference. Three resolution paths (provider-native multimodal, skill-per-format, shell-based conversion), the format matrix (pdf / zip / pptx / psd / fig / …), the scratch-space pattern for archive extraction.
SubagentsThe task tool, agent modes, blocking vs background, recursion, permission inheritance, inspectability, awareness, specialized subagents, opinionated patterns.
TriggersNon-human-originated turns. Schedule / external webhook / programmatic API / agent self-schedule / MCP-pushed event sources. Trigger envelope on metadata_json.trigger, queue semantics, interactive-vs-hosted execution, lifecycle bounds, auth and trust.
CompositorUser intent representation. The multipart user-message shape, file refs vs attachments, inline commands, mentions, editor context (host-emitted selection / open / cursor / recent-action), attachment handling, and the user-view-vs-model-view lowering rules.
UX PatternsWhat rides on top of the compositor: queued sends, sidecar chat as ephemeral branch, memory as a built-on-top layer.
DebuggingThe canonical inspection format, export paths, what an inspection tool MUST expose, replay semantics, the DX checklist.
ACP IntegrationThe Agent Client Protocol as the default outward wire. Method mapping, capability matrix, where the protocol and the guide diverge.
FAQQuestion-and-answer index over the guide. Doubles as an entry point and as a conformance test — if a Q cannot be answered from the RFC, the RFC owes a clarification.

Cross-cutting invariants

The following hold across every implementor:

LayerInvariantPolicy
LoopOne universal LLM loop drives any agentNative vs AI SDK runtime path; cancel semantics
AgentAgent-as-data: { manifest, tools, system_prompt }Where the manifest lives; how it is compiled
ToolsLocked fundamental set; self-describing parametersWhich tools beyond the lock; how MCP is surfaced
SessionThree-table shape: chat_sessions / chat_messages / chat_partsDB engine (SQLite default; alternatives); event-log opt-in
StreamingAI SDK v6 chunk shape internallyTransport (SSE / IPC / WS); resume semantics
Outward protocolACP-conformant when an external client speaks ACPWhether to ship an ACP adapter; which capabilities to advertise
CompactionAuto-fire on overflow; user-fire on demand; failure modes namedThreshold tuning; which model summarizes; tail-budget
SkillsDiscovered once; names + descriptions injected; body loaded lazilyWhere to look; remote-skill fetch policy
SubagentsSame loop, gated by intersected permissions; deny rules unconditionalRecursion limit; whether parent inspects child
SandboxCapability surface, not free spawnOS-level enforcement (seatbelt / landlock / VM); per-call sub-policies
PersistenceSave on every chunk by defaultStorage engine; write-buffer trade-off
Model switch per turnAllowed; carries to the next turnWhat to do if new model has smaller context (force compaction vs error)

Abstract

What matters most

The single decision that compounds across every other one is whether the system treats an agent as data or as code. An agent-as-data system publishes a config ({ manifest, tools, system_prompt, model? }) and runs one universal loop over it. An agent-as-code system publishes a function per agent.

This guide picks agent-as-data because:

  • Specialization is cheap. A "title" agent, a "summary" agent, a "compaction" agent are all the same loop with different config.
  • The system is inspectable. Diff two agent configs to see what changed.
  • The runtime is auditable in one place — one stream loop to read, one abort path, one permission gate.

Everything else in the guide follows from that choice.

Properties that follow

  • Dynamic, task-agnostic workflow. A code agent and a design agent differ only by manifest. Adding a new agent type does not rewrite the loop, the session schema, the streaming layer, or the tool contract — it adds a config.
  • Parallel workflow. A subagent is the same agent loop on a child session. The parent's loop continues while children run; results return as tool outputs. Parallelism is a function call, not a new framework. See subagents.
  • Safety and harness. The agent never touches the OS directly. Every shell call, every file read, every network fetch goes through a capability the runtime declared. The runtime sits on top of a sandbox the host owns. See environments.
  • Watchdog. A pre-execute hook on every tool call can refuse with a reason that goes back to the model. Policy is host configuration. See tools / watchdog.
  • Web search. Locked tool by frequency, special case by implementation (cannot be done in-house). The tool abstracts over which provider the host wires up. See tools / web search.

Stress tests

The guide is task-agnostic, but it pays to test it against the agents it targets:

  • A code agent — long-running, file-heavy, shell-heavy, occasional web search. Exercises fs.*, shell.run with sub-policies, rewind-to-edit, hour-long session compaction.
  • A design agent — file-light, model-call-heavy, tool-arg-heavy (vector diffs as tool calls). Exercises tool-output streaming, fast rewind, per-turn model swaps between cheap and premium tiers.
  • A research / write agent — web-heavy, low write traffic. Exercises web search, subagent fan-out for parallel reading, queued sends.
  • A scripted job agent — runs unattended on a queue. Exercises the watchdog, the canonical inspection format, permission policies with no human in the loop.

A change that breaks any of these four is a wrong move.

See also