Aller au contenu principal

Compositor

The compositor is the layer where the user's intent is composed and lowered into a structured message the model can read. The surface varies — a rich web editor, a TUI prompt, a CLI flag, a voice transcription, a chat input embedded in an IDE — but the shape on the wire is the same.

This page is not "UI/UX." It is the representation of user intent: the parts a user message MAY carry, the non-arguable defaults for how each part is handled, and the rules for lowering what the user sees to what the model sees. The principle the rest of the doc rests on:

Structure preserves information. Flattening a user message into a string loses what the model can use to act correctly. Keep the message multipart; let the model see the intent, not a transcript.

The compositor and the editor context sit on the same vocabulary; UX patterns that ride on top (queued sends, sidecar, memory) live in ux.

The user message shape

A user message is a structured multipart object, not a string:

{
role: "user",
parts: [
{ type: "text", text: "look at this svg and the spec" },
{ type: "file-ref", ref: { kind: "path", path: "designs/icon.svg" } },
{ type: "file-attachment", data: "<base64>", mime: "image/png", name: "screenshot.png" },
{ type: "command", id: "/search", args: { query: "icon set" } },
{ type: "mention", target: { kind: "skill", name: "canvas-docs-svg-kit" } },
{ type: "editor-context", kind: "selection", payload: { /* host-shaped */ } }
]
}

The compositor MAY emit any subset. The order matters: parts are read in sequence. The model sees a structured input — except where the lowering rules say a part collapses, resolves out, or becomes a provider-native block.

File references vs file attachments

Two distinct shapes, deliberately:

ShapeBody locationUse when
file-refPath / URL / content-id; body NOT inlinedThe file is in the workspace; the agent reads it via read or web_fetch when it needs the contents.
file-attachmentInline bytes (base64 or binary)The file is outside the workspace; the model has multi-modal capability; the model should see the bytes directly.

The host MAY map the same drag-and-drop gesture to either shape. A code-agent host typically prefers "drag = file-ref" (preserves the agent's tool-driven reading flow). A design-agent host typically prefers "drag image = attachment" (the model sees the image directly).

File-ref with range

A file-ref MAY carry a byte- or line-range:

{
type: "file-ref",
ref: {
kind: "path",
path: "src/lib/foo.ts",
range: { start: 10, end: 25 }
}
}

A range-carrying file-ref is the de-facto standard for "the user selected this code." The lowering layer inlines only the selected lines, not the whole file. Editor surfaces that already speak the file path / line range (IDEs, code editors) SHOULD use this shape rather than editor context.

Multi-modal handling

Attached images, PDFs, audio, video go to the model as provider-native multi-modal parts. The agent system shells them through; the provider adapter re-encodes if needed. The model sees them in the format its provider supports — Anthropic's image block, OpenAI's image_url, etc.

Binary the model does not understand

A user attaches a .psd. The model has no native PSD support. The compositor MUST NOT inline raw bytes the model cannot use; instead it emits a descriptor part:

{ type: "file-attachment", name: "design.psd", mime: "image/vnd.adobe.photoshop", size: 12000000 }

The model sees the descriptor, not the bytes. A tool the host wires up (psd_to_png, shipped as MCP or agent-specific) can convert if the model decides it needs to.

Inline commands

A command (/search foo, /branch this chat, /compact now) is not a string the model parses out of prose. It is a part with a distinct type. The compositor MUST resolve commands at submission time into one of:

OutcomeWhen
A direct host action executed before the message reaches the modelCommands that compact, branch, archive — the model never sees them.
A structured user-input part the model can act onCommands that supply a query or a hint the model uses (e.g. /search becomes a text part + a search-intent flag).
A pass-through text part with the original /foo args bodyCommands the host does not recognize. Forward-compatible.

The compositor knows the command catalog; the model does not need to. The catalog includes:

  • Built-in commands the host ships (/branch, /compact, /archive, /rewind).
  • Project-config commands the user defines.
  • MCP-server-published commands (prompt templates).
  • Skill-bound commands (a skill MAY expose a slash trigger).

The de-facto pattern is that slash commands never appear in the model's input as literals. They either execute host-side and emit their result as a normal part, or they expand to plain text / structured parts at submission time. The model never sees /foo.

Command templates and argument substitution

User-defined, MCP-prompt-template, and skill-bound commands often ship as a template body with substitution slots: the user types /review src/foo.ts critical, the template body expands with $1src/foo.ts and $2critical (or $ARGUMENTS for the full trailing string). The expanded body is what lands as a text part in the user message; the literal /review never reaches the model.

The de-facto syntax is positional + aggregate:

TokenSubstitutes with
$1, $2, …The Nth whitespace-separated argument.
$ARGUMENTSEverything after the command name, verbatim.
$@Synonym for $ARGUMENTS in some hosts.

The exact syntax is implementor's choice — named slots (${name}), Mustache-style ({{name}}), or another convention all qualify — but positional + aggregate is the most portable and the one users have learned from shell.

A command template with no template body and no arguments is a plain action trigger (e.g. /compact). The compositor still resolves it; it just has nothing to substitute.

Mentions

@skill:canvas-docs-svg-kit or @file:src/lib/foo.ts in the compositor becomes a mention part the model reads but is not forced to act on. Mentions are suggestions the user gives the model without bending the loop.

A mention to a skill is a special case: the model never sees the literal @skill:.... Instead, the compositor records the intent in the persisted user message (for the inspector / replay), and the skill body lands in context via the normal skill tool flow. This keeps the model's input clean and avoids two paths for "the same skill."

A mention to a file lowers to a file-ref. A mention to a doc identifier (a symbol, a node, an entity the host knows) lowers to an editor-context part with kind: "ref".

Editor context

The compositor captures what the user composes. Editor context covers what the host contributes on the user's behalf — selections, open documents, cursor position, recent actions — whatever the user is currently touching in the editor that the agent should be aware of without the user retyping it.

This is a sibling of the compositor's text input: the user types prose; the host emits state. Both feed the same multipart user message.

Why this is in the protocol

A useful agent in an editor surface knows what the user is looking at. Without it, the user has to paste, describe, or re-state the editor's state every turn — the surface might as well be a generic chat box. With it, "make this red" or "fix this" is meaningful because this resolves.

The host already knows the current state. The protocol's job is to give that state a standard shape, persistable as part of the user message, that any conforming agent can read.

Host categories

The flavor of editor context varies by host. Three rough categories, with examples and the context shapes each typically emits:

CategoryExamplesTypical context shapes
Text (code / prose)IDE, code editor, document editorSelected range (often a file-ref with range), current file + cursor position, open tabs, recent edits.
Image (design / spatial)Canvas, image editor, video / timeline editorSelected node(s), current frame or timeline range, viewport, last action.
Graph (large workspace)Multi-file IDE, knowledge base, note-graph, large-document editorCurrently open document(s), recent files, the active subtree, related entities.

The boundaries blur — a code IDE with a file-tree sidebar is both text and graph. The categories are about which context shapes are typical, not which host is allowed to use which.

The part shape

Editor context arrives in user messages as parts of type editor-context:

{
type: "editor-context",
kind: string, // discriminator: "selection", "open", "viewport", "history", "last-action", "ref", …
source?: string, // which editor surface emitted it (panel id, document id, …)
payload: object, // host-defined; opaque to the loop
emitted_at: int, // epoch ms — when the host sampled this
}

The payload is opaque to the loop — its shape is the host's business, the way it is for mention targets. The model sees the part; tools the host or the agent's manifest wires up can act on it. The runtime does not validate payloads beyond presence.

A single user message MAY include multiple editor-context parts — one selection plus the list of open files plus a recent-action descriptor — and they read in order.

How context lands in a user message

Three patterns, all host policy (not protocol):

PatternTriggerWhen to use
AutomaticHost injects on every user messageCost of inclusion is low and the agent almost always benefits (e.g. "selection" in a design app).
Explicit actionUser clicks "send with selection" / drags into chatInclusion is a deliberate choice and would otherwise inflate the message.
Compositor mentionUser types @selection / @open / @historyThe user wants to compose a specific subset of context.

Mentions follow the same shape as the compositor's mention parts — a @<kind> token resolves at submit time into one or more editor-context parts. A host MAY pre-register a default mention catalog (@selection, @open, @cursor, @last) and let agents or skills extend it.

Copy-paste as reference

A common pattern is "copy this element, paste into chat" — the user expects the pasted block to behave like a reference, not as inline text or a screenshot. The compositor MAY model paste-of-an-entity as an editor-context part with kind: "ref" (the host's identifier for the entity) rather than inlining the entity's bytes or text.

This makes the paste cheap (a string id, not the entity's body) and keeps the model honest — it must call a tool to read the entity if it needs the contents. The same path serves drag-and-drop of an entity into the compositor.

Relationship to other parts

  • vs file-reffile-ref is the path / URL / content-id of a file in the workspace. editor-context covers in-editor entities that aren't files (a canvas node, a timeline range that isn't a file slice, a code symbol the IDE resolves), plus ambient state (cursor, viewport, last action). When the editor state IS a file range, prefer file-ref with range.
  • vs mention — a mention is a user-driven suggestion; the model is not forced to act on it. editor-context is a host-driven description; the model can treat it as authoritative ("this is what the user is currently doing").
  • vs Memory (ux / memory) — Memory is durable across turns and sessions. Editor context is ephemeral — a snapshot at the moment of the user's message.
  • vs skills / / commands — a / command invokes a host action or scope. Editor context describes state, not actions.

Attachments — the long tail

The compositor MUST handle, at minimum, the following attachment classes:

ClassDefault treatment
TextInlined as text part if small (< host threshold); path-ref otherwise.
ImageMulti-modal part if the provider supports it; descriptor + read_image path otherwise.
PDFMulti-modal part if supported; PDF-to-text tool route otherwise.
AudioMulti-modal where supported; transcription tool route otherwise.
VideoMulti-modal where supported; descriptor + frame-extractor tool otherwise.
Binary unknownAlways a path-ref or descriptor, never inlined. The model sees a name, mime, size.

The agent always has a fallback path. Any attachment the model cannot read directly SHOULD still be reachable through read or bash (for shell-level inspection). The host's job is to ensure the fallback exists.

For the deeper treatment — provider-native multimodal shortcuts (PDF on Anthropic / Gemini, video / audio on Gemini), the skill-per-format convention, shell-based conversion, the format matrix (zip / pptx / psd / fig / …), and the scratch-space pattern for archive extraction — see binary.

Attachment storage

Inlined attachments (base64 in the part's data_json) inflate the DB fast. A conforming implementation SHOULD apply a threshold (e.g. 1 MB per attachment, 20 MB per session) above which attachments are split out to a sidecar blob store; the part carries a content-id instead of the bytes. The schema MAY add a blobs table; the threshold is host config.

Templating: user view vs model view

The user sees one thing; the model sees another. The compositor is the layer where this divergence is reconciled. There are two intuitions worth correcting up front:

"The model must see XML-ish wrappers like <file path='…'>…</file> or <skill-invocation>…</skill-invocation>."

It generally does not. The de-facto pattern is structural, not stringly-typed: parts stay typed end-to-end and are lowered to provider-native shapes at the final boundary. Wrappers exist only where the model needs to discriminate state that has no native representation.

"Slash commands and mentions land in the model's input as literals."

They do not. Anything that can be resolved before the model sees it, is. The model is presented with a clean, structured input; the literal compositor tokens stay in the persisted user message (for the inspector / replay) but never reach the prompt.

The lowering rules

Per part type:

PartUser view (rich UI / TUI)Persisted shapeModel view (after lowering)
textInline text{ type: "text", text }Text, verbatim.
file-refChip / link with file name and optional :lines{ type: "file-ref", ref }The file's contents inlined as text (or the selected range only). For images, a provider-native image block. The model never sees the literal @path.
file-attachmentThumbnail or file chip{ type: "file-attachment", data, mime, name, … }Provider-native multi-modal block when the provider supports the mime type; descriptor placeholder otherwise.
commandResolved chip / palette result{ type: "command", id, args }Nothing when the command is host-action-only. Otherwise the command's result lowered as text/file parts (e.g. /read foo.ts → the file's contents). The literal /foo is never sent.
mention (skill)Chip / pill in the input{ type: "mention", target }Nothing in the user message; the skill body loads via the normal skill tool flow.
mention (file)SameSameLowered as a file-ref (and from there per the file-ref row).
mention (other)SameSameLowered as an editor-context part if it resolves to an entity; otherwise dropped.
editor-contextChip / collapsible / inline-quote{ type: "editor-context", kind, source?, payload, emitted_at }A structured-text block the model can discriminate. Default: <editor_context kind="..." source="...">payload-as-text</editor_context>. Implementors MAY substitute, but the marker SHOULD be stable for portability.

The lowering chain

compositor part ──┐

(persistence) ├─► chat_parts row (typed, JSON in data_json)

(turn fires) ├─► in-memory typed part (rehydrated from data_json)

(provider adapter)└─► provider message (provider-native multimodal blocks)

└─ this is where text-or-multimodal lowering happens

The first three stages keep the part typed and structured. The provider adapter is the only stage that produces a flattened shape the model literally consumes. Implementors MAY add stages (host-side rendering for the rich UI, ACP session/update projection for an external client) but those stages MUST NOT mutate the typed part or its order.

Why structural

Three reasons:

  • Resolved-out invariants survive refactor. When the model never sees /foo or @skill:bar, you can rename a slash command or add a new mention kind without disturbing the in-flight model output. If those tokens were in the model's prompt, every rename would be a prompt-engineering exercise.
  • Multimodal is provider-shaped, not text-shaped. Wrapping an image in <image …> for the model is strictly worse than the provider's native block — it costs tokens, breaks vision routing, and forces every provider adapter to parse a custom syntax.
  • Inspection and replay want types. The persisted part is the durable record; the inspector renders the types; replay re-emits the same shape. Stringly-typed prompt assembly fights all three.

When implementors substitute the default

The lowering rules are normative defaults for portability — two conforming agents lowering the same compositor parts will produce prompts the same model can act on consistently. An implementor MAY substitute when:

  • A target provider has no native multimodal: lower images to descriptors with a read_image tool path.
  • An editor surface has a richer convention the agent's manifest declares: the manifest's tool catalog can include a custom lowering function as long as the persisted part shape stays conformant.
  • The host ships a tool that consumes editor-context directly: in that case, the model does not need a text-block lowering; the tool reads the typed payload.

Substitution that changes the persisted part shape — renaming the type field, mutating kind — is not conformant and breaks inspector / replay.

Versioning the user message shape

Part types are an evolving vocabulary. The user message shape MUST carry a version field on the message metadata so a future reader degrades gracefully:

metadata: { schema_version: 1,}

Migration discipline:

  • A new part type lands without a version bump if the type is additive (older readers ignore unknown types and render them as text with the part's text content if present, or a "unsupported part" placeholder otherwise).
  • A breaking shape change (a part field renamed) bumps the version and requires either a migration that rewrites historical rows or a read-time adapter that translates v1 parts to v2 on the way out of the DB.

The guide recommends the read-time adapter. The DB is the long-term record; rewriting it on every schema change is the wrong cost.

Implementor checklist

A conforming compositor MUST:

  • Emit user messages as structured multipart objects.
  • Distinguish file-ref from file-attachment.
  • Carry a byte- or line-range on file-ref when the user's selection is a slice of a file.
  • Resolve commands at submission time; never let the model parse a literal /foo out of prose.
  • Resolve skill mentions before the model sees them; the skill body loads via the skill tool, not via a string wrapper.
  • Emit ephemeral editor state (selection on non-file entities, open documents, recent actions) as editor-context parts rather than inlining it into the user's text.
  • Carry a schema version on user message metadata.
  • Lower attachments to provider-native multimodal blocks at the provider boundary; fall back to descriptor parts when the model has no native support.
  • Persist user messages before they reach the model.

What this guide does not specify

  • The compositor's UI. Textarea, palette, voice, gesture, CLI flag — all conformant.
  • A command palette UI. How the host surfaces the catalog is the host's choice.
  • The shape of each editor-context kind's payload. Host territory; the protocol unifies the envelope, not the payload.
  • When the host samples editor state. At submit time, at compose time, both, or none — host policy.
  • Which contexts an agent should default to. Manifest territory.
  • The exact text rendering of editor-context in the model view. The default marker is recommended for portability, but a richer lowering (e.g. a tool the model calls to read the payload) is a valid substitute.

See also

  • UX Patterns — queued sends, sidecar branches, memory, the patterns that ride on top of the compositor.
  • Skills — how skill bodies actually enter context (not via mention strings).
  • Tools — the read, web_fetch, and host-supplied tools the compositor's refs flow into.
  • Persistency — how the persisted user message is stored (data_json carries the typed part verbatim).
  • Session Lifecycle — what happens after the compositor's submission: the turn fires, the lowering chain runs.
  • Debugging — the canonical inspection format that records what the compositor emitted, at every stage.