Skip to main content

Session Lifecycle

A session is the unit of state. Everything an agent does — read, write, call a tool, switch model, fork, restart — lands here. This page covers the lifecycle (what happens between session-open and session-close) and the runtime mechanics (streaming, abort, status). The storage shape lives in persistency; the structures referenced here are detailed there.

Sessions, messages, parts

A session contains messages; a message contains parts. The shape is the AI SDK v6 message + part model. See persistency / three-table schema for the column-level shape.

The two properties this guide relies on:

  • Messages are append-only. A user turn is a new row; an assistant reply is a new row. Existing messages are never rewritten.
  • Parts are upserted while the stream lands. Text grows; tool calls transition input-streaming → input-available → output-available on the same row keyed by tool_call_id. An aborted run leaves the partial state truthful on reload.

When the agent is fronted by an ACP adapter, the session id is the same handle ACP returns from session/new and accepts on session/load / session/resume / session/close — one identifier across both layers.

Context window tracking

Tokens are tracked per role on every assistant message and rolled up to the session row:

chat_messages.metadata_json: {
usage: {
input, // user + system tokens this step consumed
output, // tokens the model produced this step
reasoning, // thinking-token charge, when the model emits one
cache_read, // tokens served from the provider's prompt cache
cache_write, // tokens written to the provider's prompt cache
}
}

Session-level rollups (prompt_tokens, completion_tokens, reasoning_tokens, cache_read, cache_write, total_tokens, cost_usd) are sums of the above across the session's assistant messages.

Why the breakdown

  • Compaction needs input + output + cache_read + cache_write to decide whether the next turn will overflow. Cache reads still count against the context window; only the charge for them is lower.
  • Cost reporting needs reasoning separate. Providers price reasoning tokens at the output rate but report them separately; rolling them into output hides cache-vs-thinking ratios that matter for picking a tier.
  • Cache write vs cache read tells the picker how much of the conversation is hot. A session at 80% cache hit can sustain ten more turns; one at 5% cannot.

Source of truth

The AI SDK's per-step onStepFinish callback delivers a usage object per turn. The recorder bumps the session row from there, NEVER from finish-step chunks on the wire — keeping one source of truth avoids double-counting when an SDK update starts emitting both.

The SDK's inputTokens field already includes the cache-read and cache-write counts. The recorder MUST subtract them out before persisting prompt_tokens, or the cache columns get double-counted in the session rollup. See ai-sdk / token usage for the exact formula and the per-component mapping.

Cost

cost_usd is recorded when a writer (a hosted route, a metadata hook) supplies it. The agent system MUST NOT compute cost from a pricing table inside the recorder. Pricing lives next to the model catalog, not next to the session store.

What "the context window" means here

The number compared against the model's limit is:

context_window_used =
chat_sessions.prompt_tokens +
chat_sessions.completion_tokens +
chat_sessions.reasoning_tokens +
chat_sessions.cache_read +
chat_sessions.cache_write

Not the wall-clock total tokens billed (that includes per-turn re-reads). Not just input + output (that misses reasoning and cache).

Rewinding

A user MUST be able to rewind to a prior user message and edit or resend it. The unit is the user message because:

  • Assistant turns are non-deterministic; "rewind to the assistant's word 47" is meaningless when the next run produces different words.
  • The user message is the only deterministic checkpoint the user controls.
  • A turn-pair (user, assistant) is the unit of rollback in every chat product the user has used.

What happens on rewind

  1. The user picks a prior user message. The UI shows the message body in the editor; the user edits or accepts.
  2. The system soft-truncates the conversation: every message and part after the chosen one is marked invisible to the next LLM call.
  3. The user submits. A new turn appends after the chosen message; the model sees the conversation as if everything after the rewind point never happened.
  4. The hidden messages remain in the DB for inspection, un-rewind, and audit. They are NOT deleted.

Soft-truncate vs delete

Soft-truncate (a hidden_at per message) is the required shape. It preserves history for inspection and lets a user un-rewind by moving the pointer back. An implementation that hard-deletes on rewind trades inspection for storage, and SHOULD document the loss.

Side-effect rewind

A rewound turn MAY have created files, run a shell command, or hit the network. The agent system does NOT undo side effects on rewind. Rewind is a prompt rewind, not a world rewind.

A host that wants world-rewind (a code agent that restores the workspace to its state at message N) ships a workspace-snapshot layer that hooks on user messages and tags the snapshot id into the message metadata. The contract:

  • A user-message metadata field snapshot_id?: string that hosts populate when they snapshotted the world at submission time.
  • A hook point (message.user, the same hook permission rules use) fires before the message lands, giving the host's snapshot layer a place to attach.
  • On rewind, the host reads snapshot_id off the target message and restores. The guide is silent on how (git patches, VM snapshots, copy-on-write filesystems are all valid).

Without a snapshot layer, rewind is prompt-only. With one, the same flow restores both.

Rewinding past a compaction

A compaction creates a synthetic assistant message with a summary; the messages it summarized are soft-hidden. Rewinding past the compaction MUST un-hide them. The truncation pointer moves back over the summary AND the summarized turns. Implementations that hard-delete on compaction lose this property; they MUST NOT.

Branching

A branch is a new session whose parent_id points back to the parent and whose parent_message_id points to the fork point. The new session starts with a copy of the parent's messages up to and including the fork point; new turns append only to the branch.

branch API

branch({
parent_session_id: string,
from_message_id: string, // a chat_messages.id reachable from parent_session_id
metadata?: object, // merged into the new session's metadata_json
}) → ChatSessionRow // the newly created session

Behavior:

  • The runtime MUST reject the call if the parent session has a run in flight (SessionStatus.state != "idle"); a 4xx-equivalent error is returned.
  • The runtime MUST copy every non-hidden message up to and including from_message_id into the new session with new ids. Parts are copied verbatim.
  • Token rollups MUST be recomputed from the copied messages, not copied from the parent's row.
  • The metadata blob is merged into the new session's metadata_json. metadata.ephemeral = true is the convention for sidecar branches; see ux / sidecar.

Why a new session and not a fork of the same session

  • Each branch needs its own running stream, its own token rollup, its own model selection. A session is the natural unit.
  • Two branches MUST be inspectable side by side. Two rows are easier to render than one row with a tree shape.
  • The picker shows branches as siblings under the parent without schema gymnastics — SELECT * FROM chat_sessions WHERE parent_id = ?.

What gets copied on branch

  • Every message and part up to and including the fork point. The copies have new ids; their data_json is verbatim from the source.
  • The session row's settings (agent, model, workspace, metadata).
  • Token rollups for the copied turns. The branch starts with the same total_tokens the parent had at the fork point.

What does NOT get copied

  • Side effects the parent took. Branching is a conversation fork, not a workspace fork. If the host wants workspace-snapshot branching, it layers it on the message metadata (same hook as side-effect rewind).
  • The parent's in-flight run, if any. A branch CANNOT fork a running turn; the user MUST wait for the parent's turn to finish or abort.

Sidecar = ephemeral branch

A sidecar chat (the "ask the model a side question without messing up the main thread") is exactly a branch — same wire, same shape — with one UX twist: the host marks the branch ephemeral, hiding it from the picker. The schema gains nothing; the host filters on a metadata flag. See ux / sidecar.

Compaction

Compaction is the act of replacing a stretch of conversation history with a summary, freeing tokens for the next turn. It is not optional above the model's context window. An implementation that ships a "the chat just stops working" failure mode is shipping a bug.

Threshold

Compaction fires when usable context is exceeded:

usable = model.context_limit - reserve
context_window_used >= usable → fire compaction

reserve is the headroom kept for the next turn's output and reasoning. Default: min(20k tokens, model.max_output). Implementors tune for product shape (a code agent that emits long diffs needs a larger reserve; a chat agent does not).

The threshold is per-model. A 200k model and a 1M model fire at different absolute token counts. A session that switches model mid-conversation (see Per-turn model switch) re-evaluates against the new model's limit before the next turn.

Auto vs manual

  • Auto-compaction is the default. For casual users, the system fires compaction before the next overflow and replays seamlessly. The user sees a one-line "summarized N earlier turns" affordance.
  • Manual compaction lets the user fire it on demand (a slash command, a button). Useful when the user knows the next turn will be expensive and wants the conversation cheap to re-run.

Both produce the same artifact in the schema; only the trigger differs.

What compaction produces

A compaction part (type: "data-compaction") on a synthetic assistant message. The part's data_json carries:

{
summary: string, // Markdown body, sectioned (Goal / Progress / Decisions / Next Steps is a reasonable shape)
tail_start_id: string, // chat_messages.id of the first message kept verbatim
auto: boolean, // true for auto-compaction, false for user-fired
summary_tokens: int, // token count of `summary`, for the next rollup
}

The model sees the summary plus every message from tail_start_id onward.

The summarized messages are soft-hidden, not deleted. Inspection, rewind, and "show what was summarized" all need them. The model's view skips them; the picker can expose them under "details".

Tail preservation

Compaction keeps the N most recent turns verbatim. Default N = 2 (one user message + one assistant response). The tail budget caps at ~25% of usable. If the tail at N=2 exceeds the budget, the implementor either:

  • Drops to N=1 (last turn only).
  • Splits the last turn (keep the last user message and the assistant's final text; drop intermediate tool calls and reasoning).

Default: "drop to N=1 and warn". "Split the last turn" is for agents that habitually run long tool chains.

Summarizer cost discipline

The summarizer is a specialized subagent running the same loop. See subagents / specialized subagents.

Required discipline:

  • Cheapest model the provider exposes (nano / small tier).
  • Low temperature.
  • Hard maxOutputTokens cap.
  • Short timeout.
  • A specialized system prompt that constrains output format.
  • No tools.
  • mode: "subagent", inspectable: false.

The summarizer's input is the soft-hidden history; its output lands in the CompactionPart. One model call, one shot.

Failure modes

FailureTriggerRecovery
Transient (network, provider 5xx)Summarizer call failed onceRetry with backoff. The main session blocks on retry up to N seconds, then proceeds without compaction (warning logged).
Spec limit (history > model's input cap)Sum of soft-hidden history exceeds the model's inputHard failure. Surface to the user; suggest a new session or a tool-output prune first.
Best-effort pruneSpec-limit failure, prune is onRun a tool-output prune pass: erase output of completed tool calls (keep input + result-truthy state), then retry.
Soft / smart fallbackSpec-limit failure, prune still insufficientSplit soft-hidden history into chunks; summarize each; concatenate. Or drop the middle and keep head + tail.

The implementor's compaction config:

{
"reserve_tokens": "int", // headroom for next turn (default 20000)
"tail_turns": "int", // recent turns to keep verbatim (default 2)
"tail_budget_pct": "float", // tail token budget as fraction (default 0.25)
"retry_on_transient": "int", // retry count on transient failure (default 2)
"smart_recovery": {
"prune_tool_outputs": "bool", // try output prune before chunked summarize
"chunked_summarize": "bool", // summarize in chunks if one-shot won't fit
"drop_middle": "bool", // last-resort: keep head + tail, drop middle
},
}

The shape is normative; the defaults are reasonable starting points.

Tool-output pruning vs compaction

Tool-output pruning is a separate, cheaper pass that runs before a real compaction is attempted:

  • Walk backwards through completed turns.
  • For each completed tool call, if its output is large and not referenced in the current task (a "protected" tool — todo, task state, the model's most recent files), drop the output and keep the tool's input + a stub like <output pruned, N tokens>.
  • Stop once enough tokens are reclaimed.

Pruning preserves the conversation's logical structure; the model still sees "I called read on file X" without re-reading X's contents. A code agent that ran 30 reads usually only needs the last 5 in context.

Pruning MAY be triggered:

  • Automatically before compaction (cheap, often enough).
  • Manually by the user ("free some space").
  • Periodically on a token threshold lower than the compaction threshold (e.g. at 60% context prune; at 90% compact).

Per-turn model switch

The model is per-message, not per-session. A user MAY pick a different model on any turn, and the new model carries forward until they pick again.

Shape. The user message metadata carries { provider_id, model_id, variant? } for that turn. The session row's model_json is the most recent active model — a denormalization for the picker so it can render "Currently using X." Historical use is queryable from the messages.

Why per-message and not per-session.

  • Users push the same prompt through cheap and premium models to compare. Forcing a new session for each pin is hostile.
  • Compaction uses a cheap model. If model were a session attribute, compaction would swap and swap back, racing with the user's own swap.
  • Variants (reasoning mode, JSON mode, thinking depth) are per-turn by their nature. A reasoning-mode toggle that lasts the session is a bug.

Compaction interacts with model switch

ScenarioBehavior
New model has larger context than current usageNo-op. Conversation proceeds.
New model has smaller context, but next turn fitsNo-op. Conversation proceeds.
New model's context cannot fit the current rollupForce compaction before the turn proceeds. The next user message blocks on the summarizer.
After compaction, history still does not fitThe model swap fails. Surface to the user: "Switching to a model with a smaller context window would lose the conversation; consider branching with the new model."

Branching is the escape hatch for the last case: a branch with the new model has only the parent's tail in scope (or the user manually picks the slice).

System prompt assembly

Every turn assembles a fresh system prompt from these sections, in this order, top to bottom:

  1. Agent manifest prompt — the agent's intrinsic system prompt.
  2. Project instructions — concatenated AGENTS.md / CLAUDE.md / CONTEXT.md content, walked from the outermost file inward (nearest-last, so the project root has the final word). See skills / project instructions.
  3. Skill index — names + one-line descriptions of every discovered skill. Bodies load on demand. See skills.
  4. Environment context — platform, workspace root, git status, current date, resolved model + variant.
  5. Tool catalog — JSON Schema for every active tool (locked, agent-specific, materialized MCP).

The order is normative; implementors MUST NOT shuffle the sections. Two implementations that follow this order produce digest-comparable prompts for the same inputs.

Optional system_prompt_digest storage MAY persist a SHA-256 of the assembled prompt on each assistant message; see persistency / metadata conventions.

Streaming and layering

The core emits AI SDK chunks to a transport the host owns. The core does NOT render anything.

┌─ Agent system core ──────────────────────────────────────────────┐
│ run(session, input, model) │
│ → AsyncIterable<UIMessageChunk> │
│ │
│ one universal LLM loop: │
│ - assemble system prompt (see above) │
│ - call model (native runtime or AI SDK adapter) │
│ - emit chunks │
│ - on tool call: validate args, capability check, watchdog, │
│ execute, emit tool-output chunks │
│ - loop until no tool calls or abort │
└──────────────┬───────────────────────────────────────────────────┘
│ same chunk shape, two consumers

┌───────┴────────┐
▼ ▼
recorder host transport
(writes parts (SSE / IPC / WS / in-memory)
to the DB) │

client / renderer
(drops chunks into the AI SDK reducer;
renders to user)

Strict layering

  • The core MUST NOT know whether the host transport is SSE, IPC, WebSocket, gRPC, or in-memory.
  • The host MUST NOT know how the chunks were produced (native runtime vs AI SDK adapter).
  • The client / renderer MUST NOT know about the recorder.

The renderer's only job

The renderer renders. It does NOT own the recorder. It does NOT own model selection. It does NOT own permission state. These belong to the core because they survive the renderer being closed.

Resume across renderer disconnect

A renderer that closes its connection mid-stream (page refresh, OS sleep, window close) MUST NOT cancel the upstream model call. The required behavior:

  • The core keeps the model call alive while at least one consumer is attached. The recorder is always attached for the run's lifetime.
  • A reconnect endpoint, given the session id, replays the chunk log from the start and live-tails until the upstream finishes.
  • Replay starts from index 0 (not from a cursor). The AI SDK reducer rejects text-delta for a part it has not seen text-start for; cursor-based resume would require chunk rewriting and is not worth the complexity.

Lifespan of the live stream registry. In-process, in-memory, keyed by session id. A host restart drops every entry; the renderer falls back to hydrating from the DB. Cross-restart resume is out of scope — the upstream provider has no notion of "your previous request."

Orphaned in-flight tool calls on restart. A tool-input-available part with no matching tool-output-* companion MUST be finalized as a tool-error envelope ("aborted by host restart" or equivalent) before the next run on that session is allowed. Without finalization the assembled prompt would contain an unresolved tool call and the loop could not proceed. The rule applies to every tool — bash, web_fetch, MCP calls, task — uniformly.

Multi-replica deployments. The in-memory registry assumes one process owns the session. A horizontally-scaled deployment (multiple replicas behind a load balancer) MUST swap the in-memory map for a pubsub layer so any replica can subscribe to a stream another replica is producing. vercel/resumable-stream is one option that matches this shape (Redis-backed pubsub, same chunk vocabulary). The protocol does not change — replay still walks the DB, live resume still hooks the registry — only the registry's storage moves out of process.

ACP mapping. An ACP-fronted host exposes two resume-style methods: session/load (replay every past message as a session/update notification — used after a host restart) and session/resume (rejoin without replay — used by a reconnecting client that already has the history). Both map onto this layer; load walks the DB rows and resume hooks into the in-memory registry. See ACP integration.

Interruption

The user MUST be able to abort the assistant mid-turn. The abort path:

  1. The host's UI surfaces an abort button.
  2. The host calls abort(session_id).
  3. The core's per-session run state holds an AbortController. The controller's signal is the one passed to the model call and to every tool's execute.
  4. abort() calls controller.abort(). The model call cancels at the next chunk boundary; tools that watch the signal cancel themselves; tools that do not watch finish naturally.
  5. The recorder finalizes the in-flight assistant message. Tool calls in input-streaming or input-available stay frozen in that state. Text in flight stops where it stopped.

Abort vs TCP close

A renderer that closes its TCP connection has not aborted; it has detached. The model call keeps going; the recorder keeps writing. Only an explicit abort (its own endpoint) cancels.

The cost of distinguishing the two is one endpoint. The cost of collapsing them is a user who refreshed mid-stream and lost every already-spent token.

Subagent abort

Aborting the parent's run aborts every subagent spawned by that run. The signal propagates through the task tool's implementation: a child session running on the parent's abort signal sees the abort, stops, and its tool call returns an "aborted by parent" error.

Session status

The streaming layer is one-directional (core → transport → client). Session status is the back-channel — what a client uses to know whether the session is currently running, retrying, or idle, without subscribing to the chunk stream.

Shape:

SessionStatus = {
state: "idle" | "busy" | "retrying" | "error",
attempt?: int, // current retry attempt, when state="retrying"
message?: string, // human-readable status, when state="retrying" | "error"
started_at?: int, // epoch ms; present when state="busy" | "retrying"
}

Where it lives.

  • Authoritative source: an in-memory map keyed by session id, owned by the core, mutated by the run-state machine.
  • Subscription transport: an event stream on the host's bus. The event payload is the SessionStatus shape.
  • Read API: get_status(session_id) → SessionStatus for consumers that join late.

Not persisted. Status is volatile. On host restart, every session reads as idle — correct, because no run is in flight after a restart.

One run per session at a time. The state machine refuses a new run while state != "idle". Hosts that want to surface "you have a turn already running" do so by checking status before submitting. error clears on the next submission attempt: the state machine transitions to busy for that run and follows the normal lifecycle from there.

Retry visibility. When the model call fails transiently (rate limit, provider 5xx) and the loop backs off, status transitions to retrying with the current attempt count and a message sourced from the error (e.g. "rate limited, retry in 12s"). The client renders this directly; the user understands the delay is not their fault.

Permission scopes

Permissions evaluated at tool-call time come from a layered ruleset. Three scopes that compose:

ScopeLifetimeSet by
ManifestCompile-time, immutableThe agent author. Describes the agent's intrinsic policy.
SessionThis session onlyUser replies to a watchdog ask with "always for this session". Stored on chat_sessions.permissions_json.
ProjectAll sessions in this projectUser replies to a watchdog ask with "always". Or pre-configured by the host.

Evaluation walks the layers in order; the most specific matching rule wins (manifest deny is overridden by a more recent session allow if and only if the manifest did not pin it; project rules override manifest defaults but never manifest pins; and so on). An implementation MAY flatten the layers into a single ranked ruleset at evaluation time.

The session row carries a permissions_json blob for the session scope. The project scope's storage is up to the host (a project-level config file, a per-user DB, both); the guide only requires it exists separately from session scope.

Three scopes, not one. A single ruleset elides the difference between "I trust this for this conversation" and "I trust this everywhere in this project." Collapsing them forces every "ask" reply into a project-permanent commitment.

Persistence

The default policy: save on every chunk. Detailed in Persistency / save policy.

See also

  • Foundations — AI SDK + locked tools + sandbox placement.
  • Persistency — the schema, save policy, ID strategy.
  • Tools — what the loop invokes.
  • Subagents — the specialized compaction subagent this page references.
  • UX Patterns — compositor, queued sends, sidecar, memory.
  • Debugging — inspection format and DX checklist.
  • ACP integration — the outward wire.