Session Lifecycle
A session is the unit of state. Everything an agent does — read,
write, call a tool, switch model, fork, restart — lands here. This
page covers the lifecycle (what happens between session-open and
session-close) and the runtime mechanics (streaming, abort,
status). The storage shape lives in persistency;
the structures referenced here are detailed there.
Sessions, messages, parts
A session contains messages; a message contains parts. The shape is
the AI SDK v6 message + part model. See
persistency / three-table schema
for the column-level shape.
The two properties this guide relies on:
- Messages are append-only. A user turn is a new row; an assistant reply is a new row. Existing messages are never rewritten.
- Parts are upserted while the stream lands. Text grows; tool
calls transition
input-streaming → input-available → output-availableon the same row keyed bytool_call_id. An aborted run leaves the partial state truthful on reload.
When the agent is fronted by an ACP adapter, the session
id is the same handle ACP returns from session/new and accepts
on session/load / session/resume / session/close — one
identifier across both layers.
Context window tracking
Tokens are tracked per role on every assistant message and rolled up to the session row:
chat_messages.metadata_json: {
usage: {
input, // user + system tokens this step consumed
output, // tokens the model produced this step
reasoning, // thinking-token charge, when the model emits one
cache_read, // tokens served from the provider's prompt cache
cache_write, // tokens written to the provider's prompt cache
}
}
Session-level rollups (prompt_tokens, completion_tokens,
reasoning_tokens, cache_read, cache_write, total_tokens,
cost_usd) are sums of the above across the session's assistant
messages.
Why the breakdown
- Compaction needs
input + output + cache_read + cache_writeto decide whether the next turn will overflow. Cache reads still count against the context window; only the charge for them is lower. - Cost reporting needs
reasoningseparate. Providers price reasoning tokens at the output rate but report them separately; rolling them intooutputhides cache-vs-thinking ratios that matter for picking a tier. - Cache write vs cache read tells the picker how much of the conversation is hot. A session at 80% cache hit can sustain ten more turns; one at 5% cannot.
Source of truth
The AI SDK's per-step onStepFinish callback delivers a usage
object per turn. The recorder bumps the session row from there,
NEVER from finish-step chunks on the wire — keeping one source of
truth avoids double-counting when an SDK update starts emitting
both.
The SDK's inputTokens field already includes the cache-read
and cache-write counts. The recorder MUST subtract them out before
persisting prompt_tokens, or the cache columns get double-counted
in the session rollup. See
ai-sdk / token usage
for the exact formula and the per-component mapping.
Cost
cost_usd is recorded when a writer (a hosted route, a metadata
hook) supplies it. The agent system MUST NOT compute cost from a
pricing table inside the recorder. Pricing lives next to the model
catalog, not next to the session store.
What "the context window" means here
The number compared against the model's limit is:
context_window_used =
chat_sessions.prompt_tokens +
chat_sessions.completion_tokens +
chat_sessions.reasoning_tokens +
chat_sessions.cache_read +
chat_sessions.cache_write
Not the wall-clock total tokens billed (that includes per-turn
re-reads). Not just input + output (that misses reasoning and
cache).
Rewinding
A user MUST be able to rewind to a prior user message and edit or resend it. The unit is the user message because:
- Assistant turns are non-deterministic; "rewind to the assistant's word 47" is meaningless when the next run produces different words.
- The user message is the only deterministic checkpoint the user controls.
- A turn-pair (user, assistant) is the unit of rollback in every chat product the user has used.
What happens on rewind
- The user picks a prior
usermessage. The UI shows the message body in the editor; the user edits or accepts. - The system soft-truncates the conversation: every message and part after the chosen one is marked invisible to the next LLM call.
- The user submits. A new turn appends after the chosen message; the model sees the conversation as if everything after the rewind point never happened.
- The hidden messages remain in the DB for inspection, un-rewind, and audit. They are NOT deleted.
Soft-truncate vs delete
Soft-truncate (a hidden_at per message) is the required shape. It
preserves history for inspection and lets a user un-rewind by moving
the pointer back. An implementation that hard-deletes on rewind
trades inspection for storage, and SHOULD document the loss.
Side-effect rewind
A rewound turn MAY have created files, run a shell command, or hit the network. The agent system does NOT undo side effects on rewind. Rewind is a prompt rewind, not a world rewind.
A host that wants world-rewind (a code agent that restores the workspace to its state at message N) ships a workspace-snapshot layer that hooks on user messages and tags the snapshot id into the message metadata. The contract:
- A user-message metadata field
snapshot_id?: stringthat hosts populate when they snapshotted the world at submission time. - A hook point (
message.user, the same hook permission rules use) fires before the message lands, giving the host's snapshot layer a place to attach. - On rewind, the host reads
snapshot_idoff the target message and restores. The guide is silent on how (git patches, VM snapshots, copy-on-write filesystems are all valid).
Without a snapshot layer, rewind is prompt-only. With one, the same flow restores both.
Rewinding past a compaction
A compaction creates a synthetic assistant message with a summary; the messages it summarized are soft-hidden. Rewinding past the compaction MUST un-hide them. The truncation pointer moves back over the summary AND the summarized turns. Implementations that hard-delete on compaction lose this property; they MUST NOT.
Branching
A branch is a new session whose parent_id points back to the
parent and whose parent_message_id points to the fork point. The
new session starts with a copy of the parent's messages up to and
including the fork point; new turns append only to the branch.
branch API
branch({
parent_session_id: string,
from_message_id: string, // a chat_messages.id reachable from parent_session_id
metadata?: object, // merged into the new session's metadata_json
}) → ChatSessionRow // the newly created session
Behavior:
- The runtime MUST reject the call if the parent session has a run
in flight (
SessionStatus.state != "idle"); a 4xx-equivalent error is returned. - The runtime MUST copy every non-hidden message up to and
including
from_message_idinto the new session with new ids. Parts are copied verbatim. - Token rollups MUST be recomputed from the copied messages, not copied from the parent's row.
- The
metadatablob is merged into the new session'smetadata_json.metadata.ephemeral = trueis the convention for sidecar branches; seeux / sidecar.
Why a new session and not a fork of the same session
- Each branch needs its own running stream, its own token rollup, its own model selection. A session is the natural unit.
- Two branches MUST be inspectable side by side. Two rows are easier to render than one row with a tree shape.
- The picker shows branches as siblings under the parent without
schema gymnastics —
SELECT * FROM chat_sessions WHERE parent_id = ?.
What gets copied on branch
- Every message and part up to and including the fork point. The
copies have new ids; their
data_jsonis verbatim from the source. - The session row's settings (agent, model, workspace, metadata).
- Token rollups for the copied turns. The branch starts with the
same
total_tokensthe parent had at the fork point.
What does NOT get copied
- Side effects the parent took. Branching is a conversation fork, not a workspace fork. If the host wants workspace-snapshot branching, it layers it on the message metadata (same hook as side-effect rewind).
- The parent's in-flight run, if any. A branch CANNOT fork a running turn; the user MUST wait for the parent's turn to finish or abort.
Sidecar = ephemeral branch
A sidecar chat (the "ask the model a side question without messing
up the main thread") is exactly a branch — same wire, same shape —
with one UX twist: the host marks the branch ephemeral, hiding it
from the picker. The schema gains nothing; the host filters on a
metadata flag. See ux / sidecar.
Compaction
Compaction is the act of replacing a stretch of conversation history with a summary, freeing tokens for the next turn. It is not optional above the model's context window. An implementation that ships a "the chat just stops working" failure mode is shipping a bug.
Threshold
Compaction fires when usable context is exceeded:
usable = model.context_limit - reserve
context_window_used >= usable → fire compaction
reserve is the headroom kept for the next turn's output and
reasoning. Default: min(20k tokens, model.max_output).
Implementors tune for product shape (a code agent that emits long
diffs needs a larger reserve; a chat agent does not).
The threshold is per-model. A 200k model and a 1M model fire at different absolute token counts. A session that switches model mid-conversation (see Per-turn model switch) re-evaluates against the new model's limit before the next turn.
Auto vs manual
- Auto-compaction is the default. For casual users, the system fires compaction before the next overflow and replays seamlessly. The user sees a one-line "summarized N earlier turns" affordance.
- Manual compaction lets the user fire it on demand (a slash command, a button). Useful when the user knows the next turn will be expensive and wants the conversation cheap to re-run.
Both produce the same artifact in the schema; only the trigger differs.
What compaction produces
A compaction part (type: "data-compaction") on a synthetic
assistant message. The part's data_json carries:
{
summary: string, // Markdown body, sectioned (Goal / Progress / Decisions / Next Steps is a reasonable shape)
tail_start_id: string, // chat_messages.id of the first message kept verbatim
auto: boolean, // true for auto-compaction, false for user-fired
summary_tokens: int, // token count of `summary`, for the next rollup
}
The model sees the summary plus every message from tail_start_id
onward.
The summarized messages are soft-hidden, not deleted. Inspection, rewind, and "show what was summarized" all need them. The model's view skips them; the picker can expose them under "details".
Tail preservation
Compaction keeps the N most recent turns verbatim. Default
N = 2 (one user message + one assistant response). The tail budget
caps at ~25% of usable. If the tail at N=2 exceeds the budget, the
implementor either:
- Drops to N=1 (last turn only).
- Splits the last turn (keep the last user message and the assistant's final text; drop intermediate tool calls and reasoning).
Default: "drop to N=1 and warn". "Split the last turn" is for agents that habitually run long tool chains.
Summarizer cost discipline
The summarizer is a specialized subagent running the same loop.
See subagents / specialized subagents.
Required discipline:
- Cheapest model the provider exposes (
nano/smalltier). - Low temperature.
- Hard
maxOutputTokenscap. - Short timeout.
- A specialized system prompt that constrains output format.
- No tools.
mode: "subagent",inspectable: false.
The summarizer's input is the soft-hidden history; its output lands in the CompactionPart. One model call, one shot.
Failure modes
| Failure | Trigger | Recovery |
|---|---|---|
| Transient (network, provider 5xx) | Summarizer call failed once | Retry with backoff. The main session blocks on retry up to N seconds, then proceeds without compaction (warning logged). |
| Spec limit (history > model's input cap) | Sum of soft-hidden history exceeds the model's input | Hard failure. Surface to the user; suggest a new session or a tool-output prune first. |
| Best-effort prune | Spec-limit failure, prune is on | Run a tool-output prune pass: erase output of completed tool calls (keep input + result-truthy state), then retry. |
| Soft / smart fallback | Spec-limit failure, prune still insufficient | Split soft-hidden history into chunks; summarize each; concatenate. Or drop the middle and keep head + tail. |
The implementor's compaction config:
{
"reserve_tokens": "int", // headroom for next turn (default 20000)
"tail_turns": "int", // recent turns to keep verbatim (default 2)
"tail_budget_pct": "float", // tail token budget as fraction (default 0.25)
"retry_on_transient": "int", // retry count on transient failure (default 2)
"smart_recovery": {
"prune_tool_outputs": "bool", // try output prune before chunked summarize
"chunked_summarize": "bool", // summarize in chunks if one-shot won't fit
"drop_middle": "bool", // last-resort: keep head + tail, drop middle
},
}
The shape is normative; the defaults are reasonable starting points.
Tool-output pruning vs compaction
Tool-output pruning is a separate, cheaper pass that runs before a real compaction is attempted:
- Walk backwards through completed turns.
- For each completed tool call, if its output is large and not
referenced in the current task (a "protected" tool —
todo,taskstate, the model's most recent files), drop the output and keep the tool's input + a stub like<output pruned, N tokens>. - Stop once enough tokens are reclaimed.
Pruning preserves the conversation's logical structure; the model
still sees "I called read on file X" without re-reading X's
contents. A code agent that ran 30 reads usually only needs the
last 5 in context.
Pruning MAY be triggered:
- Automatically before compaction (cheap, often enough).
- Manually by the user ("free some space").
- Periodically on a token threshold lower than the compaction threshold (e.g. at 60% context prune; at 90% compact).
Per-turn model switch
The model is per-message, not per-session. A user MAY pick a different model on any turn, and the new model carries forward until they pick again.
Shape. The user message metadata carries { provider_id, model_id, variant? } for that turn. The session row's model_json
is the most recent active model — a denormalization for the
picker so it can render "Currently using X." Historical use is
queryable from the messages.
Why per-message and not per-session.
- Users push the same prompt through cheap and premium models to compare. Forcing a new session for each pin is hostile.
- Compaction uses a cheap model. If model were a session attribute, compaction would swap and swap back, racing with the user's own swap.
- Variants (reasoning mode, JSON mode, thinking depth) are per-turn by their nature. A reasoning-mode toggle that lasts the session is a bug.
Compaction interacts with model switch
| Scenario | Behavior |
|---|---|
| New model has larger context than current usage | No-op. Conversation proceeds. |
| New model has smaller context, but next turn fits | No-op. Conversation proceeds. |
| New model's context cannot fit the current rollup | Force compaction before the turn proceeds. The next user message blocks on the summarizer. |
| After compaction, history still does not fit | The model swap fails. Surface to the user: "Switching to a model with a smaller context window would lose the conversation; consider branching with the new model." |
Branching is the escape hatch for the last case: a branch with the new model has only the parent's tail in scope (or the user manually picks the slice).
System prompt assembly
Every turn assembles a fresh system prompt from these sections, in this order, top to bottom:
- Agent manifest prompt — the agent's intrinsic system prompt.
- Project instructions — concatenated
AGENTS.md/CLAUDE.md/CONTEXT.mdcontent, walked from the outermost file inward (nearest-last, so the project root has the final word). Seeskills / project instructions. - Skill index — names + one-line descriptions of every
discovered skill. Bodies load on demand. See
skills. - Environment context — platform, workspace root, git status, current date, resolved model + variant.
- Tool catalog — JSON Schema for every active tool (locked, agent-specific, materialized MCP).
The order is normative; implementors MUST NOT shuffle the sections. Two implementations that follow this order produce digest-comparable prompts for the same inputs.
Optional system_prompt_digest storage MAY persist a SHA-256 of the
assembled prompt on each assistant message; see
persistency / metadata conventions.
Streaming and layering
The core emits AI SDK chunks to a transport the host owns. The core does NOT render anything.
┌─ Agent system core ──────────────────────────────────────────────┐
│ run(session, input, model) │
│ → AsyncIterable<UIMessageChunk> │
│ │
│ one universal LLM loop: │
│ - assemble system prompt (see above) │
│ - call model (native runtime or AI SDK adapter) │
│ - emit chunks │
│ - on tool call: validate args, capability check, watchdog, │
│ execute, emit tool-output chunks │
│ - loop until no tool calls or abort │
└──────────────┬───────────────────────────────────────────────────┘
│ same chunk shape, two consumers
│
┌───────┴────────┐
▼ ▼
recorder host transport
(writes parts (SSE / IPC / WS / in-memory)
to the DB) │
▼
client / renderer
(drops chunks into the AI SDK reducer;
renders to user)
Strict layering
- The core MUST NOT know whether the host transport is SSE, IPC, WebSocket, gRPC, or in-memory.
- The host MUST NOT know how the chunks were produced (native runtime vs AI SDK adapter).
- The client / renderer MUST NOT know about the recorder.
The renderer's only job
The renderer renders. It does NOT own the recorder. It does NOT own model selection. It does NOT own permission state. These belong to the core because they survive the renderer being closed.
Resume across renderer disconnect
A renderer that closes its connection mid-stream (page refresh, OS sleep, window close) MUST NOT cancel the upstream model call. The required behavior:
- The core keeps the model call alive while at least one consumer is attached. The recorder is always attached for the run's lifetime.
- A reconnect endpoint, given the session id, replays the chunk log from the start and live-tails until the upstream finishes.
- Replay starts from index 0 (not from a cursor). The AI SDK
reducer rejects
text-deltafor a part it has not seentext-startfor; cursor-based resume would require chunk rewriting and is not worth the complexity.
Lifespan of the live stream registry. In-process, in-memory, keyed by session id. A host restart drops every entry; the renderer falls back to hydrating from the DB. Cross-restart resume is out of scope — the upstream provider has no notion of "your previous request."
Orphaned in-flight tool calls on restart. A tool-input-available
part with no matching tool-output-* companion MUST be finalized as
a tool-error envelope ("aborted by host restart" or equivalent)
before the next run on that session is allowed. Without finalization
the assembled prompt would contain an unresolved tool call and the
loop could not proceed. The rule applies to every tool — bash,
web_fetch, MCP calls, task — uniformly.
Multi-replica deployments. The in-memory registry assumes one
process owns the session. A horizontally-scaled deployment (multiple
replicas behind a load balancer) MUST swap the in-memory map for a
pubsub layer so any replica can subscribe to a stream another
replica is producing. vercel/resumable-stream
is one option that matches this shape (Redis-backed pubsub, same
chunk vocabulary). The protocol does not change — replay still
walks the DB, live resume still hooks the registry — only the
registry's storage moves out of process.
ACP mapping. An ACP-fronted host exposes two resume-style
methods: session/load (replay every past message as a
session/update notification — used after a host restart) and
session/resume (rejoin without replay — used by a reconnecting
client that already has the history). Both map onto this layer;
load walks the DB rows and resume hooks into the in-memory
registry. See ACP integration.
Interruption
The user MUST be able to abort the assistant mid-turn. The abort path:
- The host's UI surfaces an abort button.
- The host calls
abort(session_id). - The core's per-session run state holds an
AbortController. The controller's signal is the one passed to the model call and to every tool'sexecute. abort()callscontroller.abort(). The model call cancels at the next chunk boundary; tools that watch the signal cancel themselves; tools that do not watch finish naturally.- The recorder finalizes the in-flight assistant message. Tool
calls in
input-streamingorinput-availablestay frozen in that state. Text in flight stops where it stopped.
Abort vs TCP close
A renderer that closes its TCP connection has not aborted; it has detached. The model call keeps going; the recorder keeps writing. Only an explicit abort (its own endpoint) cancels.
The cost of distinguishing the two is one endpoint. The cost of collapsing them is a user who refreshed mid-stream and lost every already-spent token.
Subagent abort
Aborting the parent's run aborts every subagent spawned by that
run. The signal propagates through the task tool's
implementation: a child session running on the parent's abort
signal sees the abort, stops, and its tool call returns an "aborted
by parent" error.
Session status
The streaming layer is one-directional (core → transport → client). Session status is the back-channel — what a client uses to know whether the session is currently running, retrying, or idle, without subscribing to the chunk stream.
Shape:
SessionStatus = {
state: "idle" | "busy" | "retrying" | "error",
attempt?: int, // current retry attempt, when state="retrying"
message?: string, // human-readable status, when state="retrying" | "error"
started_at?: int, // epoch ms; present when state="busy" | "retrying"
}
Where it lives.
- Authoritative source: an in-memory map keyed by session id, owned by the core, mutated by the run-state machine.
- Subscription transport: an event stream on the host's bus.
The event payload is the
SessionStatusshape. - Read API:
get_status(session_id) → SessionStatusfor consumers that join late.
Not persisted. Status is volatile. On host restart, every
session reads as idle — correct, because no run is in flight after
a restart.
One run per session at a time. The state machine refuses a new
run while state != "idle". Hosts that want to surface "you have a
turn already running" do so by checking status before submitting.
error clears on the next submission attempt: the state machine
transitions to busy for that run and follows the normal lifecycle
from there.
Retry visibility. When the model call fails transiently (rate
limit, provider 5xx) and the loop backs off, status transitions to
retrying with the current attempt count and a message sourced
from the error (e.g. "rate limited, retry in 12s"). The client
renders this directly; the user understands the delay is not their
fault.
Permission scopes
Permissions evaluated at tool-call time come from a layered ruleset. Three scopes that compose:
| Scope | Lifetime | Set by |
|---|---|---|
| Manifest | Compile-time, immutable | The agent author. Describes the agent's intrinsic policy. |
| Session | This session only | User replies to a watchdog ask with "always for this session". Stored on chat_sessions.permissions_json. |
| Project | All sessions in this project | User replies to a watchdog ask with "always". Or pre-configured by the host. |
Evaluation walks the layers in order; the most specific matching rule wins (manifest deny is overridden by a more recent session allow if and only if the manifest did not pin it; project rules override manifest defaults but never manifest pins; and so on). An implementation MAY flatten the layers into a single ranked ruleset at evaluation time.
The session row carries a permissions_json blob for the session
scope. The project scope's storage is up to the host (a project-level
config file, a per-user DB, both); the guide only requires it exists
separately from session scope.
Three scopes, not one. A single ruleset elides the difference between "I trust this for this conversation" and "I trust this everywhere in this project." Collapsing them forces every "ask" reply into a project-permanent commitment.
Persistence
The default policy: save on every chunk. Detailed in Persistency / save policy.
See also
- Foundations — AI SDK + locked tools + sandbox placement.
- Persistency — the schema, save policy, ID strategy.
- Tools — what the loop invokes.
- Subagents — the specialized compaction subagent this page references.
- UX Patterns — compositor, queued sends, sidecar, memory.
- Debugging — inspection format and DX checklist.
- ACP integration — the outward wire.