Notes on Agent Context Compression Design
Reference: Context Compression Instruction: Prompt Analysis of Claude Code and Gemini
What Problem Does Context Compression Solve?
An agent’s context window is not infinite. As multi-turn conversations, tool calls, file reads, error logs, and code diffs accumulate, the model gradually approaches the token limit. The goal of context compression is not simply to “make it shorter,” but to preserve task continuity while reorganizing history into a state that the next agent turn can continue from.
I treat context compression as a work handoff:
- Keep what the user is actually trying to accomplish
- Keep project constraints, tech stack, and key decisions
- Keep file states that were read, modified, or created
- Keep errors, fixes, and unresolved issues
- Drop repetitive, outdated, and noisy tool outputs
- Let the next context window continue execution instead of re-exploring
A good compression system should answer three questions:
- When to compress: scheduling strategy based on token thresholds, message length, tool output size, etc.
- What to compress: user messages, system constraints, tool results, file states, or plans
- How to compress: LLM summarization, rule-based trimming, retrieval reconstruction, or a hybrid approach
Classic Approach 1: LLM Summarization Compression
Both Claude Code and Gemini CLI follow a core idea: when context is too long, pass history to a model and let it output a structured summary. This summary becomes the core memory in the next context window.
The advantage is strong semantic retention: goals, constraints, errors, and plans scattered across long history can be reorganized. The downside is that quality depends on prompt design. A weak prompt may lose file paths, snippets, user preferences, or unfinished tasks.
Claude Code Style: Detailed Structured Handoff
Claude Code-style compression is closer to a full handoff document. It emphasizes chronological analysis and focuses on user requests, technical details, file changes, error handling, and next steps.
Suggested fields:
| Field | Purpose |
|---|---|
| Primary requests and intent | Preserve the initial user goal and later intent shifts |
| Key technical concepts | Record stack, frameworks, architecture patterns, dependencies |
| Files and code sections | Track read/modified/created files and key snippets |
| Errors and fixes | Prevent repeating the same mistakes after compression |
| Problem-solving status | Separate resolved issues from ongoing debugging |
| User messages | Preserve original feedback to reduce intent distortion |
| Pending tasks | Make remaining work explicit |
| Current work state | Capture what was in progress before compression |
| Optional next steps | Keep only directly relevant follow-up actions |
The point is not “a pretty summary,” but “a handoff that can keep coding.” In coding-agent workflows, file paths, function names, test commands, failed logs, and user corrections are critical.
Compression template:
Please compress the conversation history into a handoff summary that can continue execution.
Must keep:
1. User’s primary goals and explicit requests
2. Tech stack, architecture constraints, and key decisions
3. Files read/modified/created/deleted and why
4. Key code snippets, function signatures, config items
5. Encountered errors, failure logs, and fixes
6. Important user feedback and preferences
7. Completed items, pending items, and current pause point
8. Next-step suggestions directly related to the current task only
Must remove:
1. Repetitive explanations
2. Outdated tool outputs
3. Intermediate attempts that no longer help
4. Irrelevant small talk
Gemini CLI Style: State Snapshot
Gemini CLI-style compression is more like generating a compact state_snapshot. It uses fewer fields but packs higher density.
Typical fields:
| Field | Purpose |
|---|---|
overall_goal | One-line high-level user objective |
key_knowledge | Facts, constraints, and conventions that must be remembered |
file_system_state | Created/read/modified/deleted file state |
recent_actions | Recent key actions and outcomes |
current_plan | Current plan and progress |
This style works well as a runtime snapshot, especially for recovery after interruption. It is shorter than the Claude-style handoff but requires stricter detail retention.
<state_snapshot>
<overall_goal>User's current high-level goal</overall_goal>
<key_knowledge>Critical facts, constraints, preferences, technical decisions</key_knowledge>
<file_system_state>File read/modify/create/delete state</file_system_state>
<recent_actions>Recent important actions and outcomes</recent_actions>
<current_plan>Current plan, completed steps, pending steps</current_plan>
</state_snapshot>
Classic Approach 2: Tool Message Trimming
In real agent systems, the biggest token consumer is often tool output, not user text or assistant replies. File reads, code search, test runs, and logs can explode token usage.
So tool-message trimming is highly practical:
- Keep system messages
- Keep normal user and assistant messages
- Remove outdated tool calls and tool outputs
- Keep only the last N tool rounds
- Summarize key tool outputs before deleting raw long outputs
A common policy: identify all tool rounds, keep only the last N, and remove older tool-related messages.
type MessageRole = 'system' | 'user' | 'assistant' | 'tool';
interface Message {
role: MessageRole;
content: string;
tool_calls?: unknown[];
tool_call_id?: string;
}
interface CompressionOptions {
enabled: boolean;
keepLastToolRounds: number;
}
function compressToolMessages(
messages: Message[],
options: CompressionOptions
): Message[] {
if (!options.enabled) return messages;
const toolRounds = identifyToolRounds(messages);
const roundsToKeep = toolRounds.slice(-options.keepLastToolRounds);
const keepIndexes = new Set(roundsToKeep.flatMap(round => round.indexes));
return messages.filter((message, index) => {
if (message.role === 'system') return true;
if (keepIndexes.has(index)) return true;
const isToolRelated =
message.role === 'tool' ||
(message.role === 'assistant' && Boolean(message.tool_calls));
return !isToolRelated;
});
}
The key decision is whether a tool output still helps future decisions. If it has already been absorbed into conclusions or is only exploratory noise, remove it. If it is a fresh test result, key error log, or important file content, keep or summarize it first.
Classic Approach 3: Middle Drop, Oldest Drop, and Hybrid Strategy
Besides LLM summarization, rule-based algorithms can also trim messages directly. They are more controllable and cheaper, but weaker in semantic understanding.
Three common methods:
| Strategy | Method | Best for |
|---|---|---|
| Middle drop | Keep head and tail, remove middle | Head has constraints, tail has current work |
| Oldest drop | Remove earliest messages first | Long-running sessions where recent context matters most |
| Hybrid | Choose dynamically by conversation shape | Mixed workloads and different model limits |
Middle Drop
Works well when history has this structure:
Head: system prompt, project rules, user goals
Middle: heavy tool usage, search process, trial-and-error
Tail: current issue, latest code, latest errors
Advantage: keeps task framing and current working context. Risk: key decisions may be lost if the middle is removed without summarization.
Oldest Drop
This is a sliding-window style approach. It assumes the newest messages are most relevant.
Advantage: simple and effective for continuity in long sessions. Risk: early constraints, architecture decisions, or initial goals may be dropped.
Hybrid Strategy
Dynamic selection can use:
- Compression ratio target (current tokens vs target)
- Total message count
- Share of recent-message tokens
- Presence of long messages
- Presence of system messages
- Heavy tool-message density
- Model context window size
A practical decision table:
| Condition | Recommended strategy | Why |
|---|---|---|
| Light compression + short dialogue | Middle drop | Head and tail are often most important |
| Heavy compression + very long dialogue | Oldest drop | Recent context usually has higher priority |
| Recent messages dominate tokens | Middle drop | Protect the current working context |
| System/tool-heavy history | Middle drop | Keep opening rules and latest state |
| Uncertain | Try both and score | Data-driven selection |
A simple score:
efficiency_score = token_reduction_ratio * 0.6 + message_retention_ratio * 0.4
If the system prioritizes staying under target tokens, increase token-reduction weight. If it prioritizes context continuity, increase retention weight.
Recommended Hybrid Compression Architecture
A single method is usually not robust enough. For coding agents, I prefer a combined pipeline:
Raw history
↓
Token and structure statistics
↓
Compression threshold check
↓
Trim outdated tool messages
↓
LLM structured summary for key history
↓
Generate state snapshot / handoff summary
↓
Rebuild next context window
I usually preserve four layers:
| Layer | Content | Storage |
|---|---|---|
| Stable rules layer | System prompt, project rules, security constraints | Persistent prompt/rule files |
| Working memory layer | Current goal, plan, TODOs, user preferences | Structured summary |
| Evidence layer | Latest tool results, key errors, key snippets | Last N tool rounds or summarized evidence |
| External knowledge layer | Docs, codebase, history | RAG / file retrieval |
Rebuilt context layout:
System prompt
Project rules
Compression preface
Structured summary
Recent full conversation rounds
Recent key tool results
Current user request
The “recent full rounds” part is important. Summaries keep the big picture, but recent raw turns often carry subtle intent, tone, corrections, and boundary conditions.
Compression Prompt Design Principles
The goal is not to let the model freestyle. It is to enforce a stable handoff format.
Recommended prompt constraints:
- Explicit role: you are a context compressor, not an executor
- Explicit goal: generate a state that the next agent can continue from
- Explicit retention: goals, constraints, files, code, errors, plan, user feedback
- Explicit deletion: repetition, irrelevant tool output, small talk, intermediate noise
- Explicit output format: Markdown, XML, JSON, or custom tags
- Explicit prohibition: do not fabricate file states, do not invent decisions, do not execute next steps
Practical prompt template:
You are the context compressor for an agent.
Please compress the conversation history into a Chinese handoff summary.
This summary will be the primary context for continuing execution in the next context window.
Must keep:
- User goals, explicit requests, and important feedback
- Tech stack, project constraints, architecture decisions, tool preferences
- File paths read/modified/created/deleted
- Key code snippets, function names, config items, commands
- Encountered errors, failed tests, and fixes
- Completed tasks, pending tasks, and current pause point
- Next-step suggestions directly relevant to the current task
Must remove:
- Repetitive explanations
- Irrelevant small talk
- Tool output with no further value
- Intermediate attempts that do not affect final decisions
Do not fabricate information not present in history.
Do not execute tasks. Only output the compressed summary.
Engineering Implementation Notes
Trigger Timing
Compression can be triggered when:
- Tokens exceed 70% to 85% of model context limit
- Single tool output exceeds threshold
- Tool call rounds exceed threshold
- A task phase ends and a handoff is needed
- User explicitly requests
/compactor equivalent command
Compression Order
Recommended order:
- Remove obviously low-value tool output
- Keep the last N complete conversation rounds
- Generate structured summaries for older messages
- Rebuild context with summary + rules + recent rounds
- Record metrics: pre/post token count, dropped message count, kept tool rounds
Risk Control
The most common failure is not “insufficient compression,” but “loss of critical facts.”
Especially avoid:
- Losing explicit user constraints
- Losing file paths
- Losing the latest error message
- Losing failed attempts that should not be repeated
- Turning assumptions into facts
- Mixing completed tasks with pending tasks
I prefer to keep explicit state labels in summaries:
[Done] Fixed login form validation
[Failed attempt] Direct schema change breaks legacy API
[Pending confirmation] Whether to keep legacy export format
[Next] Run pnpm test for auth module verification
My Takeaway
Context compression is fundamentally an agent memory-management and handoff system. Claude Code-style compression is better for full development-context retention. Gemini CLI-style compression is better for high-density state snapshots. Tool-message trimming is the most direct way to reduce token noise.
If I were implementing a stable agent compression module, I would prioritize this combination:
Keep recent conversation rounds intact
+ Trim outdated tool messages
+ LLM structured summary
+ File state snapshot
+ Current plan and TODO list
+ Compression metrics and observability logs
The final objective is not the shortest context. It is that after compression, the agent still knows: what the user wants, what the project is, what has been done, what has failed, where it stopped, and what should happen next.