Context on XEDCZQ Blog

Agent_Context Engineering

Tue, 19 May 2026 16:35:00 +0800

What Context Engineering Is

Context engineering can be defined as:

Injecting the “just-enough and highly relevant” information at every agent step, while continuously managing the lifecycle of that information.

If prompt engineering focuses on “how to phrase the task,” context engineering focuses on “what information to provide, in what order, and when to prune or rebuild it.”

Phase 1: Passive Truncation and Sliding Window (2020–2022) — “Every Token Counts”

Typical Characteristics

Context windows were generally small, and tokens were highly constrained.
The default strategy was “truncate when over limit.”
A common implementation was sliding window (keep only the latest N turns).

What It Solved

Prevented immediate failure from overlong input.
Preserved recent interaction and basic multi-turn continuity.

Core Problems

Early critical information was often dropped.
Goal drift was severe in long tasks.
Historical state could not be inherited reliably.

Phase 2: External Topology Introduction (2021–2023) — “The Birth of an External Brain (RAG)”

Typical Characteristics

The paradigm shifted from “stuff everything into context” to “retrieve on demand then inject.”
Vector retrieval and semantic recall became mainstream.
RAG decoupled parametric knowledge from external knowledge.

What It Solved

Broke through the memory ceiling of single-window context.
Reduced hallucinations by grounding responses with retrievable evidence.
Enabled knowledge updates without retraining the model.

Core Problems

Retrieval quality remained unstable (missed recall, wrong recall).
Attention dilution still occurred after retrieval chunks were merged.
“Retrieved” did not necessarily mean “used correctly by the model.”

Phase 3: Fine-Grained Compression and Reordering (2023–2024) — “Addressing the Lost-in-the-Middle Problem”

Typical Characteristics

The community began to systematically focus on long-context utilization.
Research and engineering attention increased around the Lost-in-the-Middle effect.
Strategy evolved from “adding more context” to “compressing, reordering, and layered memory.”

Common Methods

History summarization (state snapshot / handoff summary)
Tool-output pruning (keep recent critical rounds)
Information reordering (place highest-priority evidence near strong attention zones)
Task segmentation and stage-wise handoff

What It Solved

Reduced middle-section information neglect.
Improved long-task state continuity.
Made cross-window agent execution more controllable.

Core Problems

Compression summaries could introduce information loss.
Reordering rules were task-dependent and hard to generalize.
Evaluation was required to verify post-compression executability.

Phase 4: Ultra-Long Context and Infrastructure Caching (2024–2026, Current) — “KV Cache and Intelligent Memory”

Typical Characteristics

Context windows continued to expand.
Vendors and frameworks introduced stronger cache/reuse mechanisms.
Agent systems moved from “context management” to “context infrastructure.”

Common Capabilities

Prompt/prefix caching (reducing repeated token cost)
Session state snapshots and resume
Multi-layer memory architecture (short-term working memory + long-term external memory)
Policy-based dynamic context construction

What It Solved

Lowered long-chain cost and latency.
Improved continuity in long-running tasks.
Made memory management governable as an engineering subsystem.

Core Problems

Cost and system complexity increased.
Memory contamination and stale-information governance became harder.
Strong observability was required to diagnose context failure points.

Representative Industry Articles and References

Below are high-value public references for context engineering:

Anthropic: Effective context engineering for AI agents

Clearly positions context engineering as the natural extension of prompt engineering.
Emphasizes that reliability bottlenecks in agents are often in context construction, not single prompts.

Anthropic: Prompt engineering for Claude’s long context window

Early long-context practice guidance with concrete input-structuring patterns.

Anthropic Docs: Long context prompting tips

Practical implementation checklist style guidance.

LangChain Docs: Context engineering in agents

Implementation-oriented strategies for what to inject at each agent step.

Paper: Lost in the Middle: How Language Models Use Long Contexts

Provides systematic evidence for degraded utilization of middle context.
Directly influenced later compression/reordering practices.

Foundational RAG Paper: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Established the mainstream retrieval+generation paradigm.

What Problems Context Engineering Solves

This can be summarized into 6 core problem classes:

Information selection

Not all data should be provided; only context relevant to the current step.

Memory continuity

Keep long tasks continuous across turns, windows, and sessions.

Cost and performance

Control token spend, latency, and throughput by reducing low-value context.

Reliability

Reduce missed evidence, state misreads, and repeated failed attempts.

Governance

Make context policies (compression/retrieval/reordering) configurable, measurable, and iteratable.

Toolchain coordination

Integrate context with RAG, caching, state machines, and orchestration systems.

One-line summary:

Context engineering is not about whether a model can answer once; it is about whether it can keep answering correctly, consistently, and cost-effectively in complex workflows.

My Practical Conclusion

For agent projects, a pragmatic build order is:

Start with prompt engineering (clear task contract)
Then add context engineering (information lifecycle management)
Finally implement harness engineering (end-to-end execution loop)

If you only do prompt engineering, long tasks remain fragile. If you skip context engineering and jump directly to harness engineering, complexity increases quickly and debugging becomes expensive.

Agent_Context Compression Prompt

Fri, 15 May 2026 17:58:59 +0800

Notes on Agent Context Compression Design

Reference: Context Compression Instruction: Prompt Analysis of Claude Code and Gemini

What Problem Does Context Compression Solve?

An agent’s context window is not infinite. As multi-turn conversations, tool calls, file reads, error logs, and code diffs accumulate, the model gradually approaches the token limit. The goal of context compression is not simply to “make it shorter,” but to preserve task continuity while reorganizing history into a state that the next agent turn can continue from.

I treat context compression as a work handoff:

Keep what the user is actually trying to accomplish
Keep project constraints, tech stack, and key decisions
Keep file states that were read, modified, or created
Keep errors, fixes, and unresolved issues
Drop repetitive, outdated, and noisy tool outputs
Let the next context window continue execution instead of re-exploring

A good compression system should answer three questions:

When to compress: scheduling strategy based on token thresholds, message length, tool output size, etc.
What to compress: user messages, system constraints, tool results, file states, or plans
How to compress: LLM summarization, rule-based trimming, retrieval reconstruction, or a hybrid approach

Classic Approach 1: LLM Summarization Compression

Both Claude Code and Gemini CLI follow a core idea: when context is too long, pass history to a model and let it output a structured summary. This summary becomes the core memory in the next context window.

The advantage is strong semantic retention: goals, constraints, errors, and plans scattered across long history can be reorganized. The downside is that quality depends on prompt design. A weak prompt may lose file paths, snippets, user preferences, or unfinished tasks.

Claude Code Style: Detailed Structured Handoff

Claude Code-style compression is closer to a full handoff document. It emphasizes chronological analysis and focuses on user requests, technical details, file changes, error handling, and next steps.

Suggested fields:

Field	Purpose
Primary requests and intent	Preserve the initial user goal and later intent shifts
Key technical concepts	Record stack, frameworks, architecture patterns, dependencies
Files and code sections	Track read/modified/created files and key snippets
Errors and fixes	Prevent repeating the same mistakes after compression
Problem-solving status	Separate resolved issues from ongoing debugging
User messages	Preserve original feedback to reduce intent distortion
Pending tasks	Make remaining work explicit
Current work state	Capture what was in progress before compression
Optional next steps	Keep only directly relevant follow-up actions

The point is not “a pretty summary,” but “a handoff that can keep coding.” In coding-agent workflows, file paths, function names, test commands, failed logs, and user corrections are critical.

Compression template:

Please compress the conversation history into a handoff summary that can continue execution.

Must keep:
1. User’s primary goals and explicit requests
2. Tech stack, architecture constraints, and key decisions
3. Files read/modified/created/deleted and why
4. Key code snippets, function signatures, config items
5. Encountered errors, failure logs, and fixes
6. Important user feedback and preferences
7. Completed items, pending items, and current pause point
8. Next-step suggestions directly related to the current task only

Must remove:
1. Repetitive explanations
2. Outdated tool outputs
3. Intermediate attempts that no longer help
4. Irrelevant small talk

Gemini CLI Style: State Snapshot

Gemini CLI-style compression is more like generating a compact state_snapshot. It uses fewer fields but packs higher density.

Typical fields:

Field	Purpose
`overall_goal`	One-line high-level user objective
`key_knowledge`	Facts, constraints, and conventions that must be remembered
`file_system_state`	Created/read/modified/deleted file state
`recent_actions`	Recent key actions and outcomes
`current_plan`	Current plan and progress

This style works well as a runtime snapshot, especially for recovery after interruption. It is shorter than the Claude-style handoff but requires stricter detail retention.

<state_snapshot>
 <overall_goal>User's current high-level goal</overall_goal>
 <key_knowledge>Critical facts, constraints, preferences, technical decisions</key_knowledge>
 <file_system_state>File read/modify/create/delete state</file_system_state>
 <recent_actions>Recent important actions and outcomes</recent_actions>
 <current_plan>Current plan, completed steps, pending steps</current_plan>
</state_snapshot>

Classic Approach 2: Tool Message Trimming

In real agent systems, the biggest token consumer is often tool output, not user text or assistant replies. File reads, code search, test runs, and logs can explode token usage.

So tool-message trimming is highly practical:

Keep system messages
Keep normal user and assistant messages
Remove outdated tool calls and tool outputs
Keep only the last N tool rounds
Summarize key tool outputs before deleting raw long outputs

A common policy: identify all tool rounds, keep only the last N, and remove older tool-related messages.

type MessageRole = 'system' | 'user' | 'assistant' | 'tool';

interface Message {
 role: MessageRole;
 content: string;
 tool_calls?: unknown[];
 tool_call_id?: string;
}

interface CompressionOptions {
 enabled: boolean;
 keepLastToolRounds: number;
}

function compressToolMessages(
 messages: Message[],
 options: CompressionOptions
): Message[] {
 if (!options.enabled) return messages;

 const toolRounds = identifyToolRounds(messages);
 const roundsToKeep = toolRounds.slice(-options.keepLastToolRounds);
 const keepIndexes = new Set(roundsToKeep.flatMap(round => round.indexes));

 return messages.filter((message, index) => {
 if (message.role === 'system') return true;
 if (keepIndexes.has(index)) return true;

 const isToolRelated =
 message.role === 'tool' ||
 (message.role === 'assistant' && Boolean(message.tool_calls));

 return !isToolRelated;
 });
}

The key decision is whether a tool output still helps future decisions. If it has already been absorbed into conclusions or is only exploratory noise, remove it. If it is a fresh test result, key error log, or important file content, keep or summarize it first.

Classic Approach 3: Middle Drop, Oldest Drop, and Hybrid Strategy

Besides LLM summarization, rule-based algorithms can also trim messages directly. They are more controllable and cheaper, but weaker in semantic understanding.

Three common methods:

Strategy	Method	Best for
Middle drop	Keep head and tail, remove middle	Head has constraints, tail has current work
Oldest drop	Remove earliest messages first	Long-running sessions where recent context matters most
Hybrid	Choose dynamically by conversation shape	Mixed workloads and different model limits

Middle Drop

Works well when history has this structure:

Head: system prompt, project rules, user goals
Middle: heavy tool usage, search process, trial-and-error
Tail: current issue, latest code, latest errors

Advantage: keeps task framing and current working context. Risk: key decisions may be lost if the middle is removed without summarization.

Oldest Drop

This is a sliding-window style approach. It assumes the newest messages are most relevant.

Advantage: simple and effective for continuity in long sessions. Risk: early constraints, architecture decisions, or initial goals may be dropped.

Hybrid Strategy

Dynamic selection can use:

Compression ratio target (current tokens vs target)
Total message count
Share of recent-message tokens
Presence of long messages
Presence of system messages
Heavy tool-message density
Model context window size

A practical decision table:

Condition	Recommended strategy	Why
Light compression + short dialogue	Middle drop	Head and tail are often most important
Heavy compression + very long dialogue	Oldest drop	Recent context usually has higher priority
Recent messages dominate tokens	Middle drop	Protect the current working context
System/tool-heavy history	Middle drop	Keep opening rules and latest state
Uncertain	Try both and score	Data-driven selection

A simple score:

efficiency_score = token_reduction_ratio * 0.6 + message_retention_ratio * 0.4

If the system prioritizes staying under target tokens, increase token-reduction weight. If it prioritizes context continuity, increase retention weight.

Recommended Hybrid Compression Architecture

A single method is usually not robust enough. For coding agents, I prefer a combined pipeline:

Raw history
 ↓
Token and structure statistics
 ↓
Compression threshold check
 ↓
Trim outdated tool messages
 ↓
LLM structured summary for key history
 ↓
Generate state snapshot / handoff summary
 ↓
Rebuild next context window

I usually preserve four layers:

Layer	Content	Storage
Stable rules layer	System prompt, project rules, security constraints	Persistent prompt/rule files
Working memory layer	Current goal, plan, TODOs, user preferences	Structured summary
Evidence layer	Latest tool results, key errors, key snippets	Last N tool rounds or summarized evidence
External knowledge layer	Docs, codebase, history	RAG / file retrieval

Rebuilt context layout:

System prompt
Project rules
Compression preface
Structured summary
Recent full conversation rounds
Recent key tool results
Current user request

The “recent full rounds” part is important. Summaries keep the big picture, but recent raw turns often carry subtle intent, tone, corrections, and boundary conditions.

Compression Prompt Design Principles

The goal is not to let the model freestyle. It is to enforce a stable handoff format.

Recommended prompt constraints:

Explicit role: you are a context compressor, not an executor
Explicit goal: generate a state that the next agent can continue from
Explicit retention: goals, constraints, files, code, errors, plan, user feedback
Explicit deletion: repetition, irrelevant tool output, small talk, intermediate noise
Explicit output format: Markdown, XML, JSON, or custom tags
Explicit prohibition: do not fabricate file states, do not invent decisions, do not execute next steps

Practical prompt template:

You are the context compressor for an agent.

Please compress the conversation history into a Chinese handoff summary.
This summary will be the primary context for continuing execution in the next context window.

Must keep:
- User goals, explicit requests, and important feedback
- Tech stack, project constraints, architecture decisions, tool preferences
- File paths read/modified/created/deleted
- Key code snippets, function names, config items, commands
- Encountered errors, failed tests, and fixes
- Completed tasks, pending tasks, and current pause point
- Next-step suggestions directly relevant to the current task

Must remove:
- Repetitive explanations
- Irrelevant small talk
- Tool output with no further value
- Intermediate attempts that do not affect final decisions

Do not fabricate information not present in history.
Do not execute tasks. Only output the compressed summary.

Engineering Implementation Notes

Trigger Timing

Compression can be triggered when:

Tokens exceed 70% to 85% of model context limit
Single tool output exceeds threshold
Tool call rounds exceed threshold
A task phase ends and a handoff is needed
User explicitly requests /compact or equivalent command

Compression Order

Recommended order:

Remove obviously low-value tool output
Keep the last N complete conversation rounds
Generate structured summaries for older messages
Rebuild context with summary + rules + recent rounds
Record metrics: pre/post token count, dropped message count, kept tool rounds

Risk Control

The most common failure is not “insufficient compression,” but “loss of critical facts.”

Especially avoid:

Losing explicit user constraints
Losing file paths
Losing the latest error message
Losing failed attempts that should not be repeated
Turning assumptions into facts
Mixing completed tasks with pending tasks

I prefer to keep explicit state labels in summaries:

[Done] Fixed login form validation
[Failed attempt] Direct schema change breaks legacy API
[Pending confirmation] Whether to keep legacy export format
[Next] Run pnpm test for auth module verification

My Takeaway

Context compression is fundamentally an agent memory-management and handoff system. Claude Code-style compression is better for full development-context retention. Gemini CLI-style compression is better for high-density state snapshots. Tool-message trimming is the most direct way to reduce token noise.

If I were implementing a stable agent compression module, I would prioritize this combination:

Keep recent conversation rounds intact
+ Trim outdated tool messages
+ LLM structured summary
+ File state snapshot
+ Current plan and TODO list
+ Compression metrics and observability logs

The final objective is not the shortest context. It is that after compression, the agent still knows: what the user wants, what the project is, what has been done, what has failed, where it stopped, and what should happen next.