Agent on XEDCZQ Blog

Agent_RAG Optimization

Fri, 22 May 2026 10:30:00 +0800

RAG Optimization Notes (First-Person)

After reviewing recent RAG optimization materials, my conclusion is straightforward:

The bottleneck of RAG is no longer “can it run,” but “can it hit reliably, stay controllable, and remain measurable in production.”

I now break RAG optimization into four layers:

Pre-retrieval optimization (Query + Chunk)
Retrieval-time optimization (Recall + Rank)
Post-retrieval optimization (Context Packing + Compression)
Production loop optimization (Evaluation + Feedback)

1) Pre-Retrieval Optimization: Fix Input and Corpus Quality First

What I focus on

Semantic chunking

I no longer use fixed 300/500-token hard cuts.
I chunk by semantic paragraphs, code boundaries, and heading hierarchy.
My goal is to make each chunk self-contained and independently citable.

Query rewriting

Normalize colloquial user questions into domain terms.
Handle abbreviations, aliases, and typo normalization.
Decompose complex questions into sub-queries.

HyDE (Hypothetical Document Embeddings)

Generate an “ideal answer draft” first.
Retrieve using the draft embedding, not only the short user query.
I treat HyDE as a recall-boost switch, enabled only in low-recall scenarios.

My assessment

If pre-retrieval is weak, reranking/compression/caching are mostly damage control.

2) Retrieval-Time Optimization: Multi-Path Recall + Rerank, Not Vector-Only

My current approach

Hybrid search

Dense vectors for semantic recall.
Sparse retrieval (BM25/keywords) to recover exact-match cases.
Fuse results before reranking.

Two-stage ranking (Recall L1 -> Rank L2)

Stage 1 maximizes recall (better to over-fetch).
Stage 2 reranker narrows to top-k precision.

Cross-encoder / API rerank

Score query-doc pairs directly.
More stable than pure embedding similarity, especially on long chunks.

My assessment

In production, the issue is often not “nothing found,” but “too many low-precision hits.” Rerank is not optional; it is a quality gate.

3) Post-Retrieval Optimization: Turn Context into High-Density Evidence

Three things I optimize

Evidence compression

Rerank first, then compress.
Remove weakly relevant sentences, template noise, and duplicates.
Keep entities, numbers, and conclusion-bearing sentences.

Context packing strategy

Do not concatenate by raw retrieval order.
Repack by “question sub-intent -> evidence groups.”
Tag each evidence block with source IDs for traceability.

Cache-friendly prompt assembly

Place stable system prefixes and static background first.
Maximize prefix reuse and cache hit rate (cost + latency benefits).

My assessment

RAG cost is often dominated not by retrieval itself, but by sending low-value context to the LLM. Post-retrieval refinement is one of the most direct cost levers.

4) Production Loop Optimization: Make RAG a System, Not a Demo

My evaluation perspective

Retrieval-layer metrics

Recall@k
MRR / nDCG
Hit-rate buckets (short query / long query / code query)

Generation-layer metrics

Faithfulness (is the answer grounded in evidence?)
Answer relevance (does it answer the actual question?)
Context precision (how much retrieved context is truly useful?)

System-layer metrics

P95 latency
Per-query token cost
Cache hit rate
Fallback-routing ratio (needs backup retrieval/web search)

My feedback loop

User query -> recall -> rerank -> generate answer
Evaluator scores answer and evidence automatically
Low-score samples flow into a hard-case dataset
Weekly regression over retrieval params, chunking policy, and reranker setup

Vendor/Framework Recommendations I Use as Baseline

I prioritize official vendor/framework docs over second-hand summaries.

Microsoft Learn: Build Advanced Retrieval-Augmented Generation Systems

End-to-end advanced RAG workflow
Strong emphasis on query rewriting, post-retrieval processing, and evaluation loops

Azure Architecture Center: Develop a RAG Solution—Information-Retrieval Phase

Systematic retrieval-phase guidance
Explicitly covers query augmentation/decomposition/rewriting/HyDE

Anthropic Engineering: Contextual Retrieval

Practical guidance on hybrid retrieval and context utilization
Clearly addresses “retrieved is not equal to used correctly”

Anthropic Help: Retrieval Augmented Generation (RAG) for Projects

Checklist-oriented practical recommendations for productization

Cohere Docs: Best Practices for using Rerank

Practical rerank guidance for input organization and deployment

Paper: Lost in the Middle

Evidence for middle-context utilization degradation
Supports the need for reranking, compression, and packing

Paper: RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Foundational retrieval+generation paradigm

How I Integrate These Optimizations into Real AI Application Iteration

I run a weekly optimization loop:

Step 0: Define scenario buckets and baseline

Build 100–300 real QA samples (bucketed by scenario).
Record baseline: retrieval hit quality, answer quality, latency, and cost.

Step 1: Change only one variable per iteration

I modify one parameter at a time:

Chunking policy
Query rewriting switch
Hybrid fusion weights
Reranker model/threshold
Context compression ratio

This avoids confounded results.

Step 2: Pass offline evaluation first

No offline pass, no online rollout.
I check three dimensions: quality gain, latency impact, cost impact.

Step 3: Online canary with rollback thresholds

Roll out on small traffic.
Set automatic rollback thresholds (P95, complaint rate, empty-answer rate).

Step 4: Convert wins into engineering assets

I persist proven improvements into:

Retrieval config templates
Prompt/context assembly conventions
RAG regression scripts
Failure case datasets and labeling rules

My Conclusion

My final view on RAG optimization:

Pre-retrieval defines the ceiling (is the question represented correctly?)
Retrieval-time defines hit quality (are we finding the right evidence?)
Post-retrieval defines cost and usability (is high-density evidence delivered to the LLM?)
Production loop defines sustainability (can quality keep improving?)

One-line summary:

RAG optimization is not "just tune model parameters"; it is engineering governance across retrieval, reranking, context construction, evaluation, and feedback.

Agent_Context Engineering

Tue, 19 May 2026 16:35:00 +0800

What Context Engineering Is

Context engineering can be defined as:

Injecting the “just-enough and highly relevant” information at every agent step, while continuously managing the lifecycle of that information.

If prompt engineering focuses on “how to phrase the task,” context engineering focuses on “what information to provide, in what order, and when to prune or rebuild it.”

Phase 1: Passive Truncation and Sliding Window (2020–2022) — “Every Token Counts”

Typical Characteristics

Context windows were generally small, and tokens were highly constrained.
The default strategy was “truncate when over limit.”
A common implementation was sliding window (keep only the latest N turns).

What It Solved

Prevented immediate failure from overlong input.
Preserved recent interaction and basic multi-turn continuity.

Core Problems

Early critical information was often dropped.
Goal drift was severe in long tasks.
Historical state could not be inherited reliably.

Phase 2: External Topology Introduction (2021–2023) — “The Birth of an External Brain (RAG)”

Typical Characteristics

The paradigm shifted from “stuff everything into context” to “retrieve on demand then inject.”
Vector retrieval and semantic recall became mainstream.
RAG decoupled parametric knowledge from external knowledge.

What It Solved

Broke through the memory ceiling of single-window context.
Reduced hallucinations by grounding responses with retrievable evidence.
Enabled knowledge updates without retraining the model.

Core Problems

Retrieval quality remained unstable (missed recall, wrong recall).
Attention dilution still occurred after retrieval chunks were merged.
“Retrieved” did not necessarily mean “used correctly by the model.”

Phase 3: Fine-Grained Compression and Reordering (2023–2024) — “Addressing the Lost-in-the-Middle Problem”

Typical Characteristics

The community began to systematically focus on long-context utilization.
Research and engineering attention increased around the Lost-in-the-Middle effect.
Strategy evolved from “adding more context” to “compressing, reordering, and layered memory.”

Common Methods

History summarization (state snapshot / handoff summary)
Tool-output pruning (keep recent critical rounds)
Information reordering (place highest-priority evidence near strong attention zones)
Task segmentation and stage-wise handoff

What It Solved

Reduced middle-section information neglect.
Improved long-task state continuity.
Made cross-window agent execution more controllable.

Core Problems

Compression summaries could introduce information loss.
Reordering rules were task-dependent and hard to generalize.
Evaluation was required to verify post-compression executability.

Phase 4: Ultra-Long Context and Infrastructure Caching (2024–2026, Current) — “KV Cache and Intelligent Memory”

Typical Characteristics

Context windows continued to expand.
Vendors and frameworks introduced stronger cache/reuse mechanisms.
Agent systems moved from “context management” to “context infrastructure.”

Common Capabilities

Prompt/prefix caching (reducing repeated token cost)
Session state snapshots and resume
Multi-layer memory architecture (short-term working memory + long-term external memory)
Policy-based dynamic context construction

What It Solved

Lowered long-chain cost and latency.
Improved continuity in long-running tasks.
Made memory management governable as an engineering subsystem.

Core Problems

Cost and system complexity increased.
Memory contamination and stale-information governance became harder.
Strong observability was required to diagnose context failure points.

Representative Industry Articles and References

Below are high-value public references for context engineering:

Anthropic: Effective context engineering for AI agents

Clearly positions context engineering as the natural extension of prompt engineering.
Emphasizes that reliability bottlenecks in agents are often in context construction, not single prompts.

Anthropic: Prompt engineering for Claude’s long context window

Early long-context practice guidance with concrete input-structuring patterns.

Anthropic Docs: Long context prompting tips

Practical implementation checklist style guidance.

LangChain Docs: Context engineering in agents

Implementation-oriented strategies for what to inject at each agent step.

Paper: Lost in the Middle: How Language Models Use Long Contexts

Provides systematic evidence for degraded utilization of middle context.
Directly influenced later compression/reordering practices.

Foundational RAG Paper: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Established the mainstream retrieval+generation paradigm.

What Problems Context Engineering Solves

This can be summarized into 6 core problem classes:

Information selection

Not all data should be provided; only context relevant to the current step.

Memory continuity

Keep long tasks continuous across turns, windows, and sessions.

Cost and performance

Control token spend, latency, and throughput by reducing low-value context.

Reliability

Reduce missed evidence, state misreads, and repeated failed attempts.

Governance

Make context policies (compression/retrieval/reordering) configurable, measurable, and iteratable.

Toolchain coordination

Integrate context with RAG, caching, state machines, and orchestration systems.

One-line summary:

Context engineering is not about whether a model can answer once; it is about whether it can keep answering correctly, consistently, and cost-effectively in complex workflows.

My Practical Conclusion

For agent projects, a pragmatic build order is:

Start with prompt engineering (clear task contract)
Then add context engineering (information lifecycle management)
Finally implement harness engineering (end-to-end execution loop)

If you only do prompt engineering, long tasks remain fragile. If you skip context engineering and jump directly to harness engineering, complexity increases quickly and debugging becomes expensive.

Agent_Prompt Engineering

Tue, 19 May 2026 16:20:00 +0800

What Prompt Engineering Is

Prompt engineering is essentially:

Designing input structure (instructions, context, examples, and output constraints) to improve model output quality, stability, and usability.

At an early stage, this was mainly a “single-call optimization” problem:

How to reduce model drift for the same question
How to force structured output for programmatic integration
How to make the model focus on the most relevant information under limited context

One-line view:

Prompt engineering = translating natural-language requirements into stable, executable model input contracts

What Early Prompt Engineering Tried to Solve

In early LLM usage, the main pain points were direct:

Unstable outputs

Same input, varying output quality across runs

Inconsistent instruction following

Missing constraints, skipped steps, or task boundary drift

Uncontrolled output format

Hard to reliably produce JSON/table/structured fields

Hallucination and fabrication

Models tend to fill gaps with invented facts

High engineering integration cost

Hard to plug responses into automated pipelines (parse/store/invoke)

The real value of prompt engineering was turning “probabilistic conversation behavior” into “repeatable invocation behavior.”

Typical Methods in Prompt Engineering

1. Instruction Clarification

Break tasks into explicit actions and avoid vague intent.

You are a backend code review assistant.
Goal: identify concurrency safety issues.
Scope: only check src/service/*.java.
Output: return a Markdown table with columns risk_level/file_path/fix_suggestion.

2. Structured Constraints

Define a fixed output schema to reduce “looks good but unusable” responses.

{
 "risk_level": "high|medium|low",
 "file": "string",
 "issue": "string",
 "fix": "string"
}

3. Few-shot Examples

Provide 1-3 high-quality examples to improve style consistency and task alignment.

4. Role and Boundary Control

State what the model can and cannot do, especially no guessing.

If evidence is insufficient, return "insufficient information" and do not fabricate.

5. Iterative Tuning

Treat prompts like code: version, test, and refine.

How to Use It in Real Development (Executable Workflow)

Step 0: Define the Task Interface First

Define clearly:

What the input is
Who consumes the output (human/program)
What qualifies as acceptable output

This is essentially defining an API contract for prompts.

Step 1: Use Prompt Templates, Not One-off Writing

Use a stable template:

Role
Goal
Input
Constraints
Output format
Failure handling rules

Example:

[Role]
You are a senior frontend reviewer.

[Goal]
Check whether the following PR diff contains accessibility issues.

[Input]
{{DIFF_CONTENT}}

[Constraints]
- Judge only based on the provided diff
- Do not infer unprovided code

[Output Format]
JSON array: [{"severity":"","file":"","issue":"","fix":""}]

[Failure Handling]
If evidence is insufficient, return an empty array and include a reason field.

Step 2: Add Automatic Evaluation to Prompts

Do not rely only on manual reading. At least run:

Format checks: JSON parsable, required fields present
Quality checks: key constraints satisfied (e.g. file and fix must exist)

Step 3: Feed Failure Samples Back into Prompt Design

Convert typical failures into:

New constraints
New examples
New counter-examples

This is the core learning loop in prompt engineering.

Step 4: Split Prompts by Scenario

Do not expect one mega-prompt to cover all tasks. Split by function:

Information extraction prompt
Code review prompt
Planning prompt
Generation prompt

This improves stability and testability.

Limits of Prompt Engineering Alone

Prompt engineering is effective, but has natural boundaries, especially in agent/long-running development:

Limited memory management

Prompt tuning optimizes “how to ask now,” not “how to manage multi-turn state”

Long-context degradation

As history grows, prompt constraints alone cannot solve token/attention dilution

Weak state continuity

After interruption, a single prompt cannot reliably restore full task state

No execution loop by itself

A prompt can say “run tests,” but that does not guarantee tests are executed, logs collected, and state updated

No system-level governance

It cannot alone solve tool orchestration, failure recovery, observability, and quality gates

Why It Evolved into Context Engineering

Once tasks evolved from Q&A to continuous development, the key problems became:

What history to keep
When to compress history
How to retrieve and refill old information
How to hand off state without loss across context windows

That is the scope of context engineering:

Prompt engineering focuses on: how to express tasks
Context engineering focuses on: how to manage task history and state

Why It Further Evolved into Harness Engineering

Even with prompt + context engineering, a larger challenge remains:

How to make agents reliably deliver in real engineering workflows.

That requires system capabilities:

Toolchain orchestration (lint/test/build/deploy)
Quality gates and automatic verification
Failure recovery and retry strategies
Task scheduling and state tracking
Rule accumulation and observability

That is the scope of harness engineering:

Harness engineering = assembling prompt, context, tools, checks, and workflow into a sustainable delivery system

Relationship Among the Three

Dimension	Prompt Engineering	Context Engineering	Harness Engineering
Core question	How to improve single-call output	How to manage multi-turn memory and state	How to make end-to-end delivery stable
Main object	Single input text	History, summaries, retrieval, state	Toolchains, rules, validation, orchestration
Typical artifact	Prompt templates	State snapshots, compression summaries, memory layers	Agent workflows, check loops, runtime policies
Main failure point	Drift in long tasks	Lacks execution/governance	Higher implementation cost, but highest stability

My Practical Conclusion

Prompt engineering is not outdated. It is the foundational layer.

In real development, a practical sequence is:

Stabilize prompt engineering first (stable input/output)
Add context engineering next (handle long-running memory)
Build harness engineering last (close the system loop for stable delivery)

If you jump directly to harness while prompt quality is unstable, complexity rises quickly and failures become harder to debug. If you only do prompt engineering, long-running development remains fragile.

References

OpenAI: Prompt Engineering Guide
OpenAI: Best practices for prompt engineering
Anthropic: Prompt engineering overview
Anthropic: Use XML tags to structure prompts

Agent_Context Compression Prompt

Fri, 15 May 2026 17:58:59 +0800

Notes on Agent Context Compression Design

Reference: Context Compression Instruction: Prompt Analysis of Claude Code and Gemini

What Problem Does Context Compression Solve?

An agent’s context window is not infinite. As multi-turn conversations, tool calls, file reads, error logs, and code diffs accumulate, the model gradually approaches the token limit. The goal of context compression is not simply to “make it shorter,” but to preserve task continuity while reorganizing history into a state that the next agent turn can continue from.

I treat context compression as a work handoff:

Keep what the user is actually trying to accomplish
Keep project constraints, tech stack, and key decisions
Keep file states that were read, modified, or created
Keep errors, fixes, and unresolved issues
Drop repetitive, outdated, and noisy tool outputs
Let the next context window continue execution instead of re-exploring

A good compression system should answer three questions:

When to compress: scheduling strategy based on token thresholds, message length, tool output size, etc.
What to compress: user messages, system constraints, tool results, file states, or plans
How to compress: LLM summarization, rule-based trimming, retrieval reconstruction, or a hybrid approach

Classic Approach 1: LLM Summarization Compression

Both Claude Code and Gemini CLI follow a core idea: when context is too long, pass history to a model and let it output a structured summary. This summary becomes the core memory in the next context window.

The advantage is strong semantic retention: goals, constraints, errors, and plans scattered across long history can be reorganized. The downside is that quality depends on prompt design. A weak prompt may lose file paths, snippets, user preferences, or unfinished tasks.

Claude Code Style: Detailed Structured Handoff

Claude Code-style compression is closer to a full handoff document. It emphasizes chronological analysis and focuses on user requests, technical details, file changes, error handling, and next steps.

Suggested fields:

Field	Purpose
Primary requests and intent	Preserve the initial user goal and later intent shifts
Key technical concepts	Record stack, frameworks, architecture patterns, dependencies
Files and code sections	Track read/modified/created files and key snippets
Errors and fixes	Prevent repeating the same mistakes after compression
Problem-solving status	Separate resolved issues from ongoing debugging
User messages	Preserve original feedback to reduce intent distortion
Pending tasks	Make remaining work explicit
Current work state	Capture what was in progress before compression
Optional next steps	Keep only directly relevant follow-up actions

The point is not “a pretty summary,” but “a handoff that can keep coding.” In coding-agent workflows, file paths, function names, test commands, failed logs, and user corrections are critical.

Compression template:

Please compress the conversation history into a handoff summary that can continue execution.

Must keep:
1. User’s primary goals and explicit requests
2. Tech stack, architecture constraints, and key decisions
3. Files read/modified/created/deleted and why
4. Key code snippets, function signatures, config items
5. Encountered errors, failure logs, and fixes
6. Important user feedback and preferences
7. Completed items, pending items, and current pause point
8. Next-step suggestions directly related to the current task only

Must remove:
1. Repetitive explanations
2. Outdated tool outputs
3. Intermediate attempts that no longer help
4. Irrelevant small talk

Gemini CLI Style: State Snapshot

Gemini CLI-style compression is more like generating a compact state_snapshot. It uses fewer fields but packs higher density.

Typical fields:

Field	Purpose
`overall_goal`	One-line high-level user objective
`key_knowledge`	Facts, constraints, and conventions that must be remembered
`file_system_state`	Created/read/modified/deleted file state
`recent_actions`	Recent key actions and outcomes
`current_plan`	Current plan and progress

This style works well as a runtime snapshot, especially for recovery after interruption. It is shorter than the Claude-style handoff but requires stricter detail retention.

<state_snapshot>
 <overall_goal>User's current high-level goal</overall_goal>
 <key_knowledge>Critical facts, constraints, preferences, technical decisions</key_knowledge>
 <file_system_state>File read/modify/create/delete state</file_system_state>
 <recent_actions>Recent important actions and outcomes</recent_actions>
 <current_plan>Current plan, completed steps, pending steps</current_plan>
</state_snapshot>

Classic Approach 2: Tool Message Trimming

In real agent systems, the biggest token consumer is often tool output, not user text or assistant replies. File reads, code search, test runs, and logs can explode token usage.

So tool-message trimming is highly practical:

Keep system messages
Keep normal user and assistant messages
Remove outdated tool calls and tool outputs
Keep only the last N tool rounds
Summarize key tool outputs before deleting raw long outputs

A common policy: identify all tool rounds, keep only the last N, and remove older tool-related messages.

type MessageRole = 'system' | 'user' | 'assistant' | 'tool';

interface Message {
 role: MessageRole;
 content: string;
 tool_calls?: unknown[];
 tool_call_id?: string;
}

interface CompressionOptions {
 enabled: boolean;
 keepLastToolRounds: number;
}

function compressToolMessages(
 messages: Message[],
 options: CompressionOptions
): Message[] {
 if (!options.enabled) return messages;

 const toolRounds = identifyToolRounds(messages);
 const roundsToKeep = toolRounds.slice(-options.keepLastToolRounds);
 const keepIndexes = new Set(roundsToKeep.flatMap(round => round.indexes));

 return messages.filter((message, index) => {
 if (message.role === 'system') return true;
 if (keepIndexes.has(index)) return true;

 const isToolRelated =
 message.role === 'tool' ||
 (message.role === 'assistant' && Boolean(message.tool_calls));

 return !isToolRelated;
 });
}

The key decision is whether a tool output still helps future decisions. If it has already been absorbed into conclusions or is only exploratory noise, remove it. If it is a fresh test result, key error log, or important file content, keep or summarize it first.

Classic Approach 3: Middle Drop, Oldest Drop, and Hybrid Strategy

Besides LLM summarization, rule-based algorithms can also trim messages directly. They are more controllable and cheaper, but weaker in semantic understanding.

Three common methods:

Strategy	Method	Best for
Middle drop	Keep head and tail, remove middle	Head has constraints, tail has current work
Oldest drop	Remove earliest messages first	Long-running sessions where recent context matters most
Hybrid	Choose dynamically by conversation shape	Mixed workloads and different model limits

Middle Drop

Works well when history has this structure:

Head: system prompt, project rules, user goals
Middle: heavy tool usage, search process, trial-and-error
Tail: current issue, latest code, latest errors

Advantage: keeps task framing and current working context. Risk: key decisions may be lost if the middle is removed without summarization.

Oldest Drop

This is a sliding-window style approach. It assumes the newest messages are most relevant.

Advantage: simple and effective for continuity in long sessions. Risk: early constraints, architecture decisions, or initial goals may be dropped.

Hybrid Strategy

Dynamic selection can use:

Compression ratio target (current tokens vs target)
Total message count
Share of recent-message tokens
Presence of long messages
Presence of system messages
Heavy tool-message density
Model context window size

A practical decision table:

Condition	Recommended strategy	Why
Light compression + short dialogue	Middle drop	Head and tail are often most important
Heavy compression + very long dialogue	Oldest drop	Recent context usually has higher priority
Recent messages dominate tokens	Middle drop	Protect the current working context
System/tool-heavy history	Middle drop	Keep opening rules and latest state
Uncertain	Try both and score	Data-driven selection

A simple score:

efficiency_score = token_reduction_ratio * 0.6 + message_retention_ratio * 0.4

If the system prioritizes staying under target tokens, increase token-reduction weight. If it prioritizes context continuity, increase retention weight.

Recommended Hybrid Compression Architecture

A single method is usually not robust enough. For coding agents, I prefer a combined pipeline:

Raw history
 ↓
Token and structure statistics
 ↓
Compression threshold check
 ↓
Trim outdated tool messages
 ↓
LLM structured summary for key history
 ↓
Generate state snapshot / handoff summary
 ↓
Rebuild next context window

I usually preserve four layers:

Layer	Content	Storage
Stable rules layer	System prompt, project rules, security constraints	Persistent prompt/rule files
Working memory layer	Current goal, plan, TODOs, user preferences	Structured summary
Evidence layer	Latest tool results, key errors, key snippets	Last N tool rounds or summarized evidence
External knowledge layer	Docs, codebase, history	RAG / file retrieval

Rebuilt context layout:

System prompt
Project rules
Compression preface
Structured summary
Recent full conversation rounds
Recent key tool results
Current user request

The “recent full rounds” part is important. Summaries keep the big picture, but recent raw turns often carry subtle intent, tone, corrections, and boundary conditions.

Compression Prompt Design Principles

The goal is not to let the model freestyle. It is to enforce a stable handoff format.

Recommended prompt constraints:

Explicit role: you are a context compressor, not an executor
Explicit goal: generate a state that the next agent can continue from
Explicit retention: goals, constraints, files, code, errors, plan, user feedback
Explicit deletion: repetition, irrelevant tool output, small talk, intermediate noise
Explicit output format: Markdown, XML, JSON, or custom tags
Explicit prohibition: do not fabricate file states, do not invent decisions, do not execute next steps

Practical prompt template:

You are the context compressor for an agent.

Please compress the conversation history into a Chinese handoff summary.
This summary will be the primary context for continuing execution in the next context window.

Must keep:
- User goals, explicit requests, and important feedback
- Tech stack, project constraints, architecture decisions, tool preferences
- File paths read/modified/created/deleted
- Key code snippets, function names, config items, commands
- Encountered errors, failed tests, and fixes
- Completed tasks, pending tasks, and current pause point
- Next-step suggestions directly relevant to the current task

Must remove:
- Repetitive explanations
- Irrelevant small talk
- Tool output with no further value
- Intermediate attempts that do not affect final decisions

Do not fabricate information not present in history.
Do not execute tasks. Only output the compressed summary.

Engineering Implementation Notes

Trigger Timing

Compression can be triggered when:

Tokens exceed 70% to 85% of model context limit
Single tool output exceeds threshold
Tool call rounds exceed threshold
A task phase ends and a handoff is needed
User explicitly requests /compact or equivalent command

Compression Order

Recommended order:

Remove obviously low-value tool output
Keep the last N complete conversation rounds
Generate structured summaries for older messages
Rebuild context with summary + rules + recent rounds
Record metrics: pre/post token count, dropped message count, kept tool rounds

Risk Control

The most common failure is not “insufficient compression,” but “loss of critical facts.”

Especially avoid:

Losing explicit user constraints
Losing file paths
Losing the latest error message
Losing failed attempts that should not be repeated
Turning assumptions into facts
Mixing completed tasks with pending tasks

I prefer to keep explicit state labels in summaries:

[Done] Fixed login form validation
[Failed attempt] Direct schema change breaks legacy API
[Pending confirmation] Whether to keep legacy export format
[Next] Run pnpm test for auth module verification

My Takeaway

Context compression is fundamentally an agent memory-management and handoff system. Claude Code-style compression is better for full development-context retention. Gemini CLI-style compression is better for high-density state snapshots. Tool-message trimming is the most direct way to reduce token noise.

If I were implementing a stable agent compression module, I would prioritize this combination:

Keep recent conversation rounds intact
+ Trim outdated tool messages
+ LLM structured summary
+ File state snapshot
+ Current plan and TODO list
+ Compression metrics and observability logs

The final objective is not the shortest context. It is that after compression, the agent still knows: what the user wants, what the project is, what has been done, what has failed, where it stopped, and what should happen next.

Agent_Harness Engineering

Tue, 19 May 2026 11:29:42 +0800

What Harness Engineering Actually Is

My conclusion after reading these articles side by side:

Harness Engineering is not just about writing better prompts. It is about engineering all the capabilities around the model into an iterative system, so an agent can produce stable and verifiable outcomes during long-running tasks.

One-line summary:

Agent = Model + Harness
Harness = State management + Tooling + Constraints + Feedback loops + Execution orchestration

The model provides intelligence. The harness makes that intelligence usable, controllable, and repeatable.

Shared Takeaways Across the Articles

Theme	Common Ground
Definition of harness	Not the model itself, but surrounding code, configuration, process, tools, and validation mechanisms
Goal	Reduce supervision cost, improve first-pass correctness, and support long-running execution
Core method	Turn repeated failure modes into engineered assets: rules, tools, tests, and loops
Main long-task challenge	Limited context windows, session interruption, state drift, and premature “done” claims
Solution direction	Incremental task decomposition, state handoff, automated checks, observability, and continuous correction

5 Core Components (My Practical View)

Task scaffolding

Clear decomposition strategy (one feature at a time)
Clear Definition of Done (DoD) to avoid “looks finished” outputs

State and memory

Recoverable state: progress files, commit notes, change logs
Reliable handoff between sessions instead of relying on model guessing

Tools and environment

Fast deterministic tools for agents (tests, lint, screenshots, logs)
Self-serve context access instead of manual copy/paste

Feedback and sensors

Computational sensors: lint/typecheck/unit/e2e (fast, deterministic)
Reasoning sensors: LLM review/semantic QA (slower, costlier, but useful for semantics)

Scheduling and governance

After failure, do not only retry; improve capability
Accumulate reusable rules in templates (AGENTS.md, docs, checklists)

Practical Harness Workflow for Normal WebCoding Users

This is my compressed version for individual developers. You do not need multi-agent orchestration to start.

Step 0: Define “Done” First

Create a one-page SPEC.md for each feature:

User scenario
Input and output
Acceptance criteria
Failure scenarios

Without this, agents tend to produce “confident but misaligned” output.

Step 1: Create Minimal Harness Files

At least these 4 files:

AGENTS.md: repository rules (commands, directory conventions, no-touch zones, commit style)
TASKS.md: feature backlog with todo/doing/done
PROGRESS.md: what was done, what is unfinished, next step
CHECKLIST.md: unified acceptance checks (build, test, UI, performance, security)

Step 2: One Feature Per Iteration

Execution pattern:

Pick one item from TASKS.md
Give the agent a bounded task
Avoid “build the entire site in one go” requests

This sharply reduces context chaos and regressions.

Step 3: Let the Agent Change, Then Prove

Require the agent to output every round:

Files changed
Why each change was made
Commands executed
Passed/failed checks
Risk and rollback points

This converts hidden reasoning into auditable execution traces.

Step 4: Two-Layer Validation (Computational First)

Run at least:

npm run lint
npm run test
npm run build

For frontend UI changes, also add:

Key path screenshot checks
Manual critical interaction checklist
Responsive checks on main breakpoints

Rule: pass deterministic checks first, then do semantic review.

Step 5: Convert Every Failure into Harness Assets

When agent output fails, do not only patch the immediate bug:

If it is a rule issue, add it to AGENTS.md
If it is repeated execution, script it
If it is quality drift, add it to CHECKLIST.md

Goal: prevent the same class of errors from recurring.

Step 6: Force Handoff for Long Tasks

If work spans more than one context window, generate a handoff containing:

Current goal
Completed work
Remaining work
Blockers
First step for next round

Store it in PROGRESS.md or planning files, not only in chat history.

Step 7: Run a Release-Grade Loop Before Merge

Before merge, run one unified cycle:

Regression checks
Critical user-path smoke tests
Quick performance and error-log scan
Agent self-review plus human spot-check

This prevents “local pass, system-level failure.”

Step 8: Weekly Harness Cleanup

Weekly maintenance:

Remove stale rules
Fix broken scripts
Merge duplicate constraints
Refresh docs index

Harness is also code. Without maintenance, it decays.

Minimum Viable Harness (MVP) for Individuals

If you want the fastest starting point, do this:

Write 20-50 lines of hard rules in AGENTS.md
Ask the agent to do only one feature per iteration
Run lint/test/build every round
Update PROGRESS.md each round
Convert repeated failures into rules or scripts

These five actions are usually enough to move from “using agents by feel” to “compounding engineering productivity.”

My Practical Conclusion

Harness Engineering answers one core question:

When an agent fails, do you supervise it repeatedly, or convert that failure into system capability?

The first consumes human time. The second compounds.

For normal webcoding users, the key is not the fanciest model, but:

Do you have executable rules?
Do you have automated feedback?
Do you convert failures into deterministic advantages for the next run?

That is the real value of harness engineering.

References

OpenAI: Harness engineering: leveraging Codex in an agent-first world
Anthropic: Effective harnesses for long-running agents
Anthropic: Harness design for long-running application development
LangChain: The Anatomy of an Agent Harness
Mitchell Hashimoto: My AI Adoption Journey
Martin Fowler: Harness Engineering - first thoughts
Martin Fowler: Harness engineering for coding agent users

AI Resume Analysis: Knowledgebase Module

Fri, 15 May 2026 21:55:13 +0800

Knowledgebase Module Design and Implementation

This note records how I implemented the Knowledgebase module in the interview-guide project. The goal is to connect document upload, vectorization, RAG query, and session association into a sustainable knowledge service workflow.

Module Capability Overview

Document management: supports upload, download, deletion, categorization, keyword search, and statistics.
Vectorization capability: stores vectors with pgvector, and processes chunking/storage through async tasks.
RAG Q&A: supports both non-streaming and streaming (SSE) multi-knowledgebase query.
Session coordination: automatically removes associated session references when deleting a knowledgebase to reduce inconsistency risk.

State Transitions

Diagram 1: KnowledgeBase Main State Machine

flowchart TD
A["Call POST /api/knowledgebase/upload to upload file"] --> B["File validation + type detection + dedup check"]

B --> C{"Is file duplicated (fileHash exists)?"}
C -->|Yes| D["Return existing knowledgebase record\nduplicate=true\nno vectorization triggered"]
C -->|No| E["Parse text content + upload file to storage"]

E --> F["Save KnowledgeBaseEntity\ninitial vectorStatus=PENDING"]
F --> G["Send vectorization task to Redis Stream"]

G --> H["VectorizeStreamConsumer consumes task"]
H --> I["markProcessing\nvectorStatus=PROCESSING"]

I --> J["vectorizeAndStore\nchunk text and write to pgvector"]
J --> K{"Did vectorization succeed?"}

K -->|Yes| L["markCompleted\nvectorStatus=COMPLETED\nvectorError=null"]
K -->|No| M{"retryCount < 3 ?"}
M -->|Yes| N["Requeue task (retry+1)"]
N --> H
M -->|No| O["markFailed\nvectorStatus=FAILED\nwrite vectorError"]

P["Call POST /api/knowledgebase/{id}/revectorize"] --> Q["Reset status to PENDING\nclear vectorError"]
Q --> G

R["Call DELETE /api/knowledgebase/{id} to delete knowledgebase"] --> S["Remove RAG session associations"]
S --> T["Delete vector data (best effort) + delete storage file (best effort)"]
T --> U["Delete knowledgebase DB record\nlifecycle ends"]

Diagram 2: Chunked Knowledgebase Vectorization Flow

flowchart TD
A["Knowledgebase upload succeeds"] --> B["Save knowledgebase record vectorStatus=PENDING"]
B --> C["Send vectorization task to Redis Stream"]
C --> D["VectorizeStreamConsumer starts polling"]
D --> E["Read one message: kbId + content + retryCount"]
E --> F["Set status to PROCESSING"]
F --> G["Execute vectorizeAndStore"]

G --> H["Delete old vectors for this kbId"]
H --> I["Text chunking via TokenTextSplitter"]
I --> J["Add metadata kb_id to each chunk"]
J --> K["Batch call vectorStore.add to write vectors"]

K --> L["Set status to COMPLETED"]
L --> M["ACK message"]

G --> N{"Processing exception?"}
N -->|Yes| O{"retryCount < 3"}
O -->|Yes| P["retryCount+1 and requeue"]
P --> M
O -->|No| Q["Set status to FAILED and record error"]
Q --> M

Key API Design

`GET /api/knowledgebase/list` Get Knowledgebase List (Status Filter + Sorting)

Call chain:

Result.success(listService.listKnowledgeBases(status, sortBy));
knowledgeBaseRepository.findByVectorStatusOrderByUploadedAtDesc(vectorStatus);
knowledgeBaseRepository.findAllByOrderByUploadedAtDesc();
entities = sortEntities(entities, sortBy);

`GET /api/knowledgebase/{id}` Get Knowledgebase Detail

Call chain:

listService.getKnowledgeBase(id);
knowledgeBaseRepository.findById(id);

`DELETE /api/knowledgebase/{id}` Delete Knowledgebase

Core flow:

deleteService.deleteKnowledgeBase(id);
knowledgeBaseRepository.findById(id);
sessionRepository.findByKnowledgeBaseIds(List.of(id));
vectorService.deleteByKnowledgeBaseId(id);
storageService.deleteKnowledgeBase(kb.getStorageKey());
knowledgeBaseRepository.deleteById(id);

Notes:

Removes RAG session associations first, then deletes vectors/storage files, then DB record.
Vector/storage deletion failures are logged as warn and do not block the main delete flow.

`POST /api/knowledgebase/query` Non-Streaming Q&A (Multi-Knowledgebase)

Rate limits:

GLOBAL/IP: 10 each

Call chain:

queryService.queryKnowledgeBase(request);
answerQuestion(...);
countService.updateQuestionCounts(...);
vectorService.similaritySearch(...);

Processing highlights:

knowledgeBaseIds and question are required.
If no hit, returns fixed fallback text: “No information retrieved”.
If hit exists, builds context + prompts and calls default ChatClient for answer generation.
Returns QueryResponse(answer, primaryKbId, kbNamesStr).

`POST /api/knowledgebase/query/stream` Streaming Q&A (SSE, Multi-Knowledgebase)

Rate limits:

GLOBAL/IP: 5 each

Call chain:

queryService.answerQuestionStream(kbIds, question);
countService.updateQuestionCounts(...);
vectorService.similaritySearch(...);
chatClient.prompt().stream().content();
normalizeStreamOutput(...);

Processing highlights:

Returns Flux<String> (text/event-stream).
Empty input or no hit returns fallback text stream directly.
Both stream-time and outer exceptions are downgraded to safe fallback output.

`GET /api/knowledgebase/categories` Get All Category Names

Call chain:

listService.getAllCategories();

Return:

Result<List<String>>

`GET /api/knowledgebase/category/{category}` Get Knowledgebase List by Category

Call chain:

listService.listByCategory(category);

Return:

Result<List<KnowledgeBaseListItemDTO>>

`GET /api/knowledgebase/uncategorized` Get Uncategorized Knowledgebase List

Call chain:

listService.listByCategory(category);

Notes:

Current implementation reuses category-query path and distinguishes uncategorized by specific category value.

`PUT /api/knowledgebase/{id}/category` Update Knowledgebase Category

Call chain:

listService.updateCategory(id, body.get("category"));

Processing highlights:

Queries by id first and throws business exception if not found.
Updates category and persists record when found.

`POST /api/knowledgebase/upload` Upload Knowledgebase File (multipart)

Parameters:

file (required)
name (optional)
category (optional)

Rate limits:

GLOBAL/IP: 3 each

Call chain:

uploadService.uploadKnowledgeBase(file, name, category);
findByFileHash(fileHash);

Processing flow:

Validate file presence and size (max 50MB).
Validate type by MIME + extension whitelist (PDF/DOCX/DOC/TXT/MD).
Compute SHA-256 for dedup check.
Parse text content; fail directly on empty text.
Upload file to RustFS (S3-compatible), generate fileKey/fileUrl.
Save KnowledgeBaseEntity with initial vector status PENDING.
Enqueue async vectorization task to Redis Stream (knowledgebase:vectorize:stream).
Return knowledgeBase + storage + duplicate=false.

`GET /api/knowledgebase/{id}/download` Download Original Knowledgebase File

Call chain:

listService.getEntityForDownload(id);
listService.downloadFile(id);

Return:

ResponseEntity<byte[]> (with Content-Disposition and Content-Type)

`GET /api/knowledgebase/search?keyword=...` Keyword Search Knowledgebase

Call chain:

listService.search(keyword);

`GET /api/knowledgebase/stats` Get Knowledgebase Statistics

Call chain:

listService.getStatistics();

Return:

KnowledgeBaseStatsDTO

`POST /api/knowledgebase/{id}/revectorize` Manual Re-Vectorization

Rate limits:

GLOBAL/IP: 2 each

Call chain:

uploadService.revectorize(id);

Processing flow:

Query knowledgebase by id, throw exception if missing.
Download source file from object storage and re-parse text.
Fail directly if parsing fails or returns empty text.
Reset vector status to PENDING.
Enqueue vectorization task to Redis Stream.
Return success immediately; frontend polls status afterward.

Async Vectorization Processing Flow (Core Implementation)

// 1) Delete old vectors
deleteByKnowledgeBaseId(knowledgeBaseId);

// 2) Text chunking (default no overlap)
List<Document> chunks = textSplitter.apply(List.of(new Document(content)));

// 3) Add metadata (kb_id)
chunks.forEach(chunk -> chunk.getMetadata().put("kb_id", knowledgeBaseId.toString()));

// 4) Batch vector write (DashScope batch <= 10)
for (int i = 0; i < batchCount; i++) {
 int start = i * MAX_BATCH_SIZE;
 int end = Math.min(start + MAX_BATCH_SIZE, totalChunks);
 List<Document> batch = chunks.subList(start, end);
 vectorStore.add(batch);
}

Summary

The core value of the Knowledgebase module is connecting file asset management with retrieval-augmented Q&A. For me, the real value is not just successful upload, but making sure documents reliably enter the vectorization pipeline and finally provide reusable, traceable knowledge support in Q&A scenarios.

AI Resume Analysis: Voice Interview Module

Thu, 14 May 2026 22:34:43 +0800

VoiceInterview Module Design and Implementation

This note records how I implemented the VoiceInterview module in the interview-guide project. The core goal is to make voice interviews deliver a complete experience of real-time interaction, resumable sessions, and traceable evaluation.

Module Capability Overview

Real-time voice interaction: built on WebSocket + Qwen3 Voice Model (shared API key for ASR/TTS/LLM).
Streaming experience optimization: sentence-level concurrent TTS, generation/synthesis/playback in parallel, first-packet latency around 200ms.
Server-side VAD: automatic segmentation with real-time subtitles (including intermediate results).
Echo protection: supports manual submission to avoid AI playback being captured as user input.
Session continuity: supports pause/resume and multi-turn context memory, with auto-pause on timeout.
Observability metrics: Micrometer metrics for TTS/ASR latency, session duration, etc.

State Transitions

flowchart TD
A["Create Session
POST /api/voice-interview/sessions"] --> B["IN_PROGRESS"]

B --> C{"Session Events"}
C -- "Pause / Timeout" --> D["PAUSED"]
D -- "Resume" --> B

C -- "End Interview" --> E["COMPLETED"]
E --> F["evaluateStatus = PENDING"]
F --> G["evaluateStatus = PROCESSING"]

G --> H{"Evaluation Result"}
H -- "Success" --> I["EVALUATED
evaluateStatus = COMPLETED"]
H -- "Failure" --> J["evaluateStatus = FAILED"]

B --> K["DELETE /api/voice-interview/sessions/{id}"]
D --> K
E --> K
I --> K
J --> K

Key API Design

`POST /api/voice-interview/sessions` Create Voice Interview Session

Controller entry:

VoiceInterviewController.createSession(@Valid @RequestBody CreateSessionRequest request)

Core call chain:

voiceInterviewService.createSession(request);

Implementation highlights:

Fallback skillId (use default skill when missing).
Fallback llmProvider (use default provider when empty).
Build VoiceInterviewSessionEntity (phase switches, difficulty, resume ID, JD text, planned duration, etc.).
Default userId = "default".
Set initial phase (the first enabled one in intro/tech/project/hr).
Persist to voice_interview_sessions and cache in Redis (with TTL).
Return SessionResponseDTO (session ID, status, phase, config, etc.).

`GET /api/voice-interview/sessions/{sessionId}` Get Session Detail by ID

Controller call:

voiceInterviewService.getSessionDTO(sessionId);

Implementation highlights:

Read Redis first, then DB fallback.
Build SessionResponseDTO when found.
Return unified error when not found: Session not found: {sessionId}.

`POST /api/voice-interview/sessions/{sessionId}/end` End Session and Trigger Async Evaluation

Controller call:

voiceInterviewService.endSession(sessionId.toString());

End + evaluation logic:

session.setEndTime(now);
session.setCurrentPhase(COMPLETED);
session.setStatus(COMPLETED);
session.setEvaluateStatus(PENDING);
sessionRepository.save(session);
voiceEvaluateStreamProducer.sendEvaluateTask(sessionId);
redisService.streamAdd(streamKey(), buildMessage(payload), AsyncTaskStreamConstants.STREAM_MAX_LEN);

Notes:

API returns Result.success() immediately without waiting for evaluation completion.
Frontend polls GET /api/voice-interview/sessions/{sessionId}/evaluation for progress.

`PUT /api/voice-interview/sessions/{sessionId}/pause` Pause Session

Core call:

voiceInterviewService.pauseSession(sessionId.toString(), reason);

Implementation highlights:

Only IN_PROGRESS sessions can be paused.
Set status to PAUSED, record reason, update updatedAt.
Persist DB and sync Redis cache.

`PUT /api/voice-interview/sessions/{sessionId}/resume` Resume Session

Core call:

voiceInterviewService.resumeSession(sessionId.toString());

Implementation highlights:

Only PAUSED sessions can be resumed.
After resume, status becomes IN_PROGRESS without resetting phase/progress.
Persist DB, sync Redis, and return latest SessionResponseDTO.

`GET /api/voice-interview/sessions` Get Session List (Filter by userId/status)

Call chain:

voiceInterviewService.getAllSessions(userId, status);
sessionRepository.findByUserIdAndStatusOrderByUpdatedAtDesc(userId, statusEnum);

Return:

Result<List<SessionMetaDTO>>

`DELETE /api/voice-interview/sessions/{sessionId}` Delete Voice Interview Session

Call chain:

voiceInterviewService.deleteSession(sessionId);

Implementation highlights:

Validate session existence.
Delete session and related data (messages/evaluation, depending on repository implementation).
Clear Redis cache.

`GET /api/voice-interview/sessions/{sessionId}/messages` Get Conversation History

Call chain:

voiceInterviewService.getConversationHistoryDTO(sessionId);

Return:

Result<List<VoiceInterviewMessageDTO>>

`GET /api/voice-interview/sessions/{sessionId}/evaluation` Get Async Evaluation Status and Result

Implementation highlights:

Validate session first (throw VOICE_SESSION_NOT_FOUND if missing).
Read evaluateStatus and evaluateError.
If status is COMPLETED, load evaluation details:

evaluationService.getEvaluation(sessionId);

Return VoiceEvaluationStatusDTO (includes status and result when completed).

`POST /api/voice-interview/sessions/{sessionId}/evaluation` Manually Trigger Async Evaluation

Processing logic:

voiceInterviewService.getSession(sessionId);
evaluationService.getEvaluation(sessionId);
voiceInterviewService.triggerEvaluation(sessionId);

Rules:

If already COMPLETED: return existing evaluation result directly.
If PENDING/PROCESSING: return current status without duplicate triggering.
For other triggerable states: enqueue evaluation task and return PENDING, then frontend continues polling.

Summary

The key value of the VoiceInterview module is not just making voice interaction work, but making the entire real-time pipeline and session lifecycle robustly connected. For me, only when the full chain (create, pause, resume, end, evaluate) works reliably can voice interviews become a truly evolvable product capability.

AI Resume Analysis: Interview Schedule Module

Thu, 14 May 2026 17:10:42 +0800

InterviewSchedule Module Design and Implementation

This note records how I implemented the InterviewSchedule module in the interview-guide project. The goal is to integrate invitation parsing, record management, status maintenance, and reminder coordination into one stable and maintainable workflow.

Module Capability Overview

Invitation parsing: dual-channel parsing with rule engine + AI, supports Feishu/Tencent Meeting/Zoom text formats, automatically extracts company, role, interview time, and meeting link.
Calendar management: supports day/week/month view, drag-and-drop adjustment, and list view collaboration.
Status maintenance: supports manual status updates and scheduled auto-expiration.
Reminder mechanism: supports configurable reminders to reduce missed interviews.

State Transitions

flowchart TD
A["Call POST /api/interview-schedule/parse to parse invitation text"] --> B{"Did rule parsing succeed?"}
B -->|Yes| C["Return ParseResponse\nparseMethod = rule"]
B -->|No| D["Call LLM parsing"]
D --> E{"Did AI parsing succeed?"}
E -->|Yes| F["Return ParseResponse\nparseMethod = ai"]
E -->|No| G["Return parse failure\nsuccess = false"]

H["Call POST /api/interview-schedule to create record"] --> I["create(): force status = PENDING"]
I --> J["Write to DB\nstatus: PENDING"]

J --> K["Call GET /api/interview-schedule or /{id} to query record"]

J --> L["Call PUT /api/interview-schedule/{id} to update base info"]
L --> M["Only update company/role/time fields\nwithout changing status"]
M --> J

J --> N["Call PATCH|PUT /api/interview-schedule/{id}/status?status=..."]
N --> O["updateStatus(): entity.setStatus(status)"]

O --> P{"Target status"}
P -->|COMPLETED| Q["Status -> COMPLETED"]
P -->|CANCELLED| R["Status -> CANCELLED"]
P -->|RESCHEDULED| S["Status -> RESCHEDULED"]
P -->|PENDING| T["Status -> PENDING"]

Q --> U["Record can still be rewritten via status API"]
R --> U
S --> U
T --> U
U --> N

J --> V["Scheduled task ScheduleStatusUpdater\nruns every hour"]
V --> W{"Condition met?\nstatus=PENDING and interviewTime < now"}
W -->|Yes| X["Batch update to CANCELLED"]
W -->|No| Y["No change"]

X --> R
Y --> J

J --> Z["Call DELETE /api/interview-schedule/{id}"]
Z --> AA["Delete record (lifecycle ends)"]

Key API Design

`POST /api/interview-schedule/parse` Parse Interview Invitation Text

Core logic:

parseService.parse(request.getRawText(), request.getSource());
tryRuleParsing(rawText, source);
parseWithAI(rawText, source);

Rule parsing handles structured patterns from Feishu/Tencent/Zoom first.
AI parsing acts as a fallback channel for non-standard text.
Input boundary constraints and prompt-injection protection are applied before AI parsing.

`POST /api/interview-schedule` Create Interview Record

Purpose:

Allows users to directly create an interview schedule record from manual input.

Call chain:

scheduleService.create(request);

Request body (core fields):

public class CreateInterviewRequest {
 @NotBlank(message = "Company name cannot be empty")
 private String companyName;

 @NotBlank(message = "Position cannot be empty")
 private String position;

 @NotNull(message = "Interview time cannot be empty")
 @com.fasterxml.jackson.annotation.JsonFormat(pattern = "yyyy-MM-dd'T'HH:mm[:ss]")
 private java.time.LocalDateTime interviewTime;

 private String interviewType; // ONSITE, VIDEO, PHONE
 private String meetingLink;
 private Integer roundNumber = 1;
 private String interviewer;
 private String notes;
}

`GET /api/interview-schedule/{id}` Get Interview Record by ID

Processing flow:

Controller receives id
Calls scheduleService.getById(id)
Service queries repository for one record and throws business exception if not found
Returns Result<InterviewScheduleDTO>

Call chain:

scheduleService.getById(id);

`GET /api/interview-schedule` Get Interview Record List

Processing flow:

Controller accepts optional filters: status/start/end
Calls scheduleService.getAll(status, start, end)
Service queries by conditions and converts to DTO
Returns Result<List<InterviewScheduleDTO>>

Call chain:

scheduleService.getAll(status, start, end);

`PUT /api/interview-schedule/{id}` Update Interview Record

Processing flow:

Controller receives id + CreateInterviewRequest (with @Valid validation)
Calls scheduleService.update(id, request)
Service loads existing record, updates fields, and saves
Returns updated Result<InterviewScheduleDTO>

Call chain:

scheduleService.update(id, request);

`DELETE /api/interview-schedule/{id}` Delete Interview Record

Processing flow:

Controller receives id
Calls scheduleService.delete(id)
Service deletes when found, throws exception when missing
Returns Result<Void>

Call chain:

scheduleService.delete(id);

`PATCH/PUT /api/interview-schedule/{id}/status` Update Interview Status

API implementation:

@RequestMapping(path = "/{id}/status", method = {RequestMethod.PATCH, RequestMethod.PUT})
public Result<InterviewScheduleDTO> updateStatus(
 @PathVariable Long id,
 @RequestParam InterviewStatus status
) {
 log.info("Update interview status: ID={}, status={}", id, status);
 InterviewScheduleDTO dto = scheduleService.updateStatus(id, status);
 return Result.success(dto);
}

Core call:

scheduleService.updateStatus(id, status);

Summary

The core value of the InterviewSchedule module is connecting invitation understanding with interview process management. For me, this layer is what enables frontend calendar interaction, reminder strategy, and downstream interview evaluation to form a continuous user experience, instead of scattering information across chats and manual notes.

AI Resume Analysis: Interview Module

Thu, 14 May 2026 15:00:53 +0800

Interview Mock Interview Module Design and Implementation

This note records how I implemented the Interview module in the interview-guide project, including the core APIs and evaluation pipeline. The main goal is to build a complete closed loop for question generation, answering, evaluation, and report export, while keeping text interviews and voice interviews aligned under the same evaluation logic.

Module Capability Overview

Skill-driven question generation: supports 10+ interview tracks (Java backend, major-company tracks, frontend, Python, algorithms, system design, test development, AI Agent, etc.). Each track is defined by SKILL.md for scope and difficulty distribution.
Historical question deduplication: previously asked questions in historical sessions are excluded during session creation to reduce repeated assessment.
Interview stage duration linkage: after total duration changes, each stage (self-introduction, technical assessment, project deep-dive, reverse Q&A) is auto-allocated by ratio.
Intelligent follow-up flow: supports multi-round follow-up configuration (default: 1 round) to simulate realistic interview interactions.
Unified evaluation engine: text and voice interviews share the same evaluation architecture (batch evaluation + structured output + summarization + fallback).
Report export: supports asynchronous generation and export of PDF interview reports.
Interview center: unified entry for continue/restart/history operations.

Core State Flow

flowchart TD
A["Call POST /api/interview/sessions to create session"] --> B{"Any unfinished session\nand forceCreate != true?"}
B -->|Yes| C["Return existing session"]
B -->|No| D["Generate questions and save session"]

D --> E["Session state: CREATED\nCache in Redis + persist in DB"]
C --> E

E --> F["Call GET /api/interview/sessions/{sessionId}/question"]
F --> G{"Is current state CREATED?"}
G -->|Yes| H["Switch to IN_PROGRESS"]
G -->|No| I["Keep current state"]
H --> J["Return current question"]
I --> J

J --> K["Call POST /api/interview/sessions/{sessionId}/answers to submit answer"]
K --> L["Save answer"]
L --> M{"Any next question?"}
M -->|Yes| N["currentIndex + 1\nState remains IN_PROGRESS"]
M -->|No| O["Switch state to COMPLETED"]

N --> F
O --> P["Set evaluateStatus to PENDING"]
P --> Q["Send evaluation task to Redis Stream"]

R["Call POST /api/interview/sessions/{sessionId}/complete for early submit"] --> O

Q --> S["Evaluation consumer processes task"]
S --> T["evaluateStatus = PROCESSING"]
T --> U{"Evaluation successful?"}
U -->|Yes| V["Save evaluation report"]
V --> W["Session state = EVALUATED\nevaluateStatus = COMPLETED"]
U -->|No| X{"Retry count < 3 ?"}
X -->|Yes| Q
X -->|No| Y["evaluateStatus = FAILED\nRecord evaluateError"]

Z["Call DELETE /api/interview/sessions/{sessionId}"] --> AA["Delete DB session and answers"]
AA --> AB["Session ended"]

Key API Design

`GET /api/interview/sessions` List Interview Sessions

Purpose:

Used by the interview history page, returns session list in reverse creation order.

Call chain:

persistenceService.findAll().stream();

`POST /api/interview/sessions` Create Interview Session

Rate limiting:

Global limit + IP limit (5)

Core logic:

sessionService.createSession(request);
persistenceService.getHistoricalQuestions(skillId, request.resumeId());
sessionRepository.findTop10ByResumeIdAndSkillIdOrderByCreatedAtDesc(...);
sessionRepository.findTop10BySkillIdOrderByCreatedAtDesc(...);
questionService.generateQuestionsBySkill(...);
sessionCache.saveSession(...);
persistenceService.saveSession(...);

`GET /api/interview/sessions/{sessionId}` Get Session Info

Core logic:

sessionService.getSession(sessionId);
sessionCache.getSession(sessionId);
restoreSessionFromDatabase(sessionId);

`GET /api/interview/sessions/{sessionId}/question` Get Current Question

Core logic:

sessionService.getCurrentQuestionResponse(sessionId);
getCurrentQuestion(sessionId);
getOrRestoreSession(sessionId);

If session is in CREATED state, return question by currentIndex.

`POST /api/interview/sessions/{sessionId}/answers` Submit Answer and Move Forward

Rate limiting:

Global limit (10)

Core logic:

sessionService.submitAnswer(request);

Updates answer, session state, cache, and DB.
If this is the last question:

persistenceService.updateEvaluateStatus(sessionId, AsyncTaskStatus.PENDING, null);
evaluateStreamProducer.sendEvaluateTask(sessionId);

`POST /api/interview/sessions/{sessionId}/answers` Save Draft Answer (No Progress)

Core logic:

sessionService.saveAnswer(request);

Syncs both Redis and DB.

`POST /api/interview/sessions/{sessionId}/complete` Early Submit

Core logic:

sessionService.completeInterview(sessionId);
sessionCache.updateSessionStatus(sessionId, SessionStatus.COMPLETED);

Persists DB status.

evaluateStreamProducer.sendEvaluateTask(sessionId);

`GET /api/interview/sessions/unfinished/{resumeId}` Find Unfinished Session

Core logic:

sessionService.findUnfinishedSessionOrThrow(resumeId);
findUnfinishedSession(resumeId);
sessionCache.findUnfinishedSessionId(resumeId);
persistenceService.findUnfinishedSession(resumeId);

`GET /api/interview/sessions/{sessionId}/report` Generate Interview Evaluation Report

Core logic:

sessionService.generateReport(sessionId);
evaluationService.evaluateInterview(...);
unifiedEvaluationService.evaluate(...);
evaluateInBatches(...);
summarizeBatchResults(...);
structuredOutputInvoker.invoke(...);
securedSystemPrompt = systemPromptWithFormat + ANTI_INJECTION_INSTRUCTION;

Uses anti-injection instruction to reduce prompt contamination risk from user input.

`GET /api/interview/sessions/{sessionId}/details` Get Interview Detail

Call chain:

historyService.getInterviewDetail(sessionId);
interviewPersistenceService.findBySessionId(sessionId);

`GET /api/interview/sessions/{sessionId}/export` Export Interview Report as PDF

Call chain:

historyService.exportInterviewPdf(sessionId);
interviewPersistenceService.findBySessionId(sessionId);
pdfExportService.exportInterviewReport(session);

`DELETE /api/interview/sessions/{sessionId}` Delete Interview Session

Call chain:

persistenceService.deleteSessionBySessionId(sessionId);
sessionRepository.findBySessionId(sessionId);
sessionRepository.delete(session);

Evaluation Engine Implementation Highlights

A single evaluation pipeline supports both text and voice interviews, reducing branch complexity.
Batch-first then summarize strategy balances long-context stability and structured output quality.
Anti-injection prompt composition is applied to reduce malicious-input interference.
In failure scenarios, unified invoker + fallback fields avoid hard report failures.

Summary

The Interview module now covers the full workflow from session creation, dynamic question generation, answer progression, asynchronous evaluation, to report export. For me, the key value is separating interview process management from evaluation result production into two evolvable layers, so future changes to question strategy or model upgrades can stay controlled.

AI Resume Analysis: Resume Module

Thu, 14 May 2026 11:31:10 +0800

Resume Module Design and Implementation

This note records the core design, API responsibilities, async processing pipeline, and practical considerations of the Resume module in the interview-guide project.

Module Capabilities

Multi-format parsing: supports PDF, DOCX, DOC, TXT, and MD.
Async processing: uses Redis Stream for asynchronous resume analysis with status tracking.
Stability: built-in auto-retry on analysis failure (up to 3 times) + duplicate detection based on file hash.
Report export: supports one-click export of AI analysis results as a structured PDF report.

Core Status Flow

flowchart TD
A["Call /api/resumes/upload"] --> B["Validate file and type"]
B --> C{"Is duplicate resume?"}

C -->|Yes| D["Return historical result or status (duplicate=true)"]
C -->|No| E["Parse text + upload object storage + save ResumeEntity"]

E --> F["Set analyzeStatus = PENDING"]
F --> G["Send Redis Stream analyze task"]

G --> H{"Task queued successfully?"}
H -->|No| I["Set FAILED (queue failed)"]
H -->|Yes| J["Consumer pulls task"]

J --> K["Set PROCESSING"]
K --> L["Call ResumeGradingService for AI analysis"]

L --> M{"Any exception in this round?"}
M -->|No| N["Save analysis result"]
N --> O["Set COMPLETED"]

M -->|Yes| P{"retryCount < 3 ?"}
P -->|Yes| Q["retryCount + 1, requeue task"]
Q --> J
P -->|No| R["Set FAILED (final failure)"]

S["Manual retry /api/resumes/{id}/reanalyze"] --> T["Set PENDING and requeue"]
T --> J

Key API Design

`/api/resumes/upload` Upload Resume (Async Analysis)

Rate limit strategy:

Global limit: @RateLimit(dimension = RateLimit.Dimension.GLOBAL, count = 5)
IP limit: @RateLimit(dimension = RateLimit.Dimension.IP, count = 5)

Entry call:

uploadService.uploadAndAnalyze(file);

Processing flow:

Basic file validation

fileValidationService.validateFile(file, MAX_FILE_SIZE, "Resume");

Includes: null check, file size limit, and logging. 2. File type detection

String contentType = parseService.detectContentType(file);

Supports: PDF, DOCX, DOC, TXT, MD. 3. Duplicate file detection

persistenceService.findExistingResume(file);

Internal flow:

String fileHash = fileHashService.calculateHash(file);
resumeRepository.findByFileHash(fileHash);

Resume parsing and text cleaning

parseService.parseResume(file);

Parse to plain text using Apache Tika
textCleaningService.cleanText(content) to reduce excessive line breaks and token usage

File storage (unstructured data)

storageService.uploadResume(file);
storageService.getFileUrl(fileKey);

Uploads to RustFS/MinIO for unstructured file storage. 6. Metadata persistence

persistenceService.saveResume(file, resumeText, fileKey, fileUrl);

Send async analysis task

analyzeStreamProducer.sendAnalyzeTask(savedResume.getId(), resumeText);

Uses Redis Stream as the message queue 8. Return upload response
Frontend checks subsequent APIs for async processing status.

`/api/resumes` Get Resume List

Call chain:

historyService.getAllResumes();
resumePersistenceService.findAllResumes();

Current issue:

User-level isolation is not implemented yet, so it currently returns the full list.

`/api/resumes/{id}/detail` Get Resume Detail

Call chain:

historyService.getResumeDetail(id);
resumePersistenceService.findById(id);
resumeRepository.findById(id);

`/api/resumes/{id}/export` Export Analysis Report as PDF

Call chain:

historyService.exportAnalysisPdf(id);
resumePersistenceService.findById(resumeId);
resumePersistenceService.getLatestAnalysisAsDTO(resumeId);
pdfExportService.exportResumeAnalysis(resume, analysisDTO);

`/api/resumes/{id}` Delete Resume

Call chain:

deleteService.deleteResume(id);
persistenceService.findById(id);
storageService.deleteResume(resume.getStorageKey());
interviewPersistenceService.deleteSessionsByResumeId(id);
persistenceService.deleteResume(id);

`/api/resumes/{id}/reanalyze` Reanalyze Resume

Rate limit strategy:

Global limit: @RateLimit(dimension = RateLimit.Dimension.GLOBAL, count = 2)
IP limit: @RateLimit(dimension = RateLimit.Dimension.IP, count = 2)

Call chain:

uploadService.reanalyze(id);
resumeRepository.findById(resumeId);
analyzeStreamProducer.sendAnalyzeTask(resumeId, resumeText);

Then update and persist status in the processing step.

`/api/resumes/health` Health Check

return Result.success();

For service liveness checks.

Stability Design Points

Async decoupling: upload and analysis are separated to improve responsiveness.
Auto-retry: failed analysis retries up to 3 times to reduce transient failures.
Hash-based dedup: SHA-256 content hash avoids repeated analysis of identical files.

Summary

The Resume module already forms a complete loop: upload, parse, async analyze, export, and delete. The current implementation is stable enough for iterative feature expansion and production hardening.