Agent_RAG Optimization

Fri, 22 May 2026 10:30:00 +0800

RAG Optimization Notes (First-Person)

After reviewing recent RAG optimization materials, my conclusion is straightforward:

The bottleneck of RAG is no longer “can it run,” but “can it hit reliably, stay controllable, and remain measurable in production.”

I now break RAG optimization into four layers:

Pre-retrieval optimization (Query + Chunk)
Retrieval-time optimization (Recall + Rank)
Post-retrieval optimization (Context Packing + Compression)
Production loop optimization (Evaluation + Feedback)

1) Pre-Retrieval Optimization: Fix Input and Corpus Quality First

What I focus on

Semantic chunking

I no longer use fixed 300/500-token hard cuts.
I chunk by semantic paragraphs, code boundaries, and heading hierarchy.
My goal is to make each chunk self-contained and independently citable.

Query rewriting

Normalize colloquial user questions into domain terms.
Handle abbreviations, aliases, and typo normalization.
Decompose complex questions into sub-queries.

HyDE (Hypothetical Document Embeddings)

Generate an “ideal answer draft” first.
Retrieve using the draft embedding, not only the short user query.
I treat HyDE as a recall-boost switch, enabled only in low-recall scenarios.

My assessment

If pre-retrieval is weak, reranking/compression/caching are mostly damage control.

2) Retrieval-Time Optimization: Multi-Path Recall + Rerank, Not Vector-Only

My current approach

Hybrid search

Dense vectors for semantic recall.
Sparse retrieval (BM25/keywords) to recover exact-match cases.
Fuse results before reranking.

Two-stage ranking (Recall L1 -> Rank L2)

Stage 1 maximizes recall (better to over-fetch).
Stage 2 reranker narrows to top-k precision.

Cross-encoder / API rerank

Score query-doc pairs directly.
More stable than pure embedding similarity, especially on long chunks.

My assessment

In production, the issue is often not “nothing found,” but “too many low-precision hits.” Rerank is not optional; it is a quality gate.

3) Post-Retrieval Optimization: Turn Context into High-Density Evidence

Three things I optimize

Evidence compression

Rerank first, then compress.
Remove weakly relevant sentences, template noise, and duplicates.
Keep entities, numbers, and conclusion-bearing sentences.

Context packing strategy

Do not concatenate by raw retrieval order.
Repack by “question sub-intent -> evidence groups.”
Tag each evidence block with source IDs for traceability.

Cache-friendly prompt assembly

Place stable system prefixes and static background first.
Maximize prefix reuse and cache hit rate (cost + latency benefits).

My assessment

RAG cost is often dominated not by retrieval itself, but by sending low-value context to the LLM. Post-retrieval refinement is one of the most direct cost levers.

4) Production Loop Optimization: Make RAG a System, Not a Demo

My evaluation perspective

Retrieval-layer metrics

Recall@k
MRR / nDCG
Hit-rate buckets (short query / long query / code query)

Generation-layer metrics

Faithfulness (is the answer grounded in evidence?)
Answer relevance (does it answer the actual question?)
Context precision (how much retrieved context is truly useful?)

System-layer metrics

P95 latency
Per-query token cost
Cache hit rate
Fallback-routing ratio (needs backup retrieval/web search)

My feedback loop

User query -> recall -> rerank -> generate answer
Evaluator scores answer and evidence automatically
Low-score samples flow into a hard-case dataset
Weekly regression over retrieval params, chunking policy, and reranker setup

Vendor/Framework Recommendations I Use as Baseline

I prioritize official vendor/framework docs over second-hand summaries.

Microsoft Learn: Build Advanced Retrieval-Augmented Generation Systems

End-to-end advanced RAG workflow
Strong emphasis on query rewriting, post-retrieval processing, and evaluation loops

Azure Architecture Center: Develop a RAG Solution—Information-Retrieval Phase

Systematic retrieval-phase guidance
Explicitly covers query augmentation/decomposition/rewriting/HyDE

Anthropic Engineering: Contextual Retrieval

Practical guidance on hybrid retrieval and context utilization
Clearly addresses “retrieved is not equal to used correctly”

Anthropic Help: Retrieval Augmented Generation (RAG) for Projects

Checklist-oriented practical recommendations for productization

Cohere Docs: Best Practices for using Rerank

Practical rerank guidance for input organization and deployment

Paper: Lost in the Middle

Evidence for middle-context utilization degradation
Supports the need for reranking, compression, and packing

Paper: RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Foundational retrieval+generation paradigm

How I Integrate These Optimizations into Real AI Application Iteration

I run a weekly optimization loop:

Step 0: Define scenario buckets and baseline

Build 100–300 real QA samples (bucketed by scenario).
Record baseline: retrieval hit quality, answer quality, latency, and cost.

Step 1: Change only one variable per iteration

I modify one parameter at a time:

Chunking policy
Query rewriting switch
Hybrid fusion weights
Reranker model/threshold
Context compression ratio

This avoids confounded results.

Step 2: Pass offline evaluation first

No offline pass, no online rollout.
I check three dimensions: quality gain, latency impact, cost impact.

Step 3: Online canary with rollback thresholds

Roll out on small traffic.
Set automatic rollback thresholds (P95, complaint rate, empty-answer rate).

Step 4: Convert wins into engineering assets

I persist proven improvements into:

Retrieval config templates
Prompt/context assembly conventions
RAG regression scripts
Failure case datasets and labeling rules

My Conclusion

My final view on RAG optimization:

Pre-retrieval defines the ceiling (is the question represented correctly?)
Retrieval-time defines hit quality (are we finding the right evidence?)
Post-retrieval defines cost and usability (is high-density evidence delivered to the LLM?)
Production loop defines sustainability (can quality keep improving?)

One-line summary:

RAG optimization is not "just tune model parameters"; it is engineering governance across retrieval, reranking, context construction, evaluation, and feedback.

RAG on XEDCZQ Blog

Agent_RAG Optimization

RAG Optimization Notes (First-Person)

1) Pre-Retrieval Optimization: Fix Input and Corpus Quality First

What I focus on

My assessment

2) Retrieval-Time Optimization: Multi-Path Recall + Rerank, Not Vector-Only

My current approach

My assessment

3) Post-Retrieval Optimization: Turn Context into High-Density Evidence

Three things I optimize

My assessment

4) Production Loop Optimization: Make RAG a System, Not a Demo

My evaluation perspective

My feedback loop

Vendor/Framework Recommendations I Use as Baseline

How I Integrate These Optimizations into Real AI Application Iteration

Step 0: Define scenario buckets and baseline

Step 1: Change only one variable per iteration

Step 2: Pass offline evaluation first

Step 3: Online canary with rollback thresholds

Step 4: Convert wins into engineering assets

My Conclusion