<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>RAG on XEDCZQ Blog</title><link>https://xedczq.cn/en/tags/rag/</link><description>Recent content in RAG on XEDCZQ Blog</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Fri, 22 May 2026 10:30:00 +0800</lastBuildDate><atom:link href="https://xedczq.cn/en/tags/rag/index.xml" rel="self" type="application/rss+xml"/><item><title>Agent_RAG Optimization</title><link>https://xedczq.cn/en/post/agent_rag%E4%BC%98%E5%8C%96/</link><pubDate>Fri, 22 May 2026 10:30:00 +0800</pubDate><guid>https://xedczq.cn/en/post/agent_rag%E4%BC%98%E5%8C%96/</guid><description>&lt;h1 id="rag-optimization-notes-first-person"&gt;&lt;a href="#rag-optimization-notes-first-person" class="header-anchor"&gt;&lt;/a&gt;RAG Optimization Notes (First-Person)
&lt;/h1&gt;&lt;p&gt;After reviewing recent RAG optimization materials, my conclusion is straightforward:&lt;/p&gt;
&lt;p&gt;The bottleneck of RAG is no longer &amp;ldquo;can it run,&amp;rdquo; but &amp;ldquo;can it hit reliably, stay controllable, and remain measurable in production.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;I now break RAG optimization into four layers:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Pre-retrieval optimization (Query + Chunk)&lt;/li&gt;
&lt;li&gt;Retrieval-time optimization (Recall + Rank)&lt;/li&gt;
&lt;li&gt;Post-retrieval optimization (Context Packing + Compression)&lt;/li&gt;
&lt;li&gt;Production loop optimization (Evaluation + Feedback)&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id="1-pre-retrieval-optimization-fix-input-and-corpus-quality-first"&gt;&lt;a href="#1-pre-retrieval-optimization-fix-input-and-corpus-quality-first" class="header-anchor"&gt;&lt;/a&gt;1) Pre-Retrieval Optimization: Fix Input and Corpus Quality First
&lt;/h2&gt;&lt;h3 id="what-i-focus-on"&gt;&lt;a href="#what-i-focus-on" class="header-anchor"&gt;&lt;/a&gt;What I focus on
&lt;/h3&gt;&lt;ol&gt;
&lt;li&gt;Semantic chunking&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;I no longer use fixed 300/500-token hard cuts.&lt;/li&gt;
&lt;li&gt;I chunk by semantic paragraphs, code boundaries, and heading hierarchy.&lt;/li&gt;
&lt;li&gt;My goal is to make each chunk self-contained and independently citable.&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="2"&gt;
&lt;li&gt;Query rewriting&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Normalize colloquial user questions into domain terms.&lt;/li&gt;
&lt;li&gt;Handle abbreviations, aliases, and typo normalization.&lt;/li&gt;
&lt;li&gt;Decompose complex questions into sub-queries.&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="3"&gt;
&lt;li&gt;HyDE (Hypothetical Document Embeddings)&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Generate an &amp;ldquo;ideal answer draft&amp;rdquo; first.&lt;/li&gt;
&lt;li&gt;Retrieve using the draft embedding, not only the short user query.&lt;/li&gt;
&lt;li&gt;I treat HyDE as a recall-boost switch, enabled only in low-recall scenarios.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="my-assessment"&gt;&lt;a href="#my-assessment" class="header-anchor"&gt;&lt;/a&gt;My assessment
&lt;/h3&gt;&lt;p&gt;If pre-retrieval is weak, reranking/compression/caching are mostly damage control.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="2-retrieval-time-optimization-multi-path-recall--rerank-not-vector-only"&gt;&lt;a href="#2-retrieval-time-optimization-multi-path-recall--rerank-not-vector-only" class="header-anchor"&gt;&lt;/a&gt;2) Retrieval-Time Optimization: Multi-Path Recall + Rerank, Not Vector-Only
&lt;/h2&gt;&lt;h3 id="my-current-approach"&gt;&lt;a href="#my-current-approach" class="header-anchor"&gt;&lt;/a&gt;My current approach
&lt;/h3&gt;&lt;ol&gt;
&lt;li&gt;Hybrid search&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Dense vectors for semantic recall.&lt;/li&gt;
&lt;li&gt;Sparse retrieval (BM25/keywords) to recover exact-match cases.&lt;/li&gt;
&lt;li&gt;Fuse results before reranking.&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="2"&gt;
&lt;li&gt;Two-stage ranking (Recall L1 -&amp;gt; Rank L2)&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Stage 1 maximizes recall (better to over-fetch).&lt;/li&gt;
&lt;li&gt;Stage 2 reranker narrows to top-k precision.&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="3"&gt;
&lt;li&gt;Cross-encoder / API rerank&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Score query-doc pairs directly.&lt;/li&gt;
&lt;li&gt;More stable than pure embedding similarity, especially on long chunks.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="my-assessment-1"&gt;&lt;a href="#my-assessment-1" class="header-anchor"&gt;&lt;/a&gt;My assessment
&lt;/h3&gt;&lt;p&gt;In production, the issue is often not &amp;ldquo;nothing found,&amp;rdquo; but &amp;ldquo;too many low-precision hits.&amp;rdquo; Rerank is not optional; it is a quality gate.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="3-post-retrieval-optimization-turn-context-into-high-density-evidence"&gt;&lt;a href="#3-post-retrieval-optimization-turn-context-into-high-density-evidence" class="header-anchor"&gt;&lt;/a&gt;3) Post-Retrieval Optimization: Turn Context into High-Density Evidence
&lt;/h2&gt;&lt;h3 id="three-things-i-optimize"&gt;&lt;a href="#three-things-i-optimize" class="header-anchor"&gt;&lt;/a&gt;Three things I optimize
&lt;/h3&gt;&lt;ol&gt;
&lt;li&gt;Evidence compression&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Rerank first, then compress.&lt;/li&gt;
&lt;li&gt;Remove weakly relevant sentences, template noise, and duplicates.&lt;/li&gt;
&lt;li&gt;Keep entities, numbers, and conclusion-bearing sentences.&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="2"&gt;
&lt;li&gt;Context packing strategy&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Do not concatenate by raw retrieval order.&lt;/li&gt;
&lt;li&gt;Repack by &amp;ldquo;question sub-intent -&amp;gt; evidence groups.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;Tag each evidence block with source IDs for traceability.&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="3"&gt;
&lt;li&gt;Cache-friendly prompt assembly&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Place stable system prefixes and static background first.&lt;/li&gt;
&lt;li&gt;Maximize prefix reuse and cache hit rate (cost + latency benefits).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="my-assessment-2"&gt;&lt;a href="#my-assessment-2" class="header-anchor"&gt;&lt;/a&gt;My assessment
&lt;/h3&gt;&lt;p&gt;RAG cost is often dominated not by retrieval itself, but by sending low-value context to the LLM. Post-retrieval refinement is one of the most direct cost levers.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="4-production-loop-optimization-make-rag-a-system-not-a-demo"&gt;&lt;a href="#4-production-loop-optimization-make-rag-a-system-not-a-demo" class="header-anchor"&gt;&lt;/a&gt;4) Production Loop Optimization: Make RAG a System, Not a Demo
&lt;/h2&gt;&lt;h3 id="my-evaluation-perspective"&gt;&lt;a href="#my-evaluation-perspective" class="header-anchor"&gt;&lt;/a&gt;My evaluation perspective
&lt;/h3&gt;&lt;ol&gt;
&lt;li&gt;Retrieval-layer metrics&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Recall@k&lt;/li&gt;
&lt;li&gt;MRR / nDCG&lt;/li&gt;
&lt;li&gt;Hit-rate buckets (short query / long query / code query)&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="2"&gt;
&lt;li&gt;Generation-layer metrics&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Faithfulness (is the answer grounded in evidence?)&lt;/li&gt;
&lt;li&gt;Answer relevance (does it answer the actual question?)&lt;/li&gt;
&lt;li&gt;Context precision (how much retrieved context is truly useful?)&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="3"&gt;
&lt;li&gt;System-layer metrics&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;P95 latency&lt;/li&gt;
&lt;li&gt;Per-query token cost&lt;/li&gt;
&lt;li&gt;Cache hit rate&lt;/li&gt;
&lt;li&gt;Fallback-routing ratio (needs backup retrieval/web search)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="my-feedback-loop"&gt;&lt;a href="#my-feedback-loop" class="header-anchor"&gt;&lt;/a&gt;My feedback loop
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;User query -&amp;gt; recall -&amp;gt; rerank -&amp;gt; generate answer&lt;/li&gt;
&lt;li&gt;Evaluator scores answer and evidence automatically&lt;/li&gt;
&lt;li&gt;Low-score samples flow into a hard-case dataset&lt;/li&gt;
&lt;li&gt;Weekly regression over retrieval params, chunking policy, and reranker setup&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="vendorframework-recommendations-i-use-as-baseline"&gt;&lt;a href="#vendorframework-recommendations-i-use-as-baseline" class="header-anchor"&gt;&lt;/a&gt;Vendor/Framework Recommendations I Use as Baseline
&lt;/h2&gt;&lt;p&gt;I prioritize official vendor/framework docs over second-hand summaries.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Microsoft Learn: &lt;a class="link" href="https://learn.microsoft.com/en-us/azure/developer/ai/advanced-retrieval-augmented-generation" target="_blank" rel="noopener"
 &gt;Build Advanced Retrieval-Augmented Generation Systems&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;End-to-end advanced RAG workflow&lt;/li&gt;
&lt;li&gt;Strong emphasis on query rewriting, post-retrieval processing, and evaluation loops&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="2"&gt;
&lt;li&gt;Azure Architecture Center: &lt;a class="link" href="https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag/rag-information-retrieval" target="_blank" rel="noopener"
 &gt;Develop a RAG Solution—Information-Retrieval Phase&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Systematic retrieval-phase guidance&lt;/li&gt;
&lt;li&gt;Explicitly covers query augmentation/decomposition/rewriting/HyDE&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="3"&gt;
&lt;li&gt;Anthropic Engineering: &lt;a class="link" href="https://www.anthropic.com/engineering/contextual-retrieval" target="_blank" rel="noopener"
 &gt;Contextual Retrieval&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Practical guidance on hybrid retrieval and context utilization&lt;/li&gt;
&lt;li&gt;Clearly addresses &amp;ldquo;retrieved is not equal to used correctly&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="4"&gt;
&lt;li&gt;Anthropic Help: &lt;a class="link" href="https://support.anthropic.com/en/articles/11473015-retrieval-augmented-generation-rag-for-projects" target="_blank" rel="noopener"
 &gt;Retrieval Augmented Generation (RAG) for Projects&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Checklist-oriented practical recommendations for productization&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="5"&gt;
&lt;li&gt;Cohere Docs: &lt;a class="link" href="https://docs.cohere.com/docs/reranking-best-practices" target="_blank" rel="noopener"
 &gt;Best Practices for using Rerank&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Practical rerank guidance for input organization and deployment&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="6"&gt;
&lt;li&gt;Paper: &lt;a class="link" href="https://arxiv.org/abs/2307.03172" target="_blank" rel="noopener"
 &gt;Lost in the Middle&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Evidence for middle-context utilization degradation&lt;/li&gt;
&lt;li&gt;Supports the need for reranking, compression, and packing&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="7"&gt;
&lt;li&gt;Paper: &lt;a class="link" href="https://arxiv.org/abs/2005.11401" target="_blank" rel="noopener"
 &gt;RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Foundational retrieval+generation paradigm&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="how-i-integrate-these-optimizations-into-real-ai-application-iteration"&gt;&lt;a href="#how-i-integrate-these-optimizations-into-real-ai-application-iteration" class="header-anchor"&gt;&lt;/a&gt;How I Integrate These Optimizations into Real AI Application Iteration
&lt;/h2&gt;&lt;p&gt;I run a weekly optimization loop:&lt;/p&gt;
&lt;h3 id="step-0-define-scenario-buckets-and-baseline"&gt;&lt;a href="#step-0-define-scenario-buckets-and-baseline" class="header-anchor"&gt;&lt;/a&gt;Step 0: Define scenario buckets and baseline
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Build 100–300 real QA samples (bucketed by scenario).&lt;/li&gt;
&lt;li&gt;Record baseline: retrieval hit quality, answer quality, latency, and cost.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="step-1-change-only-one-variable-per-iteration"&gt;&lt;a href="#step-1-change-only-one-variable-per-iteration" class="header-anchor"&gt;&lt;/a&gt;Step 1: Change only one variable per iteration
&lt;/h3&gt;&lt;p&gt;I modify one parameter at a time:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Chunking policy&lt;/li&gt;
&lt;li&gt;Query rewriting switch&lt;/li&gt;
&lt;li&gt;Hybrid fusion weights&lt;/li&gt;
&lt;li&gt;Reranker model/threshold&lt;/li&gt;
&lt;li&gt;Context compression ratio&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This avoids confounded results.&lt;/p&gt;
&lt;h3 id="step-2-pass-offline-evaluation-first"&gt;&lt;a href="#step-2-pass-offline-evaluation-first" class="header-anchor"&gt;&lt;/a&gt;Step 2: Pass offline evaluation first
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;No offline pass, no online rollout.&lt;/li&gt;
&lt;li&gt;I check three dimensions: quality gain, latency impact, cost impact.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="step-3-online-canary-with-rollback-thresholds"&gt;&lt;a href="#step-3-online-canary-with-rollback-thresholds" class="header-anchor"&gt;&lt;/a&gt;Step 3: Online canary with rollback thresholds
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Roll out on small traffic.&lt;/li&gt;
&lt;li&gt;Set automatic rollback thresholds (P95, complaint rate, empty-answer rate).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="step-4-convert-wins-into-engineering-assets"&gt;&lt;a href="#step-4-convert-wins-into-engineering-assets" class="header-anchor"&gt;&lt;/a&gt;Step 4: Convert wins into engineering assets
&lt;/h3&gt;&lt;p&gt;I persist proven improvements into:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Retrieval config templates&lt;/li&gt;
&lt;li&gt;Prompt/context assembly conventions&lt;/li&gt;
&lt;li&gt;RAG regression scripts&lt;/li&gt;
&lt;li&gt;Failure case datasets and labeling rules&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="my-conclusion"&gt;&lt;a href="#my-conclusion" class="header-anchor"&gt;&lt;/a&gt;My Conclusion
&lt;/h2&gt;&lt;p&gt;My final view on RAG optimization:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Pre-retrieval defines the ceiling (is the question represented correctly?)&lt;/li&gt;
&lt;li&gt;Retrieval-time defines hit quality (are we finding the right evidence?)&lt;/li&gt;
&lt;li&gt;Post-retrieval defines cost and usability (is high-density evidence delivered to the LLM?)&lt;/li&gt;
&lt;li&gt;Production loop defines sustainability (can quality keep improving?)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;One-line summary:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;RAG optimization is not &amp;#34;just tune model parameters&amp;#34;; it is engineering governance across retrieval, reranking, context construction, evaluation, and feedback.
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description></item></channel></rss>