Advanced and agentic RAG
Query rewriting, HyDE, multi-query and when the agent decides how to retrieve.
6 min read
Basic RAG works well when the user knows exactly what they want and writes it cleanly. In practice, that rarely happens. Questions are vague, incomplete, or poorly phrased — and the naive pipeline returns garbage with confidence. The techniques in this lesson exist to solve exactly that problem: making retrieval more robust before the LLM ever generates the final answer.
Query rewriting: fix the question before searching
The user types: "and the deadline?". Without context, that fragment finds nothing useful in the vector index. Query rewriting uses an LLM to transform the original question into something the search system can handle well.
The idea is simple: before hitting the vector store, you pass the query through a prompt that asks the model to rewrite it more explicitly, completely, and unambiguously. If there is conversation history, it is included in the context — the model infers that "the deadline" refers to the contract discussed two turns ago and produces "what is the delivery deadline stated in the supply contract mentioned?".
On AWS, this can be an InvokeModel call to Bedrock (Claude Haiku or Titan Text Lite are cheap enough for this step) before querying OpenSearch or Knowledge Bases. The extra cost is low; the precision gain is usually high.
One caveat: rewriting can alter intent if the prompt is not careful. Test with real examples from your domain. If the model is over-rewriting — adding assumptions the user never made — lower the temperature and be more directive in the rewriting prompt.
Multi-query and HyDE: attack the index from different angles
Multi-query is the idea that a question can be legitimately rephrased in several ways, and each phrasing may retrieve different chunks. You ask the LLM to generate N variations of the original query (typically 3 to 5), run each one against the vector store in parallel, merge the results, and deduplicate. The reranker from lesson 05 comes in here to rank the final set before sending it to the LLM.
The gain is real: variations capture synonyms, perspectives, and levels of abstraction that the original query missed. The cost is proportional to the number of searches — plan for it.
HyDE (Hypothetical Document Embeddings) is more elegant and a bit counterintuitive. Instead of searching with the question, you ask the LLM to invent a plausible hypothetical answer — with no access to the index, just the model's parametric knowledge. Then you embed that hypothetical answer and use its vector to search the index.
The intuition: real documents look more like other documents than like questions. The embedding of a hypothetical answer sits closer to the relevant chunks in vector space than the embedding of the original question. This works especially well when the vocabulary of the question differs greatly from the vocabulary of the documents — for example, colloquial questions about technical documents.
In practice, query rewriting is almost always worth it — the cost is minimal and the gain in multi-turn conversations is immediate. I use multi-query when the domain has rich, inconsistent vocabulary (e.g., legal or medical documents with many synonyms). I reserve HyDE for cases where the user's query is very short or colloquial and the documents are dense and technical — it is the technique with the highest potential gain, but also the most sensitive to the quality of the hypothetical LLM. If the model hallucinates in the hypothetical answer, the generated vector will retrieve garbage with surgical precision.
Agentic RAG loop: decide → retrieve → evaluate → respond
The agent controls the loop: decides whether to retrieve, which tool to use, evaluates whether the retrieved chunks are sufficient, and repeats if needed — before generating the final answer.
- Agente LLM · Raciocínio + Planejamento
- Decidir · Buscar? Qual fonte?
- Avaliar chunks · Suficientes? Relevantes?
- Reescrita / Multi-query · Query transformation
- Vector Store · OpenSearch / KB
- Reranker · Ordenar chunks
- Resposta Final · com citações
Agentic RAG: when the system decides how to retrieve
In the previous techniques, the pipeline is still fixed: query in, chunks out, LLM generates. Agentic RAG breaks that linear flow. Here, an LLM agent receives the question and decides what to do: search now? In which source? With which strategy? Are the retrieved chunks sufficient, or does it need another round?
The diagram above shows the central loop: the agent reasons, decides to search, transforms the query, retrieves, evaluates chunk quality, and only then generates — or repeats the cycle if results are insufficient.
This enables behaviors that linear RAG cannot achieve:
- Compound questions: "compare the refund policy for plans A and B" — the agent makes two separate searches and synthesizes.
- Iterative refinement: if the first chunks do not cover the question, the agent rephrases and searches again.
- Source routing: the agent chooses between searching the vector store, calling an external API, or using direct parametric knowledge.
On AWS, this is implemented with Bedrock Agents — you register search tools as action groups, and the model (Claude, by default) decides when and how to call them. Lesson 09 goes into detail about Knowledge Bases integrated with agents. This lesson covers the reasoning behind the loop; the concrete implementation comes later.
The critical point: agents add latency and cost by design. Each iteration of the loop is an LLM call. For simple, direct questions, linear RAG with good query rewriting is faster, cheaper, and equally effective.
Advanced retrieval techniques: quick comparison
| Technique | When to use | Extra cost | Main risk | |
|---|---|---|---|---|
| Query rewriting | Multi-turn conversations, vague queries | Low (1 lightweight LLM call) | Intent alteration with bad prompt | — |
| Multi-query | Domain with rich, inconsistent vocabulary | Medium (N parallel searches) | Noise if variations are irrelevant | — |
| HyDE | Colloquial queries, dense technical docs | Medium (1 generation + 1 embedding) | Hallucination in the hypothetical answer | — |
| Agentic RAG | Compound questions, multiple sources, iterative refinement | High (multiple LLM calls) | Unpredictable latency and cost without loop limits | — |
Advanced retrieval techniques
Tap a concept, then its definition.
Key takeaways from this lesson
How to introduce advanced techniques incrementally
- 1
Start with linear RAG and measure
Before adding any advanced technique, establish an evaluation baseline (faithfulness, relevance — lesson 08). You need to know what you are improving.
- 2
Add query rewriting first
It is the lowest-risk, highest-immediate-return technique. Implement it as a preprocessing step before the vector store call. Measure the impact on your evaluation metrics.
- 3
Try multi-query if vocabulary is the problem
If failure analysis shows the system is not finding relevant chunks because the user uses different terms than the documents, multi-query is the natural next step.
- 4
Consider HyDE for very short or colloquial queries
Test HyDE on a subset of your evaluation dataset. Compare precision@k before and after. If there is no measurable gain, do not add the complexity.
- 5
Migrate to agentic RAG only when the linear pipeline is not enough
Compound questions, multiple sources, and iterative refinement are the clear signals. Implement with Bedrock Agents, define loop limits, and monitor cost per session from day one.
Frequently asked questions
Won't HyDE hallucinate and bring wrong chunks?
It can. The hypothetical answer does not need to be factually correct — it just needs to be in the same semantic space as the relevant documents. But if the model hallucinates wildly (inventing terms, concepts, or entities that do not exist in the documents), the generated vector will retrieve garbage. That is why HyDE works best in domains where the LLM has some parametric knowledge of the subject, even if incomplete.
Can I combine rewriting + multi-query + reranker?
Yes, and it is a common combination in production. The natural order is: rewriting → multi-query → parallel search → merge → reranker → LLM. The reranker is especially valuable here because the merged chunk set can be large and noisy.
Does agentic RAG with Bedrock Agents natively support Knowledge Bases?
Yes. You can associate a Knowledge Base directly with a Bedrock Agent as a knowledge source. The agent automatically decides when to query the KB based on the model's reasoning. Lesson 09 covers this in detail.
How do I prevent infinite loops in agentic RAG?
Explicitly define a max_iterations in your agent. In Bedrock Agents, this is configurable. Also monitor the average number of iterations per session — if it is growing, the agent is struggling to satisfy questions and you need to review the available tools or the system prompt.