Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Production RAG on AWS

Module 2 · Retrieval· Lesson 07/12

Advanced and agentic RAG

Query rewriting, HyDE, multi-query and when the agent decides how to retrieve.

6 min read

Basic RAG works well when the user knows exactly what they want and writes it cleanly. In practice, that rarely happens. Questions are vague, incomplete, or poorly phrased — and the naive pipeline returns garbage with confidence. The techniques in this lesson exist to solve exactly that problem: making retrieval more robust before the LLM ever generates the final answer.

Query rewriting: fix the question before searching

The user types: "and the deadline?". Without context, that fragment finds nothing useful in the vector index. Query rewriting uses an LLM to transform the original question into something the search system can handle well.

The idea is simple: before hitting the vector store, you pass the query through a prompt that asks the model to rewrite it more explicitly, completely, and unambiguously. If there is conversation history, it is included in the context — the model infers that "the deadline" refers to the contract discussed two turns ago and produces "what is the delivery deadline stated in the supply contract mentioned?".

On AWS, this can be an InvokeModel call to Bedrock (Claude Haiku or Titan Text Lite are cheap enough for this step) before querying OpenSearch or Knowledge Bases. The extra cost is low; the precision gain is usually high.

One caveat: rewriting can alter intent if the prompt is not careful. Test with real examples from your domain. If the model is over-rewriting — adding assumptions the user never made — lower the temperature and be more directive in the rewriting prompt.

Multi-query and HyDE: attack the index from different angles

Multi-query is the idea that a question can be legitimately rephrased in several ways, and each phrasing may retrieve different chunks. You ask the LLM to generate N variations of the original query (typically 3 to 5), run each one against the vector store in parallel, merge the results, and deduplicate. The reranker from lesson 05 comes in here to rank the final set before sending it to the LLM.

The gain is real: variations capture synonyms, perspectives, and levels of abstraction that the original query missed. The cost is proportional to the number of searches — plan for it.

HyDE (Hypothetical Document Embeddings) is more elegant and a bit counterintuitive. Instead of searching with the question, you ask the LLM to invent a plausible hypothetical answer — with no access to the index, just the model's parametric knowledge. Then you embed that hypothetical answer and use its vector to search the index.

The intuition: real documents look more like other documents than like questions. The embedding of a hypothetical answer sits closer to the relevant chunks in vector space than the embedding of the original question. This works especially well when the vocabulary of the question differs greatly from the vocabulary of the documents — for example, colloquial questions about technical documents.

In practice: when each technique is worth it

Senior Solutions Architect

In practice, query rewriting is almost always worth it — the cost is minimal and the gain in multi-turn conversations is immediate. I use multi-query when the domain has rich, inconsistent vocabulary (e.g., legal or medical documents with many synonyms). I reserve HyDE for cases where the user's query is very short or colloquial and the documents are dense and technical — it is the technique with the highest potential gain, but also the most sensitive to the quality of the hypothetical LLM. If the model hallucinates in the hypothetical answer, the generated vector will retrieve garbage with surgical precision.

Agentic RAG loop: decide → retrieve → evaluate → respond

The agent controls the loop: decides whether to retrieve, which tool to use, evaluates whether the retrieved chunks are sufficient, and repeats if needed — before generating the final answer.

🧠 Agente — Raciocínio

Agente LLM · Raciocínio + Planejamento
Decidir · Buscar? Qual fonte?
Avaliar chunks · Suficientes? Relevantes?

🔍 Ferramentas de Recuperação

Reescrita / Multi-query · Query transformation
Vector Store · OpenSearch / KB
Reranker · Ordenar chunks

📤 Saída

Resposta Final · com citações

Agentic RAG: when the system decides how to retrieve

In the previous techniques, the pipeline is still fixed: query in, chunks out, LLM generates. Agentic RAG breaks that linear flow. Here, an LLM agent receives the question and decides what to do: search now? In which source? With which strategy? Are the retrieved chunks sufficient, or does it need another round?

The diagram above shows the central loop: the agent reasons, decides to search, transforms the query, retrieves, evaluates chunk quality, and only then generates — or repeats the cycle if results are insufficient.

This enables behaviors that linear RAG cannot achieve:

Compound questions: "compare the refund policy for plans A and B" — the agent makes two separate searches and synthesizes.
Iterative refinement: if the first chunks do not cover the question, the agent rephrases and searches again.
Source routing: the agent chooses between searching the vector store, calling an external API, or using direct parametric knowledge.

On AWS, this is implemented with Bedrock Agents — you register search tools as action groups, and the model (Claude, by default) decides when and how to call them. Lesson 09 goes into detail about Knowledge Bases integrated with agents. This lesson covers the reasoning behind the loop; the concrete implementation comes later.

The critical point: agents add latency and cost by design. Each iteration of the loop is an LLM call. For simple, direct questions, linear RAG with good query rewriting is faster, cheaper, and equally effective.

Advanced retrieval techniques: quick comparison

	Technique	When to use	Extra cost	Main risk
Query rewriting	Multi-turn conversations, vague queries	Low (1 lightweight LLM call)	Intent alteration with bad prompt	—
Multi-query	Domain with rich, inconsistent vocabulary	Medium (N parallel searches)	Noise if variations are irrelevant	—
HyDE	Colloquial queries, dense technical docs	Medium (1 generation + 1 embedding)	Hallucination in the hypothetical answer	—
Agentic RAG	Compound questions, multiple sources, iterative refinement	High (multiple LLM calls)	Unpredictable latency and cost without loop limits	—

Match

Advanced retrieval techniques

Tap a concept, then its definition.

Key takeaways from this lesson

Query rewriting is the highest ROI improvement: one lightweight LLM call before search resolves most vague or incomplete query problems.

Multi-query generates variations of the question and merges results — useful when user vocabulary and document vocabulary differ.

HyDE inverts the logic: generates a hypothetical answer and searches with it, because documents look more like documents than questions in vector space.

Agentic RAG gives the LLM control over the retrieval loop — when to search, in which source, how many times. It enables compound questions and iterative refinement.

Complexity has real cost: each technique adds latency and/or cost. Use the simplest one that solves your problem.

In agentic RAG, always define a maximum loop iteration limit — without it, an ambiguous question can turn into an unexpected API bill.

How to introduce advanced techniques incrementally

1
Start with linear RAG and measure
Before adding any advanced technique, establish an evaluation baseline (faithfulness, relevance — lesson 08). You need to know what you are improving.
2
Add query rewriting first
It is the lowest-risk, highest-immediate-return technique. Implement it as a preprocessing step before the vector store call. Measure the impact on your evaluation metrics.
3
Try multi-query if vocabulary is the problem
If failure analysis shows the system is not finding relevant chunks because the user uses different terms than the documents, multi-query is the natural next step.
4
Consider HyDE for very short or colloquial queries
Test HyDE on a subset of your evaluation dataset. Compare precision@k before and after. If there is no measurable gain, do not add the complexity.
5
Migrate to agentic RAG only when the linear pipeline is not enough
Compound questions, multiple sources, and iterative refinement are the clear signals. Implement with Bedrock Agents, define loop limits, and monitor cost per session from day one.

Frequently asked questions

Won't HyDE hallucinate and bring wrong chunks?

It can. The hypothetical answer does not need to be factually correct — it just needs to be in the same semantic space as the relevant documents. But if the model hallucinates wildly (inventing terms, concepts, or entities that do not exist in the documents), the generated vector will retrieve garbage. That is why HyDE works best in domains where the LLM has some parametric knowledge of the subject, even if incomplete.

Can I combine rewriting + multi-query + reranker?

Yes, and it is a common combination in production. The natural order is: rewriting → multi-query → parallel search → merge → reranker → LLM. The reranker is especially valuable here because the merged chunk set can be large and noisy.

Does agentic RAG with Bedrock Agents natively support Knowledge Bases?

Yes. You can associate a Knowledge Base directly with a Bedrock Agent as a knowledge source. The agent automatically decides when to query the KB based on the model's reasoning. Lesson 09 covers this in detail.

How do I prevent infinite loops in agentic RAG?

Explicitly define a max_iterations in your agent. In Bedrock Agents, this is configurable. Also monitor the average number of iterations per session — if it is growing, the agent is struggling to satisfy questions and you need to review the available tools or the system prompt.

References

Amazon Bedrock Agents — Developer Guide Amazon Bedrock Knowledge Bases — Overview HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels (Gao et al., 2022)LangChain — Multi-Query Retriever AWS Blog: Build a RAG-based generative AI application with Amazon Bedrock Agents

Previous Next lesson

Query rewriting: fix the question before searching

Multi-query and HyDE: attack the index from different angles

The gain is real: variations capture synonyms, perspectives, and levels of abstraction that the original query missed. The cost is proportional to the number of searches — plan for it.

Agentic RAG loop: decide → retrieve → evaluate → respond

The agent controls the loop: decides whether to retrieve, which tool to use, evaluates whether the retrieved chunks are sufficient, and repeats if needed — before generating the final answer.

🧠 Agente — Raciocínio

Agente LLM · Raciocínio + Planejamento
Decidir · Buscar? Qual fonte?
Avaliar chunks · Suficientes? Relevantes?

🔍 Ferramentas de Recuperação

Reescrita / Multi-query · Query transformation
Vector Store · OpenSearch / KB
Reranker · Ordenar chunks

📤 Saída

Resposta Final · com citações

Agentic RAG: when the system decides how to retrieve

This enables behaviors that linear RAG cannot achieve:

Compound questions: "compare the refund policy for plans A and B" — the agent makes two separate searches and synthesizes.
Iterative refinement: if the first chunks do not cover the question, the agent rephrases and searches again.
Source routing: the agent chooses between searching the vector store, calling an external API, or using direct parametric knowledge.

Advanced retrieval techniques: quick comparison

	Technique	When to use	Extra cost	Main risk
Query rewriting	Multi-turn conversations, vague queries	Low (1 lightweight LLM call)	Intent alteration with bad prompt	—
Multi-query	Domain with rich, inconsistent vocabulary	Medium (N parallel searches)	Noise if variations are irrelevant	—
HyDE	Colloquial queries, dense technical docs	Medium (1 generation + 1 embedding)	Hallucination in the hypothetical answer	—
Agentic RAG	Compound questions, multiple sources, iterative refinement	High (multiple LLM calls)	Unpredictable latency and cost without loop limits	—

Key takeaways from this lesson

Query rewriting is the highest ROI improvement: one lightweight LLM call before search resolves most vague or incomplete query problems.

Multi-query generates variations of the question and merges results — useful when user vocabulary and document vocabulary differ.

HyDE inverts the logic: generates a hypothetical answer and searches with it, because documents look more like documents than questions in vector space.

Agentic RAG gives the LLM control over the retrieval loop — when to search, in which source, how many times. It enables compound questions and iterative refinement.

Complexity has real cost: each technique adds latency and/or cost. Use the simplest one that solves your problem.

In agentic RAG, always define a maximum loop iteration limit — without it, an ambiguous question can turn into an unexpected API bill.

How to introduce advanced techniques incrementally

Start with linear RAG and measure

Before adding any advanced technique, establish an evaluation baseline (faithfulness, relevance — lesson 08). You need to know what you are improving.

Add query rewriting first

It is the lowest-risk, highest-immediate-return technique. Implement it as a preprocessing step before the vector store call. Measure the impact on your evaluation metrics.

Try multi-query if vocabulary is the problem

If failure analysis shows the system is not finding relevant chunks because the user uses different terms than the documents, multi-query is the natural next step.

Consider HyDE for very short or colloquial queries

Test HyDE on a subset of your evaluation dataset. Compare precision@k before and after. If there is no measurable gain, do not add the complexity.

Migrate to agentic RAG only when the linear pipeline is not enough

Compound questions, multiple sources, and iterative refinement are the clear signals. Implement with Bedrock Agents, define loop limits, and monitor cost per session from day one.

Frequently asked questions

Won't HyDE hallucinate and bring wrong chunks?

Can I combine rewriting + multi-query + reranker?

Does agentic RAG with Bedrock Agents natively support Knowledge Bases?

How do I prevent infinite loops in agentic RAG?