Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

The AI Architect Track

Module 2 · From model to application· Lesson 06/22

RAG: giving the model memory and facts

Retrieval-Augmented Generation end to end: when to use it and how to architect it.

6 min read

An LLM knows a lot about the world — up to its training cutoff. It knows nothing about your internal documents, yesterday's updated policies, or the support ticket opened right now. RAG (Retrieval-Augmented Generation) solves exactly that: instead of training the model on your data, you retrieve the relevant passages at query time and inject them into the context before generating the answer.

The problem RAG solves

LLMs have two fundamental limits you need to understand before designing any system.

Knowledge cutoff: the model was trained on a fixed corpus. Everything that happened after — or that was never public — is invisible to it. If you ask about version 3.2 of your internal product, it will invent a plausible answer. That is hallucination from missing facts.

Context window limit: even if you wanted to paste all your documents into the prompt, the context window has a finite size. You cannot throw 50,000 pages of documentation into a single call — and even if you could, the cost and latency would be prohibitive.

RAG attacks both problems at once. Instead of giving everything to the model, you find what is relevant to that specific question and deliver only that. The model receives surgical context, with real facts, and generates a grounded answer — not a confabulation.

In lesson 04 you saw how embeddings turn text into vectors and how semantic search finds passages similar in meaning, not just in words. RAG is the direct application of that: the embedding is the retrieval engine.

RAG Pipeline: ingestion and query

Two independent flows: ingestion (offline, prepares the data) and query (online, answers the user). The vector store is the meeting point between them.

📥 Ingestão — Offline

Documentos · PDFs, wikis, DBs
Chunking · divisão em trechos
Embedding Model · texto → vetor
Vector Store · vetores + metadados

🔍 Consulta — Online

Usuário · pergunta
Embedding Model · pergunta → vetor
Retriever · top-k chunks
Context Builder · monta o prompt
LLM · gera resposta
Resposta · com citações

The pipeline in detail: ingestion and query

RAG has two distinct flows. Confusing them is one of the first traps.

Ingestion (offline): you take your raw documents, split them into smaller pieces (chunks), generate an embedding for each chunk, and store the vectors in a vector store along with metadata (source, page, date). This process runs once — or incrementally when documents change. The result is a semantic index of your data.

Query (online): when the user asks a question, you generate the embedding of the question using the same embedding model used during ingestion — this is critical. Then you search for the top-k closest chunks in the vector store (cosine similarity or inner product). With those passages in hand, you build the prompt: system instruction + retrieved context + question. The LLM receives all of this and generates an answer that can cite the sources.

An important detail: the embedding model and the LLM are separate components. You can use Amazon Titan Embeddings to index and Claude Sonnet to generate. They have different roles — the embedding finds, the LLM reasons.

On AWS, Bedrock Knowledge Bases (which you will see in detail in Module 4) manages this entire pipeline: automatic ingestion, configurable chunking, managed vector store, and integrated retrieval. But understanding the pipeline manually is what lets you debug when something goes wrong.

In practice: RAG is not magic, it's data engineering

Senior Solutions Architect

In practice, RAG quality depends 80% on ingestion quality — not on the LLM. If chunks are too large, the retriever brings noise. If they are too small, they lose context. If metadata is wrong, you cannot filter. I have seen RAG systems that performed poorly not because of the model, but because the PDFs were scans without OCR. The LLM can only reason about what you handed it. Garbage in, garbage out — with a very well-written response.

RAG, fine-tuning, or just prompt? The right decision

This is the question every architect faces. The answer depends on what you want to solve.

Use RAG when: your data changes frequently, is private/proprietary, or you need traceable citations. RAG is dynamic — you update the index without retraining anything. It is also cheaper and faster to put into production.

Use fine-tuning when: you want to change the model's behavior or style — not inject facts. Fine-tuning teaches the model to respond in a specific way, use a proprietary format, or master technical jargon. It is not good for injecting factual knowledge that changes.

Use just prompt when: the knowledge fits in the context, is stable, and you do not have a data volume that justifies indexing infrastructure. For many use cases, a well-crafted prompt with a few business rules already solves it.

In practice, RAG and fine-tuning are not mutually exclusive. You can fine-tune a model to have the right style and use RAG to supply current facts. But start with the simplest: prompt → RAG → fine-tuning. Each step has increasing cost and complexity.

A golden rule: if the question is "the model doesn't know this fact", the answer is RAG. If the question is "the model doesn't behave the way I want", the answer might be fine-tuning.

Put in order

Order the RAG query pipeline

From the user's question to the answer with sources.

1Assemble the context with the retrieved chunks
2Embed the user's question
3Generate the answer with the LLM, citing the sources
4Search the most similar chunks in the vector store

RAG vs Fine-tuning vs Prompt

	Criterion	Prompt Only	RAG	Fine-tuning
Private/current data	❌ No	✅ Yes	⚠️ Fixed snapshot	—
Implementation cost	Low	Medium	High	—
Traceable citations	❌	✅	❌	—
Change behavior/style	⚠️ Partial	❌ No	✅ Yes	—
Data updates	Immediate	Re-index	Retrain	—

Common RAG pitfalls

Chunks too large: the retriever brings entire paragraphs with noise, diluting the relevant signal in the LLM's context.

Chunks too small: isolated sentences lose context — the LLM receives a passage that makes no sense without the surrounding paragraph.

Different embedding model at ingestion and query time: vectors end up in incompatible spaces and the search returns garbage.

Retrieving without filtering: fetching top-k without metadata filters (e.g., date, department) injects irrelevant or outdated context.

Not evaluating the retriever separately: most RAG bugs are in retrieval, not generation. Evaluate retriever recall and precision before blaming the LLM.

Frequently asked questions about RAG

How many chunks should I retrieve (top-k)?

It depends on chunk size and the model's context window. A reasonable starting point is top-3 to top-5 with 300-500 token chunks. More than that starts diluting context and increases cost. Evaluate with evals (lesson 09).

Does RAG eliminate hallucination?

It significantly reduces hallucination for questions covered by the index, but does not eliminate it. The LLM can still ignore context or mix information. Guardrails and evals (lessons 09 and 10) are complementary.

Which vector store should I use?

On AWS, Bedrock Knowledge Bases manages this for you (OpenSearch Serverless under the hood). For your own control: pgvector on RDS/Aurora is great if you already use PostgreSQL. For larger scale, OpenSearch or a dedicated service. The main criteria are search latency and operational cost.

Do I need RAG if my document fits in the context?

Not necessarily. If the document is small, stable, and you have few users, putting everything in the prompt can be simpler. RAG pays off when data volume is large, documents change, or the per-call token cost starts to matter.

RAG is the foundation of most enterprise AI systems

Essencial

If you are going to build a single applied AI pattern, make it RAG. It solves the most common problem — the model doesn't know your data — without the cost and rigidity of fine-tuning. But well-done RAG is serious engineering: well-calibrated chunking, consistent embedding model, evaluated retriever, rich metadata for filtering. In the next lesson, you will see how the model can go beyond static context and call external tools in real time — which opens a new level of capability.

Quiz

Quick check

1. Which problem does RAG best solve?

References

AWS Bedrock Knowledge Bases — Developer Guide Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020)AWS Blog: Building RAG applications with Amazon Bedrock LangChain RAG — Conceptual Guide

Previous Next lesson

The problem RAG solves

LLMs have two fundamental limits you need to understand before designing any system.

RAG Pipeline: ingestion and query

Two independent flows: ingestion (offline, prepares the data) and query (online, answers the user). The vector store is the meeting point between them.

📥 Ingestão — Offline

Documentos · PDFs, wikis, DBs
Chunking · divisão em trechos
Embedding Model · texto → vetor
Vector Store · vetores + metadados

🔍 Consulta — Online

Usuário · pergunta
Embedding Model · pergunta → vetor
Retriever · top-k chunks
Context Builder · monta o prompt
LLM · gera resposta
Resposta · com citações

The pipeline in detail: ingestion and query

RAG has two distinct flows. Confusing them is one of the first traps.

RAG, fine-tuning, or just prompt? The right decision

This is the question every architect faces. The answer depends on what you want to solve.

A golden rule: if the question is "the model doesn't know this fact", the answer is RAG. If the question is "the model doesn't behave the way I want", the answer might be fine-tuning.

Criterion

Prompt Only

RAG

Fine-tuning

Private/current data

❌ No

✅ Yes

⚠️ Fixed snapshot

—

Implementation cost

Low

Medium

High

—

Traceable citations

❌

✅

❌

—

Change behavior/style

⚠️ Partial