RAG: giving the model memory and facts
Retrieval-Augmented Generation end to end: when to use it and how to architect it.
6 min read
An LLM knows a lot about the world — up to its training cutoff. It knows nothing about your internal documents, yesterday's updated policies, or the support ticket opened right now. RAG (Retrieval-Augmented Generation) solves exactly that: instead of training the model on your data, you retrieve the relevant passages at query time and inject them into the context before generating the answer.
The problem RAG solves
LLMs have two fundamental limits you need to understand before designing any system.
Knowledge cutoff: the model was trained on a fixed corpus. Everything that happened after — or that was never public — is invisible to it. If you ask about version 3.2 of your internal product, it will invent a plausible answer. That is hallucination from missing facts.
Context window limit: even if you wanted to paste all your documents into the prompt, the context window has a finite size. You cannot throw 50,000 pages of documentation into a single call — and even if you could, the cost and latency would be prohibitive.
RAG attacks both problems at once. Instead of giving everything to the model, you find what is relevant to that specific question and deliver only that. The model receives surgical context, with real facts, and generates a grounded answer — not a confabulation.
In lesson 04 you saw how embeddings turn text into vectors and how semantic search finds passages similar in meaning, not just in words. RAG is the direct application of that: the embedding is the retrieval engine.
RAG Pipeline: ingestion and query
Two independent flows: ingestion (offline, prepares the data) and query (online, answers the user). The vector store is the meeting point between them.
- Documentos · PDFs, wikis, DBs
- Chunking · divisão em trechos
- Embedding Model · texto → vetor
- Vector Store · vetores + metadados
- Usuário · pergunta
- Embedding Model · pergunta → vetor
- Retriever · top-k chunks
- Context Builder · monta o prompt
- LLM · gera resposta
- Resposta · com citações
The pipeline in detail: ingestion and query
RAG has two distinct flows. Confusing them is one of the first traps.
Ingestion (offline): you take your raw documents, split them into smaller pieces (chunks), generate an embedding for each chunk, and store the vectors in a vector store along with metadata (source, page, date). This process runs once — or incrementally when documents change. The result is a semantic index of your data.
Query (online): when the user asks a question, you generate the embedding of the question using the same embedding model used during ingestion — this is critical. Then you search for the top-k closest chunks in the vector store (cosine similarity or inner product). With those passages in hand, you build the prompt: system instruction + retrieved context + question. The LLM receives all of this and generates an answer that can cite the sources.
An important detail: the embedding model and the LLM are separate components. You can use Amazon Titan Embeddings to index and Claude Sonnet to generate. They have different roles — the embedding finds, the LLM reasons.
On AWS, Bedrock Knowledge Bases (which you will see in detail in Module 4) manages this entire pipeline: automatic ingestion, configurable chunking, managed vector store, and integrated retrieval. But understanding the pipeline manually is what lets you debug when something goes wrong.
In practice, RAG quality depends 80% on ingestion quality — not on the LLM. If chunks are too large, the retriever brings noise. If they are too small, they lose context. If metadata is wrong, you cannot filter. I have seen RAG systems that performed poorly not because of the model, but because the PDFs were scans without OCR. The LLM can only reason about what you handed it. Garbage in, garbage out — with a very well-written response.
RAG, fine-tuning, or just prompt? The right decision
This is the question every architect faces. The answer depends on what you want to solve.
Use RAG when: your data changes frequently, is private/proprietary, or you need traceable citations. RAG is dynamic — you update the index without retraining anything. It is also cheaper and faster to put into production.
Use fine-tuning when: you want to change the model's behavior or style — not inject facts. Fine-tuning teaches the model to respond in a specific way, use a proprietary format, or master technical jargon. It is not good for injecting factual knowledge that changes.
Use just prompt when: the knowledge fits in the context, is stable, and you do not have a data volume that justifies indexing infrastructure. For many use cases, a well-crafted prompt with a few business rules already solves it.
In practice, RAG and fine-tuning are not mutually exclusive. You can fine-tune a model to have the right style and use RAG to supply current facts. But start with the simplest: prompt → RAG → fine-tuning. Each step has increasing cost and complexity.
A golden rule: if the question is "the model doesn't know this fact", the answer is RAG. If the question is "the model doesn't behave the way I want", the answer might be fine-tuning.
Order the RAG query pipeline
From the user's question to the answer with sources.
- 1Assemble the context with the retrieved chunks
- 2Embed the user's question
- 3Generate the answer with the LLM, citing the sources
- 4Search the most similar chunks in the vector store
RAG vs Fine-tuning vs Prompt
| Criterion | Prompt Only | RAG | Fine-tuning | |
|---|---|---|---|---|
| Private/current data | ❌ No | ✅ Yes | ⚠️ Fixed snapshot | — |
| Implementation cost | Low | Medium | High | — |
| Traceable citations | ❌ | ✅ | ❌ | — |
| Change behavior/style | ⚠️ Partial | ❌ No | ✅ Yes | — |
| Data updates | Immediate | Re-index | Retrain | — |
Common RAG pitfalls
Frequently asked questions about RAG
How many chunks should I retrieve (top-k)?
It depends on chunk size and the model's context window. A reasonable starting point is top-3 to top-5 with 300-500 token chunks. More than that starts diluting context and increases cost. Evaluate with evals (lesson 09).
Does RAG eliminate hallucination?
It significantly reduces hallucination for questions covered by the index, but does not eliminate it. The LLM can still ignore context or mix information. Guardrails and evals (lessons 09 and 10) are complementary.
Which vector store should I use?
On AWS, Bedrock Knowledge Bases manages this for you (OpenSearch Serverless under the hood). For your own control: pgvector on RDS/Aurora is great if you already use PostgreSQL. For larger scale, OpenSearch or a dedicated service. The main criteria are search latency and operational cost.
Do I need RAG if my document fits in the context?
Not necessarily. If the document is small, stable, and you have few users, putting everything in the prompt can be simpler. RAG pays off when data volume is large, documents change, or the per-call token cost starts to matter.
RAG is the foundation of most enterprise AI systems
If you are going to build a single applied AI pattern, make it RAG. It solves the most common problem — the model doesn't know your data — without the cost and rigidity of fine-tuning. But well-done RAG is serious engineering: well-calibrated chunking, consistent embedding model, evaluated retriever, rich metadata for filtering. In the next lesson, you will see how the model can go beyond static context and call external tools in real time — which opens a new level of capability.
Quick check
1. Which problem does RAG best solve?