The end-to-end RAG pipeline
The two halves — ingestion and query — and how they fit together.
5 min read
In the previous three lessons you learned the ingredients: why RAG exists, how embeddings work, and how to chunk without destroying context. Now let's assemble the full dish — the end-to-end pipeline, with offline ingestion and online query side by side, so you can see exactly where each earlier decision fits and where latency and cost show up on the bill.
RAG Pipeline: ingestion (offline) and query (online)
Two independent halves. Ingestion runs on-demand or on a schedule. The query path runs in real time for every user question.
- Fontes · S3, web, DB
- Loader · parse + clean
- Chunker · strategy + overlap
- Embedding Model · Titan / Cohere
- Vector Store · upsert + metadados
- Usuário · pergunta
- Embedding Model · mesmo modelo
- Vector Store · busca top-k
- Context Builder · rank + montar prompt
- LLM · Bedrock / Claude
- Resposta · + citações
The offline half: ingestion
Ingestion doesn't need to be fast — it needs to be correct and reproducible. The flow starts at the loader, which reads the source (PDF, HTML, database, S3) and delivers clean text. The chunker splits that text using the strategy you chose in Lesson 03 — size, overlap, semantic boundaries. Each chunk becomes input for the embedding model, which returns a high-dimensional vector. That vector, together with the chunk's metadata (source, date, section, document ID), is written to the vector store.
Two critical points here:
Model consistency. The embedding model used during ingestion must be exactly the same one used at query time. Swapping the model means reindexing everything — no exceptions. Store the model name and version as part of the index schema.
Metadata is not optional. Without metadata you can't filter, trace the origin of an answer, or do selective reindexing. Store at minimum: source_uri, chunk_index, doc_updated_at, section. You'll be grateful in Lesson 06 (filters and routing) and Lesson 11 (citations).
Cost in this stage comes from the embedding model (per token) and vector store storage. Latency doesn't matter here — ingestion is asynchronous. What matters is throughput and idempotency: if the job runs twice on the same document, the index must not duplicate it.
The online half: query and the top-k concept
The query path needs to be fast — the user is waiting. The flow starts with embedding the question: the same embedding model transforms the query into a vector. The vector store runs a nearest-neighbor search and returns the k most similar chunks — that's the top-k.
Why does top-k matter so much? It's a direct trade-off between recall and noise:
- Small k (e.g., 3): clean context, lower LLM token cost, but you may miss relevant information that didn't make the cut.
- Large k (e.g., 20): higher chance of capturing the right chunk, but the prompt grows large, the LLM may get lost in the middle of the context, and cost rises.
In practice, k between 5 and 10 is a reasonable starting point. You'll calibrate this in Lesson 08 (evaluation), measuring recall and faithfulness against real test sets.
After top-k, the context builder assembles the prompt: system instruction + retrieved chunks (with their references) + user question. The LLM generates the answer, ideally citing the sources of the chunks used.
Latency in this stage breaks down roughly as: query embedding (~50–150 ms), vector search (~20–100 ms), and LLM generation (dominant — hundreds of ms to seconds depending on model and output size). Cost comes from LLM input tokens (context + question) and output tokens, plus the embedding call.
In practice, the mistake I see most often in first RAG pipelines is treating ingestion and query as a single coupled process — running the question embedding inside the same job that indexes documents, or worse, reindexing everything on each query. Separate the two halves from the start: ingestion is an async job, query is a synchronous service. This separation isn't just architectural — it defines how you scale, monitor, and cost each part independently. A bug in ingestion must not take down the query path.
Reindexing: when data changes
Documents change. Policies get updated, prices shift, products are discontinued. If the vector store doesn't reflect the current state of the sources, the LLM will generate outdated answers with full confidence — and that's worse than not answering.
There are three patterns for handling reindexing:
Full reindex: delete and rebuild the entire index. Simple, reliable, but expensive and slow for large corpora. Suitable for small corpora or structural changes (e.g., swapping the embedding model).
Incremental upsert: detect new or modified documents (via hash, updated_at, or event-driven via S3 notifications / EventBridge) and reprocess only those. More complex, but scales well. Requires each chunk to have a deterministic ID based on the source and chunk index — so the upsert overwrites the old chunk without duplicating it.
Soft delete + versioning: keeps old versions marked as inactive, useful when you need audit trails or rollback. Increases storage cost but gives traceability.
My default recommendation: start with full reindex on a schedule (e.g., daily), implement incremental upsert when volume or change frequency makes full reindex prohibitive. Don't optimize before measuring.
On AWS, S3 Event Notifications + Lambda or EventBridge Pipes are the natural pattern for triggering incremental reindexing when a document is updated in the source bucket.
Order the RAG query (online)
From question to answer with sources.
- 1Generate the answer with citations
- 2Retrieve the top-k chunks from the vector store
- 3Assemble the context with the chunks
- 4Embed the question
Key takeaways from this lesson
Ingestion vs. Query: characteristics of each half
| Characteristic | Ingestion (offline) | Query (online) | |
|---|---|---|---|
| When it runs | — | Schedule or event (S3, EventBridge) | On every user question |
| Main requirement | — | Throughput and idempotency | Low latency |
| Biggest cost | — | Embedding per token (document volume) | LLM input/output tokens |
| Failure tolerable? | — | Yes — async retry, no user impact | No — failure immediately visible to user |
| Scales with | — | Document volume and update frequency | Number of concurrent users |
Frequently asked questions about the pipeline
Can I use different embedding models for documents from different domains?
Yes, but each model needs its own separate index in the vector store. At query time, you need to know which index to use (domain routing — covered in Lesson 06). Never mix vectors from different models in the same index; the distances are not comparable.
What happens if I increase top-k but the LLM has a small context window?
You'll either truncate the context or get an error. The solution is to calculate the average chunk size in tokens and ensure k × chunk_tokens fits within the model's window with margin for the instruction and response. Reranking (Lesson 05) helps select the best k before assembling the prompt.
Do I need to reindex everything if only one document changes?
No, if you implemented upsert with deterministic per-chunk IDs. Reprocess only the chunks of the changed document and upsert them into the index. The rest of the index remains intact. Full reindex is only mandatory when you swap the embedding model.
Closing Module 1
With this lesson you have the full RAG map: you know what happens at each stage, where the money goes, where latency appears, and what breaks when data changes. Module 1 was about fundamentals — and solid fundamentals are what separates a prototype that impresses in a demo from a system that works in production. Module 2 starts with retrieval quality: hybrid search, reranking, and metadata filters. Because retrieving the right chunks is the one thing the LLM cannot fix for you.
Checkpoint — Module 1
1. Ingestion typically happens…
2. Increasing top-k tends to…