Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Production RAG on AWS

Module 1 · Fundamentals· Lesson 04/12

The end-to-end RAG pipeline

The two halves — ingestion and query — and how they fit together.

5 min read

In the previous three lessons you learned the ingredients: why RAG exists, how embeddings work, and how to chunk without destroying context. Now let's assemble the full dish — the end-to-end pipeline, with offline ingestion and online query side by side, so you can see exactly where each earlier decision fits and where latency and cost show up on the bill.

RAG Pipeline: ingestion (offline) and query (online)

Two independent halves. Ingestion runs on-demand or on a schedule. The query path runs in real time for every user question.

📥 Ingestão — Offline

Fontes · S3, web, DB
Loader · parse + clean
Chunker · strategy + overlap
Embedding Model · Titan / Cohere
Vector Store · upsert + metadados

🔍 Consulta — Online

Usuário · pergunta
Embedding Model · mesmo modelo
Vector Store · busca top-k
Context Builder · rank + montar prompt
LLM · Bedrock / Claude
Resposta · + citações

The offline half: ingestion

Ingestion doesn't need to be fast — it needs to be correct and reproducible. The flow starts at the loader, which reads the source (PDF, HTML, database, S3) and delivers clean text. The chunker splits that text using the strategy you chose in Lesson 03 — size, overlap, semantic boundaries. Each chunk becomes input for the embedding model, which returns a high-dimensional vector. That vector, together with the chunk's metadata (source, date, section, document ID), is written to the vector store.

Two critical points here:

Model consistency. The embedding model used during ingestion must be exactly the same one used at query time. Swapping the model means reindexing everything — no exceptions. Store the model name and version as part of the index schema.

Metadata is not optional. Without metadata you can't filter, trace the origin of an answer, or do selective reindexing. Store at minimum: source_uri, chunk_index, doc_updated_at, section. You'll be grateful in Lesson 06 (filters and routing) and Lesson 11 (citations).

Cost in this stage comes from the embedding model (per token) and vector store storage. Latency doesn't matter here — ingestion is asynchronous. What matters is throughput and idempotency: if the job runs twice on the same document, the index must not duplicate it.

The online half: query and the top-k concept

The query path needs to be fast — the user is waiting. The flow starts with embedding the question: the same embedding model transforms the query into a vector. The vector store runs a nearest-neighbor search and returns the k most similar chunks — that's the top-k.

Why does top-k matter so much? It's a direct trade-off between recall and noise:

Small k (e.g., 3): clean context, lower LLM token cost, but you may miss relevant information that didn't make the cut.
Large k (e.g., 20): higher chance of capturing the right chunk, but the prompt grows large, the LLM may get lost in the middle of the context, and cost rises.

In practice, k between 5 and 10 is a reasonable starting point. You'll calibrate this in Lesson 08 (evaluation), measuring recall and faithfulness against real test sets.

After top-k, the context builder assembles the prompt: system instruction + retrieved chunks (with their references) + user question. The LLM generates the answer, ideally citing the sources of the chunks used.

Latency in this stage breaks down roughly as: query embedding (~50–150 ms), vector search (~20–100 ms), and LLM generation (dominant — hundreds of ms to seconds depending on model and output size). Cost comes from LLM input tokens (context + question) and output tokens, plus the embedding call.

In practice: the most common mistake in a first pipeline

Senior Solutions Architect

In practice, the mistake I see most often in first RAG pipelines is treating ingestion and query as a single coupled process — running the question embedding inside the same job that indexes documents, or worse, reindexing everything on each query. Separate the two halves from the start: ingestion is an async job, query is a synchronous service. This separation isn't just architectural — it defines how you scale, monitor, and cost each part independently. A bug in ingestion must not take down the query path.

Reindexing: when data changes

Documents change. Policies get updated, prices shift, products are discontinued. If the vector store doesn't reflect the current state of the sources, the LLM will generate outdated answers with full confidence — and that's worse than not answering.

There are three patterns for handling reindexing:

Full reindex: delete and rebuild the entire index. Simple, reliable, but expensive and slow for large corpora. Suitable for small corpora or structural changes (e.g., swapping the embedding model).

Incremental upsert: detect new or modified documents (via hash, updated_at, or event-driven via S3 notifications / EventBridge) and reprocess only those. More complex, but scales well. Requires each chunk to have a deterministic ID based on the source and chunk index — so the upsert overwrites the old chunk without duplicating it.

Soft delete + versioning: keeps old versions marked as inactive, useful when you need audit trails or rollback. Increases storage cost but gives traceability.

My default recommendation: start with full reindex on a schedule (e.g., daily), implement incremental upsert when volume or change frequency makes full reindex prohibitive. Don't optimize before measuring.

On AWS, S3 Event Notifications + Lambda or EventBridge Pipes are the natural pattern for triggering incremental reindexing when a document is updated in the source bucket.

Put in order

Order the RAG query (online)

From question to answer with sources.

1Generate the answer with citations
2Retrieve the top-k chunks from the vector store
3Assemble the context with the chunks
4Embed the question

Key takeaways from this lesson

Ingestion (offline) and query (online) are separate pipelines — never couple the two.

The embedding model must be identical on both sides; swapping the model requires full reindexing.

top-k controls the trade-off between recall and noise in the context sent to the LLM.

Metadata written during ingestion enables filters, citations, and selective reindexing — always store it.

LLM generation dominates online latency; the embedding call dominates ingestion cost.

Incremental reindexing requires deterministic per-chunk IDs to upsert without duplicating.

Ingestion vs. Query: characteristics of each half

	Characteristic	Ingestion (offline)	Query (online)
When it runs	—	Schedule or event (S3, EventBridge)	On every user question
Main requirement	—	Throughput and idempotency	Low latency
Biggest cost	—	Embedding per token (document volume)	LLM input/output tokens
Failure tolerable?	—	Yes — async retry, no user impact	No — failure immediately visible to user
Scales with	—	Document volume and update frequency	Number of concurrent users

Frequently asked questions about the pipeline

Can I use different embedding models for documents from different domains?

Yes, but each model needs its own separate index in the vector store. At query time, you need to know which index to use (domain routing — covered in Lesson 06). Never mix vectors from different models in the same index; the distances are not comparable.

What happens if I increase top-k but the LLM has a small context window?

You'll either truncate the context or get an error. The solution is to calculate the average chunk size in tokens and ensure k × chunk_tokens fits within the model's window with margin for the instruction and response. Reranking (Lesson 05) helps select the best k before assembling the prompt.

Do I need to reindex everything if only one document changes?

No, if you implemented upsert with deterministic per-chunk IDs. Reprocess only the chunks of the changed document and upsert them into the index. The rest of the index remains intact. Full reindex is only mandatory when you swap the embedding model.

Closing Module 1

Módulo 1 completo ✓

With this lesson you have the full RAG map: you know what happens at each stage, where the money goes, where latency appears, and what breaks when data changes. Module 1 was about fundamentals — and solid fundamentals are what separates a prototype that impresses in a demo from a system that works in production. Module 2 starts with retrieval quality: hybrid search, reranking, and metadata filters. Because retrieving the right chunks is the one thing the LLM cannot fix for you.

Quiz

Checkpoint — Module 1

1. Ingestion typically happens…

2. Increasing top-k tends to…

References

Amazon Bedrock Knowledge Bases — How it works Amazon OpenSearch Service — Vector search AWS Blog — Building RAG-based applications with Amazon Bedrock Lewis et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (original paper)LangChain — RAG conceptual guide

Previous Next lesson