Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Production RAG on AWS

Module 1 · Fundamentals· Lesson 03/12

Chunking: strategies and pitfalls

How to split documents without destroying context — the decision that most affects RAG quality.

5 min read

The model doesn't see your document — it sees the chunk you cut out of it. If that chunk is poorly split, context-free, sliced through a table, or too large to be useful, the answer will be bad regardless of which LLM you use. Chunking is the architectural decision that most affects RAG quality, and it's the one most projects get wrong first.

Why chunking matters so much

Think of the RAG pipeline as a two-step funnel: first you retrieve the most relevant chunks, then the model generates the answer from them. The model can only reason about what's in the context window — and what's in the context window is exactly the chunks the retriever selected.

This creates a direct dependency: bad chunk → bad embedding → bad retrieval → bad answer. You can have the best model in the world and a perfectly tuned vector search, but if the retrieved chunk cut the sentence before its conclusion, or mixed two different topics together, the model will hallucinate or answer incompletely.

Beyond quality, chunk size directly affects cost. Large chunks increase the number of tokens sent to the model on every call. If you have 5 chunks of 1,000 tokens each in context, that's already 5,000 tokens just for context — before counting the prompt and the response. In production, with thousands of calls per day, that adds up fast.

The good news: chunking is a decision you can iterate on. There is no universally correct strategy — there is the right strategy for your document type and your use case.

Chunking strategies: from document to chunks

The same document can be split in very different ways. Each strategy produces chunks with distinct characteristics of size, semantic coherence, and structure preservation.

📄 Documento original

Documento · PDF / MD / HTML

✂️ Estratégias de divisão

Tamanho fixo · 512 tokens, overlap 50
Por sentença · ou parágrafo
Recursivo · · · → · → espaço
Por estrutura · headings / tabelas
Semântico · grupo por similaridade

🧩 Chunks com metadados

Chunk · texto + título + seção + fonte

🔢 Indexação

Embedding · model
Vector store · OpenSearch / Pinecone

The five main strategies — and when to use each one

Fixed size is almost everyone's starting point: you set a token limit (e.g., 512) and split the document at that interval, with a configurable overlap. It's simple, predictable, and easy to debug. The problem is that it ignores text structure — it can cut a sentence in half, separate a question from its answer, or mix the end of one topic with the start of the next.

By sentence or paragraph respects the natural breaks in text. It's better than fixed size for flowing prose, but produces chunks of very variable size — a paragraph can have 30 words or 300.

Recursive is the most widely used in practice for general text. You define a hierarchy of separators (\n\n, \n, ., ) and the algorithm tries to keep the chunk within the limit using the highest-level separator possible. This is what LangChain's RecursiveCharacterTextSplitter implements, and it works well for most documents.

By structure is the right choice when the document has explicit hierarchy: Markdown headings, HTML sections, slides. You keep each section as a minimum unit and inherit the title as metadata automatically. For technical documents with tables and code blocks, this strategy avoids destructive cuts.

Semantic groups sentences by embedding similarity — semantically cohesive chunks, but with higher processing cost at indexing time. It makes sense for large, heterogeneous corpora where structure is unreliable.

Chunking strategy comparison

	Strategy	Semantic coherence	Predictable size	Implementation cost	Best for
Fixed size	Low	High	Minimal	Fast prototyping	—
Sentence/paragraph	Medium	Low	Low	Flowing prose, articles	—
Recursive	Medium-high	Medium	Low	General text, documentation	—
Structural	High	Variable	Medium	Technical docs, wikis, HTML	—
Semantic	High	Low	High	Large, heterogeneous corpora	—

Quiz

Quick check

1. Why is chunking so decisive in RAG?

Overlap, metadata, and the pitfalls that destroy quality

Overlap is the token overlap between consecutive chunks. If you use 512-token chunks with 50-token overlap, the last 50 tokens of chunk N also appear at the start of chunk N+1. This seems wasteful, but it has a clear purpose: preventing important information from falling in the "seam" between two chunks and being retrieved by neither. For most cases, overlap between 10% and 15% of chunk size is enough. More than that and you start duplicating content in the context.

Metadata in the chunk is as important as the text itself. Each chunk should carry: document title, source section or heading, URL or file path, and ideally the creation or update date. This metadata serves two purposes: filtering in search (we'll cover this in lesson 06) and building citations in the response (lesson 11). If you don't preserve provenance at chunking time, it will be impossible to reconstruct it later.

The most common pitfalls: chunks that are too large (above 1,000 tokens) inject noise into the context — the model receives irrelevant information alongside the relevant and may get confused. Chunks that are too small (below 100 tokens) lose context — an isolated sentence rarely carries enough meaning for the model to answer well. And the worst case: cutting a table or code block in half. Tables have a header + rows — separated, both become unintelligible. Code has dependencies between lines. Always treat tables and code as atomic units.

In practice: how I start a new project

Senior Solutions Architect

In practice, I almost always start with recursive chunking at 512 tokens and 10% overlap. It's the safest starting point for general text — it works reasonably well before any optimization. Then I look at the actual documents: if they have clear headings, I switch to structural chunking and inherit the heading as metadata. If they have tables or code, I isolate those sections before any split. I only invest in semantic chunking when the corpus is large, heterogeneous, and recursive results have already hit a ceiling. Order matters: don't optimize chunking before you have a baseline and an evaluation metric — otherwise you're tuning in the dark.

Key takeaways from this lesson

The retrieved chunk is everything the model sees — chunk quality determines answer quality.

Recursive chunking is the safest starting point for general text; structural chunking is right for documents with explicit hierarchy.

10–15% overlap prevents information loss at chunk seams without duplicating content in context.

Preserve metadata (title, section, source) in each chunk at indexing time — you'll need it for filters and citations.

Tables and code blocks are atomic units — never split them in half.

Don't optimize chunking without an evaluation metric — you need to know if you actually improved.

Frequently asked questions about chunking

What is the ideal chunk size?

There is no universal number. For technical documentation, 400–600 tokens with 50–80 token overlap is a solid starting point. For more narrative texts, whole paragraphs often work better than a fixed token limit. The right size is whatever maximizes your evaluation metric — which is why lesson 08 (evaluation) is the mandatory companion to this one.

Do I need to re-index everything if I change the chunking strategy?

Yes. Chunking and embedding are inseparable — the vector represents the text of that specific chunk. If you change how you split the text, the old vectors become inconsistent with the new ones. A full re-index is required. That's why it's worth having an automated indexing pipeline from the start.

Does Amazon Bedrock Knowledge Bases do chunking automatically?

Yes — Knowledge Bases offers fixed, sentence-based, and semantic chunking as configurable options. It's convenient, but you give up fine-grained control over overlap, table handling, and custom metadata. We'll detail this in lesson 09.

Match