Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Production RAG on AWS

Module 2 · Retrieval· Lesson 05/12

Hybrid search and reranking

Combine semantic with keyword search and reorder with a reranker to retrieve better.

6 min read

Vector search is powerful, but it fails at something basic: exact terms. If the user types "CVE-2024-1234" or "API v3.2", embeddings won't save you — they'll return semantically close chunks that are probably wrong. The solution isn't to switch approaches; it's to combine both and then reorder the results with a model that actually understands the question.

The limit of vector-only search

In lesson 02 we saw that embeddings capture meaning. That's great for questions like "how do I cancel my subscription?", where language variations exist. But consider these cases:

A developer searches for NullPointerException in error logs.
A security analyst queries the code CVE-2024-1234.
A user types an exact product name: XR-7000 Pro.

The embedding model will vectorize these terms and find neighbors in semantic space. The problem is that CVE-2024-1234 and CVE-2024-5678 end up very close vectorially — they're structurally similar. The system retrieves the wrong chunk with high confidence.

Lexical search (BM25, the classic term-frequency ranking algorithm) doesn't have this problem. It treats CVE-2024-1234 as a literal string and only returns documents containing that exact token. It's deterministic, fast, and needs no GPU.

BM25's weakness is the inverse: no synonyms, no tolerance for variation. "Cancel subscription" won't find "terminate contract". For natural language, it loses badly to embeddings.

Neither approach is superior in every scenario. The right answer is to run both in parallel and merge the results — that's hybrid search.

Hybrid search: combining both worlds

The mechanics are simple: you fire two searches in parallel — one vector, one lexical — and each returns a ranked list of chunks. The challenge is merging those lists into one, since the scores are incomparable (cosine similarity vs. BM25 score).

The most widely used technique in practice is Reciprocal Rank Fusion (RRF). The idea is elegant: instead of normalizing scores (which is brittle), you use only each document's position in the lists.

The formula is:

RRF(doc) = Σ 1 / (k + rank_i(doc))

Where k is a constant (typically 60) and rank_i is the document's position in list i. A document that appears 3rd in vector search and 5th in BM25 will have a high RRF score. A document that only appears in one list will score lower.

OpenSearch Serverless and OpenSearch Service support hybrid search with RRF natively — you configure the hybrid query type and set weights for each sub-query. Amazon Bedrock Knowledge Bases also exposes this option in the console and via API.

An important parameter is the relative weight between vector and lexical search. There's no universal value: it depends on your domain. For technical documentation with lots of codes and acronyms, I give more weight to BM25. For natural-language FAQ, more weight to vector. Start 50/50 and adjust based on evaluation results (lesson 08).

Flow: hybrid search + reranking

Full pipeline from user query to final chunks sent to the LLM. Both searches run in parallel; RRF merges the rankings; the reranker selects the best candidates.

🔍 Retrieval — Busca paralela

Busca Vetorial · ANN / cosine
Busca Lexical · BM25 / keyword

🔀 Fusão — RRF

Reciprocal Rank · Fusion (RRF)
Top-N candidatos · (ex: 20-50 chunks)

🎯 Reranking — Cross-encoder

Reranker · Cohere Rerank / Bedrock
Top-K finais · (ex: 3-5 chunks)

🤖 Geração — LLM

LLM (Claude / Titan) · Prompt + contexto

Reranking: the second filter that changes the game

Hybrid search improves recall — you retrieve more relevant chunks. But the ranking is still imperfect. RRF doesn't know why the question was asked; it only knows a chunk ranked well in two lists.

The reranker solves this. It's a cross-encoder: it receives the query and each chunk together as input and produces a true relevance score. Unlike embeddings (which vectorize query and document separately), the cross-encoder reads both at the same time and can capture subtle dependencies — for example, that the answer to the question is in the third sentence of the chunk, not the first.

The usage pattern is: retrieve many candidates (20, 50, even 100 chunks) with hybrid search, then pass everything through the reranker and keep the top-3 or top-5. This works because rerankers are expensive per call but you use them once per query, over a small set.

On AWS, the most direct path is Cohere Rerank available via Amazon Bedrock. You pass the query and the list of chunks, and get back the reordered indices with scores. Integration with Knowledge Bases is still manual in this flow — you call the reranker after retrieve and before building the prompt.

An important detail: rerankers have a token limit per document. Very long chunks (from lesson 03) will be truncated. If you use 1024-token chunking, check the reranking model's limit — typically 512 tokens per passage is the safe maximum.

In practice: when is the extra cost worth it?

Senior Solutions Architect

In practice, I don't enable reranking in every RAG I build. For simple cases — short FAQ, small knowledge base, predictable queries — hybrid search already works well and the reranker only adds latency and cost. I enable reranking when: (1) the base has more than 10k documents and information density is high; (2) users ask complex or ambiguous questions; (3) evaluation tests show the top-3 chunks still miss frequently. The precision gain is real, but measure before going to production — use the metrics from lesson 08 to justify the decision.

Put in order

Order the hybrid + rerank flow

From the question to the few final chunks.

1Rerank the candidates by relevance
2Fuse the two lists (e.g. RRF)
3Keep only the top-N best for the context
4Retrieve candidates by vector AND by keyword

Comparing search approaches

	Criterion	Vector only	Lexical only (BM25)	Hybrid + Rerank
Exact terms / acronyms	❌ Weak	✅ Strong	✅ Strong	—
Natural language / synonyms	✅ Strong	❌ Weak	✅ Strong	—
Latency	Low	Very low	Medium-high	—
Cost per query	Low	Very low	Higher (reranker)	—
Top-K precision	Medium	Medium	High	—

Implementing hybrid search + rerank on AWS

1
Configure the OpenSearch index with both vector and text fields
Create an index with knn_vector for embeddings and a standard text field for BM25. OpenSearch indexes both automatically. In Bedrock Knowledge Bases, the hybrid index is configured via console or CloudFormation.
2
Fire both searches in parallel
Use OpenSearch's hybrid query or fire two async queries (one knn, one match) and merge manually with RRF. Retrieve N candidates — start with 20.
3
Apply RRF to merge the rankings
If using OpenSearch's native hybrid query, RRF is already built in. If merging manually, implement the formula 1/(k+rank) with k=60 and sum scores from each list for each document.
4
Call Cohere Rerank via Bedrock
Pass the original query and the N merged chunks. Use bedrock-runtime with invoke_model and the Cohere Rerank model ID. Receive the reordered indices and select the top-K (3 to 5 is a reasonable default).
5
Build the prompt with top-K chunks and send to the LLM
Only now do you build the final context. Fewer chunks, more precise — the LLM will hallucinate less and input token cost drops. Store the reranker scores for observability (lesson 12).

Frequently asked questions

Do I need reranking if I already use hybrid search?

Not necessarily. Hybrid search already improves recall significantly. Reranking improves top-K precision. If your evaluation tests show the retrieved chunks are already good enough, save the reranker's latency and cost.

What's the latency impact of the reranker?

It depends on the number of candidates and chunk size. Generally, expect 200-600ms additional latency for 20-50 chunks with Cohere Rerank via Bedrock. This is acceptable for most use cases but may be critical for real-time applications.

Can I use an open-source reranker instead of Cohere?

Yes. Models like cross-encoder/ms-marco-MiniLM-L-6-v2 (HuggingFace) work well and can be hosted on SageMaker. The trade-off is endpoint operation vs. Cohere's per-call cost. For high volumes, a self-hosted model may be cheaper; for low volumes, managed Cohere is simpler.

Does Bedrock Knowledge Bases do reranking automatically?

Not natively integrated into the managed flow. You use the retrieve API to get chunks and then call the reranker separately before passing to the LLM. Lesson 09 details what Knowledge Bases manages and what remains your responsibility.

Key takeaways from this lesson

Vector search fails on exact terms (acronyms, codes, IDs). BM25 fails on natural language. Use both.

RRF is the most robust way to merge heterogeneous rankings — it uses position, not absolute score.

The pattern is: retrieve many (N=20-50) → rerank → keep few (K=3-5). High recall, high precision.

Rerankers are cross-encoders: they read query and chunk together, capturing dependencies embeddings miss.

Reranking has cost and latency. Only enable it when evaluation tests justify it — it's not a mandatory default.

Long chunks are truncated by the reranker. Align chunk size (lesson 03) with the reranking model's token limit.

Quiz

Quick check

1. What does a reranker do?

References

Amazon Bedrock — Hybrid search in Knowledge Bases OpenSearch — Hybrid search with RRF Cohere Rerank on Amazon Bedrock Reciprocal Rank Fusion (Cormack et al., 2009)AWS Blog — Improving RAG accuracy with hybrid search

Previous Next lesson

The limit of vector-only search

In lesson 02 we saw that embeddings capture meaning. That's great for questions like "how do I cancel my subscription?", where language variations exist. But consider these cases:

A developer searches for NullPointerException in error logs.
A security analyst queries the code CVE-2024-1234.
A user types an exact product name: XR-7000 Pro.

BM25's weakness is the inverse: no synonyms, no tolerance for variation. "Cancel subscription" won't find "terminate contract". For natural language, it loses badly to embeddings.

Neither approach is superior in every scenario. The right answer is to run both in parallel and merge the results — that's hybrid search.

Hybrid search: combining both worlds

The formula is:

RRF(doc) = Σ 1 / (k + rank_i(doc))

Flow: hybrid search + reranking

Full pipeline from user query to final chunks sent to the LLM. Both searches run in parallel; RRF merges the rankings; the reranker selects the best candidates.

🔍 Retrieval — Busca paralela

Busca Vetorial · ANN / cosine
Busca Lexical · BM25 / keyword

🔀 Fusão — RRF

Reciprocal Rank · Fusion (RRF)
Top-N candidatos · (ex: 20-50 chunks)

🎯 Reranking — Cross-encoder

Reranker · Cohere Rerank / Bedrock
Top-K finais · (ex: 3-5 chunks)

🤖 Geração — LLM

LLM (Claude / Titan) · Prompt + contexto

Reranking: the second filter that changes the game

Hybrid search improves recall — you retrieve more relevant chunks. But the ranking is still imperfect. RRF doesn't know why the question was asked; it only knows a chunk ranked well in two lists.

Criterion

Vector only

Lexical only (BM25)

Hybrid + Rerank

Exact terms / acronyms

❌ Weak

✅ Strong

—

Natural language / synonyms

✅ Strong

❌ Weak

✅ Strong

—

Latency

Low

Very low

Medium-high

—

Cost per query

Low

Very low

Higher (reranker)

—

Top-K precision

Medium

High

—

Implementing hybrid search + rerank on AWS

Configure the OpenSearch index with both vector and text fields

Create an index with knn_vector for embeddings and a standard text field for BM25. OpenSearch indexes both automatically. In Bedrock Knowledge Bases, the hybrid index is configured via console or CloudFormation.

Fire both searches in parallel

Use OpenSearch's hybrid query or fire two async queries (one knn, one match) and merge manually with RRF. Retrieve N candidates — start with 20.

Apply RRF to merge the rankings

If using OpenSearch's native hybrid query, RRF is already built in. If merging manually, implement the formula 1/(k+rank) with k=60 and sum scores from each list for each document.

Call Cohere Rerank via Bedrock

Pass the original query and the N merged chunks. Use bedrock-runtime with invoke_model and the Cohere Rerank model ID. Receive the reordered indices and select the top-K (3 to 5 is a reasonable default).

Build the prompt with top-K chunks and send to the LLM

Only now do you build the final context. Fewer chunks, more precise — the LLM will hallucinate less and input token cost drops. Store the reranker scores for observability (lesson 12).

Frequently asked questions

Do I need reranking if I already use hybrid search?

What's the latency impact of the reranker?

Can I use an open-source reranker instead of Cohere?

Does Bedrock Knowledge Bases do reranking automatically?

Key takeaways from this lesson

Vector search fails on exact terms (acronyms, codes, IDs). BM25 fails on natural language. Use both.

RRF is the most robust way to merge heterogeneous rankings — it uses position, not absolute score.

The pattern is: retrieve many (N=20-50) → rerank → keep few (K=3-5). High recall, high precision.

Rerankers are cross-encoders: they read query and chunk together, capturing dependencies embeddings miss.

Reranking has cost and latency. Only enable it when evaluation tests justify it — it's not a mandatory default.

Long chunks are truncated by the reranker. Align chunk size (lesson 03) with the reranking model's token limit.