Hybrid search and reranking
Combine semantic with keyword search and reorder with a reranker to retrieve better.
6 min read
Vector search is powerful, but it fails at something basic: exact terms. If the user types "CVE-2024-1234" or "API v3.2", embeddings won't save you — they'll return semantically close chunks that are probably wrong. The solution isn't to switch approaches; it's to combine both and then reorder the results with a model that actually understands the question.
The limit of vector-only search
In lesson 02 we saw that embeddings capture meaning. That's great for questions like "how do I cancel my subscription?", where language variations exist. But consider these cases:
- A developer searches for
NullPointerExceptionin error logs. - A security analyst queries the code
CVE-2024-1234. - A user types an exact product name:
XR-7000 Pro.
The embedding model will vectorize these terms and find neighbors in semantic space. The problem is that CVE-2024-1234 and CVE-2024-5678 end up very close vectorially — they're structurally similar. The system retrieves the wrong chunk with high confidence.
Lexical search (BM25, the classic term-frequency ranking algorithm) doesn't have this problem. It treats CVE-2024-1234 as a literal string and only returns documents containing that exact token. It's deterministic, fast, and needs no GPU.
BM25's weakness is the inverse: no synonyms, no tolerance for variation. "Cancel subscription" won't find "terminate contract". For natural language, it loses badly to embeddings.
Neither approach is superior in every scenario. The right answer is to run both in parallel and merge the results — that's hybrid search.
Hybrid search: combining both worlds
The mechanics are simple: you fire two searches in parallel — one vector, one lexical — and each returns a ranked list of chunks. The challenge is merging those lists into one, since the scores are incomparable (cosine similarity vs. BM25 score).
The most widely used technique in practice is Reciprocal Rank Fusion (RRF). The idea is elegant: instead of normalizing scores (which is brittle), you use only each document's position in the lists.
The formula is:
RRF(doc) = Σ 1 / (k + rank_i(doc))
Where k is a constant (typically 60) and rank_i is the document's position in list i. A document that appears 3rd in vector search and 5th in BM25 will have a high RRF score. A document that only appears in one list will score lower.
OpenSearch Serverless and OpenSearch Service support hybrid search with RRF natively — you configure the hybrid query type and set weights for each sub-query. Amazon Bedrock Knowledge Bases also exposes this option in the console and via API.
An important parameter is the relative weight between vector and lexical search. There's no universal value: it depends on your domain. For technical documentation with lots of codes and acronyms, I give more weight to BM25. For natural-language FAQ, more weight to vector. Start 50/50 and adjust based on evaluation results (lesson 08).
Flow: hybrid search + reranking
Full pipeline from user query to final chunks sent to the LLM. Both searches run in parallel; RRF merges the rankings; the reranker selects the best candidates.
- Busca Vetorial · ANN / cosine
- Busca Lexical · BM25 / keyword
- Reciprocal Rank · Fusion (RRF)
- Top-N candidatos · (ex: 20-50 chunks)
- Reranker · Cohere Rerank / Bedrock
- Top-K finais · (ex: 3-5 chunks)
- LLM (Claude / Titan) · Prompt + contexto
Reranking: the second filter that changes the game
Hybrid search improves recall — you retrieve more relevant chunks. But the ranking is still imperfect. RRF doesn't know why the question was asked; it only knows a chunk ranked well in two lists.
The reranker solves this. It's a cross-encoder: it receives the query and each chunk together as input and produces a true relevance score. Unlike embeddings (which vectorize query and document separately), the cross-encoder reads both at the same time and can capture subtle dependencies — for example, that the answer to the question is in the third sentence of the chunk, not the first.
The usage pattern is: retrieve many candidates (20, 50, even 100 chunks) with hybrid search, then pass everything through the reranker and keep the top-3 or top-5. This works because rerankers are expensive per call but you use them once per query, over a small set.
On AWS, the most direct path is Cohere Rerank available via Amazon Bedrock. You pass the query and the list of chunks, and get back the reordered indices with scores. Integration with Knowledge Bases is still manual in this flow — you call the reranker after retrieve and before building the prompt.
An important detail: rerankers have a token limit per document. Very long chunks (from lesson 03) will be truncated. If you use 1024-token chunking, check the reranking model's limit — typically 512 tokens per passage is the safe maximum.
In practice, I don't enable reranking in every RAG I build. For simple cases — short FAQ, small knowledge base, predictable queries — hybrid search already works well and the reranker only adds latency and cost. I enable reranking when: (1) the base has more than 10k documents and information density is high; (2) users ask complex or ambiguous questions; (3) evaluation tests show the top-3 chunks still miss frequently. The precision gain is real, but measure before going to production — use the metrics from lesson 08 to justify the decision.
Order the hybrid + rerank flow
From the question to the few final chunks.
- 1Rerank the candidates by relevance
- 2Fuse the two lists (e.g. RRF)
- 3Keep only the top-N best for the context
- 4Retrieve candidates by vector AND by keyword
Comparing search approaches
| Criterion | Vector only | Lexical only (BM25) | Hybrid + Rerank | |
|---|---|---|---|---|
| Exact terms / acronyms | ❌ Weak | ✅ Strong | ✅ Strong | — |
| Natural language / synonyms | ✅ Strong | ❌ Weak | ✅ Strong | — |
| Latency | Low | Very low | Medium-high | — |
| Cost per query | Low | Very low | Higher (reranker) | — |
| Top-K precision | Medium | Medium | High | — |
Implementing hybrid search + rerank on AWS
- 1
Configure the OpenSearch index with both vector and text fields
Create an index with
knn_vectorfor embeddings and a standardtextfield for BM25. OpenSearch indexes both automatically. In Bedrock Knowledge Bases, the hybrid index is configured via console or CloudFormation. - 2
Fire both searches in parallel
Use OpenSearch's
hybridquery or fire two async queries (oneknn, onematch) and merge manually with RRF. Retrieve N candidates — start with 20. - 3
Apply RRF to merge the rankings
If using OpenSearch's native
hybridquery, RRF is already built in. If merging manually, implement the formula1/(k+rank)with k=60 and sum scores from each list for each document. - 4
Call Cohere Rerank via Bedrock
Pass the original query and the N merged chunks. Use
bedrock-runtimewithinvoke_modeland the Cohere Rerank model ID. Receive the reordered indices and select the top-K (3 to 5 is a reasonable default). - 5
Build the prompt with top-K chunks and send to the LLM
Only now do you build the final context. Fewer chunks, more precise — the LLM will hallucinate less and input token cost drops. Store the reranker scores for observability (lesson 12).
Frequently asked questions
Do I need reranking if I already use hybrid search?
Not necessarily. Hybrid search already improves recall significantly. Reranking improves top-K precision. If your evaluation tests show the retrieved chunks are already good enough, save the reranker's latency and cost.
What's the latency impact of the reranker?
It depends on the number of candidates and chunk size. Generally, expect 200-600ms additional latency for 20-50 chunks with Cohere Rerank via Bedrock. This is acceptable for most use cases but may be critical for real-time applications.
Can I use an open-source reranker instead of Cohere?
Yes. Models like cross-encoder/ms-marco-MiniLM-L-6-v2 (HuggingFace) work well and can be hosted on SageMaker. The trade-off is endpoint operation vs. Cohere's per-call cost. For high volumes, a self-hosted model may be cheaper; for low volumes, managed Cohere is simpler.
Does Bedrock Knowledge Bases do reranking automatically?
Not natively integrated into the managed flow. You use the retrieve API to get chunks and then call the reranker separately before passing to the LLM. Lesson 09 details what Knowledge Bases manages and what remains your responsibility.
Key takeaways from this lesson
Quick check
1. What does a reranker do?