# Playbook: RAG in Production — The 15-Item Checklist Before You Go Live

Dumping documents into a vector store is not production RAG — it's a prototype waiting to fail publicly. This playbook covers the 15 concrete, testable items that separate a reliable RAG pipeline from one that hallucinates, leaks PII, and blows budget without anyone noticing. The uncomfortable truth: ~80% of hallucinations are retrieval failures, not model failures.

- URL: https://fernando.moretes.com/studies/playbook-rag-em-producao-checklist

- Markdown: https://fernando.moretes.com/studies/playbook-rag-em-producao-checklist/study.md?lang=en

- Type: Playbook

- Domain: IA / RAG

- Date: 2025-09-15

- Tags: RAG, GenAI, AWS Bedrock, vector search, production, evaluation, guardrails, OpenSearch

- Reading time: 9 min

---

Everyone can get a RAG demo working in 30 minutes. The problem is that 90% of those demos should never reach production as-is — arbitrary chunking, no evaluation, no cost control, no guardrails. This playbook is the checklist I apply before signing off on any RAG deployment in a real environment.

## TL;DR — What you'll be able to decide after this playbook

- Identify the 5 risk groups in any RAG pipeline: ingestion, retrieval, generation, evaluation, and guardrails/cost
- Apply the 15 checklist items in a concrete, testable way before each deploy
- Decide between semantic-only search and hybrid+reranking based on real recall, precision, cost, and latency
- Stop swapping models when the real problem is retrieval quality
- Measure cost per query from day zero — not after the bill arrives

## The mental model that unlocks everything: RAG is a data pipeline, not an AI feature

Most teams treat RAG as if it were an API call with one extra step. It isn't. RAG is a data pipeline with at least six sequential stages — and each stage has its own silent failure modes. When the system answers incorrectly, the instinctive blame goes to the LLM. In practice, the LLM is often the most reliable part of the system.

Think of it this way: the LLM can only answer based on what you put in the context. If the retrieved chunks are wrong, incomplete, duplicated, or out of order, the model will do its best with garbage — and that looks like hallucination, but it's classic garbage-in, garbage-out. The estimate circulating in the community, which matches what I see in the field, is that roughly 80% of incorrect responses in RAG originate in the retrieval phase: wrong chunk, poorly calibrated similarity threshold, missing tenant filter, or simply an embedding that doesn't capture the domain.

The correct mental model is: **you are building a search system with a natural language generator at the end**. This completely changes the priorities. Before tuning temperature, before swapping models, before sophisticated prompt engineering — fix the search. Measure recall. Measure precision. Only then move up the stack.

In the AWS context, this means treating Amazon OpenSearch Serverless (or another vector store) as a first-class data component, with the same rigor you'd apply to a relational database in production: index versioning, reindexing strategy, distribution drift monitoring, and regression tests when you swap the embedding model.

## Quick Reference — Stack and Scope

- **Domain:** Generative AI / Production RAG
- **AWS reference stack:** Amazon Bedrock Knowledge Bases, OpenSearch Serverless (vector engine), Bedrock Evaluation, Bedrock Guardrails
- **Relevant embedding models:** Amazon Titan Embeddings V2, Cohere Embed v3 (via Bedrock)
- **Relevant generation models:** Claude 3.x (Sonnet/Haiku), Amazon Nova Pro/Lite (via Bedrock)
- **Recommended search pattern:** Hybrid (semantic + BM25) + cross-encoder reranking
- **Key evaluation metrics:** Faithfulness, Answer Relevance, Context Recall, Context Precision
- **Playbook scope:** Domain-agnostic; examples anchored in AWS; applicable to any RAG stack

## The five risk groups — why the checklist is structured this way

I organize the 15 items into five groups because each group has a different failure profile and a different owner within the team.

**Group 1 — Ingestion** is where most problems are born and where the fewest people look after the initial deploy. Bad chunking contaminates the entire index. Missing metadata eliminates the possibility of efficient filters. Duplicates inflate the index and distort similarity scores. And without index versioning, you can't roll back when a new embedding model breaks quality.

**Group 2 — Retrieval** is the heart of the system. The decision between purely semantic search and hybrid search (semantic + BM25 + reranking) has a direct, measurable impact on recall and precision. Poorly calibrated top-k or missing tenant filters are information leakage vectors in multi-tenant systems — a real security risk, not a theoretical one.

**Group 3 — Generation** is where the LLM comes in — and where people spend too much time. The items here are simpler than they seem: citing the source in the output (traceability), having a valid 'I don't know' response when the context doesn't support the answer (avoiding confabulation), and using structured output when downstream needs to parse the response.

**Group 4 — Evaluation** is the most neglected group in projects I've seen. Without an evaluation dataset with ground truth, you're flying blind. Without regression tests, every swap of embedding model or LLM is a leap of faith. Bedrock Evaluation offers faithfulness and relevance metrics that can be integrated into a CI/CD pipeline.

**Group 5 — Guardrails and Cost** closes the loop. PII in retrieved context is a compliance problem, not a UX problem. Prompt injection via malicious documents is a real attack vector. And unmeasured cost per query is the fastest path to an unpleasant surprise at the end of the month — especially with reranking and multiple embedding calls adding up tokens.

## The 15-Item Checklist — Concrete and Testable

1. **INGESTION #1 — Chunk by document structure, not fixed size** — Test: take 10 representative documents and inspect chunks manually. A valid chunk should contain a complete semantic unit (section, paragraph, list item). Chunks that cut mid-sentence or separate question from answer are failures. Use heading-aware or sentence-boundary chunking. Bedrock Knowledge Bases supports hierarchical and semantic chunking natively.

2. **INGESTION #2 — Structured metadata on every chunk** — Test: query the vector store and verify every chunk has at least: source_id, document_title, section, tenant_id (if multi-tenant), created_at, version. Without these fields, retrieval filters and source traceability are impossible. In OpenSearch Serverless, define the mapping explicitly before indexing.

3. **INGESTION #3 — Deduplication before indexing** — Test: compute SHA-256 hash of each chunk's content before inserting. Verify the pipeline rejects or updates duplicates instead of re-inserting. Duplicates inflate the index, distort similarity scores, and cause the same passage to appear multiple times in context, wasting tokens.

4. **INGESTION #4 — Versioned reindexing with rollback** — Test: swap the embedding model for a new version and verify you can keep the old index active while the new one is built (index blue/green). In OpenSearch Serverless, use index aliases. Without this, any embedding upgrade is a risky operation with downtime or silent degradation.

5. **RETRIEVAL #5 — Hybrid search enabled (semantic + BM25)** — Test: build a set of 20 test queries with expected chunk ground truth. Measure recall@5 with semantic-only and hybrid search. If hybrid recall@5 is not ≥ semantic recall in at least 70% of cases, revise the weights. Hybrid search is especially critical for queries with exact terms (IDs, proper names, codes).

6. **RETRIEVAL #6 — Cross-encoder reranking before passing to LLM** — Test: compare MRR (Mean Reciprocal Rank) of top-k before and after reranking on your test dataset. The reranker should consistently move the most relevant chunk to higher positions. Without reranking, you pass the k closest chunks in embedding space — which is not the same as the k most relevant for the query.

7. **RETRIEVAL #7 — Top-k calibrated for domain, not left at default** — Test: measure the distribution of actually useful chunks per query in your dataset. If 80% of queries need at most 3 chunks, passing top-k=10 to the LLM wastes tokens and dilutes context. If some queries need 8 chunks, top-k=3 will truncate. Calibrate by percentile, not intuition.

8. **RETRIEVAL #8 — Metadata/tenant filter mandatory in multi-tenant** — Test: with two tenants A and B in the same index, make a query authenticated as tenant A and verify no tenant B chunks appear in results. This is not optional — it is data isolation. In OpenSearch Serverless, use pre-filters in the kNN query. In Bedrock Knowledge Bases, use metadata filtering.

## Semantic-Only Search vs. Hybrid + Reranking
| Criterion | Dimension | Semantic-Only (pure kNN) | Hybrid (kNN + BM25) | Hybrid + Cross-Encoder Reranking |
| --- | --- | --- | --- | --- |
| Recall on natural language queries | High | High | High | — |
| Recall on exact-term queries (IDs, codes) | Low — embeddings over-generalize | High — BM25 captures exact terms | High | — |
| Precision in final top-k | Medium — many false positives from embedding similarity | Medium-High | High — reranker reorders by actual relevance | — |
| Added latency | Baseline (~10-50ms in OpenSearch Serverless) | +5-15ms (score fusion) | +50-200ms (reranker call) | — |
| Additional cost per query | None beyond query embedding | Minimal | Moderate — depends on reranking model | — |
| Implementation complexity | Low | Medium — requires BM25 index config + fusion | High — requires separate reranking service or Bedrock model | — |
| When to use | Prototypes, domains with very homogeneous language | Most production cases | High precision required, technical domain, long documents | — |

## Production RAG Pipeline — Ingest → Index → Retrieve → Rerank → Generate → Guardrail → Eval

Complete flow of a production RAG pipeline on AWS. The upper path is the ingestion pipeline (offline/batch). The lower path is the query pipeline (online/real-time). Evaluation is a continuous loop that feeds improvements to both paths.

### 📥 Ingestão (Offline)

- Fontes S3 / SharePoint / DB (storage)
- Parser Estrutura + Metadados (compute)
- Dedup SHA-256 Hash (compute)
- Chunker Semântico / Hierárquico (compute)
- Embedding Model Titan V2 / Cohere (ai)

### 🗄️ Índice (Versionado)

- OpenSearch Serverless Vector + BM25 Index (data)
- Index Alias Blue/Green (data)

### 🔍 Recuperação (Online)

- Usuário Query (user)
- Embedding Query (ai)
- Busca Híbrida kNN + BM25 + Filtro Tenant (compute)
- Reranker Cross-Encoder (ai)

### 🤖 Geração + Guardrails

- Prompt Builder Contexto + Instrução (compute)
- Bedrock Guardrails PII + Injection Filter (security)
- LLM Claude / Nova (ai)
- Output Resposta + Citations (frontend)

### 📊 Avaliação Contínua

- Dataset Ground Truth (data)
- Bedrock Evaluation Faithfulness / Relevance (ai)
- CloudWatch Custo + Latência + Alertas (messaging)

### Flows

- src -> parser: extract
- parser -> dedup: raw chunks
- dedup -> chunker: deduplicated
- chunker -> embed_ingest: chunks + metadata
- embed_ingest -> oss: vectors
- oss -> alias: active version
- user -> embed_query: query
- embed_query -> hybrid: vector + text
- alias -> hybrid: active index
- hybrid -> rerank: top-k candidates
- rerank -> prompt: reordered chunks
- prompt -> guardrail: full prompt
- guardrail -> llm: filtered prompt
- llm -> output: response + citations
- output -> bedrock_eval: response log
- eval_ds -> bedrock_eval: ground truth
- bedrock_eval -> cw: metrics
- cw -> chunker: feedback loop

## What to do when evaluation metrics regress

When you run the evaluation dataset and see a regression, the instinct is to go straight to the LLM or the prompt. Most of the time, that's the wrong place to start. I follow a layered diagnostic protocol:

**First, isolate the stage.** Measure context recall separately from faithfulness. If context recall dropped but faithfulness is stable, the problem is in retrieval — chunking, embedding, similarity threshold, or filters. If faithfulness dropped but context recall is fine, then the problem may be in the prompt or model.

**Second, check for distribution shift in documents.** New documents with different structure, new terminology, or a different language can degrade embedding quality without any code change. Monitor the distribution of similarity scores over time — a drop in the average top-1 chunk score is an early warning signal.

**Third, before swapping models, tune retrieval parameters.** Changing the minimum similarity threshold, adjusting hybrid fusion weights, or calibrating top-k often resolves regressions without the cost and risk of a model swap. Swapping the embedding model means reindexing everything — it's an expensive operation that requires the regression tests from item #13.

**Fourth, maintain a log of evaluation decisions.** Every time you accept a regression in one metric in exchange for improvement in another, document it. This prevents the team from reverting already-made decisions and creates a trade-off history that is valuable as the system grows.

> **Anti-patterns I see in every RAG project:** **1. Fixed-size chunking by characters or tokens.** The most common and most destructive pattern. A 512-token chunk that cuts through a table or separates question from answer in a FAQ will produce useless retrievals regardless of which embedding you use. Always inspect chunks manually before indexing.

**2. Deploy without an evaluation dataset.** 'We tested manually with a few queries and it seemed fine' is not evaluation. Without ground truth, you have no way to know if a change improved or degraded the system. Building the dataset is work, but it's the only way to have confidence in changes.

**3. Unmeasured cost per query.** Reranking + embedding + LLM with large context adds up fast. I've seen projects with $0.15-0.30 per query that nobody had measured. At 10,000 queries/day, that's $1,500-3,000/day. Configure estimated cost metrics from the first deploy.

**4. Swapping models as the first resort when quality is poor.** Claude 3.5 Sonnet won't fix bad chunks. GPT-4o won't fix missing tenant filters. The model is the last place to optimize — fix the search first.

**5. Ignoring prompt injection via documents.** If users can upload documents that will be indexed, a malicious document with embedded instructions can manipulate LLM behavior. Use Bedrock Guardrails and sanitize document content before indexing.

> **Rule of Thumb:** **If the answer is wrong, blame the search before blaming the model.** ~80% of quality problems in RAG originate in retrieval. Before swapping LLMs, measure context recall. If the right chunk isn't in the top-k, no model will generate the right answer.

> **My senior perspective — what I actually do in practice:** When I start a RAG project, the first thing I do is not choose the LLM — it's understanding the document structure. Documents with predictable structure (PDFs with headings, JSONs, markdowns) allow deterministic, high-quality chunking. Scanned documents, form PDFs, or mixed content are where most projects sink before even reaching the model.

My invariable priority order: (1) correct ingestion and chunking, (2) complete metadata, (3) hybrid search with filters, (4) evaluation dataset with at least 50 pairs, (5) reranking, (6) guardrails, (7) only then prompt optimization and model selection. I never reverse this order.

On AWS, I use Bedrock Knowledge Bases for the happy path — hierarchical chunking + OpenSearch Serverless + native citations save weeks of work. But for systems with strict multi-tenancy requirements or very specialized domains, I build the pipeline manually with more control over the index and filters.

The item most frequently absent when I receive a project for review is #12 — the evaluation dataset. Without it, every architecture decision is based on intuition. With it, you have a data-driven conversation. It is the lowest-cost, highest-return investment in any RAG project.

## Verdict

Reliable RAG in production is not a model problem — it's a data engineering problem with an LLM at the end. The 15 items in this checklist are not optional best practices: they are the line that separates a system you can operate and improve with confidence from one you just hope works. Fix the search before swapping the model. Measure before optimizing. And never deploy without an evaluation dataset — you're just deferring the discovery of the problem to the worst possible moment.

## References

- [Amazon Bedrock Knowledge Bases — Overview](https://aws.amazon.com/bedrock/knowledge-bases/)
- [Amazon Bedrock — Retrieval Augmented Generation (User Guide)](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html)
- [Amazon OpenSearch Serverless — Vector Engine for Semantic Search](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-vector-search.html)
- [Amazon Bedrock — Evaluate RAG Systems](https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation.html)

## Case sources

- [AWS — Amazon Bedrock Knowledge Bases](https://aws.amazon.com/bedrock/knowledge-bases/)
- [AWS — Retrieval Augmented Generation](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html)
- [Amazon OpenSearch Serverless — Vector engine](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-vector-search.html)
- [AWS — Evaluate RAG (Bedrock)](https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation.html)
