# Design Doc: Enterprise RAG Platform with Continuous Evaluation and Guardrails on Bedrock

This document describes the architecture of an enterprise RAG platform built on Amazon Bedrock, covering semantic retrieval, continuous quality evaluation, safety guardrails, and cost control. The design prioritizes traceability, operability, and risk containment in regulated environments, without sacrificing acceptable end-user latency.

- URL: https://fernando.moretes.com/studies/design-doc-enterprise-rag-guardrails

- Markdown: https://fernando.moretes.com/studies/design-doc-enterprise-rag-guardrails/study.md?lang=en

- Type: Design Doc / RFC

- Company: Enterprise RAG (cenário)

- Domain: IA

- Date: 2026-02-05

- Tags: RAG, Bedrock, guardrails, evaluation, enterprise-ai, AWS, cost-control, security

- Reading time: 10 min

---

Building a RAG system that works in production in an enterprise environment is not the same as making a notebook work. Latency, cost, hallucination, data leakage, and auditability are real engineering problems — and each one demands a deliberate architectural decision.

## The Problem

Enterprise organizations are under pressure to expose natural language capabilities over their internal knowledge bases — regulatory documents, technical manuals, HR policies, contracts. The most direct approach is Retrieval-Augmented Generation (RAG): given a user query, retrieve relevant fragments from the corpus and provide them as context for a Large Language Model (LLM) to generate the answer.

The problem is that this approach, in its simplest form, fails on at least four dimensions critical to enterprise:

**1. Unmeasurable quality.** Without systematic evaluation, there is no way to know if the system is answering correctly, if it is degrading over time as the corpus changes, or if certain question categories consistently underperform. The engineering team is flying blind.

**2. Absence of guardrails.** LLMs can leak information from improper context, generate out-of-scope content, or be induced by prompt injection to bypass restrictions. In environments with sensitive data — financial, legal, healthcare — this is a compliance risk, not just a quality issue.

**3. Uncontrolled cost.** Every call to a frontier LLM has a per-token cost. Without context control, caching, and intelligent routing, cost scales non-linearly with usage volume. In production, this quickly becomes a budget problem.

**4. Insufficient traceability.** Auditors and security teams need to know: which document underpinned which answer, which model was invoked, which prompt version was active, and who asked the question. Naive RAG systems record none of this in a structured way.

This document proposes an architecture that treats these four problems as first-class requirements, not afterthoughts.

## Goals and Non-Goals

- ✅ GOAL: Provide answers grounded in internal documents with P95 latency < 4s for synchronous queries
- ✅ GOAL: Continuously evaluate response quality using automated metrics (faithfulness, relevance, answer correctness) via async pipeline
- ✅ GOAL: Apply content and data access guardrails via Amazon Bedrock Guardrails before each response reaches the user
- ✅ GOAL: Control cost via semantic caching, complexity-based routing, and per-session token limits
- ✅ GOAL: Produce a complete, immutable audit trail for each interaction (user, query, retrieved chunks, model, response, guardrail outcome)
- ✅ GOAL: Support multiple corpora with access isolation per user group (RBAC over vector indices)

## Scenario Fact Sheet

- **Scenario:** Enterprise RAG — internal knowledge base (regulatory + technical)
- **AI Platform:** Amazon Bedrock (models: Claude 3.5 Sonnet, Titan Embeddings V2)
- **Vector store:** Amazon OpenSearch Serverless (vector collections)
- **Estimated scale:** ~500 active users, ~10k queries/day, corpus of ~200k documents
- **AWS Region:** us-east-1 (primary), with artifact replication to us-west-2
- **Guardrails:** Amazon Bedrock Guardrails (PII, topic denial, grounding check)
- **Evaluation:** Async evaluation pipeline with RAGAS-style metrics via Lambda + DynamoDB + QuickSight
- **Cost control:** Semantic cache (ElastiCache), complexity routing, token budget per session
- **Compliance:** Immutable trail via S3 + Athena; RBAC via IAM + Cognito

## Proposed Design — Overview

The architecture is organized into five functional planes that operate relatively independently, communicating via events and well-defined APIs:

**Ingestion Plane.** Documents arrive via S3 (manual upload or integration with document management systems). An event-driven pipeline (S3 Event → EventBridge → Step Functions) executes: text extraction (Textract for scanned PDFs, native parsing for DOCX/HTML), chunking with a hybrid strategy (fixed-size with overlap + sentence boundary detection), embedding generation via Bedrock Titan Embeddings V2, and indexing into OpenSearch Serverless. Access metadata (which group can see which document) is persisted alongside the vector and used at retrieval time to filter results.

**Retrieval and Generation Plane.** The latency-critical path. The user query first passes through a semantic cache checker (ElastiCache + embedding comparison with configurable threshold). A cache hit returns directly, without invoking the LLM. A cache miss proceeds to: (1) query embedding generation, (2) hybrid search in OpenSearch (kNN + BM25 with Reciprocal Rank Fusion), (3) reranking of top-K chunks with a lightweight cross-encoder model, (4) prompt assembly with RBAC-filtered context, (5) LLM invocation via Bedrock with Guardrails applied inline. Complexity routing decides between Claude 3 Haiku (simple queries, factual lookup) and Claude 3.5 Sonnet (complex queries, multi-document synthesis) — complexity classification is done by a lightweight classifier before the main call.

**Guardrails Plane.** Bedrock Guardrails is configured with three layers: (a) PII detection and redaction in the generated response, (b) topic denial for topics outside the system's scope (e.g., direct legal advice, medical diagnosis), (c) grounding check — Bedrock verifies that the response is supported by the provided context chunks, blocking responses that introduce facts not present in the context (the primary hallucination vector in RAG). When a guardrail fires, the response is replaced by a standardized fallback message and the event is logged with severity and category.

**Continuous Evaluation Plane.** Each complete interaction is published to a Kinesis stream. A Lambda consumer computes async metrics: faithfulness (is the response supported by the context?), context relevance (are the retrieved chunks relevant to the query?), and answer relevance (does the response address the question?). These metrics are computed using a judge LLM (Claude 3 Haiku, cheaper) via structured prompts — the LLM-as-judge approach. Results are persisted in DynamoDB and aggregated in QuickSight dashboards. CloudWatch alarms fire when metrics fall below configured thresholds.

**Audit Plane.** Every interaction — query, retrieved chunks (with IDs and scores), assembled prompt, raw response, guardrail outcome, invoked model, latency, estimated token cost — is serialized to JSON and written to S3 (prefixed by date/user/session). The bucket has Object Lock enabled (WORM) for immutability. Athena provides ad-hoc queries for compliance investigations. CloudTrail records all Bedrock API calls.

## Enterprise RAG Platform Architecture

Full flow from document ingestion to user response, including guardrails, async evaluation, and audit trail.

### 👤 Usuários / Users

- Usuário User (user)
- Admin / Auditor (user)

### 🌐 Frontend / API Layer

- API Gateway (REST + Auth) (edge)
- Amazon Cognito (RBAC / JWT) (security)

### ⚙️ Retrieval & Generation

- RAG Orchestrator (Lambda) (compute)
- Semantic Cache (ElastiCache Redis) (data)
- OpenSearch Serverless (Vector + BM25) (data)
- Reranker (Lambda / cross-encoder) (compute)
- Complexity Router (Lambda classifier) (compute)

### 🤖 Amazon Bedrock

- Bedrock Guardrails (PII / Topic / Grounding) (ai)
- Claude 3 Haiku (simple queries) (ai)
- Claude 3.5 Sonnet (complex queries) (ai)
- Titan Embeddings V2 (query + ingest) (ai)

### 📥 Ingestão / Ingestion

- S3 Bucket (documentos brutos) (storage)
- EventBridge (S3 events) (messaging)
- Step Functions (ingest pipeline) (compute)
- Amazon Textract (OCR) (ai)

### 📊 Avaliação & Auditoria / Evaluation & Audit

- Kinesis Data Stream (interaction events) (messaging)
- Eval Pipeline (Lambda / LLM-as-judge) (compute)
- DynamoDB (eval metrics) (data)
- S3 Audit Bucket (WORM / Object Lock) (storage)
- Athena (compliance queries) (data)
- QuickSight (eval dashboards) (frontend)
- CloudWatch (alarms / metrics) (security)

### Flows

- user -> apigw: HTTP query
- apigw -> cognito: validate JWT
- apigw -> rag_lambda: query + claims
- rag_lambda -> cache: cache lookup
- rag_lambda -> embeddings: embed query
- rag_lambda -> opensearch: hybrid search (RBAC filter)
- opensearch -> reranker: top-K chunks
- reranker -> complexity: ranked chunks
- complexity -> llm_haiku: simple route
- complexity -> llm_sonnet: complex route
- llm_haiku -> guardrails: raw response
- llm_sonnet -> guardrails: raw response
- guardrails -> rag_lambda: filtered response
- rag_lambda -> kinesis: interaction event
- rag_lambda -> s3_audit: immutable log
- kinesis -> eval_lambda: stream consumer
- eval_lambda -> llm_haiku: LLM-as-judge
- eval_lambda -> dynamodb: eval metrics
- dynamodb -> quicksight: dashboards
- dynamodb -> cloudwatch: alarms
- s3_audit -> athena: compliance queries
- admin -> athena: investigation
- admin -> quicksight: monitoring
- s3_docs -> eventbridge: S3 event
- eventbridge -> stepfn: trigger ingestion
- stepfn -> textract: OCR (if PDF)
- stepfn -> embeddings: embed chunks
- stepfn -> opensearch: index vectors

## Critical Design Decisions

**Why hybrid search and not just kNN?**
Pure vector search (kNN) is biased toward semantic similarity but fails on queries requiring exact matching of technical terms, acronyms, standard numbers, or identifiers. BM25 complements exactly those cases. Reciprocal Rank Fusion (RRF) combines rankings without requiring score normalization — a pragmatic and well-established choice in the IR literature. OpenSearch Serverless supports both natively, avoiding the need for an additional search service.

**Why reranking after retrieval?**
Retrieving top-20 chunks and reranking to top-5 before assembling the prompt has two effects: (1) it reduces the context size sent to the LLM, reducing cost and latency, and (2) it improves context precision, which is the primary driver of faithfulness. The cross-encoder is computationally more expensive than the bi-encoder used for indexing, but operates on a small set (top-20), so the cost is acceptable. Models like `ms-marco-MiniLM-L-6-v2` are sufficient for most enterprise cases and can be served via Lambda with a container image.

**Why LLM-as-judge for evaluation and not deterministic metrics?**
Deterministic metrics like ROUGE and BLEU measure n-gram overlap, not semantic correctness. For RAG, what matters is: is the response factually supported by the context? Is the response relevant to the query's intent? These questions require semantic understanding. LLM-as-judge with structured prompts and few-shot examples produces evaluations with high correlation to human judgment, as demonstrated in the literature (RAGAS, G-Eval). The cost of using Claude 3 Haiku for async evaluation is significantly lower than Sonnet, making the pipeline economically viable.

**Why Object Lock on the audit bucket?**
In regulated environments, audit trail integrity cannot depend on access controls — a compromised administrator could delete logs. Object Lock in COMPLIANCE mode ensures that not even the root account can delete or modify objects within the configured retention period. This is a non-negotiable requirement for any system processing data subject to regulation (LGPD, SOX, HIPAA).

**On semantic cache: threshold is critical.**
A threshold that is too high (e.g., 0.99) results in near-zero hit rate — the cache is useless. A threshold that is too low (e.g., 0.80) results in semantically incorrect responses being returned for different queries. The correct value is empirical and domain-dependent. I recommend starting at 0.92-0.95 and adjusting based on false positive analysis in the first weeks of production. The cache should have configurable TTL per document category — frequently changing documents should have short TTL or be excluded from the cache.

## Evaluated Alternatives

### Bedrock Knowledge Bases (managed RAG)

**Pros**
- Fast setup, native Bedrock integration, less infrastructure to manage
- Native guardrails and source citation support

**Cons**
- Less control over chunking, reranking, and retrieval logic
- No native support for hybrid search or custom semantic cache
- Makes granular continuous evaluation and complexity routing harder

**Verdict:** Adequate for MVPs and cases with simple requirements. Insufficient for the evaluation and cost control requirements of this design.

### LangChain + self-managed vector DB (Pinecone/Weaviate)

**Pros**
- Maximum flexibility, rich integration ecosystem
- Support for multiple LLM providers without lock-in

**Cons**
- High operational complexity: more external services to manage, monitor, and secure
- Guardrails must be implemented manually — risk of gaps
- External service costs can exceed Bedrock at scale

**Verdict:** Preferable when there is a multi-cloud or LLM portability requirement. In this AWS-native scenario, it adds complexity without proportional benefit.

### Proposed design (custom RAG + Bedrock Guardrails + eval pipeline)

**Pros**
- Full control over retrieval, reranking, cache, and routing
- AWS-managed guardrails with native auditing
- Continuous evaluation pipeline and complete traceability

**Cons**
- Higher initial implementation and operational complexity
- Higher AWS service dependency (portability trade-off)

**Verdict:** Correct choice for the enterprise scenario with evaluation, compliance, and cost control requirements.

## Rollout Plan

1. **Phase 0 — Foundation (Weeks 1-2)** — Base infrastructure provisioning via IaC (CDK): VPC, IAM roles, S3 buckets (docs + audit with Object Lock), OpenSearch Serverless collection, ElastiCache Redis. Cognito setup with user groups and claims mapping for RBAC. CloudTrail enablement for Bedrock API calls.

2. **Phase 1 — Ingestion Pipeline (Weeks 3-4)** — Step Functions workflow implementation: S3 trigger → Textract (conditional) → chunking Lambda → Titan Embeddings → OpenSearch indexer. Tests with pilot corpus of 5k documents. Validation of RBAC filters in search. Definition of chunking strategy (size, overlap) based on corpus analysis.

3. **Phase 2 — RAG Core + Guardrails (Weeks 5-7)** — RAG Orchestrator Lambda implementation: hybrid search, reranker, prompt assembly, complexity routing, Bedrock invocation with Guardrails. Bedrock Guardrails configuration: PII entities, topic denial list, grounding check threshold. Semantic cache implementation with initial threshold 0.93. Load testing with k6 to validate P95 < 4s.

4. **Phase 3 — Evaluation Pipeline (Weeks 8-9)** — Kinesis consumer and eval Lambda implementation with LLM-as-judge prompts for faithfulness, context relevance, and answer relevance. DynamoDB persistence and QuickSight dashboards. CloudWatch alarm configuration with initial thresholds (faithfulness > 0.80, context relevance > 0.75). Baseline metric collection in the first 2 weeks.

5. **Phase 4 — Pilot with Real Users (Weeks 10-12)** — Rollout to pilot group of 50 users. Intensive monitoring of quality, latency, and cost metrics. Cache threshold, chunking parameters, and guardrail configurations adjusted based on real feedback. False positive analysis on guardrails (incorrectly blocked responses).

6. **Phase 5 — GA and Full Corpus Ingestion (Weeks 13-16)** — Full corpus ingestion (~200k documents). Rollout to all users with per-group feature flags. Operational documentation and runbooks. Operations team training on evaluation dashboard analysis and quality alert response.

> **Risks and Mitigations:** **Risk 1 — Prompt Injection.** A malicious user may attempt to inject instructions in the query field to bypass guardrails or exfiltrate other users' data. Mitigation: input sanitization at API Gateway (WAF), prompt templates that isolate the user query in explicit delimiters, and Bedrock Guardrails as the last line of defense. Important: guardrails are not infallible — regular adversarial testing is necessary.

**Risk 2 — Silent quality drift.** The corpus changes (documents updated, new ones added), but the evaluation pipeline may not detect degradation if alert thresholds are poorly calibrated. Mitigation: golden dataset of questions with expected answers, evaluated weekly deterministically. The golden dataset must be maintained and expanded by the domain team.

**Risk 3 — Evaluation cost scaling with volume.** The LLM-as-judge pipeline invokes the model for each interaction. At 10k queries/day, that is 10k additional Haiku calls. Mitigation: sampling — evaluate 20-30% of interactions randomly, with full evaluation only for interactions that triggered guardrails or received explicit negative user feedback.

**Risk 4 — False positives in guardrails.** An overly aggressive grounding check may block correct responses that use language not literally present in the context (paraphrase, inference). This degrades user experience. Mitigation: monitor block rate by guardrail category; adjust grounding check threshold based on analysis of blocked cases.

> **My Senior Take:** The most common mistake I see in enterprise RAG projects is treating evaluation as a backlog item — something that will be done 'after the system is in production'. This is a trap. Without an evaluation pipeline from the start, you have no way to know if your chunking, reranking, or prompt optimizations are improving or degrading quality. You are optimizing in the dark.

My firm recommendation: the golden dataset and evaluation pipeline should be built in parallel with the retrieval system, not after. Start with 50-100 questions annotated by domain experts. That is enough to have a reliable quality signal in the first weeks.

On guardrails: Bedrock Guardrails is a good tool, but it is not a complete security solution. Sophisticated prompt injection can bypass guardrails based on content classification. My approach is defense-in-depth: WAF at the edge, sanitization in the orchestrator, guardrails in Bedrock, and usage pattern anomaly monitoring (e.g., a user making 500 queries in 10 minutes is a sign of abuse, regardless of content).

On cost: complexity routing is the biggest cost reduction lever in this design. In my experience, 60-70% of queries in enterprise systems are simple factual ones — 'what is the deadline for X?', 'who is responsible for Y?' — that can be answered with Haiku at a fraction of Sonnet's cost. Investing time in calibrating the complexity classifier has direct return on the AWS bill.

Finally: document the audit log schema from day zero.

## Success Metrics and Targets

- **P95 Latency (critical path):** < 4 seconds (cache miss); < 200ms (cache hit)
- **Faithfulness Score (LLM-as-judge):** > 0.82 weekly average (baseline to calibrate in first 4 weeks)
- **Context Relevance Score:** > 0.78 weekly average
- **Semantic cache hit rate:** > 25% after 30 days in production (estimate)
- **Guardrail block rate (false positive):** < 2% of legitimate queries incorrectly blocked
- **Cost per query (estimate):** < USD 0.008 weighted average (Haiku/Sonnet mix + embeddings + search)
- **Audit coverage:** 100% of interactions with immutable S3 log within 5 seconds
- **System availability:** > 99.5% (serverless + OpenSearch Serverless SLA)

## Verdict

This design solves the central problem of enterprise RAG systems: the gap between 'works in the demo' and 'operable in production with confidence'. The four pillars — hybrid retrieval with RBAC, managed guardrails, continuous evaluation, and cost control — are not optional features; they are engineering requirements for any system that processes sensitive information at scale.

The choice of Amazon Bedrock as the AI platform is deliberate and defensible: managed Guardrails, native invocation traceability via CloudTrail, and the ability to swap models without rewriting infrastructure are real advantages in an enterprise environment where the 'best' model changes every quarter. The trade-off is vendor dependency — acceptable here given that the design isolates business logic (chunking, reranking, evaluation) in components that can be reused if the AI platform changes.

The most underestimated risk in this type of project is not technical: it is the absence of quality ownership. The evaluation pipeline only has value if someone looks at the dashboards and acts when metrics drop.

## References

- [Amazon Bedrock — Product Page](https://aws.amazon.com/bedrock/)
- [Amazon Bedrock Guardrails — Documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html)
- [Amazon OpenSearch Serverless — Vector Search](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-vector-search.html)
- [RAGAS: Automated Evaluation of Retrieval Augmented Generation](https://arxiv.org/abs/2309.15217)
- [Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf)
- [Amazon Bedrock — Knowledge Bases](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html)
- [Amazon S3 Object Lock — Compliance Mode](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html)
- [G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment](https://arxiv.org/abs/2303.16634)

## Case sources

- [AWS — Amazon Bedrock](https://aws.amazon.com/bedrock/)