# Agentic RAG with OpenSearch Serverless: Anatomy of a Pattern

The agentic RAG pattern with OpenSearch Serverless promises elastic scale and semantic retrieval without infrastructure management — but hides serious latency, cost, and consistency pitfalls that financial-grade systems cannot afford to ignore. In this article, I dissect the pattern's anatomy, map when it works, when it fails, and how to configure it with production-grade rigor.

- URL: https://fernando.moretes.com/blog/opensearch-serverless-agentic-rag-escala

- Markdown: https://fernando.moretes.com/blog/opensearch-serverless-agentic-rag-escala/article.md?lang=en

- Published: 2026-05-28T08:05:00.000Z

- Category: Data Platforms

- Tags: opensearch, rag, agentic-ai, vector-search, serverless, bedrock, financial-grade, aws

- Reading time: 8 min

- Source: [Next generation Amazon OpenSearch Serverless for agentic AI](https://aws.amazon.com/blogs/aws/)

---

When AWS announces a new generation of OpenSearch Serverless aimed at agentic AI, the technical signal that matters is not in the press release — it's in the design implications most architects will discover too late: cold starts that destroy latency SLOs, OCU costs that explode with batch embedding workloads, and the illusion that 'serverless' eliminates the need for partition modeling and concurrency control. I have 16 years building financial systems on AWS infrastructure and I know what happens when a promising architectural pattern meets the reality of a regulated environment. This article tears down the agentic RAG pattern from the ground up: the problem it solves, its internal anatomy, the numbers that matter, and — most importantly — when you should not use it.

## The Real Problem: Why Classical RAG Breaks in Agentic Workflows

Classical RAG is a two-phase pattern: you retrieve k relevant documents via vector search and inject them into an LLM's context for generation. It works well for static Q&A over a stable knowledge base. The problem surfaces when you add agency — that is, when the LLM iteratively decides which tools to call, which queries to reformulate, and how to compose the final answer from multiple heterogeneous sources.

In a real agentic workflow, the retrieval chain is not linear. A financial agent answering 'what is the consolidated credit risk for this counterparty across all exposures in the last 90 days?' may issue 4 to 8 retrieval calls in sequence or in parallel, each with a different query vector, crossing indices of contracts, market news, rating history, and regulatory data. The vector index becomes a hot-path component with P99 latency requirements below 200ms per call and throughput of tens of queries per second per agent session.

Classical OpenSearch Serverless had a well-documented problem here: the OCU (OpenSearch Compute Units) model scales per collection, not per query, and the cold start of an idle collection can reach 2-3 minutes — unacceptable for any interactive agent. The new generation promises more granular scaling and lower provisioning latency, but the architect still needs to understand the capacity model to avoid a surprise at month-end billing.

## Anatomy of the Agentic RAG Pattern with OpenSearch Serverless

Full flow: document ingestion, vector indexing, agentic retrieval-generation cycle, with control plane and observability

### 📥 Ingestão & Embedding

- S3 Documentos brutos (storage)
- AWS Glue Chunking + ETL (compute)
- Bedrock Titan Embedding v2 (ai)

### 🔍 OpenSearch Serverless

- OpenSearch Serverless Vector Collection (data)
- k-NN Index HNSW / FAISS (data)
- Reranker Cross-encoder (ai)

### 🤖 Camada Agêntica

- Bedrock Agent Orchestrator (ai)
- Tool Registry Lambda Actions (compute)
- Session Memory DynamoDB (storage)

### 🔐 Segurança & Controle

- IAM + ABAC Data Access Policy (security)
- KMS CMK Encryption at rest (security)

### 📊 Observabilidade

- CloudWatch SLO / Alarms (compute)
- OpenTelemetry Trace propagation (compute)

### Flows

- s3raw -> glue: S3 event trigger
- glue -> embed: text chunks
- embed -> oss: vectors + metadata
- oss -> knn: HNSW index
- agent -> tools: tool call
- tools -> knn: vector query
- knn -> rerank: top-k candidates
- rerank -> agent: reranked context
- agent -> mem: session / history
- iam -> oss: data access policy
- kms -> oss: CMK encryption
- agent -> otel: trace span
- otel -> cw: metrics / logs

## Technical Anatomy: Each Component and Its Critical Configurations

**Vector Indexing in OpenSearch Serverless**

The heart of the pattern is the k-NN index. In OpenSearch Serverless, you choose between HNSW (Hierarchical Navigable Small World) and IVF (Inverted File Index). For agentic workloads with high read throughput and low latency, HNSW with `ef_construction=512` and `m=16` offers the best recall-latency trade-off — expect P50 of 15-30ms and P99 of 80-150ms for collections up to 10M vectors of 1536 dimensions (Titan Embedding v2). Beyond that, the memory cost per OCU starts pressuring the budget: each OCU supports approximately 8GB of index in memory.

**Chunking and Embedding Strategy**

Retrieval quality starts in the ingestion pipeline. For financial documents — contracts, prospectuses, regulatory reports — I use hierarchical chunking: 512-token chunks with 64-token overlap, preserving structural metadata (section, page, effective date) as filter fields in the index. This enables hybrid search: vector search + metadata filter, reducing the search space and improving precision without increasing k.

**Reranking: The Most Underestimated Component**

Retrieving the top-20 candidates via k-NN and reranking them with a cross-encoder (Cohere Rerank or Bedrock Rerank) before injecting into the agent's context is the difference between a system that hallucinates and one that cites correct sources. Reranking cost is low (< $0.002 per 1000 documents on Cohere) and the precision gain in specialized domains is consistently 15-25% in MRR@10 on benchmarks I've run internally.

## Numbers That Matter for Sizing

- **~8 GB** — Index per OCU. HNSW index capacity in memory per OpenSearch Compute Unit; plan for 70% maximum utilization
- **< 150ms** — Vector search P99. Realistic P99 latency for collections up to 10M vectors of 1536 dims with HNSW ef_search=256 on warm OCUs
- **2–3 min** — Collection cold start. Provisioning time for an idle collection; the new generation reduces this but does not eliminate it — keep collections warm via heartbeat queries

## When to Use This Pattern: The Honest Criteria

The agentic RAG pattern with OpenSearch Serverless fits well under a specific set of conditions. First, **workloads with high traffic variability and low predictability**: if you have daytime usage peaks with overnight valleys, the serverless model amortizes the cost of idle capacity you would pay for with a provisioned OpenSearch cluster. For a financial analyst support system peaking from 9am to 5pm with residual overnight traffic, savings can be 40-60% compared to a 3-node m6g.2xlarge cluster.

Second, **knowledge bases that grow non-uniformly**: regulatory documents, meeting minutes, risk reports — corpora that grow in sprints (end of quarter, market events) and stay stable for weeks. The async ingestion model of OpenSearch Serverless, with Glue or Lambda processing SQS queues of new documents, adapts naturally to that rhythm.

Third, **multi-tenancy with data isolation**: OpenSearch Serverless supports data access policies per collection with IAM conditions based on tags (`aws:ResourceTag/tenant`). For a financial SaaS platform with multiple clients, you can have one collection per tenant or use index namespaces with granular access policies — without operating separate clusters.

The most important negative criterion: **do not use this pattern if you have end-to-end latency SLOs below 500ms for the complete agent response**. An agentic cycle with 3-4 rounds of retrieval, reranking, and LLM generation will rarely stay below 3-8 seconds at P50. That is acceptable for async analysis, unacceptable for trading or real-time credit decisions.

## Anti-Patterns: What Will Break in Production

- **Single index for all tenants without metadata filter**: Placing documents from multiple clients in the same index without a `tenant_id` field as a mandatory filter on all queries is a data isolation failure. In regulated financial environments, this is an audit finding, not just a product bug.
- **Embeddings generated at query time without cache**: Calling Bedrock Titan to generate the user query embedding on every request without caching embeddings of frequent queries adds unnecessary 50-100ms and API cost. Use ElastiCache with a 5-minute TTL for recurring queries in agent sessions.
- **k too high without reranking**: Retrieving top-50 or top-100 documents and injecting everything into the LLM context without reranking fills the context window with noise, increases token cost, and degrades response quality. The correct pattern is k=20 with reranking down to top-5.
- **Ignoring the OCU cost model for batch ingestion workloads**: Indexing OCUs and search OCUs are billed separately. A batch ingestion job processing 1M documents can provision 10+ indexing OCUs for hours and generate an unexpected bill. Always throttle the ingestion pipeline and monitor `IndexingOCUs` in CloudWatch.
- **No idempotency in the ingestion pipeline**: Reprocessing documents without content hash verification creates duplicates in the vector index, increases storage cost, and degrades search precision with redundant chunks. Use a SHA-256 hash of the content as `document_id` and perform upsert, not insert.
- **Relying on SDK auto-retry without agent-level idempotency**: Agentic workflows with Step Functions or Bedrock Agent that retry retrieval tool calls without idempotency guarantees can emit duplicate queries to the index and accumulate inconsistent context in the session. Each tool call must have a traceable `call_id`.

## Security and Governance in Regulated Financial Environments

In financial systems, the vector index is not just a performance component — it is a repository of potentially sensitive data. Embeddings of confidential documents can, in theory, be partially reversed with embedding inversion attacks. This is not science fiction: recent research has demonstrated partial text reconstruction from vectors of popular models.

**Mandatory controls I implement in production:**

1. **KMS CMK with annual rotation**: Every OpenSearch Serverless collection must use a customer-managed CMK (`aws/opensearchserverless` is not sufficient for regulated environments). Configure the `kms:ViaService` condition in the key policy to restrict usage exclusively to the service.

2. **Data Access Policies with least privilege**: Separate IAM roles for ingestion (`aoss:CreateIndex`, `aoss:WriteDocument`) and for search (`aoss:ReadDocument`). Never give the agent write permission on the index.

3. **VPC Endpoint for OpenSearch Serverless**: In financial environments, all traffic to the index must pass through `vpce-opensearchserverless` with an endpoint policy that rejects requests from outside the VPC. This eliminates the exfiltration vector via the public internet.

4. **Query auditing via CloudTrail**: Enable `aoss:APICall` in CloudTrail to log all queries to the index. In LGPD/GDPR environments, this is necessary to demonstrate that personal data is not being accessed outside the authorized context.

5. **Data classification in chunk metadata**: Add `data_classification` (PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED) and `retention_date` fields to each indexed document. Use these fields as mandatory filters in data access policies to ensure the agent never retrieves documents above its authorization level.

## Observability: What to Monitor and Why

An agentic RAG system without adequate observability is a black box that fails silently. Quality degradation — less precise responses, increasing hallucinations — does not show up in infrastructure metrics. You need a three-layer observability strategy.

**Layer 1 — Infrastructure (CloudWatch):**
- `SearchOCUs` and `IndexingOCUs`: alerts when > 80% of configured maximum capacity
- `SearchLatency` P99: SLO of 150ms; alarm at 120ms to allow reaction time
- `IndexingRate` (documents/second): sudden drop indicates a problem in the ingestion pipeline
- `SearchRequestRate` vs `SearchErrorRate`: error rate > 0.1% triggers investigation

**Layer 2 — Application (OpenTelemetry + CloudWatch Logs Insights):**
Instrument each agent cycle with a trace span that includes: number of retrieval rounds, effective k per round, reranker latency, context tokens injected into the LLM, and response confidence score. This allows correlating quality degradation with traffic or data changes.

**Layer 3 — Retrieval Quality (offline):**
I implement an async evaluation pipeline with RAGAS (Retrieval Augmented Generation Assessment) that runs daily over a 5% sample of production queries. The `context_precision`, `context_recall`, and `faithfulness` metrics are published to CloudWatch as custom metrics and integrated into the product's SLO dashboard. A 10% drop in `faithfulness` over 7 days is an indicator of knowledge base drift that requires reindexing.

> **Keep Collections Warm with Heartbeat Queries:** The cold start of idle OpenSearch Serverless collections is the biggest operational risk of this pattern. Configure an EventBridge Scheduler to fire a lightweight heartbeat query (`GET /_cat/indices`) every 10 minutes on critical collections. The cost is negligible (< $0.50/month in OCUs) and eliminates the risk of a 2-3 minute cold start on the first access of the day — which in financial systems may be exactly when an analyst needs an urgent answer before market open.

## OpenSearch Serverless vs Alternatives for Agentic RAG
| Criterion | Criterion | OpenSearch Serverless | Provisioned OpenSearch | pgvector (Aurora) |
| --- | --- | --- | --- | --- |
| P99 Latency (10M vectors) | 80-150ms (warm) | 20-60ms | 200-500ms | — |
| Cold Start | 2-3 min (mitigable) | None | None | — |
| Base monthly cost (idle) | ~$700 (2 OCU minimum) | ~$400 (3x r6g.large) | ~$200 (Aurora Serverless v2) | — |
| Auto-scaling | Yes, per collection | Manual / UltraWarm | Yes (ACU) | — |
| Native multi-tenancy | Data Access Policies via IAM | Index-level RBAC | Row-level security (RLS) | — |
| Hybrid search (vector + BM25) | Yes, native | Yes, native | Not native (workaround) | — |

> **My Curation Note:** After implementing variations of this pattern in three distinct financial environments — asset manager, digital bank, and insurer — the hardest lesson I learned is that RAG quality is determined 70% by the ingestion pipeline and 30% by index configuration. Architects who spend weeks tuning HNSW parameters while having poorly structured chunks and missing metadata are optimizing the wrong thing. My practical recommendation: before any index tuning, invest in a golden dataset of 200-300 domain-specific (query, relevant document) pairs and use it to measure `context_recall` — if it's below 0.75, the problem is in chunking or embedding, not in k-NN. OpenSearch Serverless is a good choice for this pattern when the minimum cost of ~$700/month is justifiable and traffic is genuinely variable; outside of that, a small provisioned cluster with UltraWarm delivers better TCO.

## Verdict: Use with Criteria, Configure with Rigor

The agentic RAG pattern with Amazon OpenSearch Serverless is technically mature and operationally viable for financial-grade systems — as long as you accept its constraints clearly. The minimum cost of ~$700/month in OCUs is non-negotiable and cold start is still a real operational risk that requires active mitigation. On the other hand, the IAM-based data access policy model is genuinely superior for regulated multi-tenancy, native hybrid search support (vector + BM25) is a concrete advantage over pgvector, and the absence of cluster management frees engineering capacity for what truly matters: ingestion pipeline quality and continuous retrieval evaluation. My recommendation: adopt this pattern if you have agentic workloads with variable traffic, a specialized domain knowledge base, and multi-tenancy requirements with data isolation. Avoid it if you need sub-100ms P99 latency, have predictable and constant traffic, or if the base cost is not justifiable by usage volume. In any case, invest first in the evaluation golden dataset — without it, you are flying blind.

**Rating:** Recommended with conditions

## Technical References

- [Amazon OpenSearch Serverless Developer Guide](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless.html)
- [Amazon OpenSearch Service k-NN Plugin Documentation](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/knn.html)
- [Amazon Bedrock Agents — Knowledge Bases Integration](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html)
- [RAGAS: Evaluation Framework for RAG Pipelines](https://docs.ragas.io/en/latest/)
- [OpenSearch Serverless Security — Data Access Policies](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html)
- [AWS Well-Architected Framework — Machine Learning Lens](https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/welcome.html)
- [Embedding Inversion Attacks: Vec2Text Research](https://arxiv.org/abs/2310.06816)
- [Cohere Rerank API Documentation](https://docs.cohere.com/reference/rerank)