Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Production RAG on AWS

Module 1 · Fundamentals· Lesson 02/12

Embeddings and vector search

How meaning becomes a vector, cosine similarity and what a vector store is.

5 min read

Before assembling any RAG pipeline, you need to understand what happens when text becomes a vector — because that transformation is what makes searching by meaning, not just keywords, possible. In this lesson you will understand embeddings, cosine similarity, vector indexes, and how to choose the right embedding model for your use case.

What an embedding actually is

A language model doesn't understand text as a string — it understands text as a position in a high-dimensional space. An embedding is exactly that: a list of numbers (the vector) representing where that text "lives" in that space.

The most important property: texts with similar meaning end up close to each other in the space. "Cancel subscription" and "terminate plan" land near each other. "Cancel subscription" and "cake recipe" are far apart. This is not magic — it is the result of training the model on billions of examples where those relationships appear together.

Think of it this way: imagine every document is a star in the sky. The embedding model is the telescope that projects each text's meaning into coordinates. When you search, you are asking "which stars are closest to my query's coordinates?"

This representation carries semantic context that keyword search never captures. That is why RAG works where CTRL+F fails: you find the right passage even when it uses completely different vocabulary from your query.

Flow: text → embedding → index → search

How a text travels through the embedding pipeline to become a search result

📄 Entrada — Documentos e Query

Documento · (chunk de texto)
Query do usuário · (pergunta)

🤖 AI — Modelo de Embedding

Embedding Model · Titan / Cohere

🗄️ Armazenamento — Vector Store

Índice Vetorial · ANN index
Metadados · source, date, id

🔍 Busca — Recuperação

Similaridade · por cosseno
Top-K chunks · resultados

Cosine similarity and the vector index

To compare two vectors, the most common method is cosine similarity: instead of measuring the absolute distance between two points, you measure the angle between them. Two vectors pointing in the same direction have cosine 1 (identical in meaning). Opposite directions give -1. Perpendicular, zero.

Why angle and not Euclidean distance? Because normalized embeddings (length 1) make the two metrics equivalent — and normalization is the default in most models. What matters is the semantic direction, not the vector's magnitude.

Now the practical problem: if you have 10 million indexed chunks, comparing the query against every single one in real time is not feasible. This is where ANN — Approximate Nearest Neighbor comes in. Algorithms like HNSW (used in OpenSearch and pgvector) build a graph structure that lets you find the nearest neighbors without scanning the entire index.

The "approximate" is not a flaw — it is a deliberate engineering choice. You trade 100% recall for millisecond latency. In practice, with well-tuned parameters, recall stays at 95-99% and latency drops from seconds to tens of milliseconds. For RAG, that trade-off is always worth it.

In practice: the embedding model is an architecture decision, not a detail

Senior Solutions Architect

In practice, the embedding model you choose during indexing is locked into your index forever — or until you reindex everything. Switching models later means reprocessing every chunk and rebuilding the index from scratch. That is why I treat this choice with the same weight as choosing a database: evaluate upfront, test with your real data, and document the decision. Amazon Titan Embeddings V2 is my default starting point on AWS projects for its low cost and native integration with Bedrock Knowledge Bases. Cohere Embed v3 comes in when I need robust multilingual support or when benchmarks on my specific domain justify the extra cost. I never choose an embedding model by popularity — I choose by recall on my corpus.

Dimension, normalization, and model choice

Every embedding has a dimension — the number of values in the vector. Titan Embeddings V2 supports 256, 512, or 1024 dimensions. Cohere Embed v3 uses 1024. Open-source models like all-MiniLM-L6-v2 use 384.

More dimensions = more capacity to capture semantic nuance, but also higher storage cost and search latency. For most enterprise RAG cases in Portuguese, 1024 dimensions is the sweet spot. Smaller dimensions (256-512) make sense when volume is huge and the domain is narrow.

Normalization means the generated vector has L2 norm equal to 1. Almost all modern models normalize by default. This matters because it guarantees that cosine similarity and dot product give the same result — and simplifies index configuration.

Some concrete criteria for choosing a model:

Language: Titan V2 and Cohere v3 have good Portuguese coverage. English-only models degrade silently on PT-BR text.
Max input size: Titan V2 accepts up to 8192 tokens. Cohere v3 accepts 512 tokens by default (with proper chunking this is not a problem — see lesson 03).
Cost per token: evaluate against your project's real volume before deciding.
Inference latency: in synchronous pipelines, query embedding time adds to total RAG latency.

In lesson 04 you will see how this query vector fits into the complete retrieval and generation pipeline.

Flashcards

Vector search terms

Tap a card to flip it.

Embedding models available on Amazon Bedrock

	Model	Dimensions	Max input tokens	Multilingual	Best for
Amazon Titan Embeddings V2	256 / 512 / 1024	8192	Yes (25+ languages)	General RAG on AWS, native Knowledge Bases integration	—
Cohere Embed v3 (English)	1024	512	No	English corpora with high semantic precision	—
Cohere Embed v3 (Multilingual)	1024	512	Yes (100+ languages)	Multilingual documents, high-quality PT-BR	—

Key takeaways from this lesson

Embedding = vector representing semantic position in space; similar texts land close together.

Cosine similarity measures the angle between vectors — semantic direction matters, not magnitude.

ANN (approximate search) trades perfect recall for viable latency — always a valid trade-off in production.

The embedding model is an architecture decision: changing it later requires full reindexing.

Verify Portuguese support before choosing a model — silent degradation is real.

Vector dimension affects storage cost and search latency — 1024d is the sweet spot for most cases.

Frequently asked questions

Can I use the same embedding model for indexing and for the query?

Yes — and you must. Document and query need to be in the same vector space for the comparison to make sense. Using different models for each is a silent bug that destroys search quality.

What is the difference between dot product and cosine similarity?

For normalized vectors (L2 norm = 1), they are equivalent. Most modern models normalize by default, so in practice you can use either. If your model does not normalize, use cosine explicitly.

Do I need a GPU to generate embeddings in production?

Not when you use models via API (Bedrock, Cohere API). Inference runs on the provider's infrastructure. If you host the model yourself (e.g., SageMaker with an open-source model), GPU accelerates significantly — but for most RAG cases on AWS, the API is simpler and cheaper.

What happens if my chunk is larger than the embedding model's token limit?

The model silently truncates — you lose content beyond the limit with no error. This is one of the reasons why your chunking strategy (lesson 03) needs to account for the chosen embedding model's token limit.

References

Amazon Titan Embeddings V2 — Bedrock docs Cohere Embed v3 on Amazon Bedrock HNSW: Efficient and robust approximate nearest neighbor search OpenSearch k-NN plugin — ANN algorithms Bedrock Knowledge Bases — supported embedding models

Previous Next lesson