Embeddings and vector search
How meaning becomes a vector, cosine similarity and what a vector store is.
5 min read
Before assembling any RAG pipeline, you need to understand what happens when text becomes a vector — because that transformation is what makes searching by meaning, not just keywords, possible. In this lesson you will understand embeddings, cosine similarity, vector indexes, and how to choose the right embedding model for your use case.
What an embedding actually is
A language model doesn't understand text as a string — it understands text as a position in a high-dimensional space. An embedding is exactly that: a list of numbers (the vector) representing where that text "lives" in that space.
The most important property: texts with similar meaning end up close to each other in the space. "Cancel subscription" and "terminate plan" land near each other. "Cancel subscription" and "cake recipe" are far apart. This is not magic — it is the result of training the model on billions of examples where those relationships appear together.
Think of it this way: imagine every document is a star in the sky. The embedding model is the telescope that projects each text's meaning into coordinates. When you search, you are asking "which stars are closest to my query's coordinates?"
This representation carries semantic context that keyword search never captures. That is why RAG works where CTRL+F fails: you find the right passage even when it uses completely different vocabulary from your query.
Flow: text → embedding → index → search
How a text travels through the embedding pipeline to become a search result
- Documento · (chunk de texto)
- Query do usuário · (pergunta)
- Embedding Model · Titan / Cohere
- Índice Vetorial · ANN index
- Metadados · source, date, id
- Similaridade · por cosseno
- Top-K chunks · resultados
Cosine similarity and the vector index
To compare two vectors, the most common method is cosine similarity: instead of measuring the absolute distance between two points, you measure the angle between them. Two vectors pointing in the same direction have cosine 1 (identical in meaning). Opposite directions give -1. Perpendicular, zero.
Why angle and not Euclidean distance? Because normalized embeddings (length 1) make the two metrics equivalent — and normalization is the default in most models. What matters is the semantic direction, not the vector's magnitude.
Now the practical problem: if you have 10 million indexed chunks, comparing the query against every single one in real time is not feasible. This is where ANN — Approximate Nearest Neighbor comes in. Algorithms like HNSW (used in OpenSearch and pgvector) build a graph structure that lets you find the nearest neighbors without scanning the entire index.
The "approximate" is not a flaw — it is a deliberate engineering choice. You trade 100% recall for millisecond latency. In practice, with well-tuned parameters, recall stays at 95-99% and latency drops from seconds to tens of milliseconds. For RAG, that trade-off is always worth it.
In practice, the embedding model you choose during indexing is locked into your index forever — or until you reindex everything. Switching models later means reprocessing every chunk and rebuilding the index from scratch. That is why I treat this choice with the same weight as choosing a database: evaluate upfront, test with your real data, and document the decision. Amazon Titan Embeddings V2 is my default starting point on AWS projects for its low cost and native integration with Bedrock Knowledge Bases. Cohere Embed v3 comes in when I need robust multilingual support or when benchmarks on my specific domain justify the extra cost. I never choose an embedding model by popularity — I choose by recall on my corpus.
Dimension, normalization, and model choice
Every embedding has a dimension — the number of values in the vector. Titan Embeddings V2 supports 256, 512, or 1024 dimensions. Cohere Embed v3 uses 1024. Open-source models like all-MiniLM-L6-v2 use 384.
More dimensions = more capacity to capture semantic nuance, but also higher storage cost and search latency. For most enterprise RAG cases in Portuguese, 1024 dimensions is the sweet spot. Smaller dimensions (256-512) make sense when volume is huge and the domain is narrow.
Normalization means the generated vector has L2 norm equal to 1. Almost all modern models normalize by default. This matters because it guarantees that cosine similarity and dot product give the same result — and simplifies index configuration.
Some concrete criteria for choosing a model:
- Language: Titan V2 and Cohere v3 have good Portuguese coverage. English-only models degrade silently on PT-BR text.
- Max input size: Titan V2 accepts up to 8192 tokens. Cohere v3 accepts 512 tokens by default (with proper chunking this is not a problem — see lesson 03).
- Cost per token: evaluate against your project's real volume before deciding.
- Inference latency: in synchronous pipelines, query embedding time adds to total RAG latency.
In lesson 04 you will see how this query vector fits into the complete retrieval and generation pipeline.
Vector search terms
Tap a card to flip it.
Embedding models available on Amazon Bedrock
| Model | Dimensions | Max input tokens | Multilingual | Best for | |
|---|---|---|---|---|---|
| Amazon Titan Embeddings V2 | 256 / 512 / 1024 | 8192 | Yes (25+ languages) | General RAG on AWS, native Knowledge Bases integration | — |
| Cohere Embed v3 (English) | 1024 | 512 | No | English corpora with high semantic precision | — |
| Cohere Embed v3 (Multilingual) | 1024 | 512 | Yes (100+ languages) | Multilingual documents, high-quality PT-BR | — |
Key takeaways from this lesson
Frequently asked questions
Can I use the same embedding model for indexing and for the query?
Yes — and you must. Document and query need to be in the same vector space for the comparison to make sense. Using different models for each is a silent bug that destroys search quality.
What is the difference between dot product and cosine similarity?
For normalized vectors (L2 norm = 1), they are equivalent. Most modern models normalize by default, so in practice you can use either. If your model does not normalize, use cosine explicitly.
Do I need a GPU to generate embeddings in production?
Not when you use models via API (Bedrock, Cohere API). Inference runs on the provider's infrastructure. If you host the model yourself (e.g., SageMaker with an open-source model), GPU accelerates significantly — but for most RAG cases on AWS, the API is simpler and cheaper.
What happens if my chunk is larger than the embedding model's token limit?
The model silently truncates — you lose content beyond the limit with no error. This is one of the reasons why your chunking strategy (lesson 03) needs to account for the chosen embedding model's token limit.