Embeddings and Enterprise Semantic Search: the Invisible Engine Behind RAG
Embeddings seem like a technical detail — but the wrong choice degrades any enterprise AI agent. Objective guide for CTOs on models, dimensions, cost and what really matters in production.
Fabiano Brito
CEO & Founder
Embedding is the numeric representation of text that makes semantic search work. For enterprises, choosing the right embedding model is critical to avoiding mediocre search performance and AI hallucinations caused by bad context.
text-embedding-005 via Vertex AI) covers 90% of cases. The other 10% (heavy multilingual, code, legal) deserve a conscious choice. More important than the model: chunking, metadata, reranker and hybrid search.
Embedding is the most underrated RAG component. Teams debate prompts for weeks but use the default embedding without comparing. Result: mediocre semantic search engine, an agent that seems to be "hallucinating" when it's actually receiving bad context.
This post is the guide we'd give a CTO before approving the architecture.
What is embedding (in 2 paragraphs)
An embedding model takes text ("services contract") and returns a vector of N numbers (e.g.: 768). Texts with similar meaning become nearby vectors in space; different texts go far apart.
Semantic search works like this: index all documents as vectors. Receive a question, vectorize it, find the K nearest vectors = top-K relevant documents. This is what is behind "search" in any modern RAG.
Models available in Vertex AI
| Model | Dim | Cost (US$/1M tokens) | Recommended use |
|---|---|---|---|
text-embedding-005 |
768 | 0.025 | General default, light multilingual |
text-embedding-large-005 |
3072 | 0.10 | High precision, higher latency |
text-multilingual-embedding-002 |
768 | 0.025 | Specific multilingual (20+ languages) |
| OSS (E5, BGE) | variable | own infra | Cases with ultra-sensitive on-prem data |
In Gemini Enterprise + Vertex AI Search, the choice comes configured. To customize, create a dedicated Data Store.
Practical decision criteria
PT-BR + occasional English only → text-embedding-005. Significant ES/ZH/AR/JA → text-multilingual-embedding-002.
General business → default. Dense legal/medical → test large version. Source code → dedicated code embedding (never text embedding).
Larger vector = slower search + more storage. Conversational chat → 768 unless recall is critical.
At 100k+ docs, the difference between 768 vs 3072 becomes real storage and periodic reindexing cost.
What matters MORE than model choice
Bad embedding of good chunk >> good embedding of bad chunk. Absolute priority is semantic chunking, not model.
- Semantic chunking: respect document structure. See enterprise RAG.
- Pre-summarization: indexing (chunk + summary) doubles recall in technical corpora.
- Metadata as filter: category/date/author reduce the universe before semantic search.
- Reranking: top-50 from embedding → top-5 from cross-encoder. Vertex AI Search has it native.
- Hybrid search: semantic + lexical combination. Catches synonyms AND exact terms (contract number, proper name).
How to evaluate your search quality
| Metric | What it measures | Target |
|---|---|---|
| Recall@10 | Right document in top 10? | > 90% |
| MRR | Average position of best result | > 0.6 |
| nDCG@10 | Quality of result ordering | > 0.7 |
Build a gold set of 100–300 (question, expected document) pairs. Use Vertex AI Evaluation to run on every change. More in production agent evaluation.
Common mistakes
When to build your own embedding
Very rare. Consider only if:
- Extremely specific domain (industrial chemistry, pharma patents).
- Volume justifies fine-tuning + serving cost (millions of queries/month).
- Team has a dedicated ML engineer.
In 95% of enterprise projects, default Vertex AI model + careful chunking beats amateur fine-tuning.
Frequently Asked Questions sobre Embeddings and Enterprise Semantic Search: the Invisible Engine Behind RAG
What is an embedding? An embedding model takes a text and returns a vector of N numbers. Texts with similar meaning become vectors close in space; different texts are far apart.
How does semantic search work? Semantic search indexes all documents as vectors, receives a question, transforms it into a vector, and finds the K nearest vectors, which represent the most relevant documents.
Which embedding models are available in Vertex AI?
The models available in Vertex AI include text-embedding-005, text-embedding-large-005, text-multilingual-embedding-002 and OSS models (E5, BGE).
Which embedding model should I use for Portuguese and English?
For Brazilian Portuguese and occasional English, the text-embedding-005 model is sufficient.
Your RAG performs mediocre and you don't know if the bottleneck is the embedding?
In 2 weeks: gold set + benchmark of 3 models + reranker on + report with action plan. Decision based on data, not hype.
