Embeddings and Enterprise Semantic Search: the Invisible Engine Behind RAG
Embeddings seem like a technical detail — but the wrong choice degrades any enterprise AI agent. Objective guide for CTOs on models, dimensions, cost and what really matters in production.
Fabiano Brito
CEO & Founder
text-embedding-005 via Vertex AI) covers 90% of cases. The other 10% (heavy multilingual, code, legal) deserve a conscious choice. More important than the model: chunking, metadata, reranker and hybrid search.
Embedding is the most underrated RAG component. Teams debate prompts for weeks but use the default embedding without comparing. Result: mediocre semantic search engine, an agent that seems to be "hallucinating" when it's actually receiving bad context.
This post is the guide we'd give a CTO before approving the architecture.
What is embedding (in 2 paragraphs)
An embedding model takes text ("services contract") and returns a vector of N numbers (e.g.: 768). Texts with similar meaning become nearby vectors in space; different texts go far apart.
Semantic search works like this: index all documents as vectors. Receive a question, vectorize it, find the K nearest vectors = top-K relevant documents. This is what is behind "search" in any modern RAG.
Models available in Vertex AI
| Model | Dim | Cost (US$/1M tokens) | Recommended use |
|---|---|---|---|
text-embedding-005 |
768 | 0.025 | General default, light multilingual |
text-embedding-large-005 |
3072 | 0.10 | High precision, higher latency |
text-multilingual-embedding-002 |
768 | 0.025 | Specific multilingual (20+ languages) |
| OSS (E5, BGE) | variable | own infra | Cases with ultra-sensitive on-prem data |
In Gemini Enterprise + Vertex AI Search, the choice comes configured. To customize, create a dedicated Data Store.
Practical decision criteria
PT-BR + occasional English only → text-embedding-005. Significant ES/ZH/AR/JA → text-multilingual-embedding-002.
General business → default. Dense legal/medical → test large version. Source code → dedicated code embedding (never text embedding).
Larger vector = slower search + more storage. Conversational chat → 768 unless recall is critical.
At 100k+ docs, the difference between 768 vs 3072 becomes real storage and periodic reindexing cost.
What matters MORE than model choice
Bad embedding of good chunk >> good embedding of bad chunk. Absolute priority is semantic chunking, not model.
- Semantic chunking: respect document structure. See enterprise RAG.
- Pre-summarization: indexing (chunk + summary) doubles recall in technical corpora.
- Metadata as filter: category/date/author reduce the universe before semantic search.
- Reranking: top-50 from embedding → top-5 from cross-encoder. Vertex AI Search has it native.
- Hybrid search: semantic + lexical combination. Catches synonyms AND exact terms (contract number, proper name).
How to evaluate your search quality
| Metric | What it measures | Target |
|---|---|---|
| Recall@10 | Right document in top 10? | > 90% |
| MRR | Average position of best result | > 0.6 |
| nDCG@10 | Quality of result ordering | > 0.7 |
Build a gold set of 100–300 (question, expected document) pairs. Use Vertex AI Evaluation to run on every change. More in production agent evaluation.
Common mistakes
When to build your own embedding
Very rare. Consider only if:
- Extremely specific domain (industrial chemistry, pharma patents).
- Volume justifies fine-tuning + serving cost (millions of queries/month).
- Team has a dedicated ML engineer.
In 95% of enterprise projects, default Vertex AI model + careful chunking beats amateur fine-tuning.
Your RAG performs mediocre and you don't know if the bottleneck is the embedding?
In 2 weeks: gold set + benchmark of 3 models + reranker on + report with action plan. Decision based on data, not hype.
