Embeddings and Enterprise Semantic Search: the Invisible Engine Behind RAG

Q: Which embedding models are available in Vertex AI?

The models available in Vertex AI include `text-embedding-005`, `text-embedding-large-005`, `text-multilingual-embedding-002` and OSS models (E5, BGE).

Q: Which embedding model should I use for Portuguese and English?

For Brazilian Portuguese and occasional English, the `text-embedding-005` model is sufficient.

Embedding is the numeric representation of text that makes semantic search work. For enterprises, choosing the right embedding model is critical to avoiding mediocre search performance and AI hallucinations caused by bad context.

TL;DR Embedding is the numeric representation that makes semantic search work. In Gemini Enterprise, the default (text-embedding-005 via Vertex AI) covers 90% of cases. The other 10% (heavy multilingual, code, legal) deserve a conscious choice. More important than the model: chunking, metadata, reranker and hybrid search.

Embedding is the most underrated RAG component. Teams debate prompts for weeks but use the default embedding without comparing. Result: mediocre semantic search engine, an agent that seems to be "hallucinating" when it's actually receiving bad context.

This post is the guide we'd give a CTO before approving the architecture.

What is embedding (in 2 paragraphs)

An embedding model takes text ("services contract") and returns a vector of N numbers (e.g.: 768). Texts with similar meaning become nearby vectors in space; different texts go far apart.

Semantic search works like this: index all documents as vectors. Receive a question, vectorize it, find the K nearest vectors = top-K relevant documents. This is what is behind "search" in any modern RAG.

Models available in Vertex AI

Model	Dim	Cost (US$/1M tokens)	Recommended use
`text-embedding-005`	768	0.025	General default, light multilingual
`text-embedding-large-005`	3072	0.10	High precision, higher latency
`text-multilingual-embedding-002`	768	0.025	Specific multilingual (20+ languages)
OSS (E5, BGE)	variable	own infra	Cases with ultra-sensitive on-prem data

In Gemini Enterprise + Vertex AI Search, the choice comes configured. To customize, create a dedicated Data Store.

Practical decision criteria

Languages involved

PT-BR + occasional English only → text-embedding-005. Significant ES/ZH/AR/JA → text-multilingual-embedding-002.

Domain

General business → default. Dense legal/medical → test large version. Source code → dedicated code embedding (never text embedding).

Acceptable latency

Larger vector = slower search + more storage. Conversational chat → 768 unless recall is critical.

Volume and cost

At 100k+ docs, the difference between 768 vs 3072 becomes real storage and periodic reindexing cost.

What matters MORE than model choice

Bad embedding of good chunk >> good embedding of bad chunk. Absolute priority is semantic chunking, not model.

Semantic chunking: respect document structure. See enterprise RAG.
Pre-summarization: indexing (chunk + summary) doubles recall in technical corpora.
Metadata as filter: category/date/author reduce the universe before semantic search.
Reranking: top-50 from embedding → top-5 from cross-encoder. Vertex AI Search has it native.
Hybrid search: semantic + lexical combination. Catches synonyms AND exact terms (contract number, proper name).

How to evaluate your search quality

Metric	What it measures	Target
Recall@10	Right document in top 10?	> 90%
MRR	Average position of best result	> 0.6
nDCG@10	Quality of result ordering	> 0.7

Build a gold set of 100–300 (question, expected document) pairs. Use Vertex AI Evaluation to run on every change. More in production agent evaluation.

Common mistakes

⚠️ 5 frequent pitfalls Reindexing entire base every week (use upsert), mixing languages in monolingual embedding, 2,000-token chunks that "blur" semantics, ignoring structured metadata, and never testing model swap on a sample.

When to build your own embedding

Very rare. Consider only if:

Extremely specific domain (industrial chemistry, pharma patents).
Volume justifies fine-tuning + serving cost (millions of queries/month).
Team has a dedicated ML engineer.

In 95% of enterprise projects, default Vertex AI model + careful chunking beats amateur fine-tuning.

Frequently Asked Questions sobre Embeddings and Enterprise Semantic Search: the Invisible Engine Behind RAG

What is an embedding? An embedding model takes a text and returns a vector of N numbers. Texts with similar meaning become vectors close in space; different texts are far apart.

How does semantic search work? Semantic search indexes all documents as vectors, receives a question, transforms it into a vector, and finds the K nearest vectors, which represent the most relevant documents.

Which embedding models are available in Vertex AI? The models available in Vertex AI include text-embedding-005, text-embedding-large-005, text-multilingual-embedding-002 and OSS models (E5, BGE).

Which embedding model should I use for Portuguese and English? For Brazilian Portuguese and occasional English, the text-embedding-005 model is sufficient.

Embeddings audit

Your RAG performs mediocre and you don't know if the bottleneck is the embedding?

In 2 weeks: gold set + benchmark of 3 models + reranker on + report with action plan. Decision based on data, not hype.

Request audit → RAG Architecture