Gemini Embedding 2: One Vector for Text, Image, Video and Audio
Google's first multimodal embedding unifies text, image, video and audio in a single 3072-dim vector. What changes for enterprise RAG architectures.
Fabiano Brito
CEO & Founder
For years, building a multimodal semantic search pipeline meant maintaining separate models per data type — one for text, one for image, one for audio — then trying to align those vector spaces with fragile fusion layers. On March 10, 2026, Google ended that era with the launch of Gemini Embedding 2: the first natively multimodal embedding model in the Gemini family, in Public Preview via Gemini API and Vertex AI.
The numbers behind the migration
new absolute record
100+ native languages
2× most competitors
The problem it solves
Enterprise data is multimodal by nature. Customer support involves text tickets, call recordings, error screenshots and PDF manuals. A product analytics system handles demo videos, written specs and catalog images. Indexing all of this semantically required separate pipelines — and quality dropped at the seam between modalities.
Gemini Embedding 2 fixes this at the foundational layer: by training a single model across all modalities, the vector distance between “an audio recording of a customer complaining about latency” and “a knowledge-base article on performance tuning” is semantically coherent — with no intermediate translation layer.
"The bridge between different media types has finally been built. Use this with complex document similarity tasks, and the results in semantic proximity should be a massive leap forward for RAG pipelines."
What’s in the spec
The gemini-embedding-2-preview model is built on the Gemini architecture and inherits its multimodal understanding. Per-modality limits:
| Modality | Per-request limit | Supported formats |
|---|---|---|
| Text | 8,192 tokens | Any UTF-8 text |
| Image | Up to 6 images | PNG, JPEG |
| Audio | Up to 80 seconds | MP3, WAV |
| Video | Up to 128 seconds | MP4, MOV (H264, H265, AV1, VP9) |
| Document (PDF) | Up to 6 pages | PDF (visual + text) |
A key architectural detail: the model accepts interleaved input — combine multiple modalities in a single request (text + image + audio) and receive one aggregated embedding that captures the relationship between them. Different from generating separate embeddings and averaging.
Matryoshka: flexibility without sacrifice
Gemini Embedding 2 incorporates Matryoshka Representation Learning (MRL) — semantic information nested hierarchically inside the vector. The first 768 values already hold a useful representation; the next 1536 add nuance; the full 3072 deliver maximum fidelity.
Pick dimensionality at inference time via the output_dimensionality parameter:
from google import genai from google.genai import typesclient = genai.Client()
Reduced-dimension embedding (75% storage savings)
result = client.models.embed_content( model=“gemini-embedding-2-preview”, contents=“Q1 2026 performance report”, config=types.EmbedContentConfig(output_dimensionality=768) ) print(f”Dimensions: {len(result.embeddings[0].values)}”) # 768
Storage impact for a 10-million-document corpus:
| Dimensionality | Storage (10M docs) | Recommended use case |
|---|---|---|
| 3072 (default) | ~117 GB | High-precision RAG, legal/medical search, dedup |
| 1536 | ~58 GB | General semantic search, content classification |
| 768 | ~29 GB | Real-time recommendation, low-latency filtering |
gemini-embedding-001. Vector spaces from different models can't be compared directly — if you migrate, plan for the cost and time to re-index the entire corpus.
Multimodal RAG in practice
The most immediate impact is pipeline simplification. The old pattern required multiple embedding models, custom fusion logic and modality-separated vector spaces. The new pattern is radically cleaner:
from google import genai from google.genai import typesclient = genai.Client()
Index a PDF directly (no manual OCR)
with open(‘financial_report.pdf’, ‘rb’) as f: pdf_bytes = f.read() pdf_embedding = client.models.embed_content( model=‘gemini-embedding-2-preview’, contents=[types.Part.from_bytes(data=pdf_bytes, mime_type=‘application/pdf’)] )
Index a meeting recording (no transcription)
with open(‘board_meeting.mp3’, ‘rb’) as f: audio_bytes = f.read() audio_embedding = client.models.embed_content( model=‘gemini-embedding-2-preview’, contents=[types.Part.from_bytes(data=audio_bytes, mime_type=‘audio/mpeg’)] )
Query with text — compares against PDF and audio in the same space
query_embedding = client.models.embed_content( model=‘gemini-embedding-2-preview’, contents=“What revenue targets were discussed in Q4?” )
All vectors live in the same space — unified search
The critical point: the text query is compared directly with PDF and audio embeddings — with no intermediate translation layer.
High-value use cases
The combination of native multimodality, an 8,192-token window and 100+ languages opens up cases that were previously infeasible or economically prohibitive:
| Sector | Use case | What changes |
|---|---|---|
| Legal | Contracts + recorded hearings | PDFs and audio in one index; clause search retrieves both |
| Healthcare | Multimodal EHR | PDF reports, imaging exams and voice notes indexed together |
| Retail | Visual + text product search | Customer uploads a photo and gets results by visual similarity and description |
| Education | Lecture repository | Videos, slides and transcripts in one space; students search by concept |
| Financial | Earnings calls + reports | Correlate conference calls with PDF reports, no transcription pipeline |
Where you can use it today
Available via Gemini API (development) and Vertex AI (production with SLA, VPC Service Controls and Vector Search). Documented support across the major libraries:
- LangChain and LlamaIndex — native integration via Google’s embedding class
- Haystack — component available on the hub
- Weaviate, Qdrant, ChromaDB — Google vectorization module
- Vertex AI Vector Search — managed integration with auto-scaling
For Google Cloud teams, the Gemini Embedding 2 + Vertex AI Vector Search + Gemini 2.5 Pro stack delivers a fully managed RAG pipeline with no external dependencies.
Adoption-readiness checklist
Map which data types (text, image, audio, video, PDF) live in your corpus and the volume per type.
Estimate cost and time to re-index the existing corpus if migrating from gemini-embedding-001.
Decide whether 768, 1536 or 3072 fits your quality/cost trade-off.
Gemini API for development/preview; Vertex AI for production-grade SLA.
Build an eval set (queries + relevant documents) to measure real-world improvement in your domain before migrating everything.
Evaluating Gemini Embedding 2 for your architecture?
We work with clients in healthcare, legal, education and finance — sectors where multimodal data is the norm. We run the feasibility analysis, re-indexing cost and final architecture.
