Gemini Embedding 2: One Vector for Text, Image, Video and Audio

Gemini Embedding 2 is Google’s first natively multimodal embedding model that maps text, images, video, audio, and PDFs into a single 3072-dimension vector space. This breakthrough eliminates the need for enterprises to maintain separate models and fragile fusion layers per data type, dramatically simplifying semantic search and RAG pipelines.

TL;DR gemini-embedding-2-preview maps text, images, video, audio and PDFs into a single 3072-dimension vector space. It supports Matryoshka (768/1536/3072), an 8,192-token context window, sets new MTEB records (84.0 code / 69.9 multilingual) and ships with native integrations for LangChain, LlamaIndex, Vertex AI Vector Search and the major vector DBs.

For years, building a multimodal semantic search pipeline meant maintaining separate models per data type — one for text, one for image, one for audio — then trying to align those vector spaces with fragile fusion layers. On March 10, 2026, Google ended that era with the launch of Gemini Embedding 2: the first natively multimodal embedding model in the Gemini family, in Public Preview via Gemini API and Vertex AI.

The numbers behind the migration

84.0

MTEB Code
new absolute record

69.9

MTEB Multilingual
100+ native languages

8,192

tokens of context
2× most competitors

The problem it solves

Enterprise data is multimodal by nature. Customer support involves text tickets, call recordings, error screenshots and PDF manuals. A product analytics system handles demo videos, written specs and catalog images. Indexing all of this semantically required separate pipelines — and quality dropped at the seam between modalities.

Gemini Embedding 2 fixes this at the foundational layer: by training a single model across all modalities, the vector distance between “an audio recording of a customer complaining about latency” and “a knowledge-base article on performance tuning” is semantically coherent — with no intermediate translation layer.

"The bridge between different media types has finally been built. Use this with complex document similarity tasks, and the results in semantic proximity should be a massive leap forward for RAG pipelines."

— Eric Dong, Engineer @ Google Cloud AI

What’s in the spec

The gemini-embedding-2-preview model is built on the Gemini architecture and inherits its multimodal understanding. Per-modality limits:

Modality	Per-request limit	Supported formats
Text	8,192 tokens	Any UTF-8 text
Image	Up to 6 images	PNG, JPEG
Audio	Up to 80 seconds	MP3, WAV
Video	Up to 128 seconds	MP4, MOV (H264, H265, AV1, VP9)
Document (PDF)	Up to 6 pages	PDF (visual + text)

A key architectural detail: the model accepts interleaved input — combine multiple modalities in a single request (text + image + audio) and receive one aggregated embedding that captures the relationship between them. Different from generating separate embeddings and averaging.

Matryoshka: flexibility without sacrifice

Gemini Embedding 2 incorporates Matryoshka Representation Learning (MRL) — semantic information nested hierarchically inside the vector. The first 768 values already hold a useful representation; the next 1536 add nuance; the full 3072 deliver maximum fidelity.

Pick dimensionality at inference time via the output_dimensionality parameter:

Python

from google import genai
from google.genai import types
client = genai.Client()
Reduced-dimension embedding (75% storage savings)
result = client.models.embed_content(
model=“gemini-embedding-2-preview”,
contents=“Q1 2026 performance report”,
config=types.EmbedContentConfig(output_dimensionality=768)
)
print(f”Dimensions: {len(result.embeddings[0].values)}”)  # 768

Storage impact for a 10-million-document corpus:

Dimensionality	Storage (10M docs)	Recommended use case
3072 (default)	~117 GB	High-precision RAG, legal/medical search, dedup
1536	~58 GB	General semantic search, content classification
768	~29 GB	Real-time recommendation, low-latency filtering

Re-indexing required Gemini Embedding 2 is not backward-compatible with vectors from gemini-embedding-001. Vector spaces from different models can't be compared directly — if you migrate, plan for the cost and time to re-index the entire corpus.

Multimodal RAG in practice

The most immediate impact is pipeline simplification. The old pattern required multiple embedding models, custom fusion logic and modality-separated vector spaces. The new pattern is radically cleaner:

Python — Multimodal RAG with Gemini Embedding 2

from google import genai
from google.genai import types
client = genai.Client()
Index a PDF directly (no manual OCR)
with open(‘financial_report.pdf’, ‘rb’) as f:
pdf_bytes = f.read()
pdf_embedding = client.models.embed_content(
model=‘gemini-embedding-2-preview’,
contents=[types.Part.from_bytes(data=pdf_bytes, mime_type=‘application/pdf’)]
)
Index a meeting recording (no transcription)
with open(‘board_meeting.mp3’, ‘rb’) as f:
audio_bytes = f.read()
audio_embedding = client.models.embed_content(
model=‘gemini-embedding-2-preview’,
contents=[types.Part.from_bytes(data=audio_bytes, mime_type=‘audio/mpeg’)]
)
Query with text — compares against PDF and audio in the same space
query_embedding = client.models.embed_content(
model=‘gemini-embedding-2-preview’,
contents=“What revenue targets were discussed in Q4?”
)
All vectors live in the same space — unified search

The critical point: the text query is compared directly with PDF and audio embeddings — with no intermediate translation layer.

High-value use cases

The combination of native multimodality, an 8,192-token window and 100+ languages opens up cases that were previously infeasible or economically prohibitive:

Sector	Use case	What changes
Legal	Contracts + recorded hearings	PDFs and audio in one index; clause search retrieves both
Healthcare	Multimodal EHR	PDF reports, imaging exams and voice notes indexed together
Retail	Visual + text product search	Customer uploads a photo and gets results by visual similarity and description
Education	Lecture repository	Videos, slides and transcripts in one space; students search by concept
Financial	Earnings calls + reports	Correlate conference calls with PDF reports, no transcription pipeline

Where you can use it today

Available via Gemini API (development) and Vertex AI (production with SLA, VPC Service Controls and Vector Search). Documented support across the major libraries:

LangChain and LlamaIndex — native integration via Google’s embedding class
Haystack — component available on the hub
Weaviate, Qdrant, ChromaDB — Google vectorization module
Vertex AI Vector Search — managed integration with auto-scaling

For Google Cloud teams, the Gemini Embedding 2 + Vertex AI Vector Search + Gemini 2.5 Pro stack delivers a fully managed RAG pipeline with no external dependencies.

Adoption-readiness checklist

Modality inventory

Map which data types (text, image, audio, video, PDF) live in your corpus and the volume per type.

Re-indexing assessment

Estimate cost and time to re-index the existing corpus if migrating from gemini-embedding-001.

Dimensionality choice

Decide whether 768, 1536 or 3072 fits your quality/cost trade-off.

Deployment environment

Gemini API for development/preview; Vertex AI for production-grade SLA.

Retrieval evaluation

Build an eval set (queries + relevant documents) to measure real-world improvement in your domain before migrating everything.

Frequently Asked Questions sobre Gemini Embedding 2: One Vector for Text, Image, Video and Audio

What is Gemini Embedding 2? It is the first natively multimodal embedding model from the Gemini family, available in Public Preview via Gemini API and Vertex AI. It maps text, images, videos, audio, and PDFs into a single vector space.

What are the main benefits of Gemini Embedding 2? It supports Matryoshka, has a context window of 8,192 tokens, and has achieved new records in MTEB (84.0 in code and 69.9 in multilingual). It also has native integration with LangChain, LlamaIndex, Vertex AI Vector Search, and major vector databases.

What are the input limits for each modality in Gemini Embedding 2? For text, the limit is 8,192 tokens; for images, up to 6 images; for audio, up to 80 seconds; for video, up to 128 seconds; and for PDF documents, up to 6 pages.

What is Matryoshka Representation Learning (MRL) in Gemini Embedding 2? It is a technique that incorporates hierarchically nested semantic information into the vector. You can choose the dimensionality at inference time via the output_dimensionality parameter.

Migration & RAG architecture

Evaluating Gemini Embedding 2 for your architecture?

We work with clients in healthcare, legal, education and finance — sectors where multimodal data is the norm. We run the feasibility analysis, re-indexing cost and final architecture.

Talk to Autenticare → Calculate ROI

Gemini Embedding 2: One Vector for Text, Image, Video and Audio

The numbers behind the migration

The problem it solves

What’s in the spec

Matryoshka: flexibility without sacrifice

Reduced-dimension embedding (75% storage savings)

Multimodal RAG in practice

Index a PDF directly (no manual OCR)

Index a meeting recording (no transcription)

Query with text — compares against PDF and audio in the same space

`All vectors live in the same space — unified search`

High-value use cases

Where you can use it today

Adoption-readiness checklist

Frequently Asked Questions sobre Gemini Embedding 2: One Vector for Text, Image, Video and Audio

Evaluating Gemini Embedding 2 for your architecture?

Read also