Autenticare
Agentic Engineering · · 10 min

Gemini Embedding 2: One Vector for Text, Image, Video and Audio

Google's first multimodal embedding unifies text, image, video and audio in a single 3072-dim vector. What changes for enterprise RAG architectures.

Fabiano Brito

Fabiano Brito

CEO & Founder

Gemini Embedding 2: One Vector for Text, Image, Video and Audio
TL;DR gemini-embedding-2-preview maps text, images, video, audio and PDFs into a single 3072-dimension vector space. It supports Matryoshka (768/1536/3072), an 8,192-token context window, sets new MTEB records (84.0 code / 69.9 multilingual) and ships with native integrations for LangChain, LlamaIndex, Vertex AI Vector Search and the major vector DBs.

For years, building a multimodal semantic search pipeline meant maintaining separate models per data type — one for text, one for image, one for audio — then trying to align those vector spaces with fragile fusion layers. On March 10, 2026, Google ended that era with the launch of Gemini Embedding 2: the first natively multimodal embedding model in the Gemini family, in Public Preview via Gemini API and Vertex AI.

The numbers behind the migration

84.0
MTEB Code
new absolute record
69.9
MTEB Multilingual
100+ native languages
8,192
tokens of context
2× most competitors

The problem it solves

Enterprise data is multimodal by nature. Customer support involves text tickets, call recordings, error screenshots and PDF manuals. A product analytics system handles demo videos, written specs and catalog images. Indexing all of this semantically required separate pipelines — and quality dropped at the seam between modalities.

Gemini Embedding 2 fixes this at the foundational layer: by training a single model across all modalities, the vector distance between “an audio recording of a customer complaining about latency” and “a knowledge-base article on performance tuning” is semantically coherent — with no intermediate translation layer.

"The bridge between different media types has finally been built. Use this with complex document similarity tasks, and the results in semantic proximity should be a massive leap forward for RAG pipelines."

— Eric Dong, Engineer @ Google Cloud AI

What’s in the spec

The gemini-embedding-2-preview model is built on the Gemini architecture and inherits its multimodal understanding. Per-modality limits:

ModalityPer-request limitSupported formats
Text8,192 tokensAny UTF-8 text
ImageUp to 6 imagesPNG, JPEG
AudioUp to 80 secondsMP3, WAV
VideoUp to 128 secondsMP4, MOV (H264, H265, AV1, VP9)
Document (PDF)Up to 6 pagesPDF (visual + text)

A key architectural detail: the model accepts interleaved input — combine multiple modalities in a single request (text + image + audio) and receive one aggregated embedding that captures the relationship between them. Different from generating separate embeddings and averaging.

Matryoshka: flexibility without sacrifice

Gemini Embedding 2 incorporates Matryoshka Representation Learning (MRL) — semantic information nested hierarchically inside the vector. The first 768 values already hold a useful representation; the next 1536 add nuance; the full 3072 deliver maximum fidelity.

Pick dimensionality at inference time via the output_dimensionality parameter:

Python
from google import genai
from google.genai import types

client = genai.Client()

Reduced-dimension embedding (75% storage savings)

result = client.models.embed_content( model=“gemini-embedding-2-preview”, contents=“Q1 2026 performance report”, config=types.EmbedContentConfig(output_dimensionality=768) ) print(f”Dimensions: {len(result.embeddings[0].values)}”) # 768

Storage impact for a 10-million-document corpus:

DimensionalityStorage (10M docs)Recommended use case
3072 (default)~117 GBHigh-precision RAG, legal/medical search, dedup
1536~58 GBGeneral semantic search, content classification
768~29 GBReal-time recommendation, low-latency filtering
Re-indexing required Gemini Embedding 2 is not backward-compatible with vectors from gemini-embedding-001. Vector spaces from different models can't be compared directly — if you migrate, plan for the cost and time to re-index the entire corpus.

Multimodal RAG in practice

The most immediate impact is pipeline simplification. The old pattern required multiple embedding models, custom fusion logic and modality-separated vector spaces. The new pattern is radically cleaner:

Python — Multimodal RAG with Gemini Embedding 2
from google import genai
from google.genai import types

client = genai.Client()

Index a PDF directly (no manual OCR)

with open(‘financial_report.pdf’, ‘rb’) as f: pdf_bytes = f.read() pdf_embedding = client.models.embed_content( model=‘gemini-embedding-2-preview’, contents=[types.Part.from_bytes(data=pdf_bytes, mime_type=‘application/pdf’)] )

Index a meeting recording (no transcription)

with open(‘board_meeting.mp3’, ‘rb’) as f: audio_bytes = f.read() audio_embedding = client.models.embed_content( model=‘gemini-embedding-2-preview’, contents=[types.Part.from_bytes(data=audio_bytes, mime_type=‘audio/mpeg’)] )

Query with text — compares against PDF and audio in the same space

query_embedding = client.models.embed_content( model=‘gemini-embedding-2-preview’, contents=“What revenue targets were discussed in Q4?” )

All vectors live in the same space — unified search

The critical point: the text query is compared directly with PDF and audio embeddings — with no intermediate translation layer.

High-value use cases

The combination of native multimodality, an 8,192-token window and 100+ languages opens up cases that were previously infeasible or economically prohibitive:

SectorUse caseWhat changes
LegalContracts + recorded hearingsPDFs and audio in one index; clause search retrieves both
HealthcareMultimodal EHRPDF reports, imaging exams and voice notes indexed together
RetailVisual + text product searchCustomer uploads a photo and gets results by visual similarity and description
EducationLecture repositoryVideos, slides and transcripts in one space; students search by concept
FinancialEarnings calls + reportsCorrelate conference calls with PDF reports, no transcription pipeline

Where you can use it today

Available via Gemini API (development) and Vertex AI (production with SLA, VPC Service Controls and Vector Search). Documented support across the major libraries:

  • LangChain and LlamaIndex — native integration via Google’s embedding class
  • Haystack — component available on the hub
  • Weaviate, Qdrant, ChromaDB — Google vectorization module
  • Vertex AI Vector Search — managed integration with auto-scaling

For Google Cloud teams, the Gemini Embedding 2 + Vertex AI Vector Search + Gemini 2.5 Pro stack delivers a fully managed RAG pipeline with no external dependencies.

Adoption-readiness checklist

1
Modality inventory

Map which data types (text, image, audio, video, PDF) live in your corpus and the volume per type.

2
Re-indexing assessment

Estimate cost and time to re-index the existing corpus if migrating from gemini-embedding-001.

3
Dimensionality choice

Decide whether 768, 1536 or 3072 fits your quality/cost trade-off.

4
Deployment environment

Gemini API for development/preview; Vertex AI for production-grade SLA.

5
Retrieval evaluation

Build an eval set (queries + relevant documents) to measure real-world improvement in your domain before migrating everything.

Migration & RAG architecture

Evaluating Gemini Embedding 2 for your architecture?

We work with clients in healthcare, legal, education and finance — sectors where multimodal data is the norm. We run the feasibility analysis, re-indexing cost and final architecture.


Read also