Google、Gemini Embedding 2でテキスト・画像・動画・音声を単一ベクトル空間に統合
GoogleはGemini Embedding 2により、テキスト、画像、動画、音声、文書を単一のベクトル空間に統合する初のネイティブマルチモーダル埋め込みモデルを発表し、AIパイプラインにおける個別モデルの必要性を排除した。
キーポイント
初のネイティブマルチモーダル埋め込みモデル
Gemini Embedding 2は、テキスト、画像、動画、音声、文書を単一のベクトル空間に統合するGoogle初のネイティブマルチモーダル埋め込みモデルである。
AIパイプラインの効率化
複数のモダリティを単一モデルで処理できるため、AIパイプラインにおける個別モデルの必要性がなくなり、システムの複雑性とコストが削減される。
技術的統合の革新
異なるデータ形式を共通のベクトル空間にマッピングすることで、クロスモーダル検索やマルチメディア理解タスクの性能向上が期待される。
影響分析・編集コメントを表示
影響分析
この技術はマルチモーダルAIの実用化を加速させ、検索、コンテンツ理解、生成AIシステムの基盤を強化する可能性がある。従来の個別モデルアプローチに比べて効率性と統合性が向上し、AIアプリケーション開発の障壁を下げる重要な進展と言える。
編集コメント
マルチモーダルAIの実用化に向けた重要な技術的マイルストーン。単一モデルでの異種データ統合は、AIシステムの設計と運用を根本から変える可能性を秘めている。

Google初のネイティブ・マルチモーダル埋め込みモデルは、テキスト、画像、動画、音声、文書を単一のベクトル空間に統合します。これにより、AIパイプラインにおいて個別のモデルを用意する必要がなくなります。
本記事「Google unifies text, image, video, and audio in a single vector space with Gemini Embedding 2」は、The Decoderで最初に公開されました。
原文を表示
Google's first native multimodal embedding model maps text, images, video, audio, and documents into one shared semantic space, potentially simplifying complex AI pipelines.
Back in July 2025, Google shipped gemini-embedding-001, a text-only embedding model supporting over 100 languages that landed a top spot on the MTEB Multilingual Leaderboard. With Gemini Embedding 2, the company is making a much bigger move: the new model still builds on the Gemini architecture, but now also maps images, video, audio, and PDF documents into the same vector space as text.
Embeddings are numerical representations of data that capture its meaning. They're the backbone of applications like semantic search, retrieval augmented generation (RAG), sentiment analysis, and data clustering. A shared embedding space makes it possible to compare different media types directly, without running them through separate models or adding extra steps.
Gemini Embedding 2 handles five modalities: text, images, video, audio, and PDF documents. | Image: Google
Native audio processing cuts out the transcription middleman
Google says Gemini Embedding 2 supports up to 8,192 input tokens for text, four times the 2,048-token limit of its predecessor. It can handle up to six images per request in PNG and JPEG formats. Videos can run up to 120 seconds, and PDF documents can be up to six pages long.
The audio side is worth noting. The model processes audio natively without converting it to text first. Most previous approaches rely on a speech-to-text step in between, which tends to lose information along the way. Gemini Embedding 2 skips that entirely.
There's also what Google calls "interleaved input:" developers can mix multiple modalities in a single request, like pairing an image with a text description. Google says this helps the model pick up on relationships between different media types better than embedding each one on its own.
Like its predecessor, Gemini Embedding 2 uses Matryoshka Representation Learning (MRL). The technique layers information so output dimensions can be scaled down dynamically, like a Matryoshka doll where smaller representations nest inside larger ones.
The default dimension is 3,072, with Google recommending 1,536 and 768 as useful alternatives. This lets developers trade off between maximum quality and lower storage costs depending on their use case. Google says the model supports semantic capture in over 100 languages.
Benchmarks show a clear lead across every modality tested
Google backs up its performance claims with benchmark comparisons against Amazon's Nova 2 Multimodal Embeddings, Voyage Multimodal 3.5, and its own earlier models. According to the published numbers, the new model comes out on top in every category tested: text, images, video, and spoken language.
The gap is widest in text/video tasks: Gemini Embedding 2 hits up to 68.8 points, while Amazon Nova 2 lands at 60.3 and Voyage Multimodal 3.5 at 55.2. In text-image comparisons, Google also pulls ahead clearly with 93.4 versus Amazon's 84.0.
Google pits Gemini Embedding 2 against competing models across text, image, video, and audio benchmarks. The model reportedly leads in nearly every category. | Image: Google[Google says early access partners are already putting the model to work in multimodal applications. Embeddings are the technology driving many Google products, from RAG-powered context engineering to large-scale data management and classic search.
Gemini Embedding 2 is available through the Gemini API and Vertex AI. Google provides interactive Colab notebooks and supports integrations with popular frameworks and vector databases, including LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB, and Vector Search. The company also put out a lightweight demo for multimodal semantic search so developers can test the model's capabilities firsthand.
In late February, AI search engine Perplexity dropped two open-source embedding models under an MIT license. The models—pplx-embed-v1 and pplx-embed-context-v1—only handle text, but they focus on extreme memory efficiency and bidirectional text understanding.
On the MTEB retrieval benchmark, Perplexity's largest model reportedly matched Alibaba's Qwen3 embedding scores and outperformed Google's gemini-embedding-001 at the time, all while using significantly less memory.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe now
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み