読み込み中…

Google DeepMind·2026年6月9日 23:10·約6分

Gemma 4 12B の紹介：統一型エンコーダー非搭載マルチモーダルモデル

#Gemma #マルチモーダル AI #エッジ AI #エンコーダーフリー #Google DeepMind

TL;DR

Google DeepMind は、ローカル環境での高パフォーマンスな推論を可能にする「Gemma 4 12B」というユニファイド・エンコーダーフリー型マルチモーダルモデルを発表した。

AI深層分析2026年6月10日 22:17

注目/ 5段階

深度40%

キーポイント

エンコーダーフリー構造の採用

従来の画像エンコーダーを不要とする「エンコーダーフリー」アーキテクチャを採用し、モデルの統合と効率化を実現している。

ローカルデバイスでの高パフォーマンス

12B パラメータ規模でありながら、ノートパソコンなどのエッジデバイス上で動作するよう設計され、モバイルファーストな効率性を重視している。

高度な推論能力の統合

マルチモーダル処理だけでなく、複雑な推論タスクに対応できる知的性能を備え、実用的なアプリケーションでの利用を想定している。

ネイティブオーディオ入力と統合アーキテクチャ

Gemma 4 12B は、視覚・音声入力を直接 LLM バックボーンに流し込むマルチモーダルエンコーダーを不要とする新しい統一アーキテクチャを採用しており、初のミッドサイズモデルとしてネイティブオーディオ入力に対応しています。

ローカル実行と推論能力

16GB の VRAM またはユニファイドメモリで動作可能でありながら、26B モデルに迫るベンチマーク性能を達成し、複雑な多段階推論やエージェントワークフローをローカル環境で実現します。

レイテンシ削減機能

マルチトークン予測（MTP）ドラフターを標準装備しており、これにより推論時のレイテンシを大幅に削減し、より高速な応答が可能になります。

エンコーダーフリーの統合アーキテクチャ

従来の別々のエンコーダーを廃止し、視覚・音声入力を直接言語モデルに統合することで、レイテンシとメモリ使用量を削減しました。

重要な引用

Gemma 4 12B is designed to bring high-performance multimodal intelligence directly to your laptop, combining mobile-first efficiency with advanced reasoning.

Gemma 4 12B packages powerful capabilities inside a reduced memory footprint.

Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

We trained Gemma 4 12B with an encoder-free architecture to integrate audio and vision input directly.

We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.

To support agents to build with the latest Gemma advancements, we are releasing our official Skills Repository.

影響分析・編集コメントを表示

影響分析

この発表は、大規模モデルのクラウド依存から脱却し、プライバシー保護や低遅延が求められるローカル環境でのマルチモーダル AI 利用を加速させる可能性を秘めています。特にエンコーダーフリーというアーキテクチャ転換は、推論コストの削減とモデルの汎用性向上に寄与する画期的なアプローチと言えます。

編集コメント

2026 年という未来の日付が設定されたプレスリリースですが、ローカルでの高機能マルチモーダル処理を目指す方向性は現在の AI トレンドと合致しており、エッジコンピューティング分野における重要な一歩となります。

2026 年 6 月 3 日

3 min read

Gemma 4 12B は、モバイルファーストの効率性と高度な推論能力を組み合わせることで、高性能なマルチモーダル知能を直接あなたのラップトップに持ち込むために設計されています。

image

オリビエ・ラコンブ（Olivier Lacombe）

Google DeepMind 製品管理ディレクター

image

ガス・マルティンス（Gus Martins）

Google DeepMind 製品マネージャー

image

あなたのブラウザはオーディオ要素をサポートしていません。

記事の再生

このコンテンツは Google AI によって生成されています。生成 AI は実験的な技術です。

[[duration]] 分

本日、私たちはラップトップにエージェント型マルチモーダル知能を直接持ち込むために設計された最新モデル「Gemma 4 12B」を発表します。エッジ対応の E4B と、より高度な 26B の Mixture of Experts (MoE) の間のギャップを埋める Gemma 4 12B は、限られたメモリフットプリントの中に強力な機能をパッケージ化しています。また、ネイティブオーディオ入力を備えた初のミッドサイズモデルでもあります。

開発者コミュニティのおかげで、Gemma 4 モデルのダウンロード数はついに 1.5 億回を超えました。皆さんは、物理的な支援のためのウェアラブルロボットアームから、エンタープライズグレードの AI セキュリティまで、あらゆるものを構築してきました。この最新モデルを使って皆さんが何を作り出すのか、私たちも楽しみにしています。

Gemma 4 12B がユニークである理由の概要は以下の通りです：

革新的な統一アーキテクチャ：マルチモーダルエンコーダーなし。ビジョンおよびオーディオ入力は直接 LLM バックボーンへ流れます。
高度な推論能力：ベンチマーク性能は当社の 26B モデルに迫り、強力な多段推論やエージェントワークフローを可能にします。
ラップトップ対応：VRAM または統一メモリがわずか 16GB でローカル実行可能なサイズです。
オープンでアクセスしやすい：開発者エコシステム全体でのサポート付きで Apache 2.0 ライセンスの下でリリースされています。
ドラフター対応：Gemma 4 12B はレイテンシ削減のための Multi-Token Prediction (MTP) ドラフターを標準装備しています。

これらの機能により、速度や推論能力を犠牲にすることなく、日常のハードウェア上で高度なマルチモーダル機能を備えることができます。では、Gemma 4 12B がこれを実現する方法について詳しく見ていきましょう。

ローカルで最先端のエージェントを実行する

Gemma 4 12B は、標準ベンチマークにおいて当社のより大規模な 26B MoE モデルに迫る性能を発揮しますが、必要なメモリフットプリントは半分未満です。消費向けラップトップ（RAM 16GB）でローカル実行可能なサイズであり、マシン上で強力なマルチモーダルおよびエージェント体験を直接実現します。

独自に効率的で統一されたアーキテクチャを体験する

Gemma 4 12B が際立っている理由は、視覚および音声入力を処理するための streamlined なアプローチにあります。従来のマルチモーダルモデルは通常、画像や音声を言語モデルに渡す前に変換するために別々のエンコーダー（encoder）に依存しています。これらの分割されたエンコーダーがレイテンシ（遅延）を追加しメモリ使用量を増大させるため、私たちはオーディオとビジョン入力を直接統合するために、Gemma 4 12B をエンコーダーフリーのアーキテクチャでトレーニングしました。

Gemma 4 12B がマルチモーダル入力をネイティブに処理する方法は以下の通りです：

ビジョン（視覚）: Gemma 4 のビジョンエンコーダーを、単一の行列乗算、位置埋め込み、および正規化からなる軽量な埋め込みモジュールに置き換えました。これにより、LLM バックボーンが視覚処理を引き継ぐことができます。
オーディオ（音声）: オーディオ処理はさらに簡素化しました。オーディオエンコーダーを完全に削除し、生の音声信号をテキストトークンと同じ次元空間に投影しました。

詳細な技術解説が必要な開発者の皆様は、当社の companion Gemma 4 12B Developer Guide（開発者ガイド）をご覧ください。

See native audio processing in action: Watch Gemma 4 12B transcribe, format, and translate voice inputs entirely offline using the Google AI Edge Eloquent app.

必ず JSON 形式で返してください。translation フィールドのみ。他のフィールド (technical_terms 等) は一切追加しないこと — 余計なフィールドを書こうとして本文翻訳がトークン上限で打ち切られる事故を防ぐため:

{"translation": "翻訳全文"}

今日から始めよう

実際に試す：LM Studio、Ollama、Google AI Edge Gallery App、Google AI Edge Eloquent アプリ、LiteRT-LM CLI で数回のクリックで実験できます。
重みのダウンロード：事前学習済みおよび指令微調整済みのチェックポイントを Hugging Face および Kaggle から直接ダウンロードしてください。
統合と学習：開発者向けドキュメントおよびクイックスタートノートブックを確認してください。
お気に入りの開発ツールを活用する：Hugging Face Transformers、llama.cpp、MLX、SGLang、vLLM を使用してローカル推論パイプラインを実装するか、Unsloth で効率よくファインチューニングを行ってください。
Gemma Skills によるエージェント開発の解放：最新の Gemma の進展を活用したエージェント構築を支援するため、公式のスキルリポジトリを公開します。これは、Gemma モデルを用いたエージェント構築を可能にするために特別に設計されたスキルのライブラリです。
お好みの方法でデプロイ：Google Cloud を使用して本番環境のエンドポイントを起動するか、Gemini Enterprise Agent Platform Model Garden、Cloud Run、GKE を通じてお好みの方法でデプロイしてください。

Experience a uniquely efficient, unified architecture

What makes Gemma 4 12B stand out is its streamlined approach to processing visual and audio inputs. Traditional multimodal models typically rely on separate encoders to translate images and audio before passing those representations to the language model. Because these split encoders add latency and increase memory usage, we trained Gemma 4 12B with an encoder-free architecture to integrate audio and vision input directly.

Here is how Gemma 4 12B processes multimodal inputs natively:

Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations. This allows the LLM backbone to take over visual processing.
Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.

For developers who want a breakdown, head over to our companion Gemma 4 12B Developer Guide.

See native audio processing in action: Watch Gemma 4 12B transcribe, format, and translate voice inputs entirely offline using the Google AI Edge Eloquent app.

Get started today

Try it yourself: Experiment with a couple of clicks in LM Studio, Ollama, Google AI Edge Gallery App, the Google AI Edge Eloquent app and the LiteRT-LM CLI
Download the weights: Download the pre-trained and instruction-tuned checkpoints directly from Hugging Face and Kaggle.
Integrate & learn: Review the developer documentation and the quick start notebook.
Use your favorite development tools: Implement local inference pipelines with Hugging Face Transformers, llama.cpp, MLX, SGLang, and vLLM, or fine-tune with efficiency using Unsloth.
Unlock Agentic Development with Gemma Skills: To support agents to build with the latest Gemma advancements, we are releasing our official Skills Repository. This is a library of skills designed specifically to enable agents to build with Gemma models.
Deploy your way: Spin up endpoints in production using Google Cloud. Deploy your way through Gemini Enterprise Agent Platform Model Garden, Cloud Run and GKE.

Gemma 4 12B の紹介：統一型エンコーダー非搭載マルチモーダルモデル

#Gemma #マルチモーダル AI #エッジ AI #エンコーダーフリー #Google DeepMind

TL;DR

AI深層分析2026年6月10日 22:17

注目/ 5段階

深度40%

キーポイント

エンコーダーフリー構造の採用

従来の画像エンコーダーを不要とする「エンコーダーフリー」アーキテクチャを採用し、モデルの統合と効率化を実現している。

ローカルデバイスでの高パフォーマンス

12B パラメータ規模でありながら、ノートパソコンなどのエッジデバイス上で動作するよう設計され、モバイルファーストな効率性を重視している。

高度な推論能力の統合

マルチモーダル処理だけでなく、複雑な推論タスクに対応できる知的性能を備え、実用的なアプリケーションでの利用を想定している。

ネイティブオーディオ入力と統合アーキテクチャ

ローカル実行と推論能力

レイテンシ削減機能

マルチトークン予測（MTP）ドラフターを標準装備しており、これにより推論時のレイテンシを大幅に削減し、より高速な応答が可能になります。

エンコーダーフリーの統合アーキテクチャ

従来の別々のエンコーダーを廃止し、視覚・音声入力を直接言語モデルに統合することで、レイテンシとメモリ使用量を削減しました。

重要な引用

Gemma 4 12B is designed to bring high-performance multimodal intelligence directly to your laptop, combining mobile-first efficiency with advanced reasoning.

Gemma 4 12B packages powerful capabilities inside a reduced memory footprint.

Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

We trained Gemma 4 12B with an encoder-free architecture to integrate audio and vision input directly.

We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.

To support agents to build with the latest Gemma advancements, we are releasing our official Skills Repository.

影響分析・編集コメントを表示

影響分析

編集コメント

2026 年 6 月 3 日

3 min read

image

オリビエ・ラコンブ（Olivier Lacombe）

Google DeepMind 製品管理ディレクター

image

ガス・マルティンス（Gus Martins）

Google DeepMind 製品マネージャー

image

あなたのブラウザはオーディオ要素をサポートしていません。

記事の再生

このコンテンツは Google AI によって生成されています。生成 AI は実験的な技術です。

[[duration]] 分

Gemma 4 12B がユニークである理由の概要は以下の通りです：

革新的な統一アーキテクチャ：マルチモーダルエンコーダーなし。ビジョンおよびオーディオ入力は直接 LLM バックボーンへ流れます。
高度な推論能力：ベンチマーク性能は当社の 26B モデルに迫り、強力な多段推論やエージェントワークフローを可能にします。
ラップトップ対応：VRAM または統一メモリがわずか 16GB でローカル実行可能なサイズです。
オープンでアクセスしやすい：開発者エコシステム全体でのサポート付きで Apache 2.0 ライセンスの下でリリースされています。
ドラフター対応：Gemma 4 12B はレイテンシ削減のための Multi-Token Prediction (MTP) ドラフターを標準装備しています。

ローカルで最先端のエージェントを実行する

独自に効率的で統一されたアーキテクチャを体験する

Gemma 4 12B がマルチモーダル入力をネイティブに処理する方法は以下の通りです：

ビジョン（視覚）: Gemma 4 のビジョンエンコーダーを、単一の行列乗算、位置埋め込み、および正規化からなる軽量な埋め込みモジュールに置き換えました。これにより、LLM バックボーンが視覚処理を引き継ぐことができます。
オーディオ（音声）: オーディオ処理はさらに簡素化しました。オーディオエンコーダーを完全に削除し、生の音声信号をテキストトークンと同じ次元空間に投影しました。

詳細な技術解説が必要な開発者の皆様は、当社の companion Gemma 4 12B Developer Guide（開発者ガイド）をご覧ください。

See native audio processing in action: Watch Gemma 4 12B transcribe, format, and translate voice inputs entirely offline using the Google AI Edge Eloquent app.

{"translation": "翻訳全文"}

今日から始めよう

実際に試す：LM Studio、Ollama、Google AI Edge Gallery App、Google AI Edge Eloquent アプリ、LiteRT-LM CLI で数回のクリックで実験できます。
重みのダウンロード：事前学習済みおよび指令微調整済みのチェックポイントを Hugging Face および Kaggle から直接ダウンロードしてください。
統合と学習：開発者向けドキュメントおよびクイックスタートノートブックを確認してください。
お気に入りの開発ツールを活用する：Hugging Face Transformers、llama.cpp、MLX、SGLang、vLLM を使用してローカル推論パイプラインを実装するか、Unsloth で効率よくファインチューニングを行ってください。
Gemma Skills によるエージェント開発の解放：最新の Gemma の進展を活用したエージェント構築を支援するため、公式のスキルリポジトリを公開します。これは、Gemma モデルを用いたエージェント構築を可能にするために特別に設計されたスキルのライブラリです。
お好みの方法でデプロイ：Google Cloud を使用して本番環境のエンドポイントを起動するか、Gemini Enterprise Agent Platform Model Garden、Cloud Run、GKE を通じてお好みの方法でデプロイしてください。

Experience a uniquely efficient, unified architecture

Here is how Gemma 4 12B processes multimodal inputs natively:

Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations. This allows the LLM backbone to take over visual processing.
Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.

For developers who want a breakdown, head over to our companion Gemma 4 12B Developer Guide.

See native audio processing in action: Watch Gemma 4 12B transcribe, format, and translate voice inputs entirely offline using the Google AI Edge Eloquent app.

Get started today

Try it yourself: Experiment with a couple of clicks in LM Studio, Ollama, Google AI Edge Gallery App, the Google AI Edge Eloquent app and the LiteRT-LM CLI
Download the weights: Download the pre-trained and instruction-tuned checkpoints directly from Hugging Face and Kaggle.
Integrate & learn: Review the developer documentation and the quick start notebook.
Use your favorite development tools: Implement local inference pipelines with Hugging Face Transformers, llama.cpp, MLX, SGLang, and vLLM, or fine-tune with efficiency using Unsloth.
Unlock Agentic Development with Gemma Skills: To support agents to build with the latest Gemma advancements, we are releasing our official Skills Repository. This is a library of skills designed specifically to enable agents to build with Gemma models.
Deploy your way: Spin up endpoints in production using Google Cloud. Deploy your way through Gemini Enterprise Agent Platform Model Garden, Cloud Run and GKE.

Gemma 4 12B の紹介：統一型エンコーダー非搭載マルチモーダルモデル

キーポイント

重要な引用

影響分析

編集コメント

ローカルで最先端のエージェントを実行する

独自に効率的で統一されたアーキテクチャを体験する

今日から始めよう

Related stories

Run state-of-the-art agents locally

Experience a uniquely efficient, unified architecture

Get started today

Related stories

関連記事

Gemma 4 12B の紹介：統一型エンコーダー非搭載マルチモーダルモデル

キーポイント

重要な引用

影響分析

編集コメント

ローカルで最先端のエージェントを実行する

独自に効率的で統一されたアーキテクチャを体験する

今日から始めよう

Related stories

Run state-of-the-art agents locally

Experience a uniquely efficient, unified architecture

Get started today

Related stories

関連記事