Gemma 4を歓迎:デバイス上のフロンティアマルチモーダルインテリジェンス
Google DeepMindが開発したGemma 4マルチモーダルモデルファミリーがHugging Faceで公開され、画像・音声・テキスト入力を処理し、Apache 2ライセンスでオープンソース化されたことで、オンデバイスAIの実用性が大きく向上した。
キーポイント
オープンソースのマルチモーダルモデル
Gemma 4はApache 2ライセンスで公開された真のオープンソースモデルで、画像・音声・テキストのマルチモーダル入力を処理できる。
オンデバイス対応の多様なサイズ
2.3Bから31Bまでの4つのサイズが提供され、特に小規模モデルはデバイス上での実行を可能にしている。
技術的改良点
可変アスペクト比対応の画像エンコーダー、長いコンテキストウィンドウ、共有KVキャッシュなど、前世代からの改良が組み込まれている。
広範な実装サポート
transformers、llama.cpp、MLX、WebGPU、Rustなど多様なフレームワークと推論エンジンで利用可能。
すぐに使える高品質
記事では「ファインチューニングの例を見つけるのが難しいほど、箱から出してすぐに優れている」と評価されている。
アーキテクチャの効率性と互換性
Gemma 4は複雑な機能を省き、ライブラリやデバイス間での高い互換性を実現し、長いコンテキストやエージェント的ユースケースを効率的にサポートする設計となっている。
圧倒的なパラメータ効率
31B密モデルと26B MoEモデルは、GLM-5やKimi K2.5と同等の性能を約30分の1のパラメータ数で達成している。
影響分析・編集コメントを表示
影響分析
Gemma 4の公開は、オープンソースのマルチモーダルAIをデバイス上で実用的に利用できる段階に到達したことを示しており、エッジAIの普及を加速させる可能性が高い。特にApache 2ライセンスによる商用利用の自由さと、多様な実装環境への対応は、開発者コミュニティ全体に大きな影響を与えるだろう。
編集コメント
オープンソースのマルチモーダルモデルが実用的な品質でデバイス上で動作するようになったことは、AIの民主化と実世界応用を大きく前進させる重要なマイルストーンと言える。
音声は、シカゴから国民に向けて別れの挨拶をしているスピーカーの演説の抜粋です。スピーカーは在任期間を振り返り、リビングルームや学校、農場、工場、ダイナー、遠隔の軍事拠点など様々な場所でアメリカ国民と交わした対話に感謝の意を表しています。振り返りと感謝のトーンで、政治的な旅路におけるこれらの交流の重要性を強調しています。
以下は、文字起こしを行いたい場合の例です:
messages = [
{
"role": "user",
"content": [
{"type": "audio", "url": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3"},
{"type": "text", "text": "Transcribe the audio?"},
],
},
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)
output = model.generate(
**inputs,
max_new_tokens=1000,
do_sample=False,
)
print(processor.decode(output[0], skip_special_tokens=True))今週、私は歴代大統領の伝統に従い、国民への最後の別れの挨拶をするためにシカゴを訪れました。これは感謝を伝える機会でした。私たちが意見が一致したかどうかに関わらず、リビングルームや学校、農場や工場の現場、ダイナーや遠隔の軍事拠点であなた方アメリカ国民と交わした対話、これらすべての対話が私を正直に保ってくれました。
マルチモーダル機能呼び出し (Multimodal Function Calling)
画像に表示されている場所の天気を取得するように依頼してモデルをテストします。
import re
WEATHER_TOOL = {
"type": "function",
"function": {
"name": "get_weather",
"description": "Gets the current weather for a specific location.",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "The city name"},
},
"required": ["city"],
},
},
}
tools = [WEATHER_TOOL]
messages = [
{"role": "user",
"content": [
{"type": "image", "image": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/thailand.jpg"},
{"type": "text", "text": "What is the city in this image? Check the weather there right now."},
]},
]
inputs = processor.apply_chat_template(
messages,
tools=[WEATHER_TOOL],
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
enable_thinking=True,
).to(model.device)
output = model.generate(**inputs, max_new_tokens=1000)
input_len = inputs.input_ids.shape[-1]
generated_text_ids = output[0][input_len:]
generated_text = processor.decode(generated_text_ids, skip_special_tokens=True)
result = processor.parse_response(generated_text)
print(result["content"])get_weather(city="Bangkok")どこでもデプロイ
Gemma 4には、多くのオープンソース推論エンジンに対するデイ0サポートが付属しています。また、多くのハードウェアバックエンドで実行可能なONNXチェックポイントもリリースしており、エッジデバイスやブラウザでのユースケースを可能にします!
Gemma 4には、最初からファーストクラスのtransformersサポートが付属しています🤗。この統合により、bitsandbytes、PEFT、TRLなどの他のライブラリとモデルを使用することができます。最新バージョンのtransformersをインストールしてください。
pip install -U transformers小さなGemma 4モデルで推論を行う最も簡単な方法は、any-to-anyパイプラインを通じてです。以下のように初期化できます。
from transformers import pipeline
pip原文を表示
Back to Articles Welcome Gemma 4: Frontier multimodal intelligence on device
Upvote 29 ![]()





The Gemma 4 family of multimodal models by Google DeepMind is out on Hugging Face, with support for your favorite agents, inference engines, and fine-tuning libraries 🤗
These models are the real deal: truly open with Apache 2 licenses, high quality with pareto frontier arena scores, multimodal including audio, and sizes you can use everywhere including on-device. Gemma 4 builds on advances from previous families and makes them click together. In our tests with pre-release checkpoints we have been impressed by their capabilities, to the extent that we struggled to find good fine-tuning examples because they are so good out of the box.
We collaborated with Google and the community to make them available everywhere: transformers, llama.cpp, MLX, WebGPU, Rust; you name it. This blog post will show you how to build with your favorite tools so let us know what you think!
Table of Contents
What is New with Gemma 4?
Overview of Capabilities and Architecture Architecture at a Glance
Per-Layer Embeddings (PLE)
Shared KV Cache
Multimodal Capabilities
Deploy Anywhere transformers
Plug in to your local agent
transformers.js
Fine-tuning & Demos Fine-tuning with TRL Fine-tuning with TRL on Vertex AI
Fine-tuning with Unsloth Studio
Acknowledgements
What is new with Gemma 4?
Similar to Gemma-3n, Gemma 4 supports image, text, and audio inputs, and generates text responses. The text decoder is based on the Gemma model with support for long context windows. The image encoder is similar to the one from Gemma 3 but with two crucial improvements: variable aspect ratios, and configurable number of image token inputs to find your sweet spot between speed, memory, and quality. All models support images (or video) and text inputs, while the small variants (E2B and E4B) support audio as well.
Gemma 4 comes in four sizes, all base and instruction fine-tuned:
2.3B effective, 5.1B with embeddings
4.5B effective, 8B with embeddings
31B dense model
Gemma 4 26B A4B
mixture-of-experts with 4B activated/26B total parameters
Overview of Capabilities and Architecture
Gemma 4 leverages several architecture components used in previous Gemma versions and other open models, and leaves out complex or inconclusive features such as Altup. The combination is a mix designed to be highly compatible across libraries and devices, that can efficiently support long context and agentic use cases, whilst being ideal for quantization.
With this feature mix (and the undisclosed training data or recipe), the 31B dense model achieves an estimated LMArena score (text only) of 1452, while the 26B MoE reaches 1441 with just 4B active parameters 🤯. To put this in context, these scores are more or less the same as the recent GLM-5 or Kimi K2.5, but with ~30 times less parameters. As we'll see, multimodal operation is comparatively as good as text generation, at least in informal and subjective tests.
These are the main architecture characteristics in Gemma 4:
Alternating local sliding-window and global full-context attention layers. Smaller dense models use sliding windows of 512 tokens while larger models use 1024 tokens.
Dual RoPE configurations: standard RoPE for sliding layers, proportional RoPE for global layers, to enable longer context.
Per-Layer Embeddings (PLE): a second embedding table that feeds a small residual signal into every decoder layer.
Shared KV Cache: the last N layers of the model reuse key-value states from earlier layers, eliminating redundant KV projections.
Vision encoder: uses learned 2D positions and multidimensional RoPE. Preserves the original aspect ratios and can encode images to a few different token budgets (70, 140, 280, 560, 1120).
Audio encoder: USM-style conformer with the same base architecture as the one in Gemma-3n.
Per-Layer Embeddings (PLE)
One of the most distinctive features in smaller Gemma 4 models is Per-Layer Embeddings (PLE), which was introduced previously in Gemma-3n. In a standard transformer, each token gets a single embedding vector at input, and the same initial representation is what the residual stream builds on across all layers, forcing the embedding to frontload everything the model might need. PLE adds a parallel, lower-dimensional conditioning pathway alongside the main residual stream. For each token, it produces a small dedicated vector for every layer by combining two signals: a token-identity component (from an embedding lookup) and a context-aware component (from a learned projection of the main embeddings). Each decoder layer then uses its corresponding vector to modulate the hidden states via a lightweight residual block after attention and feed-forward. This gives each layer its own channel to receive token-specific information only when it becomes relevant, rather than requiring everything to be packed into a single upfront embedding. Because the PLE dimension is much smaller than the main hidden size, this adds meaningful per-layer specialization at modest parameter cost. For multimodal inputs (images, audio, video), PLE is computed before soft tokens are merged into the embedding sequence — since PLE relies on token IDs that are lost once multimodal features replace the placeholders. Multimodal positions use the pad token ID, effectively receiving neutral per-layer signals.
Shared KV Cache
The shared KV cache is an efficiency optimization that reduces both compute and memory during inference. The last num_kv_shared_layers
In practice, this has a minimal impact on quality while being much more efficient (in terms of both memory and compute) for long context generation and on-device use.
Multimodal Capabilities
We saw in our tests that Gemma 4 supports comprehensive multimodal capabilities out of the box. We don't know what was the training mix, but we had success using it for tasks such as OCR, speech-to-text, object detection, or pointing. It also supports text-only and multimodal function calling, reasoning, code completion and correction.
Here, we show a few inference examples across different model sizes. You can run them conveniently with this notebook. We encourage you to try the demos and share them below this blog!
Object Detection and Pointing
We test Gemma-4 on GUI element detection and pointing across different sizes, with the following image and text prompt: "What's the bounding box for the "view recipe" element in the image?"

With this prompt, the model natively responds in JSON format with the detected bounding boxes - no need for specific instructions or grammar-constrained generation. We found the coordinates refer to an image size of 1000x1000, relative to the input dimensions.
We visualize the outputs below for your convenience. We parse the bounding boxes from the returned JSON: json\n[\n {"box_2d": [171, 75, 245, 308], "label": "view recipe element"}\n]\n




Object Detection
We test models to detect everyday objects, here we ask them to detect the bike and compare different model outputs. As in the previous case, we parse the bounding box from the json and translate to image space coordinates.




Multimodal Thinking and Function Calling
We asked Gemma 4 to write HTML code to reconstruct a page we made with Gemini 3. Below you can find the code to do this, we enable thinking and ask each model to generate up to 4000 new tokens, to make it foolproof.
Gemini Generated Website (Reference)
Gemini Reproduced Image


messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/landing_page.png", }, {"type": "text", "text": "Write HTML code for this page."}, ], } ] inputs = processor.apply_chat_template( messages, tokenize=True, return_dict=True, return_tensors="pt", add_generation_prompt=True, enable_thinking=True, ).to(model.device) output = model.generate(**inputs, max_new_tokens=4000) input_len = inputs.input_ids.shape[-1] generated_text_ids = output[0][input_len:] generated_text = processor.decode(generated_text_ids, skip_special_tokens=True) result = processor.parse_response(generated_text) print(result["content"])




Video Understanding
Smaller Gemma 4 models can take in videos with audio while larger ones can take in videos without audio. While the models are not explicitly post-trained on videos, they can understand videos both with and without audio. The model is particularly strong in audios
messages = [ { "role": "user", "content": [ {"type": "video", "url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/concert.mp4"}, {"type": "text", "text": "What is happening in the video? What is the song about?"}, ], }, ] inputs = processor.apply_chat_template( messages, tokenize=True, return_dict=True, return_tensors="pt", add_generation_prompt=True, load_audio_from_video=True, # disable this for larger models ).to(model.device) output = model.generate(**inputs, max_new_tokens=200) input_len = inputs.input_ids.shape[-1] generated_text_ids = output[0][input_len:] generated_text = processor.decode(generated_text_ids, skip_special_tokens=True) print(result["content"])
E2B (hallucinated for the audio)
Based on the video provided, here is an analysis: What is happening in this video? The video captures a live musical performance, likely at a large outdoor festival or concert. * Setting: The scene is a large stage set up outdoors, indicated by the extensive lighting rig, large speakers, and the audience visible in the foreground. * Performers: There are several musicians on stage. The focal point seems to be a lead singer/guitarist in a blue shirt and white pants who is actively performing, playing an acoustic or electric guitar. There are also other musicians, including drummers and bassists, visible in the background. * Atmosphere: The atmosphere is energetic and dramatic, characterized by bright stage lights, intense blue and white spotlights, and a significant amount of stage fog/smoke, which adds to the concert mood. * Audience: The foreground is filled with the backs of audience members, showing they are close to the stage and engaged in watching the performance. What is the song about? The lyrics provided below the video are: > "I don't care 'bout street > From what I cherish > Could it's time to protest
The video captures a live performance of a band on a stage with bright lights and smoke effects, surrounded by an audience. The band performs and sings to the crowd. The song appears to be about the struggles and disillusionment of modern life, specifically the feeling of being stuck and the difficulties in finding one's place in a chaotic and overwhelming world. Lyrics mention themes of the "concrete street," "broken promises," "storm of the day," and the difficulty of finding oneself amidst societal pressures.
26/A4B (no audio) ✅
This video shows a live music performance at an outdoor concert or festival at night. A solo male artist, dressed in a blue shirt and white pants, is the central focus. He is playing an acoustic guitar and singing into a microphone on a large, lit stage. The performance is accompanied by a band, including a drummer and other musicians visible in the background and on large video screens. The stage is filled with atmospheric blue lighting, smoke, and bright spotlights. The video captures different perspectives, including shots from the crowd looking up at the stage and close-up views of the performers projected onto large screens. The crowd is visible in the foreground, watching the show. One of the screens also displays the "Estrella Damm" brand name.
31B (no audio) ✅
A musician is performing on stage during a concert, playing an acoustic guitar and singing into a microphone. He is wearing a blue shirt and white pants. The stage is filled with atmospheric smoke and illuminated by bright blue and white lights. Other band members, including a drummer and a keyboardist, are also visible on stage. The video shows the musician from the perspective of the audience, with some shots focusing on him and others showing the large screen on the side of the stage, which displays close-ups of the performer and the crowd.
We have tested all models on captioning. All checkpoints perform very well in capturing nuances accurately in complex sceneries. Here's the image prompt we use with text prompt "Write single detailed caption for this image.".

messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/bird.png"}, {"type": "text", "text": "Write single detailed caption for this image."}, ], }, ] inputs = processor.apply_chat_template( messages, tokenize=True, return_dict=True, return_tensors="pt", add_generation_prompt=True, ).to(model.device) output = model.generate(**inputs, max_new_tokens=512) input_len = inputs.input_ids.shape[-1] generated_text_ids = output[0][input_len:] generated_text = processor.decode(generated_text_ids, skip_special_tokens=True) result = processor.parse_response(generated_text) print(result["content"])
A medium shot captures a weathered seagull perched atop a stone pedestal in what appears to be a bustling European square, with a grand, classical-style building featuring ornate columns and architectural details dominating the right side of the frame. In the background, a low, light-colored building stretches across the middle ground, flanked by some green foliage, and hints of other structures, including a terracotta-roofed building, are visible. In the foreground, part of a modern bus stop or shelter is visible, with signage indicating routes, and the overall scene is bathed in bright daylight under a pale sky.
A solitary seagull perches atop a grey stone pedestal in a bustling European piazza, with the historical grandeur of a classical building dominating the right side of the frame, contrasting with the more modern, low-rise architecture in the background under a bright, hazy sky.
A full shot shows a white and brown seagull perched on a gray pillar in Venice, Italy. The seagull faces left, its beak pointed towards the edge of the frame. Behind it is a light gray, two-story building with several dark rectangular windows. To the right of the seagull is a large, white building with columns and intricate carvings. The sky is a clear, light blue. The lighting is bright and even, suggesting it is daytime.
A high-angle, medium shot shows a white and gray seagull perched on a gray concrete pylon in Venice, Italy. The seagull is facing left, its body angled slightly towards the camera. Its white head, neck, and chest contrast with its gray wings and back. The seagull's orange beak is slightly open. The pylon is a light gray concrete cylinder with a weathered texture. In the background, there are several buildings. To the left is a long, low, white building with a series of small, rectangular windows. In the center is a cluster of green trees. To the right is a red building with a tiled roof and several small, rectangular windows. Further right is a large, ornate, white stone building with several columns and sculptures. The sky is a pale blue, and the light is bright and clear. In the lower-left corner, there is a yellow bus with the word "Ferrovia" printed on it in black letters. There are several people walking in the background, but they are small and out of focus.
Audio Question Answering
These models are trained to answer questions about speech in audio. Music and non-speech sounds were not part of the training data.
messages = [ { "role": "user", "content": [ {"type": "audio", "url": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3"}, {"type": "text", "text": "Can you describe this audio in detail?"}, ], }, ] inputs = processor.apply_chat_template( messages, tokenize=True, return_dict=True, return_tensors="pt", add_generation_prompt=True, ).to(model.device) output = model.generate( **inputs, max_new_tokens=1000, do_sample=False, ) print(processor.decode(output[0], skip_special_tokens=True))
This audio is a personal reflection. The speaker is talking about their final farewell address to the nation, which they delivered in Chicago. They express gratitude for the conversations they've had with the American people, noting that despite not having met them face-to-face or even greeted them, these interactions in various settings like living rooms, schools, farms, factory floors, diners, and military outposts have been what has kept them going.
The audio is a speech excerpt where a speaker is delivering a farewell address to the nation from Chicago. The speaker reflects on their time in office, expressing gratitude for the conversations they had with the American people across various settings like living rooms, schools, farms, factories, diners, and military outposts. The tone is reflective and appreciative, highlighting the importance of these interactions in their political journey.
Here is an example if you want to do transcription:
messages = [ { "role": "user", "content": [ {"type": "audio", "url": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3"}, {"type": "text", "text": "Transcribe the audio?"}, ], }, ] inputs = processor.apply_chat_template( messages, tokenize=True, return_dict=True, return_tensors="pt", add_generation_prompt=True, ).to(model.device) output = model.generate( **inputs, max_new_tokens=1000, do_sample=False, ) print(processor.decode(output[0], skip_special_tokens=True))
This week I traveled to Chicago to deliver my final farewell address to the nation following in the tradition of presidents before me It was an opportunity to say thank you whether we've seen eye to eye or rarely agreed at all my conversations with you the American people in living rooms and schools at farms and on factory floors at diners and on distant military outposts all these conversations are what have kept me honest
This week I traveled to Chicago to deliver my final farewell address to the nation following in the tradition of presidents before me. It was an opportunity to say thank you. Whether we've seen eye to eye or rarely agreed at all, my conversations with you, the American people, in living rooms and schools, at farms and on factory floors, at diners and on distant military outposts, all these conversations are what have kept me honest.
Multimodal Function Calling
We test the model by asking to get the weather in the place shown in the image.
import re WEATHER_TOOL = { "type": "function", "function": { "name": "get_weather", "description": "Gets the current weather for a specific location.", "parameters": { "type": "object", "properties": { "city": {"type": "string", "description": "The city name"}, }, "required": ["city"], }, }, } tools = [WEATHER_TOOL] messages = [ {"role": "user", "content": [ {"type": "image", "image": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/thailand.jpg"}, {"type": "text", "text": "What is the city in this image? Check the weather there right now."}, ]}, ] inputs = processor.apply_chat_template( messages, tools=[WEATHER_TOOL], tokenize=True, return_dict=True, return_tensors="pt", add_generation_prompt=True, enable_thinking=True, ).to(model.device) output = model.generate(**inputs, max_new_tokens=1000) input_len = inputs.input_ids.shape[-1] generated_text_ids = output[0][input_len:] generated_text = processor.decode(generated_text_ids, skip_special_tokens=True) result = processor.parse_response(generated_text) print(result["content"])
Analyze the Request: The user has provided an image and asked two questions:\n * "What is the city in this image?" * "Check the weather there right now."\n\n2. Analyze the Image: The image displays a large, ornate, tall structure, characteristic of temple architecture, likely in Southeast Asia (given the style). It appears to be a famous landmark. 3. Identify the Landmark/Location (Internal Knowledge/Visual Cues): The structure strongly resembles the chedis/stupas found in places like Bangkok, Thailand (e.g., Wat Arun or similar structures).\n\n4. Determine the City: Based on the visual evidence, the location is highly likely to be Bangkok, Thailand.\n\n5. Check Available Tools: The available tool is get_weather(city: STRING)
get_weather(city="Bangkok")
'1. Analyze the Request: The user has provided an image and asked two things:\n a. Identify the city in the image.\n b. Check the weather in that city right now.\n\n2. Analyze the Image: The image shows a very distinctive, ornate tower/spire structure, characteristic of Thai architecture (specifically, a chedi or prang).\n\n3. Determine City Identification (Internal Knowledge/Visual Recognition): Based on the architectural style, the structure strongly resembles prominent temples in Thailand, such as Wat Arun in Bangkok.\n\n4. Evaluate Tool Availability: I have one available tool: get_weather(city: STRING)
The user wants to know the city in the image and the current weather there. 1. Analyze the image: The image shows ornate, spire-like structures (prangs) characteristic of Thai temple architecture. Specifically, these look like the prangs at Wat Phra Kaew (Temple of the Emerald Buddha) or Wat Pho in Bangkok, Thailand. The style is very distinctive of Bangkok's royal temple complexes. 2. Identify the city: Based on the architectural style, the city is Bangkok. 3. Find the weather: I need to call the get_weather
The image shows several tall, ornate towers (prangs) typical of Thai Buddhist temple architecture. Specifically, these look like the structures within the Grand Palace or Wat Phra Kaew in Bangkok, Thailand. The style is unmistakably Thai. I'll confirm this by looking at the specific towers. These are very iconic.\n2. Confirm the location: The image shows elements of the Grand Palace complex in Bangkok.\n3. Get weather information: Once the city is identified as Bangkok, I will use the get_weather
Deploy Anywhere
Gemma 4 comes with day-0 support for many open-source inference engines. We also release ONNX checkpoints that can run on many hardware backends, allowing use cases on edge devices or in browser!
Gemma 4 comes with first-class transformers support from the get-go 🤗. This integration allows using the model with other libraries like bitsandbytes, PEFT and TRL. Make sure to install latest version of transformers.
pip install -U transformers
The easiest way to infer with small Gemma 4 models is through the any-to-any pipeline. You can initialize it as follows.
from transformers import pipeline pip
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み