xAI News·2024年4月12日 09:00·約2分

Grok-1.5 ビジョンプレビュー

#マルチモーダルAI #視覚言語モデル #物理世界理解 #xAI #ベンチマーク #リアルワールドQA

TL;DR

初のマルチモーダルモデルでデジタル世界と物理世界を接続する。

AI深層分析2026年2月25日 14:41

重要/ 5段階

キーポイント

xAIが初のマルチモーダルモデル「Grok-1.5V」を発表

文書・図表・写真など多様な視覚情報を処理可能

独自ベンチマーク「RealWorldQA」で物理世界理解で他モデルを上回る

早期テスターと既存ユーザーに近日提供予定

コード生成など実用的な応用例を実証

影響分析・編集コメントを表示

影響分析

xAIがOpenAIのGPT-4Vなど既存のマルチモーダルモデル市場に参入し、特に物理世界の空間理解で優位性を示した。独自ベンチマークの導入は業界の評価基準に影響を与える可能性があり、AIアシスタントの実世界応用競争が激化する見込み。

編集コメント

イーロン・マスク率いるxAIが本格的にマルチモーダル競争に参戦。物理空間理解に特化したアプローチが今後のAI開発の方向性を示唆。

グロック1.5V（Grok-1.5 Vision）プレビュー：初のマルチモーダルモデルで現実世界の理解を前進

xAIは2024年4月12日、初のマルチモーダルモデル「Grok-1.5V」のプレビューを発表した。このモデルは、従来の強力なテキスト処理能力に加え、文書、図表、グラフ、スクリーンショット、写真など多様な視覚情報を処理できることが特徴である。近日中に既存のグロックユーザーや初期テスターに提供が開始される予定だ。

Grok-1.5Vは、学際的推論から文書・科学図解・グラフ・写真の理解まで、様々な領域で既存の先進的マルチモーダルモデルと競合する性能を示している。特に注目されるのは、物理世界の理解能力である。xAIが新たに導入した「RealWorldQA」ベンチマーク（現実世界の空間理解を測定）では、同モデルが他モデルを上回る結果を出した。評価は、連鎖的思考（chain-of-thought）プロンプトを用いないゼロショット設定で実施されている。

具体的な性能として、図表理解（Diagram）では88.3%、テキスト読解（TextVQA）では78.1%、グラフ理解（ChartQA）では76.1%、文書理解（DocVQA）では85.6%のスコアを記録した。そして、現実世界理解（RealWorldQA）では68.7%と、比較対象モデルを大きく引き離した。このベンチマークは、ピザカッターとハサミの大きさ比較、運転レーンの進行方向判断、車間距離の確認、写真からの方角推測など、人間には容易だがAIモデルには難しい実空間に根差した質問で構成されている。現在、700枚以上の画像と検証可能な質疑応答がデータセットに含まれている。

実用例として、同モデルはフローチャートからPythonコードを生成する能力も示している。ユーザーが提供した数当てゲームの図解を正確に解釈し、対応する実行可能なプログラムコードを書き出した。

xAIは、実世界で有用なAIアシスタントを開発する上で、モデルの物理的環境理解の進化が不可欠であると強調する。RealWorldQAベンチマークの導入は、この重要な能力を測定・促進することを目的としている。Grok-1.5Vは、デジタル情報と物理世界を結びつける重要な一歩として位置づけられており、今後の展開が期待される。

原文を表示

Grok-1.5 Vision Preview | xAIApril 12, 2024Grok-1.5 Vision Preview

Connecting the digital and physical worlds with our first multimodal model.

Introducing Grok-1.5V, our first-generation multimodal model. In addition to its strong text capabilities, Grok can now process a wide variety of visual information, including documents, diagrams, charts, screenshots, and photographs. Grok-1.5V will be available soon to our early testers and existing Grok users.

Grok-1.5V is competitive with existing frontier multimodal models in a number of domains, ranging from multi-disciplinary reasoning to understanding documents, science diagrams, charts, screenshots, and photographs. We are particularly excited about Grok’s capabilities in understanding our physical world. Grok outperforms its peers in our new RealWorldQA benchmark that measures real-world spatial understanding. For all datasets below, we evaluate Grok in a zero-shot setting without chain-of-thought prompting.

Diagrams88.3%78.2%88.7%88.1%80.3%TextVQAText reading78.1%78.0%--73.5%ChartQACharts76.1%78.5%81.1%80.8%81.3%DocVQADocuments85.6%88.4%89.5%89.3%86.5%RealWorldQAReal-world understanding68.7%61.4%51.9%49.8%67.5% ExampleWriting code from a diagramUser

Certainly! The flowchart you’ve provided describes a simple guessing game where the computer generates a random number, and the user has to guess it. Here’s the Python code that represents the logic in the flowchart:

import random def guess_number(): # Generate a random number between 1 and 10 target = random.randint(1, 10) while True: guess = int(input("Guess a number between 1 and 10: ")) if guess == target: print("You won!") break else: print("Wrong guess, try again!") if __name__ == "__main__": guess_number()This code will run the guessing game as described in the flowchart. The user will be prompted to enter their guess until they guess the correct number.

In order to develop useful real-world AI assistants, it is crucial to advance a model's understanding of the physical world. Towards this goal, we are introducing a new benchmark, RealWorldQA. This benchmark is designed to evaluate basic real-world spatial understanding capabilities of multimodal models. While many of the examples in the current benchmark are relatively easy for humans, they often pose a challenge for frontier models.

Which object is larger the pizza cutter or the scissors? A. The pizza cutter is larger. B. The scissors is larger. C. They are about the same size.

Where can we go from the current lane? A. Turn left. B. Go straight. C. Turn left and go straight. D. Turn right.

Given this front camera view from our sedan, do we have enough space to drive around the gray car in front of us? A. Yes. B. No.

Given the picture, in which cardinal direction is the dinosaur facing? A. North. B. South. C. East. D. West.

The initial release of the RealWorldQA consists of over 700 images, with a question and easily verifiable answer for each image. The dataset consists of anonymized images taken from vehicles, in addition to other real-world images. We are excited to release RealWorldQA to the community, and we intend to expand it as our multimodal models improve. RealWorldQA is released under CC BY-ND 4.0. Click here (677MB) to download the dataset.

Advancing both our multimodal understanding and generation capabilities are important steps in building beneficial AGI that can understand the universe. In the coming months, we anticipate to make significant improvements in both capabilities, across various modalities such as images, audio, and video.

If you want to be a part of this journey, we are hiring.

この記事をシェア

TLDR AI2026年7月3日 09:00

メタの「Watermelon」が GPT-5.5 ベンチマークに匹敵

TLDR AI重要度42026年7月3日 09:00

Seed2.0 モデルカード（72 分間の読了）

Latent Space重要度42026年7月2日 08:52

自己改善型エージェントの背後にあるフィードバックループ：Autoresearch

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む