The Decoder·2026年3月13日 04:25·約1分

Grok 4.20はGeminiとGPT-5.4に大きく遅れるが、幻覚を起こさない新記録を樹立

#LLM #ベンチマーク #ハルシネーション #xAI #大規模言語モデル #モデル評価

TL;DR

xAIのGrok 4.20は、ベンチマークではGeminiやGPT-5.4に大きく遅れを取るものの、幻覚（ハルシネーション）の少なさにおいて新記録を樹立したと報じられている。

AI深層分析2026年3月13日 05:41

注目/ 5段階

深度40%

キーポイント

性能評価の二面性

Grok 4.20は、主要なベンチマークテストにおいて競合モデル（Gemini, GPT-5.4）に大きく遅れを取っている。

幻覚抑制の新記録

テストされたモデルの中で、幻覚（ハルシネーション）を最も少なく発生させるという新記録を達成した。

コストと速度の優位性

記事では、Grok 4.20が安価で高速であるという利点も指摘されている。

市場における位置付け

幻覚抑制という特定の強みを持つ一方で、総合的な性能ではトップ層に追いつけていないという評価が示されている。

影響分析・編集コメントを表示

影響分析

この記事は、大規模言語モデル（LLM）の評価において、単なるベンチマークスコアだけでなく、幻覚（ハルシネーション）のような実用上の信頼性が重要な指標として浮上していることを示している。Grok 4.20のアプローチは、特定のユースケース（信頼性が最優先される場面）において競争力を持つ可能性を示唆しており、業界の評価基準の多様化を促す可能性がある。

編集コメント

AIモデルの進化が多様化する中で、『総合力』と『特定領域での卓越性』のどちらを重視するか、ユースケースに応じた選択がより重要になってきていることを示すニュースです。

image

xAIのGrok 4.20は低コストで高速、かつテスト対象の他のどのモデルよりも幻覚を起こしにくいが、ベンチマークではトップクラスの性能には届かない。

本記事「Grok 4.20はGeminiとGPT-5.4に大きく遅れをとるが、幻覚を起こさない新記録を樹立」は、The Decoderで最初に公開されました。

原文を表示

xAI's Grok 4.20 can't keep up with the top AI models in benchmarks but hallucinates less than any other model tested. According to Artificial Analysis, Grok 4.20 Beta scores 48 on the Intelligence Index with reasoning enabled, well behind Gemini 3.1 Pro Preview and GPT-5.4 at 57, but still a 6-point improvement over Grok 4.

Grok trails the latest models from major AI labs in overall benchmark performance. | Image: Artificial Analysis

xAI shipped three API variants: with reasoning, without reasoning, and a multi-agent mode. The model supports a 2-million-token context window and costs 2 or 6 dollars per million tokens; cheaper than Grok 4 and competitively priced among Western models.

Where Grok 4.20 stands out, of all things, is factual reliability. On the AA Omniscience test, it hit a 78 percent non-hallucination rate, a record, according to Artificial Analysis. The test measures how often a model fabricates an answer instead of admitting it doesn't know, alongside factual recall. Grok 4.20 only got it wrong about one in five times when it didn't have the answer.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Subscribe now

この記事をシェア

TLDR AI2026年7月3日 09:00

メタの「Watermelon」が GPT-5.5 ベンチマークに匹敵

TLDR AI重要度42026年7月3日 09:00

Seed2.0 モデルカード（72 分間の読了）

MarkTechPost重要度42026年7月5日 11:31

Qwen の元リーダーが「ハイブリッド思考」の誤りと、なぜ今「エージェント」を支持するのか

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む