AIモデルは見たことのない画像を自信満々に描写し、ベンチマークはその問題を見逃している
スタンフォード大学の研究によると、GPT-5やGemini 3 ProなどのマルチモーダルAIモデルは、画像が提供されていない場合でも詳細な画像説明や医療診断を自信を持って生成し、一般的なベンチマークはこの問題を隠蔽していることが明らかになった。
キーポイント
マルチモーダルAIの幻覚生成問題
GPT-5、Gemini 3 Pro、Claude Opus 4.5などの先進的マルチモーダルAIモデルが、実際には画像が提供されていない状況でも、詳細な画像説明や医療診断を自信を持って生成する現象が確認された。
ベンチマークの限界と問題隠蔽
現在広く使用されている評価ベンチマークは、この「画像なし生成」問題を適切に検出できず、むしろ問題を隠蔽していることがスタンフォード大学の研究で明らかになった。
研究手法と発見
スタンフォード大学の研究者らは、画像入力を意図的に省略した条件下でモデルを評価する実験を行い、モデルが現実には存在しない画像内容について詳細な記述を生成することを実証した。
実用上の重大なリスク
特に医療診断などの高リスク応用分野において、モデルが実際のデータに基づかない自信過剰な出力を生成することは、重大な実用上の問題を引き起こす可能性がある。
影響分析・編集コメントを表示
影響分析
この発見は、マルチモーダルAIの信頼性評価における根本的な欠陥を暴露し、特に医療や安全性が重要な分野でのAI適用に重大な疑問を投げかけている。AI評価ベンチマークの設計とモデル開発の両方にパラダイムシフトを要求する内容であり、業界全体の評価基準の見直しを促す可能性が高い。
編集コメント
AIモデルの「自信過剰な誤り」という根本的な問題を浮き彫りにした重要な研究。ベンチマークの盲点が実用リスクに直結する点が特に懸念される。

GPT-5、Gemini 3 Pro、Claude Opus 4.5などのマルチモーダルAIモデルは、画像が一切提供されていない場合でも、詳細な画像描写や医療診断を生成してしまいます。スタンフォード大学の研究は、一般的なベンチマークがこの問題を明確に捉えられていないことを示しています。
本記事「AI models confidently describe images they never saw, and benchmarks fail to catch it」は、The Decoderに最初に掲載されました。
原文を表示
To measure the scope of the problem, the team developed a benchmark called "Phantom-0" containing 200 visual questions across 20 categories, presented without any accompanying image. All tested frontier models - including GPT-5, GPT-5.1, GPT-5.2, Gemini 3 Pro, Claude Opus 4.5, and Claude Sonnet 4.5 - confidently described visual details in over 60 percent of cases, according to the study. When additional prompt instructions common in typical evaluation workflows were added, that rate jumped to 90 to 100 percent.
Fabricated medical diagnoses skew heavily toward severe pathologies
The results get especially alarming in the medical domain. The researchers had Gemini 3 Pro describe nonexistent images and provide diagnoses across five clinical categories: X-ray, brain MRI, ECG, pathology, and dermatology. Each question was repeated with 200 different random seeds.
The findings show that mirage-based diagnoses skew heavily toward severe pathologies. Among the most frequently generated diagnoses, the study found ST-elevation myocardial infarctions (STEMI), melanomas, and carcinomas. While "Normal" and "No diagnosis" appear among the top responses, pathological findings cumulatively dominate by a wide margin.
The chart shows the distribution of Gemini 3 Pro's responses when asked about nonexistent medical images. Although "Normal" and "No diagnosis" are frequent answers, diagnoses of specific pathologies cumulatively dominate. | Image: Asadi, O'Sullivan, Cao et al.
In practice, this could mean a failed image upload leads to an urgent recommendation for a condition that doesn't exist. In API-based applications and agentic tools in particular, it's difficult to verify whether the image actually arrived.
Models reach 70 to 80 percent of benchmark scores without seeing a single image
The study also reveals how deeply this problem distorts model evaluation. The researchers tested four frontier models (Gemini 3 Pro, Gemini 2.5 Pro, GPT-5.1, Claude Opus 4.5) on six established benchmarks: MMMU-Pro, Video-MMMU, and Video-MME for general visual understanding, plus VQA-Rad, MicroVQA, and MedXpertQA-MM for medical image analysis.
The central finding: the models achieved an average of 70 to 80 percent of their full benchmark accuracy without ever seeing an image. The actual image contributed only the remaining 20 to 30 percent of overall performance. The majority of the results on which these models' visual competence claims rest comes from text patterns, prior knowledge, and structural cues in the questions.
The gap was largest for medical benchmarks. Here, models achieved up to 99 percent of their image-mode accuracy through text alone - the actual image contributed almost nothing.
These numbers have direct practical consequences. Companies and hospitals pick AI models based on benchmark rankings. If those rankings largely reflect non-visual reasoning, they say little about a model's actual visual capabilities.
A text-only model with 3 billion parameters outperforms all frontier models and radiologists
To show how far text-based shortcuts can carry, the researchers trained a "super-guesser": a pure text model based on Qwen 2.5 with 3 billion parameters, fine-tuned on the public training set of the ReXVQA benchmark for chest X-ray analysis, with all images removed. The base model was released a year before the benchmark to minimize the risk of data contamination.
According to the study, this text-only model outperformed all frontier multimodal models - including those with hundreds of billions of parameters - on the held-out test set, and beat human radiologists by more than 10 percent on average. It also generated explanations that were in some cases indistinguishable from actual ground-truth reasoning. A model with no access to any images delivered both the correct answer and a plausible visual justification.
Current benchmarks fail to measure what they promise
The super-guesser experiment exposes a two-sided problem. On one side are models that use textual prior knowledge and statistical patterns as shortcuts instead of actually processing images. On the other are benchmarks that enable exactly this behavior: their questions contain enough linguistic cues, structural regularities, and implicit answer distributions that a pure text model can solve them. The two sides reinforce each other.
The study emphasizes that it remains unclear how well multimodal models actually see. A high benchmark score neither proves that a model processed an image, nor can reasoning traces reveal whether a visual justification is based on real input or on a mirage. The researchers don't dispute that the models can process images in principle. Their finding is more specific: current benchmarks can't distinguish whether a model actually uses an image or derives the answer from text. And this silent failure mode varies by domain, according to the study. A model that works in a visually grounded way on natural images doesn't necessarily do so on X-rays or pathology slides.
Models score higher when they hallucinate than when they're told to guess
A further experiment shows that the mirage effect isn't simple guessing. The researchers compared GPT-5.1 in two setups: in mirage mode, the model received the visual question without an image and without any indication that the image was missing. In guess mode, the model was explicitly told no image was available and instructed to pick the best possible answer.
The comparison between "mirage mode" and "guess mode" shows that GPT-5.1 consistently achieves better or comparable results with missing images when the model hallucinates visual information rather than explicitly guessing. | Image: Asadi, O'Sullivan, Cao et al.
Performance dropped in guess mode across nearly all benchmark categories. In both cases, the model has access to the same information: the question text and its trained world knowledge. The difference, according to the researchers, lies in the processing regime. In guess mode, the model knows no image is present and acts conservatively, deriving the answer from obvious cues in the question text. In mirage mode, the model behaves as though an image exists, constructs a plausible perceptual narrative, and activates associations and patterns it doesn't access in the consciously image-free mode.
Previous benchmarking controls that use explicit guessing to identify image-independent questions therefore systematically underestimate the problem: they capture only what a model can solve in conservative mode without an image, not what it achieves in mirage mode.
Cleaned benchmarks reshuffle model rankings
As a solution, the researchers propose the "B-Clean" framework. It first evaluates each candidate model in mirage mode and then removes all questions that at least one model answered correctly without an image. Only questions remain that none of the tested models could solve without visual input.
The researchers applied B-Clean to three benchmarks, with GPT-5.1, Gemini 2.5 Pro, and Gemini 3 Pro as candidates. In the process, 74 to 77 percent of all questions were removed. According to the researchers, this doesn't necessarily mean those questions were poorly constructed. The high filtering rates reflect a mix of unintentional data contamination, hidden statistical patterns in question formulations, and prevalence distributions that allow models to answer correctly without an image.
The "B-Clean" method identifies and removes compromised questions from benchmarks to enable a pure evaluation of AI models' visual understanding. The process significantly reduces the number of test questions and changes performance assessments. | Image: Asadi, O'Sullivan, Cao et al.
On the cleaned benchmarks, accuracy dropped drastically in some cases and barely at all in others. On MicroVQA, GPT-5.1 fell from 61.5 to 15.4 percent, Gemini 3 Pro from 68.8 to 23.2 percent. On MMMU-Pro, however, Gemini 2.5 Pro remained nearly stable (68.0 to 68.2 percent), while GPT-5.1 dropped from 76.0 to 67.1 percent. Model rankings changed on two of three benchmarks, which the researchers interpret as evidence that the original rankings were partly inflated by non-visual reasoning.
The researchers note that B-Clean doesn't deliver absolute values: the cleaned results apply only to comparisons among the tested models and aren't transferable to other models. However, B-Clean does offer a method for relative, vision-grounded comparisons on existing benchmarks without the need to constantly create new ones.
Beyond that, the researchers lay out three demands: modality-ablation testing should become standard in every multimodal evaluation workflow. The field should move toward private or dynamically updated benchmarks that can't be absorbed into pretraining data. And evaluation metrics should measure not absolute accuracy but the delta between image-present and image-absent performance.
Stronger language skills make the problem worse, not better
The researchers hypothesize that the mirage effect stems from how these systems are trained. Modern multimodal models are built on pretrained language models trained on web-scale corpora, enabling them to retrieve statistical regularities and reconstruct plausible contexts from sparse cues. During multimodal training, the models receive an image, a question, and an answer. A human would intuitively rely on the image in this setup because, as the researchers note, a human lacks "access to an entire text corpora." A language model, however, has already internalized this prior knowledge. Optimized for correct next-token prediction, it can take the shorter path and ignore visual information when its linguistic priors already lead to the correct answer.
This effect isn't static. Mirage rates across model generations show that newer versions of the same model tended to exhibit higher mirage rates than older ones, according to the study. Better language capabilities appear to amplify the effect rather than fix it.
"As models become more capable linguistic reasoners, the risk increases that their language abilities will mask deficiencies in other modalities," the researchers write. The study doesn't call into question the general text capabilities of frontier models, but rather their claimed visual understanding - and the suitability of current benchmarks to measure it.
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み