AIニュース最前線
最新ニュースAI日報Hacker日報週報動画AIツールトレンド企業

AIニュース最前線

世界中のAI最新情報を日本語で毎時更新

最新ニュース日報トレンド企業プレミアムRSS
© 2026 ainew.jp特定商取引法に基づく表記
ニュース一覧元記事を開く
The Decoder·2026年4月11日 18:39·約1分で読める

研究者が発見:AIモデルは助けを求めるより推測することを選ぶ

#マルチモーダルAI#言語モデル#強化学習#AI評価#AI信頼性#人間-AI協調
TL;DR

研究者らは、マルチモーダル言語モデル22種を評価した結果、視覚情報が不足している場合にほとんどがユーザーに助けを求めずに推測する傾向があり、シンプルな強化学習アプローチが改善の可能性を示唆していることを発見した。

AI深層分析2026年4月11日 19:41
3
注目/ 5段階
深度40%
3
関連度30%
4
実用性20%
3
革新性10%
3

キーポイント

1

マルチモーダルAIの「助けを求めない」傾向

ProactiveBenchという評価手法を用いて、視覚情報が不足している状況でマルチモーダル言語モデルがユーザーに助けを求めるかどうかをテストした結果、22モデル中ほぼ全てが不足情報を推測するだけで、積極的に質問しないことが判明した。

2

強化学習による改善可能性

研究では、シンプルな強化学習アプローチを適用することで、AIモデルが不足情報をユーザーに質問するように訓練できる可能性が示唆されており、これは実用的な改善策として注目される。

3

実用性への影響

AIモデルが助けを求めずに推測する傾向は、現実世界のアプリケーション(例:画像説明、視覚的質問応答)で誤った出力や信頼性の低下を招く可能性があり、この問題の解決は実用性向上に直結する。

4

研究手法の革新性

ProactiveBenchという新しい評価ベンチマークを開発し、AIモデルの「プロアクティブな情報要求」能力を体系的に測定した点で、評価手法の面で一定の革新性がある。

影響分析・編集コメントを表示

影響分析

この研究は、マルチモーダルAIの実用化における重要な課題を明らかにし、評価手法と改善アプローチの両面で貢献している。AIシステムが自信過剰に推測するのではなく、不確実性を認識してユーザーと対話する能力は、より安全で信頼性の高いAI応用の開発に不可欠である。

編集コメント

AIの「過信」問題を実証的に示した点で価値があるが、具体的なモデル名や企業名が明記されていないため、業界への直接的な影響は限定的。評価手法の提案としての意義が大きい。

AIモデルは助けを求めるより推測を選ぶ、と研究者が確認

image
image

ProactiveBenchは、視覚情報が不足している場合に、マルチモーダル言語モデルがユーザーに助けを求めるかどうかをテストする指標です。テスト対象22モデルのうち、ほとんどが必要な情報を積極的に求めることはありませんでしたが、単純な強化学習アプローチによって改善の可能性が示されました。

本記事「AI models would rather guess than ask for help, researchers find」は、The Decoderに最初に掲載されました。

原文を表示

Apr 11, 2026

Nano Banana Pro prompted by THE DECODER

Larger models don't ask better questionsModels that look proactive are mostly just winging itReinforcement learning can teach models when to speak upAI models don't know what they don't know

ProactiveBench tests whether multimodal language models ask users for help when visual information is missing. Out of 22 models tested, almost none ask for what they need, but a simple reinforcement learning approach hints at a fix.

If you ask a person to identify an object that's blocked from view, they'll ask you to move whatever's in the way. Multimodal language models don't work that way. They either hallucinate a wrong answer or just refuse to respond. The new ProactiveBench benchmark puts this problem under a microscope, systematically testing whether today's AI models can recognize when they need help and actually ask for it.

Reactive models hallucinate a wrong answer or refuse to respond. A proactive model would ask to move the blocks, then answer correctly. | Image: De Min et al.

The benchmark pulls from seven existing datasets and turns them into test scenarios that are impossible to solve without human input. Models have to identify hidden objects, clean up noisy images, interpret rough sketches, or request different camera angles. All told, ProactiveBench packs more than 108,000 images into 18,000 samples. A built-in filter strips out any task a model can nail on the first try; to pass, a model has to proactively ask for more information.

ProactiveBench covers seven scenarios: occluded objects (ROD, VSOD), uninformative viewpoints (MVP-N), noisy images (ImageNet-C), sketches (QuickDraw), temporal ambiguities (ChangeIt), and camera movements (MS-COCO). Proactive models ask for help; reactive ones hallucinate or bail. | Image: De Min et al.

Larger models don't ask better questions

The researchers put 22 multimodal language models through their paces, including LLaVA-OV, Qwen2.5-VL, InternVL3, GPT-4.1, GPT-5.2, and o4-mini. In the reference setting with clearly visible objects, the models nail an average of 79.8 percent of tasks. On ProactiveBench, that number craters by more than 60 percent.

The ROD dataset tells the starkest story. When objects are hidden behind blocks, accuracy plummets from 98.3 percent in the reference setting to just 8.2 percent. The models can spot objects just fine when they're in plain sight; they just never think to ask someone to uncover them.

With visible objects, models average 79.8 percent accuracy. On ProactiveBench, where they'd need to ask for help, that drops to 17.5 percent. | Image: De Min et al.

Model size doesn't help either. InternVL3-1B actually outperforms InternVL3-8B at 27.1 versus 12.7 percent. The older LLaVA-1.5-7B beats the much newer LLaVA-OV-72B at 24.8 versus 13 percent. The choice of underlying language model matters too: LLaVA-NeXT with Vicuna hits 19.3 percent, while the same setup with Mistral manages just 4.5 percent. Closed models like GPT-4.1 posted the best accuracy numbers, though the researchers flag their unusually high COCO scores as possible data contamination.

Models that look proactive are mostly just winging it

Some models appear more proactive than others at first glance. The researchers stress-tested this by swapping valid proactive suggestions with nonsensical ones—like "Rewind the video" for a sketching task. Models that previously seemed proactive picked the meaningless options just as happily. LLaVA-NeXT Vicuna actually bumped its selection rate from 37 to 49 percent when given bogus choices. The takeaway is that what looks like proactivity is really just a lower bar for guessing, not actual understanding.

When valid proactive suggestions get swapped for invalid ones, models like LLaVA-OV-0.5B and InternVL3-1B pick them anyway. Their "proactivity" is guesswork, not comprehension. | Image: De Min et al.

Dropping explicit hints into prompts and conversation histories doesn't fix things either. Hints do push the rate of proactive suggestions up, nudging accuracy to 25.8 percent, but that still doesn't beat chance on average. In 16 percent of cases, models just blindly spam proactive suggestions up to the maximum allowed steps. Conversation histories actually make performance worse: models parrot the proactive actions from the history instead of learning from them.

Reinforcement learning can teach models when to speak up

There is a bright spot, though. The researchers showed that proactivity can be trained in. They fine-tuned LLaVA-NeXT-Mistral-7B and Qwen2.5-VL-3B using Group-Relative Policy Optimization (GRPO) on roughly 27,000 examples. The key detail: the reward function scores correct predictions higher than proactive suggestions, so the model only asks for help when it's genuinely stuck.

After training, both models beat every one of the 22 previously tested models, including o4-mini (37.4 and 38.6 versus 34.0 percent). The learned proactivity also carried over to scenarios outside the training data . On ChangeIt, Qwen2.5-VL-3B's accuracy jumped from 12.4 to 55.6 percent. But get the reward balance wrong, and the whole thing falls apart: when proactive suggestions are rewarded equally to correct answers, the model spams help requests nonstop, and accuracy tanks to 5.4 percent.

Even with these gains, a big gap remains compared to the reference setting (40.7 versus 75.1 percent). The researchers have released ProactiveBench as open source and frame it as a first step toward models that know when they're missing information and ask for it instead of making things up.

AI models don't know what they don't know

ProactiveBench taps into a pattern that keeps surfacing across recent AI research: multimodal language models are terrible at handling uncertainty. Moonshot AI's WorldVQA benchmark recently found that even top-tier models cap out around 50 percent in visual object recognition, pointing to baked-in overconfidence.

A Stanford study on what researchers call the Mirage effect drove this point home. Multimodal models like GPT-5 and Gemini 3 Pro confidently described visual details and offered medical diagnoses even when no image was provided. On standard benchmarks, they hit 70 to 80 percent of their normal performance using nothing but text patterns and prior knowledge, essentially faking visual understanding without realizing the input was missing.

Other research tells a similar story. A study on exam question difficulty found that language models can't reliably gauge their own limits, while researchers at Sapienza University of Rome used their "Spilled Energy" method to show that hallucinations leave measurable traces in a model's computations—suggesting that even when models don't know they're guessing, the math under the hood does.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Subscribe now

More than 16% discount.

Read without distractions – no Google ads.

Access to comments and community discussions.

Weekly AI newsletter.

6 times a year: “AI Radar” – deep dives on key AI topics.

Up to 25 % off on KI Pro online events.

Access to our full ten-year archive.

Get the latest AI news from The Decoder.

Subscribe to The Decoder

この記事をシェア

関連記事

TLDR AI★42026年6月4日 09:00

継続学習のための「睡眠」アプローチ(24 分読)

Google の研究者らは、モデルが短期間の文脈内知識を長期パラメータに統合する新手法「Sleep」を提案した。この手法は蒸留と再生成を用い、さらに強化学習による「夢見」段階で合成カリキュラムを生成して自己改善を図る。

TLDR AI★42026年6月3日 09:00

ヒルクライミング機械の構築:7 つの新規 MAI モデルを発表(5 分読了)

マイクロソフトは、開発者がモデル重みを調整し日常製品に統合できる 7 つの新規 MAI モデル「MAI」を発表した。これらは強化学習環境を用いたフロンティア・チューニング技術を採用しており、またメイヨー・クリニックとの医療 AI 共同開発も発表した。

Ars Technica AI★42026年5月27日 02:16

3D プリンタ対応の人間型ロボット脚がロボティクス実験を加速

Hugging Face が公開した約 2,500 ドルの安価な 3D プリント製人間型ロボット脚により、研究者は実世界での AI ロボットソフトウェアテストと訓練を容易に行えるようになった。

ニュース一覧に戻る元記事を読む