TLDR AI·2026年4月29日 09:00·約9分

エージェントが検索スタックを代替できるか？

#LLM #Agentic Search #RAG #Reasoning

TL;DR

LLM を活用したエージェント型検索モデルは基本的な検索ツールで回答品質を向上させるが、モデル自体の知識不足を検証する手段がなく、学習データに依存する限界が残る。

AI深層分析2026年4月29日 23:05

重要/ 5段階

深度40%

キーポイント

エージェント検索による品質向上

エージェントに基本検索ツールを与えることで回答品質が向上し、さらに探索を促すことでさらなる改善が見られる。

モデルの知識評価の限界

大規模言語モデルは自分が知らないことを評価する能力を持たないため、知識不足に対する補完手段が依然として存在しない。

学習データの依存性

エージェント検索モデルは関連結果を抽出する方法を習得できるが、その性能の上限や得意分野（スイートスポット）は依然として訓練データによって決定される。

影響分析・編集コメントを表示

影響分析

この記事は、LLM を基盤とした検索エージェントの実用化における根本的な限界を浮き彫りにしており、単なるツール連携だけでなく「モデルの自己認識能力」や「知識補完メカニズム」の開発が今後の重要な課題であることを示唆しています。開発者は検索機能の追加だけで解決策を得られると過信せず、データの偏りや評価バイアスへの対策を講じる必要性があります。

編集コメント

検索機能の追加だけでエージェントが完璧になるわけではないという、開発者にとって重要な警鐘です。技術的な限界を理解した上で、どのようなアーキテクチャで知識不足を補うかが次の勝負どころとなるでしょう。

あなたの職場では検索はどのように実装されていますか？おそらく、取得機能の上に構築された複雑な能力のセットとしてでしょう。私たちの検索 API はクエリを理解し、バックエンドの検索システムを呼び出し、最後に結果を再ランク付けします。

しかし、ループ内にエージェントがあれば、それらすべてが必要でしょうか？検索バックエンドをエージェントに置き換えることはできるのでしょうか？結局のところ、エージェントはユーザーのリクエストを理解し、取得ツールを呼び出し、関連性を自ら評価します。ChatGPT はこれを常に行っていますが、なぜ私たちの検索バーではできないのでしょうか。

つまり、基本的な BM25 バックエンドをエージェントに与えれば、それは検索 API の仕事を引き継ぐことができるでしょうか？

シンプルなツール使用で驚くほど良い結果

実は、初期の発見を測定しました（過去よりも少し深く分析しています）。エージェントに基本的な検索ツールを与え、ポップコーンを用意して観察しました。

Amazon ESCI では、わずかに優れた取得機能と LLM（大規模言語モデル）を組み合わせた場合、性能が向上する様子が見られます。

まず、bm25+e5_embedding のベースラインから始めます。それぞれ 0.289 と 0.314 の NDCG を達成します。

次に、3 つのエージェント実験を測定しました。それぞれ GPT-5-mini が異なる検索ツールの組み合わせを駆動しています。e5 埋め込みを使用するエージェントは 0.359、BM25 を使用するのは 0.385 のスコアを出しました。両方を使用すると 0.410 になります。

GPT-5 が両方のツールを使用した場合は NDCG が 0.453 になりました。

image

これが素晴らしいのは、私がほとんど手を動かしていないからです。単に標準的な大規模言語モデル（LLM）にいくつかの検索ツールを設定しただけです。ここに私のデータに合わせて調整された要素は一つもありません。品質が 0.289 から 0.453 に向上したのは、モデルの質と単純な検索エンジン（retrievers）の結果に過ぎません。

もしエージェントが十分に高速であれば、API を使わずにエージェントで検索を構築する未来があるのでしょうか？ユーザーに対して最適な結果を示すためにクエリについて推論を行うエージェントの世界が実現されるでしょうか？

しかし、エージェントは検索を約 1 回しか呼び出さない

これらのエージェントはどのような巧妙な戦略を採用しているのでしょうか。実は大部分の場合、各検索ツールを約 1 回だけ呼び出しています：

image

推論の追跡（reasoning traces）を見ると、エージェントが検索結果を解釈し、単一のクエリを発行し、最大 20 件までの結果を取得し、それらをランク付けしている様子がわかります。

ごく稀に、BM25 のみを使用する場合に、2 つの並列検索が行われることがあります。例えば、「PVC coupler」の場合、まず「pvc coupler」で BM25 検索を行い、エージェントがその結果に失望してフォローアップクエリ（PVC pipe coupler）を発行するといったケースです：

Calling search_bm25 with args keywords='pvc coupler' top_k=20

PVC カップラーの検索

検索結果をみると、主にネットワークケーブル用の RJ45 カップラーが返されており、ユーザーが意図した内容とは異なるようです。おそらく彼らが求めているのは、配管で使用される PVC パイプ用カップラーでしょう。現在の検索ツールでは、PVC パイプ用カップラーに特化したエントリは得られていません。説明の中に PVC という単語の言及は見受けられますが、より明確な結果を得るためには、より広範なキーワードを使用する方が良いでしょう。「PVC pipe coupler」や「PVC coupling」といった用語に焦点を当てて検索を絞り込む必要があるため、次はそのように試してみます。

Calling search_bm25 with args keywords='PVC pipe coupler' top_k=20

また、エージェントが BM25 を使用して、結果の不一致を見越して同時に 2 つのクエリを発行するケースも確認できます:

Calling search_bm25 with args keywords='AN10 oil catch can no filter' top_k=20

Calling search_bm25 with args keywords='AN10 oil catch can without breather filter' top_k=20

どうやらキーワード検索には、エージェントにより多くのクエリを試させるような何らかの特性があるようです。エージェントは結果を解釈し、なぜ一致しなかったのかを理解し、次のステップを実行できます。キーワード検索には原因と結果（キーワードがマッチした／しなかった）があり、エージェントはその反応を通じて改善を図ることができます。

探索を促すことでさらに向上する

エージェントによる探索を例外ではなく、標準的な運用にしたいと考えています。

ハネス自体に変更を加えたわけではありません。エージェントがより効果的に検索できるよう誘導することは可能です。設定にいくつかの小さな変更を加えた場合、どのようなことが起こるでしょうか：

より多くのツール呼び出しを要求する - エージェントに対して、必要な数のツール呼び出し（例：4 回）を行わなかった場合は継続して試すよう指示します。その際、「素晴らしいですが、もっとできるはずです！」といった励ましのようなメッセージを発します。

エージェントに多様な検索クエリを使用させる - エージェントが本当に探索を行うことを保証するため、エージェントが過去のこの検索で発行したクエリとあまりにも類似しているクエリは拒否します。

GPT-5-mini の例では段階的な改善が確認できます。

戦略

モデル

平均

中央値

2 つのツール

gpt-5-mini

0.4101

0.3743

4 回の呼び出しを要求し、重複を禁止

gpt-5-mini

0.4290

0.3948

4 回の呼び出しを要求し、過去のクエリとの類似度が 0.9 を超えるものを禁止

gpt-5-mini

0.4308

0.4258

これは素晴らしい結果です。より小さなモデルでありながら、以前の GPT-5 ベンチマークに近づいています。ただし、ツール呼び出しの回数は増えています。

私がまだ試していないのは、メタデータの断片 bits of metadata を豊富に備え、コーパスの重要な部分に焦点を当てやすくする検索ツールです。カテゴリフィルターや最新情報のための検索ツールなどがあります。

エージェント型検索モデル（SID-1 など）はどうでしょうか？

新たな可能性として、エージェント型検索モデルが挙げられます。SID などのチームは、RAG スタックの drop-in replacements として機能するモデルを訓練しています。いくつかの検索ツールを設定すれば、SID-1 といったモデルが関連結果をどのように提示するかを判断します。

なぜ専用エージェント型検索モデルが必要なのでしょうか？最先端ラボの LLM は「検索」をウェブ検索と捉えています。そのため、一度クエリを発行して結果を信頼し、次に進みます。これはユーザーの問題について推論しているため（検索の関連性については推論していない）価値があります。

しかし、SID-1 といったモデルは、関連性について現実的に推論します。私たちの企業が構築する検索機能はそれほど洗練されていません。SID-1 はそのことを知っています。そのため、慎重に探索し、結果について推論し、再試行を行います。検索要件を満たす結果の取得にのみ集中した自己ホスト型のサブエージェントとして動作する SID-1 は、メインの LLM がユーザーのタスクに集中できるようにします。

小さく、速く、かつ良質であるためには、焦点を絞る必要があります。ドメインごとに特化したエージェント型検索 LLM が登場するのでしょうか？e コースに優れたもの、求人検索に優れたもの、文書検索に優れたものなどです。あるいは、各ドメイン向けにエージェント型検索モデルをファインチューニングする検索チームが現れるのでしょうか。

トレードオフは埋め込みモデルのそれと似ています：トレーニングデータがモデルの適正領域を決定します。そこから大きく外れれば、突然最適ではない結果が得られるようになります。

「ものを探すこと」と「ディープ・リサーチ」の違い

このエージェント型検索の理想郷はどこで崩壊するのでしょうか？これまでに私たちは*もの*（製品、求人など）を見つけることに焦点を当ててきました。しかし、エージェントが持っていない*情報*についてはどうでしょうか？それは根本的に異なる問題です。

MSMarco パッセージデータセットに変更すると、エージェントにツール操作を任せても改善は見られません。埋め込みモデルには勝てません。

image

この埋め込みモデルはトレーニングセットに MSMarco を含んでいます。つまり検索担当者が最もよく知っています。LLM は自分が知らないものを評価できません。もし正しい情報が何であるかを知っていれば、検索など不要です！

これがエージェント型検索における二つの世界の違いです。一つはエージェントが私たちを助けてものを見つける世界、もう一つはエージェントが学ぶために私たちの助けを必要とする世界です。後者のケースこそが、従来の検索スタックが依然として繁栄している領域です。

私はこの最後のアプローチを「ディープ・リサーチ」と呼び、私がこれまで行ってきたものを「エージェント型検索」と呼んでいます。現在、LLM には知識のギャップが存在しており、私がこれまでに述べたどの方法も、LLM の知識不足を補うものとしては適切ではありません。

エージェントを検索に活用し、より優れた RAG（Retrieval-Augmented Generation）を構築し、クエリ理解において LLM を使用する技術を学ぶために、Cheat at Search with Agents で私と一緒に参加してください。

原文を表示

How is search implemented where you work? Probably as a complex set of capabilities on top of retrieval. Our search APIs understand queries, call backend search systems, and finally rerank results.

But if we had an agent in the loop, would we need all that? Could we replace search backends with an agent? After all, an agent understands user requests, calls retrieval tools, and evaluates relevance on its own. We see ChatGPT do this all the time, why can’t our search bar?

In other words, if you give a basic BM25 backend to an agent, could it take the Search API’s job?

Surprisingly good results with simple tool usage

Well I’ve measured initial findings (a bit deeper than in the past). I gave agents basic search tools and got the popcorn out to watch.

On Amazon ESCI, we see increasing performance with slightly better retrievers + LLMs.

First, we start with a bm25+e5_embedding baseline. They achieve 0.289/0.314 NDCG respectively.

Then we measure three agentic experiments. Each with GPT-5-mini driving different combinations of search tools. We see an agent with e5 embeddings get 0.359 and with BM25 0.385. With both we get 0.410.

GPT-5 with both tools gets an NDCG of 0.453.

What’s amazing about this? I’ve barely lifted a finger. I just set up a stock LLM with some search tools. Nothing here has been fit to my data. A 0.289→0.453 jump in quality depends only on model quality + simple retrievers.

If an agent could be fast enough, is there a future here we build search with agents and not APIs? A world where agents reason about queries to show users the best results?

But, agents call search approximately… once?

What kind of ingenious strategies are these agents employing? It turns out, for the most part, they call each search tool roughly once:

When you look at reasoning traces, you see the agent interpret the search, issue a single query, retrieve up to the max of 20 results, then rank them.

Occasionally, very rarely, with just BM25, you’ll see two simultaneous searches. For example, “PVC coupler” shows first a BM25 search for “pvc coupler”, the agent, disappointed with results, issues a follow up query (PVC pipe coupler):

code

Calling search_bm25 with args keywords='pvc coupler' top_k=20

**Searching for PVC couplers**  I noticed that the search results mainly
returned RJ45 couplers, which are for network cables, not what the user likely
meant. They’re probably asking about PVC pipe couplers used in plumbing. The
current search tool isn't yielding entries for PVC pipe couplers specifically. I
saw mentions of PVC in descriptions, but it’s clearer to use broader keywords to
refine my search. I need to focus on terms like "PVC pipe coupler" or "PVC
coupling" to get better results, so I'll try that next.

Calling search_bm25 with args keywords='PVC pipe coupler' top_k=20

You also see cases where the agent uses BM25 to issue two queries simultaneously, anticipating inconsistent results:

code

Calling search_bm25 with args keywords='AN10 oil catch can no filter' top_k=20
Calling search_bm25 with args keywords='AN10 oil catch can without breather filter' top_k=20

There’s something about keyword search, it seems, that nudges the agent to try more queries. The agent can interpret results, understand why they didn’t match, and take next steps. Keyword search has cause-and-effect (keywords matched / or didn’t). Agents can react to that and improve.

Encouraging exploration improves further

I want agent exploration to be the norm, not the exception.

We haven’t changed a thing about the harness. We could steer agents to search more effectively. What happens with a few small changes to our setup:

Require more tool calls - we ask the agent to keep trying if its made fewer than the required number of tool calls (ie 4). We say something encouraging like “wow these are great, but I bet you can do better!”

Require the agent use diverse retrieval queries - to ensure the agent truly explores, we reject queries too similar to those issued by the agent previously for this search.

Within the GPT-5-mini examples, we see incremental improvements.

strategy

model

mean

median

Two tools

gpt-5-mini

0.4101

0.3743

Require 4 calls, disallow duplicates

gpt-5-mini

0.4290

0.3948

Require 4 calls, disallow queries > 0.9 sim to past queries

gpt-5-mini

0.4308

0.4258

That’s nice, it approaches the GPT-5 benchmark from before, but with a smaller model. Albeit with more tool calls.

What I haven’t yet tested are retrieval tools rich with affordances - bits of metadata making it easy to focus on important parts of the corpus. A category filter. A search tool for recent information. And so on.

What about agentic search models (ie SID-1)?

Another possibility comes from new agentic search models. Teams like SID train models that act as drop-in replacements for your RAG stack. Setup some retrieval tools, and a model like SID-1 will figure out how to surface relevant results.

Why a dedicated agentic search model? Frontier lab LLMs think of “search” as web search. So they issue a query once, trust the results, and move on. That’s valuable because they’re reasoning about the user’s problem - not search relevance.

But a model like SID-1 reasons about relevance realistically. Our companies build less sophisticated retrievers. SID-1 knows that. So it takes care to explore thoughtfully, reason about the results, and retry. As a self-hosted subagent, focused only on retrieving results that satisfy the search, SID-1 leaves the main LLM focused on the user’s task.

To be smaller+faster+good, they must focus. One wonders if we’ll have agentic search LLMs tailored to different domains. One good at e-commerce. Another at job search. Another good at document retrieval, etc. Or search teams fine tuning agentic search models for their domain?

The tradeoffs look like embedding models: the training data dictates the models sweet spot. Move too far away, and suddenly get suboptimal results.

“Finding things” different than Deep Research

Where does this agentic search nirvana break down? So far we’ve focused on finding *things* (products, jobs, etc). But what about *information* the agent doesn’t have? That’s fundamentally a different problem.

Change to MSMarco passages dataset. There’s no improvement by letting an agent drive tools. You can’t beat the embedding model.

This embedding model has MSMarco in its training set. So the retriever knows best. The LLM can’t evaluate what it doesn’t know. If it knew what information was correct, it wouldn’t need search!

That’s a difference between two worlds of agentic search. One where an agent helps us find things. Another where the agent needs our help to learn. This latter case is where the traditional search stack still thrives.

I’d call this last approach “Deep Research” - while what I’ve been doing as “agentic search”. As they currently exist, LLMs have gaps about their knowledge, and nothing I’ve talked about would be suitable for compensating for the LLMs lack of knowledge.

I hope you join me at Cheat at Search with Agents to learn use agents in search. build better RAG and use LLMs in query understanding.

この記事をシェア

Vercel Blog重要度42026年6月25日 22:00

AI SDK 7 の発表

KDnuggets2026年6月25日 21:00

2026 年に AI エンジニアになるためのロードマップ

Vercel Blog重要度42026年6月25日 16:00

Vercel でエージェントに製品デザインを教える

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Calling search_bm25 with args keywords='pvc coupler' top_k=20 **Searching for PVC couplers** I noticed that the search results mainly returned RJ45 couplers, which are for network cables, not what the user likely meant. They’re probably asking about PVC pipe couplers used in plumbing. The current search tool isn't yielding entries for PVC pipe couplers specifically. I saw mentions of PVC in descriptions, but it’s clearer to use broader keywords to refine my search. I need to focus on terms like "PVC pipe coupler" or "PVC coupling" to get better results, so I'll try that next. Calling search_bm25 with args keywords='PVC pipe coupler' top_k=20

エージェントが検索スタックを代替できるか？

キーポイント

影響分析

編集コメント

シンプルなツール使用で驚くほど良い結果

しかし、エージェントは検索を約 1 回しか呼び出さない

探索を促すことでさらに向上する

エージェント型検索モデル（SID-1 など）はどうでしょうか？

「ものを探すこと」と「ディープ・リサーチ」の違い

Surprisingly good results with simple tool usage

But, agents call search approximately… once?

Encouraging exploration improves further

What about agentic search models (ie SID-1)?

“Finding things” different than Deep Research

関連記事

エージェントが検索スタックを代替できるか？

キーポイント

影響分析

編集コメント

シンプルなツール使用で驚くほど良い結果

しかし、エージェントは検索を約 1 回しか呼び出さない

探索を促すことでさらに向上する

エージェント型検索モデル（SID-1 など）はどうでしょうか？

「ものを探すこと」と「ディープ・リサーチ」の違い

Surprisingly good results with simple tool usage

But, agents call search approximately… once?

Encouraging exploration improves further

What about agentic search models (ie SID-1)?

“Finding things” different than Deep Research

関連記事