OpenSeekerのオープンソースアプローチがAI検索エージェントのデータ独占を打破することを目指す
オープンソースのAI検索エージェントOpenSeekerは、わずか11,700のトレーニングデータポイントと1回のトレーニングでAlibabaなどのソリューションに匹敵する結果を達成し、データ・コード・モデルをすべて公開することでAI検索エージェントのデータ独占を打破することを目指している。
キーポイント
効率的なトレーニングによる高性能達成
OpenSeekerはわずか11,700のトレーニングデータポイントと1回のトレーニングで、Alibabaなどの大規模なソリューションと同等の結果を達成している。
完全なオープンソースアプローチ
データ、コード、モデルのすべてが公開されており、透明性と再現性を確保している。
データ独占への挑戦
このアプローチはAI検索エージェント分野におけるデータの独占状態を打破することを目的としている。
実用性の実証
小規模なリソースで大企業のソリューションに匹敵する性能を実現し、実用性を証明している。
影響分析・編集コメントを表示
影響分析
この記事は、AI検索エージェント分野においてデータの独占状態を打破する可能性を示しており、オープンソースモデルが大企業のソリューションと競合できることを実証している。これにより、より民主的なAI開発環境の促進と、中小企業や研究者の参入障壁の低下が期待される。
編集コメント
大企業のデータ独占に対抗するオープンソースモデルの成功例として、AI業界の民主化トレンドを象徴する重要なニュース。効率的なトレーニング手法も注目に値する。

わずか11,700のトレーニングデータポイントと1回のトレーニング実行により、AI検索エージェント「OpenSeeker」はAlibabaなど他社のソリューションと同等の結果を達成しています。データ、コード、モデルはすべて公開されており、自由にアクセス可能です。
この記事「OpenSeeker's open-source approach aims to break up the data monopoly for AI search agents」は、The Decoderに最初に掲載されました。
原文を表示
Skip to content
Mar 23, 2026
Nano Banana Pro prompted by THE DECODER
Web link structures replace language model guesswork for training data11,700 data points versus 147,000: data quality beats sheer volume
With just 11,700 training data points and a single training run, the AI search agent OpenSeeker achieves results that rival solutions from Alibaba and others. Data, code, and model are all openly accessible.
Powerful AI search agents—systems that autonomously search the internet for information across multiple steps—have so far been big tech's territory. OpenAI, Google, and Alibaba keep their training data locked down. Even projects that publish their model weights stay quiet about the data behind them.
This data monopoly has held back the open research community for close to a year, according to researchers at Shanghai Jiao Tong University. With OpenSeeker, the academic team aims to change that: all training data (MIT license), the code, and the model weights are openly available.
Web link structures replace language model guesswork for training data
OpenSeeker builds on two core ideas for generating data. For question-answer pairs, the team uses the real link structure of the web as its foundation, generating questions from it. Starting from randomly selected seed pages within a web corpus (around 68 GB of English and 9 GB of Chinese data), the system follows hyperlinks to related pages and pulls out the most important information.
Specific names and terms then get swapped out for vague descriptions, so a search agent can't find the answer with a simple keyword search. This forces genuine multi-step search and reasoning.
OpenSeeker's training question pipeline: from web corpus to graph expansion and obfuscation to double quality checking. | Image: Du et al.
A two-stage filter weeds out unusable questions: a strong base model must not be able to answer them without tools but must be able to solve them with full context. If either condition fails, the question gets tossed.
The second idea focuses on the search paths the model learns from. Web pages contain a lot of noise that drags down the quality of recorded solution paths. During data generation, a teacher model gets a cleaned-up summary of previous search results and makes better decisions based on that.
During training, the student model then sees the raw, uncleaned data but is still expected to reproduce the teacher's high-quality decisions. This forces it to figure out on its own how to separate signal from noise.
11,700 data points versus 147,000: data quality beats sheer volume
OpenSeeker is based on Qwen3-30B-A3B and was trained with just 11,700 data points in a single run using supervised fine-tuning, without any reinforcement learning or repeated tweaking.
According to the paper, the model hit 48.4 percent on the Chinese-language BrowseComp-ZH benchmark, beating Alibaba's Tongyi DeepResearch at 46.7 percent. Alibaba's model went through a three-stage process of extended pre-training, supervised fine-tuning, and reinforcement learning.
OpenSeeker (orange) is the only fully open agent to achieve competitive results on all four benchmarks. | Image: Du et al.
On OpenAI's English-language BrowseComp, OpenSeeker scores 29.5 percent - nearly double DeepDive's 15.3 percent, the previous leader among fully open agents.
A comparison with MiroThinker drives home how much data quality matters over raw quantity: that model was fed 147,000 training examples yet only manages 13.8 percent on BrowseComp-ZH. OpenSeeker hits 3.5 times that score with less than one-twelfth the data.
OpenSeeker's Chinese training data requires an average of 46 tool calls per task, compared to just 27 for BrowseComp-ZH. | Image: Du et al.
There's still a gap compared to the strongest proprietary systems, though. OpenAI's GPT-5-High hits 54.9 percent on BrowseComp, and DeepSeek-V3.2 with 671 billion parameters reaches 51.4 percent. OpenSeeker runs on a fraction of the model size and training effort.
The question of who gets access to high-quality training data has been a central issue in the AI industry for a while now. Last year, a research team released the Common Pile, an 8 TB text dataset built from openly licensed sources. So far, that hasn't done much to shake the dominance of commercial models.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe now
More than 16% discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.
Subscribe to The Decoder
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み