MarkTechPost·2026年6月20日 07:06·約10分で読める

VibeThinker-3B：Qwen2.5-Coder-3Bを基盤にスペクトルから信号へのポストトレーニングパイプラインで構築された 30 億パラメータの密着型推論モデル

#Reasoning Models #Open Source LLM #SFT #Reinforcement Learning #Efficient AI

TL;DR

新浪微博の研究チームが開発した 3B パラメータの軽量推論モデル「VibeThinker-3B」は、独自の訓練パイプラインにより数百倍サイズのモデルと同等の数学・コーディング性能を発揮し、リソース効率化の新たな基準を示した。

AI深層分析2026年6月19日 23:02

重要/ 5段階

深度40%

キーポイント

Spectrum-to-Signal (SSP) パイプラインによる高効率学習

教師あり微調整で推論経路の「スペクトル」を構築し、強化学習で正解の「シグナル」を増幅する独自のポストトレーニング手法を採用している。

圧倒的なベンチマーク性能と小規模モデルでの大規模モデル対抗

AIME26 で 94.3、LiveCodeBench v6 で 80.2 を記録し、671B や 1T パラメータの超大規模モデルに匹敵する結果を出した。

検証可能なタスク特化型アーキテクチャ

数学やコーディングなど正解が検証可能な領域に特化しており、一般知識タスクにはより大規模なモデルの利用を推奨している。

4段階のポストトレーニングパイプライン

カリキュラムベースのSFT、多領域RL（MGPO使用）、オフライン自己蒸留、指示追従RLという4段階を経て、3Bモデルの推論能力と制御性を両立させています。

テスト時スケーリング手法CLR

Claim-Level Reliability Assessment (CLR) はパラメータを増やさず、生成された複数の経路から主張を検証・集約することで、AIME26で97.1%という高い正答率を達成します。

文脈拡張と報酬設計の工夫

小規模モデル向けに段階的な文脈拡大を廃止し64K固定ウィンドウを採用するとともに、数学RLでは冗長なトークンを減らすため短く正しい回答に高い報酬を与える設計が採用されています。

驚異的な推論性能とコスト効率

3Bという軽量モデルでありながらAIME26で94.3点（DeepSeek V3.2やKimi K2.5に匹敵）を達成し、CLRテストタイムスケーリングによりさらに向上します。

影響分析・編集コメントを表示

影響分析

このモデルは、リソース制約のある環境やエッジデバイスにおいても、大規模推論モデルに匹敵する高度な論理的思考能力を実装可能であることを実証しました。業界全体に対して、単なるパラメータ数の競争から、訓練手法の革新による効率化への転換を促す重要なマイルストーンとなります。

編集コメント

3B という小規模パラメータで超大規模モデルと互角の性能を達成した点は、推論能力の民主化において極めて重要な転換点です。特に「Spectrum-to-Signal」という訓練概念は、今後の軽量モデル開発における新たな標準となる可能性があります。

最近の AI の推論における画期的進歩は、主に大規模化によって牽引されており、複雑な認知の閾値を越えるために数十億ものパラメータが投入されてきました。しかし、VibeThinker-3B は全く異なる道筋を描いています。

中国の新浪微博社（Sina Weibo Inc）の研究者たちによって開発されたこの 30 億パラメータモデルは、効率性がその重量級クラスを遥かに凌駕する威力を発揮できることを証明しました。オープンソースの MIT ライセンスの下でリリースされている VibeThinker-3B は、数学、コーディング、STEM（科学・技術・工学・数学）分野といった検証可能なタスクにおいて、自身よりも数百倍も大きなモデルと同等のパフォーマンスを達成します。

VibeThinker-3B とは何か

VibeThinker-3B は、Qwen2.5-Coder-3B をベースとしたコンパクトな密結合（dense）モデルです。これはゼロから事前学習されたものではなく、ポストトレーニング（post-training）が施されています。研究チームは、その上で教師あり微調整（SFT: Supervised Fine-Tuning）、強化学習（RL: Reinforcement Learning）、および自己蒸留（self-distillation）を適用しています。

このトレーニングフレームワークは、以前の VibeThinker-1.5B で採用された「スペクトラム・トゥ・シグナル原則（Spectrum-to-Signal Principle: SSP）」を引き継いでいます。SFT は有効な推論経路の広範な空間である「スペクトラム」を構築し、RL はその後、正しい経路である「シグナル」を増幅します。

このモデルが担う役割は一つに絞られています：検証者が回答を確認できるような推論です。研究チームは、オープンドメインの知識タスクにはより大規模な一般用モデルの使用を推奨しています。VibeThinker-3B は設計上、専門特化型（specialist）として機能します。

標準的なスタック上で動作します。モデルの重みには transformers>=4.54.0 が必要です。推論速度を向上させるためには、vLLM==0.10.1 または SGLang>=0.4.9.post6 の使用を推奨しています。BF16（半精度浮動小数点）形式の重みは約 6GB で、単一の GPU でも十分に動作するサイズです。

imagehttps://arxiv.org/pdf/2606.16140v1

ベンチマーク

AIME26 において、VibeThinker-3B は 94.3 のスコアを記録しました。研究論文によると、これは DeepSeek V3.2 (671B) や Kimi K2.5 (1T) と同等の性能です。

LiveCodeBench v6 では Pass@1 で 80.2 を達成しました。コードベンチマークである OJBench では 38.6 のスコアであり、最大規模モデルには及びません。HMMT25 では 89.3、BruMO25 では 93.8 に達しています。400 問の IMO レベルの問題セットである IMO-AnswerBench では 76.4 のスコアです。

以下の表は、VibeThinker-3B をより大規模な推論モデルと比較したものです。「+CLR」行はテスト時のスケーリング（test-time scaling）を使用しています。これは Claim-Level Reliability Assessment（主張レベル信頼性評価）を意味します。

モデルパラメータAIME26HMMT25IMO-AnsLCBv6GPQA-D

VibeThinker-3B3B94.389.376.480.270.2

VibeThinker-3B +CLR3B97.195.480.6—72.9

GPT-OSS (high)120B93.290.075.681.980.1

DeepSeek V3.2671B94.290.278.380.882.4

GLM-5744B95.897.982.585.586.0

Kimi K2.51T93.395.481.885.087.6

出典：VibeThinker-3B 技術報告書、表 2。GPQA-D は GPQA-Diamond です。

この傾向は一定しています。検証可能な数学とコードにおいては、3B モデルは上位クラスターに位置します。一方、知識依存度の高いベンチマークである GPQA-Diamond では、大規模モデルとの差が依然として明確です。

研究チームはまた、分布外（out-of-distribution）のコーディングテストも実施しました。これは 2026 年 4 月 25 日から 5 月 31 日までの最新の LeetCode の週次および隔週のコンテストを使用しています。モデルは 128 件の初回 Python 提出のうち 123 件に合格しました。これは未知の問題に対する 96.1% の受理率です。

スペクトルから信号へのパイプラインの中身

ポストトレーニング・パイプラインは 4 つの段階で実行されます。各段階は、小規模な推論モデルの異なる弱点を対象としています。

まず、カリキュラムに基づく 2 段階の SFT（Supervised Fine-Tuning: 教師あり微調整）が来ます。第 1 段階では、数学、コード、STEM（科学・技術・工学・数学）、対話、指示従順を広くカバーします。第 2 段階では、推論長さと難易度でフィルタリングされた、より難しく、より長いホライズンのサンプルへとシフトします。多様性探索型蒸留（Diversity-Exploring Distillation）は、両方の段階を通じて複数の有効な解決経路を維持します。

次に、マルチドメインの推論 RL（Reinforcement Learning: 強化学習）が来ます。研究チームは MaxEnt-Guided Policy Optimization (MGPO) を再利用しています。MGPO は、モデルの現在の能力境界付近のプロンプトに重みを付け、そこでは正解と不正解のロールアウトが共存します。トレーニングは数学、コード、STEM の順に逐次的に実行されます。

注目すべき詳細として、VibeThinker-3B は段階的なコンテキスト拡張を廃止しています。研究チームは、このスケールにおいて高頻度の切り捨てウォームアップが長文推論を損なうことを発見しました。そのため、RL では 64K の単一ロング・コンテキストウィンドウを一貫して使用します。

数学 RL には Long2Short 段階が追加されています。これは、正解の経路間で報酬を長さに応じて再配分するものです。より短い正解回答には高い報酬が、より長いものには低い報酬が与えられ、グループ平均は不変に保たれます。目的は、精度を損なうことなく冗長なトークンを減らすことです。

第三に、オフライン自己蒸留（Offline Self-Distillation）により、RL チェックポイントを 1 つの学生モデルに統合します。第四に、インストラクション RL が指示従順性を向上させます。この段階が、IFEval で 93.4、IFBench で 74.5 のスコアを示した理由を説明しています。両方の結果は、推論チューニングが制御可能性を損なっていないことを示しています。

CLR: スケーリングはテスト時に行い、パラメータ数は増加させない

Claim-Level Reliability Assessment (CLR) は、本報告書のテスト時のスケーリング手法です。これは回答検証可能なタスク上で動作し、パラメータを追加しません。

この手順には 2 つのステップがあります。まずモデルは問題ごとに K = 32 の軌道（trajectories）を生成します。各軌道から M = 5 の意思決定に関連する主張（claims）と最終回答を抽出します。

次に、モデル自身が検証者として振る舞います。各主張を検証または反証し、二値の判定（binary verdicts）を生成します。CLR はこれらを非線形な軌道信頼度スコアに変換し、1 つでも弱い主張があれば重みが急激に低下します。

回答は同等性に基づいてクラスタリングされ、最も高い信頼度加重された回答が勝利します。この一連の流れは 8 回実行され、平均の Pass@1 が報告されます。CLR は AIME26 で 97.1、BruMO25 で 99.2 に引き上げます。

以下のインタラクティブなデモでは、主張を切り替えてスコアが崩壊する様子を確認できます。またベンチマークを切り替えて、より大規模なモデルとの比較も可能です。

(function(){

window.addEventListener('message', function(e){

if(e && e.data && e.data.vt3bHeight){

var f=document.getElementById('vt3b-frame');

if(f){ f.style.height = e.data.vt3bHeight + 'px'; }

}

});

})();

使用例と具体例

研究チームは VibeThinker-3B を専門特化型モデルとして位置づけているため、使用例は検証可能な推論の境界に沿って設定されています。

競争数学の指導：AIME や HMMT スタイルの問題を完全な推論連鎖（chains of reasoning）で解決します。学習ツールとしては、手作業による解答生成やローカルでの自己チェック回答が可能です。

アルゴリズムコーディング支援：96.1% の LeetCode 合格率は、ワンショットでの Python 生成能力が高いことを示唆しています。IDE アシスタントは、コンテスト形式の解答を起草し、非公開テストを実行できます。

コスト意識型の強化学習（RL）やエージェントバックエンド：3B モデルはスケール展開時の運用コストが低く抑えられます。多数の検証可能なサブタスクを実行するチームは、600B 以上の大規模モデルではなく、こちらにリクエストをルーティングできます。

オンデバイス推論：BF16（半精度浮動小数点）形式の重みは、一般的な消費者向け GPU に収まります。エッジ環境やオフライン展開においても、クラウドへの依存なしで推論エンジンを利用可能です。

実行方法：クイックスタート

vLLM を使用してサービングすると、OpenAI 互換エンドポイントが公開されます:

pip install vllm

vllm serve "WeiboAI/VibeThinker-3B"

curl -X POST "http://localhost:8000/v1/chat/completions" \

-H "Content-Type: application/json" \

--data '{

"model": "WeiboAI/VibeThinker-3B",

"messages": [{"role":"user","content":"Prove there are infinitely many primes."}],

"temperature": 1.0, "top_p": 0.95

Transformers ライブラリを直接使用する場合も、公式カードの記述とほぼ同様です:

from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("WeiboAI/VibeThinker-3B", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(

"WeiboAI/VibeThinker-3B", torch_dtype="bfloat16", device_map="auto")

msgs = [{"role": "user", "content": "Your prompt"}]

text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)

inputs = tok([text], return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=102400,

do_sample=True, temperature=1.0, top_p=0.95)

print(tok.decode(out[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))

高い max_new_tokens の値が重要です。このモデルは長い推論トレースを生成するため、短い制限では回答が切り捨てられてしまいます。

キーポイント

VibeThinker-3B は、検証可能な推論のために Qwen2.5-Coder-3B を基盤とした 3B 密度型（dense model）モデルであり、MIT ライセンスで提供されています。

AIME26 では 94.3 のスコアを記録し、DeepSeek V3.2 (671B) や Kimi K2.5 (1T) と同等の性能を示しています。

CLR（Continuous Learning）によるテスト時スケーリングにより、追加パラメータなしで AIME26 が 97.1、BruMO25 が 99.2 に向上しました。

未見の LeetCode コンテストでは、最初の試行での Python 提出のうち 128 件中 123 件（96.1%）に合格しています。

この性能向上は限定的であり、GPQA-Diamond や広範なオープンドメイン知識においては大規模モデルには及びません。

論文、モデル重み、リポジトリをチェックしてください。また、Twitter でフォローすることもできますし、15 万人以上の ML サブレッドに参加したり、ニュースレターを購読したりするのもご自由です。待ってください！Telegram をご利用ですか？今なら Telegram でも私たちに参加できるようになりました。

GitHub リポジトリや Hugging Face ページ、製品リリース、ウェビナーなどのプロモーションでパートナーシップをご希望の方は、ぜひご連絡ください。

VibeThinker-3B：スペクトルから信号へのポストトレーニングパイプラインを基盤とした Qwen2.5-Coder-3B 上に構築された 3B 密度推論モデルに関する記事は、MarkTechPost で最初に公開されました。

原文を表示

While recent breakthroughs in AI reasoning have largely been driven by massive scale, pouring in billions of parameters to cross complex cognitive thresholds—VibeThinker-3B is charting a completely different path.

Created by researchers from Sina Weibo Inc (China), this 3-billion-parameter model proves that efficiency can punch far above its weight class. Released under an open-source MIT license, VibeThinker-3B matches the performance of models hundreds of times its size on verifiable tasks like mathematics, coding, and STEM disciplines.

What is VibeThinker-3B

VibeThinker-3B is a compact dense model built on the Qwen2.5-Coder-3B base. It is post-trained, not pretrained from scratch. The research team applies supervised fine-tuning, reinforcement learning, and self-distillation on top.

The training framework continues the Spectrum-to-Signal Principle (SSP) from the earlier VibeThinker-1.5B. SFT (Supervised Fine-Tuning) builds a broad space of valid reasoning paths, the ‘Spectrum.’ RL then amplifies the correct paths, the ‘Signal.’

The model targets one job: reasoning where a verifier can confirm the answer. The research team recommends larger general models for open-domain knowledge tasks. VibeThinker-3B is a specialist by design.

It runs on standard stacks. The model weights require transformers>=4.54.0. For faster inference it recommends vLLM==0.10.1 or SGLang>=0.4.9.post6. The BF16 weights are roughly 6 GB, small enough for a single GPU.

imagehttps://arxiv.org/pdf/2606.16140v1

Benchmark

On AIME26, VibeThinker-3B scores 94.3. According to the research paper, this is comparable to DeepSeek V3.2 (671B) and Kimi K2.5 (1T).

On LiveCodeBench v6, it reaches 80.2 Pass@1. On OJBench, another code benchmark, it scores 38.6, below the largest models. On HMMT25 it scores 89.3, and on BruMO25 it reaches 93.8. On IMO-AnswerBench, a 400-problem IMO-level set, it scores 76.4.

The table below compares it against much larger reasoning models. The ‘+CLR’ row uses test-time scaling. It stands for Claim-Level Reliability Assessment

ModelParamsAIME26HMMT25IMO-AnsLCBv6GPQA-D

VibeThinker-3B3B94.389.376.480.270.2

VibeThinker-3B +CLR3B97.195.480.6—72.9

GPT-OSS (high)120B93.290.075.681.980.1

DeepSeek V3.2671B94.290.278.380.882.4

GLM-5744B95.897.982.585.586.0

Kimi K2.51T93.395.481.885.087.6

Source: VibeThinker-3B Technical Report, Table 2. GPQA-D is GPQA-Diamond.

The pattern is consistent. On verifiable math and code, the 3B model sits near the top cluster. On GPQA-Diamond, a knowledge-heavy benchmark, the gap to large models stays visible.

The research team also ran an out-of-distribution coding test. It used recent LeetCode weekly and biweekly contests, from Apr 25 to May 31, 2026. The model passed 123 of 128 first-attempt Python submissions. That is a 96.1% acceptance rate on unseen problems.

Inside the Spectrum-to-Signal Pipeline

The post-training pipeline runs in four stages. Each one targets a different weakness of small reasoning models.

First comes curriculum-based two-stage SFT. Stage 1 covers math, code, STEM, dialogue, and instruction following broadly. Stage 2 shifts to harder, longer-horizon samples filtered by reasoning length and difficulty. Diversity-Exploring Distillation preserves multiple valid solution paths through both stages.

Second comes multi-domain Reasoning RL. The research team reuses MaxEnt-Guided Policy Optimization (MGPO). MGPO weights prompts near the model’s current capability boundary, where correct and incorrect rollouts coexist. Training runs sequentially across Math, Code, and STEM.

A notable detail: VibeThinker-3B drops progressive context expansion. The research team found high-truncation warm-up hurt long reasoning at this scale. So RL uses a single 64K long-context window throughout.

Math RL adds a Long2Short stage. It redistributes reward among correct trajectories by length. Shorter correct answers get higher reward, longer ones lower, with the group mean unchanged. The goal is fewer redundant tokens without losing accuracy.

Third, Offline Self-Distillation merges the RL checkpoints back into one student model. Fourth, Instruct RL improves instruction adherence. That stage explains the 93.4 IFEval and 74.5 IFBench scores. Both show reasoning tuning did not break controllability.

CLR: Scaling at Test Time, Not Parameter Count

Claim-Level Reliability Assessment (CLR) is the report’s test-time scaling method. It runs on answer-verifiable tasks and adds no parameters.

The procedure has two steps. The model first generates K = 32 trajectories per problem. From each, it extracts M = 5 decision-relevant claims plus a final answer.

The model then acts as its own verifier. It validates or falsifies each claim, producing binary verdicts. CLR maps these into a nonlinear trajectory reliability score, where one weak claim sharply lowers the weight.

Answers are clustered by equivalence, and the highest reliability-weighted answer wins. The full flow runs 8 times, and the averaged Pass@1 is reported. CLR lifts AIME26 to 97.1 and BruMO25 to 99.2.

The interactive demo below lets you flip claims and watch the score collapse. It also lets you switch benchmarks and compare against larger models.

(function(){

window.addEventListener('message', function(e){

if(e && e.data && e.data.vt3bHeight){

var f=document.getElementById('vt3b-frame');

if(f){ f.style.height = e.data.vt3bHeight + 'px'; }

}

});

})();

Use Cases With Examples

The research team frames VibeThinker-3B as a specialist, so use cases follow the verifiable-reasoning boundary.

Competitive math tutoring: It solves AIME and HMMT-style problems with full chains of reasoning. A study tool could generate worked solutions and self-check answers locally.

Algorithmic coding help: The 96.1% LeetCode acceptance rate suggests strong one-shot Python generation. An IDE assistant could draft contest-style solutions and run hidden tests.

Cost-sensitive RL or agent backends: A 3B model is cheap to serve at scale. Teams running many verifiable subtasks could route them here instead of a 600B+ model.

On-device reasoning. BF16 weights fit one consumer GPU. Edge or offline deployments gain a reasoning engine without cloud calls.

Running It: Quick Start

Serving with vLLM exposes an OpenAI-compatible endpoint:

Copy CodeCopiedUse a different Browser

pip install vllm

vllm serve "WeiboAI/VibeThinker-3B"

curl -X POST "http://localhost:8000/v1/chat/completions" \

-H "Content-Type: application/json" \

--data '{

"model": "WeiboAI/VibeThinker-3B",

"messages": [{"role":"user","content":"Prove there are infinitely many primes."}],

"temperature": 1.0, "top_p": 0.95

Direct Transformers usage mirrors the official card:

Copy CodeCopiedUse a different Browser

from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("WeiboAI/VibeThinker-3B", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(

"WeiboAI/VibeThinker-3B", torch_dtype="bfloat16", device_map="auto")

msgs = [{"role": "user", "content": "Your prompt"}]

text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)

inputs = tok([text], return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=102400,

do_sample=True, temperature=1.0, top_p=0.95)

print(tok.decode(out[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))

The high max_new_tokens matters. The model produces long reasoning traces, so short caps can truncate answers.

Key Takeaways

VibeThinker-3B is a 3B dense model, MIT-licensed, built on Qwen2.5-Coder-3B for verifiable reasoning.

It scores 94.3 on AIME26, comparable to DeepSeek V3.2 (671B) and Kimi K2.5 (1T).

CLR test-time scaling lifts AIME26 to 97.1 and BruMO25 to 99.2, with no extra parameters.

On unseen LeetCode contests, it passed 123 of 128 first-attempt Python submissions (96.1%).

The gain is narrow: it trails large models on GPQA-Diamond and broad open-domain knowledge.

Check out the Paper, Model weight and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post VibeThinker-3B: A 3B Dense Reasoning Model Built on Qwen2.5-Coder-3B With the Spectrum-to-Signal Post-Training Pipeline appeared first on MarkTechPost.

この記事をシェア

Sebastian Raschka★52025年12月30日 21:22

2025年の大規模言語モデルの現状：進歩、課題、予測

DeepSeek R1やRLVRから推論時のスケーリング、ベンチマーク、アーキテクチャまで、2025年のLLMの進展と2026年予測を概観。

Sebastian Raschka★42025年12月30日 21:15

LLM研究論文：2025年リスト（7月から12月）

有料購読者向けに、2025年後半の注目すべきLLM研究論文リストを紹介する記事。著者が厳選した論文をまとめている。

Sebastian Raschka★32025年7月1日 20:11

LLM研究論文：2025年リスト（1月〜6月）

Sebastian Raschka博士は、2025年1月から6月にかけて発表された200本以上のLLM研究論文をテーマ別に整理したリストを公開している。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

MarkTechPost·2026年6月20日 07:06·約10分で読める

VibeThinker-3B：Qwen2.5-Coder-3Bを基盤にスペクトルから信号へのポストトレーニングパイプラインで構築された 30 億パラメータの密着型推論モデル

#Reasoning Models #Open Source LLM #SFT #Reinforcement Learning #Efficient AI

TL;DR

AI深層分析2026年6月19日 23:02

重要/ 5段階

深度40%

キーポイント

Spectrum-to-Signal (SSP) パイプラインによる高効率学習

教師あり微調整で推論経路の「スペクトル」を構築し、強化学習で正解の「シグナル」を増幅する独自のポストトレーニング手法を採用している。

圧倒的なベンチマーク性能と小規模モデルでの大規模モデル対抗

AIME26 で 94.3、LiveCodeBench v6 で 80.2 を記録し、671B や 1T パラメータの超大規模モデルに匹敵する結果を出した。

検証可能なタスク特化型アーキテクチャ

数学やコーディングなど正解が検証可能な領域に特化しており、一般知識タスクにはより大規模なモデルの利用を推奨している。

4段階のポストトレーニングパイプライン

テスト時スケーリング手法CLR

文脈拡張と報酬設計の工夫

驚異的な推論性能とコスト効率

3Bという軽量モデルでありながらAIME26で94.3点（DeepSeek V3.2やKimi K2.5に匹敵）を達成し、CLRテストタイムスケーリングによりさらに向上します。

影響分析・編集コメントを表示

影響分析

編集コメント

VibeThinker-3B とは何か

imagehttps://arxiv.org/pdf/2606.16140v1

ベンチマーク

AIME26 において、VibeThinker-3B は 94.3 のスコアを記録しました。研究論文によると、これは DeepSeek V3.2 (671B) や Kimi K2.5 (1T) と同等の性能です。

モデルパラメータAIME26HMMT25IMO-AnsLCBv6GPQA-D

VibeThinker-3B3B94.389.376.480.270.2

VibeThinker-3B +CLR3B97.195.480.6—72.9

GPT-OSS (high)120B93.290.075.681.980.1

DeepSeek V3.2671B94.290.278.380.882.4

GLM-5744B95.897.982.585.586.0

Kimi K2.51T93.395.481.885.087.6

出典：VibeThinker-3B 技術報告書、表 2。GPQA-D は GPQA-Diamond です。

スペクトルから信号へのパイプラインの中身

ポストトレーニング・パイプラインは 4 つの段階で実行されます。各段階は、小規模な推論モデルの異なる弱点を対象としています。

CLR: スケーリングはテスト時に行い、パラメータ数は増加させない

(function(){

window.addEventListener('message', function(e){

if(e && e.data && e.data.vt3bHeight){

var f=document.getElementById('vt3b-frame');

if(f){ f.style.height = e.data.vt3bHeight + 'px'; }

}

});

})();

使用例と具体例

研究チームは VibeThinker-3B を専門特化型モデルとして位置づけているため、使用例は検証可能な推論の境界に沿って設定されています。

実行方法：クイックスタート

vLLM を使用してサービングすると、OpenAI 互換エンドポイントが公開されます:

pip install vllm

vllm serve "WeiboAI/VibeThinker-3B"

curl -X POST "http://localhost:8000/v1/chat/completions" \

-H "Content-Type: application/json" \

--data '{

"model": "WeiboAI/VibeThinker-3B",

"messages": [{"role":"user","content":"Prove there are infinitely many primes."}],

"temperature": 1.0, "top_p": 0.95

Transformers ライブラリを直接使用する場合も、公式カードの記述とほぼ同様です:

from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("WeiboAI/VibeThinker-3B", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(

"WeiboAI/VibeThinker-3B", torch_dtype="bfloat16", device_map="auto")

msgs = [{"role": "user", "content": "Your prompt"}]

text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)

inputs = tok([text], return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=102400,

do_sample=True, temperature=1.0, top_p=0.95)

print(tok.decode(out[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))

高い max_new_tokens の値が重要です。このモデルは長い推論トレースを生成するため、短い制限では回答が切り捨てられてしまいます。

キーポイント

VibeThinker-3B は、検証可能な推論のために Qwen2.5-Coder-3B を基盤とした 3B 密度型（dense model）モデルであり、MIT ライセンスで提供されています。

AIME26 では 94.3 のスコアを記録し、DeepSeek V3.2 (671B) や Kimi K2.5 (1T) と同等の性能を示しています。

CLR（Continuous Learning）によるテスト時スケーリングにより、追加パラメータなしで AIME26 が 97.1、BruMO25 が 99.2 に向上しました。

未見の LeetCode コンテストでは、最初の試行での Python 提出のうち 128 件中 123 件（96.1%）に合格しています。

この性能向上は限定的であり、GPQA-Diamond や広範なオープンドメイン知識においては大規模モデルには及びません。

GitHub リポジトリや Hugging Face ページ、製品リリース、ウェビナーなどのプロモーションでパートナーシップをご希望の方は、ぜひご連絡ください。

原文を表示

What is VibeThinker-3B

imagehttps://arxiv.org/pdf/2606.16140v1

Benchmark

On AIME26, VibeThinker-3B scores 94.3. According to the research paper, this is comparable to DeepSeek V3.2 (671B) and Kimi K2.5 (1T).

The table below compares it against much larger reasoning models. The ‘+CLR’ row uses test-time scaling. It stands for Claim-Level Reliability Assessment

ModelParamsAIME26HMMT25IMO-AnsLCBv6GPQA-D

VibeThinker-3B3B94.389.376.480.270.2

VibeThinker-3B +CLR3B97.195.480.6—72.9

GPT-OSS (high)120B93.290.075.681.980.1

DeepSeek V3.2671B94.290.278.380.882.4

GLM-5744B95.897.982.585.586.0

Kimi K2.51T93.395.481.885.087.6

Source: VibeThinker-3B Technical Report, Table 2. GPQA-D is GPQA-Diamond.

The pattern is consistent. On verifiable math and code, the 3B model sits near the top cluster. On GPQA-Diamond, a knowledge-heavy benchmark, the gap to large models stays visible.

Inside the Spectrum-to-Signal Pipeline

The post-training pipeline runs in four stages. Each one targets a different weakness of small reasoning models.

CLR: Scaling at Test Time, Not Parameter Count

Claim-Level Reliability Assessment (CLR) is the report’s test-time scaling method. It runs on answer-verifiable tasks and adds no parameters.

The procedure has two steps. The model first generates K = 32 trajectories per problem. From each, it extracts M = 5 decision-relevant claims plus a final answer.

Answers are clustered by equivalence, and the highest reliability-weighted answer wins. The full flow runs 8 times, and the averaged Pass@1 is reported. CLR lifts AIME26 to 97.1 and BruMO25 to 99.2.

The interactive demo below lets you flip claims and watch the score collapse. It also lets you switch benchmarks and compare against larger models.

(function(){

window.addEventListener('message', function(e){

if(e && e.data && e.data.vt3bHeight){

var f=document.getElementById('vt3b-frame');

if(f){ f.style.height = e.data.vt3bHeight + 'px'; }

}

});

})();

Use Cases With Examples

The research team frames VibeThinker-3B as a specialist, so use cases follow the verifiable-reasoning boundary.

Competitive math tutoring: It solves AIME and HMMT-style problems with full chains of reasoning. A study tool could generate worked solutions and self-check answers locally.

Algorithmic coding help: The 96.1% LeetCode acceptance rate suggests strong one-shot Python generation. An IDE assistant could draft contest-style solutions and run hidden tests.

Cost-sensitive RL or agent backends: A 3B model is cheap to serve at scale. Teams running many verifiable subtasks could route them here instead of a 600B+ model.

On-device reasoning. BF16 weights fit one consumer GPU. Edge or offline deployments gain a reasoning engine without cloud calls.

Running It: Quick Start

Serving with vLLM exposes an OpenAI-compatible endpoint:

Copy CodeCopiedUse a different Browser

pip install vllm

vllm serve "WeiboAI/VibeThinker-3B"

curl -X POST "http://localhost:8000/v1/chat/completions" \

-H "Content-Type: application/json" \

--data '{

"model": "WeiboAI/VibeThinker-3B",

"messages": [{"role":"user","content":"Prove there are infinitely many primes."}],

"temperature": 1.0, "top_p": 0.95

Direct Transformers usage mirrors the official card:

Copy CodeCopiedUse a different Browser

from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("WeiboAI/VibeThinker-3B", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(

"WeiboAI/VibeThinker-3B", torch_dtype="bfloat16", device_map="auto")

msgs = [{"role": "user", "content": "Your prompt"}]

text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)

inputs = tok([text], return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=102400,

do_sample=True, temperature=1.0, top_p=0.95)

print(tok.decode(out[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))

The high max_new_tokens matters. The model produces long reasoning traces, so short caps can truncate answers.

Key Takeaways

VibeThinker-3B is a 3B dense model, MIT-licensed, built on Qwen2.5-Coder-3B for verifiable reasoning.

It scores 94.3 on AIME26, comparable to DeepSeek V3.2 (671B) and Kimi K2.5 (1T).

CLR test-time scaling lifts AIME26 to 97.1 and BruMO25 to 99.2, with no extra parameters.

On unseen LeetCode contests, it passed 123 of 128 first-attempt Python submissions (96.1%).

The gain is narrow: it trails large models on GPQA-Diamond and broad open-domain knowledge.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post VibeThinker-3B: A 3B Dense Reasoning Model Built on Qwen2.5-Coder-3B With the Spectrum-to-Signal Post-Training Pipeline appeared first on MarkTechPost.

この記事をシェア

Sebastian Raschka★52025年12月30日 21:22

2025年の大規模言語モデルの現状：進歩、課題、予測

DeepSeek R1やRLVRから推論時のスケーリング、ベンチマーク、アーキテクチャまで、2025年のLLMの進展と2026年予測を概観。

Sebastian Raschka★42025年12月30日 21:15

LLM研究論文：2025年リスト（7月から12月）

有料購読者向けに、2025年後半の注目すべきLLM研究論文リストを紹介する記事。著者が厳選した論文をまとめている。

Sebastian Raschka★32025年7月1日 20:11

LLM研究論文：2025年リスト（1月〜6月）

Sebastian Raschka博士は、2025年1月から6月にかけて発表された200本以上のLLM研究論文をテーマ別に整理したリストを公開している。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み