MarkTechPost·2026年7月5日 11:31·約12分

Qwen の元リーダーが「ハイブリッド思考」の誤りと、なぜ今「エージェント」を支持するのか

#LLM #Reasoning #Agentic AI #Alibaba Cloud #Qwen

TL;DR

アリババの元Qwen技術責任者であるJunyang Lin氏は、ハイブリッド思考モードの実装における根本的な課題を指摘し、現在は「モデルの訓練からエージェントの訓練へ」というパラダイムシフトを強く支持する立場へと転換した。

AI深層分析2026年7月5日 12:03

重要/ 5段階

深度40%

キーポイント

ハイブリッド思考モードの限界と分離

Lin氏は、即時応答を重視する指示モードと推論に時間を割く思考モードは相反する性質を持ち、無理に統合すると双方の性能が劣化することを指摘し、2025年以降は別々のバリアントとして提供される方針へ転換した。

Qwen3 の技術的詳細と拡張

Qwen3 は多言語サポートを 29 から 119 に拡大し、最大 235B パラメータの MoE モデルを含め、128K コンテキストや動的思考予算機能を搭載しているが、アーキテクチャ上の制約によりハイブリッド化は困難だった。

次世代への転換：エージェント訓練へ

Lin氏は今後の研究開発の方向性を「モデルを訓練すること」から「エージェントを訓練すること」へと明確にシフトさせ、環境フィードバックを用いた RL やより長いコンテキスト処理が重要視されると述べている。

思考の進化：推論から行動へ

Lin は「推論思考」から「エージェント思考」への移行を定義し、前者が内部検討の質に焦点を当てるのに対し、後者は環境との閉ループ相互作用を通じて行動する計画立案と実行に重点を置くと述べている。

最適化目標の変化

推論思考では数学やコードなどの検証可能な答えが報酬信号となるが、エージェント思考では対話環境におけるタスクの成功継続こそが主要な評価基準となり、モデルと環境（ハッチ）の両方を訓練対象とする。

実装アプローチの違い

コーディングエージェントなどの具体例において、推論モデルは単一のパッチを生成するのに対し、エージェントシステムはテストハッチを実行しエラーを読み取って修正を繰り返すなど、ツールオーケストレーションとエラー回復に思考を活用すべきだと示唆している。

Agentic RL のインフラ要件

エージェント型強化学習では、ツールサーバーやサンドボックスを含む環境でのロールアウトが必要となるため、トレーニングと推論の明確な分離が不可欠である。

影響分析・編集コメントを表示

影響分析

この記事は、大規模言語モデルの開発における「万能型思考」への楽観論に警鐘を鳴らし、実用性を高めるためには機能の分離と、より高度な自律性を持つエージェント開発へ焦点を移す必要があるという重要な示唆を与えています。特に元技術責任者の立場からのこの見解転換は、業界全体が単なる推論能力の向上から、環境との相互作用による学習へとパラダイムシフトしていることを裏付ける証拠となります。

編集コメント

「ハイブリッド思考」への期待が高まる中、その実装の難しさを内部関係者が率直に告白した点は非常に貴重です。今後は単なるモデルの性能競争ではなく、環境と対話できるエージェントとしての能力構築が次の主戦場になることが明確になりました。

Junyang Lin はアリババの Qwen プロジェクトの技術リーダーでした。彼は 2026 年 3 月 3 日に退任を発表しました。現在、彼の個人サイトでは独立した研究者として登録されています。

「Qwen: Towards a Generalist Model / Agent」と題された講演で、彼は Qwen ファミリーを紹介し、最後に「モデルのトレーニング → エージェントのトレーニング」という一行で締めくくりました。その後、独立した研究者としてこの一行を詳細な投稿に展開しました。この記事では、その講演と詳細な投稿を合わせて読み解きます。

Lin の講演が実際にカバーしている内容

この講演は単一のリリースではなく、Qwen モデルファミリー全体を紹介するツアーです。QwQ-32B、Qwen2.5-Max、Qwen3、Qwen2.5-VL、そして Qwen2.5-Omni を順に巡ります。各停止地点では、同時代のモデルに対するベンチマークチャートが示されます。比較対象となるベースラインには、DeepSeek-R1、Grok 3 Beta、Gemini 2.5 Pro、OpenAI の o シリーズが含まれます。

Qwen3 のセクションで最も詳細な説明が行われます。Lin はハイブリッド思考モード（hybrid thinking modes）を強調しました。これは、段階的な推論のための「思考モード」と、ほぼ即時の応答のための「非思考モード」です。さらに、動的思考予算（dynamic thinking budgets）も追加され、呼び出し元がモデルの推論量を制限できるようになりました。Qwen3 は多言語サポートを 29 から 119 の言語と方言に拡大しました。

プレゼンテーションでは、0.6B から 235B パラメータに及ぶ多数のモデルタイプとサイズがリストされています。また、GGUF、GPTQ、AWQ、MLX を含む量子化フォーマットも Apache 2.0 ライセンスの下で列挙されています。その後、Web 開発デモと深層研究（Deep Research）デモの 2 つの実演が続きます。最後の「今後の課題」スライドはエージェント指向を示唆しており、さらなる事前学習、環境フィードバックを用いた強化学習（RL）、より長いコンテキスト、そして多様なモダリティへの対応を挙げています。最後に重要なのは、「モデルのトレーニングからエージェントのトレーニングへ」という転換点です。

Qwen3 アーキテクチャ、発表で示された内容

本発表では Qwen3 のアーキテクチャ表が含まれており、以下に再録します。

モデル | レイヤー数 | ヘッド数 (Q/KV) | 埋め込み結合 / エキスパート (総数/アクティブ) | コンテキストサイズ

---|---|---|---|---

Qwen3-0.6B | 28 | 16 / 8 | 結合: はい | 32K

Qwen3-1.7B | 28 | 16 / 8 | 結合: はい | 32K

Qwen3-4B | 36 | 32 / 8 | 結合: はい | 32K

Qwen3-8B | 36 | 32 / 8 | 結合: いいえ | 128K

Qwen3-14B | 40 | 40 / 8 | 結合: いいえ | 128K

Qwen3-32B | 64 | 64 / 8 | 結合: いいえ | 128K

Qwen3-30B-A3B | 48 | 32 / 4 | エキスパート: 128 / 8 | 128K

Qwen3-235B-A22B | 94 | 64 / 4 | エキスパート: 128 / 8 | 128K

小規模な密着型モデルは入力と出力の埋め込みを結合し、32K のコンテキストを使用します。一方、大規模な密着型および MoE（Mixture of Experts）モデルではこの結合を外し、コンテキスト長を 128K に拡張しています。2 つの MoE モデルは、トークンあたり 128 個のエキスパートのうち 8 個のみをアクティブ化します。

ハイブリッド思考と、その統合が困難な理由

Lin はハイブリッド思考を明確な機能として提示しました。投稿では、なぜこれが構築しにくかったのかの説明がなされています。Lin は、思考モードと指示モードが互いに逆方向に引き合うためであると述べています。

強力な指示モデルは、直接的さ、簡潔さ、低遅延に対して報酬を得ます。一方、強力な思考モデルは、困難な問題に多くのトークンを費やすことに対して報酬を得ます。これら 2 つを不注意に融合させると、両方の性能が低下します。思考行動は肥大化し、指示行動は鮮やかさを失います。

Qwen3 は、4 つの段階からなるポストトレーニングパイプラインを用いてこの融合を試みました。そのパイプラインには、長文のコト（Chain of Thought）によるコールドスタート、推論 RL（強化学習）、および「思考モード融合」ステップが含まれていました。その後 2025 年後半に、2507 ラインは別々の指示用と思考用のバリアントをリリースしました。リン氏はこれをモデルの問題というよりもデータの問題として捉えています。

Anthropic は逆のルートを取り、リン氏はそれを有用な修正と呼んでいます。Claude 3.7 Sonnet は、ユーザーが設定可能な思考予算を持つハイブリッドモデルとして出荷されました。Claude 4 では、推論をツール使用と交互に行うことを目指し、コーディングや長時間実行されるタスクに焦点を当てています。彼の主張は、より長い推論トレースがモデルを賢くするわけではないという点です。思考はベンチマークではなく、対象となるワークロードによって形成されるべきです。

インタラクティブ解説器

(function(){

window.addEventListener("message",function(e){

var d=e.data||{};

if(d && d.mtpResize && d.frame==="reasoning-agentic"){

var f=document.getElementById("mtp-reasoning-agentic");

if(f){ f.style.height=d.mtpResize+"px"; }

}

});

})();

「推論」思考から「エージェント型」思考へ

リンは2つの時代を区別します。最初の時代は推論思考であり、o1 や DeepSeek-R1 によって定義されました。この時代は、強化学習（RL）には決定論的で検証可能な報酬が必要であることを分野に教え、数学、コード、論理が中心となりました。また、強化学習を大規模なロールアウトと検証のシステム問題へと変えました。

リンの枠組みにおける次の時代は、行動するための思考、つまりエージェント思考です。エージェントは計画を立て、いつ行動するかを決定し、ツールを使用し、環境からのフィードバックを読み取り、修正を行います。これは長い内的独白ではなく、世界とのクローズドループ相互作用によって定義されます。

リンが列挙するのは、純粋な推論では避けられるが、エージェント思考が扱う必要がある事項です：

いつ思考を停止して行動するかを決定する

どのツールを呼び出し、どのような順序で実行するかを選択する

環境からのノイズの多いまたは不完全な観測を取り込む

失敗後に計画を修正する

多数のターンや多数のツール呼び出しにわたって一貫性を維持する

最適化目標は時代とともに変化します。以下の表はリンが描く対比を要約したものです。

次元 | 推論思考 | エージェント思考

---|---|---

評価基準 | 回答前の内的熟考の質 | 行動しながら進捗が持続しているか

報酬信号 | 検証可能な答え（数学、コード、論理） | インタラクティブ環境におけるタスク成功

トレーニングの核心対象 | モデル | モデルとその環境（ハネス）

インフラのボトルネック | ロールアウト、検証、安定したポリシー更新 | ツールサーバー、サンドボックス、訓練と実行の分離

主な失敗モード：冗長で価値の低い推論トレース、ツールアクセスや環境漏洩を通じた報酬ハッキング

ユースケースと具体例

この区別は構築方法を変えます:

コーディングエージェント: 推論モデルがスタックトレースからパッチを1 つ生成します。エージェントシステムはテストハーネスを実行し、実際のエラーを読み取り、修正して再実行し、スイートが合格するまでこれを繰り返します。ここでは思考がコードベースのナビゲーション、エラー回復、およびツールオーケストレーションに役立つべきです。

深層研究: 推論モデルは記憶から長い回答を記述します。エージェントシステムは質問をサブクエリに分解し、検索を実行し、弱いソースを除外し、根拠のある引用を返します。Qwen の独自の Deep Research デモはこのカテゴリに位置しています。

マルチエージェントオーケストレーション: Lin は「ハーネスエンジニアリング」がより重要になると予想しています。オーケストレーターは作業の計画とルーティングを行います。専門的なサブエージェントは狭いタスクを実行し、コンテキスト汚染の制御を支援します。

具体的なフック：Qwen3 思考トグル

ハイブリッド思考はコードで直接露呈されます。enable_thinking フラグがチャットテンプレート内でモードを切り替えます。

コードをコピーしました別のブラウザを使用してください

from transformers import AutoModelForCausalLM, AutoTokenizer

name = "Qwen/Qwen3-8B"

tok = AutoTokenizer.from_pretrained(name)

model = AutoModelForCausalLM.from_pretrained(

name, torch_dtype="auto", device_map="auto"

)

messages = [{"role": "user", "content": "この関数をリファクタリングし、変更点を説明してください。"}]

enable_thinking=True -> ステップバイステップ思考モード

enable_thinking=False -> ほぼ即時、非思考モード

text = tok.apply_chat_template(

messages, tokenize=False,

add_generation_prompt=True, enable_thinking=True,

)

inputs = tok(text, return_tensors="pt").to(model.device)

Qwen の推奨する思考モード用のサンプリング

out = model.generate(

**inputs, max_new_tokens=2048,

temperature=0.6, top_p=0.95, top_k=20,

)

enable_thinking=True がデフォルトであり、出力は推論を ... ブロックで囲みます。Qwen3 はソフトスイッチも受け入れます。ユーザーのターンに /think または /no_think を追加することで、メッセージごとにモードを切り替えることができます。このターンごとの制御こそが、動的思考予算の基盤となっています。

なぜアジェンティック RL インフラストラクチャは難しいのか

プレゼンの核心的な工学的ポイントはインフラストラクチャに関するものです。推論 RL ではロールアウトは主に自己完結型の軌跡であり、クリーンな評価器を備えています。一方、アジェンティック RL ではポリシーがツールサーバー、ブラウザ、ターミナル、サンドボックスからなるハネスの中に存在します。

このハネスが新たな要件を強要します：トレーニングと推論は明確に分離されなければなりません。そうでなければロールアウトのスループットは崩壊してしまいます。ライブテスト実行を待っているコーディングエージェントは推論を停止させ、トレーニングを飢えさせます。GPU の利用率は、推論 RL が達成する水準よりも大幅に低下します。

リンはまた、何に執着すべきかを再定義します。SFT（Supervised Fine-Tuning：教師あり微調整）の時代にはチームがデータの多様性を最適化していましたが、エージェントの時代においては、環境の質—安定性、現実味、網羅性、そして悪用耐性—を最適化するべきだと彼は主張しています。彼はその中で、リワード・ハッキング（報酬ハッキング）こそが最も困難な問題であると名指ししました。なぜなら、ツールへのアクセスが可能になることで、偽の最適化に対する攻撃対象領域が拡大するからです。

主要なポイント

ジュンヤン・リンは2026年3月3日にQwenを離れ、現在は独立した研究者として活動しています。

彼の講演は一つの仮説で終わります：この分野はモデルのトレーニングからエージェントのトレーニングへと移行しているという点です。

エージェンシー思考（Agentic thinking）は、内部的な熟考ではなく、環境における持続的な行動によって評価されます。

エージェンシー強化学習（Agentic RL）には、検証可能な報酬だけでなく、訓練と運用を分離したインフラストラクチャと高品質な環境が必要です。

モデルが実際のツールアクセス権を獲得した際、リワード・ハッキングが中心的なリスクとなります。

出典：

一次情報源 — 講演動画

https://www.youtube.com/watch?v=b0xlsQ_6wUQ

一次情報源 — ジュンヤン・リンのブログ

「『推論』思考から『エージェンシー』思考へ」: https://justinlin610.github.io/blog/from-reasoning-to-agentic-thinking/

彼のホームページ（独立研究者としてのステータス）: https://justinlin610.github.io/

Qwen3 の技術詳細（アーキテクチャ、119 言語対応、ハイブリッド思考）

Qwen3 Technical Report (arXiv:2505.09388): https://arxiv.org/abs/2505.09388 · HTML: https://arxiv.org/html/2505.09388v1

コード検証（enable_thinking, /think /no_think, サンプリング）

Qwen ドキュメントクイックスタートガイド：https://qwen.readthedocs.io/en/latest/getting_started/quickstart.html

Qwen3-8B モデルカード：https://huggingface.co/Qwen/Qwen3-8B

Qwen3-32B モデルカード：https://huggingface.co/Qwen/Qwen3-32B

記事内で引用された離職の事実

TechCrunch: https://techcrunch.com/2026/03/03/alibabas-qwen-tech-lead-steps-down-after-major-ai-push/

Bloomberg: https://www.bloomberg.com/news/articles/2026-03-04/alibaba-qwen-head-who-warned-of-openai-gap-steps-down

VentureBeat: https://venturebeat.com/technology/did-alibaba-just-kneecap-its-powerful-qwen-ai-team-key-figures-depart-in

離職の背景や文脈を補足する報道（クロスチェック用。すべてが本文内で直接引用されているわけではない）

RecodeChinaAI (LatePost 翻訳): https://www.recodechinaai.com/p/alibabas-qwen-lead-just-stepped-down

Simon Willison: https://simonwillison.net/2026/Mar/4/qwen/

Geopolitechs: https://www.geopolitechs.org/p/inside-the-stepping-down-of-qwens

OfficeChai: https://officechai.com/ai/alibaba-qwens-tech-lead-junyang-lin-steps-down/

MLQ News: https://mlq.ai/news/key-researcher-steps-down-from-alibabas-qwen-ai-project/

GenAI Assembling (エッセイ分析、エッセイの初出場所を特定するために使用): https://genaiassembling.substack.com/p/what-junyang-lin-saw

X 投稿 2 件

https://x.com/h100envy/status/2068987470960623783

https://x.com/h100envy/status/2073433806254624930

「Qwen の元リーダーがハイブリッド思考の誤りと、なぜ今ではエージェントを支持するのか」という投稿は、MarkTechPost で最初に掲載されました。

原文を表示

Junyang Lin was the technical lead of Alibaba’s Qwen project. He announced he was stepping down on March 3, 2026. He now lists himself as an independent researcher on his personal site.

In a talk titled ‘Qwen: Towards a Generalist Model / Agent,‘ he walks through the Qwen family. It ends on a single line: “Training models -> training agents.” He later expanded that line into an detailed post as an independent researcher. This article reads the talk and the detailed post together.

What Lin’s Talk Actually Covers

The talk is a tour of the Qwen model family, not a single release. It moves through QwQ-32B, Qwen2.5-Max, Qwen3, Qwen2.5-VL, and Qwen2.5-Omni. Each stop shows benchmark charts against contemporaries. The named baselines include DeepSeek-R1, Grok 3 Beta, Gemini 2.5 Pro, and OpenAI’s o-series.

The Qwen3 stop carries the most detail. Lin highlights hybrid thinking modes: a thinking mode for step-by-step reasoning, and a non-thinking mode for near-instant responses. He adds dynamic thinking budgets, so callers can cap how much the model reasons. Qwen3 expanded multilingual support from 29 to 119 languages and dialects.

The presentation lists many model types and sizes from 0.6B to 235B parameters. It also lists quantized formats including GGUF, GPTQ, AWQ, and MLX, all under Apache 2.0. Two demos follow: a Web Dev demo and a Deep Research demo. The closing “Future work” slide points at agents. It lists more pretraining, RL with environment feedback, longer context, and more modalities. The last key mention is the “training models -> training agents.”

Qwen3 Architecture, As Shown in the Talk

The talk includes the Qwen3 architecture tables, reproduced below.

ModelLayersHeads (Q/KV)Tie Embedding / Experts (Total/Act.)Context

Qwen3-0.6B2816 / 8Tie: Yes32K

Qwen3-1.7B2816 / 8Tie: Yes32K

Qwen3-4B3632 / 8Tie: Yes32K

Qwen3-8B3632 / 8Tie: No128K

Qwen3-14B4040 / 8Tie: No128K

Qwen3-32B6464 / 8Tie: No128K

Qwen3-30B-A3B4832 / 4Experts: 128 / 8128K

Qwen3-235B-A22B9464 / 4Experts: 128 / 8128K

The small dense models tie input and output embeddings and use a 32K context. The larger dense and MoE models drop tying and extend context to 128K. The two MoE models activate 8 of 128 experts per token.

Hybrid Thinking, and Why Merging is Hard

Lin presents hybrid thinking as a clean feature. The post explains why it was hard to build. Lin writes that thinking mode and instruct mode pull in opposite directions.

A strong instruct model is rewarded for directness, brevity, and low latency. A strong thinking model is rewarded for spending more tokens on hard problems. Merge the two carelessly, and both degrade. The thinking behavior gets bloated, and the instruct behavior gets less crisp.

Qwen3 tried the merge with a four-stage post-training pipeline. That pipeline included a long-CoT cold start, reasoning RL, and a “thinking mode fusion” step. Later in 2025, the 2507 line shipped separate Instruct and Thinking variants instead. Lin frames this as a data problem more than a model problem.

Anthropic took the opposite route, and Lin calls it a useful corrective. Claude 3.7 Sonnet shipped as a hybrid model with a user-set thinking budget. Claude 4 let reasoning interleave with tool use, aimed at coding and long-running tasks. His point: a longer reasoning trace does not make a model smarter. Thinking should be shaped by the target workload, not by the benchmark.

Interactive Explainer

(function(){

window.addEventListener("message",function(e){

var d=e.data||{};

if(d && d.mtpResize && d.frame==="reasoning-agentic"){

var f=document.getElementById("mtp-reasoning-agentic");

if(f){ f.style.height=d.mtpResize+"px"; }

}

});

})();

From ‘Reasoning’ Thinking to ‘Agentic’ Thinking

Lin draws a line between two eras. The first was reasoning thinking, defined by o1 and DeepSeek-R1. It taught the field that RL needs deterministic, verifiable rewards, so math, code, and logic became central. It also turned RL into a systems problem of large-scale rollouts and verification.

The next era, in his framing, is agentic thinking: thinking in order to act. An agent formulates plans, decides when to act, uses tools, reads environment feedback, and revises. It is defined by closed-loop interaction with the world, not by a long internal monologue.

Lin lists what agentic thinking must handle that pure reasoning can avoid:

Deciding when to stop thinking and take an action

Choosing which tool to invoke, and in what order

Incorporating noisy or partial observations from the environment

Revising plans after failures

Maintaining coherence across many turns and many tool calls

The optimization target changes with the era. The table below summarizes the contrast Lin draws.

DimensionReasoning thinkingAgentic thinking

Judged byQuality of internal deliberation before an answerWhether progress is sustained while acting

Reward signalVerifiable answers (math, code, logic)Task success in an interactive environment

Core object of trainingThe modelThe model plus its environment (the harness)

Infra bottleneckRollouts, verification, stable policy updatesTool servers, sandboxes, train-serve decoupling

Main failure modeVerbose, low-value reasoning tracesReward hacking through tool access and env leaks

Use Cases, With Examples

The distinction changes how you build:

Coding agents: A reasoning model emits one patch from a stack trace. An agentic system runs the test harness, reads the real error, revises, and re-runs until the suite passes. Thinking here should help with codebase navigation, error recovery, and tool orchestration.

Deep research: A reasoning model writes a long answer from memory. An agentic system breaks the question into sub-queries, calls search, drops weak sources, and returns grounded citations. Qwen’s own Deep Research demo sits in this category.

Multi-agent orchestration: Lin expects ‘harness engineering’ to matter more. An orchestrator plans and routes work. Specialized sub-agents execute narrower tasks and help control context pollution.

A Concrete Hook: Qwen3 Thinking Toggle

Hybrid thinking is exposed directly in code. The enable_thinking flag switches modes in the chat template.

Copy CodeCopiedUse a different Browser

from transformers import AutoModelForCausalLM, AutoTokenizer

name = "Qwen/Qwen3-8B"

tok = AutoTokenizer.from_pretrained(name)

model = AutoModelForCausalLM.from_pretrained(

name, torch_dtype="auto", device_map="auto"

)

messages = [{"role": "user", "content": "Refactor this function and explain the change."}]

enable_thinking=True -> step-by-step thinking mode

enable_thinking=False -> near-instant, non-thinking mode

text = tok.apply_chat_template(

messages, tokenize=False,

add_generation_prompt=True, enable_thinking=True,

)

inputs = tok(text, return_tensors="pt").to(model.device)

Qwen's recommended sampling for thinking mode

out = model.generate(

**inputs, max_new_tokens=2048,

temperature=0.6, top_p=0.95, top_k=20,

)

enable_thinking=True is the default, and the output wraps reasoning in a <think>...</think> block. Qwen3 also accepts soft switches. Appending /think or /no_think to a user turn flips the mode per message. That per-turn control is what dynamic thinking budgets build on.

Why Agentic RL Infrastructure is Harder

The presentation’s core engineering point is about infrastructure. In reasoning RL, rollouts are mostly self-contained trajectories with clean evaluators. In agentic RL, the policy lives inside a harness of tool servers, browsers, terminals, and sandboxes.

That harness forces a new requirement: training and inference must be cleanly decoupled. Without it, rollout throughput collapses. A coding agent waiting on live test execution stalls inference and starves training. GPU utilization drops well below what reasoning RL achieves.

Lin also reframes what to obsess over. In the SFT era, teams optimized data diversity. In the agent era, he argues teams should optimize environment quality: stability, realism, coverage, and exploit resistance. He names reward hacking as the hardest problem, because tool access enlarges the attack surface for spurious optimization.

Key Takeaways

Junyang Lin left Qwen on March 3, 2026, and now publishes as an independent researcher.

His talk ends on one thesis: the field is moving from training models to training agents.

Agentic thinking is judged by sustained action in an environment, not by internal deliberation.

Agentic RL needs decoupled train-serve infra and high-quality environments, not just verifiable rewards.

Reward hacking is the central risk once models gain real tool access.

Sources:

Primary source — the talk

https://www.youtube.com/watch?v=b0xlsQ_6wUQ

Primary source — Junyang Lin’s Blog

“From ‘Reasoning’ Thinking to ‘Agentic’ Thinking”: https://justinlin610.github.io/blog/from-reasoning-to-agentic-thinking/

His homepage (independent-researcher status): https://justinlin610.github.io/

Qwen3 technical details (architecture, 119 languages, hybrid thinking)

Qwen3 Technical Report (arXiv:2505.09388): https://arxiv.org/abs/2505.09388 · HTML: https://arxiv.org/html/2505.09388v1

Code verification (enable_thinking, /think /no_think, sampling)

Qwen docs Quickstart: https://qwen.readthedocs.io/en/latest/getting_started/quickstart.html

Qwen3-8B model card: https://huggingface.co/Qwen/Qwen3-8B

Qwen3-32B model card: https://huggingface.co/Qwen/Qwen3-32B

Departure facts (cited in the article)

TechCrunch: https://techcrunch.com/2026/03/03/alibabas-qwen-tech-lead-steps-down-after-major-ai-push/

Bloomberg: https://www.bloomberg.com/news/articles/2026-03-04/alibaba-qwen-head-who-warned-of-openai-gap-steps-down

VentureBeat: https://venturebeat.com/technology/did-alibaba-just-kneecap-its-powerful-qwen-ai-team-key-figures-depart-in

Supporting departure/context coverage (used for cross-checking, not all cited inline)

RecodeChinaAI (LatePost translation): https://www.recodechinaai.com/p/alibabas-qwen-lead-just-stepped-down

Simon Willison: https://simonwillison.net/2026/Mar/4/qwen/

Geopolitechs: https://www.geopolitechs.org/p/inside-the-stepping-down-of-qwens

OfficeChai: https://officechai.com/ai/alibaba-qwens-tech-lead-junyang-lin-steps-down/

MLQ News: https://mlq.ai/news/key-researcher-steps-down-from-alibabas-qwen-ai-project/

GenAI Assembling (essay analysis, used to first locate the essay): https://genaiassembling.substack.com/p/what-junyang-lin-saw

Two X posts

https://x.com/h100envy/status/2068987470960623783

https://x.com/h100envy/status/2073433806254624930

The post Qwen’s Former Lead on What Hybrid Thinking Got Wrong — and Why He Now Backs Agents appeared first on MarkTechPost.

この記事をシェア

TLDR AI2026年7月3日 09:00

AI 向けラマヌジャン・チャレンジ（1 分読了）

Latent Space重要度42026年7月3日 06:25

未来のウェブサイトは訪問者ごとに自動構成されるかもしれない

Simon Willison Blog2026年7月5日 10:00

sqlite-utils 4.0rc2、主にClaude Fable（約149.25ドル分）が執筆

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

MarkTechPost·2026年7月5日 11:31·約12分

Qwen の元リーダーが「ハイブリッド思考」の誤りと、なぜ今「エージェント」を支持するのか

#LLM #Reasoning #Agentic AI #Alibaba Cloud #Qwen

TL;DR

AI深層分析2026年7月5日 12:03

重要/ 5段階

深度40%

キーポイント

ハイブリッド思考モードの限界と分離

Qwen3 の技術的詳細と拡張

次世代への転換：エージェント訓練へ

思考の進化：推論から行動へ

最適化目標の変化

実装アプローチの違い

Agentic RL のインフラ要件

影響分析・編集コメントを表示

影響分析

編集コメント

Lin の講演が実際にカバーしている内容

Qwen3 アーキテクチャ、発表で示された内容

本発表では Qwen3 のアーキテクチャ表が含まれており、以下に再録します。

モデル | レイヤー数 | ヘッド数 (Q/KV) | 埋め込み結合 / エキスパート (総数/アクティブ) | コンテキストサイズ

---|---|---|---|---

Qwen3-0.6B | 28 | 16 / 8 | 結合: はい | 32K

Qwen3-1.7B | 28 | 16 / 8 | 結合: はい | 32K

Qwen3-4B | 36 | 32 / 8 | 結合: はい | 32K

Qwen3-8B | 36 | 32 / 8 | 結合: いいえ | 128K

Qwen3-14B | 40 | 40 / 8 | 結合: いいえ | 128K

Qwen3-32B | 64 | 64 / 8 | 結合: いいえ | 128K

Qwen3-30B-A3B | 48 | 32 / 4 | エキスパート: 128 / 8 | 128K

Qwen3-235B-A22B | 94 | 64 / 4 | エキスパート: 128 / 8 | 128K

ハイブリッド思考と、その統合が困難な理由

インタラクティブ解説器

(function(){

window.addEventListener("message",function(e){

var d=e.data||{};

if(d && d.mtpResize && d.frame==="reasoning-agentic"){

var f=document.getElementById("mtp-reasoning-agentic");

if(f){ f.style.height=d.mtpResize+"px"; }

}

});

})();

「推論」思考から「エージェント型」思考へ

リンが列挙するのは、純粋な推論では避けられるが、エージェント思考が扱う必要がある事項です：

いつ思考を停止して行動するかを決定する

どのツールを呼び出し、どのような順序で実行するかを選択する

環境からのノイズの多いまたは不完全な観測を取り込む

失敗後に計画を修正する

多数のターンや多数のツール呼び出しにわたって一貫性を維持する

最適化目標は時代とともに変化します。以下の表はリンが描く対比を要約したものです。

次元 | 推論思考 | エージェント思考

---|---|---

評価基準 | 回答前の内的熟考の質 | 行動しながら進捗が持続しているか

報酬信号 | 検証可能な答え（数学、コード、論理） | インタラクティブ環境におけるタスク成功

トレーニングの核心対象 | モデル | モデルとその環境（ハネス）

インフラのボトルネック | ロールアウト、検証、安定したポリシー更新 | ツールサーバー、サンドボックス、訓練と実行の分離

主な失敗モード：冗長で価値の低い推論トレース、ツールアクセスや環境漏洩を通じた報酬ハッキング

ユースケースと具体例

この区別は構築方法を変えます:

具体的なフック：Qwen3 思考トグル

ハイブリッド思考はコードで直接露呈されます。enable_thinking フラグがチャットテンプレート内でモードを切り替えます。

コードをコピーしました別のブラウザを使用してください

from transformers import AutoModelForCausalLM, AutoTokenizer

name = "Qwen/Qwen3-8B"

tok = AutoTokenizer.from_pretrained(name)

model = AutoModelForCausalLM.from_pretrained(

name, torch_dtype="auto", device_map="auto"

)

messages = [{"role": "user", "content": "この関数をリファクタリングし、変更点を説明してください。"}]

enable_thinking=True -> ステップバイステップ思考モード

enable_thinking=False -> ほぼ即時、非思考モード

text = tok.apply_chat_template(

messages, tokenize=False,

add_generation_prompt=True, enable_thinking=True,

)

inputs = tok(text, return_tensors="pt").to(model.device)

Qwen の推奨する思考モード用のサンプリング

out = model.generate(

**inputs, max_new_tokens=2048,

temperature=0.6, top_p=0.95, top_k=20,

)

なぜアジェンティック RL インフラストラクチャは難しいのか

主要なポイント

ジュンヤン・リンは2026年3月3日にQwenを離れ、現在は独立した研究者として活動しています。

彼の講演は一つの仮説で終わります：この分野はモデルのトレーニングからエージェントのトレーニングへと移行しているという点です。

エージェンシー思考（Agentic thinking）は、内部的な熟考ではなく、環境における持続的な行動によって評価されます。

エージェンシー強化学習（Agentic RL）には、検証可能な報酬だけでなく、訓練と運用を分離したインフラストラクチャと高品質な環境が必要です。

モデルが実際のツールアクセス権を獲得した際、リワード・ハッキングが中心的なリスクとなります。

出典：

一次情報源 — 講演動画

https://www.youtube.com/watch?v=b0xlsQ_6wUQ

一次情報源 — ジュンヤン・リンのブログ

「『推論』思考から『エージェンシー』思考へ」: https://justinlin610.github.io/blog/from-reasoning-to-agentic-thinking/

彼のホームページ（独立研究者としてのステータス）: https://justinlin610.github.io/

Qwen3 の技術詳細（アーキテクチャ、119 言語対応、ハイブリッド思考）

Qwen3 Technical Report (arXiv:2505.09388): https://arxiv.org/abs/2505.09388 · HTML: https://arxiv.org/html/2505.09388v1

コード検証（enable_thinking, /think /no_think, サンプリング）

Qwen ドキュメントクイックスタートガイド：https://qwen.readthedocs.io/en/latest/getting_started/quickstart.html

Qwen3-8B モデルカード：https://huggingface.co/Qwen/Qwen3-8B

Qwen3-32B モデルカード：https://huggingface.co/Qwen/Qwen3-32B

記事内で引用された離職の事実

TechCrunch: https://techcrunch.com/2026/03/03/alibabas-qwen-tech-lead-steps-down-after-major-ai-push/

Bloomberg: https://www.bloomberg.com/news/articles/2026-03-04/alibaba-qwen-head-who-warned-of-openai-gap-steps-down

VentureBeat: https://venturebeat.com/technology/did-alibaba-just-kneecap-its-powerful-qwen-ai-team-key-figures-depart-in

離職の背景や文脈を補足する報道（クロスチェック用。すべてが本文内で直接引用されているわけではない）

RecodeChinaAI (LatePost 翻訳): https://www.recodechinaai.com/p/alibabas-qwen-lead-just-stepped-down

Simon Willison: https://simonwillison.net/2026/Mar/4/qwen/

Geopolitechs: https://www.geopolitechs.org/p/inside-the-stepping-down-of-qwens

OfficeChai: https://officechai.com/ai/alibaba-qwens-tech-lead-junyang-lin-steps-down/

MLQ News: https://mlq.ai/news/key-researcher-steps-down-from-alibabas-qwen-ai-project/

GenAI Assembling (エッセイ分析、エッセイの初出場所を特定するために使用): https://genaiassembling.substack.com/p/what-junyang-lin-saw

X 投稿 2 件

https://x.com/h100envy/status/2068987470960623783

https://x.com/h100envy/status/2073433806254624930

「Qwen の元リーダーがハイブリッド思考の誤りと、なぜ今ではエージェントを支持するのか」という投稿は、MarkTechPost で最初に掲載されました。

原文を表示

Junyang Lin was the technical lead of Alibaba’s Qwen project. He announced he was stepping down on March 3, 2026. He now lists himself as an independent researcher on his personal site.

What Lin’s Talk Actually Covers

Qwen3 Architecture, As Shown in the Talk

The talk includes the Qwen3 architecture tables, reproduced below.

ModelLayersHeads (Q/KV)Tie Embedding / Experts (Total/Act.)Context

Qwen3-0.6B2816 / 8Tie: Yes32K

Qwen3-1.7B2816 / 8Tie: Yes32K

Qwen3-4B3632 / 8Tie: Yes32K

Qwen3-8B3632 / 8Tie: No128K

Qwen3-14B4040 / 8Tie: No128K

Qwen3-32B6464 / 8Tie: No128K

Qwen3-30B-A3B4832 / 4Experts: 128 / 8128K

Qwen3-235B-A22B9464 / 4Experts: 128 / 8128K

Hybrid Thinking, and Why Merging is Hard

Lin presents hybrid thinking as a clean feature. The post explains why it was hard to build. Lin writes that thinking mode and instruct mode pull in opposite directions.

Interactive Explainer

(function(){

window.addEventListener("message",function(e){

var d=e.data||{};

if(d && d.mtpResize && d.frame==="reasoning-agentic"){

var f=document.getElementById("mtp-reasoning-agentic");

if(f){ f.style.height=d.mtpResize+"px"; }

}

});

})();

From ‘Reasoning’ Thinking to ‘Agentic’ Thinking

Lin lists what agentic thinking must handle that pure reasoning can avoid:

Deciding when to stop thinking and take an action

Choosing which tool to invoke, and in what order

Incorporating noisy or partial observations from the environment

Revising plans after failures

Maintaining coherence across many turns and many tool calls

The optimization target changes with the era. The table below summarizes the contrast Lin draws.

DimensionReasoning thinkingAgentic thinking

Judged byQuality of internal deliberation before an answerWhether progress is sustained while acting

Reward signalVerifiable answers (math, code, logic)Task success in an interactive environment

Core object of trainingThe modelThe model plus its environment (the harness)

Infra bottleneckRollouts, verification, stable policy updatesTool servers, sandboxes, train-serve decoupling

Main failure modeVerbose, low-value reasoning tracesReward hacking through tool access and env leaks

Use Cases, With Examples

The distinction changes how you build:

A Concrete Hook: Qwen3 Thinking Toggle

Hybrid thinking is exposed directly in code. The enable_thinking flag switches modes in the chat template.

Copy CodeCopiedUse a different Browser

from transformers import AutoModelForCausalLM, AutoTokenizer

name = "Qwen/Qwen3-8B"

tok = AutoTokenizer.from_pretrained(name)

model = AutoModelForCausalLM.from_pretrained(

name, torch_dtype="auto", device_map="auto"

)

messages = [{"role": "user", "content": "Refactor this function and explain the change."}]

enable_thinking=True -> step-by-step thinking mode

enable_thinking=False -> near-instant, non-thinking mode

text = tok.apply_chat_template(

messages, tokenize=False,

add_generation_prompt=True, enable_thinking=True,

)

inputs = tok(text, return_tensors="pt").to(model.device)

Qwen's recommended sampling for thinking mode

out = model.generate(

**inputs, max_new_tokens=2048,

temperature=0.6, top_p=0.95, top_k=20,

)

Why Agentic RL Infrastructure is Harder

Key Takeaways

Junyang Lin left Qwen on March 3, 2026, and now publishes as an independent researcher.

His talk ends on one thesis: the field is moving from training models to training agents.

Agentic thinking is judged by sustained action in an environment, not by internal deliberation.

Agentic RL needs decoupled train-serve infra and high-quality environments, not just verifiable rewards.

Reward hacking is the central risk once models gain real tool access.

Sources:

Primary source — the talk

https://www.youtube.com/watch?v=b0xlsQ_6wUQ

Primary source — Junyang Lin’s Blog

“From ‘Reasoning’ Thinking to ‘Agentic’ Thinking”: https://justinlin610.github.io/blog/from-reasoning-to-agentic-thinking/

His homepage (independent-researcher status): https://justinlin610.github.io/

Qwen3 technical details (architecture, 119 languages, hybrid thinking)

Qwen3 Technical Report (arXiv:2505.09388): https://arxiv.org/abs/2505.09388 · HTML: https://arxiv.org/html/2505.09388v1

Code verification (enable_thinking, /think /no_think, sampling)

Qwen docs Quickstart: https://qwen.readthedocs.io/en/latest/getting_started/quickstart.html

Qwen3-8B model card: https://huggingface.co/Qwen/Qwen3-8B

Qwen3-32B model card: https://huggingface.co/Qwen/Qwen3-32B

Departure facts (cited in the article)

TechCrunch: https://techcrunch.com/2026/03/03/alibabas-qwen-tech-lead-steps-down-after-major-ai-push/

Bloomberg: https://www.bloomberg.com/news/articles/2026-03-04/alibaba-qwen-head-who-warned-of-openai-gap-steps-down

VentureBeat: https://venturebeat.com/technology/did-alibaba-just-kneecap-its-powerful-qwen-ai-team-key-figures-depart-in

Supporting departure/context coverage (used for cross-checking, not all cited inline)

RecodeChinaAI (LatePost translation): https://www.recodechinaai.com/p/alibabas-qwen-lead-just-stepped-down

Simon Willison: https://simonwillison.net/2026/Mar/4/qwen/

Geopolitechs: https://www.geopolitechs.org/p/inside-the-stepping-down-of-qwens

OfficeChai: https://officechai.com/ai/alibaba-qwens-tech-lead-junyang-lin-steps-down/

MLQ News: https://mlq.ai/news/key-researcher-steps-down-from-alibabas-qwen-ai-project/

GenAI Assembling (essay analysis, used to first locate the essay): https://genaiassembling.substack.com/p/what-junyang-lin-saw

Two X posts

https://x.com/h100envy/status/2068987470960623783

https://x.com/h100envy/status/2073433806254624930

The post Qwen’s Former Lead on What Hybrid Thinking Got Wrong — and Why He Now Backs Agents appeared first on MarkTechPost.

この記事をシェア

TLDR AI2026年7月3日 09:00

AI 向けラマヌジャン・チャレンジ（1 分読了）

Latent Space重要度42026年7月3日 06:25

未来のウェブサイトは訪問者ごとに自動構成されるかもしれない

Simon Willison Blog2026年7月5日 10:00

sqlite-utils 4.0rc2、主にClaude Fable（約149.25ドル分）が執筆

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Qwen の元リーダーが「ハイブリッド思考」の誤りと、なぜ今「エージェント」を支持するのか

キーポイント

影響分析

編集コメント

enable_thinking=True -> ステップバイステップ思考モード

enable_thinking=False -> ほぼ即時、非思考モード

Qwen の推奨する思考モード用のサンプリング

enable_thinking=True -> step-by-step thinking mode

enable_thinking=False -> near-instant, non-thinking mode

Qwen's recommended sampling for thinking mode

関連記事

Qwen の元リーダーが「ハイブリッド思考」の誤りと、なぜ今「エージェント」を支持するのか

キーポイント

影響分析

編集コメント

enable_thinking=True -> ステップバイステップ思考モード

enable_thinking=False -> ほぼ即時、非思考モード

Qwen の推奨する思考モード用のサンプリング

enable_thinking=True -> step-by-step thinking mode

enable_thinking=False -> near-instant, non-thinking mode

Qwen's recommended sampling for thinking mode

関連記事