Simon Willison Blog·2026年7月5日 07:53·約2分

より優れたモデル、劣化したツール

#LLM Tool Use #Anthropic #Claude Code #Schema Validation #Coding Agents

TL;DR

Anthropic の最新モデル（Opus 4.8 や Sonnet 5）が、自社ツール向けに最適化された結果、第三者のコーディング環境で定義されたツールスキーマに対して誤った引数を生成する逆転現象が発生している。

AI深層分析2026年7月5日 08:02

重要/ 5段階

深度40%

キーポイント

最新モデルのツールスキーマ適合性の低下

Anthropic の最上位モデル（Opus 4.8, Sonnet 5）が、従来のモデルよりも外部ツールの定義スキーマに違反し、存在しないフィールドを生成する傾向が強まっている。

自社ツール最適化による副作用

Claude Code の組み込み編集ツール使用効率向上のために強化学習が行われた結果、Pi などの外部コーディングハーンスに対する汎用性が損なわれているという理論が示唆されている。

開発者への対応策の模索

モデルごとの挙動差異に対応するため、サードパーティ製ツールが複数の編集機能を実装し、使用するモデルに最適な方を選択する戦略が必要になる可能性が議論されている。

業界全体のパラダイムシフトの兆候

OpenAI の Codex がパッチ適用メカニズムを重視する中、Anthropic のアプローチの違いが、モデルの「能力向上」と「汎用ツールの互換性」のトレードオフを示している。

影響分析・編集コメントを表示

影響分析

この現象は、AI モデルの開発者が自社エコシステム内でのパフォーマンスを最大化しようとする最適化プロセスが、外部開発者や汎用プラットフォームにとって予期せぬ互換性問題を引き起こす可能性を示唆しています。業界全体として、LLM のツール呼び出しにおける「モデル固有のトレーニングバイアス」への理解と、それを吸収する柔軟なアーキテクチャ設計の重要性が再認識されるべきです。

編集コメント

「より良いモデル」が必ずしも「使いやすいツール」とは限らないという皮肉な現実を浮き彫りにした重要な指摘です。開発者は、最新モデルへの依存だけでなく、スキーマの堅牢性やモデルごとの特性理解にも注力する必要があります。

Better Models: Worse Tools

Armin は、Pi でのハッキング中に遭遇した奇妙な問題について報告しています。

要約すると、新しい Claude モデルは、ネストされた edits[] アレイ内で余分な、捏造されたフィールドを伴って Pi の編集ツール（edit tool）を呼び出すことがあります。これは Haiku や小規模モデルに限った話ではなく、Opus 4.8 でも発生しています。編集自体は通常正しいのですが、モデルがでたらめなキーを作成するため、引数がスキーマと一致せず、Pi がそのツール呼び出しを拒否して再試行を求めます。

これ自体は、モデルが時折不正な形式のツール呼び出しを行うという点において驚くべきことではありません。特に小規模モデルでは顕著です。私が驚いたのは、この問題が新しい Anthropic モデルになるほど悪化していることです。Opus 4.8 と Sonnet 5 の両方で確認されていますが、古いモデルでは発生しません。つまり、この特定のツールスキーマ（tool schema）においては、そのファミリーの SOTA（State-of-the-Art：最先端）モデルの方が、古参の兄弟モデルよりも劣っているのです。

Armin は、これはより最近の Anthropic モデルが、Claude Code に組み込まれている編集ツールをより効果的に使用するように特別にトレーニングされた（おそらく強化学習（Reinforcement Learning）を通じて）ためだと推測しています。その不幸な結果として、Pi などの他のコーディングハーンセスでは、独自のカスタム編集ツールが誤って使用される可能性が高まることがあります。

Claude の編集ツールは検索と置換を使用します。OpenAI の Codex は代わりにパッチ適用メカニズム (apply_patch) を採用しており、OpenAI は過去に自社のモデルがそのツールを効果的に利用するように訓練されていることについて言及したことがあります。

これは、Pi などのサードパーティ製コーディングハーンチスが、ユーザーが選択した基盤モデルに対して最もパフォーマンスの良いものを使用できるようにするために、複数の編集ツールを実装する必要があることを意味しているのでしょうか？

Tags: armin-ronacher, ai, openai, generative-ai, llms, anthropic, llm-tool-use, coding-agents, pi

原文を表示

Better Models: Worse Tools

Armin reports on a weird problem he ran into while hacking on Pi:

The short version is that newer Claude models sometimes call Pi’s edit tool with extra, invented fields in the nested edits[] array. And not Haiku or some small model: Opus 4.8. The edit itself is usually correct but the arguments do not match the schema as the model invents made-up keys and Pi thus rejects the tool call and asks to try again.
That alone is not too surprising as models emit malformed tool calls sometimes. Particularly small ones. What surprised me is that this is getting worse with newer Anthropic models as both Opus 4.8 and Sonnet 5 show it but none of the older models. In other words, the SOTA models of the family are worse at this specific tool schema than their older siblings.

Armin theorizes that this is because more recent Anthropic models have been specifically trained (presumably via Reinforcement Learning) to better use the edit tools that are baked into Claude Code. This has the unfortunate effect that other coding harnesses, such as Pi, may find that their own custom edit tools are more likely to be used incorrectly.

Claude's edit tool uses search and replace. OpenAI's Codex uses an apply_patch mechanism instead, and OpenAI have talked in the past about how their models are trained to use that tool effectively.

Does this mean third-party coding harnesses like Pi should implement multiple edit tools just so they can use the one with the best performance for the underlying model the user has selected?

Tags: armin-ronacher, ai, openai, generative-ai, llms, anthropic, llm-tool-use, coding-agents, pi

この記事をシェア

TechCrunch AI重要度42026年7月5日 01:32

アリババ、従業員によるClaude Codeの使用を禁止と報じられる

MarkTechPost重要度42026年7月5日 01:21

Anthropic、再現可能なゲノム・プロテオーム・ケミインフォマティクスパイプライン向けマルチエージェント AI ワークベンチ「Claude Science Beta」をリリース

Simon Willison Blog重要度42026年7月4日 03:51

Fable の判断力を活用する重要性について

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Simon Willison Blog·2026年7月5日 07:53·約2分

より優れたモデル、劣化したツール

#LLM Tool Use #Anthropic #Claude Code #Schema Validation #Coding Agents

TL;DR

AI深層分析2026年7月5日 08:02

重要/ 5段階

深度40%

キーポイント

最新モデルのツールスキーマ適合性の低下

自社ツール最適化による副作用

開発者への対応策の模索

業界全体のパラダイムシフトの兆候

影響分析・編集コメントを表示

影響分析

編集コメント

Better Models: Worse Tools

Armin は、Pi でのハッキング中に遭遇した奇妙な問題について報告しています。

Tags: armin-ronacher, ai, openai, generative-ai, llms, anthropic, llm-tool-use, coding-agents, pi

原文を表示

Better Models: Worse Tools

Armin reports on a weird problem he ran into while hacking on Pi:

The short version is that newer Claude models sometimes call Pi’s edit tool with extra, invented fields in the nested edits[] array. And not Haiku or some small model: Opus 4.8. The edit itself is usually correct but the arguments do not match the schema as the model invents made-up keys and Pi thus rejects the tool call and asks to try again.
That alone is not too surprising as models emit malformed tool calls sometimes. Particularly small ones. What surprised me is that this is getting worse with newer Anthropic models as both Opus 4.8 and Sonnet 5 show it but none of the older models. In other words, the SOTA models of the family are worse at this specific tool schema than their older siblings.

Claude's edit tool uses search and replace. OpenAI's Codex uses an apply_patch mechanism instead, and OpenAI have talked in the past about how their models are trained to use that tool effectively.

Does this mean third-party coding harnesses like Pi should implement multiple edit tools just so they can use the one with the best performance for the underlying model the user has selected?

Tags: armin-ronacher, ai, openai, generative-ai, llms, anthropic, llm-tool-use, coding-agents, pi

この記事をシェア

TechCrunch AI重要度42026年7月5日 01:32

アリババ、従業員によるClaude Codeの使用を禁止と報じられる

MarkTechPost重要度42026年7月5日 01:21

Simon Willison Blog重要度42026年7月4日 03:51

Fable の判断力を活用する重要性について

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む