読み込み中…

Interconnects·2026年2月9日 23:03·約11分

Opus 4.6、Codex 5.3、そしてベンチマーク後の時代

#LLM #Code Generation #AI Agents #OpenAI #Anthropic

TL;DR

OpenAIのCodex 5.3とAnthropicのOpus 4.6という最新コーディングエージェントモデルの比較分析において、両者の性能差は微細であり、実用性や信頼性の観点からClaude系モデルがわずかに優位にあると結論付けている。

AI深層分析2026年4月27日 16:43

重要/ 5段階

深度40%

キーポイント

最新モデルの発表と市場動向

OpenAIのGPT-5.3-CodexとAnthropicのClaude Opus 4.6が発表され、特にAnthropicがエージェント領域で主導権を握っている状況下での競争である。

Codex 5.3の進化とClaudeとの接近

Codex 5.3は以前のバージョンから大幅に進化し、反応速度と幅広いタスク（git操作など）の処理能力においてClaudeに迫る製品市場適合性を達成した。

複雑なコード修正におけるOpenAIの優位性

単純なタスクではClaude系が信頼できる一方、複雑なバグ修正やコードベースの理解においてはOpenAIのモデルがわずかな優位性を持つ可能性がある。

ユーザー体験と監督コストのトレードオフ

Codex 5.3を使用するには詳細な指示が必要な「お世話焼き」的な側面があり、Claudeの文脈理解能力と信頼性にはまだ差がある。

新モデルの使いやすさと実用性のトレードオフ

Opus 4.6とCodex 5.3は能力と速度を追求したが、その代償として使いやすさが犠牲になっており、特に複数タスクのキュー処理時に指示を無視する傾向がある。

Claude Codeの製品面での優位性と普及性

Codexは大幅な進歩を遂げたものの、Claude Codeの方が体験が良好で幅広いタスクに対応可能であり、ソフトウェア経験が少ないユーザーへの推奨モデルとしてClaudeが大きなアドバンテージを持つ。

ベンチマークの時代終焉と多様なモデル活用

Opus 4.6やCodex 5.3のリリースにより、ベンチマークスコアがユーザーにとって意味のある指標ではなくなっており、特定のユースケースに最適なモデルは一つではなく、複数のモデルを使い分けるスキルが必要となっている。

重要な引用

Codex 5.3 feels much more Claude-like, where it’s much faster in its feedback and much more capable in a broad suite of tasks from git to data analysis

OpenAI’s latest GPT, with this context, keeps an edge as a better coding model... it seems to be a bit better at finding bugs and fixing things in codebases

Switching from Opus 4.6 to Codex 5.3 feels like I need to babysit the model in terms of more detailed descriptions

"Each of the AI laboratories, and the media ecosystems covering them, have been on this transition away from standard evaluations at their own pace."

"It should be clear with the releases of both Opus 4.6 and Codex 5.3 that benchmark-based release reactions barely matter."

"They were likely not the only AI lab to note the coming role of agents, but they were by far the first to shift their messaging and prioritization towards this."

影響分析・編集コメントを表示

影響分析

この記事は、コーディング支援AI市場が「単なるコード生成」から「高度なエージェントとしての信頼性と制御性」へシフトしていることを示唆しています。開発者はモデルの絶対的な性能だけでなく、作業フローへの統合しやすさやエラー許容度を重視するようになり、AnthropicとOpenAIの競合がユーザー体験の細部で決着をつける時代に入ったことを示しています。

編集コメント

最新モデルのベンチマーク数値だけでなく、実際の開発現場での「信頼性」と「監督コスト」が競争の鍵となる時代に入ったことを示す貴重な実体験レポートです。

先週木曜日、2月5日、OpenAI と Anthropic はそれぞれコーディングアシスタントとして設計された次世代モデル、GPT-5.3-Codex と Claude Opus 4.6 を発表しました。これに先立ち、Anthropic は「エージェント」という新たな世界に人々が総じて取り組む中で、特に Claude Code（Opus 4.5 に起因するパフォーマンスの劇的向上）によって主導される中、世論を掌握していました。本稿ではソフトウェアが永遠に変化している様子や、Moltbook が未来を示唆している点、機械学習研究の加速、そしてより広範な影響について掘り下げるのではなく、新しいモデルをどのように評価し、受け入れ、準備すべきかに焦点を当てます。Opus 4.6 と Codex 5.3 の間の微妙な差は、今年発表される多くのモデルバージョンで感じられることになるでしょうが、使いやすさという点では Opus がこの対決において先行しています。

これらのリリースに臨むにあたり、私は Claude Code を一般的なコンピュータエージェントとして幅広く活用しており、ソフトウェアエンジニアリングやデータ分析、自動化などにも取り組んでいました。Codex 5.2（通常は xhigh で最大思考努力を要請）には試行しましたが、私の広範で横断的なタスクセットの中では、期待通りに機能しないと感じました。

ここ数日、私は両方のモデルをより均等に使用してきました。これは大変な誉め言葉ですが、Codex 5.3 は非常に Claude に似ており、フィードバックがはるかに速く、git からデータ分析に至るまで幅広いタスクにおいてはるかに能力が高いと感じます（以前の Codex バージョン、5.2 まで含めても、新しいブランチの作成といった基本的な git 操作を定期的に失敗していました）。Codex 5.3 は、製品と市場の適合性が向上したことで、Claude の領域へと向かう非常に重要な一歩を踏み出しました。これは OpenAI にとって極めて重要な動きであり、両モデルを比較すると、Codex 5.3 はその前身とははるかに異なる印象を受けます。

この文脈における OpenAI の最新 GPT は、より優れたコーディングモデルとしての優位性を保っています。この一般的な主張を正確に記述するのは難しく、多くは他者の成果を読むことに基づいていますが、コードベース内のバグの発見や修正、私の RLHF Book（強化学習による人間フィードバック）のための最小限のアルゴリズム例などにおいて、少しだけ優れているように思われます。私の経験では、これは微細な優位性であり、コミュニティはこの点が複雑な状況（つまり、大半の「雰囲気コーディング」されたアプリではない場合）で最も顕著であると信じています。

ユーザーがこれらの新しいエージェントを監督するスキルを高めるにつれ、ソフトウェアの理解と作成における最高レベルの能力を持つことが、Codex 5.3 にとって意味のある優位性となる可能性がありますが、今日では明白な利点とは言えません。AI 分野での私の最も信頼できる友人の多くは、Codex を支持しています。それは単にほんの少しだけ優れているからだと信じているからです。私はそのポテンシャルを引き出すことができていません。

Opus 4.6 から Codex 5.3 に切り替えると、「このブランチをクリーンアップして PR をプッシュする」ようなやや日常的なタスクを行う際、より詳細な説明が必要でモデルの世話を焼かざるを得ないような感覚になります。Claude なら修正の文脈を理解し、概ね正しく処理してくれると信頼できますが、Codex はファイルをスキップしたり、変な場所に配置してしまったりすることがあります。

これらのリリースはどちらも、企業がモデルの機能や実行速度を追求する一方で、使いやすさという代償を支払っているように感じられます。私は、複数のタスクを一括で指示した場合に、Opus 4.6 も Codex 5.3 も指示を無視してしまうことがあると発見しました。これらは特に Codex がそうですが、範囲が明確で問題がはっきりしている場合に最もよく機能します。Claude Code のハーン（制御枠）には、サブエージェントがターミナルをフリーズさせるというひどいバグがあり、新しいメッセージでは「コンパクト化またはクリアが必要」と表示されるものの、コンパクト化自体は失敗してしまいます。

Codex による大きな一歩にもかかわらず、製品面での Claude との差はまだ依然として大きく埋めるべき課題が残っています。Opus 4.6 は正しい方向へのさらなる一歩であり、Claude Code は素晴らしい体験をもたらします。親しみやすく、私が試す幅広いタスクに対して概ね動作し、これが Codex よりもはるかに広い採用につながることでしょう。ソフトウェア経験が限られているか全くない聴衆にコーディングエージェントを推薦するならば、間違いなく Claude になります。エージェントが一般利用へとようやく登場したこの時期において、これはマインドシェアと使用データに基づくフィードバックの両面で大きな優位性です。

その一方で、どのユースケースにどのエージェントを使用すべきかという明確な指針はなく、常に複数のモデルを使用し、エージェントを管理するスキルを維持する必要があります。

Interconnects AI は読者支援型の出版物です。購読をご検討ください。

2026 年のモデル評価

2025 年を通じて、モデルリリースに伴うベンチマークがユーザーにとって意味のあるシグナルを伝えなくなっていく AI 世界へと向かっているという多くの兆候がありました。かつて GPT-4 や Gemini 2.5 Pro がリリースされた時代には、その日のチャットボットのフォームファクター内でベンチマークの差分を容易に実感できました——モデルはより信頼性が高く、より多くのタスクを実行可能でしたなどです。これは OpenAI の o3 などのモデルを通じても続きました。この AI の構築フェーズ、おおよそ 2023 年から 2025 年にかけて、私たちは現代の言語モデルの中核機能（ツール使用、拡張推論、基本的なスケーリングなど）を組み立てていました。その向上は明白でした。

Opus 4.6 と Codex 5.3 の両方のリリースにより、ベンチマークに基づくリリースへの反応がほとんど重要ではないことは明らかです。今回のリリースでは、私は評価スコアをほとんど見ませんでした。Opus 4.6 がわずかに検索スコアで優れていたこと、Codex 5.3 が回答あたり遥かに少ないトークン数を使用していることは確認しましたが、これらはいずれも、これらのモデルが大幅に優れていると確信させるものではありませんでした。

各 AI ラボと、それらを報道するメディア生態系は、それぞれ独自のペースで標準的な評価からの移行を遂げてきました。最も示唆に富む例として挙げられるのは、2025 年 11 月の Gemini 3 Pro のリリースです。当時の全体的な雰囲気は「Google が再び主導権を握った」というものでした。サンフランシスコ在住のニューヨーク・タイムズ記者で自らを「AGI に目覚めた（AGI-pilled）」と称するケビン・ルーズ氏は次のように述べています。

"どうやら、ここ数年 AI 分野で苦戦していた Google が、Bard のローンチやいくつかの問題を抱えていた Gemini の初期バージョンを経て、ついに最先端に追いつきつつあるという見方がありました。そして今、問われているのは、これが彼らが王冠を取り戻す瞬間なのかということです。"

Gemini 現在の危機の深さについて詳しく論じる必要はありませんが、コーディングエージェントの最前線においては、その影響は実質的にありません。この分野こそがパフォーマンスにおける劇的な飛躍、あえて言えば、「リモートワーカー」という概念を中心とした多くの一般的に受け入れられている AGI の定義さえもが実現される可能性が最も高い領域だと感じられます。しかし、彼らの戴冠から 2 ヶ月後にはタイムラインが彼らを置き去りにしており、Gemini 3 は偽の王として称賛されたことが示されました。

一方の極端にあるのが Anthropic です。Anthropic が 2025 年 5 月に Claude 4 をリリースした際、私は彼らのコードへの賭けに懐疑的でした。OpenAI と Gemini が数学における IMO（国際数学オリンピック）金メダル達成などのモデル発表やその他の評価における画期的な成果を競い合う華やかさに気を取られていたからです。

Anthropic は、そのビジョンの焦点化に対して真剣に評価されるべきです。彼らがエージェントの来るべき役割に気づいていなかった AI ラボが他になかったとは限りませんが、メッセージや優先順位をこの方向へシフトさせた点では、断然彼らが最初でした。2025 年 6 月の私の投稿（Claude 4 のリリースから 1 ヶ月後）において、私は標準的なベンチマークの優先度を下げる判断が正しいと理解し始めていました。

これは業界にとって異なる道であり、私たちが慣れ親しんできたものとは異なる形でのメッセージングが必要になります。今後のリリースは、Anthropic の Claude 4 のように、ベンチマークでの向上はわずかである一方、実世界での進歩は大きな一歩となるものが多くなるでしょう。これに伴い、政策、評価、透明性に関する多くの含意が生じます。特に AI に批判的な人々が評価の横ばいを「AI はもはや機能していない」と主張する機会として利用する中で、進捗ペースが継続しているかどうかを理解するには、はるかに繊細な配慮が必要になります。

この状況は、2026 年の Interconnects のモデルレビューにおける役割について考えさせられます。2025 年は、多くの劇的なリリース日付のブログ投稿、多数の新規中国製オープンモデル開発者の参入、GPT-2 以来となる OpenAI の最初のオープン言語モデル、そして何よりも無限に過大評価された GPT-5 などによって特徴づけられました。これらのタイムリーなリリース投稿には依然として大きな価値がありますが、モデルが類似したままの場合、現在の AI フロンティアの複雑さを解きほぐすためにはほとんど役に立ちません。

独立した声として最前線のモデルを追跡する私の役割を果たすためには、私がどのようにモデルを使用し、なぜ使用し、なぜ使用しないのかについて、定期的に更新を提供し続ける必要があります。時間の経過とともに、業界はエージェント型モデルの違いをより明確に表現する方法を開発していくでしょう。今後数ヶ月、あるいは数年の間、私はエージェント機能における進歩のペースが非常に速く、また不均一であると考えており、一貫したテストと明確な表現こそがそれを監視するための唯一の方法になると予想しています。

コードエージェンツの新たな最前線は、サブエージェント（または「エージェントチーム」、これは一緒に作業できるサブエージェント）の使用にあります。ここで主要なオーケストレーションエージェントは、問題の一部に取り組むために自分自身のコピーを送り出します。Claude はここですでに少し先行しており、より洗練された機能を持っていますが、この領域は急速に進化し、おそらく OpenAI は GPT-Pro などの製品での経験を踏まえて Pro エージェントを構築できるかもしれません。

GPT-Pro シリーズのモデルは、Anthropic に対する OpenAI の大きな優位性です。私はこれらを常に使用しています。これらのエージェントをより複雑で長期的なタスクに使用するようになると、単一の課題に対してより多くの計算リソースを活用することが決定的な差別化要因となるでしょう。

原文を表示

Last Thursday, February 5th, both OpenAI and Anthropic unveiled the next iterations of their models designed as coding assistants, GPT-5.3-Codex and Claude Opus 4.6, respectively. Ahead of this, Anthropic had a firm grasp of the mindshare as everyone collectively grappled with the new world of agents, primarily driven by a Claude Code with Opus 4.5-induced step change in performance. This post doesn’t unpack how software is changing forever, Moltbook is showcasing the future, ML research is accelerating, and the many broader implications, but rather how to assess, live with, and prepare for new models. The fine margins between Opus 4.6 and Codex 5.3 will be felt in many model versions this year, with Opus ahead in this matchup on usability.

Going into these releases I’d been using Claude Code extensively as a general computer agent, with some software engineering and a lot of data analysis, automation, etc. I had dabbled with Codex 5.2 (usually on xhigh, maximum thinking effort), but found it not to quite work for me among my broad, horizontal set of tasks.

For the last few days, I’ve been using both of the models much more evenly. I mean this as a great compliment, but Codex 5.3 feels much more Claude-like, where it’s much faster in its feedback and much more capable in a broad suite of tasks from git to data analysis (previous versions of Codex, including up to 5.2, regularly failed basic git operations like creating a fresh branch). Codex 5.3 takes a very important step towards Claude’s territory by having better product-market fit. This is a very important move for OpenAI and between the two models, Codex 5.3 feels far more different than its predecessors.

OpenAI’s latest GPT, with this context, keeps an edge as a better coding model. It’s hard to describe this general statement precisely, and a lot of it is based on reading others’ work, but it seems to be a bit better at finding bugs and fixing things in codebases, such as the minimal algorithmic examples for my RLHF Book. In my experience, this is a minor edge, and the community thinks that this is most apparent in complex situations (i.e. not most vibe-coded apps).

As users become better at supervising these new agents, having the best top-end ability in software understanding and creation could become a meaningful edge for Codex 5.3, but it is not an obvious advantage today. Many of my most trusted friends in the AI space swear by Codex because it can be just this tiny bit better. I haven’t been able to unlock it.

Switching from Opus 4.6 to Codex 5.3 feels like I need to babysit the model in terms of more detailed descriptions when doing somewhat mundane tasks like “clean up this branch and push the PR.” I can trust Claude to understand the context of the fix and generally get it right, where Codex can skip files, put stuff in weird places, etc.

Both of these releases feel like the companies pushing for capabilities and speed of execution in the models, but at the cost of some ease of use. I’ve found both Opus 4.6 and Codex 5.3 ignoring an instruction if I queue up multiple things to do — they’re really best when given well-scoped, clear problems (especially Codex). Claude Code’s harness has a terrible bug that makes subagents brick the terminal, where new messages say you must compact or clear, but compaction fails.

Despite the massive step by Codex, they still have a large gap to close to Claude on the product side. Opus 4.6 is another step in the right direction, where Claude Code feels like a great experience. It’s approachable, it tends to work in the wide range of tasks I throw at it, and this’ll help them gain much broader adoption than Codex. If I’m going to recommend a coding agent to an audience who has limited-to-no software experience, it’s certainly going to be Claude. At a time when agents are just emerging into general use, this is a massive advantage, both in mindshare and feedback in terms of usage data.1

In the meantime, there’s no cut-and-dried guideline on which agent you need to use for any use-case, you need to use multiple models all the time and keep up with the skill that is managing agents.

Interconnects AI is a reader-supported publication. Consider becoming a subscriber.

Assessing models in 2026

There have been many hints through 2025 that we were heading toward an AI world where benchmarks associated with model releases no longer convey meaningful signal to users. Back in the time of the GPT-4 or Gemini 2.5 Pro releases, the benchmark deltas could be easily felt within the chatbot form factor of the day — models were more reliable, could do more tasks, etc. This continued through models like OpenAI’s o3. During this phase of AI’s buildout, roughly from 2023 to 2025, we were assembling the core functionality of modern language models: tool-use, extended reasoning, basic scaling, etc. The gains were obvious.

It should be clear with the releases of both Opus 4.6 and Codex 5.3 that benchmark-based release reactions barely matter. For this release, I barely looked at the evaluation scores. I saw that Opus 4.6 had a bit better search scores and Codex 5.3 used far fewer tokens per answer, but neither of these were going to make me sure they were much better models.

Each of the AI laboratories, and the media ecosystems covering them, have been on this transition away from standard evaluations at their own pace. The most telling example is the Gemini 3 Pro release in November of 2025. The collective vibe was Google is back in the lead. Kevin Roose, self-proclaimed “AGI-pilled” NYTimes reporter in SF said:

There's sort of this feeling that Google, which kind of struggled in AI for a couple of years there — they had the launch of Bard and the first versions of Gemini, which had some issues — and I think they were seen as sort of catching up to the state of the art. And now the question is: is this them taking their crown back?

We don’t need to dwell on the depths of Gemini’s current crisis, but they have effectively no impact at the frontier of coding agents, which as an area feels the most likely for dramatic strides in performance — dare I say, even many commonly accepted definitions of AGI that center around the notion of a “remote worker?” The timeline has left them behind 2 months after their coronation, showing Gemini 3 was hailed as a false king.

On the other end of the spectrum is Anthropic. With Anthropic’s release of Claude 4 in May of 2025, I was skeptical of their bet on code — I was distracted by the glitz of OpenAI and Gemini trading blows with announcements like models achieving IMO Gold medals in mathematics or other evaluation breakthroughs.

Anthropic deserves serious credit for the focus of its vision. They were likely not the only AI lab to note the coming role of agents, but they were by far the first to shift their messaging and prioritization towards this. In my post in June of 2025, a month after Claude 4 was released, I was coming around to them being right to deprioritize standard benchmarks:

This is a different path for the industry and will take a different form of messaging than we’re used to. More releases are going to look like Anthropic’s Claude 4, where the benchmark gains are minor and the real world gains are a big step. There are plenty of more implications for policy, evaluation, and transparency that come with this. It is going to take much more nuance to understand if the pace of progress is continuing, especially as critics of AI are going to seize the opportunity of evaluations flatlining to say that AI is no longer working.

This leaves me reflecting on the role of Interconnects’ model reviews in 2026. 2025 was characterized by many dramatic, day-of model release blog posts, with the entry of many new Chinese open model builders, OpenAI’s first open language model since GPT-2, and of course the infinitely hyped GPT-5. These timely release posts still have great value — they center the conversation around the current snapshot of a company vis-a-vis the broader industry, but if models remain similar, they’ll do little to disentangle the complexity in mapping the current frontier of AI.

In order to serve my role as an independent voice tracking the frontier models, I need to keep providing regular updates on how I’m using models, why, and why not. Over time, the industry is going to develop better ways of articulating the differences in agentic models. For the next few months, maybe even years, I expect the pace of progress to be so fast and uneven in agentic capabilities, that consistent testing and clear articulation will be the only way to monitor it.

1The emerging frontier of coding agents is in the use of subagents (or “agent teams”, which are subagents that can work together), where the primary orchestration agent sends off copies of itself to work on pieces of the problem. Claude is slightly ahead here with more polished features, but the space will evolve quickly, and maybe OpenAI can take their experiences with products like GPT-Pro to make a Pro agent.

The GPT-Pro line of models is a major advantage OpenAI has over Anthropic. I use them all the time. As we learn to use these agents for more complex, long-term tasks, harnessing more compute on a single problem will be a crucial differentiator.

この記事をシェア

One Useful Thing重要度42026年7月24日 03:05

AI活用ガイド：何に使うべきか

Latent Space重要度42026年7月25日 16:25

Anthropic、Claude Opus 5 を発表

Simon Willison Blog重要度42026年7月25日 09:42

アントのOpus5、プロンプト注入に強靭

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む