TLDR AI·2026年7月1日 09:00·約14分

Claude Sonnet 5（4 分間の読み物）

#LLM #Reasoning #Anthropic #Claude #Long-context

TL;DR

Anthropic が新モデル「Claude Sonnet 5」を発表し、推論能力や長文コンテキストの処理において大幅な性能向上を達成したと発表している。

AI深層分析2026年7月2日 00:04

重要/ 5段階

深度40%

キーポイント

推論能力の飛躍的向上

複雑な数学的問題やコード生成タスクにおける解決精度が前世代から劇的に改善され、専門的な推論能力が強化された。

超長文コンテキストの最適化

数十万トークン規模のコンテキストウィンドウを維持しながらも、情報の検索精度と整合性を大幅に向上させた。

コスト効率と速度のバランス

高性能化を実現しつつ、従来モデルと比較して推論コストと応答時間の最適化を図り、実用性の高いラインナップとなった。

影響分析・編集コメントを表示

影響分析

この発表は、LLM の推論能力に関する競争をさらに激化させ、複雑なタスク処理が可能な次世代 AI ツールへの移行を加速させる要因となるでしょう。特に、高度な分析やコード生成が必要な開発者や企業にとって、Claude Sonnet 5 は強力な代替手段となり、市場のダイナミクスに大きな影響を与える可能性があります。

編集コメント

推論能力の向上は実務での信頼性を高める上で決定的な要素であり、Claude Sonnet 5 の登場は業界標準を再定義する重要な一歩です。

Claude Sonnet 5 は、これまでで最も自律的な Sonnet モデルとなるよう設計されています。計画立案が可能であり、ブラウザやターミナルなどのツールを使用し、数ヶ月前まではより大規模で高価なモデルを必要としていたレベルで自律的に動作できます。

多くの開発者にとって、自律型 AI の時代は Sonnet クラスのモデルから始まりました：Claude Sonnet 3.5、3.6、および 3.7 は、コーディングやツール使用において印象的なスキルを示した最初のモデルでした。しかし最近では、自律的機能における最も明確な進歩は、Opus クラスのモデルで見られています。

Sonnet 5 はその格差を縮めます：その性能は Opus 4.8 に近いものですが、より低い価格で提供されます。推論、ツール使用、コーディング、知識作業など、自律的パフォーマンスの重要な側面において、前作である Sonnet 4.6 から大幅な改善が図られています。

image*Sonnet 5 の各種評価におけるスコアを、Sonnet 4.6 および Opus 4.8（参考のために、より汎用的な能力を持つモデル）と比較したものです。Claude Sonnet 5 システムカードでは、より広範な評価の詳細が報告されています。*

私たちの安全性評価では、Sonnet 5 は Sonnet 4.6 と比べて望ましくない行動の全体的な発生率が低く、エージェント（agent）コンテキストにおいて一般的により安全に使用できることが示されました。また、評価結果からは、現在の Opus モデルと比較してサイバーセキュリティタスクを実行する能力が大幅に低いことも明らかになりました。

本日より、Claude Sonnet 5 はすべてのプランで利用可能となりました。無料プランと Pro プランのデフォルトモデルとなり、Max、Team、Enterprise ユーザーにも提供されます。また、Claude Code および Claude Platform でも利用可能で、2026 年 8 月 31 日までは導入価格として、入力トークン 100 万あたり 2 ドル、出力トークン 100 万あたり 10 ドルで提供されます。それ以降は、入力トークン 100 万あたり 3 ドル、出力トークン 100 万あたり 15 ドルとなります。開発者は Claude API を通じて claude-sonnet-5 を利用できます。

Claude Sonnet 5 との連携

以下のチャートは、エージェント検索評価 BrowseComp およびコンピュータ操作評価 OSWorld-Verified において、異なる effort レベルで Sonnet 5 のパフォーマンスを Sonnet 4.6 および Opus 4.8 と比較したものです。Sonnet 5（オレンジ色の線）は Sonnet 4.6（グレーの線）に対する明確な改善であり、Opus 4.8（黄色の線）よりもはるかに広いコストパフォーマンスの選択肢を提供します。中程度の effort レベルでは大幅にコスト効率が向上しており、高 effort のパフォーマンスでは一部のタスクにおいて Opus 4.8 と同等の結果を達成できます。Sonnet 5 と Opus 4.8 の間では、ユーザーがコストとパフォーマンスの適切なバランスを見つけるために effort レベルを調整することが可能です。

早期アクセスパートナーからのフィードバックは一定しており、Sonnet 5 はその先行モデルよりもはるかにエージェント的であるという点で一致しています。テスターたちは、以前の Sonnet モデルでは完了できなかった複雑なタスクを Sonnet 5 が完遂すること、明示的な指示がなくても自身の出力を検証すること、そしてこれらすべてのエージェント的な作業を魅力的な価格帯で行うことを報告しました。

image

Claude Sonnet 5 は、多段階のソフトウェアエンジニアリング作業に対して、エージェントに強力な実行レイヤーを提供します。複雑な技術的コンテキストにおいて、持続的なコーディング、ツール使用、デバッグを良好に処理し、特に継続性と技術的根拠が重要となるワークフローで非常に有用です。

image

Claude Sonnet 5 に、セールスフォースのアカウントティアを更新し、エンタープライズ連絡先にローンチ発表を送信するという 2 つの部分からなるタスクを任せたところ、エンドツーエンドで完了させました。以前は途中で立ち止まることが多かったものです。日常の自動化においては、これは言うまでもなく最適な選択肢です。

image

Claude Sonnet 5 は、より少ないリソースでより多くの成果を上げます。出力品質は同じままに保ちつつ、到達までのステップ数を削減します。また、安全ではない要求に対しても、明確かつ一貫して拒否します。Lovable では、強力なツールを数百万人のビルダーの手に届けることを目指しています。「どのように構築するか」を知るモデルと同様に、「いつノーと言うか」を知っているモデルも極めて重要です。

image

Claude Sonnet 5 を、最も困難な実 Pull Request（プルリクエスト）の数十件に対して実行したところ、それぞれを単独でテスト済みかつ検証済みの結果へと導き通しました。これにより、エンジニアは判断、意思決定、そして最終的な承認に集中することが可能になりました。

image

私は Claude Sonnet 5 にバグの調査を依頼しました。指示なしで、再現テストを作成し、修正を実装した上で、その変更がないとバグが再発することを確認するために一時保存（stash）まで行いました。これらはすべて単一のパスで完了しました。

image

Claude Sonnet 5 を用いれば、エージェントは計画を維持し、当社の規約に従い、効率的なコストでクリーンな多段階の変更を実行できます。

image

Claude Sonnet 5 は、既存システム（ブラウンフィールド）のコードにおいて最もその真価を発揮します。競合状態や隠されたテスト、誰も手をつけたくない部分などです。このモデルは障害を実際の根本原因まで追跡し、症状に対する応急処置ではなく、永続的な修正を提供します。

image

Claude Sonnet 5 は、Eve の原告側業務におけるパレート最適のフロンティアに位置しています。特に法的調査と分析において最も明確な改善が見られ、そのコストパフォーマンス比は移行先を選定する決断を容易なものにしました。

image

ClickHouse エージェントは生データを探索し、その場で洞察を生成するため、新しいモデルをテストする際には「洞察を得るまでの時間」が重要です。Claude Sonnet 5 はより細かなステップで推論を行い、ユーザーに答えを明らかに速く提供します。この速度こそが、お客様が実感できる違いです。

image

Pace において、私たちのコンピュータ使用型エージェントは、運用チームがすでに利用しているシステム上で保険ワークフロー（申込受付、事故報告 (FNOL)、損失実績など）を実行しています。Claude Sonnet 5 は一貫して適切な行動を取り、かつ迅速に実行します。これが実際の保険業務で求められることです。

01 /

セーフティ評価

事前展開時のセーフティ評価では、Sonnet 5 が全体として Sonnet 4.6 よりも改善されていることが判明しました。エージェントとしての安全性においては、悪意のあるリクエストを拒否し、プロンプトインジェクション攻撃における乗っ取り試行に抵抗する能力が向上しています。また、幻覚 (hallucination) や迎合的行動 (sycophancy) の発生率は Sonnet 4.6 よりも低くなっています。私たちの自動化された行動監査では、悪用への協力や欺瞞など、多様な整合性のない行動をテストしますが、Sonnet 5 は全体としてより低いスコア（つまりより安全）を示しました。ただし、この評価においては、より能力の高い Opus 4.8 や Claude Mythos Preview と比較すると、やや高い割合で整合性のない行動が見られることも確認されました。

image*自動行動監査（多くの状況や文脈にわたる広範な望ましくない行動を検出するテスト）における整合性外れ行動の発生率。これは、Sonnet 5 システムカードのセクション 6.4 に各特定の行動に関する完全なリストと結果が記載されています。Sonnet 5 は Sonnet 4.6 よりも全体的に整合性外れ行動の発生率が低いものの、Mythos Preview や Opus 4.8 よりも高い発生率を示しています。

私たちは Sonnet 5 をサイバーセキュリティタスクのために意図的に訓練していません。Sonnet 5 は一部の日常的で有害ではないサイバータスクを遂行できますが、ソフトウェアの脆弱性攻撃（エクスプロイト）の開発など潜在的に危険なサイバースキルを検証する評価では、Opus 4.8 や Mythos 5 などのモデルと比較して著しく低いパフォーマンスを示します。Firefox ブラウザの脆弱性に対するエクスプロイトを開発する能力をテストしたある評価からのスコアを以下のチャートに示します。Sonnet 5 は完全な動作可能なエクスプロイトの開発には決して成功しませんでしたが、Sonnet 4.6 よりもわずかに高い*部分的*成功率を示しています。この後者の変化は、特定の訓練によるものではなく、一般知能の向上によるものである可能性が高いです。

image*Firefox 147 のソフトウェア脆弱性に対するエクスプロイトを開発するモデルの成功度を測定したスコア（本評価は Mozilla と共同で開発されました。すべての脆弱性は Firefox 148 でパッチ済みです）。各モデルについて、左側の棒グラフはモデルが（安全対策なしで）動作するエクスプロイトを開発した頻度を、右側の棒グラフは部分的な成功を収めた頻度を示しています。Sonnet モデルのいずれも動作するエクスプロイトの開発に成功せず（両者ともスコア 0.0%）、Sonnet 5 は Sonnet 4.6 よりわずかに高い部分的な成功率を示しました。両 Sonnet モデルは、Opus 4.8 や Mythos 5 に比べてサイバー能力が大幅に劣っています。詳細については Sonnet 5 システムカードのセクション 3.2.4 を参照してください。*

Sonnet 5 はこれらのタスクにおいて前作よりもやや強力であるため、サイバーセキュリティ対策をデフォルトで有効にしてリリースしました。この対策は、危険なサイバー利用をリアルタイムで検知してブロックするものであり、Claude Opus 4.7 および 4.8 に搭載されているものと同じです（Sonnet 5 の全体的なサイバーセキュリティリスクが低いと判断したため、Fable 5 で導入された対策よりも厳格さは低く設定されています。Fable 5 の対策はより広範なサイバーセキュリティタスクをブロックするためです）。1

Sonnet 5 に関する安全性および機能評価の包括的な評価結果は、Claude Sonnet 5 システムカードに報告されています。

利用可能状況と価格設定

Claude Sonnet 5 は今日よりどこでも利用可能です。2026 年 8 月 31 日までは、入力トークン 100 万あたり 2 ドル、出力トークン 100 万あたり 10 ドルの導入価格で提供されます。その後、標準価格として入力トークン 100 万あたり 3 ドル、出力トークン 100 万あたり 15 ドルに引き上げられます。2

Chat、Cowork、Claude Code、および Claude Platform 3 におけるレート制限を全体的に引き上げました。これは、より高い努力レベルでの利用に伴うトークン使用量の増加に対応するためです。ユーザーは自身のプロジェクトに適したレベルを選択できます。

Changelog

*2026 年 6 月 30 日付編集：この投稿の元バージョンでは、BrowseComp 評価に関するコストパフォーマンスチャートが含まれていましたが、これはアジェンシー検索（エージェント型検索）の評価に我々が採用している標準的な手法を反映していない、より単純な手法に基づくデータでした。その結果、Sonnet 5 の評価におけるパフォーマンスが過小評価されることになりました。

現在、チャートを Sonnet 5 システムカードで我々が使用し議論した手法（コンパクションとプログラムによるツール呼び出しを伴う 10M トークンの予算を使用）に一致するように更新しました。また、周囲のテキストも更新しています。

Footnotes

1 Sonnet 5 は、ネイティブ Claude プラットフォーム、AWS 上の Claude プラットフォーム、Microsoft Foundry 内の Claude（Azure および Anthropic でホスト）、およびまもなく Google Vertex 内の Claude で利用可能な、我々のサイバー検証プログラムの一部です。すでにサイバー検証プログラムに登録済みの組織は、再申請の必要なく Sonnet 5 でも同じアクセス権限を自動的に付与されます。全体として、ガードレール（安全装置）を低減する必要があるセキュリティ関連作業には Claude Opus 4.8 を推奨します。

Sonnet 5 は Sonnet 4.6 のアップグレード版ですが、モデルのテキスト処理方法を変更してパフォーマンスを向上させるために、更新されたトークナイザーを採用しています（これは Claude Opus 4.7 で導入したトークナイザーの変更と同様です）。その代償として、同じ入力でもコンテンツの種類に応じて約 1.0–1.35 倍のトークン数にマッピングされる可能性があります。導入時の価格は、Sonnet 5 への移行が概ねコスト中立になるように設定されています。

2026 年 4 月 26 日、私たちは Claude Platform のネイティブ環境において、すべての利用階層で Sonnet および Haiku のレート制限を引き上げ、3 つの階層（Start, Build, Scale）に簡素化しました。ご自身の階層と現在の制限については Claude Console で確認するか、ドキュメントを読んで詳細をご覧ください。

Humanity's Last Exam: Humanity's Last Exam の採点モデルを更新し、Sonnet 4.6 のスコアをツールなしで 34.6%、ツールありで 46.8% に更新しました。これが、Sonnet 4.6 のローンチブログで報告されたスコアと異なる理由です。
OSWorld-Verified: モデルの実際の世界でのパフォーマンスをより正確に反映するために、OSWorld-Verified の評価実行方法を変更し、Sonnet 4.6 のスコアを 78.5% に更新しました。これが、Sonnet 4.6 のローンチブログで報告されたスコアと異なる理由です。

Fable 5 の再展開

Fable 5 は 7 月 1 日にグローバルに再展開されます。また、Amazon、Microsoft、Google、およびその他の Glasswing パートナーと共に、脱獄の深刻度を評価するための業界全体向けのフレームワークを提案しています。

科学者向け AI ワークベンチ「Claude Science」が利用可能に

Claude Science は、研究者が最も頻繁に使用するツールやパッケージを統合し、監査可能な成果物を生成し、計算リソースへの柔軟なアクセスを提供するカスタマイズ可能なアプリです。

Claude Tag の紹介

Claude Tag は、チームが Claude と連携して作業するための新しい方法です。

原文を表示

Claude Sonnet 5 is built to be the most agentic Sonnet model yet. It can make plans, use tools like browsers and terminals, and run autonomously at a level that, just a few months ago, required larger and more expensive models.

For many developers, the agentic AI era began with Sonnet-class models: Claude Sonnet 3.5, 3.6, and 3.7 were the first models that showed impressive skills in coding and tool use. More recently, though, the clearest gains in agentic capabilities have been in our Opus-class models.

Sonnet 5 narrows the gap: its performance is close to that of Opus 4.8, but at lower prices. It’s a substantial improvement over its predecessor, Sonnet 4.6, on important aspects of agentic performance like reasoning, tool use, coding, and knowledge work:

Scores for Sonnet 5 on a variety of evaluations compared to those of Sonnet 4.6 and Opus 4.8 (a more generally capable model, for reference). The [Claude Sonnet 5 System Card reports a broader set of evaluations in detail.](https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2F9941d610909f28a504e16dd5af823df172ec6035-2600x1234.png&w=3840&q=75)

Our safety assessments found that Sonnet 5 shows an overall lower rate of undesirable behaviors than Sonnet 4.6, and is generally safer to use in agentic contexts. Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.

From today, Claude Sonnet 5 is available across all plans: it is the default model for Free and Pro plans, and is available to Max, Team, and Enterprise users. It’s also available in Claude Code and on the Claude Platform, where it launches with introductory pricing of $2 per million input tokens and $10 per million output tokens through August 31, 2026, after which it will be priced at $3 per million input tokens and $15 per million output tokens. Developers can use claude-sonnet-5 via the Claude API.

Working with Claude Sonnet 5

The charts below compare the performance of Sonnet 5 with Sonnet 4.6 and Opus 4.8 at different effort levels on the agentic search evaluation BrowseComp and the computer use evaluation OSWorld-Verified. Sonnet 5 (orange line) is a strict improvement over Sonnet 4.6 (gray line) and covers a much wider range of cost-performance options than Opus 4.8 (yellow line). It provides substantially improved cost efficiency at medium effort; its higher-effort performance can match Opus 4.8 on some tasks. Between Sonnet 5 and Opus 4.8, users can adjust the effort level to find the right balance of cost and performance.

Feedback from our early access partners has been consistent: Sonnet 5 is much more agentic than its predecessors. Testers described how it finishes complex tasks where previous Sonnet models would stop short, how it checks its own output without explicitly being asked, and how it does all this agentic work at an attractive price point:

logo

Claude Sonnet 5 gives our agents a strong execution layer for multi-step software engineering work. It handles sustained coding, tool use, and debugging well across messy technical contexts, and has been especially useful for workflows where follow-through and technical grounding matter.

logo

We handed Claude Sonnet 5 a two-part job—update Salesforce account tiers, send a launch announcement to enterprise contacts—and it finished end to end. That used to stall halfway. For day-to-day automation, it’s a no-brainer.

logo

Claude Sonnet 5 gets more done with less. Same output quality, fewer steps to get there. It refuses unsafe requests cleanly and consistently, too. At Lovable, we’re putting powerful tools in the hands of millions of builders. A model that knows when to say no is just as important as one that knows how to build.

logo

We ran Claude Sonnet 5 against dozens of our most challenging real pull requests, and it carried each one through to a tested, verified result on its own — freeing our engineers to focus on the judgment, the decision, and the final sign-off.

logo

I asked Claude Sonnet 5 to investigate a bug. Unprompted, it wrote a reproducing test, implemented the fix, then stashed it to confirm the bug came back without the change. All in a single pass.

logo

With Claude Sonnet 5, agents stay on plan, follow our conventions, and ship clean multi-step changes, all at an efficient cost.

logo

Claude Sonnet 5 is at its best on brownfield code—race conditions, hidden tests, the parts nobody wants to touch. It traces a failure to its actual root cause and ships a durable fix instead of patching the symptom.

logo

Claude Sonnet 5 sits on the Pareto frontier for Eve’s plaintiff-law tasks. We see the clearest gains in legal research and analysis, at a price-to-performance ratio that made the choice to migrate easy.

logo

ClickHouse agents explore live data and produce insights on the fly, so time-to-insight matters when testing new models. Claude Sonnet 5 reasons in tighter steps and gets our users to answers noticeably faster. That speed is a difference our customers feel.

logo

At Pace, our computer-use agents run insurance workflows—submission intake, FNOL, loss runs—on the systems our operations teams already use. Claude Sonnet 5 consistently takes the right action and does it quickly, which is what real insurance work demands.

01 /

Safety evaluations

Our pre-deployment safety evaluations found that Sonnet 5 was overall an improvement on Sonnet 4.6. On agentic safety, the model is better at refusing malicious requests and resisting hijack attempts in prompt injection attacks. The model shows lower rates of hallucination and sycophancy than Sonnet 4.6. On our automated behavioral audit, which tests a wide range of misaligned behaviors such as cooperation with misuse and deception, Sonnet 5 scored lower (that is, safer) overall. However, it did show somewhat higher rates of misaligned behavior on this assessment compared to the more capable Opus 4.8 and Claude Mythos Preview.

Rates of misaligned behavior on our automated behavioral audit, which tests for a very wide range of undesirable behaviors across many situations and contexts (see Section 6.4 of the [Sonnet 5 System Card for a complete list and results for each specific behavior). Sonnet 5 shows an overall lower rate of misaligned behavior than Sonnet 4.6, though a higher rate than Mythos Preview and Opus 4.8.](https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2Fd018d76aa03c0ef18abc8a68de8f6fcd51c0a574-3840x2160.png&w=3840&q=75)

We did not deliberately train Sonnet 5 on cybersecurity tasks. It can perform some routine, non-harmful cyber tasks, but on evaluations testing potentially dangerous cyber skills, such as developing software exploits, it shows substantially poorer performance than models such as Opus 4.8 and Mythos 5. Scores from one evaluation, which tested models’ ability to develop exploits for vulnerabilities in the Firefox browser, are shown in the chart below. Sonnet 5 was never able to develop a full working exploit, but it does show a slightly higher rate of *partial* success than Sonnet 4.6. This latter change is likely due to improvements in general intelligence rather than specific training.

Scores measuring models’ success at developing exploits for software vulnerabilities in Firefox 147 (this evaluation was developed [in collaboration with Mozilla; all vulnerabilities have been patched in Firefox 148). For each model, the left-hand bar shows how often the model (without safeguards) developed a working exploit; the right-hand bar shows how often the model had partial success. Neither of the Sonnet models could successfully develop a working exploit (both scored 0.0%); Sonnet 5 showed a slightly higher partial success rate than Sonnet 4.6. Both Sonnet models have substantially poorer cyber capabilities than Opus 4.8 and Mythos 5. For full details, see Section 3.2.4 of the Sonnet 5 System Card.](https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2Fee9944c865937053bae293f057fffa478ee0f46b-3840x2160.png&w=3840&q=75)

Since Sonnet 5 is somewhat stronger than its predecessor on these tasks, we’ve launched it with cyber safeguards enabled by default. These safeguards—which detect and block dangerous cyber usage in real time—are the same as those present in Claude Opus 4.7 and 4.8 (because we judged that the overall level of cybersecurity risk from Sonnet 5 was low, the safeguards are less strict than those launched with Fable 5, which block a much wider range of cybersecurity tasks).1

Our full assessment of Sonnet 5 across many safety and capability evaluations is reported in the Claude Sonnet 5 System Card.

Availability and pricing

Claude Sonnet 5 is available everywhere today at an introductory price of $2 per million input tokens and $10 per million output tokens through August 31, 2026. It then moves to standard pricing at $3 per million input tokens and $15 per million output tokens.2 We’ve increased rate limits across Chat, Cowork, Claude Code, and the Claude Platform3 to accommodate the higher token usage of higher effort levels; users can select whichever level makes sense for their particular project.

Changelog

*Edit June 30, 2026: In the original version of this post, we included a cost-performance chart for the BrowseComp evaluation that was based on data from a simpler methodology that did not reflect the standard methodology we use for agentic search evaluations. This had the result of underestimating Sonnet 5's performance on the evaluation.*

*We have now updated the chart so that it matches the methodology that we used and discussed in the Sonnet 5 system card (which used a 10M token budget with compaction and programmatic tool calling). We have also updated the surrounding text.*

Footnotes

1 Sonnet 5 is part of our Cyber Verification Program, which is available today on the native Claude Platform, the Claude Platform on AWS, and Claude in Microsoft Foundry (hosted on Azure and Anthropic), and coming soon on Claude in Google Vertex. Organizations that are already enrolled in the Cyber Verification Program automatically have the same access on Sonnet 5, with no need to reapply. Overall, we recommend Claude Opus 4.8 for cybersecurity work that requires reduced guardrails.

2 Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer that changes how the model processes text to improve performance (this is similar to the tokenizer change we introduced with Claude Opus 4.7). The tradeoff is that the same input can map to more tokens: roughly 1.0–1.35× depending on the content type. The introductory pricing is set so that the transition to Sonnet 5 is roughly cost-neutral.

3 On April 26, 2026, we raised Sonnet and Haiku rate limits at every usage tier and simplified to three tiers (Start, Build, and Scale) on the native Claude Platform. You can view your tier and current limits in the Claude Console or read the documentation to learn more.

Humanity’s Last Exam: We updated the grader model for Humanity’s Last Exam and have updated the Sonnet 4.6 score to 34.6% (no tools) and 46.8% (with tools). This is the reason the score differs from that reported in the Sonnet 4.6 launch blog.
OSWorld-Verified: We made changes to how we run the OSWorld-Verified evaluation to more accurately reflect the model’s performance in the real world, and have updated the Sonnet 4.6 score to 78.5%. This is the reason the score differs from that reported in the Sonnet 4.6 launch blog.

Claude Sonnet 5（4 分間の読み物）

#LLM #Reasoning #Anthropic #Claude #Long-context

TL;DR

Anthropic が新モデル「Claude Sonnet 5」を発表し、推論能力や長文コンテキストの処理において大幅な性能向上を達成したと発表している。

AI深層分析2026年7月2日 00:04

重要/ 5段階

深度40%

キーポイント

推論能力の飛躍的向上

複雑な数学的問題やコード生成タスクにおける解決精度が前世代から劇的に改善され、専門的な推論能力が強化された。

超長文コンテキストの最適化

数十万トークン規模のコンテキストウィンドウを維持しながらも、情報の検索精度と整合性を大幅に向上させた。

コスト効率と速度のバランス

高性能化を実現しつつ、従来モデルと比較して推論コストと応答時間の最適化を図り、実用性の高いラインナップとなった。

影響分析・編集コメントを表示

影響分析

編集コメント

推論能力の向上は実務での信頼性を高める上で決定的な要素であり、Claude Sonnet 5 の登場は業界標準を再定義する重要な一歩です。

Claude Sonnet 5 との連携

image

Claude Sonnet 5 を用いれば、エージェントは計画を維持し、当社の規約に従い、効率的なコストでクリーンな多段階の変更を実行できます。

image

01 /

セーフティ評価

Sonnet 5 に関する安全性および機能評価の包括的な評価結果は、Claude Sonnet 5 システムカードに報告されています。

利用可能状況と価格設定

Changelog

Footnotes

Humanity's Last Exam: Humanity's Last Exam の採点モデルを更新し、Sonnet 4.6 のスコアをツールなしで 34.6%、ツールありで 46.8% に更新しました。これが、Sonnet 4.6 のローンチブログで報告されたスコアと異なる理由です。
OSWorld-Verified: モデルの実際の世界でのパフォーマンスをより正確に反映するために、OSWorld-Verified の評価実行方法を変更し、Sonnet 4.6 のスコアを 78.5% に更新しました。これが、Sonnet 4.6 のローンチブログで報告されたスコアと異なる理由です。

Fable 5 の再展開

科学者向け AI ワークベンチ「Claude Science」が利用可能に

Claude Tag の紹介

Claude Tag は、チームが Claude と連携して作業するための新しい方法です。

原文を表示

Working with Claude Sonnet 5

logo

Claude Sonnet 5 gives our agents a strong execution layer for multi-step software engineering work. It handles sustained coding, tool use, and debugging well across messy technical contexts, and has been especially useful for workflows where follow-through and technical grounding matter.

logo

We handed Claude Sonnet 5 a two-part job—update Salesforce account tiers, send a launch announcement to enterprise contacts—and it finished end to end. That used to stall halfway. For day-to-day automation, it’s a no-brainer.

logo

Claude Sonnet 5 gets more done with less. Same output quality, fewer steps to get there. It refuses unsafe requests cleanly and consistently, too. At Lovable, we’re putting powerful tools in the hands of millions of builders. A model that knows when to say no is just as important as one that knows how to build.

logo

We ran Claude Sonnet 5 against dozens of our most challenging real pull requests, and it carried each one through to a tested, verified result on its own — freeing our engineers to focus on the judgment, the decision, and the final sign-off.

logo

I asked Claude Sonnet 5 to investigate a bug. Unprompted, it wrote a reproducing test, implemented the fix, then stashed it to confirm the bug came back without the change. All in a single pass.

logo

With Claude Sonnet 5, agents stay on plan, follow our conventions, and ship clean multi-step changes, all at an efficient cost.

logo

Claude Sonnet 5 is at its best on brownfield code—race conditions, hidden tests, the parts nobody wants to touch. It traces a failure to its actual root cause and ships a durable fix instead of patching the symptom.

logo

Claude Sonnet 5 sits on the Pareto frontier for Eve’s plaintiff-law tasks. We see the clearest gains in legal research and analysis, at a price-to-performance ratio that made the choice to migrate easy.

logo

ClickHouse agents explore live data and produce insights on the fly, so time-to-insight matters when testing new models. Claude Sonnet 5 reasons in tighter steps and gets our users to answers noticeably faster. That speed is a difference our customers feel.

logo

At Pace, our computer-use agents run insurance workflows—submission intake, FNOL, loss runs—on the systems our operations teams already use. Claude Sonnet 5 consistently takes the right action and does it quickly, which is what real insurance work demands.

01 /

Safety evaluations

Our full assessment of Sonnet 5 across many safety and capability evaluations is reported in the Claude Sonnet 5 System Card.

Availability and pricing

Changelog

Footnotes

Humanity’s Last Exam: We updated the grader model for Humanity’s Last Exam and have updated the Sonnet 4.6 score to 34.6% (no tools) and 46.8% (with tools). This is the reason the score differs from that reported in the Sonnet 4.6 launch blog.
OSWorld-Verified: We made changes to how we run the OSWorld-Verified evaluation to more accurately reflect the model’s performance in the real world, and have updated the Sonnet 4.6 score to 78.5%. This is the reason the score differs from that reported in the Sonnet 4.6 launch blog.

キーポイント

影響分析

編集コメント

Claude Sonnet 5 との連携

セーフティ評価

利用可能状況と価格設定

Changelog

Footnotes

Related content

Fable 5 の再展開

科学者向け AI ワークベンチ「Claude Science」が利用可能に

Claude Tag の紹介

Working with Claude Sonnet 5

Safety evaluations

Availability and pricing

Changelog

Footnotes

Related content

Redeploying Fable 5

Claude Science, an AI workbench for scientists, is now available

Introducing Claude Tag

関連記事

キーポイント

影響分析

編集コメント

Claude Sonnet 5 との連携

セーフティ評価

利用可能状況と価格設定

Changelog

Footnotes

Related content

Fable 5 の再展開

科学者向け AI ワークベンチ「Claude Science」が利用可能に

Claude Tag の紹介

Working with Claude Sonnet 5

Safety evaluations

Availability and pricing

Changelog

Footnotes

Related content

Redeploying Fable 5

Claude Science, an AI workbench for scientists, is now available

Introducing Claude Tag

関連記事