Anthropic News·2026年2月5日 09:00·約7分

Claude Opus 4.6の紹介

#LLM #Long Context #Agentic AI #Code Generation #Anthropic

TL;DR

Anthropic は、100 万トークンのコンテキストウィンドウと強力なエージェント機能を備えた最新モデル「Claude Opus 4.6」を発表し、複雑なコーディングや経済的タスクにおいて競合他社を凌駕する性能を示した。

AI深層分析2026年4月29日 18:04

最重要/ 5段階

深度40%

キーポイント

100 万トークンコンテキストウィンドウの導入

Opus クラスモデル史上初めて、ベータ版として 1M トークンのコンテキストウィンドウをサポートし、大規模なコードベースや長文書の処理能力を飛躍的に向上させた。

エージェント機能とコーディング能力の強化

より慎重な計画立案が可能になり、長時間にわたる自律的なタスク実行、コードレビュー、デバッグ能力が大幅に改善され、Terminal-Bench 2.0 で最高スコアを記録した。

競合他社との明確な性能差

経済的価値のある知識労働タスクの評価「GDPval-AA」において、OpenAI の GPT-5.2 より約 144 エロポイント高いスコアを記録し、業界最高水準であることを証明した。

Office ツールとの統合と新機能

Excel での機能が大幅に強化され、PowerPoint への対応が研究プレビューとして開始されたほか、API では「アダプティブ・シンキング」やコンテキスト圧縮機能が追加された。

推論深度とコストのトレードオフ

より深い推論を行うことで複雑な問題での精度が高まる一方、単純なタスクでは過剰思考による遅延やコスト増が発生する可能性がある。

自律的なエージェント機能の強化

複雑なタスクを独立したサブタスクに分解し、並列実行とツール呼び出しを精密に行うことで、手動介入なしで長期的かつ多段階の作業を完遂できる。

コード理解とセキュリティ調査での卓越性

大規模なコードベースのナビゲーションや不慣れな環境のデバッグにおいて最先端の性能を発揮し、サイバーセキュリティ調査では競合モデルを圧倒する結果を出している。

影響分析・編集コメントを表示

影響分析

この発表は、LLM が単なる対話ツールから、大規模コードベースを自律的に処理・修正できる高度な「エージェント」としての役割を確立する重要な転換点です。特に 100 万トークンという圧倒的なコンテキスト容量と、競合他社を凌駕する実務能力は、企業の開発ワークフローや複雑な業務自動化におけるデファクトスタンダードとなる可能性を秘めています。

編集コメント

競合他社の最新モデルを明確に上回る数値データと、実務で即戦力となる 100 万トークンコンテキストの提供により、企業向け AI ツールの競争激化が決定的となりました。

Claude Opus 4.6の紹介

私たちは最も賢いモデルをアップグレードします。

新しいClaude Opus 4.6は、前身モデルのコーディングスキルを向上させています。より慎重に計画を立て、エージェントタスクをより長く持続させ、大規模なコードベースにおいてより確実に動作し、自身の誤りを捕捉するためのコードレビューおよびデバッグスキルが優れています。また、Opusクラスモデルとして初めて、Opus 4.6はベータ版で100万トークンのコンテキストウィンドウを備えています¹。

Opus 4.6は、その向上した能力を財務分析の実行、調査、文書・スプレッドシート・プレゼンテーションの利用および作成といった、さまざまな日常業務タスクに適用することもできます。Claudeが自律的にマルチタスクを実行できるCowork内では、Opus 4.6はこれらすべてのスキルをあなたに代わって働かせることができます。

このモデルの性能は、いくつかの評価において最先端です。例えば、エージェント的コーディング評価「Terminal-Bench 2.0」で最高スコアを達成し、複雑な学際的推論テスト「Humanity's Last Exam」において他のすべてのフロンティアモデルをリードしています。金融、法律などの分野における経済的に価値のある知識作業タスクの性能を評価する「GDPval-AA」²では、Opus 4.6は業界で次点のモデル（OpenAIのGPT-5.2）を約144エロポイント³、また前身モデル（Claude Opus 4.5）を190ポイント上回ります。さらに、オンライン上で見つけにくい情報を特定するモデルの能力を測定する「BrowseComp」においても、Opus 4.6は他のどのモデルよりも優れた性能を示しています。

詳細なシステムカードで示しているように、Opus 4.6はまた、安全性評価全体にわたる不整合行動の発生率が低く、業界の他のどのフロンティアモデルにも劣らない、あるいはそれ以上に優れた全体的な安全性プロファイルを示しています。

Claude Codeでは、タスクに共同で取り組むエージェントチームを編成できるようになりました。APIでは、Claudeがコンパクションを使用して自身のコンテキストを要約し、制限に達することなくより長時間実行されるタスクを実行できます。また、モデルが拡張思考をどれだけ使用すべきかについて文脈上の手がかりを捉えることができる適応的思考（adaptive thinking）と、開発者が知性、速度、コストをより細かく制御できる新しい努力制御（effort controls）を導入しています。

Claude in Excelに大幅なアップグレードを行い、Claude in PowerPointをリサーチプレビューとしてリリースします。これにより、Claudeは日常業務においてはるかに有能になります。

Claude Opus 4.6は本日、claude.ai、当社のAPI、およびすべての主要なクラウドプラットフォームで利用可能です。開発者の方は、claude-opus-4-6をご利用ください。

モデル、新しい製品アップデート、評価、および詳細な安全性テストについては、以下で詳しく説明します。

第一印象

私たちはClaudeでClaudeを構築しています。当社のエンジニアは毎日Claude Codeでコードを書き、すべての新しいモデルはまず私たち自身の作業でテストされます。Opus 4.6では、このモデルが指示されなくてもタスクの最も困難な部分により集中し、より単純な部分は素早く処理し、曖昧な問題をより優れた判断力で扱い、長時間のセッションにわたって生産性を維持することがわかりました。

Opus 4.6は、しばしばより深く考え、答えを確定する前に自身の推論をより慎重に見直します。これはより困難な問題ではより良い結果を生みますが、より単純な問題ではコストとレイテンシを増加させる可能性があります。特定のタスクでモデルが考えすぎていると感じる場合は、デフォルト設定（高）から努力（effort）を中程度に下げることをお勧めします。これは /effort で簡単に制御できます。

以下は、早期アクセスパートナーから寄せられたClaude Opus 4.6に関する感想の一部です。手取り足取りの指導を必要とせず自律的に作業する性質、以前のモデルが失敗した場面での成功、そしてチームの働き方への影響について言及されています。

Claude Opus 4.6はAnthropicがリリースした最強のモデルです。複雑な要求を受け取り、実際にそれを実行に移し、具体的なステップに分解し、実行し、タスクが野心的であっても洗練された成果を生み出します。Notionユーザーにとって、それは単なるツールというよりも、有能な共同作業者のように感じられます。

初期テストでは、Claude Opus 4.6が、開発者が日常的に直面する複雑で多段階のコーディング作業、特に計画とツール呼び出しを要求するエージェント的ワークフローにおいて、期待に応えることが示されています。これはフロンティアにおける長期視野のタスクの実現を始めています。

Claude Opus 4.6は、エージェント的計画において大きな飛躍です。複雑なタスクを独立したサブタスクに分解し、ツールとサブエージェントを並列実行し、ブロッカーを非常に正確に特定します。

Claude Opus 4.6は私たちがテストした中で最高のモデルです。その推論と計画能力は、私たちのAIチームメイトを駆動するのに卓越しています。また、素晴らしいコーディングモデルでもあります。大規模なコードベースをナビゲートし、行うべき適切な変更を特定する能力は最先端です。

Claude Opus 4.6は、これまでに見たことのないレベルで複雑な問題を推論します。他のモデルが見落とすエッジケースを考慮し、一貫してより優雅でよく考え抜かれた解決策に到達します。特にDevin ReviewにおけるOpus 4.6の性能には感銘を受けています。バグ発見率を向上させています。

Claude Opus 4.6はWindsurfにおいて、Opus 4.5よりも顕著に優れていると感じられます。特に、デバッグや見慣れないコードベースの理解のように注意深い探索を必要とするタスクにおいてそうです。Opus 4.6はより長く考えることに気づきましたが、より深い推論が必要な場合にはそれが報われます。

Claude Opus 4.6は、長文脈性能において意味のある飛躍を表しています。私たちのテストでは、はるかに大量の情報を、複雑な研究ワークフローの設計と展開方法を強化する一貫性のレベルで処理することを確認しました。この分野の進歩は、専門家が信頼できる真にエキスパート級のシステムを提供するための、より強力な構成要素を私たちに与えます。

40件のサイバーセキュリティ調査において、Claude Opus 4.6は、Claude 4.5モデル群とのブラインドランキングで40回中38回最高の結果を生み出しました。各モデルは、最大9つのサブエージェントと100回以上のツール呼び出しを伴う同じエージェント的ハーネス上でエンドツーエンドで実行されました。

Claude Opus 4.6は、当社の内部ベンチマークとテストにおいて、長時間実行タスクに関する新たなフロンティアです。また、コードレビューにおいても非常に効果的でした。

Claude Opus 4.6は、あらゆるClaudeモデルの中で最高のBigLaw Benchスコア90.2%を達成しました。40%の完全正解と8...

原文を表示

Introducing Claude Opus 4.6

We’re upgrading our smartest model.

The new Claude Opus 4.6 improves on its predecessor’s coding skills. It plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases, and has better code review and debugging skills to catch its own mistakes. And, in a first for our Opus-class models, Opus 4.6 features a 1M token context window in beta1.

Opus 4.6 can also apply its improved abilities to a range of everyday work tasks: running financial analyses, doing research, and using and creating documents, spreadsheets, and presentations. Within Cowork, where Claude can multitask autonomously, Opus 4.6 can put all these skills to work on your behalf.

The model’s performance is state-of-the-art on several evaluations. For example, it achieves the highest score on the agentic coding evaluation Terminal-Bench 2.0 and leads all other frontier models on Humanity’s Last Exam, a complex multidisciplinary reasoning test. On GDPval-AA—an evaluation of performance on economically valuable knowledge work tasks in finance, legal, and other domains2—Opus 4.6 outperforms the industry’s next-best model (OpenAI’s GPT-5.2) by around 144 Elo points,3 and its own predecessor (Claude Opus 4.5) by 190 points. Opus 4.6 also performs better than any other model on BrowseComp, which measures a model’s ability to locate hard-to-find information online.

As we show in our extensive system card, Opus 4.6 also shows an overall safety profile as good as, or better than, any other frontier model in the industry, with low rates of misaligned behavior across safety evaluations.

In Claude Code, you can now assemble agent teams to work on tasks together. On the API, Claude can use compaction to summarize its own context and perform longer-running tasks without bumping up against limits. We’re also introducing adaptive thinking, where the model can pick up on contextual clues about how much to use its extended thinking, and new effort controls to give developers more control over intelligence, speed, and cost.

We’ve made substantial upgrades to Claude in Excel, and we’re releasing Claude in PowerPoint in a research preview. This makes Claude much more capable for everyday work.

Claude Opus 4.6 is available today on claude.ai, our API, and all major cloud platforms. If you’re a developer, use claude-opus-4-6

We cover the model, our new product updates, our evaluations, and our extensive safety testing in depth below.

First impressions

We build Claude with Claude. Our engineers write code with Claude Code every day, and every new model first gets tested on our own work. With Opus 4.6, we’ve found that the model brings more focus to the most challenging parts of a task without being told to, moves quickly through the more straightforward parts, handles ambiguous problems with better judgment, and stays productive over longer sessions.

Opus 4.6 often thinks more deeply and more carefully revisits its reasoning before settling on an answer. This produces better results on harder problems, but can add cost and latency on simpler ones. If you’re finding that the model is overthinking on a given task, we recommend dialing effort down from its default setting (high) to medium. You can control this easily with the /effort

Here are some of the things our Early Access partners told us about Claude Opus 4.6, including its propensity to work autonomously without hand-holding, its success where previous models failed, and its effect on how teams work:

Claude Opus 4.6 is the strongest model Anthropic has shipped. It takes complicated requests and actually follows through, breaking them into concrete steps, executing, and producing polished work even when the task is ambitious. For Notion users, it feels less like a tool and more like a capable collaborator.

Early testing shows Claude Opus 4.6 delivering on the complex, multi-step coding work developers face every day—especially agentic workflows that demand planning and tool calling. This starts unlocking long-horizon tasks at the frontier.

Claude Opus 4.6 is a huge leap for agentic planning. It breaks complex tasks into independent subtasks, runs tools and subagents in parallel, and identifies blockers with real precision.

Claude Opus 4.6 is the best model we've tested yet. Its reasoning and planning capabilities have been exceptional at powering our AI Teammates. It's also a fantastic coding model – its ability to navigate a large codebase and identify the right changes to make is state of the art.

Claude Opus 4.6 reasons through complex problems at a level we haven't seen before. It considers edge cases that other models miss and consistently lands on more elegant, well-considered solutions. We're particularly impressed with Opus 4.6 in Devin Review, where it's increased our bug catching rates.

Claude Opus 4.6 feels noticeably better than Opus 4.5 in Windsurf, especially on tasks that require careful exploration like debugging and understanding unfamiliar codebases. We’ve noticed Opus 4.6 thinks longer, which pays off when deeper reasoning is needed.

Claude Opus 4.6 represents a meaningful leap in long-context performance. In our testing, we saw it handle much larger bodies of information with a level of consistency that strengthens how we design and deploy complex research workflows. Progress in this area gives us more powerful building blocks to deliver truly expert-grade systems professionals can trust.

Across 40 cybersecurity investigations, Claude Opus 4.6 produced the best results 38 of 40 times in a blind ranking against Claude 4.5 models. Each model ran end to end on the same agentic harness with up to 9 subagents and 100+ tool calls.

Claude Opus 4.6 is the new frontier on long-running tasks from our internal benchmarks and testing. It's also been highly effective at reviewing code.

Claude Opus 4.6 achieved the highest BigLaw Bench score of any Claude model at 90.2%. With 40% perfect scores and 84% above 0.8, it’s remarkably capable for legal reasoning.

Claude Opus 4.6 autonomously closed 13 issues and assigned 12 issues to the right team members in a single day, managing a ~50-person organization across 6 repositories. It handled both product and organizational decisions while synthesizing context across multiple domains, and it knew when to escalate to a human.

Claude Opus 4.6 is an uplift in design quality. It works beautifully with our design systems and it’s more autonomous, which is core to Lovable’s values. People should be creating things that matter, not micromanaging AI.

Claude Opus 4.6 excels in high-reasoning tasks like multi-source analysis across legal, financial, and technical content. Box’s eval showed a 10% lift in performance, reaching 68% vs. a 58% baseline, and near-perfect scores in technical domains.

Claude Opus 4.6 generates complex, interactive apps and prototypes in Figma Make with an impressive creative range. The model translates detailed designs and multi-layered tasks into code on the first try, making it a powerful starting point for teams to explore and build ideas.

Claude Opus 4.6 is the best Anthropic model we’ve tested. It understands intent with minimal prompting and went above and beyond, exploring and creating details I didn’t even know I wanted until I saw them. It felt like I was working with the model, not waiting on it.

Both hands-on testing and evals show Claude Opus 4.6 is a meaningful improvement for design systems and large codebases, use cases that drive enormous enterprise value. It also one-shotted a fully functional physics engine, handling a large multi-scope task in a single pass.

Claude Opus 4.6 is the biggest leap I’ve seen in months. I’m more comfortable giving it a sequence of tasks across the stack and letting it run. It’s smart enough to use subagents for the individual pieces.

Claude Opus 4.6 handled a multi-million-line codebase migration like a senior engineer. It planned up front, adapted its strategy as it learned, and finished in half the time.

We only ship models in v0 when developers will genuinely feel the difference. Claude Opus 4.6 passed that bar with ease. Its frontier-level reasoning, especially with edge cases, helps v0 to deliver on our number-one aim: to let anyone elevate their ideas from prototype to production.

The performance jump with Claude Opus 4.6 feels almost unbelievable. Real-world tasks that were challenging for Opus [4.5] suddenly became easy. This feels like a watershed moment for spreadsheet agents on Shortcut.

Evaluating Claude Opus 4.6

Across agentic coding, computer use, tool use, search, and finance, Opus 4.6 is an industry-leading model, often by a wide margin. The table below shows how Claude Opus 4.6 compares to our previous models and to other industry models on a variety of benchmarks.

Opus 4.6 is much better at retrieving relevant information from large sets of documents. This extends to long-context tasks, where it holds and tracks information over hundreds of thousands of tokens with less drift, and picks up buried details that even Opus 4.5 would miss.

A common complaint about AI models is “context rot,” where performance degrades as conversations exceed a certain number of tokens. Opus 4.6 performs markedly better than its predecessors: on the 8-needle 1M variant of MRCR v2—a needle-in-a-haystack benchmark that tests a model’s ability to retrieve information “hidden” in vast amounts of text—Opus 4.6 scores 76%, whereas Sonnet 4.5 scores just 18.5%. This is a qualitative shift in how much context a model can actually use while maintaining peak performance.

All in all, Opus 4.6 is better at finding information across long contexts, better at reasoning after absorbing that information, and has substantially better expert-level reasoning abilities in general.

Finally, the charts below show how Claude Opus 4.6 performs on a variety of benchmarks that assess its software engineering skills, multilingual coding ability, long-term coherence, cybersecurity capabilities, and its life sciences knowledge.

A step forward on safety

These intelligence gains do not come at the cost of safety. On our automated behavioral audit, Opus 4.6 showed a low rate of misaligned behaviors such as deception, sycophancy, encouragement of user delusions, and cooperation with misuse. Overall, it is just as well-aligned as its predecessor, Claude Opus 4.5, which was our most-aligned frontier model to date. Opus 4.6 also shows the lowest rate of over-refusals—where the model fails to answer benign queries—of any recent Claude model.

For Claude Opus 4.6, we ran the most comprehensive set of safety evaluations of any model, applying many different tests for the first time and upgrading several that we’ve used before. We included new evaluations for user wellbeing, more complex tests of the model’s ability to refuse potentially dangerous requests, and updated evaluations of the model’s ability to surreptitiously perform harmful actions. We also experimented with new methods from interpretability, the science of the inner workings of AI models, to begin to understand why the model behaves in certain ways—and, ultimately, to catch problems that standard testing might miss.

A detailed description of all capability and safety evaluations is available in the Claude Opus 4.6 system card.

We’ve also applied new safeguards in areas where Opus 4.6 shows particular strengths that might be put to dangerous as well as beneficial uses. In particular, since the model shows enhanced cybersecurity abilities, we’ve developed six new cybersecurity probes—methods of detecting harmful responses—to help us track different forms of potential misuse.

We’re also accelerating the cyberdefensive uses of the model, using it to help find and patch vulnerabilities in open-source software (as we describe in our new cybersecurity blog post). We think it’s critical that cyberdefenders use AI models like Claude to help level the playing field. Cybersecurity moves fast, and we’ll be adjusting and updating our safeguards as we learn more about potential threats; in the near future, we may institute real-time intervention to block abuse.

Product and API updates

We’ve made substantial updates across Claude, Claude Code, and the Claude Developer Platform to let Opus 4.6 perform at its best.

Claude Developer Platform

On the API, we’re giving developers better control over model effort and more flexibility for long-running agents. To do so, we’re introducing the following features:

Adaptive thinking. Previously, developers only had a binary choice between enabling or disabling extended thinking. Now, with adaptive thinking, Claude can decide when deeper reasoning would be helpful. At the default effort level (high), the model uses extended thinking when useful, but developers can adjust the effort level to make it more or less selective.

Effort. There are now four effort levels to choose from: low, medium, high (default), and max. We encourage developers to experiment with different options to find what works best.

Context compaction (beta). Long-running conversations and agentic tasks often hit the context window. Context compaction automatically summarizes and replaces older context when the conversation approaches a configurable threshold, letting Claude perform longer tasks without hitting limits.

1M token context (beta). Opus 4.6 is our first Opus-class model with 1M token context. Premium pricing applies for prompts exceeding 200k tokens ($10/$37.50 per million input/output tokens), available only on the Claude Developer Platform.

128k output tokens. Opus 4.6 supports outputs of up to 128k tokens, which lets Claude complete larger-output tasks without breaking them into multiple requests.

US-only inference. For workloads that need to run in the United States, US-only inference is available at 1.1× token pricing.

Product updates

Across Claude and Claude Code, we’ve added features that allow knowledge workers and developers to tackle harder tasks with more of the tools they use every day.

We’ve introduced agent teams in Claude Code as a research preview. You can now spin up multiple agents that work in parallel as a team and coordinate autonomously—best for tasks that split into independent, read-heavy work like codebase reviews. You can take over any subagent directly using Shift+Up/Down or tmux.

Claude now also works better with the office tools you already use. Claude in Excel handles long-running and harder tasks with improved performance, and can plan before acting, ingest unstructured data and infer the right structure without guidance, and handle multi-step changes in one pass. Pair that with Claude in PowerPoint, and you can first process and structure your data in Excel, then bring it to life visually in PowerPoint. Claude reads your layouts, fonts, and slide masters to stay on brand, whether you’re building from a template or generating a full deck from a description. Claude in PowerPoint is now available in research preview for Max, Team, and Enterprise plans.

[1] The 1M token context window is currently available in beta on the Claude Developer Platform only.

[2] Run independently by Artificial Analysis. See here for full methodological details.

[3] This translates into Claude Opus 4.6 obtaining a higher score than GPT-5.2 on this eval approximately 70% of the time (where 50% of the time would have implied parity in the scores).

For GPT-5.2 and Gemini 3 Pro models, we compared the best reported model version in the charts and table.

Terminal-Bench 2.0: We report both scores reproduced on our infrastructure and published scores from other labs. All runs used the Terminus-2 harness, except for OpenAI’s Codex CLI. All experiments used 1× guaranteed / 3× ceiling resource allocation and 5–15 samples per task across staggered batches. See system card for details.

Humanity’s Last Exam: Claude models run “with tools” were run with web search, web fetch, code execution, programmatic tool calling, context compaction triggered at 50k tokens up to 3M total tokens, max reasoning effort, and adaptive thinking enabled. A domain blocklist was used to decontaminate eval results. See system card for more details.

SWE-bench Verified: Our score was averaged over 25 trials. With a prompt modification, we saw a score of 81.42%.

MCP Atlas: Claude Opus 4.6 was run with max effort. When run at high effort, it reached an industry-leading score of 62.7%.

BrowseComp: Claude models were run with web search, web fetch, programmatic tool calling, context compaction triggered at 50k tokens up to 10M total tokens, max reasoning effort, and no thinking enabled. Adding a multi-agent harness increased scores to 86.8%. See system card for more details.

ARC AGI 2: Claude Opus 4.6 was run with max effort and a 120k thinking budget score.

CyberGym: Claude models were run on no thinking, default effort, temperature, and top_p

OpenRCA: For each failure case in OpenRCA, Claude receives 1 point if all generated root-cause elements match the ground-truth ones, and 0 points if any mismatch is identified. The overall accuracy is the average score across all failure cases. The benchmark was run on the benchmark author’s harness, graded using their official methodology, and has been submitted for official verification.

[Feb 23, 2026] Updated reported score for Opus 4.6 for HLE with tools (53.1% to 53.0%). The update was caused by running an improved cheating detection pipeline which flagged 3 additional instances of cheating that our original pipeline had missed.