読み込み中…

OpenAI News·2026年4月23日 20:00·約29分

GPT-5.5の発表

#LLM #OpenAI #GPT-5.5 #コーディング支援 #データ分析

TL;DR

OpenAIは最新モデル「GPT-5.5」を公式発表し、処理速度の向上と汎用能力の強化を図りながら、コーディングや学術研究、ツール連携を伴うデータ分析などの複雑なタスクに対応可能なモデルであると主張している。

AI深層分析2026年4月24日 04:23

注目/ 5段階

深度40%

キーポイント

新型モデルの公式発表

OpenAIが既存シリーズを超越する「GPT-5.5」を発表し、知能レベルの向上を明言している。

パフォーマンスの最適化

応答速度の高速化とタスク処理能力の強化を謳っており、実運用時の効率向上が期待される。

専門領域への特化設計

コーディング支援、学術リサーチ、外部ツールを連携させたデータ分析など、複雑な作業フローに最適化されたアーキテクチャを採用している。

重要な引用

Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis across tools.

影響分析・編集コメントを表示

影響分析

本発表はOpenAIのモデル競争がさらに激化する兆しを示すが、詳細な技術仕様やベンチマークデータが欠如しているため、現時点での実務への直接的な影響は限定的である。業界全体としては次世代モデルの標準仕様や開発者向けエコシステム整備への期待が高まる一方、実際の性能比較と導入判断は公式詳細公開後に評価されるべき段階にある。

編集コメント

単なるキャッチコピーに留まる発表文では実務への適用判断が困難であるため、詳細な技術レポートやベンチマーク公開を待機すべき段階と判断する。

GPT-5.5をリリースいたします。これはこれまでで最も賢く、直感的に操作しやすいモデルであり、コンピュータでの作業の新しい方法への次の一歩です。

GPT-5.5はあなたの意図をより速く理解し、作業の大部分を自身で遂行できます。コードの作成とデバッグ、オンライン調査、データ分析、ドキュメントやスプレッドシートの作成、ソフトウェアの操作、そしてタスクが完了するまで複数のツールを横断して移動することなどに優れています。すべてのステップを手動で細かく管理する代わりに、GPT-5.5に複雑で多岐にわたるタスクを任せ、計画立案、ツールの使用、作業の検証、曖昧さへの対応、そして継続的な実行を信頼できます。

その向上幅は、特にエージェント型コーディング（agentic coding）、コンピュータ操作（computer use）、知識労働（knowledge work）、初期の科学的研究において顕著です。これらはコンテキストを跨ぐ推論と時間を通じた行動の取得が進展の鍵となる領域です。GPT-5.5は速度を犠牲にすることなく、この知能の一段階上の性能を実現します。一般的により大規模で高性能なモデルは応答速度が遅くなりがちですが、GPT-5.5は現実世界でのサービス提供においてGPT-5.4と同等のトークンあたりのレイテンシ（per-token latency）を維持しながら、はるかに高い知能レベルで動作します。また、同じCodexタスクを完了する際に大幅に少ないトークン数で済むため、効率性と能力の両面で優れています。

GPT-5.5は、これまでのところ最も強力な安全対策セットを備えてリリースされます。これは悪用を減らしつつ有益な作業へのアクセスを維持するよう設計されています。当社の安全・準備体制の全フレームワークを通じてこのモデルを評価し、内部および外部のレッドチーム（redteamers）と連携し、高度なサイバーセキュリティおよび生物学機能に対するターゲットテストを追加し、リリース前に約200社の信頼できる早期アクセスパートナーから実際のユースケースに関するフィードバックを収集しました。

本日、GPT-5.5はChatGPTおよびCodexのPlus、Pro、Business、Enterpriseユーザー向けにロールアウトを開始し、GPT-5.5 ProはChatGPTのPro、Business、Enterpriseユーザー向けにロールアウトを開始しました。APIへの展開には異なる安全対策が必要であり、大規模提供における安全性とセキュリティ要件についてパートナーおよび顧客と密接に連携しています。GPT-5.5およびGPT-5.5 ProをAPIにて提供開始する予定はまもなくです。

GPT-5.5

GPT-5.4

GPT-5.5 Pro

GPT-5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

Terminal-Bench 2.0

82.7%

75.1%

69.4%

68.5%

Expert-SWE（内部）

73.1%

68.5%

GDPval（勝利または同率）

84.9%

83.0%

82.3%

82.0%

80.3%

67.3%

OSWorld-Verified

78.7%

75.0%

78.0%

Toolathlon

55.6%

54.6%

48.8%

BrowseComp

84.4%

82.7%

90.1%

89.3%

79.3%

85.9%

FrontierMath Tier 1–3

51.7%

47.6%

52.4%

50.0%

43.8%

36.9%

FrontierMath Tier 4

35.4%

27.1%

39.6%

38.0%

22.9%

16.7%

CyberGym

81.8%

79.0%

73.1%

OpenAIはエージェント型AI（agentic AI）のグローバルインフラストラクチャを構築中であり、世界中の人々や企業がAIを活用して業務を遂行できるようにすることを目指しています。過去1年間、私たちはAIがソフトウェアエンジニアリング（software engineering）を劇的に加速させたのを見てきました。CodexやChatGPTに搭載されたGPT‑5.5により、その同じ変革が科学的研究や、人々がコンピュータで行う幅広い業務へと広がり始めています。

これらの領域全体において、GPT‑5.5は単に知能が高いだけでなく、問題解決のプロセスにおいても効率的であり、より少ないトークン数（tokens）と再試行回数で高品質な出力を達成することがよくあります。Artificial AnalysisのCoding Indexにおいて、GPT‑5.5は競合する最先端コーディングモデルの半分のコストで、最先端の知能性能を提供しています。

GPT‑5.5はこれまでで最も強力なエージェント型コーディングモデルです。計画、反復、ツール連携を必要とする複雑なコマンドラインワークフロー（command-line workflows）をテストするTerminal-Bench 2.0,では、82.7%という最先端の精度を達成しています。実際のGitHubイシュー解決を評価するSWE-Bench Proでは58.6%に達し、以前のモデルよりも単一パス（single pass）でエンドツーエンド（end-to-end）のタスクをより多く解決します。推定人間の完了時間が中央値で20時間という長期コーディングタスク向けの社内最先端評価指標Expert-SWEにおいても、GPT‑5.5はGPT‑5.4を上回っています。

3つの評価指標すべてにおいて、GPT‑5.5はより少ないトークン数（tokens）を使用しながら、GPT‑5.4のスコアを上回っています。

このモデルのコーディング強みは、実装やリファクタリング（refactors）からデバッグ（debugging）、テスト、検証（validation）に至るまでのエンジニアリング作業を引き受けるCodexにおいて特に明確に現れます。初期テストでは、GPT‑5.5が実際のエンジニアリング作業で求められる動作、例えば大規模システム全体でのコンテキスト（context）の保持、曖昧な失敗の原因推論、ツールを用いた仮定の検証、そして関連するコードベース全体への変更の反映において優れていることが示唆されています。

描画された軌道は、読みやすさのために表示スケール（display scaling）を適用した上で、オリオン、月、太陽のNASA/JPL Horizonsベクトルデータ（vector data）を使用しています。

プロンプト：[添付画像] WebGLとViteを使用して、アルテミスIIミッションの実際のデータからこのアプリを新規作成してください。完全に機能し、画像にあるアプリの外観になるまで徹底的にテストしてください。惑星とフライパス（fly paths）のレンダリングには細心の注意を払ってください。3Dレンダリング（3D rendering）とインタラクティブに操作できることを望みます。現実的な軌道力学（orbital mechanics）を備えていることを確認してください。

ベンチマーク結果を超えて、初期テスターたちはGPT‑5.5がシステムの構造を把握する能力がより強いと述べています。具体的には、何が失敗しているのか、修正はどこに適用されるべきか、そしてコードベースの他のどの部分が影響を受けるかを理解する力です。

Everyの創設者兼CEOであるDan Shipper氏は、GPT‑5.5を「これまで使用したコーディングモデルの中で、概念的な明確さが最も優れている」と評価しました。

アプリをリリースした後、彼はリリース後の問題のデバッグ（debugging）に数日を費やし、その後、最高のエンジニアの一人を招いてシステムの一部分を書き直すことになった。GPT‑5.5 をテストするため、彼は時間を巻き戻すかのような実験を行った：このモデルは壊れた状態を確認し、エンジニアが最終的に決定したのと同じ種類のリファクタリング（refactoring）を生み出せるだろうか？ GPT‑5.4 はできなかった。GPT‑5.5 はできた。

MagicPath の CEO、ピエトロ・スキラーノは、GPT‑5.5 がフロントエンド（frontend）とリファクタリングの変更を数百件含むブランチを、大幅に変更されたメインブランチにマージした際にも、同様の段階的な飛躍を目の当たりにした。この作業は約20分で一度に解決された。

このモデルをテストしたシニアエンジニアたちは、GPT‑5.5 が推論（reasoning）と自律性（autonomy）において GPT‑5.4 や Claude Opus 4.7 よりも明らかに優れていると指摘し、明示的なプロンプト（prompting）なしで問題を事前に察知し、テストやレビューの必要性を予測できると述べた。あるケースでは、エンジニアが共同作業用マークダウンエディタのコメントシステムの再構築を依頼し、戻ってきたのはほぼ完了した12回分の差分スタック（diff stack）だった。他のエンジニアたちも、実装の修正が予想以上に少なく済み、GPT‑5.4 と比較して GPT‑5.5 の計画に対してより自信を持てたと語った。

NVIDIA でこのモデルへの早期アクセス権を持っていたあるエンジニアは、次のように語っている。「GPT‑5.5 へのアクセスを失うことは、腕や脚を切断されたようなものだ。」

**“GPT-5.5 は GPT‑5.4 よりも明らかに賢く、持続性が高く、コーディング性能が優れ、ツールの使用もより信頼できます。ユーザーが Cursor に委ねる複雑で長時間実行される作業において最も重要なのは、このモデルが早期に中断することなく、著しく長い時間タスクに集中し続けられる点です。”

— マイケル・トゥエル、Cursor 共同創設者兼 CEO

GPT‑5.5 をコーディングに優れさせている同じ強みが、コンピュータでの日常業務においても強力な武器となる。このモデルは意図の理解に優れているため、ナレッジワーカー（knowledge work）の全サイクルをより自然に処理できる：情報の検索、重要事項の把握、ツールの使用、出力の確認、そして生データを有用な成果物への変換である。

Codex において、GPT‑5.5 はドキュメント、スプレッドシート、スライドプレゼンテーションの生成において GPT‑5.4 よりも優れている。アルファテスター（alpha testers）は、業務調査やスプレッドシートモデリング、曖昧なビジネス入力を計画に変換する作業において、過去のモデルを上回ると報告した。Codex のコンピュータ操作スキルと組み合わせることで、GPT‑5.5 はモデルが実際にあなたと共にコンピュータを操作できるという感覚に近づける：画面の内容を確認し、クリックし、入力し、インターフェースをナビゲートし、ツール間を正確に移動する。

OpenAIの各チームはすでに、これらの強みを実際のワークフローで活用しています。現在、ソフトウェアエンジニアリング、財務、広報、マーケティング、データサイエンス、プロダクトマネジメントを含む各部門で、社員の85％以上がCodexを毎週使用しています。広報チームでは、Codex内のGPT‑5.5を活用して6ヶ月分の講演依頼データを分析し、評価・リスクフレームワークを構築するとともに、自動化Slackエージェント（automated Slack agent）を検証しました。これにより、低リスクの依頼は自動処理され、高リスクの依頼は引き続き人間によるレビューに回すことが可能になりました。財務チームでは、Codexを使用して71,637ページにわたる24,771枚のK-1税務申告書（K-1 tax forms）を検証しました。個人情報を含むデータを除外するワークフローを採用したことで、前年比で2週間という作業の短縮を実現しました。マーケティング・販売チーム（Go-to-Market team）では、ある社員が週次ビジネスレポートの生成を自動化し、週5〜10時間の節約を実現しました。

ChatGPTにおいて、GPT‑5.5 Thinking（思考モード）は難易度の高い問題に対してより迅速な支援を提供し、賢明で簡潔な回答によって複雑な作業をより効率的に進めるのをサポートします。コーディング、リサーチ、情報の統合と分析、ドキュメント中心のタスクといった専門的な作業に優れており、特にプラグインを使用する際にその真価を発揮します。

GPT‑5.5 Pro（プロ版）では、早期テスターからChatGPTが引き受ける作業の難易度と品質が大幅に向上したとの報告があり、レイテンシー（latency）の改善により要求の高いタスクでの実用性が格段に高まりました。GPT‑5.4 Proと比較して、テスターはGPT‑5.5 Proの回答が大幅に包括的であり、構造化が優れ、正確で、関連性が高く、有用であると評価しました。特にビジネス、法務、教育、データサイエンスの分野で顕著なパフォーマンスを示しています。

GPT‑5.5は、こうした作業を反映する複数のベンチマークで最先端のパフォーマンスを達成しています。44の職業にわたる明確な仕様のナレッジワーク生成能力をテストするGDPval⁠⁠（ナレッジワーク評価ベンチマーク）では84.9％を記録し、モデルが実際のコンピュータ環境を自律的に操作できるかを測定するOSWorld-Verified（実環境操作テスト）では78.7％に達しました。また、複雑なカスタマーサービスワークフローをテストするTau2-bench Telecom（カスタマーサービスワークフローテスト）では、プロンプトチューニング（prompt tuning）なしで98.0％を達成しています。GPT‑5.5は他のナレッジワーク系ベンチマークでも強力なパフォーマンスを示しており、FinanceAgent（金融エージェントベンチマーク）で60.0％、内部投資銀行モデリングタスクで88.5％、OfficeQA Pro（オフィス業務評価ベンチマーク）で54.1％を記録しています。

Tau2-bench Telecomはプロンプトチューニング（prompt tuning）なし（かつGPT‑4.1をユーザーモデルとして使用）で実行されました。GPT‑5.5はタスクの意図をより正確に理解し、前世代モデルと比較してトークン効率（token efficiency）も向上しています。

**“GPT-5.5は、実行を重視した作業に必要な持続的なパフォーマンスを提供します。NVIDIA GB200 NVL72システム（NVIDIA GB200 NVL72 systems）上で構築・提供されるこのモデルにより、当社のチームは自然言語のプロンプトからエンドツーエンドの機能をリリースでき、デバッグ時間を数日単位から時間単位に短縮し、複雑なコードベースにおける数週間の実験を一夜で進捗させることが可能になりました。単なるコーディングの高速化ではありません。それは人々を根本的に異なる速度で稼働させる、新しい働き方です。”

— Justin Boitano, VP of Enterprise AI at NVIDIA

GPT-5.5は、科学・技術研究のワークフロー（workflows）においても向上を示しており、そこには難しい質問に答える以上のものが求められます。研究者はアイデアを探求し、証拠を集め、仮説を検証し、結果を解釈し、次に何を試すべきかを決定する必要があります。GPT-5.5は、他のモデルよりもこのループを継続的に実行する能力に優れています。

特筆すべきは、GPT-5.5が遺伝学（genetics）や定量的生物学（quantitative biology）における多段階の科学データ分析に焦点を当てた新しい評価指標（eval）であるGeneBench⁠(opens in a new window)において、GPT-5.4を明確に上回る改善を示している点です。これらの課題は、モデルが最小限の監督ガイダンス（supervisory guidance）のもとで、曖昧さやエラーを含む可能性のあるデータについて推論し、隠れた交絡因子（hidden confounders）やQC（品質管理：quality control）の失敗といった現実的な障害に対処し、現代の統計手法（statistical methods）を正しく実装・解釈することを要求します。ここで扱うタスクが科学者にとってしばしば数日間のプロジェクトに相当することを考慮すると、このモデルの性能は際立っています。

同様に、実際のバイオインフォマティクス（bioinformatics）とデータ分析を基に設計されたベンチマーク（benchmark）であるBixBench⁠(opens in a new window)において、GPT-5.5は公開スコアを持つモデルの中で最高性能を達成しました。このモデルの科学的能力は、本格的な共同研究者として生物医学研究の最前線での進歩を意味ある形で加速するのに十分な強さを今や備えています。

もう一つの例として、カスタムハネス（custom harness）を備えたGPT-5.5の内部バージョンが、組合せ論（combinatorics）における中心的な対象の一つであるラムゼー数（Ramsey numbers）に関する新しい証明⁠(opens in a new window)の発見を手助けしました。組合せ論は、グラフ、ネットワーク、集合、パターンといった離散オブジェクト（discrete objects）がどのように組み合わさるかを研究する分野です。ラムゼー数は、おおまかに言えば、ある種の秩序が確実に現れる前にネットワークがどれほど大きくなければならないかを問うものです。この分野の結果は稀であり、しばしば技術的に困難を伴います。ここでGPT-5.5は、オフダイアゴナル・ラムゼー数（off-diagonal Ramsey numbers）に関する長年の漸近的事実（asymptotic fact）の証明を見つけ、後にLeanで検証されました。この結果は、GPT-5.5が単なるコードや説明だけでなく、中核的な研究分野において驚くべきかつ有用な数学的論証（mathematical argument）を提供したという具体的な例です。

初期テスターは、ChatGPT内でGPT-5.5 Proを、単発の回答エンジン（one-shot answer engine）というよりは研究パートナーとして活用しました。複数のパスにわたって原稿を批判的に検討し、技術的論証の耐性テスト（stress-testing）を行い、分析を提案し、コード、ノート、PDFコンテキスト（PDF context）と連携して作業を行いました。共通する点は、GPT-5.5が研究者の「質問から実験へ、そして出力へ」というプロセスを支援する能力に優れていることです。

Jackson Laboratory for Genomic Medicineの免疫学教授であり研究者でもあるDerya Unutmazは、GPT-5.5 Proを用いて、62サンプルと約28,000の遺伝子からなる遺伝子発現データセット（gene-expression dataset）を分析し、発見内容をまとめるだけでなく主要な質問と洞察を引き出した詳細な研究レポートを作成しました。彼は、この作業は自身のチームに数ヶ月を要するものであったと述べています。

ポーランド、ポズナニのアダム・ミツケヴィチ大学で数学の助教を務めるバルトシュ・ナスクレンツキ氏は、Codex上でGPT-5.5を使用して、1つのプロンプトから代数幾何学（algebraic-geometry）アプリを11分で構築しました。二次曲面（quadratic surfaces）の交差を可視化し、得られた曲線をワイエルシュトラス模型（Weierstrass model）に変換しています。

彼はその後、このアプリにより安定した特異点の可視化（singularity visualization）と、その後の作業で再利用可能な正確な係数を追加して拡張しました。彼にとってのより大きな転換点は、Codexが以前は専用ツールを必要としていたカスタム数学可視化やコンピュータ代数ワークフロー（computer-algebra workflows）の実装を支援できるようになったことです。これらの例は、GPT-5.5が専門家の意図を実用的な研究ツールや解析に変換していることを示しています。

image

Credit: Bartosz Naskręcki⁠(opens in a new window)

プロンプト: **# 代数幾何学（algebraic-geometry）曲面の交差

2つの二次曲面（quadratic surfaces）を描画し、その交差曲線を赤色で着色するアプリを作成してください。計算論的リーマン・ロッホ定理（Riemann-Roch theorem）を用いて、これをワイエルシュトラス曲線（Weierstrass curve）に変換してください。

メインウィンドウ

わずかに透過したシェーディングが施された2つの着色面が、赤色の代数曲線に沿って交差する高品質なレンダリング

両方向のマウス回転、ズーム用の完全なピンチ操作、各曲面の係数を変更するためのスライダーを含む小さなメニューを表示するハプティックプレス；Zバッファレベル（Z-buffer level）による検出

右側ウィンドウ

有効なリーマン・ロッホ定理（Riemann-Roch theorem）の公式を用いてリアルタイムで計算された、（Qまたは二次体拡大（quadratic field extension）上の）短ワイエルシュトラス方程式（Short Weierstrass equation）

環境モード（すべてのコントロールが非表示になり、ユーザーが形状の美しさを鑑賞できるモード）

仕様

ブラウザ上で動作するアプリ。最新のフルスタック（stack）ライブラリを使用した軽量実装。ポータブル、デプロイ可能

ドキュメント

Gitリポジトリ、ジャーナル、計画（Markdownファイル）

「当社のハーンネスでOpenAIの新しいGPT-5.5モデルを使用し、膨大な生化学データセットを推論（inference）させてヒトの薬物効果を予測させ、最も困難な創薬評価（drug discovery evals）において大幅な精度向上を実現する様子を見るのは、非常に活力を与えてくれます。OpenAIがこのようなペースで開発を続ければ、創薬の基盤は今年末までに変わってしまうでしょう。」

— ブランドン・ホワイト、Axiom Bio 共同創設者兼CEO

GPT-5.4のレイテンシ（latency）でGPT-5.5を提供するには、推論（inference）を孤立した最適化の集合ではなく統合システムとして再考する必要がありました。GPT-5.5はNVIDIA GB200およびGB300 NVL72システム向けに共同設計され、同システムで学習・提供されています。CodexとGPT-5.5は、パフォーマンス目標を達成する過程で重要な役割を果たしました。Codexはチームがアプローチのスケッチ、実験の配線、より深い投資価値のある最適化の特定を支援しながら、アイデアからベンチマーク可能な実装への移行を加速させました。GPT-5.5は、スタック（stack）自体における主要な改善の発見と実装を支援しました。要するに、このモデルはそれ自身を提供するインフラストラクチャの改善にも貢献しました。

一つの改善点として、負荷分散およびパーティショニングのヒューリスティック（load balancing and partitioning heuristics）が挙げられます。GPT‑5.5以前は、アクセラレータ（accelerator）上のリクエストを計算コア（computing cores）間で作業を均衡させるため、固定数のチャンクに分割していましたが、これでは大規模なリクエストと小規模なリクエストを同じGPUで実行できるものの、事前に決定された静的なチャンク数はすべてのトラフィックの形状（traffic shapes）に最適ではありませんでした。GPUをより効果的に活用するため、Codexは数週間にわたる本番環境のトラフィックパターンを分析し、作業を最適にパーティショニングして均衡させるカスタムヒューリスティックアルゴリズム（heuristic algorithms）を開発しました。この取り組みは非常に大きな影響をもたらし、トークン生成速度（token generation speeds）を20％以上向上させました。

脆弱性の発見とパッチ適用に非常に優れたモデルに対して世界を準備することはチームスポーツであり、次世代のサイバー防御⁠においてモデルアクセスの民主化と反復的デプロイメント（iterative deployment）を実現しながら、エコシステム全体が連携してレジリエンスを構築するために努力することが必要です。

フロンティアモデル（Frontier models）はサイバーセキュリティ（cybersecurity）においてますます高度な能力を備えつつあります。これらの能力は広く普及していくでしょうが、私たちは前進するための最善の道筋は、それらがサイバー防御（cyber defense）の加速とエコシステムの強化に活用されることを確実にすることだと信じています。

GPT‑5.5は、サイバーセキュリティのような世界最大の課題の一部を解決できるAIに向けた、段階的ではあるが重要な一歩です。12月に公開されたGPT‑5.2において、私たちはモデルによる潜在的なサイバー悪用を制限するため、必要なサイバーセーフガード⁠を積極的に導入しました。そしてGPT‑5.5では、時間とともに調整しながら、潜在的なサイバーリスクに対するより厳格な分類器（classifiers）を導入しています。これにより、一部のユーザーは当初不便に感じる可能性がありますが、長期的な安全性と性能の向上を目指しています。

当社のモデルが段階的に改善されるにつれて、私たちは長年にわたり準備フレームワーク⁠(opens in a new window)においてサイバーセキュリティを一つのカテゴリとして位置づけてきました。その間、意味のあるサイバーセキュリティ能力を持つモデルを責任を持って公開できるよう、軽減策（mitigations）を開発し調整するプロセスを反復的に進めてきました。

このレベルのサイバー能力（cyber capability）に対して、業界をリードする安全対策を展開しています。昨年、GPT-5.2⁠(opens in a new window)においてサイバー固有の安全対策を初めて導入し、その後の展開で継続的にテスト、改良、強化を行ってきました。GPT-5.5では、よりリスクの高い活動や機微なサイバーリクエストに関する制御を強化し、悪用が繰り返された場合の保護措置を追加しました。モデル安全性（model safety）への投資、認証済み使用（authenticated usage）、不正利用（impermissible use）の監視に対する取り組みにより、広範なアクセスが可能になっています。これらの安全対策の堅牢性を開発・テストし、反復的に改善するために、外部の専門家と数ヶ月にわたって協力してきました。GPT-5.5では、開発者が容易にコードを保護できるようにすると同時に、悪意のあるアクター（malicious actors）によって害を引き起こす可能性が最も高いサイバーワークフローに対してより強力な制御を設けることを保証しています。

各レベルでのサイバー防御（cyber defense）を加速するため、アクセスを拡大しています。Trusted Access for Cyber⁠を通じて、サイバーパーミッシブモデル（cyber-permissive models）を提供し始めました。まずCodexから開始し、これはGPT-5.5の高度なサイバーセキュリティ能力（cybersecurity capabilities）へのアクセスを拡大するもので、起動時に特定の信頼シグナル（trust signals）⁠(opens in a new window)を満たす認証済みユーザーに対しては制限を緩和しています。重要インフラ（critical infrastructure）⁠の防御を担当する組織は、厳格なセキュリティ要件を満たすことを条件に、GPT-5.4-Cyberのようなサイバーパーミッシブモデルへのアクセスを申請できます。これにより、正当なセキュリティ作業のために、より高度なツールを幅広い認証済み防御担当者に提供し、不要な摩擦を減らすことで、重要な防御能力へのアクセスの民主化を実現します。ユーザーはchatgpt.com/cyber⁠(opens in a new window)で信頼アクセスを申請でき、GPT-5.5を認証済みの防御作業に使用する際の不要な拒否を減らすことができます。

政府のパートナーと連携し、公共のための重要インフラ（critical infrastructure）保護を支援しています。私たちは、重要な納税者データを保護するデジタルシステムから地域社会の電力網や水道供給に至るまで、人々が依存するシステムの防御を担当する信頼できる職員を支援するために、高度なAIがどのように活用できるかを共に探求しています。

GPT-5.5の生物・化学（biological/chemical）およびサイバーセキュリティ能力については、当社のPreparedness Framework⁠(opens in a new window)に基づき「High（高）」として扱っています。GPT-5.5はCritical（最高）のサイバーセキュリティ能力レベルには達していませんが、当社の評価とテストでは、そのサイバーセキュリティ能力はGPT-5.4と比較して一段階向上していることが示されています。

さらに、GPT-5.5はリリース前に当社の完全な安全・ガバナンスプロセス（safety and governance process）を完了しており、これには準備度評価（preparedness evaluations）、ドメイン固有のテスト（domain-specific testing）、高度な生物学およびサイバーセキュリティ能力に対する新たなターゲット評価、外部専門家による堅牢なテストが含まれます。詳細はGPT-5.5のsystem card⁠(opens in a new window)でご確認いただけます。

この取り組みは、モデルの能力が進展するにつれて必要になると考える、より広範なAIレジリエンス（強靭性）アプローチを反映したものです。私たちは、システム、機関、そして一般市民を守るために、強力なAIを利用者に提供したいと考えています。実現可能な道筋は、信頼できるアクセス、能力に合わせたスケーリングが可能な堅牢な安全対策（safeguards）、そして重大な誤用を検出し対応する運用能力です。

今日、GPT-5.5はChatGPTおよびCodexのPlus、Pro、Business、Enterpriseユーザー向けにロールアウトされ、GPT-5.5 ProはChatGPTのPro、Business、Enterpriseユーザー向けにロールアウトされています。GPT-5.5とGPT-5.5 ProはまもなくAPIでも提供開始します。

ChatGPTでは、GPT-5.5 ThinkingがPlus、Pro、Business、Enterpriseユーザー向けに利用可能です。さらに高度な質問や高精度な作業を目的として設計されたGPT-5.5 Proは、Pro、Business、Enterpriseユーザー向けに提供されています。

Codexでは、GPT-5.5がPlus、Pro、Business、Enterprise、Edu、Goプランで400Kのコンテキストウィンドウ（文脈窓）を備えて提供されています。また、GPT-5.5はFastモードでも利用可能で、コストが2.5倍になる代わりにトークン生成速度が1.5倍高速化します。

API開発者向けには、gpt-5.5がまもなくResponses APIおよびChat Completions APIsで提供開始されます。1Mコンテキストウィンドウ（文脈窓）を備え、入力トークン100万個あたり5ドル、出力トークン100万個あたり30ドルです。BatchおよびFlexプランは標準API料金の半額、Priority処理は標準料金の2.5倍で利用可能です。さらに高精度なgpt-5.5-proもAPIでリリースされ、入力トークン100万個あたり30ドル、出力トークン100万個あたり180ドルです。詳細はpricing page⁠をご覧ください。

GPT-5.5はGPT-5.4より価格が高いものの、知能レベルが高く、トークン効率も大幅に向上しています。Codexでは、ほとんどのユーザーにとってGPT-5.4よりも少ないトークンでより良い結果を提供するよう体験を慎重に調整しており、サブスクリプションレベルに応じて引き続き豊富な利用枠を提供し続けています。

コーディング

評価**

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

SWE-Bench Pro（Public）*

58.6%

57.7%

64.3%

54.2%

Terminal-Bench 2.0

82.7%

75.1%

69.4%

68.5%

Expert-SWE（Internal）

73.1%

68.5%

プロフェッショナル

評価

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

GDPval（勝利または同率）

84.9%

83.0%

82.3%

82.0%

80.3%

67.3%

FinanceAgent v1.1

60.0%

56.0%

61.5%

64.4%

59.7%

Investment Banking Modeling Tasks（Internal）

88.5%

87.3%

88.6%

83.6%

OfficeQA Pro

54.1%

53.2%

43.6%

18.1%

コンピュータ操作とビジョン

評価

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

OSWorld-Verified

78.7%

75.0%

78.0%

MMMU Pro（ツールなし）

81.2%

80.5%

MMMU Pro（ツールあり）

83.2%

82.1%

ツール利用

評価

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

BrowseComp

84.4%

82.7%

90.1%

89.3%

79.3%

85.9%

MCP Atlas**

75.3%

70.6%

79.1%

78.2%

Toolathlon

55.6%

54.6%

48.8%

Tau2-bench Telecom*(original prompts)

98.0%

92.8%

** MCP Atlas：最新の2026年4月アップデート後のScale AIによる結果。

*** Tau2-bench telecom：5.5および5.4のオリジナルプロンプト（つまりプロンプト調整なし）による結果。プロンプト調整を伴って評価された他のラボの結果は除外されています。

Academic

Eval**

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

GeneBench

25.0%

19.0%

33.2%

25.6%

FrontierMath Tier 1–3

51.7%

47.6%

52.4%

50.0%

43.8%

36.9%

FrontierMath Tier 4

35.4%

27.1%

39.6%

38.0%

22.9%

16.7%

BixBench

80.5%

74.0%

GPQA Diamond

93.6%

92.8%

94.4%

94.2%

94.3%

Humanity's Last Exam（ツールなし）

41.4%

39.8%

43.1%

42.7%

46.9%

44.4%

Humanity's Last Exam（ツールあり）

52.2%

52.1%

57.2%

58.7%

54.7%

51.4%

Cybersecurity

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

Capture-the-Flags challenge tasks（内部）****

88.1%

83.7%

CyberGym

81.8%

79.0%

73.1%

**** システムカードで使用されている最も困難なCTF（Capture-the-Flags：旗取り戦）の拡張版であり、追加された難易度の高い課題が含まれています。

Long context

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

Graphwalks BFS 256k f1

73.7%

62.5%

76.9%

Graphwalks BFS 1mil f1

45.4%

9.4%

41.2%（Opus 4.6）

Graphwalks parents 256k f1

90.1%

82.8%

93.6%

Graphwalks parents 1mil f1

58.5%

44.4%

72.0%（Opus 4.6）

OpenAI MRCR v2 8-needle 4K-8K

98.1%

97.3%

OpenAI MRCR v2 8-needle 8K-16K

93.0%

91.4%

OpenAI MRCR v2 8-needle 16K-32K

96.5%

97.2%

OpenAI MRCR v2 8-needle 32K-64K

90.0%

90.5%

OpenAI MRCR v2 8-needle 64K-128K

83.1%

86.0%

OpenAI MRCR v2 8-needle 128K-256K

87.5%

79.3%

59.2%

OpenAI MRCR v2 8-needle 256K-512K

81.5%

57.5%

OpenAI MRCR v2 8-needle 512K-1M

74.0%

36.6%

32.2%

Abstract reasoning

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

ARC-AGI-1（検証済み）

95.0%

93.7%

94.5%

93.5%

98.0%

ARC-AGI-2（検証済み）

85.0%

73.3%

83.3%

75.8%

77.1%

GPTのEvals（評価テスト）は、推論努力（reasoning effort）をxhighに設定して実行され、研究環境で実施されました。これにより、一部のケースでは本番のChatGPTからわずかに異なる出力が生成される場合があります。

原文を表示

We’re releasing GPT‑5.5, our smartest and most intuitive to use model yet, and the next step toward a new way of getting work done on a computer.

GPT‑5.5 understands what you’re trying to do faster and can carry more of the work itself. It excels at writing and debugging code, researching online, analyzing data, creating documents and spreadsheets, operating software, and moving across tools until a task is finished. Instead of carefully managing every step, you can give GPT‑5.5 a messy, multi-part task and trust it to plan, use tools, check its work, navigate through ambiguity, and keep going.

The gains are especially strong in agentic coding, computer use, knowledge work, and early scientific research—areas where progress depends on reasoning across context and taking action over time. GPT‑5.5 delivers this step up in intelligence without compromising on speed: larger, more capable models are often slower to serve, but GPT‑5.5 matches GPT‑5.4 per-token latency in real-world serving, while performing at a much higher level of intelligence. It also uses significantly fewer tokens to complete the same Codex tasks, making it more efficient as well as more capable.

We are releasing GPT‑5.5 with our strongest set of safeguards to date, designed to reduce misuse while preserving access for beneficial work. We evaluated this model across our full suite of safety and preparedness frameworks, worked with internal and external redteamers, added targeted testing for advanced cybersecurity and biology capabilities, and collected feedback on real use cases from nearly 200 trusted early-access partners before release.

Today, GPT‑5.5 is rolling out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex, and GPT‑5.5 Pro is rolling out to Pro, Business, and Enterprise users in ChatGPT. API deployments require different safeguards and we are working closely with partners and customers on the safety and security requirements for serving it at scale. We'll bring GPT‑5.5 and GPT‑5.5 Pro to the API very soon.

GPT-5.5

GPT-5.4

GPT-5.5 Pro

GPT-5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

Terminal-Bench 2.0

82.7%

75.1%

69.4%

68.5%

Expert-SWE (Internal)

73.1%

68.5%

GDPval (wins or ties)

84.9%

83.0%

82.3%

82.0%

80.3%

67.3%

OSWorld-Verified

78.7%

75.0%

78.0%

Toolathlon

55.6%

54.6%

48.8%

BrowseComp

84.4%

82.7%

90.1%

89.3%

79.3%

85.9%

FrontierMath Tier 1–3

51.7%

47.6%

52.4%

50.0%

43.8%

36.9%

FrontierMath Tier 4

35.4%

27.1%

39.6%

38.0%

22.9%

16.7%

CyberGym

81.8%

79.0%

73.1%

OpenAI is building the global infrastructure for agentic AI, making it possible for people and businesses around the world to get work done with AI. Over the past year, we’ve seen AI dramatically accelerate software engineering. With GPT‑5.5 in Codex and ChatGPT, that same transformation is beginning to extend into scientific research and the broader work people do on computers.

Across these domains, GPT‑5.5 is not just more intelligent; it is more efficient in how it works through problems, often reaching higher-quality outputs with fewer tokens and fewer retries. On Artificial Analysis's Coding Index, GPT‑5.5 delivers state-of-the-art intelligence at half the cost of competitive frontier coding models.

GPT‑5.5 is our strongest agentic coding model to date. On Terminal-Bench 2.0, which tests complex command-line workflows requiring planning, iteration, and tool coordination, it achieves a state-of-the-art accuracy of 82.7%. On SWE-Bench Pro, which evaluates real-world GitHub issue resolution, it reaches 58.6%, solving more tasks end-to-end in a single pass than previous models. On Expert-SWE, our internal frontier eval for long-horizon coding tasks with a median estimated human completion time of 20 hours, GPT‑5.5 also outperforms GPT‑5.4.

Across all three evals, GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.

The model’s coding strengths show up especially clearly in Codex where it can take on engineering work ranging from implementation and refactors to debugging, testing, and validation. Early testing suggests GPT‑5.5 is better at the behaviors real engineering work depends on, like holding context across large systems, reasoning through ambiguous failures, checking assumptions with tools, and carrying changes through the surrounding codebase.

The rendered trajectory uses NASA/JPL Horizons vector data for Orion, the Moon, and the Sun, with display scaling applied for readability.

Prompt: [attached image] Implement this as a new app using webgl and vite using real data from the artemis II mission. Make sure to test the app thoroughly until it is fully functional and looks like the app in the picture. Pay close attention to the rendering of the planets and fly paths. I want to be able to interact with the 3D rendering. Ensure it has realistic orbital mechanics.

Beyond benchmarks, early testers said GPT‑5.5 shows a stronger ability to understand the shape of a system: why something is failing, where the fix needs to land, and what else in the codebase would be affected.

Dan Shipper, Founder and CEO of Every, described GPT‑5.5 as “the first coding model I’ve used that has serious conceptual clarity.”

After launching an app, he spent days debugging a post-launch issue before bringing in one of his best engineers to rewrite part of the system. To test GPT‑5.5, he effectively rewound the clock: could the model look at the broken state and produce the same kind of rewrite the engineer eventually decided on? GPT‑5.4 could not. GPT‑5.5 could.

Pietro Schirano, CEO of MagicPath, saw a similar step change when GPT‑5.5 merged a branch with hundreds of frontend and refactor changes into a main branch that had also changed substantially, resolving the work in one shot in about 20 minutes.

Senior engineers who tested the model said GPT‑5.5 was noticeably stronger than GPT‑5.4 and Claude Opus 4.7 at reasoning and autonomy, catching issues in advance and predicting testing and review needs without explicit prompting. In one case, an engineer asked it to re-architect a comment system in a collaborative markdown editor and returned to a 12-diff stack that was nearly complete. Others said they needed surprisingly little implementation correction and felt more confident in GPT‑5.5’s plans compared with GPT‑5.4.

One engineer at NVIDIA who had early access to the model went as far as to say: "Losing access to GPT‑5.5 feels like I've had a limb amputated.”

“GPT-5.5 is noticeably smarter and more persistent than GPT-5.4, with stronger coding performance and more reliable tool use. It stays on task for significantly longer without stopping early, which matters most for the complex, long-running work our users delegate to Cursor.”

— Michael Truell, Co-founder & CEO at Cursor

The same strengths that make GPT‑5.5 great at coding also make it powerful for everyday work on a computer. Because the model is better at understanding intent, it can move more naturally through the full loop of knowledge work: finding information, understanding what matters, using tools, checking the output, and turning raw material into something useful.

In Codex, GPT‑5.5 is better than GPT‑5.4 at generating documents, spreadsheets, and slide presentations. Alpha testers said it outperformed past models on work like operational research, spreadsheet modeling, and turning messy business inputs into plans. When combined with Codex’s computer use skills, GPT‑5.5 brings us closer to the feeling that the model can actually use the computer with you: seeing what’s on screen, clicking, typing, navigating interfaces, and moving across tools with precision.

Teams at OpenAI are already using these strengths in real workflows. Today, more than 85% of the company uses Codex every week across functions including software engineering, finance, communications, marketing, data science, and product management. In Comms, the team used GPT‑5.5 in Codex to analyze six months of speaking request data, build a scoring and risk framework, and validate an automated Slack agent so low-risk requests could be handled automatically while higher-risk requests still route to human review. In Finance, the team used Codex to review 24,771 K-1 tax forms totaling 71,637 pages, using a workflow that excluded personal information and helped the team accelerate the task by two weeks compared to the prior year. On the Go-to-Market team, an employee automated generating weekly business reports, saving 5-10 hours a week.

In ChatGPT, GPT‑5.5 Thinking unlocks faster help for harder problems, with smarter and more concise answers to help you move through complex work more efficiently. It excels at professional work like coding, research, information synthesis and analysis, and document-heavy tasks, especially when using plugins.

In GPT‑5.5 Pro, early testers are seeing a significant step up in both the difficulty and quality of work ChatGPT can take on, with latency improvements that make it much more practical for demanding tasks. Compared to GPT‑5.4 Pro, testers found GPT‑5.5 Pro’s responses significantly more comprehensive, well-structured, accurate, relevant, and useful, with especially strong performance in business, legal, education, and data science.

GPT‑5.5 reaches state-of-the-art performance across multiple benchmarks that reflect this kind of work. OnGDPval⁠⁠, which tests agents’ abilities to produce well-specified knowledge work across 44 occupations, GPT‑5.5 scores 84.9%. On OSWorld-Verified, which measures whether a model can operate real computer environments on its own, it reaches 78.7%. And on Tau2-bench Telecom, which tests complex customer-service workflows, it reaches 98.0% without prompt tuning. GPT‑5.5 also performs strongly across other knowledge work benchmarks: 60.0% on FinanceAgent, 88.5% on internal investment-banking modeling tasks, and 54.1% on OfficeQA Pro.

“GPT-5.5 delivers the sustained performance required for execution-heavy work. Built and served on NVIDIA GB200 NVL72 systems, the model enables our teams to ship end-to-end features from natural language prompts, cut debug time from days to hours, and turn weeks of experimentation into overnight progress in complex codebases. It’s more than faster coding—it’s a new way of working that helps people operate at a fundamentally different speed.”

— Justin Boitano, VP of Enterprise AI at NVIDIA

GPT‑5.5 also shows gains on scientific and technical research workflows, which require more than answering a hard question. Researchers need to explore an idea, gather evidence, test assumptions, interpret results, and decide what to try next. GPT‑5.5 is better at persisting across that loop than other models.

Notably, GPT‑5.5 shows a clear improvement over GPT‑5.4 on GeneBench⁠(opens in a new window), a new eval focusing on multi-stage scientific data analysis in genetics and quantitative biology. These problems require models to reason about potentially ambiguous or errorful data with minimal supervisory guidance, address realistic obstacles such as hidden confounders or QC failures, and correctly implement and interpret modern statistical methods. The model’s performance is striking in light of the fact that tasks here often correspond to multi-day projects for scientific experts.

Similarly, on BixBench⁠(opens in a new window), a benchmark designed around real-world bioinformatics and data analysis, GPT‑5.5 achieved leading performance among models with published scores. The model’s scientific capabilities are now strong enough to meaningfully accelerate progress at the frontiers of biomedical research as a bona fide co-scientist.

In another example, an internal version of GPT‑5.5 with a custom harness helped discover a new proof⁠(opens in a new window) about Ramsey numbers, one of the central objects in combinatorics. Combinatorics studies how discrete objects fit together: graphs, networks, sets, and patterns. Ramsey numbers ask, roughly, how large a network has to be before some kind of order is guaranteed to appear. Results in this area are rare and often technically difficult. Here, GPT‑5.5 found a proof of a longstanding asymptotic fact about off-diagonal Ramsey numbers, later verified in Lean. The result is a concrete example of GPT‑5.5 contributing not just code or explanation, but a surprising and useful mathematical argument in a core research area.

Early testers used GPT‑5.5 Pro in ChatGPT less like a one-shot answer engine and more like a research partner: critiquing manuscripts over multiple passes, stress-testing technical arguments, proposing analyses, and working with code, notes, and PDF context. The common thread is that GPT‑5.5 is better at helping researchers move from question to experiment to output.

Derya Unutmaz, an immunology professor and researcher at the Jackson Laboratory for Genomic Medicine, used GPT‑5.5 Pro to analyze a gene-expression dataset with 62 samples and nearly 28,000 genes, producing a detailed research report that not only summarized the findings but also surfaced key questions and insights—work he said would have taken his team months.

Bartosz Naskręcki, assistant professor of mathematics at Adam Mickiewicz University in Poznań, Poland, used GPT‑5.5 in Codex to build an algebraic-geometry app from a single prompt in 11 minutes, visualizing the intersection of quadratic surfaces and converting the resulting curve into a Weierstrass model.

He later extended the app with more stable singularity visualization and exact coefficients that can be reused in further work. For him, the bigger shift is that Codex can now help implement custom mathematical visualization and computer-algebra workflows that previously required dedicated tools. Together, these examples show GPT‑5.5 turning expert intent into working research tools and analyses.

Credit: Bartosz Naskręcki⁠(opens in a new window)

Prompt: # Algebraic geometry surface intersection

Make an app which draws two quadratic surfaces and colors in red the intersection curve. Use computational Riemann-Roch theorem to convert this into Weierstrass curve.

Main window

Two tinted surfaces with a slightly transparent shading, high quality rendering intersect along a red colored algebraic curve

Rotation with mouses in both directions, full pinch mechanism for zoom, haptic press to show the little menu with sliders for changing the coefficients of each surface; detection via Z-buffor level

Side right window

Short Weierstrass equation (over Q or quadratic field extension) computed on the go via effective Riemann-Roch theorem formulas

Ambient mode where all the controls are hidden and the user can admire the beauty of the shapes

Specs

App is running in the browser, light-weight implementation with full stack newest libraries, portable, deployable

Docs

Git repo, journal, plan (Markdown files)

“It’s incredibly energizing to use OpenAI’s new GPT-5.5 model in our harness, have it reason over massive biochemical datasets to predict human drug outcomes, and then see it deliver significant accuracy gains on our hardest drug discovery evals. If OpenAI keeps cooking like this, the foundations of drug discovery will change by the end of the year.”

— Brandon White, Co-Founder & CEO at Axiom Bio

Serving GPT‑5.5 at GPT‑5.4 latency required rethinking inference as an integrated system, not a set of isolated optimizations. GPT‑5.5 was co-designed for, trained with, and served on NVIDIA GB200 and GB300 NVL72 systems. Codex and GPT‑5.5 were instrumental in how we achieved our performance targets. Codex helped the team move faster from idea to benchmarkable implementation, sketching approaches, wiring experiments, and helping identify which optimizations were worth deeper investment. GPT‑5.5 helped find and implement key improvements in the stack itself. Put simply, the model helped improve the infrastructure that serves it.

One such improvement was load balancing and partitioning heuristics. Before GPT‑5.5, we split requests on an accelerator into a fixed number of chunks to balance work across computing cores, ensuring big and small requests could run on the same GPU. However, a pre-determined number of static chunks is not optimal for all traffic shapes. To better utilize GPUs, Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The effort had an outsized impact, increasing token generation speeds by over 20%.

Preparing the world for models that are very good at finding and patching security vulnerabilities is a team sport and will require the entire ecosystem to work hard to build resilience, with democratized model access and iterative deployment for the next era of cyber defense⁠.

Frontier models are becoming increasingly more capable in cybersecurity. Those capabilities will become broadly distributed and we believe the best path forward is to make sure they can be put to use for accelerating cyber defense and strengthening the ecosystem.

GPT‑5.5 is an incremental but important step towards AI that can solve some of the world’s toughest challenges like cybersecurity. With GPT‑5.2 in December, we proactively deployed the necessary cyber safeguards⁠ to limit potential cyber abuse with our models; now with GPT‑5.5, we’re deploying stricter classifiers for potential cyber risk which some users may find annoying initially, as we tune them over time.

We’ve identified cybersecurity as a category in our Preparedness Framework⁠(opens in a new window) for years as our models have incrementally improved, while we develop and calibrate mitigations iteratively, to be able to responsibly release models with meaningful cybersecurity capabilities.

We are deploying industry-leading safeguards for this level of cyber capability. We first introduced cyber-specific safeguards with GPT‑5.2⁠(opens in a new window) last year, which we have continued to test, refine, and build on in subsequent deployments. For GPT‑5.5, we designed tighter controls around higher-risk activity, sensitive cyber requests, and added protections for repeated misuse. Broad access is made possible through our investments in model safety, authenticated usage, and monitoring for impermissible use. We have been working with external experts for months to develop, test and iterate on the robustness of these safeguards. With GPT‑5.5, we are ensuring developers can secure their code with ease, while putting stronger controls around the cyber workflows most likely to cause harm by malicious actors.
We are expanding access to accelerate cyber defense at every level. We are making our cyber-permissive models available through Trusted Access for Cyber⁠, starting with Codex, which includes expanded access to the advanced cybersecurity capabilities of GPT‑5.5 with fewer restrictions for verified users meeting certain trust signals⁠(opens in a new window) at launch. Organizations who are responsible for defending critical infrastructure⁠ can apply to access cyber-permissive models like GPT‑5.4‑Cyber, while meeting strict security requirements to use these models for securing their internal systems. This gives a wide range of verified defenders more capable tools for legitimate security work with less unnecessary friction to ensure we democratize access to important defensive capabilities. Users can apply for trusted access at chatgpt.com/cyber⁠(opens in a new window) to reduce unnecessary refusals while using GPT‑5.5 for verified defensive work.
We are working with government partners to help protect critical infrastructure for the public. Together, we are exploring how advanced AI can support the defensive work of trusted officials responsible for systems people rely on, from the digital systems that secure important taxpayer data to the power grid and water supplies in local communities.

We are treating the biological/chemical and cybersecurity capabilities of GPT‑5.5 as High under our Preparedness Framework⁠(opens in a new window). While GPT‑5.5 didn’t reach Critical cybersecurity capability level, our evaluations and testing showed that its cybersecurity capabilities are a step up compared to GPT‑5.4.

In addition, GPT‑5.5 went through our full safety and governance process prior to release, including preparedness evaluations, domain-specific testing, new targeted evaluations for advanced biology and cybersecurity capabilities, and robust testing with external experts. We share more details in the GPT‑5.5 system card⁠(opens in a new window).

This work reflects our broader AI resilience approach, which we believe is needed as model capabilities advance. We want powerful AI to be available to the people using it to defend systems, institutions, and the public. The viable path is trusted access, robust safeguards that scale with capability, and the operational capacity to detect and respond to serious misuse.

Today, GPT‑5.5 is rolling out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex, and GPT‑5.5 Pro is rolling out to Pro, Business, and Enterprise users in ChatGPT. We'll bring GPT‑5.5 and GPT‑5.5 Pro to the API very soon.

In ChatGPT, GPT‑5.5 Thinking is available to Plus, Pro, Business, and Enterprise users. GPT‑5.5 Pro, designed for even harder questions and higher-accuracy work, is available to Pro, Business, and Enterprise users.

In Codex, GPT‑5.5 is available for Plus, Pro, Business, Enterprise, Edu, and Go plans with a 400K context window. GPT‑5.5 is also available in Fast mode, generating tokens 1.5x faster for 2.5x the cost.

For API developers, gpt-5.5 will soon be available in the Responses and Chat Completions APIs at $5 per 1M input tokens and $30 per 1M output tokens, with a 1M context window. Batch and Flex pricing are available at half the standard API rate, while Priority processing is available at 2.5x the standard rate. We will also release gpt-5.5-pro in the API for even higher accuracy, priced at $30 per 1M input tokens and $180 per 1M output tokens. See the pricing page⁠ for full details.

While GPT‑5.5 is priced higher than GPT‑5.4, it is both more intelligent and much more token efficient. In Codex, we have carefully tuned the experience so GPT‑5.5 delivers better results with fewer tokens than GPT‑5.4 for most users, while continuing to offer generous usage across subscription levels.

Coding

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

SWE-Bench Pro (Public) *

58.6%

57.7%

64.3%

54.2%

Terminal-Bench 2.0

82.7%

75.1%

69.4%

68.5%

Expert-SWE (Internal)

73.1%

68.5%

Professional

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

GDPval (wins or ties)

84.9%

83.0%

82.3%

82.0%

80.3%

67.3%

FinanceAgent v1.1

60.0%

56.0%

61.5%

64.4%

59.7%

Investment Banking Modeling Tasks (Internal)

88.5%

87.3%

88.6%

83.6%

OfficeQA Pro

54.1%

53.2%

43.6%

18.1%

Computer use and vision

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

OSWorld-Verified

78.7%

75.0%

78.0%

MMMU Pro (no tools)

81.2%

80.5%

MMMU Pro (with tools)

83.2%

82.1%

Tool use

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

BrowseComp

84.4%

82.7%

90.1%

89.3%

79.3%

85.9%

MCP Atlas**

75.3%

70.6%

79.1%

78.2%

Toolathlon

55.6%

54.6%

48.8%

Tau2-bench Telecom*(original prompts)

98.0%

92.8%

** MCP Atlas: results from Scale AI after the latest 2026 April update.

*** Tau2-bench telecom: results for 5.5 and 5.4 with original prompts i.e no prompt adjustment. This omits results from other labs that were evaluated with prompt adjustments.

Academic

Eval**

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

GeneBench

25.0%

19.0%

33.2%

25.6%

FrontierMath Tier 1–3

51.7%

47.6%

52.4%

50.0%

43.8%

36.9%

FrontierMath Tier 4

35.4%

27.1%

39.6%

38.0%

22.9%

16.7%

BixBench

80.5%

74.0%

GPQA Diamond

93.6%

92.8%

94.4%

94.2%

94.3%

Humanity's Last Exam (no tools)

41.4%

39.8%

43.1%

42.7%

46.9%

44.4%

Humanity's Last Exam (with tools)

52.2%

52.1%

57.2%

58.7%

54.7%

51.4%

Cybersecurity

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

Capture-the-Flags challenge tasks (Internal)****

88.1%

83.7%

CyberGym

81.8%

79.0%

73.1%

**** An expansion of the hardest CTFs used in system cards with additional hard challenges.

Long context

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

Graphwalks BFS 256k f1

73.7%

62.5%

76.9%

Graphwalks BFS 1mil f1

45.4%

9.4%

41.2% (Opus 4.6)

Graphwalks parents 256k f1

90.1%

82.8%

93.6%

Graphwalks parents 1mil f1

58.5%

44.4%

72.0% (Opus 4.6)

OpenAI MRCR v2 8-needle 4K-8K

98.1%

97.3%

OpenAI MRCR v2 8-needle 8K-16K

93.0%

91.4%

OpenAI MRCR v2 8-needle 16K-32K

96.5%

97.2%

OpenAI MRCR v2 8-needle 32K-64K

90.0%

90.5%

OpenAI MRCR v2 8-needle 64K-128K

83.1%

86.0%

OpenAI MRCR v2 8-needle 128K-256K

87.5%

79.3%

59.2%

OpenAI MRCR v2 8-needle 256K-512K

81.5%

57.5%

OpenAI MRCR v2 8-needle 512K-1M

74.0%

36.6%

32.2%

Abstract reasoning

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

ARC-AGI-1 (Verified)

95.0%

93.7%

94.5%

93.5%

98.0%

ARC-AGI-2 (Verified)

85.0%

73.3%

83.3%

75.8%

77.1%

Evals of GPT were run with reasoning effort set to xhigh and were conducted in a research environment, which may provide slightly different output from production ChatGPT in some cases.

この記事をシェア

Simon Willison Blog2026年4月26日 01:44

なぜあなたは如此なのか

Simon Willison Blog重要度42026年4月25日 21:06

ロマン・ヒュエ氏の発言を引用

Simon Willison Blog重要度42026年4月25日 13:13

GPT-5.5 プロンプティングガイド

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

OpenAI News·2026年4月23日 20:00·約29分

GPT-5.5の発表

#LLM #OpenAI #GPT-5.5 #コーディング支援 #データ分析

TL;DR

AI深層分析2026年4月24日 04:23

注目/ 5段階

深度40%

キーポイント

新型モデルの公式発表

OpenAIが既存シリーズを超越する「GPT-5.5」を発表し、知能レベルの向上を明言している。

パフォーマンスの最適化

応答速度の高速化とタスク処理能力の強化を謳っており、実運用時の効率向上が期待される。

専門領域への特化設計

コーディング支援、学術リサーチ、外部ツールを連携させたデータ分析など、複雑な作業フローに最適化されたアーキテクチャを採用している。

重要な引用

Introducing GPT-5.5, our smartest model yet—faster, more capable, and built for complex tasks like coding, research, and data analysis across tools.

影響分析・編集コメントを表示

影響分析

編集コメント

GPT-5.5

GPT-5.4

GPT-5.5 Pro

GPT-5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

Terminal-Bench 2.0

82.7%

75.1%

69.4%

68.5%

Expert-SWE（内部）

73.1%

68.5%

GDPval（勝利または同率）

84.9%

83.0%

82.3%

82.0%

80.3%

67.3%

OSWorld-Verified

78.7%

75.0%

78.0%

Toolathlon

55.6%

54.6%

48.8%

BrowseComp

84.4%

82.7%

90.1%

89.3%

79.3%

85.9%

FrontierMath Tier 1–3

51.7%

47.6%

52.4%

50.0%

43.8%

36.9%

FrontierMath Tier 4

35.4%

27.1%

39.6%

38.0%

22.9%

16.7%

CyberGym

81.8%

79.0%

73.1%

3つの評価指標すべてにおいて、GPT‑5.5はより少ないトークン数（tokens）を使用しながら、GPT‑5.4のスコアを上回っています。

— マイケル・トゥエル、Cursor 共同創設者兼 CEO

— Justin Boitano, VP of Enterprise AI at NVIDIA

image

Credit: Bartosz Naskręcki⁠(opens in a new window)

プロンプト: **# 代数幾何学（algebraic-geometry）曲面の交差

メインウィンドウ

わずかに透過したシェーディングが施された2つの着色面が、赤色の代数曲線に沿って交差する高品質なレンダリング

右側ウィンドウ

環境モード（すべてのコントロールが非表示になり、ユーザーが形状の美しさを鑑賞できるモード）

仕様

ブラウザ上で動作するアプリ。最新のフルスタック（stack）ライブラリを使用した軽量実装。ポータブル、デプロイ可能

ドキュメント

Gitリポジトリ、ジャーナル、計画（Markdownファイル）

— ブランドン・ホワイト、Axiom Bio 共同創設者兼CEO

このレベルのサイバー能力（cyber capability）に対して、業界をリードする安全対策を展開しています。昨年、GPT-5.2⁠(opens in a new window)においてサイバー固有の安全対策を初めて導入し、その後の展開で継続的にテスト、改良、強化を行ってきました。GPT-5.5では、よりリスクの高い活動や機微なサイバーリクエストに関する制御を強化し、悪用が繰り返された場合の保護措置を追加しました。モデル安全性（model safety）への投資、認証済み使用（authenticated usage）、不正利用（impermissible use）の監視に対する取り組みにより、広範なアクセスが可能になっています。これらの安全対策の堅牢性を開発・テストし、反復的に改善するために、外部の専門家と数ヶ月にわたって協力してきました。GPT-5.5では、開発者が容易にコードを保護できるようにすると同時に、悪意のあるアクター（malicious actors）によって害を引き起こす可能性が最も高いサイバーワークフローに対してより強力な制御を設けることを保証しています。

各レベルでのサイバー防御（cyber defense）を加速するため、アクセスを拡大しています。Trusted Access for Cyber⁠を通じて、サイバーパーミッシブモデル（cyber-permissive models）を提供し始めました。まずCodexから開始し、これはGPT-5.5の高度なサイバーセキュリティ能力（cybersecurity capabilities）へのアクセスを拡大するもので、起動時に特定の信頼シグナル（trust signals）⁠(opens in a new window)を満たす認証済みユーザーに対しては制限を緩和しています。重要インフラ（critical infrastructure）⁠の防御を担当する組織は、厳格なセキュリティ要件を満たすことを条件に、GPT-5.4-Cyberのようなサイバーパーミッシブモデルへのアクセスを申請できます。これにより、正当なセキュリティ作業のために、より高度なツールを幅広い認証済み防御担当者に提供し、不要な摩擦を減らすことで、重要な防御能力へのアクセスの民主化を実現します。ユーザーはchatgpt.com/cyber⁠(opens in a new window)で信頼アクセスを申請でき、GPT-5.5を認証済みの防御作業に使用する際の不要な拒否を減らすことができます。

政府のパートナーと連携し、公共のための重要インフラ（critical infrastructure）保護を支援しています。私たちは、重要な納税者データを保護するデジタルシステムから地域社会の電力網や水道供給に至るまで、人々が依存するシステムの防御を担当する信頼できる職員を支援するために、高度なAIがどのように活用できるかを共に探求しています。

コーディング

評価**

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

SWE-Bench Pro（Public）*

58.6%

57.7%

64.3%

54.2%

Terminal-Bench 2.0

82.7%

75.1%

69.4%

68.5%

Expert-SWE（Internal）

73.1%

68.5%

プロフェッショナル

評価

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

GDPval（勝利または同率）

84.9%

83.0%

82.3%

82.0%

80.3%

67.3%

FinanceAgent v1.1

60.0%

56.0%

61.5%

64.4%

59.7%

Investment Banking Modeling Tasks（Internal）

88.5%

87.3%

88.6%

83.6%

OfficeQA Pro

54.1%

53.2%

43.6%

18.1%

コンピュータ操作とビジョン

評価

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

OSWorld-Verified

78.7%

75.0%

78.0%

MMMU Pro（ツールなし）

81.2%

80.5%

MMMU Pro（ツールあり）

83.2%

82.1%

ツール利用

評価

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

BrowseComp

84.4%

82.7%

90.1%

89.3%

79.3%

85.9%

MCP Atlas**

75.3%

70.6%

79.1%

78.2%

Toolathlon

55.6%

54.6%

48.8%

Tau2-bench Telecom*(original prompts)

98.0%

92.8%

** MCP Atlas：最新の2026年4月アップデート後のScale AIによる結果。

Academic

Eval**

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

GeneBench

25.0%

19.0%

33.2%

25.6%

FrontierMath Tier 1–3

51.7%

47.6%

52.4%

50.0%

43.8%

36.9%

FrontierMath Tier 4

35.4%

27.1%

39.6%

38.0%

22.9%

16.7%

BixBench

80.5%

74.0%

GPQA Diamond

93.6%

92.8%

94.4%

94.2%

94.3%

Humanity's Last Exam（ツールなし）

41.4%

39.8%

43.1%

42.7%

46.9%

44.4%

Humanity's Last Exam（ツールあり）

52.2%

52.1%

57.2%

58.7%

54.7%

51.4%

Cybersecurity

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

Capture-the-Flags challenge tasks（内部）****

88.1%

83.7%

CyberGym

81.8%

79.0%

73.1%

**** システムカードで使用されている最も困難なCTF（Capture-the-Flags：旗取り戦）の拡張版であり、追加された難易度の高い課題が含まれています。

Long context

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

Graphwalks BFS 256k f1

73.7%

62.5%

76.9%

Graphwalks BFS 1mil f1

45.4%

9.4%

41.2%（Opus 4.6）

Graphwalks parents 256k f1

90.1%

82.8%

93.6%

Graphwalks parents 1mil f1

58.5%

44.4%

72.0%（Opus 4.6）

OpenAI MRCR v2 8-needle 4K-8K

98.1%

97.3%

OpenAI MRCR v2 8-needle 8K-16K

93.0%

91.4%

OpenAI MRCR v2 8-needle 16K-32K

96.5%

97.2%

OpenAI MRCR v2 8-needle 32K-64K

90.0%

90.5%

OpenAI MRCR v2 8-needle 64K-128K

83.1%

86.0%

OpenAI MRCR v2 8-needle 128K-256K

87.5%

79.3%

59.2%

OpenAI MRCR v2 8-needle 256K-512K

81.5%

57.5%

OpenAI MRCR v2 8-needle 512K-1M

74.0%

36.6%

32.2%

Abstract reasoning

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

ARC-AGI-1（検証済み）

95.0%

93.7%

94.5%

93.5%

98.0%

ARC-AGI-2（検証済み）

85.0%

73.3%

83.3%

75.8%

77.1%

原文を表示

We’re releasing GPT‑5.5, our smartest and most intuitive to use model yet, and the next step toward a new way of getting work done on a computer.

GPT-5.5

GPT-5.4

GPT-5.5 Pro

GPT-5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

Terminal-Bench 2.0

82.7%

75.1%

69.4%

68.5%

Expert-SWE (Internal)

73.1%

68.5%

GDPval (wins or ties)

84.9%

83.0%

82.3%

82.0%

80.3%

67.3%

OSWorld-Verified

78.7%

75.0%

78.0%

Toolathlon

55.6%

54.6%

48.8%

BrowseComp

84.4%

82.7%

90.1%

89.3%

79.3%

85.9%

FrontierMath Tier 1–3

51.7%

47.6%

52.4%

50.0%

43.8%

36.9%

FrontierMath Tier 4

35.4%

27.1%

39.6%

38.0%

22.9%

16.7%

CyberGym

81.8%

79.0%

73.1%

Across all three evals, GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.

The rendered trajectory uses NASA/JPL Horizons vector data for Orion, the Moon, and the Sun, with display scaling applied for readability.

Dan Shipper, Founder and CEO of Every, described GPT‑5.5 as “the first coding model I’ve used that has serious conceptual clarity.”

One engineer at NVIDIA who had early access to the model went as far as to say: "Losing access to GPT‑5.5 feels like I've had a limb amputated.”

“GPT-5.5 is noticeably smarter and more persistent than GPT-5.4, with stronger coding performance and more reliable tool use. It stays on task for significantly longer without stopping early, which matters most for the complex, long-running work our users delegate to Cursor.”

— Michael Truell, Co-founder & CEO at Cursor

“GPT-5.5 delivers the sustained performance required for execution-heavy work. Built and served on NVIDIA GB200 NVL72 systems, the model enables our teams to ship end-to-end features from natural language prompts, cut debug time from days to hours, and turn weeks of experimentation into overnight progress in complex codebases. It’s more than faster coding—it’s a new way of working that helps people operate at a fundamentally different speed.”

— Justin Boitano, VP of Enterprise AI at NVIDIA

Credit: Bartosz Naskręcki⁠(opens in a new window)

Prompt: # Algebraic geometry surface intersection

Make an app which draws two quadratic surfaces and colors in red the intersection curve. Use computational Riemann-Roch theorem to convert this into Weierstrass curve.

Main window

Two tinted surfaces with a slightly transparent shading, high quality rendering intersect along a red colored algebraic curve

Rotation with mouses in both directions, full pinch mechanism for zoom, haptic press to show the little menu with sliders for changing the coefficients of each surface; detection via Z-buffor level

Side right window

Short Weierstrass equation (over Q or quadratic field extension) computed on the go via effective Riemann-Roch theorem formulas

Ambient mode where all the controls are hidden and the user can admire the beauty of the shapes

Specs

App is running in the browser, light-weight implementation with full stack newest libraries, portable, deployable

Docs

Git repo, journal, plan (Markdown files)

“It’s incredibly energizing to use OpenAI’s new GPT-5.5 model in our harness, have it reason over massive biochemical datasets to predict human drug outcomes, and then see it deliver significant accuracy gains on our hardest drug discovery evals. If OpenAI keeps cooking like this, the foundations of drug discovery will change by the end of the year.”

— Brandon White, Co-Founder & CEO at Axiom Bio

We are deploying industry-leading safeguards for this level of cyber capability. We first introduced cyber-specific safeguards with GPT‑5.2⁠(opens in a new window) last year, which we have continued to test, refine, and build on in subsequent deployments. For GPT‑5.5, we designed tighter controls around higher-risk activity, sensitive cyber requests, and added protections for repeated misuse. Broad access is made possible through our investments in model safety, authenticated usage, and monitoring for impermissible use. We have been working with external experts for months to develop, test and iterate on the robustness of these safeguards. With GPT‑5.5, we are ensuring developers can secure their code with ease, while putting stronger controls around the cyber workflows most likely to cause harm by malicious actors.
We are expanding access to accelerate cyber defense at every level. We are making our cyber-permissive models available through Trusted Access for Cyber⁠, starting with Codex, which includes expanded access to the advanced cybersecurity capabilities of GPT‑5.5 with fewer restrictions for verified users meeting certain trust signals⁠(opens in a new window) at launch. Organizations who are responsible for defending critical infrastructure⁠ can apply to access cyber-permissive models like GPT‑5.4‑Cyber, while meeting strict security requirements to use these models for securing their internal systems. This gives a wide range of verified defenders more capable tools for legitimate security work with less unnecessary friction to ensure we democratize access to important defensive capabilities. Users can apply for trusted access at chatgpt.com/cyber⁠(opens in a new window) to reduce unnecessary refusals while using GPT‑5.5 for verified defensive work.
We are working with government partners to help protect critical infrastructure for the public. Together, we are exploring how advanced AI can support the defensive work of trusted officials responsible for systems people rely on, from the digital systems that secure important taxpayer data to the power grid and water supplies in local communities.

Coding

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

SWE-Bench Pro (Public) *

58.6%

57.7%

64.3%

54.2%

Terminal-Bench 2.0

82.7%

75.1%

69.4%

68.5%

Expert-SWE (Internal)

73.1%

68.5%

Professional

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

GDPval (wins or ties)

84.9%

83.0%

82.3%

82.0%

80.3%

67.3%

FinanceAgent v1.1

60.0%

56.0%

61.5%

64.4%

59.7%

Investment Banking Modeling Tasks (Internal)

88.5%

87.3%

88.6%

83.6%

OfficeQA Pro

54.1%

53.2%

43.6%

18.1%

Computer use and vision

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

OSWorld-Verified

78.7%

75.0%

78.0%

MMMU Pro (no tools)

81.2%

80.5%

MMMU Pro (with tools)

83.2%

82.1%

Tool use

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

BrowseComp

84.4%

82.7%

90.1%

89.3%

79.3%

85.9%

MCP Atlas**

75.3%

70.6%

79.1%

78.2%

Toolathlon

55.6%

54.6%

48.8%

Tau2-bench Telecom*(original prompts)

98.0%

92.8%

** MCP Atlas: results from Scale AI after the latest 2026 April update.

*** Tau2-bench telecom: results for 5.5 and 5.4 with original prompts i.e no prompt adjustment. This omits results from other labs that were evaluated with prompt adjustments.

Academic

Eval**

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

GeneBench

25.0%

19.0%

33.2%

25.6%

FrontierMath Tier 1–3

51.7%

47.6%

52.4%

50.0%

43.8%

36.9%

FrontierMath Tier 4

35.4%

27.1%

39.6%

38.0%

22.9%

16.7%

BixBench

80.5%

74.0%

GPQA Diamond

93.6%

92.8%

94.4%

94.2%

94.3%

Humanity's Last Exam (no tools)

41.4%

39.8%

43.1%

42.7%

46.9%

44.4%

Humanity's Last Exam (with tools)

52.2%

52.1%

57.2%

58.7%

54.7%

51.4%

Cybersecurity

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

Capture-the-Flags challenge tasks (Internal)****

88.1%

83.7%

CyberGym

81.8%

79.0%

73.1%

**** An expansion of the hardest CTFs used in system cards with additional hard challenges.

Long context

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

Graphwalks BFS 256k f1

73.7%

62.5%

76.9%

Graphwalks BFS 1mil f1

45.4%

9.4%

41.2% (Opus 4.6)

Graphwalks parents 256k f1

90.1%

82.8%

93.6%

Graphwalks parents 1mil f1

58.5%

44.4%

72.0% (Opus 4.6)

OpenAI MRCR v2 8-needle 4K-8K

98.1%

97.3%

OpenAI MRCR v2 8-needle 8K-16K

93.0%

91.4%

OpenAI MRCR v2 8-needle 16K-32K

96.5%

97.2%

OpenAI MRCR v2 8-needle 32K-64K

90.0%

90.5%

OpenAI MRCR v2 8-needle 64K-128K

83.1%

86.0%

OpenAI MRCR v2 8-needle 128K-256K

87.5%

79.3%

59.2%

OpenAI MRCR v2 8-needle 256K-512K

81.5%

57.5%

OpenAI MRCR v2 8-needle 512K-1M

74.0%

36.6%

32.2%

Abstract reasoning

Eval

GPT-5.5

GPT‑5.4

GPT-5.5 Pro

GPT‑5.4 Pro

Claude Opus 4.7

Gemini 3.1 Pro

ARC-AGI-1 (Verified)

95.0%

93.7%

94.5%

93.5%

98.0%

ARC-AGI-2 (Verified)

85.0%

73.3%

83.3%

75.8%

77.1%

Evals of GPT were run with reasoning effort set to xhigh and were conducted in a research environment, which may provide slightly different output from production ChatGPT in some cases.

この記事をシェア

Simon Willison Blog2026年4月26日 01:44

なぜあなたは如此なのか

Simon Willison Blog重要度42026年4月25日 21:06

ロマン・ヒュエ氏の発言を引用

Simon Willison Blog重要度42026年4月25日 13:13

GPT-5.5 プロンプティングガイド

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

キーポイント

重要な引用

影響分析

編集コメント

メインウィンドウ

右側ウィンドウ

環境モード（すべてのコントロールが非表示になり、ユーザーが形状の美しさを鑑賞できるモード）

仕様

ドキュメント

コーディング

プロフェッショナル

コンピュータ操作とビジョン

ツール利用

Academic

Cybersecurity

Long context

Abstract reasoning

Main window

Side right window

Ambient mode where all the controls are hidden and the user can admire the beauty of the shapes

Specs

Docs

Coding

Professional

Computer use and vision

Tool use

Academic

Cybersecurity

Long context

Abstract reasoning

関連記事

キーポイント

重要な引用

影響分析

編集コメント

メインウィンドウ

右側ウィンドウ

環境モード（すべてのコントロールが非表示になり、ユーザーが形状の美しさを鑑賞できるモード）

仕様

ドキュメント

コーディング

プロフェッショナル

コンピュータ操作とビジョン

ツール利用

Academic

Cybersecurity

Long context

Abstract reasoning

Main window

Side right window

Ambient mode where all the controls are hidden and the user can admire the beauty of the shapes

Specs

Docs

Coding

Professional

Computer use and vision

Tool use

Academic

Cybersecurity

Long context

Abstract reasoning

関連記事