OpenAI News·2026年6月26日 19:00·約14分

次世代モデル「GPT-5.6 Sol」のプレビュー公開

#LLM #OpenAI #GPT-5.6 Sol

TL;DR

OpenAI が次世代モデル「GPT-5.6 Sol」のプレビュー版を公開し、詳細な性能や特徴を紹介したが、記事内容が極めて簡素であり具体的な技術的知見に欠ける。

AI深層分析2026年6月26日 18:02

参考/ 5段階

深度40%

キーポイント

新モデルの発表

OpenAI が次世代モデル「GPT-5.6 Sol」のプレビュー版を公開した。

詳細情報の提供

同社は新モデルの詳細な性能や特徴について紹介を行っている。

影響分析・編集コメントを表示

影響分析

このニュースは OpenAI の新たなモデル開発を示唆するものだが、提供されたテキストが単なる発表告知に留まっており、業界への即時的な影響や技術的なインパクトを評価するには情報が不足している。今後の具体的な性能比較や実装事例の公開が待たれる状況である。

編集コメント

記事の情報が極めて限定的であり、新モデルの具体的な革新性や業界への影響を判断するには不十分です。今後の詳細な技術レポートに注目が必要です。

GPT-5.6 シリーズの限定プレビューを開始します。これは、当社のフラッグシップモデルである Sol、日常業務に最適なバランス型モデル Terra、そして高速かつ低コストな Luna です。Terra は GPT-5.5 と同等の競争力あるパフォーマンスを持ちながら、コストは半分に抑えられており、Luna は最も低いコストで強力な能力を提供します。

GPT-5.6 Sol は、これまでで最も堅牢な安全性スタック（safety stack）を備えて登場します。高リスク活動、機密性の高いサイバー関連の要求、および繰り返される悪用に対する保護を強化し、弱点の特定に数週間を費やし、システムへの負荷テストを実施し、実世界での攻撃に対して堅牢化しました。

私たちは広範なアクセスを信じており、今後数週間で GPT‑5.6 Sol、Terra、Luna を一般公開する計画です。米国政府との継続的な対話の一環として、本日のローンチに先立ち、私たちの計画と各モデルの能力についてプレビューしました。彼らの要請により、まずは政府にも共有された信頼できるパートナーの小規模グループに対して限定プレビューを開始し、その後より広くリリースしていきます。このプレビュー期間中、私たちはより広範な利用に向けて取り組むとともに、パートナーと緊密にテストと調整を続けていきます。このような政府によるアクセスプロセスが長期的なデフォルトとなるべきではないと考えています。これは、必要なユーザー、開発者、企業、サイバー防衛担当者、そしてグローバルなパートナーから最良のツールを遠ざけてしまうからです。私たちは今後数週間でより広範な利用を実現するための最も強力な道筋であると信じており、Administration と協力してサイバーに関する大統領令の枠組みと、将来のモデルリリースのための反復可能なプロセスを開発している間に、この短期的な措置をとっています。

機能

GPT‑5.6 Sol はこれまでで最も強力なモデルです。モデルのパフォーマンスをプレビューするために、コーディング、生物学、サイバーセキュリティにおけるエージェント機能（agentic capabilities）の向上を強調した一連の評価結果を共有します。また、追加の安全性および準備状況に関する評価は、システムカード⁠(新しいウィンドウで開く) でご覧いただけます。モデルを一般公開する際には、より充実した一連の評価結果も共有いたします。

GPT-5.6 では、Sol が深く推論するための時間を最大限に確保する新しい max 推論努力モードを導入します。さらに、サブエージェントを活用して単一エージェントの能力を超え、複雑な作業を加速させる新しい ultra モードも導入されます。

コーディングワークフローにおいて、GPT-5.6 Sol は、計画、反復、ツール調整を必要とするコマンドラインワークフローをテストする Terminal-Bench 2.1 で新たな最先端性能を記録しました。

GPT-5.6 Sol はまた、生物学ワークフローにおいても広範な改善を示しています。長期のゲノム解析や定量的生物学分析を評価する GeneBench v1 では、GPT-5.5 よりも優れた結果を達成しつつ、より少ないトークン数で処理を行っています。

GPT-5.6 Sol は、サイバーセキュリティにおいてこれまでにない最も能力の高いモデルです。脆弱性調査やエクスプロイト（悪用）を含む長期のセキュリティタスクにおける性能と効率のフロンティアをシフトさせます。ExploitBench² では、GPT-5.6 Sol は Mythos Preview と競合する性能を発揮しつつ、出力トークンの約 1/3 しか使用していません。UC バークレー大学の研究者が OpenAI やその他の最先端ラボと協力して作成したベンチマークである ExploitGym⁠(新しいウィンドウで開く)3 では、推論能力を高めるにつれて、GPT-5.6 Sol、Terra、Luna モデルのすべてがサイバー能力において顕著な改善を示しています。

強化されたセキュリティ機能とより強力なセーフガード

私たちは、GPT‑5.6 Sol、Terra、Luna を開発するにあたり、これまでで最も堅牢なセーフガード（安全装置）を適用し、各モデルの能力に合わせた設定を行いました。モデルの能力が高まるにつれて、コードレビュー、脆弱性調査、パッチ開発、デバッグ、セキュリティ教育、防御的テストといった正当な作業へのアクセスを維持しつつ、現実世界の敵対的な圧力に対してより強く耐えられるようセーフガードを設計しています。私たちの目標は、禁止された攻撃的活動を不必要に制限することなく、それをより困難で不確実なものにし、検出可能にすることです。モデルとセーフガードに関する私たちの評価に基づけば、正当な防御的作業には大きな利益をもたらす一方で、禁止された攻撃的使用を意味ある形で抑制できると考えています。

GPT‑5.6 Sol は、エンドツーエンドの攻撃を実行するよりも、人々が脆弱性を見つけ修正するのを支援することに優れています。これらの能力がさらに進展していく中で、私たちの優先事項は、これらのツールを使って弱点を発見し、パッチを開発し、システム全体を強化できる防御者たちが確実にこれらを利用できるよう確保することです。

GPT‑5.6 Sol は、当社の準備度フレームワークにおいて、サイバー・クリティカル閾値（Cyber Critical threshold）には達していません。Chromium や Firefox を対象とした評価では、バグやエクスプロイトの構成要素である「エクスプロイトプリミティブ」を特定しましたが、テストされた条件下では自律的に機能的なフルチェーンのエクスプロイトを生成することはありませんでした。しかし、ベンチマーク閾値は、モデルがどのように使用されたり他のツールと組み合わせられたりするすべての可能性を網羅することはできません。この不確実性と、モデルの能力におけるより広範な飛躍的進歩こそが、当社の対応策として、モデルの能力向上に合わせたより強力なセーフガードと段階的なリリースを実施する理由です。セーフガードの詳細については、GPT‑5.6 プレビューシステムカード（新しいウィンドウで開く）でさらに詳しくご紹介します。

多層化されたセーフガードスタック

決意あるまたは適応的な悪用に対して、単一のセーフガードでは不十分です。GPT‑5.6 プレビュー全体を通じて、私たちは多層化されたセーフガードを採用しており、正確な設定はモデルごとに異なりますが、実世界での攻撃に対して圧力テストを繰り返しています。これには、モデルに組み込まれた訓練による保護、生成中のリアルタイムチェック、アカウントレベルのシグナル、差別化されたアクセス権限、監視、執行措置、そして継続的なテストが含まれます。

GPT-5.6 は、ユーザーが意図を隠蔽したりモデルの制限を解除しようとする場合であっても、禁止されたサイバー支援を拒否するように訓練されています。これらのモデルレベルのセーフガードは、モデルが何を支援すべきで何を支援すべきでないかという最初の境界線を確立します。

リアルタイムのサイバーおよび生物学 misuse クラスファイア（分類器）は、生成される出力を評価することで別の層を提供します。リスクが高いケースでは、潜在的な違反を検出した場合、生成プロセスが一時的に一時停止され、より大規模な推論モデルが会話とその文脈を見直します。もし出力が許可されないものと判断された場合は、ユーザーに到達する前に差し止められます。

フラグ付けられた活動は、コンテンツの保持およびレビューに関する当社の利用規約とポリシーに従い、関連する会話やリスク信号全体にわたるアカウントレベルの見直しを引き起こすこともあります。単一の会話を超えて考察することで、システムは持続的な悪意のある行動と、非常に異なる文脈で同様の技術的概念が現れる可能性がある正当なデュアルユース（二重利用）セキュリティ作業を区別できるようになります。

これら複数の層を組み合わせることで、全体のアプローチは単一のセーフガードよりもはるかに堅牢なものとなります。モデルの挙動により有害な応答の可能性が低減され、リアルタイムシステムは生成中に介入でき、アカウントレベルの見直しにより広範なパターンを特定し、差別化されたアクセスにより重要な防御作業を維持しつつ、最も機密性の高い機能をデフォルトで広く利用可能にする必要がなくなります。

特にプレビュー期間中、ユーザーは一部の要求をブロックまたは拒否するセーフガードに遭遇することがあります。また、追加のレビューのために生成が一時停止されるため、他の要求にはより時間がかかる場合があります。セーフガードは、防御活動と攻撃活動が当初似て見える二重用途領域において、正当な業務にも時折介入することがあります。

これはプレビューが設計されている目的の一部です。私たちは、セーフガードが誤用を制限しているかどうかだけでなく、正当なユーザーが依然として通常業務を信頼性高くかつ効率的に完了できるかどうかも理解したいと考えています。プレビュー中のフィードバックは、不必要なブロックと遅延の削減、セーフガードの文脈解釈方法の改善、より広範なリリース前のよりスムーズな体験の実現に役立ちます。

また、プライバシーを保護する検出機能、顧客が操作する安全性制御、顧客・ユーザー・ワークロードのリスクに応じて調整されたアクセスなど、中長期的なアプローチについて企業顧客と協力しており、企業のプライバシー要件をサポートしながら安全性を向上させています。

自動化レッドチームによる堅牢性の向上

セーフガードは、攻撃者が戦術を適応させた際にも効果的であり続ける必要があります。既知の攻撃の固定セットでのみ機能する保護では、フロンティアモデルには十分ではありません。

そのため、私たちは以前にも増して多くの知見と計算リソースを安全性の確保に投入し、自社のモデルを用いて弱点を発見し、保護措置をより迅速に改善しています。普遍的な脱獄（jailbreak）攻撃、すなわち特定の狭い設定だけでなく、多様なプロンプトや文脈で機能する可能性のある攻撃を検出することを目的とした自動化されたレッドチームングには、A100換算で 70 万時間以上の GPU 時間を投入しました。こうしたより困難で汎用的な攻撃に焦点を当てることで、固定された既知の失敗事例の枠を超えて保護措置をテストできます。また、人間のテストだけではカバーしきれないはるかに多くの攻撃パターンを検証でき、失敗のパターンを早期に特定し、弱点を発見してから対策を講じるまでの期間を短縮することができます。

自動化されたレッドチームングに加えて、第三者のテスターと協力して広範な人間による専門家レッドチームングを実施しており、これはプレビュー期間中も継続されます。人間のレッドチームングは、システムが想定していない方法でモデルを悪用しようとする創造的な専門家に対して保護措置をテストすることで、自動化された作業を補完します。

あらゆる製品構成や多段階攻撃、あるいは現実世界のワークフローをすべて評価に反映させることは不可能です。そのため、新たに発見された脱獄攻撃の再現、評価、優先順位付け、対策を行う迅速な対応プロセスを維持しており、それらを継続的な評価に追加することで、将来同様の失敗に対してテストを行えるようにしています。

利用可能性と価格設定

プレビュー期間中、GPT‑5.6 モデルはまず API と Codex を通じて、信頼できるパートナーや組織の限定されたグループに提供されます。今後は、ChatGPT、Codex、および API を利用する人々に対して、より広く利用可能にする計画です。

GPT‑5.6 で導入された新しい命名体系では、数字がモデルの世代を識別し、Sol、Terra、Luna はそれぞれ独自のペースで進化できる耐久性のある能力ティアを表します。このファミリー全体により、知能、速度、コストの面で、ユーザーや開発者にとってより明確な選択肢が提供されます。

GPT‑5.6 の料金は 100 万トークンあたりで計算され、3 つのモデルサイズがあります：Sol は入力$5/出力$30、Terra は入力$2.50/出力$15、Luna は入力$1/出力$6 です。また、GPT‑5.6 では予測可能なプロンプトキャッシュ（prompt caching）が導入され、明示的なキャッシュブレークポイントのサポートと 30 分の最小キャッシュ寿命が含まれます。GPT‑5.6 およびそれ以降のモデルでは、キャッシュへの書き込みは非キャッシュ入力料率の 1.25 倍で課金され、キャッシュからの読み出しについては引き続き 90% のキャッシュ入力割引が適用されます。

また、7 月に Cerebras で GPT‑5.6 Sol を最大 1 秒あたり 750 トークンの速度で提供開始します。これにより、前例のない速度で最先端の知能を顧客に届けることになります。アクセスは当初、容量拡大に伴い限定された顧客のみを対象とします。

このプレビュー期間から学び続け、GPT‑5.6 Sol、Terra、Luna をより多くの人々に早く提供できることを楽しみにしています。

レイテンシと API コストの見積もりは、モデルの生産環境での動作を調査し、オフラインでシミュレーションすることによって行います。これらの見積もりには、ツール呼び出しの詳細、サンプリングされたトークン数、および入力トークン数が含まれています。実際の結果は大きく異なる可能性があり、シミュレーションに捕捉されていない多くの要因に依存します。レイテンシのシミュレーションは高速 API 速度で行い、コストの見積もりは通常の API 価格に基づいています。

すべてのモデルは、ExploitBench API ハーネスを使用して、5 つのシードと推論の連続性を伴って評価されます。

ExploitGym は、パブリック API よりも高速なレスポンスを出力するアルファ API で実行され、その後パブリック API に合わせるために再スケーリングされました。パブリック API で想定される速度にレイテンシを再スケーリングすると、評価ランでは正しく遵守されていたにもかかわらず、一部の推定レイテンシが 2 時間および 6 時間の時間制限を超えてしまいます。時間敏感なワークロードでより高速な処理を実現するために、API では優先処理⁠、Codex ではファストモード⁠を提供しています。

出力トークン数、レイテンシ、またはコストが報告されていないモデルは、水平の点線としてプロットされます。

原文を表示

We're beginning a limited preview of the GPT‑5.6 series: Sol, our flagship model; Terra, a balanced model for everyday work; and Luna, a fast and affordable model. Terra has competitive performance to GPT‑5.5 while being 2x cheaper and Luna brings strong capability at our lowest cost.

GPT‑5.6 Sol launches with our most robust safety stack to date. We strengthened protections for higher-risk activity, sensitive cyber requests, and repeated misuse, and spent multiple weeks finding weaknesses, pressure-testing our system, and hardening it against real-world attacks.

We believe in broad access, and we plan to make GPT‑5.6 Sol, Terra, and Luna generally available in the coming weeks. As part of our ongoing engagement with the U.S. government, we previewed our plans and the models’ capabilities ahead of today’s launch. At their request, we are starting with a limited preview for a small group of trusted partners whose participation has been shared with the government, before releasing more broadly. During this preview, we will continue testing and coordinating closely with partners as we work toward broader availability. We don’t believe this kind of government access process should become the long-term default. It keeps the best tools from users, developers, enterprises, cyber defenders, and global partners who need them. We are taking this short-term step because we believe it is the strongest path to broader availability in the coming weeks, while we work with the Administration to develop the cyber Executive Order framework and a repeatable process for future model releases.

Capabilities

GPT‑5.6 Sol is our strongest model yet. To give a preview of model performance, we share a set of evaluations highlighting improved agentic capabilities in coding, biology, and cybersecurity, with additional safety and preparedness evaluations available in our system card⁠(opens in a new window). We will share an expanded suite of evaluation results when we make the model broadly available.

With GPT‑5.6, we’re introducing a new max reasoning effort to give Sol the most time to reason deeply. Additionally, we’re introducing a new ultra mode that goes beyond the capabilities of a single agent by leveraging subagents to accelerate complex work.

For coding workflows, GPT‑5.6 Sol sets a new state of the art on Terminal‑Bench 2.1, which tests command-line workflows requiring planning, iteration, and tool coordination.

GPT‑5.6 Sol also shows broad improvements in biology workflows. On GeneBench v1, which evaluates long-horizon genomics and quantitative-biology analyses, it achieves stronger results than GPT‑5.5 while using fewer tokens.

GPT‑5.6 Sol is our most capable model yet for cybersecurity. It shifts the performance-efficiency frontier for long-horizon security tasks including vulnerability research and exploitation. On ExploitBench², GPT‑5.6 Sol is competitive with Mythos Preview using only ~1/3 of the output tokens. On ExploitGym⁠(opens in a new window)3, a benchmark created by UC Berkeley researchers in collaboration with OpenAI and other frontier labs, GPT‑5.6 Sol, Terra, and Luna models all demonstrate strong improvements in cyber capabilities as we increase reasoning.

Stronger cyber capabilities with stronger safeguards

We developed GPT‑5.6 Sol, Terra and Luna with our most robust safeguards to date, with configurations matched to each model’s capabilities. As the model becomes more capable, we design safeguards to increasingly hold up to real-world adversarial pressure while preserving access to legitimate work such as code review, vulnerability research, patch development, debugging, security education, and defensive testing. Our goal is to make prohibited offensive activity more difficult, uncertain, and detectable without unnecessarily limiting those beneficial uses. Based on our assessment of the model and safeguards, we expect substantial benefit for legitimate defensive work, while meaningfully constraining prohibited offensive use.

GPT‑5.6 Sol is better at helping people find and fix vulnerabilities than reliably carrying out end-to-end attacks. As these capabilities continue to advance, our priority is to make sure they reach and benefit defenders, who can use these tools to find weaknesses, develop patches, and strengthen systems more broadly.

GPT‑5.6 Sol does not cross the Cyber Critical threshold under our Preparedness Framework⁠. In evaluations involving Chromium and Firefox, it identified bugs and exploitation primitives—the building blocks of an exploit—but did not autonomously produce a functional full-chain exploit under the conditions tested. Still, benchmark thresholds cannot capture every way a model may be used or combined with other tools. That uncertainty, along with the model’s broader step change in capabilities, is why we are pairing the model’s increased capabilities with stronger safeguards and a phased release. We share more details about our safeguards in the GPT‑5.6 Preview system card⁠(opens in a new window).

A layered safeguard stack

No single safeguard is sufficient against determined or adaptive misuse. Across the GPT‑5.6 preview, we use layered safeguards, with exact configurations varying across models, and pressure-test them for real-world attacks. These include protections trained into the model, real-time checks during generation, account-level signals, differentiated access, monitoring, enforcement, and continued testing.

GPT‑5.6 is trained to refuse prohibited cyber assistance, including when users attempt to disguise their intent or jailbreak the model. These model-level safeguards establish the first boundary around what the model should and should not help with.

Real-time cyber and biology misuse classifiers provide another layer by evaluating output as it is generated. For higher risk cases, if they detect a potential violation, the generation may be paused while a larger reasoning model reviews the conversation and its context. If the output is assessed as disallowed, it is withheld before it reaches the user.

Flagged activity can also trigger account-level review across relevant conversations and risk signals, consistent with our terms and policies around content retention and review. Looking beyond a single conversation helps our systems distinguish persistent malicious behavior from legitimate dual-use security work, where similar technical concepts may appear in very different contexts.

Together, these layers make the overall approach more robust than any one safeguard on its own. Model behavior reduces the likelihood of harmful responses, real-time systems can intervene during generation, account-level review can identify broader patterns, and differentiated access preserves important defensive work without making the most sensitive capabilities broadly available by default.

Especially during the preview, users may encounter safeguards that block or refuse some requests. Other requests may take longer because generation is paused for additional review. Safeguards may occasionally intervene on legitimate work, particularly in dual-use areas where defensive and offensive activity can initially look similar.

That is part of what the preview is designed to test. We want to understand not only whether the safeguards constrain misuse, but whether legitimate users can still complete normal work reliably and efficiently. Feedback during the preview will help us reduce unnecessary blocks and delays, improve how the safeguards interpret context, and create a smoother experience before wider release.

We are also working with enterprise customers on longer-term approaches—including privacy-preserving detection, customer-operated safety controls, and access calibrated to the risk of a customer, user, or workload—to advance safety while supporting enterprise privacy requirements.

Improving robustness with automated red-teaming

Safeguards also need to remain effective when attackers adapt their tactics. A protection that works only on a fixed set of known attacks is not robust enough for a frontier model.

That’s why we are applying more intelligence and compute than ever before to safety, using our own models to find weaknesses and improve safeguards faster. We dedicated over 700,000 A100-equivalent GPU hours to automated red teaming aimed at finding universal jailbreaks: attacks that can work across many prompts or contexts, not just one narrow setting. Focusing on these harder, more general attacks let us test the safeguards beyond a fixed set of known failures. It also lets us explore far more attack patterns than human testing alone could cover, identify failure patterns earlier, and shorten the path from finding a weakness to addressing it.

In addition to automated red-teaming, we worked with third-party testers to conduct extensive human expert red teaming, which will continue in the preview period. Human red-teaming complements the automated work by testing safeguards against creative experts trying to misuse the model in ways our systems might not anticipate.

No evaluation can represent every product configuration, multi-step attack, or real-world workflow. We therefore maintain a rapid-response process to reproduce, assess, prioritize, and remediate newly discovered jailbreaks, then add them to our ongoing evaluations so we can test against similar failures in the future.

Availability and pricing

During the preview, GPT‑5.6 models will initially be available through the API and Codex to a select group of trusted partners and organizations. We plan to make them more broadly available to people using ChatGPT, Codex, and the API soon.

In this new naming system introduced with GPT‑5.6, the number identifies a model’s generation, while Sol, Terra, and Luna identify durable capability tiers that can advance on their own cadence. Together, the family gives people and developers clearer choices across intelligence, speed, and cost.

GPT‑5.6 is priced per 1M tokens across three model sizes: Sol is $5 input / $30 output; Terra is $2.50 input / $15 output; and Luna is $1 input / $6 output. GPT‑5.6 also introduces more predictable prompt caching, including support for explicit cache breakpoints and a 30-minute minimum cache life. For GPT‑5.6 and later models, cache writes are billed at 1.25x the model’s uncached input rate, while cache reads continue to receive the 90% cached-input discount.

We're also launching GPT‑5.6 Sol on Cerebras at up to 750 tokens per second in July, bringing frontier intelligence to customers at unprecedented speed. Access will initially be limited to select customers as we expand capacity.

We’re excited to continue learning from this preview period, and to bring GPT‑5.6 Sol, Terra and Luna to more people soon.

We estimate latency and API cost by looking at the production behavior of our models, and simulating offline. These estimates account for tool call details, sampled tokens, and input tokens. Real-world results may vary substantially, and depend on many factors not captured in our simulation. We simulate latency at fast API speeds, and cost at regular API pricing.

All models are evaluated using the ExploitBench API harness with 5 seeds and reasoning continuity.

We ran ExploitGym on our alpha API, which outputs responses faster than our public API, and then rescaled to match our public API. When rescaling latencies to the speeds expected for our public API, this causes some estimated latencies to exceed the 2h and 6h hour time limits, despite being correctly obeyed in the evaluation run. To get faster speeds for time-sensitive work, we offer priority processing⁠ in the API and fast mode⁠ in Codex.

Models without reported output tokens, latency or cost are plotted as horizontal dotted lines.

この記事をシェア

Simon Willison Blog重要度42026年6月27日 07:25

Dean W. Ball の発言を紹介：業界の現状は深刻

Ars Technica AI重要度42026年6月27日 05:04

NYT、OpenAI向けに著作権侵害用スーパーコンピュータを構築したとしてMicrosoftを非難

MarkTechPost重要度52026年6月27日 04:18

OpenAI、GPT-5.6 をソル・テラ・ルナの 3 つのティアでプレビュー開始：階層化モデルと新推論モード、アクセスは限定

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

OpenAI News·2026年6月26日 19:00·約14分

次世代モデル「GPT-5.6 Sol」のプレビュー公開

#LLM #OpenAI #GPT-5.6 Sol

TL;DR

AI深層分析2026年6月26日 18:02

参考/ 5段階

深度40%

キーポイント

新モデルの発表

OpenAI が次世代モデル「GPT-5.6 Sol」のプレビュー版を公開した。

詳細情報の提供

同社は新モデルの詳細な性能や特徴について紹介を行っている。

影響分析・編集コメントを表示

影響分析

編集コメント

機能

強化されたセキュリティ機能とより強力なセーフガード

多層化されたセーフガードスタック

自動化レッドチームによる堅牢性の向上

利用可能性と価格設定

このプレビュー期間から学び続け、GPT‑5.6 Sol、Terra、Luna をより多くの人々に早く提供できることを楽しみにしています。

レイテンシと API コストの見積もりは、モデルの生産環境での動作を調査し、オフラインでシミュレーションすることによって行います。これらの見積もりには、ツール呼び出しの詳細、サンプリングされたトークン数、および入力トークン数が含まれています。実際の結果は大きく異なる可能性があり、シミュレーションに捕捉されていない多くの要因に依存します。レイテンシのシミュレーションは高速 API 速度で行い、コストの見積もりは通常の API 価格に基づいています。

すべてのモデルは、ExploitBench API ハーネスを使用して、5 つのシードと推論の連続性を伴って評価されます。

ExploitGym は、パブリック API よりも高速なレスポンスを出力するアルファ API で実行され、その後パブリック API に合わせるために再スケーリングされました。パブリック API で想定される速度にレイテンシを再スケーリングすると、評価ランでは正しく遵守されていたにもかかわらず、一部の推定レイテンシが 2 時間および 6 時間の時間制限を超えてしまいます。時間敏感なワークロードでより高速な処理を実現するために、API では優先処理⁠、Codex ではファストモード⁠を提供しています。

出力トークン数、レイテンシ、またはコストが報告されていないモデルは、水平の点線としてプロットされます。

原文を表示

Capabilities

For coding workflows, GPT‑5.6 Sol sets a new state of the art on Terminal‑Bench 2.1, which tests command-line workflows requiring planning, iteration, and tool coordination.

Stronger cyber capabilities with stronger safeguards

A layered safeguard stack

Improving robustness with automated red-teaming

Safeguards also need to remain effective when attackers adapt their tactics. A protection that works only on a fixed set of known attacks is not robust enough for a frontier model.

Availability and pricing

We’re excited to continue learning from this preview period, and to bring GPT‑5.6 Sol, Terra and Luna to more people soon.

We estimate latency and API cost by looking at the production behavior of our models, and simulating offline. These estimates account for tool call details, sampled tokens, and input tokens. Real-world results may vary substantially, and depend on many factors not captured in our simulation. We simulate latency at fast API speeds, and cost at regular API pricing.

All models are evaluated using the ExploitBench API harness with 5 seeds and reasoning continuity.

We ran ExploitGym on our alpha API, which outputs responses faster than our public API, and then rescaled to match our public API. When rescaling latencies to the speeds expected for our public API, this causes some estimated latencies to exceed the 2h and 6h hour time limits, despite being correctly obeyed in the evaluation run. To get faster speeds for time-sensitive work, we offer priority processing⁠ in the API and fast mode⁠ in Codex.

Models without reported output tokens, latency or cost are plotted as horizontal dotted lines.

この記事をシェア

Simon Willison Blog重要度42026年6月27日 07:25

Dean W. Ball の発言を紹介：業界の現状は深刻

Ars Technica AI重要度42026年6月27日 05:04

NYT、OpenAI向けに著作権侵害用スーパーコンピュータを構築したとしてMicrosoftを非難

MarkTechPost重要度52026年6月27日 04:18

OpenAI、GPT-5.6 をソル・テラ・ルナの 3 つのティアでプレビュー開始：階層化モデルと新推論モード、アクセスは限定

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

次世代モデル「GPT-5.6 Sol」のプレビュー公開

キーポイント

影響分析

編集コメント

機能

強化されたセキュリティ機能とより強力なセーフガード

多層化されたセーフガードスタック

自動化レッドチームによる堅牢性の向上

利用可能性と価格設定

Capabilities

Stronger cyber capabilities with stronger safeguards

A layered safeguard stack

Improving robustness with automated red-teaming

Availability and pricing

関連記事

次世代モデル「GPT-5.6 Sol」のプレビュー公開

キーポイント

影響分析

編集コメント

機能

強化されたセキュリティ機能とより強力なセーフガード

多層化されたセーフガードスタック

自動化レッドチームによる堅牢性の向上

利用可能性と価格設定

Capabilities

Stronger cyber capabilities with stronger safeguards

A layered safeguard stack

Improving robustness with automated red-teaming

Availability and pricing

関連記事