MarkTechPost·2026年6月26日 02:11·約8分で読める

DeepReinforce が Ornith-1.0 を公開：自律的に RL スキャフォールドを学習するオープンソースコーディングモデルファミリー

#Ornith-1.0 #強化学習 (RL)#コーディングエージェント #オープンソースモデル #推論 (Reasoning)

TL;DR

DeepReinforce が公開したコーディングエージェント用モデル「Ornith-1.0」は、固定されたハッチスではなく学習によって自律的に最適化されるスキャフォールドを生成する画期的な技術であり、大規模モデルで業界最高水準の性能を示しています。

AI深層分析2026年6月25日 18:03

重要/ 5段階

深度40%

キーポイント

自律型スキャフォールド学習

従来の人間設計による固定ハッチスに依存せず、強化学習（RL）プロセスにおいてモデル自身がスキャフォールド（実行枠組み）を共進化させながら最適化します。

大規模かつ多様なモデルラインナップ

9B から 397B の混合専門家（MoE）モデルまで4サイズを提供し、MIT ライセンスで Hugging Face で公開されています。

高い推論性能とベンチマーク結果

Ornith-1.0-397B は Claude Opus 4.7 を上回る成績を記録し、複雑なコーディングタスクにおいてトップクラスの能力を発揮します。

安全対策と実装の容易さ

報酬ハッキングを防ぐための3層構造（固定信頼境界、決定論的モニター、凍結 LLM 判定）を採用し、vLLM や SGLang でのローカル展開も容易です。

報酬ハッキング対策の多層防御

モデルがテストファイルを読み込んだり検証スクリプトを改変したりするのを防ぐため、不変な信頼境界、決定論的モニタ、およびフリーズされたLLM判事による拒否機能の3層構造を採用している。

大規模モデルと小規模モデルの実用性

397Bモデルは長期的な多ステップタスクで最高精度を達成する一方、9Bモデルはエッジ環境や単一GPU構成での低遅延・低コストなコード検証に適している。

ベンチマークにおける競合他社との比較

Ornith-1.0-397BはSWE-Bench VerifiedでClaude Opus 4.8に次ぐ2位、Terminal-Benchでは同サイズクラスのオープンモデルや一部クローズドモデルを上回る性能を示している。

影響分析・編集コメントを表示

影響分析

この発表は、コーディングエージェントの設計パラダイムを「人間が設計する固定枠組み」から「モデルが自律的に最適化する動的枠組み」へと転換させる可能性を秘めています。特に大規模モデルにおける性能向上と、安全対策の強化により、実戦レベルの自律開発ツールの実現に向けた大きな一歩となるでしょう。

編集コメント

「スキャフォールド自体を学習対象とする」という発想は、エージェントの柔軟性と適応能力を劇的に高める画期的なアプローチです。ただし、複雑化する学習プロセスにおける安全性の担保が今後の実用化の鍵となるでしょう。

DeepReinforce は、エージェント型コーディング用に構築されたオープンソースモデルファミリー Ornith-1.0 をリリースしました。このラインナップは 9B の密集型モデルから 397B の混合専門家（MoE）フラッグシップまで 4 つのサイズを網羅しており、すべてのチェックポイントは Hugging Face で MIT ライセンスの下で提供されています。これらのモデルは、事前学習済みである Gemma 4 および Qwen 3.5 を基盤としてポストトレーニングが施されています。

多くのコーディングエージェントは、モデルと固定された人間設計のハッチ（harness）を組み合わせます。一方 Ornith-1.0 は、自らそのハッチを書く方法を学習します。DeepReinforce の研究チームは、同規模のオープンモデルの中で最先端の結果を報告しています。

TL;DR

Ornith-1.0 は MIT ライセンスの下で 9B、31B、35B-MoE、397B-MoE のサイズで提供され、Gemma 4 と Qwen 3.5 を基盤としています。

このモデルは強化学習（RL）中に自らその足場（scaffold）を学習し、ハッチと解決策の両方を同時に最適化します。

Ornith-1.0-397B は主要なベンチマークにおいて Claude Opus 4.7 を上回りますが、Opus 4.8 やより大規模な GLM-5.2-744B には及びません。

報酬ハッキング（reward hacking）を防ぐために、固定された信頼境界、決定論的モニター、凍結された LLM 判事という 3 つの層が機能します。

Ornith-1.0 とは何か？

Ornith-1.0 は、コーディングエージェント向けに調整された推論モデルのセットです。バリアントには 9B Dense、31B Dense、35B MoE、および 397B MoE が含まれます。35B モデルは混合専門家（MoE）構造であり、トークンあたり約 3B のパラメータが活性化されます。高速なローカル推論のために FP8 および GGUF 版も公開されています。

各モデルは推論モデルです。回答は最終的な答えの前に <thinking> ブロックから始まります。サービングレシピには推論パーサーが用意されており、トレースは separate reasoning_content フィールドとして返されます。また、これらのモデルはエージェントループ用の適切に構造化されたツール呼び出しも生成します。

デプロイは簡単です。9B モデルは bf16 で約 19GB のサイズであり、単一の 80GB GPU でサービング可能です。サービングレシピは vLLM、SGLang、および Transformers を対象としています。各モデルは OpenAI 互換のエンドポイントを公開しており、標準的なエージェントフレームワークもコード変更なしで動作します。

インタラクティブな解説機能

(function(){

window.addEventListener("message",function(e){

if(e.data&&e.data.type==="mtp-ornith-height"){

var f=document.getElementById("mtp-ornith-frame");

if(f&&e.data.height){f.style.height=e.data.height+"px";}

}

});

})();

自己スキャフォールディングのアイデア

ほとんどのコーディングエージェントは、ハネスとも呼ばれるスキャフォールドに依存しています。スキャフォールドは、モデルをメモリ、ツール、エラーハンドリング、およびオーケストレーションロジックで包み込みます。AI チームは通常、タスクカテゴリごとに手動で 1 つのスキャフォールドを設計します。

Ornith-1.0 は、スキャフォールドを学習可能なオブジェクトとして扱います。強化学習の間、スキャフォールドはモデルの方針と共進化します。各 RL ステップは 2 つのステージで実行されます。

まず、モデルはタスクとその前のスキャフォールドを読み取り、改良されたスキャフォールドを提案します。次に、そのスキャフォールドとタスクを使用して、ソリューションのロールアウトを生成します。ロールアウトからの報酬が両方のステージにフィードバックされます。

したがって、このモデルは単なる回答の生成ではなく、オーケストレーションの作成を最適化するように設計されています。トレーニングを通じて、より高い報酬を得るための足場（scaffolds）が自動的に変異・選択され、手作業で設計されたハッチなしにタスク固有の戦略が自然発生的に現れます。

トレーニングは非同期で実行され、パイプライン RL 設定が用いられています。古くなったオフポリシートークンには遅延重み（staleness weight）が適用され、一定の閾値を超えると除外されます。この最適化では、トークンレベルの GRPO 目標関数が使用されています。

報酬ハッキングへの対策

モデルに独自の足場作成を任せることは、報酬ハッキングのリスクを招きます。例えば、足場が可視なテストファイルを読み込んで期待される出力をハードコードしたり、環境内に存在するオラクル（oracle）ソリューションをコピーしたりする可能性があります。DeepReinforce チームは、これに対抗するための 3 つの防御層を説明しています。

最外層の信頼境界は固定され不変です。環境、ツールインターフェース、テストの隔離領域はモデルの到達範囲の外に置かれます。モデルが進化させるのは、その内部にあるポリシー足場のみです。

決定論的なモニターが禁止されたアクションを検知します。アクセス制限のあるパスの読み込みや検証スクリプトの編集を行った場合、報酬はゼロとなります。そのような軌道（trajectories）は、アドバンテージ計算から除外されます。

凍結された LLM 判事が拒否権を持つ役割を果たします。これは主要な報酬源ではなく、検証器の上に位置するものとして機能します。

ベンチマーク

DeepReinforce は、複数のエージェント型コーディングベンチにおいてベンダー数値を報告しています。旗艦規模の Ornith-1.0-397B は、Terminal-Bench 2.1 で 77.5、SWE-Bench Verified で 82.4 のスコアを記録しました。SWE-Bench Verified においては、この 82.4 というスコアは、リストされたモデルの中で Claude Opus 4.8（87.6）に次ぐ第 2 位です。一方、Terminal-Bench 2.1 における結果は、やや複雑な状況を示しています。

Ornith-1.0-397B は Terminal-Bench 2.1 で Claude Opus 4.7 (70.3) を上回り、70.3 のスコアを記録しました。しかし、Claude Opus 4.8 (85) やより大規模な GLM-5.2-744B (81.0) には及びません。したがって、「最先端」という主張は、同程度のサイズのオープンモデルに限定されたものです。

小規模モデルはその効率性を示しています。35B モデルは Terminal-Bench 2.1 で 64.2 のスコアを記録し、Qwen 3.5-397B の 53.5 を上回っています。また、9B モデルは Terminal-Bench 2.1 で 43.1、SWE-Bench Verified では 69.4 に達しています。

---|---|---|---|---|---|---|---|---

Terminal-Bench 2.1 | 77.5 | 53.5 | 73.5 | 81.0 | 64 | 64 | 70.3 | 85

SWE-Bench Verified | 82.4 | 76.4 | 80.4 | –– | 80.6 | 80.8 | 87.6 |

SWE-Bench Pro | 62.2 | 51.6 | 60.6 | 62.1 | 59 | 55.4 | 64.3 | 69.2

SWE-Bench Multilingual | 78.9 | 69.3 | 78.3 | –– | 76.2 | –– |

NL2Repo | 48.2 | 36.8 | 47.2 | 48.9 | 42.1 | –– | 69.7

ClawEval Avg | 77.1 | 70.7 | 65.2 | –– | 75.8 | 78.2 |

ユースケースとクイックスタート

これらのモデルは、ターミナルネイティブなコーディングエージェントやリポジトリ規模の作業を対象としています。実用的な適用例としては、複数ファイルにわたるリファクタリング、バグの特定、テスト駆動型パッチ作成などがあります。9B モデルは、レイテンシとコストが重要なエッジ環境や単一 GPU 構成に適しています。一方、397B モデルは、長く多段階のタスクにおける最大限の精度を必要とする場合にターゲットとなります。

例えば、開発者はローカルで 9B モデルを実行して、失敗したテストスイートの原因調査を行うことができます。また、プラットフォームチームは内部用のコーディングエージェントとして 397B モデルをセルフホストすることも可能です。

vLLM を使用すれば、サービングはワンライナーで完了します：

Copy CodeCopiedUse a different Browser

vllm serve deepreinforce-ai/Ornith-1.0-9B \

--served-model-name Ornith-1.0-9B \

--max-model-len 262144 \

--enable-auto-tool-choice --tool-call-parser qwen3_xml \

--reasoning-parser qwen3 \

--trust-remote-code

Then call it with any OpenAI client:

Copy CodeCopiedUse a different Browser

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(

model="Ornith-1.0-9B",

messages=[{"role": "user", "content": "Write a Python is_prime(n)."}],

temperature=0.6, top_p=0.95,

)

msg = resp.choices[0].message

print(getattr(msg, "reasoning_content", None)) # the trace

print(msg.content) # the final answer

The reasoning trace returns in reasoning_content, with the answer in content. Recommended sampling is temperature=0.6, top_p=0.95, top_k=20. The model also plugs into OpenHands, OpenClaw, and OpenCode.

Check out the Model Weights and Technical details. Also, feel free to follow us on Twitter and don't forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post DeepReinforce Releases Ornith-1.0: An Open-Source Coding Model Family That Learns Its Own RL Scaffolds appeared first on MarkTechPost.

原文を表示

DeepReinforce has released Ornith-1.0, an open-source model family built for agentic coding. The lineup spans four sizes, from a 9B dense model to a 397B mixture-of-experts flagship. Every checkpoint ships under the MIT license on Hugging Face. The models are post-trained on top of pretrained Gemma 4 and Qwen 3.5.

Most coding agents pair a model with a fixed, human-designed harness. Ornith-1.0 instead learns to write its own. The DeepReinforce research team reports state-of-the-art results among open models of comparable size.

TL;DR

Ornith-1.0 ships in 9B, 31B, 35B-MoE, and 397B-MoE sizes under MIT, built on Gemma 4 and Qwen 3.5.

The model learns its own scaffold during RL, jointly optimizing the harness and the solution.

Ornith-1.0-397B tops Claude Opus 4.7 on both headline benchmarks, but not Opus 4.8 or the larger GLM-5.2-744B.

Three layers — fixed trust boundary, deterministic monitor, frozen LLM judge — guard against reward hacking.

What is Ornith-1.0?

Ornith-1.0 is a set of reasoning models tuned for coding agents. The variants are 9B Dense, 31B Dense, 35B MoE, and 397B MoE. The 35B model is mixture-of-experts and activates roughly 3B parameters per token. FP8 and GGUF builds are also published for faster local serving.

Each model is a reasoning model. Replies open with a <think> block before the final answer. The serving recipes enable a reasoning parser, so that trace returns in a separate reasoning_content field. The models also emit well-formed tool calls for agent loops.

Deployment is straightforward. The 9B model is about 19GB in bf16 and serves on a single 80GB GPU. Serving recipes target vLLM, SGLang, and Transformers. Each model exposes an OpenAI-compatible endpoint. Standard agent frameworks therefore work without code changes.

Interactive Explainer

(function(){

window.addEventListener("message",function(e){

if(e.data&&e.data.type==="mtp-ornith-height"){

var f=document.getElementById("mtp-ornith-frame");

if(f&&e.data.height){f.style.height=e.data.height+"px";}

}

});

})();

The Self-Scaffolding Idea

Most coding agents rely on a scaffold, also called a harness. A scaffold wraps the model with memory, tools, error handling, and orchestration logic. AI teams usually hand-design one scaffold per task category.

Ornith-1.0 treats the scaffold as a learnable object instead. During reinforcement learning, the scaffold co-evolves with the model’s policy. Each RL step runs in two stages.

First, the model reads the task and its previous scaffold. It then proposes a refined scaffold. Second, it uses that scaffold and the task to generate a solution rollout. Reward from the rollout flows back to both stages.

So the model is optimized to author orchestration, not just answers. Over training, higher-reward scaffolds are mutated and selected automatically. Per-task strategies emerge without hand-engineered harness design.

Training also runs asynchronously, using a pipeline-RL setup. A staleness weight downweights older, off-policy tokens and drops them past a threshold. The optimization uses a token-level GRPO objective.

Guarding Against Reward Hacking

Letting a model write its own scaffold invites reward hacking. A scaffold could read visible test files and hardcode expected outputs. It could also copy an oracle solution sitting in the environment. DeepReinforce team describes three defense layers.

The outer trust boundary is fixed and immutable. The environment, tool surface, and test isolation stay outside the model’s reach. The model evolves only its inner policy scaffold.

A deterministic monitor flags banned actions. Reading withheld paths or editing verification scripts earns zero reward. Those trajectories are excluded from the advantage computation.

A frozen LLM judge acts as a veto. It sits on top of the verifier, not as the primary reward.

Benchmark

DeepReinforce reports vendor numbers across several agentic coding benchmarks. At flagship scale, Ornith-1.0-397B posts 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified. On SWE-Bench Verified, that 82.4 trails only Claude Opus 4.8 (87.6) among the listed models. On Terminal-Bench 2.1, the picture is more mixed.

Ornith-1.0-397B beats Claude Opus 4.7 (70.3) on Terminal-Bench 2.1. But it trails Claude Opus 4.8 (85) and the larger GLM-5.2-744B (81.0). So the ‘state-of-the-art’ claim is scoped to open models of comparable size.

The smaller models carry the efficiency case. The 35B model scores 64.2 on Terminal-Bench 2.1, above Qwen 3.5-397B’s 53.5. The 9B model reaches 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified.

BenchmarkOrnith-1.0-397BQwen3.5-397BQwen3.7-MaxGLM-5.2-744BMinimax-M3-428BDeepSeek-V4-Pro-1.6TClaude Opus 4.7Claude Opus 4.8

Terminal-Bench 2.177.553.573.581.0646470.385

SWE-Bench Verified82.476.480.4––80.680.887.6

SWE-Bench Pro62.251.660.662.15955.464.369.2

SWE-Bench Multilingual78.969.378.3––76.2––

NL2Repo48.236.847.248.942.1––69.7

ClawEval Avg77.170.765.2––75.878.2–

Use Cases and a Quick Start

The models target terminal-native coding agents and repository-scale work. Practical fits include multi-file refactors, bug localization, and test-driven patches. The 9B model suits edge or single-GPU setups where latency and cost matter. The 397B model targets maximum accuracy on long, multi-step tasks.

For example, a dev can run the 9B model locally to triage a failing test suite. A platform team can self-host the 397B model for an internal coding agent.

Serving is a one-liner with vLLM:

Copy CodeCopiedUse a different Browser

vllm serve deepreinforce-ai/Ornith-1.0-9B \

--served-model-name Ornith-1.0-9B \

--max-model-len 262144 \

--enable-auto-tool-choice --tool-call-parser qwen3_xml \

--reasoning-parser qwen3 \

--trust-remote-code

Then call it with any OpenAI client:

Copy CodeCopiedUse a different Browser

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(

model="Ornith-1.0-9B",

messages=[{"role": "user", "content": "Write a Python is_prime(n)."}],

temperature=0.6, top_p=0.95,

)

msg = resp.choices[0].message

print(getattr(msg, "reasoning_content", None)) # the <think> trace

print(msg.content) # the final answer

Check out the Model Weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post DeepReinforce Releases Ornith-1.0: An Open-Source Coding Model Family That Learns Its Own RL Scaffolds appeared first on MarkTechPost.

この記事をシェア

Vercel Blog★42026年6月25日 09:00

AI SDK ハーネスに「Deep Agents」と「OpenCode」が追加され利用可能に

Vercel は、アプリケーションコードを変更せずにランタイムを切り替えられる AI SDK ハーネスに、「Deep Agents」と「OpenCode」の 2 つの新規アダプターを追加した。これらは Vercel サンドボックス内で動作し、ファイル操作やシェルツールなどの機能を備えている。

MarkTechPost★42026年6月25日 14:39

百度、長文解析向け KV キャッシュを一定に保つ 3B モデル「Unlimited OCR」を発表

百度は、出力が増加してもメモリ使用量が一定となる「Reference Sliding Window Attention」を採用した 3B パラメータモデル「Unlimited OCR」を発表し、長文の OCR 処理を高速化した。

MarkTechPost★42026年6月25日 05:00

Gradium、リアルタイム音声翻訳モデル「stt-translate」と「s2s-translate」を公開し、精度と遅延で競合を上回る

Gradium は、5 か国語に対応するリアルタイム音声翻訳モデル「stt-translate（音声→テキスト）」および「s2s-translate（音声→音声）」を発表した。同社は、これらのモデルが GPT-Realtime-Translate や Gemini 3.5 Live Translate よりも精度と遅延のバランスに優れ、さらに後者が欠く出力音声のクローン機能を提供できると主張している。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

MarkTechPost·2026年6月26日 02:11·約8分で読める

DeepReinforce が Ornith-1.0 を公開：自律的に RL スキャフォールドを学習するオープンソースコーディングモデルファミリー

#Ornith-1.0 #強化学習 (RL)#コーディングエージェント #オープンソースモデル #推論 (Reasoning)

TL;DR

AI深層分析2026年6月25日 18:03

重要/ 5段階

深度40%

キーポイント

自律型スキャフォールド学習

大規模かつ多様なモデルラインナップ

9B から 397B の混合専門家（MoE）モデルまで4サイズを提供し、MIT ライセンスで Hugging Face で公開されています。

高い推論性能とベンチマーク結果

Ornith-1.0-397B は Claude Opus 4.7 を上回る成績を記録し、複雑なコーディングタスクにおいてトップクラスの能力を発揮します。

安全対策と実装の容易さ

報酬ハッキングを防ぐための3層構造（固定信頼境界、決定論的モニター、凍結 LLM 判定）を採用し、vLLM や SGLang でのローカル展開も容易です。

報酬ハッキング対策の多層防御

大規模モデルと小規模モデルの実用性

ベンチマークにおける競合他社との比較

影響分析・編集コメントを表示

影響分析

編集コメント

TL;DR

Ornith-1.0 は MIT ライセンスの下で 9B、31B、35B-MoE、397B-MoE のサイズで提供され、Gemma 4 と Qwen 3.5 を基盤としています。

このモデルは強化学習（RL）中に自らその足場（scaffold）を学習し、ハッチと解決策の両方を同時に最適化します。

Ornith-1.0-397B は主要なベンチマークにおいて Claude Opus 4.7 を上回りますが、Opus 4.8 やより大規模な GLM-5.2-744B には及びません。

報酬ハッキング（reward hacking）を防ぐために、固定された信頼境界、決定論的モニター、凍結された LLM 判事という 3 つの層が機能します。

Ornith-1.0 とは何か？

インタラクティブな解説機能

(function(){

window.addEventListener("message",function(e){

if(e.data&&e.data.type==="mtp-ornith-height"){

var f=document.getElementById("mtp-ornith-frame");

if(f&&e.data.height){f.style.height=e.data.height+"px";}

}

});

})();

自己スキャフォールディングのアイデア

報酬ハッキングへの対策

凍結された LLM 判事が拒否権を持つ役割を果たします。これは主要な報酬源ではなく、検証器の上に位置するものとして機能します。

ベンチマーク

---|---|---|---|---|---|---|---|---

Terminal-Bench 2.1 | 77.5 | 53.5 | 73.5 | 81.0 | 64 | 64 | 70.3 | 85

SWE-Bench Verified | 82.4 | 76.4 | 80.4 | –– | 80.6 | 80.8 | 87.6 |

SWE-Bench Pro | 62.2 | 51.6 | 60.6 | 62.1 | 59 | 55.4 | 64.3 | 69.2

SWE-Bench Multilingual | 78.9 | 69.3 | 78.3 | –– | 76.2 | –– |

NL2Repo | 48.2 | 36.8 | 47.2 | 48.9 | 42.1 | –– | 69.7

ClawEval Avg | 77.1 | 70.7 | 65.2 | –– | 75.8 | 78.2 |

ユースケースとクイックスタート

vLLM を使用すれば、サービングはワンライナーで完了します：

Copy CodeCopiedUse a different Browser

vllm serve deepreinforce-ai/Ornith-1.0-9B \

--served-model-name Ornith-1.0-9B \

--max-model-len 262144 \

--enable-auto-tool-choice --tool-call-parser qwen3_xml \

--reasoning-parser qwen3 \

--trust-remote-code

Then call it with any OpenAI client:

Copy CodeCopiedUse a different Browser

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(

model="Ornith-1.0-9B",

messages=[{"role": "user", "content": "Write a Python is_prime(n)."}],

temperature=0.6, top_p=0.95,

)

msg = resp.choices[0].message

print(getattr(msg, "reasoning_content", None)) # the trace

print(msg.content) # the final answer

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post DeepReinforce Releases Ornith-1.0: An Open-Source Coding Model Family That Learns Its Own RL Scaffolds appeared first on MarkTechPost.

原文を表示

TL;DR

Ornith-1.0 ships in 9B, 31B, 35B-MoE, and 397B-MoE sizes under MIT, built on Gemma 4 and Qwen 3.5.

The model learns its own scaffold during RL, jointly optimizing the harness and the solution.

Ornith-1.0-397B tops Claude Opus 4.7 on both headline benchmarks, but not Opus 4.8 or the larger GLM-5.2-744B.

Three layers — fixed trust boundary, deterministic monitor, frozen LLM judge — guard against reward hacking.

What is Ornith-1.0?

Interactive Explainer

(function(){

window.addEventListener("message",function(e){

if(e.data&&e.data.type==="mtp-ornith-height"){

var f=document.getElementById("mtp-ornith-frame");

if(f&&e.data.height){f.style.height=e.data.height+"px";}

}

});

})();

The Self-Scaffolding Idea

Ornith-1.0 treats the scaffold as a learnable object instead. During reinforcement learning, the scaffold co-evolves with the model’s policy. Each RL step runs in two stages.

Guarding Against Reward Hacking

The outer trust boundary is fixed and immutable. The environment, tool surface, and test isolation stay outside the model’s reach. The model evolves only its inner policy scaffold.

A deterministic monitor flags banned actions. Reading withheld paths or editing verification scripts earns zero reward. Those trajectories are excluded from the advantage computation.

A frozen LLM judge acts as a veto. It sits on top of the verifier, not as the primary reward.

Benchmark

BenchmarkOrnith-1.0-397BQwen3.5-397BQwen3.7-MaxGLM-5.2-744BMinimax-M3-428BDeepSeek-V4-Pro-1.6TClaude Opus 4.7Claude Opus 4.8

Terminal-Bench 2.177.553.573.581.0646470.385

SWE-Bench Verified82.476.480.4––80.680.887.6

SWE-Bench Pro62.251.660.662.15955.464.369.2

SWE-Bench Multilingual78.969.378.3––76.2––

NL2Repo48.236.847.248.942.1––69.7

ClawEval Avg77.170.765.2––75.878.2–

Use Cases and a Quick Start

For example, a dev can run the 9B model locally to triage a failing test suite. A platform team can self-host the 397B model for an internal coding agent.

Serving is a one-liner with vLLM:

Copy CodeCopiedUse a different Browser

vllm serve deepreinforce-ai/Ornith-1.0-9B \

--served-model-name Ornith-1.0-9B \

--max-model-len 262144 \

--enable-auto-tool-choice --tool-call-parser qwen3_xml \

--reasoning-parser qwen3 \

--trust-remote-code

Then call it with any OpenAI client:

Copy CodeCopiedUse a different Browser

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(

model="Ornith-1.0-9B",

messages=[{"role": "user", "content": "Write a Python is_prime(n)."}],

temperature=0.6, top_p=0.95,

)

msg = resp.choices[0].message

print(getattr(msg, "reasoning_content", None)) # the <think> trace

print(msg.content) # the final answer

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post DeepReinforce Releases Ornith-1.0: An Open-Source Coding Model Family That Learns Its Own RL Scaffolds appeared first on MarkTechPost.

この記事をシェア

Vercel Blog★42026年6月25日 09:00

AI SDK ハーネスに「Deep Agents」と「OpenCode」が追加され利用可能に

MarkTechPost★42026年6月25日 14:39

百度、長文解析向け KV キャッシュを一定に保つ 3B モデル「Unlimited OCR」を発表

MarkTechPost★42026年6月25日 05:00

Gradium、リアルタイム音声翻訳モデル「stt-translate」と「s2s-translate」を公開し、精度と遅延で競合を上回る

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

DeepReinforce が Ornith-1.0 を公開：自律的に RL スキャフォールドを学習するオープンソースコーディングモデルファミリー

キーポイント

影響分析

編集コメント

関連記事

DeepReinforce が Ornith-1.0 を公開：自律的に RL スキャフォールドを学習するオープンソースコーディングモデルファミリー

キーポイント

影響分析

編集コメント

関連記事