TLDR AI·2026年5月21日 09:00·約16分

第一原理からエージェントを構築する方法（15 分読了）

#Agent Training #Reinforcement Learning from Human Feedback #RLHF #Prompt Engineering

TL;DR

Mishra は TRL や Unsloth などの抽象化レイヤーを排除し、エージェント学習が「プロンプト→行動→環境→報酬→勾配更新」という基本ループに還元されることを実証した。

AI深層分析2026年5月22日 00:06

重要/ 5段階

深度40%

キーポイント

抽象化の剥離と基本ループの可視化

TRL、Unsloth、PRIME-RL などの既存フレームワークの抽象層を除去し、エージェント学習の本質が単純な反復ループに帰着することを示した。

純粋 Python による実装デモ

モデルが JSON で形状作成や接続アクションを出力し、検証キャンバスと対話する「テキストから図面へ」のエージェントを純粋な Python で構築した。

多面的な報酬関数の設計

JSON の妥当性、スキーマ準拠、レイアウト品質、およびプロンプトキーワードの意味的カバレッジを組み合わせた複合的な報酬関数を導入した。

影響分析・編集コメントを表示

影響分析

この記事は、業界で蔓延するブラックボックス化されたフレームワークへの依存を見直し、開発者がエージェント学習の根本原理に立ち返るよう促す重要な示唆を与える。特に、複雑な抽象層を排除して報酬設計と評価ロジックを直接制御するアプローチは、より堅牢で予測可能なエージェントシステム構築のための実践的な指針となる。

編集コメント

既存のツールに頼りきりになりがちな開発者に対し、基礎原理への回帰と制御性の重要性を強く訴える内容です。

環境の定義方法、教師の軌道の生成法、学生のファインチューニング手法、そして強化学習による改善方法を解説します。

著者：Anshuman Mishra & GPT 5.5

2026年5月20日

[編集者の注記：論点は私自身のものです。執筆と構成はGPT 5.5によって洗練されました。これは、実際の味わい、方向性、主張を人間が保持しつつ、粗末なメモから技術系研究ブログの執筆速度をAIで向上させるための実験の一部でもあります。]

ポストトレーニングに関するチュートリアルは、スタックの上層部から始まります。まずフレームワークが登場し、「このライブラリをインストールし、この報酬関数を定義し、このトレーナーを実行し、報酬曲線の変化を見よ」という手順が示されます。これは、すでに何が起こっているかを理解している場合には有用です。しかし、システム全体に対するメンタルモデル（心的モデル）を構築しようとしている際には、あまり役立ちません。

私は、より下位から始める方が有益だと考えます。トレーナーが存在する以前には環境があり、強化学習が存在する以前には行動空間があります。エージェントが存在する以前には、世界の何らかの状態を変化させる行動を生み出すポリシーが存在します。

本稿は、その像を第一原理（ファーストプリンシプル）から構築しようとする試みです。

例意図的に小さく設定します：テキストから図を作成するエージェントです。ユーザーが簡単な図の作成を求めると、モデルはキャンバス上に形状を描画するための構造化されたJSON形式の行動を出力します。これは、tldraw 風のエージェントの微小版と考えることができます。実際のエディタ上でクリック操作を行うのではなく、「矩形を作成」「ラベルを追加」「これら2つのノードを接続」といった行動をモデルが生成するのです。

目標は世界最高の図表エージェントを構築することではありません。真の目的は、エージェントトレーニングそのものの形状を理解することです。

高レベルでは、このループが成り立ちます：

プロンプト -> モデルのアクション -> 環境 -> リワード -> グラディエント更新

ほぼすべてのエージェントトレーニングシステムは、このループを拡張したものです。ブラウザエージェント、コーディングエージェント、スプレッドシートエージェント、ロボティクスプランナー、数式ソルバー、図表エージェントはいずれも同じ基本構造を持っています。異なる点は、環境、アクション空間、そしてリワード関数のみです。

これはフレームワークを使用する際に見過ごされがちな部分です。TRL、Unsloth、PRIME-RL、verl、OpenRLHF、あるいは独自に開発された内部トレーニングツールなどのライブラリは魔法ではありません。これらは主にこのループを取り巻くインフラストラクチャです：バッチ処理、ロールアウト生成、分散推論、リワード計算、ログ記録、参照モデル、クリッピング、チェックポイント作成、そしてスケーリングです。

概念的な核心部分は、はるかに小さく単純です。

言語モデルはシーケンス上の分布です。これをエージェントとして使用する際、私たちはこの分布に通常の文章ではなくアクションを生成させるよう求めています。

観測に基づき、モデルは行動を出力します。チャットモデルでは、観測は会話であり、行動はアシスタントからのメッセージです。ブラウザエージェントでは、観測は DOM であり、行動はクリックやキー入力かもしれません。コーディングエージェントでは、観測はリポジトリの状態であり、行動はパッチかもしれません。図表エージェントでは、観測はユーザーの要求であり、行動は形状と接続を記述する JSON オブジェクトです。

したがって、最初の質問は「どの強化学習トレーナーを使うべきか？」ではありません。

最初の質問は、「環境とは何か？」です。

環境は、どのような行動が有効であるかを定義し、その行動を実行したときに何が起きるかを決定し、成功をどのように測定するかを示します。通常の教師あり微調整では、この環境はしばしば暗黙的です。モデルに良い行動の例を示し、それを模倣させるだけです。強化学習では、環境が明示化されます。モデルが何かを試み、環境が応答し、報酬関数がその試みが良かったかどうかを判断します。

図表エージェントにとって、完成した出力は単に妥当そうに見えるからといって良いわけではありません。JSON が解析可能で、スキーマが有効であり、キャンバスが行動を受け入れ、要求されたオブジェクトが表示され、矢印が正しいノードを結び、最終的なレイアウトが理解しやすい場合にのみ、それは良いものとなります。

これが通常のチャット微調整とエージェントトレーニングの核心的な違いです。エージェントトレーニングは、モデルの出力を実行可能な世界に根ざさせます。

最も小さな行動空間から始めましょう。

モデルはアクション配列を含む JSON を返す必要があります。各アクションは形状の作成または 2 つの形状間の接続のいずれかを行います。有効な完成例は以下のようになります:

json

{

"actions": [

{

"type": "create_shape",

"id": "frontend",

"shape": "rectangle",

"x": 80,

"y": 100,

"w": 180,

"h": 80,

"text": "Frontend"

{

"type": "create_shape",

"id": "api",

"shape": "rectangle",

"x": 340,

"y": 100,

"w": 180,

"h": 80,

"text": "API"

{

"type": "connect",

"from": "frontend",

"to": "api",

"text": "request"

}

]

}

これは一見単純に見えますが、すでにツール使用の基本的な構造を含んでいます。モデルはもはや単にテキストを生成しているのではなく、別のシステムによって実行される指示を生成しています。

これが学習問題を根本的に変えます。モデルは言うべきことを学ぶだけでなく、環境が受け入れるものを学ぶ必要があります。

形状を接続する前にそれらを作成する必要があります。安定した ID を使用する必要があります。重複する ID を避ける必要があります。座標を有限に保つ必要があります。無効な形状タイプを避ける必要があります。解析可能な JSON を出力する必要があります。これらは哲学的な細部ではありません。これらがポリシーがアクション空間の有効領域に入ることができるかどうかを決定します。

これが SFT（Supervised Fine-Tuning: 教師あり微調整）が RL（Reinforcement Learning: 強化学習）の前にしばしば必要とされる理由です。モデルが報酬を最適化する前に、環境の言語を学ぶ必要があるからです。

環境には、入力プロンプト、アクション形式、報酬関数の 3 つのものだけで十分です。

これは純粋な Python で書かれた最小限のキャンバス環境です。長方形、楕円形、菱形、テキストブロック、矢印をサポートしています。このキャンバスが洗練されていることが目的なのではなく、モデルの出力が成功するか失敗するかの判定ができる決定論的な世界を提供している点が重要です。

python

env.py

from __future__ import annotations

import json

import math

from dataclasses import dataclass, field

from typing import Any

ALLOWED_SHAPES = {"rectangle", "ellipse", "diamond", "text"}

@dataclass

class Shape:

id: str

shape: str

x: float

y: float

w: float

h: float

text: str = ""

@dataclass

class Arrow:

source: str

target: str

text: str = ""

@dataclass

class Canvas:

shapes: dict[str, Shape] = field(default_factory=dict)

arrows: list[Arrow] = field(default_factory=list)

def create_shape(self, action: dict[str, Any]) -> None:

shape_id = require_str(action, "id")

shape_type = require_str(action, "shape")

if shape_type not in ALLOWED_SHAPES:

raise ValueError(f"unknown shape type: {shape_type}")

if shape_id in self.shapes:

raise ValueError(f"duplicate shape id: {shape_id}")

必ず JSON 形式で返してください。translation フィールドのみ。他のフィールド (technical_terms 等) は一切追加しないこと — 余計なフィールドを書こうとして本文翻訳がトークン上限で打ち切られる事故を防ぐため:

{"translation": "翻訳全文"}

x = require_number(action, "x")

y = require_number(action, "y")

w = require_number(action, "w")

h = require_number(action, "h")

if w <= 0 or h <= 0:

raise ValueError("shape width and height must be positive")

if w > 1000 or h > 1000:

raise ValueError("shape too large")

self.shapes[shape_id] = Shape(

id=shape_id,

shape=shape_type,

x=x,

y=y,

w=w,

h=h,

text=str(action.get("text", "")),

)

def connect(self, action: dict[str, Any]) -> None:

source = require_str(action, "from")

target = require_str(action, "to")

if source not in self.shapes:

raise ValueError(f"arrow source does not exist: {source}")

if target not in self.shapes:

raise ValueError(f"arrow target does not exist: {target}")

if source == target:

raise ValueError("arrow cannot connect a shape to itself")

self.arrows.append(Arrow(source=source, target=target, text=str(action.get("text", ""))))

def apply(self, action: dict[str, Any]) -> None:

action_type = require_str(action, "type")

if action_type == "create_shape":

self.create_shape(action)

elif action_type == "connect":

self.connect(action)

else:

raise ValueError(f"unknown action type: {action_type}")

def require_str(obj: dict[str, Any], key: str) -> str:

value = obj.get(key)

if not isinstance(value, str) or not value:

raise ValueError(f"{key} must be a non-empty string")

return value

def require_number(obj: dict[str, Any], key: str) -> float:

value = obj.get(key)

if not isinstance(value, int | float) or not math.isfinite(value):

raise ValueError(f"{key} must be a finite number")

return float(value)

def parse_actions(text: str) -> list[dict[str, Any]]:

try:

data = json.loads(text)

except json.JSONDecodeError:

start = text.find("{")

end = text.rfind("}")

if start == -1 or end == -1 or end <= start:

raise ValueError("model output does not contain JSON")

data = json.loads(text[start : end + 1])

actions = data.get("actions")

if not isinstance(actions, list):

raise ValueError("missing actions array")

if not actions:

raise ValueError("actions array is empty")

if len(actions) > 40:

raise ValueError("too many actions")

if not all(isinstance(action, dict) for action in actions):

raise ValueError("each action must be an object")

return actions

def validate_completion(text: str) -> tuple[Canvas | None, list[str]]:

errors: list[str] = []

canvas = Canvas()

try:

actions = parse_actions(text)

except Exception as exc:

return None, [str(exc)]

for i, action in enumerate(actions):

try:

canvas.apply(action)

except Exception as exc:

errors.append(f"action {i}: {exc}")

return canvas, errors

これにより、検証環境の核心がすでに得られます。モデルはテキストを生成し、その環境はそのテキストを解析してアクションシーケンスを実行し、何か問題があればエラーを返します。

この仕組みをフレームワーク化すれば、よりスケーラブルになります。ただし、概念的には本質的に異なるものにはなりません。

一度、環境がアクションを実行できるようになれば、次に成功の定義を定める必要があります。

ここが、実際の難問の多くが存在する場所です。トレーナーはあなたが与えた報酬のみを最適化できます。もし報酬が主に構文に依存しているなら、モデルは構文を学習します。もし報酬がタスクの満足度を測定するものなら、モデルは有用な行動を学ぶ可能性があります。もし報酬が脆いものであれば、モデルは最終的にその脆さを突いてきます。

この玩具的なダイアグラムエージェントの場合、いくつかのシグナルを組み合わせることができます。解析可能な出力、検証されるアクション、重ならないレイアウト、存在するラベル、オブジェクトをつなぐ矢印、そしてユーザーのリクエストから重要な単語をカバーするラベルに対して報酬を与えることができます。

python

reward.py

from __future__ import annotations

import re

from env import Canvas, validate_completion

def score_layout(canvas: Canvas) -> float:

if not canvas.shapes:

return 0.0

score = 1.0

shapes = list(canvas.shapes.values())

for i, a in enumerate(shapes):

for b in shapes[i + 1 :]:

ax2, ay2 = a.x + a.w, a.y + a.h

bx2, by2 = b.x + b.w, b.y + b.h

overlap = not (ax2 < b.x or bx2 < a.x or ay2 < b.y or by2 < a.y)

if overlap:

score -= 0.15

labeled = sum(1 for shape in shapes if shape.text.strip())

score += 0.1 * min(labeled, 5)

score += 0.1 * min(len(canvas.arrows), 5)

return max(0.0, min(1.0, score))

def score_semantics(prompt: str, canvas: Canvas) -> float:

prompt_words = set(re.findall(r"[a-zA-Z][a-zA-Z0-9_-]+", prompt.lower()))

label_words: set[str] = set()

for shape in canvas.shapes.values():

label_words.update(re.findall(r"[a-zA-Z][a-zA-Z0-9_-]+", shape.text.lower()))

important = {w for w in prompt_words if len(w) >= 4}

if not important:

return 0.5

coverage = len(important & label_words) / max(1, len(important))

return max(0.0, min(1.0, coverage))

def reward(prompt: str, completion: str) -> float:

canvas, errors = validate_completion(completion)

if errors or canvas is None:

return 0.0

validity = 1.0

layout = score_layout(canvas)

semantics = score_semantics(prompt, canvas)

return 0.4 * validity + 0.3 * layout + 0.3 * semantics

この報酬は意図的に不完全です。人間が気にする多くの事項を見逃すでしょう。ラベルを過大評価するかもしれません。美的価値を過小評価するかもしれません。同義語に対して失敗するかもしれません。正しい単語を含んでいるが構造が間違っている図表に対して報酬を与えるかもしれません。

これは玩具環境にとっては問題ありません。実際、このようにすることで中心的な問題を露呈させるという点で有用です。

エージェントにおける強化学習（RL）の難しい部分は、通常は方策勾配方程式ではありません。難しいのは、実際に望む行動と相関する報酬を持つ環境を構築することです。

このタスクに対する弱い報酬とは次のようなものです：

1 if JSON parses, else 0

これはモデルに有効性を教えますが、有用性は教えません。

より強力な報酬は以下のようなものになるかもしれません：

python

reward =

0.25 * parses_as_json

0.20 * schema_valid

0.20 * renderer_accepts

0.15 * requested_entities_present

0.10 * arrows_connect_expected_entities

0.10 * layout_quality

実際の tldraw に似たエージェントの場合、これらの一部の項目はコードで検証できます。スキーマの妥当性を確認したり、実際のエディタ内でアクションを実行したり、最終的な形状を検査したり、矢印のバインディングを確認したり、重複数を数えたり、スクリーンショットを出力したりすることが可能です。他の部分には判断モデル（judge model）が必要になるかもしれません。VLM や LLM の判断モデルに、スクリーンショットがユーザーの意図を満たしているかどうかを問うことができます。

しかし、その場合でも、審査員は環境のノイズの多いコンポーネントとして扱われ、神託（オラクル）として扱われてはいけません。審査員の判断をログに記録し、失敗を検証し、人間のレビューと比較し、ネガティブテストを追加してください。モデルは最終的に、最も簡単に悪用できる報酬信号を利用するものだと想定してください。

有効なアクションを生成できないモデルから強化学習（RL）を開始すると、ロールアウトのほとんどがゼロ報酬を受け取ることになります。

これは単なる最適化上の不都合ではありません。状態分布の問題です。教師あり微調整（SFT）の前には、モデルの方針は環境の有効領域の外側に確率質量の大部分を置きます。説明やマークダウン、不正な JSON、無効な ID、不可能な接続、またはキャンバスが拒否する可能性のあるアクションを書き込みます。環境はゼロを返します。勾配には有用な情報がほとんど含まれていません。

これが教師の軌跡（トラジェクトリ）が重要である理由です。

Gemini などのより強力なモデルは、アクション言語の例を生成できます。複数の完成形をサンプリングして検証し、機能するものだけを保持し、それを SFT データセットに変換できます。シングルターン環境では、軌跡は単に以下のようになります：

観測：ユーザープロンプト

アクション：JSON アクション

報酬：検証スコア

マルチターン環境では、以下のように展開されます：

obs_0 -> action_0 -> obs_1 -> action_1 -> obs_2 -> 報酬

より豊かな tldraw スタイルのセットアップでは、軌跡にはユーザーのリクエスト、表示されているキャンバス状態、選択された形状、モデルのアクションバッチ、バリデーターの出力、最終的なスクリーンショット、および報酬が含まれる可能性があります。

重要な点は、教師生成は魔法ではないということです。これはより強力なポリシーからのサンプリングであり、環境内でその出力を実行し、生き残ったトレースを保持することです。

Gemini の構造化出力を使用した最小限のスケッチを以下に示します。

これにより、最初のフェーズに必要なデータセットが生成されます。教師は生徒を有効な行動多様体へと導きます。しかし、タスク全体を解決するわけではありません。強化学習（RL）にとって有用な出発点を提供します。

python

teacher_generate.py

from __future__ import annotations

import json

from pathlib import Path

from typing import Literal

from google import genai

from pydantic import BaseModel, Field

from reward import reward

from env import validate_completion

class CreateShape(BaseModel):

type: Literal["create_shape"]

id: str

shape: Literal["rectangle", "ellipse", "diamond", "text"]

x: float

y: float

w: float

h: float

text: str = ""

class Connect(BaseModel</spa

原文を表示

How to define an environment, generate teacher trajectories, fine-tune a student, and improve it with reinforcement learning.

Authors: Anshuman Mishra & GPT 5.5 May 20, 2026

[Editor’s note: Arguments are mine; writing and structure were refined with GPT 5.5. This is partly an experiment in using AI to help speed-write technical research blogs from rough notes, while keeping the actual taste, direction, and claims human.]

The tutorials on post training start too high up the stack. They begin with a framework. Install this library, define this reward function, run this trainer, watch the reward curve move. That is useful once you already understand what is happening. It is less useful when you are trying to build a mental model of the whole system.

I find it more helpful to start lower down. Before there is a trainer, there is an environment. Before there is reinforcement learning, there is an action space. Before there is an agent, there is a policy producing actions that change some state of the world.

This post is an attempt to build that picture from first principles.

The example will be deliberately small: a text-to-diagram agent. The user asks for a simple diagram, and the model outputs structured JSON actions that create shapes on a canvas. You can think of this as a tiny version of a tldraw-style agent. Instead of clicking around a real editor, the model emits actions like “create a rectangle,” “add a label,” and “connect these two nodes.”

The goal is not to build the world’s best diagram agent. The goal is to understand the shape of agent training itself.

At a high level, the loop is:

prompt -> model action -> environment -> reward -> gradient update

Almost every agent-training system is a scaled-up version of this loop. A browser agent, a coding agent, a spreadsheet agent, a robotics planner, a math solver, and a diagramming agent all have the same basic structure. They differ in the environment, the action space, and the reward function.

This is the part that is easy to miss when using frameworks. Libraries like TRL, Unsloth, PRIME-RL, verl, OpenRLHF, or custom internal trainers are not magic. They are mostly infrastructure around this loop: batching, rollout generation, distributed inference, reward computation, logging, reference models, clipping, checkpointing, and scaling.

The conceptual core is much smaller.

A language model is a distribution over sequences. When we use it as an agent, we are asking this distribution to produce actions instead of ordinary prose.

Given an observation, the model emits an action. In a chat model, the observation is the conversation and the action is the assistant message. In a browser agent, the observation might be the DOM and the action might be a click or a keystroke. In a coding agent, the observation might be the repository state and the action might be a patch. In a diagram agent, the observation is the user request and the action is a JSON object describing shapes and connections.

So the first question is not “which RL trainer should I use?”

The first question is: what is the environment?

The environment defines what actions are valid, what happens when those actions are executed, and how success is measured. In ordinary supervised fine-tuning, this environment is often implicit. We show the model examples of good behavior and ask it to imitate them. In reinforcement learning, the environment becomes explicit. The model tries something, the environment responds, and the reward function decides whether that attempt was good.

For a diagram agent, a completion is not good merely because it sounds plausible. It is good if the JSON parses, the schema is valid, the canvas accepts the actions, the requested objects appear, the arrows connect the right nodes, and the final layout is understandable.

That is the core difference between ordinary chat fine-tuning and agent training. Agent training grounds the model’s output in an executable world.

Let us start with the smallest possible action space.

The model must return JSON with an actions array. Each action either creates a shape or connects two shapes. A valid completion might look like this:

json

code

{
  "actions": [
    {
      "type": "create_shape",
      "id": "frontend",
      "shape": "rectangle",
      "x": 80,
      "y": 100,
      "w": 180,
      "h": 80,
      "text": "Frontend"
    },
    {
      "type": "create_shape",
      "id": "api",
      "shape": "rectangle",
      "x": 340,
      "y": 100,
      "w": 180,
      "h": 80,
      "text": "API"
    },
    {
      "type": "connect",
      "from": "frontend",
      "to": "api",
      "text": "request"
    }
  ]
}

This looks simple, but it already contains the essential structure of tool use. The model is no longer just generating text. It is generating instructions that another system will execute.

That changes the training problem. The model has to learn not just what to say, but what the environment will accept.

It has to create shapes before connecting them. It has to use stable IDs. It has to avoid duplicate IDs. It has to keep coordinates finite. It has to avoid invalid shape types. It has to emit parseable JSON. These are not philosophical details. They determine whether the policy can even enter the valid region of the action space.

This is why SFT is often necessary before RL. Before the model can optimize reward, it has to learn the language of the environment.

An environment needs only three things: an input prompt, an action format, and a reward function.

Here is a minimal pure-Python canvas environment. It supports rectangles, ellipses, diamonds, text blocks, and arrows. The point is not that this canvas is sophisticated. The point is that it gives us a deterministic world in which model outputs can succeed or fail.

python

code

# env.py
from __future__ import annotations

import json
import math
from dataclasses import dataclass, field
from typing import Any


ALLOWED_SHAPES = {"rectangle", "ellipse", "diamond", "text"}


@dataclass
class Shape:
    id: str
    shape: str
    x: float
    y: float
    w: float
    h: float
    text: str = ""


@dataclass
class Arrow:
    source: str
    target: str
    text: str = ""


@dataclass
class Canvas:
    shapes: dict[str, Shape] = field(default_factory=dict)
    arrows: list[Arrow] = field(default_factory=list)

    def create_shape(self, action: dict[str, Any]) -> None:
        shape_id = require_str(action, "id")
        shape_type = require_str(action, "shape")
        if shape_type not in ALLOWED_SHAPES:
            raise ValueError(f"unknown shape type: {shape_type}")
        if shape_id in self.shapes:
            raise ValueError(f"duplicate shape id: {shape_id}")

        x = require_number(action, "x")
        y = require_number(action, "y")
        w = require_number(action, "w")
        h = require_number(action, "h")
        if w <= 0 or h <= 0:
            raise ValueError("shape width and height must be positive")
        if w > 1000 or h > 1000:
            raise ValueError("shape too large")

        self.shapes[shape_id] = Shape(
            id=shape_id,
            shape=shape_type,
            x=x,
            y=y,
            w=w,
            h=h,
            text=str(action.get("text", "")),
        )

    def connect(self, action: dict[str, Any]) -> None:
        source = require_str(action, "from")
        target = require_str(action, "to")
        if source not in self.shapes:
            raise ValueError(f"arrow source does not exist: {source}")
        if target not in self.shapes:
            raise ValueError(f"arrow target does not exist: {target}")
        if source == target:
            raise ValueError("arrow cannot connect a shape to itself")
        self.arrows.append(Arrow(source=source, target=target, text=str(action.get("text", ""))))

    def apply(self, action: dict[str, Any]) -> None:
        action_type = require_str(action, "type")
        if action_type == "create_shape":
            self.create_shape(action)
        elif action_type == "connect":
            self.connect(action)
        else:
            raise ValueError(f"unknown action type: {action_type}")


def require_str(obj: dict[str, Any], key: str) -> str:
    value = obj.get(key)
    if not isinstance(value, str) or not value:
        raise ValueError(f"{key} must be a non-empty string")
    return value


def require_number(obj: dict[str, Any], key: str) -> float:
    value = obj.get(key)
    if not isinstance(value, int | float) or not math.isfinite(value):
        raise ValueError(f"{key} must be a finite number")
    return float(value)


def parse_actions(text: str) -> list[dict[str, Any]]:
    try:
        data = json.loads(text)
    except json.JSONDecodeError:
        start = text.find("{")
        end = text.rfind("}")
        if start == -1 or end == -1 or end <= start:
            raise ValueError("model output does not contain JSON")
        data = json.loads(text[start : end + 1])

    actions = data.get("actions")
    if not isinstance(actions, list):
        raise ValueError("missing actions array")
    if not actions:
        raise ValueError("actions array is empty")
    if len(actions) > 40:
        raise ValueError("too many actions")
    if not all(isinstance(action, dict) for action in actions):
        raise ValueError("each action must be an object")
    return actions


def validate_completion(text: str) -> tuple[Canvas | None, list[str]]:
    errors: list[str] = []
    canvas = Canvas()

    try:
        actions = parse_actions(text)
    except Exception as exc:
        return None, [str(exc)]

    for i, action in enumerate(actions):
        try:
            canvas.apply(action)
        except Exception as exc:
            errors.append(f"action {i}: {exc}")

    return canvas, errors

This already gives us the core of a verifier environment. The model emits text. The environment parses that text, executes the action sequence, and returns errors if something goes wrong.

A framework would make this more scalable. It would not make it more conceptually different.

Once the environment can execute actions, we need to define success.

This is where most of the real difficulty lives. The trainer can only optimize the reward you give it. If the reward is mostly syntactic, the model will learn syntax. If the reward measures task satisfaction, the model has a chance of learning useful behavior. If the reward is brittle, the model will eventually find the brittleness.

For the toy diagram agent, we can combine a few signals. We can reward outputs that parse, actions that validate, layouts that do not overlap, labels that are present, arrows that connect objects, and labels that cover important words from the user request.

python

code

# reward.py
from __future__ import annotations

import re

from env import Canvas, validate_completion


def score_layout(canvas: Canvas) -> float:
    if not canvas.shapes:
        return 0.0

    score = 1.0

    shapes = list(canvas.shapes.values())
    for i, a in enumerate(shapes):
        for b in shapes[i + 1 :]:
            ax2, ay2 = a.x + a.w, a.y + a.h
            bx2, by2 = b.x + b.w, b.y + b.h
            overlap = not (ax2 < b.x or bx2 < a.x or ay2 < b.y or by2 < a.y)
            if overlap:
                score -= 0.15

    labeled = sum(1 for shape in shapes if shape.text.strip())
    score += 0.1 * min(labeled, 5)

    score += 0.1 * min(len(canvas.arrows), 5)

    return max(0.0, min(1.0, score))


def score_semantics(prompt: str, canvas: Canvas) -> float:
    prompt_words = set(re.findall(r"[a-zA-Z][a-zA-Z0-9_-]+", prompt.lower()))
    label_words: set[str] = set()
    for shape in canvas.shapes.values():
        label_words.update(re.findall(r"[a-zA-Z][a-zA-Z0-9_-]+", shape.text.lower()))

    important = {w for w in prompt_words if len(w) >= 4}
    if not important:
        return 0.5

    coverage = len(important & label_words) / max(1, len(important))
    return max(0.0, min(1.0, coverage))


def reward(prompt: str, completion: str) -> float:
    canvas, errors = validate_completion(completion)
    if errors or canvas is None:
        return 0.0

    validity = 1.0
    layout = score_layout(canvas)
    semantics = score_semantics(prompt, canvas)

    return 0.4 * validity + 0.3 * layout + 0.3 * semantics

This reward is intentionally imperfect. It will miss many things humans care about. It may overvalue labels. It may undervalue aesthetics. It may fail on synonyms. It may reward a diagram that contains the right words but has the wrong structure.

That is fine for a toy environment. In fact, it is useful because it exposes the central problem.

The hard part of RL for agents is usually not the policy-gradient equation. The hard part is constructing an environment where reward is correlated with the behavior you actually want.

A weak reward for this task would be:

1 if JSON parses, else 0

That teaches the model to be valid, but not useful.

A stronger reward might look like this:

python

code

reward =
  0.25 * parses_as_json
  0.20 * schema_valid
  0.20 * renderer_accepts
  0.15 * requested_entities_present
  0.10 * arrows_connect_expected_entities
  0.10 * layout_quality

For a real tldraw-like agent, some of these can be checked with code. You can validate the schema, execute actions inside the real editor, inspect final shapes, check arrow bindings, count overlaps, and export a screenshot. Other parts may require a judge model. You might ask a VLM or LLM judge whether the screenshot satisfies the user’s intent.

But even then, the judge should be treated as a noisy component of the environment, not as an oracle. Log its judgments. Inspect failures. Compare against human review. Add negative tests. Assume the model will eventually exploit whatever reward signal is easiest to exploit.

If we start RL from a model that cannot produce valid actions, almost every rollout receives zero reward.

This is not merely an optimization inconvenience. It is a state-distribution problem. Before SFT, the model’s policy places most of its probability mass outside the valid region of the environment. It writes explanations, markdown, malformed JSON, invalid IDs, impossible connections, or plausible-looking actions that the canvas rejects. The environment returns zero. The gradient has very little useful information.

This is why teacher trajectories matter.

A stronger model, such as Gemini, can generate examples of the action language. We can sample multiple completions, validate them, keep the ones that work, and turn them into an SFT dataset. In a single-turn environment, a trajectory is just:

observation: user prompt

action: JSON actions

reward: validation score

In a multi-turn environment, it becomes:

obs_0 -> action_0 -> obs_1 -> action_1 -> obs_2 -> reward

For a richer tldraw-style setup, the trajectory might contain the user request, the visible canvas state, the selected shapes, the model’s action batch, the validator output, the final screenshot, and the reward.

The key point is that teacher generation is not magic. It is sampling from a stronger policy, executing the output in the environment, and keeping the traces that survive.

Here is a minimal sketch using Gemini structured output.

This produces the dataset we need for the first phase. The teacher moves the student into the valid action manifold. It does not solve the whole task. It gives RL somewhere useful to start.

python

teacher_generate.py

from __future__ import annotations

import json

from pathlib import Path

from typing import Literal

from google import genai

from pydantic import BaseModel, Field

from reward import reward

from env import validate_completion

class CreateShape(BaseModel):

type: Literal["create_shape"]

id: str

shape: Literal["rectangle", "ellipse", "diamond", "text"]

x: float

y: float

w: float

h: float

text: str = ""

class Connect(BaseModel</spa

この記事をシェア

TLDR AI重要度42026年7月8日 09:00

最終トークン選好最適化によるドゥームループの削減

TLDR AI重要度42026年7月8日 09:00

マイクロソフトの真の AI ストラテジーはチャットボートではない（8 分読了）

TLDR AI2026年7月8日 09:00

MiniMax M3：スパースアテンションが長期ホライズンエージェントを現実的なものにする方法（11 分読了）

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

TLDR AI·2026年5月21日 09:00·約16分

第一原理からエージェントを構築する方法（15 分読了）

#Agent Training #Reinforcement Learning from Human Feedback #RLHF #Prompt Engineering

TL;DR

AI深層分析2026年5月22日 00:06

重要/ 5段階

深度40%

キーポイント

抽象化の剥離と基本ループの可視化

TRL、Unsloth、PRIME-RL などの既存フレームワークの抽象層を除去し、エージェント学習の本質が単純な反復ループに帰着することを示した。

純粋 Python による実装デモ

多面的な報酬関数の設計

JSON の妥当性、スキーマ準拠、レイアウト品質、およびプロンプトキーワードの意味的カバレッジを組み合わせた複合的な報酬関数を導入した。

影響分析・編集コメントを表示

影響分析

編集コメント

既存のツールに頼りきりになりがちな開発者に対し、基礎原理への回帰と制御性の重要性を強く訴える内容です。

環境の定義方法、教師の軌道の生成法、学生のファインチューニング手法、そして強化学習による改善方法を解説します。

著者：Anshuman Mishra & GPT 5.5

2026年5月20日

本稿は、その像を第一原理（ファーストプリンシプル）から構築しようとする試みです。

目標は世界最高の図表エージェントを構築することではありません。真の目的は、エージェントトレーニングそのものの形状を理解することです。

高レベルでは、このループが成り立ちます：

プロンプト -> モデルのアクション -> 環境 -> リワード -> グラディエント更新

概念的な核心部分は、はるかに小さく単純です。

したがって、最初の質問は「どの強化学習トレーナーを使うべきか？」ではありません。

最初の質問は、「環境とは何か？」です。

最も小さな行動空間から始めましょう。

json

{

"actions": [

{

"type": "create_shape",

"id": "frontend",

"shape": "rectangle",

"x": 80,

"y": 100,

"w": 180,

"h": 80,

"text": "Frontend"

{

"type": "create_shape",

"id": "api",

"shape": "rectangle",

"x": 340,

"y": 100,

"w": 180,

"h": 80,

"text": "API"

{

"type": "connect",

"from": "frontend",

"to": "api",

"text": "request"

}

]

}

これが学習問題を根本的に変えます。モデルは言うべきことを学ぶだけでなく、環境が受け入れるものを学ぶ必要があります。

環境には、入力プロンプト、アクション形式、報酬関数の 3 つのものだけで十分です。

python

env.py

from __future__ import annotations

import json

import math

from dataclasses import dataclass, field

from typing import Any

ALLOWED_SHAPES = {"rectangle", "ellipse", "diamond", "text"}

@dataclass

class Shape:

id: str

shape: str

x: float

y: float

w: float

h: float

text: str = ""

@dataclass

class Arrow:

source: str

target: str

text: str = ""

@dataclass

class Canvas:

shapes: dict[str, Shape] = field(default_factory=dict)

arrows: list[Arrow] = field(default_factory=list)

def create_shape(self, action: dict[str, Any]) -> None:

shape_id = require_str(action, "id")

shape_type = require_str(action, "shape")

if shape_type not in ALLOWED_SHAPES:

raise ValueError(f"unknown shape type: {shape_type}")

if shape_id in self.shapes:

raise ValueError(f"duplicate shape id: {shape_id}")

{"translation": "翻訳全文"}

x = require_number(action, "x")

y = require_number(action, "y")

w = require_number(action, "w")

h = require_number(action, "h")

if w <= 0 or h <= 0:

raise ValueError("shape width and height must be positive")

if w > 1000 or h > 1000:

raise ValueError("shape too large")

self.shapes[shape_id] = Shape(

id=shape_id,

shape=shape_type,

x=x,

y=y,

w=w,

h=h,

text=str(action.get("text", "")),

)

def connect(self, action: dict[str, Any]) -> None:

source = require_str(action, "from")

target = require_str(action, "to")

if source not in self.shapes:

raise ValueError(f"arrow source does not exist: {source}")

if target not in self.shapes:

raise ValueError(f"arrow target does not exist: {target}")

if source == target:

raise ValueError("arrow cannot connect a shape to itself")

self.arrows.append(Arrow(source=source, target=target, text=str(action.get("text", ""))))

def apply(self, action: dict[str, Any]) -> None:

action_type = require_str(action, "type")

if action_type == "create_shape":

self.create_shape(action)

elif action_type == "connect":

self.connect(action)

else:

raise ValueError(f"unknown action type: {action_type}")

def require_str(obj: dict[str, Any], key: str) -> str:

value = obj.get(key)

if not isinstance(value, str) or not value:

raise ValueError(f"{key} must be a non-empty string")

return value

def require_number(obj: dict[str, Any], key: str) -> float:

value = obj.get(key)

if not isinstance(value, int | float) or not math.isfinite(value):

raise ValueError(f"{key} must be a finite number")

return float(value)

def parse_actions(text: str) -> list[dict[str, Any]]:

try:

data = json.loads(text)

except json.JSONDecodeError:

start = text.find("{")

end = text.rfind("}")

if start == -1 or end == -1 or end <= start:

raise ValueError("model output does not contain JSON")

data = json.loads(text[start : end + 1])

actions = data.get("actions")

if not isinstance(actions, list):

raise ValueError("missing actions array")

if not actions:

raise ValueError("actions array is empty")

if len(actions) > 40:

raise ValueError("too many actions")

if not all(isinstance(action, dict) for action in actions):

raise ValueError("each action must be an object")

return actions

def validate_completion(text: str) -> tuple[Canvas | None, list[str]]:

errors: list[str] = []

canvas = Canvas()

try:

actions = parse_actions(text)

except Exception as exc:

return None, [str(exc)]

for i, action in enumerate(actions):

try:

canvas.apply(action)

except Exception as exc:

errors.append(f"action {i}: {exc}")

return canvas, errors

この仕組みをフレームワーク化すれば、よりスケーラブルになります。ただし、概念的には本質的に異なるものにはなりません。

一度、環境がアクションを実行できるようになれば、次に成功の定義を定める必要があります。

python

reward.py

from __future__ import annotations

import re

from env import Canvas, validate_completion

def score_layout(canvas: Canvas) -> float:

if not canvas.shapes:

return 0.0

score = 1.0

shapes = list(canvas.shapes.values())

for i, a in enumerate(shapes):

for b in shapes[i + 1 :]:

ax2, ay2 = a.x + a.w, a.y + a.h

bx2, by2 = b.x + b.w, b.y + b.h

overlap = not (ax2 < b.x or bx2 < a.x or ay2 < b.y or by2 < a.y)

if overlap:

score -= 0.15

labeled = sum(1 for shape in shapes if shape.text.strip())

score += 0.1 * min(labeled, 5)

score += 0.1 * min(len(canvas.arrows), 5)

return max(0.0, min(1.0, score))

def score_semantics(prompt: str, canvas: Canvas) -> float:

prompt_words = set(re.findall(r"[a-zA-Z][a-zA-Z0-9_-]+", prompt.lower()))

label_words: set[str] = set()

for shape in canvas.shapes.values():

label_words.update(re.findall(r"[a-zA-Z][a-zA-Z0-9_-]+", shape.text.lower()))

important = {w for w in prompt_words if len(w) >= 4}

if not important:

return 0.5

coverage = len(important & label_words) / max(1, len(important))

return max(0.0, min(1.0, coverage))

def reward(prompt: str, completion: str) -> float:

canvas, errors = validate_completion(completion)

if errors or canvas is None:

return 0.0

validity = 1.0

layout = score_layout(canvas)

semantics = score_semantics(prompt, canvas)

return 0.4 * validity + 0.3 * layout + 0.3 * semantics

これは玩具環境にとっては問題ありません。実際、このようにすることで中心的な問題を露呈させるという点で有用です。

このタスクに対する弱い報酬とは次のようなものです：

1 if JSON parses, else 0

これはモデルに有効性を教えますが、有用性は教えません。

より強力な報酬は以下のようなものになるかもしれません：

python

reward =

0.25 * parses_as_json

0.20 * schema_valid

0.20 * renderer_accepts

0.15 * requested_entities_present

0.10 * arrows_connect_expected_entities

0.10 * layout_quality

有効なアクションを生成できないモデルから強化学習（RL）を開始すると、ロールアウトのほとんどがゼロ報酬を受け取ることになります。

これが教師の軌跡（トラジェクトリ）が重要である理由です。

観測：ユーザープロンプト

アクション：JSON アクション

報酬：検証スコア

マルチターン環境では、以下のように展開されます：

obs_0 -> action_0 -> obs_1 -> action_1 -> obs_2 -> 報酬

Gemini の構造化出力を使用した最小限のスケッチを以下に示します。

python

teacher_generate.py

from __future__ import annotations

import json

from pathlib import Path

from typing import Literal

from google import genai

from pydantic import BaseModel, Field

from reward import reward

from env import validate_completion

class CreateShape(BaseModel):

type: Literal["create_shape"]

id: str

shape: Literal["rectangle", "ellipse", "diamond", "text"]

x: float

y: float

w: float

h: float

text: str = ""

class Connect(BaseModel</spa

原文を表示

How to define an environment, generate teacher trajectories, fine-tune a student, and improve it with reinforcement learning.

Authors: Anshuman Mishra & GPT 5.5 May 20, 2026

This post is an attempt to build that picture from first principles.

The goal is not to build the world’s best diagram agent. The goal is to understand the shape of agent training itself.

At a high level, the loop is:

prompt -> model action -> environment -> reward -> gradient update

The conceptual core is much smaller.

A language model is a distribution over sequences. When we use it as an agent, we are asking this distribution to produce actions instead of ordinary prose.

So the first question is not “which RL trainer should I use?”

The first question is: what is the environment?

That is the core difference between ordinary chat fine-tuning and agent training. Agent training grounds the model’s output in an executable world.

Let us start with the smallest possible action space.

The model must return JSON with an actions array. Each action either creates a shape or connects two shapes. A valid completion might look like this:

json

code

{
  "actions": [
    {
      "type": "create_shape",
      "id": "frontend",
      "shape": "rectangle",
      "x": 80,
      "y": 100,
      "w": 180,
      "h": 80,
      "text": "Frontend"
    },
    {
      "type": "create_shape",
      "id": "api",
      "shape": "rectangle",
      "x": 340,
      "y": 100,
      "w": 180,
      "h": 80,
      "text": "API"
    },
    {
      "type": "connect",
      "from": "frontend",
      "to": "api",
      "text": "request"
    }
  ]
}

This looks simple, but it already contains the essential structure of tool use. The model is no longer just generating text. It is generating instructions that another system will execute.

That changes the training problem. The model has to learn not just what to say, but what the environment will accept.

This is why SFT is often necessary before RL. Before the model can optimize reward, it has to learn the language of the environment.

An environment needs only three things: an input prompt, an action format, and a reward function.

python

code

# env.py
from __future__ import annotations

import json
import math
from dataclasses import dataclass, field
from typing import Any


ALLOWED_SHAPES = {"rectangle", "ellipse", "diamond", "text"}


@dataclass
class Shape:
    id: str
    shape: str
    x: float
    y: float
    w: float
    h: float
    text: str = ""


@dataclass
class Arrow:
    source: str
    target: str
    text: str = ""


@dataclass
class Canvas:
    shapes: dict[str, Shape] = field(default_factory=dict)
    arrows: list[Arrow] = field(default_factory=list)

    def create_shape(self, action: dict[str, Any]) -> None:
        shape_id = require_str(action, "id")
        shape_type = require_str(action, "shape")
        if shape_type not in ALLOWED_SHAPES:
            raise ValueError(f"unknown shape type: {shape_type}")
        if shape_id in self.shapes:
            raise ValueError(f"duplicate shape id: {shape_id}")

        x = require_number(action, "x")
        y = require_number(action, "y")
        w = require_number(action, "w")
        h = require_number(action, "h")
        if w <= 0 or h <= 0:
            raise ValueError("shape width and height must be positive")
        if w > 1000 or h > 1000:
            raise ValueError("shape too large")

        self.shapes[shape_id] = Shape(
            id=shape_id,
            shape=shape_type,
            x=x,
            y=y,
            w=w,
            h=h,
            text=str(action.get("text", "")),
        )

    def connect(self, action: dict[str, Any]) -> None:
        source = require_str(action, "from")
        target = require_str(action, "to")
        if source not in self.shapes:
            raise ValueError(f"arrow source does not exist: {source}")
        if target not in self.shapes:
            raise ValueError(f"arrow target does not exist: {target}")
        if source == target:
            raise ValueError("arrow cannot connect a shape to itself")
        self.arrows.append(Arrow(source=source, target=target, text=str(action.get("text", ""))))

    def apply(self, action: dict[str, Any]) -> None:
        action_type = require_str(action, "type")
        if action_type == "create_shape":
            self.create_shape(action)
        elif action_type == "connect":
            self.connect(action)
        else:
            raise ValueError(f"unknown action type: {action_type}")


def require_str(obj: dict[str, Any], key: str) -> str:
    value = obj.get(key)
    if not isinstance(value, str) or not value:
        raise ValueError(f"{key} must be a non-empty string")
    return value


def require_number(obj: dict[str, Any], key: str) -> float:
    value = obj.get(key)
    if not isinstance(value, int | float) or not math.isfinite(value):
        raise ValueError(f"{key} must be a finite number")
    return float(value)


def parse_actions(text: str) -> list[dict[str, Any]]:
    try:
        data = json.loads(text)
    except json.JSONDecodeError:
        start = text.find("{")
        end = text.rfind("}")
        if start == -1 or end == -1 or end <= start:
            raise ValueError("model output does not contain JSON")
        data = json.loads(text[start : end + 1])

    actions = data.get("actions")
    if not isinstance(actions, list):
        raise ValueError("missing actions array")
    if not actions:
        raise ValueError("actions array is empty")
    if len(actions) > 40:
        raise ValueError("too many actions")
    if not all(isinstance(action, dict) for action in actions):
        raise ValueError("each action must be an object")
    return actions


def validate_completion(text: str) -> tuple[Canvas | None, list[str]]:
    errors: list[str] = []
    canvas = Canvas()

    try:
        actions = parse_actions(text)
    except Exception as exc:
        return None, [str(exc)]

    for i, action in enumerate(actions):
        try:
            canvas.apply(action)
        except Exception as exc:
            errors.append(f"action {i}: {exc}")

    return canvas, errors

This already gives us the core of a verifier environment. The model emits text. The environment parses that text, executes the action sequence, and returns errors if something goes wrong.

A framework would make this more scalable. It would not make it more conceptually different.

Once the environment can execute actions, we need to define success.

python

code

# reward.py
from __future__ import annotations

import re

from env import Canvas, validate_completion


def score_layout(canvas: Canvas) -> float:
    if not canvas.shapes:
        return 0.0

    score = 1.0

    shapes = list(canvas.shapes.values())
    for i, a in enumerate(shapes):
        for b in shapes[i + 1 :]:
            ax2, ay2 = a.x + a.w, a.y + a.h
            bx2, by2 = b.x + b.w, b.y + b.h
            overlap = not (ax2 < b.x or bx2 < a.x or ay2 < b.y or by2 < a.y)
            if overlap:
                score -= 0.15

    labeled = sum(1 for shape in shapes if shape.text.strip())
    score += 0.1 * min(labeled, 5)

    score += 0.1 * min(len(canvas.arrows), 5)

    return max(0.0, min(1.0, score))


def score_semantics(prompt: str, canvas: Canvas) -> float:
    prompt_words = set(re.findall(r"[a-zA-Z][a-zA-Z0-9_-]+", prompt.lower()))
    label_words: set[str] = set()
    for shape in canvas.shapes.values():
        label_words.update(re.findall(r"[a-zA-Z][a-zA-Z0-9_-]+", shape.text.lower()))

    important = {w for w in prompt_words if len(w) >= 4}
    if not important:
        return 0.5

    coverage = len(important & label_words) / max(1, len(important))
    return max(0.0, min(1.0, coverage))


def reward(prompt: str, completion: str) -> float:
    canvas, errors = validate_completion(completion)
    if errors or canvas is None:
        return 0.0

    validity = 1.0
    layout = score_layout(canvas)
    semantics = score_semantics(prompt, canvas)

    return 0.4 * validity + 0.3 * layout + 0.3 * semantics

That is fine for a toy environment. In fact, it is useful because it exposes the central problem.

The hard part of RL for agents is usually not the policy-gradient equation. The hard part is constructing an environment where reward is correlated with the behavior you actually want.

A weak reward for this task would be:

1 if JSON parses, else 0

That teaches the model to be valid, but not useful.

A stronger reward might look like this:

python

code

reward =
  0.25 * parses_as_json
  0.20 * schema_valid
  0.20 * renderer_accepts
  0.15 * requested_entities_present
  0.10 * arrows_connect_expected_entities
  0.10 * layout_quality

If we start RL from a model that cannot produce valid actions, almost every rollout receives zero reward.

This is why teacher trajectories matter.

observation: user prompt

action: JSON actions

reward: validation score

In a multi-turn environment, it becomes:

obs_0 -> action_0 -> obs_1 -> action_1 -> obs_2 -> reward

The key point is that teacher generation is not magic. It is sampling from a stronger policy, executing the output in the environment, and keeping the traces that survive.

Here is a minimal sketch using Gemini structured output.

This produces the dataset we need for the first phase. The teacher moves the student into the valid action manifold. It does not solve the whole task. It gives RL somewhere useful to start.

python

teacher_generate.py

from __future__ import annotations

import json

from pathlib import Path

from typing import Literal

from google import genai

from pydantic import BaseModel, Field

from reward import reward

from env import validate_completion

class CreateShape(BaseModel):

type: Literal["create_shape"]

id: str

shape: Literal["rectangle", "ellipse", "diamond", "text"]

x: float

y: float

w: float

h: float

text: str = ""

class Connect(BaseModel</spa

この記事をシェア

TLDR AI重要度42026年7月8日 09:00

最終トークン選好最適化によるドゥームループの削減

TLDR AI重要度42026年7月8日 09:00

マイクロソフトの真の AI ストラテジーはチャットボートではない（8 分読了）

TLDR AI2026年7月8日 09:00

MiniMax M3：スパースアテンションが長期ホライズンエージェントを現実的なものにする方法（11 分読了）

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

第一原理からエージェントを構築する方法（15 分読了）

キーポイント

影響分析

編集コメント

env.py

reward.py

teacher_generate.py

teacher_generate.py

関連記事

第一原理からエージェントを構築する方法（15 分読了）

キーポイント

影響分析

編集コメント

env.py

reward.py

teacher_generate.py

teacher_generate.py

関連記事