TLDR AI·2026年6月10日 09:00·約20分で読める

テキストを本格的な最適化レイヤーとして位置づける（8 分読了）

#LLM #Prompt Engineering #System Optimization #Cost Efficiency

TL;DR

TLDR AI は、テキストデータを単なる情報源として扱う従来の見方を改め、システム全体の性能を最適化するための重要なレイヤーとして再評価する必要性を提唱している。

AI深層分析2026年6月11日 00:04

注目/ 5段階

深度40%

キーポイント

テキストの役割転換

テキストは単なる入力データや出力結果ではなく、モデルのパフォーマンスやリソース効率を制御・向上させるための能動的な最適化層として再定義されるべきである。

システム設計への統合

従来のアーキテクチャでは軽視されがちだったテキストの構造や質が、推論速度やコスト削減に直結する重要なパラメータであることを示唆している。

実用的な最適化手法

適切なテキスト設計やプロンプトエンジニアリングを通じて、ハードウェアリソースを節約しつつ精度を維持する具体的なアプローチが議論されている。

影響分析・編集コメントを表示

影響分析

この記事は、AI システム開発におけるテキストデータの扱い方に対する根本的な見直しを促すものであり、特にコスト効率とパフォーマンスのバランスを重視する現場の開発者にとって重要な示唆を与える。従来のハードウェア依存型の最適化だけでなく、ソフトウェア層（テキスト）での工夫がシステム全体のパフォーマンスを決定づける新たな基準となる可能性がある。

編集コメント

ハードウェアリソースの限界が叫ばれる中、ソフトウェア側（テキスト層）での最適化余地を見出す視点は非常に鋭く、実務的な価値が高いと考えられます。

機械学習研究者の間には、プロンプト、あるいはより広くはテキスト最適化に対して、共通の否定的な感情が存在しているように見受けられます。その背景にある考え方はおそらく、「真の学習は重みの中で起こるものである」といったものです。ここでいうテキスト最適化とは、モデルを取り巻く可変的なテキスト層（プロンプト、コンテキスト、ファイルシステムの状態、メモリ、検索データベース、およびモデルのハルネス）を変更する手法を広く指します。

私は、この層をより広い研究コミュニティによって真剣に受け止められるべきだと考えます。私はテキスト最適化の必要性について、3 つの理由から主張いたします。

テキスト最適化は正当な更新メカニズムです。これは勾配に基づく重み最適化と同じ機能的役割を持ち、新しい情報に応じて将来の行動を変化させるものです。
テキスト最適化は、重み最適化よりもはるかにサンプル効率的であり、特にデータが少ない状況において顕著です。比較的短く、尤度が高いテキストは記述長が低いため、テキスト最適化には有利な帰納的バイアスが働きます。
テキスト最適化は新たなスケーリング軸である「更新時の計算リソース」を可能にします。反射的なテキスト最適化により、システムは単一の経験からより多くの計算リソースを費やして学習することができ、これは推論時のスケーリングがモデルが単一の入力に対してより多くのリソースを割くことができるのと同じ原理です。

重み以外の学習#_n

展開された AI システムはもはや孤立して照会されるパラメータベクトルだけではありません。それらは複雑な状態機械であり、多くの可動要素を持ち、重みはその一部に過ぎません。このシステム全体が研究対象となるならば、学習とは行動を条件付けるあらゆる状態を変更することを意味します。重みは一つの状態であり、通常は勾配に基づく最適化を通じて更新されます。プロンプト、メモリ、検索インデックス、そしてハーンコードもまた状態の一部ですが、それぞれに異なるコスト、容量、および故障モードを持っています。重要な問いは、「特定の情報のためにどの更新対象が最も適切か」という点です。

テキストのアーティファクトには有用な帰納的バイアスがあります。通常のコルモゴロフスタイルの圧縮直感も適用されます：多くのケースを説明する短い仕様は、例外の長いリストよりも実際の構造を捉える可能性が高いのです。この意味において、優れたテキスト更新とは「事前学習された世界モデルに対するコンパクトなパッチ」です。経験則として、低データ領域におけるテキスト最適化は、サンプル効率において桁違いに優れています (1, 2, 3)。このため、スケールにおける反復的なパターンとして、テキスト層を用いてモデル内の既存の能力を引き出し組み合わせ、それを時間とともに重みへと蒸留するアプローチが採られています (Anthropic, OpenAI, Cursor, Letta, Hippocratic AI, Harvey)。

Update-Time Compute: A New Scaling Axis#

テキスト層は、反射的学習（Reflexion、Trace、GEPA、Meta-Harness）を可能にします。これは、テキストに基づいた最適化ループであり、システムが「どのように変更すべきか」についての仮説を外部化できることを意味します。これにより、更新時における*仮説検証*がスケーラブルに有用になります。システムはテキスト上で複数のアイデアを提案し、それらを新しい証拠と照合してテストした後で、受け入れるか拒否するかを決めることができます。これは科学者が一つの理論に落ち着く前に複数の理論を提案・検証する様子に似ています。このような仮説検証行動の実際の例については、Meta-Harness の付録 A.2 を参照してください。SGD（確率的勾配降下法）はこれを安価に行うことができません。その単一の実行中のパラメータベクトルは各更新を確定させてしまい、分岐して比較する簡単な方法がないからです。

テキスト最適化の核心的な約束は、私たちが「更新時の計算リソース」をスケールできる点にあると考えます。推論時のスケーリングがモデルに単一のインスタンスを解決するためにより多くの計算リソースを使わせるように、反射的なテキスト最適化はシステムに単一の経験から学習するためにより多くの計算リソースを使わせるものです。失敗した軌道は再読され、診断され、抽象化され、候補となる修正と対比してテストされた後、提案される更新へと変換されます。したがって、テキスト空間での学習は、(1) 失敗のコストが高い場合、(2) 望ましい動作を指定することが困難な場合、または (3) SFT（Supervised Fine-Tuning: 教師あり微調整）やオフライン RL（Reinforcement Learning: 強化学習）ではうまく機能しない豊富なオフライン追跡データが存在する場合に特に有用です。

重みにおける学習の最も強力なケース、および私の反論#

学習を重みに保持することにはいくつかの説得力のある議論があります。それぞれについて、私はその議論に対する私の最良の解釈を述べた上で、反駁形式で応答します。

** 重みは「費用の平準化」をもたらします。一度行動がモデルに訓練されれば、システムは各コンテキストウィンドウ内でその行動の完全な仕様を保持する必要がなくなります。一方、コンテキストウィンドウは有限のリソースです。

私は、多くの種類の情報が最終的には重み（weights）に属すべきだという強い論拠があると思います。同意します。例えば、LLM は、すべてのリクエストに対して基本的な算術を説明するために長いプロンプトを必要とするべきではありません。しかし、ここでも、検索エージェントが動的なインターネットの文脈を集めたり、変化するユーザー履歴、好み、およびプライベートな状態に依存するパーソナライズされたエージェントのように、多くの有用な情報は、安定性や一般性が十分ではなく、コストの償却（amortization）に見合う価値がない場合があります。私は、適切な枠組みはルーティング問題として捉えるべきだと考えます：重みとは、安定しており繰り返し有用な情報が属する場所であり、テキストとは、情報が揮発的（volatile）、局所的（local）、監査可能（auditable）、または償却するにはまだ信頼性が十分でない場合に情報が留まる場所です。

さらに、優れたテキスト層システムは、利用可能な情報をすべてコンテキストウィンドウに書き込むわけではありません。それらは情報の段階的開示を実装しており、必要な場合にのみ関連する情報を取得して条件付けます (RAG、MemGPT、RLM、Anthropic dynamic workflows、Meta-Harness)。適切な組織化が行われていれば、モデルの入力ウィンドウよりもはるかに大きなコンテキストに対して暗黙的に条件付けを行うことは比較的容易です。何を包含すべきかを知っていれば、驚くほど多くの情報をコンテキストに詰め込むことができます。1M トークンのコンテキストの 1% は 10K トークンであり、これはこの投稿の 3 部分以上に相当します。これが読者のメンタルモデルを意味ある形で変化させるのに十分であることを願っています。

ある情報が重みに分散して組み込まれる価値があるとしても、それを即座に分散する必要はありません。私はテキスト層を、最終的に重みに凝縮される可能性のある情報のための柔軟な「待機地」として捉えるようになりました。この層は、モデルにコミットする前に行動仮説を検証し洗練させることを非常に容易にします。テキスト層を進化させ、時間をかけて重みを改善するためにそれを使用する方法の仕組みは、直接の凝縮 (1, 2, 3, 4)、合成データ生成 (5, 6, 7, 8)、トレーニングループ自体の修正 (9, 10)、あるいはファスト・スロー学習フレームワーク (11) を通じてであっても、興味深い未解決の研究課題です。

重みのトレーニングは新しい神経回路を創出しますが、テキスト最適化は固定された重みセットから既存の行動を引き出すことしかできず、その重みが与えられた場合、テキスト層が到達できる上限が存在します。

弱いモデルではテキスト最適化に活用できる要素が極めて少ない点には同意します。しかし、このような限界はテキスト最適化固有のものではなく、この議論は強化学習 (RL) に対しても行われてきました2。テキスト最適化が有用であるためには、完全に新しい潜在能力を創出する必要はありません。多くの導入済みシステムは、モデルが原理的に特定の行動プリミティブを実行できるかどうかではなく、その行動を信頼性を持って引き出し、組み合わせられるかどうかにボトルネックを抱えています (mgh)。したがって実務的な問いは、モデルの潜在能力と導入済みシステムが実際に示す行動との間に、どの程度の有用な余地が残っているかです。

経験則から、テキスト層の改善には大きな余地があることが示されています。これは、検索拡張 QA、テスト時のスケーリング、ツール使用エージェント全体にわたって現れます：重みを変更するのではなく、コンテキストや実行環境を変更することで、固定モデルの動作が改善されます (1, 2, 3, 4, 5)。また、スケールが大きくなるほどテキスト条件付けの価値も増す傾向があります：大規模モデルは推論時に供給される情報を活用する能力が向上し、一部の文脈条件付き能力は大規模化して初めて現れるようになります (1, 2, 3)。

「存在論的議論」：人間の脳は明らかに知能を持っています。重み alone を変更することによって学習することは可能でなければなりません。

私は実際、テキスト最適化に対しても同様の存在論的議論を行います。すべての書かれたテキスト（書籍、論文、コード、ウェブページなど）のコレクションを見てください：優れた外部表現は人間の知能を劇的に増幅します。もし突然すべての外部テキストから切り離されたら、私たちの仕事の質がどれほど低下するでしょうか？

** 誰でもテキストアーティファクトを変更し、一見より良く見える出力を得ることができます。テキスト最適化は、ベンチマークのリークやモデル心理学に関する民間理論に対して特に脆弱です。

まず、テキスト最適化は初期の成功によって不適切にマーケティングされてきました。最も目立つ例としては、「ステップバイステップで考えよう」「let's think step by step」や「深呼吸しよう」「take a deep breath」、「これは私のキャリアにとって非常に重要です」「this is very important to my career」、ペルソナ、そして脅しとチップ（報酬）といった、モデルの面白い癖が挙げられます。より新しいモデルがこうしたトリックに対して頑健になるにつれて、テキスト最適化そのものが消滅すると結論づけたくなるのはおそらく誘惑でしょう。しかしそれは、分野に対する初期の弱い枠組みと、根本的な研究問題とを混同しています。

テキスト層でのいじくりは非常に簡単です：誰でも指示を編集し、選ばれた出力に基づいて勝利宣言ができます3。この参入障壁の低さが、ここでは悪質な科学が一般的になる原因となっています。むしろ、私はこのような未熟な方法論的規範こそが、テキスト最適化をより厳密に研究すべき強力な根拠であると捉えています。特にその実用的重要性を考慮すればなおさらです。

勾配降下法（gradient descent）は真の最適化アルゴリズムです。その仕組みを理解するには、最適化・一般化・収束に関する膨大な文献に頼ることができます。一方、テキスト最適化はヒューリスティックな山登り（hill-climbing）です。

収束理論は、プロキシ損失を最小化することを保証するだけであり、そのプロキシが実際に我々が重視するものと一致することを保証するものではありません。より強力な最適化アルゴリズムは、この乖離を利用するだけです。この分野では、一般化ダイナミクスに関する理論分析から、経験的なスケーリング則とベストプラクティスへと大きく移行しています。特に RL（強化学習）のポストトレーニングは、この種の過学習に対して非常に繊細で脆弱であることで有名です (1, 2, 3, 4)。一方、テキスト層への編集は、より弱い最適化圧力をかけつつも、依然として高度に監査可能であり、多くの場合で組み合わせ可能となります。

ニューラルネットワークは汎関数近似器であり、あらゆるものを表現することができます。

表現能力こそが注目すべき点ではありません。たとえ 2 レイヤーの MLP（多層パーセプトロン）でも*原理的には*任意の関数を表現できるかもしれませんが、それが効率的かつ信頼性を持って学習できることを意味するわけではありません。我々が注目すべきは、到達可能な振る舞い**、すなわち暗黙的な事前分布の下で十分に高い確率を持つ振る舞いが何であるかです。ハーンズ（Harnesses）は、単一のフォワードパスを通じて、凍結されたモデルでは期待できないような振る舞いを実行できることが実証されています。

テキストアーティファクトは移植性がありません。それらは特定のモデルの癖に過学習しており、次のチェックポイントでしばしば破綻します。

関連する比較は、他の更新アーティファクトとのものです。あるモデル向けに書かれたテキスト・アーティファクトは別のモデルでは失敗することがありますが、あるアーキテクチャ向けに訓練された重みデルタは、通常全く移植可能ではありません。テキスト・アーティファクトはわずかに移植性が高いです。なぜなら、テキストはモデル間を跨いで意味を保持し続けるからです。

振り子が行き過ぎたのではないか？#

「重みが真の学習である」という見方は、初期の AI に対する反応の一部でもあります。当時、研究者たちは内部パラメータを変更することで学習できるシステムを構築することに注力していました。数十年にわたり、支配的な図像は知能を明示的な記号操作として捉えていました。ニューエルとサイモンの物理記号系仮説や、ハウゲランドのGOFAIは、この思考様式の典型例です。ニューラルネットワークは、これがあまりにも狭い見方であることを示しました：有用な情報は明らかに重みに存在し得るのです。現代の LLM（大規模言語モデル）がその主張に対する最も強力な証拠となっています。

私たちは、知識の唯一の真面目な拠り所として重み（weights）を見る方向に、やりすぎた修正を加えてしまったようです。これは、視野を広げて見れば奇妙なことです。なぜなら、人間の認知は日常的に外部の人工物に依存しているからです。『野外における認知』（Cognition in the Wild）において、エドウィン・ハッチンズは、船の航海を人々、計器、手順、そして外部表現からなる認知システムとして分析しています。クラークとチャームズも、『拡張された心（The Extended Mind）』で同様の指摘を行っています：認知システムの境界は、単一の構成要素の内部状態を超えて拡張しうるのです。この系譜におけるコンピュータサイエンス版は、少なくともヴァネヴァー・ブッシュの『Memex』にまで遡ります：これは個人的アーカイブを通じた連想経路を中心に組織された外部メモリです。Notion や Obsidian といった現代の思考支援ツール（tools-for-thought）は、外部メモリを日常の知識作業の一部とするための具体的な試みです。

*科学的実践* は有用な比較対象となります。科学の核心的な目標の一つは、世界をコンパクトに表現することであり、これは科学者たちの頭の中にある私的な直観によって支援されますが、それらに還元されるものではありません。通常、その成果物は結晶化され、抽象概念、定理、あるいは因果モデルとして現れ、これらは書き留められ共有することができます。これらの価値は主に外部化から生じます：批判に耐え、新たな証拠と比較し、修正され、新しい事例に応用できるからです。テキストアーティファクトはモデルシステムにおいて同様の機能的役割を担います：それらは行動に関連する抽象概念を符号化した外部表現であり、これらを更新することは、新たな証拠に基づいて科学理論を修正することと同じ意味での「学習」です。

テキスト層に関する良質な研究への呼びかけ#

テキスト最適化には、重み最適化の周りに築かれたコミュニティと同様のコミュニティが必要だと私は考えます。また、ここにもっと高品質な研究が存在することを願っています。非常に近い将来において、基礎的な作業に適していると思われるいくつかの方向性があります：

テキスト層の理論的分析。一般的に、テキスト空間は重み空間よりもはるかに優れた事前情報（prior）を提供し、この観察結果を明確に形式化することは、実践を導く上で非常に有用である可能性がある。これは 2023 レベルのモデルに対するプロンプトへの PAC-Bayes の適用を試みた古めの論文だが、最新世代のモデルやテキストアーティファクトを用いて再検討する価値が十分にある。
より良い評価指標。CL-bench は文脈学習のための適切な評価の初期試みであり、TerminalBench-2 などのエージェントベンチマークはハッチネス（harnesses）の争奪戦の場となっている。しかし依然として、重みの能力を統制しつつ、テキスト層がもたらす過剰適合や不正行為という新しいクラスの特徴を浮き彫りにする、テキスト層の有効な特性を孤立させたベンチマークが必要である。
「アーキテクチャ研究」、つまり設計空間の理解。インストラクション階層、DSPy プログラム、エージェントスキル、OpenClaw スタイルのエージェント、そして膨大な数のメモリシステム設計など、テキスト層に対する提案された設計は数多く存在する。これらはすべて巨大な設計空間上の点であるという感覚はあるが、その空間について語る適切な方法すらなく、ましてやその中の異なる点を比較する方法はない。
人間から入力を引き出してテキスト層を最適化する方法、およびシステムの内部状態をユーザーに提示して検査・修正を行う方法に関する HCI（人間とコンピュータの相互作用）研究。テキスト層との適切な対話方法を確立できれば、ドメインのトップエキスパートが AI システムと定期的に「音声による微調整」セッションを持つことを経済的に実現可能にするだろう。この方向性の優れた事例は知らないが、私のこの論文も本質的に同じ動機に基づいており、非常に限られた領域ではあるが機能した。
テキスト最適化の大幅な拡張、スケーリング則（scaling laws）の確立を含む。現在テキスト最適化に割り当てられている計算リソース予算は、重みのポストトレーニング規模よりも桁違いに小さい。例えば、拡張されたアーティファクトは、測定可能なモデル・システム性能に対して最初から最適化されたウィキペディア規模の知識/ハッチネス層のようなものになる可能性がある。
初期草案に対する優れたフィードバックを提供してくれた Omar Khattab, Allen Nie, Chelsea Finn, Alex Zhang, Ahmad Beirami, Qizheng Zhang に感謝する。この投稿は過去 1 年ほどの研究者との多くの対話の集約であり、ここではそれらをすべて列挙しようとは思わない。

脚註#

私は「テキスト」という言葉を使っていますが、これは言語が最も明確かつ一般的なケースであるためです。しかし、この議論は画像、音声、動画、および他のトークン化された状態など、モデルの将来の行動を条件付けうる外部アーティファクト全般に適用されるべきものです。↩
これは論争の的となっています。ProRL と『The Art of Scaling RL Compute』は、適切なトレーニングレシピがあれば、強化学習（RL）によってベースモデルを超えた推論能力を拡張できると主張しています。個人的には、真実はその中間にあると考えています：設計上、強化学習は新しい行動を発見できるはずですが、ベースモデルの質に対する強い経験的依存関係があるのは確かです。いずれにせよ、ここで述べた詳細は、私が本記事で展開している議論にとっては重要ではありません。↩
私はこれが比較的頻繁に起こっていることを見ています。少なくとも重み空間の研究よりも多く見られます。誰もが最適なシステムプロンプトやスキルについて強い意見を持っています。これはおそらくソーシャルメディアのダイナミクスによるものです：「モデルを 10 倍賢くする奇妙なトリック」は、いかなる重み空間への介入よりもはるかに実行可能です。↩
私はこれが良いスタートアップになるかもしれないと考え始めています。↩

原文を表示

There is a common negative sentiment I observe among ML researchers toward prompting, or more broadly, text optimization. The underlying view seems to be something like *“real learning happens in the weights.”* By text optimization, I broadly mean methods that modify the mutable text layer around a model: prompts, context, filesystem state, memory, retrieval databases, and model harnesses.1 I think this layer should be taken more seriously by the broader research community. I’ll argue for text optimization on three counts:

Text optimization is a legitimate update mechanism. It holds the same functional role as gradient-based weight optimization: changing future behavior in response to new information.
Text optimization is much more sample-efficient than weight optimization, particularly in the low-data regime. Relatively short, high-likelihood text has low description length, giving text optimization a favorable inductive bias.
Text optimization enables a new scaling axis: update-time compute. Reflective text optimization lets a system spend more compute learning from a single experience, the way inference-time scaling lets a model spend more on a single input.

Learning Outside the Weights#

Deployed AI systems are no longer just a parameter vector queried in isolation; they are complex, stateful machines with many moving parts, the weights being just one of them. Once this whole system is the object of study, learning can mean changing any behavior-conditioning state. Weights are one state, typically updated through gradient-based optimization. Prompts, memories, retrieval indices, and harness code are others, with different costs, capacities, and failure modes. The important question is *which update target is the most appropriate* for a given piece of information.

Text artifacts have a useful inductive bias. The usual Kolmogorov-style compression intuition applies: short specifications that explain many cases are more likely to capture real structure than long lists of exceptions. In this sense, good text updates are *compact patches to a pretrained world prior*. Empirically, text optimization is orders of magnitude more sample-efficient in the low-data regime (1, 2, 3). Because of this, a recurring pattern at scale is to use the text layer to elicit and compose existing capabilities in the model, and then distill this into the weights over time (Anthropic, OpenAI, Cursor, Letta, Hippocratic AI, Harvey).

Update-Time Compute: A New Scaling Axis#

The text layer enables reflective learning (Reflexion, Trace, GEPA, Meta-Harness): an optimization loop grounded in text can *externalize its own hypotheses about how it should change*. This makes *hypothesis testing* scalably useful at update time: systems can propose multiple ideas in text and test them against new evidence before accepting or rejecting them, the way a scientist might propose and test multiple theories before settling on one. See e.g. Appendix A.2 of Meta-Harness for a real example of such hypothesis-testing behavior. SGD can’t cheaply do this; its single running parameter vector commits each update, with no easy way to fork and compare.

I think the core promise of text optimization is that we can scale “update-time compute”: just as inference-time scaling lets a model spend more compute to solve a single instance, reflective text optimization lets a system spend more compute learning from a single experience. A failed trajectory can be reread, diagnosed, abstracted, tested against candidate revisions, and then converted into a proposed update. Text-space learning is therefore especially useful when (1) failures are expensive, (2) the desired behavior is hard to specify, or (3) there is abundant offline trace data that does not work well otherwise (SFT or offline RL).

The Strongest Case for Weights, and My Counterpoints#

There are some compelling arguments for keeping learning in the weights. For each, I will state my strongest interpretation of the argument, and then respond in rebuttal style.

Weights give amortization. Once a behavior is trained into the model, the system no longer has to carry the full specification of that behavior in every context window. The context window, in contrast, is a finite resource.

I think this is a strong argument for many types of information to *ultimately* belong in weights. I agree; for example, LLMs should not need a long prompt to explain basic arithmetic for every request. Even here, though, many pieces of useful information are not stable or general enough to be worth the cost of amortization, as with search agents that gather dynamic internet context or personalized agents that depend on changing user history, preferences, and private state. I think the right framing is as a routing problem: weights are where stable, repeatedly useful information belongs, while text is where information stays while it is volatile, local, auditable, or not yet trusted enough to amortize.

Additionally, good text-layer systems do not dump all available information into the context window. They implement progressive disclosure of information, where the system retrieves and conditions on relevant information as needed (RAG, MemGPT, RLM, Anthropic dynamic workflows, Meta-Harness). With the right organization, it’s fairly straightforward to implicitly condition on a much larger context than the model’s input window. When you know what to include, you can pack a surprising amount of information into context. 1% of a 1M-token context is 10K tokens, which is more than three copies of this post; hopefully enough that reading it meaningfully shifts a reader’s mental model.

Even if some information is worth amortizing into weights, it need not be amortized immediately. I’ve come to view the text layer as a kind of flexible *“staging ground”* for information that may eventually be distilled into weights. This layer makes it very easy to test and refine behavioral hypotheses before committing them to the model. The mechanics of how to evolve the text layer and use it to improve the weights over time is an interesting open research question, whether through direct distillation, (1, 2, 3, 4), synthetic data generation (5, 6, 7, 8), modifying the training loop itself (9, 10), or fast-slow learning frameworks (11).

Training the weights creates new neural circuits. Text optimization only ever elicits existing behavior from a fixed set of weights, and given those weights, there is a ceiling on what the text layer can reach.

Agreed that a weak model gives text optimization very little to work with. However, such a ceiling is not unique to text optimization, and this argument has even been made against RL.2 Text optimization does not *need* to create completely new latent capabilities to be useful. Many deployed systems are bottlenecked not by whether the model could in principle perform a certain behavior primitive, but by whether the system can elicit and compose that behavior reliably (mgh). The practical question is therefore *how much useful headroom remains* between the model’s latent capabilities and the behavior the deployed system actually exhibits.

Empirically, the headroom for improving the text layer is significant. It shows up across retrieval-augmented QA, test-time scaling, and tool-use agents: fixed-model behavior improves when we change the context or execution environment rather than the weights (1, 2, 3, 4, 5). Scale also appears to increase the value of text conditioning: larger models become better at using information supplied at inference time, and some context-conditioned abilities appear only at larger scale (1, 2, 3).

The “existence argument”: the human brain is clearly intelligent. It must be possible to learn by changing weights alone.

I’d actually make a similar existence argument for text optimization. Look at the collection of all written text (books, papers, code, webpages, etc.): good external representations greatly amplify human intelligence. *How much would the quality of our work suffer if we were suddenly cut off from all external text?*

Anyone can change a text artifact and get a seemingly better-looking output. Text optimization is unusually vulnerable to benchmark leakage and folk theories about model psychology.

First, text optimization has been poorly marketed by its early successes. The most visible examples were amusing model quirks like “let’s think step by step”, “take a deep breath”, “this is very important to my career”, personas, and threats and tipping. It’s perhaps tempting to conclude that text optimization itself will disappear as newer models become more robust to such tricks. But this confuses a weak early framing of the field with the underlying research problem.

It’s very easy to tinker on the text layer: anyone can edit an instruction and declare victory based on cherry-picked outputs.3 This low barrier to entry makes bad science here common. If anything, I view such immature methodological norms as a strong argument for studying text optimization more rigorously, especially given its practical importance.

Gradient descent is a real optimizer. You can lean on the large literature on optimization, generalization, and convergence to understand how it works. Text optimization is heuristic hill-climbing.

Convergence theory only guarantees that you will minimize the proxy loss, not that the proxy matches what you actually care about. A stronger optimizer just exploits this gap; the field has largely moved on from theoretical analysis of generalization dynamics to empirical scaling laws and best practices. RL post-training in particular is notoriously finicky and prone to this kind of overfitting (1, 2, 3, 4). In contrast, text-layer edits apply weaker optimization pressure while remaining highly auditable, and in many cases also composable.

Neural networks are universal function approximators and can represent anything.

Representational capacity is not the right thing to look at; even a two-layer MLP can *in principle* represent any function, but that doesn’t mean it can learn to do so efficiently or reliably. We should be looking at reachable behavior, i.e., what behaviors are sufficiently high-likelihood under the implicit prior. Harnesses can demonstrably execute behaviors that we wouldn’t expect frozen models to via a single forward pass.

Text artifacts are not portable. They are overfit to one model’s quirks and often break on the next checkpoint.

The relevant comparison is with other update artifacts. A text artifact written for one model may fail on another, but a weight delta trained for one architecture is usually not portable at all. Text artifacts are slightly more portable since text still carries meaning across models.

Perhaps the Pendulum Has Swung Too Far#

The “weights are the real learning” view is partly a reaction to early AI, when researchers were focused on building systems that could learn by changing their internal parameters. For decades, the dominant picture treated intelligence as explicit symbol manipulation. Newell and Simon’s physical symbol system hypothesis and Haugeland’s GOFAI are canonical examples of this mindset. Neural networks showed that this was too narrow: useful information can clearly live in weights; modern LLMs are the strongest evidence for that claim.

We seem to have overcorrected towards viewing weights as the *only* serious home for knowledge. This is strange when zoomed out because human cognition routinely depends on external artifacts. In Cognition in the Wild, Edwin Hutchins analyzes ship navigation as a cognitive system made of people, instruments, procedures, and external representations. Clark and Chalmers make a related point in The Extended Mind: the boundary of a cognitive system can extend beyond the internal state of a single component. The computer-science version of this lineage runs at least back to Vannevar Bush’s Memex: an external memory organized around associative trails through a personal archive. Modern tools-for-thought systems like Notion and Obsidian are concrete attempts to make external memory part of everyday knowledge work.

*Scientific practice* is a useful comparison. One of the core goals of science is to construct compact representations of the world, which is *aided by* private intuitions inside scientists’ heads but not reducible to them. The usual products are crystallized: an abstraction, a theorem, or a causal model, which can be written down and shared. Their value comes in large part *from* externalization: they can be criticized, compared against new evidence, revised, and applied to new cases. Text artifacts occupy a similar functional role in model systems: they are external representations that encode behavior-relevant abstractions. Updating them is “learning” in the same sense that revising a scientific theory in light of new evidence is learning.

A Call for Good Research on the Text Layer#

I think text optimization deserves the same kind of community we built around weight optimization, and I wish there were more high-quality research here. Several directions seem ripe for foundational work in the very near future:

Theoretical analysis of the text layer. Generally, text space gives a much better prior than weight space, and cleanly formalizing this observation could be very useful for guiding practice. This old-ish paper is a promising start applying PAC-Bayes to prompts in 2023-level models, which seems very much worth revisiting with the latest generation of models and text artifacts.
Better evals. CL-bench is an initial attempt at a proper eval for context learning, and agentic benchmarks like TerminalBench-2 have partly become a battleground for harnesses. Still, we need more benchmarks that isolate useful properties of the text layer, controlling for weight capability while flagging the weird new classes of overfitting and cheating that the text layer enables.
“Architecture research”, i.e., understanding the design space. There are so many proposed designs for the text layer, from the instruction hierarchy, DSPy programs, agent skills, OpenClaw-style agents, and the massive number of memory system designs. There is a sense in which these are all points on one huge design space, but we don’t have a good way to talk about that space, let alone compare different points in it.
HCI research on how to elicit input from humans to optimize the text layer, and how to present the system’s internal state back to users for inspection and revision. I think figuring out the right ways to interact with the text layer can make it economically viable to routinely have top domain experts sit down for “verbal fine-tuning” sessions with AI systems. I don’t know of a good example of work in this direction, but this paper of mine had essentially this motivation, though it worked in a very limited domain.
Seriously scaling up text optimization, including establishing scaling laws. The compute budgets currently allocated to text optimization are orders of magnitude smaller than weight post-training scale. For example, a scaled-up artifact might look like a Wikipedia-scale knowledge/harness layer, optimized from the ground up against measurable model-system performance4.

Thanks to Omar Khattab, Allen Nie, Chelsea Finn, Alex Zhang, Ahmad Beirami, and Qizheng Zhang for excellent feedback on an earlier draft. This post is a distillation of many conversations with researchers over the past year or so, which I won’t attempt to list here in full.

Footnotes#

I use “text” because language is the clearest and most common case, but the argument should apply more broadly to external artifacts that can condition a model’s future behavior, including images, audio, video, and other tokenized state. ↩
This is contested. ProRL and The Art of Scaling RL Compute argue that with the right training recipe, RL can expand reasoning capacity beyond the base model. Personally, I think the truth is somewhere in the middle: RL by design should be able to discover new behaviors, but there’s definitely a strong empirical dependence on the quality of the base model. Either way, the details here don’t matter for the argument I’m making in the post. ↩
I do see this happen somewhat often, at least more than in weight-space research. Everyone seems to have a strong opinion on what the best system prompt or skill is. This probably has more to do with social media dynamics: “one weird trick to make your model 10x smarter” is much more actionable than any weight-space intervention. ↩
I’ve started to think this may be a good startup. ↩

この記事をシェア

Latent Space2026年6月20日 17:06

[AINews] 今日特に大きな出来事はありませんでした

Latent Space は、GLM 5.2 が依然として注目されていると指摘しつつ、AIE WF 2026 の通常チケットが月曜日に完売すると発表しました。同サイト購読者向けに限定割引を提供し、参加者には Warp や Datadog などからのスポンサークレジットも付与されます。

TechCrunch AI★42026年6月20日 01:01

米国がアンソロピックの「Fable 5」発売を禁止、しかし市場は動じず

米国政府は国家安全保障上の懸念から、アマゾンの研究者らがガードレール回避手法を発見したとして、アンソロピックに対し最新モデル「Fable 5」と「Mythos 5」の販売差し止めを命じた。サイバーセキュリティ研究者らはこの措置が危険だとする公開書簡に署名し、同社も他モデルでも同様の抜け道が存在すると指摘している。

GitHub Blog★42026年6月20日 01:00

社内データ分析エージェントの構築方法について

GitHub は、大規模なデータ組織が直面する自己完結型のデータアクセスと洞察提供の課題に対し、AI を活用した信頼性の高い解決策として、社内でデータ分析エージェントを構築したことを発表した。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

TLDR AI·2026年6月10日 09:00·約20分で読める

テキストを本格的な最適化レイヤーとして位置づける（8 分読了）

#LLM #Prompt Engineering #System Optimization #Cost Efficiency

TL;DR

AI深層分析2026年6月11日 00:04

注目/ 5段階

深度40%

キーポイント

テキストの役割転換

システム設計への統合

実用的な最適化手法

影響分析・編集コメントを表示

影響分析

編集コメント

テキスト最適化は正当な更新メカニズムです。これは勾配に基づく重み最適化と同じ機能的役割を持ち、新しい情報に応じて将来の行動を変化させるものです。
テキスト最適化は、重み最適化よりもはるかにサンプル効率的であり、特にデータが少ない状況において顕著です。比較的短く、尤度が高いテキストは記述長が低いため、テキスト最適化には有利な帰納的バイアスが働きます。
テキスト最適化は新たなスケーリング軸である「更新時の計算リソース」を可能にします。反射的なテキスト最適化により、システムは単一の経験からより多くの計算リソースを費やして学習することができ、これは推論時のスケーリングがモデルが単一の入力に対してより多くのリソースを割くことができるのと同じ原理です。

重み以外の学習#_n

Update-Time Compute: A New Scaling Axis#

重みにおける学習の最も強力なケース、および私の反論#

「存在論的議論」：人間の脳は明らかに知能を持っています。重み alone を変更することによって学習することは可能でなければなりません。

ニューラルネットワークは汎関数近似器であり、あらゆるものを表現することができます。

テキストアーティファクトは移植性がありません。それらは特定のモデルの癖に過学習しており、次のチェックポイントでしばしば破綻します。

振り子が行き過ぎたのではないか？#

テキスト層に関する良質な研究への呼びかけ#

テキスト層の理論的分析。一般的に、テキスト空間は重み空間よりもはるかに優れた事前情報（prior）を提供し、この観察結果を明確に形式化することは、実践を導く上で非常に有用である可能性がある。これは 2023 レベルのモデルに対するプロンプトへの PAC-Bayes の適用を試みた古めの論文だが、最新世代のモデルやテキストアーティファクトを用いて再検討する価値が十分にある。
より良い評価指標。CL-bench は文脈学習のための適切な評価の初期試みであり、TerminalBench-2 などのエージェントベンチマークはハッチネス（harnesses）の争奪戦の場となっている。しかし依然として、重みの能力を統制しつつ、テキスト層がもたらす過剰適合や不正行為という新しいクラスの特徴を浮き彫りにする、テキスト層の有効な特性を孤立させたベンチマークが必要である。
「アーキテクチャ研究」、つまり設計空間の理解。インストラクション階層、DSPy プログラム、エージェントスキル、OpenClaw スタイルのエージェント、そして膨大な数のメモリシステム設計など、テキスト層に対する提案された設計は数多く存在する。これらはすべて巨大な設計空間上の点であるという感覚はあるが、その空間について語る適切な方法すらなく、ましてやその中の異なる点を比較する方法はない。
人間から入力を引き出してテキスト層を最適化する方法、およびシステムの内部状態をユーザーに提示して検査・修正を行う方法に関する HCI（人間とコンピュータの相互作用）研究。テキスト層との適切な対話方法を確立できれば、ドメインのトップエキスパートが AI システムと定期的に「音声による微調整」セッションを持つことを経済的に実現可能にするだろう。この方向性の優れた事例は知らないが、私のこの論文も本質的に同じ動機に基づいており、非常に限られた領域ではあるが機能した。
テキスト最適化の大幅な拡張、スケーリング則（scaling laws）の確立を含む。現在テキスト最適化に割り当てられている計算リソース予算は、重みのポストトレーニング規模よりも桁違いに小さい。例えば、拡張されたアーティファクトは、測定可能なモデル・システム性能に対して最初から最適化されたウィキペディア規模の知識/ハッチネス層のようなものになる可能性がある。
初期草案に対する優れたフィードバックを提供してくれた Omar Khattab, Allen Nie, Chelsea Finn, Alex Zhang, Ahmad Beirami, Qizheng Zhang に感謝する。この投稿は過去 1 年ほどの研究者との多くの対話の集約であり、ここではそれらをすべて列挙しようとは思わない。

脚註#

私は「テキスト」という言葉を使っていますが、これは言語が最も明確かつ一般的なケースであるためです。しかし、この議論は画像、音声、動画、および他のトークン化された状態など、モデルの将来の行動を条件付けうる外部アーティファクト全般に適用されるべきものです。↩
これは論争の的となっています。ProRL と『The Art of Scaling RL Compute』は、適切なトレーニングレシピがあれば、強化学習（RL）によってベースモデルを超えた推論能力を拡張できると主張しています。個人的には、真実はその中間にあると考えています：設計上、強化学習は新しい行動を発見できるはずですが、ベースモデルの質に対する強い経験的依存関係があるのは確かです。いずれにせよ、ここで述べた詳細は、私が本記事で展開している議論にとっては重要ではありません。↩
私はこれが比較的頻繁に起こっていることを見ています。少なくとも重み空間の研究よりも多く見られます。誰もが最適なシステムプロンプトやスキルについて強い意見を持っています。これはおそらくソーシャルメディアのダイナミクスによるものです：「モデルを 10 倍賢くする奇妙なトリック」は、いかなる重み空間への介入よりもはるかに実行可能です。↩
私はこれが良いスタートアップになるかもしれないと考え始めています。↩

原文を表示

Text optimization is a legitimate update mechanism. It holds the same functional role as gradient-based weight optimization: changing future behavior in response to new information.
Text optimization is much more sample-efficient than weight optimization, particularly in the low-data regime. Relatively short, high-likelihood text has low description length, giving text optimization a favorable inductive bias.
Text optimization enables a new scaling axis: update-time compute. Reflective text optimization lets a system spend more compute learning from a single experience, the way inference-time scaling lets a model spend more on a single input.

Learning Outside the Weights#

Update-Time Compute: A New Scaling Axis#

The Strongest Case for Weights, and My Counterpoints#

There are some compelling arguments for keeping learning in the weights. For each, I will state my strongest interpretation of the argument, and then respond in rebuttal style.

Weights give amortization. Once a behavior is trained into the model, the system no longer has to carry the full specification of that behavior in every context window. The context window, in contrast, is a finite resource.

Training the weights creates new neural circuits. Text optimization only ever elicits existing behavior from a fixed set of weights, and given those weights, there is a ceiling on what the text layer can reach.

The “existence argument”: the human brain is clearly intelligent. It must be possible to learn by changing weights alone.

Anyone can change a text artifact and get a seemingly better-looking output. Text optimization is unusually vulnerable to benchmark leakage and folk theories about model psychology.

Gradient descent is a real optimizer. You can lean on the large literature on optimization, generalization, and convergence to understand how it works. Text optimization is heuristic hill-climbing.

Neural networks are universal function approximators and can represent anything.

Text artifacts are not portable. They are overfit to one model’s quirks and often break on the next checkpoint.

Perhaps the Pendulum Has Swung Too Far#

A Call for Good Research on the Text Layer#

Theoretical analysis of the text layer. Generally, text space gives a much better prior than weight space, and cleanly formalizing this observation could be very useful for guiding practice. This old-ish paper is a promising start applying PAC-Bayes to prompts in 2023-level models, which seems very much worth revisiting with the latest generation of models and text artifacts.
Better evals. CL-bench is an initial attempt at a proper eval for context learning, and agentic benchmarks like TerminalBench-2 have partly become a battleground for harnesses. Still, we need more benchmarks that isolate useful properties of the text layer, controlling for weight capability while flagging the weird new classes of overfitting and cheating that the text layer enables.
“Architecture research”, i.e., understanding the design space. There are so many proposed designs for the text layer, from the instruction hierarchy, DSPy programs, agent skills, OpenClaw-style agents, and the massive number of memory system designs. There is a sense in which these are all points on one huge design space, but we don’t have a good way to talk about that space, let alone compare different points in it.
HCI research on how to elicit input from humans to optimize the text layer, and how to present the system’s internal state back to users for inspection and revision. I think figuring out the right ways to interact with the text layer can make it economically viable to routinely have top domain experts sit down for “verbal fine-tuning” sessions with AI systems. I don’t know of a good example of work in this direction, but this paper of mine had essentially this motivation, though it worked in a very limited domain.
Seriously scaling up text optimization, including establishing scaling laws. The compute budgets currently allocated to text optimization are orders of magnitude smaller than weight post-training scale. For example, a scaled-up artifact might look like a Wikipedia-scale knowledge/harness layer, optimized from the ground up against measurable model-system performance4.

Footnotes#

I use “text” because language is the clearest and most common case, but the argument should apply more broadly to external artifacts that can condition a model’s future behavior, including images, audio, video, and other tokenized state. ↩
This is contested. ProRL and The Art of Scaling RL Compute argue that with the right training recipe, RL can expand reasoning capacity beyond the base model. Personally, I think the truth is somewhere in the middle: RL by design should be able to discover new behaviors, but there’s definitely a strong empirical dependence on the quality of the base model. Either way, the details here don’t matter for the argument I’m making in the post. ↩
I do see this happen somewhat often, at least more than in weight-space research. Everyone seems to have a strong opinion on what the best system prompt or skill is. This probably has more to do with social media dynamics: “one weird trick to make your model 10x smarter” is much more actionable than any weight-space intervention. ↩
I’ve started to think this may be a good startup. ↩

この記事をシェア

Latent Space2026年6月20日 17:06

[AINews] 今日特に大きな出来事はありませんでした

TechCrunch AI★42026年6月20日 01:01

米国がアンソロピックの「Fable 5」発売を禁止、しかし市場は動じず

GitHub Blog★42026年6月20日 01:00

社内データ分析エージェントの構築方法について

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

テキストを本格的な最適化レイヤーとして位置づける（8 分読了）

キーポイント

影響分析

編集コメント

重み以外の学習#_n

Update-Time Compute: A New Scaling Axis#

重みにおける学習の最も強力なケース、および私の反論#

振り子が行き過ぎたのではないか？#

テキスト層に関する良質な研究への呼びかけ#

脚註#

Learning Outside the Weights#

Update-Time Compute: A New Scaling Axis#

The Strongest Case for Weights, and My Counterpoints#

Perhaps the Pendulum Has Swung Too Far#

A Call for Good Research on the Text Layer#

Footnotes#

関連記事

テキストを本格的な最適化レイヤーとして位置づける（8 分読了）

キーポイント

影響分析

編集コメント

重み以外の学習#_n

Update-Time Compute: A New Scaling Axis#

重みにおける学習の最も強力なケース、および私の反論#

振り子が行き過ぎたのではないか？#

テキスト層に関する良質な研究への呼びかけ#

脚註#

Learning Outside the Weights#

Update-Time Compute: A New Scaling Axis#

The Strongest Case for Weights, and My Counterpoints#

Perhaps the Pendulum Has Swung Too Far#

A Call for Good Research on the Text Layer#

Footnotes#

関連記事