Hugging Face Blog·2026年6月7日 04:02·約8分で読める

5 つのラボ、5 つの思考：小規模モデルによるマルチモデル金融ドラマの構築

#LLM #Small Language Models #Multi-Agent Systems #Emergent Behavior #Hugging Face

TL;DR

Hugging Face の記事は、異なるラボの小型モデルを各エージェントに割り当てることで多様性を生み出す「Thousand Token Wood v2」の実装と、その技術的意義について報告している。

AI深層分析2026年6月11日 21:13

注目/ 5段階

深度40%

キーポイント

異種モデルによるエージェント構成

単一のモデルではなく、OpenAI, OpenBMB, NVIDIA, Qwen の異なる小型モデルを各生物エージェントに割り当て、市場の多様性を意図的に設計している。

プレイヤーの役割変化とゲーム性

単なる観察者から「パトロン（影の金融家）」へとプレイヤーの役割が進化し、金利設定や情報操作、同盟結成など複雑な経済シミュレーションが可能になった。

多様性が製品価値となる設計思想

モデルの異質性を制約ではなく製品の核として位置づけ、異なる推論特性を持つエージェント間の相互作用から emergent（創発的な）な経済現象を生み出している。

小規模モデルの信頼性向上戦略

小規模モデルは推論に不向きだが、構造化、プロンプトエンジニアリング、および微調整を組み合わせることで信頼性の高いフォーマット生成が可能になる。

異種多様性と秘密情報の扱い

均質なモデル群よりも多様なモデル構成の方が興味深く、エージェントへの機密情報はプロンプト内の指示ではなくデータフロー上のファイアウォールで管理する必要がある。

永続的記憶の活用

エージェントに生命感を与える最も安価な方法は永続的記憶であり、プロンプトには有界（bounded）された要約のみを提示すれば十分である。

影響分析・編集コメントを表示

影響分析

この記事は、大規模言語モデル（LLM）の性能競争だけでなく、異なる特性を持つ小型モデルを組み合わせることで複雑なシミュレーション環境を構築できる可能性を示唆しています。特に、実用性の高い軽量モデルを活用したマルチエージェントシステムの設計指針として、研究開発やゲーム開発の現場において重要な示唆を与える内容です。

編集コメント

「モデルの多様性」をゲームデザインやシミュレーションの核に据えた視点は、単一の最強モデルを求める現在のトレンドに対する有益な対案となっています。実用リソースでいかに複雑な振る舞いを創出するかというエンジニアリングの知見が光ります。

記事一覧に戻る

*第二回 Build Small ハッカソンのフィールドレポート：新興経済の各エージェントが異なるラボの小規模モデルで動作し、プレイヤーが糸を操る資金提供者となる時に何が起きるか。

Thousand Token Wood の最初のバージョンは気象神のサンドボックスでした。1 つの微調整済み 0.5B モデル上で 5 匹の森の生き物が物品を取引し、あなたは世界にショックを与えてバブルやクラッシュが現れる様子を観察しました。それは素敵な玩具でしたが、遊ぶというよりは観る対象でした。

v2 ではそれをあなたが運営するゲームへと再構築しました。あなたは「森のパトロン」、影の資金提供者です。利息をつけて貸し出しを行い、真実か偽りのどちらか分からない情報を囁き、市場を空売りし、賄賂を渡し、同盟を仲介します。一方、裁判官はあなたが知るべきではない情報を使って取引を行ったとしてあなたを追跡しています。生き物たちはあなたの扱い方を記憶しており、復讐の策略を巡らせます。そして最も大きな変化は裏側で起こっています：すべての生き物が今や異なるラボの小規模モデルで思考するようになりました。これが本技術報告書です。

多様性が製品であり、制約ではない

エージェント評議会を実行する明白な方法は、1 つのモデルに複数のプロンプトを適用することです。v2 では 4 つのモデルを実行しています：gpt-oss-20b（OpenAI）、MiniCPM3-4B（OpenBMB）、Nemotron-Mini-4B（NVIDIA）、そして私が独自にファインチューニングした Qwen 0.5B です。目的は新奇性そのもののためではありません。参加者が本質的に異なる場合、市場は興味深いものになります。異なるデータで訓練され、異なるポストトレーニングを経た 4 つの研究所のモデルは、小規模モデルとしてはこれ以上ないほど異なります。フクロウは貯蔵の仕方が異なり、キツネは投機的な行動が異なります。評議会は台本ではなく、生きた議論です。

1 つのプラットフォーム上で 4 つの異なるモデルを並列に実行したことで浮き彫りになった真の教訓は、摩擦はほぼ完全にサービング層（serving layer）にあり、モデリング層にはないということです。

現在の vLLM (0.22.1) はロード時にカーネルを JIT コンパイルする必要があり、CUDA ツールキット（nvcc）が利用可能である必要があります。軽量なベースイメージには nvcc が同梱されていないため、CUDA 開発用イメージを基盤とするまで、4 つのモデルすべてが同じく「nvcc が見つからない」というエラーで失敗しました。これは gpt-oss の特有の挙動ではなく、vLLM のバージョンに共通する問題でした。1 つのイメージ修正ですべての問題が解決しました。

gpt-oss-20b はネイティブの MXFP4 量子化（quantization）で動作し、24GB の L4 GPU に余裕を持って収まります。高価な GPU は不要です。また、回答を分析の前書きで囲むチャネル形式をサポートしているため、消費者側は最終的なチャネル部分を抽出する必要があります。

MiniCPM3 では trust_remote_code が必要でしたが、Nemotron は問題なくロードされました。モデルごとの落とし穴であり、それぞれが 1 行の構成設定で対応可能です。

4 つの異種モデルを扱いやすくした要因は、v1 で単一モデルを扱いやすくしたのと同じ原初的な仕組み、すなわちすべてのモデル出力が流れる許容性の高い JSON パース・修復レイヤーです。異なるトークナイザーやフォーマット習慣はそれぞれ異なる変形を生み出しますが、パーサーは救済不能な部分を切り捨て、シミュレーションは決してクラッシュしません。このレイヤーを一度構築すれば、新しいモデルを追加するだけで済み、リファクタリングは不要となります。

情報非対称性にはファイアウォールが必要

v2 の劇的な核心は内部者からのヒントです。あなたは生物に*真実*（デッキが引き出す次の市場熱狂の実際の予測であり、あなたの真の優位性）か*偽物*（おとり）のどちらかのヒントを囁くことができます。真のヒントに基づいて行動し利益を得ると、あなたの「ヒート」が上昇します。ある閾値を超えると、裁判官が捜査を開始し、罰金、資産凍結、あるいは追放という結末を迎えます。

それが真のゲームとなるためには、噂の真偽を「生き物」から隠さなければなりません。彼らは噂のテキストは目にしますが、フラグ（秘密の合図）を決して見てはいけません。これは UI の小細工ではなくセキュリティ上の要件であり、小型モデルのエージェントにおいては特に鋭い問題となります：モデルが繰り返し返すものはすべて、プロンプトに含めた内容そのものだからです。したがって、隠されたフラグはプロンプトの外（プレイヤーの台帳上）に存在し、構築時に公開イベント記録から削除され、ナレーターが要約するのは公開イベントのみです。単一のテストで、各ターンごとにすべての生き物の完全なプロンプトをスキャンし、禁止トークンが含まれていないか確認します。このテストは、スイート（一連のテスト）の中で最も重要なものです。エージェントに秘密情報を渡す際は、テストによって漏洩しないことが証明されるまで、必ず漏れると仮定してください。

制約されたメモリは、安価なドラマを生み出します

生き物は永続的な関係性を保持しています：パトロン（支援者）や互いに対する署名付きの感情です。これはイベントによって微調整されます（「あなたの作物を空売りした」「借金を返済した」「敵対者と同盟を結んだ」など）。敵対的になった生き物は融資を拒否し、より不利な条件で取引します；同盟関係にある生き物たちは互いの価格引き下げを止め、カルテルのように振る舞います。

罠はプロンプトの膨張です。生の履歴は無限に成長し、小規模モデルはその中で溺れてしまいます。解決策は、決して履歴をプロンプトに含めないことです：モデルは一行の要約（「Oona に対して温かい感情を抱きつつも、パトロンのことは警戒している」）というバケット化された概要のみを見ます。これは少数の最も強い感情に基づき、整数値のセンチメントから導き出され、上限が設けられています。ノートは追跡のために保持されますが、範囲が限定されており、決して表示されることはありません。この行動バイアスは、一部は創発的（要約がモデルを誘導する）であり、一部は機械的（強い敵対性を持つ生物が決定論的に拒絶する）であるため、単なる希望ではなく、観測可能でテスト可能なものです。

実際に何が起こったか

完全な v2 メカニクスを実装した代表者評議会による実行：

レバー

結果

評議会のモデル

4 機関が参加し、すべてが 32B の上限内で、Modal で提供されました。

微調整された 0.5B の信頼性

自己売買 0%、有効なオファー 100%（3B の教師モデルを上回る）

真実のファイアウォール

スキャンされたすべてのプロンプトにおいて、ヒントの隠しフラグの漏洩はゼロ

インサイダー・ヒントのエッジ

真のヒントによる事前配置は正の損益を確定させますが、偽のヒントはそうしません。

調査への熱意

2 つのクリーンな疑わしい勝利が裁判官の基準を超えました。

破滅

証拠保全（マージンコール）と貸付デフォルト禁止により、ある生物が追放され、数章後に復帰します。

*パトロンの行使、情報戦、人間関係、そしてレバレッジをエンドツーエンドで実行した単一のシードされたランです。*

小規模モデルでの構築における教訓

小規模モデルは信頼できるフォーマット生成器ですが、推論においては信頼性が低いです。このギャップを埋めるには、スケーリングではなく、構造化、プロンプト設計、そして小規模なファインチューニングによって達成されます。均質な評議会よりも多様な評議会のほうが興味深く、サービング層が堅牢であれば構成コストは一度きりで済みます。エージェントに与えられる機密情報はファイアウォールの問題であり、そのファイアウォールはプロンプト内の指示ではなく、データフロー内に存在し、テストによって証明されるべきものです。また、永続的メモリは、プロンプトが常に有界な要約のみを参照する限りにおいて、エージェントに生きている感覚を持たせる最も安価な手段です。

小規模モデルでも大冒険が可能。評議会全体もトレースもすべてオープンです。

原文を表示

Back to Articles

*A second Build Small Hackathon field report: what happens when each agent in an emergent economy runs on a different lab's small model, and the player becomes the financier pulling the strings.*

The first version of Thousand Token Wood was a weather-god sandbox: five woodland creatures on one fine-tuned 0.5B model traded goods, and you poked the world with shocks and watched bubbles and crashes emerge. It was a nice toy. It was also something you watched rather than played.

v2 rebuilt it into a game you operate. You are the Patron of the Wood, a shadow financier: you lend at interest, whisper tips that may be true or planted, short the market, bribe, and broker alliances, while a magistrate hunts you for trading on what you should not know. The creatures remember how you treated them and scheme back. And the biggest change is under the hood: every creature now thinks with a different lab's small model. This is the engineering report.

Heterogeneity is the product, not a constraint

The obvious way to run a council of agents is one model, many prompts. v2 runs four: gpt-oss-20b (OpenAI), MiniCPM3-4B (OpenBMB), Nemotron-Mini-4B (NVIDIA), and a fine-tuned Qwen 0.5B of my own. The point is not novelty for its own sake. A market is interesting when the participants genuinely differ, and four labs' models trained on different data with different post-training are about as different as small models get. The owl hoards differently than the fox speculates. The council is a live argument, not a script.

Standing four distinct models up on one platform surfaced the real lesson: the friction is almost entirely at the serving layer, not the modeling layer.

Current vLLM (0.22.1) JIT-compiles kernels at load and needs the CUDA toolkit (nvcc) present. A lean base image does not ship it, so all four models failed identically with "could not find nvcc" until I based them on a CUDA devel image. This was not a gpt-oss quirk; it was universal to the vLLM version. One image fix unblocked everything.

gpt-oss-20b runs in its native MXFP4 quantization and fits a 24GB L4 with room to spare; no high-end GPU needed. It also speaks a channel format that wraps the answer in an analysis preamble, so the consumer has to extract the final channel.

MiniCPM3 needed trust_remote_code; Nemotron loaded clean. Per-model footguns, each a one-line config.

The thing that made four heterogeneous models tractable was the same primitive that made one model tractable in v1: a tolerant JSON parse-and-repair layer that every model's output flows through. Different tokenizers and formatting habits produce different malformations; the parser drops what it cannot salvage and the simulation never crashes. Build that layer once and adding a model is a config entry, not a refactor.

Information asymmetry needs a firewall

The dramatic core of v2 is the insider tip. You can whisper a tip to a creature that is *true* (a real forecast of the next market mania the deck will draw, your genuine edge) or *false* (bait). Acting on a true tip and profiting raises your heat; cross a threshold and the magistrate opens an investigation that ends in a fine, frozen assets, or exile.

For that to be a real game, the truth of a tip must be hidden from the creatures. They see the rumor text; they must never see the flag. This is a security property, not a UI nicety, and small-model agents make it sharp: everything the model could repeat back is whatever you put in its prompt. So the hidden flag lives off-prompt entirely (on the player's ledger), it is stripped from the public event record at construction, and the only thing the narrator ever summarizes is public events. A single test scans every creature's full prompt, every turn, for the banned tokens. That test is the most important one in the suite. When you give an agent secret information, assume it will leak unless a test proves it cannot.

Memory is cheap drama if you bound it

Creatures carry persistent relationships: a signed sentiment toward the Patron and toward each other, nudged by events (you shorted my crop, you repaid your loan, you allied me with a rival). A creature that turns hostile refuses your loans and quotes you worse; allied creatures stop undercutting each other and behave like a cartel.

The trap is prompt inflation. Raw history grows without bound and a small model drowns in it. The fix is to never put history in the prompt: the model sees a one-line bucketed summary ("you feel warmly toward Oona, wary of the Patron"), capped to the few strongest feelings, derived from integer sentiment. Notes are kept for traces but bounded and never shown. The behavioral bias is part emergent (the summary nudges the model) and part mechanical (a strongly hostile creature deterministically refuses), so it is observable and testable rather than a hope.

What actually happened

A representative council run, with the full v2 mechanics live:

Lever

Result

Models in the council

4 labs, all under the 32B cap, served on Modal

Fine-tuned 0.5B reliability

0% self-buys, 100% valid offers (beats its 3B teacher)

Truth firewall

0 leaks of a tip's hidden flag across every prompt scanned

Insider tip edge

a true-tip pre-position settles a positive P&L; a false tip does not

Heat to investigation

two clean suspicious wins cross the magistrate's line

Ruin

a margin call and a loan default banish a creature, who returns a chapter later

*A single seeded run exercising the Patron, the information war, relationships, and leverage end to end.*

Takeaways for building with small models

A small model is a reliable format generator and an unreliable reasoner; you close the gap with structure, prompting, and a small fine-tune, not with scale. A heterogeneous council is more interesting than a homogeneous one and costs you only config once the serving layer is solid. Secret information given to an agent is a firewall problem, and the firewall belongs in the data flow, proven by a test, not in a prompt instruction. And persistent memory is the cheapest way to make agents feel alive, as long as the prompt only ever sees a bounded summary.

Small models, big adventures. The whole council is open, and so are the traces.

この記事をシェア

Hugging Face Blog★42026年6月19日 03:13

MosaicLeaks：研究エージェントは秘密を守れるか？

Hugging Face は、AI エージェントが機密情報を漏洩するリスクを検証する「MosaicLeaks」という評価フレームワークを発表した。

Latent Space2026年6月20日 17:06

[AINews] 今日特に大きな出来事はありませんでした

Latent Space は、GLM 5.2 が依然として注目されていると指摘しつつ、AIE WF 2026 の通常チケットが月曜日に完売すると発表しました。同サイト購読者向けに限定割引を提供し、参加者には Warp や Datadog などからのスポンサークレジットも付与されます。

TechCrunch AI★42026年6月20日 01:01

米国がアンソロピックの「Fable 5」発売を禁止、しかし市場は動じず

米国政府は国家安全保障上の懸念から、アマゾンの研究者らがガードレール回避手法を発見したとして、アンソロピックに対し最新モデル「Fable 5」と「Mythos 5」の販売差し止めを命じた。サイバーセキュリティ研究者らはこの措置が危険だとする公開書簡に署名し、同社も他モデルでも同様の抜け道が存在すると指摘している。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Hugging Face Blog·2026年6月7日 04:02·約8分で読める

5 つのラボ、5 つの思考：小規模モデルによるマルチモデル金融ドラマの構築

#LLM #Small Language Models #Multi-Agent Systems #Emergent Behavior #Hugging Face

TL;DR

AI深層分析2026年6月11日 21:13

注目/ 5段階

深度40%

キーポイント

異種モデルによるエージェント構成

単一のモデルではなく、OpenAI, OpenBMB, NVIDIA, Qwen の異なる小型モデルを各生物エージェントに割り当て、市場の多様性を意図的に設計している。

プレイヤーの役割変化とゲーム性

多様性が製品価値となる設計思想

小規模モデルの信頼性向上戦略

異種多様性と秘密情報の扱い

永続的記憶の活用

エージェントに生命感を与える最も安価な方法は永続的記憶であり、プロンプトには有界（bounded）された要約のみを提示すれば十分である。

影響分析・編集コメントを表示

影響分析

編集コメント

記事一覧に戻る

多様性が製品であり、制約ではない

現在の vLLM (0.22.1) はロード時にカーネルを JIT コンパイルする必要があり、CUDA ツールキット（nvcc）が利用可能である必要があります。軽量なベースイメージには nvcc が同梱されていないため、CUDA 開発用イメージを基盤とするまで、4 つのモデルすべてが同じく「nvcc が見つからない」というエラーで失敗しました。これは gpt-oss の特有の挙動ではなく、vLLM のバージョンに共通する問題でした。1 つのイメージ修正ですべての問題が解決しました。

gpt-oss-20b はネイティブの MXFP4 量子化（quantization）で動作し、24GB の L4 GPU に余裕を持って収まります。高価な GPU は不要です。また、回答を分析の前書きで囲むチャネル形式をサポートしているため、消費者側は最終的なチャネル部分を抽出する必要があります。

MiniCPM3 では trust_remote_code が必要でしたが、Nemotron は問題なくロードされました。モデルごとの落とし穴であり、それぞれが 1 行の構成設定で対応可能です。

情報非対称性にはファイアウォールが必要

制約されたメモリは、安価なドラマを生み出します

実際に何が起こったか

完全な v2 メカニクスを実装した代表者評議会による実行：

レバー

結果

評議会のモデル

4 機関が参加し、すべてが 32B の上限内で、Modal で提供されました。

微調整された 0.5B の信頼性

自己売買 0%、有効なオファー 100%（3B の教師モデルを上回る）

真実のファイアウォール

スキャンされたすべてのプロンプトにおいて、ヒントの隠しフラグの漏洩はゼロ

インサイダー・ヒントのエッジ

真のヒントによる事前配置は正の損益を確定させますが、偽のヒントはそうしません。

調査への熱意

2 つのクリーンな疑わしい勝利が裁判官の基準を超えました。

破滅

証拠保全（マージンコール）と貸付デフォルト禁止により、ある生物が追放され、数章後に復帰します。

*パトロンの行使、情報戦、人間関係、そしてレバレッジをエンドツーエンドで実行した単一のシードされたランです。*

小規模モデルでの構築における教訓

小規模モデルでも大冒険が可能。評議会全体もトレースもすべてオープンです。

原文を表示

Back to Articles

*A second Build Small Hackathon field report: what happens when each agent in an emergent economy runs on a different lab's small model, and the player becomes the financier pulling the strings.*

Heterogeneity is the product, not a constraint

Standing four distinct models up on one platform surfaced the real lesson: the friction is almost entirely at the serving layer, not the modeling layer.

Current vLLM (0.22.1) JIT-compiles kernels at load and needs the CUDA toolkit (nvcc) present. A lean base image does not ship it, so all four models failed identically with "could not find nvcc" until I based them on a CUDA devel image. This was not a gpt-oss quirk; it was universal to the vLLM version. One image fix unblocked everything.

gpt-oss-20b runs in its native MXFP4 quantization and fits a 24GB L4 with room to spare; no high-end GPU needed. It also speaks a channel format that wraps the answer in an analysis preamble, so the consumer has to extract the final channel.

MiniCPM3 needed trust_remote_code; Nemotron loaded clean. Per-model footguns, each a one-line config.

Information asymmetry needs a firewall

Memory is cheap drama if you bound it

What actually happened

A representative council run, with the full v2 mechanics live:

Lever

Result

Models in the council

4 labs, all under the 32B cap, served on Modal

Fine-tuned 0.5B reliability

0% self-buys, 100% valid offers (beats its 3B teacher)

Truth firewall

0 leaks of a tip's hidden flag across every prompt scanned

Insider tip edge

a true-tip pre-position settles a positive P&L; a false tip does not

Heat to investigation

two clean suspicious wins cross the magistrate's line

Ruin

a margin call and a loan default banish a creature, who returns a chapter later

*A single seeded run exercising the Patron, the information war, relationships, and leverage end to end.*

Takeaways for building with small models

Small models, big adventures. The whole council is open, and so are the traces.

この記事をシェア

Hugging Face Blog★42026年6月19日 03:13

MosaicLeaks：研究エージェントは秘密を守れるか？

Hugging Face は、AI エージェントが機密情報を漏洩するリスクを検証する「MosaicLeaks」という評価フレームワークを発表した。

Latent Space2026年6月20日 17:06

[AINews] 今日特に大きな出来事はありませんでした

TechCrunch AI★42026年6月20日 01:01

米国がアンソロピックの「Fable 5」発売を禁止、しかし市場は動じず

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

5 つのラボ、5 つの思考：小規模モデルによるマルチモデル金融ドラマの構築

キーポイント

影響分析

編集コメント

多様性が製品であり、制約ではない

情報非対称性にはファイアウォールが必要

実際に何が起こったか

小規模モデルでの構築における教訓

Heterogeneity is the product, not a constraint

Information asymmetry needs a firewall

Memory is cheap drama if you bound it

What actually happened

Takeaways for building with small models

関連記事

5 つのラボ、5 つの思考：小規模モデルによるマルチモデル金融ドラマの構築

キーポイント

影響分析

編集コメント

多様性が製品であり、制約ではない

情報非対称性にはファイアウォールが必要

実際に何が起こったか

小規模モデルでの構築における教訓

Heterogeneity is the product, not a constraint

Information asymmetry needs a firewall

Memory is cheap drama if you bound it

What actually happened

Takeaways for building with small models

関連記事