MarkTechPost·2026年6月21日 08:04·約13分で読める

Cisco AI、FAPO（パイプライン対応プロンプト最適化）を発表：ステップごとの失敗特定とClaude Codeによるオーケストレーションを実現

#LLM #Prompt Engineering #Open Source #Autonomous Agents #LangGraph

TL;DR

Cisco AI は、Claude Code を活用して多段階 LLM パイプラインの失敗箇所を特定し、プロンプトや構造を自動的に最適化する「FAPO」をオープンソースで公開した。

AI深層分析2026年6月21日 09:02

重要/ 5段階

深度40%

キーポイント

自動化された最適化ループ

Claude Code エージェントが中心となり、評価・失敗原因の特定・バリアント提案・検証という一連のプロセスを自律的に実行し、目標精度に達するまで反復する。

段階的な介入戦略

まずプロンプトの微調整を行い、改善が見られない場合にパラメータや最終的にはチェーン構造そのものの変更へとエスカレーションする階層的アプローチを採用している。

競合モデルとの比較評価

Cisco の評価では、既存の最先端手法である GEPA を 18 中 15 で上回り、特にパイプライン構造の変更が必要なケースでは平均 33.8 ポイントの大幅な改善を記録した。

過学習防止と分離設計

検証用データセットのみの使用や不変のバリアントファイル、独立したレビューアーによる提案チェックなど、過学習を防ぐための厳格なガードレールが実装されている。

FAPO の階層的最適化アプローチ

コストの低いプロンプト編集から始め、失敗の原因を特定した上でパラメータ変更や構造変更（例：自己反省ノードの追加）へと段階的にエスカレーションする。

失敗原因の分類と解決策の対応

フォーマットや推論の問題はプロンプトで、検索や連鎖的な失敗は構造変更で解決するという明確なルールに基づき、最適化を効率化する。

GEPA に対する性能向上の実証

18回の比較のうち15回で勝利し、特にパイプライン構造の変更が必要なケースでは GEPA を平均 33.8 ポイント上回る成果を収めた。

影響分析・編集コメントを表示

影響分析

この発表は、LLM アプリケーション開発における最も時間がかかる「プロンプトエンジニアリング」および「デバッグ」のボトルネックを解消する画期的なツールです。特に複雑なマルチステップパイプラインにおいて、手動での原因特定が困難だった課題に対し、AI エージェント自身が自律的に解決策を探る手法は、産業レベルでの LLM 導入スピードを劇的に加速させる可能性があります。

編集コメント

「プロンプト調整は運に頼るもの」という常識を覆す、実用的な自律最適化フレームワークの登場です。特にパイプライン構造の変更まで自動化できる点は、複雑な業務システム構築において即戦力となるでしょう。

信頼性の高い LLM アプリケーションをリリースする上で、プロンプトを正しく設計することは依然として最も困難な部分です。わずかな言葉の使い方の違いで精度が 20% も変動することがあります。数例では機能しても、スケールすると破綻してしまうケースも少なくありません。多段階パイプラインが誤った回答を返した際、失敗したステップを特定するには、中間出力を手動で検査する必要があります。

Cisco AI はこのボトルネックに対処するため FAPO を導入しました。FAPO とは Fully Automated Prompt Optimization（完全自動化プロンプト最適化）の略称です。これは Claude Code を駆使したシステムであり、ベースラインのプロンプトから目標精度に至るまで LLM パイプラインを最適化するものです。ユーザーはデータセットと初期プロンプトを提供します。FAPO はその後、評価を行い、失敗を分類し、バリエーションを提案し、それらを検証して反復処理を行います。この一連のループ全体が Claude Code エージェントによってオーケストレーションされます。本プロジェクトは Apache 2.0 ライセンスの下でオープンソースとして公開されており、最適化エージェントとして Codex もサポートしています。

Cisco が報告した評価結果では、FAPO は最先端のプロンプトオプティマイザである GEPA を、18 のモデル・ベンチマーク比較のうち 15 で上回りました。FAPO がパイプラインの変更へとエスカレーションした 2 つのベンチマークにおいては、GEPA に対する平均的な改善幅は +33.8 ポイントに達しました。

TL;DR（要約）

FAPO は、Apache 2.0 ライセンスの下でオープンソース化された、Claude Code を駆使したシステムです。これはベースラインのプロンプトから目標精度に至るまで、多段階 LLM パイプラインを自律的に最適化するものです。

ステップレベルの失敗帰属（step-level failure attribution）を用いて次の変更対象を判断し、プロンプト、パラメータ、そしてチェーン構造という 3 つのレベルを経てエスカレーションします。

Cisco の評価では、FAPO は 18 のモデル・ベンチマーク比較のうち 15 で GEPA を上回り、平均で +14.1 ポイントの改善を実現しました。

HoVer および IFBench ではパイプライン変更へとエスカレートしましたが、FAPO は 6 つのペアすべてで勝利し、平均獲得率は +33.8 ポイントとなりました。AIME については GEPA の唯一の勝利であり、これはサンプリングノイズの範囲内です。

過学習に対するガードレールとしては、トレーニング分割データのみでの検査、変更不可能なバリアントファイル、および各提案に対する独立したレビューヤーが挙げられます。

FAPO とは何か

FAPO はマルチテナント型の評価・最適化フレームワークです。テナントとは、自己完結型の最適化プロジェクトを指します。各テナントディレクトリには、1 つのタスクに関するプロンプト、データセット、チェーン定義、スコアラー、設定ファイルが格納されます。テナントは相互に隔離されているため、無関係なタスクも干渉することなく並列して最適化できます。

中核となるエンジン名は hephaestus であり、ドメインに依存しません。評価、チェーン実行、スコアリングを処理します。チェーンは LangGraph の状態グラフ（state graph）で構成され、各テストケースを処理します。標準機能として FAPO は 3 つのプロバイダーをサポートしています：OpenAI、Baseten、SageMaker。

ユーザーが用意しなければならない入力はデータセットのみです。これは成功の基準となるペア入力と期待される出力から成ります。FAPO はこれを検証用セット（validation set）と保持されたテストセット（held-out test set）に分割します。検証用セットは反復処理を駆動し、テストセットは最終的なワンショット評価（one-shot evaluation）のみで使用されます。タスクの説明から Claude が残りの部分を構築できます：初期プロンプト、チェーン、およびスコアラーです。

最適化ループの仕組み

構成要素が整うと、FAPO は目標精度に達するまでクローズドループを実行します。各サイクルは 6 つのステージで構成されます：

評価（Evaluate）— データセット上でチェーンを実行し、ケースごとのスコアおよびステップレベルの出力を収集します。

属性 — ルールベースのヒューリスティックと LLM（大規模言語モデル）分析を組み合わせて、失敗の原因を分類する。

提案 — 支配的な失敗クラスターを対象としたバリアントを生成する。

レビュー — 独立したエージェントが、スコープ準拠性とデータ漏洩の有無について提案を検証する。

比較 — 前回の最良結果よりも改善された場合のみバリアントを受け入れ、そうでなければ却下する。

反復 — 目標精度に達するか、最適化予算が尽きるまで継続する。

システムは3つの段階で動作します。プロンプト編集はコストが最も低く最初に試されます。パラメータ変更は、retrieval_k や temperature などの設定値を調整します。構造的変更は、自己反省ノードの追加や ReAct パターンへの切り替えなど、チェーンのトポロジーを変更するものです。FAPO は、次の段階へ昇格する前に各レベルをすべて試行します。

ステップ帰属により、失敗は4つのクラスに分類されます。検索失敗では空の結果または無関係なコンテンツが返されます。カスケード失敗は、初期のステップで空の出力が生じた場合に発生します。フォーマット失敗では、正解がスコアラーが解析できないテキストの中に隠されています。推論失敗では、良い入力であっても誤った結論が導き出されてしまいます。フォーマットと推論の問題はプロンプトで対応可能です。検索とカスケードの問題は構造的な対応が必要です。

ガードレールは、最適化が過学習するのを防ぎます。これはトレーニング分割のケースのみを検査し、検証セットとテストセットでは集計スコアのみを公開します。各バリアントは新しい不変ファイルとして作成され、その場で編集されることはありません。実行前に独立したレビューアーが各提案をチェックします。

ベンチマーク事例：FAPO と GEPA の比較

Cisco チームは、FAPO を最先端のプロンプト最適化手法である GEPA（Generalized Evolutionary Prompt Architecture）と比較評価しました。GEPA は、遺伝的演算子を用いた進化的探索により、多段階パイプライン向けプロンプトの最適化を行います。両システムとも、同一のベースラインパイプラインとプロンプトから開始されました。FAPO は、失敗原因特定（アトリビューション）でボトルネックを検出した場合に、構造的変更へとエスカレートすることが可能でした。一方、GEPA はプロンプトレベルでの最適化に限定されていました。

比較は 6 つのベンチマークと 3 つのタスクモデル（GPT-4.1-mini、GPT-5.4-mini、Gemma 3-12B）にわたって行われました。Claude Opus 4.6 は、FAPO のオーケストレーターおよび GEPA のリフレクターとして両方の役割を担いました。以下のスコアは、3 つのタスクモデル全体で平均化された値です。

ベンチマーク | ベースライン | GEPA | FAPO | GEPA に対する改善幅

HoVer | 35.9 | 48.5 | 83.8 | +35.3 ポイント

IFBench | 35.7 | 48.5 | 80.7 | +32.2 ポイント

LiveBench-Math | 51.0 | 52.6 | 62.0 | +9.4 ポイント

HotpotQA | 50.9 | 61.8 | 68.3 | +6.5 ポイント

Papillon | 73.6 | 90.7 | 94.9 | +4.2 ポイント

AIME | 16.7 | 16.0 | 12.9 | -3.1 ポイント

FAPO は、18 のモデル・ベンチマーク比較のうち 15 で勝利し、GEPA に対する平均改善幅は +14.1 ポイントでした。特に FAPO がパイプライン変更へとエスカレートした HoVer と IFBench では、6 つのすべてのモデル・ベンチマークペアで勝利を収め、その平均改善幅は +33.8 ポイントに達しました。構造的変更が行われなかった 4 つのベンチマークにおいても、FAPO はプロンプト最適化のみによって 12 の比較のうち 9 で勝利しました。GEPA が上回ったのは AIME のみで、その差は 3.1 ポイントでした。この差は、確率的試行全体における標準偏差よりも小さいものです。

能力比較は、Cisco が報告した設計上の違いを示しています。以下の各行は、2 つのシステムのソース記述を反映したものです。

能力 GEPA FAPO

最適化レベル prompt テキストのみ prompt → パラメータ → 構造

チェーン構造の変更可否いいえはい（失敗帰属分析でボトルネックが特定された場合）

駆動方法遺伝的演算子による進化探索 Claude Code または Codex エージェントループ

18 のモデル・ベンチマークペア全体での結果参照値 18 件中 15 件で勝利；平均 +14.1 ポイント

適用領域：ユースケース

FAPO は単一のプロンプトではなく、多段階の LLM パイプラインを対象としています。具体的な例をいくつか挙げます。

マルチホップ質問応答：チェーンが文書を検索し、事実を抽出し、証拠に基づいて推論し、回答をフォーマットします。Cisco のドキュメントに記載されたウォークスルーでは、2 回の反復を経て、マルチホップ QA チェーンの検証完全一致率が 39.3% から 70.3% に向上しました。その後、失敗帰属分析により残りの失敗が検索制限に起因することが特定され、構造的な修正が必要であることが示されました。一方、HotpotQA ベンチマークでは、FAPO はテスト精度 68.3% を達成し、GEPA の 61.8% を上回りました。

指示従順性：IFBench では、フォーマット制約の失敗により FAPO がプロンプトを超えてエスカレートし、テスト精度 80.7% に到達しました。

分類：ソフトウェア名をカテゴリにマッピングするタスクは Claude Code によってスキャフォールディングされ、その後、完全一致目標に向けて最適化されます。

ReAct エージェント：MCP ワークフロー拡張により、トラジェクトリスコアリングと LLM-as-Judge スコアリングを用いてツール呼び出し型 ReAct エージェントが最適化されます。

始め方

最速の道は、Claude Code にテナントファイルを作成させることです。リポジトリから、平文の英語でタスクを記述し、JSONL データセットを追加します。各行は 1 つのテストケースであり、case_id、task_type、context、expected、metadata を含みます:

Copy CodeCopiedUse a different Browser

{"case_id": "1", "task_type": "qa", "context": {"question": "What is the capital of France?"}, "expected": {"answer": "Paris"}, "metadata": {}}

{"case_id": "2", "task_type": "qa", "context": {"question": "What is 2 + 2?"}, "expected": {"answer": "4"}, "metadata": {}}

スコアラーは、チェーンの出力を期待される答えと比較します。これは早期に不良データを検出するために validate_case を実装し、複合スコアを返すために score_case を実装しています:

Copy CodeCopiedUse a different Browser

from hephaestus.scoring.scorer import Scorer as BaseScorer

class Scorer(BaseScorer):

def validate_case(self, case, scoring_profile):

assert "answer" in case.expected, "Missing 'answer' in expected"

def score_case(self, case, output_text, scoring_profile):

expected = case.expected["answer"].strip().lower()

predicted = output_text.strip().lower()

em = 100.0 if predicted == expected else 0.0

return {"composite_score": em, "score_breakdown": {"exact_match": em}}

ベースライン評価でセットアップを検証します:

Copy CodeCopiedUse a different Browser

export OPENAI_API_KEY="sk-..."

python -m hephaestus.cli eval --config tenants/my_project/configs/eval.json

必ず JSON 形式で返してください。translation フィールドのみ。他のフィールド (technical_terms 等) は一切追加しないこと — 余計なフィールドを書こうとして本文翻訳がトークン上限で打ち切られる事故を防ぐため:

{"translation": "翻訳全文"}

テナント、設定、および複合スコア >= 90 のような成功基準を指定して最適化エージェントを呼び出します。Claude Code はスコープ契約を作成し、その後自律的に反復処理を行います。各プロンプト変種、設定、および変種ごとの分析はディスクに書き込まれるため、各実行は監査可能となります。後には FAPO Explorer と呼ばれるローカルの読み取り専用 UI がアーティファクトを閲覧します。

強みと弱み

強み

パイプライン認識スコアリングでは、失敗の原因が最終出力ではなく、その失敗を引き起こしたステップに帰属されます。

プロンプト単独では解決できない失敗に対して、3 段階のエスカレーション処理が行われます。

過学習を防ぐためのガードレールとして、トレーニング分割のみの検査、不変の変種、および独立したレビューアーが用意されています。

Apache 2.0 ライセンスの下でオープンソース化されており、Claude Code と Codex の両方がサポートされています。

弱み

最適化の品質は、ユーザーが供給する必要があるデータセットの品質とカバレッジによって制限されます。

プロジェクトは比較的新しいため、独立した生産環境での実績はまだ限られています。

デフォルトのループは、スタンドアロンのオプティマイザではなく、エージェント型コーディングツール（Claude Code または Codex）に依存しています。

インタラクティブな解説

(function(){

var f = document.getElementById("fapo-embed-frame");

window.addEventListener("message", function(e){

if(e && e.data && typeof e.data.fapoHeight === "number"){

f.style.height = e.data.fapoHeight + "px";

}

}, false);

})();

⟦CODE_0⟧

リポジトリと技術詳細をご覧ください。また、Twitter でフォローしていただくこともお気軽にどうぞ。15 万人以上の ML サブレッドに参加し、ニュースレターを購読するのを忘れないでください。待ってください！Telegram をご利用ですか？今なら Telegram でも私たちに参加いただけます。

GitHub リポジトリや Hugging Face ページ、製品リリース、ウェビナーなどのプロモーションのためにパートナーシップをご検討の場合は、こちらまでご連絡ください。

本記事「Cisco AI が FAPO を発表：ステップレベルの失敗帰属と Claude Code オーケストレーションを備えたパイプライン対応のプロンプト最適化」は、MarkTechPost で最初に公開されました。

原文を表示

Getting prompts right is still the hardest part of shipping reliable LLM applications. Small wording changes can swing accuracy by 20 percent. What works on a few examples often breaks at scale. When a multi-step pipeline returns a wrong answer, finding the failing step means inspecting intermediate outputs by hand.

Cisco AI introduced FAPO to address that bottleneck. FAPO stands for Fully Automated Prompt Optimization. It is a Claude Code-driven system that optimizes LLM pipelines from baseline prompts to target accuracy. You supply a dataset and an initial prompt. FAPO then evaluates, classifies failures, proposes variants, validates them, and iterates. The whole loop is orchestrated by Claude Code agents. The project ships open source under Apache 2.0, and also supports Codex as the optimization agent.

In Cisco’s reported evaluation, FAPO beat GEPA, a state-of-the-art prompt optimizer, on 15 of 18 model-benchmark comparisons. On the two benchmarks where FAPO escalated to pipeline changes, the mean gain over GEPA reached +33.8pp.

TL;DR

FAPO is a Claude Code-driven system that autonomously optimizes multi-step LLM pipelines from baseline prompts to target accuracy, open source under Apache 2.0.

It escalates through three levels — prompt, parameter, then chain structure — using step-level failure attribution to decide what to change next.

In Cisco’s evaluation, FAPO beat GEPA on 15 of 18 model-benchmark comparisons, with a +14.1pp mean gain.

On HoVer and IFBench, where it escalated to pipeline changes, FAPO won all six pairs at a +33.8pp mean gain; AIME was GEPA’s only win, within sampling noise.

Guardrails against overfitting include training-split-only inspection, immutable variant files, and an independent reviewer on every proposal.

What is FAPO

FAPO is a multi-tenant evaluation and optimization framework. A tenant is a self-contained optimization project. Each tenant directory holds one task’s prompts, dataset, chain definition, scorer, and config. Tenants stay isolated, so unrelated tasks optimize side by side without interference.

The core engine is named hephaestus and is domain-agnostic. It handles evaluation, chain execution, and scoring. Chains are LangGraph state graphs that process each test case. Out of the box, FAPO supports three providers: OpenAI, Baseten, and SageMaker.

The one input you must bring is a dataset. It is paired inputs and expected outputs that define success. FAPO splits it into a validation set and a held-out test set. The validation set drives iteration; the test set is used only for a final one-shot evaluation. From a task description, Claude can scaffold the rest: the initial prompt, the chain, and the scorer.

How the Optimization Loop Works

Once the pieces exist, FAPO runs a closed loop until target accuracy is reached. Each cycle runs six stages:

Evaluate — run the chain on the dataset, collect per-case scores and step-level outputs.

Attribute — classify failures by root cause using rule-based heuristics plus LLM analysis.

Propose — generate a variant targeting the dominant failure cluster.

Review — an independent agent validates the proposal for scope compliance and data leakage.

Compare — accept the variant only if it improves on the previous best, otherwise reject.

Iterate — continue until target accuracy is reached or the optimization budget is exhausted.

The system works at three escalating levels. Prompt edits are lowest cost and tried first. Parameter changes adjust config values like retrieval_k or temperature. Structural changes alter chain topology, such as adding a self-reflection node or switching to a ReAct pattern. FAPO exhausts one level before escalating to the next.

Step attribution sorts failures into four classes. Retrieval failures return empty or irrelevant content. Cascading failures begin when an early step produces empty output. Format failures hide the correct answer inside text the scorer cannot parse. Reasoning failures occur when good inputs still produce a wrong conclusion. Format and reasoning issues are prompt-addressable. Retrieval and cascade issues are structural-addressable.

Guardrails keep the optimizer from overfitting. It inspects only training-split cases, while validation and test expose aggregate scores only. Every variant is a new immutable file, never edited in place. An independent reviewer checks each proposal before it runs.

The Benchmark Case: FAPO vs. GEPA

Cisco team evaluated FAPO against GEPA (Generalized Evolutionary Prompt Architecture), a state-of-the-art prompt optimization method. GEPA uses evolutionary search with genetic operators to optimize prompts for multi-step pipelines. Both systems started from identical baseline pipelines and prompts. FAPO could escalate to structural changes when attribution found bottlenecks. GEPA was limited to prompt-level optimization.

The comparison spanned six benchmarks and three task models: GPT-4.1-mini, GPT-5.4-mini, and Gemma 3-12B. Claude Opus 4.6 served as both FAPO’s orchestrator and GEPA’s reflector. Scores below are averaged across the three task models.

BenchmarkBaselineGEPAFAPOGain vs. GEPA

HoVer35.948.583.8+35.3pp

IFBench35.748.580.7+32.2pp

LiveBench-Math51.052.662.0+9.4pp

HotpotQA50.961.868.3+6.5pp

Papillon73.690.794.9+4.2pp

AIME16.716.012.9-3.1pp

FAPO won 15 of 18 model-benchmark comparisons, with a mean gain of +14.1pp over GEPA. On HoVer and IFBench, where FAPO escalated to pipeline changes, it won all six model-benchmark pairs. The mean gain there was +33.8pp. On the four benchmarks without structural changes, FAPO still won 9 of 12 through prompt optimization alone. AIME was the only benchmark where GEPA led, by 3.1pp. The gap is smaller than the standard deviation across stochastic trials.

A capability comparison shows the design difference reported by Cisco. Every row below reflects the source description of the two systems.

CapabilityGEPAFAPO

Optimization levelsPrompt text onlyPrompt → parameter → structural

Can change chain structureNoYes, when attribution finds bottlenecks

How it is drivenEvolutionary search with genetic operatorsClaude Code or Codex agent loop

Result across 18 model-benchmark pairsReferenceWins 15 of 18; +14.1pp mean

Where It Fits: Use Cases

FAPO targets multi-step LLM pipelines, not single prompts. A few concrete examples:

Multi-hop question answering: A chain retrieves documents, extracts facts, reasons over evidence, and formats an answer. In Cisco’s documented walkthrough, a multi-hop QA chain rose from 39.3% to 70.3% validation exact match across two iterations. Attribution then flagged the remaining failures as retrieval-limited, signaling a structural fix. Separately, on the HotpotQA benchmark, FAPO reached 68.3% test accuracy versus GEPA’s 61.8%.

Instruction following: On IFBench, format-constraint failures pushed FAPO to escalate beyond prompts, reaching 80.7% test accuracy.

Classification: A software-name-to-category task can be scaffolded by Claude Code, then optimized to exact-match targets.

ReAct agents: An MCP workflow extension optimizes a tool-calling ReAct agent using trajectory scoring and LLM-as-Judge scoring.

Getting Started

The fastest path is to let Claude Code create the tenant files. From the repo, describe your task in plain English, then add a JSONL dataset. Each line is one test case with case_id, task_type, context, expected, and metadata:

Copy CodeCopiedUse a different Browser

{"case_id": "1", "task_type": "qa", "context": {"question": "What is the capital of France?"}, "expected": {"answer": "Paris"}, "metadata": {}}

{"case_id": "2", "task_type": "qa", "context": {"question": "What is 2 + 2?"}, "expected": {"answer": "4"}, "metadata": {}}

A scorer compares the chain output to the expected answer. It implements validate_case to catch bad data early and score_case to return a composite score:

Copy CodeCopiedUse a different Browser

from hephaestus.scoring.scorer import Scorer as BaseScorer

class Scorer(BaseScorer):

def validate_case(self, case, scoring_profile):

assert "answer" in case.expected, "Missing 'answer' in expected"

def score_case(self, case, output_text, scoring_profile):

expected = case.expected["answer"].strip().lower()

predicted = output_text.strip().lower()

em = 100.0 if predicted == expected else 0.0

return {"composite_score": em, "score_breakdown": {"exact_match": em}}

Verify the setup with a baseline evaluation:

Copy CodeCopiedUse a different Browser

export OPENAI_API_KEY="sk-..."

python -m hephaestus.cli eval --config tenants/my_project/configs/eval.json

Then invoke the optimization agent with a tenant, config, and success criteria such as composite_score >= 90. Claude Code produces a scope contract, then iterates autonomously. Every prompt variant, config, and per-variant analysis is written to disk, so each run stays auditable. A local read-only UI called FAPO Explorer browses the artifacts afterward.

Strengths and Weaknesses

Strengths

Pipeline-aware scoring attributes failures to the step that caused them, not just the final output.

Three-level escalation handles failures that prompts alone cannot fix.

Guardrails against overfitting: training-split-only inspection, immutable variants, and an independent reviewer.

Open source under Apache 2.0, with both Claude Code and Codex supported.

Weaknesses

Optimization quality is bounded by the dataset’s quality and coverage, which you must supply.

The project is recent, so independent production track records are still limited.

The default loop depends on agentic coding tools (Claude Code or Codex) rather than a standalone optimizer.

Interactive Explainer

(function(){

var f = document.getElementById("fapo-embed-frame");

window.addEventListener("message", function(e){

if(e && e.data && typeof e.data.fapoHeight === "number"){

f.style.height = e.data.fapoHeight + "px";

}

}, false);

})();

Check out the Repo and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post Cisco AI Introduces FAPO: Pipeline-Aware Prompt Optimization With Step-Level Failure Attribution and Claude Code Orchestration appeared first on MarkTechPost.

この記事をシェア

TechCrunch AI★32026年6月21日 05:32

Signal のメレディス・ウィッター、AI チャットボットは「友人ではない」と記憶するよう呼びかけ

Signal のメレディス・ウィッター氏は、ユーザーに対し AI チャットボットが人間のような友人関係にはなり得ないことを認識し、その限界を忘れないよう警告した。

TechCrunch AI★42026年6月21日 04:41

重みこそが、あなたの新しい AI 中心の検索結果である

TechCrunch は、AI が生成する重み（Weights）が新たな検索体験を定義し、従来のキーワード検索から AI 中心の検索へ移行していることを報じています。

Latent Space2026年6月20日 17:06

[AINews] 今日特に大きな出来事はありませんでした

Latent Space は、GLM 5.2 が依然として注目されていると指摘しつつ、AIE WF 2026 の通常チケットが月曜日に完売すると発表しました。同サイト購読者向けに限定割引を提供し、参加者には Warp や Datadog などからのスポンサークレジットも付与されます。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

MarkTechPost·2026年6月21日 08:04·約13分で読める

Cisco AI、FAPO（パイプライン対応プロンプト最適化）を発表：ステップごとの失敗特定とClaude Codeによるオーケストレーションを実現

#LLM #Prompt Engineering #Open Source #Autonomous Agents #LangGraph

TL;DR

AI深層分析2026年6月21日 09:02

重要/ 5段階

深度40%

キーポイント

自動化された最適化ループ

段階的な介入戦略

競合モデルとの比較評価

過学習防止と分離設計

FAPO の階層的最適化アプローチ

失敗原因の分類と解決策の対応

フォーマットや推論の問題はプロンプトで、検索や連鎖的な失敗は構造変更で解決するという明確なルールに基づき、最適化を効率化する。

GEPA に対する性能向上の実証

18回の比較のうち15回で勝利し、特にパイプライン構造の変更が必要なケースでは GEPA を平均 33.8 ポイント上回る成果を収めた。

影響分析・編集コメントを表示

影響分析

編集コメント

TL;DR（要約）

Cisco の評価では、FAPO は 18 のモデル・ベンチマーク比較のうち 15 で GEPA を上回り、平均で +14.1 ポイントの改善を実現しました。

FAPO とは何か

最適化ループの仕組み

構成要素が整うと、FAPO は目標精度に達するまでクローズドループを実行します。各サイクルは 6 つのステージで構成されます：

評価（Evaluate）— データセット上でチェーンを実行し、ケースごとのスコアおよびステップレベルの出力を収集します。

属性 — ルールベースのヒューリスティックと LLM（大規模言語モデル）分析を組み合わせて、失敗の原因を分類する。

提案 — 支配的な失敗クラスターを対象としたバリアントを生成する。

レビュー — 独立したエージェントが、スコープ準拠性とデータ漏洩の有無について提案を検証する。

比較 — 前回の最良結果よりも改善された場合のみバリアントを受け入れ、そうでなければ却下する。

反復 — 目標精度に達するか、最適化予算が尽きるまで継続する。

ベンチマーク事例：FAPO と GEPA の比較

ベンチマーク | ベースライン | GEPA | FAPO | GEPA に対する改善幅

HoVer | 35.9 | 48.5 | 83.8 | +35.3 ポイント

IFBench | 35.7 | 48.5 | 80.7 | +32.2 ポイント

LiveBench-Math | 51.0 | 52.6 | 62.0 | +9.4 ポイント

HotpotQA | 50.9 | 61.8 | 68.3 | +6.5 ポイント

Papillon | 73.6 | 90.7 | 94.9 | +4.2 ポイント

AIME | 16.7 | 16.0 | 12.9 | -3.1 ポイント

能力比較は、Cisco が報告した設計上の違いを示しています。以下の各行は、2 つのシステムのソース記述を反映したものです。

能力 GEPA FAPO

最適化レベル prompt テキストのみ prompt → パラメータ → 構造

チェーン構造の変更可否いいえはい（失敗帰属分析でボトルネックが特定された場合）

駆動方法遺伝的演算子による進化探索 Claude Code または Codex エージェントループ

18 のモデル・ベンチマークペア全体での結果参照値 18 件中 15 件で勝利；平均 +14.1 ポイント

適用領域：ユースケース

FAPO は単一のプロンプトではなく、多段階の LLM パイプラインを対象としています。具体的な例をいくつか挙げます。

指示従順性：IFBench では、フォーマット制約の失敗により FAPO がプロンプトを超えてエスカレートし、テスト精度 80.7% に到達しました。

始め方

Copy CodeCopiedUse a different Browser

{"case_id": "1", "task_type": "qa", "context": {"question": "What is the capital of France?"}, "expected": {"answer": "Paris"}, "metadata": {}}

{"case_id": "2", "task_type": "qa", "context": {"question": "What is 2 + 2?"}, "expected": {"answer": "4"}, "metadata": {}}

Copy CodeCopiedUse a different Browser

from hephaestus.scoring.scorer import Scorer as BaseScorer

class Scorer(BaseScorer):

def validate_case(self, case, scoring_profile):

assert "answer" in case.expected, "Missing 'answer' in expected"

def score_case(self, case, output_text, scoring_profile):

expected = case.expected["answer"].strip().lower()

predicted = output_text.strip().lower()

em = 100.0 if predicted == expected else 0.0

return {"composite_score": em, "score_breakdown": {"exact_match": em}}

ベースライン評価でセットアップを検証します:

Copy CodeCopiedUse a different Browser

export OPENAI_API_KEY="sk-..."

python -m hephaestus.cli eval --config tenants/my_project/configs/eval.json

{"translation": "翻訳全文"}

強みと弱み

強み

パイプライン認識スコアリングでは、失敗の原因が最終出力ではなく、その失敗を引き起こしたステップに帰属されます。

プロンプト単独では解決できない失敗に対して、3 段階のエスカレーション処理が行われます。

過学習を防ぐためのガードレールとして、トレーニング分割のみの検査、不変の変種、および独立したレビューアーが用意されています。

Apache 2.0 ライセンスの下でオープンソース化されており、Claude Code と Codex の両方がサポートされています。

弱み

最適化の品質は、ユーザーが供給する必要があるデータセットの品質とカバレッジによって制限されます。

プロジェクトは比較的新しいため、独立した生産環境での実績はまだ限られています。

インタラクティブな解説

(function(){

var f = document.getElementById("fapo-embed-frame");

window.addEventListener("message", function(e){

if(e && e.data && typeof e.data.fapoHeight === "number"){

f.style.height = e.data.fapoHeight + "px";

}

}, false);

})();

⟦CODE_0⟧

原文を表示

TL;DR

FAPO is a Claude Code-driven system that autonomously optimizes multi-step LLM pipelines from baseline prompts to target accuracy, open source under Apache 2.0.

It escalates through three levels — prompt, parameter, then chain structure — using step-level failure attribution to decide what to change next.

In Cisco’s evaluation, FAPO beat GEPA on 15 of 18 model-benchmark comparisons, with a +14.1pp mean gain.

On HoVer and IFBench, where it escalated to pipeline changes, FAPO won all six pairs at a +33.8pp mean gain; AIME was GEPA’s only win, within sampling noise.

Guardrails against overfitting include training-split-only inspection, immutable variant files, and an independent reviewer on every proposal.

What is FAPO

How the Optimization Loop Works

Once the pieces exist, FAPO runs a closed loop until target accuracy is reached. Each cycle runs six stages:

Evaluate — run the chain on the dataset, collect per-case scores and step-level outputs.

Attribute — classify failures by root cause using rule-based heuristics plus LLM analysis.

Propose — generate a variant targeting the dominant failure cluster.

Review — an independent agent validates the proposal for scope compliance and data leakage.

Compare — accept the variant only if it improves on the previous best, otherwise reject.

Iterate — continue until target accuracy is reached or the optimization budget is exhausted.

The Benchmark Case: FAPO vs. GEPA

BenchmarkBaselineGEPAFAPOGain vs. GEPA

HoVer35.948.583.8+35.3pp

IFBench35.748.580.7+32.2pp

LiveBench-Math51.052.662.0+9.4pp

HotpotQA50.961.868.3+6.5pp

Papillon73.690.794.9+4.2pp

AIME16.716.012.9-3.1pp

A capability comparison shows the design difference reported by Cisco. Every row below reflects the source description of the two systems.

CapabilityGEPAFAPO

Optimization levelsPrompt text onlyPrompt → parameter → structural

Can change chain structureNoYes, when attribution finds bottlenecks

How it is drivenEvolutionary search with genetic operatorsClaude Code or Codex agent loop

Result across 18 model-benchmark pairsReferenceWins 15 of 18; +14.1pp mean

Where It Fits: Use Cases

FAPO targets multi-step LLM pipelines, not single prompts. A few concrete examples:

Instruction following: On IFBench, format-constraint failures pushed FAPO to escalate beyond prompts, reaching 80.7% test accuracy.

Classification: A software-name-to-category task can be scaffolded by Claude Code, then optimized to exact-match targets.

ReAct agents: An MCP workflow extension optimizes a tool-calling ReAct agent using trajectory scoring and LLM-as-Judge scoring.

Getting Started

Copy CodeCopiedUse a different Browser

{"case_id": "1", "task_type": "qa", "context": {"question": "What is the capital of France?"}, "expected": {"answer": "Paris"}, "metadata": {}}

{"case_id": "2", "task_type": "qa", "context": {"question": "What is 2 + 2?"}, "expected": {"answer": "4"}, "metadata": {}}

A scorer compares the chain output to the expected answer. It implements validate_case to catch bad data early and score_case to return a composite score:

Copy CodeCopiedUse a different Browser

from hephaestus.scoring.scorer import Scorer as BaseScorer

class Scorer(BaseScorer):

def validate_case(self, case, scoring_profile):

assert "answer" in case.expected, "Missing 'answer' in expected"

def score_case(self, case, output_text, scoring_profile):

expected = case.expected["answer"].strip().lower()

predicted = output_text.strip().lower()

em = 100.0 if predicted == expected else 0.0

return {"composite_score": em, "score_breakdown": {"exact_match": em}}

Verify the setup with a baseline evaluation:

Copy CodeCopiedUse a different Browser

export OPENAI_API_KEY="sk-..."

python -m hephaestus.cli eval --config tenants/my_project/configs/eval.json

Strengths and Weaknesses

Strengths

Pipeline-aware scoring attributes failures to the step that caused them, not just the final output.

Three-level escalation handles failures that prompts alone cannot fix.

Guardrails against overfitting: training-split-only inspection, immutable variants, and an independent reviewer.

Open source under Apache 2.0, with both Claude Code and Codex supported.

Weaknesses

Optimization quality is bounded by the dataset’s quality and coverage, which you must supply.

The project is recent, so independent production track records are still limited.

The default loop depends on agentic coding tools (Claude Code or Codex) rather than a standalone optimizer.

Interactive Explainer

(function(){

var f = document.getElementById("fapo-embed-frame");

window.addEventListener("message", function(e){

if(e && e.data && typeof e.data.fapoHeight === "number"){

f.style.height = e.data.fapoHeight + "px";

}

}, false);

})();

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post Cisco AI Introduces FAPO: Pipeline-Aware Prompt Optimization With Step-Level Failure Attribution and Claude Code Orchestration appeared first on MarkTechPost.

この記事をシェア

TechCrunch AI★32026年6月21日 05:32

Signal のメレディス・ウィッター、AI チャットボットは「友人ではない」と記憶するよう呼びかけ

TechCrunch AI★42026年6月21日 04:41

重みこそが、あなたの新しい AI 中心の検索結果である

TechCrunch は、AI が生成する重み（Weights）が新たな検索体験を定義し、従来のキーワード検索から AI 中心の検索へ移行していることを報じています。

Latent Space2026年6月20日 17:06

[AINews] 今日特に大きな出来事はありませんでした

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む