Claude Blog·2026年3月3日 09:00·約10分

スキルクリエーターの改善：エージェントスキルのテスト、測定、改良

#AIエージェント #スキル開発 #コードレス開発 #Anthropic #Claude #評価・テスト

TL;DR

AnthropicはClaudeのAgent Skills開発ツール「skill-creator」に評価機能を追加し、非エンジニアのスキル作成者がテスト、ベンチマーク、反復的改善を行えるようにした。

AI深層分析2026年3月4日 03:45

重要/ 5段階

深度40%

キーポイント

スキル作成ツールの機能強化

skill-creatorに評価機能が追加され、スキル作成者はコードを書かずにテストを作成し、スキルの動作確認や品質向上が可能になった。

2種類のスキルカテゴリー

スキルは「能力向上スキル」と「エンコードされた選好スキル」に分類され、それぞれ異なるテスト目的と耐久性を持つ。

評価の具体的な活用方法

評価は品質の退行を検出し、モデルの進歩を理解するために使用され、PDFスキルの改善例のように具体的な問題解決に役立つ。

対象ユーザーの拡大

専門知識を持つがエンジニアではないユーザー向けに設計されており、ソフトウェア開発の厳密さをコードレスでスキル作成に導入している。

マルチエージェントによる評価の高速化と精度向上

skill-creatorは独立したエージェントを並列実行することで評価を高速化し、コンテキストの混在を防ぎます。また、A/B比較用のコンパレータエージェントを追加し、変更の効果を客観的に判断できます。

スキルトリガーの精度向上のためのチューニング支援

スキル説明文の分析と編集提案機能により、誤トリガー（偽陽性）と未トリガー（偽陰性）の両方を削減し、適切なタイミングでのスキル発動を実現します。

スキルと仕様の境界の変化

モデルの進化に伴い、詳細な実装手順（SKILL.md）から自然言語による目的記述へと移行し、評価フレームワークがその方向性を示しています。

影響分析・編集コメントを表示

影響分析

この機能強化は、AIエージェントのスキル開発を民主化し、専門知識を持つ非技術者による高品質なスキル作成を促進する。これにより、企業内でのAI活用の拡大と、より多様なユースケースへの対応が期待できる。

編集コメント

AIエージェントの実用化において重要な「非エンジニアによる開発支援」という課題に具体的なソリューションを提供している点が注目される。

スキル作成者の改善：テスト、測定、そしてエージェントスキルの洗練

スキル作者は現在、自分のスキルが機能していることを確認し、回帰を検出し、説明を改善することができます。

カテゴリ: Claude Code, 製品発表

製品: Claude Code

日付: 2026 年 3 月 3 日

読了時間: 5 分

共有: リンクをコピー https://claude.com/blog/improving-skill-creator-test-measure-and-refine-agent-skills

スキル作成者は現在、評価（evals）の作成やベンチマークの実行をサポートし、モデルが進化してもスキルが正常に動作し続けるように保つことができます。これらのアップデートは、Claude.ai および Cowork で利用可能であり、Claude Code のプラグインとしても、またリポジトリ内でも利用可能です。

先月エージェントスキルをリリースして以来、ほとんどの作者がエンジニアではなく専門分野の専門家であることを確認しました。彼らは自分のワークフローをよく知っていますが、新しいモデルでもスキルがまだ機能しているか、必要な時にトリガーされるか、編集後に実際に改善されたかどうかを判断するためのツールを持っていません。

本日発表するスキル作成者の強化により、作者はより自信を持って構築できるようになります。コードを書くことを一切要求することなく、ソフトウェア開発の厳密性（テスト、ベンチマーク、反復的な改善）の一部をスキル作成に持ち込みます。

2 種類のスキル

スキルは一般的に以下の 2 つのカテゴリーに分類されます:

能力向上型スキルは、ベースモデルではできないこと、または一貫してできないことを Claude に実行させるためのものです。ドキュメント作成のスキルがその良い例です。これらは、プロンプトのみよりも優れた出力を生み出すための技術やパターンをエンコードしています。

エンコードされたプリファレンススキルは、Claude がすでに各タスクを実行できるワークフローを文書化したものであり、スキルがチームの工程に従って順序立てて実行されます。例としては、設定された基準に基づいて秘密保持契約（NDA）レビューを進めるスキルや、さまざまな MCP からデータを取得して週次更新ドラフトを作成するスキルなどがあります。

この区別は重要です。なぜなら、これら 2 種類のスキルには異なる理由でテストが必要となる可能性があるからです。

能力向上型スキルは、モデルの性能が向上するにつれて不要になる可能性があります。評価（Evals）によってそのタイミングを把握できます。

エンコードされたプリファレンススキルはより耐久性がありますが、その価値は実際のワークフローへの忠実度にかかっています。評価（Evals）はその忠実度を検証します。

いずれにせよ、テストを行うことで「うまくいそう」なスキルを、「実際に機能する」と確信できるスキルへと変えることができます。

評価（Evals）を用いたスキルのテストと改善

Skill-creator は now、評価（evals）の作成をサポートしています。これは、特定のプロンプトに対して Claude が期待通りに動作するかを確認するためのテストです。ソフトウェアのテストを書いた経験があれば、この仕組みは馴染み深いはずです：いくつかのテスト用プロンプト（必要に応じてファイルも追加）を定義し、成功した状態を記述すれば、Skill-creator がそのスキルが要件を満たすかどうかを判断します。

例えば、当社の PDF 関連スキルは以前、非入力フォームに対して苦戦していました。Claude はガイドとなる定義されたフィールドがない状態で、テキストを正確な座標に配置する必要があったのです。評価（Evals）によって失敗箇所を特定し、抽出したテキストの座標に基づいて位置決めを固定する修正版をリリースしました。

評価（Evals）は多くの点で役立ちますが、特に重要な用途として、品質の低下（regressions）を検知することと、モデルの進捗を理解することが挙げられます。

まず、品質の回帰を検知することです。モデルやその周辺インフラが進化するにつれ、先月はうまく機能していたスキルが今日では異なる挙動を示す可能性があります。新しいモデルに対して評価（evals）を実行することで、チームの業務に影響を与える前に何かが変化したという早期信号を得ることができます。

次に、一般的なモデルの機能がスキルの範囲を超えた時期を知る方法です。これは主に機能強化型のスキルに適用されます。ベースモデルがスキルを読み込んでも評価（evals）をパスするようになった場合、それはスキルの技術がモデルのデフォルト動作に取り込まれたという信号となります。スキルが壊れているわけではなく、もはや不要になったのです。

また、評価（evals）を用いた標準化されたアセスメントを実行するベンチマークモードを追加しました。これはモデル更新後や、スキル自体を反復改善する際にも実行可能です。評価（evals）のパス率、経過時間、トークン使用量を追跡します。

あなたの評価（evals）と結果は常にあなたのもとにあります。ローカルに保存するか、ダッシュボードと連携させたり、CI システムに組み込んだりできます。

マルチエージェントサポートによる高速で一貫性のある評価

評価（evals）を逐次実行すると時間がかかり、テスト実行の間にコンテキストが混ざり合う可能性があります。Skill-creator は now マルチエージェントサポートにより独立したエージェントを起動して並列で評価（evals）を実行します。各エージェントはクリーンなコンテキスト内で独自のトークンとタイミングメトリクスを持ちます。結果が高速化され、相互汚染もありません。

A/B 比較用のコンパレーターエージェントも追加しました。これは、2 つのスキルバージョン、あるいはスキルありとスキールのなしを比較するものです。どちらがどちらか分からない状態で出力を評価するため、変更が実際に役立ったかどうかを確認できます。

適切なタイミングでスキルを発火させるには

評価（eval）は出力の品質を測定しますが、スキルが必要な時に発火することが前提です。スキル数が増えるにつれて、記述の精度が極めて重要になります。範囲が広すぎると誤作動が発生し、狭すぎると全く発火しません。Skill-creator は、より信頼性の高い発火のために記述を調整するのを支援します。現在の記述をサンプルプロンプトと比較分析し、偽陽性と偽陰性の両方を削減するための編集案を提案します。

ドキュメント作成用のスキルでこの機能をテストしたところ、公開されている 6 つのスキルのうち 5 つで発火率が改善しました。

モデルが向上するにつれて、「スキル」と「仕様」の境界線は曖昧になる可能性があります。現在、SKILL.md ファイルは実装計画 essentially であり、Claude に何を行うかを詳細に指示するものです。将来的には、スキルが何をすべきかの自然言語による記述だけで十分となり、残りの部分はモデルが自動的に判断するようになるかもしれません。

本日公開する評価フレームワークは、その方向への一歩です。評価（eval）はすでに「何を」を記述しています。最終的には、その記述自体がスキルとなるでしょう。

はじめに

すべての Skill-creator の更新機能は、現在 Claude.ai と Cowork で利用可能です。Claude に Skill-creator を使用して始めるよう指示してください。

Claude Code ユーザーは、プラグインをインストールするか、リポジトリからダウンロードできます。

PrevPrev0/5NextNexteBook

Claude を活用して構築するチーム向けの製品ニュースとベストプラクティスをもっとご覧ください。

お気に入りの業務ツールが、今や Claude 内のインタラクティブなコネクタになりました

製品発表：お気に入りの業務ツールが、今や Claude 内のインタラクティブなコネクタになりました

お気に入りの業務ツールが、今や Claude 内のインタラクティブなコネクタになりました

お気に入りの業務ツールが、今や Claude 内のインタラクティブなコネクタになりました 2026 年 2 月 23 日

AI が COBOL モダナイゼーションのコスト障壁をどう打破するか

Claude Code：AI が COBOL モダナイゼーションのコスト障壁をどう打破するか

AI が COBOL モダナイゼーションのコスト障壁をどう打破するか

AI が COBOL モダナイゼーションのコスト障壁をどう打破するか 2026 年 2 月 20 日

デスクトップ版 Claude Code に、自動プレビュー・レビュー・マージ機能を導入

Claude Code

デスクトップ版 Claude Code に自動プレビュー、レビュー、マージ機能を導入

デスクトップ版 Claude Code に自動プレビュー、レビュー、マージ機能を導入 2026 年 2 月 17 日

動的フィルタリングにより、ウェブ検索の精度と効率を向上

製品発表：動的フィルタリングにより、ウェブ検索の精度と効率を向上

組織の運用方法を変革する Claude

開発者向けニュースレターを購読

製品アップデート、ハウツー記事、コミュニティ紹介など。毎月あなたのメールボックスにお届けします。

購読する

月次開発者ニュースレターを受け取りたい場合は、メールアドレスをご入力ください。いつでも解除できます。

原文を表示

Improving skill-creator: Test, measure, and refine Agent Skills

Skill authors can now verify that their skills work, catch regressions, and improve descriptions.

CategoryClaude CodeProduct announcements

ProductClaude Code

DateMarch 3, 2026

Reading time5min

ShareCopy linkhttps://claude.com/blog/improving-skill-creator-test-measure-and-refine-agent-skills

Skill-creator now helps you write evals, run benchmarks, and keep your skills working as models evolve. These updates are available now in Claude.ai and Cowork, as a plugin for Claude Code, and within our repo.

Since launching Agent Skills last October, we've noticed that most authors are subject matter experts, not engineers. They know their workflows but don't have the tools to tell whether a skill still works with a new model, triggers when it should, or if it actually improved after an edit.

Today we're announcing skill-creator enhancements that help authors build with more confidence. We are bringing some of the rigor of software development (testing, benchmarking, iterative improvement) to skill authoring without requiring anyone to write code.

Two kinds of skills

Skills generally fall into two categories:

Capability uplift skills help Claude do something the base model either can't do or can't do consistently. Our document creation skills are good examples. They encode techniques and patterns that produce better output than prompting alone.

Encoded preference skills document workflows where Claude can already do each piece, but the skill sequences them according to your team's process. Examples: a skill that walks through NDA review against set criteria, or one that drafts weekly updates with data from various MCPs.

This distinction matters because these two types of skills may need testing for different reasons:

Capability uplift skills may become less necessary as models improve. Evals tell you when that's happened.

Encoded preference skills are more durable, but only as valuable as their fidelity to your actual workflow. Evals verify that fidelity.

Either way, testing turns a skill that seems to work into one you know works.

Using evals to test and improve skills

Skill-creator now helps you write evals, which are tests that check Claude does what you expect for a given prompt. If you've written software tests, this will feel familiar: define some test prompts (plus files if needed), describe what good looks like, and skill-creator tells you whether the skill holds up.

Our PDF skill, for instance, previously struggled with non-fillable forms. Claude had to place text at exact coordinates with no defined fields to guide it. Evals isolated the failure, and we shipped a fix that anchors positioning to extracted text coordinates.

Evals help in many ways, but two important uses are to catch quality regressions and understand model progress.

First, catching regressions in quality. As models and the infrastructure around them evolve, a skill that worked well last month might behave differently today. Running evals against a new model gives you an early signal when something shifts before it impacts your team’s work.

Second, knowing when general model capabilities have outgrown your skill. This applies mainly to capability uplift skills. If the base model starts passing your evals without the skill loaded, that's a signal the skill's techniques may have been incorporated into the model's default behavior. The skill isn't broken; it's just no longer necessary.

We've also added a benchmark mode that runs a standardized assessment using your evals. This is something you can run after model updates or as you iterate on the skill itself. It tracks eval pass rate, elapsed time, and token usage.

Your evals and results stay with you. Store them locally, integrate them with a dashboard, or plug them into a CI system.

Faster, more consistent evaluation with multi-agent support

Running evals sequentially can be slow, and accumulating context can bleed between test runs. Skill-creator now spins up independent agents to run evals in parallel with multi-agent support — each in a clean context with its own token and timing metrics. Faster results, no cross-contamination.

We've also added comparator agents for A/B comparisons: two skill versions, or skill vs. no skill. They judge outputs without knowing which is which, so you can tell whether a change actually helped.

Getting skills to trigger at the right time

Evals measure output quality, but that only matters if your skill triggers when it should. As your skill count grows, description precision becomes critical: too broad and you get false triggers, too narrow and it never fires. Skill-creator now helps you tune descriptions for more reliable triggering — it analyzes your current description against sample prompts and suggests edits that cut both false positives and false negatives.

We ran it across our document-creation skills and saw improved triggering on 5 out of 6 public skills.

As models improve, the line between "skill" and "specification" may blur. Today, a SKILL.md file is essentially an implementation plan, providing detailed instructions telling Claude how to do something. Over time, a natural-language description of what the skill should do may be enough, with the model figuring out the rest.

The eval framework we're releasing today is a step in that direction. Evals already describe the "what." Eventually, that description may be the skill itself.

Getting Started

All skill-creator updates are available now on Claude.ai and Cowork. Ask Claude to use the skill-creator to get started.

Claude Code users can install the plugin or download from our repo.

PrevPrev0/5NextNexteBook

Explore more product news and best practices for teams building with Claude.

Your favorite work tools are now interactive connectors inside Claude

Product announcementsYour favorite work tools are now interactive connectors inside ClaudeYour favorite work tools are now interactive connectors inside ClaudeYour favorite work tools are now interactive connectors inside ClaudeYour favorite work tools are now interactive connectors inside Claude Feb 23, 2026How AI helps break the cost barrier to COBOL modernization

Claude CodeHow AI helps break the cost barrier to COBOL modernizationHow AI helps break the cost barrier to COBOL modernizationHow AI helps break the cost barrier to COBOL modernizationHow AI helps break the cost barrier to COBOL modernization Feb 20, 2026Bringing automated preview, review, and merge to Claude Code on desktop

Claude CodeBringing automated preview, review, and merge to Claude Code on desktopBringing automated preview, review, and merge to Claude Code on desktopBringing automated preview, review, and merge to Claude Code on desktopBringing automated preview, review, and merge to Claude Code on desktop Feb 17, 2026Increase web search accuracy and efficiency with dynamic filtering

Product announcementsIncrease web search accuracy and efficiency with dynamic filteringIncrease web search accuracy and efficiency with dynamic filteringIncrease web search accuracy and efficiency with dynamic filteringIncrease web search accuracy and efficiency with dynamic filteringTransform how your organization operates with Claude

Get the developer newsletter

Product updates, how-tos, community spotlights, and more. Delivered monthly to your inbox.

SubscribeSubscribePlease provide your email address if you'd like to receive our monthly developer newsletter. You can unsubscribe at any time.

この記事をシェア

宝玉的分享重要度42026年3月2日 09:00

デザインプロセスは終わった：Anthropicデザイン責任者Jenny Wenが語るAI時代のデザイン変革

TechCrunch AI2026年3月2日 22:31

AnthropicのClaudeが広範囲にわたるサービス停止を報告

Andrej Karpathy 厳選2026年3月3日 09:20

npx workos: コードベースに直接認証を書き込むAIエージェント

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Claude Blog·2026年3月3日 09:00·約10分

スキルクリエーターの改善：エージェントスキルのテスト、測定、改良

#AIエージェント #スキル開発 #コードレス開発 #Anthropic #Claude #評価・テスト

TL;DR

AI深層分析2026年3月4日 03:45

重要/ 5段階

深度40%

キーポイント

スキル作成ツールの機能強化

skill-creatorに評価機能が追加され、スキル作成者はコードを書かずにテストを作成し、スキルの動作確認や品質向上が可能になった。

2種類のスキルカテゴリー

スキルは「能力向上スキル」と「エンコードされた選好スキル」に分類され、それぞれ異なるテスト目的と耐久性を持つ。

評価の具体的な活用方法

評価は品質の退行を検出し、モデルの進歩を理解するために使用され、PDFスキルの改善例のように具体的な問題解決に役立つ。

対象ユーザーの拡大

専門知識を持つがエンジニアではないユーザー向けに設計されており、ソフトウェア開発の厳密さをコードレスでスキル作成に導入している。

マルチエージェントによる評価の高速化と精度向上

スキルトリガーの精度向上のためのチューニング支援

スキルと仕様の境界の変化

モデルの進化に伴い、詳細な実装手順（SKILL.md）から自然言語による目的記述へと移行し、評価フレームワークがその方向性を示しています。

影響分析・編集コメントを表示

影響分析

編集コメント

AIエージェントの実用化において重要な「非エンジニアによる開発支援」という課題に具体的なソリューションを提供している点が注目される。

スキル作成者の改善：テスト、測定、そしてエージェントスキルの洗練

スキル作者は現在、自分のスキルが機能していることを確認し、回帰を検出し、説明を改善することができます。

カテゴリ: Claude Code, 製品発表

製品: Claude Code

日付: 2026 年 3 月 3 日

読了時間: 5 分

共有: リンクをコピー https://claude.com/blog/improving-skill-creator-test-measure-and-refine-agent-skills

2 種類のスキル

スキルは一般的に以下の 2 つのカテゴリーに分類されます:

この区別は重要です。なぜなら、これら 2 種類のスキルには異なる理由でテストが必要となる可能性があるからです。

能力向上型スキルは、モデルの性能が向上するにつれて不要になる可能性があります。評価（Evals）によってそのタイミングを把握できます。

いずれにせよ、テストを行うことで「うまくいそう」なスキルを、「実際に機能する」と確信できるスキルへと変えることができます。

評価（Evals）を用いたスキルのテストと改善

マルチエージェントサポートによる高速で一貫性のある評価

適切なタイミングでスキルを発火させるには

ドキュメント作成用のスキルでこの機能をテストしたところ、公開されている 6 つのスキルのうち 5 つで発火率が改善しました。

はじめに

すべての Skill-creator の更新機能は、現在 Claude.ai と Cowork で利用可能です。Claude に Skill-creator を使用して始めるよう指示してください。

Claude Code ユーザーは、プラグインをインストールするか、リポジトリからダウンロードできます。

PrevPrev0/5NextNexteBook

Claude を活用して構築するチーム向けの製品ニュースとベストプラクティスをもっとご覧ください。

お気に入りの業務ツールが、今や Claude 内のインタラクティブなコネクタになりました

製品発表：お気に入りの業務ツールが、今や Claude 内のインタラクティブなコネクタになりました

お気に入りの業務ツールが、今や Claude 内のインタラクティブなコネクタになりました

お気に入りの業務ツールが、今や Claude 内のインタラクティブなコネクタになりました 2026 年 2 月 23 日

AI が COBOL モダナイゼーションのコスト障壁をどう打破するか

Claude Code：AI が COBOL モダナイゼーションのコスト障壁をどう打破するか

AI が COBOL モダナイゼーションのコスト障壁をどう打破するか

AI が COBOL モダナイゼーションのコスト障壁をどう打破するか 2026 年 2 月 20 日

デスクトップ版 Claude Code に、自動プレビュー・レビュー・マージ機能を導入

Claude Code

デスクトップ版 Claude Code に自動プレビュー、レビュー、マージ機能を導入

デスクトップ版 Claude Code に自動プレビュー、レビュー、マージ機能を導入 2026 年 2 月 17 日

動的フィルタリングにより、ウェブ検索の精度と効率を向上

製品発表：動的フィルタリングにより、ウェブ検索の精度と効率を向上

組織の運用方法を変革する Claude

開発者向けニュースレターを購読

製品アップデート、ハウツー記事、コミュニティ紹介など。毎月あなたのメールボックスにお届けします。

購読する

月次開発者ニュースレターを受け取りたい場合は、メールアドレスをご入力ください。いつでも解除できます。

原文を表示

Improving skill-creator: Test, measure, and refine Agent Skills

Skill authors can now verify that their skills work, catch regressions, and improve descriptions.

CategoryClaude CodeProduct announcements

ProductClaude Code

DateMarch 3, 2026

Reading time5min

ShareCopy linkhttps://claude.com/blog/improving-skill-creator-test-measure-and-refine-agent-skills

Two kinds of skills

Skills generally fall into two categories:

This distinction matters because these two types of skills may need testing for different reasons:

Capability uplift skills may become less necessary as models improve. Evals tell you when that's happened.

Encoded preference skills are more durable, but only as valuable as their fidelity to your actual workflow. Evals verify that fidelity.

Either way, testing turns a skill that seems to work into one you know works.

Using evals to test and improve skills

Evals help in many ways, but two important uses are to catch quality regressions and understand model progress.

Your evals and results stay with you. Store them locally, integrate them with a dashboard, or plug them into a CI system.

Faster, more consistent evaluation with multi-agent support

We've also added comparator agents for A/B comparisons: two skill versions, or skill vs. no skill. They judge outputs without knowing which is which, so you can tell whether a change actually helped.

Getting skills to trigger at the right time

We ran it across our document-creation skills and saw improved triggering on 5 out of 6 public skills.

The eval framework we're releasing today is a step in that direction. Evals already describe the "what." Eventually, that description may be the skill itself.

Getting Started

All skill-creator updates are available now on Claude.ai and Cowork. Ask Claude to use the skill-creator to get started.

Claude Code users can install the plugin or download from our repo.

PrevPrev0/5NextNexteBook

Explore more product news and best practices for teams building with Claude.

Your favorite work tools are now interactive connectors inside Claude

Get the developer newsletter

Product updates, how-tos, community spotlights, and more. Delivered monthly to your inbox.

SubscribeSubscribePlease provide your email address if you'd like to receive our monthly developer newsletter. You can unsubscribe at any time.