Hugging Face Blog·2026年1月28日 09:00·約2分

ClaudeにCUDAカーネルを構築させ、オープンモデルを教え込むことに成功！

#LLM #エージェントスキル #CUDA #オープンソースモデル #Hugging Face

TL;DR

Hugging Face は、Claude Opus 4.5 を教師モデルとして活用し、オープンソースモデルが CUDA カーネルの作成という高度なタスクを習得するための「スキル」を生成・共有する手法「upskill」を紹介した。

AI深層分析2026年5月2日 03:07

重要/ 5段階

深度40%

キーポイント

エージェントスキルの概念と目的

エージェントスキルは、モデルのコンテキストをファイル（Markdown やスクリプト）として定義し、特定のドメインや困難な問題に対する能力を共有・再利用するための実用的な手段である。

Claude を用いた教師モデルによる生成

Claude Code (Opus 4.5) をインタラクティブに使用して CUDA カーネルを作成し、そのプロセスを追跡（トレース）することで、複雑なタスクの解決手順を抽出する。

オープンソースモデルへの転移学習

生成されたスキルファイルを活用することで、ローカルで動作する軽量なオープンソースモデルでも、本来は困難とされる CUDA 開発タスクのパフォーマンス向上を図る。

スキルの効果と限界の検証

単純なドキュメントベースのスキルですべてのモデルで性能が向上するわけではなく、場合によってはパフォーマンス低下やトークン使用量の増加を招く可能性があり、反復的な改善プロセスが必要である。

スキル作成と評価の重要性

エージェントがタスクを完了した直後にスキルファイルを作成するか、Anthropicの「skill creator」や「upskill」ツールを活用することで機能するスキルを生成できます。

オープンソースモデルへの知識移転と最適化

作成したスキルをより小さく安価なオープンソースモデルに適用し、精度を維持しつつトークン使用量を削減できるか評価することが重要です。

CUDA カーネル構築ワークフローの完全サポート

単なるコード生成だけでなく、プロジェクト構造やビルド設定を含めた完全なワークフローを理解し、H100向けに最適化されたカーネルを自動生成するスキルが提供されています。

影響分析・編集コメントを表示

影響分析

この記事は、LLM の能力を「学習させる」のではなく、「スキルファイル」という形で外部化・共有する新しいパラダイムを示しており、特にハードウェアに近い低レベルなプログラミング（CUDA）のような専門領域において、オープンソースモデルの活用可能性を大きく広げる可能性があります。これにより、リソース制約のある環境でも高度なタスクを実行できるエコシステムの構築が加速すると予想されます。

編集コメント

高性能モデルを単に使うだけでなく、その思考プロセスを「スキル」として抽出し、より軽量なモデルに継承させるというアプローチは、コスト効率と汎用性の両面で非常に注目すべき進展です。ただし、すべてのケースで有効ではないという検証結果も示されており、実装時には注意深いテストが必要です。

教師モデル (Opus) がスキルを生成します

テストケース (Opus) はタスクの説明から自動生成されます

生徒モデル (ローカル) はスキルあり・なしで評価されます

スキルリフトは改善度を測定します

既存のスキルをupskill evalに渡す場合

json

{
  "cases": [
    {
      "input": "H100をターゲットとしたCUDAカーネルのbuild.tomlを作成せよ",
      "expected": {"contains": "9.0"}
    },
    {
      "input": "適切なインクルードを含む基本的なCUDAカーネルのテンプレートを書け",
      "expected": {"contains": "cuda_runtime.h"}
    }
  ]
}

また、スキルが異なるモデル間でどのように機能するかをテストできます:

code

upskill eval ./skills/kernel-builder-cuda-kernels/ --model haiku --model kimi --runs 5

2つのモデル、3つのテストケース、モデルあたり5回の実行でkernel-builder-cuda-kernelsを評価中

haiku 合格率: 4/5 (80%) 平均アサーション: 2.8/3

sonnet 合格率: 5/5 (100%) 平均アサーション: 3.0/3

┏━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓

┃ Model ┃ Pass Rate ┃ Avg Assertions ┃ Avg Tokens ┃

┡━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩

│ haiku │ 4/5 │ 2.8/3 │ 1250 │

│ kimi │ 5/5 │ 3.0/3 │ 1890 │

└────────┴───────────┴────────────────┴────────────┘

これはコストパフォーマンスの最適点を見つけるのに役立ちます: おそらくスキルを適用したHaikuがあなたのユースケースには十分であり、大幅なAPIコストを節約できます。

私たちは、upskillが強力なモデルから安価なモデルへドメイン知識を転送する検証済みスキルを作成できることを示しました。kernel-builderスキルは可能性のほんの一例です。

試すべきこと:

内部ツールのスキルを生成する
コードベースのスキルライブラリを構築する
暗黙知を捕捉する
モデル間でベンチマークする

このアプローチは、詳細なプロンプトを繰り返し書く必要があるような専門的なタスクであれば何にでも機能します。スキルは、Agent Skills仕様をサポートするClaude Code、Cursor、その他のツール間で移植可能です。

Agent Skills仕様

HuggingFace kernel-builder

原文を表示

Back to Articles We got Claude to teach open models how to write CUDA kernels!

Upvote 142

The best thing about agent skills is upskilling your agents on hard problems. There are two ways to look at that:

You can take Opus 4.5 or other SOTA models and tackle the hardest problems out there.

You can take models that run on your laptop and upskill them to harder problems. In this blog post, we’ll show you how to take on the latter.

This blog post walks through the process of using a new tool, upskill

What are agent skills?

In case you missed it, agent skills are taking the coding agent game by storm. In fact, they’re a straightforward concept to define model context as files, like instructions as markdown and code as scripts. The file format makes them easy to generate, share, and review. In short, they’re a practical medium to share capabilities across models and tools, and they're most useful in specific domains or hard problems. Not stuff the model can do well anyway.

This post showcases this process by using Claude to generate a Skill file that can be used by open source models for a complex and specialized task: write CUDA kernels. We first tried a simple skill based on existing documentation, and we found that it improved performance for some others, but not all. In fact, it could even degrade performance or increase token usage for some models. Check out the plot below to see the performance of the model with and without the basic skill.

Now, let's walk through how you can use upskill

Get the teacher (Claude Opus 4.5) to build a kernel

First, we use Claude Code to build a kernel interactively and export the trace. We worked through the process by instructing, validating, and adding documentation links. This somewhat naive process is important to reveal the models' initial challenges. In fact, you can iterate on this multiple times, by trying to solve the task with draft versions of the skill, and experimenting with smaller models. Each time, you can instruct the agent to improve the skill and test it on the smaller model.

Here's an example of the skill that we created and have been using to build kernels. We started from this agent trace where the agent was able to build a kernel, but not without some help.

Make an agent skill from the trace

Once the teacher model has performed the task, we need them to make a skill. There are a number of effective ways to do this.

Within the same session, instruct the agent to create a skill file for the task it just completed.

Use Anthropic ‘skill creator’ skill either within the agent session or with an exported trace and a new agent session.

Use the upskill

In most cases, the first 2 options result in functional skills. However, the performance of an agent with the skill is unknown. That’s where upskill

Take your skill to an open source, smaller, or cheaper model

Finally, we need to transfer our newly created skill to the tool or model we intend to use. Most tools like codex

{agent}/skills/{skill_name}/SKILL.md

In this case, we might want to iterate further on the gpt-oss

upskill generate --from {skill}

There is more to agent skills than model performance. Often agents can reach a given accuracy with or without a skill, they just need to consume more tokens to get there. For recurring tasks, we want to optimize agents to use less tokens to achieve the same accuracy. The results below reveal another dimension to the skill. Some models are significantly reducing their performance token usage, whilst others are using more tokens with the skill. For example, with moonshotai/Kimi-K2-Thinking

tldr; try out and evaluate models with the skills you create. Use upskill eval

That’s the high level end to end of upskilling your coding agents on hard problems. Try out upskill now like this:

install upskill pip install upskill # or use uvx uvx upskill --help # generate a skill based on an agent trace upskill generate "write nvidia kernels" --from ./trace.md # evaluate models on a skill upskill eval ./skills/my-skill/ --model haiku --model sonnet # generate skills for local models upskill generate "parse YAML" --model opus --eval-model "unsloth/GLM-4.7-Flash-GGUF:Q4_0" --eval-base-url http://localhost:8080/v1

Deep dive tutorial into building kernels with agent skills

We have a high level understanding of how we can upskill an agent. Let’s now look at the use case we solved for writing CUDA kernels.

We didn’t just want to write kernel code, but understand the full kernel-builder workflow: project structure, build.toml

The kernel-builder-cuda-kernels

With this skill, you can tell Claude things like:

Build a fused LayerNorm + GELU kernel optimized for H100.

And Claude will create the complete project structure, CUDA implementation, and build configuration—following the exact conventions that kernel-builder expects.

This isn't about generating boilerplate. The skill encodes domain expertise: H100 uses compute capability 9.0, shared memory should be aligned to 128 bytes, async memory copies require __CUDA_ARCH__ >= 900

Setup and Install

Install upskill:

pip install upskill # or use uvx for one-off runs uvx upskill --help

Set your API key:

export ANTHROPIC_API_KEY=sk-ant-... export HF_TOKEN=hf_...

That's it. Upskill uses Anthropic Claude Opus-4.5 model by default but also supports OpenAI and local models via OpenAI-compatible endpoints as generators. We want to use the more expensive and higher quality models to generate skills, and the smaller ones to use them. Think robin hood.

Skill Generation

Let's walk through generating a skill that teaches agents how to build CUDA kernels with HuggingFace's kernels

Generate the Skill

Start with a clear task description:

upskill generate "build optimized CUDA kernels for PyTorch using HuggingFace kernel-builder"

Above we used upskill, but it could in fact be any agent or chat tool and an exported trace.

upskill generate "write kernels" --from <agent-trace>.md

Also, we could start from an existing skill and add to it:

upskill generate "add more error handling and edge cases" --from ./skills/kernel-builder-cuda-kernels/

upskill loads the existing skill, applies your improvements, and re-evaluates to ensure the changes help.

upskill creates a skill, generates test cases, evaluates performance, and refines based on failures:

Generating skill with sonnet... Generating test cases... Evaluating on sonnet... (attempt 1) 60% -> 95% (+35%) OK kernel-builder-cuda-kernels Build optimized CUDA kernels for PyTorch using HuggingFace kernel-builder. SKILL.md ~520 tokens baseline ████████████ 60% with skill ███████████████████ 95% (+35%) Saved to ./skills/kernel-builder-cuda-kernels

The baseline shows how the model performs without any skill. The "with skill" result shows performance after the skill is injected into context. A 35% improvement means the skill is working.

The skill is saved as a directory following the Agent Skills specification:

./skills/kernel-builder-cuda-kernels/ ├── SKILL.md # Main instructions (~520 tokens) └── skill_meta.json # Metadata and test cases

--- name: kernel-builder-cuda-kernels description: Build optimized CUDA kernels for PyTorch using HuggingFace kernel-builder. --- # Building CUDA Kernels with kernel-builder ## Overview This guide explains how to create optimized CUDA kernels for PyTorch models using HuggingFace's kernel-builder. It covers project setup, kernel implementation, and building for specific GPU architectures like NVIDIA H100. ## Project Structure project/ ├── build.toml # Build configuration ├── kernel_src/ # CUDA kernel implementations │ ├── attention.cu │ ├── layernorm.cu │ └── geglu.cu └── torch-ext/ # PyTorch C++ bindings └── torch_binding.cpp ## Build Configuration Create build.toml to define your kernel package: [general] name = "diffuser_kernels" backends = ["cuda"] [general.cuda] # H100 is compute capability 9.0 capabilities = ["9.0"] ...

Evaluate on a Different Model

The important test is: does this skill help local or cheaper models to build kernels?

Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/GLM-4.7-Flash-GGUF:Q4_K_M # Evaluate on local model (llama.cpp server) upskill eval ./skills/my-skill/ --model "unsloth/GLM-4.7-Flash-GGUF:Q4_0" --base-url http://localhost:8080/v1

Generating skill with sonnet... Generating test cases... Evaluating on "unsloth/GLM-4.7-Flash-GGUF:Q4_0"... (attempt 1) 40% -> 85% (+45%) OK baseline ████████░░░░░░░░░░░░ 40% with skill █████████████████░░░ 85% (+45%) Saved to ./skills/kernel-builder-cuda-kernels

A 45% improvement on "unsloth/GLM-4.7-Flash-GGUF:Q4_0"

This is the core value proposition: use expensive models to create skills, then deploy those skills with cheap or local models.

How the evaluation in upskill works

upskill uses a teacher-student approach to evaluate models where the teacher model generates test cases for the student model to be evaluated on.

Teacher model (Opus) generates the skill

Test cases (Opus) are generated automatically from the task description

Student model (local) is evaluated with and without the skill

Skill lift measures the improvement

If you pass an existing skill to upskill eval

{ "cases": [ { "input": "Create a build.toml for a CUDA kernel targeting H100", "expected": {"contains": "9.0"} }, { "input": "Write a basic CUDA kernel template with proper includes", "expected": {"contains": "cuda_runtime.h"} } ] }

We can also test how a skill performs across different models:

upskill eval ./skills/kernel-builder-cuda-kernels/ --model haiku --m kimi --runs 5

Evaluating kernel-builder-cuda-kernels across 2 model(s) 3 test case(s), 5 run(s) per model haiku Pass rate: 4/5 (80%) Avg assertions: 2.8/3 sonnet Pass rate: 5/5 (100%) Avg assertions: 3.0/3 ┏━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Model ┃ Pass Rate ┃ Avg Assertions ┃ Avg Tokens ┃ ┡━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ haiku │ 4/5 │ 2.8/3 │ 1250 │ │ kimi │ 5/5 │ 3.0/3 │ 1890 │ └────────┴───────────┴────────────────┴────────────┘

This helps you find the cost-performance sweet spot: maybe Haiku with the skill is good enough for your use case, saving significant API costs.

We've shown that upskill can create validated skills that transfer domain expertise from powerful models to cheaper ones. The kernel-builder skill is just one example of what's possible.

Some things to try:

Generate skills for your internal tools

Build a skill library for your codebase

Capture tribal knowledge

Benchmark across models

The approach works for any specialized task where you'd otherwise write detailed prompts repeatedly. Skills are portable across Claude Code, Codex, Cursor, and other tools that support the Agent Skills specification.

Agent Skills Specification

HuggingFace kernel-builder

この記事をシェア

Simon Willison Blog2026年7月3日 23:50

2026年6月ニュースレター：Claude Fable 5、GPT-5.6、輸出規制、GLM-5.2の登場など

Simon Willison Blog2026年7月5日 10:00

sqlite-utils 4.0rc2、主にClaude Fable（約149.25ドル分）が執筆

TechCrunch AI2026年7月5日 00:51

ミストラル AI とは？OpenAI の競合企業に関する全知識

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Hugging Face Blog·2026年1月28日 09:00·約2分

ClaudeにCUDAカーネルを構築させ、オープンモデルを教え込むことに成功！

#LLM #エージェントスキル #CUDA #オープンソースモデル #Hugging Face

TL;DR

AI深層分析2026年5月2日 03:07

重要/ 5段階

深度40%

キーポイント

エージェントスキルの概念と目的

Claude を用いた教師モデルによる生成

オープンソースモデルへの転移学習

スキルの効果と限界の検証

スキル作成と評価の重要性

オープンソースモデルへの知識移転と最適化

作成したスキルをより小さく安価なオープンソースモデルに適用し、精度を維持しつつトークン使用量を削減できるか評価することが重要です。

CUDA カーネル構築ワークフローの完全サポート

影響分析・編集コメントを表示

影響分析

編集コメント

教師モデル (Opus) がスキルを生成します

テストケース (Opus) はタスクの説明から自動生成されます

生徒モデル (ローカル) はスキルあり・なしで評価されます

スキルリフトは改善度を測定します

既存のスキルをupskill evalに渡す場合

json

{
  "cases": [
    {
      "input": "H100をターゲットとしたCUDAカーネルのbuild.tomlを作成せよ",
      "expected": {"contains": "9.0"}
    },
    {
      "input": "適切なインクルードを含む基本的なCUDAカーネルのテンプレートを書け",
      "expected": {"contains": "cuda_runtime.h"}
    }
  ]
}

また、スキルが異なるモデル間でどのように機能するかをテストできます:

code

upskill eval ./skills/kernel-builder-cuda-kernels/ --model haiku --model kimi --runs 5

2つのモデル、3つのテストケース、モデルあたり5回の実行でkernel-builder-cuda-kernelsを評価中

haiku 合格率: 4/5 (80%) 平均アサーション: 2.8/3

sonnet 合格率: 5/5 (100%) 平均アサーション: 3.0/3

┏━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓

┃ Model ┃ Pass Rate ┃ Avg Assertions ┃ Avg Tokens ┃

┡━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩

│ haiku │ 4/5 │ 2.8/3 │ 1250 │

│ kimi │ 5/5 │ 3.0/3 │ 1890 │

└────────┴───────────┴────────────────┴────────────┘

試すべきこと:

内部ツールのスキルを生成する
コードベースのスキルライブラリを構築する
暗黙知を捕捉する
モデル間でベンチマークする

Agent Skills仕様

HuggingFace kernel-builder

原文を表示

Back to Articles We got Claude to teach open models how to write CUDA kernels!

Upvote 142

The best thing about agent skills is upskilling your agents on hard problems. There are two ways to look at that:

You can take Opus 4.5 or other SOTA models and tackle the hardest problems out there.

You can take models that run on your laptop and upskill them to harder problems. In this blog post, we’ll show you how to take on the latter.

This blog post walks through the process of using a new tool, upskill

What are agent skills?

Now, let's walk through how you can use upskill

Get the teacher (Claude Opus 4.5) to build a kernel

Here's an example of the skill that we created and have been using to build kernels. We started from this agent trace where the agent was able to build a kernel, but not without some help.

Make an agent skill from the trace

Once the teacher model has performed the task, we need them to make a skill. There are a number of effective ways to do this.

Within the same session, instruct the agent to create a skill file for the task it just completed.

Use Anthropic ‘skill creator’ skill either within the agent session or with an exported trace and a new agent session.

Use the upskill

In most cases, the first 2 options result in functional skills. However, the performance of an agent with the skill is unknown. That’s where upskill

Take your skill to an open source, smaller, or cheaper model

Finally, we need to transfer our newly created skill to the tool or model we intend to use. Most tools like codex

{agent}/skills/{skill_name}/SKILL.md

In this case, we might want to iterate further on the gpt-oss

upskill generate --from {skill}

tldr; try out and evaluate models with the skills you create. Use upskill eval

That’s the high level end to end of upskilling your coding agents on hard problems. Try out upskill now like this:

install upskill pip install upskill # or use uvx uvx upskill --help # generate a skill based on an agent trace upskill generate "write nvidia kernels" --from ./trace.md # evaluate models on a skill upskill eval ./skills/my-skill/ --model haiku --model sonnet # generate skills for local models upskill generate "parse YAML" --model opus --eval-model "unsloth/GLM-4.7-Flash-GGUF:Q4_0" --eval-base-url http://localhost:8080/v1

Deep dive tutorial into building kernels with agent skills

We have a high level understanding of how we can upskill an agent. Let’s now look at the use case we solved for writing CUDA kernels.

We didn’t just want to write kernel code, but understand the full kernel-builder workflow: project structure, build.toml

The kernel-builder-cuda-kernels

With this skill, you can tell Claude things like:

Build a fused LayerNorm + GELU kernel optimized for H100.

And Claude will create the complete project structure, CUDA implementation, and build configuration—following the exact conventions that kernel-builder expects.

Setup and Install

Install upskill:

pip install upskill # or use uvx for one-off runs uvx upskill --help

Set your API key:

export ANTHROPIC_API_KEY=sk-ant-... export HF_TOKEN=hf_...

Skill Generation

Let's walk through generating a skill that teaches agents how to build CUDA kernels with HuggingFace's kernels

Generate the Skill

Start with a clear task description:

upskill generate "build optimized CUDA kernels for PyTorch using HuggingFace kernel-builder"

Above we used upskill, but it could in fact be any agent or chat tool and an exported trace.

upskill generate "write kernels" --from <agent-trace>.md

Also, we could start from an existing skill and add to it:

upskill generate "add more error handling and edge cases" --from ./skills/kernel-builder-cuda-kernels/

upskill loads the existing skill, applies your improvements, and re-evaluates to ensure the changes help.

upskill creates a skill, generates test cases, evaluates performance, and refines based on failures:

The baseline shows how the model performs without any skill. The "with skill" result shows performance after the skill is injected into context. A 35% improvement means the skill is working.

The skill is saved as a directory following the Agent Skills specification:

./skills/kernel-builder-cuda-kernels/ ├── SKILL.md # Main instructions (~520 tokens) └── skill_meta.json # Metadata and test cases

Evaluate on a Different Model

The important test is: does this skill help local or cheaper models to build kernels?

Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/GLM-4.7-Flash-GGUF:Q4_K_M # Evaluate on local model (llama.cpp server) upskill eval ./skills/my-skill/ --model "unsloth/GLM-4.7-Flash-GGUF:Q4_0" --base-url http://localhost:8080/v1

A 45% improvement on "unsloth/GLM-4.7-Flash-GGUF:Q4_0"

This is the core value proposition: use expensive models to create skills, then deploy those skills with cheap or local models.

How the evaluation in upskill works

upskill uses a teacher-student approach to evaluate models where the teacher model generates test cases for the student model to be evaluated on.

Teacher model (Opus) generates the skill

Test cases (Opus) are generated automatically from the task description

Student model (local) is evaluated with and without the skill

Skill lift measures the improvement

If you pass an existing skill to upskill eval

We can also test how a skill performs across different models:

upskill eval ./skills/kernel-builder-cuda-kernels/ --model haiku --m kimi --runs 5

This helps you find the cost-performance sweet spot: maybe Haiku with the skill is good enough for your use case, saving significant API costs.

We've shown that upskill can create validated skills that transfer domain expertise from powerful models to cheaper ones. The kernel-builder skill is just one example of what's possible.

Some things to try:

Generate skills for your internal tools

Build a skill library for your codebase

Capture tribal knowledge

Benchmark across models

Agent Skills Specification

HuggingFace kernel-builder