ClaudeにCUDAカーネルを構築させ、オープンモデルを教え込むことに成功!
Hugging Face は、Claude Opus 4.5 を教師モデルとして活用し、オープンソースモデルが CUDA カーネルの作成という高度なタスクを習得するための「スキル」を生成・共有する手法「upskill」を紹介した。
キーポイント
エージェントスキルの概念と目的
エージェントスキルは、モデルのコンテキストをファイル(Markdown やスクリプト)として定義し、特定のドメインや困難な問題に対する能力を共有・再利用するための実用的な手段である。
Claude を用いた教師モデルによる生成
Claude Code (Opus 4.5) をインタラクティブに使用して CUDA カーネルを作成し、そのプロセスを追跡(トレース)することで、複雑なタスクの解決手順を抽出する。
オープンソースモデルへの転移学習
生成されたスキルファイルを活用することで、ローカルで動作する軽量なオープンソースモデルでも、本来は困難とされる CUDA 開発タスクのパフォーマンス向上を図る。
スキルの効果と限界の検証
単純なドキュメントベースのスキルですべてのモデルで性能が向上するわけではなく、場合によってはパフォーマンス低下やトークン使用量の増加を招く可能性があり、反復的な改善プロセスが必要である。
スキル作成と評価の重要性
エージェントがタスクを完了した直後にスキルファイルを作成するか、Anthropicの「skill creator」や「upskill」ツールを活用することで機能するスキルを生成できます。
オープンソースモデルへの知識移転と最適化
作成したスキルをより小さく安価なオープンソースモデルに適用し、精度を維持しつつトークン使用量を削減できるか評価することが重要です。
CUDA カーネル構築ワークフローの完全サポート
単なるコード生成だけでなく、プロジェクト構造やビルド設定を含めた完全なワークフローを理解し、H100向けに最適化されたカーネルを自動生成するスキルが提供されています。
影響分析・編集コメントを表示
影響分析
この記事は、LLM の能力を「学習させる」のではなく、「スキルファイル」という形で外部化・共有する新しいパラダイムを示しており、特にハードウェアに近い低レベルなプログラミング(CUDA)のような専門領域において、オープンソースモデルの活用可能性を大きく広げる可能性があります。これにより、リソース制約のある環境でも高度なタスクを実行できるエコシステムの構築が加速すると予想されます。
編集コメント
高性能モデルを単に使うだけでなく、その思考プロセスを「スキル」として抽出し、より軽量なモデルに継承させるというアプローチは、コスト効率と汎用性の両面で非常に注目すべき進展です。ただし、すべてのケースで有効ではないという検証結果も示されており、実装時には注意深いテストが必要です。
教師モデル (Opus) がスキルを生成します
テストケース (Opus) はタスクの説明から自動生成されます
生徒モデル (ローカル) はスキルあり・なしで評価されます
スキルリフトは改善度を測定します
既存のスキルをupskill evalに渡す場合
{
"cases": [
{
"input": "H100をターゲットとしたCUDAカーネルのbuild.tomlを作成せよ",
"expected": {"contains": "9.0"}
},
{
"input": "適切なインクルードを含む基本的なCUDAカーネルのテンプレートを書け",
"expected": {"contains": "cuda_runtime.h"}
}
]
}また、スキルが異なるモデル間でどのように機能するかをテストできます:
upskill eval ./skills/kernel-builder-cuda-kernels/ --model haiku --model kimi --runs 52つのモデル、3つのテストケース、モデルあたり5回の実行でkernel-builder-cuda-kernelsを評価中
haiku 合格率: 4/5 (80%) 平均アサーション: 2.8/3
sonnet 合格率: 5/5 (100%) 平均アサーション: 3.0/3
┏━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Model ┃ Pass Rate ┃ Avg Assertions ┃ Avg Tokens ┃
┡━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ haiku │ 4/5 │ 2.8/3 │ 1250 │
│ kimi │ 5/5 │ 3.0/3 │ 1890 │
└────────┴───────────┴────────────────┴────────────┘
これはコストパフォーマンスの最適点を見つけるのに役立ちます: おそらくスキルを適用したHaikuがあなたのユースケースには十分であり、大幅なAPIコストを節約できます。
私たちは、upskillが強力なモデルから安価なモデルへドメイン知識を転送する検証済みスキルを作成できることを示しました。kernel-builderスキルは可能性のほんの一例です。
試すべきこと:
- 内部ツールのスキルを生成する
- コードベースのスキルライブラリを構築する
- 暗黙知を捕捉する
- モデル間でベンチマークする
このアプローチは、詳細なプロンプトを繰り返し書く必要があるような専門的なタスクであれば何にでも機能します。スキルは、Agent Skills仕様をサポートするClaude Code、Cursor、その他のツール間で移植可能です。
Agent Skills仕様
HuggingFace kernel-builder



原文を表示
Back to Articles We got Claude to teach open models how to write CUDA kernels!
Upvote 142 ![]()




The best thing about agent skills is upskilling your agents on hard problems. There are two ways to look at that:
You can take Opus 4.5 or other SOTA models and tackle the hardest problems out there.
You can take models that run on your laptop and upskill them to harder problems. In this blog post, we’ll show you how to take on the latter.
This blog post walks through the process of using a new tool, upskill
What are agent skills?
In case you missed it, agent skills are taking the coding agent game by storm. In fact, they’re a straightforward concept to define model context as files, like instructions as markdown and code as scripts. The file format makes them easy to generate, share, and review. In short, they’re a practical medium to share capabilities across models and tools, and they're most useful in specific domains or hard problems. Not stuff the model can do well anyway.
This post showcases this process by using Claude to generate a Skill file that can be used by open source models for a complex and specialized task: write CUDA kernels. We first tried a simple skill based on existing documentation, and we found that it improved performance for some others, but not all. In fact, it could even degrade performance or increase token usage for some models. Check out the plot below to see the performance of the model with and without the basic skill.

Now, let's walk through how you can use upskill
- Get the teacher (Claude Opus 4.5) to build a kernel
First, we use Claude Code to build a kernel interactively and export the trace. We worked through the process by instructing, validating, and adding documentation links. This somewhat naive process is important to reveal the models' initial challenges. In fact, you can iterate on this multiple times, by trying to solve the task with draft versions of the skill, and experimenting with smaller models. Each time, you can instruct the agent to improve the skill and test it on the smaller model.
Here's an example of the skill that we created and have been using to build kernels. We started from this agent trace where the agent was able to build a kernel, but not without some help.
- Make an agent skill from the trace
Once the teacher model has performed the task, we need them to make a skill. There are a number of effective ways to do this.
Within the same session, instruct the agent to create a skill file for the task it just completed.
Use Anthropic ‘skill creator’ skill either within the agent session or with an exported trace and a new agent session.
Use the upskill
In most cases, the first 2 options result in functional skills. However, the performance of an agent with the skill is unknown. That’s where upskill

- Take your skill to an open source, smaller, or cheaper model
Finally, we need to transfer our newly created skill to the tool or model we intend to use. Most tools like codex
{agent}/skills/{skill_name}/SKILL.md

In this case, we might want to iterate further on the gpt-oss
upskill generate --from {skill}
There is more to agent skills than model performance. Often agents can reach a given accuracy with or without a skill, they just need to consume more tokens to get there. For recurring tasks, we want to optimize agents to use less tokens to achieve the same accuracy. The results below reveal another dimension to the skill. Some models are significantly reducing their performance token usage, whilst others are using more tokens with the skill. For example, with moonshotai/Kimi-K2-Thinking

tldr; try out and evaluate models with the skills you create. Use upskill eval
That’s the high level end to end of upskilling your coding agents on hard problems. Try out upskill now like this:
install upskill pip install upskill # or use uvx uvx upskill --help # generate a skill based on an agent trace upskill generate "write nvidia kernels" --from ./trace.md # evaluate models on a skill upskill eval ./skills/my-skill/ --model haiku --model sonnet # generate skills for local models upskill generate "parse YAML" --model opus --eval-model "unsloth/GLM-4.7-Flash-GGUF:Q4_0" --eval-base-url http://localhost:8080/v1
Deep dive tutorial into building kernels with agent skills
We have a high level understanding of how we can upskill an agent. Let’s now look at the use case we solved for writing CUDA kernels.
We didn’t just want to write kernel code, but understand the full kernel-builder workflow: project structure, build.toml
The kernel-builder-cuda-kernels
With this skill, you can tell Claude things like:
Build a fused LayerNorm + GELU kernel optimized for H100.
And Claude will create the complete project structure, CUDA implementation, and build configuration—following the exact conventions that kernel-builder expects.
This isn't about generating boilerplate. The skill encodes domain expertise: H100 uses compute capability 9.0, shared memory should be aligned to 128 bytes, async memory copies require __CUDA_ARCH__ >= 900
Setup and Install
Install upskill:
pip install upskill # or use uvx for one-off runs uvx upskill --help
Set your API key:
export ANTHROPIC_API_KEY=sk-ant-... export HF_TOKEN=hf_...
That's it. Upskill uses Anthropic Claude Opus-4.5 model by default but also supports OpenAI and local models via OpenAI-compatible endpoints as generators. We want to use the more expensive and higher quality models to generate skills, and the smaller ones to use them. Think robin hood.
Skill Generation
Let's walk through generating a skill that teaches agents how to build CUDA kernels with HuggingFace's kernels
Generate the Skill
Start with a clear task description:
upskill generate "build optimized CUDA kernels for PyTorch using HuggingFace kernel-builder"
Above we used upskill, but it could in fact be any agent or chat tool and an exported trace.
upskill generate "write kernels" --from <agent-trace>.md
Also, we could start from an existing skill and add to it:
upskill generate "add more error handling and edge cases" --from ./skills/kernel-builder-cuda-kernels/
upskill loads the existing skill, applies your improvements, and re-evaluates to ensure the changes help.
upskill creates a skill, generates test cases, evaluates performance, and refines based on failures:
Generating skill with sonnet... Generating test cases... Evaluating on sonnet... (attempt 1) 60% -> 95% (+35%) OK kernel-builder-cuda-kernels Build optimized CUDA kernels for PyTorch using HuggingFace kernel-builder. SKILL.md ~520 tokens baseline ████████████ 60% with skill ███████████████████ 95% (+35%) Saved to ./skills/kernel-builder-cuda-kernels
The baseline shows how the model performs without any skill. The "with skill" result shows performance after the skill is injected into context. A 35% improvement means the skill is working.
The skill is saved as a directory following the Agent Skills specification:
./skills/kernel-builder-cuda-kernels/ ├── SKILL.md # Main instructions (~520 tokens) └── skill_meta.json # Metadata and test cases
--- name: kernel-builder-cuda-kernels description: Build optimized CUDA kernels for PyTorch using HuggingFace kernel-builder. --- # Building CUDA Kernels with kernel-builder ## Overview This guide explains how to create optimized CUDA kernels for PyTorch models using HuggingFace's kernel-builder. It covers project setup, kernel implementation, and building for specific GPU architectures like NVIDIA H100. ## Project Structure project/ ├── build.toml # Build configuration ├── kernel_src/ # CUDA kernel implementations │ ├── attention.cu │ ├── layernorm.cu │ └── geglu.cu └── torch-ext/ # PyTorch C++ bindings └── torch_binding.cpp ## Build Configuration Create build.toml to define your kernel package: [general] name = "diffuser_kernels" backends = ["cuda"] [general.cuda] # H100 is compute capability 9.0 capabilities = ["9.0"] ...
Evaluate on a Different Model
The important test is: does this skill help local or cheaper models to build kernels?
Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/GLM-4.7-Flash-GGUF:Q4_K_M # Evaluate on local model (llama.cpp server) upskill eval ./skills/my-skill/ --model "unsloth/GLM-4.7-Flash-GGUF:Q4_0" --base-url http://localhost:8080/v1
Generating skill with sonnet... Generating test cases... Evaluating on "unsloth/GLM-4.7-Flash-GGUF:Q4_0"... (attempt 1) 40% -> 85% (+45%) OK baseline ████████░░░░░░░░░░░░ 40% with skill █████████████████░░░ 85% (+45%) Saved to ./skills/kernel-builder-cuda-kernels
A 45% improvement on "unsloth/GLM-4.7-Flash-GGUF:Q4_0"
This is the core value proposition: use expensive models to create skills, then deploy those skills with cheap or local models.
How the evaluation in upskill works
upskill uses a teacher-student approach to evaluate models where the teacher model generates test cases for the student model to be evaluated on.
Teacher model (Opus) generates the skill
Test cases (Opus) are generated automatically from the task description
Student model (local) is evaluated with and without the skill
Skill lift measures the improvement
If you pass an existing skill to upskill eval
{ "cases": [ { "input": "Create a build.toml for a CUDA kernel targeting H100", "expected": {"contains": "9.0"} }, { "input": "Write a basic CUDA kernel template with proper includes", "expected": {"contains": "cuda_runtime.h"} } ] }
We can also test how a skill performs across different models:
upskill eval ./skills/kernel-builder-cuda-kernels/ --model haiku --m kimi --runs 5
Evaluating kernel-builder-cuda-kernels across 2 model(s) 3 test case(s), 5 run(s) per model haiku Pass rate: 4/5 (80%) Avg assertions: 2.8/3 sonnet Pass rate: 5/5 (100%) Avg assertions: 3.0/3 ┏━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Model ┃ Pass Rate ┃ Avg Assertions ┃ Avg Tokens ┃ ┡━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ haiku │ 4/5 │ 2.8/3 │ 1250 │ │ kimi │ 5/5 │ 3.0/3 │ 1890 │ └────────┴───────────┴────────────────┴────────────┘
This helps you find the cost-performance sweet spot: maybe Haiku with the skill is good enough for your use case, saving significant API costs.
We've shown that upskill can create validated skills that transfer domain expertise from powerful models to cheaper ones. The kernel-builder skill is just one example of what's possible.
Some things to try:
Generate skills for your internal tools
Build a skill library for your codebase
Capture tribal knowledge
Benchmark across models
The approach works for any specialized task where you'd otherwise write detailed prompts repeatedly. Skills are portable across Claude Code, Codex, Cursor, and other tools that support the Agent Skills specification.
Agent Skills Specification
HuggingFace kernel-builder



関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み