Hamel Husain·2024年3月27日 16:00·約4分

ファインチューニングは依然として価値があるか？

#LLM #Fine-tuning #RAG #Prompt Engineering #Evaluation Harness

TL;DR

Hamel Husain は、ファインチューニングの有用性を否定する声に対し、ドメイン固有の評価ハッチネスやプロンプトエンジニアリングの必要性を強調し、構文・スタイル学習においては依然として有効であると主張している。

AI深層分析2026年5月3日 05:17

重要/ 5段階

深度40%

キーポイント

ファインチューニングが不要とされる背景の分析

開発者ツールや汎用パーソナルアシスタントなど、基盤モデル自体が既に最適化されている領域や、評価システム（eval harness）が未整備な初期段階で「不要」と判断されやすい傾向がある。

評価システムの重要性とプロンプトエンジニアリングの役割

効果的なファインチューニングには厳密な評価ハッチネスが必須であり、プロンプトエンジニアリングは評価システムをストレステストするための重要な手段として位置づけられている。

ファインチューニングと RAG の適切な使い分け

ファインチューニングはドメイン固有の構文・スタイル・ルールの学習に、RAG は最新情報やコンテキストの提供にそれぞれ最適であり、目的に応じて使い分けるべきである。

実証事例：Honeycomb と ReChat

Honeycomb では専門クエリ言語の構文学習に、ReChat では CRM 連携のための複雑な出力フォーマット生成にファインチューニングが成功した具体例が示されている。

複雑なフォーマット要件への対応

構造化データと非構造化データを織り交ぜた独自形式での応答が必要となる場合、ファインチューニングが実装の鍵となります。

大規模モデルへの適用可能性

ファインチューニングは小規模なオープンソースモデルに限定されず、GPT-3.5などの大規模モデルに対しても広く活用されています。

影響分析・編集コメントを表示

影響分析

この記事は、AI 開発現場における「ファインチューニング vs プロンプトエンジニアリング」という議論に対し、単なる技術選定ではなく、評価基盤（Eval Harness）の整備状況という文脈で捉えるべきだと指摘しています。これにより、開発者が安易にファインチューニングを断念するのを防ぎ、ドメイン固有の要件に対して適切な技術選択を行うための判断基準を提供します。

編集コメント

ファインチューニングの是非を議論する際、多くの開発者が見過ごしがちな「評価ハッチネス」の整備状況を指摘した非常に実践的な洞察です。技術選定において、ツールの有無だけでなく、その運用基盤が整っているかを問う視点は重要です。

このツイートで私が提示した質問に対する私の個人的な見解は以下の通りです。

ファインチューニングへの幻滅を表す声が増えています。

私は、より一般的な感情について興味を持っています。（現時点では私の意見は共有しません）

以下のツイートは @mlpowered, @abacaj, @emollick 氏からのものです pic.twitter.com/cU0hCdubBU

— Hamel Husain (@HamelHusain) 2024年3月26日

ファインチューニングは多くの状況において依然として非常に価値があると考えています。さらに詳しく調査した結果、ファインチューニングが有用ではないと言う人々は、実際にはファインチューニングが有用になる可能性の低い製品に取り組んでいることが多いことが分かりました。

彼らは開発者向けツールを作成しています。基盤モデル（foundation models）はコーディングタスクに対して広範にトレーニングされています。

彼らは基盤モデルを構築し、最も一般的なケースについてテストを行っています。しかし、基盤モデル自体もまた最も一般的なケースのためにトレーニングされています。

彼らは特定のドメインやユースケースに限定されないパーソナルアシスタントを構築しており、これは本質的に基盤モデルを構築している人々と同じグループです。

もう一つの共通のパターンは、多くの人が製品開発の初期段階でこのように言うことです。人々が非常に初期段階にいることを示す兆候の一つとして、ドメイン固有の評価ハーンチ（eval harness）を持っていないことが挙げられます。

評価システムなしに効果的にファインチューニングを行うことは不可能であり、この前提条件を完了していないとファインチューニングを無効視してしまう可能性があります。また、長期的には製品を改善するためにも優れた評価システムが不可欠であり、ファインチューニングの有無にかかわらずこれは変わりません。

ファインチューニングを行う前に、可能な限り多くのプロンプトエンジニアリングを実施すべきです。ただし、それはあなたが考えるような理由によるものではありません！膨大な量のプロンプトエンジニアリングを行う理由は、それが評価システム（eval system）のストレステストに最適な手段だからです。

もしプロンプトエンジニアリングがうまく機能していることが分かり（かつ体系的に製品を評価している場合）、そこで止めても問題ありません。私は問題を解決するために最もシンプルなアプローチを採用すべきだと強く信じています。しかし、まだファインチューニングを無効視するのは早計だと思います。

私がファインチューニングが効果的に機能したと確認した事例

一般的に、ファインチューニングは構文（syntax）、スタイル、およびルールを学習させるのに最も適しており、RAG（Retrieval-Augmented Generation）などの手法は、モデルにコンテキストや最新の事実を提供するのに最も適しています。

これらは私が携わった企業からのいくつかの事例です。今後さらに詳細をお伝えできることを願っています。

Honeycomb の自然言語クエリアシスタント - 以前は、Honeycomb クエリ言語のための「プログラミングマニュアル」が多くの例とともにプロンプトに追加されていました。これは許容範囲でしたが、ファインチューニングを行うことで、モデルがこのニッチなドメイン固有の言語（domain-specific language）の構文とルールを学習させることがはるかに効果的になりました。

ReChat の Lucy は、既存の不動産 CRM システムに統合された AI 不動産アシスタントです。ReChat では、フロントエンドがチャットインターフェースにウィジェットやカード、その他のインタラクティブ要素を動的にレンダリングできるように、構造化データと非構造化データを織り交ぜた非常に個性的な形式で LLM の応答を提供する必要があります。これを正しく動作させる鍵となったのがファインチューニングでした。本講演ではさらに詳細が述べられています。

P.S. ファインチューニングは、オープンソースや「小規模」モデルに限定されるものではありません。Perplexity.AI や CaseText など、GPT-3.5 をファインチューニングしている事例は数多くあります。

原文を表示

Here is my personal opinion about the questions I posed in this tweet:

There are a growing number of voices expressing disillusionment with fine-tuning.

I'm curious about the sentiment more generally. (I am withholding sharing my opinion rn).

Tweets below are from @mlpowered @abacaj @emollick pic.twitter.com/cU0hCdubBU

— Hamel Husain (@HamelHusain) March 26, 2024

I think that fine-tuning is still very valuable in many situations. I’ve done some more digging and I find that people who say that fine-tuning isn’t useful are indeed often working on products where fine-tuning isn’t likely to be useful:

They are making developer tools - foundation models have been trained extensively on coding tasks.

They are building foundation models and testing for the most general cases. But the foundation models themselves are also being trained for the most general cases.

They are building a personal assistant that isn’t scoped to any type of domain or use case and is essentially similar to the same folks building foundation models.

Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness.

It’s impossible to fine-tune effectively without an eval system which can lead to writing off fine-tuning if you haven’t completed this prerequisite. It’s also impossible to improve your product without a good eval system in the long term, fine-tuning or not.

You should do as much prompt engineering as possible before you fine-tune. But not for reasons you would think! The reason for doing lots of prompt engineering is that it’s a great way to stress test your eval system!

If you find that prompt-engineering works fine (and you are systematically evaluating your product) then it’s fine to stop there. I’m a big believer in using the simplest approach to solving a problem. I just don’t think you should write off fine-tuning yet.

Examples where I’ve seen fine-tuning work well

Generally speaking, fine-tuning works best to learn syntax, style and rules whereas techniques like RAG work best to supply the model with context or up-to-date facts.

These are some examples from companies I’ve worked with. Hopefully, we will be able to share more details soon.

Honeycomb’s Natural Language Query Assistant - previously, the “programming manual” for the Honeycomb query language was being dumped into the prompt along with many examples. While this was OK, fine-tuning worked much better to allow the model to learn the syntax and rules of this niche domain-specific language.

ReChat’s Lucy - this is an AI real estate assistant integrated into an existing Real Estate CRM system. ReChat needs LLM responses to be provided in a very idiosyncratic format that weaves together structured and unstructured data to allow the front end to render widgets, cards and other interactive elements dynamically into the chat interface. Fine-tuning was the key to making this work correctly. This talk has more details.

P.S. Fine-tuning is not only limited to open or “small” models. There are lots of folks who have been fine-tuning GPT-3.5, such as Perplexity.AI: and CaseText, to name a few.

この記事をシェア

LangChain Blog重要度42026年6月27日 02:13

Deep Agents との連携によるプロンプトキャッシング

KDnuggets2026年6月25日 21:00

2026 年に AI エンジニアになるためのロードマップ

Simon Willison Blog重要度52026年6月27日 02:10

OpenAI、GPT-5.6 シリーズの限定プレビューを開始

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む