Hamel Husain·2023年10月15日 09:00·約5分

LLMのレイテンシ最適化

#LLM最適化 #レイテンシ #推論サーバー #オープンソース #モデル配備 #パフォーマンス比較

TL;DR

Hamel Husain氏は、オープンソースLLMのレイテンシ最適化ツールを比較評価し、mlcが最速、CTranslate2が使いやすさで優位、vLLMが大規模モデル向けに適していると報告した。

AI深層分析2026年3月1日 17:46

注目/ 5段階

深度40%

キーポイント

最速ツールはmlc

mlcが最も高速なパフォーマンスを示したが、品質評価は今後の課題として指摘されている。

使いやすさでCTranslate2が優位

CTranslate2は高速性に加えて優れたドキュメントと使い勝手を備えており、著者のお気に入りツールと評価された。

大規模モデル向けにvLLMが適切

vLLMは分散推論をサポートしており、非常に大規模なモデルの提供に適した選択肢と位置付けられている。

実験条件の統一性

バッチサイズ1、Nvidia A6000、200トークン出力など条件を統一し、ツール間の公平な比較を試みた。

MLC LLMのコンパイルと実行手順

MLCを使用してLlama-2-7b-chat-hfモデルをCUDAターゲットで量子化（q4f16_1）し、Pythonクライアントで対話する具体的な手順が示されています。

MLC設定ファイルのカスタマイズ

mlc-chat-config.jsonファイルを編集することで、会話テンプレートや生成パラメータ（temperature、max_gen_lenなど）を調整できます。

CTranslate2の高速化ツール

CTranslate2はLLMを非常に高速化できる最適化ツールで、Llamaモデル向けの具体的なドキュメントが提供されています。

影響分析・編集コメントを表示

影響分析

この記事は実践的なLLM導入を検討する開発者にとって、ツール選択の重要な参考情報を提供する。特にレイテンシ最適化に焦点を当てた比較評価は、実運用環境での意思決定を支援する価値がある。

編集コメント

実務志向のツール比較記事として、現場の開発者が直面する具体的な課題に即した内容となっている。ただし、厳密なベンチマークではなく概略的な比較である点に注意が必要。

model_id = "meta-llama/Llama-2-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id)

nf4_config = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_quant_type="nf4",

bnb_4bit_compute_dtype=torch.bfloat16

)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)

TGIとは異なり、ここではbitsandbytesを正常に動作させることができました。しかし、TGIと同様に、推論レイテンシの観点では目立った高速化は見られませんでした。ベンチマーク表に示されている通り、最適化を施さないTransformersとほぼ同等の結果でした。

また比較のため、推論サーバーを使用せずにAutoGPTQを用いてモデルを量子化しました。コードはこちらです。

結果は非常に悪く（約5トークン/秒）、表には掲載しませんでした。これは私の環境では明らかに異常な値だったためです。

Text Generation WebUI

Amanから、text-generation-webuiについて、およびExLlamaとggmlを素早く試すための手順を教えてもらいました。私はggmlを動作させることはできませんでした。

text-generation-webuiリポジトリのルートディレクトリで、ExLlamaで最適化された推論サーバーを起動するには、以下のコマンドを実行します。

python3 download-model.py TheBloke/Llama-2-7B-GPTQ

python3 server.py --listen --extensions openai --loader exllama_hf --model TheBloke_Llama-2-7B-GPTQ

サーバー起動後、このコードを使用してベンチマークを実施しました。

全体的に、この特定のソフトウェアはあまり好みではありませんでした。一度に多くの機能（推論サーバー、Web UI、その他の最適化）を提供しようとしているため、やや肥大化している印象です。ただし、ドキュメントはよく整っており、使いやすい点は評価できます。

Webユーザーインターフェースを備えたエンドツーエンドのソリューションを求めている場合（多くの人がそうでしょう！）を除けば、これを使用する特別な理由はないと思います。

vLLMはCUDA 11.8でのみ動作します。私はこの方法でCUDAをセットアップしました。CUDAのセットアップと適切なバージョンのPyTorchのインストールが完了したら、gitリポジトリから最新版をインストールする必要があります。

pip install -U git+https://github.com/vllm-project/vllm.git

vLLMを使用するための優れた手順は、これらのModalドキュメントで見つけることができます。驚くべきことに、ローカルのA6000で実行した際には、はるかに低いレイテンシを達成しました。

from vllm import SamplingParams, LLM

以下は https://modal.com/docs/guide/ex/vllm_inference より

questions = [

# コーディング関連の質問

"フィボナッチ数を計算するPython関数を実装してください。",

"二進指数計算を行うRust関数を書いてください。",

"JavascriptとPythonの違いは何ですか？",

# 文学関連

"2083年にロボットを見るためにオーストラリアのアウトバックへ旅行する話を、ジェイムズ・ジョイスのスタイルで書いてください。",

"ハリーは誰を風船に変えますか？",

"人類史上最も重要な出来事を目撃しようと決意した時間旅行歴史学者についての物語を書いてください。",

# 数学関連

"9と8の積は何ですか？",

"電車が2時間で120キロメートル移動する場合、その平均速度は？",

"段階的に考えてください。数列a_nがa_1 = 3、a_2 = 5、n > 2に対してa_n = a_(n-1) + a_(n-2)で定義されている場合、a_6を求めてください。",

]

MODEL_DIR = "/home/ubuntu/hamel-drive/vllm-models"

def download_model_to_folder():

from huggingface_hub import snapshot_download

import os

snapshot_download(

"meta-llama/Llama-2-7b-hf",

local_dir=MODEL_DIR,

token=os.environ["HUGGING_FACE_HUB_TOKEN"],

)

return LLM(MODEL_DIR)

def generate(question, llm, note=None):

response = {'question': question, 'note': note}

sampling_params = SamplingParams(

temperature=1.0,

top_p=1,

max_tokens=200,

)

start = time.perf_counter()

result = llm.generate(question, sampling_params)

request_time = time.perf_counter() - start

for output in result:

response['tok_count'] = len(output.outputs[0].token_ids)

response['time'] = request_time

response['answer'] = output.outputs[0].text

return response

if __name__ == '__main__':

llm = download_model_to_folder()

counter = 1

responses = []

for q in questions:

response = generate(question=q, llm=llm, note='vLLM')

if counter >= 2:

responses.append(response)

counter += 1

df = pd.DataFrame(responses)

df.to_csv('bench-vllm.csv', index=False)

HuggingFace Inference Endpoint

HuggingFace上で、Nvidia A10Gを搭載したmeta-llama/Llama-2-7b-hfの推論エンドポイントをデプロイしました。

これらのインターフェースに関するドキュメントはこちらで確認できます。Pythonクライアントも提供されています。

彼らのドキュメントによれば、内部ではTGI（Text Generation Inference）が使用されています。しかし、彼らのホスト型推論プラットフォームで計測したレイテンシは、私がローカルで実行したTGIよりも大幅に高速でした。これは、私がA10Gインスタンスを使用したためかもしれません。

このベンチマークのコードはこちらで確認できます。

推論ベンチマークを実施する際には、推論性能とスループットの限界を探ることが一般的です。私はレイテンシに最も関心があったため、この探求は行いませんでした。スループットとレイテンシの両方を考慮した推論ベンチマークの実施例はこちらで確認できます。↩︎

Llama v2モデルを使用する場合は、-hfで終わるモデルを使用するように注意してください。

Modular Inference Engine（モジュラー推論エンジン）は、最適化技術も適用する推論サーバーの別の例です。これを書いている時点では独自技術ですが、今後注目に値する可能性があります。↩︎

原文を表示

Below is a summary of my findings:

🏁 mlc is the fastest. This is so fast that I’m skeptical and am now motivated to measure quality (if I have time). When checking the outputs manually, they didn’t seem that different than other approaches.

❤️ CTranslate2 is my favorite tool, which is among the fastest but is also the easiest to use. The documentation is the best out of all of the solutions I tried. Furthermore, I think that the ergonomics are excellent for the models that they support. Unlike vLLM, CTranslate doesn’t seem to support distributed inference just yet.

🛠️ vLLM is really fast, but CTranslate can be much faster. On other hand, vLLM supports distributed inference, which is something you will need for larger models. vLLM might be the sweet spot for serving very large models.

😐 Text Generation Inference is an ok option (but nowhere near as fast as vLLM

Rough Benchmarks

This study focuses on various approaches to optimizing latency. Specifically, I want to know which tools are the most effective at optimizing latency for open source LLMs. In order to focus on latency, I hold the following variables constant:

batch size of n = 1

All experiments were conducted on a Nvidia A6000

Max output tokens were always set to 200

All numbers are calculated as an average over a fixed set of 9 prompts.

The model used is meta-llama/Llama-2-7b-hf on the HuggingFace Hub 2.

In addition to batch size of n = 1

avg time (seconds)

avg output token count

float16 quantization

int8 quantization

HF Hosted Inference Endpoint

HuggingFace Transformers (no server)

nf4 4bit quantization bitsandbytes

quantized w/ GPTQ

quantized w/ bitsandbytes

text-generation-webui

A100 (on Modal Labs)

In some cases I did not use an A6000

I noticed that the output of the LLM was quite different (less tokens) when using vLLM. I am not sure if I did something wrong here, or it changes the behavior of the LLM.

Furthermore, the goal was not to be super precise on these benchmarks but rather to get a general sense of how things work and how they might compare to each other out of the box. Some of the tools above are inference servers which perform logging, tracing etc. in addition to optimizing models which effect latency. The idea is to see where there are significant differences between tools. I discussed this more here.

One capability you need to be successful with open source LLMs is the ability to serve models efficiently. There are two categories of tools for model inference:

Inference servers: these help with providing a web server that can provide a REST/grpc or other interface to interact with your model as a service. These inference servers usually have parameters to help you make trade-offs between throughput and latency. Additionally, some inference servers come with additional features like telemetry, model versioning and more. You can learn more about this topic the serving section of these notes. For LLMs, popular inference servers are the Text Generation Inference (TGI) and vLLM.

Model Optimization: These modify your model to make them faster for inference. Examples include quantization, Paged Attention, Exllama and more.

It is common to use both Inference servers and Model Optimization techniques in conjunction. Some inference servers like TGIand vLLM even help you apply optimization techniques.3

Other than benchmarking, an important goal of this study was to understand how to use different platforms & tools.

Start with compiling the model as shown in these docs

After installing MLC, you can compile meta-llama/Llama-2-7b-chat-hf

python3 -m mlc_llm.build \ --hf-path meta-llama/Llama-2-7b-chat-hf \ --target cuda --quantization q4f16_1

The arguments for the compliation are documented here. This puts the model in the ./dist/

Llama-2-7b-chat-hf-q4f16_1

You can use their python client to interact with the compiled model:

from mlc_chat import ChatModule, ChatConfig cfg = ChatConfig(max_gen_len=200) cm = ChatModule(model="Llama-2-7b-chat-hf-q4f16_1", chat_config=cfg) output = cm.generate(prompt=prompt)

You can see the full benchmarking code here.

I wasn’t able to get meta-llama/Llama-2-7b-hf

Llama-2-7b-chat-hf

conv.system = ("[INST] <<SYS>>\n\nYou are a helpful, respectful and honest assistant. " "Always answer as helpfully as possible, while being safe. " "Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, " "or illegal content. " "Please ensure that your responses are socially unbiased and positive in nature.\n\n" "If a question does not make any sense, or is not factually coherent, explain why instead " "of answering something not correct. " "If you don't know the answer to a question, please don't share false " "information.\n<</SYS>>\n\n ");

If you want to fix this, you must edit mlc-chat-config.json

The config file is located in ./dist/<model-name>/params/mlc-chat-config.json

cat ./dist/Llama-2-7b-hf-q4f16_1/params/mlc-chat-config.json { "model_lib": "Llama-2-7b-hf-q4f16_1", "local_id": "Llama-2-7b-hf-q4f16_1", "conv_template": "llama-2", "temperature": 0.7, "repetition_penalty": 1.0, "top_p": 0.95, "mean_gen_len": 128, "max_gen_len": 512, "shift_fill_factor": 0.3, "tokenizer_files": [ "tokenizer.json", "tokenizer.model" ], "model_category": "llama", "model_name": "Llama-2-7b-hf" }

CTranslate2 is an optimization tool that can make models ridiculously fast. h/t to Anton. The documentation for CTranslate2 contains specific instructions for llama models.

To optimize llama v2

ct2-transformers-converter --model meta-llama/Llama-2-7b-hf --quantization int8 --output_dir llama-2-7b-ct2 --force

meta-llama/Llama-2-7b-hf

import time import ctranslate2 import transformers import sys sys.path.append('../common/') from questions import questions import pandas as pd generator = ctranslate2.Generator("llama-2-7b-ct2", device="cuda") tokenizer = transformers.AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") def predict(prompt:str): "Generate text give a prompt" start = time.perf_counter() tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)) results = generator.generate_batch([tokens], sampling_topk=1, max_length=200, include_prompt_in_result=False) tokens = results[0].sequences_ids[0] output = tokenizer.decode(tokens) request_time = time.perf_counter() - start return {'tok_count': len(tokens), 'time': request_time, 'question': prompt, 'answer': output, 'note': 'CTranslate2 int8 quantization'} if __name__ == '__main__': counter = 1 responses = [] for q in questions: if counter >= 2: responses.append(predict(q)) counter += 1 df = pd.DataFrame(responses) df.to_csv('bench-ctranslate-int8.csv', index=False)

Text Generation Inference (TGI)

The license for TGI was recently changed away from Apache 2.0 to be more restrictive. Be careful when using TGI in commercial applications.

Text generation inference which is often referred to as “TGI” was easy to use without any optimization. You can run it like this:

“start_server.sh”

#!/bin/bash if [ -z "$HUGGING_FACE_HUB_TOKEN" ] then echo "HUGGING_FACE_HUB_TOKEN is not set. Please set it before running this script." exit 1 fi model="TheBloke/Llama-2-7B-GPTQ" volume=$PWD/data docker run --gpus all \ -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \ -e GPTQ_BITS=4 -e GPTQ_GROUPSIZE=128 \ --shm-size 5g -p 8081:80 \ -v $volume:/data ghcr.io/huggingface/text-generation-inference \ --max-best-of 1 "$@"

We can then run the server with this command:

bash start_server.sh --model-id "meta-llama/Llama-2-7b-hf"

You can see all the options for the TGI container with the help flag like so:

docker run ghcr.io/huggingface/text-generation-inference --help | less

Quantization was very difficult to get working. There is a —quantize

text-generation-server download-weights meta-llama/Llama-2-7b-hf

You can run the following command to perform the quantization (the last argument is the destination directory where the weights are stored).

text-generation-server quantize "meta-llama/Llama-2-7b-hf" data/quantized/

However, this step is not needed for the most popular models, as someone will likely already have quantized and uploaded them to the Hub.

Pre-Quantized Models

Alternatively, you can use a pre-quantized model that has been uploaded to the Hub. TheBloke/Llama-2-7B-GPTQ is a good example of one. To get this to work, you have to be careful to set the GPTQ_BITS

GPTQ_GROUPSIZE=128

start_server.sh

To use the TheBloke/Llama-2-7B-GPTQ with TGI, I can use the same bash script with the following arguments:

bash start_server.sh --model-id TheBloke/Llama-2-7B-GPTQ --quantize gptq

Comparison Without TGI Server

When I first drafted this study I got the following response on twitter:

Based on your code (https://t.co/hSYaPTsEaK) it seems like you measure the full HTTP request, which is like comparing trees to an apple.

Phillip certainly has a point! I am indeed testing both! I’m looking for big differences in tools here, and since some inference servers have optimization tools, and some optimization tools do not have an inference server I cannot do a true apples to apples comparison. However, I think its still useful to try different things as advertised to see what is possible, and also take note of really significant gaps in latency between tools.

Therefore, I ran the following tests to perform the similar optimizations as TGI, but without the server to see what happened:

HuggingFace Transformers

I was able to get slightly better performance without the TGI server as predicted by Phillip, but it did not account for the the massive gap between some tools (which is exactly the kind of thing I was looking for).

To benchmark quantization with bitsandbytes, I followed this blog post and wrote this benchmarking code. I quantized the model by loading it like this:

model_id = "meta-llama/Llama-2-7b-hf" tokenizer = AutoTokenizer.from_pretrained(model_id) nf4_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)

Unlike TGI, I was able to get bitsandbytes to work properly here, but just like TGI it didn’t speed anything up for me with respect to inference latency. As reflected in the benchmark table, I got nearly the same results with transformers without any optimizations.

I also quantized the model using AutoGPTQ without an inference server to compare against TGI. The code for that is here.

The results were so bad ~ 5 tok/sec that I decided not to put this in the table, because it seemed quite off to me.

Text Generation WebUI

Aman let me know about text-generation-web-ui, and also these instructions for quickly experimenting with ExLlama and ggml. I wasn’t able to get the ggml

From the root of the text-generation-web-ui repo, you can run the following commands to start an inference server optimized with ExLlama

python3 download-model.py TheBloke/Llama-2-7B-GPTQ python3 server.py --listen --extensions openai --loader exllama_hf --model TheBloke_Llama-2-7B-GPTQ

After the server was started, I used this code to conduct the benchmark.

Overall, I didn’t like this particular piece of software much. It’s bit bloated because its trying to do too many things at once (An inference server, Web UIs, and other optimizations). That being said, the documentation is good and it is easy to use.

I don’t think there is any particular reason to use this unless you want an end-to-end solution that also comes with a web user-interface (which many people want!).

vLLM only works with CUDA 11.8, which I configured using this approach. After configuring CUDA and installing the right version of PyTorch, you need to install the bleeding edge from git:

pip install -U git+https://github.com/vllm-project/vllm.git

A good recipe to use for vLLM can be find on these Modal docs. Surprisingly, I had much lower latency when running on a local A6000

from vllm import SamplingParams, LLM #from https://modal.com/docs/guide/ex/vllm_inference questions = [ # Coding questions "Implement a Python function to compute the Fibonacci numbers.", "Write a Rust function that performs binary exponentiation.", "What are the differences between Javascript and Python?", # Literature "Write a story in the style of James Joyce about a trip to the Australian outback in 2083, to see robots in the beautiful desert.", "Who does Harry turn into a balloon?", "Write a tale about a time-traveling historian who's determined to witness the most significant events in human history.", # Math "What is the product of 9 and 8?", "If a train travels 120 kilometers in 2 hours, what is its average speed?", "Think through this step by step. If the sequence a_n is defined by a_1 = 3, a_2 = 5, and a_n = a_(n-1) + a_(n-2) for n > 2, find a_6.", ] MODEL_DIR = "/home/ubuntu/hamel-drive/vllm-models" def download_model_to_folder(): from huggingface_hub import snapshot_download import os snapshot_download( "meta-llama/Llama-2-7b-hf", local_dir=MODEL_DIR, token=os.environ["HUGGING_FACE_HUB_TOKEN"], ) return LLM(MODEL_DIR) def generate(question, llm, note=None): response = {'question': question, 'note': note} sampling_params = SamplingParams( temperature=1.0, top_p=1, max_tokens=200, ) start = time.perf_counter() result = llm.generate(question, sampling_params) request_time = time.perf_counter() - start for output in result: response['tok_count'] = len(output.outputs[0].token_ids) response['time'] = request_time response['answer'] = output.outputs[0].text return response if __name__ == '__main__': llm = download_model_to_folder() counter = 1 responses = [] for q in questions: response = generate(question=q, llm=llm, note='vLLM') if counter >= 2: responses.append(response) counter += 1 df = pd.DataFrame(responses) df.to_csv('bench-vllm.csv', index=False)

HuggingFace Inference Endpoint

I deployed an inference endpoint on HuggingFace for meta-llama/Llama-2-7b-hf, on a Nvidia A10G

The documentation for these interfaces can be found here. There is also a python client.

Their documentation says they are using TGI under the hood. However, my latency was significantly faster on their hosted inference platform than using TGI locally. This could be due to the fact that I used a A10G

The code for this benchmark can be found here.

It is common to explore the inference vs throughput frontier when conducting inference benchmarks. I did not do this, since I was most interested in latency. Here is an example of how to conduct inference benchmarks that consider both throughput and latency.↩︎

For Llama v2 models, you must be careful to use the models ending in -hf

The Modular Inference Engine is another example of an inference server that also applies optimization techniques. At the time of this writing, this is proprietary technology, but its worth keeping an eye on this in the future.↩︎

この記事をシェア

MarkTechPost重要度42026年7月2日 17:46

Google Health API に CLI ツール「ghealth」登場：Fitbit データを AI エージェントへ

Hamel Husain重要度42026年6月29日 16:00

「評価が難しい」というのは製品上の欠陥である

Hamel Husain重要度42026年4月18日 23:24

[お知らせ] 本フィードの運用を終了しました

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Hamel Husain·2023年10月15日 09:00·約5分

LLMのレイテンシ最適化

#LLM最適化 #レイテンシ #推論サーバー #オープンソース #モデル配備 #パフォーマンス比較

TL;DR

AI深層分析2026年3月1日 17:46

注目/ 5段階

深度40%

キーポイント

最速ツールはmlc

mlcが最も高速なパフォーマンスを示したが、品質評価は今後の課題として指摘されている。

使いやすさでCTranslate2が優位

CTranslate2は高速性に加えて優れたドキュメントと使い勝手を備えており、著者のお気に入りツールと評価された。

大規模モデル向けにvLLMが適切

vLLMは分散推論をサポートしており、非常に大規模なモデルの提供に適した選択肢と位置付けられている。

実験条件の統一性

バッチサイズ1、Nvidia A6000、200トークン出力など条件を統一し、ツール間の公平な比較を試みた。

MLC LLMのコンパイルと実行手順

MLCを使用してLlama-2-7b-chat-hfモデルをCUDAターゲットで量子化（q4f16_1）し、Pythonクライアントで対話する具体的な手順が示されています。

MLC設定ファイルのカスタマイズ

mlc-chat-config.jsonファイルを編集することで、会話テンプレートや生成パラメータ（temperature、max_gen_lenなど）を調整できます。

CTranslate2の高速化ツール

CTranslate2はLLMを非常に高速化できる最適化ツールで、Llamaモデル向けの具体的なドキュメントが提供されています。

影響分析・編集コメントを表示

影響分析

編集コメント

model_id = "meta-llama/Llama-2-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id)

nf4_config = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_quant_type="nf4",

bnb_4bit_compute_dtype=torch.bfloat16

)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)

また比較のため、推論サーバーを使用せずにAutoGPTQを用いてモデルを量子化しました。コードはこちらです。

結果は非常に悪く（約5トークン/秒）、表には掲載しませんでした。これは私の環境では明らかに異常な値だったためです。

Text Generation WebUI

text-generation-webuiリポジトリのルートディレクトリで、ExLlamaで最適化された推論サーバーを起動するには、以下のコマンドを実行します。

python3 download-model.py TheBloke/Llama-2-7B-GPTQ

python3 server.py --listen --extensions openai --loader exllama_hf --model TheBloke_Llama-2-7B-GPTQ

サーバー起動後、このコードを使用してベンチマークを実施しました。

pip install -U git+https://github.com/vllm-project/vllm.git

from vllm import SamplingParams, LLM

以下は https://modal.com/docs/guide/ex/vllm_inference より

questions = [

# コーディング関連の質問

"フィボナッチ数を計算するPython関数を実装してください。",

"二進指数計算を行うRust関数を書いてください。",

"JavascriptとPythonの違いは何ですか？",

# 文学関連

"2083年にロボットを見るためにオーストラリアのアウトバックへ旅行する話を、ジェイムズ・ジョイスのスタイルで書いてください。",

"ハリーは誰を風船に変えますか？",

"人類史上最も重要な出来事を目撃しようと決意した時間旅行歴史学者についての物語を書いてください。",

# 数学関連

"9と8の積は何ですか？",

"電車が2時間で120キロメートル移動する場合、その平均速度は？",

"段階的に考えてください。数列a_nがa_1 = 3、a_2 = 5、n > 2に対してa_n = a_(n-1) + a_(n-2)で定義されている場合、a_6を求めてください。",

]

MODEL_DIR = "/home/ubuntu/hamel-drive/vllm-models"

def download_model_to_folder():

from huggingface_hub import snapshot_download

import os

snapshot_download(

"meta-llama/Llama-2-7b-hf",

local_dir=MODEL_DIR,

token=os.environ["HUGGING_FACE_HUB_TOKEN"],

)

return LLM(MODEL_DIR)

def generate(question, llm, note=None):

response = {'question': question, 'note': note}

sampling_params = SamplingParams(

temperature=1.0,

top_p=1,

max_tokens=200,

)

start = time.perf_counter()

result = llm.generate(question, sampling_params)

request_time = time.perf_counter() - start

for output in result:

response['tok_count'] = len(output.outputs[0].token_ids)

response['time'] = request_time

response['answer'] = output.outputs[0].text

return response

if __name__ == '__main__':

llm = download_model_to_folder()

counter = 1

responses = []

for q in questions:

response = generate(question=q, llm=llm, note='vLLM')

if counter >= 2:

responses.append(response)

counter += 1

df = pd.DataFrame(responses)

df.to_csv('bench-vllm.csv', index=False)

HuggingFace Inference Endpoint

HuggingFace上で、Nvidia A10Gを搭載したmeta-llama/Llama-2-7b-hfの推論エンドポイントをデプロイしました。

これらのインターフェースに関するドキュメントはこちらで確認できます。Pythonクライアントも提供されています。

このベンチマークのコードはこちらで確認できます。

Llama v2モデルを使用する場合は、-hfで終わるモデルを使用するように注意してください。

原文を表示

Below is a summary of my findings:

😐 Text Generation Inference is an ok option (but nowhere near as fast as vLLM

Rough Benchmarks

batch size of n = 1

All experiments were conducted on a Nvidia A6000

Max output tokens were always set to 200

All numbers are calculated as an average over a fixed set of 9 prompts.

The model used is meta-llama/Llama-2-7b-hf on the HuggingFace Hub 2.

In addition to batch size of n = 1

avg time (seconds)

avg output token count

float16 quantization

int8 quantization

HF Hosted Inference Endpoint

HuggingFace Transformers (no server)

nf4 4bit quantization bitsandbytes

quantized w/ GPTQ

quantized w/ bitsandbytes

text-generation-webui

A100 (on Modal Labs)

In some cases I did not use an A6000

I noticed that the output of the LLM was quite different (less tokens) when using vLLM. I am not sure if I did something wrong here, or it changes the behavior of the LLM.

One capability you need to be successful with open source LLMs is the ability to serve models efficiently. There are two categories of tools for model inference:

Model Optimization: These modify your model to make them faster for inference. Examples include quantization, Paged Attention, Exllama and more.

It is common to use both Inference servers and Model Optimization techniques in conjunction. Some inference servers like TGIand vLLM even help you apply optimization techniques.3

Other than benchmarking, an important goal of this study was to understand how to use different platforms & tools.

Start with compiling the model as shown in these docs

After installing MLC, you can compile meta-llama/Llama-2-7b-chat-hf

python3 -m mlc_llm.build \ --hf-path meta-llama/Llama-2-7b-chat-hf \ --target cuda --quantization q4f16_1

The arguments for the compliation are documented here. This puts the model in the ./dist/

Llama-2-7b-chat-hf-q4f16_1

You can use their python client to interact with the compiled model:

from mlc_chat import ChatModule, ChatConfig cfg = ChatConfig(max_gen_len=200) cm = ChatModule(model="Llama-2-7b-chat-hf-q4f16_1", chat_config=cfg) output = cm.generate(prompt=prompt)

You can see the full benchmarking code here.

I wasn’t able to get meta-llama/Llama-2-7b-hf

Llama-2-7b-chat-hf

If you want to fix this, you must edit mlc-chat-config.json

The config file is located in ./dist/<model-name>/params/mlc-chat-config.json

cat ./dist/Llama-2-7b-hf-q4f16_1/params/mlc-chat-config.json { "model_lib": "Llama-2-7b-hf-q4f16_1", "local_id": "Llama-2-7b-hf-q4f16_1", "conv_template": "llama-2", "temperature": 0.7, "repetition_penalty": 1.0, "top_p": 0.95, "mean_gen_len": 128, "max_gen_len": 512, "shift_fill_factor": 0.3, "tokenizer_files": [ "tokenizer.json", "tokenizer.model" ], "model_category": "llama", "model_name": "Llama-2-7b-hf" }

CTranslate2 is an optimization tool that can make models ridiculously fast. h/t to Anton. The documentation for CTranslate2 contains specific instructions for llama models.

To optimize llama v2

ct2-transformers-converter --model meta-llama/Llama-2-7b-hf --quantization int8 --output_dir llama-2-7b-ct2 --force

meta-llama/Llama-2-7b-hf

Text Generation Inference (TGI)

The license for TGI was recently changed away from Apache 2.0 to be more restrictive. Be careful when using TGI in commercial applications.

Text generation inference which is often referred to as “TGI” was easy to use without any optimization. You can run it like this:

“start_server.sh”

We can then run the server with this command:

bash start_server.sh --model-id "meta-llama/Llama-2-7b-hf"

You can see all the options for the TGI container with the help flag like so:

docker run ghcr.io/huggingface/text-generation-inference --help | less

Quantization was very difficult to get working. There is a —quantize

text-generation-server download-weights meta-llama/Llama-2-7b-hf

You can run the following command to perform the quantization (the last argument is the destination directory where the weights are stored).

text-generation-server quantize "meta-llama/Llama-2-7b-hf" data/quantized/

However, this step is not needed for the most popular models, as someone will likely already have quantized and uploaded them to the Hub.

Pre-Quantized Models

Alternatively, you can use a pre-quantized model that has been uploaded to the Hub. TheBloke/Llama-2-7B-GPTQ is a good example of one. To get this to work, you have to be careful to set the GPTQ_BITS

GPTQ_GROUPSIZE=128

start_server.sh

To use the TheBloke/Llama-2-7B-GPTQ with TGI, I can use the same bash script with the following arguments:

bash start_server.sh --model-id TheBloke/Llama-2-7B-GPTQ --quantize gptq

Comparison Without TGI Server

When I first drafted this study I got the following response on twitter:

Based on your code (https://t.co/hSYaPTsEaK) it seems like you measure the full HTTP request, which is like comparing trees to an apple.

Therefore, I ran the following tests to perform the similar optimizations as TGI, but without the server to see what happened:

HuggingFace Transformers

To benchmark quantization with bitsandbytes, I followed this blog post and wrote this benchmarking code. I quantized the model by loading it like this:

I also quantized the model using AutoGPTQ without an inference server to compare against TGI. The code for that is here.

The results were so bad ~ 5 tok/sec that I decided not to put this in the table, because it seemed quite off to me.

Text Generation WebUI

Aman let me know about text-generation-web-ui, and also these instructions for quickly experimenting with ExLlama and ggml. I wasn’t able to get the ggml

From the root of the text-generation-web-ui repo, you can run the following commands to start an inference server optimized with ExLlama

python3 download-model.py TheBloke/Llama-2-7B-GPTQ python3 server.py --listen --extensions openai --loader exllama_hf --model TheBloke_Llama-2-7B-GPTQ

After the server was started, I used this code to conduct the benchmark.

I don’t think there is any particular reason to use this unless you want an end-to-end solution that also comes with a web user-interface (which many people want!).

vLLM only works with CUDA 11.8, which I configured using this approach. After configuring CUDA and installing the right version of PyTorch, you need to install the bleeding edge from git:

pip install -U git+https://github.com/vllm-project/vllm.git

A good recipe to use for vLLM can be find on these Modal docs. Surprisingly, I had much lower latency when running on a local A6000

HuggingFace Inference Endpoint

I deployed an inference endpoint on HuggingFace for meta-llama/Llama-2-7b-hf, on a Nvidia A10G

The documentation for these interfaces can be found here. There is also a python client.

The code for this benchmark can be found here.

For Llama v2 models, you must be careful to use the models ending in -hf

この記事をシェア

MarkTechPost重要度42026年7月2日 17:46

Google Health API に CLI ツール「ghealth」登場：Fitbit データを AI エージェントへ

Hamel Husain重要度42026年6月29日 16:00

「評価が難しい」というのは製品上の欠陥である

Hamel Husain重要度42026年4月18日 23:24

[お知らせ] 本フィードの運用を終了しました

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

LLMのレイテンシ最適化

キーポイント

影響分析

編集コメント

以下は https://modal.com/docs/guide/ex/vllm_inference より

関連記事

LLMのレイテンシ最適化

キーポイント

影響分析

編集コメント

以下は https://modal.com/docs/guide/ex/vllm_inference より

関連記事