TLDR AI·2026年5月4日 09:00·約13分

大規模言語モデル向け高精度量子化ツールキット「AutoRound」

#大規模言語モデル (LLM)#量子化 (Quantization)#Open Source #推論最適化 #ビジョン・ランゲージモデル

TL;DR

AutoRound は、大規模言語モデルやビジョン・ランゲージモデル向けに設計された高度な量子化ツールキットであり、最小限の調整で超低ビット幅でも高精度を達成し、単一 GPU で 7B モデルを 10 分以内に量子化できる画期的な技術である。

AI深層分析2026年5月4日 23:04

重要/ 5段階

深度40%

キーポイント

超高精度・低ビット量子化の実現

最小限の調整（チューニング）のみで、超低ビット幅においても高い精度を維持する高度な量子化アルゴリズムを採用している。

主要フレームワークとのシームレス連携

Transformers, vLLM, SGLang などの主要な推論・学習フレームワークと互換性があり、既存のワークフローへの統合が容易である。

圧倒的な高速化と低リソース運用

単一の GPU 環境で 7B パラメータモデルをわずか 10 分以内に量子化処理できるため、開発サイクルの大幅な短縮が可能である。

影響分析・編集コメントを表示

影響分析

このツールは、大規模モデルを低コスト・低遅延環境で運用する際のボトルネックであった量子化プロセスの時間と精度のトレードオフを解決し、開発現場での実装ハードルを劇的に低下させる。特にリソース制約のある環境や迅速なプロトタイピングが必要なケースにおいて、LLM の普及と実用化を加速させる重要なインフラとなる可能性が高い。

編集コメント

単一 GPU で数分レベルの処理が可能になる点は、開発者の生産性を飛躍的に高める要素であり、今後のエッジ AI やローカル LLM 展開における標準ツール候補として注目すべき技術です。

🚀 AutoRound とは

AutoRound は、大規模言語モデル（LLM）およびビジョン・ランゲージモデル（VLM）向けに設計された高度な量子化ツールキットです。

符号勾配降下法（sign-gradient descent）を活用し、最小限の調整で極低ビット幅（2〜4 ビット）において高精度を達成するとともに、幅広いハードウェアとの互換性を提供します。

詳細については、論文 SignRoundV1 および SignRoundV2 をご覧ください。使用方法については、ユーザーガイドを参照してください。

🆕 新着情報

[2026/03] --scheme FP8_BLOCK --iters 0 --disable_opt_rtn を介して、ブロックごとの FP8 量子化が可能になりました。

[2026/03] この PR で MTP レイヤーの量子化がサポートされました。

[2025/12] SignRoundV2 の論文が公開されました。結果を再現するには、enable_alg_ext をオンにして AutoScheme API を使用し、混合精度量子化を実行してください：論文、LLaMA モデル評価のためのノート。

[2025/11] AutoRound が LLM-Compressor に統合されました。詳細は、使用方法、vLLM ブログ、RedHat ブログ、X 投稿、Intel ブログ、LinkedIn、微信、知乎をご覧ください。

[2025/11] --enable_alg_ext を介して、精度を向上させた GGUF 量子化アルゴリズムが利用可能になりました。

[2025/10] AutoRound が SGLang に統合されました。詳細は、使用方法、LMSYS ブログ、X 投稿、Intel ブログ、LinkedIn をご覧ください。

[2025/10] 数分でスキームを生成できる混合精度アルゴリズムが利用可能になりました。詳細は、使用方法、精度をご覧ください。

[2025/09] MXFP4 および NVFP4 データ型が利用可能になりました。詳細は、精度をご覧ください。

[2025/08] --enable_alg_ext オプションにより、精度を向上させた INT2 アルゴリズムを利用可能になりました。

[2025/07] GGUF 形式に対応しました：使用方法。

[2025/05] AutoRound が vLLM に統合されました：使用方法、Medium ブログ、小红书。

[2025/05] AutoRound が Transformers に統合されました：ブログ。

[2025/03] INT2 ミックスの DeepSeek-R1 モデル（約 200GB）は 97.9% の精度を維持しています：モデル。

✨ 主要機能

✅ 優れた精度

2〜3 ビット例示モデルでも強力なパフォーマンスを発揮し、4 ビットではベンチマークでトップの結果を記録しています。

✅ エコシステムとの統合

Transformers, vLLM, SGLang などとシームレスに動作します。

✅ 多様な形式へのエクスポート対応

最大限の互換性のために、AutoRound, AutoAWQ, AutoGPTQ, GGUF に対応しています。詳細はエクスポート形式で確認できます。

✅ 高速な混合ビット/データ型スキームの生成

数分で自動的に設定可能で、オーバーヘッドはモデルの BF16 メモリサイズの約 1.1〜1.5 倍です。精度結果とユーザーガイドは以下の通りです。

✅ 最適化された四捨五入モード

4 ビットでの高速量子化には --iters 0 を使用してください（若干の精度低下があります）。詳細は opt_rtn モードをご覧ください。

✅ 手頃な量子化コスト

単一の GPU で 7B モデルを約 10 分で量子化できます。詳細は量子化コストをご覧ください。

✅ 10 以上の VLM サポート

10 以上のビジョン・ランゲージモデル（VLM）に対して、そのまま使える量子化を提供します。例示モデル、サポートマトリックス。

✅ 複数のレシピ

ニーズに合わせて auto-round-best、auto-round、auto-round-light から選択できます。詳細は量子化レシピをご覧ください。

✅ 高度なユーティリティ

複数 GPU での量子化、複数のキャリブレーションデータセットのサポート、および 10 以上のランタイムバックエンドへの対応を含みます。

✅ 重みのみ量子化を超えて。私たちは MXFP、NVFP、W8A8 など、追加のデータ型に対するサポートを積極的に拡大しています。

インストール

PyPI からインストール

CPU(Xeon)/GPU(CUDA)

pip install auto-round

CPU(Xeon)/GPU(CUDA) nightly ビルド

pip install auto-round-nightly

HPU(Gaudi)

HPU Docker コンテナ内でインストールしてください。例：vault.habana.ai/gaudi-docker/1.23.0/ubuntu24.04/habanalabs/pytorch-installer-2.9.0:latest

pip install auto-round-hpu

XPU(Intel GPU)

pip install torch --index-url https://download.pytorch.org/whl/xpu

pip install auto-round

ソースコードからのビルド

CPU(Xeon)/GPU(CUDA)

pip install .

HPU(Gaudi)

python setup.py install hpu

XPU(Intel GPU)

pip install torch --index-url https://download.pytorch.org/whl/xpu

pip install .

モデル量子化 (CPU/Intel GPU/Gaudi/CUDA)

量子化中に問題が発生した場合は、iters=0 を指定して純粋な RTN モードを使用してみてください。また、より良い結果を得るためには group_size=32 または混合ビット幅の使用が推奨されます。

CLI の使用方法

サポートされている引数の完全リストは、ターミナルで auto-round -h と実行することで確認できます。

モデルダウンロードには ModelScope がサポートされており、AR_USE_MODELSCOPE=1 を設定するだけです。**

auto-round \

--model Qwen/Qwen3-0.6B \

--scheme "W4A16" \

--format "auto_round" \

--output_dir ./tmp_autoround

最適な精度と高速化をそれぞれ目的とした、もう 2 つのレシピ（auto-round-best と auto-round-light）も提供しています。詳細は以下の通りです。

その他のレシピ

最高精度、3 倍遅い、低 GPU メモリ使用量は約 20GB 節約できるが約 30% 遅くなる

auto-round-best \

--model Qwen/Qwen3-0.6B \

--scheme "W4A16" \

--low_gpu_mem_usage

2〜3 倍の高速化、W4 でわずかな精度低下、W2 ではより大きな精度低下

auto-round-light \

--model Qwen/Qwen3-0.6B \

--scheme "W4A16"

結論として、W4A16 には auto-round を、W2A16 には enable_alg_ext を有効にした auto-round-best を使用することを推奨します。ただし、特定の要件や利用可能なリソースに合わせて設定を調整することも可能です。

API の使用方法

from auto_round import AutoRound

モデルの読み込み (FP8/BF16/FP16/FP32 に対応)

model_name_or_path = "Qwen/Qwen3-0.6B"

利用可能なスキーム: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (実装されたカーネルなし), "GGUF:Q4_K_M" など

ar = AutoRound(model_name_or_path, scheme="W4A16")

最高精度 (4〜5 倍遅い)。

`low_gpu_mem_usage=True` は約 20GB の VRAM を節約するが、実行速度は約 30% 低下する。

ar = AutoRound(model_name_or_path, nsamples=512, iters=1000, low_gpu_mem_usage=True)

高速な量子化 (2〜3 倍の高速化)、W4G128 でわずかな精度低下。

ar = AutoRound(model_name_or_path, nsamples=128, iters=50, lr=5e-3)

サポートされる形式: "auto_round" (デフォルト), "auto_gptq", "auto_awq", "llm_compressor", "gguf:q4_k_m" など

ar.quantize_and_save(output_dir="./qmodel", format="auto_round")

重要なハイパーパラメータ

量子化スキームと設定

scheme (str|dict|AutoScheme): 事前定義された量子化キー（例：W4A16, MXFP4, NVFP4, GGUF:Q4_K_M）。MXFP4/NVFP4 の場合、LLM-Compressor 形式へのエクスポートを推奨します。

bits (int): 量子化に使用するビット数（デフォルトは None）。None でない場合、scheme の設定を上書きします。

group_size (int): 量子化グループのサイズ（デフォルトは None）。None でない場合、scheme の設定を上書きします。

sym (bool): 対称量子化を使用するかどうか（デフォルトは None）。None でない場合、scheme の設定を上書きします。

layer_config (dict): レベルごとのスキーム用設定（デフォルトは None）、主にカスタマイズされた混合スキーム向けです。

アルゴリズム設定

enable_alg_ext (bool): [実験的機能] iters>0 の場合のみ有効。特定のスキーム（例：MXFP4/W2A16）に対してアルゴリズムのバリアントを有効化し、顕著な改善をもたらす可能性があります。デフォルトは False です。

disable_opt_rtn (bool|None): 特定のスキーム（例：GGUF および WOQ）に対して純粋な RTN モードを使用します。デフォルトは None です。None の場合、精度を向上させるため通常は False に設定されますが、既知の問題により True に設定される場合があります。

チューニングプロセスパラメータ

iters (int): チューニング反復回数（デフォルトは 200）。一般的な値：0（RTN モード）、50（lr=5e-3 を推奨）、1000。値を大きくすると精度が向上しますが、チューニング速度は低下します。

lr (float): ラウンド値の学習率（デフォルトは None）。None の場合、自動的に 1.0/iters に設定されます。

batch_size (int): トレーニング用のバッチサイズ（デフォルトは 8。4 も一般的に使用されます）。

enable_deterministic_algorithms (bool): 再現性を確保するために決定論的アルゴリズムを有効にするかどうか（デフォルトは False）。

Calibration Dataset

dataset (str|list|tuple|torch.utils.data.DataLoader): チューニング用のデータセット（デフォルトは "NeelNanda/pile-10k"）。ローカルの JSON ファイルやデータセットの組み合わせに対応しており、例としては "./tmp.json,NeelNanda/pile-10k:train,mbpp:train+validation+test" のような形式が利用可能です。

nsamples (int): チューニングに使用するサンプル数（デフォルトは 128）。

seqlen (int): チューニング用のシーケンスのデータ長（デフォルトは 2048）。

Device/Speed Configuration

enable_torch_compile (bool): 例外が発生しない場合、通常はリソース消費を抑えて高速な量子化を行うために True に設定することを推奨します。

low_gpu_mem_usage (bool): 中間特徴量を CPU にオフロードして GPU メモリ使用量を削減するかどうか。その代わりチューニング時間が約 20% 増加します（デフォルトは False）。

low_cpu_mem_usage (bool): [実験的機能] RAM 使用量を削減するために即座に保存を有効にするかどうか（デフォルトは True）。

device_map (str|dict|int): チューニングに使用するデバイス（例：auto, cpu, cuda, 0,1,2 など、デフォルトは 0）。auto を指定すると、利用可能なすべての GPU を使用しようと試みます。

Supported Schemes

詳細

グレー表示は、カーネルが存在しない場合、または非効率な/参照用のカーネルのみが存在する場合を示します。BF16 は主に AutoScheme のために用意されています。

フォーマット

サポートされているスキーム

auto_round

W4A16(推奨), W2A16, W3A16, W8A16, W2A16G64, W2A16G32, MXFP4, MXFP8, MXFP4_RCEIL, MXFP8_RCEIL, NVFP4, FPW8A16, FP8_STATIC, BF16

auto_awq

W4A16(推奨), BF16

auto_gptq

W4A16(推奨), W2A16, W3A16, W8A16, W2A16G64, W2A16G32, BF16

llm_compressor

NVFP4(推奨), MXFP4, MXFP8, FPW8A16, FP8_STATIC, FP8_BLOCK, INT8, W4A16, W8A16

gguf

GGUF:Q4_K_M(推奨), GGUF:Q2_K_S, GGUF:Q3_K_S, GGUF:Q3_K_M, GGUF:Q3_K_L, GGUF:Q4_K_S, GGUF:Q5_K_S, GGUF:Q5_K_M, GGUF:Q6_K, GGUF:Q4_0, GGUF:Q4_1, GGUF:Q5_0, GGUF:Q5_1, GGUF:Q8_0

fake

すべてのスキーム (研究目的のみ)

適応型スキーム（実験機能）

AutoScheme は、適応型の混合ビット/データタイプ量子化レシピを自動生成するためのアルゴリズムを提供します。

AutoScheme の詳細については、ユーザーガイドを参照してください。

from auto_round import AutoRound, AutoScheme

model_name = "Qwen/Qwen3-8B"

avg_bits = 3.0

scheme = AutoScheme(avg_bits=avg_bits, options=("GGUF:Q2_K_S", "GGUF:Q4_K_S"), ignore_scale_zp_bits=True)

layer_config = {"lm_head": "GGUF:Q6_K"}

GGUF 以外のスキームでは iters を 200 に変更してください

ar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0)

ar.quantize_and_save()

AutoScheme の重要なハイパーパラメータ

AutoScheme ハイパーパラメータ

avg_bits (float): モデル全体の目標平均ビット幅。平均ビット計算には量子化されたレイヤーのみが含まれます。

options (str | list[str] | list[QuantizationScheme]): 選択候補となる量子化スキーム。単一のカンマ区切り文字列（例："W4A16,W2A16"）、文字列のリスト（例：["W4A16", "W2A16"]）、または QuantizationScheme オブジェクトのリストとして指定可能です。

ignore_scale_zp_bits (bool): API 利用時のみサポートされています。スケールとゼロポイントのビット数を平均ビット幅計算から除外するかどうかを決定します（デフォルト：False）。

shared_layers (Iterable[Iterable[str]], optional): API 利用時のみサポートされています。量子化設定を共有するレイヤーのグループを定義します。

batch_size (int, optional): API 利用時のみサポートされています。VRAM 使用量を削減するために 1 に設定できますが、その分チューニング時間が長くなります。

VLM の API 利用

クリックして展開

この機能は実験的であり、変更される可能性があります。

デフォルトでは、AutoRound は VLM のテキストモジュールのみを量子化し、NeelNanda/pile-10k をキャリブレーションに使用します。モデル全体を量子化するには、quant_nontext_module を True に設定して有効化できますが、この機能へのサポートは限定的です。詳細については、AutoRound の readme を参照してください。

from auto_round import AutoRound

モデルの読み込み

model_name_or_path = "Qwen/Qwen2.5-VL-7B-Instruct"

モデルの量子化

ar = AutoRound(model_name_or_path, scheme="W4A16")

output_dir = "./qmodel"

ar.quantize_and_save(output_dir)

モデル推論

vLLM (CPU/Intel GPU/CUDA)

from vllm import LLM, SamplingParams

prompts = [

"Hello, my name is",

]

sampling_params = SamplingParams(temperature=0.6, top_p=0.95)

model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"

llm = LLM(model=model_name)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:

prompt = output.prompt

generated_text = output.outputs[0].text

print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

SGLang (Intel GPU/CUDA)

MoE モデルおよび視覚言語モデルのサポートは現在限定的です。

import sglang as sgl

llm = sgl.Engine(model_path="Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound")

prompts = [

"Hello, my name is",

]

sampling_params = {"temperature": 0.6, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)

for prompt, output in zip(prompts, outputs):

print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Transformers (CPU/Intel GPU/Gaudi/CUDA)

AutoRound は 10 以上のバックエンドをサポートしており、インストールされているライブラリに基づいて最適な利用可能なバックエンドを自動的に選択します。また、より優れたバックエンドが見つかった場合、ユーザーに追加ライブラリのインストールを促します。

推論中に量子化済みモデルを手動で異なるデバイスへ移動させること（例：model.to('cpu')）は避けてください。これは予期せぬ例外を引き起こす可能性があります。

Gaudi デバイスに対するサポートは限定的です。

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")

tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "There is a girl who likes adventure,"

inputs = tokenizer(text, return_tensors="pt").to(model.device)

print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

出版物・イベント

SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs (2025年12月論文)

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLM (2023年9月論文)

TEQ: Trainable Equivalent Transformation for Quantization of LLMs (2023年10月論文)

Effective Post-Training Quantization for Large Language Models (2023年4月ブログ)

完全な出版物リストをご覧ください。

謝辞

AutoRound で利用されている低精度 CUDA カーネル（CUDA kernel）を提供してくれた、AutoGPTQ、AutoAWQ、GPTQModel、Triton、Marlin、ExLLaMAV2 などのオープンソース低精度ライブラリに特別感謝いたします。

もし AutoRound が役立つと感じていただけた場合は、ぜひこのリポジトリを⭐スター登録し、コミュニティの皆様と共有してください！

原文を表示

🚀 What is AutoRound?

AutoRound is an advanced quantization toolkit designed for Large Language Models (LLMs) and Vision-Language Models (VLMs).

It achieves high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging sign-gradient descent and providing broad hardware compatibility.

See our papers SignRoundV1 and SignRoundV2 for more details. For usage instructions, please refer to the User Guide.

🆕 What's New

[2026/03] Block-wise FP8 quantization is available via --scheme FP8_BLOCK --iters 0 --disable_opt_rtn.

[2026/03] MTP layer quantization has been supported in this PR

[2025/12] The SignRoundV2 paper is available. Turn on enable_alg_ext and use the AutoScheme API for mixed-precision quantization to reproduce the results: Paper, Notes for evaluating LLaMA models.

[2025/11] AutoRound has landed in LLM-Compressor: Usage, vLLM blog, RedHat blog, X post, Intel blog, Linkedin, 微信, 知乎.

[2025/11] An enhanced GGUF quantization algorithm is available via --enable_alg_ext: Accuracy.

[2025/10] AutoRound has been integrated into SGLang: Usage, LMSYS Blog, X post, Intel blog, Linkedin.

[2025/10] A mixed precision algorithm is available to generate schemes in minutes: Usage, Accuracy.

[2025/09] MXFP4 and NVFP4 dtypes is available: Accuracy.

[2025/08] An improved INT2 algorithm is available via --enable_alg_ext: Accuracy

[2025/07] GGUF format is supported: Usage.

[2025/05] AutoRound has been integrated into vLLM: Usage, Medium blog, 小红书.

[2025/05] AutoRound has been integrated into Transformers: Blog.

[2025/03] The INT2-mixed DeepSeek-R1 model (~200GB) retains 97.9% accuracy: Model.

✨ Key Features

✅ Superior Accuracy

Delivers strong performance even at 2–3 bits example models, with leading results at 4 bits benchmark.

✅ Ecosystem Integration

Seamlessly works with Transformers, vLLM, SGLang and more.

✅ Multiple Formats Export

Support AutoRound, AutoAWQ, AutoGPTQ, and GGUF for maximum compatibility. Details are shown in export formats

✅ Fast Mixed Bits/Dtypes Scheme Generation

Automatically configure in minutes, with about 1.1X-1.5X the model’s BF16 RAM size as overhead. Accuracy results and user guide.

✅ Optimized Round-to-Nearest Mode

Use --iters 0 for fast quantization with some accuracy drop for 4 bits. Details are shown in opt_rtn mode

✅ Affordable Quantization Cost

Quantize 7B models in about 10 minutes on a single GPU. Details are shown in quantization costs

✅ 10+ VLMs Support

Out-of-the-box quantization for 10+ vision-language models example models, support matrix

✅ Multiple Recipes

Choose from auto-round-best, auto-round, and auto-round-light to suit your needs. Details are shown in quantization recipes

✅ Advanced Utilities

Includes multiple gpus quantization, multiple calibration datasets and support for 10+ runtime backends.

✅ Beyond weight only quantization. We are actively expanding support for additional datatypes such as MXFP, NVFP, W8A8, and more.

Installation

Install from pypi

code

# CPU(Xeon)/GPU(CUDA)
pip install auto-round

# CPU(Xeon)/GPU(CUDA) nightly
pip install auto-round-nightly

# HPU(Gaudi)
# install inside the hpu docker container, e.g. vault.habana.ai/gaudi-docker/1.23.0/ubuntu24.04/habanalabs/pytorch-installer-2.9.0:latest  
pip install auto-round-hpu

# XPU(Intel GPU)
pip install torch --index-url https://download.pytorch.org/whl/xpu
pip install auto-round

Build from Source

code

# CPU(Xeon)/GPU(CUDA)
pip install .

# HPU(Gaudi)
python setup.py install hpu

# XPU(Intel GPU)
pip install torch --index-url https://download.pytorch.org/whl/xpu
pip install .

Model Quantization (CPU/Intel GPU/Gaudi/CUDA)

If you encounter issues during quantization, try using pure RTN mode with iters=0, disable_opt_rtn=True. Additionally, using group_size=32 or mixed bits is recommended for better results.

CLI Usage

The full list of supported arguments is provided by calling auto-round -h on the terminal.

ModelScope is supported for model downloads, simply set AR_USE_MODELSCOPE=1.

code

auto-round \
    --model Qwen/Qwen3-0.6B \
    --scheme "W4A16" \
    --format "auto_round" \
    --output_dir ./tmp_autoround

We offer another two recipes, auto-round-best and auto-round-light, designed for optimal accuracy and improved speed, respectively. Details are as follows.

Other Recipes

code

# Best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
auto-round-best \
  --model Qwen/Qwen3-0.6B \
  --scheme "W4A16" \
  --low_gpu_mem_usage

code

# 2-3X speedup, slight accuracy drop at W4 and larger accuracy drop at W2
auto-round-light \
  --model Qwen/Qwen3-0.6B \
  --scheme "W4A16"

In conclusion, we recommend using auto-round for W4A16 and auto-round-best with enable_alg_ext for W2A16. However, you may adjust the

configuration to suit your specific requirements and available resources.

API Usage

code

from auto_round import AutoRound

# Load a model (supports FP8/BF16/FP16/FP32)
model_name_or_path = "Qwen/Qwen3-0.6B"

# Available schemes: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (no real kernels), "GGUF:Q4_K_M", etc.
ar = AutoRound(model_name_or_path, scheme="W4A16")

# Highest accuracy (4–5× slower).
# `low_gpu_mem_usage=True` saves ~20GB VRAM but runs ~30% slower.
# ar = AutoRound(model_name_or_path, nsamples=512, iters=1000, low_gpu_mem_usage=True)

# Faster quantization (2–3× speedup) with slight accuracy drop at W4G128.
# ar = AutoRound(model_name_or_path, nsamples=128, iters=50, lr=5e-3)

# Supported formats: "auto_round" (default), "auto_gptq", "auto_awq", "llm_compressor", "gguf:q4_k_m", etc.
ar.quantize_and_save(output_dir="./qmodel", format="auto_round")

Important Hyperparameters

Quantization Scheme & Configuration

scheme (str|dict|AutoScheme): The predefined quantization keys, e.g. W4A16, MXFP4, NVFP4, GGUF:Q4_K_M. For MXFP4/NVFP4, we recommend exporting to LLM-Compressor format.

bits (int): Number of bits for quantization (default is None). If not None, it will override the scheme setting.

group_size (int): Size of the quantization group (default is None). If not None, it will override the scheme setting.

sym (bool): Whether to use symmetric quantization (default is None). If not None, it will override the scheme setting.

layer_config (dict): Configuration for layer_wise scheme (default is None), mainly for customized mixed schemes.

Algorithm Settings

enable_alg_ext (bool): [Experimental Feature] Only for iters>0. Enable algorithm variants for specific schemes (e.g., MXFP4/W2A16) that could bring notable improvements. Default is False.

disable_opt_rtn (bool|None): Use pure RTN mode for specific schemes (e.g., GGUF and WOQ). Default is None. If None, it defaults to False in most cases to improve accuracy, but may be set to True due to known issues.

Tuning Process Parameters

iters (int): Number of tuning iterations (default is 200). Common values: 0 (RTN mode), 50 (with lr=5e-3 recommended), 1000. Higher values increase accuracy but slow down tuning.

lr (float): The learning rate for rounding value (default is None). When None, it will be set to 1.0/iters automatically.

batch_size (int): Batch size for training (default is 8). 4 is also commonly used.

enable_deterministic_algorithms (bool): Whether to enable deterministic algorithms for reproducibility (default is False).

Calibration Dataset

dataset (str|list|tuple|torch.utils.data.DataLoader): The dataset for tuning (default is "NeelNanda/pile-10k"). Supports local JSON files and dataset combinations, e.g. "./tmp.json,NeelNanda/pile-10k:train,mbpp:train+validation+test".

nsamples (int): Number of samples for tuning (default is 128).

seqlen (int): Data length of the sequence for tuning (default is 2048).

Device/Speed Configuration

enable_torch_compile (bool): If no exception is raised, typically we recommend setting it to True for faster quantization with lower resource.

low_gpu_mem_usage (bool): Whether to offload intermediate features to CPU at the cost of ~20% more tuning time (default is False).

low_cpu_mem_usage (bool): [Experimental Feature]Whether to enable saving immediately to reduce ram usage (default is True).

device_map (str|dict|int): The device to be used for tuning, e.g., auto, cpu, cuda, 0,1,2 (default is 0). When using auto, it will try to use all available GPUs.

Supported Schemes

Details

Gray indicates the absence of a kernel or the presence of only an inefficient/reference kernel. BF16 is mainly for AutoScheme

Format

Supported Schemes

auto_round

W4A16(Recommended), W2A16, W3A16, W8A16, W2A16G64, W2A16G32, MXFP4, MXFP8, MXFP4_RCEIL, MXFP8_RCEIL, NVFP4, FPW8A16, FP8_STATIC, BF16

auto_awq

W4A16(Recommended), BF16

auto_gptq

W4A16(Recommended), W2A16, W3A16, W8A16, W2A16G64, W2A16G32,BF16

llm_compressor

NVFP4(Recommended), MXFP4, MXFP8, FPW8A16, FP8_STATIC, FP8_BLOCK, INT8, W4A16, W8A16

gguf

GGUF:Q4_K_M(Recommended), GGUF:Q2_K_S, GGUF:Q3_K_S, GGUF:Q3_K_M, GGUF:Q3_K_L, GGUF:Q4_K_S, GGUF:Q5_K_S, GGUF:Q5_K_M, GGUF:Q6_K, GGUF:Q4_0, GGUF:Q4_1, GGUF:Q5_0, GGUF:Q5_1,GGUF:Q8_0

fake

all schemes (only for research)

Adaptive Schemes (Experimental Feature)

AutoScheme provides an automatic algorithm to generate adaptive mixed bits/data-type quantization recipes.

Please refer to the user guide for more details on AutoScheme.

code

from auto_round import AutoRound, AutoScheme

model_name = "Qwen/Qwen3-8B"
avg_bits = 3.0
scheme = AutoScheme(avg_bits=avg_bits, options=("GGUF:Q2_K_S", "GGUF:Q4_K_S"), ignore_scale_zp_bits=True)
layer_config = {"lm_head": "GGUF:Q6_K"}

# Change iters to 200 for non-GGUF schemes
ar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0)
ar.quantize_and_save()

Important Hyperparameters of AutoScheme

AutoScheme Hyperparameters

avg_bits (float): Target average bit-width for the entire model. Only quantized layers are included in the average bit calculation.

options (str | list[str] | list[QuantizationScheme]): Candidate quantization schemes to choose from. It can be a single comma-separated string (e.g., "W4A16,W2A16"), a list of strings (e.g., ["W4A16", "W2A16"]), or a list of QuantizationScheme objects.

ignore_scale_zp_bits (bool): Only supported in API usage. Determines whether to exclude the bits of scale and zero-point from the average bit-width calculation (default: False).

shared_layers (Iterable[Iterable[str]], optional): Only supported in API usage. Defines groups of layers that share quantization settings.

batch_size (int, optional): Only supported in API usage. Can be set to 1 to reduce VRAM usage at the expense of longer tuning time.

API Usage for VLMs

Click to expand

This feature is experimental and may be subject to changes.

By default, AutoRound only quantize the text module of VLMs and uses NeelNanda/pile-10k for calibration. To

quantize the entire model, you can enable quant_nontext_module by setting it to True, though support for this feature

is limited. For more information, please refer to the AutoRound readme.

code

from auto_round import AutoRound

# Load the model
model_name_or_path = "Qwen/Qwen2.5-VL-7B-Instruct"
# Quantize the model
ar = AutoRound(model_name_or_path, scheme="W4A16")
output_dir = "./qmodel"
ar.quantize_and_save(output_dir)

Model Inference

vLLM (CPU/Intel GPU/CUDA)

code

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95)
model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"
llm = LLM(model=model_name)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

SGLang (Intel GPU/CUDA)

Please note that support for the MoE models and visual language models is currently limited.

code

import sglang as sgl

llm = sgl.Engine(model_path="Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound")
prompts = [
    "Hello, my name is",
]
sampling_params = {"temperature": 0.6, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Transformers (CPU/Intel GPU/Gaudi/CUDA)

AutoRound supports 10+ backends and automatically selects the best available backend based on the installed libraries and prompts the user to

install additional libraries when a better backend is found.

Please avoid manually moving the quantized model to a different device (e.g., model.to('cpu')) during inference, as

this may cause unexpected exceptions.

The support for Gaudi device is limited.

code

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

Publications & Events

SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs (2025.12 paper)

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLM (2023.09 paper)

TEQ: Trainable Equivalent Transformation for Quantization of LLMs (2023.10 paper)

Effective Post-Training Quantization for Large Language Models (2023.04 blog)

Check out Full Publication List.

Acknowledgement

Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.

If you find AutoRound helpful, please ⭐ star the repo and share it with your community!

この記事をシェア

Latent Space重要度42026年6月25日 11:14

[AINews] メタハーネスの夏が到来

Simon Willison Blog2026年6月25日 08:59

ブラウザ互換性データベースをSQLite化

MarkTechPost重要度42026年6月25日 04:08

ツール、メモリ、権限、スキル、マルチエージェント協調を備えた OpenHarness スタイルのエージェントランタイム設計方法

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

TLDR AI·2026年5月4日 09:00·約13分

大規模言語モデル向け高精度量子化ツールキット「AutoRound」

#大規模言語モデル (LLM)#量子化 (Quantization)#Open Source #推論最適化 #ビジョン・ランゲージモデル

TL;DR

AI深層分析2026年5月4日 23:04

重要/ 5段階

深度40%

キーポイント

超高精度・低ビット量子化の実現

最小限の調整（チューニング）のみで、超低ビット幅においても高い精度を維持する高度な量子化アルゴリズムを採用している。

主要フレームワークとのシームレス連携

Transformers, vLLM, SGLang などの主要な推論・学習フレームワークと互換性があり、既存のワークフローへの統合が容易である。

圧倒的な高速化と低リソース運用

単一の GPU 環境で 7B パラメータモデルをわずか 10 分以内に量子化処理できるため、開発サイクルの大幅な短縮が可能である。

影響分析・編集コメントを表示

影響分析

編集コメント

🚀 AutoRound とは

AutoRound は、大規模言語モデル（LLM）およびビジョン・ランゲージモデル（VLM）向けに設計された高度な量子化ツールキットです。

詳細については、論文 SignRoundV1 および SignRoundV2 をご覧ください。使用方法については、ユーザーガイドを参照してください。

🆕 新着情報

[2026/03] --scheme FP8_BLOCK --iters 0 --disable_opt_rtn を介して、ブロックごとの FP8 量子化が可能になりました。

[2026/03] この PR で MTP レイヤーの量子化がサポートされました。

[2025/12] SignRoundV2 の論文が公開されました。結果を再現するには、enable_alg_ext をオンにして AutoScheme API を使用し、混合精度量子化を実行してください：論文、LLaMA モデル評価のためのノート。

[2025/11] AutoRound が LLM-Compressor に統合されました。詳細は、使用方法、vLLM ブログ、RedHat ブログ、X 投稿、Intel ブログ、LinkedIn、微信、知乎をご覧ください。

[2025/11] --enable_alg_ext を介して、精度を向上させた GGUF 量子化アルゴリズムが利用可能になりました。

[2025/10] AutoRound が SGLang に統合されました。詳細は、使用方法、LMSYS ブログ、X 投稿、Intel ブログ、LinkedIn をご覧ください。

[2025/10] 数分でスキームを生成できる混合精度アルゴリズムが利用可能になりました。詳細は、使用方法、精度をご覧ください。

[2025/09] MXFP4 および NVFP4 データ型が利用可能になりました。詳細は、精度をご覧ください。

[2025/08] --enable_alg_ext オプションにより、精度を向上させた INT2 アルゴリズムを利用可能になりました。

[2025/07] GGUF 形式に対応しました：使用方法。

[2025/05] AutoRound が vLLM に統合されました：使用方法、Medium ブログ、小红书。

[2025/05] AutoRound が Transformers に統合されました：ブログ。

[2025/03] INT2 ミックスの DeepSeek-R1 モデル（約 200GB）は 97.9% の精度を維持しています：モデル。

✨ 主要機能

✅ 優れた精度

2〜3 ビット例示モデルでも強力なパフォーマンスを発揮し、4 ビットではベンチマークでトップの結果を記録しています。

✅ エコシステムとの統合

Transformers, vLLM, SGLang などとシームレスに動作します。

✅ 多様な形式へのエクスポート対応

最大限の互換性のために、AutoRound, AutoAWQ, AutoGPTQ, GGUF に対応しています。詳細はエクスポート形式で確認できます。

✅ 高速な混合ビット/データ型スキームの生成

数分で自動的に設定可能で、オーバーヘッドはモデルの BF16 メモリサイズの約 1.1〜1.5 倍です。精度結果とユーザーガイドは以下の通りです。

✅ 最適化された四捨五入モード

4 ビットでの高速量子化には --iters 0 を使用してください（若干の精度低下があります）。詳細は opt_rtn モードをご覧ください。

✅ 手頃な量子化コスト

単一の GPU で 7B モデルを約 10 分で量子化できます。詳細は量子化コストをご覧ください。

✅ 10 以上の VLM サポート

10 以上のビジョン・ランゲージモデル（VLM）に対して、そのまま使える量子化を提供します。例示モデル、サポートマトリックス。

✅ 複数のレシピ

ニーズに合わせて auto-round-best、auto-round、auto-round-light から選択できます。詳細は量子化レシピをご覧ください。

✅ 高度なユーティリティ

複数 GPU での量子化、複数のキャリブレーションデータセットのサポート、および 10 以上のランタイムバックエンドへの対応を含みます。

✅ 重みのみ量子化を超えて。私たちは MXFP、NVFP、W8A8 など、追加のデータ型に対するサポートを積極的に拡大しています。

インストール

PyPI からインストール

CPU(Xeon)/GPU(CUDA)

pip install auto-round

CPU(Xeon)/GPU(CUDA) nightly ビルド

pip install auto-round-nightly

HPU(Gaudi)

HPU Docker コンテナ内でインストールしてください。例：vault.habana.ai/gaudi-docker/1.23.0/ubuntu24.04/habanalabs/pytorch-installer-2.9.0:latest

pip install auto-round-hpu

XPU(Intel GPU)

pip install torch --index-url https://download.pytorch.org/whl/xpu

pip install auto-round

ソースコードからのビルド

CPU(Xeon)/GPU(CUDA)

pip install .

HPU(Gaudi)

python setup.py install hpu

XPU(Intel GPU)

pip install torch --index-url https://download.pytorch.org/whl/xpu

pip install .

モデル量子化 (CPU/Intel GPU/Gaudi/CUDA)

CLI の使用方法

サポートされている引数の完全リストは、ターミナルで auto-round -h と実行することで確認できます。

モデルダウンロードには ModelScope がサポートされており、AR_USE_MODELSCOPE=1 を設定するだけです。**

auto-round \

--model Qwen/Qwen3-0.6B \

--scheme "W4A16" \

--format "auto_round" \

--output_dir ./tmp_autoround

最適な精度と高速化をそれぞれ目的とした、もう 2 つのレシピ（auto-round-best と auto-round-light）も提供しています。詳細は以下の通りです。

その他のレシピ

最高精度、3 倍遅い、低 GPU メモリ使用量は約 20GB 節約できるが約 30% 遅くなる

auto-round-best \

--model Qwen/Qwen3-0.6B \

--scheme "W4A16" \

--low_gpu_mem_usage

2〜3 倍の高速化、W4 でわずかな精度低下、W2 ではより大きな精度低下

auto-round-light \

--model Qwen/Qwen3-0.6B \

--scheme "W4A16"

API の使用方法

from auto_round import AutoRound

モデルの読み込み (FP8/BF16/FP16/FP32 に対応)

model_name_or_path = "Qwen/Qwen3-0.6B"

利用可能なスキーム: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (実装されたカーネルなし), "GGUF:Q4_K_M" など

ar = AutoRound(model_name_or_path, scheme="W4A16")

最高精度 (4〜5 倍遅い)。

`low_gpu_mem_usage=True` は約 20GB の VRAM を節約するが、実行速度は約 30% 低下する。

ar = AutoRound(model_name_or_path, nsamples=512, iters=1000, low_gpu_mem_usage=True)

高速な量子化 (2〜3 倍の高速化)、W4G128 でわずかな精度低下。

ar = AutoRound(model_name_or_path, nsamples=128, iters=50, lr=5e-3)

サポートされる形式: "auto_round" (デフォルト), "auto_gptq", "auto_awq", "llm_compressor", "gguf:q4_k_m" など

ar.quantize_and_save(output_dir="./qmodel", format="auto_round")

重要なハイパーパラメータ

量子化スキームと設定

scheme (str|dict|AutoScheme): 事前定義された量子化キー（例：W4A16, MXFP4, NVFP4, GGUF:Q4_K_M）。MXFP4/NVFP4 の場合、LLM-Compressor 形式へのエクスポートを推奨します。

bits (int): 量子化に使用するビット数（デフォルトは None）。None でない場合、scheme の設定を上書きします。

group_size (int): 量子化グループのサイズ（デフォルトは None）。None でない場合、scheme の設定を上書きします。

sym (bool): 対称量子化を使用するかどうか（デフォルトは None）。None でない場合、scheme の設定を上書きします。

layer_config (dict): レベルごとのスキーム用設定（デフォルトは None）、主にカスタマイズされた混合スキーム向けです。

アルゴリズム設定

enable_alg_ext (bool): [実験的機能] iters>0 の場合のみ有効。特定のスキーム（例：MXFP4/W2A16）に対してアルゴリズムのバリアントを有効化し、顕著な改善をもたらす可能性があります。デフォルトは False です。

disable_opt_rtn (bool|None): 特定のスキーム（例：GGUF および WOQ）に対して純粋な RTN モードを使用します。デフォルトは None です。None の場合、精度を向上させるため通常は False に設定されますが、既知の問題により True に設定される場合があります。

チューニングプロセスパラメータ

iters (int): チューニング反復回数（デフォルトは 200）。一般的な値：0（RTN モード）、50（lr=5e-3 を推奨）、1000。値を大きくすると精度が向上しますが、チューニング速度は低下します。

lr (float): ラウンド値の学習率（デフォルトは None）。None の場合、自動的に 1.0/iters に設定されます。

batch_size (int): トレーニング用のバッチサイズ（デフォルトは 8。4 も一般的に使用されます）。

enable_deterministic_algorithms (bool): 再現性を確保するために決定論的アルゴリズムを有効にするかどうか（デフォルトは False）。

Calibration Dataset

dataset (str|list|tuple|torch.utils.data.DataLoader): チューニング用のデータセット（デフォルトは "NeelNanda/pile-10k"）。ローカルの JSON ファイルやデータセットの組み合わせに対応しており、例としては "./tmp.json,NeelNanda/pile-10k:train,mbpp:train+validation+test" のような形式が利用可能です。

nsamples (int): チューニングに使用するサンプル数（デフォルトは 128）。

seqlen (int): チューニング用のシーケンスのデータ長（デフォルトは 2048）。

Device/Speed Configuration

enable_torch_compile (bool): 例外が発生しない場合、通常はリソース消費を抑えて高速な量子化を行うために True に設定することを推奨します。

low_gpu_mem_usage (bool): 中間特徴量を CPU にオフロードして GPU メモリ使用量を削減するかどうか。その代わりチューニング時間が約 20% 増加します（デフォルトは False）。

low_cpu_mem_usage (bool): [実験的機能] RAM 使用量を削減するために即座に保存を有効にするかどうか（デフォルトは True）。

device_map (str|dict|int): チューニングに使用するデバイス（例：auto, cpu, cuda, 0,1,2 など、デフォルトは 0）。auto を指定すると、利用可能なすべての GPU を使用しようと試みます。

Supported Schemes

詳細

グレー表示は、カーネルが存在しない場合、または非効率な/参照用のカーネルのみが存在する場合を示します。BF16 は主に AutoScheme のために用意されています。

フォーマット

サポートされているスキーム

auto_round

W4A16(推奨), W2A16, W3A16, W8A16, W2A16G64, W2A16G32, MXFP4, MXFP8, MXFP4_RCEIL, MXFP8_RCEIL, NVFP4, FPW8A16, FP8_STATIC, BF16

auto_awq

W4A16(推奨), BF16

auto_gptq

W4A16(推奨), W2A16, W3A16, W8A16, W2A16G64, W2A16G32, BF16

llm_compressor

NVFP4(推奨), MXFP4, MXFP8, FPW8A16, FP8_STATIC, FP8_BLOCK, INT8, W4A16, W8A16

gguf

GGUF:Q4_K_M(推奨), GGUF:Q2_K_S, GGUF:Q3_K_S, GGUF:Q3_K_M, GGUF:Q3_K_L, GGUF:Q4_K_S, GGUF:Q5_K_S, GGUF:Q5_K_M, GGUF:Q6_K, GGUF:Q4_0, GGUF:Q4_1, GGUF:Q5_0, GGUF:Q5_1, GGUF:Q8_0

fake

すべてのスキーム (研究目的のみ)

適応型スキーム（実験機能）

AutoScheme は、適応型の混合ビット/データタイプ量子化レシピを自動生成するためのアルゴリズムを提供します。

AutoScheme の詳細については、ユーザーガイドを参照してください。

from auto_round import AutoRound, AutoScheme

model_name = "Qwen/Qwen3-8B"

avg_bits = 3.0

scheme = AutoScheme(avg_bits=avg_bits, options=("GGUF:Q2_K_S", "GGUF:Q4_K_S"), ignore_scale_zp_bits=True)

layer_config = {"lm_head": "GGUF:Q6_K"}

GGUF 以外のスキームでは iters を 200 に変更してください

ar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0)

ar.quantize_and_save()

AutoScheme の重要なハイパーパラメータ

AutoScheme ハイパーパラメータ

avg_bits (float): モデル全体の目標平均ビット幅。平均ビット計算には量子化されたレイヤーのみが含まれます。

options (str | list[str] | list[QuantizationScheme]): 選択候補となる量子化スキーム。単一のカンマ区切り文字列（例："W4A16,W2A16"）、文字列のリスト（例：["W4A16", "W2A16"]）、または QuantizationScheme オブジェクトのリストとして指定可能です。

ignore_scale_zp_bits (bool): API 利用時のみサポートされています。スケールとゼロポイントのビット数を平均ビット幅計算から除外するかどうかを決定します（デフォルト：False）。

shared_layers (Iterable[Iterable[str]], optional): API 利用時のみサポートされています。量子化設定を共有するレイヤーのグループを定義します。

batch_size (int, optional): API 利用時のみサポートされています。VRAM 使用量を削減するために 1 に設定できますが、その分チューニング時間が長くなります。

VLM の API 利用

クリックして展開

この機能は実験的であり、変更される可能性があります。

from auto_round import AutoRound

モデルの読み込み

model_name_or_path = "Qwen/Qwen2.5-VL-7B-Instruct"

モデルの量子化

ar = AutoRound(model_name_or_path, scheme="W4A16")

output_dir = "./qmodel"

ar.quantize_and_save(output_dir)

モデル推論

vLLM (CPU/Intel GPU/CUDA)

from vllm import LLM, SamplingParams

prompts = [

"Hello, my name is",

]

sampling_params = SamplingParams(temperature=0.6, top_p=0.95)

model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"

llm = LLM(model=model_name)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:

prompt = output.prompt

generated_text = output.outputs[0].text

print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

SGLang (Intel GPU/CUDA)

MoE モデルおよび視覚言語モデルのサポートは現在限定的です。

import sglang as sgl

llm = sgl.Engine(model_path="Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound")

prompts = [

"Hello, my name is",

]

sampling_params = {"temperature": 0.6, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)

for prompt, output in zip(prompts, outputs):

print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Transformers (CPU/Intel GPU/Gaudi/CUDA)

Gaudi デバイスに対するサポートは限定的です。

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")

tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "There is a girl who likes adventure,"

inputs = tokenizer(text, return_tensors="pt").to(model.device)

print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

出版物・イベント

SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs (2025年12月論文)

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLM (2023年9月論文)

TEQ: Trainable Equivalent Transformation for Quantization of LLMs (2023年10月論文)

Effective Post-Training Quantization for Large Language Models (2023年4月ブログ)

完全な出版物リストをご覧ください。

謝辞

もし AutoRound が役立つと感じていただけた場合は、ぜひこのリポジトリを⭐スター登録し、コミュニティの皆様と共有してください！

原文を表示

🚀 What is AutoRound?

AutoRound is an advanced quantization toolkit designed for Large Language Models (LLMs) and Vision-Language Models (VLMs).

It achieves high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging sign-gradient descent and providing broad hardware compatibility.

See our papers SignRoundV1 and SignRoundV2 for more details. For usage instructions, please refer to the User Guide.

🆕 What's New

[2026/03] Block-wise FP8 quantization is available via --scheme FP8_BLOCK --iters 0 --disable_opt_rtn.

[2026/03] MTP layer quantization has been supported in this PR

[2025/12] The SignRoundV2 paper is available. Turn on enable_alg_ext and use the AutoScheme API for mixed-precision quantization to reproduce the results: Paper, Notes for evaluating LLaMA models.

[2025/11] AutoRound has landed in LLM-Compressor: Usage, vLLM blog, RedHat blog, X post, Intel blog, Linkedin, 微信, 知乎.

[2025/11] An enhanced GGUF quantization algorithm is available via --enable_alg_ext: Accuracy.

[2025/10] AutoRound has been integrated into SGLang: Usage, LMSYS Blog, X post, Intel blog, Linkedin.

[2025/10] A mixed precision algorithm is available to generate schemes in minutes: Usage, Accuracy.

[2025/09] MXFP4 and NVFP4 dtypes is available: Accuracy.

[2025/08] An improved INT2 algorithm is available via --enable_alg_ext: Accuracy

[2025/07] GGUF format is supported: Usage.

[2025/05] AutoRound has been integrated into vLLM: Usage, Medium blog, 小红书.

[2025/05] AutoRound has been integrated into Transformers: Blog.

[2025/03] The INT2-mixed DeepSeek-R1 model (~200GB) retains 97.9% accuracy: Model.

✨ Key Features

✅ Superior Accuracy

Delivers strong performance even at 2–3 bits example models, with leading results at 4 bits benchmark.

✅ Ecosystem Integration

Seamlessly works with Transformers, vLLM, SGLang and more.

✅ Multiple Formats Export

Support AutoRound, AutoAWQ, AutoGPTQ, and GGUF for maximum compatibility. Details are shown in export formats

✅ Fast Mixed Bits/Dtypes Scheme Generation

Automatically configure in minutes, with about 1.1X-1.5X the model’s BF16 RAM size as overhead. Accuracy results and user guide.

✅ Optimized Round-to-Nearest Mode

Use --iters 0 for fast quantization with some accuracy drop for 4 bits. Details are shown in opt_rtn mode

✅ Affordable Quantization Cost

Quantize 7B models in about 10 minutes on a single GPU. Details are shown in quantization costs

✅ 10+ VLMs Support

Out-of-the-box quantization for 10+ vision-language models example models, support matrix

✅ Multiple Recipes

Choose from auto-round-best, auto-round, and auto-round-light to suit your needs. Details are shown in quantization recipes

✅ Advanced Utilities

Includes multiple gpus quantization, multiple calibration datasets and support for 10+ runtime backends.

✅ Beyond weight only quantization. We are actively expanding support for additional datatypes such as MXFP, NVFP, W8A8, and more.

Installation

Install from pypi

code

# CPU(Xeon)/GPU(CUDA)
pip install auto-round

# CPU(Xeon)/GPU(CUDA) nightly
pip install auto-round-nightly

# HPU(Gaudi)
# install inside the hpu docker container, e.g. vault.habana.ai/gaudi-docker/1.23.0/ubuntu24.04/habanalabs/pytorch-installer-2.9.0:latest  
pip install auto-round-hpu

# XPU(Intel GPU)
pip install torch --index-url https://download.pytorch.org/whl/xpu
pip install auto-round

Build from Source

code

# CPU(Xeon)/GPU(CUDA)
pip install .

# HPU(Gaudi)
python setup.py install hpu

# XPU(Intel GPU)
pip install torch --index-url https://download.pytorch.org/whl/xpu
pip install .

Model Quantization (CPU/Intel GPU/Gaudi/CUDA)

If you encounter issues during quantization, try using pure RTN mode with iters=0, disable_opt_rtn=True. Additionally, using group_size=32 or mixed bits is recommended for better results.

CLI Usage

The full list of supported arguments is provided by calling auto-round -h on the terminal.

ModelScope is supported for model downloads, simply set AR_USE_MODELSCOPE=1.

code

auto-round \
    --model Qwen/Qwen3-0.6B \
    --scheme "W4A16" \
    --format "auto_round" \
    --output_dir ./tmp_autoround

We offer another two recipes, auto-round-best and auto-round-light, designed for optimal accuracy and improved speed, respectively. Details are as follows.

Other Recipes

code

# Best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
auto-round-best \
  --model Qwen/Qwen3-0.6B \
  --scheme "W4A16" \
  --low_gpu_mem_usage

code

# 2-3X speedup, slight accuracy drop at W4 and larger accuracy drop at W2
auto-round-light \
  --model Qwen/Qwen3-0.6B \
  --scheme "W4A16"

In conclusion, we recommend using auto-round for W4A16 and auto-round-best with enable_alg_ext for W2A16. However, you may adjust the

configuration to suit your specific requirements and available resources.

API Usage

code

from auto_round import AutoRound

# Load a model (supports FP8/BF16/FP16/FP32)
model_name_or_path = "Qwen/Qwen3-0.6B"

# Available schemes: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (no real kernels), "GGUF:Q4_K_M", etc.
ar = AutoRound(model_name_or_path, scheme="W4A16")

# Highest accuracy (4–5× slower).
# `low_gpu_mem_usage=True` saves ~20GB VRAM but runs ~30% slower.
# ar = AutoRound(model_name_or_path, nsamples=512, iters=1000, low_gpu_mem_usage=True)

# Faster quantization (2–3× speedup) with slight accuracy drop at W4G128.
# ar = AutoRound(model_name_or_path, nsamples=128, iters=50, lr=5e-3)

# Supported formats: "auto_round" (default), "auto_gptq", "auto_awq", "llm_compressor", "gguf:q4_k_m", etc.
ar.quantize_and_save(output_dir="./qmodel", format="auto_round")

Important Hyperparameters

Quantization Scheme & Configuration

scheme (str|dict|AutoScheme): The predefined quantization keys, e.g. W4A16, MXFP4, NVFP4, GGUF:Q4_K_M. For MXFP4/NVFP4, we recommend exporting to LLM-Compressor format.

bits (int): Number of bits for quantization (default is None). If not None, it will override the scheme setting.

group_size (int): Size of the quantization group (default is None). If not None, it will override the scheme setting.

sym (bool): Whether to use symmetric quantization (default is None). If not None, it will override the scheme setting.

layer_config (dict): Configuration for layer_wise scheme (default is None), mainly for customized mixed schemes.

Algorithm Settings

enable_alg_ext (bool): [Experimental Feature] Only for iters>0. Enable algorithm variants for specific schemes (e.g., MXFP4/W2A16) that could bring notable improvements. Default is False.

disable_opt_rtn (bool|None): Use pure RTN mode for specific schemes (e.g., GGUF and WOQ). Default is None. If None, it defaults to False in most cases to improve accuracy, but may be set to True due to known issues.

Tuning Process Parameters

iters (int): Number of tuning iterations (default is 200). Common values: 0 (RTN mode), 50 (with lr=5e-3 recommended), 1000. Higher values increase accuracy but slow down tuning.

lr (float): The learning rate for rounding value (default is None). When None, it will be set to 1.0/iters automatically.

batch_size (int): Batch size for training (default is 8). 4 is also commonly used.

enable_deterministic_algorithms (bool): Whether to enable deterministic algorithms for reproducibility (default is False).

Calibration Dataset

dataset (str|list|tuple|torch.utils.data.DataLoader): The dataset for tuning (default is "NeelNanda/pile-10k"). Supports local JSON files and dataset combinations, e.g. "./tmp.json,NeelNanda/pile-10k:train,mbpp:train+validation+test".

nsamples (int): Number of samples for tuning (default is 128).

seqlen (int): Data length of the sequence for tuning (default is 2048).

Device/Speed Configuration

enable_torch_compile (bool): If no exception is raised, typically we recommend setting it to True for faster quantization with lower resource.

low_gpu_mem_usage (bool): Whether to offload intermediate features to CPU at the cost of ~20% more tuning time (default is False).

low_cpu_mem_usage (bool): [Experimental Feature]Whether to enable saving immediately to reduce ram usage (default is True).

device_map (str|dict|int): The device to be used for tuning, e.g., auto, cpu, cuda, 0,1,2 (default is 0). When using auto, it will try to use all available GPUs.

Supported Schemes

Details

Gray indicates the absence of a kernel or the presence of only an inefficient/reference kernel. BF16 is mainly for AutoScheme

Format

Supported Schemes

auto_round

W4A16(Recommended), W2A16, W3A16, W8A16, W2A16G64, W2A16G32, MXFP4, MXFP8, MXFP4_RCEIL, MXFP8_RCEIL, NVFP4, FPW8A16, FP8_STATIC, BF16

auto_awq

W4A16(Recommended), BF16

auto_gptq

W4A16(Recommended), W2A16, W3A16, W8A16, W2A16G64, W2A16G32,BF16

llm_compressor

NVFP4(Recommended), MXFP4, MXFP8, FPW8A16, FP8_STATIC, FP8_BLOCK, INT8, W4A16, W8A16

gguf

GGUF:Q4_K_M(Recommended), GGUF:Q2_K_S, GGUF:Q3_K_S, GGUF:Q3_K_M, GGUF:Q3_K_L, GGUF:Q4_K_S, GGUF:Q5_K_S, GGUF:Q5_K_M, GGUF:Q6_K, GGUF:Q4_0, GGUF:Q4_1, GGUF:Q5_0, GGUF:Q5_1,GGUF:Q8_0

fake

all schemes (only for research)

Adaptive Schemes (Experimental Feature)

AutoScheme provides an automatic algorithm to generate adaptive mixed bits/data-type quantization recipes.

Please refer to the user guide for more details on AutoScheme.

code

from auto_round import AutoRound, AutoScheme

model_name = "Qwen/Qwen3-8B"
avg_bits = 3.0
scheme = AutoScheme(avg_bits=avg_bits, options=("GGUF:Q2_K_S", "GGUF:Q4_K_S"), ignore_scale_zp_bits=True)
layer_config = {"lm_head": "GGUF:Q6_K"}

# Change iters to 200 for non-GGUF schemes
ar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0)
ar.quantize_and_save()

Important Hyperparameters of AutoScheme

AutoScheme Hyperparameters

avg_bits (float): Target average bit-width for the entire model. Only quantized layers are included in the average bit calculation.

options (str | list[str] | list[QuantizationScheme]): Candidate quantization schemes to choose from. It can be a single comma-separated string (e.g., "W4A16,W2A16"), a list of strings (e.g., ["W4A16", "W2A16"]), or a list of QuantizationScheme objects.

ignore_scale_zp_bits (bool): Only supported in API usage. Determines whether to exclude the bits of scale and zero-point from the average bit-width calculation (default: False).

shared_layers (Iterable[Iterable[str]], optional): Only supported in API usage. Defines groups of layers that share quantization settings.

batch_size (int, optional): Only supported in API usage. Can be set to 1 to reduce VRAM usage at the expense of longer tuning time.

API Usage for VLMs

Click to expand

This feature is experimental and may be subject to changes.

By default, AutoRound only quantize the text module of VLMs and uses NeelNanda/pile-10k for calibration. To

quantize the entire model, you can enable quant_nontext_module by setting it to True, though support for this feature

is limited. For more information, please refer to the AutoRound readme.

code

from auto_round import AutoRound

# Load the model
model_name_or_path = "Qwen/Qwen2.5-VL-7B-Instruct"
# Quantize the model
ar = AutoRound(model_name_or_path, scheme="W4A16")
output_dir = "./qmodel"
ar.quantize_and_save(output_dir)

Model Inference

vLLM (CPU/Intel GPU/CUDA)

code

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95)
model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"
llm = LLM(model=model_name)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

SGLang (Intel GPU/CUDA)

Please note that support for the MoE models and visual language models is currently limited.

code

import sglang as sgl

llm = sgl.Engine(model_path="Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound")
prompts = [
    "Hello, my name is",
]
sampling_params = {"temperature": 0.6, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Transformers (CPU/Intel GPU/Gaudi/CUDA)

AutoRound supports 10+ backends and automatically selects the best available backend based on the installed libraries and prompts the user to

install additional libraries when a better backend is found.

Please avoid manually moving the quantized model to a different device (e.g., model.to('cpu')) during inference, as

this may cause unexpected exceptions.

The support for Gaudi device is limited.

code

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

Publications & Events

SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs (2025.12 paper)

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLM (2023.09 paper)

TEQ: Trainable Equivalent Transformation for Quantization of LLMs (2023.10 paper)

Effective Post-Training Quantization for Large Language Models (2023.04 blog)

Check out Full Publication List.

Acknowledgement

Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.

If you find AutoRound helpful, please ⭐ star the repo and share it with your community!

この記事をシェア

Latent Space重要度42026年6月25日 11:14

[AINews] メタハーネスの夏が到来

Simon Willison Blog2026年6月25日 08:59

ブラウザ互換性データベースをSQLite化

MarkTechPost重要度42026年6月25日 04:08

ツール、メモリ、権限、スキル、マルチエージェント協調を備えた OpenHarness スタイルのエージェントランタイム設計方法

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

キーポイント

影響分析

編集コメント

🚀 AutoRound とは

🆕 新着情報

✨ 主要機能

インストール

PyPI からインストール

CPU(Xeon)/GPU(CUDA)

CPU(Xeon)/GPU(CUDA) nightly ビルド

HPU(Gaudi)

HPU Docker コンテナ内でインストールしてください。例：vault.habana.ai/gaudi-docker/1.23.0/ubuntu24.04/habanalabs/pytorch-installer-2.9.0:latest

XPU(Intel GPU)

CPU(Xeon)/GPU(CUDA)

HPU(Gaudi)

XPU(Intel GPU)

モデル量子化 (CPU/Intel GPU/Gaudi/CUDA)

CLI の使用方法

最高精度、3 倍遅い、低 GPU メモリ使用量は約 20GB 節約できるが約 30% 遅くなる

2〜3 倍の高速化、W4 でわずかな精度低下、W2 ではより大きな精度低下

API の使用方法

モデルの読み込み (FP8/BF16/FP16/FP32 に対応)

利用可能なスキーム: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (実装されたカーネルなし), "GGUF:Q4_K_M" など

最高精度 (4〜5 倍遅い)。

low_gpu_mem_usage=True は約 20GB の VRAM を節約するが、実行速度は約 30% 低下する。

ar = AutoRound(model_name_or_path, nsamples=512, iters=1000, low_gpu_mem_usage=True)

高速な量子化 (2〜3 倍の高速化)、W4G128 でわずかな精度低下。

ar = AutoRound(model_name_or_path, nsamples=128, iters=50, lr=5e-3)

サポートされる形式: "auto_round" (デフォルト), "auto_gptq", "auto_awq", "llm_compressor", "gguf:q4_k_m" など

量子化スキームと設定

アルゴリズム設定

チューニングプロセスパラメータ

Calibration Dataset

Device/Speed Configuration

Supported Schemes

適応型スキーム（実験機能）

GGUF 以外のスキームでは iters を 200 に変更してください

AutoScheme ハイパーパラメータ

VLM の API 利用

モデルの読み込み

モデルの量子化

モデル推論

vLLM (CPU/Intel GPU/CUDA)

SGLang (Intel GPU/CUDA)

Transformers (CPU/Intel GPU/Gaudi/CUDA)

出版物・イベント

謝辞

🚀 What is AutoRound?

🆕 What's New

✨ Key Features

Installation

Install from pypi

Model Quantization (CPU/Intel GPU/Gaudi/CUDA)

CLI Usage

API Usage

Quantization Scheme & Configuration

Algorithm Settings

Tuning Process Parameters

Calibration Dataset

Device/Speed Configuration

Supported Schemes

Adaptive Schemes (Experimental Feature)

AutoScheme Hyperparameters

API Usage for VLMs

Model Inference

vLLM (CPU/Intel GPU/CUDA)

SGLang (Intel GPU/CUDA)

Transformers (CPU/Intel GPU/Gaudi/CUDA)

Publications & Events

Acknowledgement

関連記事

キーポイント

影響分析

編集コメント

🚀 AutoRound とは

🆕 新着情報

✨ 主要機能

インストール

PyPI からインストール

CPU(Xeon)/GPU(CUDA)

`low_gpu_mem_usage=True` は約 20GB の VRAM を節約するが、実行速度は約 30% 低下する。

`low_gpu_mem_usage=True` は約 20GB の VRAM を節約するが、実行速度は約 30% 低下する。