MiniMax のスパースアテンション技術が百万トークンコンテキストを実現（GitHub リポジトリ）

Apache-2.0

nvidia-cutlass-dsl

NVIDIA CUTLASS Python DSL (ドメイン固有言語)

NVIDIA / BSD-3-Clause (パッケージ参照)

apache-tvm-ffi

Apache TVM FFI (外部関数インターフェース)

Apache-2.0

cuda-python

NVIDIA

NVIDIA / パッケージ参照

torch

BSD-3-Clause

jinja2

BSD-3-Clause

ninja

Apache-2.0

pybind11

BSD-3-Clause

各インストール済みパッケージの正確なライセンスは、そのパッケージと共に配布されます。

権威あるテキストについては、メタデータを参照してください (pip show <pkg>)。

MSA があなたの研究に役立つ場合は、必ず引用してください。（コンパニオン論文/技術報告書に安定した識別子が付与された時点で BibTeX エントリを公開します——現在はプレースホルダーです）

アルゴリズムの参照先は、docs/MiniMaxSparseAttention.pdf に掲載されています。

@software{msa2026,

title = {MiniMax Sparse Attention (MSA): NVIDIA SM100 向けの FlashAttention およびブロック疎性アテンションカーネル},

author = {{MiniMax}},

year = {2026},

url = {https://github.com/MiniMax-AI/MSA}

}

貢献について

イシュートラッカーでは、課題（Issues）やプルリクエスト（PRs）を歓迎します。カーネルやランタイム契約に関する変更を行う場合は、まず課題を作成して公開インターフェースの整合性を確認してください——fmha_sm100.api、fmha_sm100.sparse、cute.interface が安定したエントリーポイントです。それ以外の部分は内部実装であり、予告なく変更される可能性があります。

原文を表示

MiniMax Sparse Attention (MSA)

MSA (fmha_sm100) ships dense FlashAttention and sparse top-k attention

kernels for NVIDIA SM100. Two JIT-compiled stacks

share one Python package:

Algorithm reference: MiniMax Sparse Attention paper.

Stack

Path

What it gives you

csrc JIT

python/fmha_sm100/csrc/

Dense FMHA (fmha_sm100, fmha_sm100_plan) + sparse_topk_select indexer, compiled from Jinja templates by jit.py at runtime.

CuTe-DSL

python/fmha_sm100/cute/

Full sparse attention (forward + paged FP8 decode, BF16 / FP8 / NVFP4 / FP4), compiled at runtime via cute.compile.

Bridge

python/fmha_sm100/sparse_fmha_adapter.py

Adapts the fmha_sm100 API to call sparse_atten_func for sparse prefill paths.

License: MIT. Self-authored files carry SPDX-License-Identifier: MIT.
See LICENSE and NOTICE. Bundled / derived third-party
code retains its own license — see Third-party licenses.

Requirements

GPU: NVIDIA SM100.

Toolchain: CUDA Toolkit with nvcc on PATH (or CUDA_HOME / CUDA_PATH set).

Python: ≥ 3.10.

OS: Linux x86_64 (aarch64 untested; JIT builds may need small Makefile edits on WSL).

Quick sanity check before installing:

code

nvcc --version                # expect ≥ 12.x
nvidia-smi --query-gpu=compute_cap --format=csv | grep "10.0"  # confirm SM100
python -c "import sys; print(sys.version_info[:2])"              # ≥ (3, 10)

Using with the kernels library

To quickly get started using MSA kernels, you can use the kernels library:

code

# make sure `kernels` is installed: `pip install -U kernels`
from kernels import get_kernel

kernel_module = get_kernel("MiniMaxAI/msa", version=0)
sparse_atten_func = kernel_module.sparse_atten_func

sparse_atten_func(...)

Check out the kernel on the Hugging Face Hub here.

Install

code

# --recursive pulls the NVIDIA CUTLASS submodule (python/fmha_sm100/cutlass/),
# whose headers are required for JIT/AOT compilation.
git clone --recursive https://github.com/MiniMax-AI/MSA.git msa
cd msa
# If you cloned without --recursive:
#   git submodule update --init --recursive
pip install .           # standard install (works from a wheel too)
# or
pip install -e .        # editable install for development

This pulls in the CuTe-DSL stack via nvidia-cutlass-dsl and quack-kernels;

the csrc kernels are JIT-compiled at first import from sources shipped inside

the package.

Verify

Run a small CUDA smoke test. **The first run JIT-compiles sparse_topk_select,

which takes 30 s – a few minutes on a cold nvcc cache** — this is normal, not

a hang. Subsequent runs hit the JIT cache and finish in seconds.

code

python tests/smoke/test_sparse_topk_forced.py

Usage

code

import torch
from fmha_sm100 import fmha_sm100, fmha_sm100_plan, sparse_topk_select

# Page size and top-k for the sparse prefill path.
page_size, topk = 128, 16

# Dense proxy pass: compute per-block max score from a cheap Q slice.
proxy_plan = fmha_sm100_plan(
    qo_lens, kv_lens, proxy_q.shape[1],
    num_kv_heads=1,
    page_size=page_size,
    output_maxscore=True,
)
_, max_score = fmha_sm100(
    proxy_q, proxy_k_pages, proxy_v_pages, proxy_plan,
    kv_indices=kv_indices,
    output_o=False,
    output_maxscore=True,
)

# max_score -> sparse KV block indexes.
kv_block_indexes = sparse_topk_select(
    max_score.contiguous(), topk, num_valid_pages=num_pages,
)

# Sparse attention with the selected blocks.
sparse_plan = fmha_sm100_plan(
    qo_lens, kv_lens, q.shape[1],
    num_kv_heads=k_pages.shape[1],
    page_size=page_size,
    kv_block_num=topk,
)
out, _ = fmha_sm100(
    q, k_pages, v_pages, sparse_plan,
    kv_indices=kv_indices,
    kv_block_indexes=kv_block_indexes,
)

For block-sparse prefill with CSR metadata, the FP4 indexer, NVFP4 K/V, and

the paged FP8 decode wrapper, see the CuTe-DSL deep dive:

python/fmha_sm100/cute/README.md

Test

code

# Fast smoke tests.
python -m pytest tests/smoke -q

# API and end-to-end integration tests.
python -m pytest tests/integration -q
python tests/integration/test_proxy_kv_e2e.py

# Large regression suites.
python tests/regression/test_correctness.py
python tests/regression/test_sparse_attn.py

# CuTe-DSL forward-only sparse attention.
cd python/fmha_sm100/cute
python -m pytest test_sparse_atten.py -q

Benchmark

benchmarks/bench_sparse_attention_ops.py covers dense prefill, paged

prefill, sparse prefill, dense decode, paged decode, sparse decode, in

fp8 and bf16 (nvfp4 is sparse-prefill only).

code

python benchmarks/bench_sparse_attention_ops.py --help     # full flag list

Common invocations (output is TSV):

Goal

Command

FP8 full sweep

python benchmarks/bench_sparse_attention_ops.py --dtype fp8 --sections all --output_mode o -o /tmp/msa_fp8.tsv

BF16 full sweep

python benchmarks/bench_sparse_attention_ops.py --dtype bf16 --sections all --output_mode o -o /tmp/msa_bf16.tsv

NVFP4 sparse prefill

python benchmarks/bench_sparse_attention_ops.py --dtype nvfp4 --sections sparse_prefill --output_mode o -o /tmp/msa_nvfp4.tsv

Quick CI smoke

python benchmarks/bench_sparse_attention_ops.py --dtype fp8 --sections prefill,decode,sparse_decode --seqs 8192,16384 --tp 1,4 --decode-k 8192,131072 --decode-b 32 --dry-run-ms 50 --repeat-ms 200 -o /tmp/msa_smoke.tsv

Output-mode checks (dense/paged)

--output_mode maxscore or --output_mode full

Layout

code

python/fmha_sm100/                  Python package
  __init__.py                       Public re-exports (lazy for the CuTe-DSL stack)
  api.py                            fmha_sm100 / fmha_sm100_plan / sparse_topk_select
  jit.py                            Runtime JIT (nvcc + ninja) for the csrc stack
  sparse.py                         Lazy shim that loads the cute/ stack
  sparse_fmha_adapter.py            Bridge: fmha_sm100 API → sparse_atten_func
  csrc/                             CUDA kernels + Jinja templates (JIT-compiled)
    include/                        Vendored FlashInfer / CUTLASS-derived / TRT-LLM headers
  cutlass/                          NVIDIA CUTLASS git submodule (include/ + tools/util/include/)
  cute/                             CuTe-DSL sparse attention (loaded via sys.path)
tests/                              Correctness tests
  smoke/  integration/  regression/
scripts/                            Warmup + cache-management helpers
benchmarks/                         bench_sparse_attention_ops.py

Stacks

csrc JIT — dense FlashAttention, page KV, and sparse_topk_select

indexer. Compiled at runtime from csrc/*.cu.jinja plus

csrc/include/. Public entry: fmha_sm100.plan → run.

CuTe-DSL — block-sparse prefill, FP8 / NVFP4 / FP4 quantization, paged

FP8 decode (SparseDecodePagedAttentionWrapper), FP4 block-score indexer.

Public entry: fmha_sm100.sparse_atten_func,

fmha_sm100.sparse_decode_atten_func, fmha_sm100.fp4_indexer_block_scores.

Bridge — sparse_fmha_plan / sparse_fmha adapt the dense-API call

site to the sparse backend for prefill paths; useful when you already

drive the dense kernel and want a one-line swap to sparse.

Third-party licenses

fmha_sm100 bundles, derives from, or depends on the third-party components

below. Each retains its original license; this section summarizes them.

Authoritative text is shipped with each component.

Vendored / derived source (shipped in this repo)

Component

License

Where

NVIDIA CUTLASS

BSD-3-Clause

Git submodule at python/fmha_sm100/cutlass/ (provides include/ + tools/util/include/), plus BSD-3-tagged headers under python/fmha_sm100/csrc/include/. The SM100 MMA descriptor encodings in python/fmha_sm100/cute/src/common/mma_sm100_desc.py mirror CUTLASS hardware descriptors. Copyright (c) 2017–2025 NVIDIA CORPORATION & AFFILIATES.

FlashInfer

Apache-2.0

Headers and sources under python/fmha_sm100/csrc/ and python/fmha_sm100/csrc/include/ that carry a Copyright (c) <year> by FlashInfer team line (e.g. allocator.h, exception.h, utils.cuh, cutlass_utils.cuh, fmha_cutlass_sm100.cuh, sparse_topk_select.cuh, plan.cuh, sm100_fmha_reduction.hpp, tvm_ffi_utils.h). Project: https://github.com/flashinfer-ai/flashinfer.

NVIDIA TensorRT-LLM + NAVER Corp (CLOVA)

Apache-2.0

Portions of python/fmha_sm100/csrc/include/sparse_topk_select.cuh — indexerTopK histogram-step + insertion-sort derived from tensorrt_llm/cpp/tensorrt_llm/kernels/indexerTopK.cu. Copyright (c) 2019–2026 NVIDIA CORPORATION; Copyright (c) 2021 NAVER Corp. The per-file header in sparse_topk_select.cuh includes a function-level provenance map.

Runtime dependencies (installed via pip)

Package

Upstream

License

quack-kernels

Apache-2.0

nvidia-cutlass-dsl

NVIDIA CUTLASS Python DSL

NVIDIA / BSD-3-Clause (see package)

apache-tvm-ffi

Apache TVM FFI

Apache-2.0

cuda-python

NVIDIA

NVIDIA / see package

torch

BSD-3-Clause

jinja2

BSD-3-Clause

ninja

Apache-2.0

pybind11

docs/MiniMaxSparseAttention.pdf.

BSD-3-Clause

The exact license of each installed package is distributed with that package;

consult its metadata (pip show <pkg>) for the authoritative text.

Citation

If MSA helps your research, please cite it. (BibTeX entry coming once the

companion paper / technical report has a stable identifier — placeholder.)

The algorithmic reference is shipped at

code

@software{msa2026,
  title  = {MiniMax Sparse Attention (MSA): FlashAttention and block-sparse
            attention kernels for NVIDIA SM100},
  author = {{MiniMax}},
  year   = {2026},
  url    = {https://github.com/MiniMax-AI/MSA}
}

Contributing

Issues and PRs welcome on the

issue tracker. For kernel or

runtime-contract changes, open an issue first to align on the public

surface — fmha_sm100.api, fmha_sm100.sparse and

cute.interface are the stable entry points; everything else

is internal and may change without notice.

この記事をシェア

Latent Space2026年6月20日 17:06

[AINews] 今日特に大きな出来事はありませんでした

Latent Space は、GLM 5.2 が依然として注目されていると指摘しつつ、AIE WF 2026 の通常チケットが月曜日に完売すると発表しました。同サイト購読者向けに限定割引を提供し、参加者には Warp や Datadog などからのスポンサークレジットも付与されます。

TechCrunch AI★42026年6月20日 01:01

米国がアンソロピックの「Fable 5」発売を禁止、しかし市場は動じず

米国政府は国家安全保障上の懸念から、アマゾンの研究者らがガードレール回避手法を発見したとして、アンソロピックに対し最新モデル「Fable 5」と「Mythos 5」の販売差し止めを命じた。サイバーセキュリティ研究者らはこの措置が危険だとする公開書簡に署名し、同社も他モデルでも同様の抜け道が存在すると指摘している。

GitHub Blog★42026年6月20日 01:00

社内データ分析エージェントの構築方法について

GitHub は、大規模なデータ組織が直面する自己完結型のデータアクセスと洞察提供の課題に対し、AI を活用した信頼性の高い解決策として、社内でデータ分析エージェントを構築したことを発表した。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

TLDR AI·2026年6月15日 09:00·約8分で読める

MiniMax のスパースアテンション技術が百万トークンコンテキストを実現（GitHub リポジトリ）

#LLM #Sparse Attention #FlashAttention #NVIDIA Blackwell #Open Source

TL;DR

AI深層分析2026年6月16日 04:06

重要/ 5段階

深度40%

キーポイント

NVIDIA Blackwell (SM100) 専用最適化

高密度とスパースなアテンションの統合

Dense FlashAttention と Sparse Top-k Attention の両方のカーネルを同じ Python パッケージ内で JIT コンパイルにより提供し、用途に応じて柔軟に切り替え可能。

高性能な実装スタック

長文コンテキスト処理の実現

依存関係の自動取得

初回実行時の JIT コンパイル

開発用インストール方法

標準的な `pip install .` の他に、開発環境向けにソースコードを直接参照できる編集モード（editable mode）の `pip install -e .` もサポートされています。

影響分析・編集コメントを表示

影響分析

編集コメント

MiniMax Sparse Attention (MSA)

アルゴリズム参照: MiniMax Sparse Attention 論文。

スタック	パス	提供される機能
csrc JIT	python/fmha_sm100/csrc/	密な FMHA (fmha_sm100, fmha_sm100_plan) + sparse_topk_select インデクサー。Jinja テンプレートから jit.py によってランタイムでコンパイルされます。
CuTe-DSL	python/fmha_sm100/cute/	完全な疎アテンション（フォワードパス + ページ化 FP8 デコード、BF16 / FP8 / NVFP4 / FP4）。cute.compile を介してランタイムでコンパイルされます。
Bridge	python/fmha_sm100/sparse_fmha_adapter.py	fmha_sm100 API を適応させ、疎なプリフェッチパスで sparse_atten_func を呼び出します。

ライセンス: MIT。自己作成ファイルには SPDX-License-Identifier: MIT が付与されています。

要件

GPU: NVIDIA SM100。

ツールチェーン: PATH に nvcc が含まれる CUDA Toolkit（または CUDA_HOME / CUDA_PATH が設定されていること）。

Python: ≥ 3.10。

OS: Linux x86_64 (aarch64 は未検証; WSL では JIT ビルドに小さな Makefile の編集が必要になる場合があります)。

インストール前の簡単な健全性チェック：

nvcc --version # expect ≥ 12.x

nvidia-smi --query-gpu=compute_cap --format=csv | grep "10.0" # confirm SM100

python -c "import sys; print(sys.version_info[:2])" # ≥ (3, 10)

Using with the kernels library

MSA カーネル（Sparse Attention Kernel）の使用を素早く始めるには、kernels ライブラリを利用できます:

`kernels` がインストールされていることを確認: `pip install -U kernels`

from kernels import get_kernel

kernel_module = get_kernel("MiniMaxAI/msa", version=0)

sparse_atten_func = kernel_module.sparse_atten_func

sparse_atten_func(...)

Hugging Face Hub 上のカーネルはこちらでご確認ください。

Install

--recursive オプションにより、NVIDIA CUTLASS サブモジュール (python/fmha_sm100/cutlass/) が取得されます。

このサブモジュールのヘッダーファイルは、JIT/AOT コンパイルに必要です。

git clone --recursive https://github.com/MiniMax-AI/MSA.git msa

cd msa

--recursive を指定せずにクローンした場合:

git submodule update --init --recursive

pip install . # 標準インストール（ホイールからも動作します）

または

pip install -e . # 開発用の編集可能インストール

これにより、nvidia-cutlass-dsl と quack-kernels を経由して CuTe-DSL スタックが取り込まれます。

csrc カーネルは、パッケージ内に同梱されたソースコードから最初のインポート時に JIT コンパイルされます。

Verify

python tests/smoke/test_sparse_topk_forced.py

使用法

import torch

from fmha_sm100 import fmha_sm100, fmha_sm100_plan, sparse_topk_select

スパースプレフィルパスにおけるページサイズと top-k。

page_size, topk = 128, 16

密なプロキシパス：安価な Q スライスからブロックごとの最大スコアを計算する。

proxy_plan = fmha_sm100_plan(

qo_lens, kv_lens, proxy_q.shape[1],

num_kv_heads=1,

page_size=page_size,

output_maxscore=True,

)

_, max_score = fmha_sm100(

proxy_q, proxy_k_pages, proxy_v_pages, proxy_plan,

kv_indices=kv_indices,

output_o=False,

output_maxscore=True,

)

max_score からスパース KV ブロックインデックスを取得。

kv_block_indexes = sparse_topk_select(

max_score.contiguous(), topk, num_valid_pages=num_pages,

)

選択されたブロックによるスパースアテンション。

sparse_plan = fmha_sm100_plan(

qo_lens, kv_lens, q.shape[1],

num_kv_heads=k_pages.shape[1],

page_size=page_size,

kv_block_num=topk,

)

out, _ = fmha_sm100(

q, k_pages, v_pages, sparse_plan,

kv_indices=kv_indices,

kv_block_indexes=kv_block_indexes,

)

python/fmha_sm100/cute/README.md

テスト

Fast smoke tests.

python -m pytest tests/smoke -q

API and end-to-end integration tests.

python -m pytest tests/integration -q

python tests/integration/test_proxy_kv_e2e.py

Large regression suites.

python tests/regression/test_correctness.py

python tests/regression/test_sparse_attn.py

CuTe-DSL forward-only sparse attention.

cd python/fmha_sm100/cute

python -m pytest test_sparse_atten.py -q

Benchmark

benchmarks/bench_sparse_attention_ops.py covers dense prefill, paged prefill, sparse prefill, dense decode, paged decode, sparse decode, in fp8 and bf16 (nvfp4 is sparse-prefill only).

python benchmarks/bench_sparse_attention_ops.py --help # full flag list

Common invocations (output is TSV):

Goal

Command

FP8 full sweep

python benchmarks/bench_sparse_attention_ops.py --dtype fp8 --sections all --output_mode o -o /tmp/msa_fp8.tsv

BF16 full sweep

python benchmarks/bench_sparse_attention_ops.py --dtype bf16 --sections all --output_mode o -o /tmp/msa_bf16.tsv

NVFP4 sparse prefill

python benchmarks/bench_sparse_attention_ops.py --dtype nvfp4 --sections sparse_prefill --output_mode o -o /tmp/msa_nvfp4.tsv

Quick CI smoke

Output-mode checks (dense/paged)

--output_mode maxscore or --output_mode full

Layout

python/fmha_sm100/ Python パッケージ

__init__.py パブリック再エクスポート（CuTe-DSL スタック向けに遅延ロード）

api.py fmha_sm100 / fmha_sm100_plan / sparse_topk_select

jit.py csrc スタック向けのランタイム JIT（nvcc + ninja）

sparse.py cute/スタックを読み込むための遅延シャム

sparse_fmha_adapter.py ブリッジ：fmha_sm100 API → sparse_atten_func

csrc/ CUDA カーネルと Jinja テンプレート（JIT コンパイル済み）

include/ バンドルされた FlashInfer / CUTLASS 派生 / TRT-LLM ヘッダー

cutlass/ NVIDIA CUTLASS git サブモジュール（include/ + tools/util/include/）

cute/ CuTe-DSL スパースアテンション（sys.path を経由してロード）

tests/ 正しさテスト

smoke/ integration/ regression/

scripts/ ウォームアップおよびキャッシュ管理ヘルパー

benchmarks/ bench_sparse_attention_ops.py

スタック

csrc JIT — 密な FlashAttention、ページ KV、および sparse_topk_select.indexer。csrc/*.cu.jinja および csrc/include/ からランタイムでコンパイルされる。パブリックエントリーポイント：fmha_sm100.plan → run。

CuTe-DSL — ブロックスパースプリフェッチ、FP8 / NVFP4 / FP4 量子化、ページ付き FP8 デコード（SparseDecodePagedAttentionWrapper）、FP4 ブロックスコアインデクサー。パブリックエントリーポイント：fmha_sm100.sparse_atten_func、fmha_sm100.sparse_decode_atten_func、fmha_sm100.fp4_indexer_block_scores。

Bridge — sparse_fmha_plan / sparse_fmha は、prefill パスに対して密な API 呼び出しサイトを疎なバックエンドに適応させます。すでに密なカーネルを駆使している場合に、疎な実装へワンライントランスフォームで切り替えたい場合などに有用です。

サードパーティライセンス

ベンダー化/派生ソース（本リポジトリに同梱）

コンポーネント | ライセンス | 場所

NVIDIA CUTLASS

BSD-3-Clause

FlashInfer

Apache-2.0

NVIDIA TensorRT-LLM + NAVER Corp (CLOVA)

Apache-2.0

ランタイム依存関係（pip を経由してインストール）

パッケージ

上流元

ライセンス

quack-kernels

Apache-2.0

nvidia-cutlass-dsl

NVIDIA CUTLASS Python DSL (ドメイン固有言語)

NVIDIA / BSD-3-Clause (パッケージ参照)

apache-tvm-ffi

Apache TVM FFI (外部関数インターフェース)

Apache-2.0

cuda-python

NVIDIA

NVIDIA / パッケージ参照

torch

BSD-3-Clause

jinja2

BSD-3-Clause

ninja

Apache-2.0

pybind11

BSD-3-Clause

各インストール済みパッケージの正確なライセンスは、そのパッケージと共に配布されます。

権威あるテキストについては、メタデータを参照してください (pip show <pkg>)。

アルゴリズムの参照先は、docs/MiniMaxSparseAttention.pdf に掲載されています。

@software{msa2026,

title = {MiniMax Sparse Attention (MSA): NVIDIA SM100 向けの FlashAttention およびブロック疎性アテンションカーネル},

author = {{MiniMax}},

year = {2026},

url = {https://github.com/MiniMax-AI/MSA}

}

貢献について

原文を表示

MiniMax Sparse Attention (MSA)

MSA (fmha_sm100) ships dense FlashAttention and sparse top-k attention

kernels for NVIDIA SM100. Two JIT-compiled stacks

share one Python package:

Algorithm reference: MiniMax Sparse Attention paper.

Stack

Path

What it gives you

csrc JIT

python/fmha_sm100/csrc/

Dense FMHA (fmha_sm100, fmha_sm100_plan) + sparse_topk_select indexer, compiled from Jinja templates by jit.py at runtime.

CuTe-DSL

python/fmha_sm100/cute/

Full sparse attention (forward + paged FP8 decode, BF16 / FP8 / NVFP4 / FP4), compiled at runtime via cute.compile.

Bridge

python/fmha_sm100/sparse_fmha_adapter.py

Adapts the fmha_sm100 API to call sparse_atten_func for sparse prefill paths.

License: MIT. Self-authored files carry SPDX-License-Identifier: MIT.
See LICENSE and NOTICE. Bundled / derived third-party
code retains its own license — see Third-party licenses.

Requirements

GPU: NVIDIA SM100.

Toolchain: CUDA Toolkit with nvcc on PATH (or CUDA_HOME / CUDA_PATH set).

Python: ≥ 3.10.

OS: Linux x86_64 (aarch64 untested; JIT builds may need small Makefile edits on WSL).

Quick sanity check before installing:

code

nvcc --version                # expect ≥ 12.x
nvidia-smi --query-gpu=compute_cap --format=csv | grep "10.0"  # confirm SM100
python -c "import sys; print(sys.version_info[:2])"              # ≥ (3, 10)

Using with the kernels library

To quickly get started using MSA kernels, you can use the kernels library:

code

# make sure `kernels` is installed: `pip install -U kernels`
from kernels import get_kernel

kernel_module = get_kernel("MiniMaxAI/msa", version=0)
sparse_atten_func = kernel_module.sparse_atten_func

sparse_atten_func(...)

Check out the kernel on the Hugging Face Hub here.

Install

code

# --recursive pulls the NVIDIA CUTLASS submodule (python/fmha_sm100/cutlass/),
# whose headers are required for JIT/AOT compilation.
git clone --recursive https://github.com/MiniMax-AI/MSA.git msa
cd msa
# If you cloned without --recursive:
#   git submodule update --init --recursive
pip install .           # standard install (works from a wheel too)
# or
pip install -e .        # editable install for development

This pulls in the CuTe-DSL stack via nvidia-cutlass-dsl and quack-kernels;

the csrc kernels are JIT-compiled at first import from sources shipped inside

the package.

Verify

Run a small CUDA smoke test. **The first run JIT-compiles sparse_topk_select,

which takes 30 s – a few minutes on a cold nvcc cache** — this is normal, not

a hang. Subsequent runs hit the JIT cache and finish in seconds.

code

python tests/smoke/test_sparse_topk_forced.py

Usage

code

import torch
from fmha_sm100 import fmha_sm100, fmha_sm100_plan, sparse_topk_select

# Page size and top-k for the sparse prefill path.
page_size, topk = 128, 16

# Dense proxy pass: compute per-block max score from a cheap Q slice.
proxy_plan = fmha_sm100_plan(
    qo_lens, kv_lens, proxy_q.shape[1],
    num_kv_heads=1,
    page_size=page_size,
    output_maxscore=True,
)
_, max_score = fmha_sm100(
    proxy_q, proxy_k_pages, proxy_v_pages, proxy_plan,
    kv_indices=kv_indices,
    output_o=False,
    output_maxscore=True,
)

# max_score -> sparse KV block indexes.
kv_block_indexes = sparse_topk_select(
    max_score.contiguous(), topk, num_valid_pages=num_pages,
)

# Sparse attention with the selected blocks.
sparse_plan = fmha_sm100_plan(
    qo_lens, kv_lens, q.shape[1],
    num_kv_heads=k_pages.shape[1],
    page_size=page_size,
    kv_block_num=topk,
)
out, _ = fmha_sm100(
    q, k_pages, v_pages, sparse_plan,
    kv_indices=kv_indices,
    kv_block_indexes=kv_block_indexes,
)

For block-sparse prefill with CSR metadata, the FP4 indexer, NVFP4 K/V, and

the paged FP8 decode wrapper, see the CuTe-DSL deep dive:

python/fmha_sm100/cute/README.md

Test

code

# Fast smoke tests.
python -m pytest tests/smoke -q

# API and end-to-end integration tests.
python -m pytest tests/integration -q
python tests/integration/test_proxy_kv_e2e.py

# Large regression suites.
python tests/regression/test_correctness.py
python tests/regression/test_sparse_attn.py

# CuTe-DSL forward-only sparse attention.
cd python/fmha_sm100/cute
python -m pytest test_sparse_atten.py -q

Benchmark

benchmarks/bench_sparse_attention_ops.py covers dense prefill, paged

prefill, sparse prefill, dense decode, paged decode, sparse decode, in

fp8 and bf16 (nvfp4 is sparse-prefill only).

code

python benchmarks/bench_sparse_attention_ops.py --help     # full flag list

Common invocations (output is TSV):

Goal

Command

FP8 full sweep

python benchmarks/bench_sparse_attention_ops.py --dtype fp8 --sections all --output_mode o -o /tmp/msa_fp8.tsv

BF16 full sweep

python benchmarks/bench_sparse_attention_ops.py --dtype bf16 --sections all --output_mode o -o /tmp/msa_bf16.tsv

NVFP4 sparse prefill

python benchmarks/bench_sparse_attention_ops.py --dtype nvfp4 --sections sparse_prefill --output_mode o -o /tmp/msa_nvfp4.tsv

Quick CI smoke

Output-mode checks (dense/paged)

--output_mode maxscore or --output_mode full

Layout

code

python/fmha_sm100/                  Python package
  __init__.py                       Public re-exports (lazy for the CuTe-DSL stack)
  api.py                            fmha_sm100 / fmha_sm100_plan / sparse_topk_select
  jit.py                            Runtime JIT (nvcc + ninja) for the csrc stack
  sparse.py                         Lazy shim that loads the cute/ stack
  sparse_fmha_adapter.py            Bridge: fmha_sm100 API → sparse_atten_func
  csrc/                             CUDA kernels + Jinja templates (JIT-compiled)
    include/                        Vendored FlashInfer / CUTLASS-derived / TRT-LLM headers
  cutlass/                          NVIDIA CUTLASS git submodule (include/ + tools/util/include/)
  cute/                             CuTe-DSL sparse attention (loaded via sys.path)
tests/                              Correctness tests
  smoke/  integration/  regression/
scripts/                            Warmup + cache-management helpers
benchmarks/                         bench_sparse_attention_ops.py

Stacks

csrc JIT — dense FlashAttention, page KV, and sparse_topk_select

indexer. Compiled at runtime from csrc/*.cu.jinja plus

csrc/include/. Public entry: fmha_sm100.plan → run.

CuTe-DSL — block-sparse prefill, FP8 / NVFP4 / FP4 quantization, paged

FP8 decode (SparseDecodePagedAttentionWrapper), FP4 block-score indexer.

Public entry: fmha_sm100.sparse_atten_func,

fmha_sm100.sparse_decode_atten_func, fmha_sm100.fp4_indexer_block_scores.

Bridge — sparse_fmha_plan / sparse_fmha adapt the dense-API call

site to the sparse backend for prefill paths; useful when you already

drive the dense kernel and want a one-line swap to sparse.

Third-party licenses

fmha_sm100 bundles, derives from, or depends on the third-party components

below. Each retains its original license; this section summarizes them.

Authoritative text is shipped with each component.

Vendored / derived source (shipped in this repo)

Component

License

Where

NVIDIA CUTLASS

BSD-3-Clause

FlashInfer

Apache-2.0

NVIDIA TensorRT-LLM + NAVER Corp (CLOVA)

Apache-2.0

Runtime dependencies (installed via pip)

Package

Upstream

License

quack-kernels

Apache-2.0

nvidia-cutlass-dsl

NVIDIA CUTLASS Python DSL

NVIDIA / BSD-3-Clause (see package)

apache-tvm-ffi

Apache TVM FFI

Apache-2.0

cuda-python

NVIDIA

NVIDIA / see package

torch

BSD-3-Clause

jinja2

BSD-3-Clause

ninja

Apache-2.0

pybind11