TLDR AI·2026年5月21日 09:00·約5分

WavFlow が波形空間内で直接オーディオを生成（GitHub リポジトリ）

#Audio Generation #Flow Matching #Waveform Space #Multimodal AI #Facebook Research

TL;DR

Facebook Research が開発した WavFlow は、潜在的な圧縮をバイパスして波形空間で直接オーディオを生成する新パラダイムを実現し、従来の手法に匹敵する高忠実度と同期性能を達成した。

AI深層分析2026年7月4日 23:15

重要/ 5段階

深度40%

キーポイント

波形空間での直接生成

WavFlow は従来の潜在空間（latent space）への圧縮を回避し、生の波形空間でオーディオを直接生成するパラダイムを採用している。

技術的革新：パッチ化と振幅リフティング

波形パッチ化（waveform patchifying）と振幅リフティング（amplitude lifting）の手法により、生音声に対する安定したフローマッチングを可能にしている。

ベンチマークでの同等性能

VGGSound (VT2A) および AudioCaps (T2A) の評価において、既存の潜在ベース手法と同等かそれ以上の音響的豊かさ、忠実度、同期性を示した。

エンドツーエンド生成の実現

テキストや動画入力から直接高品質なオーディオを生成するエンドツーエンドの波形生成が、従来のフレームワークと同等のレベルで実現可能であることを証明した。

多様な音源の対応

記事ではドラムなどの音楽サンプルを含む多様なオーディオ生成の実証動画が示されており、高品質な音声合成能力を示唆しています。

拡散モデルの活用と高速化

この手法は拡散モデルを波形データに適用することで、高品質な音声を効率的かつ迅速に合成することを可能にします。

推論入力形式の柔軟性

CSV形式の入力でビデオ、テキスト、またはその両方を指定可能であり、特定のモダリティがない場合は学習された空トークンを使用します。

影響分析・編集コメントを表示

影響分析

この技術は、オーディオ生成モデルにおける「潜在空間を経由する」という従来の定石を覆す可能性があり、より忠実で遅延の少ない音声合成の実現に向けた重要な転換点となる。特に動画との同期性を保ったまま高品質な音声を直接生成できる点は、コンテンツ制作やリアルタイムアプリケーションにおいて即座に価値を生む技術的進歩である。

編集コメント

潜在空間を介さない直接生成というアプローチは、従来のオーディオ生成モデルのボトルネックであった情報損失やアーティファクトの削減に大きく寄与する画期的な成果です。

オーバービュー

WavFlow は、潜在的圧縮を完全に回避し、生波形空間（raw waveform space）内で動画およびテキスト入力から同期された高忠実度オーディオを直接生成するための新たなパラダイムを導入します。*波形パッチ化（waveform patchifying）*と*振幅リフティング（amplitude lifting）*を通じて、WavFlow は直接的な *x-予測（x-prediction）*により生音声上で安定したフローマッチングを実現します。VGGSound (VT2A) および AudioCaps (T2A) ベンチマークにおける評価では、WavFlow が確立された潜在的ベース手法と同等の性能を発揮し、エンドツーエンドの波形生成が従来のフレームワークと同様に音響的豊かさ、忠実度、および同期性を達成できることを証明しています。

デモ

🌳 森 *(自然)*

forest.mp4

🐸 カエル *(動物)*

frog.mp4

🥁 ドラム *(音楽)*

drum.mp4

🛹 スケートボード *(スポーツ)*

skateboard.mp4

プロジェクトページで、24 以上のサンプルおよび並列ベンチマーク比較をご覧ください。

手法

インストール

git clone https://github.com/facebookresearch/WavFlow.git

cd WavFlow

bash scripts/setup.sh # conda 環境 'wavflow' を作成し、すべての依存関係をインストールします

conda activate wavflow

手動セットアップ

conda create -n wavflow python=3.10 -y

conda activate wavflow

pip install -r requirements.txt

pip install -e . --no-deps

conda install -n wavflow -c conda-forge "ffmpeg<7" -y # torio による動画デコード用

必要なすべての外部重み（CLIP、Synchformer、空文字列 CFG 埋め込み）は、初回実行時に自動的にダウンロードまたは計算され、~/.cache/wavflow/ にキャッシュされます。

推論

⚠️ 組織の方針上の制約により、現在、本番環境で訓練されたチェックポイントの公開は行えておりません。完全にオープンソースデータで訓練されたファウンデーションモデル用のチェックポイントを準備中ですが、その間はご自身で訓練することも可能です — 詳しくはトレーニングガイドをご覧ください。

訓練済みのチェックポイントをお持ちの場合は、以下を実行してください:

bash scripts/launch/predict.sh [--gpu N] [--config PATH]

デフォルトの構成ファイルは wavflow/configs/infer.yaml です。入力 CSV (data.csv_path) は、動画、テキスト、またはその両方を受け付けます:

video_path,caption,video_exist,text_exist

/abs/path/sample1.mp4,a whistling rocket explodes,1,1 # video + text

/abs/path/sample2.mp4,birds chirping in a forest,1,1 # video + text

,a whistling rocket explodes,0,1 # text-only

/abs/path/sample3.mp4,,1,0 # video-only

構成リファレンス

ランチャーオプション

フラグ / 環境変数

デフォルト

説明

--gpu N *(または GPU=N)*

CUDA デバイスのインデックス

--config PATH *(または CONFIG_PATH=...)*

wavflow/configs/infer.yaml

読み込む YAML 構成ファイル

WAVFLOW_ENV

wavflow

自動アクティベートする conda 環境名

それ以外の追加の位置引数は、すべて python -m wavflow.infer に転送されます。

infer.yaml の主要フィールド

フィールド

設定内容

data.csv_path

入力 CSV（上記）

model.name

medium_16k、medium_44k、large_16k、large_44k のいずれか（学習済みチェックポイントと一致させる必要があります）

model.ckpt_path

checkpoint_*.pth（完全なチェックポイント）または ema_epoch_*.pth（EMA 専用）のファイルパス

model.use_ema

完全なチェックポイントから model_ema1 を読み込む場合は true、生モデルの重みを使用する場合は false

inference.duration_sec / target_sample_rate

出力長さとサンプリングレート（SR）（モデルアーキテクチャと一致させる必要があります）

inference.cfg, num_steps, noise_scale, noise_shift, prediction_type, seed

サンプリング用のハイパーパラメータ

inference.batch_size

ODE バッチあたりの行数

inference.trim_to_duration

duration_sec に合わせて出力をトリミングするかどうか

output.output_dir

音声ファイル（wavs）が書き込まれるディレクトリ

output.loudness_norm, loudness_target_lufs

任意の pyloudnorm 後処理パラメータ

CSV の意味論

video_exist=0 → 学習済みの空の CLIP/Sync トークンを使用（ビデオデコードなし）

text_exist=0 → 学習済みの空の CLIP-text トークンを使用（キャプションは無視される）

id カラムは任意です。指定しない場合、wav ファイル名は Path(video_path).stem から派生し、テキストのみの行については row_<idx> にフォールバックします。

カンマを含むキャプションは引用符で囲む必要があります。

EMA の注意点

model_ema1 として保存された EMA テンソルは、ステップごとに ema_decay = 0.9999 で更新されます。数百〜数千ステップのみの場合でも、まだランダム初期化値を含んでおり、推論時にノイズを生成します。短時間または過学習した実行からサンプリングする場合は、model.use_ema: false を設定するか、十分なステップ数後に保存された ema_epoch_*.pth を使用してください。

訓練

必ず JSON 形式で返してください。translation フィールドのみ。他のフィールド (technical_terms 等) は一切追加しないこと — 余計なフィールドを書こうとして本文翻訳がトークン上限で打ち切られる事故を防ぐため:

{"translation": "翻訳全文"}

特徴抽出およびトレーニング（シングルノードおよびマルチノード）については、TRAINING.md を参照してください。**

引用

@misc{zhou2026wavflowaudiogenerationwaveform,

title={WavFlow: Audio Generation in Waveform Space},

author={Feiyan Zhou and Luyuan Wang and Shoufa Chen and Zhe Wang and Zhiheng Liu and Yuren Cong and Xiaohui Zhang and Fanny Yang and Belinda Zeng},

year={2026},

eprint={2605.18749},

archivePrefix={arXiv},

primaryClass={cs.SD},

url={https://arxiv.org/abs/2605.18749},

}

謝辞

WavFlow はオープンソースコミュニティの上に構築されています。私たちは以下の点に心から感謝いたします：

MMAudio — マルチモーダル音声生成

JiT — Just Image Transformer（Just Image Transformer）

Synchformer — オーディオ・ビジュアル同期

ライセンス

WavFlow の大部分は CC-BY-NC 4.0 に基づいてライセンスされています。プロジェクトの一部は、元のライセンス条件（MIT、Apache 2.0、CC BY-NC 4.0、Stability AI Community License）の下でサードパーティのオープンソースプロジェクトからベンダー化されたものです。各コンポーネントごとの完全な内訳およびライセンステキストについては、NOTICE.txt を参照してください。

原文を表示

Overview

WavFlow introduces a paradigm for generating synchronized, high-fidelity audio from video and text inputs directly in the raw waveform space, bypassing latent compression entirely. Through *waveform patchifying* and *amplitude lifting*, WavFlow enables stable flow matching on raw audio via direct *x*-prediction. Evaluation on the VGGSound (VT2A) and AudioCaps (T2A) benchmarks shows that WavFlow delivers performance on par with established latent-based methods, proving that end-to-end waveform generation can match traditional frameworks in acoustic richness, fidelity, and synchronization.

Demo

🌳 Forest *(natural)*

forest.mp4

🐸 Frog *(animal)*

frog.mp4

🥁 Drum *(music)*

drum.mp4

🛹 Skateboard *(sport)*

skateboard.mp4

See the Project Page for 24+ samples and side-by-side benchmark comparisons.

Method

Installation

code

git clone https://github.com/facebookresearch/WavFlow.git
cd WavFlow
bash scripts/setup.sh        # creates conda env 'wavflow' and installs everything
conda activate wavflow

Manual setup

code

conda create -n wavflow python=3.10 -y
conda activate wavflow
pip install -r requirements.txt
pip install -e . --no-deps
conda install -n wavflow -c conda-forge "ffmpeg<7" -y    # for torio video decoding

All required external weights (CLIP, Synchformer, the empty-string CFG embedding) are downloaded or computed automatically on first run and cached under ~/.cache/wavflow/.

Inference

⚠️ Due to organizational policy constraints, we are currently unable to release the production-trained checkpoints. We are working on a foundation checkpoint trained on fully open-source data; in the meantime you can train your own — see the training guide.

Once you have a trained checkpoint, run:

code

bash scripts/launch/predict.sh [--gpu N] [--config PATH]

The default config is wavflow/configs/infer.yaml. The input CSV (data.csv_path) accepts video, text, or both:

code

video_path,caption,video_exist,text_exist
/abs/path/sample1.mp4,a whistling rocket explodes,1,1   # video + text
/abs/path/sample2.mp4,birds chirping in a forest,1,1    # video + text
,a whistling rocket explodes,0,1                        # text-only
/abs/path/sample3.mp4,,1,0                              # video-only

Configuration reference

Launcher options

Flag / env

Default

Description

--gpu N *(or GPU=N)*

0

CUDA device index

--config PATH *(or CONFIG_PATH=...)*

wavflow/configs/infer.yaml

YAML config to load

WAVFLOW_ENV

wavflow

conda env name to auto-activate

Any extra positional argument is forwarded to python -m wavflow.infer.

Key fields in infer.yaml

Field

What to set

data.csv_path

the input CSV (above)

model.name

one of medium_16k, medium_44k, large_16k, large_44k (must match the trained ckpt)

model.ckpt_path

a checkpoint_*.pth (full ckpt) or ema_epoch_*.pth (EMA-only)

model.use_ema

true to load model_ema1 from a full ckpt; false to use the live model weights

inference.duration_sec / target_sample_rate

output length and SR (must match model arch)

inference.cfg, num_steps, noise_scale, noise_shift, prediction_type, seed

sampling hyperparameters

inference.batch_size

rows per ODE batch

inference.trim_to_duration

trim output to duration_sec

output.output_dir

where wavs are written

output.loudness_norm, loudness_target_lufs

optional pyloudnorm post-processing

CSV semantics

video_exist=0 → uses learned empty CLIP/Sync tokens (no video decode)

text_exist=0 → uses learned empty CLIP-text token (caption ignored)

Optional id column; otherwise the wav file name is derived from Path(video_path).stem, falling back to row_<idx> for text-only rows

Captions with commas must be quoted

EMA caveat

The EMA tensor stored as model_ema1 is updated with ema_decay = 0.9999 per step. After only a few hundred / thousand steps it still contains random-init values and produces noise during inference. Set model.use_ema: false (or pass an ema_epoch_*.pth saved after enough steps) when sampling from a short / overfit run.

Training

For feature extraction and training (single-node and multi-node), see TRAINING.md.

Citation

code

@misc{zhou2026wavflowaudiogenerationwaveform,
      title={WavFlow: Audio Generation in Waveform Space}, 
      author={Feiyan Zhou and Luyuan Wang and Shoufa Chen and Zhe Wang and Zhiheng Liu and Yuren Cong and Xiaohui Zhang and Fanny Yang and Belinda Zeng},
      year={2026},
      eprint={2605.18749},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2605.18749}, 
}

Acknowledgements

WavFlow builds on the open-source community. We gratefully acknowledge:

MMAudio — multimodal audio generation

JiT — Just Image Transformer

Synchformer — audio-visual synchronization

License

The majority of WavFlow is licensed under CC-BY-NC 4.0. Portions of the project are vendored from third-party open source projects under their original license terms (MIT, Apache 2.0, CC BY-NC 4.0, and Stability AI Community License). See NOTICE.txt for the full per-component breakdown and license texts.

この記事をシェア

TLDR AI2026年7月3日 09:00

ハードウェアのクーデター：なぜAIハードウェアが永遠に変化したのか（3分読了）

TLDR AI重要度42026年7月3日 09:00

Seed2.0 モデルカード（72 分間の読了）

TLDR AI2026年7月3日 09:00

AI 向けラマヌジャン・チャレンジ（1 分読了）

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

conda create -n wavflow python=3.10 -y conda activate wavflow pip install -r requirements.txt pip install -e . --no-deps conda install -n wavflow -c conda-forge "ffmpeg<7" -y # for torio video decoding

video_path,caption,video_exist,text_exist /abs/path/sample1.mp4,a whistling rocket explodes,1,1 # video + text /abs/path/sample2.mp4,birds chirping in a forest,1,1 # video + text ,a whistling rocket explodes,0,1 # text-only /abs/path/sample3.mp4,,1,0 # video-only

@misc{zhou2026wavflowaudiogenerationwaveform, title={WavFlow: Audio Generation in Waveform Space}, author={Feiyan Zhou and Luyuan Wang and Shoufa Chen and Zhe Wang and Zhiheng Liu and Yuren Cong and Xiaohui Zhang and Fanny Yang and Belinda Zeng}, year={2026}, eprint={2605.18749}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2605.18749}, }

キーポイント

影響分析

編集コメント

オーバービュー

デモ

手法

インストール

推論

ランチャーオプション

infer.yaml の主要フィールド

CSV の意味論

EMA の注意点

訓練

引用

謝辞

ライセンス

Overview

Demo

Method

Installation

Inference

Launcher options

Key fields in infer.yaml

CSV semantics

EMA caveat

Training

Citation

Acknowledgements

License

関連記事

キーポイント

影響分析

編集コメント

オーバービュー

デモ

手法

インストール

推論

ランチャーオプション

infer.yaml の主要フィールド

CSV の意味論

EMA の注意点

訓練

引用

謝辞

ライセンス

Overview

Demo

Method

Installation

Inference

Launcher options

Key fields in infer.yaml

CSV semantics

EMA caveat

Training

Citation

Acknowledgements

License

関連記事