Tuna を、視覚符号化コンポーネントを段階的に除去することで簡素化しました。VAE（変分オートエンコーダ）を削除することにより、まず表現エンコーダのみを用いたピクセル空間統合マルチモーダルモデル（UMM: Unified Multimodal Model）であるTuna-Rを導出します。Tuna-2はさらに設計を簡素化し、表現エンコーダを完全にバイパスして、生画像入力に対して直接パッチ埋め込み層を利用します。ピクセル埋め込みを用いる Tuna-2 は、多様なマルチモーダルベンチマークにおいて、Tuna-R および Tuna の両方を上回る性能を発揮します。

生成結果

インストール

git clone https://github.com/facebookresearch/tuna-2.git

cd tuna-2

bash scripts/setup_uv.sh # 依存関係をすべて含む .venv を作成します

source .venv/bin/activate

手動セットアップ（自分で uv を操作したい場合）

curl -LsSf https://astral.sh/uv/install.sh | sh

uv sync

uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

uv pip install -e .

source .venv/bin/activate

推論

すべての推論は、単一の統合スクリプトを通じて行われます:

bash scripts/launch/predict.sh --ckpt <PATH> --prompt <TEXT> [OPTIONS]

オプション

フラグ | 値 | デフォルト | 説明

--ckpt | path | *(必須)* | モデルチェックポイントへのパス

--prompt | text | *(必須)* | テキストプロンプト（t2i）または編集指示（edit）

--task | t2i, edit | t2i | 推論タスク

--variant | none_encoder, siglip_pixel, vae | none_encoder | モデルのバリアント：Tuna-2、Tuna-R、またはTuna

--size | 7b, 2b | 7b | モデルサイズ（2b は --variant vae の場合のみ利用可能）

--resolution | 以下の表を参照 | 512x512 | 出力解像度 (HxW)

--gpu | int | 0 | GPU デバイスインデックス

--image | path | — | ソース画像（--task edit の場合必須）

--steps | int | 50 | 拡散ステップ数

--guidance | float | *(設定ファイルから)* | クラスifierフリーガイダンススケール

--seed | int | 42 | ランダムシード

--negative | text | *(設定ファイルから)* | ネガティブプロンプト

サポートされている解像度

512 クラス | 1024 クラス

---|---

512x512 | 1024x1024

448x576 | 896x1152

576x448 | 1152x896

384x672 | 768x1344

672x384 | 1344x768

例

サンプルプロンプトについては assets/prompts.txt を参照してください。

Tuna-2 (7B, エンコーダーなし、512px)

bash scripts/launch/predict.sh \

--ckpt /path/to/tuna_2_pixel_7b.pt \

--prompt "極端なクローズアップで撮影された非常にリアルな美人ポートレート。眉の上から唇までを映し出す若い女性の顔。肌は自然で、輝きがあり、質感があり、毛穴や細かい顔の毛、微妙な凹凸が確認でき、わずかにツヤのある仕上がりだが、過度なレタッチや人工的な平滑化はない。"

Tuna (2B, VAE ラテン特量、512px)

bash scripts/launch/predict.sh \

--variant vae --size 2b \

--ckpt /path/to/tuna_2b.pt \

--prompt "実際の宇宙ステーションのドーム内での、過酷なまでにリアルな映画のようなクローズアップ。窓際に浮かぶブロンドの女性宇宙飛行士のサイドプロファイル。無重力状態で彼女の緩い三つ編みが自然に揺れながら、静かに地球を見つめている。"

ビデオ生成

ポリシー上の制約により、現時点ではビデオ生成モデルを公開することはできません。ただし、完全なビデオ学習および推論コードベースを提供しています。ご自身でビデオモデルの学習に興味がある場合は、すぐに使用可能な出発点となります — 学習設定については configs/train/video_t2v.yaml を、推論設定については configs/predict/t2v_2b.yaml を参照してください。

TODO (予定事項)

Tuna-2 モデルの一部の重み（ウェイト）を公開する。

Tuna モデルの一部の重みを公開する。

完全な復元モデルの重み（欠落した層を外部データでファインチューニングして回復させたもの）を公開する。

モデル公開に関する注記

組織の方針による制約のため、本番環境で学習されたモデルの重み全体を公開することはできません。研究コミュニティをサポートするため、LLM のバックボーンと拡散ヘッド（フローヘッド）から少数の層を取り除いたファウンデーションチェックポイントを公開する予定です。残りの層および他のすべてのコンポーネント（ビジョンエンコーダー、投影層、埋め込みベクトルなど）は完全に保持されます。ご自身のデータで短い微調整（fine-tuning）を行うことで、取り除かれた層を迅速に再学習し、モデルを完全な品質まで復元することができます。

詳細な微調整の手順については、トレーニングガイドをご参照ください。

同時に、外部データを用いて取り除かれた層の微調整も積極的に進めており、完全な重みはできるだけ早く公開する予定です。

引用

@article{tuna2,

title={TUNA-2: Pixel Embeddings Beat Vision Encoders

for Unified Understanding and Generation},

author={Liu, Zhiheng and Ren, Weiming and Huang, Xiaoke

and Chen, Shoufa and Li, Tianhong and Chen, Mengzhao

and Ji, Yatai and He, Sen and Schult, Jonas

and Xiang, Tao and Chen, Wenhu and Luo, Ping

and Zettlemoyer, Luke and Cong, Yuren},

journal={arXiv preprint arXiv:2604.24763},

year={2026}

}

@article{liu2025tuna,

title={Tuna: Taming unified visual representations for native unified multimodal models},

author={Liu, Zhiheng and Ren, Weiming and Liu, Haozhe and Zhou, Zijian and Chen, Shoufa and Qiu, Haonan and Huang, Xiaoke and An, Zhaochong and Yang, Fanny and Patel, Aditya and others},

journal={CVPR2026},

year={2026}

}

ライセンス

本プロジェクトは Apache License 2.0 の下でライセンスされています。詳細については LICENSE をご覧ください。

原文を表示

Zhiheng Liu*1,2,

Weiming Ren*1,3,

Sen He1,

1Meta 2The University of Hong Kong 3University of Waterloo

Equal contribution

[[Project Page]](https://tuna-ai.org/tuna-2) [[arXiv]](https://arxiv.org/abs/2604.24763)

Overview

We simplify Tuna by progressively stripping away its visual encoding components. By removing the VAE, we first derive Tuna-R, a pixel-space unified multimodal model (UMM) that relies solely on a representation encoder. Tuna-2 further streamlines the design by bypassing the representation encoder entirely, utilizing direct patch embedding layers for raw image inputs. Tuna-2 using pixel embeddings outperforms both Tuna-R and Tuna across a diverse suite of multimodal benchmarks.

Generation Results

Installation

code

git clone https://github.com/facebookresearch/tuna-2.git
cd tuna-2
bash scripts/setup_uv.sh   # creates .venv with all dependencies
source .venv/bin/activate

Manual setup (if you prefer to drive uv yourself)

code

curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
uv pip install -e .
source .venv/bin/activate

Inference

All inference is done through a single unified script:

code

bash scripts/launch/predict.sh --ckpt <PATH> --prompt <TEXT> [OPTIONS]

Options

Flag

Values

Default

Description

--ckpt

path

*(required)*

Path to the model checkpoint

--prompt

text

*(required)*

Text prompt (t2i) or editing instruction (edit)

--task

t2i, edit

t2i

Inference task

--variant

none_encoder, siglip_pixel, vae

none_encoder

Model variant: Tuna-2, Tuna-R, or Tuna

--size

7b, 2b

7b

Model size (2b only available for --variant vae)

--resolution

See table below

512x512

Output resolution (HxW)

--gpu

int

0

GPU device index

--image

path

—

Source image (required for --task edit)

--steps

int

50

Number of diffusion steps

--guidance

float

*(from config)*

Classifier-free guidance scale

--seed

int

42

Random seed

--negative

text

*(from config)*

Negative prompt

Supported Resolutions

512-class

1024-class

512x512

1024x1024

448x576

896x1152

576x448

1152x896

384x672

768x1344

672x384

1344x768

Examples

See assets/prompts.txt for sample prompts.

code

# Tuna-2 (7B, no encoder, 512px)
bash scripts/launch/predict.sh \
    --ckpt /path/to/tuna_2_pixel_7b.pt \
    --prompt "A highly realistic beauty portrait in extreme close-up, showing the face of a young woman from just above the eyebrows down to the lips. Her skin is natural, luminous, and textured, with visible pores, fine facial hairs, subtle unevenness, and a slightly dewy finish, without heavy retouching or artificial smoothing."

# Tuna (2B, VAE latent, 512px)
bash scripts/launch/predict.sh \
    --variant vae --size 2b \
    --ckpt /path/to/tuna_2b.pt \
    --prompt "A brutally realistic cinematic close-up inside a real space station cupola, side profile of a blonde female astronaut floating in zero gravity beside the window, her loose braid drifting naturally, looking out at Earth in silence."

Video

Due to policy constraints, we are unable to release the video generation model at this time. However, we provide the complete video training and inference codebase. If you are interested in training your own video model, this is a ready-to-use starting point — see configs/train/video_t2v.yaml for training configuration and configs/predict/t2v_2b.yaml for inference.

TODO

Release some of the Tuna-2 model weights.

Release some of the Tuna model weights.

Release the fully restored model weights (fine-tuned on external data to recover the missing layers).

A Note on Model Release

Due to organizational policy constraints, we are unable to release the full production-trained model weights. To support the research community, we plan to release a foundation checkpoint with a small number of layers removed from both the LLM backbone and the diffusion head (flow head). The remaining layers and all other components (vision encoder, projections, embeddings, etc.) are fully preserved. With a short fine-tuning pass on your own data, the removed layers can be quickly re-learned and the model restored to full quality.

For detailed fine-tuning instructions, please refer to the training guide.

Meanwhile, we are also actively working on fine-tuning the removed layers using external data, and plan to release the complete weights as soon as possible.

Citation

code

@article{tuna2,
  title={TUNA-2: Pixel Embeddings Beat Vision Encoders
         for Unified Understanding and Generation},
  author={Liu, Zhiheng and Ren, Weiming and Huang, Xiaoke
          and Chen, Shoufa and Li, Tianhong and Chen, Mengzhao
          and Ji, Yatai and He, Sen and Schult, Jonas
          and Xiang, Tao and Chen, Wenhu and Luo, Ping
          and Zettlemoyer, Luke and Cong, Yuren},
  journal={arXiv preprint arXiv:2604.24763},
  year={2026}
}

code

@article{liu2025tuna,
  title={Tuna: Taming unified visual representations for native unified multimodal models},
  author={Liu, Zhiheng and Ren, Weiming and Liu, Haozhe and Zhou, Zijian and Chen, Shoufa and Qiu, Haonan and Huang, Xiaoke and An, Zhaochong and Yang, Fanny and Patel, Aditya and others},
  journal={CVPR2026},
  year={2026}
}

License

This project is licensed under the Apache License 2.0. See LICENSE for details.

この記事をシェア

Latent Space重要度42026年6月26日 10:12

[AINews] OpenAI、2025年11月以降の内部Codex出力トークン数が研究で56倍、カスタマーサポートで32倍に急増と報告

KDnuggets重要度42026年6月25日 23:00

テキスト、画像、音声、動画を処理する 5 つのオープンソース・オムニ AI モデル

KDnuggets重要度42026年6月27日 00:00

Apple Silicon で MLX を用いた言語モデルのファインチューニング

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

TLDR AI·2026年5月5日 09:00·約5分

Tuna-2（GitHub リポジトリ）：メタがマルチモーダルモデルの基礎チェックポイントを公開

#マルチモーダル AI #メタ (Meta)#オープンソースモデル #拡散モデル #LLM

TL;DR

AI深層分析2026年5月5日 23:05

重要/ 5段階

深度40%

キーポイント

Tuna-2 の性能向上と技術的特徴

モデル公開戦略の変更点

完全な生産訓練済みモデルの重みではなく、基礎チェックポイント（foundation checkpoint）のみをリリースする方針が示されています。

アーキテクチャの調整と実装

生成画像の実証

GitHub リポジトリには、同モデルによって生成された画像のサンプルが含まれており、その能力を視覚的に確認できます。

影響分析・編集コメントを表示

影響分析

編集コメント

Zhiheng Liu*1,2,

Weiming Ren*1,3,

Sen He1,

1Meta 2香港大学 3ウォータロー大学

同等の貢献

[[プロジェクトページ]](https://tuna-ai.org/tuna-2) [[arXiv]](https://arxiv.org/abs/2604.24763)

概要

生成結果

インストール

git clone https://github.com/facebookresearch/tuna-2.git

cd tuna-2

bash scripts/setup_uv.sh # 依存関係をすべて含む .venv を作成します

source .venv/bin/activate

手動セットアップ（自分で uv を操作したい場合）

curl -LsSf https://astral.sh/uv/install.sh | sh

uv sync

uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

uv pip install -e .

source .venv/bin/activate

推論

すべての推論は、単一の統合スクリプトを通じて行われます:

bash scripts/launch/predict.sh --ckpt <PATH> --prompt <TEXT> [OPTIONS]

オプション

フラグ | 値 | デフォルト | 説明

--ckpt | path | *(必須)* | モデルチェックポイントへのパス

--prompt | text | *(必須)* | テキストプロンプト（t2i）または編集指示（edit）

--task | t2i, edit | t2i | 推論タスク

--variant | none_encoder, siglip_pixel, vae | none_encoder | モデルのバリアント：Tuna-2、Tuna-R、またはTuna

--size | 7b, 2b | 7b | モデルサイズ（2b は --variant vae の場合のみ利用可能）

--resolution | 以下の表を参照 | 512x512 | 出力解像度 (HxW)

--gpu | int | 0 | GPU デバイスインデックス

--image | path | — | ソース画像（--task edit の場合必須）

--steps | int | 50 | 拡散ステップ数

--guidance | float | *(設定ファイルから)* | クラスifierフリーガイダンススケール

--seed | int | 42 | ランダムシード

--negative | text | *(設定ファイルから)* | ネガティブプロンプト

サポートされている解像度

512 クラス | 1024 クラス

---|---

512x512 | 1024x1024

448x576 | 896x1152

576x448 | 1152x896

384x672 | 768x1344

672x384 | 1344x768

例

サンプルプロンプトについては assets/prompts.txt を参照してください。

Tuna-2 (7B, エンコーダーなし、512px)

bash scripts/launch/predict.sh \

--ckpt /path/to/tuna_2_pixel_7b.pt \

Tuna (2B, VAE ラテン特量、512px)

bash scripts/launch/predict.sh \

--variant vae --size 2b \

--ckpt /path/to/tuna_2b.pt \

ビデオ生成

TODO (予定事項)

Tuna-2 モデルの一部の重み（ウェイト）を公開する。

Tuna モデルの一部の重みを公開する。

完全な復元モデルの重み（欠落した層を外部データでファインチューニングして回復させたもの）を公開する。

モデル公開に関する注記

詳細な微調整の手順については、トレーニングガイドをご参照ください。

同時に、外部データを用いて取り除かれた層の微調整も積極的に進めており、完全な重みはできるだけ早く公開する予定です。

引用

@article{tuna2,

title={TUNA-2: Pixel Embeddings Beat Vision Encoders

for Unified Understanding and Generation},

author={Liu, Zhiheng and Ren, Weiming and Huang, Xiaoke

and Chen, Shoufa and Li, Tianhong and Chen, Mengzhao

and Ji, Yatai and He, Sen and Schult, Jonas

and Xiang, Tao and Chen, Wenhu and Luo, Ping

and Zettlemoyer, Luke and Cong, Yuren},

journal={arXiv preprint arXiv:2604.24763},

year={2026}

}

@article{liu2025tuna,

title={Tuna: Taming unified visual representations for native unified multimodal models},

author={Liu, Zhiheng and Ren, Weiming and Liu, Haozhe and Zhou, Zijian and Chen, Shoufa and Qiu, Haonan and Huang, Xiaoke and An, Zhaochong and Yang, Fanny and Patel, Aditya and others},

journal={CVPR2026},

year={2026}

}

ライセンス

本プロジェクトは Apache License 2.0 の下でライセンスされています。詳細については LICENSE をご覧ください。

原文を表示

Zhiheng Liu*1,2,

Weiming Ren*1,3,

Sen He1,

1Meta 2The University of Hong Kong 3University of Waterloo

Equal contribution

[[Project Page]](https://tuna-ai.org/tuna-2) [[arXiv]](https://arxiv.org/abs/2604.24763)

Overview

Generation Results

Installation

code

git clone https://github.com/facebookresearch/tuna-2.git
cd tuna-2
bash scripts/setup_uv.sh   # creates .venv with all dependencies
source .venv/bin/activate

Manual setup (if you prefer to drive uv yourself)

code

curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
uv pip install -e .
source .venv/bin/activate

Inference

All inference is done through a single unified script:

code

bash scripts/launch/predict.sh --ckpt <PATH> --prompt <TEXT> [OPTIONS]

Options

Flag

Values

Default

Description

--ckpt

path

*(required)*

Path to the model checkpoint

--prompt

text

*(required)*

Text prompt (t2i) or editing instruction (edit)

--task

t2i, edit

t2i

Inference task

--variant

none_encoder, siglip_pixel, vae

none_encoder

Model variant: Tuna-2, Tuna-R, or Tuna

--size

7b, 2b

7b

Model size (2b only available for --variant vae)

--resolution

See table below

512x512

Output resolution (HxW)

--gpu

int

0

GPU device index

--image

path

—

Source image (required for --task edit)

--steps

int

50

Number of diffusion steps

--guidance

float

*(from config)*

Classifier-free guidance scale

--seed

int

42

Random seed

--negative

text

*(from config)*

Negative prompt

Supported Resolutions

512-class

1024-class

512x512

1024x1024

448x576

896x1152

576x448

1152x896

384x672

768x1344

672x384

1344x768

Examples

See assets/prompts.txt for sample prompts.

code

# Tuna-2 (7B, no encoder, 512px)
bash scripts/launch/predict.sh \
    --ckpt /path/to/tuna_2_pixel_7b.pt \
    --prompt "A highly realistic beauty portrait in extreme close-up, showing the face of a young woman from just above the eyebrows down to the lips. Her skin is natural, luminous, and textured, with visible pores, fine facial hairs, subtle unevenness, and a slightly dewy finish, without heavy retouching or artificial smoothing."

# Tuna (2B, VAE latent, 512px)
bash scripts/launch/predict.sh \
    --variant vae --size 2b \
    --ckpt /path/to/tuna_2b.pt \
    --prompt "A brutally realistic cinematic close-up inside a real space station cupola, side profile of a blonde female astronaut floating in zero gravity beside the window, her loose braid drifting naturally, looking out at Earth in silence."

Video

TODO

Release some of the Tuna-2 model weights.

Release some of the Tuna model weights.

Release the fully restored model weights (fine-tuned on external data to recover the missing layers).

A Note on Model Release

For detailed fine-tuning instructions, please refer to the training guide.

Meanwhile, we are also actively working on fine-tuning the removed layers using external data, and plan to release the complete weights as soon as possible.

Citation

code

@article{tuna2,
  title={TUNA-2: Pixel Embeddings Beat Vision Encoders
         for Unified Understanding and Generation},
  author={Liu, Zhiheng and Ren, Weiming and Huang, Xiaoke
          and Chen, Shoufa and Li, Tianhong and Chen, Mengzhao
          and Ji, Yatai and He, Sen and Schult, Jonas
          and Xiang, Tao and Chen, Wenhu and Luo, Ping
          and Zettlemoyer, Luke and Cong, Yuren},
  journal={arXiv preprint arXiv:2604.24763},
  year={2026}
}

code

@article{liu2025tuna,
  title={Tuna: Taming unified visual representations for native unified multimodal models},
  author={Liu, Zhiheng and Ren, Weiming and Liu, Haozhe and Zhou, Zijian and Chen, Shoufa and Qiu, Haonan and Huang, Xiaoke and An, Zhaochong and Yang, Fanny and Patel, Aditya and others},
  journal={CVPR2026},
  year={2026}
}

License

This project is licensed under the Apache License 2.0. See LICENSE for details.

この記事をシェア

Latent Space重要度42026年6月26日 10:12

[AINews] OpenAI、2025年11月以降の内部Codex出力トークン数が研究で56倍、カスタマーサポートで32倍に急増と報告

KDnuggets重要度42026年6月25日 23:00

テキスト、画像、音声、動画を処理する 5 つのオープンソース・オムニ AI モデル

KDnuggets重要度42026年6月27日 00:00

Apple Silicon で MLX を用いた言語モデルのファインチューニング

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

キーポイント

影響分析

編集コメント

概要

生成結果

インストール

推論

オプション

サポートされている解像度

例

Tuna-2 (7B, エンコーダーなし、512px)

Tuna (2B, VAE ラテン特量、512px)

ビデオ生成

TODO (予定事項)

モデル公開に関する注記

引用

ライセンス

Overview

Generation Results

Installation

Inference

Options

Supported Resolutions

Examples

Video

TODO

A Note on Model Release

Citation

License

関連記事

キーポイント

影響分析

編集コメント

概要

生成結果

インストール

推論

オプション

サポートされている解像度

例

Tuna-2 (7B, エンコーダーなし、512px)

Tuna (2B, VAE ラテン特量、512px)

ビデオ生成

TODO (予定事項)

モデル公開に関する注記

引用

ライセンス

Overview

Generation Results

Installation

Inference

Options

Supported Resolutions

Examples

Video

TODO

A Note on Model Release

Citation

License

関連記事