TLDR AI·2026年5月25日 09:00·約10分

Hugging Face リポジトリの Lance（軽量ネイティブ統合マルチモーダルモデル）

#マルチモーダル AI #Image Generation #Video Generation #Efficient AI #ByteDance

TL;DR

ByteDance が公開した軽量ネイティブ統一マルチモーダルモデル「Lance」は、30 億パラメータ規模でありながら画像・動画の理解、生成、編集を単一フレームワークで実現する画期的な成果である。

AI深層分析2026年5月26日 01:03

重要/ 5段階

深度40%

キーポイント

軽量かつ高性能な統一アーキテクチャ

3B のアクティブパラメータという軽量規模でありながら、画像生成、編集、動画生成のすべてのタスクで強力なパフォーマンスを発揮するネイティブ統一モデル。

ゼロからの効率的なトレーニング手法

段階的なマルチタスクレシピに基づき、128 台の A100 GPU を用いてゼロから完全にトレーニングされた。

包括的なマルチモーダル機能

画像と動画の理解（Understanding）、生成（Generation）、編集（Editing）を単一のフレームワーク内でシームレスにサポートする。

マルチターン一貫性編集機能

Lance は複数の編集指示に対して動画の整合性を維持しながら連続して編集を行う能力を備えています。

インテリジェントな動画生成

ユーザーの意図を理解し、論理的に拡張された高品質な動画を自動的に生成する機能を提供します。

多様なビデオ理解タスクへの対応

Lance は、オブジェクトの動作回数や方向の特定、反則行為の検出など、複雑な推論を要する VQA タスクと、料理手順や自然現象の詳細記述を含むキャプション生成タスクの両方を正確に処理できる。

時間的・空間的情報の精緻な把握

モデルは単なる物体認識を超え、反復動作の識別や物理法則に反する現象（例：画面越しの物体操作）の検出など、動画内の時間的変化と因果関係を深く理解している。

重要な引用

Lance is a lightweight native unified multimodal model that supports image and video understanding, generation, and editing within a single framework.

With only 3B active parameters, Lance delivers strong performance across image generation, image editing, and video generation benchmarks.

The man grabs an object through a phone screen.

In a sunlit meadow, a small tortoiseshell butterfly rests on a purple flower. A bee... lands on the same flower.

The Colosseum in Rome, Italy is a magnificent ancient amphitheater made of stone and concrete. It is characterized by its large, arched openings and its semi-circular shape.

A total solar eclipse looks like a dark circle with a bright, white outer edge.

影響分析・編集コメントを表示

影響分析

この発表は、マルチモーダル AI の分野において「大規模化」への依存から「効率化と統合」への転換を示唆する重要なマイルストーンです。特に、リソース制約のある環境でも高品質な動画・画像生成が可能になるため、研究コミュニティや実装現場におけるモデルの普及速度を加速させる可能性があります。

編集コメント

3B という軽量規模で動画生成までカバーする統一モデルは、実用化のハードルを劇的に下げる可能性があり、今後のローカル推論やエッジデバイスへの展開に大きな期待を持たせます。

image

Lance: マルチタスク相乗効果による統一型多モーダルモデリング

Fengyi Fu*,

Mengqi Huang*,✉,

Shaojin Wu*,

Yunsheng Jiang*,

Yufei Huo,

Jianzhu Guo✉,§

Hao Li,

Yinghang Song,

Fei Ding,

Qian He,

Zheren Fu,

Zhendong Mao,

Yongdong Zhang

*ByteDance*

同等貢献

✉ 責任著者

§ プロジェクトリーダー

English | 简体中文

🌟 ハイライト

Lance は、画像およびビデオの理解、生成、編集を単一のフレームワーク内でサポートする軽量なネイティブ統一型多モーダルモデルです。

3B スケールで効率的。アクティブパラメータがわずか 30 億（3B）であるにもかかわらず、Lance は画像生成、画像編集、ビデオ生成の各ベンチマークにおいて強力なパフォーマンスを発揮します。

一からトレーニング済み。Lance は段階的なマルチタスクレシピに基づいて構築され、128 個の A100 GPU を用いた予算内で完全にゼロからトレーニングされています。

image

🎨 デモ

テキストからビデオへ（Text-to-Video）

ビデオ編集

多ターン一貫性編集

インテリジェント・ビデオ生成

ビデオ理解

質問: テーブル上で物体を投げた回数は何回ですか？選択肢：(A) 3 (B) 2 (C) 4

回答: (A) 3

質問: その人は反復動作のセットを行っています。その人が行った異なる反復動作は何種類ありますか？選択肢：(A) 2 (B) 3 (C) 4

回答: (A) 2

質問: ビデオの中で紫色の球体はどの方向に移動していますか？選択肢：(A) 下へ右斜め。 (B) 上へ左斜め。 (C) 上へ右斜め。 (D) 物体は静止している。

回答: (A) 下へ右斜め。

質問: ビデオに表示されている非現実的な現象は何ですか？選択肢：(A) その男が電話で時間を操作できる。 (B) 男が電話画面越しに物体を掴む。 (C) チョコレートが別の物体に変化する。 (D) 可視の推進手段により飛行が可能になる。

回答: (B) 男が電話画面越しに物体を掴む。

質問: このビデオで示されている料理のプロセスについて簡潔な説明を提供してください。

回答: トマトピューレを加え、鶏肉の塊とよく混ぜ合わせます。

質問: 与えられたビデオの詳細な記述を提供し、その重要な瞬間を捉えてください。

回答: 日差しが降り注ぐ牧草地で、小さな斑蝶が紫色の花の上に休んでいます。黒と黄色の縞模様を持つミツバチが同じ花に着地します。蝶はそっと羽ばたき、ミツバチはせっせと蜜を集めています。花はそよ風になびきます。その後、蝶は飛び立ち、ミツバチもそれに続いて次の花へと向かいます。この光景は、自然環境における昆虫の相互作用を鮮やかに描いたもので、蝶とミツバチの色が牧草地の緑色の背景と対比しています。この動画は、平和な瞬間を短い 6 秒間に収めて捉えています。

テキストから画像への生成

image

画像編集

image

画像理解

image

質問: 最大のセグメントは、他のすべてのセグメントの合計よりも大きいですか？

回答: はい、円グラフにおける最大のセグメントは青色のものであり、これが最も人口の多いセグメントです。

image

質問: 回答者のうち、より厳格な国境管理を望む人は何パーセントですか？

回答: 29%

image

質問: この車のナンバープレート番号は何ですか？

回答: この車のナンバープレート番号は BX62 BFY です。

image

質問: 独自市場調査データによると、1998 年にプロモーション会議やイベントに費やされた金額はいくらですか？

回答: 独自市場調査データによると、1998 年のプロモーション会議およびイベントへの総支出額は約 13 億ドルでした。

image

質問: イタリア・ローマのコロッセオの外観はどのようなものですか？

回答: イタリア・ローマのコロッセオは、石とコンクリートで作られた壮大な古代円形闘技場です。その特徴は、大きなアーチ型の開口部と半円形の形状にあります。コロッセオは部分的に修復されており、元の色は褪せていますが、柱や半円形のアーチなどの建築的詳細はまだ確認できます。画像の照明は、夕暮れの空を背景に際立つコロッセオの壮大さと歴史的意義を強調しています。

image

質問: 地球から見た皆既日食はどのように見えますか？

回答: 皆既日食は、明るい白い外縁を持つ暗い円として見えます。これは、地球の自転と地球の影によって太陽光が部分的に遮られる結果です。日食の明るい白い縁は、太陽の高いエネルギーと大気との相互作用によって生じ、一方、日食の暗い部分は地球の影と周囲の空気の流れによるものです。明るく白い縁と暗い中心を持つ皆既日食の形状は、満月や暗い円盤の形状に似ています。これは地球の大気中で起こる自然現象であり、太陽系の重要な一部です。

インストール手順

まず、リポジトリをクローンします:

git clone https://github.com/bytedance/Lance.git

cd Lance

次に、環境を設定します:

conda create -n Lance python=3.11 -y

conda activate Lance

pip install torch==2.5.1+cu124 torchvision==0.20.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124

pip install -r requirements.txt

pip install flash-attn==2.8.3 --no-build-isolation

注:** ソースから flash-attn のインストールに失敗した場合は、事前ビルドされたホイールを代わりにインストールできます:

pip install --no-cache-dir --no-deps --force-reinstall \

"https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3%2Bcu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl"

モデル重みのダウンロード

Lance-3B on Hugging Face から必要なすべてのモデルチェックポイント（重み）をダウンロードし、downloads/ ディレクトリに配置してください。

from huggingface_hub import snapshot_download

save_dir = "./downloads/"

repo_id = "bytedance-research/Lance"

cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,

local_dir=save_dir,

repo_id=repo_id,

local_dir_use_symlinks=False,

resume_download=True,

allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt","*.pth",],

)

📚 使用方法

推論

生成、編集、理解のすべてのタスクに対して、統一されたコマンドラインインターフェースを提供しています:

オプション 1: 統一スクリプトの設定と実行

bash inference_lance.sh

実行前に、inference_lance.sh の上部で推論パラメータを設定してください。

サポートされているタスクは、t2i（テキストから画像）、t2v（テキストから動画）、image_edit（画像編集）、video_edit（動画編集）、x2t_image（画像からテキスト）、および x2t_video（動画からテキスト）です。各タスクのデフォルトデータサンプルをカスタマイズするには、inference_lance.py 内の TASK_DEFAULT_CONFIGS を変更してください。

注意：すべてのタスクにおいて、入力プロンプトを作成する際は、提供された例で使用されているプロンプト形式に従うことを推奨します。これにより、通常は生成品質が向上します。

オプション 2: ユニファイドスクリプトの設定と実行

生成、編集、理解の各タスクに対して、タスク固有のワンクリックコマンドを提供しています。

テキストから動画への生成

bash inference_lance.sh \

--TASK_NAME t2v \

--MODEL_PATH downloads/Lance_3B_Video \

--RESOLUTION video_480p \

--NUM_FRAMES 121 \

--VIDEO_HEIGHT 480 \

--VIDEO_WIDTH 848 \

--SAVE_PATH_GEN results/t2v_121f

テキストから画像への生成

bash inference_lance.sh \

--TASK_NAME t2i \

--MODEL_PATH downloads/Lance_3B \

--RESOLUTION image_768res \

--VIDEO_HEIGHT 768 \

--VIDEO_WIDTH 768 \

--SAVE_PATH_GEN results/t2i

動画編集

bash inference_lance.sh \

--TASK_NAME video_edit \

--MODEL_PATH downloads/Lance_3B_Video \

--RESOLUTION video_480p \

--SAVE_PATH_GEN results/video_edit

画像編集

bash inference_lance.sh \

--TASK_NAME image_edit \

--MODEL_PATH downloads/Lance_3B \

--RESOLUTION image_768res \

--SAVE_PATH_GEN results/image_edit

動画理解

bash inference_lance.sh \

--TASK_NAME x2t_video \

--MODEL_PATH downloads/Lance_3B_Video \

--RESOLUTION video_480p \

--NUM_FRAMES 50 \

--SAVE_PATH_GEN results/x2t_video

Image Understanding

bash inference_lance.sh \

--TASK_NAME x2t_image \

--MODEL_PATH downloads/Lance_3B \

--RESOLUTION image_768res \

--SAVE_PATH_GEN results/x2t_image

Available Tasks

Task Name

Description

Example JSON

t2v

Text-to-Video generation

config/examples/t2v_example.json

t2i

Text-to-Image generation

config/examples/t2i_example.json

image_edit

Image editing

config/examples/image_edit_example.json

video_edit

Video editing

config/examples/video_edit_example.json

x2t_image

Image understanding

config/examples/x2t_image_example.json

x2t_video

Video understanding

config/examples/x2t_video_example.json

For understanding examples:

config/examples/x2t_image_example.json: image understanding examples for visual question answering and image-based reasoning.

config/examples/x2t_video_example.json: video understanding examples for video question answering and video captioning.

Parameters

You can configure the following hyperparameters at the top of the inference_lance.sh script:

Parameter

Default Value

Description

MODEL_PATH

"downloads/lance_3b"

Path to the downloaded Lance model weights.

NUM_GPUS

Number of GPUs to use for inference.

VALIDATION_NUM_TIMESTEPS

Number of denoising steps (e.g., 30 or 50).

VALIDATION_TIMESTEP_SHIFT

3.5

フローマッチングスケジューリングのためのタイムステップシフトパラメータ。

CFG_TEXT_SCALE

4.0

テキスト条件付けのためのClassifier-Free Guidance (CFG) スケール。

VALIDATION_DATA_SEED

生成の再現性のためのランダムシード。

NUM_FRAMES

動画生成用のフレーム数（最大：121）。*画像タスクでは未使用*。

VIDEO_HEIGHT / VIDEO_WIDTH

768

空間解像度。*編集タスクでは未使用（入力画像/動画によって決定される）*。

RESOLUTION

"video_480p"

基本解像度のプリセット（image_768res または video_480p）。

Gradio

python lance_gradio_t2v_v2t.py --gpus 0 --server-port 7860

Benchmarks

DPG-Bench Evaluation

Models

# Params.

Global

Entity

Attribute

Relation

Other

Overall

*Generation-only Models*

SDXL3.5B83.2782.4380.9186.7680.4174.65

DALL-E 3-90.9789.6188.3990.5889.8383.50

SD3-Medium2B87.9091.0188.8380.7088.6884.08

FLUX.1-dev12B74.3590.0088.9690.8788.3383.84

Qwen-Image20B91.3291.5692.0294.3192.7388.32

*Unified Models*

Janus-Pro-7B7B86.9088.9089.4089.3289.4884.19

OmniGen24B88.8188.8390.1889.3790.2783.57

Show-o27B89.0091.7889.9691.8191.6486.14

BAGEL†7B88.9490.3791.2990.8288.6785.07

InternVL-U1.7B90.3990.7890.6890.2988.7785.18

TUNA7B90.4291.6890.9491.8790.7386.76

TUNA-27B89.5091.4092.0791.9188.8186.54

🌟 Lance (Ours)3B83.8991.0789.3693.3880.8084.67

*†は、生成前に LLM リライター（LLM rewriter）を使用してプロンプト書き換えを行う手法を示します。

GenEval 評価

モデル

パラメータ数

1-Obj.

2-Obj.

Count

Colors

Position

Attr.

Overall

*生成専用モデル*

SDXL3.5B0.980.740.390.850.150.230.55

DALL-E 3-0.960.870.470.830.430.450.67

SD3-Medium2B0.990.940.720.890.330.600.74

FLUX.1-dev12B0.980.930.750.930.680.650.82

Qwen-Image20B0.990.920.890.880.760.770.87

*統合モデル*

Janus-Pro-7B7B0.990.890.590.900.790.660.80

OmniGen24B1.000.950.640.880.550.760.80

Show-o27B1.000.870.580.920.520.620.76

BAGEL†7B0.980.950.840.950.780.770.88

Mogao7B1.000.970.830.930.840.800.89

InternVL-U1.7B0.990.940.740.910.770.740.85

TUNA7B1.000.970.810.910.880.830.90

TUNA-27B0.990.960.800.910.840.760.87

🌟 Lance (Ours)3B1.000.940.840.970.870.810.90

*†は、生成前に LLM リライター（LLM rewriter）を使用してプロンプト書き換えを行う手法を示します。

GEdit-Bench 評価

Models

# パラメータ数。

SRp

平均/G_O

*生成専用モデル*

Gemini 2.0------------6.32

GPT Image 1-6.966.857.105.416.747.447.518.738.558.458.697.49

Qwen-Image-Edit20B8.238.307.338.057.496.748.578.098.298.488.508.01

*統合モデル*

Lumina-DiMOO8B3.434.273.082.774.745.194.443.804.382.684.203.91

Ovis-U11.2B7.496.886.214.795.986.467.497.257.274.486.316.42

BAGEL7B7.326.916.384.754.576.157.907.167.027.326.226.52

InternVL-U1.7B7.087.056.387.026.036.277.136.556.336.596.856.66

InternVL-U (CoT 付き)1.7B7.057.876.506.995.776.107.337.167.127.366.466.88

🌟 Lance (本稿) 3B 7.73 7.74 7.28 7.83 7.50 7.03 7.64 7.85 7.71 4.46 7.57 7.30

VBench 評価 (動画生成)

タイプ

モデル

# パラメータ数

総合スコア ↑

*生成専用*

ModelScope1.7B75.75

LaVie3B77.08

Show-16B78.93

AnimateDiff-V2-80.27

VideoCrafter-2.0-80.44

CogVideoX5B81.61

Kling-81.85

Open-Sora-2.0-81.71

Gen-3-82.32

Step-Video-T2V30B81.83

Hunyuan Video-83.43

Wan2.1-T2V14B 83.69

*Unified*

HaproOmni7B 78.10

Emu3 8B 80.96

VILA-U 7B 74.01

Show-o 22B 81.34

TUNA 1.5B 84.06

🌟 Lance (Ours) 3B 85.11

ベンチマークの実行

実行可能なベンチマークスクリプトは、benchmarks/ディレクトリ以下に用意されています：

Benchmark	モダリティ	スクリプト
GenEVAL (画像生成)	画像	benchmarks/image_gen/GenEVAL/sample_GenEVAL.sh
DPG (画像生成)	画像	benchmarks/image_gen/DPG/sample_DPG.sh
GEdit (画像編集)	画像	benchmarks/image_gen/GEdit/sample_GEdit.sh
VBench (動画生成)	ビデオ	benchmarks/video_gen/Vbench/sample_vbench.sh

📄 ライセンス

🙏 謝辞

オープンな研究と貢献をしてくださった、BAGEL、Qwen2.5-VL-3B-Instruct、および Wan2.2 の貢献者に感謝いたします。

💖 引用

Lance がプロジェクトや研究に役立つと感じた場合は、このリポジトリを🌟していただき、以下の BibTeX を使用して当社の研究成果を引用してください：

@misc{lance2026,

title = {Lance: Unified Multimodal Modeling by Multi-Task Synergy},

author = {Fengyi Fu and Mengqi Huang and Shaojin Wu and Yunsheng Jiang and Yufei Huo and Jianzhu Guo and Hao Li and Yinghang Song and Fei Ding and Qian He and Zheren Fu and Zhendong Mao and Yongdong Zhang},

year = {2026},

note = {Manuscript}

}

📞 お問い合わせ

ご質問、不具合報告、または共同研究のご相談は、Mengqi Huang および Jianzhu Guo までご連絡ください。

原文を表示

Lance logo

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Fengyi Fu*,

Mengqi Huang*,✉,

Shaojin Wu*,

Yunsheng Jiang*,

Yufei Huo,

Jianzhu Guo✉,§

Hao Li,

Yinghang Song,

Fei Ding,

Qian He,

Zheren Fu,

Zhendong Mao,

Yongdong Zhang

*ByteDance*

Equal contribution

✉ Corresponding authors

§ Project lead

English | 简体中文

🌟 Highlights

Lance is a lightweight native unified multimodal model that supports image and video understanding, generation, and editing** within a single framework.

Efficient at 3B scale. With only 3B active parameters, Lance delivers strong performance across image generation, image editing, and video generation benchmarks.

Trained from scratch. Lance is built with a staged multi-task recipe and trained entirely from scratch within a 128-A100-GPU budget.

Lance benchmark overview across image generation, image editing, video generation, and video understanding

🎨 Demo

Text-to-Video

Video Editing

Multi-turn Consistency Editing

Intelligent Video Generation

Video Understanding

Question: How many times did the person launch objects on the table? Options: (A) 3 (B) 2 (C) 4

Response: (A) 3

Question: The person makes sets of repeated actions. How many distinct repeated actions did the person do? Options: (A) 2 (B) 3 (C) 4

Response: (A) 2

Question: In which direction does the purple sphere move in the video? Options: (A) Down and to the right. (B) Up and to the left. (C) Up and to the right. (D) The object is stationary.

Response: (A) Down and to the right.

Question: What is the unrealistic phenomenon displayed in the video? Options: (A) The man can manipulate time via phone. (B) Man grabs an object through a phone screen. (C) Chocolate transforms into different objects. (D) Visible means of propulsion enables flight.

Response: (B) Man grabs an object through a phone screen.

Question: Offer a succinct account of the culinary process shown in this video.

Response: Add tomato puree and mix it well with chicken pieces.

Question: Provide a detailed description of the given video, capturing its key moments.

Response: In a sunlit meadow, a small tortoiseshell butterfly rests on a purple flower. A bee, with black and yellow stripes, lands on the same flower. The butterfly flaps its wings gently, while the bee busies itself, collecting nectar. The flower sways slightly in the breeze. The butterfly then takes off, and the bee follows, both heading to the next flower. The scene is a vivid display of insect interaction in a natural setting, with the colors of the butterfly and the bee contrasting against the green background of the meadow. The video captures this peaceful moment in a short 6-second duration.

Text-to-Image Generation

Image Editing

Image Understanding

Question: Is the largest segment greater than sum of all the other segments?

Response: Yes, the largest segment in the pie chart is the blue one, which is the most populous segment.

Survey chart about border security priorities

Question: What percentage of respondents want better border security?

Response: 29%

Question: What is the license plate number of the car?

Response: The license plate number of the car is BX62 BFY.

Question: According to the data from the proprietary market research, how much amount was spent on the promotional meetings and events during 1998?

Response: According to the data from the proprietary market research, the total amount spent on the promotional meetings and events during 1998 was approximately $1.3 billion.

Question: What is the appearance of the Colosseum in Rome, Italy?

Response: The Colosseum in Rome, Italy is a magnificent ancient amphitheater made of stone and concrete. It is characterized by its large, arched openings and its semi-circular shape. The Colosseum has been partially restored, and its original color has faded, but the architectural details, like the columns and the semi-circular arches, are still visible. The lighting in the image highlights the grandeur and historical significance of the Colosseum, making it stand out against the backdrop of the evening sky.

Question: How does a total solar eclipse look like from Earth?

Response: A total solar eclipse looks like a dark circle with a bright, white outer edge. This is a result of the Earth's rotation and the Earth's shadow, which partially obscures the sun's light. The bright white edge of the eclipse is caused by the sun's high energy and its interaction with the Earth's atmosphere, while the dark part of the eclipse is due to the Earth's shadow and the surrounding air currents. The solar eclipse's shape, with its bright white edge and dark center, is similar to the shape of a full moon or a dark disk. It is a natural phenomenon that occurs in the atmosphere of the Earth and is an important part of the solar system.

Installation Steps

First, clone the repository:

code

git clone https://github.com/bytedance/Lance.git
cd Lance

Then, set up the environment:

code

conda create -n Lance python=3.11 -y
conda activate Lance
pip install torch==2.5.1+cu124 torchvision==0.20.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
pip install flash-attn==2.8.3 --no-build-isolation

Note: If installing flash-attn from source fails, you can install the prebuilt wheel instead:
`

pip install --no-cache-dir --no-deps --force-reinstall \

"https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3%2Bcu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl"

code


### Download Model Weights

Please download all necessary model checkpoints from [Lance-3B on Hugging Face](https://huggingface.co/bytedance-research/Lance) and place them in the `downloads/` directory.

from huggingface_hub import snapshot_download

save_dir = "./downloads/"

repo_id = "bytedance-research/Lance"

cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,

local_dir=save_dir,

repo_id=repo_id,

local_dir_use_symlinks=False,

resume_download=True,

allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt","*.pth",],

)

code


## 📚 Usage

### Inference

We provide a unified command-line interface for all generation / editing / understanding tasks:

### Option 1: Configure and Run the Unified Script

bash inference_lance.sh

code


- Before running, please configure the inference parameters at the top of inference_lance.sh.

- Supported tasks: t2i, t2v, image_edit, video_edit, x2t_image, and x2t_video. You can modify TASK_DEFAULT_CONFIGS in inference_lance.py to customize the default data samples for each task.

- Note: For all tasks, we recommend following the prompt format used in the provided examples when writing input prompts, as this typically leads to better generation quality.

### Option 2: Configure and Run the Unified Script

We provide task-specific one-click commands for different generation, editing, and understanding tasks.

### Text-to-Video Generation

bash inference_lance.sh \

--TASK_NAME t2v \

--MODEL_PATH downloads/Lance_3B_Video \

--RESOLUTION video_480p \

--NUM_FRAMES 121 \

--VIDEO_HEIGHT 480 \

--VIDEO_WIDTH 848 \

--SAVE_PATH_GEN results/t2v_121f

code


### Text-to-Image Generation

bash inference_lance.sh \

--TASK_NAME t2i \

--MODEL_PATH downloads/Lance_3B \

--RESOLUTION image_768res \

--VIDEO_HEIGHT 768 \

--VIDEO_WIDTH 768 \

--SAVE_PATH_GEN results/t2i

code


### Video Editing

bash inference_lance.sh \

--TASK_NAME video_edit \

--MODEL_PATH downloads/Lance_3B_Video \

--RESOLUTION video_480p \

--SAVE_PATH_GEN results/video_edit

code


### Image Editing

bash inference_lance.sh \

--TASK_NAME image_edit \

--MODEL_PATH downloads/Lance_3B \

--RESOLUTION image_768res \

--SAVE_PATH_GEN results/image_edit

code


### Video Understanding

bash inference_lance.sh \

--TASK_NAME x2t_video \

--MODEL_PATH downloads/Lance_3B_Video \

--RESOLUTION video_480p \

--NUM_FRAMES 50 \

--SAVE_PATH_GEN results/x2t_video

code


### Image Understanding

bash inference_lance.sh \

--TASK_NAME x2t_image \

--MODEL_PATH downloads/Lance_3B \

--RESOLUTION image_768res \

--SAVE_PATH_GEN results/x2t_image

code


### Available Tasks

Task Name
Description
Example JSON

`t2v`
Text-to-Video generation
`config/examples/t2v_example.json`

`t2i`
Text-to-Image generation
`config/examples/t2i_example.json`

`image_edit`
Image editing
`config/examples/image_edit_example.json`

`video_edit`
Video editing
`config/examples/video_edit_example.json`

`x2t_image`
Image understanding
`config/examples/x2t_image_example.json`

`x2t_video`
Video understanding
`config/examples/x2t_video_example.json`

For understanding examples:

- config/examples/x2t_image_example.json: image understanding examples for visual question answering and image-based reasoning.

- config/examples/x2t_video_example.json: video understanding examples for video question answering and video captioning.

### Parameters

You can configure the following hyperparameters at the top of the `inference_lance.sh` script:

Parameter
Default Value
Description

`MODEL_PATH`
`"downloads/lance_3b"`
Path to the downloaded Lance model weights.

`NUM_GPUS`
`1`
Number of GPUs to use for inference.

`VALIDATION_NUM_TIMESTEPS`
`30`
Number of denoising steps (e.g., 30 or 50).

`VALIDATION_TIMESTEP_SHIFT`
`3.5`
Timestep shift parameter for flow matching scheduling.

`CFG_TEXT_SCALE`
`4.0`
Classifier-Free Guidance (CFG) scale for text conditioning.

`VALIDATION_DATA_SEED`
`42`
Random seed for generation reproducibility.

`NUM_FRAMES`
`50`
Number of frames for video generation (Max: 121). *Unused for image tasks.*

`VIDEO_HEIGHT` / `VIDEO_WIDTH`
`768`
Spatial resolution. *Unused for editing tasks (determined by input image/video).*

`RESOLUTION`
`"video_480p"`
Base resolution preset (`image_768res` or `video_480p`).

### Gradio

python lance_gradio_t2v_v2t.py --gpus 0 --server-port 7860

code


### Benchmarks

### DPG-Bench Evaluation

Models
# Params.
Global
Entity
Attribute
Relation
Other
Overall

*Generation-only Models*

SDXL3.5B83.2782.4380.9186.7680.4174.65

DALL-E 3-90.9789.6188.3990.5889.8383.50

SD3-Medium2B87.9091.0188.8380.7088.6884.08

FLUX.1-dev12B74.3590.0088.9690.8788.3383.84

Qwen-Image20B91.3291.5692.0294.3192.7388.32

*Unified Models*

Janus-Pro-7B7B86.9088.9089.4089.3289.4884.19

OmniGen24B88.8188.8390.1889.3790.2783.57

Show-o27B89.00**91.78**89.9691.81**91.64**86.14

BAGEL†7B88.9490.3791.2990.8288.6785.07

InternVL-U1.7B90.3990.7890.6890.2988.7785.18

TUNA7B**90.42**91.6890.9491.8790.73**86.76**

TUNA-27B89.5091.40**92.07**91.9188.8186.54

🌟 **Lance (Ours)****3B****83.89****91.07****89.36****93.38****80.80****84.67**

*† indicates methods that use LLM rewriters for prompt rewriting before generation.*

### GenEval Evaluation

Models
# Params.
1-Obj.
2-Obj.
Count
Colors
Position
Attr.
Overall

*Generation-only Models*

SDXL3.5B0.980.740.390.850.150.230.55

DALL-E 3-0.960.870.470.830.430.450.67

SD3-Medium2B0.990.940.720.890.330.600.74

FLUX.1-dev12B0.980.930.750.930.680.650.82

Qwen-Image20B0.990.920.890.880.760.770.87

*Unified Models*

Janus-Pro-7B7B0.990.890.590.900.790.660.80

OmniGen24B**1.00**0.950.640.880.550.760.80

Show-o27B**1.00**0.870.580.920.520.620.76

BAGEL†7B0.980.95**0.84**0.950.780.770.88

Mogao7B**1.00****0.97**0.830.930.840.800.89

InternVL-U1.7B0.990.940.740.910.770.740.85

TUNA7B**1.00****0.97**0.810.91**0.88****0.83****0.90**

TUNA-27B0.990.960.800.910.840.760.87

🌟 **Lance (Ours)****3B****1.00****0.94****0.84****0.97****0.87****0.81****0.90**

*† indicates methods that use LLM rewriters for prompt rewriting before generation.*

### GEdit-Bench Evaluation

Models
# Params.
BC
CA
MM
MC
PB
ST
SA
SR
SRp
TM
TT
Avg/G_O

*Generation-only Models*

Gemini 2.0------------6.32

GPT Image 1-6.966.857.105.416.747.447.518.738.558.458.697.49

Qwen-Image-Edit20B8.238.307.338.057.496.748.578.098.298.488.508.01

*Unified Models*

Lumina-DiMOO8B3.434.273.082.774.745.194.443.804.382.684.203.91

Ovis-U11.2B7.496.886.214.795.986.467.497.257.274.486.316.42

BAGEL7B7.326.916.384.754.576.15**7.90**7.167.027.326.226.52

InternVL-U1.7B7.087.056.387.026.036.277.136.556.336.596.856.66

InternVL-U (w/ CoT)1.7B7.05**7.87**6.506.995.776.107.337.167.12**7.36**6.466.88

🌟 **Lance (Ours)****3B****7.73**7.74**7.28****7.83****7.50****7.03**7.64**7.85****7.71**4.46**7.57****7.30**

### VBench Evaluation (Video Generation)

Type
Model
# Params.
Total Score ↑

*Gen. Only*
ModelScope1.7B75.75

LaVie3B77.08

Show-16B78.93

AnimateDiff-V2-80.27

VideoCrafter-2.0-80.44

CogVideoX5B81.61

Kling-81.85

Open-Sora-2.0-81.71

Gen-3-82.32

Step-Video-T2V30B81.83

Hunyuan Video-83.43

Wan2.1-T2V14B83.69

*Unified*
HaproOmni7B78.10

Emu38B80.96

VILA-U7B74.01

Show-o22B81.34

TUNA1.5B84.06

🌟 **Lance (Ours)****3B****85.11**

### Running Benchmarks

Ready-to-run benchmark scripts are provided under `benchmarks/`:

Benchmark
Modality
Script

GenEVAL (image gen)
Image
`benchmarks/image_gen/GenEVAL/sample_GenEVAL.sh`

DPG (image gen)
Image
`benchmarks/image_gen/DPG/sample_DPG.sh`

GEdit (image edit)
Image
`benchmarks/image_gen/GEdit/sample_GEdit.sh`

VBench (video gen)
Video
`benchmarks/video_gen/Vbench/sample_vbench.sh`

## 📄 License

Copyright 2025 Bytedance Ltd. and/or its affiliates.

## 🙏 Acknowledgements

We would like to thank the contributors of [BAGEL](https://github.com/ByteDance-Seed/bagel), [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct), and [Wan2.2](https://github.com/Wan-Video/Wan2.2) for their open research and contributions.

## 💖 Citation

If you find **Lance** useful for your project or research, welcome to 🌟 this repo and cite our work using the following BibTeX:

@misc{lance2026,

title = {Lance: Unified Multimodal Modeling by Multi-Task Synergy},

year = {2026},

note = {Manuscript}

}

code


## 📞 Contact

For questions, issues, or collaborations, please contact [Mengqi Huang](https://corleone-huang.github.io/) and [Jianzhu Guo](https://guojianzhu.com/).

この記事をシェア

The Verge AI重要度42026年7月15日 06:26

OpenAI、今年中に ChatGPT 対応スマートスピーカーを発表する可能性

AWS Machine Learning Blog重要度42026年7月15日 01:43

Amazon Nova Act を用いた UX テストの拡張：ユーザーフロー分析の新アプローチ

TechCrunch AI重要度42026年7月14日 09:00

動画生成スタートアップ PixVerse が 4.39 億ドルを調達、評価額が 20 億ドル超に急騰

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Benchmark

モダリティ

スクリプト

GenEVAL (画像生成)

画像

benchmarks/image_gen/GenEVAL/sample_GenEVAL.sh

DPG (画像生成)

画像

benchmarks/image_gen/DPG/sample_DPG.sh

GEdit (画像編集)

画像

benchmarks/image_gen/GEdit/sample_GEdit.sh

VBench (動画生成)

ビデオ

benchmarks/video_gen/Vbench/sample_vbench.sh

conda create -n Lance python=3.11 -y conda activate Lance pip install torch==2.5.1+cu124 torchvision==0.20.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124 pip install -r requirements.txt pip install flash-attn==2.8.3 --no-build-isolation

### Download Model Weights Please download all necessary model checkpoints from [Lance-3B on Hugging Face](https://huggingface.co/bytedance-research/Lance) and place them in the `downloads/` directory.

- Before running, please configure the inference parameters at the top of inference_lance.sh. - Supported tasks: t2i, t2v, image_edit, video_edit, x2t_image, and x2t_video. You can modify TASK_DEFAULT_CONFIGS in inference_lance.py to customize the default data samples for each task. - Note: For all tasks, we recommend following the prompt format used in the provided examples when writing input prompts, as this typically leads to better generation quality. ### Option 2: Configure and Run the Unified Script We provide task-specific one-click commands for different generation, editing, and understanding tasks. ### Text-to-Video Generation

### Available Tasks Task Name Description Example JSON `t2v` Text-to-Video generation `config/examples/t2v_example.json` `t2i` Text-to-Image generation `config/examples/t2i_example.json` `image_edit` Image editing `config/examples/image_edit_example.json` `video_edit` Video editing `config/examples/video_edit_example.json` `x2t_image` Image understanding `config/examples/x2t_image_example.json` `x2t_video` Video understanding `config/examples/x2t_video_example.json` For understanding examples: - config/examples/x2t_image_example.json: image understanding examples for visual question answering and image-based reasoning. - config/examples/x2t_video_example.json: video understanding examples for video question answering and video captioning. ### Parameters You can configure the following hyperparameters at the top of the `inference_lance.sh` script: Parameter Default Value Description `MODEL_PATH` `"downloads/lance_3b"` Path to the downloaded Lance model weights. `NUM_GPUS` `1` Number of GPUs to use for inference. `VALIDATION_NUM_TIMESTEPS` `30` Number of denoising steps (e.g., 30 or 50). `VALIDATION_TIMESTEP_SHIFT` `3.5` Timestep shift parameter for flow matching scheduling. `CFG_TEXT_SCALE` `4.0` Classifier-Free Guidance (CFG) scale for text conditioning. `VALIDATION_DATA_SEED` `42` Random seed for generation reproducibility. `NUM_FRAMES` `50` Number of frames for video generation (Max: 121). *Unused for image tasks.* `VIDEO_HEIGHT` / `VIDEO_WIDTH` `768` Spatial resolution. *Unused for editing tasks (determined by input image/video).* `RESOLUTION` `"video_480p"` Base resolution preset (`image_768res` or `video_480p`). ### Gradio

### Benchmarks ### DPG-Bench Evaluation Models # Params. Global Entity Attribute Relation Other Overall *Generation-only Models* SDXL3.5B83.2782.4380.9186.7680.4174.65 DALL-E 3-90.9789.6188.3990.5889.8383.50 SD3-Medium2B87.9091.0188.8380.7088.6884.08 FLUX.1-dev12B74.3590.0088.9690.8788.3383.84 Qwen-Image20B91.3291.5692.0294.3192.7388.32 *Unified Models* Janus-Pro-7B7B86.9088.9089.4089.3289.4884.19 OmniGen24B88.8188.8390.1889.3790.2783.57 Show-o27B89.00**91.78**89.9691.81**91.64**86.14 BAGEL†7B88.9490.3791.2990.8288.6785.07 InternVL-U1.7B90.3990.7890.6890.2988.7785.18 TUNA7B**90.42**91.6890.9491.8790.73**86.76** TUNA-27B89.5091.40**92.07**91.9188.8186.54 🌟 **Lance (Ours)****3B****83.89****91.07****89.36****93.38****80.80****84.67** *† indicates methods that use LLM rewriters for prompt rewriting before generation.* ### GenEval Evaluation Models # Params. 1-Obj. 2-Obj. Count Colors Position Attr. Overall *Generation-only Models* SDXL3.5B0.980.740.390.850.150.230.55 DALL-E 3-0.960.870.470.830.430.450.67 SD3-Medium2B0.990.940.720.890.330.600.74 FLUX.1-dev12B0.980.930.750.930.680.650.82 Qwen-Image20B0.990.920.890.880.760.770.87 *Unified Models* Janus-Pro-7B7B0.990.890.590.900.790.660.80 OmniGen24B**1.00**0.950.640.880.550.760.80 Show-o27B**1.00**0.870.580.920.520.620.76 BAGEL†7B0.980.95**0.84**0.950.780.770.88 Mogao7B**1.00****0.97**0.830.930.840.800.89 InternVL-U1.7B0.990.940.740.910.770.740.85 TUNA7B**1.00****0.97**0.810.91**0.88****0.83****0.90** TUNA-27B0.990.960.800.910.840.760.87 🌟 **Lance (Ours)****3B****1.00****0.94****0.84****0.97****0.87****0.81****0.90** *† indicates methods that use LLM rewriters for prompt rewriting before generation.* ### GEdit-Bench Evaluation Models # Params. BC CA MM MC PB ST SA SR SRp TM TT Avg/G_O *Generation-only Models* Gemini 2.0------------6.32 GPT Image 1-6.966.857.105.416.747.447.518.738.558.458.697.49 Qwen-Image-Edit20B8.238.307.338.057.496.748.578.098.298.488.508.01 *Unified Models* Lumina-DiMOO8B3.434.273.082.774.745.194.443.804.382.684.203.91 Ovis-U11.2B7.496.886.214.795.986.467.497.257.274.486.316.42 BAGEL7B7.326.916.384.754.576.15**7.90**7.167.027.326.226.52 InternVL-U1.7B7.087.056.387.026.036.277.136.556.336.596.856.66 InternVL-U (w/ CoT)1.7B7.05**7.87**6.506.995.776.107.337.167.12**7.36**6.466.88 🌟 **Lance (Ours)****3B****7.73**7.74**7.28****7.83****7.50****7.03**7.64**7.85****7.71**4.46**7.57****7.30** ### VBench Evaluation (Video Generation) Type Model # Params. Total Score ↑ *Gen. Only* ModelScope1.7B75.75 LaVie3B77.08 Show-16B78.93 AnimateDiff-V2-80.27 VideoCrafter-2.0-80.44 CogVideoX5B81.61 Kling-81.85 Open-Sora-2.0-81.71 Gen-3-82.32 Step-Video-T2V30B81.83 Hunyuan Video-83.43 Wan2.1-T2V14B83.69 *Unified* HaproOmni7B78.10 Emu38B80.96 VILA-U7B74.01 Show-o22B81.34 TUNA1.5B84.06 🌟 **Lance (Ours)****3B****85.11** ### Running Benchmarks Ready-to-run benchmark scripts are provided under `benchmarks/`: Benchmark Modality Script GenEVAL (image gen) Image `benchmarks/image_gen/GenEVAL/sample_GenEVAL.sh` DPG (image gen) Image `benchmarks/image_gen/DPG/sample_DPG.sh` GEdit (image edit) Image `benchmarks/image_gen/GEdit/sample_GEdit.sh` VBench (video gen) Video `benchmarks/video_gen/Vbench/sample_vbench.sh` ## 📄 License Copyright 2025 Bytedance Ltd. and/or its affiliates. ## 🙏 Acknowledgements We would like to thank the contributors of [BAGEL](https://github.com/ByteDance-Seed/bagel), [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct), and [Wan2.2](https://github.com/Wan-Video/Wan2.2) for their open research and contributions. ## 💖 Citation If you find **Lance** useful for your project or research, welcome to 🌟 this repo and cite our work using the following BibTeX:

キーポイント

重要な引用

影響分析

編集コメント

Lance: マルチタスク相乗効果による統一型多モーダルモデリング

🌟 ハイライト

🎨 デモ

テキストからビデオへ（Text-to-Video）

ビデオ編集

多ターン一貫性編集

インテリジェント・ビデオ生成

ビデオ理解

テキストから画像への生成

画像編集

画像理解

インストール手順

モデル重みのダウンロード

📚 使用方法

推論

オプション 1: 統一スクリプトの設定と実行

オプション 2: ユニファイドスクリプトの設定と実行

テキストから動画への生成

テキストから画像への生成

動画編集

画像編集

動画理解

Image Understanding

Available Tasks

Parameters

Gradio

Benchmarks

DPG-Bench Evaluation

GenEval 評価

GEdit-Bench 評価

VBench 評価 (動画生成)

ベンチマークの実行

📄 ライセンス

🙏 謝辞

💖 引用

📞 お問い合わせ

Lance: Unified Multimodal Modeling by Multi-Task Synergy

🌟 Highlights

🎨 Demo

Text-to-Video

Video Editing

Multi-turn Consistency Editing

Intelligent Video Generation

Video Understanding

Text-to-Image Generation

Image Editing

Image Understanding

Installation Steps

関連記事

キーポイント

重要な引用

影響分析

編集コメント

Lance: マルチタスク相乗効果による統一型多モーダルモデリング

🌟 ハイライト

🎨 デモ

テキストからビデオへ（Text-to-Video）

ビデオ編集

多ターン一貫性編集

インテリジェント・ビデオ生成

ビデオ理解

テキストから画像への生成

画像編集

画像理解

インストール手順

モデル重みのダウンロード

📚 使用方法

推論

オプション 1: 統一スクリプトの設定と実行

オプション 2: ユニファイドスクリプトの設定と実行

テキストから動画への生成

テキストから画像への生成

動画編集

画像編集

動画理解

Image Understanding