ByteDanceのオープンウェイトHeliosモデル、1分間のAI動画生成をほぼリアルタイムに
Bytedanceの研究者が公開したオープンウェイトの動画生成AIモデル「Helios」は、単一GPUで19.5 FPSを達成し、1分間の動画生成をほぼリアルタイムに近づけたと発表した。
キーポイント
画期的な性能
140億パラメータの動画生成モデルとして初めて、単一GPUで19.5 FPSを達成し、1分間の動画生成を実現した。
オープンウェイト公開
コードとモデルウェイトが公開されており、研究コミュニティや開発者が自由に利用・改良できる。
実用性の向上
従来の動画生成AIよりも高速な処理により、より実用的なアプリケーションへの道を開いた。
技術的革新
大規模な動画生成モデルを単一GPUで効率的に実行する技術的進歩を示している。
影響分析・編集コメントを表示
影響分析
この発表は、動画生成AIの実用化に向けた重要なマイルストーンであり、オープンウェイトとして公開されることで、研究コミュニティ全体の進歩を加速させる可能性が高い。単一GPUでの高速処理は、より広範な開発者や企業が動画生成技術を利用する道を開く。
編集コメント
動画生成AIの性能競争が激化する中、オープンウェイトでの公開は業界全体の技術進歩を促す戦略的な動きと言える。単一GPUでの高速処理実現は、コスト面でも大きな意味を持つ。

Bytedanceの研究者らは、Heliosが単一GPUで19.5 FPSを達成し、1分間の動画クリップを生成する初の140億パラメータ動画モデルであると述べています。コードとモデル重みは公開されています。
この記事「BytedanceのオープンウェイトHeliosモデル、1分間のAI動画生成をほぼリアルタイムに」は、The Decoderで最初に公開されました。
原文を表示
Helios is the first 14B video model to hit 19.5 FPS on a single GPU while producing minute-long videos. The code and model weights are publicly available.
Most video generation models today only produce 5-to-10-second clips and can take minutes to render them. Real-time approaches for longer videos rely on much smaller 1.3B models that struggle with quality. Larger models like Krea-RealTime-14B top out at 6.7 FPS on an H100 and suffer from severe drifting artifacts.
Helios builds on Wan-2.1-14B, which takes around 50 minutes to generate five seconds of video on an A100. Training happens in three stages: Helios-Base (architecture and anti-drifting), Helios-Mid (token compression, 1.05 FPS), and Helios-Distilled, which maxes out speed by cutting computation down to just three steps.
In developer benchmarks, the distilled version of Helios hits 19.53 FPS—even faster than some much smaller distilled models. SANA Video Long, which has 2 billion parameters and is roughly seven times smaller, only manages 13.24 FPS.
At 19.53 FPS on a single H100, Helios-Distilled matches the speed of significantly smaller 1.3B models, while other 14B models sometimes drop below 1 FPS. | Image: Yuan et al.
On video quality, Helios scores 6.00 overall for short videos with 81 frames. The authors say that it beats every distilled model and is on par with most base models at this size. For long videos, Helios scored 6.94, edging out the previous leader, Reward Forcing, at 6.88. A user study with 200 participants backs up the results.
Helios-Base leads the long video quality ranking and holds its own against significantly larger base models on short video generation. | Image: Yuan et al.
Longer generated videos typically lose quality, color consistency, and content coherence over time. Previous models tackle this with complex techniques like self-forcing, where the model feeds its own output back as input during training to close the gap between training and inference. Helios skips all of that.
Three typical drifting patterns in long video generation: position shift, color shift, and restoration artifacts that show up as either noise or blur. | Image: Yuan et al.
Instead, the authors pinpoint three typical drifting patterns and propose simpler fixes. Relative position coding keeps the model from hitting unknown position indices in long videos, which would otherwise cause repetitive movements. A first-frame anchor holds the opening frame in memory at all times, giving the model a visual reference point to prevent color shifts. A targeted perturbation simulation during training makes the model more resilient to its own errors, which would otherwise snowball over time.
One model handles text, image, and video input
Helios uses a unified architecture that supports text-to-video, image-to-video, and video-to-video in a single framework. The model switches between tasks automatically based on what's in the previous context.
If the context is empty, the model generates from text. If only the last frame is present, it works as an image animator. If multiple frames are available, it continues an existing video. Users can also swap the text prompt mid-generation; a gradual crossfade between the old and new prompt helps prevent jarring visual breaks.
Helios' architecture compresses historical context across three time scales and automatically switches between text, image, and video input. | Image: Yuan et al.
The model was trained in three stages on 800,000 short video clips, each under ten seconds long. Resolution tops out at 384 x 640 pixels for now, and flicker artifacts still show up at segment transitions. Since no open benchmark exists for real-time long videos, the researchers built their own test dataset called HeliosBench with 240 prompts.
Aggressive compression slashes compute costs
Helios hits its speed targets without common acceleration tricks like KV cache, sparse attention, or quantization. Instead, the model aggressively compresses input data at two levels.
A hierarchical memory structure splits the video history into three time scales. Recent frames get lighter compression, while older frames get compressed more heavily. This cuts the number of tokens to process by a factor of eight.
A multi-stage sampling process reduces tokens for the video segment being generated by a factor of 2.29. Early steps run at lower resolution, and only later steps fill in fine details. Together, these two techniques bring compute costs down to roughly the same level as generating a single image.
As context length grows, the naive approach sees linear increases in token count, memory usage, and inference time, while Helios stays nearly flat. The standard approach runs out of memory at a context length of 6. | Image: Yuan et al.
A dedicated distillation technique also cuts the number of required computation steps per video segment from 50 to 3. Unlike previous approaches, Helios uses only real video data as context and generates just one segment per training step. An adversarial training objective similar to a GAN pushes quality beyond the limits of the teacher model.
Token compression lets Helios train through its first two stages on a single GPU. The third stage requires four complete models running simultaneously, but they fit into 80 GB of GPU memory thanks to various memory optimizations. Custom compute kernels for common operations speed up both training and inference by about 14 percent over the standard implementation, according to the researchers.
Helios is available as an open-weight model on GitHub and Hugging Face, which also hosts a live demo. Generated video examples are available on the project page. The project is strictly for research and isn't planned for integration into any Bytedance products.
Bytedance recently made waves with Seedance 2.0, a multimodal video generation model that processes images, videos, audio, and text all at once. Seedance needs significantly more compute and caps out at 15-second clips, but delivers much higher visual quality—so high that it raised alarms in Hollywood over its potential for massive copyright infringement.
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み