テキストから画像生成モデルの学習設計:アブレーション研究からの知見
Hugging Face は、スクラッチから効率的なテキスト画像生成モデルを訓練するための設計指針とアブレーション実験の結果を公開し、計算リソースの制約下での最適化戦略を示した。
キーポイント
アーキテクチャからトレーニングへの焦点シフト
前回の投稿でモデル構造(PRX)について議論した後、今回は訓練プロセスの効率化、収束の安定性、表現学習の質を向上させる具体的な手法に焦点を当てている。
実験的ログブック形式での検証
既存の「トレーニングトリック」を網羅的に調査するのではなく、一貫した設定で再現・適応し、最適化と収束への実影響を記録する実験ログとして構成されている。
単独および組み合わせ効果の評価
各技術を孤立して報告するだけでなく、複数の手法を組み合わせた際の相乗効果や、組み合わせ時の有用性についても探求している。
オープンなトレーニングレシピの公開予定
次回投稿で完全なトレーニングレシピをコードとして公開し、ベストプラクティスを統合した「スピードラン」によるエンドツーエンドの負荷テスト結果も報告する予定である。
アブレーション研究のためのクリーンなベースラインの確立
計算リソースを節約するための工夫や補助目的関数を排除した、標準的な Flow Matching 設定(PRX-1.2B モデル、Flux VAE ラテン空間)を用いて安定した比較基準を設定する。
改善点の特定と評価指標の定義
すべての手法は特定のトレーニング介入によるものかを確認するためにベースラインと比較され、FID、CMMD、DINO-MMD、ネットワークスループットという 4 つの実用的な指標でモニタリングされる。
表現整合性による学習効率の向上
拡散モデルやフローモデルに、強力な固定ビジョンエンコーダを用いた補助損失を追加することで、初期学習を加速し、計算リソースが限定的でも高品質な特徴を獲得できるようになります。
影響分析・編集コメントを表示
影響分析
この記事は、大規模言語モデルや画像生成モデルの分野において、単にモデルサイズを大きくするだけでなく、訓練プロセスの設計(トレーニングデザイン)が性能と効率を決定づける重要な要素であることを再認識させる内容です。特に、オープンソースコミュニティに対して再現可能な実験データを提供することで、開発者が計算リソースを効率的に活用し、独自の基盤モデルを構築するための具体的な指針となるでしょう。
編集コメント
計算リソースの制約下でいかに高性能なモデルを訓練するかという、実務家にとって極めて示唆に富む実践的な知見が得られる記事です。
ここで、λ∈[0,1) は「斥力」項の強度を制御します。直感的には、自身の軌跡には一致させ、他の軌跡とは一致しないようにする、という意味です。
著者らは、対照的フローマッチング(contrastive flow matching)がより識別性の高い軌跡を生成し、これが品質と効率の両方の向上につながることを示しています。ImageNet (Deng et al., 2009) および CC3M (Sharma et al., 2018) での実験では、より速い収束(同様のFIDに到達するまでの学習反復回数が最大9分の1に減少)と、より少ないサンプリングステップ(ノイズ除去ステップ数が最大5分の1に減少)が報告されています。
重要な利点は、この目的関数がほぼそのまま置き換え可能であることです。通常のフローマッチング損失を維持し、同じバッチ内の他のサンプルを負例として使用する単一の対照的「斥力」項を追加するだけです。これにより、追加のモデル実行を導入することなく、追加の教師信号が提供されます。
私たちが観察したこと

この実験では、対照的フローマッチングは、表現駆動型メトリクスにおいて、小さくはあるが測定可能な改善をもたらしました。CMMDは0.41 → 0.40に、DINO-MMDは0.39 → 0.36に改善しました。この改善の度合いは、論文がImageNetで報告しているものよりも小さく、これは驚くことではありません。テキスト条件付けは離散クラスよりもはるかに複雑であり、学習データ分布はImageNetよりも「分離可能」でない可能性が高く、対照的な信号を活用するのが難しくなっているためです。
この特定の実験ではFIDの改善は見られませんでした(むしろわずかに悪化しました)。しかし、スループットのコストは実際には無視できます(3.95 → 3.75 バッチ/秒)。変更の単純さと、条件付け/表現メトリクスが一貫して正しい方向に動いていることを考えると、私たちは対照的フローマッチングを、低コストの正則化器(regularizer)として学習パイプラインに残す可能性が高いです。
JiT (Li and He, 2025)
『基本に立ち返る: ノイズ除去生成モデルにノイズ除去をさせよう(Back to Basics: Let Denoising Generative Models Denoise)』は、おそらく拡散モデル分野における私たちのお気に入りの最近の論文の一つです。なぜなら、これは新しいトリックではなく、リセットだからです。ネットワークに多様体外の量(ノイズや速度)を予測させるのをやめて、単にノイズ除去をさせようというものです。ほとんどの現代的な拡散モデルやフローモデルは、ネットワークにノイズεや速度vのような混合量を予測するように学習します。多様体仮説(manifold assumption)の下では、自然画像は低次元多様体上に存在しますが、εとvは本質的に多様体外にあります。したがって、それらを予測することは、見た目よりも難しい学習問題になり得ます。
著者らは、クリーン画像xとノイズεの間の標準的な線形補間を用いて問題を定式化します: z_t = t x + (1-t) ε, そして対応するフロー速度: v = dz_t / dt = x - ε。
v_θを直接出力する代わりに、モデルはクリーン画像の推定値を予測します: x_θ(z_t, t) := net_θ(z_t, t), そして、それを速度予測に変換します: v_θ(z_t, t) = (x_θ(z_t, t) - z_t) / (1-t)。
その後、v空間で全く同じフローベースの目的関数を維持できます: L_v = E_{t,x,ε}[ || v_θ(z_t, t) - v ||_2^2 ] ただし v = x - ε。
この定式化は、高次元において学習問題を大幅に容易にします。ネットワークは、ノイズや速度(これらは本質的にピクセル空間で制約がない)を予測する代わりに、クリーン画像x、つまりデータ多様体上にある何かを予測します。実際には、これにより、VAEやトークナイザーなしで、大きなパッチのトランスフォーマーをピクセル上で直接学習することが可能になり、最適化を安定させ、トークンの総数を管理可能な範囲に保つことができます。
私たちが観察したこと
まず、他の目的関数実験と同じ設定、つまり256×256解像度のFLUX潜在空間での学習において、x予測を評価しました。

この設定では、x予測の利点は明確ではありません。FIDはベースラインと比べてわずかに改善しますが、CMMDとDINO-MMDは顕著に悪化し、スループットは変わりません。これは、すでに構造化された潜在空間で作業する場合、速度の代わりにクリーン画像を予測することは、ベースライン目的関数に対して一貫して優位に立つわけではなく、表現レベルの整合性を損なうことさえあることを示唆しています。
とはいえ、この実験はx予測が真に輝く場所ではありません。
エキサイティングな部分は、x予測が高次元学習を安定させ、より大きなパッチを使用し、VAEなしで、はるかに高い解像度でピクセル空間で直接ノイズ除去を行うことを可能にすることです。JiTを使用して、圧縮された潜在空間で操作する代わりに、32×32パッチで1024×1024画像を直接学習しました。はるかに高い解像度とトークナイザーの不在にもかかわらず、最適化は安定して高速に行われました。FID 17.42、DINO_MMD 0.56、CMMD 0.71、スループット1.33バッチ/秒という結果を得ました。
これらの結果は注目に値します。1024×1024画像を直接学習することは、256×256の潜在空間で学習するよりも約3倍遅いだけです。
原文を表示
Back to Articles Training Design for Text-to-Image Models: Lessons from Ablations
Upvote 64 ![]()





Welcome back! This is the second part of our series on training efficient text-to-image models from scratch.
In the first post of this series, we introduced our goal: training a competitive text-to-image foundation model entirely from scratch, in the open, and at scale. We focused primarily on architectural choices and motivated the core design decisions behind our model PRX. We also released an early, small (1.2B parameters) version of the model as a preview of what we are building (go try it if you haven't already 😉).
In this post, we shift our focus from architecture to training. The goal is to document what actually moved the needle for us when trying to make models train faster, converge more reliably, and learn better representations. The field is moving quickly and the list of “training tricks” keeps growing, so rather than attempting an exhaustive survey, we structured this as an experimental logbook: we reproduce (or adapt) a set of recent ideas, implement them in a consistent setup, and report how they affect optimization and convergence in practice. Finally, we do not only report these techniques in isolation; we also explore which ones remain useful when combined.
In the next post, we will publish the full training recipe as code, including the experiments in this post. We will also run and report on a public "speedrun" where we put the best pieces together into a single configuration and stress-test it end-to-end. This exercise will serve both as a stress test of our current training pipeline and as a concrete demonstration of how far careful training design can go under tight constraints. If you haven’t already, we invite you to join our Discord to continue the discussion. A significant part of this project has been shaped by exchanges with community members, and we place a high value on external feedback, ablations, and alternative interpretations of the results.
Before introducing any training-efficiency techniques, we first establish a clean reference run. This baseline is intentionally simple. It uses standard components, avoids auxiliary objectives, and does not rely on architectural shortcuts or tricks to save compute resources. Its role is to serve as a stable point of comparison for all subsequent experiments. Concretely, this is a pure Flow Matching (Lipman et al., 2022) training setup (as introduced in Part 1) with no extra objectives and no architectural speed hacks. We will use the small PRX-1.2B model we presented in the first post of this series (single stream architecture with global attention for the image tokens and text tokens) as our baseline and train it in Flux VAE latent space, keeping the configuration fixed across all comparisons unless stated otherwise.
The baseline training setup is as follows:
Public 1M synthetic image generated with MidJourneyV6
Global batch size
Positional encoding
This baseline configuration provides a transparent and reproducible anchor. It allows us to attribute observed improvements and regressions to specific training interventions, rather than to shifting hyperparameters or hidden setup changes. Throughout the remainder of this post, every technique is evaluated against this reference with a single guiding question in mind:
Does this modification improve convergence or training efficiency relative to the baseline?
Examples of baseline model generations after 100K training steps.
Benchmarking Metrics
To keep this post grounded, we rely on a small set of metrics to monitor checkpoints over time. None of them is a perfect proxy for perceived image quality, but together they provide a practical scoreboard while we iterate.
Fréchet Inception Distance (FID): (Heusel et al., 2017) Measures how close the distributions of generated and real images are, using Inception-v3 feature statistics (mean and covariance). Lower values typically correlate with higher sample fidelity.
CLIP Maximum Mean Discrepancy (CMMD): (Jayasumana et al., 2024) Measures the distance between real and generated image distributions using CLIP image embeddings and Maximum Mean Discrepancy (MMD). Unlike FID, CMMD does not assume Gaussian feature distributions and can be more sample-efficient; in practice it often tracks perceptual quality better than FID, though it is still an imperfect proxy.
DINOv2 Maximum Mean Discrepancy (DINO-MMD): Same MMD-based distance as CMMD, but computed on DINOv2 (Oquab et al. 2023) image embeddings instead of CLIP. This provides a complementary view of distribution shift under a self-supervised vision backbone.
Network throughput: Average number of samples processed per second (samples/s), as a measure of end-to-end training efficiency.
With the scoreboard defined, we can now dive into the methods we explored, grouped into four buckets: Representation Alignment, Training Objectives, Token Routing and Sparsification, and Data.
Representation Alignment
Diffusion and flow models are typically trained with a single objective: predict a noise-like target (or vector field) from a corrupted input. Early in training, that one objective is doing two jobs at once: it must build a useful internal representation and learn to denoise on top of it. Representation alignment makes this explicit by keeping the denoising objective and adding an auxiliary loss that directly supervises intermediate features using a strong, frozen vision encoder. This tends to speed up early learning and bring the model’s features closer to those of modern self-supervised encoders. As a result, you often need less compute to hit the same quality.
A useful way to view it is to decompose the denoiser into an implicit encoder that produces intermediate hidden states, and a decoder that maps those states to the denoising target. The claim is that representation learning is the bottleneck: diffusion and flow transformers do learn discriminative features, but they lag behind foundation vision encoders when training is compute-limited. Therefore, borrowing a powerful representation space can make the denoising problem easier.
REPA (Yu et al., 2024)
REPA adds a representation matching term on top of the base flow-matching objective. Let x0∼pdatax_0 \sim p_{\text{data}}x0∼pdata be a clean sample and x1∼ppriorx_1 \sim p_{\text{prior}}x1∼pprior be the noise sample. The model is trained on an interpolated state xtx_txt (for t∈[0,1]t \in [0,1]t∈[0,1]) and predicts a vector field vθ(xt,t)v_\theta(x_t, t)vθ(xt,t). In REPA, a pretrained vision encoder fff processes the clean sample x0x_0x0 to produce patch embeddings y0=f(x0)∈RN×Dy_0 = f(x_0) \in \mathbb{R}^{N \times D}y0=f(x0)∈RN×D, where NNN is the number of patch tokens and DDD is the teacher embedding dimension. In parallel, the denoiser processes xtx_txt and produces intermediate hidden tokens hth_tht (one token per patch). A small projection head hϕh_\phihϕ maps these student hidden tokens into the teacher embedding space, and an auxiliary loss maximizes patch-wise similarity between corresponding teacher and student tokens: LREPA(θ,ϕ)=−Ex0,x1,t[1N∑n=1Nsim(y0,[n], hϕ(ht,[n]))] \mathcal{L}_{\text{REPA}}(\theta,\phi) = -\mathbb{E}_{x_0,x_1,t}\Big[\frac{1}{N}\sum_{n=1}^{N} \text{sim}\big(y_{0,[n]},\, h_\phi(h_{t,[n]})\big)\Big] LREPA(θ,ϕ)=−Ex0,x1,t[N1n=1∑Nsim(y0,[n],hϕ(ht,[n]))] Here n∈{1,…,N}n \in \{1,\dots,N\}n∈{1,…,N} indexes patch tokens, y0,[n]y_{0,[n]}y0,[n] is the teacher embedding for patch nnn, ht,[n]h_{t,[n]}ht,[n] is the corresponding student hidden token at time ttt, and sim(⋅,⋅)\text{sim}(\cdot,\cdot)sim(⋅,⋅) is typically cosine similarity.
This term is combined with the main flow-matching loss:
L=LFM+λ LREPA \mathcal{L} = \mathcal{L}_{\text{FM}} + \lambda\,\mathcal{L}_{\text{REPA}} L=LFM+λLREPA
with λ\lambdaλ controlling the trade-off.
In practice, the student is trained to produce noise-robust, data-consistent patch representations from xtx_txt, so later layers can focus on predicting the vector field and generating details rather than rediscovering a semantic scaffold from scratch.
What we observed
We ran REPA on top of our baseline PRX training, using two frozen teachers: DINOv2 and DINOv3 (Siméoni et al., 2025). The pattern was very consistent: adding alignment improves quality metrics, and the stronger teacher helps more, at the cost of a bit of speed.

On the quality metrics, both teachers improve over the baseline. The effect is strongest with DINOv3, which achieves the best overall numbers in this run.
REPA is not free: we pay for an extra frozen teacher forward and the patch-level similarity loss, which shows up as a throughput drop from 3.95 batches/s to 3.66 (DINOv2) or 3.46 (DINOv3). In other words, DINOv3 prioritizes maximum representation quality at the cost of slower training, while DINOv2 offers a more efficient tradeoff, still delivering substantial gains with a smaller slowdown.
Our practical takeaway is that REPA is a strong lever for text-to-image training. In our setup, the throughput trade-off is real and the net speedup (time required to reach a given level of image quality) felt a bit less dramatic than what the authors of the paper report on ImageNet-style, class-conditioned generation. That said, the quality gains are still clearly significant. Qualitatively, we also saw the difference early: after ~100K steps, samples trained with alignment tended to lock in cleaner global structure and more coherent layouts, which makes it easy to see why REPA (and alignment variants more broadly) have become a go-to ingredient in modern T2I training recipes.



iREPA (Singh et al., 2025)
A natural follow-up to REPA is: what exactly should we be aligning? iREPA argues that the answer is spatial structure, not global semantics. Across a large sweep of 27 vision encoders, the authors find that ImageNet-style “global” quality (e.g., linear-probe accuracy on patch tokens) is only weakly predictive of downstream generation quality under REPA, while simple measures of patch-token spatial self-similarity correlate much more strongly with FID. Based on that diagnosis, iREPA makes two tiny but targeted changes to the REPA recipe to better preserve and transfer spatial information:
Replace the usual MLP projection head with a lightweight 3×3 convolutional projection operating on the patch grid.
Apply a spatial normalization to teacher patch tokens that removes a global overlay (mean across spatial locations) to increase local contrast.
Despite representing “less than 4 lines of code”, these tweaks consistently speed up convergence and improve quality across encoders, model sizes, and even REPA-adjacent training recipes.
What we observed
In our setup, we observed a similar kind of boost when applying the iREPA spatial tweaks on top of DINOv2: convergence was a bit smoother and the metrics improved more steadily over the first 100K steps. Interestingly, the same changes did not transfer as cleanly when applied on top of a DINOv3 teacher and they tended to degrade performance rather than help. We do not want to over-interpret that result: this could easily be an interaction with our specific architecture, resolution/patching, loss weighting, or even small implementation details. Still, given this inconsistency across teachers, we will likely not include these tweaks in our default recipe, even if they remain an interesting option to revisit when tuning for a specific setup.

About Using REPA During the Full Training:
The paper REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training (Wang et al., 2025) highlights a key caveat: REPA is a powerful early accelerator, but it can plateau or even become a brake later in training. The authors describe a capacity mismatch. Once the generative model starts fitting the full data distribution (especially high-frequency details), forcing it to stay close to a frozen recognition encoder’s lower-dimensional embedding manifold becomes constraining. Their practical takeaway is simple: keep alignment for the “burn-in” phase, then turn it off with a stage-wise schedule.
We observed the same qualitative pattern in our own runs. When training our preview model, removing REPA after ~200K steps noticeably improved the overall feel of image quality, textures, micro-contrast, and fine detail continued to sharpen instead of looking slightly muted. For that reason, we also recommend treating representation alignment as a transient scaffold. Use it to get fast early progress, then drop it after a while once the model’s own generative features have caught up.
Alignment in the Token Latent Space
So far, “alignment” meant regularizing the generator’s internal features against a frozen teacher while treating the tokenizer / latent space as fixed. A more direct lever is to shape the latent space itself so the representation presented to the flow backbone is intrinsically easier to model, without sacrificing the reconstruction fidelity needed for editing and downstream workflows.
REPA-E (Leng et al., 2025) makes this concrete. Its starting point is a failure mode: if you simply backprop the diffusion / flow loss into the VAE, the tokenizer quickly learns a pathologically easy latent for the denoiser, which can even degrade final generation quality. REPA-E’s fix is a two-signal training recipe:
keep the diffusion loss, but apply a stop-gradient so it only updates the latent diffusion model (not the VAE);
update both the VAE and the diffusion model using an end-to-end REPA alignment loss.
Thanks to these two tricks, the tokenizer is explicitly optimized to produce latents that yield higher alignment and empirically better generations.
In parallel, Black Forest Labs’ FLUX.2 AE work frames latent design as a trade-off between learnability, quality, and compression.Their core argument is that improving learnability requires injecting semantic structure into the representation, rather than treating the tokenizer as a pure compression module. This motivates retraining the latent space to explicitly target “better learnability and higher image quality at the same time". They do not share the full recipe, but they do clearly state the key idea: make the AE’s latent space more learnable by adding semantic or representation alignment, and explicitly point to REPA-style alignment with a frozen vision encoder as the mechanism they build on and integrate into the FLUX.2 AE.
What we observed
To probe alignment in the latent space, we compared two pretrained autoencoders as drop-in tokenizers for the same flow backbone: a REPA-E-VAE (where we do add the REPA alignment objective, as in the paper) and the Flux2-AE (where we do not add REPA, following their recommendation). The results were, honestly, extremely impressive, both quantitatively and qualitatively. In samples, the gap is immediately visible: generations show more coherent global structure and cleaner layouts, with far fewer “early training” artifacts.

A first striking point is that both latent-space interventions lower the FID by ~6 points (18.20 to ~12.08), which is a much larger jump than what we typically get from “just” aligning intermediate features. This strongly supports the core idea: if the tokenizer produces a representation that is intrinsically more learnable, the flow model benefits everywhere.
The two AEs then behave quite differently in the details. Flux2-AE dominates most metrics (very low CMMD and DINO_MMD, but it comes with a huge throughput penalty: batches/sec drops from 3.95 to 1.79. In our case this slowdown is explained by practical factors they also emphasize: the model is simply heavier, and it also produces a larger latent (32 channels), which increases the amount of work the diffusion backbone has to do per step.
REPA-E-VAE is the “balanced” option: it reaches essentially the same FID as Flux2-AE while keeping throughput much closer to the baseline (3.39 batches/sec).



Training Objectives: Beyond Vanilla Flow Matching
Architecture gets you capacity, but the training objective is what decides how that capacity is used. In practice, small changes to the loss often have outsized effects on convergence speed, conditional fidelity, and how quickly a model “locks in” global structure. In the sections below, we will go through the objectives we tested on top of our baseline rectified flow setup, starting with a simple but surprisingly effective modification: Contrastive Flow Matching.
Contrastive Flow Matching (Stoica et al., 2025)
Flow matching has a nice property in the unconditional case: trajectories are implicitly encouraged to be unique (flows should not intersect). But once we move to conditional generation (class- or text-conditioned), different conditions can still induce overlapping flows, which empirically shows up as “averaging” behavior: weaker conditional specificity, and muddier global structure. Contrastive flow matching addresses this directly by adding a contrastive term that pushes conditional flows away from other flows in the batch.
For a given training triplet (x,y,ε)(x, y, \varepsilon)(x,y,ε), standard conditional flow matching trains the model velocity vθ(xt,t,y)v_\theta(x_t,t,y)vθ(xt,t,y) to match the target transport direction. Contrastive flow matching keeps that positive term, but additionally samples a negative pair (x~,y~,ε~)(\tilde{x}, \tilde{y}, \tilde{\varepsilon})(x~,y~,ε~) from the batch and penalizes the model if its predicted flow is also compatible with that other trajectory. In the paper’s notation, this becomes:
LΔFM(θ)=E[∥vθ(xt,t,y)−(α˙tx+σ˙tε)∥2 − λ∥vθ(xt,t,y)−(α˙tx~+σ˙tε~)∥2] \mathcal{L}_{\Delta \text{FM}}(\theta) = \mathbb{E}\Big[ \|v_\theta(x_t,t,y)-(\dot{\alpha}_t x+\dot{\sigma}_t\varepsilon)\|^2 \;-\; \lambda \|v_\theta(x_t,t,y)-(\dot{\alpha}_t \tilde{x}+\dot{\sigma}_t\tilde{\varepsilon})\|^2 \Big] LΔFM(θ)=E[∥vθ(xt,t,y)−(α˙tx+σ˙tε)∥2−λ∥vθ(xt,t,y)−(α˙tx~+σ˙tε~)∥2]
where λ∈[0,1)\lambda\in[0,1)λ∈[0,1) controls the strength of the “push-away” term. Intuitively: match your own trajectory, and be incompatible with someone else’s.
The authors show that contrastive flow matching produces more discriminative trajectories and that this translates into both quality and efficiency gains: faster convergence (reported up to 9× fewer training iterations to reach similar FID) and fewer sampling steps (reported up to 5× fewer denoising steps) on ImageNet (Deng et al. 2009) and CC3M(Sharma et al., 2018) experiments.
A key advantage is that the objective is almost a drop-in replacement: you keep the usual flow-matching loss, then add a single contrastive “push-away” term using other samples in the same batch as negatives which provides the extra supervision without introducing additional model passes.
What we observed

On this run, contrastive flow matching yields a small but measurable improvement on the representation-driven metrics: CMMD goes from 0.41 → 0.40 and DINO-MMD from 0.39 → 0.36. The magnitude of the gain is smaller than what the paper reports on ImageNet, which is not too surprising: text conditioning is much more complex than discrete classes, and the training data distribution is likely less “separable” than ImageNet, making the contrastive signal harder to exploit.
We do not see an improvement in FID in this specific experiment (it slightly worsens), but the throughput cost is negligible in practice (3.95 → 3.75 batches/sec). Given the simplicity of the change and the consistent movement in the right direction for the conditioning/representation metrics, we will likely still keep contrastive flow matching in our training pipeline as a low-cost regularizer.
JiT (Li and He, 2025)
Back to Basics: Let Denoising Generative Models Denoise is probably one of our favorite recent papers in the diffusion space because it is not a new trick but a reset: stop asking the network to predict off-manifold quantities (noise or velocity) and just let it denoise. Most modern diffusion and flow models train the network to predict noise ε\varepsilonε or a mixed quantity like velocity vvv. Under the manifold assumption, natural images live on a low-dimensional manifold, while ε\varepsilonε and vvv are inherently off-manifold, so predicting them can be a harder learning problem than it looks.
The authors frame the problem with the standard linear interpolation between the clean image xxx and the noise ε\varepsilonε: zt=t x+(1−t) ε, z_t = t\,x + (1-t)\,\varepsilon, zt=tx+(1−t)ε, and the corresponding flow velocity: v=dztdt=x−ε. v = \frac{d z_t}{dt} = x - \varepsilon. v=dtdzt=x−ε.
Instead of outputting vθv_\thetavθ directly, the model predicts a clean image estimate: xθ(zt,t):=netθ(zt,t), x_\theta(z_t,t) := \mathrm{net}_\theta(z_t,t), xθ(zt,t):=netθ(zt,t), and we convert it to a velocity prediction via: vθ(zt,t)=xθ(zt,t)−zt1−t. v_\theta(z_t,t) = \frac{x_\theta(z_t,t) - z_t}{1-t}. vθ(zt,t)=1−txθ(zt,t)−zt.
Then we can keep the exact same flow-style objective in v-space: Lv=Et,x,ε[∥vθ(zt,t)−v∥22]withv=x−ε. \mathcal{L}_{v} = \mathbb{E}_{t,x,\varepsilon}\left[\left\|v_\theta(z_t,t) - v\right\|_2^2\right] \quad\text{with}\quad v = x-\varepsilon. Lv=Et,x,ε[∥vθ(zt,t)−v∥22]withv=x−ε.
This formulation makes the learning problem substantially easier in high dimensions: instead of predicting noise or velocity (which are essentially unconstrained in pixel space), the network predicts the clean image xxx, i.e., something that lies on the data manifold. In practice, this makes it feasible to train large-patch Transformers directly on pixels without a VAE or tokenizer while keeping optimization stable and the total number of tokens manageable.
What we observed
We first evaluated x-prediction in the same setting as the rest of our objective experiments, namely training in the FLUX latent space at 256×256 resolution.

In this regime, the benefit of x-prediction is unclear. While FID improves slightly compared to the baseline, both CMMD and DINO-MMD degrade noticeably, and throughput is unchanged. This suggests that, when working in an already well-structured latent space, predicting clean images instead of velocity does not consistently dominate the baseline objective, and can even hurt representation-level alignment.
That said, this experiment is not where x-prediction really shines.
The exciting part is that x-prediction stabilizes high-dimensional training, making it feasible to use larger patches and denoise directly in pixel space, without a VAE, at much higher resolutions. Using JiT, we trained a model directly on 1024×1024 images with 32×32 patches, instead of operating in a compressed latent space. Despite the much higher resolution and the absence of a tokenizer, optimization remained stable and fast. We reached FID 17.42, DINO_MMD 0.56, and CMMD 0.71 with a throughput of 1.33 batches/sec.
These results are remarkable: training directly on 1024×1024 images is only about 3× slower than training in a 256×256 latent sp
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み