Representation Distribution Matching for One-Step Visual Generation
中文标题: 表征分布匹配用于单步视觉生成
英文摘要
The paper formalizes Representation Distribution Matching (RDM) for one-step image generation, analyzing two design axes: distribution comparison method and representation space. They find that classical MMD becomes a strong scalable objective when estimated with large batches (>2048) and that any single representation can be gamed, motivating a battery of encoders and the SW_r14 metric. Their improved RDM (iRDM) sets a new one-step state of the art on ImageNet (SW_r14 1.30) and is preferred by PickScore over the prior best on 71.2% of samples. The recipe also post-trains the four-step FLUX.2 into a one-step generator that surpasses its four-step version on GenEval (0.826 vs 0.794) and PickScore (22.76 vs 22.58) in 90 H200 GPU-hours.
中文摘要
本文形式化了用于单步图像生成的表征分布匹配(RDM)范式,分析了分布比较方法和表征空间两个设计维度。他们发现经典MMD在使用大批量(>2048)估计时成为强大且可扩展的目标,而任何单一表征都可能被欺骗,因此需要一组编码器及SW_r14评估指标。其改进版iRDM在ImageNet上实现了单步生成新最优(SW_r14 = 1.30),并在71.2%的样本上被PickScore优先于此前最佳单步生成器。该方法还将四步FLUX.2模型后训练为单步生成器,在GenEval(0.826 vs 0.794)和PickScore(22.76 vs 22.58)上超越四步版本,仅需90个H200 GPU小时。
关键要点
Classical MMD becomes a strong and scalable one-step training objective when estimated with large batch sizes (optimum above 2048).
经典最大均值差异(MMD)在使用超大 batch size(最优值>2048)估计时,成为强大且可扩展的单步训练目标。
Using any single representation can be gamed, so the method matches against a balanced battery of encoders and evaluates with SW_r14, a Sliced-Wasserstein distance over 14 encoders that resists gaming.
任何单一表征都可能被欺骗,因此该方法针对一组平衡的编码器进行匹配,并用SW_r14(14个编码器上的切片Wasserstein距离)进行评估以抵抗欺骗。
iRDM sets one-step state of the art on ImageNet (SW_r14 1.30) and is preferred by the human-preference proxy PickScore over the prior best one-step generator on 71.2% of samples.
iRDM在ImageNet上达到单步最先进水平(SW_r14 1.30),并在71.2%的样本上被人类偏好代理PickScore优先于此前的单步最佳生成器。
Post-training the four-step FLUX.2 with the same recipe yields a one-step generator surpassing the four-step version on GenEval (0.826 vs 0.794) and PickScore (22.76 vs 22.58) in only 90 H200 GPU-hours.
使用相同方案对四步FLUX.2进行后训练,得到一个单步生成器,在GenEval(0.826 vs 0.794)和PickScore(22.76 vs 22.58)上超越四步版本,仅需90个H200 GPU小时。