Multi-Block Diffusion Language Models
English summary
This paper proposes Multi-Block Diffusion Language Models (MBD-LMs), extending block diffusion LMs to decode multiple consecutive blocks in parallel for inter-block parallelism. To align training with multi-block inference, they introduce Multi-block Teacher Forcing (MultiTF), which trains on bounded noise-groups conditioned on clean prefixes with randomized noise-schedulers. A Block Buffer decoding algorithm preserves KV-cache reuse and static input shapes, translating parallelism into wall-clock speedup. On MBD-LLaDA2-Mini, average tokens per forward pass increase from 3.47 to 6.19 while accuracy rises from 79.95% to 81.03%. Combined with DMax, the model reaches 9.34 TPF with only a 1.02% accuracy drop on math and code benchmarks.
Chinese summary
该论文提出多块扩散语言模型(MBD-LMs),将块扩散语言模型扩展为并行解码多个连续块,以实现块间并行。为弥合训练与多块推理的差距,提出多块教师强制(MultiTF),在干净前缀条件下训练有限噪声组,并采用随机噪声调度。Block Buffer解码算法保留了KV缓存复用和静态输入形状,将增加的并行性转化为实际加速。在MBD-LLaDA2-Mini上,平均每次前向生成令牌数从3.47提升到6.19,准确率从79.95%升至81.03%。结合DMax后,TPF达到9.34,仅在数学和代码基准上准确率下降1.02%。
Key points
Extends block diffusion LMs to multi-block decoding, enabling parallel processing of consecutive blocks for higher throughput.
将块扩散语言模型扩展为多块并行解码,显著提升吞吐量。
Introduces Multi-block Teacher Forcing (MultiTF) to align training with the heterogeneous noise patterns of multi-block inference.
提出多块教师强制(MultiTF),使训练状态匹配多块推理的异构噪声模式。
Designs a Block Buffer decoding algorithm that maintains KV-cache reuse and static shapes, yielding wall-clock acceleration.
设计了Block Buffer解码算法,保持KV缓存复用和静态输入形状,实现实际速度提升。
MBD-LLaDA2-Mini increases average TPF from 3.47 to 6.19 and accuracy from 79.95% to 81.03%; with DMax, TPF hits 9.34 with only 1.02% accuracy loss on math/code tasks.
MBD-LLaDA2-Mini将平均每步生成令牌数从3.47提升至6.19,准确率从79.95%升至81.03%;结合DMax,TPF达9.34,仅在数理代码任务上准确率下降1.02%。