NVIDIA Releases Nemotron-Labs-TwoTower: an Open-Weight Diffusion Language Model Built on a Frozen Autoregressive Nemotron-3-Nano-30B-A3B Backbone
English summary
NVIDIA released Nemotron-Labs-TwoTower under open weights, a discrete diffusion language model that uses a frozen autoregressive context tower (Nemotron-3-Nano-30B-A3B) and a separately trained denoiser tower. The model retains 98.7% of the AR baseline's aggregate benchmark quality while delivering 2.42× higher wall-clock generation throughput (γ=0.8, block size 16 on 2×H100). The denoiser was trained on ~2.1T tokens, only a fraction of the backbone’s 25T-token pretraining. A single checkpoint provides three generation modes: full mask diffusion, mock-AR, and standard AR decoding. The two-tower architecture uses layer-aligned cross-attention and Mamba-2 state seeding to preserve context representation across diffusion steps.
Chinese summary
NVIDIA 以开放权重发布了 Nemotron-Labs-TwoTower,一个离散扩散语言模型。它采用冻结的自回归上下文塔(基于 Nemotron-3-Nano-30B-A3B)和单独训练的去噪塔。该模型保留了 AR 基线 98.7% 的综合基准质量,同时实现了 2.42 倍的生成吞吐量(γ=0.8,块大小 16,2×H100)。去噪塔在约 2.1T 个 token 上训练,远少于骨干的 25T。单一检查点支持三种生成模式:全量掩码扩散、模拟 AR 和标准 AR 解码。双塔架构通过逐层交叉注意力和 Mamba-2 状态播种,在扩散步骤间保持上下文表示的一致性。
Key points
Open-weight release under NVIDIA Nemotron Open Model License.
以 NVIDIA Nemotron 开放模型许可证发布开放权重。
Two-tower design: frozen AR context tower + trained denoiser tower, built on Nemotron-3-Nano-30B-A3B backbone (Mamba-2, self-attention, MoE layers).
双塔设计:冻结的自回归上下文塔 + 训练好的去噪塔,基于 Nemotron-3-Nano-30B-A3B 骨干(Mamba-2、自注意力、MoE 层)。
Retains 98.7% of AR baseline aggregate benchmark quality at 2.42× generation throughput (γ=0.8, S=16, 2×H100).
在 2.42 倍生成吞吐量下(γ=0.8,块大小 16,2×H100),保留了 AR 基线 98.7% 的综合基准质量。
Denoiser trained on ~2.1T tokens; backbone pretrained on 25T tokens.
去噪塔在约 2.1T tokens 上训练,骨干在 25T tokens 上预训练。
Single checkpoint supports three inference modes: diffusion, mock-AR, and standard AR decoding.
单一检查点支持三种推理模式:扩散、模拟 AR 和标准 AR 解码。
Layer-aligned cross-attention and Mamba-2 state seeding connect the towers; total ~60B parameters (~3B active per token per tower).
通过逐层交叉注意力和 Mamba-2 状态播种连接双塔;总参数约 60B(每个 token 每塔激活约 3B)。