TutorialsSource: MARKTECHPOSTJuly 1, 2026Importance: 5/5

NVIDIA Releases Nemotron-Labs-TwoTower: an Open-Weight Diffusion Language Model Built on a Frozen Autoregressive Nemotron-3-Nano-30B-A3B Backbone

English summary

NVIDIA released Nemotron-Labs-TwoTower under open weights, a discrete diffusion language model that uses a frozen autoregressive context tower (Nemotron-3-Nano-30B-A3B) and a separately trained denoiser tower. The model retains 98.7% of the AR baseline's aggregate benchmark quality while delivering 2.42× higher wall-clock generation throughput (γ=0.8, block size 16 on 2×H100). The denoiser was trained on ~2.1T tokens, only a fraction of the backbone’s 25T-token pretraining. A single checkpoint provides three generation modes: full mask diffusion, mock-AR, and standard AR decoding. The two-tower architecture uses layer-aligned cross-attention and Mamba-2 state seeding to preserve context representation across diffusion steps.

Chinese summary

NVIDIA 以开放权重发布了 Nemotron-Labs-TwoTower，一个离散扩散语言模型。它采用冻结的自回归上下文塔（基于 Nemotron-3-Nano-30B-A3B）和单独训练的去噪塔。该模型保留了 AR 基线 98.7% 的综合基准质量，同时实现了 2.42 倍的生成吞吐量（γ=0.8，块大小 16，2×H100）。去噪塔在约 2.1T 个 token 上训练，远少于骨干的 25T。单一检查点支持三种生成模式：全量掩码扩散、模拟 AR 和标准 AR 解码。双塔架构通过逐层交叉注意力和 Mamba-2 状态播种，在扩散步骤间保持上下文表示的一致性。

Key points

Open-weight release under NVIDIA Nemotron Open Model License.
以 NVIDIA Nemotron 开放模型许可证发布开放权重。
Two-tower design: frozen AR context tower + trained denoiser tower, built on Nemotron-3-Nano-30B-A3B backbone (Mamba-2, self-attention, MoE layers).
双塔设计：冻结的自回归上下文塔 + 训练好的去噪塔，基于 Nemotron-3-Nano-30B-A3B 骨干（Mamba-2、自注意力、MoE 层）。
Retains 98.7% of AR baseline aggregate benchmark quality at 2.42× generation throughput (γ=0.8, S=16, 2×H100).
在 2.42 倍生成吞吐量下（γ=0.8，块大小 16，2×H100），保留了 AR 基线 98.7% 的综合基准质量。
Denoiser trained on ~2.1T tokens; backbone pretrained on 25T tokens.
去噪塔在约 2.1T tokens 上训练，骨干在 25T tokens 上预训练。
Single checkpoint supports three inference modes: diffusion, mock-AR, and standard AR decoding.
单一检查点支持三种推理模式：扩散、模拟 AR 和标准 AR 解码。
Layer-aligned cross-attention and Mamba-2 state seeding connect the towers; total ~60B parameters (~3B active per token per tower).
通过逐层交叉注意力和 Mamba-2 状态播种连接双塔；总参数约 60B（每个 token 每塔激活约 3B）。

Open original