Variable-Width Transformers

English summary

This paper proposes the > <former, a transformer architecture with wider early and late layers and narrower middle layers, using a parameter-free residual resizing mechanism. Across decoder-only language models from 200M to 2B dense parameters and 3B MoE parameters, > <former consistently outperforms uniform-width baselines on language modeling loss. Under loss-matched scaling, the architecture reduces overall FLOPs by 22% and KV cache memory and I/O cost by 15%. Analysis reveals the bottleneck structure produces qualitatively different representations in residual streams, demonstrating that nonuniform width allocation enables more resource-optimal scaling.

Chinese summary

本文提出 > <former 架构，在Transformer中采用宽早期层、宽晚期层和窄中间层的非均匀宽度分配，并通过无参数的残差尺寸调整机制连接不同宽度的层。在200M至2B稠密参数和3B混合专家（MoE）参数的自回归语言模型上，> <former 在语言建模损失上持续优于均匀宽度的基线模型。在损失匹配的缩放规律下，该架构总FLOPs减少22%，KV缓存内存和I/O成本降低15%。分析表明瓶颈结构导致残差流中的表征发生质性变化，证明非均匀宽度分配可实现更优的资源缩放。

Key points

Introduces > <former, a variable-width transformer with wide early/late layers and narrow middle layers, using a parameter-free residual resizing mechanism.
提出 > <former 架构，采用宽早期和晚期层、窄中间层的可变宽度设计，并通过无参数的残差尺寸调整机制连接不同层。
Outperforms uniform-width baselines on language modeling loss across 200M–2B dense models and 3B MoE models.
在200M至2B稠密模型和3B MoE模型上，语言建模损失均优于均匀宽度基线模型。
Achieves 22% FLOPs reduction and 15% KV cache memory reduction under loss-matched scaling.
在损失匹配的缩放条件下，实现22%的FLOPs降低和15%的KV缓存内存减少。
Bottleneck structure yields qualitatively different residual stream representations.
瓶颈结构产生与均匀模型性质不同的残差流表征。

Open original