变形为混合注意力模型
英文摘要
The paper proposes FlashMorph, a method to convert standard Transformers into hybrid attention models by formulating layer selection as a budget-constrained optimization problem. It uses morphable models and linearization regularization to decide which layers keep full attention and which switch to linear attention, considering global interdependencies. This approach outperforms heuristic selection strategies, discovering efficient configurations that maintain strong long-context recall and overall performance. FlashMorph also reduces the computational cost of layer selection itself, making it scalable.
中文摘要
该论文提出FlashMorph方法,通过将层选择建模为预算约束优化问题,将标准Transformer转换为混合注意力模型。它利用可变形模型和线性化正则化决定哪些层保留完全注意力、哪些替换为线性注意力,并考虑全局层间依赖关系。该方法优于启发式选择策略,能发现保持长上下文召回和整体性能的高效配置,同时降低了层选择本身的计算成本,具有可扩展性。
关键要点
FlashMorph formulates hybrid layer selection as a budget-constrained optimization, explicitly modeling global interdependencies between layers.
FlashMorph将混合层选择建模为预算约束优化,显式地建模各层间的全局依赖关系。
It uses morphable models and linearization regularization to efficiently determine which layers should become linear attention versus full attention.
它使用可变形模型和线性化正则化,高效确定哪些层应替换为线性注意力、哪些保留完全注意力。
Experiments show FlashMorph finds more computationally efficient hybrid configurations while preserving long-context recall and overall model performance.
实验表明FlashMorph能发现计算更高效的混合配置,同时保持长上下文召回能力和模型整体性能。
The method reduces the cost of the layer selection process itself, unlike prior heuristic approaches that require extensive trial-and-error.
该方法降低了层选择过程本身的成本,而先前的启发式方法往往需要反复试错。