Morphing into Hybrid Attention Models
The paper proposes FlashMorph, a method to convert standard Transformers into hybrid attention models by formulating layer selection as a budget-constrained optimization problem. It uses morphable models and linearization regularization to decide which layers keep full attention and which switch to linear attention, considering global interdependencies. This approach outperforms heuristic selection strategies, discovering efficient configurations that maintain strong long-context recall and overall performance. FlashMorph also reduces the computational cost of layer selection itself, making it scalable.