consider - enable streaming attention as default for llama models (1-4M context) #86

lessw2020 · 2024-02-25T05:23:21Z

for the price of 4 additional tokens (first four) we can enable streaming window attention and enable extremely long context length (1-4M?).

"we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup."

Mixtral is using the sliding window approach. This might be an easy add to showcase the newest attention, though it's not 'core' aspect for PTD.
See: https://arxiv.org/abs/2309.17453
I can make a PR to enable if there is interest.

tianyu-l added the enhancement New feature or request label May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consider - enable streaming attention as default for llama models (1-4M context) #86

consider - enable streaming attention as default for llama models (1-4M context) #86

lessw2020 commented Feb 25, 2024

consider - enable streaming attention as default for llama models (1-4M context) #86

consider - enable streaming attention as default for llama models (1-4M context) #86

Comments

lessw2020 commented Feb 25, 2024