Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider - enable streaming attention as default for llama models (1-4M context) #86

Open
lessw2020 opened this issue Feb 25, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@lessw2020
Copy link
Contributor

for the price of 4 additional tokens (first four) we can enable streaming window attention and enable extremely long context length (1-4M?).

"we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup."

Mixtral is using the sliding window approach. This might be an easy add to showcase the newest attention, though it's not 'core' aspect for PTD.
See: https://arxiv.org/abs/2309.17453
I can make a PR to enable if there is interest.

@tianyu-l tianyu-l added the enhancement New feature or request label May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants