You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
for the price of 4 additional tokens (first four) we can enable streaming window attention and enable extremely long context length (1-4M?).
"we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup."
Mixtral is using the sliding window approach. This might be an easy add to showcase the newest attention, though it's not 'core' aspect for PTD.
See: https://arxiv.org/abs/2309.17453
I can make a PR to enable if there is interest.
The text was updated successfully, but these errors were encountered:
for the price of 4 additional tokens (first four) we can enable streaming window attention and enable extremely long context length (1-4M?).
"we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup."
Mixtral is using the sliding window approach. This might be an easy add to showcase the newest attention, though it's not 'core' aspect for PTD.
See: https://arxiv.org/abs/2309.17453
I can make a PR to enable if there is interest.
The text was updated successfully, but these errors were encountered: