Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is max_position_embeddings=8096 neccessary in 2b model? #41

Open
agiwave opened this issue Mar 8, 2024 · 3 comments
Open

Is max_position_embeddings=8096 neccessary in 2b model? #41

agiwave opened this issue Mar 8, 2024 · 3 comments
Labels
type:support Support issues

Comments

@agiwave
Copy link

agiwave commented Mar 8, 2024

I just try to do some small changes on model '2b'
1, Limit max_position_embeddings from 8096 to 256. :)
2, Trim kv-cache in GemmaAttention to max_position_embeddings(256).
3, Unlimit the output length of model.generate
The generate work is still working fine and can generate about 400 tokens for question "The life meaning is".

Is that means?
1, The too old kv-caches is not neccessary and the model can store and compress long-context info into 256 kv-cache(18-layers)?
2, Could have a try on training model this way(only max 256 kv-cache)?
3, If above is true, Does this means that we can decrease the training and generating complexity tremendously from O(LLD) to O(256LD) = O(L*D)?

@agiwave
Copy link
Author

agiwave commented Mar 8, 2024

Maybe we can extend Gemma context-length to unlimited size(depend on the compress-rate) in this way(with limited kv-cache length - 256 or little more?) in linear complexity.

@pengchongjin
Copy link
Collaborator

pengchongjin commented Mar 13, 2024

2, Trim kv-cache in GemmaAttention to max_position_embeddings(256).

Do you mean using a slide window of size 256 as you generate the output tokens?

I think this is an interesting observation. I believe there are some related work in the literature which tries to use sliding window to extrapolate the context. It sounds like you are doing similar things.

@agiwave
Copy link
Author

agiwave commented Mar 13, 2024

Yeah. I try to limit max_position_embeddings to 256 and generate beyond 400 tokens answer. It looks work well. I wish the model can compress the context info far beyond 256 token first. So, I tried it. But, unfortunately. I told the model my name first, and followed by 300 tokens about another infos. At last , I try to ask Gemma " Do you know what's my name". Gemma couldn't give me the right answer. So gemma has no sliding windows memory. This test only work in 256 tokens(Attention scope). Emm, a little bit lose here :).

@tilakrayal tilakrayal added the type:support Support issues label Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:support Support issues
Projects
None yet
Development

No branches or pull requests

3 participants