New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Citations not reflecting custom sized embedding #1259
Comments
It's unexpected, but the chunk length value is in characters not in tokens. So it's smaller by a factor 3 to 4. (So is the sequence length setting for the embedding model btw) |
@Propheticus is correct, the spit size is by chars, not tokens. This is actually intentional and known as counting tokens depends on the tokenizer of the model, which we cannot really replicate since we only have access to the tokenizer that comes with I do think though we are at a point where we can probably rely on that library more as we for sure "underestimate" the token length since we count by chars, which is indeed off by some factor depending on tokenizer |
Mistake on my side, I did not realize the setting was for characters not tokens. Which means that I will now re-vectorize all of my embeddings. However it looks like the citations are still too small even when counting characters and bot tokens? |
Further testing, it seems that something is off:
|
How are you running AnythingLLM?
Docker (remote machine)
What happened?
Having started a fresh instance of ALLM and setting the custom chunk size to 800 tokens, the citations shown are much shorter than the actual vector size. As an example:
Are there known steps to reproduce?
See above
The text was updated successfully, but these errors were encountered: