Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Chat with doc (RAG) is 100x slower than chat with same base model #1418

Open
cuihee opened this issue May 16, 2024 · 3 comments
Open
Labels
question Further information is requested wontfix This will not be worked on

Comments

@cuihee
Copy link

cuihee commented May 16, 2024

How are you running AnythingLLM?

AnythingLLM Docker (local)
LM Studio (local)
Model: QWen chat 1.5 7B q8 gguf
GPU: NVIDIA RTX4000 SFF Ada 20GB

What happened?

When I am in a workspace without documents, it start streaming after one second.
When I am in a workspace with documents, it takes about 80 seconds to start streaming.

My document is not very complex or large.
My document has two files, one doc and one txt

Are there known steps to reproduce?

No response

@cuihee cuihee added the possible bug Bug was reported but is not confirmed or is unable to be replicated. label May 16, 2024
@timothycarambat timothycarambat added question Further information is requested wontfix This will not be worked on and removed possible bug Bug was reported but is not confirmed or is unable to be replicated. labels May 16, 2024
@timothycarambat
Copy link
Member

How many words are in the documents? You would be surprised that even a bit more context in the window dramatically impacts time to first token. This would include the snippets injected into the prompt.

Also, how are you running AnythingLLM? If this is on CPU and you are not on an M1,M2,M3 Mac then this is not unexpected

@frost19k
Copy link

I'm using the LM Studio backend, and in the AnythingLLM UI agents don't seem to stream. Their response show up as a full text blob. But in the LM Studio logs I can see that text generation has begun. This makes it appear slower than it actually is...

@timothycarambat
Copy link
Member

@frost19k we dont stream agent responses (because of tool calling) but we will be resolving that soon. The "latency" from this very well may just be the model generating the full response. We do stream the regular responses obviously

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants