[BUG]: Chat with doc (RAG) is 100x slower than chat with same base model #1418

cuihee · 2024-05-16T08:29:49Z

How are you running AnythingLLM?

AnythingLLM Docker (local)
LM Studio (local)
Model: QWen chat 1.5 7B q8 gguf
GPU: NVIDIA RTX4000 SFF Ada 20GB

What happened?

When I am in a workspace without documents, it start streaming after one second.
When I am in a workspace with documents, it takes about 80 seconds to start streaming.

My document is not very complex or large.
My document has two files, one doc and one txt

Are there known steps to reproduce?

No response

timothycarambat · 2024-05-16T16:01:45Z

How many words are in the documents? You would be surprised that even a bit more context in the window dramatically impacts time to first token. This would include the snippets injected into the prompt.

Also, how are you running AnythingLLM? If this is on CPU and you are not on an M1,M2,M3 Mac then this is not unexpected

frost19k · 2024-05-18T05:37:30Z

I'm using the LM Studio backend, and in the AnythingLLM UI agents don't seem to stream. Their response show up as a full text blob. But in the LM Studio logs I can see that text generation has begun. This makes it appear slower than it actually is...

timothycarambat · 2024-05-18T07:27:23Z

@frost19k we dont stream agent responses (because of tool calling) but we will be resolving that soon. The "latency" from this very well may just be the model generating the full response. We do stream the regular responses obviously

cuihee added the possible bug Bug was reported but is not confirmed or is unable to be replicated. label May 16, 2024

timothycarambat added question Further information is requested wontfix This will not be worked on and removed possible bug Bug was reported but is not confirmed or is unable to be replicated. labels May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Chat with doc (RAG) is 100x slower than chat with same base model #1418

[BUG]: Chat with doc (RAG) is 100x slower than chat with same base model #1418

cuihee commented May 16, 2024 •

edited

timothycarambat commented May 16, 2024

frost19k commented May 18, 2024

timothycarambat commented May 18, 2024

[BUG]: Chat with doc (RAG) is 100x slower than chat with same base model #1418

[BUG]: Chat with doc (RAG) is 100x slower than chat with same base model #1418

Comments

cuihee commented May 16, 2024 • edited

How are you running AnythingLLM?

What happened?

Are there known steps to reproduce?

timothycarambat commented May 16, 2024

frost19k commented May 18, 2024

timothycarambat commented May 18, 2024

cuihee commented May 16, 2024 •

edited