New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate with a locally hosted LLM instead of using API #5
Comments
Cohere's trial api key is free. |
the model needs to have atleast 4000 tokens as context length and should be able to run on consumer hardware and work in real-time, i don't think there are any models that can do that. |
Just to say, you can get 4000 tokens worth of context length when running models through exllama. I've been doing that with Chronos 30b model, with exllama in tow, with just enough room to still run stable diffusion on the side. Worth baring in mind though, that I've got a pretty beefy system, with a 4090 - so that would probably be a struggle for most. Plus, even on this PC, I probably couldn't run that alongside games that require particularly high specs. Although, any LLama model should be able to run 4000 tokens through exllama, so smaller models should be doable. All the same, I'd dig testing this out with a local LLM - would it be a struggle to alter the code to work with a local LLM through Ooga Booga? |
Here's a Reddit post chatting on the larger context length utilizing exllama: https://www.reddit.com/r/LocalLLaMA/comments/14j4l7h/6000_tokens_context_with_exllama/ The repo itself: https://github.com/turboderp/exllama |
oh cool, i haven't tried exllama or ooga booga before. The code is currently using langchain and cohere to give the responses, it should be possible to use a local llm, the prompts will more or less be the same, the parameters will need to modified a bit depending on the model. I'm not sure if i can add this feature though because i'm not using such a powerful machine 😅 |
Great work!
The text was updated successfully, but these errors were encountered: