Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to use GPU-accelerated hardware instead of CPU-bound with gpt4all #6

Open
2 of 3 tasks
misslivirose opened this issue Nov 7, 2023 · 4 comments
Open
2 of 3 tasks
Labels
enhancement New feature or request help wanted Extra attention is needed privateGPT This is related to the privateGPT code

Comments

@misslivirose
Copy link
Collaborator

misslivirose commented Nov 7, 2023

Memory Cache should use a GPU that is available to do inference in order to speed up performance of queries and deriving insights from documents.

What I tried so far

I spent a few days last week exploring the differences between the primordial privateGPT version and latest. One of the major differences is that the newer project updates include support for GPU inference for llama and gpt4all, but the challenge that I ran into with the newer version is that moving from the older groovy.ggml model (which is no longer supported given that privateGPT now uses the .gguf format) to llama doesn't have the same results when ingesting the same local file store and querying.

This might be a matter of how RAG is implemented, something about how I set things up on my local machine, or a function of model choice.

I've lazily tried to see if this can be resolved through dependency changes but I haven't had luck getting to a version that runs that supports .ggml and GPU acceleration together. From what I can tell, Nomic introduced a version of gpt4all that works on GPU in 2.4 (latest is 2.5+) but it's unclear if there's a way to get this working cleanly with minimal changes to how my fork of privateGPT uses langchain to import the gpt4all package. It is unclear to me if this works on Ubuntu or if it's only Vulkan APIs, I need to do some additional investigation.

I did get CUDA installed and verified that my GPU is properly detected and set up to run the sample projects provided by Nvidia.

What's next

  • I'm going to test against gpt4all's chat client with snoozy (which uses the same dataset as groovy) and the shared file directory, but there seems to be a sweet spot for the combination of primordial privateGPT + groovy that is challenging to replicate.
  • Branch and start experimenting with upgrading gpt4all and langchain in the primoridal privateGPT repo to see if I can get any of it running with the existing groovy.ggml model
  • Attempt to convert groovy from ggml to gguf using the llama.cpp utility and try to switch from gpt4all to llama, which might be easier than trying to get a proper CUDA-backed gpt4all working.

Testing

I've been using a highly subjective test to evaluate:

Prompt: "What is the meaning of a life well-lived?"

The answer for primordial privateGPT+groovy that has been augmented on my local files answers this question with a combination of "technology and community" consistently. No other combination of model/project has replicated that consistently.

@misslivirose
Copy link
Collaborator Author

Update: GPT4All 2.5.2 with snoozy fails the "life well-lived" test.

@misslivirose
Copy link
Collaborator Author

misslivirose commented Nov 8, 2023

My quick attempt at trying to convert groovy from ggml to gguf using the llama.cpp utility did not work - it looks like this is a known impact of the swap to gguf, but I didn't have time today to investigate further.

I did find the Nomic.ai model card for GPT4All-J to be helpful in explaining the specific iterations that led to groovy. New idea to test:

  • Pull down a new version of the latest privateGPT repo and use snoozy to see if this gets closer than gpt4all's own chat client. privateGPT uses gpt4all's modules, but maybe between the prompts and the RAG implementation, there's something else going on that makes the life well-lived test fail.

@misslivirose misslivirose pinned this issue Nov 8, 2023
@misslivirose misslivirose added enhancement New feature or request help wanted Extra attention is needed privateGPT This is related to the privateGPT code labels Nov 8, 2023
@johnshaughnessy johnshaughnessy unpinned this issue Jan 16, 2024
@misslivirose misslivirose pinned this issue Jan 16, 2024
@tomjorquera
Copy link

tomjorquera commented Jan 22, 2024

Hey @misslivirose , I got curious about making the project work on GPU, so I spend some my Sunday evening investigating the issue.

I managed to make the whole thing to work with GPU by updating some of the dependencies and make some required changes. Feel free to look at my branch on https://github.com/tomjorquera/privateGPT/tree/gpu-acceleration and to pull as you please.

It's not all rosy sadly, I've hit some snags along the road (some more of that below)

  • You will need to recreate the DB, as chroma needs to be updated to a version which contains breaking changes
  • GPT4All streaming is broken in the current version of langchain
  • I still couldn't manage to make GPT4All-j work sadly (I gained a little more info on that however)

How to use with GPU

My changes introduce the USE_GPU env variable that controls GPU execution, as well as an additional MODEL_N_GPU_LAYERS that allows to choose the number of layers to run on the GPU (with LlamaCpp only, GPT4All doesn't support such thing to my knowledge).

I tested the changes with both LLamaCpp and GPT4All models both with and without GPU and it seems to work well on my side.

Installing llama-cpp-python properly

One very important note however, is that you must set the correct env variable when installing llama-cpp-python the first time. So if you're ok with nucking your virtualenv you can simply do:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install -r requirements.txt

That will enable CUDA support.

If you've already installed llama-cpp-python without this env variable, reinstalling it will not work, as the cache will not be rebuilt. If for some reason you do not want to nuke your venv, the magic command to force reinstallation is:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir

Issue: Chroma DB needs to be recreated

Sadly, I had to update Chroma to a version with breaking changes. So you will need to recreated the DB.

Also the Chroma API changed a lot in the meatime, so I had to do some modifications to ingest.py to make it work.

Personally, I would do away with the use of LangChain in ingest.py and use the chroma bindings directly. The API is really straightforward, and the LangChain wrapper only adds complexity without any additional value here.

Issue: GPT4All streaming is broken in latest LangChain FIXED!

LangChain had multiple refactorings, and it seems at some point streaming support for GPT4All broke. I found a way to re-enable it and created an issue with a PR for that at langchain-ai/langchain#16389, but I'm not optimistic it will be fixed soon, given that:

  • folks at langchain seem to want to move more to asynchronous API for streaming. Which in itself is a good idea (I do similarly in my LLM-related projets) but is not implemented for GPT4All and would require some deeper changes in privateGPT.py
  • they have a lot of things in their backlog 😅

I don't have a satisfying solution for that (except by doing away with LangChain completely).

EDIT: scratch all that, my proposed change has just been merged. So once it is released the issue should be fixable by adding streaming=True to the GPT4All constructor.
The LangChain dev also shared a nice solution to do away with callback in the latest version langchain-ai/langchain#16392 (comment)

EDIT2: My fix has been released with v0.1.3. I updated my branch with latest version (0.1.4) and re-enabled streaming with GPT4All.

Issue: GPT4All-j still doesn't work

So as you mentioned previously GPT4All-j has not be migrated to gguf.

Looking around in the GPT4All-j repo I noticed a promising script gpt4all-backend/script/convert_gptj_to_gguf.py, but it was badly broken.

I reported the issue and proposed a fix at nomic-ai/gpt4all#1863 managed to that to generate a (seemingly) valid gguf file from the original GPT4All-J model (downloaded from HF).

However trying to quantize the model using gpt4all-backend/llama.cpp-mainline/quantize fails, and the unquantized model is simply too big for me to try.

Potential solution: adding support for HF models directly

So sadly I didn't find a way to make it work. What can be done however is adding support to HF models in privateGPT.py and run the original project directly. This will probably not give you the same results than the model you were using, but if you really want to continue using GPT4All-J specifically that could be a solution.

Just for fun, I did that using HuggingFacePipeline in my branch https://github.com/tomjorquera/privateGPT/tree/hf-backend, which build on previous one and adds support for the "HF" MODEL_TYPE.

For the MODEL_PATH, you can give a local project downloaded from HF, or an HF repo identifiant. In the latter case, it will fetch the files from HF on first use (and since nomic-ai/gpt4all-j is public, you don't need to set an auth token). I've set it so it does 4bit quantization on the fly if you enable GPU support (probably need to be something more configurable however).

Note however that it will not do any quantization whatsoever if running on CPU, so it's probably basically useless to try do use it that way.

Let me know if you take it for a spin 😄


Fun project to work on. Hope that helps!

@tomjorquera
Copy link

Update: My fix for GPT4All streaming has been released with langchain v0.1.3. I updated my branch with latest version (0.1.4) and to make use of the relevant option. So now GPT4All streaming works again 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed privateGPT This is related to the privateGPT code
Projects
None yet
Development

No branches or pull requests

2 participants