Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on own data? #532

Closed
dinnerisserved opened this issue May 11, 2023 · 9 comments
Closed

Training on own data? #532

dinnerisserved opened this issue May 11, 2023 · 9 comments
Labels
duplicate This issue or pull request already exists

Comments

@dinnerisserved
Copy link

Is there a way to feed GPT4all own data so that it can be trained on the information? I would like to be able to feed it my emails, my PDF files and a bunch of other data that I have, and use GPT4all's chat to trawl through this data and spit out information for me. Is this something that is being worked on, or currently possible?

@cigoic
Copy link

cigoic commented May 11, 2023

vote this! Same question. ✋

@mdogadailo
Copy link

mdogadailo commented May 11, 2023

LangChain allows you to connect a language model to other sources of data. you can also take a look https://github.com/imartinez/privateGPT

@mehrooo
Copy link

mehrooo commented May 11, 2023

Its known as "Chat with PDF" or "Talk to PDF/Book/Document", look it up
You don't need to train a model on your own data, you can finetune a model in a way that it analyse the given file and extract the asked information from it.

@dinnerisserved
Copy link
Author

I never intended to "train" on own data, but it was more about letting the GPT access a file repository to take into consideration when asking it questions. Chat with a datalake is what I wanted to achieve.

@sunilkumardash9
Copy link

I never intended to "train" on own data, but it was more about letting the GPT access a file repository to take into consideration when asking it questions. Chat with a datalake is what I wanted to achieve.

Well, I think you can do this by performing a semantic search over your text data (embeddings) and feed the relevant ones to chat models and get your answers.

@naman-TI
Copy link

Semantic search is one option agreed. But it sounds better than how it actually performs.

Is there a way to fine-tune (domain adaptation) the gpt4all model using my local enterprise data, such that gpt4all "knows" about the local data as it does the open data (from wikipedia etc)

@zubair-ahmed-ai
Copy link

I never intended to "train" on own data, but it was more about letting the GPT access a file repository to take into consideration when asking it questions. Chat with a datalake is what I wanted to achieve.

Well, I think you can do this by performing a semantic search over your text data (embeddings) and feed the relevant ones to chat models and get your answers.

Tried that with dolly-v2-3b, langchain and FAISS but boy is that slow, takes too long to load embeddings over 4gb of 30 pdf files of less than 1 mb each then CUDA out of memory issues on 7b and 12b models running on Azure STANDARD_NC6 instance with single Nvidia K80 GPU, tokens keep repeating on 3b model with chaining
Currently trying to use the mosaicml/mpt-7b to generate some tokens and that's taken forever.

Came here to test out GPT4All to see if this is any better

@claell
Copy link
Contributor

claell commented Jun 9, 2023

@niansa: #87, #198, #223

@niansa
Copy link
Collaborator

niansa commented Aug 11, 2023

yes, this is a duplicate.

please open a new, updated issue if this is still relevant to you.
You're encouraged to open new issues or even PRs if there's anything you need!

@niansa niansa closed this as not planned Won't fix, can't repro, duplicate, stale Aug 11, 2023
@niansa niansa added the duplicate This issue or pull request already exists label Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

9 participants