Training on own data? #532

dinnerisserved · 2023-05-11T09:41:49Z

Is there a way to feed GPT4all own data so that it can be trained on the information? I would like to be able to feed it my emails, my PDF files and a bunch of other data that I have, and use GPT4all's chat to trawl through this data and spit out information for me. Is this something that is being worked on, or currently possible?

cigoic · 2023-05-11T14:38:00Z

vote this! Same question. ✋

mdogadailo · 2023-05-11T15:57:26Z

LangChain allows you to connect a language model to other sources of data. you can also take a look https://github.com/imartinez/privateGPT

mehrooo · 2023-05-11T21:46:23Z

Its known as "Chat with PDF" or "Talk to PDF/Book/Document", look it up
You don't need to train a model on your own data, you can finetune a model in a way that it analyse the given file and extract the asked information from it.

dinnerisserved · 2023-05-12T07:15:07Z

I never intended to "train" on own data, but it was more about letting the GPT access a file repository to take into consideration when asking it questions. Chat with a datalake is what I wanted to achieve.

sunilkumardash9 · 2023-05-14T14:14:45Z

I never intended to "train" on own data, but it was more about letting the GPT access a file repository to take into consideration when asking it questions. Chat with a datalake is what I wanted to achieve.

Well, I think you can do this by performing a semantic search over your text data (embeddings) and feed the relevant ones to chat models and get your answers.

naman-TI · 2023-05-18T10:37:36Z

Semantic search is one option agreed. But it sounds better than how it actually performs.

Is there a way to fine-tune (domain adaptation) the gpt4all model using my local enterprise data, such that gpt4all "knows" about the local data as it does the open data (from wikipedia etc)

zubair-ahmed-ai · 2023-05-22T09:44:44Z

I never intended to "train" on own data, but it was more about letting the GPT access a file repository to take into consideration when asking it questions. Chat with a datalake is what I wanted to achieve.

Well, I think you can do this by performing a semantic search over your text data (embeddings) and feed the relevant ones to chat models and get your answers.

Tried that with dolly-v2-3b, langchain and FAISS but boy is that slow, takes too long to load embeddings over 4gb of 30 pdf files of less than 1 mb each then CUDA out of memory issues on 7b and 12b models running on Azure STANDARD_NC6 instance with single Nvidia K80 GPU, tokens keep repeating on 3b model with chaining
Currently trying to use the mosaicml/mpt-7b to generate some tokens and that's taken forever.

Came here to test out GPT4All to see if this is any better

claell · 2023-06-09T10:26:15Z

@niansa: #87, #198, #223

niansa · 2023-08-11T12:48:20Z

yes, this is a duplicate.

please open a new, updated issue if this is still relevant to you.
You're encouraged to open new issues or even PRs if there's anything you need!

niansa closed this as not planned Won't fix, can't repro, duplicate, stale Aug 11, 2023

niansa added the duplicate This issue or pull request already exists label Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on own data? #532

Training on own data? #532

dinnerisserved commented May 11, 2023

cigoic commented May 11, 2023

mdogadailo commented May 11, 2023 •

edited

mehrooo commented May 11, 2023

dinnerisserved commented May 12, 2023

sunilkumardash9 commented May 14, 2023

naman-TI commented May 18, 2023

zubair-ahmed-ai commented May 22, 2023

claell commented Jun 9, 2023

niansa commented Aug 11, 2023

Training on own data? #532

Training on own data? #532

Comments

dinnerisserved commented May 11, 2023

cigoic commented May 11, 2023

mdogadailo commented May 11, 2023 • edited

mehrooo commented May 11, 2023

dinnerisserved commented May 12, 2023

sunilkumardash9 commented May 14, 2023

naman-TI commented May 18, 2023

zubair-ahmed-ai commented May 22, 2023

claell commented Jun 9, 2023

niansa commented Aug 11, 2023

mdogadailo commented May 11, 2023 •

edited