Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collections that are missing embeddings can get stuck that way until an explicit re-index #2273

Open
3Simplex opened this issue Apr 26, 2024 · 3 comments
Labels
bug Something isn't working chat gpt4all-chat issues local-docs

Comments

@3Simplex
Copy link
Contributor

3Simplex commented Apr 26, 2024

Bug Report

Pre-existing collections from before the update to 2.7.4 do not work after update.
Only collections created in 2.7.4 work.

Steps to Reproduce

  1. Create collection in version 2.7.3 or older.
  2. Update to 2.7.4
  3. Start new chat
  4. Select LocalDocs Collections that were made before the update.
  5. Reference all collections in chat to pull context from each collection.
    Screenshot 2024-04-26 172417
  6. LocalDocs will not find contents in collections made before the update to 2.7.4
    Screenshot 2024-04-26 172912
  7. Create and add a new collection using 2.7.4
  8. Include new collection in selected LocalDocs collections
  9. Reference all collections in chat to pull context from each collection.
    Screenshot 2024-04-26 164521
  10. LocalDocs will find contents in newly created collections only.
    Screenshot 2024-04-26 163344

Expected Behavior

All collections were expected to function as usual.

Your Environment

  • GPT4All version: 2.7.4
  • Operating System: Win11
  • Chat model used (if applicable): SBert-LocalDocs [model-all-MiniLM-L6-v2.gguf2.f16.gguf]
@3Simplex 3Simplex added bug-unconfirmed chat gpt4all-chat issues labels Apr 26, 2024
@SINAPSA-IC
Copy link

SINAPSA-IC commented Apr 29, 2024

I second this.

The program starts searching the selected collections...

tried this with 4 collections, to spot the fraction-of-a-second long text message "searching in localdocs:..."

...but immediately switches to the /default "generating response..." and "processing..." without parsing the collections which were however mentioned in the beginning but without them being really used (redundant here, but this is the idea :) )

@cebtenzzre cebtenzzre added bug Something isn't working local-docs and removed bug-unconfirmed labels Apr 29, 2024
@cebtenzzre
Copy link
Member

cebtenzzre commented Apr 29, 2024

I am able to reproduce this issue using a copy of some of 3Simplex's collections. It seems like the embeddings are missing for certain documents, due to the process getting interrupted somehow. These documents would have been re-indexed on every launch in previous versions of GPT4All because their modification timestamp did not match the database. Now they are only re-indexed the first time GPT4All v2.7.4 is started, and if that did not succeed then the collections will be broken until they are once again re-indexed (e.g. by changing the document snippet size) and it completes successfully.

We need to implement a way to know whether embeddings have been generated for a chunk so the program can continue where it left off.

@cebtenzzre cebtenzzre changed the title LocalDocs ignoring old Collections Collections that are missing embeddings can get stuck that way until an explicit re-index Apr 29, 2024
@SINAPSA-IC
Copy link

SINAPSA-IC commented Apr 29, 2024

I have also done as 3Simplex said, in the sense of changing a folder's contents as a collection, here's what I've done:

  • deleted (Cut-and-Pasted it one level upwards) one file from a folder which was already known as a LocalDocs collection
  • after deleting the file, the program did not reindex the collection
  • after placing (Paste-d) the file back into its folder, the program started reindexing that collection

Done this with 3 distinct files in 3 distinct folders/categories.
The result was the same - those collections were reindexed.

However, the issue is still here, - of reindexing existing collections. I see several collections being indexed again, immediately after program start, which were created even before 2.7.3 (I can't remember, was it 2.6.1 or a 2.7.x) and stayed that way since then...

Edit :) - the explanation of cebtenzzre clarifies as to why this would happen. Indeed, a flag or something would be handy, like Windows which knows that it didn't shut down properly :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working chat gpt4all-chat issues local-docs
Projects
None yet
Development

No branches or pull requests

3 participants