Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VectorCollection.AddOrUpdateFrom #294

Open
HavenDV opened this issue May 5, 2024 · 0 comments
Open

VectorCollection.AddOrUpdateFrom #294

HavenDV opened this issue May 5, 2024 · 0 comments
Labels

Comments

@HavenDV
Copy link
Contributor

HavenDV commented May 5, 2024

dani
I have a question, when I store a stream in the storage, how does it identify the insertion if it should be done or the index already exists, I have seen that DataSource.FromStream does not use a path, doesn't it make it difficult to find an element?
HavenDV — 05/02/2024 9:58 PM
DataSource.FromStream simply retrieves Documents from this Stream. Although there is metadata here, it is not currently used in any way, and the presence of the same data in the VectorCollection is not determined
dani — 05/02/2024 10:01 PM
I think I have not expressed myself well, for example in the code I am testing, I insert files from a repository, can I decide whether to insert or not if the vector already exists in the database?

foreach (var f in files)
{
    if (!ignoreExt.Contains(Path.GetExtension(f.FilePath).ToLower()))
    {
        index = await vectordb.AddDocumentsFromAsync<GitlabLoader>(
        embeddings,
        dimensions: dimensions,
        dataSource: DataSource.FromStream(f.ContentToStream),
        collectionName: collection);
    }
}

HavenDV — 05/02/2024 10:06 PM
IVectorDatabase.IsCollectionExistsAsync probably the best choice at the moment if you can store files in different collections
IVectorCollection.IsEmptyAsync can also be used, but it is not yet implemented/tested for all databases
dani — 05/02/2024 10:11 PM
My idea was to use a collection to store an entire repository, would it be viable to have a method to check if a path already exists? and that DataSource.FromStream has as an option to be able to pass it a path to have it indexed in some way
HavenDV — 05/02/2024 10:16 PM
I understand your problem, I'll think about it, for now the solution is only to recreate the entire collection for all files if any of the files have changed
The problem is that one DataSource can turn into several Documents, and as a result into several vectors in the database. And we need to add metadata to the database, such as FilePath, and then check for the presence of vectors with this metadata
Id for a specific vector won't work because it needs to be unique
dani — 05/02/2024 10:19 PM
You could recover a vector count with the same path, right?
HavenDV — 05/02/2024 10:23 PM
But what if the file changes partially?
dani — 05/02/2024 10:25 PM
Maybe you could have a hash function and if it doesn't match delete and reinsert?
HavenDV — 05/02/2024 10:31 PM
Yeah, that sounds like a good suggestion.

And add AddOrUpdateFrom, which does this automatically.
Also add the ability to pass metadata to the DataSource so that you can control this.
But this seems to go a little beyond the scope of the current release and sounds like a good plan for the near future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant