How to handle multiple embeddings? #12158

cmosguy · 2024-03-21T23:30:56Z

cmosguy
Mar 21, 2024

I had an issue where I saved my embeddings and document in ChromaDB using a SentenceTranaformer model. The issue is that I lost track of the exact model used at the time. So when I attempted to do a semantic search query, my query embedding did not match the document embedding stored in ChromaDB.

I have two questions:

At the time of writing to the database, is the exact document embedding model written stored somewhere for future reference? I looked in the metadata for the doc, but it indicates embedding=Null (what is this field used for anyways?)
Secondly, what is the best practice when managing the stored data embeddings? For example, new embeddings come out all the time, how do we manage storing the embedding type at the time of indexing? Should I have a separate chroma DB collection and store the embedding name in the collection metadata or put it somewhere else? My plan was to have a single doc_id= identifier so I do not have redundant document records. Not sure what the best strategy is here.

Thanks!

@dosu-bot · 2024-03-21T23:31:10Z

dosubot[bot]
bot Mar 21, 2024

Based on the current implementation of LlamaIndex, there isn't a built-in feature that stores the exact SentenceTransformer model used for generating embeddings at the time of writing to the database. The metadata indicating embedding=Null suggests that the embedding model information was not stored during the indexing process.

For managing and storing data embeddings, considering the frequent release of new embeddings, here are some strategies you could consider:

Versioning: You could maintain version control of your embeddings by including a version identifier in the metadata of each embedding. This would allow you to track and retrieve embeddings generated by specific versions of models.
Metadata Management: Enhance your embeddings with rich metadata. This includes not only the version of the model used but also parameters that might affect the embeddings, such as preprocessing steps or model configurations. This metadata can be stored alongside the embeddings in ChromaDB or in a separate metadata management system.
Decoupling Storage and Processing: Consider decoupling the storage of embeddings from the processes that generate or consume them. This allows for more flexibility in updating models or embeddings without disrupting downstream systems. APIs or microservices can be used to abstract the access to embeddings, providing a consistent interface while allowing for changes in the backend.
Monitoring and Evaluation: Regularly monitor the performance of your embeddings in downstream tasks and evaluate the impact of updating to newer versions. This can help in making informed decisions about when to update the embeddings and which version to use.
Automated Pipelines: Automate the pipeline for generating, updating, and deploying embeddings as much as possible. This includes the process of training or fine-tuning models, generating embeddings, evaluating their performance, and updating the storage system. Automation can help in managing the frequent updates more efficiently.

In terms of storing the embedding type at the time of indexing, you could consider storing the embedding name in the collection metadata. This would allow you to easily identify the embedding type used for each collection. However, this approach would require careful management of the collection metadata to ensure it remains accurate and up-to-date.

As for avoiding redundant document records, using a single doc_id=<name of file> identifier could work, provided that the file name is unique for each document. This would allow you to easily identify and retrieve specific documents without having to deal with duplicate records.

Please note that these strategies would require additional code and systems for versioning, metadata management, and automation, which are not covered in the current LlamaIndex implementation. You might need to extend the existing codebase or integrate with other systems to implement these strategies.

I hope this helps! If you have any further questions, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle multiple embeddings? #12158

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

About Dosu

Select a reply

How to handle multiple embeddings? #12158

cmosguy Mar 21, 2024

Replies: 1 comment

dosubot[bot] bot Mar 21, 2024

Sources

About Dosu

cmosguy
Mar 21, 2024

dosubot[bot]
bot Mar 21, 2024