Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating embedding with MongoDB text store when library contains CSV file fails #475

Open
BillJones-SectorFlow opened this issue Mar 2, 2024 · 7 comments

Comments

@BillJones-SectorFlow
Copy link

Hi all! I think I may have found a bug related to creating embeddings of CSV files.

When attempting to create an embedding of a library (with ChromaDB as vector_db), where the library has a CSV file added, I'm getting the following exception:

  File "C:\Files\llmware-main\llmware\embeddings.py", line 2164, in create_new_embedding
    text_search = block["text_search"].strip()
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'list' object has no attribute 'strip'

**I certainly could be doing something wrong, but on the line mentioned, block["text_search"] is a list of lists representing the rows of the CSV, but the code then attempts to do a strip() on the list as if it was a str, causing the error. Simply avoiding the strip() based on instance type doesn't work as errors are then picked up elsewhere.

I'm able to reproduce with the latest main branch using the following sample code (with above setup included):

Setup/Boilerplate Minimized for Simplicity

from llmware.configs import ChromaDBConfig, LLMWareConfig, MongoConfig
from llmware.library import Library

vector_db = "chromadb"
active_db = "mongo"
debug_int = 3
home_dir = "LLMWARE_WORKSPACE"
path = "data"
chroma_path = "ChromaDB\\data"

account_name = "testaccount"
library_name = "testlib"
file_path = "addresses.csv"
embedding_model_name = "industry-bert-contracts"

mongo_username = "mongousr"
mongo_password = "XXXXX"
mongo_dbname = "llmware_libraries"
mongo_db_uri = "mongodb://" + mongo_username + ":" + mongo_password + "@127.0.0.1:27017/"

def setup():   

    LLMWareConfig().set_vector_db(vector_db)
    LLMWareConfig().set_active_db(active_db)
    LLMWareConfig().set_config("debug_mode", debug_int)

    LLMWareConfig().set_home(home_dir)

    LLMWareConfig().set_llmware_path_name(path)
    LLMWareConfig().setup_llmware_workspace()

    MongoConfig.set_config("user_name", mongo_username)
    MongoConfig.set_config("pw", mongo_password)
    MongoConfig.set_config("db_name", mongo_dbname)
    MongoConfig.set_config("db_uri", mongo_db_uri)

    ChromaDBConfig.set_config("persistent_path", chroma_path)

def csv_example():
   
    global library_name, account_name, file_path, embedding_model_name, vector_db

    library = Library().create_new_library(library_name=library_name, account_name=account_name)

    library.add_file(file_path=file_path)

    library.install_new_embedding(embedding_model_name=embedding_model_name, vector_db=vector_db)

if __name__ == "__main__":
    setup() #
    csv_example()

Example addresses.csv file contents (though, it happens with all I've tested):

a,b,c,d,e,f,g,h
h,g,f,e,d,c,b,a
z,z,z,z,z,z,z,z
a,a,a,a,a,a,a,a

Bug or user error? :)

@BillJones-SectorFlow
Copy link
Author

Anyone use this functionality?

@MacOS
Copy link
Contributor

MacOS commented Mar 6, 2024

Hi @BillJones-SectorFlow,

thank you for reporting this. Do you have this problem solely with ChromaDB or also other vector stores?.

@BillJones-SectorFlow
Copy link
Author

Hi @MacOS,

I'm not positive if it only happens with ChromaDB, but the specific exception that occurs with any non-string mentioned above is specifically in the create_new_embedding function of the EmbeddingChromaDB class. I haven't yet had a chance to do any tests with different vector stores.

@MacOS
Copy link
Contributor

MacOS commented Mar 7, 2024

I see, @BillJones-SectorFlow. So it could effect other vector stores too.

@BillJones-SectorFlow
Copy link
Author

As a test, I moved to postgres for both db types and I no longer get this error, so I do believe it is specific to ChromaDB (or at least it doesn't affect postgres).

@MacOS
Copy link
Contributor

MacOS commented Mar 8, 2024

🤔 I think it has nothing to do with the vector store, but with the text store. Because the line in the ChromaDB class that is causing the trouble is the same for Postgres.

Compare ChromaDB here.

text_search = block["text_search"].strip()

With Postgres here.

text_search = block["text_search"].strip()

For some reason, the text collection is returning a list for ChromaDB but not for Postgres. In other words, this line

block["text_search"]

is returning a list in the former case but a string in the later. Which you described above.

In your case, I would try to convert the list to a string.

''.join(block["text_search"])

And the full line change would then be

text_search = ''.join(block["text_search"]).strip()

With that said, I do not understand how this is possible. May I kindly ask you to post a self-contained example that reproduces the error for ChromaDB and Postgres?

@BillJones-SectorFlow
Copy link
Author

BillJones-SectorFlow commented Mar 11, 2024

I did not test ChromaDB + Postgres, but I did switch from MongoDB to using postgres on both sides and you are correct: I also see a string instead of a list coming in from the text collection store after moving off of MongoDB. So, I believe it might actually be an issue with Mongo (either how it's saving it or how it's retrieved -- unsure which). I've changed the title of the issue to indicate this.

@BillJones-SectorFlow BillJones-SectorFlow changed the title Creating embedding on ChromaDB when library contains CSV file fails Creating embedding with MongoDB text store when library contains CSV file fails Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants