-
Notifications
You must be signed in to change notification settings - Fork 13.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chroma VectorBase Use "L2" as Similarity Measure Rather than Cosine #21599
Comments
The distance function is decided by the metadata, and if the collection already exists(when loading from disk), the metadata is same as the metadata when we save to disk. So to use cosine distance we need to specific the metadata like this db = Chroma.from_documents(documents, embeddings, persist_directory=persist_dir, collection_metadata={"hnsw:space": "cosine"}) |
@DragonMengLong I don't think that adding
On the other hand, calculating cosine similarity with sentence_transformers return other value:
|
@klaudialemiec i think you should use the model.embed_documents(['what is parameter?']) to get the embedding of the document because for different embedding models, the embedding methord for doc and query may be different, for example it may add a different prompt before the input. |
@DragonMengLong I made as you suggested. The results is the same:
Meanwhile I've made one more test with ChromaDB library:
The distance calculated with Chroma makes sense, as it returns cosine distance, while sentence transformers cosine similarity (1 - 0.8539 = 0.1461). Additionally, when I remove the line
|
I would be more then happy to resolve the bug as my first PR to langchain :) |
Now I see where was the problem! What happened: The correct way of testing should include deleting collection, to create it once again. Nevertheless, I believe it would be helpful for new users to add to the documentation information how to set up distance metric (DragonMengLong first reply to this issue) |
Checked other resources
Example Code
vectorstore = Chroma(
persist_directory=persisit_dir,
embedding_function=embeddings
)
docs_and_scores = vectorstore.similarity_search_with_score(query=user_query)
for doc, score in docs_and_scores:
print(score)
Error Message and Stack Trace (if applicable)
No response
Description
In the doc of langchain, it said chroma use cosine to measure the distance by default, but i found it actually use l2 distence, if we debug and follow into the code of the chroma db we can find that the default distance_fn is l2
System Info
langchain==0.1.17
langchain-chroma==0.1.0
langchain-community==0.0.37
langchain-core==0.1.52
langchain-text-splitters==0.0.1
chroma-hnswlib==0.7.3
chromadb==0.4.24
langchain-chroma==0.1.0
The text was updated successfully, but these errors were encountered: