Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chroma VectorBase Use "L2" as Similarity Measure Rather than Cosine #21599

Open
5 tasks done
DragonMengLong opened this issue May 13, 2024 · 7 comments
Open
5 tasks done
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: chroma Primarily related to ChromaDB integrations Ɑ: vector store Related to vector store module

Comments

@DragonMengLong
Copy link

DragonMengLong commented May 13, 2024

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

vectorstore = Chroma(
persist_directory=persisit_dir,
embedding_function=embeddings
)
docs_and_scores = vectorstore.similarity_search_with_score(query=user_query)
for doc, score in docs_and_scores:
print(score)

Error Message and Stack Trace (if applicable)

No response

Description

In the doc of langchain, it said chroma use cosine to measure the distance by default, but i found it actually use l2 distence, if we debug and follow into the code of the chroma db we can find that the default distance_fn is l2

System Info

langchain==0.1.17
langchain-chroma==0.1.0
langchain-community==0.0.37
langchain-core==0.1.52
langchain-text-splitters==0.0.1
chroma-hnswlib==0.7.3
chromadb==0.4.24
langchain-chroma==0.1.0

@dosubot dosubot bot added Ɑ: vector store Related to vector store module 🔌: chroma Primarily related to ChromaDB integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels May 13, 2024
@DragonMengLong
Copy link
Author

The distance function is decided by the metadata, and if the collection already exists(when loading from disk), the metadata is same as the metadata when we save to disk. So to use cosine distance we need to specific the metadata like this

db = Chroma.from_documents(documents, embeddings, persist_directory=persist_dir, collection_metadata={"hnsw:space": "cosine"})

@klaudialemiec
Copy link
Contributor

klaudialemiec commented May 20, 2024

@DragonMengLong I don't think that adding collection_metadata={"hnsw:space": "cosine"} solves the issue. Please take a look on this example:

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.docstore.document import Document
from langchain.vectorstores import Chroma

model = HuggingFaceEmbeddings(model_name="gtr-t5-base")

docs_txt = [
    "what is parameterized complexity theory?", 
    "what is the complexity of parameterized complexity theory?", 
    "what is parameter?", 
    "I like cats!", 
    "what is universe?"
]
docs = [Document(page_content=doc, metadata={}) for doc in docs_txt]

db = Chroma.from_documents(
    docs,
    model,
    persist_directory="data/test",
    collection_name="docs",
    collection_metadata={"hnsw:space": "cosine"}
)

docs_and_scores = db.similarity_search_with_score(query="Parameters", k=1)

for doc, score in docs_and_scores:
    print(score)
    print(doc)

output:
0.2922315249599653
page_content='what is parameter?'

On the other hand, calculating cosine similarity with sentence_transformers return other value:

from sentence_transformers.util import cos_sim

query = model.embed_query("Parameters")
text = model.embed_query('what is parameter?')

cos_sim(query, text)

output:
tensor([[0.8539]])

@DragonMengLong
Copy link
Author

@klaudialemiec i think you should use the model.embed_documents(['what is parameter?']) to get the embedding of the document because for different embedding models, the embedding methord for doc and query may be different, for example it may add a different prompt before the input.

@klaudialemiec
Copy link
Contributor

klaudialemiec commented May 20, 2024

@DragonMengLong I made as you suggested. The results is the same:
from sentence_transformers.util import cos_sim

query = model.embed_query("Parameters")
text = model.embed_documents(['what is parameter?'])

cos_sim(query, text)

output: tensor([[0.8539]])

Meanwhile I've made one more test with ChromaDB library:

import chromadb
from chromadb.utils import embedding_functions

embedding = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="gtr-t5-base")
chroma_client = chromadb.Client()

collection = chroma_client.create_collection(
    name="my_collection", 
    embedding_function=embedding,
    metadata={"hnsw:space": "cosine"}
)

collection.add(
    documents = [
        "what is parameterized complexity theory?", 
        "what is the complexity of parameterized complexity theory?", 
        "what is parameter?", 
        "I like cats!", 
        "what is universe?"
    ],
    ids=["id1", "id2", "id3", "id4", "id5"]
)

results = collection.query(
    query_texts=["Parameters"],
    n_results=1
)
print(results)

{'ids': [['id3']], 'distances': [[0.14611518383026123]], 'metadatas': [[None]], 'embeddings': None, 'documents': [['what is parameter?']], 'uris': None, 'data': None}

The distance calculated with Chroma makes sense, as it returns cosine distance, while sentence transformers cosine similarity (1 - 0.8539 = 0.1461).
The above code is basically copied from Chroma documentation.
Resources: https://docs.trychroma.com/guides#changing-the-distance-function, https://docs.trychroma.com/guides/embeddings

Additionally, when I remove the line metadata={"hnsw:space": "cosine"} from create_collection (so now Chroma use default distance metric, which is l2), I've fot the distance 0.2922315299510956 for the same query and 1 most similar document.

collection = chroma_client.create_collection(
    name="my_collection2", 
    embedding_function=embedding,
)

collection.add(
    documents = [
        "what is parameterized complexity theory?", 
        "what is the complexity of parameterized complexity theory?", 
        "what is parameter?", 
        "I like cats!", 
        "what is universe?"
    ],
    ids=["id1", "id2", "id3", "id4", "id5"]
)

results = collection.query(
    query_texts=["Parameters"],
    n_results=1
)
print(results)

output: {'ids': [['id3']], 'distances': [[0.2922315299510956]], 'metadatas': [[None]], 'embeddings': None, 'documents': [['what is parameter?']], 'uris': None, 'data': None}

@klaudialemiec
Copy link
Contributor

I would be more then happy to resolve the bug as my first PR to langchain :)

@DragonMengLong
Copy link
Author

@klaudialemiec :)

@klaudialemiec
Copy link
Contributor

Now I see where was the problem!
The implementation is fine - I handle the problem before of caching.

What happened:
The first time when I created collection, I didn't define distance metric, so the default Chroma metric was assigned (l2). Later, I create the collection with the same once again and set different metric. But the collection already existed, in result the new documents were added to the existing collection and new value of distance was ignored.

The correct way of testing should include deleting collection, to create it once again.

Nevertheless, I believe it would be helpful for new users to add to the documentation information how to set up distance metric (DragonMengLong first reply to this issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: chroma Primarily related to ChromaDB integrations Ɑ: vector store Related to vector store module
Projects
None yet
Development

No branches or pull requests

2 participants