Chroma VectorBase Use "L2" as Similarity Measure Rather than Cosine #21599

DragonMengLong · 2024-05-13T12:18:35Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

vectorstore = Chroma(
persist_directory=persisit_dir,
embedding_function=embeddings
)
docs_and_scores = vectorstore.similarity_search_with_score(query=user_query)
for doc, score in docs_and_scores:
print(score)

Error Message and Stack Trace (if applicable)

No response

Description

In the doc of langchain, it said chroma use cosine to measure the distance by default, but i found it actually use l2 distence, if we debug and follow into the code of the chroma db we can find that the default distance_fn is l2

System Info

langchain==0.1.17
langchain-chroma==0.1.0
langchain-community==0.0.37
langchain-core==0.1.52
langchain-text-splitters==0.0.1
chroma-hnswlib==0.7.3
chromadb==0.4.24
langchain-chroma==0.1.0

DragonMengLong · 2024-05-14T10:07:54Z

The distance function is decided by the metadata, and if the collection already exists(when loading from disk), the metadata is same as the metadata when we save to disk. So to use cosine distance we need to specific the metadata like this

db = Chroma.from_documents(documents, embeddings, persist_directory=persist_dir, collection_metadata={"hnsw:space": "cosine"})

klaudialemiec · 2024-05-20T13:38:08Z

@DragonMengLong I don't think that adding collection_metadata={"hnsw:space": "cosine"} solves the issue. Please take a look on this example:

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.docstore.document import Document
from langchain.vectorstores import Chroma

model = HuggingFaceEmbeddings(model_name="gtr-t5-base")

docs_txt = [
    "what is parameterized complexity theory?", 
    "what is the complexity of parameterized complexity theory?", 
    "what is parameter?", 
    "I like cats!", 
    "what is universe?"
]
docs = [Document(page_content=doc, metadata={}) for doc in docs_txt]

db = Chroma.from_documents(
    docs,
    model,
    persist_directory="data/test",
    collection_name="docs",
    collection_metadata={"hnsw:space": "cosine"}
)

docs_and_scores = db.similarity_search_with_score(query="Parameters", k=1)

for doc, score in docs_and_scores:
    print(score)
    print(doc)

output:
0.2922315249599653
page_content='what is parameter?'

On the other hand, calculating cosine similarity with sentence_transformers return other value:

from sentence_transformers.util import cos_sim

query = model.embed_query("Parameters")
text = model.embed_query('what is parameter?')

cos_sim(query, text)

output:
tensor([[0.8539]])

DragonMengLong · 2024-05-20T14:21:09Z

@klaudialemiec i think you should use the model.embed_documents(['what is parameter?']) to get the embedding of the document because for different embedding models, the embedding methord for doc and query may be different, for example it may add a different prompt before the input.

klaudialemiec · 2024-05-20T14:38:33Z

@DragonMengLong I made as you suggested. The results is the same:
from sentence_transformers.util import cos_sim

query = model.embed_query("Parameters")
text = model.embed_documents(['what is parameter?'])

cos_sim(query, text)

output: tensor([[0.8539]])

Meanwhile I've made one more test with ChromaDB library:

import chromadb
from chromadb.utils import embedding_functions

embedding = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="gtr-t5-base")
chroma_client = chromadb.Client()

collection = chroma_client.create_collection(
    name="my_collection", 
    embedding_function=embedding,
    metadata={"hnsw:space": "cosine"}
)

collection.add(
    documents = [
        "what is parameterized complexity theory?", 
        "what is the complexity of parameterized complexity theory?", 
        "what is parameter?", 
        "I like cats!", 
        "what is universe?"
    ],
    ids=["id1", "id2", "id3", "id4", "id5"]
)

results = collection.query(
    query_texts=["Parameters"],
    n_results=1
)
print(results)

{'ids': [['id3']], 'distances': [[0.14611518383026123]], 'metadatas': [[None]], 'embeddings': None, 'documents': [['what is parameter?']], 'uris': None, 'data': None}

The distance calculated with Chroma makes sense, as it returns cosine distance, while sentence transformers cosine similarity (1 - 0.8539 = 0.1461).
The above code is basically copied from Chroma documentation.
Resources: https://docs.trychroma.com/guides#changing-the-distance-function, https://docs.trychroma.com/guides/embeddings

Additionally, when I remove the line metadata={"hnsw:space": "cosine"} from create_collection (so now Chroma use default distance metric, which is l2), I've fot the distance 0.2922315299510956 for the same query and 1 most similar document.

collection = chroma_client.create_collection(
    name="my_collection2", 
    embedding_function=embedding,
)

collection.add(
    documents = [
        "what is parameterized complexity theory?", 
        "what is the complexity of parameterized complexity theory?", 
        "what is parameter?", 
        "I like cats!", 
        "what is universe?"
    ],
    ids=["id1", "id2", "id3", "id4", "id5"]
)

results = collection.query(
    query_texts=["Parameters"],
    n_results=1
)
print(results)

output: {'ids': [['id3']], 'distances': [[0.2922315299510956]], 'metadatas': [[None]], 'embeddings': None, 'documents': [['what is parameter?']], 'uris': None, 'data': None}

klaudialemiec · 2024-05-20T14:45:01Z

I would be more then happy to resolve the bug as my first PR to langchain :)

DragonMengLong · 2024-05-20T14:47:57Z

@klaudialemiec :)

klaudialemiec · 2024-05-20T23:03:19Z

Now I see where was the problem!
The implementation is fine - I handle the problem before of caching.

What happened:
The first time when I created collection, I didn't define distance metric, so the default Chroma metric was assigned (l2). Later, I create the collection with the same once again and set different metric. But the collection already existed, in result the new documents were added to the existing collection and new value of distance was ignored.

The correct way of testing should include deleting collection, to create it once again.

Nevertheless, I believe it would be helpful for new users to add to the documentation information how to set up distance metric (DragonMengLong first reply to this issue)

dosubot bot added Ɑ: vector store Related to vector store module 🔌: chroma Primarily related to ChromaDB integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels May 13, 2024

klaudialemiec mentioned this issue May 20, 2024

community: Added cosine and inner product metrics to chroma #21933

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chroma VectorBase Use "L2" as Similarity Measure Rather than Cosine #21599

Chroma VectorBase Use "L2" as Similarity Measure Rather than Cosine #21599

DragonMengLong commented May 13, 2024 •

edited

DragonMengLong commented May 14, 2024

klaudialemiec commented May 20, 2024 •

edited

DragonMengLong commented May 20, 2024

klaudialemiec commented May 20, 2024 •

edited

klaudialemiec commented May 20, 2024

DragonMengLong commented May 20, 2024

klaudialemiec commented May 20, 2024

Chroma VectorBase Use "L2" as Similarity Measure Rather than Cosine #21599

Chroma VectorBase Use "L2" as Similarity Measure Rather than Cosine #21599

Comments

DragonMengLong commented May 13, 2024 • edited

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

DragonMengLong commented May 14, 2024

klaudialemiec commented May 20, 2024 • edited

DragonMengLong commented May 20, 2024

klaudialemiec commented May 20, 2024 • edited

klaudialemiec commented May 20, 2024

DragonMengLong commented May 20, 2024

klaudialemiec commented May 20, 2024

DragonMengLong commented May 13, 2024 •

edited

klaudialemiec commented May 20, 2024 •

edited

klaudialemiec commented May 20, 2024 •

edited