Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mxbai-embed-large embedding not consistent with original paper #4207

Open
deadbeef84 opened this issue May 6, 2024 · 9 comments
Open

mxbai-embed-large embedding not consistent with original paper #4207

deadbeef84 opened this issue May 6, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@deadbeef84
Copy link

What is the issue?

I'm trying to use embeddings from mxbai-embed-large to create a similarity/semantic search functionality, but the quality of the embeddings coming from ollama doesn't seem to be very good.

I've tried replicating the numbers from the original blog post:

import { Ollama } from 'ollama'
import cosineSimilarity from 'compute-cosine-similarity'

const ollama = new Ollama({ host: 'http://127.0.0.1:11434' })

const docs = [
  'Represent this sentence for searching relevant passages: A man is eating a piece of bread',
  'A man is eating food.',
  'A man is eating pasta.',
  'The girl is carrying a baby.',
  'A man is riding a horse.',
]

const [queryEmbedding, ...embeddings] = await Promise.all(
  docs.map(
    async (doc) => (await ollama.embeddings({ model: 'mxbai-embed-large', prompt: doc })).embedding
  )
)

const similarities = embeddings.map((e) => cosineSimilarity(queryEmbedding, e))
console.log(similarities)
[
  0.6231103528590645,
  0.6258446589848462,
  0.5631986516911313,
  0.5891047395895846
]

Those numbers are nowhere close to the original numbers, and if I compare the embedding vectors they are completely different.

The javascript implementation at huggingface produces the same numbers as the original post.

OS

Linux, Docker

GPU

Nvidia

CPU

AMD

Ollama version

0.1.33

@deadbeef84 deadbeef84 added the bug Something isn't working label May 6, 2024
@deadbeef84
Copy link
Author

Same thing with snowflake:

Ollama using snowflake-arctic-embed:137m-m-long-fp16

Query: Represent this sentence for searching relevant passages: A man is eating a piece of bread
0.408 A man is eating food.
0.368 A man is riding a horse.
0.353 A man is eating pasta.
0.259 The girl is carrying a baby.

@xenova/transformers and Snowflake/snowflake-arctic-embed-m-long

Query: Represent this sentence for searching relevant passages: A man is eating a piece of bread
0.581 A man is eating food.
0.538 A man is eating pasta.
0.471 A man is riding a horse.
0.375 The girl is carrying a baby.

I expected these to give same results.

@deadbeef84
Copy link
Author

I've now also verified that the embeddings generated by https://github.com/ggerganov/llama.cpp/tree/master/examples/embedding are correct and consistent with the blog post:

./embedding --model ./models/mxbai-embed-large/mxbai-embed-large-v1-f16.gguf --prompt $'Represent this sentence for searching relevant passages: A man is eating a piece of bread\nA man is eating food.\nA man is eating pasta.\nThe girl is carrying a baby.\nA man is riding a horse.'

Output:

embedding 0:  0.031844 -0.020246  0.003061  0.025761 -0.030529  0.007648 -0.003402 -0.006877  0.003626  0.005590  0.021032 -0.048852  0.050770 -0.010658 -0.042844 -0.014537 
embedding 1:  0.018362 -0.016959 -0.009913 -0.000620 -0.031476 -0.012503 -0.004979  0.036731 -0.004214  0.031309  0.030365 -0.014224  0.038043 -0.029713 -0.049113  0.000813 
embedding 2:  0.011478 -0.011224 -0.008358  0.031598 -0.008998 -0.023611 -0.009947  0.029237 -0.000569  0.029407  0.044036 -0.003409  0.034929 -0.028693 -0.053001  0.002418 
embedding 3: -0.025487  0.045029 -0.005886 -0.025535  0.006403  0.000159 -0.009435  0.026796  0.023252  0.004105 -0.019179 -0.007933 -0.007297 -0.007150  0.016169  0.043604 
embedding 4:  0.028173  0.013244  0.045796 -0.018567  0.014471 -0.002285  0.029447  0.018477  0.046593  0.005216  0.031499 -0.007253 -0.030249  0.025316  0.050654 -0.006526 

cosine similarity matrix:

  1.00   0.79   0.64   0.16   0.36 
  0.79   1.00   0.79   0.13   0.38 
  0.64   0.79   1.00   0.17   0.33 
  0.16   0.13   0.17   1.00   0.13 
  0.36   0.38   0.33   0.13   1.00 

@deadbeef84
Copy link
Author

I've now also confirmed the issue is still happening in ollama v0.1.34

@deadbeef84
Copy link
Author

#3777
Perhaps related?

@ActionsPerMinute
Copy link

I also have this issue v0.1.35. following the blog with mxbai-embed-large gives the wrong results. switching embedding models provide the correct results

@VMinB12
Copy link

VMinB12 commented May 11, 2024

I can also confirm that Ollama embeddings for snowflake-arctic-embed:137m-m-long-fp16 are not behaving as expected. I set up a synthetic benchmark for internal testing. I take 500 articles and use an LLM to generate a question for each article. Then I retrieve based on a given question and check if the top retrieved article matches the article that was used to generate the question.
I'm using LangChain and get the following results:

# embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # 0.8709677419354839
# embeddings = OpenAIEmbeddings(model="text-embedding-3-large") # 0.8951612903225806
# embeddings = OllamaEmbeddings(model="nomic-embed-text")  # 0.7842741935483871
# embeddings = OllamaEmbeddings(
#    model="nomic-embed-text",
#    query_instruction="",
#    embed_instruction="",
#    num_ctx=8192,
#    temperature=0,
#)  # 0.8185483870967742
# embeddings = OllamaEmbeddings(model="mxbai-embed-large")  # 0.6653225806451613
# embeddings = OllamaEmbeddings(model="all-minilm")  # 0.45564516129032256
# embeddings = OllamaEmbeddings(model="snowflake-arctic-embed")  # 0.14516129032258066
# embeddings = OllamaEmbeddings(
#     model="snowflake-arctic-embed:137m-m-long-fp16",
#     query_instruction="",
#     embed_instruction="",
#     num_ctx=8192,
#     temperature=0,
# )  # 0.06854838709677419
# OllamaEmbeddings(model="snowflake-arctic-embed:137m-m-long-fp16") # 0.07661290322580645

I'm not sure if the result from nomic-embed-text is in alignment with expectations. If so, it could be an indication that the problem is not with Ollama itself, but rather the model weights of the snowflake and mxbai models.

Edit: My ollama version is 0.1.28

@deadbeef84
Copy link
Author

Similarities after PR #4399:

mxbai-embed-large

Query: Represent this sentence for searching relevant passages: A man is eating a piece of bread
0.791 A man is eating food.
0.636 A man is eating pasta.
0.360 A man is riding a horse.
0.163 The girl is carrying a baby.

snowflake-arctic-embed:137m-m-long-fp16

Query: Represent this sentence for searching relevant passages: A man is eating a piece of bread
0.581 A man is eating food.
0.538 A man is eating pasta.
0.471 A man is riding a horse.
0.375 The girl is carrying a baby.

@fredrik-smedberg
Copy link

fredrik-smedberg commented May 14, 2024

Tack @deadbeef84 for creating this issue. I've been scratching my head over the weekend, changed to different embedding models and had very odd results. Good to know I wasn't crazy and that Ollama actually was a bit broken. I'll update Ollama and see if my ChromaDB <-> Ollama experiment runs better.

Update: I downloaded and built #4399. I can confirm it indeed fixes obvious issues I had when doing embedding and queries with mxbai-embed-large and nomic-embed-text models.
If can't tell if the PR possibly introduces other issues, but it definitely solved the embedding problem.

@hazelwolf
Copy link

@deadbeef84 thanks for the fix, this solves the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants