mxbai-embed-large embedding not consistent with original paper #4207

deadbeef84 · 2024-05-06T19:21:37Z

What is the issue?

I'm trying to use embeddings from mxbai-embed-large to create a similarity/semantic search functionality, but the quality of the embeddings coming from ollama doesn't seem to be very good.

I've tried replicating the numbers from the original blog post:

import { Ollama } from 'ollama'
import cosineSimilarity from 'compute-cosine-similarity'

const ollama = new Ollama({ host: 'http://127.0.0.1:11434' })

const docs = [
  'Represent this sentence for searching relevant passages: A man is eating a piece of bread',
  'A man is eating food.',
  'A man is eating pasta.',
  'The girl is carrying a baby.',
  'A man is riding a horse.',
]

const [queryEmbedding, ...embeddings] = await Promise.all(
  docs.map(
    async (doc) => (await ollama.embeddings({ model: 'mxbai-embed-large', prompt: doc })).embedding
  )
)

const similarities = embeddings.map((e) => cosineSimilarity(queryEmbedding, e))
console.log(similarities)

[
  0.6231103528590645,
  0.6258446589848462,
  0.5631986516911313,
  0.5891047395895846
]

Those numbers are nowhere close to the original numbers, and if I compare the embedding vectors they are completely different.

The javascript implementation at huggingface produces the same numbers as the original post.

OS

Linux, Docker

GPU

Nvidia

CPU

AMD

Ollama version

0.1.33

The text was updated successfully, but these errors were encountered:

deadbeef84 · 2024-05-06T21:22:23Z

Same thing with snowflake:

Ollama using snowflake-arctic-embed:137m-m-long-fp16

Query: Represent this sentence for searching relevant passages: A man is eating a piece of bread
0.408 A man is eating food.
0.368 A man is riding a horse.
0.353 A man is eating pasta.
0.259 The girl is carrying a baby.

@xenova/transformers and Snowflake/snowflake-arctic-embed-m-long

Query: Represent this sentence for searching relevant passages: A man is eating a piece of bread
0.581 A man is eating food.
0.538 A man is eating pasta.
0.471 A man is riding a horse.
0.375 The girl is carrying a baby.

I expected these to give same results.

deadbeef84 · 2024-05-07T20:58:24Z

I've now also verified that the embeddings generated by https://github.com/ggerganov/llama.cpp/tree/master/examples/embedding are correct and consistent with the blog post:

./embedding --model ./models/mxbai-embed-large/mxbai-embed-large-v1-f16.gguf --prompt $'Represent this sentence for searching relevant passages: A man is eating a piece of bread\nA man is eating food.\nA man is eating pasta.\nThe girl is carrying a baby.\nA man is riding a horse.'

Output:

embedding 0:  0.031844 -0.020246  0.003061  0.025761 -0.030529  0.007648 -0.003402 -0.006877  0.003626  0.005590  0.021032 -0.048852  0.050770 -0.010658 -0.042844 -0.014537 
embedding 1:  0.018362 -0.016959 -0.009913 -0.000620 -0.031476 -0.012503 -0.004979  0.036731 -0.004214  0.031309  0.030365 -0.014224  0.038043 -0.029713 -0.049113  0.000813 
embedding 2:  0.011478 -0.011224 -0.008358  0.031598 -0.008998 -0.023611 -0.009947  0.029237 -0.000569  0.029407  0.044036 -0.003409  0.034929 -0.028693 -0.053001  0.002418 
embedding 3: -0.025487  0.045029 -0.005886 -0.025535  0.006403  0.000159 -0.009435  0.026796  0.023252  0.004105 -0.019179 -0.007933 -0.007297 -0.007150  0.016169  0.043604 
embedding 4:  0.028173  0.013244  0.045796 -0.018567  0.014471 -0.002285  0.029447  0.018477  0.046593  0.005216  0.031499 -0.007253 -0.030249  0.025316  0.050654 -0.006526 

cosine similarity matrix:

  1.00   0.79   0.64   0.16   0.36 
  0.79   1.00   0.79   0.13   0.38 
  0.64   0.79   1.00   0.17   0.33 
  0.16   0.13   0.17   1.00   0.13 
  0.36   0.38   0.33   0.13   1.00

deadbeef84 · 2024-05-07T21:06:37Z

I've now also confirmed the issue is still happening in ollama v0.1.34

deadbeef84 · 2024-05-10T21:36:15Z

#3777
Perhaps related?

ActionsPerMinute · 2024-05-11T00:12:34Z

I also have this issue v0.1.35. following the blog with mxbai-embed-large gives the wrong results. switching embedding models provide the correct results

VMinB12 · 2024-05-11T15:13:02Z

I can also confirm that Ollama embeddings for snowflake-arctic-embed:137m-m-long-fp16 are not behaving as expected. I set up a synthetic benchmark for internal testing. I take 500 articles and use an LLM to generate a question for each article. Then I retrieve based on a given question and check if the top retrieved article matches the article that was used to generate the question.
I'm using LangChain and get the following results:

# embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # 0.8709677419354839
# embeddings = OpenAIEmbeddings(model="text-embedding-3-large") # 0.8951612903225806
# embeddings = OllamaEmbeddings(model="nomic-embed-text")  # 0.7842741935483871
# embeddings = OllamaEmbeddings(
#    model="nomic-embed-text",
#    query_instruction="",
#    embed_instruction="",
#    num_ctx=8192,
#    temperature=0,
#)  # 0.8185483870967742
# embeddings = OllamaEmbeddings(model="mxbai-embed-large")  # 0.6653225806451613
# embeddings = OllamaEmbeddings(model="all-minilm")  # 0.45564516129032256
# embeddings = OllamaEmbeddings(model="snowflake-arctic-embed")  # 0.14516129032258066
# embeddings = OllamaEmbeddings(
#     model="snowflake-arctic-embed:137m-m-long-fp16",
#     query_instruction="",
#     embed_instruction="",
#     num_ctx=8192,
#     temperature=0,
# )  # 0.06854838709677419
# OllamaEmbeddings(model="snowflake-arctic-embed:137m-m-long-fp16") # 0.07661290322580645

I'm not sure if the result from nomic-embed-text is in alignment with expectations. If so, it could be an indication that the problem is not with Ollama itself, but rather the model weights of the snowflake and mxbai models.

Edit: My ollama version is 0.1.28

deadbeef84 · 2024-05-13T17:50:15Z

Similarities after PR #4399:

mxbai-embed-large

Query: Represent this sentence for searching relevant passages: A man is eating a piece of bread
0.791 A man is eating food.
0.636 A man is eating pasta.
0.360 A man is riding a horse.
0.163 The girl is carrying a baby.

snowflake-arctic-embed:137m-m-long-fp16

Query: Represent this sentence for searching relevant passages: A man is eating a piece of bread
0.581 A man is eating food.
0.538 A man is eating pasta.
0.471 A man is riding a horse.
0.375 The girl is carrying a baby.

fredrik-smedberg · 2024-05-14T15:44:38Z

Tack @deadbeef84 for creating this issue. I've been scratching my head over the weekend, changed to different embedding models and had very odd results. Good to know I wasn't crazy and that Ollama actually was a bit broken. I'll update Ollama and see if my ChromaDB <-> Ollama experiment runs better.

Update: I downloaded and built #4399. I can confirm it indeed fixes obvious issues I had when doing embedding and queries with mxbai-embed-large and nomic-embed-text models.
If can't tell if the PR possibly introduces other issues, but it definitely solved the embedding problem.

hazelwolf · 2024-05-18T10:05:22Z

@deadbeef84 thanks for the fix, this solves the issue.

deadbeef84 added the bug Something isn't working label May 6, 2024

deadbeef84 mentioned this issue May 13, 2024

fix embedding by adding fixes from llama.cpp upstream #4399

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mxbai-embed-large embedding not consistent with original paper #4207

mxbai-embed-large embedding not consistent with original paper #4207

deadbeef84 commented May 6, 2024

deadbeef84 commented May 6, 2024

deadbeef84 commented May 7, 2024

deadbeef84 commented May 7, 2024

deadbeef84 commented May 10, 2024

ActionsPerMinute commented May 11, 2024

VMinB12 commented May 11, 2024 •

edited

deadbeef84 commented May 13, 2024

fredrik-smedberg commented May 14, 2024 •

edited

hazelwolf commented May 18, 2024

mxbai-embed-large embedding not consistent with original paper #4207

mxbai-embed-large embedding not consistent with original paper #4207

Comments

deadbeef84 commented May 6, 2024

What is the issue?

OS

GPU

CPU

Ollama version

deadbeef84 commented May 6, 2024

deadbeef84 commented May 7, 2024

deadbeef84 commented May 7, 2024

deadbeef84 commented May 10, 2024

ActionsPerMinute commented May 11, 2024

VMinB12 commented May 11, 2024 • edited

deadbeef84 commented May 13, 2024

mxbai-embed-large

snowflake-arctic-embed:137m-m-long-fp16

fredrik-smedberg commented May 14, 2024 • edited

hazelwolf commented May 18, 2024

VMinB12 commented May 11, 2024 •

edited

fredrik-smedberg commented May 14, 2024 •

edited