Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different tokenization than AutoTokenizer when word is adjacent to non-special added token #7049

Open
JohanAR opened this issue May 2, 2024 · 7 comments

Comments

@JohanAR
Copy link
Contributor

JohanAR commented May 2, 2024

llama.cpp commit: 6ecf318 (current master)

>>> from transformers import AutoTokenizer
>>> t = AutoTokenizer.from_pretrained("NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO")
>>> t("<|im_start|>")
{'input_ids': [1, 32001], 'attention_mask': [1, 1]}
>>> t("user")
{'input_ids': [1, 2188], 'attention_mask': [1, 1]}
>>> t("<|im_start|>user")
{'input_ids': [1, 32001, 2188], 'attention_mask': [1, 1, 1]}

llama.cpp example server with https://huggingface.co/RichardErkhov/NousResearch_-_Nous-Hermes-2-Mixtral-8x7B-DPO-gguf/blob/main/Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf

$ curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>"}'
{"tokens":[32001]}
$ curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "user"}'
{"tokens":[2188]}
$ curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>user"}'
{"tokens":[32001,1838]}

llama.cpp changes the tokenization of the word "user" when it comes directly after the added token, AutoTokenizer does not.

Could it be relevant that <|im_start|> is not a special token in this model? tokenizer_config.json

A different model where <|im_start|> is a special token does not have this behaviour. tokenizer_config.json

@Jeximo
Copy link
Contributor

Jeximo commented May 3, 2024

Could it be relevant that <|im_start|> is not a special token in this model? tokenizer_config.json.

I'm able to reproduce your results:
false <|im_start|> special config.json:

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>"}'
{"tokens":[32001]}  

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "user"}'
{"tokens":[2188]}

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>user"}'
{"tokens":[32001,1838]}

different model, true <|im_start|> special config.json:

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>"}'
{"tokens":[128002]}

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "user"}'      
{"tokens":[882]}

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>user"}'
{"tokens":[128002,882]}

false <|im_start|> special token in the tokenizer_config.json is why it's tokenized differently.

@JohanAR JohanAR changed the title Different tokenization than AutoTokenizer adjacent to added token (ChatML) Different tokenization than AutoTokenizer when word is adjacent to non-special added token May 3, 2024
@JohanAR
Copy link
Contributor Author

JohanAR commented May 3, 2024

@Jeximo thanks! Updating issue title to reflect this

@turian
Copy link

turian commented May 6, 2024

I have also seen this issue. I have observed that passing HF tokenized text into llama.cpp gives good perplexity, but llama.cpp tokenized text gives bad perplexity.

@ggerganov
Copy link
Owner

There is logic in llama.cpp to add space prefix if the first token is not special:

llama.cpp/llama.cpp

Lines 12690 to 12695 in 947d3ad

auto raw_text = fragment.raw_text.substr(fragment.offset, fragment.length);
if (&fragment == &fragment_buffer.front()) {
if (vocab.add_space_prefix) {
raw_text = " " + raw_text; // prefix with space if the first token is not special
}
}

      "user": 1838,
      "▁user": 2188,

That's why "user" tokenizes to [2188] instead of [1838]

If <|im_start|> is not special then you need to add a space manually: "<|im_start|> user" should tokenize to [32001,2188]

@JohanAR
Copy link
Contributor Author

JohanAR commented May 7, 2024

Sounds reasonable. Is it a project goal to be compatible with Transformers, or is it acceptable that llama.cpp behaves differently? As far as I can tell, it tokenizes the string as [32001,2188] regardless of if <|im_start|> is special or not.

I'll just close the issue if it's not a problem that llama.cpp behaves differently than AutoTokenizer.

@teleprint-me
Copy link
Contributor

Is it a project goal to be compatible with Transformers, or is it acceptable that llama.cpp behaves differently?

This is a valid question and one I'm keen to having answered.

@ggerganov
Copy link
Owner

Is it a project goal to be compatible with Transformers, or is it acceptable that llama.cpp behaves differently?

It's probably a good idea to stick to Transformers, so if there is a suggestion how to fix our logic we can merge it. I'm not 100% sure though

I think this might be one more instance of the topic about added tokens discussed in #7144

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants