Different tokenization than AutoTokenizer when word is adjacent to non-special added token #7049

JohanAR · 2024-05-02T19:38:49Z

llama.cpp commit: 6ecf318 (current master)

>>> from transformers import AutoTokenizer
>>> t = AutoTokenizer.from_pretrained("NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO")
>>> t("<|im_start|>")
{'input_ids': [1, 32001], 'attention_mask': [1, 1]}
>>> t("user")
{'input_ids': [1, 2188], 'attention_mask': [1, 1]}
>>> t("<|im_start|>user")
{'input_ids': [1, 32001, 2188], 'attention_mask': [1, 1, 1]}

llama.cpp example server with https://huggingface.co/RichardErkhov/NousResearch_-_Nous-Hermes-2-Mixtral-8x7B-DPO-gguf/blob/main/Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf

$ curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>"}'
{"tokens":[32001]}
$ curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "user"}'
{"tokens":[2188]}
$ curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>user"}'
{"tokens":[32001,1838]}

llama.cpp changes the tokenization of the word "user" when it comes directly after the added token, AutoTokenizer does not.

Could it be relevant that <|im_start|> is not a special token in this model? tokenizer_config.json

A different model where <|im_start|> is a special token does not have this behaviour. tokenizer_config.json

The text was updated successfully, but these errors were encountered:

Jeximo · 2024-05-03T12:25:20Z

Could it be relevant that <|im_start|> is not a special token in this model? tokenizer_config.json.

I'm able to reproduce your results:
false <|im_start|> special config.json:

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>"}'
{"tokens":[32001]}  

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "user"}'
{"tokens":[2188]}

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>user"}'
{"tokens":[32001,1838]}

different model, true <|im_start|> special config.json:

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>"}'
{"tokens":[128002]}

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "user"}'      
{"tokens":[882]}

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>user"}'
{"tokens":[128002,882]}

false <|im_start|> special token in the tokenizer_config.json is why it's tokenized differently.

JohanAR · 2024-05-03T12:39:16Z

@Jeximo thanks! Updating issue title to reflect this

turian · 2024-05-06T23:26:10Z

I have also seen this issue. I have observed that passing HF tokenized text into llama.cpp gives good perplexity, but llama.cpp tokenized text gives bad perplexity.

ggerganov · 2024-05-07T09:22:15Z

There is logic in llama.cpp to add space prefix if the first token is not special:

llama.cpp/llama.cpp

Lines 12690 to 12695 in 947d3ad

    
           auto raw_text = fragment.raw_text.substr(fragment.offset, fragment.length); 
        
           if (&fragment == &fragment_buffer.front()) { 
        
               if (vocab.add_space_prefix) { 
        
                   raw_text = " " + raw_text; // prefix with space if the first token is not special 
        
               } 
        
           }

      "user": 1838,
      "▁user": 2188,

That's why "user" tokenizes to [2188] instead of [1838]

If <|im_start|> is not special then you need to add a space manually: "<|im_start|> user" should tokenize to [32001,2188]

JohanAR · 2024-05-07T19:35:27Z

Sounds reasonable. Is it a project goal to be compatible with Transformers, or is it acceptable that llama.cpp behaves differently? As far as I can tell, it tokenizes the string as [32001,2188] regardless of if <|im_start|> is special or not.

I'll just close the issue if it's not a problem that llama.cpp behaves differently than AutoTokenizer.

teleprint-me · 2024-05-07T20:01:50Z

Is it a project goal to be compatible with Transformers, or is it acceptable that llama.cpp behaves differently?

This is a valid question and one I'm keen to having answered.

ggerganov · 2024-05-08T12:01:00Z

Is it a project goal to be compatible with Transformers, or is it acceptable that llama.cpp behaves differently?

It's probably a good idea to stick to Transformers, so if there is a suggestion how to fix our logic we can merge it. I'm not 100% sure though

I think this might be one more instance of the topic about added tokens discussed in #7144

JohanAR added the bug-unconfirmed label May 2, 2024

JohanAR changed the title ~~Different tokenization than AutoTokenizer adjacent to added token (ChatML)~~ Different tokenization than AutoTokenizer when word is adjacent to non-special added token May 3, 2024

Jeximo mentioned this issue May 14, 2024

EOT token incorrectly set for Mistral-v0.2 trained with added ChatML tokens #7271

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different tokenization than AutoTokenizer when word is adjacent to non-special added token #7049

Different tokenization than AutoTokenizer when word is adjacent to non-special added token #7049

JohanAR commented May 2, 2024 •

edited

Jeximo commented May 3, 2024

JohanAR commented May 3, 2024

turian commented May 6, 2024

ggerganov commented May 7, 2024

JohanAR commented May 7, 2024 •

edited

teleprint-me commented May 7, 2024

ggerganov commented May 8, 2024

Different tokenization than AutoTokenizer when word is adjacent to non-special added token #7049

Different tokenization than AutoTokenizer when word is adjacent to non-special added token #7049

Comments

JohanAR commented May 2, 2024 • edited

Jeximo commented May 3, 2024

JohanAR commented May 3, 2024

turian commented May 6, 2024

ggerganov commented May 7, 2024

JohanAR commented May 7, 2024 • edited

teleprint-me commented May 7, 2024

ggerganov commented May 8, 2024

JohanAR commented May 2, 2024 •

edited

JohanAR commented May 7, 2024 •

edited