Tokenizers questions and ... proposals? #6980

JoanFM · 2024-04-29T13:16:40Z

Thanks for the great project. I have been trying to include the Jina Embedding models into llama.cpp as you can see in #6826.

I have been successful on having it run for most of the models that Jina offers, (English, Spanish and German) but I cannot have it for Chinese.

I havee seen that the issue comes from the tokenization part of the model and I have been digging more into the code for llama.cpp as the one from tokenizers in HuggingFace.

I have some questions that I will try to place here.

1st. - How to know which tokenizer needs to be used for each model. For instance, I see that the SPM and BPE tokenizers here seems to work quite similarly but there are some discrepancies.

2nd. - I have seen that the problem from the Chinese model when it comes to the differences in output compared to the usage of transformers comes from the fact that the model uses some Normalizers and PreTokenizers that are very hard to configure in llama.cpp.

I wonder if there would be need to do some refactoring in the tokenizer to enable the decoupling of the tokenizing logic with the surrouding normalization code, plus some options to have a reacher mapping of the tokenizer options in transformers and in llama.cpp.

I am not sure if my observations here make any sense or I am just missusing the project or missunderstanding some of the concepts.

Thank you for the great work and happy to bring some help.

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-04-29T13:26:45Z

On master there is no way to support correct tokenization for BPE/WPM tokenizers

Please take a look at the description in #6920 - this will be merged soon and it will introduce a pre-tokenizer field that llama.cpp can use to do pre-tokenization correctly. I've focused only on BPE tokenizers in that PR. AFAICT the Jina tokenizer falls in the WPM category - probably similar handling would be necessary, but I think it should be relatively simple to add. First step is to add the model as explained in the PR and create tokenizer tests for it and then we'll see what is necessary

JoanFM · 2024-04-29T14:00:42Z

Okey I will follow closely the developments, thank you very much

JoanFM · 2024-04-30T15:29:13Z

Hey @ggerganov ,

I have been checking the PR and the pretokenization closely and I have some questions and doubts.

In the model which I was trying to add, it was a Chinese model where NFC Unicode Normalization and LowerCase as a list of Normalizers from HF (https://huggingface.co/docs/tokenizers/components). I am not sure how I can map this to a regex, or I can do the normalization without regex?
I have another model which I would like to apply which has a PreCompiled HF tokenizer with a precompiled_charsmap attribute which I am failing to understand. (I was wondering if that has in fact any relationship with the regex you set in the llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920 PR.

Thanks for all the help and attention I am receiving in my PRs and inquiries.

ggerganov · 2024-05-01T05:57:06Z

I'm not sure about NFC normalization - need to understand what it is first. Lowercase should be easy to apply in the tokenize() method before or after the regex split based on vocab.type_pre

Not familiar with pre-compiled char maps

JoanFM · 2024-05-02T09:46:06Z

Another question.

In order to decide the pretokenizer we do this:

        chktok = tokenizer.encode(chktxt)
        chkhsh = sha256(str(chktok).encode()).hexdigest()

and we pretend to identify the pretrokenization behavior based on this.

But does this actually make sense?

The chktok may differ even for models with the same pretokenziation behavior but just a different vocabulary. Shouldn't there be a way to have this independent of the actual vocabulary?

Wdyt @ggerganov ?

JoanFM · 2024-05-03T11:05:48Z

@ggerganov @slaren @teleprint-me I have a question.

If I were to consider adding support for NFC normalization, would you be open to include other dependencies as Boost or ICU to depend on this, or is including external dependencies very much avoided regarding the nature of llama.cpp to intend to be used for edge devices?

JoanFM · 2024-05-03T11:17:29Z

Also, Perhaps a nice option to better scale the efforts to handle tokenization logic could be to take a look at https://github.com/mlc-ai/tokenizers-cpp which seems to be a port from tokenizers and been linked in this issue huggingface/tokenizers#185.

slaren · 2024-05-03T11:20:40Z

Large external dependencies like boost are out of the question. Small, self-contained libraries that can be bundled in the repository have a higher chance of being accepted by @ggerganov, but the preference is still to reduce dependency on external libraries.

ggerganov · 2024-05-03T13:44:24Z

The chktok may differ even for models with the same pretokenziation behavior but just a different vocabulary. Shouldn't there be a way to have this independent of the actual vocabulary?

Yes, the resulting hashes would be different, but if you observe that the pre-tokenizer configs are the same, you can assign the same "name" for both models in order to reuse the pre-tokenizer. But also adding a duplicate one is fine.

If I were to consider adding support for NFC normalization, would you be open to include other dependencies as Boost or ICU to depend on this

From quick look, is NFC normalization similar to NFD normalization? We have the latter already implemented:

llama.cpp/unicode.cpp

Lines 472 to 485 in 3275e60

    
           std::vector<uint32_t> unicode_cpts_normalize_nfd(const std::vector<uint32_t> & cpts) { 
        
               std::vector<uint32_t> result; 
        
               result.reserve(cpts.size()); 
        
               for (size_t i = 0; i < cpts.size(); ++i) { 
        
                   auto it = unicode_map_nfd.find(cpts[i]); 
        
                   if (it == unicode_map_nfd.end()) { 
        
                       result.push_back(cpts[i]); 
        
                   } else { 
        
                       result.push_back(it->second); 
        
                   } 
        
               } 
        
               return result; 
        
           }

Seems simple enough to implement from scratch

JoanFM · 2024-05-03T13:49:18Z

Hey @ggerganov ,

To be honest, I know NFC is related to NFD but not sure how wasy is to implement, I am trying to understand some implementations but I am quite new to this. (I believe is a little more complex as it combines symbols more than decomposing). I will try to dig deeper, thanks

I know you can map to the same name from different hashes, but would be nice if it could actually detect it in a better way. (I have a small proposal in #7039 for some imrpovement).

teleprint-me · 2024-05-05T00:18:28Z

NFC simply merges similar characters together to reduce redundancy, but the importance and practicality of such an implementation is debatable. It really depends on the use case and needs of the project. I have no say here, but I agree with @ggerganov on implementing a function from scratch if it really is desired for some reason. There's a nifty online tool with a brief overview on what the concept is.

Edit: It's important to keep in mind that there's a definite potential for data loss with the use of NFC, but everything is compressed here. It's all compression, so I suppose it's more practical to think in terms of how lossy it is in comparison to other methods.

JoanFM · 2024-05-06T15:19:12Z

Hey @teleprint-me ,

As per wether it i should be used or not, I believe this is a discussion to be had when training the tokenizer, in the case of the library, I believe that if a model is trained with a tokenizer requiring this normalization process, we should be able to produce this normalization.

As per wether we should implement ourselves, I agree and I am working in that direction. However I am finding it very hard to get the data needed for the mapping. Where did @ggerganov get the required data to complete the unicode_map_nfd?

The algorithm I understand it goes as:

Decompose the unicode_cpts recursively following the mapping rules that can be extracted (I will see how later) from https://www.unicode.org/Public/13.0.0/ucd/UnicodeData.txt
Reorder adjacent unicode_cpts according to its Unicode Canonical Combining Class (also extracted potentially from same file)
Apply the Canonical Composition Algorithm (as described here https://unicode.org/reports/tr15/#Description_Norm). I guess this takes a map of pairs (or more) of unicode_pts and merges them into a single unicode_cpts. The problem is I have no clue where to extract the mappings to obtain such behavior.

Is there any hint you could give me as per where I could look for?

Thanks

JoanFM · 2024-05-06T16:25:05Z

@iamlemec may be able to provide more details about the current implementation.

Plus I have a question,

In theory NFC should be equivalent somehow to using NFD plus a Normalization step, but I am doubting why the NFD implementation right now uses find for the multi_map instead of range_equals as in the first implementation (#5799). It seems to me that find will return only the first occurecne and thus the decomposition would be wrong?

Thanks for all the help

iamlemec · 2024-05-06T20:42:03Z

Hi @JoanFM! Very interesting work here. Having Jina embeddings would be great. I'm actually working on BAAI/bge-m3 support right now and running into similar tokenizer issues. Especially with sentencepiece and precompiled_charsmap. With regards to the latter, there is this parser library: https://github.com/huggingface/spm_precompiled. I haven't looked at it in detail, but it seems like it should work.

As for the provenance of the NFD tables, I think I eventually used this script here: https://github.com/n1t0/unicode-normalization/blob/master/scripts/unicode.py. Plus some post-processing for formatting. I should patch that up again and add it to the scripts directory. I think for BERT-style models that use WordPiece tokenizer, it makes most sense to use NFD since the vocabulary will typically just contain the root character.

Hadn't noticed that little find to range_equals changeup. I guess in a simple case like à that will be equivalent since we're stripping the accent characters out right afterwards in the WPM implemenation for preprocess. That said, if we are always just taking the first character, we may as well strip the rest out before hand and turn unicode_map_nfd into a std::map, which would probably speed up compile time too.

JoanFM · 2024-05-06T20:56:15Z

Hey @iamlemec,

I think we should not change that, I believe that the change should tell us that NFD is not being properly used (at least in the way HF tokenizers) aim to do.

I am not sure if maybe llama.cpp is not being extensively used with languages other than English?

As per the SPM library, I had already noticed the library but I could not figure exactly what is doing and how one could implement that in C++ or what is the actual use it has.

Thanks for the help!!

turian · 2024-05-07T00:03:27Z

Possibly related to the tokenization issues discussed in #7056 #7062 #7049 #7006

JoanFM · 2024-05-07T07:16:09Z

Hi @JoanFM! Very interesting work here. Having Jina embeddings would be great. I'm actually working on BAAI/bge-m3 support right now and running into similar tokenizer issues. Especially with sentencepiece and precompiled_charsmap. With regards to the latter, there is this parser library: https://github.com/huggingface/spm_precompiled. I haven't looked at it in detail, but it seems like it should work.

As for the provenance of the NFD tables, I think I eventually used this script here: https://github.com/n1t0/unicode-normalization/blob/master/scripts/unicode.py. Plus some post-processing for formatting. I should patch that up again and add it to the scripts directory. I think for BERT-style models that use WordPiece tokenizer, it makes most sense to use NFD since the vocabulary will typically just contain the root character.

Hadn't noticed that little find to range_equals changeup. I guess in a simple case like à that will be equivalent since we're stripping the accent characters out right afterwards in the WPM implemenation for preprocess. That said, if we are always just taking the first character, we may as well strip the rest out before hand and turn unicode_map_nfd into a std::map, which would probably speed up compile time too.

Hey @iamlemec .

I think that the find was changed after this refactoring https://github.com/ggerganov/llama.cpp/pull/5992/files#diff-70eb27fba52eb29d31f61ec3d85c7864431a2f512d9d9a5a95021e7c679affb1.

I feel it seems like a bug more than an intentional change?

ggerganov · 2024-05-07T08:03:44Z

I think that the find was changed after this refactoring #5992 (files).

I feel it seems like a bug more than an intentional change?

Hm, definitely not intentional. Nice find

JoanFM · 2024-05-07T12:16:23Z

I am trying to do some fixes here, but still not sure about implementation #7122

iamlemec · 2024-05-07T16:47:33Z

Ok, let's bring back the old range_equals way! Will hop over to comment on #7122 in a moment. I mostly have experience with English and CJK, where it seems like NFD isn't required too often. What are the most common examples beyond accented characters like à? Just want to have some on hand for testing.

JoanFM · 2024-05-07T17:13:48Z

Ok, let's bring back the old range_equals way! Will hop over to comment on #7122 in a moment. I mostly have experience with English and CJK, where it seems like NFD isn't required too often. What are the most common examples beyond accented characters like à? Just want to have some on hand for testing.

I am not so sire about the most common cases. But I just found that it does not implement NFD as it is supposed to do.

It does not apply this recursively
It does not do the canonical order.

I am doing a trial in #7122 , but not working yet

iamlemec · 2024-05-07T17:37:25Z

So unicode_map_nfd is actually already pre-computed recursively. I just pushed a branch on my repo that adds a proper Python script to generate it (the output has three additional entries due to recently added Unicode codepoints). Here's the source: https://github.com/iamlemec/llama.cpp/blob/gen-nfd-table/scripts/gen-nfd-table.py. Will push a PR in a moment.

As for the canonical order, do we know if this will actually affect any specific examples? We're stripping out all of the accent mark characters right afterwards, so I'd be kind of suprised. And the reordering might be a bit expensive, which can start to impact speeds on smaller embedding models.

JoanFM · 2024-05-07T17:52:52Z

So unicode_map_nfd is actually already pre-computed recursively. I just pushed a branch on my repo that adds a proper Python script to generate it (the output has three additional entries due to recently added Unicode codepoints). Here's the source: https://github.com/iamlemec/llama.cpp/blob/gen-nfd-table/scripts/gen-nfd-table.py. Will push a PR in a moment.

As for the canonical order, do we know if this will actually affect any specific examples? We're stripping out all of the accent mark characters right afterwards, so I'd be kind of suprised. And the reordering might be a bit expensive, which can start to impact speeds on smaller embedding models.

Not sure how much it affects, but it is the deginition of the algorithm so this should be applied, otherwise they would not make it. I guess for non English language the affection will be greater. You can always check if no detection in nfd map has been made to skip the sorting.

iamlemec · 2024-05-08T04:28:43Z

Still would be nice to see at least a single use case. Either way, can't we simply pre-compute the reordering in the Python script rather than doing it at runtime?

JoanFM · 2024-05-08T05:44:05Z

Still would be nice to see at least a single use case. Either way, can't we simply pre-compute the reordering in the Python script rather than doing it at runtime?

I do not think is feasible right?

iamlemec · 2024-05-08T06:29:26Z

The UnicodeData.txt file has the Canonical Combining Class for each codepoint, so you can use those to sort the decomposed codepoints.

JoanFM · 2024-05-08T06:49:21Z

The UnicodeData.txt file has the Canonical Combining Class for each codepoint, so you can use those to sort the decomposed codepoints.

yes, but I mean I do not think u can sort that in the map so that u can skip the sorting during the normaliization

JoanFM · 2024-05-08T07:44:45Z

Hey @iamlemec ,

I have been digging a little bit, and then I saw that in the case where the reordering happens, this would mean that NFD and NFC would differ between each other and with the original representation, but this seems to only happen in very strange characters which may not be relevant at all.

See https://unicode.org/reports/tr15/#Multiple_Mark_Figure

I even tried this experiment in Python (which I believe is complex enough) and it seems it does not apply, so I think it is okey for simplicity to skip it for now.

So to fix NFD only having the range_equals added should be okey

a = "北京的清晨，空氣清新而寧靜，一个年轻的旅行者在长城上漫步，他从自己的故乡—서울에서 출발하여 아시아의 다양한 문화를 탐험하고자 하는 꿈을 품고 떠났다。彼は日本の古都、京都を訪れ、そこで美しい桜の花が満開の下で古典音楽のコンサートに参加しました。祭りの夜、彼は色とりどりの灯籠が空に浮かぶのを見て、その美しさに感動しました。その後、彼は印度のバラナシに到着し、गंगा की घाटों पर आध्यात्मिक शांति की खोज में जुट गया। वहाँ उसने दिवाली के उत्सव में हिस्सा लिया, जहां लाखों दीये जलाकर समृद्धि और खुशहाली की कामना की गई थी।この旅は彼にとって非常に啓発的であり、多くの異なる文化から新しいことを学び、新しい友達を作る機会を与えました。彼はこの経験を通じて、 異なる文化の間の共通点と相違点を理解するようになりました。España is your's mine's l'heure èspciâl café über naïve résumé cañón élite cañas Barça 例子 東京 こんにちは 你好 中国"
nfd  = unicodedata.normalize('NFD', a)
nfc  = unicodedata.normalize('NFC', a)
>>> nfd == nfc
False
>>> nfd == a
False
>>> nfc == a
True
>>>

iamlemec · 2024-05-08T15:48:58Z

Thanks for looking into it @JoanFM! I do love learning about the rich complexity of Unicode. Yeah, I think main place this shows up is with languages that use multiple accents per base character, like Vietnamese. But at least in the WordPiece model, we strip these accents out anyway, so it shouldn't make a difference.

Overall, it seems like embedding models tend to ignore accents pretty aggressively, possibly because English and Chinese are so dominant in that space right now. For instance, the original BAAI/bge-* models don't even have "á" in their vocabulary. However, looking at BAAI/bge-m3 this does have "á" since it's geared towards multi-lingual, but that also appears not to use a WordPiece tokenizer.

JoanFM · 2024-05-08T15:51:10Z

Thanks for looking into it @JoanFM! I do love learning about the rich complexity of Unicode. Yeah, I think main place this shows up is with languages that use multiple accents per base character, like Vietnamese. But at least in the WordPiece model, we strip these accents out anyway, so it shouldn't make a difference.

Overall, it seems like embedding models tend to ignore accents pretty aggressively, possibly because English and Chinese are so dominant in that space right now. For instance, the original BAAI/bge-* models don't even have "á" in their vocabulary. However, looking at BAAI/bge-m3 this does have "á" since it's geared towards multi-lingual, but that also appears not to use a WordPiece tokenizer.

Yes, even trying to fix NFD in #7122 I struggled to find a test failing for that case.

JoanFM added the enhancement New feature or request label Apr 29, 2024

JoanFM changed the title ~~Tokenizers questions and proposals~~ Tokenizers questions and ... proposals? Apr 29, 2024

github-actions bot added the stale label Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizers questions and ... proposals? #6980

Tokenizers questions and ... proposals? #6980

JoanFM commented Apr 29, 2024

ggerganov commented Apr 29, 2024

JoanFM commented Apr 29, 2024

JoanFM commented Apr 30, 2024

ggerganov commented May 1, 2024

JoanFM commented May 2, 2024

JoanFM commented May 3, 2024

JoanFM commented May 3, 2024

slaren commented May 3, 2024

ggerganov commented May 3, 2024

JoanFM commented May 3, 2024

teleprint-me commented May 5, 2024 •

edited

JoanFM commented May 6, 2024

JoanFM commented May 6, 2024

iamlemec commented May 6, 2024

JoanFM commented May 6, 2024 •

edited

turian commented May 7, 2024

JoanFM commented May 7, 2024

ggerganov commented May 7, 2024

JoanFM commented May 7, 2024

iamlemec commented May 7, 2024

JoanFM commented May 7, 2024 •

edited

iamlemec commented May 7, 2024

JoanFM commented May 7, 2024

iamlemec commented May 8, 2024

JoanFM commented May 8, 2024

iamlemec commented May 8, 2024

JoanFM commented May 8, 2024

JoanFM commented May 8, 2024 •

edited

iamlemec commented May 8, 2024

JoanFM commented May 8, 2024

Tokenizers questions and ... proposals? #6980

Tokenizers questions and ... proposals? #6980

Comments

JoanFM commented Apr 29, 2024

ggerganov commented Apr 29, 2024

JoanFM commented Apr 29, 2024

JoanFM commented Apr 30, 2024

ggerganov commented May 1, 2024

JoanFM commented May 2, 2024

JoanFM commented May 3, 2024

JoanFM commented May 3, 2024

slaren commented May 3, 2024

ggerganov commented May 3, 2024

JoanFM commented May 3, 2024

teleprint-me commented May 5, 2024 • edited

JoanFM commented May 6, 2024

JoanFM commented May 6, 2024

iamlemec commented May 6, 2024

JoanFM commented May 6, 2024 • edited

turian commented May 7, 2024

JoanFM commented May 7, 2024

ggerganov commented May 7, 2024

JoanFM commented May 7, 2024

iamlemec commented May 7, 2024

JoanFM commented May 7, 2024 • edited

iamlemec commented May 7, 2024

JoanFM commented May 7, 2024

iamlemec commented May 8, 2024

JoanFM commented May 8, 2024

iamlemec commented May 8, 2024

JoanFM commented May 8, 2024

JoanFM commented May 8, 2024 • edited

iamlemec commented May 8, 2024

JoanFM commented May 8, 2024

teleprint-me commented May 5, 2024 •

edited

JoanFM commented May 6, 2024 •

edited

JoanFM commented May 7, 2024 •

edited

JoanFM commented May 8, 2024 •

edited