Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizers questions and ... proposals? #6980

Open
JoanFM opened this issue Apr 29, 2024 · 30 comments
Open

Tokenizers questions and ... proposals? #6980

JoanFM opened this issue Apr 29, 2024 · 30 comments
Labels
enhancement New feature or request stale

Comments

@JoanFM
Copy link
Contributor

JoanFM commented Apr 29, 2024

Hello @ggerganov ,

Thanks for the great project. I have been trying to include the Jina Embedding models into llama.cpp as you can see in #6826.

I have been successful on having it run for most of the models that Jina offers, (English, Spanish and German) but I cannot have it for Chinese.

I havee seen that the issue comes from the tokenization part of the model and I have been digging more into the code for llama.cpp as the one from tokenizers in HuggingFace.

I have some questions that I will try to place here.

1st. - How to know which tokenizer needs to be used for each model. For instance, I see that the SPM and BPE tokenizers here seems to work quite similarly but there are some discrepancies.

2nd. - I have seen that the problem from the Chinese model when it comes to the differences in output compared to the usage of transformers comes from the fact that the model uses some Normalizers and PreTokenizers that are very hard to configure in llama.cpp.

I wonder if there would be need to do some refactoring in the tokenizer to enable the decoupling of the tokenizing logic with the surrouding normalization code, plus some options to have a reacher mapping of the tokenizer options in transformers and in llama.cpp.

I am not sure if my observations here make any sense or I am just missusing the project or missunderstanding some of the concepts.

Thank you for the great work and happy to bring some help.

@JoanFM JoanFM added the enhancement New feature or request label Apr 29, 2024
@JoanFM JoanFM changed the title Tokenizers questions and proposals Tokenizers questions and ... proposals? Apr 29, 2024
@ggerganov
Copy link
Owner

On master there is no way to support correct tokenization for BPE/WPM tokenizers

Please take a look at the description in #6920 - this will be merged soon and it will introduce a pre-tokenizer field that llama.cpp can use to do pre-tokenization correctly. I've focused only on BPE tokenizers in that PR. AFAICT the Jina tokenizer falls in the WPM category - probably similar handling would be necessary, but I think it should be relatively simple to add. First step is to add the model as explained in the PR and create tokenizer tests for it and then we'll see what is necessary

@JoanFM
Copy link
Contributor Author

JoanFM commented Apr 29, 2024

Okey I will follow closely the developments, thank you very much

@JoanFM
Copy link
Contributor Author

JoanFM commented Apr 30, 2024

Hey @ggerganov ,

I have been checking the PR and the pretokenization closely and I have some questions and doubts.

  • In the model which I was trying to add, it was a Chinese model where NFC Unicode Normalization and LowerCase as a list of Normalizers from HF (https://huggingface.co/docs/tokenizers/components). I am not sure how I can map this to a regex, or I can do the normalization without regex?

  • I have another model which I would like to apply which has a PreCompiled HF tokenizer with a precompiled_charsmap attribute which I am failing to understand. (I was wondering if that has in fact any relationship with the regex you set in the llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920 PR.

Thanks for all the help and attention I am receiving in my PRs and inquiries.

@ggerganov
Copy link
Owner

I'm not sure about NFC normalization - need to understand what it is first. Lowercase should be easy to apply in the tokenize() method before or after the regex split based on vocab.type_pre

Not familiar with pre-compiled char maps

@JoanFM
Copy link
Contributor Author

JoanFM commented May 2, 2024

Another question.

In order to decide the pretokenizer we do this:

        chktok = tokenizer.encode(chktxt)
        chkhsh = sha256(str(chktok).encode()).hexdigest()

and we pretend to identify the pretrokenization behavior based on this.

But does this actually make sense?

The chktok may differ even for models with the same pretokenziation behavior but just a different vocabulary. Shouldn't there be a way to have this independent of the actual vocabulary?

Wdyt @ggerganov ?

@JoanFM
Copy link
Contributor Author

JoanFM commented May 3, 2024

@ggerganov @slaren @teleprint-me I have a question.

If I were to consider adding support for NFC normalization, would you be open to include other dependencies as Boost or ICU to depend on this, or is including external dependencies very much avoided regarding the nature of llama.cpp to intend to be used for edge devices?

@JoanFM
Copy link
Contributor Author

JoanFM commented May 3, 2024

Also, Perhaps a nice option to better scale the efforts to handle tokenization logic could be to take a look at https://github.com/mlc-ai/tokenizers-cpp which seems to be a port from tokenizers and been linked in this issue huggingface/tokenizers#185.

@slaren
Copy link
Collaborator

slaren commented May 3, 2024

Large external dependencies like boost are out of the question. Small, self-contained libraries that can be bundled in the repository have a higher chance of being accepted by @ggerganov, but the preference is still to reduce dependency on external libraries.

@ggerganov
Copy link
Owner

The chktok may differ even for models with the same pretokenziation behavior but just a different vocabulary. Shouldn't there be a way to have this independent of the actual vocabulary?

Yes, the resulting hashes would be different, but if you observe that the pre-tokenizer configs are the same, you can assign the same "name" for both models in order to reuse the pre-tokenizer. But also adding a duplicate one is fine.

If I were to consider adding support for NFC normalization, would you be open to include other dependencies as Boost or ICU to depend on this

From quick look, is NFC normalization similar to NFD normalization? We have the latter already implemented:

llama.cpp/unicode.cpp

Lines 472 to 485 in 3275e60

std::vector<uint32_t> unicode_cpts_normalize_nfd(const std::vector<uint32_t> & cpts) {
std::vector<uint32_t> result;
result.reserve(cpts.size());
for (size_t i = 0; i < cpts.size(); ++i) {
auto it = unicode_map_nfd.find(cpts[i]);
if (it == unicode_map_nfd.end()) {
result.push_back(cpts[i]);
} else {
result.push_back(it->second);
}
}
return result;
}

Seems simple enough to implement from scratch

@JoanFM
Copy link
Contributor Author

JoanFM commented May 3, 2024

Hey @ggerganov ,

To be honest, I know NFC is related to NFD but not sure how wasy is to implement, I am trying to understand some implementations but I am quite new to this. (I believe is a little more complex as it combines symbols more than decomposing). I will try to dig deeper, thanks

I know you can map to the same name from different hashes, but would be nice if it could actually detect it in a better way. (I have a small proposal in #7039 for some imrpovement).

@teleprint-me
Copy link
Contributor

teleprint-me commented May 5, 2024

NFC simply merges similar characters together to reduce redundancy, but the importance and practicality of such an implementation is debatable. It really depends on the use case and needs of the project. I have no say here, but I agree with @ggerganov on implementing a function from scratch if it really is desired for some reason. There's a nifty online tool with a brief overview on what the concept is.

Edit: It's important to keep in mind that there's a definite potential for data loss with the use of NFC, but everything is compressed here. It's all compression, so I suppose it's more practical to think in terms of how lossy it is in comparison to other methods.

@JoanFM
Copy link
Contributor Author

JoanFM commented May 6, 2024

Hey @teleprint-me ,

As per wether it i should be used or not, I believe this is a discussion to be had when training the tokenizer, in the case of the library, I believe that if a model is trained with a tokenizer requiring this normalization process, we should be able to produce this normalization.

As per wether we should implement ourselves, I agree and I am working in that direction. However I am finding it very hard to get the data needed for the mapping. Where did @ggerganov get the required data to complete the unicode_map_nfd?

The algorithm I understand it goes as:

  • Decompose the unicode_cpts recursively following the mapping rules that can be extracted (I will see how later) from https://www.unicode.org/Public/13.0.0/ucd/UnicodeData.txt
  • Reorder adjacent unicode_cpts according to its Unicode Canonical Combining Class (also extracted potentially from same file)
  • Apply the Canonical Composition Algorithm (as described here https://unicode.org/reports/tr15/#Description_Norm). I guess this takes a map of pairs (or more) of unicode_pts and merges them into a single unicode_cpts. The problem is I have no clue where to extract the mappings to obtain such behavior.

Is there any hint you could give me as per where I could look for?

Thanks

@JoanFM
Copy link
Contributor Author

JoanFM commented May 6, 2024

@iamlemec may be able to provide more details about the current implementation.

Plus I have a question,

In theory NFC should be equivalent somehow to using NFD plus a Normalization step, but I am doubting why the NFD implementation right now uses find for the multi_map instead of range_equals as in the first implementation (#5799). It seems to me that find will return only the first occurecne and thus the decomposition would be wrong?

Thanks for all the help

@iamlemec
Copy link
Collaborator

iamlemec commented May 6, 2024

Hi @JoanFM! Very interesting work here. Having Jina embeddings would be great. I'm actually working on BAAI/bge-m3 support right now and running into similar tokenizer issues. Especially with sentencepiece and precompiled_charsmap. With regards to the latter, there is this parser library: https://github.com/huggingface/spm_precompiled. I haven't looked at it in detail, but it seems like it should work.

As for the provenance of the NFD tables, I think I eventually used this script here: https://github.com/n1t0/unicode-normalization/blob/master/scripts/unicode.py. Plus some post-processing for formatting. I should patch that up again and add it to the scripts directory. I think for BERT-style models that use WordPiece tokenizer, it makes most sense to use NFD since the vocabulary will typically just contain the root character.

Hadn't noticed that little find to range_equals changeup. I guess in a simple case like à that will be equivalent since we're stripping the accent characters out right afterwards in the WPM implemenation for preprocess. That said, if we are always just taking the first character, we may as well strip the rest out before hand and turn unicode_map_nfd into a std::map, which would probably speed up compile time too.

@JoanFM
Copy link
Contributor Author

JoanFM commented May 6, 2024

Hey @iamlemec,

I think we should not change that, I believe that the change should tell us that NFD is not being properly used (at least in the way HF tokenizers) aim to do.

I am not sure if maybe llama.cpp is not being extensively used with languages other than English?

As per the SPM library, I had already noticed the library but I could not figure exactly what is doing and how one could implement that in C++ or what is the actual use it has.

Thanks for the help!!

@turian
Copy link

turian commented May 7, 2024

Possibly related to the tokenization issues discussed in #7056 #7062 #7049 #7006

@JoanFM
Copy link
Contributor Author

JoanFM commented May 7, 2024

Hi @JoanFM! Very interesting work here. Having Jina embeddings would be great. I'm actually working on BAAI/bge-m3 support right now and running into similar tokenizer issues. Especially with sentencepiece and precompiled_charsmap. With regards to the latter, there is this parser library: https://github.com/huggingface/spm_precompiled. I haven't looked at it in detail, but it seems like it should work.

As for the provenance of the NFD tables, I think I eventually used this script here: https://github.com/n1t0/unicode-normalization/blob/master/scripts/unicode.py. Plus some post-processing for formatting. I should patch that up again and add it to the scripts directory. I think for BERT-style models that use WordPiece tokenizer, it makes most sense to use NFD since the vocabulary will typically just contain the root character.

Hadn't noticed that little find to range_equals changeup. I guess in a simple case like à that will be equivalent since we're stripping the accent characters out right afterwards in the WPM implemenation for preprocess. That said, if we are always just taking the first character, we may as well strip the rest out before hand and turn unicode_map_nfd into a std::map, which would probably speed up compile time too.

Hey @iamlemec .

I think that the find was changed after this refactoring https://github.com/ggerganov/llama.cpp/pull/5992/files#diff-70eb27fba52eb29d31f61ec3d85c7864431a2f512d9d9a5a95021e7c679affb1.

I feel it seems like a bug more than an intentional change?

@ggerganov
Copy link
Owner

I think that the find was changed after this refactoring #5992 (files).

I feel it seems like a bug more than an intentional change?

Hm, definitely not intentional. Nice find

@JoanFM
Copy link
Contributor Author

JoanFM commented May 7, 2024

I am trying to do some fixes here, but still not sure about implementation #7122

@iamlemec
Copy link
Collaborator

iamlemec commented May 7, 2024

Ok, let's bring back the old range_equals way! Will hop over to comment on #7122 in a moment. I mostly have experience with English and CJK, where it seems like NFD isn't required too often. What are the most common examples beyond accented characters like à? Just want to have some on hand for testing.

@JoanFM
Copy link
Contributor Author

JoanFM commented May 7, 2024

Ok, let's bring back the old range_equals way! Will hop over to comment on #7122 in a moment. I mostly have experience with English and CJK, where it seems like NFD isn't required too often. What are the most common examples beyond accented characters like à? Just want to have some on hand for testing.

I am not so sire about the most common cases. But I just found that it does not implement NFD as it is supposed to do.

  • It does not apply this recursively
  • It does not do the canonical order.

I am doing a trial in #7122 , but not working yet

@iamlemec
Copy link
Collaborator

iamlemec commented May 7, 2024

So unicode_map_nfd is actually already pre-computed recursively. I just pushed a branch on my repo that adds a proper Python script to generate it (the output has three additional entries due to recently added Unicode codepoints). Here's the source: https://github.com/iamlemec/llama.cpp/blob/gen-nfd-table/scripts/gen-nfd-table.py. Will push a PR in a moment.

As for the canonical order, do we know if this will actually affect any specific examples? We're stripping out all of the accent mark characters right afterwards, so I'd be kind of suprised. And the reordering might be a bit expensive, which can start to impact speeds on smaller embedding models.

@JoanFM
Copy link
Contributor Author

JoanFM commented May 7, 2024

So unicode_map_nfd is actually already pre-computed recursively. I just pushed a branch on my repo that adds a proper Python script to generate it (the output has three additional entries due to recently added Unicode codepoints). Here's the source: https://github.com/iamlemec/llama.cpp/blob/gen-nfd-table/scripts/gen-nfd-table.py. Will push a PR in a moment.

As for the canonical order, do we know if this will actually affect any specific examples? We're stripping out all of the accent mark characters right afterwards, so I'd be kind of suprised. And the reordering might be a bit expensive, which can start to impact speeds on smaller embedding models.

Not sure how much it affects, but it is the deginition of the algorithm so this should be applied, otherwise they would not make it. I guess for non English language the affection will be greater. You can always check if no detection in nfd map has been made to skip the sorting.

@iamlemec
Copy link
Collaborator

iamlemec commented May 8, 2024

Still would be nice to see at least a single use case. Either way, can't we simply pre-compute the reordering in the Python script rather than doing it at runtime?

@JoanFM
Copy link
Contributor Author

JoanFM commented May 8, 2024

Still would be nice to see at least a single use case. Either way, can't we simply pre-compute the reordering in the Python script rather than doing it at runtime?

I do not think is feasible right?

@iamlemec
Copy link
Collaborator

iamlemec commented May 8, 2024

The UnicodeData.txt file has the Canonical Combining Class for each codepoint, so you can use those to sort the decomposed codepoints.

@JoanFM
Copy link
Contributor Author

JoanFM commented May 8, 2024

The UnicodeData.txt file has the Canonical Combining Class for each codepoint, so you can use those to sort the decomposed codepoints.

yes, but I mean I do not think u can sort that in the map so that u can skip the sorting during the normaliization

@JoanFM
Copy link
Contributor Author

JoanFM commented May 8, 2024

Hey @iamlemec ,

I have been digging a little bit, and then I saw that in the case where the reordering happens, this would mean that NFD and NFC would differ between each other and with the original representation, but this seems to only happen in very strange characters which may not be relevant at all.

See https://unicode.org/reports/tr15/#Multiple_Mark_Figure

I even tried this experiment in Python (which I believe is complex enough) and it seems it does not apply, so I think it is okey for simplicity to skip it for now.

So to fix NFD only having the range_equals added should be okey

a = "北京的清晨,空氣清新而寧靜,一个年轻的旅行者在长城上漫步,他从自己的故乡—서울에서 출발하여 아시아의 다양한 문화를 탐험하고자 하는 꿈을 품고 떠났다。彼は日本の古都、京都を訪れ、そこで美しい桜の花が満開の下で古典音楽のコンサートに参加しました。祭りの夜、彼は色とりどりの灯籠が空に浮かぶのを見て、その美しさに感動しました。その後、彼は印度のバラナシに到着し、गंगा की घाटों पर आध्यात्मिक शांति की खोज में जुट गया। वहाँ उसने दिवाली के उत्सव में हिस्सा लिया, जहां लाखों दीये जलाकर समृद्धि और खुशहाली की कामना की गई थी।この旅は彼にとって非常に啓発的であり、多くの異なる文化から新しいことを学び、新しい友達を作る機会を与えました。彼はこの経験を通じて、 異なる文化の間の共通点と相違点を理解するようになりました。España is your's mine's l'heure èspciâl café über naïve résumé cañón élite cañas Barça 例子 東京 こんにちは 你好 中国"
nfd  = unicodedata.normalize('NFD', a)
nfc  = unicodedata.normalize('NFC', a)
>>> nfd == nfc
False
>>> nfd == a
False
>>> nfc == a
True
>>> 

@iamlemec
Copy link
Collaborator

iamlemec commented May 8, 2024

Thanks for looking into it @JoanFM! I do love learning about the rich complexity of Unicode. Yeah, I think main place this shows up is with languages that use multiple accents per base character, like Vietnamese. But at least in the WordPiece model, we strip these accents out anyway, so it shouldn't make a difference.

Overall, it seems like embedding models tend to ignore accents pretty aggressively, possibly because English and Chinese are so dominant in that space right now. For instance, the original BAAI/bge-* models don't even have "á" in their vocabulary. However, looking at BAAI/bge-m3 this does have "á" since it's geared towards multi-lingual, but that also appears not to use a WordPiece tokenizer.

@JoanFM
Copy link
Contributor Author

JoanFM commented May 8, 2024

Thanks for looking into it @JoanFM! I do love learning about the rich complexity of Unicode. Yeah, I think main place this shows up is with languages that use multiple accents per base character, like Vietnamese. But at least in the WordPiece model, we strip these accents out anyway, so it shouldn't make a difference.

Overall, it seems like embedding models tend to ignore accents pretty aggressively, possibly because English and Chinese are so dominant in that space right now. For instance, the original BAAI/bge-* models don't even have "á" in their vocabulary. However, looking at BAAI/bge-m3 this does have "á" since it's geared towards multi-lingual, but that also appears not to use a WordPiece tokenizer.

Yes, even trying to fix NFD in #7122 I struggled to find a test failing for that case.

@github-actions github-actions bot added the stale label Jun 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

6 participants