llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920

ggerganov · 2024-04-26T08:46:07Z

Continuing the work in #6252 by @dragnil1

This PR adds support for BPE pre-tokenization to llama.cpp

Summary

The state so far has been that for all BPE-based models, llama.cpp applied a default pre-tokenization inherited back from GPT-2:

llama.cpp/llama.cpp

Lines 12186 to 12196 in e00b4a8

    
           std::vector<std::string> bpe_gpt2_preprocess(const std::string & text) { 
        
               std::vector<std::string> bpe_words; 
        
               std::vector<std::string> bpe_encoded_words; 
        
               std::string token = ""; 
        
               // GPT2 system regex:  's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+ 
        
               bool collecting_numeric = false; 
        
               bool collecting_letter = false; 
        
               bool collecting_special = false; 
        
               bool collecting_whitespace_lookahead = false; 
        
               bool collecting = false;

This works most of the times since BPE models use similar pre-tokenization strategies. However, there are cases where this fails: #6914. This leads to poor generation quality because the model starts to work with out-of-distribution data when the pre-tokenization splits the input string in the wrong way

There are 2 main obstacles in introducing proper BPE pre-tokenization:

C++ STL <regex> does not support sophisticated regexes which are typically used for pre-tokenization: llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920 (comment)
The pre-tokenization configuration is either undefined or defined in a very complex way. For example, transformers define it in tokenizer.json https://huggingface.co/docs/transformers/en/main_classes/configuration

Both introducing a dedicated regex library or supporting complex json configurations are out-of-scope for llama.cpp. Therefore, this PR implements the following solution:

The need for full-blown regex support is circumvented with the following custom approach that relies on std::regex and std::wregex + some pre-processing: llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920 (comment)
The pre-tokenization configuration is determined by hashing the resulting tokens of a large enough string llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920 (comment)

Details

Introduce new convert-hf-to-gguf-update.py script

llama.cpp/convert-hf-to-gguf-update.py

Lines 1 to 19 in 120cf37

    
           # This script downloads the tokenizer models of the specified models from Huggingface and 
        
           # generates the get_vocab_base_pre() function for convert-hf-to-gguf.py 
        
           # 
        
           # This is necessary in order to analyze the type of pre-tokenizer used by the model and 
        
           # provide the necessary information to llama.cpp via the GGUF header in order to implement 
        
           # the same pre-tokenizer. 
        
           # 
        
           # ref: https://github.com/ggerganov/llama.cpp/pull/6920 
        
           # 
        
           # Instructions: 
        
           # 
        
           # - Add a new model to the "models" list 
        
           # - Run the script with your huggingface token: 
        
           # 
        
           #   python3 convert-hf-to-gguf-update.py <huggingface_token> 
        
           # 
        
           # - Copy-paste the generated get_vocab_base_pre() function into convert-hf-to-gguf.py 
        
           # - Update llama.cpp with the new pre-tokenizer if necessary 
        
           #

From now on, we start listing all supported models in it:

llama.cpp/convert-hf-to-gguf-update.py

Lines 47 to 56 in c21ab18

    
           # TODO: add models here, base models preferred 
        
           models = [ 
        
                   { "name": "llama-spm",      "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/meta-llama/Llama-2-7b-hf",                }, 
        
                   { "name": "llama-bpe",      "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/meta-llama/Meta-Llama-3-8B",              }, 
        
                   { "name": "deepseek-llm",   "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/deepseek-llm-7b-base",        }, 
        
                   { "name": "deepseek-coder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base",    }, 
        
                   { "name": "falcon",         "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/tiiuae/falcon-7b",                        }, 
        
                   { "name": "bert-bge",       "tokt": TOKENIZER_TYPE.WPM, "repo": "https://huggingface.co/BAAI/bge-small-en-v1.5",                  }, 
        
                   ]

During conversion with convert-hf-to-gguf.py, if the hash of the tokens of a large string are not recognized, we prompt for update of convert-hf-to-gguf-update.py

llama.cpp/convert-hf-to-gguf.py

Lines 263 to 315 in c21ab18

    
           # NOTE: this function is generated by convert-hf-to-gguf-update.py 
        
           #       do not modify it manually! 
        
           # ref:  https://github.com/ggerganov/llama.cpp/pull/6920 
        
           def get_vocab_base_pre(self, tokenizer) -> str: 
        
               # encoding this string and hashing the resulting tokens would (hopefully) give us a unique identifier that 
        
               # is specific for the BPE pre-tokenizer used by the model 
        
               # we will use this unique identifier to write a "tokenizer.ggml.pre" entry in the GGUF file which we can 
        
               # use in llama.cpp to implement the same pre-tokenizer 
        
               chktxt = '\n \n\n \n\n\n \t \t\t \t\n  \n   \n    \n     \n🚀 (normal) 😶\u200d🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天～ ------======= нещо на Български \'\'\'\'\'\'```````""""......!!!!!!?????? I\'ve been \'told he\'s there, \'RE you sure? \'M not sure I\'ll make it, \'D you like some tea? We\'Ve a\'lL' 
        
               chktok = tokenizer.encode(chktxt) 
        
               chkhsh = sha256(str(chktok).encode()).hexdigest() 
        
               print(f"chktok: {chktok}") 
        
               print(f"chkhsh: {chkhsh}") 
        
               res = None 
        
               # NOTE: if you get an error here, you need to add the model to the if-elif chain below 
        
               #       don't do this manually - use the convert-hf-to-gguf-update.py script! 
        
               if chkhsh == "0ef9807a4087ebef797fc749390439009c3b9eda9ad1a097abbe738f486c01e5": 
        
                   # ref: https://huggingface.co/meta-llama/Meta-Llama-3-8B 
        
                   res = "llama-bpe" 
        
               if chkhsh == "049ecf7629871e3041641907f3de7c733e4dbfdc736f57d882ba0b0845599754": 
        
                   # ref: https://huggingface.co/deepseek-ai/deepseek-llm-7b-base 
        
                   res = "deepseek-llm" 
        
               if chkhsh == "347715f544604f9118bb75ed199f68779f423cabb20db6de6f31b908d04d7821": 
        
                   # ref: https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base 
        
                   res = "deepseek-coder" 
        
               if chkhsh == "8aeee3860c56296a157a1fe2fad249ec40aa59b1bb5709f4ade11c4e6fe652ed": 
        
                   # ref: https://huggingface.co/tiiuae/falcon-7b 
        
                   res = "falcon" 
        
               if chkhsh == "0876d13b50744004aa9aeae05e7b0647eac9d801b5ba4668afc01e709c15e19f": 
        
                   # ref: https://huggingface.co/BAAI/bge-small-en-v1.5 
        
                   res = "bert-bge" 
        
               if res is None: 
        
                   print("\n") 
        
                   print("**************************************************************************************") 
        
                   print("** WARNING: The BPE pre-tokenizer was not recognized!") 
        
                   print("**          This means that it was not added yet or you are using an older version.") 
        
                   print("**          Check convert-hf-to-gguf-update.py and update it accordingly.") 
        
                   print("**") 
        
                   print(f"** chkhsh:  {chkhsh}") 
        
                   print("**************************************************************************************") 
        
                   print("\n") 
        
                   raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()") 
        
               print(f"tokenizer.ggml.pre: {res}") 
        
               print(f"chkhsh: {chkhsh}") 
        
               return res

For now, this is required only for BPE models, since it seems SPM does not use pre-tokenization

The string used for the hashing should be extended to cover as much pre-tokenizer functionality as possible:

llama.cpp/convert-hf-to-gguf-update.py

Lines 37 to 40 in c21ab18

    
           # TODO: this string has to exercise as much pre-tokenizer functionality as possible 
        
           #       will be updated with time - contributions welcome 
        
           chktxt = '\n \n\n \n\n\n \t \t\t \t\n  \n   \n    \n     \n🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天～ ------======= нещо на Български \'\'\'\'\'\'```````\"\"\"\"......!!!!!!?????? I\'ve been \'told he\'s there, \'RE you sure? \'M not sure I\'ll make it, \'D you like some tea? We\'Ve a\'lL'

Pre-tokenizer types are identified via a string written to the GGUF header:

llama.cpp/llama.cpp

Line 397 in c21ab18

{ LLM_KV_TOKENIZER_PRE, "tokenizer.ggml.pre" },

For each pre-tokenizer, we have to tell llama.cpp what pre-processing regexes to use:

llama.cpp/llama.cpp

Lines 12087 to 12141 in c21ab18

    
           std::vector<std::string> word_collection; 
        
           switch (vocab.type) { 
        
               case LLAMA_VOCAB_TYPE_BPE: 
        
                   switch (vocab.type_pre) { 
        
                       case LLAMA_VOCAB_PRE_TYPE_LLAMA3: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               // original regex from tokenizer.json 
        
                               //"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", 
        
                               // adapted: https://github.com/ggerganov/llama.cpp/pull/6920#issuecomment-2080233989 
        
                               "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_LLM: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "[\r\n]", 
        
                               "\\s?[A-Za-zµÀ-ÖØ-öø-ƺƼ-ƿǄ-ʓʕ-ʯͰ-ͳͶͷͻ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-ՖႠ-ჅᎠ-Ᏽᏸ-ᏽᲐ-ᲺᲽ-Ჿᴀ-ᴫᵫ-ᵷᵹ-ᶚḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℴℹℼ-ℿⅅ-ⅉⅎↃↄⰀ-ⱻⱾ-ⳤⳫ-ⳮⳲⳳꙀ-ꙭꚀ-ꚛꜢ-ꝯꝱ-ꞇꞋ-ꞎꭰ-ꮿﬀ-ﬆﬓ-ﬗＡ-Ｚａ-ｚ𐐀-𐑏𐒰-𐓓𐓘-𐓻𐲀-𐲲𐳀-𐳲𑢠-𑣟𞤀-𞥃]+", 
        
                               "\\s?[!-/:-~！-／：-～‘-‟　-。]+", 
        
                               "\\s+$", 
        
                               "[一-龥ࠀ-一가-퟿]+", 
        
                               "\\p{N}+", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_CODER: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "[\r\n]", 
        
                               "\\s?\\p{L}+", 
        
                               "\\s?\\p{P}+", 
        
                               "[一-龥ࠀ-一가-퟿]+", 
        
                               "\\p{N}+", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_FALCON: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "[\\p{P}\\$\\+<=>\\^~\\|]+", 
        
                               "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", 
        
                               "\\p{N}+", 
        
                               "[0-9][0-9][0-9]", 
        
                           }); 
        
                           break; 
        
                       default: 
        
                           // default regex for BPE tokenization pre-processing 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "[\\p{P}\\$\\+<=>\\^~\\|]+", 
        
                               "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", 
        
                               "\\p{N}+", 
        
                               "[0-9][0-9][0-9]", 
        
                           }); 
        
                           break; 
        
                   } 
        
                   break; 
        
               default: 
        
                   GGML_ASSERT(false); 
        
                   break; 
        
           }

Here, we have to inspect manually the contents of the tokenizer.json of the model and either reuse an existing set of regex patterns, or add a new one corresponding to the new configuration. For a tutorial, see 120cf37. We verify the correctness using the tests/test-tokenizer-0 program and the exported vocab for that model:

make tests
./tests/test-tokenizer-0 ./models/ggml-vocab-llama-bpe.gguf

Old GGUF models using BPE tokenizers, generated before this change, will fallback to the "default" pre-tokenization, which in almost all cases is wrong. A warning is printed in the output:

llama.cpp/llama.cpp

Lines 4333 to 4352 in 80cb312

    
           // for now, only BPE models have pre-tokenizers 
        
           if (vocab.type == LLAMA_VOCAB_TYPE_BPE) { 
        
               if (tokenizer_pre.empty()) { 
        
                   LLAMA_LOG_WARN("%s: missing pre-tokenizer type, using: 'default'\n", __func__); 
        
                   LLAMA_LOG_WARN("%s:                                             \n", __func__); 
        
                   LLAMA_LOG_WARN("%s: ************************************        \n", __func__); 
        
                   LLAMA_LOG_WARN("%s: GENERATION QUALITY WILL BE DEGRADED!        \n", __func__); 
        
                   LLAMA_LOG_WARN("%s: CONSIDER REGENERATING THE MODEL             \n", __func__); 
        
                   LLAMA_LOG_WARN("%s: ************************************        \n", __func__); 
        
                   LLAMA_LOG_WARN("%s:                                             \n", __func__); 
        
                   vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT; 
        
               } else if ( 
        
                       tokenizer_pre == "default") { 
        
                   vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT; 
        
               } else if ( 
        
                       tokenizer_pre == "llama3"   || 
        
                       tokenizer_pre == "llama-v3" || 
        
                       tokenizer_pre == "llama-bpe") { 
        
                   vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_LLAMA3; 
        
               } else if (

Although we now support pre-processing using regexes, there is now also infrastructure to add more custom splitting implementations in order to have better performance:

llama.cpp/unicode.cpp

Lines 424 to 432 in c21ab18

    
           static std::vector<size_t> unicode_regex_split_custom(const std::string & text, const std::string & regex_expr, const std::vector<size_t> & offsets) { 
        
               std::vector<size_t> bpe_offsets; 
        
               if (regex_expr == "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)") { 
        
                   bpe_offsets = unicode_regex_split_custom_gpt2(text, offsets); 
        
               } 
        
               return bpe_offsets; 
        
           }

For example, there is already an attempt to add custom LLaMA v3 pre-tokenization: llama3 custom regex split #6965

The tokenizer tests have been refactored to allow easy addition of more tests and vocabs. Add tests here and run convert-hf-to-gguf-update.py to create input/output files for all known tokenizer models:

llama.cpp/convert-hf-to-gguf-update.py

Lines 181 to 225 in c21ab18

    
           # generate tests for each tokenizer model 
        
           tests = [ 
        
               "", 
        
               " ", 
        
               "  ", 
        
               "   ", 
        
               "\t", 
        
               "\n", 
        
               "\n\n", 
        
               "\n\n\n", 
        
               "\t\n", 
        
               "Hello world", 
        
               " Hello world", 
        
               "Hello World", 
        
               " Hello World", 
        
               " Hello World!", 
        
               "Hello, world!", 
        
               " Hello, world!", 
        
               " this is 🦙.cpp", 
        
               "w048 7tuijk dsdfhu", 
        
               "нещо на Български", 
        
               "កាន់តែពិសេសអាចខលចេញ", 
        
               "🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)", 
        
               "Hello", 
        
               " Hello", 
        
               "  Hello", 
        
               "   Hello", 
        
               "    Hello", 
        
               "    Hello\n    Hello", 
        
               " (", 
        
               "\n =", 
        
               "' era", 
        
               "Hello, y'all! How are you 😁 ?我想在apple工作1314151天～", 
        
               "3", 
        
               "33", 
        
               "333", 
        
               "3333", 
        
               "33333", 
        
               "333333", 
        
               "3333333", 
        
               "33333333", 
        
               "333333333", 
        
               chktxt, 
        
           ]

llama.cpp/tests/CMakeLists.txt

Lines 68 to 79 in c21ab18

    
           # build test-tokenizer-0 target once and add many tests 
        
           add_executable(test-tokenizer-0 test-tokenizer-0.cpp) 
        
           target_link_libraries(test-tokenizer-0 PRIVATE common) 
        
           install(TARGETS test-tokenizer-0 RUNTIME) 
        
           llama_test(test-tokenizer-0 NAME test-tokenizer-0-llama-spm         ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-spm.gguf) 
        
           llama_test(test-tokenizer-0 NAME test-tokenizer-0-llama-bpe         ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-bpe.gguf) 
        
           llama_test(test-tokenizer-0 NAME test-tokenizer-0-falcon            ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-falcon.gguf) 
        
           llama_test(test-tokenizer-0 NAME test-tokenizer-0-deepseek-llm      ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-llm.gguf) 
        
           llama_test(test-tokenizer-0 NAME test-tokenizer-0-deepseek-coder    ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-coder.gguf) 
        
           llama_test(test-tokenizer-0 NAME test-tokenizer-0-bert-bge     r    ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-bert-bge.gguf)

TODOs

Fix custom GPT-2 pre-processing bug:

llama.cpp/unicode.cpp

Lines 430 to 434 in 120cf37

    
           // TODO: this implementation is actually wrong, uncomment and run: 
        
           //       make -j && ./bin/test-tokenizer-0 ../models/ggml-vocab-gpt-2.gguf 
        
           //if (regex_expr == "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)") { 
        
           //    bpe_offsets = unicode_regex_split_custom_gpt2(text, offsets); 
        
           //}

Fix MPT pre-tokenization:

llama.cpp/llama.cpp

Lines 12136 to 12146 in 120cf37

    
           case LLAMA_VOCAB_PRE_TYPE_MPT: 
        
               // TODO: MPT pre-tokenization regexes are unknown 
        
               //       the following are close, but not exact. run the following: 
        
               //       ./bin/test-tokenizer-0 ../models/ggml-vocab-mpt.gguf 
        
               GGML_ASSERT("MPT pre-tokenization regexes are unknown - fixes needed"); 
        
               word_collection = unicode_regex_split(text, { 
        
                   "\\s?\\p{L}+", 
        
                   "\\s?\\p{P}+", 
        
                   "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", 
        
               }); 
        
               break;

ggerganov · 2024-04-26T09:57:28Z

unicode.cpp

-static inline std::string unicode_wstring_to_utf8(const std::wstring & ws)
-{
-    // code to convert from utf32/utf16 to utf8
-    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> converter;
-    std::string utf8 = converter.to_bytes(ws);
-    return utf8;
+static inline std::string unicode_wstring_to_utf8(const std::wstring & ws) {
+    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
+    return conv.to_bytes(ws);


@dragnil1 Not sure if this is the intent, but the following change of this function makes the tokenizer tests pass on my Mac. Do you think this is OK to change?

This change converts UCS-2 or UCS-4/UTF-32 encoded std::wstring to UTF-8 encoded std::string and the previous one, converts UTF-16 encoded std::wstring to UTF-8 encoded std::string according to reference. Both works on Ubuntu(tested) but I am not sure about windows as it uses UTF-16 encoded std::wstring.

ggerganov · 2024-04-26T12:04:29Z

llama.cpp

+        std::vector<std::string> word_collection;
+        switch (vocab.type) {
+            case LLAMA_VOCAB_TYPE_BPE:
+                switch (vocab.arch) {
+                    // TODO: how to detect deepseek and llama v3 models?
+                    //case LLM_ARCH_LLAMA:
+                    //case LLM_ARCH_DEEPSEEK_CODER:
+                    //    word_collection = unicode_regex_split(text, {
+                    //        "[\r\n]",
+                    //        "\\s?\\p{L}+",
+                    //        "\\s?\\p{P}+",
+                    //        "[一-龥ࠀ-一가-퟿]+",
+                    //        "\\p{N}+"
+                    //    });
+                    //    break;
+                    //case LLM_ARCH_DEEPSEEK_LLM:
+                    //    word_collection = unicode_regex_split(text, {
+                    //        "[\r\n]",
+                    //        "\\s?[A-Za-zµÀ-ÖØ-öø-ƺƼ-ƿǄ-ʓʕ-ʯͰ-ͳͶͷͻ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-ՖႠ-ჅᎠ-Ᏽᏸ-ᏽᲐ-ᲺᲽ-Ჿᴀ-ᴫᵫ-ᵷᵹ-ᶚḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℴℹℼ-ℿⅅ-ⅉⅎↃↄⰀ-ⱻⱾ-ⳤⳫ-ⳮⳲⳳꙀ-ꙭꚀ-ꚛꜢ-ꝯꝱ-ꞇꞋ-ꞎꭰ-ꮿﬀ-ﬆﬓ-ﬗＡ-Ｚａ-ｚ𐐀-𐑏𐒰-𐓓𐓘-𐓻𐲀-𐲲𐳀-𐳲𑢠-𑣟𞤀-𞥃]+",
+                    //        "\\s?[!-/:-~！-／：-～‘-‟　-。]+",
+                    //        "\\s+$",
+                    //        "[一-龥ࠀ-一가-퟿]+",
+                    //        "\\p{N}+"
+                    //    });
+                    //    break;
+                    default:
+                        // default regex for BPE tokenization pre-processing
+                        {
+                            word_collection = unicode_regex_split(text, {
+                                "\\p{P}+",
+                                "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
+                                "\\p{N}+",
+                                "[0-9][0-9][0-9]"
+                            });
+                        }
+                        break;
+                }
+                break;
+            default:
+                GGML_ASSERT(false);
+                break;
+        }


This is the missing part - how to distinguish models from one another?

For example all LLaMA, Deepseek Coder and Deepseek LLM models have the same architecture:

"architectures": [ "LlamaForCausalLM" ],

There seems to be no way to automatically determine which model we are converting. Therefore, there is no way to automatically determine the correct regex to use.

Seems we will have to rely on some heuristics based on the rest of the parameters, such as vocab size and tensor sizes. Not great

Sorry if I'm new to the inner workings of llama.cpp and get something wrong, but is vocab.arch coming from the gguf_metadata_kv_t in the gguf?

If it's not coming from there, would it be reasonable to add it as a key in the gguf? Then the file could specify what it needs, and llama.cpp could use that, or otherwise just fallback to the current behavior.

The gguf specification talks about how it "is designed to be unambiguous by containing all the information needed to load a model", and this seems like information needed to load a model.

The problem is that when creating the GGUF file in the first place (i.e. during conversion from HF to GGUF) there is no way to know what model we are dealing with. For example, take these 2 models:

LLaMA v3 8B Instruct: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main

DeepSeek LLM 7B Chat: https://huggingface.co/deepseek-ai/deepseek-llm-7b-chat/tree/main

Both use LLaMA architecture, both use BPE tokenizer and so currently they will be interpreted as the same arch by llama.cpp.

However, they use different pre-tokenizers:

LLaMA:

"normalizer": null, "pre_tokenizer": { "type": "Sequence", "pretokenizers": [ { "type": "Split", "pattern": { "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" }, "behavior": "Isolated", "invert": false }, { "type": "ByteLevel", "add_prefix_space": false, "trim_offsets": true, "use_regex": false } ] },

DeepSeek LLM:

"normalizer": { "type": "Sequence", "normalizers": [] }, "pre_tokenizer": { "type": "Sequence", "pretokenizers": [ { "type": "Split", "pattern": { "Regex": "[\r\n]" }, "behavior": "Isolated", "invert": false }, { "type": "Split", "pattern": { "Regex": "\\s?[A-Za-zµÀ-ÖØ-öø-ƺƼ-ƿǄ-ʓʕ-ʯͰ-ͳͶͷͻ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-ՖႠ-ჅᎠ-Ᏽᏸ-ᏽᲐ-ᲺᲽ-Ჿᴀ-ᴫᵫ-ᵷᵹ-ᶚḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℴℹℼ-ℿⅅ-ⅉⅎↃↄⰀ-ⱻⱾ-ⳤⳫ-ⳮⳲⳳꙀ-ꙭꚀ-ꚛꜢ-ꝯꝱ-ꞇꞋ-ꞎꭰ-ꮿﬀ-ﬆﬓ-ﬗＡ-Ｚａ-ｚ𐐀-𐑏𐒰-𐓓𐓘-𐓻𐲀-𐲲𐳀-𐳲𑢠-𑣟𞤀-𞥃]+" }, "behavior": "Isolated", "invert": false }, { "type": "Split", "pattern": { "Regex": "\\s?[!-/:-~！-／：-～‘-‟　-。]+" }, "behavior": "Isolated", "invert": false }, { "type": "Split", "pattern": { "Regex": "\\s+$" }, "behavior": "Isolated", "invert": false }, { "type": "Split", "pattern": { "Regex": "[一-龥ࠀ-一가-퟿]+" }, "behavior": "Isolated", "invert": false }, { "type": "Digits", "individual_digits": true }, { "type": "ByteLevel", "add_prefix_space": false, "trim_offsets": true, "use_regex": false } ] },

So maybe we have to start parsing this information from the tokenizer.json and use it to determine the correct arch. Not sure yet

Thinking more about this, I'm starting to consider the option where we tokenize a few strings during conversion and based on the resulting tokens we add a new enum to the GGUF header indicating the pre-tokenizer type. In llama.cpp we will have custom implementations of each pre-tokenizer type with a fallback to some default pre-tokenizer (as we already do)

In the convert script, if the strings tokenize to unknown set of tokens, we stop with an error asking the developer to check the pre-tokenizer configuration and either assign an existing one or add a new one to the enum

option where we tokenize a few strings during conversion

It looks like a pretty messy solution. Maybe it's better to choose a variant with parsing tokenizer.json and make alternative implementation on C++?

Here is a prototype of the idea above:

llama.cpp/convert-hf-to-gguf.py

Lines 379 to 407 in 9b4d63a

def get_vocab_base_pre(self, tokenizer) -> str:

# encoding this string and hashing the resulting tokens would (hopefully) give us a unique identifier that

# is specific for the BPE pre-tokenizer used by the model

# we will use this unique identifier to write a "tokenizer.ggml.pre" entry in the GGUF file which we can

# use in llama.cpp to implement the same pre-tokenizer

chktxt = "\n \n\n \n\n\n \t \t\t \t\n \n \n \n \n🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天～ ------======= нещо на Български what's ''''''```````\"\"\"\"......!!!!!!??????"

chktok = tokenizer.encode(chktxt)

chkhsh = hash(tuple(chktok))

print(f"chktok: {chktok}")

print(f"chkhsh: {chkhsh}")

res = None

# NOTE: if you get an error here, you need to add the model to the if-elif chain below

# observe the stdout for the chkhsh value and add it to the chain

if self.model_arch == gguf.MODEL_ARCH.LLAMA:

if chkhsh == -3290901550109860290:

# ref: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/tokenizer.json

res = "llama3"

if chkhsh == 4190561703949727616:

# ref: https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct/blob/main/tokenizer.json

res = "deepseek-coder"

if res is None:

raise NotImplementedError(f"BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")

Feedback is welcome

Yeah, I went through each of these steps awhile back and realized there's no way to do it unless I create my own way of doing it. Automating it was just not an option. For detection though, why not just create a hash sum of the encodings as a list? e.g. hash(tuple([k, v for k, v in tokenizer.model.vocab.items()])).

Could probably do this for any model file. Only issue is knowing the sum value in advance, which means they'd need to be added manually. This would need to include added tokens and any other required misc files.

github-actions · 2024-04-26T14:33:20Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 425 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=11122.45ms p(95)=33038.18ms fails=, finish reason: stop=374 truncated=51
Prompt processing (pp): avg=121.76tk/s p(95)=561.52tk/s
Token generation (tg): avg=25.61tk/s p(95)=37.24tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=gg/bpe-preprocess commit=80cb3127df55a05a7688797ae5b46be8c0b6a8cf

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 425 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714397444 --> 1714398074
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 281.1, 281.1, 281.1, 281.1, 281.1, 459.61, 459.61, 459.61, 459.61, 459.61, 353.54, 353.54, 353.54, 353.54, 353.54, 367.77, 367.77, 367.77, 367.77, 367.77, 380.36, 380.36, 380.36, 380.36, 380.36, 425.49, 425.49, 425.49, 425.49, 425.49, 443.21, 443.21, 443.21, 443.21, 443.21, 448.41, 448.41, 448.41, 448.41, 448.41, 462.18, 462.18, 462.18, 462.18, 462.18, 486.29, 486.29, 486.29, 486.29, 486.29, 490.86, 490.86, 490.86, 490.86, 490.86, 510.23, 510.23, 510.23, 510.23, 510.23, 533.69, 533.69, 533.69, 533.69, 533.69, 535.43, 535.43, 535.43, 535.43, 535.43, 549.92, 549.92, 549.92, 549.92, 549.92, 560.84, 560.84, 560.84, 560.84, 560.84, 575.48, 575.48, 575.48, 575.48, 575.48, 584.86, 584.86, 584.86, 584.86, 584.86, 589.08, 589.08, 589.08, 589.08, 589.08, 590.11, 590.11, 590.11, 590.11, 590.11, 597.19, 597.19, 597.19, 597.19, 597.19, 605.3, 605.3, 605.3, 605.3, 605.3, 603.71, 603.71, 603.71, 603.71, 603.71, 604.13, 604.13, 604.13, 604.13, 604.13, 604.67, 604.67, 604.67, 604.67, 604.67, 609.23, 609.23, 609.23, 609.23, 609.23, 612.23, 612.23, 612.23, 612.23, 612.23, 615.25, 615.25, 615.25, 615.25, 615.25, 613.4, 613.4, 613.4, 613.4, 613.4, 616.63, 616.63, 616.63, 616.63, 616.63, 617.53, 617.53, 617.53, 617.53, 617.53, 629.55, 629.55, 629.55, 629.55, 629.55, 628.03, 628.03, 628.03, 628.03, 628.03, 627.04, 627.04, 627.04, 627.04, 627.04, 626.53, 626.53, 626.53, 626.53, 626.53, 624.97, 624.97, 624.97, 624.97, 624.97, 629.9, 629.9, 629.9, 629.9, 629.9, 630.08, 630.08, 630.08, 630.08, 630.08, 629.59, 629.59, 629.59, 629.59, 629.59, 631.21, 631.21, 631.21, 631.21, 631.21, 633.9, 633.9, 633.9, 633.9, 633.9, 644.09, 644.09, 644.09, 644.09, 644.09, 644.33, 644.33, 644.33, 644.33, 644.33, 649.53, 649.53, 649.53, 649.53, 649.53, 651.81, 651.81, 651.81, 651.81, 651.81, 652.42, 652.42, 652.42, 652.42, 652.42, 652.05, 652.05, 652.05, 652.05, 652.05, 651.31, 651.31, 651.31, 651.31, 651.31, 651.79, 651.79, 651.79, 651.79, 651.79, 654.03, 654.03, 654.03, 654.03, 654.03, 656.93, 656.93, 656.93, 656.93, 656.93, 658.1, 658.1, 658.1, 658.1, 658.1, 647.34, 647.34, 647.34, 647.34, 647.34, 640.8, 640.8, 640.8, 640.8, 640.8, 638.23, 638.23, 638.23, 638.23, 638.23, 637.53, 637.53, 637.53, 637.53, 637.53, 636.9, 636.9, 636.9, 636.9, 636.9, 635.47, 635.47, 635.47, 635.47, 635.47, 636.43, 636.43, 636.43, 636.43, 636.43, 641.9, 641.9, 641.9, 641.9, 641.9, 641.99, 641.99, 641.99, 641.99, 641.99, 641.99]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 425 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714397444 --> 1714398074
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 33.53, 33.53, 33.53, 33.53, 33.53, 35.46, 35.46, 35.46, 35.46, 35.46, 20.69, 20.69, 20.69, 20.69, 20.69, 21.06, 21.06, 21.06, 21.06, 21.06, 20.46, 20.46, 20.46, 20.46, 20.46, 20.57, 20.57, 20.57, 20.57, 20.57, 20.81, 20.81, 20.81, 20.81, 20.81, 21.6, 21.6, 21.6, 21.6, 21.6, 22.97, 22.97, 22.97, 22.97, 22.97, 23.64, 23.64, 23.64, 23.64, 23.64, 24.06, 24.06, 24.06, 24.06, 24.06, 24.34, 24.34, 24.34, 24.34, 24.34, 24.43, 24.43, 24.43, 24.43, 24.43, 24.49, 24.49, 24.49, 24.49, 24.49, 24.26, 24.26, 24.26, 24.26, 24.26, 23.71, 23.71, 23.71, 23.71, 23.71, 23.48, 23.48, 23.48, 23.48, 23.48, 23.38, 23.38, 23.38, 23.38, 23.38, 22.95, 22.95, 22.95, 22.95, 22.95, 22.97, 22.97, 22.97, 22.97, 22.97, 23.1, 23.1, 23.1, 23.1, 23.1, 22.69, 22.69, 22.69, 22.69, 22.69, 22.52, 22.52, 22.52, 22.52, 22.52, 22.47, 22.47, 22.47, 22.47, 22.47, 22.27, 22.27, 22.27, 22.27, 22.27, 21.91, 21.91, 21.91, 21.91, 21.91, 21.72, 21.72, 21.72, 21.72, 21.72, 21.93, 21.93, 21.93, 21.93, 21.93, 21.97, 21.97, 21.97, 21.97, 21.97, 22.03, 22.03, 22.03, 22.03, 22.03, 22.13, 22.13, 22.13, 22.13, 22.13, 22.3, 22.3, 22.3, 22.3, 22.3, 22.2, 22.2, 22.2, 22.2, 22.2, 22.02, 22.02, 22.02, 22.02, 22.02, 21.79, 21.79, 21.79, 21.79, 21.79, 21.43, 21.43, 21.43, 21.43, 21.43, 21.63, 21.63, 21.63, 21.63, 21.63, 21.72, 21.72, 21.72, 21.72, 21.72, 21.82, 21.82, 21.82, 21.82, 21.82, 21.9, 21.9, 21.9, 21.9, 21.9, 22.0, 22.0, 22.0, 22.0, 22.0, 21.99, 21.99, 21.99, 21.99, 21.99, 21.93, 21.93, 21.93, 21.93, 21.93, 21.82, 21.82, 21.82, 21.82, 21.82, 21.82, 21.82, 21.82, 21.82, 21.82, 21.53, 21.53, 21.53, 21.53, 21.53, 21.34, 21.34, 21.34, 21.34, 21.34, 21.26, 21.26, 21.26, 21.26, 21.26, 21.23, 21.23, 21.23, 21.23, 21.23, 21.3, 21.3, 21.3, 21.3, 21.3, 21.42, 21.42, 21.42, 21.42, 21.42, 21.5, 21.5, 21.5, 21.5, 21.5, 21.51, 21.51, 21.51, 21.51, 21.51, 21.4, 21.4, 21.4, 21.4, 21.4, 21.25, 21.25, 21.25, 21.25, 21.25, 20.95, 20.95, 20.95, 20.95, 20.95, 20.85, 20.85, 20.85, 20.85, 20.85, 20.64, 20.64, 20.64, 20.64, 20.64, 20.06, 20.06, 20.06, 20.06, 20.06, 20.04, 20.04, 20.04, 20.04, 20.04, 20.03, 20.03, 20.03, 20.03, 20.03, 20.13]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 425 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714397444 --> 1714398074
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11, 0.11, 0.11, 0.11, 0.11, 0.34, 0.34, 0.34, 0.34, 0.34, 0.29, 0.29, 0.29, 0.29, 0.29, 0.25, 0.25, 0.25, 0.25, 0.25, 0.27, 0.27, 0.27, 0.27, 0.27, 0.23, 0.23, 0.23, 0.23, 0.23, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.31, 0.31, 0.31, 0.31, 0.31, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.3, 0.3, 0.3, 0.3, 0.3, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.32, 0.32, 0.32, 0.32, 0.32, 0.33, 0.33, 0.33, 0.33, 0.33, 0.27, 0.27, 0.27, 0.27, 0.27, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.23, 0.23, 0.23, 0.23, 0.23, 0.11, 0.11, 0.11, 0.11, 0.11, 0.28, 0.28, 0.28, 0.28, 0.28, 0.36, 0.36, 0.36, 0.36, 0.36, 0.28, 0.28, 0.28, 0.28, 0.28, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.45, 0.45, 0.45, 0.45, 0.45, 0.55, 0.55, 0.55, 0.55, 0.55, 0.45, 0.45, 0.45, 0.45, 0.45, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.38, 0.38, 0.38, 0.38, 0.38, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 425 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714397444 --> 1714398074
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0]

teleprint-me · 2024-05-01T01:25:13Z

This implementation is conflicting with StableLMForCausal.

21:03:35 | /mnt/valerie/remote/ggerganov/llama.cpp
(.venv) git:(master | Δ) λ python convert-hf-to-gguf.py /mnt/valerie/models/stabilityai/stablelm-2-1_6b-chat                                                                            
Loading model: stablelm-2-1_6b-chat
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
chktok: [198, 4815, 15073, 66597, 8004, 1602, 2355, 79772, 11187, 9468, 248, 222, 320, 8416, 8, 27623, 114, 378, 235, 9468, 234, 104, 31643, 320, 36773, 100166, 98634, 8, 26602, 227, 11410, 99, 247, 9468, 99, 247, 220, 18, 220, 18, 18, 220, 18, 18, 18, 220, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 18, 18, 220, 18, 13, 18, 220, 18, 497, 18, 220, 18, 1131, 18, 220, 21549, 222, 98629, 241, 45358, 233, 21549, 237, 45358, 224, 21549, 244, 21549, 115, 21549, 253, 45358, 223, 21549, 253, 21549, 95, 98629, 227, 76460, 223, 949, 37046, 33565, 111, 19000, 23182, 49792, 19967, 16, 18, 16, 19, 16, 20, 16, 36827, 21909, 56560, 54337, 19175, 14476, 1482, 13373, 64571, 34694, 3114, 15752, 17721, 80112, 3436, 4708, 4708, 14196, 14196, 74694, 3089, 3089, 29249, 17523, 3001, 27708, 7801, 358, 3077, 1027, 364, 83, 820, 568, 596, 1070, 11, 364, 793, 499, 2771, 30, 364, 44, 539, 2771, 358, 3358, 1304, 433, 11, 364, 35, 499, 1093, 1063, 15600, 30, 1226, 6, 43712, 264, 64966, 43]
chkhsh: 32d85c31273f8019248f2559fed492d929ea28b17e51d81d3bb36fff23ca72b3


**************************************************************************************
** WARNING: The BPE pre-tokenizer was not recognized!
**          There are 2 possible reasons for this:
**          - the model has not been added to convert-hf-to-gguf-update.py yet
**          - the pre-tokenization config has changed upstream
**          Check your model files and convert-hf-to-gguf-update.py and update them accordingly.
** ref:     https://github.com/ggerganov/llama.cpp/pull/6920
**
** chkhsh:  32d85c31273f8019248f2559fed492d929ea28b17e51d81d3bb36fff23ca72b3
**************************************************************************************


Traceback (most recent call last):
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert-hf-to-gguf.py", line 3001, in <module>
    main()
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert-hf-to-gguf.py", line 2988, in main
    model_instance.set_vocab()
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert-hf-to-gguf.py", line 1280, in set_vocab
    self._set_vocab_gpt2()
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert-hf-to-gguf.py", line 331, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
                               ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert-hf-to-gguf.py", line 242, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert-hf-to-gguf.py", line 323, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

The GPT-2 hash isn't the same because the vocab differs.

>>> chktxt = '\n \n\n \n\n\n \t \t\t \t\n  \n   \n    \n     \n🚀 (normal) 😶\u200d🌫️ (multiple emoj
is concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 
?我想在apple工作1314151天～ ------======= нещо на Български \'\'\'\'\'\'```````""""......!!!!!!????
?? I\'ve been \'told he\'s there, \'RE you sure? \'M not sure I\'ll make it, \'D you like some tea?
 We\'Ve a\'lL'
>>> chktok = tokenizer.encode(chktxt)
>>> chkhsh = sha256(str(chktok).encode()).hexdigest()
>>> print(f"chktok: {chktok}")
chktok: [198, 4815, 15073, 66597, ...]
>>> print(f"chkhsh: {chkhsh}")
chkhsh: 32d85c31273f8019248f2559fed492d929ea28b17e51d81d3bb36fff23ca72b3
>>> type(tokenizer)
<class 'transformers.models.gpt2.tokenization_gpt2_fast.GPT2TokenizerFast'>

Also, the fallback is to use qwens vocab for 1.6B models which invokes the self._set_vocab_gpt2() method call.

if chkhsh == "3ce83efda5659b07b1ad37ca97ca5797ea4285d9b9ab0dc679e4a720c9da7454":
    # ref: https://huggingface.co/openai-community/gpt2
    res = "gpt-2"
if chkhsh == "32d85c31273f8019248f2559fed492d929ea28b17e51d81d3bb36fff23ca72b3":
    res = "gpt-2"  # StableForCausalLM

I mentioned this earlier, but I'll explicitly state it here now because it should allow for dynamically handling the hashes. Manually referencing them will still be required though.

21:18:18 | /mnt/valerie/remote/ggerganov/llama.cpp
(.venv) git:(master | Δ) λ bpython
bpython version 0.24 on top of Python 3.12.3 /mnt/valerie/remote/ggerganov/llama.cpp/.venv/bin/python
>>> from pathlib import Path
>>> from hashlib import sha256
>>> from transformers import AutoTokenizer
>>> 
>>> model_path = Path("/mnt/valerie/models/stabilityai/stablelm-2-1_6b-chat")
>>> model_path.exists()
True
>>> tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
>>> tokenizer.name_or_path
'/mnt/valerie/models/stabilityai/stablelm-2-1_6b-chat'
>>> len(tokenizer.vocab)
100289
>>> vocab_hashsum = sha256(str(tuple((k, v) for k,v in tokenizer.vocab.items())).encode()).hexdiges
t()
>>> vocab_hashsum
'b7a950fbf72c9984f3f3c5f3481e5eff112d647bca38349d00523fc73bc4073e'

The downside to this approach, which is probably why it was implemented the way it was, is that there is a definite and noticeable latency when reading the vocab this way, but it only does that the first time. The result is cached and the latency is negligible with repeated calls.

The upside to this approach is that the vocab is dynamically handled and read in from the model path. This should be applicable to any vocabulary of any type.

The algorithm for BPE is really simple at it's core.

"""
examples/slow_bpe.py - Neural Machine Translation of Rare Words with Subword Units

Byte Pair Encoding (BPE) Tokenization for Natural Language Processing

Paper: https://arxiv.org/abs/1508.07909v5
"""
import argparse
import collections
import json
import re
from typing import Dict, Tuple


def get_stats(vocab: Dict[str, int]) -> Dict[Tuple[str, str], int]:
    """
    Calculate frequencies of pairs of adjacent symbols in the vocabulary.

    Args:
        vocab (dict): Dictionary with space-separated symbols as keys and frequencies as values

    Returns:
        dict: Dictionary of symbol pairs (tuple) and their combined frequency
    """
    symbol_pairs_frequency = collections.defaultdict(int)

    for word, frequency in vocab.items():
        symbols = word.split()
        for index in range(len(symbols) - 1):
            symbol_pair = (symbols[index], symbols[index + 1])
            symbol_pairs_frequency[symbol_pair] += frequency

    return symbol_pairs_frequency


def merge_vocab(
    symbol_pair: Tuple[str, str], input_vocab: Dict[str, int]
) -> Dict[str, int]:
    """
    Merge a given pair of symbols in the vocabulary.

    Args:
        symbol_pair (tuple): Tuple of two strings, the pair of symbols to merge
        input_vocab (dict): Input vocabulary

    Returns:
        dict: New vocabulary with the specified pair merged
    """
    output_vocab = {}
    bigram = re.escape(" ".join(symbol_pair))
    pattern = re.compile(r"(?<!\S)" + bigram + r"(?!\S)")

    for word in input_vocab:
        merged_word = pattern.sub("".join(symbol_pair), word)
        output_vocab[merged_word] = input_vocab[word]

    return output_vocab


def load_vocab_from_json(json_file: str) -> Dict[str, int]:
    """
    Load a vocabulary from a JSON file.

    Args:
        json_file (str): Path to the JSON file containing the vocabulary.

    Returns:
        dict: Vocabulary loaded from the JSON file.
    """
    try:
        with open(json_file, "r") as file:
            vocab = json.load(file)
    except json.JSONDecodeError as e:
        raise json.JSONDecodeError(f"Error loading JSON vocabulary: {e}")

    return vocab


def main(args: argparse.Namespace) -> None:
    if args.vocab_json:
        vocab = load_vocab_from_json(args.vocab_json)
    else:
        # Default vocabulary
        vocab = {
            "l o w </w>": 5,
            "l o w e r </w>": 2,
            "n e w e s t </w>": 6,
            "w i d e s t </w>": 3,
        }

    for i in range(args.num_merges):
        pairs = get_stats(vocab)
        best = max(pairs, key=pairs.get)
        vocab = merge_vocab(best, vocab)
        print(best)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument(
        "--num_merges",
        type=int,
        default=10,
        help="Number of BPE merges (default is 10)",
    )

    parser.add_argument(
        "--vocab_json",
        type=str,
        help="Path to a JSON file containing the initial vocabulary (optional)",
    )

    args = parser.parse_args()

    main(args)

The issue is the implementation details from tokenizer to tokenizer. I won't have any free time for quite some time and this was time consuming as there isn't much information on this specifically on the net (that I could find). Any other implementations are extremely complicated as they rely concepts such as word2vec and other well known methods.

It might be worth considering a proper custom implementation for gguf models. I mention this because sophisticated regular expression patterns are truly overkill IMHO. A simple lexer and parser should suffice for more complex implementations. This is something I've only toyed around in theory though.

sealad886 · 2024-05-01T03:45:09Z

Hoping for a bit of help with Command-R-Plus please!

I've updated the files as noted [https://github.com//pull/6920] here, and then I've re-made everything with make -B tests (unconditionally make all targets) more than once. I've made sure that my model from HF is up-to-date. I've re-converted the model twice.

It's still failing on the tests though, and I can't figure out why.

As far as I can tell, Command-R-Plus has its own:

tokenizer.ggml.pre: command-r-plus

And then in the (admittedly far too long) tokenizer.json file, it looks very similar to Starcoder:

 "normalizer": {
        "type": "NFC"
    },
    "pre_tokenizer": {
        "type": "Sequence",
        "pretokenizers": [
            {
                "type": "Digits",
                "individual_digits": true
            },
            {
                "type": "ByteLevel",
                "add_prefix_space": false,
                "trim_offsets": true,
                "use_regex": true
            }
        ]
    },
    "post_processor": {
        "add_prefix_space": true,
        "trim_offsets": false,
        "use_regex": true,
        "type": "TemplateProcessing",
        "single": [
            {

And Starcoder just had a blank switch statement in the regex part of llama.cpp:

case LLAMA_VOCAB_PRE_TYPE_MPT:
    // TODO: MPT pre-tokenization regexes are unknown
    //       the following are close, but not exact. run the following:
    //       ./bin/test-tokenizer-0 ../models/ggml-vocab-mpt.gguf
    GGML_ASSERT("MPT pre-tokenization regexes are unknown - fixes needed");
    word_collection = unicode_regex_split(text, {
        "\\s?\\p{L}+",
        "\\s?\\p{P}+",
        "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
    });
    break;
case LLAMA_VOCAB_PRE_TYPE_STARCODER:
case LLAMA_VOCAB_PRE_TYPE_GPT2:
    word_collection = unicode_regex_split(text, {
        "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
    });
    break;
case LLAMA_VOCAB_PRE_TYPE_COMMANDRPLUS:
default:
    // default regex for BPE tokenization pre-processing
    word_collection = unicode_regex_split(text, {
        "[\\p{P}\\$\\+<=>\\^~\\|]+",
        "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
        "\\p{N}+",
        "[0-9][0-9][0-9]",
    });
    break;

teleprint-me · 2024-05-01T04:49:34Z

Starcoder uses the GPT-2 tokenizer, so it's not blank. A lot of models actually use the GPT-2 tokenizer. Need the hashsum output by the conversion script when the error is raised to add it.

USBhost · 2024-05-01T04:58:19Z

Has anyone been able to convert C4AI Command R+ after this PR?

kallewoof · 2024-05-01T09:40:50Z

Edit: llama3 outdated. Redownload json files. ~~I am seeing the following when attempting to convert-hf-to-gguf the llama3 instruct model by meta.~~

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
chktok: [198, 4815, 15073, 66597, 8004, 1602, 2355, 79772, 11187, 9468, 248, 222, 320, 8416, 8, 27623, 114, 102470, 9468, 234, 104, 31643, 320, 36773, 100166, 98634, 8, 26602, 227, 11410, 99, 247, 9468, 99, 247, 220, 18, 220, 1644, 220, 8765, 220, 8765, 18, 220, 8765, 1644, 220, 8765, 8765, 220, 8765, 8765, 18, 220, 8765, 8765, 1644, 220, 18, 13, 18, 220, 18, 497, 18, 220, 18, 1131, 18, 220, 21549, 222, 98629, 241, 45358, 233, 21549, 237, 45358, 224, 21549, 244, 21549, 115, 21549, 253, 45358, 223, 21549, 253, 21549, 95, 98629, 227, 76460, 223, 949, 37046, 101067, 19000, 23182, 102301, 9263, 18136, 16, 36827, 21909, 56560, 54337, 19175, 102118, 13373, 64571, 34694, 3114, 112203, 80112, 3436, 106451, 14196, 14196, 74694, 3089, 3089, 29249, 17523, 3001, 27708, 7801, 358, 3077, 1027, 364, 83, 820, 568, 596, 1070, 11, 364, 793, 499, 2771, 30, 364, 44, 539, 2771, 358, 3358, 1304, 433, 11, 364, 35, 499, 1093, 1063, 15600, 30, 1226, 6, 43712, 264, 64966, 43]
chkhsh: c136ed14d01c2745d4f60a9596ae66800e2b61fa45643e72436041855ad4089d


**************************************************************************************
** WARNING: The BPE pre-tokenizer was not recognized!
**          There are 2 possible reasons for this:
**          - the model has not been added to convert-hf-to-gguf-update.py yet
**          - the pre-tokenization config has changed upstream
**          Check your model files and convert-hf-to-gguf-update.py and update them accordingly.
** ref:     https://github.com/ggerganov/llama.cpp/pull/6920
**
** chkhsh:  c136ed14d01c2745d4f60a9596ae66800e2b61fa45643e72436041855ad4089d
**************************************************************************************

Edit: my llama3 model was outdated. Fetching latest version made this go away.

BramVanroy · 2024-05-01T12:57:49Z

Is phi-2 included in the fixes? It uses CodeGenTokenizer, which is also a BPE tokenizer. Or should I avoid generating GGUFs for phi-2-based models because of the potential bad tokenization? Related issue: #7022

teleprint-me · 2024-05-01T13:58:47Z

@BramVanroy I'm about to have my breakfast. I'll add it to my PR #7018 if it isn't merged by then. This is an issue with every model supported by the HF script. They all require a hash and pretokenizer in order to be validated. The quality of the output is degraded otherwise. I had to regen all the converted models I use. Spent last night uploading the ones I care about.

BramVanroy · 2024-05-01T14:07:57Z

@teleprint-me I had some users tell me that sometimes generation degrade significantly after a while using ollama. I can't reproduce it on plain Python so I came looking for a potential issue with llamacpp. I've generated the hash in #7022 so you can just copy that, I think.

teleprint-me · 2024-05-01T14:21:30Z

@BramVanroy That's-possibly just-an Ollama issue. Model generation on latest llama.cpp is phenomenal. My micro pretrained models putput quality skyrocketed with this PR update for some reason. I think it depends on the model because I tested phi 1, 2, and 3, llama 3, and mistral 7v2 as well as stablelm 1.6.

MoonRide303 · 2024-05-01T14:43:11Z

Both llama.cpp (b2776) and koboldcpp (1.64) seem to be fine now, but ollama as of 0.1.32 still has tokenizer issues (ollama/ollama#4082).

Imaniac230 · 2024-05-01T15:31:46Z

Hi, I just want to clarify some things (I'm currently on c4ec9c0). It appears that there are currently two ways to successfully convert HF llama3 models to gguf:

python convert.py <path-to-llama3-hf> --outtype f16 --vocab-type bpe (this doesn't add the new pre-tokenizer type field, so main spits out a warning during loading)
python convert-hf-to-gguf.py <path-to-llama3-hf> --outtype f16

The original pytorch checkpoints from Meta have to be converted to HF as mentioned here: #6819 (I used the script from transformers version 4.41.0.dev0), or the HF version has to be downloaded from the Meta repo (ex. https://huggingface.co/meta-llama/Meta-Llama-3-8B) .

The tokenizer config in the Meta repo is slightly modified over the raw conversion:

They removed the chat template from tokenizer_config.json -> https://huggingface.co/meta-llama/Meta-Llama-3-8B/commit/561487d18c41c76bcb5fc6cfb73a324982f04f47
They changed the post-processor structure in in tokenizer.json -> https://huggingface.co/meta-llama/Meta-Llama-3-8B/commit/cd892e8f4da1043d4b01d5ea182a2e8412bf658f
They added parameters in the generation_config.json -> https://huggingface.co/meta-llama/Meta-Llama-3-8B/commit/1460c22666392e470910ce3d44ffeb2ab7dbd4df

This now creates a total of 4 possible ways to generate a gguf:

Use the un-changed tokenizer with convert.py (without the pre-tokenizer type)
Use the changed tokenizer with convert.py (without the pre-tokenizer type)
Use the un-changed tokenizer with convert-hf-to-gguf.py (here the hash has to be updated with convert-hf-to-gguf-update.py)
Use the changed tokenizer with convert-hf-to-gguf.py (here the hash already matches the output)

The generation_config.json always stays the same, but I'm assuming it is used only for the pytorch inference so it shouldn't matter?

I'm not really up-to-date on this stuff, but I assume (very naively) that the tokenizer changes just shifted around the logic to a different part, so they should be equivalent in the end? Or could this have any meaningful effect on the gguf results?

nkeilar · 2024-05-03T04:17:16Z

I'm not sure the best place to comment this, but the current llama-3 Q_4_M performance compared with Groq seems quite different. Maybe this is due to the quantitation. Can anyone confirm? It's very clear to me as CrewAI won't run with Ollama without additional prompt tweaks, vs just running on Groq.

x4080 · 2024-05-05T01:29:48Z

@nkeilar Did you try using regular llama cpp to test the prompt ? Since I found that output using server and regular llama cpp is quite different (ollama is like llama.cpp server output)

Edit : I compare to groq too and regular llama cpp = groq, server is not

* merged the changes from deepseeker models to main branch * Moved regex patterns to unicode.cpp and updated unicode.h * Moved header files * Resolved issues * added and refactored unicode_regex_split and related functions * Updated/merged the deepseek coder pr * Refactored code * Adding unicode regex mappings * Adding unicode regex function * Added needed functionality, testing remains * Fixed issues * Fixed issue with gpt2 regex custom preprocessor * unicode : fix? unicode_wstring_to_utf8 * lint : fix whitespaces * tests : add tokenizer tests for numbers * unicode : remove redundant headers * tests : remove and rename tokenizer test scripts * tests : add sample usage * gguf-py : reader prints warnings on duplicate keys * llama : towards llama3 tokenization support (wip) * unicode : shot in the dark to fix tests on Windows * unicode : first try custom implementations * convert : add "tokenizer.ggml.pre" GGUF KV (wip) * llama : use new pre-tokenizer type * convert : fix pre-tokenizer type writing * lint : fix * make : add test-tokenizer-0-llama-v3 * wip * models : add llama v3 vocab file * llama : adapt punctuation regex + add llama 3 regex * minor * unicode : set bomb * unicode : set bomb * unicode : always use std::wregex * unicode : support \p{N}, \p{L} and \p{P} natively * unicode : try fix windows * unicode : category support via std::regex * unicode : clean-up * unicode : simplify * convert : add convert-hf-to-gguf-update.py ggml-ci * lint : update * convert : add falcon ggml-ci * unicode : normalize signatures * lint : fix * lint : fix * convert : remove unused functions * convert : add comments * convert : exercise contractions ggml-ci * lint : fix * cmake : refactor test targets * tests : refactor vocab tests ggml-ci * tests : add more vocabs and tests ggml-ci * unicode : cleanup * scripts : ignore new update script in check-requirements.sh * models : add phi-3, mpt, gpt-2, starcoder * tests : disable obsolete ggml-ci * tests : use faster bpe test ggml-ci * llama : more prominent warning for old BPE models * tests : disable test-tokenizer-1-bpe due to slowness ggml-ci --------- Co-authored-by: Jaggzh <jaggz.h@gmail.com> Co-authored-by: Kazim Abrar Mahi <kazimabrarmahi135@gmail.com>

jaggzh and others added 12 commits April 26, 2024 11:43

merged the changes from deepseeker models to main branch

6fbab2d

Moved regex patterns to unicode.cpp and updated unicode.h

d2cfc22

Moved header files

54f93eb

Resolved issues

1c924e4

added and refactored unicode_regex_split and related functions

4056dc5

Updated/merged the deepseek coder pr

c8e7d95

Refactored code

4c3e882

Adding unicode regex mappings

a5710a4

Adding unicode regex function

7e308ed

Added needed functionality, testing remains

feeaf4f

Fixed issues

7535803

Fixed issue with gpt2 regex custom preprocessor

36d9832

This comment was marked as resolved.

Sign in to view

unicode : fix? unicode_wstring_to_utf8

06d3e69

ggerganov commented Apr 26, 2024

View reviewed changes

ggerganov added 8 commits April 26, 2024 12:58

lint : fix whitespaces

c56e19d

tests : add tokenizer tests for numbers

7a44e44

unicode : remove redundant headers

d999cf6

tests : remove and rename tokenizer test scripts

aeafb43

tests : add sample usage

e1b2bf7

gguf-py : reader prints warnings on duplicate keys

ed42711

llama : towards llama3 tokenization support (wip)

4907e41

unicode : shot in the dark to fix tests on Windows

e8c206b

ggerganov commented Apr 26, 2024

View reviewed changes

unicode : first try custom implementations

e989176

ggerganov mentioned this pull request Apr 26, 2024

llama : add Deepseek support #5981 #6252

Draft

ggerganov added 2 commits April 26, 2024 18:08

Merge branch 'master' into gg/bpe-preprocess

e3f6dc7

convert : add "tokenizer.ggml.pre" GGUF KV (wip)

9b4d63a

This comment was marked as resolved.

Sign in to view

teleprint-me mentioned this pull request May 1, 2024

chore: Add hashsum for stablelm models #7018

Closed

city96 mentioned this pull request May 1, 2024

Llama 3 - Regression with apostrophes #7006

Open

dranger003 mentioned this pull request May 2, 2024

Add BPE pre-tokenization for Command-R. #7033

Closed

sealad886 mentioned this pull request May 2, 2024

Command-R GGUF conversion no longer working #7030

Closed

sealad886 mentioned this pull request May 3, 2024

Some Ollama models apparently affected by llama.cpp BPE pretokenization issue ollama/ollama#4126

Open

This was referenced May 7, 2024

Decide pre tokenizer based on preprocessing of entry and not on tokens encoded #7039

Closed

chore: Add model vocab support #7117

Closed

ggerganov mentioned this pull request May 7, 2024

Deepseek coder merge #5464

Closed

hyperbolic-c mentioned this pull request May 8, 2024

llama : add Deepseek support #5981

Closed

mofosyne added the enhancement New feature or request label May 9, 2024

christianazinn mentioned this pull request May 10, 2024

Option to split during conversion #6942

Open

Spacellary mentioned this pull request May 14, 2024

Llama-3 Instruct tokenizer_config.json changes in relation to the currently fetched llama-bpe configs. #7289

Open

4 tasks

aleloi mentioned this pull request May 17, 2024

convert.py still fails on llama3 8B-Instruct downloaded directly from Meta (Huggingface works) #7339

Open

akx mentioned this pull request May 17, 2024

Viking tokenizer support #7328

Draft

teleprint-me mentioned this pull request May 19, 2024

Automate vocab support and model conversion #7379

Draft

7 tasks

ggerganov mentioned this pull request May 24, 2024

llama: extend for small granite models #7481

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920

llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920

ggerganov commented Apr 26, 2024 •

edited

This comment was marked as resolved.

ggerganov Apr 26, 2024

dragnil1 Apr 26, 2024

ggerganov Apr 26, 2024

coder543 Apr 26, 2024

ggerganov Apr 26, 2024

ggerganov Apr 26, 2024 •

edited

Sumanai Apr 26, 2024

ggerganov Apr 26, 2024

teleprint-me Apr 29, 2024 •

edited

github-actions bot commented Apr 26, 2024 •

edited

This comment was marked as resolved.

teleprint-me commented May 1, 2024 •

edited

sealad886 commented May 1, 2024 •

edited

teleprint-me commented May 1, 2024 •

edited

USBhost commented May 1, 2024

kallewoof commented May 1, 2024 •

edited

BramVanroy commented May 1, 2024 •

edited

teleprint-me commented May 1, 2024 •

edited

BramVanroy commented May 1, 2024

teleprint-me commented May 1, 2024 •

edited

MoonRide303 commented May 1, 2024

Imaniac230 commented May 1, 2024

nkeilar commented May 3, 2024

x4080 commented May 5, 2024 •

edited

	std::vector<std::string> bpe_gpt2_preprocess(const std::string & text) {
	std::vector<std::string> bpe_words;
	std::vector<std::string> bpe_encoded_words;

	std::string token = "";
	// GPT2 system regex: 's\|'t\|'re\|'ve\|'m\|'ll\|'d\| ?\p{L}+\| ?\p{N}+\| ?[^\s\p{L}\p{N}]+\|\s+(?!\S)\|\s+
	bool collecting_numeric = false;
	bool collecting_letter = false;
	bool collecting_special = false;
	bool collecting_whitespace_lookahead = false;
	bool collecting = false;

	# This script downloads the tokenizer models of the specified models from Huggingface and
	# generates the get_vocab_base_pre() function for convert-hf-to-gguf.py
	#
	# This is necessary in order to analyze the type of pre-tokenizer used by the model and
	# provide the necessary information to llama.cpp via the GGUF header in order to implement
	# the same pre-tokenizer.
	#
	# ref: https://github.com/ggerganov/llama.cpp/pull/6920
	#
	# Instructions:
	#
	# - Add a new model to the "models" list
	# - Run the script with your huggingface token:
	#
	# python3 convert-hf-to-gguf-update.py <huggingface_token>
	#
	# - Copy-paste the generated get_vocab_base_pre() function into convert-hf-to-gguf.py
	# - Update llama.cpp with the new pre-tokenizer if necessary
	#

	# TODO: add models here, base models preferred
	models = [
	{ "name": "llama-spm", "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/meta-llama/Llama-2-7b-hf", },
	{ "name": "llama-bpe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/meta-llama/Meta-Llama-3-8B", },
	{ "name": "deepseek-llm", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/deepseek-llm-7b-base", },
	{ "name": "deepseek-coder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base", },
	{ "name": "falcon", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/tiiuae/falcon-7b", },
	{ "name": "bert-bge", "tokt": TOKENIZER_TYPE.WPM, "repo": "https://huggingface.co/BAAI/bge-small-en-v1.5", },
	]

	# NOTE: this function is generated by convert-hf-to-gguf-update.py
	# do not modify it manually!
	# ref: https://github.com/ggerganov/llama.cpp/pull/6920
	def get_vocab_base_pre(self, tokenizer) -> str:
	# encoding this string and hashing the resulting tokens would (hopefully) give us a unique identifier that
	# is specific for the BPE pre-tokenizer used by the model
	# we will use this unique identifier to write a "tokenizer.ggml.pre" entry in the GGUF file which we can
	# use in llama.cpp to implement the same pre-tokenizer

	chktxt = '\n \n\n \n\n\n \t \t\t \t\n \n \n \n \n🚀 (normal) 😶\u200d🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天～ ------======= нещо на Български \'\'\'\'\'\'```````""""......!!!!!!?????? I\'ve been \'told he\'s there, \'RE you sure? \'M not sure I\'ll make it, \'D you like some tea? We\'Ve a\'lL'

	chktok = tokenizer.encode(chktxt)
	chkhsh = sha256(str(chktok).encode()).hexdigest()

	print(f"chktok: {chktok}")
	print(f"chkhsh: {chkhsh}")

	res = None

	# NOTE: if you get an error here, you need to add the model to the if-elif chain below
	# don't do this manually - use the convert-hf-to-gguf-update.py script!
	if chkhsh == "0ef9807a4087ebef797fc749390439009c3b9eda9ad1a097abbe738f486c01e5":
	# ref: https://huggingface.co/meta-llama/Meta-Llama-3-8B
	res = "llama-bpe"
	if chkhsh == "049ecf7629871e3041641907f3de7c733e4dbfdc736f57d882ba0b0845599754":
	# ref: https://huggingface.co/deepseek-ai/deepseek-llm-7b-base
	res = "deepseek-llm"
	if chkhsh == "347715f544604f9118bb75ed199f68779f423cabb20db6de6f31b908d04d7821":
	# ref: https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base
	res = "deepseek-coder"
	if chkhsh == "8aeee3860c56296a157a1fe2fad249ec40aa59b1bb5709f4ade11c4e6fe652ed":
	# ref: https://huggingface.co/tiiuae/falcon-7b
	res = "falcon"
	if chkhsh == "0876d13b50744004aa9aeae05e7b0647eac9d801b5ba4668afc01e709c15e19f":
	# ref: https://huggingface.co/BAAI/bge-small-en-v1.5
	res = "bert-bge"

	if res is None:
	print("\n")
	print("**************************************************************************************")
	print("** WARNING: The BPE pre-tokenizer was not recognized!")
	print("** This means that it was not added yet or you are using an older version.")
	print("** Check convert-hf-to-gguf-update.py and update it accordingly.")
	print("**")
	print(f"** chkhsh: {chkhsh}")
	print("**************************************************************************************")
	print("\n")
	raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")

	print(f"tokenizer.ggml.pre: {res}")
	print(f"chkhsh: {chkhsh}")

	return res

	# TODO: this string has to exercise as much pre-tokenizer functionality as possible
	# will be updated with time - contributions welcome
	chktxt = '\n \n\n \n\n\n \t \t\t \t\n \n \n \n \n🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天～ ------======= нещо на Български \'\'\'\'\'\'```````\"\"\"\"......!!!!!!?????? I\'ve been \'told he\'s there, \'RE you sure? \'M not sure I\'ll make it, \'D you like some tea? We\'Ve a\'lL'

	std::vector<std::string> word_collection;
	switch (vocab.type) {
	case LLAMA_VOCAB_TYPE_BPE:
	switch (vocab.type_pre) {
	case LLAMA_VOCAB_PRE_TYPE_LLAMA3:
	word_collection = unicode_regex_split(text, {
	// original regex from tokenizer.json
	//"(?i:'s\|'t\|'re\|'ve\|'m\|'ll\|'d)\|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+\|\\p{N}{1,3}\| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]\|\\s[\\r\\n]+\|\\s+(?!\\S)\|\\s+",

	// adapted: https://github.com/ggerganov/llama.cpp/pull/6920#issuecomment-2080233989
	"(?:'[sS]\|'[tT]\|'[rR][eE]\|'[vV][eE]\|'[mM]\|'[lL][lL]\|'[dD])\|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+\|\\p{N}{1,3}\| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]\|\\s[\\r\\n]+\|\\s+(?!\\S)\|\\s+",
	});
	break;
	case LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_LLM:
	word_collection = unicode_regex_split(text, {
	"[\r\n]",
	"\\s?[A-Za-zµÀ-ÖØ-öø-ƺƼ-ƿǄ-ʓʕ-ʯͰ-ͳͶͷͻ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-ՖႠ-ჅᎠ-Ᏽᏸ-ᏽᲐ-ᲺᲽ-Ჿᴀ-ᴫᵫ-ᵷᵹ-ᶚḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℴℹℼ-ℿⅅ-ⅉⅎↃↄⰀ-ⱻⱾ-ⳤⳫ-ⳮⳲⳳꙀ-ꙭꚀ-ꚛꜢ-ꝯꝱ-ꞇꞋ-ꞎꭰ-ꮿﬀ-ﬆﬓ-ﬗＡ-Ｚａ-ｚ𐐀-𐑏𐒰-𐓓𐓘-𐓻𐲀-𐲲𐳀-𐳲𑢠-𑣟𞤀-𞥃]+",
	"\\s?[!-/:-~！-／：-～‘-‟　-。]+",
	"\\s+$",
	"[一-龥ࠀ-一가-퟿]+",
	"\\p{N}+",
	});
	break;
	case LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_CODER:
	word_collection = unicode_regex_split(text, {
	"[\r\n]",
	"\\s?\\p{L}+",
	"\\s?\\p{P}+",
	"[一-龥ࠀ-一가-퟿]+",
	"\\p{N}+",
	});
	break;
	case LLAMA_VOCAB_PRE_TYPE_FALCON:
	word_collection = unicode_regex_split(text, {
	"[\\p{P}\\$\\+<=>\\^~\\\|]+",
	"'s\|'t\|'re\|'ve\|'m\|'ll\|'d\| ?\\p{L}+\| ?\\p{N}+\| ?[^\\s\\p{L}\\p{N}]+\|\\s+(?!\\S)",
	"\\p{N}+",
	"[0-9][0-9][0-9]",
	});
	break;
	default:
	// default regex for BPE tokenization pre-processing
	word_collection = unicode_regex_split(text, {
	"[\\p{P}\\$\\+<=>\\^~\\\|]+",
	"'s\|'t\|'re\|'ve\|'m\|'ll\|'d\| ?\\p{L}+\| ?\\p{N}+\| ?[^\\s\\p{L}\\p{N}]+\|\\s+(?!\\S)",
	"\\p{N}+",
	"[0-9][0-9][0-9]",
	});
	break;
	}
	break;
	default:
	GGML_ASSERT(false);
	break;
	}

	// for now, only BPE models have pre-tokenizers
	if (vocab.type == LLAMA_VOCAB_TYPE_BPE) {
	if (tokenizer_pre.empty()) {
	LLAMA_LOG_WARN("%s: missing pre-tokenizer type, using: 'default'\n", __func__);
	LLAMA_LOG_WARN("%s: \n", __func__);
	LLAMA_LOG_WARN("%s: ************************************ \n", __func__);
	LLAMA_LOG_WARN("%s: GENERATION QUALITY WILL BE DEGRADED! \n", __func__);
	LLAMA_LOG_WARN("%s: CONSIDER REGENERATING THE MODEL \n", __func__);
	LLAMA_LOG_WARN("%s: ************************************ \n", __func__);
	LLAMA_LOG_WARN("%s: \n", __func__);
	vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT;
	} else if (
	tokenizer_pre == "default") {
	vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT;
	} else if (
	tokenizer_pre == "llama3" \|\|
	tokenizer_pre == "llama-v3" \|\|
	tokenizer_pre == "llama-bpe") {
	vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_LLAMA3;
	} else if (

	static std::vector<size_t> unicode_regex_split_custom(const std::string & text, const std::string & regex_expr, const std::vector<size_t> & offsets) {
	std::vector<size_t> bpe_offsets;

	if (regex_expr == "'s\|'t\|'re\|'ve\|'m\|'ll\|'d\| ?\\p{L}+\| ?\\p{N}+\| ?[^\\s\\p{L}\\p{N}]+\|\\s+(?!\\S)") {
	bpe_offsets = unicode_regex_split_custom_gpt2(text, offsets);
	}

	return bpe_offsets;
	}

	# generate tests for each tokenizer model

	tests = [
	"",
	" ",
	" ",
	" ",
	"\t",
	"\n",
	"\n\n",
	"\n\n\n",
	"\t\n",
	"Hello world",
	" Hello world",
	"Hello World",
	" Hello World",
	" Hello World!",
	"Hello, world!",
	" Hello, world!",
	" this is 🦙.cpp",
	"w048 7tuijk dsdfhu",
	"нещо на Български",
	"កាន់តែពិសេសអាចខលចេញ",
	"🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)",
	"Hello",
	" Hello",
	" Hello",
	" Hello",
	" Hello",
	" Hello\n Hello",
	" (",
	"\n =",
	"' era",
	"Hello, y'all! How are you 😁 ?我想在apple工作1314151天～",
	"3",
	"33",
	"333",
	"3333",
	"33333",
	"333333",
	"3333333",
	"33333333",
	"333333333",
	chktxt,
	]

	# build test-tokenizer-0 target once and add many tests
	add_executable(test-tokenizer-0 test-tokenizer-0.cpp)
	target_link_libraries(test-tokenizer-0 PRIVATE common)
	install(TARGETS test-tokenizer-0 RUNTIME)

	llama_test(test-tokenizer-0 NAME test-tokenizer-0-llama-spm ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-spm.gguf)
	llama_test(test-tokenizer-0 NAME test-tokenizer-0-llama-bpe ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-bpe.gguf)
	llama_test(test-tokenizer-0 NAME test-tokenizer-0-falcon ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-falcon.gguf)
	llama_test(test-tokenizer-0 NAME test-tokenizer-0-deepseek-llm ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-llm.gguf)
	llama_test(test-tokenizer-0 NAME test-tokenizer-0-deepseek-coder ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-coder.gguf)
	llama_test(test-tokenizer-0 NAME test-tokenizer-0-bert-bge r ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-bert-bge.gguf)

	// TODO: this implementation is actually wrong, uncomment and run:
	// make -j && ./bin/test-tokenizer-0 ../models/ggml-vocab-gpt-2.gguf
	//if (regex_expr == "'s\|'t\|'re\|'ve\|'m\|'ll\|'d\| ?\\p{L}+\| ?\\p{N}+\| ?[^\\s\\p{L}\\p{N}]+\|\\s+(?!\\S)") {
	// bpe_offsets = unicode_regex_split_custom_gpt2(text, offsets);
	//}

	case LLAMA_VOCAB_PRE_TYPE_MPT:
	// TODO: MPT pre-tokenization regexes are unknown
	// the following are close, but not exact. run the following:
	// ./bin/test-tokenizer-0 ../models/ggml-vocab-mpt.gguf
	GGML_ASSERT("MPT pre-tokenization regexes are unknown - fixes needed");
	word_collection = unicode_regex_split(text, {
	"\\s?\\p{L}+",
	"\\s?\\p{P}+",
	"'s\|'t\|'re\|'ve\|'m\|'ll\|'d\| ?\\p{L}+\| ?\\p{N}+\| ?[^\\s\\p{L}\\p{N}]+\|\\s+(?!\\S)",
	});
	break;

llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920

llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920

Conversation

ggerganov commented Apr 26, 2024 • edited

Summary

Details

TODOs

This comment was marked as resolved.

ggerganov Apr 26, 2024

Choose a reason for hiding this comment

dragnil1 Apr 26, 2024

Choose a reason for hiding this comment

ggerganov Apr 26, 2024

Choose a reason for hiding this comment

coder543 Apr 26, 2024

Choose a reason for hiding this comment

ggerganov Apr 26, 2024

Choose a reason for hiding this comment

ggerganov Apr 26, 2024 • edited

Choose a reason for hiding this comment

Sumanai Apr 26, 2024

Choose a reason for hiding this comment

ggerganov Apr 26, 2024

Choose a reason for hiding this comment

teleprint-me Apr 29, 2024 • edited

Choose a reason for hiding this comment

github-actions bot commented Apr 26, 2024 • edited

This comment was marked as resolved.

teleprint-me commented May 1, 2024 • edited

sealad886 commented May 1, 2024 • edited

teleprint-me commented May 1, 2024 • edited

USBhost commented May 1, 2024

kallewoof commented May 1, 2024 • edited

BramVanroy commented May 1, 2024 • edited

teleprint-me commented May 1, 2024 • edited

BramVanroy commented May 1, 2024

teleprint-me commented May 1, 2024 • edited

MoonRide303 commented May 1, 2024

Imaniac230 commented May 1, 2024

nkeilar commented May 3, 2024

x4080 commented May 5, 2024 • edited

ggerganov commented Apr 26, 2024 •

edited

ggerganov Apr 26, 2024 •

edited

teleprint-me Apr 29, 2024 •

edited

github-actions bot commented Apr 26, 2024 •

edited

teleprint-me commented May 1, 2024 •

edited

sealad886 commented May 1, 2024 •

edited

teleprint-me commented May 1, 2024 •

edited

kallewoof commented May 1, 2024 •

edited

BramVanroy commented May 1, 2024 •

edited

teleprint-me commented May 1, 2024 •

edited

teleprint-me commented May 1, 2024 •

edited

x4080 commented May 5, 2024 •

edited