Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920

Merged
merged 61 commits into from
Apr 29, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Apr 26, 2024

Continuing the work in #6252 by @dragnil1

This PR adds support for BPE pre-tokenization to llama.cpp

Summary

The state so far has been that for all BPE-based models, llama.cpp applied a default pre-tokenization inherited back from GPT-2:

llama.cpp/llama.cpp

Lines 12186 to 12196 in e00b4a8

std::vector<std::string> bpe_gpt2_preprocess(const std::string & text) {
std::vector<std::string> bpe_words;
std::vector<std::string> bpe_encoded_words;
std::string token = "";
// GPT2 system regex: 's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
bool collecting_numeric = false;
bool collecting_letter = false;
bool collecting_special = false;
bool collecting_whitespace_lookahead = false;
bool collecting = false;

This works most of the times since BPE models use similar pre-tokenization strategies. However, there are cases where this fails: #6914. This leads to poor generation quality because the model starts to work with out-of-distribution data when the pre-tokenization splits the input string in the wrong way

There are 2 main obstacles in introducing proper BPE pre-tokenization:

Both introducing a dedicated regex library or supporting complex json configurations are out-of-scope for llama.cpp. Therefore, this PR implements the following solution:

Details

  • Introduce new convert-hf-to-gguf-update.py script

    # This script downloads the tokenizer models of the specified models from Huggingface and
    # generates the get_vocab_base_pre() function for convert-hf-to-gguf.py
    #
    # This is necessary in order to analyze the type of pre-tokenizer used by the model and
    # provide the necessary information to llama.cpp via the GGUF header in order to implement
    # the same pre-tokenizer.
    #
    # ref: https://github.com/ggerganov/llama.cpp/pull/6920
    #
    # Instructions:
    #
    # - Add a new model to the "models" list
    # - Run the script with your huggingface token:
    #
    # python3 convert-hf-to-gguf-update.py <huggingface_token>
    #
    # - Copy-paste the generated get_vocab_base_pre() function into convert-hf-to-gguf.py
    # - Update llama.cpp with the new pre-tokenizer if necessary
    #

  • From now on, we start listing all supported models in it:

    # TODO: add models here, base models preferred
    models = [
    { "name": "llama-spm", "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/meta-llama/Llama-2-7b-hf", },
    { "name": "llama-bpe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/meta-llama/Meta-Llama-3-8B", },
    { "name": "deepseek-llm", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/deepseek-llm-7b-base", },
    { "name": "deepseek-coder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base", },
    { "name": "falcon", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/tiiuae/falcon-7b", },
    { "name": "bert-bge", "tokt": TOKENIZER_TYPE.WPM, "repo": "https://huggingface.co/BAAI/bge-small-en-v1.5", },
    ]

  • During conversion with convert-hf-to-gguf.py, if the hash of the tokens of a large string are not recognized, we prompt for update of convert-hf-to-gguf-update.py

    # NOTE: this function is generated by convert-hf-to-gguf-update.py
    # do not modify it manually!
    # ref: https://github.com/ggerganov/llama.cpp/pull/6920
    def get_vocab_base_pre(self, tokenizer) -> str:
    # encoding this string and hashing the resulting tokens would (hopefully) give us a unique identifier that
    # is specific for the BPE pre-tokenizer used by the model
    # we will use this unique identifier to write a "tokenizer.ggml.pre" entry in the GGUF file which we can
    # use in llama.cpp to implement the same pre-tokenizer
    chktxt = '\n \n\n \n\n\n \t \t\t \t\n \n \n \n \n🚀 (normal) 😶\u200d🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天~ ------======= нещо на Български \'\'\'\'\'\'```````""""......!!!!!!?????? I\'ve been \'told he\'s there, \'RE you sure? \'M not sure I\'ll make it, \'D you like some tea? We\'Ve a\'lL'
    chktok = tokenizer.encode(chktxt)
    chkhsh = sha256(str(chktok).encode()).hexdigest()
    print(f"chktok: {chktok}")
    print(f"chkhsh: {chkhsh}")
    res = None
    # NOTE: if you get an error here, you need to add the model to the if-elif chain below
    # don't do this manually - use the convert-hf-to-gguf-update.py script!
    if chkhsh == "0ef9807a4087ebef797fc749390439009c3b9eda9ad1a097abbe738f486c01e5":
    # ref: https://huggingface.co/meta-llama/Meta-Llama-3-8B
    res = "llama-bpe"
    if chkhsh == "049ecf7629871e3041641907f3de7c733e4dbfdc736f57d882ba0b0845599754":
    # ref: https://huggingface.co/deepseek-ai/deepseek-llm-7b-base
    res = "deepseek-llm"
    if chkhsh == "347715f544604f9118bb75ed199f68779f423cabb20db6de6f31b908d04d7821":
    # ref: https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base
    res = "deepseek-coder"
    if chkhsh == "8aeee3860c56296a157a1fe2fad249ec40aa59b1bb5709f4ade11c4e6fe652ed":
    # ref: https://huggingface.co/tiiuae/falcon-7b
    res = "falcon"
    if chkhsh == "0876d13b50744004aa9aeae05e7b0647eac9d801b5ba4668afc01e709c15e19f":
    # ref: https://huggingface.co/BAAI/bge-small-en-v1.5
    res = "bert-bge"
    if res is None:
    print("\n")
    print("**************************************************************************************")
    print("** WARNING: The BPE pre-tokenizer was not recognized!")
    print("** This means that it was not added yet or you are using an older version.")
    print("** Check convert-hf-to-gguf-update.py and update it accordingly.")
    print("**")
    print(f"** chkhsh: {chkhsh}")
    print("**************************************************************************************")
    print("\n")
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
    print(f"tokenizer.ggml.pre: {res}")
    print(f"chkhsh: {chkhsh}")
    return res

  • For now, this is required only for BPE models, since it seems SPM does not use pre-tokenization

  • The string used for the hashing should be extended to cover as much pre-tokenizer functionality as possible:

    # TODO: this string has to exercise as much pre-tokenizer functionality as possible
    # will be updated with time - contributions welcome
    chktxt = '\n \n\n \n\n\n \t \t\t \t\n \n \n \n \n🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天~ ------======= нещо на Български \'\'\'\'\'\'```````\"\"\"\"......!!!!!!?????? I\'ve been \'told he\'s there, \'RE you sure? \'M not sure I\'ll make it, \'D you like some tea? We\'Ve a\'lL'

  • Pre-tokenizer types are identified via a string written to the GGUF header:

    { LLM_KV_TOKENIZER_PRE, "tokenizer.ggml.pre" },

  • For each pre-tokenizer, we have to tell llama.cpp what pre-processing regexes to use:

    llama.cpp/llama.cpp

    Lines 12087 to 12141 in c21ab18

    std::vector<std::string> word_collection;
    switch (vocab.type) {
    case LLAMA_VOCAB_TYPE_BPE:
    switch (vocab.type_pre) {
    case LLAMA_VOCAB_PRE_TYPE_LLAMA3:
    word_collection = unicode_regex_split(text, {
    // original regex from tokenizer.json
    //"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
    // adapted: https://github.com/ggerganov/llama.cpp/pull/6920#issuecomment-2080233989
    "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
    });
    break;
    case LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_LLM:
    word_collection = unicode_regex_split(text, {
    "[\r\n]",
    "\\s?[A-Za-zµÀ-ÖØ-öø-ƺƼ-ƿDŽ-ʓʕ-ʯͰ-ͳͶͷͻ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-ՖႠ-ჅᎠ-Ᏽᏸ-ᏽᲐ-ᲺᲽ-Ჿᴀ-ᴫᵫ-ᵷᵹ-ᶚḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℴℹℼ-ℿⅅ-ⅉⅎↃↄⰀ-ⱻⱾ-ⳤⳫ-ⳮⳲⳳꙀ-ꙭꚀ-ꚛꜢ-ꝯꝱ-ꞇꞋ-ꞎꭰ-ꮿff-stﬓ-ﬗA-Za-z𐐀-𐑏𐒰-𐓓𐓘-𐓻𐲀-𐲲𐳀-𐳲𑢠-𑣟𞤀-𞥃]+",
    "\\s?[!-/:-~!-/:-~‘-‟ -。]+",
    "\\s+$",
    "[一-龥ࠀ-一가-퟿]+",
    "\\p{N}+",
    });
    break;
    case LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_CODER:
    word_collection = unicode_regex_split(text, {
    "[\r\n]",
    "\\s?\\p{L}+",
    "\\s?\\p{P}+",
    "[一-龥ࠀ-一가-퟿]+",
    "\\p{N}+",
    });
    break;
    case LLAMA_VOCAB_PRE_TYPE_FALCON:
    word_collection = unicode_regex_split(text, {
    "[\\p{P}\\$\\+<=>\\^~\\|]+",
    "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
    "\\p{N}+",
    "[0-9][0-9][0-9]",
    });
    break;
    default:
    // default regex for BPE tokenization pre-processing
    word_collection = unicode_regex_split(text, {
    "[\\p{P}\\$\\+<=>\\^~\\|]+",
    "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
    "\\p{N}+",
    "[0-9][0-9][0-9]",
    });
    break;
    }
    break;
    default:
    GGML_ASSERT(false);
    break;
    }

    Here, we have to inspect manually the contents of the tokenizer.json of the model and either reuse an existing set of regex patterns, or add a new one corresponding to the new configuration. For a tutorial, see 120cf37. We verify the correctness using the tests/test-tokenizer-0 program and the exported vocab for that model:

    make tests
    ./tests/test-tokenizer-0 ./models/ggml-vocab-llama-bpe.gguf
  • Old GGUF models using BPE tokenizers, generated before this change, will fallback to the "default" pre-tokenization, which in almost all cases is wrong. A warning is printed in the output:

    llama.cpp/llama.cpp

    Lines 4333 to 4352 in 80cb312

    // for now, only BPE models have pre-tokenizers
    if (vocab.type == LLAMA_VOCAB_TYPE_BPE) {
    if (tokenizer_pre.empty()) {
    LLAMA_LOG_WARN("%s: missing pre-tokenizer type, using: 'default'\n", __func__);
    LLAMA_LOG_WARN("%s: \n", __func__);
    LLAMA_LOG_WARN("%s: ************************************ \n", __func__);
    LLAMA_LOG_WARN("%s: GENERATION QUALITY WILL BE DEGRADED! \n", __func__);
    LLAMA_LOG_WARN("%s: CONSIDER REGENERATING THE MODEL \n", __func__);
    LLAMA_LOG_WARN("%s: ************************************ \n", __func__);
    LLAMA_LOG_WARN("%s: \n", __func__);
    vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT;
    } else if (
    tokenizer_pre == "default") {
    vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT;
    } else if (
    tokenizer_pre == "llama3" ||
    tokenizer_pre == "llama-v3" ||
    tokenizer_pre == "llama-bpe") {
    vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_LLAMA3;
    } else if (

  • Although we now support pre-processing using regexes, there is now also infrastructure to add more custom splitting implementations in order to have better performance:

    llama.cpp/unicode.cpp

    Lines 424 to 432 in c21ab18

    static std::vector<size_t> unicode_regex_split_custom(const std::string & text, const std::string & regex_expr, const std::vector<size_t> & offsets) {
    std::vector<size_t> bpe_offsets;
    if (regex_expr == "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)") {
    bpe_offsets = unicode_regex_split_custom_gpt2(text, offsets);
    }
    return bpe_offsets;
    }

    For example, there is already an attempt to add custom LLaMA v3 pre-tokenization: llama3 custom regex split #6965

  • The tokenizer tests have been refactored to allow easy addition of more tests and vocabs. Add tests here and run convert-hf-to-gguf-update.py to create input/output files for all known tokenizer models:

# generate tests for each tokenizer model
tests = [
"",
" ",
" ",
" ",
"\t",
"\n",
"\n\n",
"\n\n\n",
"\t\n",
"Hello world",
" Hello world",
"Hello World",
" Hello World",
" Hello World!",
"Hello, world!",
" Hello, world!",
" this is 🦙.cpp",
"w048 7tuijk dsdfhu",
"нещо на Български",
"កាន់តែពិសេសអាចខលចេញ",
"🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)",
"Hello",
" Hello",
" Hello",
" Hello",
" Hello",
" Hello\n Hello",
" (",
"\n =",
"' era",
"Hello, y'all! How are you 😁 ?我想在apple工作1314151天~",
"3",
"33",
"333",
"3333",
"33333",
"333333",
"3333333",
"33333333",
"333333333",
chktxt,
]

# build test-tokenizer-0 target once and add many tests
add_executable(test-tokenizer-0 test-tokenizer-0.cpp)
target_link_libraries(test-tokenizer-0 PRIVATE common)
install(TARGETS test-tokenizer-0 RUNTIME)
llama_test(test-tokenizer-0 NAME test-tokenizer-0-llama-spm ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-spm.gguf)
llama_test(test-tokenizer-0 NAME test-tokenizer-0-llama-bpe ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-bpe.gguf)
llama_test(test-tokenizer-0 NAME test-tokenizer-0-falcon ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-falcon.gguf)
llama_test(test-tokenizer-0 NAME test-tokenizer-0-deepseek-llm ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-llm.gguf)
llama_test(test-tokenizer-0 NAME test-tokenizer-0-deepseek-coder ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-coder.gguf)
llama_test(test-tokenizer-0 NAME test-tokenizer-0-bert-bge r ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-bert-bge.gguf)

TODOs

  • Fix custom GPT-2 pre-processing bug:

    llama.cpp/unicode.cpp

    Lines 430 to 434 in 120cf37

    // TODO: this implementation is actually wrong, uncomment and run:
    // make -j && ./bin/test-tokenizer-0 ../models/ggml-vocab-gpt-2.gguf
    //if (regex_expr == "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)") {
    // bpe_offsets = unicode_regex_split_custom_gpt2(text, offsets);
    //}

  • Fix MPT pre-tokenization:

llama.cpp/llama.cpp

Lines 12136 to 12146 in 120cf37

case LLAMA_VOCAB_PRE_TYPE_MPT:
// TODO: MPT pre-tokenization regexes are unknown
// the following are close, but not exact. run the following:
// ./bin/test-tokenizer-0 ../models/ggml-vocab-mpt.gguf
GGML_ASSERT("MPT pre-tokenization regexes are unknown - fixes needed");
word_collection = unicode_regex_split(text, {
"\\s?\\p{L}+",
"\\s?\\p{P}+",
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
});
break;

@ggerganov

This comment was marked as resolved.

unicode.cpp Outdated
Comment on lines 206 to 207
static inline std::string unicode_wstring_to_utf8(const std::wstring & ws)
{
// code to convert from utf32/utf16 to utf8
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> converter;
std::string utf8 = converter.to_bytes(ws);
return utf8;
static inline std::string unicode_wstring_to_utf8(const std::wstring & ws) {
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
return conv.to_bytes(ws);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dragnil1 Not sure if this is the intent, but the following change of this function makes the tokenizer tests pass on my Mac. Do you think this is OK to change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change converts UCS-2 or UCS-4/UTF-32 encoded std::wstring to UTF-8 encoded std::string and the previous one, converts UTF-16 encoded std::wstring to UTF-8 encoded std::string according to reference. Both works on Ubuntu(tested) but I am not sure about windows as it uses UTF-16 encoded std::wstring.

llama.cpp Outdated
Comment on lines 12011 to 12052
std::vector<std::string> word_collection;
switch (vocab.type) {
case LLAMA_VOCAB_TYPE_BPE:
switch (vocab.arch) {
// TODO: how to detect deepseek and llama v3 models?
//case LLM_ARCH_LLAMA:
//case LLM_ARCH_DEEPSEEK_CODER:
// word_collection = unicode_regex_split(text, {
// "[\r\n]",
// "\\s?\\p{L}+",
// "\\s?\\p{P}+",
// "[一-龥ࠀ-一가-퟿]+",
// "\\p{N}+"
// });
// break;
//case LLM_ARCH_DEEPSEEK_LLM:
// word_collection = unicode_regex_split(text, {
// "[\r\n]",
// "\\s?[A-Za-zµÀ-ÖØ-öø-ƺƼ-ƿDŽ-ʓʕ-ʯͰ-ͳͶͷͻ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-ՖႠ-ჅᎠ-Ᏽᏸ-ᏽᲐ-ᲺᲽ-Ჿᴀ-ᴫᵫ-ᵷᵹ-ᶚḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℴℹℼ-ℿⅅ-ⅉⅎↃↄⰀ-ⱻⱾ-ⳤⳫ-ⳮⳲⳳꙀ-ꙭꚀ-ꚛꜢ-ꝯꝱ-ꞇꞋ-ꞎꭰ-ꮿff-stﬓ-ﬗA-Za-z𐐀-𐑏𐒰-𐓓𐓘-𐓻𐲀-𐲲𐳀-𐳲𑢠-𑣟𞤀-𞥃]+",
// "\\s?[!-/:-~!-/:-~‘-‟ -。]+",
// "\\s+$",
// "[一-龥ࠀ-一가-퟿]+",
// "\\p{N}+"
// });
// break;
default:
// default regex for BPE tokenization pre-processing
{
word_collection = unicode_regex_split(text, {
"\\p{P}+",
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
"\\p{N}+",
"[0-9][0-9][0-9]"
});
}
break;
}
break;
default:
GGML_ASSERT(false);
break;
}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the missing part - how to distinguish models from one another?

For example all LLaMA, Deepseek Coder and Deepseek LLM models have the same architecture:

  "architectures": [
    "LlamaForCausalLM"
  ],

There seems to be no way to automatically determine which model we are converting. Therefore, there is no way to automatically determine the correct regex to use.

Seems we will have to rely on some heuristics based on the rest of the parameters, such as vocab size and tensor sizes. Not great

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I'm new to the inner workings of llama.cpp and get something wrong, but is vocab.arch coming from the gguf_metadata_kv_t in the gguf?

If it's not coming from there, would it be reasonable to add it as a key in the gguf? Then the file could specify what it needs, and llama.cpp could use that, or otherwise just fallback to the current behavior.

The gguf specification talks about how it "is designed to be unambiguous by containing all the information needed to load a model", and this seems like information needed to load a model.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that when creating the GGUF file in the first place (i.e. during conversion from HF to GGUF) there is no way to know what model we are dealing with. For example, take these 2 models:

Both use LLaMA architecture, both use BPE tokenizer and so currently they will be interpreted as the same arch by llama.cpp.

However, they use different pre-tokenizers:

LLaMA:

  "normalizer": null,
  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  },

DeepSeek LLM:

  "normalizer": {
    "type": "Sequence",
    "normalizers": []
  },
  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "[\r\n]"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s?[A-Za-zµÀ-ÖØ-öø-ƺƼ-ƿDŽ-ʓʕ-ʯͰ-ͳͶͷͻ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-ՖႠ-ჅᎠ-Ᏽᏸ-ᏽᲐ-ᲺᲽ-Ჿᴀ-ᴫᵫ-ᵷᵹ-ᶚḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℴℹℼ-ℿⅅ-ⅉⅎↃↄⰀ-ⱻⱾ-ⳤⳫ-ⳮⳲⳳꙀ-ꙭꚀ-ꚛꜢ-ꝯꝱ-ꞇꞋ-ꞎꭰ-ꮿff-stﬓ-ﬗA-Za-z𐐀-𐑏𐒰-𐓓𐓘-𐓻𐲀-𐲲𐳀-𐳲𑢠-𑣟𞤀-𞥃]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s?[!-/:-~!-/:-~‘-‟ -。]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s+$"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "[一-龥ࠀ-一가-퟿]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Digits",
        "individual_digits": true
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  },

So maybe we have to start parsing this information from the tokenizer.json and use it to determine the correct arch. Not sure yet

Copy link
Owner Author

@ggerganov ggerganov Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this, I'm starting to consider the option where we tokenize a few strings during conversion and based on the resulting tokens we add a new enum to the GGUF header indicating the pre-tokenizer type. In llama.cpp we will have custom implementations of each pre-tokenizer type with a fallback to some default pre-tokenizer (as we already do)

In the convert script, if the strings tokenize to unknown set of tokens, we stop with an error asking the developer to check the pre-tokenizer configuration and either assign an existing one or add a new one to the enum

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

option where we tokenize a few strings during conversion

It looks like a pretty messy solution. Maybe it's better to choose a variant with parsing tokenizer.json and make alternative implementation on C++?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a prototype of the idea above:

def get_vocab_base_pre(self, tokenizer) -> str:
# encoding this string and hashing the resulting tokens would (hopefully) give us a unique identifier that
# is specific for the BPE pre-tokenizer used by the model
# we will use this unique identifier to write a "tokenizer.ggml.pre" entry in the GGUF file which we can
# use in llama.cpp to implement the same pre-tokenizer
chktxt = "\n \n\n \n\n\n \t \t\t \t\n \n \n \n \n🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天~ ------======= нещо на Български what's ''''''```````\"\"\"\"......!!!!!!??????"
chktok = tokenizer.encode(chktxt)
chkhsh = hash(tuple(chktok))
print(f"chktok: {chktok}")
print(f"chkhsh: {chkhsh}")
res = None
# NOTE: if you get an error here, you need to add the model to the if-elif chain below
# observe the stdout for the chkhsh value and add it to the chain
if self.model_arch == gguf.MODEL_ARCH.LLAMA:
if chkhsh == -3290901550109860290:
# ref: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/tokenizer.json
res = "llama3"
if chkhsh == 4190561703949727616:
# ref: https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct/blob/main/tokenizer.json
res = "deepseek-coder"
if res is None:
raise NotImplementedError(f"BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")

Feedback is welcome

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I went through each of these steps awhile back and realized there's no way to do it unless I create my own way of doing it. Automating it was just not an option. For detection though, why not just create a hash sum of the encodings as a list? e.g. hash(tuple([k, v for k, v in tokenizer.model.vocab.items()])).

Could probably do this for any model file. Only issue is knowing the sum value in advance, which means they'd need to be added manually. This would need to include added tokens and any other required misc files.

Copy link
Contributor

github-actions bot commented Apr 26, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 425 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=11122.45ms p(95)=33038.18ms fails=, finish reason: stop=374 truncated=51
  • Prompt processing (pp): avg=121.76tk/s p(95)=561.52tk/s
  • Token generation (tg): avg=25.61tk/s p(95)=37.24tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=gg/bpe-preprocess commit=80cb3127df55a05a7688797ae5b46be8c0b6a8cf

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 425 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714397444 --> 1714398074
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 281.1, 281.1, 281.1, 281.1, 281.1, 459.61, 459.61, 459.61, 459.61, 459.61, 353.54, 353.54, 353.54, 353.54, 353.54, 367.77, 367.77, 367.77, 367.77, 367.77, 380.36, 380.36, 380.36, 380.36, 380.36, 425.49, 425.49, 425.49, 425.49, 425.49, 443.21, 443.21, 443.21, 443.21, 443.21, 448.41, 448.41, 448.41, 448.41, 448.41, 462.18, 462.18, 462.18, 462.18, 462.18, 486.29, 486.29, 486.29, 486.29, 486.29, 490.86, 490.86, 490.86, 490.86, 490.86, 510.23, 510.23, 510.23, 510.23, 510.23, 533.69, 533.69, 533.69, 533.69, 533.69, 535.43, 535.43, 535.43, 535.43, 535.43, 549.92, 549.92, 549.92, 549.92, 549.92, 560.84, 560.84, 560.84, 560.84, 560.84, 575.48, 575.48, 575.48, 575.48, 575.48, 584.86, 584.86, 584.86, 584.86, 584.86, 589.08, 589.08, 589.08, 589.08, 589.08, 590.11, 590.11, 590.11, 590.11, 590.11, 597.19, 597.19, 597.19, 597.19, 597.19, 605.3, 605.3, 605.3, 605.3, 605.3, 603.71, 603.71, 603.71, 603.71, 603.71, 604.13, 604.13, 604.13, 604.13, 604.13, 604.67, 604.67, 604.67, 604.67, 604.67, 609.23, 609.23, 609.23, 609.23, 609.23, 612.23, 612.23, 612.23, 612.23, 612.23, 615.25, 615.25, 615.25, 615.25, 615.25, 613.4, 613.4, 613.4, 613.4, 613.4, 616.63, 616.63, 616.63, 616.63, 616.63, 617.53, 617.53, 617.53, 617.53, 617.53, 629.55, 629.55, 629.55, 629.55, 629.55, 628.03, 628.03, 628.03, 628.03, 628.03, 627.04, 627.04, 627.04, 627.04, 627.04, 626.53, 626.53, 626.53, 626.53, 626.53, 624.97, 624.97, 624.97, 624.97, 624.97, 629.9, 629.9, 629.9, 629.9, 629.9, 630.08, 630.08, 630.08, 630.08, 630.08, 629.59, 629.59, 629.59, 629.59, 629.59, 631.21, 631.21, 631.21, 631.21, 631.21, 633.9, 633.9, 633.9, 633.9, 633.9, 644.09, 644.09, 644.09, 644.09, 644.09, 644.33, 644.33, 644.33, 644.33, 644.33, 649.53, 649.53, 649.53, 649.53, 649.53, 651.81, 651.81, 651.81, 651.81, 651.81, 652.42, 652.42, 652.42, 652.42, 652.42, 652.05, 652.05, 652.05, 652.05, 652.05, 651.31, 651.31, 651.31, 651.31, 651.31, 651.79, 651.79, 651.79, 651.79, 651.79, 654.03, 654.03, 654.03, 654.03, 654.03, 656.93, 656.93, 656.93, 656.93, 656.93, 658.1, 658.1, 658.1, 658.1, 658.1, 647.34, 647.34, 647.34, 647.34, 647.34, 640.8, 640.8, 640.8, 640.8, 640.8, 638.23, 638.23, 638.23, 638.23, 638.23, 637.53, 637.53, 637.53, 637.53, 637.53, 636.9, 636.9, 636.9, 636.9, 636.9, 635.47, 635.47, 635.47, 635.47, 635.47, 636.43, 636.43, 636.43, 636.43, 636.43, 641.9, 641.9, 641.9, 641.9, 641.9, 641.99, 641.99, 641.99, 641.99, 641.99, 641.99]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 425 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714397444 --> 1714398074
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 33.53, 33.53, 33.53, 33.53, 33.53, 35.46, 35.46, 35.46, 35.46, 35.46, 20.69, 20.69, 20.69, 20.69, 20.69, 21.06, 21.06, 21.06, 21.06, 21.06, 20.46, 20.46, 20.46, 20.46, 20.46, 20.57, 20.57, 20.57, 20.57, 20.57, 20.81, 20.81, 20.81, 20.81, 20.81, 21.6, 21.6, 21.6, 21.6, 21.6, 22.97, 22.97, 22.97, 22.97, 22.97, 23.64, 23.64, 23.64, 23.64, 23.64, 24.06, 24.06, 24.06, 24.06, 24.06, 24.34, 24.34, 24.34, 24.34, 24.34, 24.43, 24.43, 24.43, 24.43, 24.43, 24.49, 24.49, 24.49, 24.49, 24.49, 24.26, 24.26, 24.26, 24.26, 24.26, 23.71, 23.71, 23.71, 23.71, 23.71, 23.48, 23.48, 23.48, 23.48, 23.48, 23.38, 23.38, 23.38, 23.38, 23.38, 22.95, 22.95, 22.95, 22.95, 22.95, 22.97, 22.97, 22.97, 22.97, 22.97, 23.1, 23.1, 23.1, 23.1, 23.1, 22.69, 22.69, 22.69, 22.69, 22.69, 22.52, 22.52, 22.52, 22.52, 22.52, 22.47, 22.47, 22.47, 22.47, 22.47, 22.27, 22.27, 22.27, 22.27, 22.27, 21.91, 21.91, 21.91, 21.91, 21.91, 21.72, 21.72, 21.72, 21.72, 21.72, 21.93, 21.93, 21.93, 21.93, 21.93, 21.97, 21.97, 21.97, 21.97, 21.97, 22.03, 22.03, 22.03, 22.03, 22.03, 22.13, 22.13, 22.13, 22.13, 22.13, 22.3, 22.3, 22.3, 22.3, 22.3, 22.2, 22.2, 22.2, 22.2, 22.2, 22.02, 22.02, 22.02, 22.02, 22.02, 21.79, 21.79, 21.79, 21.79, 21.79, 21.43, 21.43, 21.43, 21.43, 21.43, 21.63, 21.63, 21.63, 21.63, 21.63, 21.72, 21.72, 21.72, 21.72, 21.72, 21.82, 21.82, 21.82, 21.82, 21.82, 21.9, 21.9, 21.9, 21.9, 21.9, 22.0, 22.0, 22.0, 22.0, 22.0, 21.99, 21.99, 21.99, 21.99, 21.99, 21.93, 21.93, 21.93, 21.93, 21.93, 21.82, 21.82, 21.82, 21.82, 21.82, 21.82, 21.82, 21.82, 21.82, 21.82, 21.53, 21.53, 21.53, 21.53, 21.53, 21.34, 21.34, 21.34, 21.34, 21.34, 21.26, 21.26, 21.26, 21.26, 21.26, 21.23, 21.23, 21.23, 21.23, 21.23, 21.3, 21.3, 21.3, 21.3, 21.3, 21.42, 21.42, 21.42, 21.42, 21.42, 21.5, 21.5, 21.5, 21.5, 21.5, 21.51, 21.51, 21.51, 21.51, 21.51, 21.4, 21.4, 21.4, 21.4, 21.4, 21.25, 21.25, 21.25, 21.25, 21.25, 20.95, 20.95, 20.95, 20.95, 20.95, 20.85, 20.85, 20.85, 20.85, 20.85, 20.64, 20.64, 20.64, 20.64, 20.64, 20.06, 20.06, 20.06, 20.06, 20.06, 20.04, 20.04, 20.04, 20.04, 20.04, 20.03, 20.03, 20.03, 20.03, 20.03, 20.13]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 425 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714397444 --> 1714398074
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11, 0.11, 0.11, 0.11, 0.11, 0.34, 0.34, 0.34, 0.34, 0.34, 0.29, 0.29, 0.29, 0.29, 0.29, 0.25, 0.25, 0.25, 0.25, 0.25, 0.27, 0.27, 0.27, 0.27, 0.27, 0.23, 0.23, 0.23, 0.23, 0.23, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.31, 0.31, 0.31, 0.31, 0.31, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.3, 0.3, 0.3, 0.3, 0.3, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.32, 0.32, 0.32, 0.32, 0.32, 0.33, 0.33, 0.33, 0.33, 0.33, 0.27, 0.27, 0.27, 0.27, 0.27, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.23, 0.23, 0.23, 0.23, 0.23, 0.11, 0.11, 0.11, 0.11, 0.11, 0.28, 0.28, 0.28, 0.28, 0.28, 0.36, 0.36, 0.36, 0.36, 0.36, 0.28, 0.28, 0.28, 0.28, 0.28, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.45, 0.45, 0.45, 0.45, 0.45, 0.55, 0.55, 0.55, 0.55, 0.55, 0.45, 0.45, 0.45, 0.45, 0.45, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.38, 0.38, 0.38, 0.38, 0.38, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 425 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714397444 --> 1714398074
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0]
                    

@m18coppola

This comment was marked as resolved.

@teleprint-me
Copy link
Contributor

teleprint-me commented May 1, 2024

This implementation is conflicting with StableLMForCausal.

21:03:35 | /mnt/valerie/remote/ggerganov/llama.cpp
(.venv) git:(master | Δ) λ python convert-hf-to-gguf.py /mnt/valerie/models/stabilityai/stablelm-2-1_6b-chat                                                                            
Loading model: stablelm-2-1_6b-chat
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
chktok: [198, 4815, 15073, 66597, 8004, 1602, 2355, 79772, 11187, 9468, 248, 222, 320, 8416, 8, 27623, 114, 378, 235, 9468, 234, 104, 31643, 320, 36773, 100166, 98634, 8, 26602, 227, 11410, 99, 247, 9468, 99, 247, 220, 18, 220, 18, 18, 220, 18, 18, 18, 220, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 18, 18, 220, 18, 13, 18, 220, 18, 497, 18, 220, 18, 1131, 18, 220, 21549, 222, 98629, 241, 45358, 233, 21549, 237, 45358, 224, 21549, 244, 21549, 115, 21549, 253, 45358, 223, 21549, 253, 21549, 95, 98629, 227, 76460, 223, 949, 37046, 33565, 111, 19000, 23182, 49792, 19967, 16, 18, 16, 19, 16, 20, 16, 36827, 21909, 56560, 54337, 19175, 14476, 1482, 13373, 64571, 34694, 3114, 15752, 17721, 80112, 3436, 4708, 4708, 14196, 14196, 74694, 3089, 3089, 29249, 17523, 3001, 27708, 7801, 358, 3077, 1027, 364, 83, 820, 568, 596, 1070, 11, 364, 793, 499, 2771, 30, 364, 44, 539, 2771, 358, 3358, 1304, 433, 11, 364, 35, 499, 1093, 1063, 15600, 30, 1226, 6, 43712, 264, 64966, 43]
chkhsh: 32d85c31273f8019248f2559fed492d929ea28b17e51d81d3bb36fff23ca72b3


**************************************************************************************
** WARNING: The BPE pre-tokenizer was not recognized!
**          There are 2 possible reasons for this:
**          - the model has not been added to convert-hf-to-gguf-update.py yet
**          - the pre-tokenization config has changed upstream
**          Check your model files and convert-hf-to-gguf-update.py and update them accordingly.
** ref:     https://github.com/ggerganov/llama.cpp/pull/6920
**
** chkhsh:  32d85c31273f8019248f2559fed492d929ea28b17e51d81d3bb36fff23ca72b3
**************************************************************************************


Traceback (most recent call last):
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert-hf-to-gguf.py", line 3001, in <module>
    main()
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert-hf-to-gguf.py", line 2988, in main
    model_instance.set_vocab()
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert-hf-to-gguf.py", line 1280, in set_vocab
    self._set_vocab_gpt2()
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert-hf-to-gguf.py", line 331, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
                               ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert-hf-to-gguf.py", line 242, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert-hf-to-gguf.py", line 323, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

The GPT-2 hash isn't the same because the vocab differs.

>>> chktxt = '\n \n\n \n\n\n \t \t\t \t\n  \n   \n    \n     \n🚀 (normal) 😶\u200d🌫️ (multiple emoj
is concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 
?我想在apple工作1314151天------======= нещо на Български \'\'\'\'\'\'```````""""......!!!!!!????
?? I\'ve been \'told he\'s there, \'RE you sure? \'M not sure I\'ll make it, \'D you like some tea?
 We\'Ve a\'lL'
>>> chktok = tokenizer.encode(chktxt)
>>> chkhsh = sha256(str(chktok).encode()).hexdigest()
>>> print(f"chktok: {chktok}")
chktok: [198, 4815, 15073, 66597, ...]
>>> print(f"chkhsh: {chkhsh}")
chkhsh: 32d85c31273f8019248f2559fed492d929ea28b17e51d81d3bb36fff23ca72b3
>>> type(tokenizer)
<class 'transformers.models.gpt2.tokenization_gpt2_fast.GPT2TokenizerFast'>

Also, the fallback is to use qwens vocab for 1.6B models which invokes the self._set_vocab_gpt2() method call.

if chkhsh == "3ce83efda5659b07b1ad37ca97ca5797ea4285d9b9ab0dc679e4a720c9da7454":
    # ref: https://huggingface.co/openai-community/gpt2
    res = "gpt-2"
if chkhsh == "32d85c31273f8019248f2559fed492d929ea28b17e51d81d3bb36fff23ca72b3":
    res = "gpt-2"  # StableForCausalLM

I mentioned this earlier, but I'll explicitly state it here now because it should allow for dynamically handling the hashes. Manually referencing them will still be required though.

21:18:18 | /mnt/valerie/remote/ggerganov/llama.cpp
(.venv) git:(master | Δ) λ bpython
bpython version 0.24 on top of Python 3.12.3 /mnt/valerie/remote/ggerganov/llama.cpp/.venv/bin/python
>>> from pathlib import Path
>>> from hashlib import sha256
>>> from transformers import AutoTokenizer
>>> 
>>> model_path = Path("/mnt/valerie/models/stabilityai/stablelm-2-1_6b-chat")
>>> model_path.exists()
True
>>> tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
>>> tokenizer.name_or_path
'/mnt/valerie/models/stabilityai/stablelm-2-1_6b-chat'
>>> len(tokenizer.vocab)
100289
>>> vocab_hashsum = sha256(str(tuple((k, v) for k,v in tokenizer.vocab.items())).encode()).hexdiges
t()
>>> vocab_hashsum
'b7a950fbf72c9984f3f3c5f3481e5eff112d647bca38349d00523fc73bc4073e'

The downside to this approach, which is probably why it was implemented the way it was, is that there is a definite and noticeable latency when reading the vocab this way, but it only does that the first time. The result is cached and the latency is negligible with repeated calls.

The upside to this approach is that the vocab is dynamically handled and read in from the model path. This should be applicable to any vocabulary of any type.

The algorithm for BPE is really simple at it's core.

"""
examples/slow_bpe.py - Neural Machine Translation of Rare Words with Subword Units

Byte Pair Encoding (BPE) Tokenization for Natural Language Processing

Paper: https://arxiv.org/abs/1508.07909v5
"""
import argparse
import collections
import json
import re
from typing import Dict, Tuple


def get_stats(vocab: Dict[str, int]) -> Dict[Tuple[str, str], int]:
    """
    Calculate frequencies of pairs of adjacent symbols in the vocabulary.

    Args:
        vocab (dict): Dictionary with space-separated symbols as keys and frequencies as values

    Returns:
        dict: Dictionary of symbol pairs (tuple) and their combined frequency
    """
    symbol_pairs_frequency = collections.defaultdict(int)

    for word, frequency in vocab.items():
        symbols = word.split()
        for index in range(len(symbols) - 1):
            symbol_pair = (symbols[index], symbols[index + 1])
            symbol_pairs_frequency[symbol_pair] += frequency

    return symbol_pairs_frequency


def merge_vocab(
    symbol_pair: Tuple[str, str], input_vocab: Dict[str, int]
) -> Dict[str, int]:
    """
    Merge a given pair of symbols in the vocabulary.

    Args:
        symbol_pair (tuple): Tuple of two strings, the pair of symbols to merge
        input_vocab (dict): Input vocabulary

    Returns:
        dict: New vocabulary with the specified pair merged
    """
    output_vocab = {}
    bigram = re.escape(" ".join(symbol_pair))
    pattern = re.compile(r"(?<!\S)" + bigram + r"(?!\S)")

    for word in input_vocab:
        merged_word = pattern.sub("".join(symbol_pair), word)
        output_vocab[merged_word] = input_vocab[word]

    return output_vocab


def load_vocab_from_json(json_file: str) -> Dict[str, int]:
    """
    Load a vocabulary from a JSON file.

    Args:
        json_file (str): Path to the JSON file containing the vocabulary.

    Returns:
        dict: Vocabulary loaded from the JSON file.
    """
    try:
        with open(json_file, "r") as file:
            vocab = json.load(file)
    except json.JSONDecodeError as e:
        raise json.JSONDecodeError(f"Error loading JSON vocabulary: {e}")

    return vocab


def main(args: argparse.Namespace) -> None:
    if args.vocab_json:
        vocab = load_vocab_from_json(args.vocab_json)
    else:
        # Default vocabulary
        vocab = {
            "l o w </w>": 5,
            "l o w e r </w>": 2,
            "n e w e s t </w>": 6,
            "w i d e s t </w>": 3,
        }

    for i in range(args.num_merges):
        pairs = get_stats(vocab)
        best = max(pairs, key=pairs.get)
        vocab = merge_vocab(best, vocab)
        print(best)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument(
        "--num_merges",
        type=int,
        default=10,
        help="Number of BPE merges (default is 10)",
    )

    parser.add_argument(
        "--vocab_json",
        type=str,
        help="Path to a JSON file containing the initial vocabulary (optional)",
    )

    args = parser.parse_args()

    main(args)

The issue is the implementation details from tokenizer to tokenizer. I won't have any free time for quite some time and this was time consuming as there isn't much information on this specifically on the net (that I could find). Any other implementations are extremely complicated as they rely concepts such as word2vec and other well known methods.

It might be worth considering a proper custom implementation for gguf models. I mention this because sophisticated regular expression patterns are truly overkill IMHO. A simple lexer and parser should suffice for more complex implementations. This is something I've only toyed around in theory though.

@sealad886
Copy link

sealad886 commented May 1, 2024

Hoping for a bit of help with Command-R-Plus please!

I've updated the files as noted [https://github.com//pull/6920] here, and then I've re-made everything with make -B tests (unconditionally make all targets) more than once. I've made sure that my model from HF is up-to-date. I've re-converted the model twice.

It's still failing on the tests though, and I can't figure out why.

As far as I can tell, Command-R-Plus has its own:

tokenizer.ggml.pre: command-r-plus

And then in the (admittedly far too long) tokenizer.json file, it looks very similar to Starcoder:

 "normalizer": {
        "type": "NFC"
    },
    "pre_tokenizer": {
        "type": "Sequence",
        "pretokenizers": [
            {
                "type": "Digits",
                "individual_digits": true
            },
            {
                "type": "ByteLevel",
                "add_prefix_space": false,
                "trim_offsets": true,
                "use_regex": true
            }
        ]
    },
    "post_processor": {
        "add_prefix_space": true,
        "trim_offsets": false,
        "use_regex": true,
        "type": "TemplateProcessing",
        "single": [
            {

And Starcoder just had a blank switch statement in the regex part of llama.cpp:

case LLAMA_VOCAB_PRE_TYPE_MPT:
    // TODO: MPT pre-tokenization regexes are unknown
    //       the following are close, but not exact. run the following:
    //       ./bin/test-tokenizer-0 ../models/ggml-vocab-mpt.gguf
    GGML_ASSERT("MPT pre-tokenization regexes are unknown - fixes needed");
    word_collection = unicode_regex_split(text, {
        "\\s?\\p{L}+",
        "\\s?\\p{P}+",
        "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
    });
    break;
case LLAMA_VOCAB_PRE_TYPE_STARCODER:
case LLAMA_VOCAB_PRE_TYPE_GPT2:
    word_collection = unicode_regex_split(text, {
        "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
    });
    break;
case LLAMA_VOCAB_PRE_TYPE_COMMANDRPLUS:
default:
    // default regex for BPE tokenization pre-processing
    word_collection = unicode_regex_split(text, {
        "[\\p{P}\\$\\+<=>\\^~\\|]+",
        "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
        "\\p{N}+",
        "[0-9][0-9][0-9]",
    });
    break;

@teleprint-me
Copy link
Contributor

teleprint-me commented May 1, 2024

Starcoder uses the GPT-2 tokenizer, so it's not blank. A lot of models actually use the GPT-2 tokenizer. Need the hashsum output by the conversion script when the error is raised to add it.

@USBhost
Copy link

USBhost commented May 1, 2024

Has anyone been able to convert C4AI Command R+ after this PR?

@kallewoof
Copy link

kallewoof commented May 1, 2024

Edit: llama3 outdated. Redownload json files. I am seeing the following when attempting to convert-hf-to-gguf the llama3 instruct model by meta.

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
chktok: [198, 4815, 15073, 66597, 8004, 1602, 2355, 79772, 11187, 9468, 248, 222, 320, 8416, 8, 27623, 114, 102470, 9468, 234, 104, 31643, 320, 36773, 100166, 98634, 8, 26602, 227, 11410, 99, 247, 9468, 99, 247, 220, 18, 220, 1644, 220, 8765, 220, 8765, 18, 220, 8765, 1644, 220, 8765, 8765, 220, 8765, 8765, 18, 220, 8765, 8765, 1644, 220, 18, 13, 18, 220, 18, 497, 18, 220, 18, 1131, 18, 220, 21549, 222, 98629, 241, 45358, 233, 21549, 237, 45358, 224, 21549, 244, 21549, 115, 21549, 253, 45358, 223, 21549, 253, 21549, 95, 98629, 227, 76460, 223, 949, 37046, 101067, 19000, 23182, 102301, 9263, 18136, 16, 36827, 21909, 56560, 54337, 19175, 102118, 13373, 64571, 34694, 3114, 112203, 80112, 3436, 106451, 14196, 14196, 74694, 3089, 3089, 29249, 17523, 3001, 27708, 7801, 358, 3077, 1027, 364, 83, 820, 568, 596, 1070, 11, 364, 793, 499, 2771, 30, 364, 44, 539, 2771, 358, 3358, 1304, 433, 11, 364, 35, 499, 1093, 1063, 15600, 30, 1226, 6, 43712, 264, 64966, 43]
chkhsh: c136ed14d01c2745d4f60a9596ae66800e2b61fa45643e72436041855ad4089d


**************************************************************************************
** WARNING: The BPE pre-tokenizer was not recognized!
**          There are 2 possible reasons for this:
**          - the model has not been added to convert-hf-to-gguf-update.py yet
**          - the pre-tokenization config has changed upstream
**          Check your model files and convert-hf-to-gguf-update.py and update them accordingly.
** ref:     https://github.com/ggerganov/llama.cpp/pull/6920
**
** chkhsh:  c136ed14d01c2745d4f60a9596ae66800e2b61fa45643e72436041855ad4089d
**************************************************************************************

Edit: my llama3 model was outdated. Fetching latest version made this go away.

@BramVanroy
Copy link

BramVanroy commented May 1, 2024

Is phi-2 included in the fixes? It uses CodeGenTokenizer, which is also a BPE tokenizer. Or should I avoid generating GGUFs for phi-2-based models because of the potential bad tokenization? Related issue: #7022

@teleprint-me
Copy link
Contributor

teleprint-me commented May 1, 2024

@BramVanroy I'm about to have my breakfast. I'll add it to my PR #7018 if it isn't merged by then. This is an issue with every model supported by the HF script. They all require a hash and pretokenizer in order to be validated. The quality of the output is degraded otherwise. I had to regen all the converted models I use. Spent last night uploading the ones I care about.

@BramVanroy
Copy link

@teleprint-me I had some users tell me that sometimes generation degrade significantly after a while using ollama. I can't reproduce it on plain Python so I came looking for a potential issue with llamacpp. I've generated the hash in #7022 so you can just copy that, I think.

@teleprint-me
Copy link
Contributor

teleprint-me commented May 1, 2024

@BramVanroy That's-possibly just-an Ollama issue. Model generation on latest llama.cpp is phenomenal. My micro pretrained models putput quality skyrocketed with this PR update for some reason. I think it depends on the model because I tested phi 1, 2, and 3, llama 3, and mistral 7v2 as well as stablelm 1.6.

@MoonRide303
Copy link

Both llama.cpp (b2776) and koboldcpp (1.64) seem to be fine now, but ollama as of 0.1.32 still has tokenizer issues (ollama/ollama#4082).

@Imaniac230
Copy link

Hi, I just want to clarify some things (I'm currently on c4ec9c0). It appears that there are currently two ways to successfully convert HF llama3 models to gguf:

  1. python convert.py <path-to-llama3-hf> --outtype f16 --vocab-type bpe (this doesn't add the new pre-tokenizer type field, so main spits out a warning during loading)
  2. python convert-hf-to-gguf.py <path-to-llama3-hf> --outtype f16

The original pytorch checkpoints from Meta have to be converted to HF as mentioned here: #6819 (I used the script from transformers version 4.41.0.dev0), or the HF version has to be downloaded from the Meta repo (ex. https://huggingface.co/meta-llama/Meta-Llama-3-8B) .

The tokenizer config in the Meta repo is slightly modified over the raw conversion:

  1. They removed the chat template from tokenizer_config.json -> https://huggingface.co/meta-llama/Meta-Llama-3-8B/commit/561487d18c41c76bcb5fc6cfb73a324982f04f47
  2. They changed the post-processor structure in in tokenizer.json -> https://huggingface.co/meta-llama/Meta-Llama-3-8B/commit/cd892e8f4da1043d4b01d5ea182a2e8412bf658f
  3. They added parameters in the generation_config.json -> https://huggingface.co/meta-llama/Meta-Llama-3-8B/commit/1460c22666392e470910ce3d44ffeb2ab7dbd4df

This now creates a total of 4 possible ways to generate a gguf:

  1. Use the un-changed tokenizer with convert.py (without the pre-tokenizer type)
  2. Use the changed tokenizer with convert.py (without the pre-tokenizer type)
  3. Use the un-changed tokenizer with convert-hf-to-gguf.py (here the hash has to be updated with convert-hf-to-gguf-update.py)
  4. Use the changed tokenizer with convert-hf-to-gguf.py (here the hash already matches the output)

The generation_config.json always stays the same, but I'm assuming it is used only for the pytorch inference so it shouldn't matter?

I'm not really up-to-date on this stuff, but I assume (very naively) that the tokenizer changes just shifted around the logic to a different part, so they should be equivalent in the end? Or could this have any meaningful effect on the gguf results?

@nkeilar
Copy link

nkeilar commented May 3, 2024

I'm not sure the best place to comment this, but the current llama-3 Q_4_M performance compared with Groq seems quite different. Maybe this is due to the quantitation. Can anyone confirm? It's very clear to me as CrewAI won't run with Ollama without additional prompt tweaks, vs just running on Groq.

@x4080
Copy link

x4080 commented May 5, 2024

@nkeilar Did you try using regular llama cpp to test the prompt ? Since I found that output using server and regular llama cpp is quite different (ollama is like llama.cpp server output)

Edit : I compare to groq too and regular llama cpp = groq, server is not

nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024
* merged the changes from deepseeker models to main branch

* Moved regex patterns to unicode.cpp and updated unicode.h

* Moved header files

* Resolved issues

* added and refactored unicode_regex_split and related functions

* Updated/merged the deepseek coder pr

* Refactored code

* Adding unicode regex mappings

* Adding unicode regex function

* Added needed functionality, testing remains

* Fixed issues

* Fixed issue with gpt2 regex custom preprocessor

* unicode : fix? unicode_wstring_to_utf8

* lint : fix whitespaces

* tests : add tokenizer tests for numbers

* unicode : remove redundant headers

* tests : remove and rename tokenizer test scripts

* tests : add sample usage

* gguf-py : reader prints warnings on duplicate keys

* llama : towards llama3 tokenization support (wip)

* unicode : shot in the dark to fix tests on Windows

* unicode : first try custom implementations

* convert : add "tokenizer.ggml.pre" GGUF KV (wip)

* llama : use new pre-tokenizer type

* convert : fix pre-tokenizer type writing

* lint : fix

* make : add test-tokenizer-0-llama-v3

* wip

* models : add llama v3 vocab file

* llama : adapt punctuation regex + add llama 3 regex

* minor

* unicode : set bomb

* unicode : set bomb

* unicode : always use std::wregex

* unicode : support \p{N}, \p{L} and \p{P} natively

* unicode : try fix windows

* unicode : category support via std::regex

* unicode : clean-up

* unicode : simplify

* convert : add convert-hf-to-gguf-update.py

ggml-ci

* lint : update

* convert : add falcon

ggml-ci

* unicode : normalize signatures

* lint : fix

* lint : fix

* convert : remove unused functions

* convert : add comments

* convert : exercise contractions

ggml-ci

* lint : fix

* cmake : refactor test targets

* tests : refactor vocab tests

ggml-ci

* tests : add more vocabs and tests

ggml-ci

* unicode : cleanup

* scripts : ignore new update script in check-requirements.sh

* models : add phi-3, mpt, gpt-2, starcoder

* tests : disable obsolete

ggml-ci

* tests : use faster bpe test

ggml-ci

* llama : more prominent warning for old BPE models

* tests : disable test-tokenizer-1-bpe due to slowness

ggml-ci

---------

Co-authored-by: Jaggzh <jaggz.h@gmail.com>
Co-authored-by: Kazim Abrar Mahi <kazimabrarmahi135@gmail.com>
@ggerganov ggerganov mentioned this pull request May 7, 2024
@mofosyne mofosyne added the enhancement New feature or request label May 9, 2024
@akx akx mentioned this pull request May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority Very important issue need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet