Supporting phi-2 tokenizer #7022

BramVanroy · 2024-05-01T13:40:37Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Provide support for phi-2. Running the following yields an error:

python -c "
from huggingface_hub import snapshot_download;
snapshot_download(repo_id='microsoft/phi-2', local_dir='phi-2', local_dir_use_symlinks=False)
"
python convert-hf-to-gguf.py phi-2/ --outtype f16

Error:

Traceback (most recent call last):
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 3001, in <module>
    main()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 2988, in main
    model_instance.set_vocab()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 75, in set_vocab
    self._set_vocab_gpt2()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 331, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 242, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 323, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

Phi-2 uses CodeGenTokenizer, which is a BPE Tokenizer.

I'm not sure if it is as easy as adding the following line here?

{ "name": "phi-2",          "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/microsoft/phi-2" },

Edit tried that, this is the generated hash:

if chkhsh == "fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085":
    # ref: https://huggingface.co/microsoft/phi-2
    res = "phi-2"

The text was updated successfully, but these errors were encountered:

arch-btw · 2024-05-01T14:29:36Z

@BramVanroy #7024

turian · 2024-05-06T23:34:46Z

Can you confirm that the HF tokenization and the llama.cpp quantized GGUF'ed tokenizer give identical results?

Particularly when the text has special characters

See #7049 and #7062

BramVanroy · 2024-05-14T18:11:12Z

@turian Any idea how I can easily test that?

BramVanroy added the enhancement New feature or request label May 1, 2024

BramVanroy mentioned this issue May 1, 2024

llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920

Merged

arch-btw mentioned this issue May 1, 2024

"ï¸ı" is causing chktok to mismatch when using chkhsh #7024

Closed

BramVanroy linked a pull request May 15, 2024 that will close this issue

Add phi-2 tokenizer #7300

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting phi-2 tokenizer #7022

Supporting phi-2 tokenizer #7022

BramVanroy commented May 1, 2024 •

edited

arch-btw commented May 1, 2024

turian commented May 6, 2024

BramVanroy commented May 14, 2024

Supporting phi-2 tokenizer #7022

Supporting phi-2 tokenizer #7022

Comments

BramVanroy commented May 1, 2024 • edited

Prerequisites

Feature Description

arch-btw commented May 1, 2024

turian commented May 6, 2024

BramVanroy commented May 14, 2024

BramVanroy commented May 1, 2024 •

edited