Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A draft implementation of BPE tokenizer #39

Closed
wants to merge 1 commit into from

Conversation

gboduljak
Copy link

@gboduljak gboduljak commented Jan 18, 2024

This is a draft implementation of the naive BPE tokenizer. I tested it on CLIP tokenizer configuration. Some basics tests are working, but more tests are necessary.

I realized that current tokenization methods are mostly based on CharTrie. However, this data structure may not be the best for the BPE tokenization. This is because BPE tokenization kinda works bottom up from unigrams. Consequently, I think we may want a separate class for BPE tokenizer. To get quick feedback, I implemented everything within the current Tokenizer.

I would appreciate your feedback on:

  1. The current implementation of the naive BPE algorithm.
  2. Where to place the new version of the BPE tokenization implementation.
  3. Perhaps better data structure design or some ideas for better utilization of the existing CharTrie.

Closes #36.

@gboduljak gboduljak marked this pull request as ready for review January 18, 2024 02:07
@angeloskath
Copy link
Member

@gboduljak sorry for not getting to it a bit sooner. Thanks for the contribution this already looks great!

As you wrote above, indeed this should be a different class as the tokenization algorithm is fundamentally very different. I would create a class BPETokenizer and corresponding files in mlx/data/core. No need for inheritance trees (ie no base class for all tokenizers; if needed we 'll refactor later).

I love the python tests. We really need to get to porting the tests we had to the open source repo. I will add some C++ test scaffolding ASAP so that we can later add some C++ tests as well. In the meantime I would not commit full merges and vocabulary files. For the test you can create some small merges and vocab on the fly or even better compare against a BPE tokenizer from HF using its merges file and vocab.

I will definitely take a close look to the implementation over the weekend.

@angeloskath
Copy link
Member

@gboduljak I will close this. #62 implements a fast generic bpe tokenizer. It was quite involved in the end. Sorry if I dragged you into implementing it only to not be merged. I do appreciate your contributions here and in mlx.core.

Let me know if you think I am missing something in the implementation in #62. Thanks!

@gboduljak
Copy link
Author

No worries. I will go through your implementation :) It is awesome to see this implemented in mlx. Great work 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

An efficient implementation of BytePairTokenizer
2 participants