Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An efficient implementation of BytePairTokenizer #36

Open
gboduljak opened this issue Jan 17, 2024 · 1 comment
Open

An efficient implementation of BytePairTokenizer #36

gboduljak opened this issue Jan 17, 2024 · 1 comment

Comments

@gboduljak
Copy link

As suggested by @angeloskath' s code review ml-explore/mlx-examples#315 (comment), an implementation of BytePairTokenizer seems useful for many use cases, but it is currently missing in mlx-data. I did some research on byte pair tokenization in transformers. I think that the implementation in transformers is somewhat slow. More precisely, the implementation iterates over all possible adjacent symbol pairs to determine the optimal symbol pair to merge, every time a merge could be done. This implies quadratic time complexity. However, in the referenced paper, there is an elegant linearithmic time implementation. Since the implementation requires some pointer trickery, it seems that we could (relatively) easily implement this in C++ and expose to Python.

I would appreciate your thoughts on:

  1. Do we want an implementation of BytePairTokenizer in C++?
  2. Do we want the faster implementation of BytePairTokenizer in C++, referenced in the paper?

Paper: https://arxiv.org/pdf/2306.16837.pdf

@angeloskath
Copy link
Member

Hi @gboduljak!

Yeah we would want a tokenizer in C++. I think for starters implementing it similar to python but in C++ would be sufficient. BPE quite a simple algorithm and if a Python implementation is usable I think a C++ one would be at least as much (probably much faster) with the benefit of allowing us to use threads.

Subsequently, we can optimize it if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants