Saving and reloading the pretrained model's vocab breaks the tokenizer. #9024

owos · 2024-04-23T19:35:26Z

Describe the bug

So I picked nvidia/parakeet-ctc-0.6b and I untarred the the .nemo file.
After that, I then loaded the model and changed the the vocab this way:

Steps/Code to reproduce bug

model.change_vocabulary(
            new_tokenizer_dir=vocab_extension_path, new_tokenizer_type="bpe"
        )

where vocab_extension_path = the path of the pretrained model.

Expected behavior
The model's tokenizer is supposed to remain intact and not start generating gibberish because I am just reloading the extact tokenizer that was used to pretrain the model.

Why I need this
I need to replace some tokens in the model's vocab while keeping the order tokens intact. If I cant keep other parts of the tokenizer intact then my replacement of tokens cannot work.

The text was updated successfully, but these errors were encountered:

nithinraok · 2024-05-08T18:06:26Z

It should be path to a tokenizer directory not model.

the directory should contain:

tokenizer.model
tokenizer.vocab
vocab.txt

owos · 2024-05-08T18:40:03Z

Yes, that's what I'm doing.
Infact, I've been able to edit the pretrained model's tokenizer and changed the tokens inside of it.
What I found out is that just reloading the pretrained tokenizer with the change_vocab method scatters the whole decoding process.

github-actions · 2024-06-08T01:47:02Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

owos added the bug Something isn't working label Apr 23, 2024

github-actions bot added the stale label Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving and reloading the pretrained model's vocab breaks the tokenizer. #9024

Saving and reloading the pretrained model's vocab breaks the tokenizer. #9024

owos commented Apr 23, 2024

nithinraok commented May 8, 2024

owos commented May 8, 2024

github-actions bot commented Jun 8, 2024

Saving and reloading the pretrained model's vocab breaks the tokenizer. #9024

Saving and reloading the pretrained model's vocab breaks the tokenizer. #9024

Comments

owos commented Apr 23, 2024

nithinraok commented May 8, 2024

owos commented May 8, 2024

github-actions bot commented Jun 8, 2024