Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guessing subtitle language based on subtitle text #287

Open
big-eater opened this issue Dec 31, 2023 · 0 comments
Open

Guessing subtitle language based on subtitle text #287

big-eater opened this issue Dec 31, 2023 · 0 comments

Comments

@big-eater
Copy link

I saw the existing suggestion #130 , and also think that something like that would be a great addition.

I have tried implementing it, and will create a pull request. If you think that it fits in this project, but don't like something about how it's implemented, or feel that something is missing (e.g. tests and usage documentation), let me know. And if you think "ah, great idea, but I want just this or that part", or it inspires you to do something similar, feel free to take any part of it. I don't care about getting credit for the implementation, I just would be happy that it's available.

I tried a few different text language guessers. Many language guessers are not easy to install on some platforms. Notably, I don't have access to a Windows machine, so have not tested the installation on Windows. For that reason, I think it would be good to have a few different options.

I tried but gave up on CLDv3 (suggested in #130), because it doesn't currently work with python 3.11 / 3.12.

These are the language guessers that I integrated:

  • langid
    • In my limited testing, this seems to be the best guesser.
    • requires numpy
  • lingua
    • In my limited testing, seems to be the next-best guesser
  • fasttext
    • Decent detector
    • Requires g++ or other modern c++ compiler available to build
  • langdetect
    • probably the easiest one to install on various platforms, but also the least accurate in my testing

To try it out, a version can be installed like this:
Install all guessers:

pip install mnamer[guess_all]@git+https://git@github.com/big-eater/mnamer.git@subtitle-text-guesser

Or, install one or more of guess_langid, guess_lingua, guess_fasttext, guess_langdetect, for example:

pip install mnamer[guess_langid]@git+https://git@github.com/big-eater/mnamer.git@subtitle-text-guesser

To use it, specify a guesser when running mnamer:

mnamer --test --batch --subtitle-lang-guesser=langid /path/to/files

It will only try guessing the language from the text in the subtitle file if it was unable to guess the language from the file name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant