Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Equivalent of transformer's chunk_length_s in whisper.cpp #2165

Open
digikar99 opened this issue May 18, 2024 · 2 comments
Open

Equivalent of transformer's chunk_length_s in whisper.cpp #2165

digikar99 opened this issue May 18, 2024 · 2 comments

Comments

@digikar99
Copy link

Hello, thank you very much for whisper.cpp!

While trying out a fine-tuned model with whisper.cpp (specifically, whisper-hindi-large-v2), I noted poor performance on a particular audio with whisper.cpp compared to using hugging-face directly. On a bit of debugging, I'm able to boil it down to hugging face inference example given in the repository using chunk_length_s=30. If this option is removed from theur pipeline, the performance is as poor as whisper.cpp.

I wonder if an equivalent of chunk_length_s is already implemented with whisper.cpp. Here is the implementation in the huggingface/transformers repository. If yes, what parameters I should be using?

More specific questions:

  1. Is there anything that controls the "stride_length" of the processing?
  2. I think I understand max-len, but in what situations is the max-context useful?

My current idea is to run whisper.cpp multiple times with appropriate "offset-t" and "duration". Obtain the outputs, and the finally, do a find_longest_common_sequence over them.

For this particular audio, it seems it is the presence of multiple speakers that is confusing whisper. So, diarizing or clustering the audio and processing each speaker/cluster individually might be a better idea than doing all this.


This is the inference code given in the repository.

import torch
from transformers import pipeline

# path to the audio file to be transcribed
audio = "/path/to/audio.format"
device = "cuda:0" if torch.cuda.is_available() else "cpu"

transcribe = pipeline(task="automatic-speech-recognition", model="vasista22/whisper-hindi-large-v2", chunk_length_s=30, device=device)
transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(language="hi", task="transcribe")

print('Transcription: ', transcribe(audio)["text"])

Output with chunk_length_s=30:

Transcription: हैलो हैलो मैं बोल रहा हूँ आपकी क्वीरी जो सॉल्व नहीं हुई थी फर्स्ट आपका मेल मिला था हमारी तरफ से तीन मेल भी गए थे उसका रिस्पॉन्स नहीं अच्छा मैं ये बता रहा था मैडम इसमें ज्यादा प्रॉब्लम होगी नहीं तो आज मैने स्पेशली अपने ब्रांच मैनेजर से कहा मैने कहा निधि का करो या उनका फिर वो फंड में हाँ तो मैं अभी क्या करूँ मैडम मैं आपकी कॉल ट्रांसफर कर रहा हूँ अभी दो एजेंट �्य और भोलू प्रसाद दोनों खाली है न तो इनमें से मैडम को कोई भी समझा देगा भोलू प्रसाद को ट्रांसफर कर देता हूँ मैं आपकी कॉल नहीं नहीं वेट वेट वेट हाँ एकलव्य सर आई थिंक अपने एकलव्य किसी का नाम या उनको ट्रांसफर कर दीजिये आई डोंट नो

Output without chunk_length_s=30:

Transcription: हैलो हैलो मैं बोल रहा हूँ आपकी क्वीरी जो सॉल्व नहीं हुई थी फर्स्ट आपका मेल मिला था हमारी तरफ से तीन मेल भी गए थे उसका रिस्पॉन्स नहीं अभी

@ggerganov
Copy link
Owner

There is no option currently to set the chunk size, but using --no-timestamps would be equivalent to chunk size of 30s. Try adding this flag and see if it helps

@digikar99
Copy link
Author

Thanks for getting back!

Nope, --no-timestamps does not help, it produces the same output.

I tried out whisperX today. For this particular audio, it worked amazingly well! Interestingly, merely using faster-whisper (a dependency of whisperX) alone did not help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants