-
Notifications
You must be signed in to change notification settings - Fork 760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whisper stutters #774
Comments
Interesting.. I've seen that behavior before in lower quality models. Two questions:
|
I expect that the mp3 will be 16-bit. The problem seems to be a feature of the underlying architecture: padding to 30s chunks, eg. The original paper offered some mitigations, but they were far from completely effective. (Eg, using the results from the previous chunk as a prompt, fiddling with temperatures and beaming, etc). WhisperX seems to do a better job, but needs components that are only x86/cuda based. It seems ironic that a i effectiveness should rely on hand tuning that is input-specific. 😇 |
As most of the output seems v accurate, I can only suppose that the repetition is caused by some heuristic that says "if you cannot generate output, just repeat what you just produced". Reasons for not generating output could include either silence, or padding from the 30s chunking, or background noise, or ... |
The repetition problem is a common problem with encoder-decoder style models. Though it usually becomes vanishingly rare for high quality models. Indeed it could be that edge case inputs are more likely to trigger it.
I meant the model parameters. The default model is fp16 which may be slightly worse. You could try using an fp32 (pass |
Thanks. I'm just trying a recording of a back and forth chat. Most of the transcription looks great, it's just these repetitions that are anomalous. I've tried using this: import mlx_whisper
speech_file = "/Users/jrp/.cache/whisper/alice.mp3"
result = mlx_whisper.transcribe(speech_file,path_or_hf_repo=f"mlx-community/whisper-large-v3-mlx",verbose=False, fp16=False)
f=open("result.txt","w+")
for segment in result["segments"]:
print(segment["text"], file=f)
f.close() with fp16 and fp32 versions on the Alice chapter, to get started. Most of the differences (see attachment) seem to be just how the output is segmented (with the fp16 version (<) being preferable in most cases, but there are a couple of oddities. Eg:
:
|
Blimey, the f32 version is about half the speed of the fp32 one. Doesn't half exercise the fans on this 48Gb machine. GPUs are at 100%... Does transcription stream, or is it going to just increase memory demand? Looking at the various whisper offshoots (the original, lightening, kit., etc). They all seem to suffer from the same problem, with various heuristics being added and subtracted. |
... and the f32 version also stutters / hallucinates for me. This is a pity. Most of the output is remarkably good, it just seems to be the chopping up the input, padding it, and stitching it back together seems to introduce errors. |
Edit: it is actually implemented and should be enabled by default.
|
Using
I find that the output contains repeated phrases from time to time, enough to ruin the transcription. Eg:
Maybe this is a feature of the underlying model?
The text was updated successfully, but these errors were encountered: