Extremely slow: 50 minutes to transcribe 4 minute speech? Does that sound right? Please review my data and make suggestions. #2059

peppies · 2024-04-16T00:53:31Z

peppies
Apr 16, 2024

Greetings,

I am trying whisper.cpp because I do not have a GPU on my dreamhost web server, which has 4 vCPUs and 8GB RAM. I believe I have everything installed and I'm using base.en as the model. Here are the steps I've taken.

Step 1) Downloaded a random 4 minute Presidential speech from Wikipedia and upload it to my server: https://upload.wikimedia.org/wikipedia/commons/9/99/Confidence_in_Government_%28James_M._Cox%29.ogg

Step 2) Convert .ogg file to 16Khz .wav file using FFMPEG (assuming whisper.cpp can only work on 16Khz .wav files??):
ffmpeg -i speech.ogg -ar 16000 speech.wav

Step 3) Run the following whisper.cpp command:

ubuntu@home:~/whisper.cpp$ sudo ./main --output-vtt --no-fallback true --max-context 0 -f speech.wav
error: input file not found 'true'
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_init_state: kv self size  =   16.52 MB
whisper_init_state: kv cross size =   18.43 MB
whisper_init_state: compute buffer (conv)   =   16.39 MB
whisper_init_state: compute buffer (encode) =  132.07 MB
whisper_init_state: compute buffer (cross)  =    4.78 MB
whisper_init_state: compute buffer (decode) =   96.48 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0

main: processing 'speech.wav' (3797049 samples, 237.3 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:12.160]   We desire industrial peace. We want our people to have an abiding confidence in government.
...... [rest of transcribed lines intentionally removed]
[00:03:53.620 --> 00:04:01.580]   [BLANK_AUDIO]

output_vtt: saving output to 'speech.wav.vtt'

whisper_print_timings:     load time =   271.63 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   684.16 ms
whisper_print_timings:   sample time = 42974.38 ms /  3370 runs (   12.75 ms per run)
whisper_print_timings:   encode time = 94221.25 ms /    10 runs ( 9422.12 ms per run)
whisper_print_timings:   decode time = 46045.39 ms /    14 runs ( 3288.96 ms per run)
whisper_print_timings:   batchd time = 2782029.00 ms /  3316 runs (  838.97 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 2966312.00 ms

I added "--no-fallback true --max-context 0" to command after viewing a separate post that suggest that this might speed things up, but it made no difference on our end.

I noticed in the output above when running the command: "error: input file not found 'true'", however the file is found and is transcribing correctly, but extremely long.

Let me know if I'm doing something wrong, or if 50 minutes is actually an expected time for transcribing 4 minute video.

Thanks

ulatekh · 2024-04-25T16:21:20Z

ulatekh
Apr 25, 2024

Without hardware acceleration, e.g. CUDA, Metal, BLAS, etc., yes, that's a totally expected amount of time.

3 replies

peppies Apr 28, 2024
Author

I thought the whole point of whisper.cpp was to be used with CPU, for web servers that don't run on GPUs. Whisper.ccp is supposed to be different than the original Whispier.

bobqianic Apr 28, 2024
Collaborator

It's just like the laws of physics. If you're using the same model, there's a minimum requirement for compute and memory bandwidth. All we can do is to reduce overhead. However, if you're solely focused on English, you can try out Distilled Whisper (supported by whisper.cpp), which offers much better performance.

ulatekh Apr 29, 2024

@peppies : And it can be used...it's just a lot slower. Also, the other Whisper implementations (including the original) can be run without the GPU; most, if not all, take a --device parameter that allows you to specify cpu.

bobqianic · 2024-04-28T13:13:28Z

bobqianic
Apr 28, 2024
Collaborator

I am trying whisper.cpp because I do not have a GPU on my dreamhost web server, which has 4 vCPUs and 8GB RAM.

Cloud service providers often engage in oversubscription. Although they advertise it as a 4-core processor, in reality, it consists of 4 threads and competes with other users for resources. Therefore, this speed may be considered normal. I suggest you run Geekbench once to test the performance and see how it performs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely slow: 50 minutes to transcribe 4 minute speech? Does that sound right? Please review my data and make suggestions. #2059

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Extremely slow: 50 minutes to transcribe 4 minute speech? Does that sound right? Please review my data and make suggestions. #2059

peppies Apr 16, 2024

Replies: 2 comments · 3 replies

ulatekh Apr 25, 2024

peppies Apr 28, 2024 Author

bobqianic Apr 28, 2024 Collaborator

ulatekh Apr 29, 2024

bobqianic Apr 28, 2024 Collaborator

peppies
Apr 16, 2024

Replies: 2 comments 3 replies

ulatekh
Apr 25, 2024

peppies Apr 28, 2024
Author

bobqianic Apr 28, 2024
Collaborator

bobqianic
Apr 28, 2024
Collaborator