Whisper Static Cache #30760

huseinzol05 · 2024-05-11T14:16:52Z

Static Cache for Whisper

This enable to use torch.compile for Whisper generation to enable faster generation, example https://gist.github.com/huseinzol05/9aff34ec1427ee8c92240cb4f3cc0c88

Compiled static cache able to achieve 186.26it/s while non-compiled got 150.20it/s .

Still work in progress

Current forked only work to use static cache, need to follow caching steps as Llama.
There are so many conditions need to fulfill first.
Only worked on Pytorch 2.4.0.dev20240508+cu121 version, not yet released as stable for custom function reduce-overhead torch compile.

mobicham · 2024-05-11T15:40:16Z

Thank you very much @huseinzol05 for the work.
Here's a version with HQQ 4-bit using the torchao backend. As expected there's a good speed-up with the static cache and fullgraph compilation: https://gist.github.com/mobicham/ecfe09a48efb11e4014386901a5c6cce

GPU: 4090
orig - no compile : 48 it/sec
orig + compiled   : 227 it/sec

hqq - no compile  : 42 it/sec
hqq + compile     : 308 it/sec

kadirnar · 2024-05-13T08:47:03Z

Will it be merged? @younesbelkada

amyeroberts · 2024-05-13T09:11:25Z

cc @sanchit-gandhi

huseinzol05 · 2024-05-16T00:34:54Z

@kadirnar , this PR is not ready to merge, or you can continue to work on it to fulfill no 1, 2 and 3. But if you want to use it, you have to split the audio into 30s chunks with overlap and feed into encoder-decoder process, feel free to add temperature and top_k like https://github.com/pytorch-labs/gpt-fast/blob/main/generate.py#L52

kadirnar · 2024-05-16T08:57:41Z

Will it work if I run these codes? Also, should I make any changes to the gpt-fast library?

https://gist.github.com/huseinzol05/9aff34ec1427ee8c92240cb4f3cc0c88

huseinzol05 · 2024-05-17T23:27:50Z

yeah it should work, i use it in my prod, but dont forget to warmup the static cache multiple time first

kadirnar · 2024-05-18T09:39:06Z

yeah it should work, i use it in my prod, but dont forget to warmup the static cache multiple time first

I ran the notebook file. It gives this error.

File /usr/local/lib/python3.10/dist-packages/transformers/cache_utils.py:484, in WhisperStaticCache.__init__(self, config, dtype, device, existing_cache, batch_size)
    482 torch._dynamo.mark_static_address(e_key_cache)
    483 torch._dynamo.mark_static_address(e_value_cache)
--> 484 e_key_cache[:, :, :, :] = existing_cache[k][2].clone()
    485 e_value_cache[:, :, :, :] = existing_cache[k][3].clone()
    486 self.key_cache.append(new_layer_key_cache)File /usr/local/lib/python3.10/dist-packages/transformers/cache_utils.py:484, in WhisperStaticCache.__init__(self, config, dtype, device, existing_cache, batch_size)
    482 torch._dynamo.mark_static_address(e_key_cache)
    483 torch._dynamo.mark_static_address(e_value_cache)
--> 484 e_key_cache[:, :, :, :] = existing_cache[k][2].clone()
    485 e_value_cache[:, :, :, :] = existing_cache[k][3].clone()
    486 self.key_cache.append(new_layer_key_cache)

IndexError: tuple index out of range

huseinzol05 · 2024-05-18T15:19:25Z

I just reran and no issue, super weird, which line is that you the error?

sanchit-gandhi · 2024-05-20T16:09:37Z

src/transformers/cache_utils.py

@@ -448,3 +448,109 @@ def reset(self):
            # In-place ops prevent breaking the static address
            self.key_cache[layer_idx].zero_()
            self.value_cache[layer_idx].zero_()
+
+class WhisperStaticCache(Cache):


Thanks for this great first start @huseinzol05! With @gante, we were discussing how the design of the static k/v cache should look for encoder-decoder models, and we distilled the design options down to two possibilities:

Hold a tuple of StaticCache caches, e.g. as proposed here

Add a new Cache classes specific to encoder-decoder models, e.g. those with the attributes:

key_cache (same as decoder-only self-attn)

value_cache (same as decoder-only self-attn)

cross_key_cache (new for enc-dec cross-attn)

cross_value_cache (new for enc-dec cross-attn)

Option 1 doesn't require any new Cache classes, so should be easier to maintain! Thus, we were thinking this would be the best design option for Whisper (and other encoder-decoder models in the library, such as BART). Would be curious to hear you opinions here, having had a go at option 2

@huseinzol05 this is great work!

I'm heavily biased towards option 1, especially now that we are seeing more cache types. For instance, we could easily plug in the quantized cache as the decoder cache with 0 code overhead, if we design Whisper to support a tuple of Cache objects through past_key_values 🤗

im good with anything

added initial

a973582

huseinzol05 mentioned this pull request May 11, 2024

Add static cache support for Whisper #30707

Open

amyeroberts added Audio Cache labels May 13, 2024

fix caching

0c8e36e

sanchit-gandhi reviewed May 20, 2024

View reviewed changes

zucchini-nlp mentioned this pull request May 21, 2024

tracker: generate compatibility with torch.compile #28981

Open

27 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper Static Cache #30760

Whisper Static Cache #30760

huseinzol05 commented May 11, 2024 •

edited

mobicham commented May 11, 2024

kadirnar commented May 13, 2024

amyeroberts commented May 13, 2024

huseinzol05 commented May 16, 2024

kadirnar commented May 16, 2024 •

edited

huseinzol05 commented May 17, 2024

kadirnar commented May 18, 2024

huseinzol05 commented May 18, 2024

sanchit-gandhi May 20, 2024

gante May 29, 2024

huseinzol05 May 31, 2024

Whisper Static Cache #30760

Are you sure you want to change the base?

Whisper Static Cache #30760

Conversation

huseinzol05 commented May 11, 2024 • edited

Static Cache for Whisper

Still work in progress

mobicham commented May 11, 2024

kadirnar commented May 13, 2024

amyeroberts commented May 13, 2024

huseinzol05 commented May 16, 2024

kadirnar commented May 16, 2024 • edited

huseinzol05 commented May 17, 2024

kadirnar commented May 18, 2024

huseinzol05 commented May 18, 2024

sanchit-gandhi May 20, 2024

Choose a reason for hiding this comment

gante May 29, 2024

Choose a reason for hiding this comment

huseinzol05 May 31, 2024

Choose a reason for hiding this comment

huseinzol05 commented May 11, 2024 •

edited

kadirnar commented May 16, 2024 •

edited