Metal shaders for memory efficient self attention on large sequences #964

bpkeene · 2024-04-06T03:58:55Z

Implements metal shaders for:

o = mx.fast.scaled_dot_product_attention(queries, keys, values, scale=scale, mask=None)

Supports fp16, fp32 dtypes; flexible hidden dimension, currently templated for 64 and 128.

Causal masking for prompt encoding not yet implemented, this shader is focused at present on full self-attention.

Context

We're continuing the work started with #735 and following up with memory savings for the self-attention use case common in diffusion transformer workflows such as Stable Diffusion 3. In particular, we observe memory savings for the 8-b use case of > 5GB for float32. Current shaders are implemented using Steel-like GEMM primitives in MLX style, with potential for performance tuning to improve latency while retaining the memory savings.

Supported

Supports mx.float16 and mx.float32 dtypes
Supports head_dim=64,128 (covers most 7b+ LLMs)
MHA supported (no MQA, no GQA)
Unsupported use cases still go through MLX primitives under the hood.
No backward pass implementation (inference-only kernel)

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

bpkeene · 2024-04-06T04:01:21Z

Marking as draft, currently working through some numerical issues via separate workflow, and will add CPU side bindings + dispatch, test & docs - sharing a current status and will update this PR

awni · 2024-05-14T22:50:16Z

mlx/backend/metal/scaled_dot_product_attention.cpp

+      1 /* self attention: unused */,
+      alpha};
+
+  set_array_buffer(compute_encoder, q, 0);


These changed to compute_encoder.set_input_array (for inputs) and compute_encoder.set_output_array for outputs of the shader.

awni · 2024-05-14T22:50:30Z

mlx/backend/metal/scaled_dot_product_attention.cpp

+  constexpr const uint rows_per_tgroup = 16;
+  const int tgp_y_indices = ((int(qseq) - 1) / rows_per_tgroup) + 1;
+
+  auto compute_encoder = d.get_command_encoder(s.index);


This changes to auto& compute_encoder ...

Updated fast attention: GEMM-ified with Steel primitives Uses flash attention 1 for scale correction

bpkeene · 2024-05-21T19:08:39Z

Hi folks,

Attaching some graphs for measured latency on M3 Max and some estimated memory savings per attention block (empirically observed at several data points, graph here obtained via formulas)

bpkeene · 2024-05-21T19:10:43Z

Some room for improvement on larger sequences re: latency, with a divergence after ~2300 sequence length, though the memory savings exceeds 1GB ~2k, and is approaching 5GB at 4250 sequence length (SD3 8B use case).

All measurements were with batch size 2, heads = 38, hidden dim = 64, and float32 on M3 Max / 48GB.

python/tests/test_fast_sdpa.py

mlx/fast.cpp

awni · 2024-05-22T16:35:58Z

@bpkeene left a few minor comments. Could you address? Once updated we can run the tests and get this merged.

bpkeene · 2024-05-23T22:49:47Z

Updated with the requested changes, thank you for the prompt review!

bpkeene marked this pull request as draft April 6, 2024 04:01

bpkeene force-pushed the user/bkeene/fast_sdpa_self_attention branch 2 times, most recently from 54d1412 to beccbf5 Compare April 6, 2024 04:07

bpkeene marked this pull request as ready for review April 12, 2024 06:46

awni mentioned this pull request Apr 30, 2024

Seems like when generating, some memory usage cannot be correctly released. ml-explore/mlx-examples#724

Closed

bpkeene marked this pull request as draft May 3, 2024 17:19

awni reviewed May 14, 2024

View reviewed changes

bpkeene added 4 commits May 15, 2024 22:41

Metal shaders for efficient self attention on large sequences

9122b8c

Updated fast attention: GEMM-ified with Steel primitives Uses flash attention 1 for scale correction

more compiler silencing

c16c6a6

Address rebase issues

9cab477

Templatize kernel instantiation, revise cpu bindings

9da07d0

bpkeene force-pushed the user/bkeene/fast_sdpa_self_attention branch from f9ede9b to 9da07d0 Compare May 16, 2024 03:46

bpkeene marked this pull request as ready for review May 16, 2024 03:47

bpkeene added 4 commits May 16, 2024 09:43

Safer writes to output

1fddb64

Permit batch size > 1

e728f6d

Numerical fixes for sdpa self attention

e9ac46b

Re-enable test, remove unused variable

655610f

bpkeene changed the title ~~Metal shaders for efficient self attention on large sequences~~ Metal shaders for memory efficient self attention on large sequences May 21, 2024

add benchmarking script

807fb70