numerical difference for SDPA between non-dtensor vs dtensor, when math attention and fp16 are used #317

tianyu-l · 2024-05-08T21:29:09Z

Higher loss (9.5602 vs. 9.3164) was observed for the dtensor case, after 10 steps on the llama2 debug model. This happens even without applying rotary embedding, and the complex number multiplication issue mentioned in #267.

Note: to apply math attention with dtensor, one needs to set _allow_implicit_replication to true (because a non-dtensor mask will be generated if is_causal=True for SDPA).

This issue doesn't seem to be urgent, as math attention is only a fallback option for flash attention and memory-efficient attention.

The text was updated successfully, but these errors were encountered:

tianyu-l added the bug Something isn't working label May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numerical difference for SDPA between non-dtensor vs dtensor, when math attention and fp16 are used #317

numerical difference for SDPA between non-dtensor vs dtensor, when math attention and fp16 are used #317

tianyu-l commented May 8, 2024 •

edited

numerical difference for SDPA between non-dtensor vs dtensor, when math attention and fp16 are used #317

numerical difference for SDPA between non-dtensor vs dtensor, when math attention and fp16 are used #317

Comments

tianyu-l commented May 8, 2024 • edited

tianyu-l commented May 8, 2024 •

edited