Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numerical difference for SDPA between non-dtensor vs dtensor, when math attention and fp16 are used #317

Open
tianyu-l opened this issue May 8, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@tianyu-l
Copy link
Contributor

tianyu-l commented May 8, 2024

Higher loss (9.5602 vs. 9.3164) was observed for the dtensor case, after 10 steps on the llama2 debug model. This happens even without applying rotary embedding, and the complex number multiplication issue mentioned in #267.

Note: to apply math attention with dtensor, one needs to set _allow_implicit_replication to true (because a non-dtensor mask will be generated if is_causal=True for SDPA).

This issue doesn't seem to be urgent, as math attention is only a fallback option for flash attention and memory-efficient attention.

@tianyu-l tianyu-l added the bug Something isn't working label May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant