Update flash attention section in memory_optimizations.rst #9188

cyanguwa · 2024-05-13T23:47:39Z

What does this PR do ?

Update the flash attention section in memory_optimizations.rst

Collection: [Note which collection this PR will affect]

Changelog

Added more information about flash attention implementation in NeMo and Transformer Engine in docs/source/features/memory_optimizations.rst L14-23

Usage

N/A

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

[x ] Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
[x ] Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
[x ] Documentation

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>

docs/source/features/memory_optimizations.rst

jgerh · 2024-05-22T22:47:59Z

docs/source/features/memory_optimizations.rst


 Turn Flash Attention On and Off
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-In the NeMo Framework, Flash Attention is supported through the Transformer Engine with the inclusion of Flash Attention 2. By default, Flash Attention is enabled, but the Transformer Engine may switch to a different kernel if the tensor dimensions are not optimal for Flash Attention. Users can completely disable Flash Attention by setting the environment variable ``NVTE_FLASH_ATTN=0``.
+In the NeMo Framework, flash attention is supported through `Transformer Engine <https://github.com/NVIDIA/TransformerEngine/tree/main>`_ with both of the above implementations. Transformer Engine selects the appropriate implementation based on the input information (sequence length, number of heads, head dimension, etc), but when both implementations are applicable, Transformer Engine prefers cuDNN flash attention on Hopper+ architectures, and Tri Dao flash attention on Ampere-based architectures. To disable Tri Dao flash attention, users can set the environment variable ``NVTE_FLASH_ATTN=0``, and to disable cuDNN flash attention, users can set ``NVTE_FUSED_ATTN=0``.


Change to:

“In the NeMo Framework, Flash Attention is supported through the Transformer Engine, including both of the implementations mentioned above. The Transformer Engine selects the appropriate implementation based on input information such as sequence length, number of heads, and head dimension. When both implementations are applicable, the Transformer Engine prefers cuDNN flash attention on Hopper+ architectures and Tri Dao flash attention on Ampere-based architectures.

To disable Tri Dao flash attention, set the environment variable NVTE_FLASH_ATTN=0. To disable cuDNN flash attention, set NVTE_FUSED_ATTN=0.”

docs/source/features/memory_optimizations.rst

jgerh

I reviewed the file and made a few copyedits, formatting changes, and paragraph rewrites.

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Update flash attention section in memory_optimizations.rst

4b6fed6

Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>

jgerh reviewed May 22, 2024

View reviewed changes

docs/source/features/memory_optimizations.rst Outdated Show resolved Hide resolved

jgerh reviewed May 22, 2024

View reviewed changes

docs/source/features/memory_optimizations.rst Show resolved Hide resolved

jgerh reviewed May 22, 2024

View reviewed changes

docs/source/features/memory_optimizations.rst Outdated Show resolved Hide resolved

jgerh reviewed May 22, 2024

View reviewed changes

cyanguwa added 2 commits May 24, 2024 09:14

Merge branch 'main' into patch-1

0e5c004

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

update changes based on comments

ca81cd9

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

cyanguwa requested a review from jgerh May 24, 2024 16:49

jgerh approved these changes May 24, 2024

View reviewed changes

ericharper merged commit c3f19e9 into NVIDIA:main May 24, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update flash attention section in memory_optimizations.rst #9188

Update flash attention section in memory_optimizations.rst #9188

cyanguwa commented May 13, 2024

jgerh May 22, 2024

jgerh left a comment

Update flash attention section in memory_optimizations.rst #9188

Update flash attention section in memory_optimizations.rst #9188

Conversation

cyanguwa commented May 13, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

jgerh May 22, 2024

Choose a reason for hiding this comment

jgerh left a comment

Choose a reason for hiding this comment