Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gemini] async grad chunk reduce (all-reduce&reduce-scatter) #5713

Merged
merged 15 commits into from
May 24, 2024

Conversation

botbw
Copy link
Contributor

@botbw botbw commented May 13, 2024

馃搶 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs
  • I have installed pre-commit: pip install pre-commit && pre-commit install

馃毃 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

馃摑 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

馃挜 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

猸愶笍 Do you enjoy contributing to Colossal-AI?

  • 馃対 Yes, I do.
  • 馃寶 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

@botbw
Copy link
Contributor Author

botbw commented May 14, 2024

trace

previous
image
now
image

benchmark

before

colossalai run --nproc_per_node 8 --hostfile hosts.txt benchmark.py -g -x -b 4 -s 100

num_samples: 392, dp_world_size: 8, flop_megatron: 9.7808637396779e+16, flop: 86555325938794496, avg_duration: 567.2686767578125, avg_throughput: 5.528244601700203
Throughput: 5.53 samples/sec, TFLOPS per GPU by Megatron: 172.42, TFLOPS per GPU: 152.58
Max CUDA memory usage: 53503.48 MB

now

colossalai run --nproc_per_node 8 --hostfile hosts.txt benchmark.py -g -x -b 4 -s 100 --async-reduce

num_samples: 392, dp_world_size: 8, flop_megatron: 9.7808637396779e+16, flop: 86555325938794496, avg_duration: 562.5335083007812, avg_throughput: 5.574779019782774
Throughput: 5.57 samples/sec, TFLOPS per GPU by Megatron: 173.87, TFLOPS per GPU: 153.87
Max CUDA memory usage: 53504.40 MB

@botbw botbw marked this pull request as ready for review May 14, 2024 01:39
@botbw botbw requested a review from a team as a code owner May 14, 2024 01:39
colossalai/zero/gemini/gemini_ddp.py Outdated Show resolved Hide resolved
examples/language/llama/benchmark.py Outdated Show resolved Hide resolved
@botbw botbw self-assigned this May 16, 2024
@botbw botbw added the gemini related to the gemini feature label May 16, 2024
@botbw botbw marked this pull request as draft May 20, 2024 07:31
@botbw botbw marked this pull request as ready for review May 20, 2024 09:32
@botbw botbw marked this pull request as draft May 23, 2024 02:42
@botbw botbw marked this pull request as ready for review May 23, 2024 13:43
@ver217 ver217 merged commit 2fc85ab into hpcaitech:main May 24, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gemini related to the gemini feature
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants