fix sequence parallel(Ulysses) grad scale for zero0 #5555

inkcherry · 2024-05-21T07:47:06Z

use dp_world_size for grad reduction, instead of seq_dp_world_size.
Currently, for zero0, only sparse tensors use the correct world_size.

tiny model with sp=4 grad norm test:

grad_norm	step1	step2	step3	step4	step5	step100
zero1	15.825	16.646	15.853	16.159	17.333	15.555
zero0	3.956	4.161	3.963	4.040	4.333	3.889
zero0(this patch)	15.825	16.646	15.853	16.159	17.333	15.554

samadejacobs · 2024-05-24T17:02:48Z

deepspeed/runtime/engine.py


    def _reduce_expert_gradients(self, expert_grads, elements_per_buffer):
        # to maintain the gradients value unaffected by ep_size setting,
        # utilize dp_world_size for allreduce average
-        dp_world_size = dist.get_world_size(groups._get_data_parallel_group())
+        dp_world_size = dist.get_world_size(groups._get_data_parallel_group()) / float(self.sequence_parallel_size)


@inkcherry, can you help me understand why scale by sp_size? get_data_parallel_group != get_sequence_data_parallel_group, you should have correct value already, no?

Thanks for the review! @samadejacobs. Yes, this should be the correct value. We should only need to modify the dp_world_size in the above instance.

inkcherry · 2024-06-03T03:48:05Z

Hi, @samadejacobs I have removed the modifications you mentioned in that line. Could you please help review the other parts again? Thanks!

fix ds-sp grad scale for zero0

cb15ffa

inkcherry requested review from mrwyattii and tjruwase as code owners May 21, 2024 07:47

tjruwase requested review from samadejacobs and tohtana and removed request for tjruwase and mrwyattii May 21, 2024 15:14

samadejacobs reviewed May 24, 2024

View reviewed changes

keep the correct dp_size

60e0dbc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix sequence parallel(Ulysses) grad scale for zero0 #5555

fix sequence parallel(Ulysses) grad scale for zero0 #5555

inkcherry commented May 21, 2024 •

edited

samadejacobs May 24, 2024

inkcherry May 28, 2024

inkcherry commented Jun 3, 2024

fix sequence parallel(Ulysses) grad scale for zero0 #5555

Are you sure you want to change the base?

fix sequence parallel(Ulysses) grad scale for zero0 #5555

Conversation

inkcherry commented May 21, 2024 • edited

samadejacobs May 24, 2024

Choose a reason for hiding this comment

inkcherry May 28, 2024

Choose a reason for hiding this comment

inkcherry commented Jun 3, 2024

inkcherry commented May 21, 2024 •

edited