Skip to content

[Question] How horovod ensures allreduce is finished before gradients get applied in tensorflow #3900

Discussion options

You must be logged in to vote

Hi @zhaolianshuizls, good question, this is not entirely obvious!

In HorovodAllreduceOP::ComputeAsync() we also define a callback (to be called by the Horovod background thread), which will inform TensorFlow that the operation is done (by calling the TF callback done()), with this lambda function:
https://github.com/horovod/horovod/blob/master/horovod/tensorflow/mpi_ops.cc#L492-L505

Op implementations like NCCLAllreduce::Execute() (https://github.com/horovod/horovod/blob/master/horovod/common/ops/nccl_operations.cc#L185) will make sure that this callback is called once the operation is complete. In this case via FinalizeGPUQueue(), https://github.com/horovod/horovod/blob/master/horovod/co…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@zhaolianshuizls
Comment options

Answer selected by zhaolianshuizls
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants