[Question] How horovod ensures allreduce is finished before gradients get applied in tensorflow #3900
-
In Tensorflow, the main thread calls HorovodAllreduceOP::ComputeAsync to put allreduce requests into a queue through EnqueueTensorAllreduce, which are later consumed by the background thread, and the actual allreduce process is done by the background thread. If I got that right, here is the question which puzzles me for a while. The tensorflow graph is run in the main thread, and the allreduce operations are done inside the background thread, so how these two threads are synced so that allreduce is done before the reduced gradients are applied? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @zhaolianshuizls, good question, this is not entirely obvious! In Op implementations like TensorFlow uses a pool with multiple threads of its own to run async operations like a Horovod Allreduce, so they don't all run in the same main thread. |
Beta Was this translation helpful? Give feedback.
Hi @zhaolianshuizls, good question, this is not entirely obvious!
In
HorovodAllreduceOP::ComputeAsync()
we also define a callback (to be called by the Horovod background thread), which will inform TensorFlow that the operation is done (by calling the TF callbackdone()
), with this lambda function:https://github.com/horovod/horovod/blob/master/horovod/tensorflow/mpi_ops.cc#L492-L505
Op implementations like
NCCLAllreduce::Execute()
(https://github.com/horovod/horovod/blob/master/horovod/common/ops/nccl_operations.cc#L185) will make sure that this callback is called once the operation is complete. In this case viaFinalizeGPUQueue()
, https://github.com/horovod/horovod/blob/master/horovod/co…