You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Using nvcr.io/nvidia/tensorflow:23.02-tf2-py3 docker container provided by Nvidia for multinode training on 2XH100 server works completely fine, which has following package and version.
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
[6]<stderr>: return fn(*args, **kwargs)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1742, in fit
[6]<stderr>: tmp_logs = self.train_function(iterator)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1338, in train_function
[6]<stderr>: return step_function(self, iterator)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1322, in step_function
[6]<stderr>: outputs = model.distribute_strategy.run(run_step, args=(data,))
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1303, in run_step
[6]<stderr>: outputs = model.train_step(data)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1084, in train_step
[6]<stderr>: self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/keras/src/optimizers/legacy/optimizer_v2.py", line 598, in minimize
[6]<stderr>: grads_and_vars = self._compute_gradients(
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 136, in _compute_gradients
[6]<stderr>: allreduced_grads = self._allreduce(grads, weights)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 218, in _allreduce
[6]<stderr>: return __filtered_reduce_grads(grads, vars)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 184, in __filtered_reduce_grads
[6]<stderr>: rg = self._allreduce_grads(rg, rv)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 573, in allreduce_grads
[6]<stderr>: if groups is not None:
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>: op=op,
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>: op=op,
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>: op=op,
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 398, in _allreduce_cond
[6]<stderr>: return tf.cond(cond, allreduce_fn, id_fn)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 384, in allreduce_fn
[6]<stderr>: return allreduce(tensor, *args, process_set=process_set, **kwargs)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 102, in allreduce
[6]<stderr>: if isinstance(tensor, tf.IndexedSlices):
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 138, in allreduce
[6]<stderr>: summed_tensor_compressed = _allreduce(tensor_compressed, op=op,
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/mpi_ops.py", line 130, in _allreduce
[6]<stderr>: return MPI_LIB.horovod_allreduce(tensor, name=name, reduce_op=op,
[6]<stderr>:
[6]<stderr>: File "multinode_training/multinode_training.py", line 477, in <module>
[6]<stderr>: net.fit(train_batches,
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
[6]<stderr>: return fn(*args, **kwargs)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1742, in fit
[6]<stderr>: tmp_logs = self.train_function(iterator)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1338, in train_function
[6]<stderr>: return step_function(self, iterator)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1322, in step_function
[6]<stderr>: outputs = model.distribute_strategy.run(run_step, args=(data,))
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1303, in run_step
[6]<stderr>: outputs = model.train_step(data)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1084, in train_step
[6]<stderr>: self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/keras/src/optimizers/legacy/optimizer_v2.py", line 598, in minimize
[6]<stderr>: grads_and_vars = self._compute_gradients(
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 136, in _compute_gradients
[6]<stderr>: allreduced_grads = self._allreduce(grads, weights)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 218, in _allreduce
[6]<stderr>: return __filtered_reduce_grads(grads, vars)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 184, in __filtered_reduce_grads
[6]<stderr>: rg = self._allreduce_grads(rg, rv)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 573, in allreduce_grads
[6]<stderr>: if groups is not None:
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>: op=op,
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>: op=op,
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>: op=op,
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 398, in _allreduce_cond
[6]<stderr>: return tf.cond(cond, allreduce_fn, id_fn)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 384, in allreduce_fn
[6]<stderr>: return allreduce(tensor, *args, process_set=process_set, **kwargs)
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 102, in allreduce
[6]<stderr>: if isinstance(tensor, tf.IndexedSlices):
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 138, in allreduce
[6]<stderr>: summed_tensor_compressed = _allreduce(tensor_compressed, op=op,
[6]<stderr>:
[6]<stderr>: File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/mpi_ops.py", line 130, in _allreduce
[6]<stderr>: return MPI_LIB.horovod_allreduce(tensor, name=name, reduce_op=op,
[6]<stderr>:
[6]<stderr>: File "<string>", line 108, in horovod_allreduce
[6]<stderr>:
[6]<stderr>:ncclCommInitRank failed: invalid usage (run with NCCL_DEBUG=WARN for details)
[6]<stderr>: [[{{node DistributedAdam_Allreduce/cond_74/HorovodAllreduce_grads_74_0}}]] [Op:__inference_train_function_6700]
[6]<stderr>:Terminated
[0]<stderr>:Terminated
[4]<stderr>:Terminated
[5]<stderr>:Terminated
Process 7 exit with status code 1.
Terminating remaining workers after failure of Process 7.
Process 3 exit with status code 1.
Process 1 exit with status code 1.
Process 2 exit with status code 1.
Process 6 exit with status code 143.
Process 0 exit with status code 143.
Process 4 exit with status code 143.
Process 5 exit with status code 143.
Traceback (most recent call last):
File "/home/idps/.local/bin/horovodrun", line 8, in <module>
sys.exit(run_commandline())
File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 837, in run_commandline
_run(args)
File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 827, in _run
return _run_static(args)
File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 685, in _run_static
_launch_job(args, settings, nics, command)
File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 800, in _launch_job
run_controller(args.use_gloo, gloo_run_fn,
File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 776, in run_controller
gloo_run()
File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 792, in gloo_run_fn
gloo_run(settings, nics, env, driver_ip, command)
File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 300, in gloo_run
launch_gloo(command, exec_command, settings, nics, env, server_ip)
File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 284, in launch_gloo
raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: 7
Exit code: 1
PurvagLapsiwala
changed the title
Getting error while running multi node machine learning training on H100 server
Getting error while running multi node machine learning training on H100 servers
Oct 2, 2023
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Using nvcr.io/nvidia/tensorflow:23.02-tf2-py3 docker container provided by Nvidia for multinode training on 2XH100 server works completely fine, which has following package and version.
When I try to run it on host level without using docker using following version, I am getting mentioned error.
Error:
script
Describe the solution you'd like
A clear and concise description of what you want to happen.
It should run without any error
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: