We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Environment:
Checklist:
Bug report: Please describe erroneous behavior you're observing and steps to reproduce it.
[root@bm83 ~]# cat /etc/centos-release CentOS Linux release 8.1.1911 (Core) top - 15:59:32 up 345 days, 5:36, 2 users, load average: 3.35, 3.45, 3.31 Tasks: 395 total, 1 running, 380 sleeping, 14 stopped, 0 zombie %Cpu(s): 5.3 us, 5.9 sy, 0.0 ni, 85.8 id, 1.6 wa, 0.2 hi, 1.1 si, 0.0 st MiB Mem : 64260.5 total, 383.6 free, 5623.3 used, 58253.6 buff/cache MiB Swap: 32288.0 total, 25981.3 free, 6306.7 used. 57995.8 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12652 root 20 0 1838576 363928 183776 S 97.7 0.6 16:13.30 python 12653 root 20 0 1838524 364924 184768 S 97.7 0.6 16:14.91 python 17665 1000 20 0 16.1g 2.1g 25564 S 57.3 3.4 419294:36 java
2.The way I have to build my environment is:
[root@bm83 ~]# docker images REPOSITORY TAG IMAGE ID CREATED SIZE horovod/horovod latest 4f3896dc9b9e 7 months ago 14.3GB docker run -it -d --privileged --name horovod --network host -v /data/ssh/:/root/.ssh/ -v /data/horovod:/data/ horovod/horovod:latest docker exec -it horovod /bin/bash sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config sed -i 's/#PubkeyAuthentication yes/PubkeyAuthentication yes/' /etc/ssh/sshd_config sed -i 's/#Port 22/Port 12345/' /etc/ssh/sshd_config service ssh restart apt update -y && apt install rsync net-tools vim ncat telnet -y
3.The script code I executed was main.py:
import tensorflow as tf import numpy as np from tensorflow import keras import horovod.tensorflow.keras as hvd print("1111111111111111111") hvd.init() print("2222222222222222222") model = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])]) model.compile(optimizer='sgd', loss='mean_squared_error') xs = np.array([-1.0, 0.0, 1.0, 2.0, 3.0, 4.0], dtype=float) ys = np.array([-2.0, 1.0, 4.0, 7.0, 10.0, 13.0], dtype=float) model.fit(xs, ys, epochs=3000) if hvd.rank() == 0: model.save_weights("adasd.h5")
4.I have to activate the command is:
root@bm83:/data/QuakeMitchell# export HOROVOD_LOG_LEVEL=trace root@bm83:/data/QuakeMitchell# mpirun --allow-run-as-root -oversubscribe --mca oob_tcp_include eth0,eth2 --mca btl tcp,self --mca oob tcp -map-by slot --mca plm_rsh_args "-p 12345 -q -o StrictHostKeyChecking=no" -np 2 -H 10.206.74.32:2 python /data/QuakeMitchell/main.py 1111111111111111111 [2024-01-26 07:43:02.115518: D /tmp/pip-req-build-9nlys6qr/horovod/common/utils/env_parser.cc:107] Using MPI to perform controller operations. [2024-01-26 07:43:02.115573: D /tmp/pip-req-build-9nlys6qr/horovod/common/utils/env_parser.cc:73] Using MPI to perform CPU operations. [2024-01-26 07:43:02.115589: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_context.h:51] MPI context enabled. [2024-01-26 07:43:02.115612: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_controller.h:36] MPI Controller constructed. 1111111111111111111 [2024-01-26 07:43:02.118399: D /tmp/pip-req-build-9nlys6qr/horovod/common/utils/env_parser.cc:107] Using MPI to perform controller operations. [2024-01-26 07:43:02.118443: D /tmp/pip-req-build-9nlys6qr/horovod/common/utils/env_parser.cc:73] Using MPI to perform CPU operations. [2024-01-26 07:43:02.118473: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_context.h:51] MPI context enabled. [2024-01-26 07:43:02.118503: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_controller.h:36] MPI Controller constructed. [2024-01-26 07:43:02.185741: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_context.cc:195] Using MPI_COMM_WORLD as global communicator. [2024-01-26 07:43:02.185741: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_context.cc:195] Using MPI_COMM_WORLD as global communicator. --------------The program blocks hvd.init()------------- root@bm83:/data/QuakeMitchell# top top - 07:28:21 up 345 days, 5:05, 1 user, load average: 2.62, 3.10, 3.13 Tasks: 26 total, 1 running, 11 sleeping, 14 stopped, 0 zombie %Cpu(s): 5.5 us, 5.7 sy, 0.0 ni, 87.1 id, 0.3 wa, 0.2 hi, 1.1 si, 0.0 st MiB Mem : 64260.5 total, 308.6 free, 5691.5 used, 58260.4 buff/cache MiB Swap: 32288.0 total, 26062.5 free, 6225.5 used. 57926.1 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9134 root 20 0 1842688 367908 183580 S 98.3 0.6 1:43.27 python 9135 root 20 0 1842640 368152 183812 S 97.7 0.6 1:42.90 python 1 root 20 0 4244 0 0 S 0.0 0.0 0:00.04 bash 29 root 20 0 4244 1808 1544 S 0.0 0.0 0:00.15 bash
The text was updated successfully, but these errors were encountered:
Neither your code nor the way you're using horovod sounds correct. Please follow the keras example here: https://github.com/horovod/horovod/blob/master/examples/keras/keras_mnist.py Also follow the horovod-mpi docs to see how to run the program using horovodrun command: https://github.com/horovod/horovod/blob/master/examples/keras/keras_mnist.py
horovodrun
Sorry, something went wrong.
No branches or pull requests
Environment:
Checklist:
Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.
2.The way I have to build my environment is:
3.The script code I executed was main.py:
4.I have to activate the command is:
The text was updated successfully, but these errors were encountered: