Question about running MPI and Horovod in a Jupyter Python kernel #3880
christaylor181
started this conversation in
General
Replies: 1 comment
-
@christaylor181 just wondering if you were able to move on this issue, and how you done it |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi- I want to experiment with a notebook running horovod distributed across three HPC nodes, each with one GPU. I load these modules in my kernel definition:
"module load shared slurm jupyter-eg-kernel-wlm-py39 horovod-tensorflow2-py39-cuda11.2-gcc9/0.22.1 nccl2-cuda11.2-gcc9/2.14.3 tensorflow2-py39-cuda11.2-gcc9/2.7.0 openmpi4-cuda11.2-ofed51-gcc9"
Then the kernel is started with this SLURM submission:
...etc
So it starts up fine and I see three nodes allocated for my three tasks, but when I run
hvd.init()
and thenhvd.size()
in a notebook cell it only prints 1. Shouldn't I be able to see a size of "3"? Various examples online of loading a MNIST dataset and training a model all run fine, but only ever on one of the GPUs on one of the nodes. I always get this message when I run the sample (I can see the GPUs on all the nodes fine with nvidia-smi):I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1631 MB memory: -> device: 0, name: Quadro P600, pci bus id: 0000:07:00.0, compute capability: 6.1
Beta Was this translation helpful? Give feedback.
All reactions