Question about running MPI and Horovod in a Jupyter Python kernel #3880

christaylor181 · 2023-04-04T03:20:17Z

christaylor181
Apr 4, 2023

Hi- I want to experiment with a notebook running horovod distributed across three HPC nodes, each with one GPU. I load these modules in my kernel definition:

"module load shared slurm jupyter-eg-kernel-wlm-py39 horovod-tensorflow2-py39-cuda11.2-gcc9/0.22.1 nccl2-cuda11.2-gcc9/2.14.3 tensorflow2-py39-cuda11.2-gcc9/2.7.0 openmpi4-cuda11.2-ofed51-gcc9"

Then the kernel is started with this SLURM submission:

"submit_script": [
          "#!/bin/bash",
          "#SBATCH --nodes=3",
          "#SBATCH --ntasks=3",
          "#SBATCH --ntasks-per-node=1",
          "#SBATCH --gres=gpu:1",
          "#SBATCH --mem=18G",
          "#SBATCH --export=NONE",
          "#SBATCH --job-name=jupyter-eg-kernel-slurm-py39-{kernel_id}",

...etc

So it starts up fine and I see three nodes allocated for my three tasks, but when I run hvd.init() and then hvd.size() in a notebook cell it only prints 1. Shouldn't I be able to see a size of "3"? Various examples online of loading a MNIST dataset and training a model all run fine, but only ever on one of the GPUs on one of the nodes. I always get this message when I run the sample (I can see the GPUs on all the nodes fine with nvidia-smi):

I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1631 MB memory: -> device: 0, name: Quadro P600, pci bus id: 0000:07:00.0, compute capability: 6.1

ricardobarroslourenco · 2023-08-25T20:21:43Z

ricardobarroslourenco
Aug 25, 2023

@christaylor181 just wondering if you were able to move on this issue, and how you done it

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about running MPI and Horovod in a Jupyter Python kernel #3880

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Question about running MPI and Horovod in a Jupyter Python kernel #3880

christaylor181 Apr 4, 2023

Replies: 1 comment

ricardobarroslourenco Aug 25, 2023

christaylor181
Apr 4, 2023

ricardobarroslourenco
Aug 25, 2023