Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling with MPI+PyTorch does not work #3992

Open
fferroni opened this issue Oct 6, 2023 · 2 comments
Open

Compiling with MPI+PyTorch does not work #3992

fferroni opened this issue Oct 6, 2023 · 2 comments
Labels

Comments

@fferroni
Copy link

fferroni commented Oct 6, 2023

Environment:

Trying to build latest Horovod with a Docker container, using MPI and PyTorch. I've based it off in large parts on the Docker image present in the repo.

FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

# Select pytorch toolchain version
ARG PYTORCH_VERSION=2.0.1+cu118
ARG TORCHVISION_VERSION=0.15.2+cu118
ARG PYTORCH_LIGHTNING_VERSION=2.0.6

ENV DEBIAN_FRONTEND=noninteractive \
    DEBCONF_NONINTERACTIVE_SEEN=true \
    PYTHONUNBUFFERED=1 \
    TZ=Europe/London
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

RUN apt update && \
    apt install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
    build-essential \
    cmake \
    git \
    curl \
    vim \
    wget \
    ca-certificates \
    libjpeg-dev \
    libpng-dev \
    librdmacm1 \
    libibverbs1 \
    ibverbs-providers \
    openjdk-8-jdk-headless \
    openssh-client \
    openssh-server && \
    gpg --list-keys && \
    gpg --no-default-keyring --keyring /usr/share/keyrings/deadsnakes.gpg --keyserver keyserver.ubuntu.com --recv-keys F23C5A6CF475977595C89F51BA6932366A755776 && \
    echo 'deb [signed-by=/usr/share/keyrings/deadsnakes.gpg] http://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy main' | tee -a /etc/apt/sources.list.d/python.list && \
    apt update && \
    apt install python3.10 python3.10-dev python3.10-distutils -y && \
    update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 && \
    curl --output get-pip.py https://bootstrap.pypa.io/get-pip.py && \
    python3.10 get-pip.py && \
    apt clean

# Set default shell to /bin/bash
SHELL ["/bin/bash", "-euo", "pipefail", "-c"]

# Install Open MPI
RUN wget --progress=dot:mega -O /tmp/openmpi-4.1.4-bin.tar.gz https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.gz && \
    cd /tmp && tar -zxf /tmp/openmpi-4.1.4-bin.tar.gz && \
    mkdir openmpi-4.1.4/build && cd openmpi-4.1.4/build && ../configure --prefix=/usr/local && \
    make -j all && make install && ldconfig && \
    mpirun --version

# Install PyTorch
RUN pip install --no-cache-dir \
    torch==${PYTORCH_VERSION} \
    torchvision==${TORCHVISION_VERSION} \
    pytorch_lightning==${PYTORCH_LIGHTNING_VERSION} \
    --extra-index-url=https://download.pytorch.org/whl/cu118

# Install Horovod, temporarily using CUDA stubs
WORKDIR /horovod
RUN wget https://github.com/horovod/horovod/archive/refs/tags/v0.28.1.tar.gz && \
    tar -xvf v0.28.1.tar.gz && \
    cp -r horovod-0.28.1/* . && \
    rm v0.28.1.tar.gz && \
    rm -r horovod-0.28.1/
RUN python setup.py sdist && \
    ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
    bash -c "HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_MPI=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir -v $(ls /horovod/dist/horovod-*.tar.gz)" && \
    horovodrun --check-build && \
    ldconfig

# Check all frameworks are working correctly. Use CUDA stubs to ensure CUDA libs can be found correctly
# when running on CPU machine
WORKDIR "/horovod/examples"
RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
    python -c "import horovod.torch as hvd; hvd.init()" && \
    ldconfig

Error
I am getting an error when building for missing Eigen/Core, has anyone come across this? Installing apt install libeigen3-dev does not work.

#0 44.68   cd /tmp/pip-req-build-gb4msjz3/build/temp.linux-x86_64-cpython-310/RelWithDebInfo/horovod/torch && /usr/bin/c++ -DEIGEN_MPL2_ONLY=1 -DHAVE_CUDA=1 -DHAVE_GPU=1 -DHAVE_MPI=1 -DHAVE_NCCL=1 -DHAVE_NVTX=1 -DHOROVOD_GPU_ALLGATHER=78 -DHOROVOD_GPU_ALLREDUCE=78 -DHOROVOD_GPU_ALLTOALL=78 -DHOROVOD_GPU_BROADCAST=78 -DHOROVOD_GPU_REDUCESCATTER=78 -DPYTORCH_VERSION=2000001000 -DTORCH_API_INCLUDE_EXTENSION_H=1 -Dpytorch_EXPORTS -I/tmp/pip-req-build-gb4msjz3/third_party/HTTPRequest/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/assert/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/config/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/core/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/detail/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/iterator/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/lockfree/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/mpl/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/parameter/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/predef/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/preprocessor/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/static_assert/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/type_traits/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/utility/include -I/tmp/pip-req-build-gb4msjz3/third_party/lbfgs/include -I/tmp/pip-req-build-gb4msjz3/third_party/eigen -I/tmp/pip-req-build-gb4msjz3/third_party/flatbuffers/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /usr/local/cuda/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0  -pthread -fPIC -Wall -ftree-vectorize -mf16c -mavx -mfma -O3 -g -DNDEBUG -std=c++14 -fPIC -MD -MT horovod/torch/CMakeFiles/pytorch.dir/__/common/half.cc.o -MF CMakeFiles/pytorch.dir/__/common/half.cc.o.d -o CMakeFiles/pytorch.dir/__/common/half.cc.o -c /tmp/pip-req-build-gb4msjz3/horovod/common/half.cc
#0 44.68   cd /tmp/pip-req-build-gb4msjz3/build/temp.linux-x86_64-cpython-310/RelWithDebInfo/horovod/torch && /usr/bin/c++ -DEIGEN_MPL2_ONLY=1 -DHAVE_CUDA=1 -DHAVE_GPU=1 -DHAVE_MPI=1 -DHAVE_NCCL=1 -DHAVE_NVTX=1 -DHOROVOD_GPU_ALLGATHER=78 -DHOROVOD_GPU_ALLREDUCE=78 -DHOROVOD_GPU_ALLTOALL=78 -DHOROVOD_GPU_BROADCAST=78 -DHOROVOD_GPU_REDUCESCATTER=78 -DPYTORCH_VERSION=2000001000 -DTORCH_API_INCLUDE_EXTENSION_H=1 -Dpytorch_EXPORTS -I/tmp/pip-req-build-gb4msjz3/third_party/HTTPRequest/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/assert/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/config/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/core/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/detail/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/iterator/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/lockfree/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/mpl/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/parameter/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/predef/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/preprocessor/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/static_assert/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/type_traits/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/utility/include -I/tmp/pip-req-build-gb4msjz3/third_party/lbfgs/include -I/tmp/pip-req-build-gb4msjz3/third_party/eigen -I/tmp/pip-req-build-gb4msjz3/third_party/flatbuffers/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /usr/local/cuda/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0  -pthread -fPIC -Wall -ftree-vectorize -mf16c -mavx -mfma -O3 -g -DNDEBUG -std=c++14 -fPIC -MD -MT horovod/torch/CMakeFiles/pytorch.dir/__/common/logging.cc.o -MF CMakeFiles/pytorch.dir/__/common/logging.cc.o.d -o CMakeFiles/pytorch.dir/__/common/logging.cc.o -c /tmp/pip-req-build-gb4msjz3/horovod/common/logging.cc
#0 44.68   cd /tmp/pip-req-build-gb4msjz3/build/temp.linux-x86_64-cpython-310/RelWithDebInfo/horovod/torch && /usr/bin/c++ -DEIGEN_MPL2_ONLY=1 -DHAVE_CUDA=1 -DHAVE_GPU=1 -DHAVE_MPI=1 -DHAVE_NCCL=1 -DHAVE_NVTX=1 -DHOROVOD_GPU_ALLGATHER=78 -DHOROVOD_GPU_ALLREDUCE=78 -DHOROVOD_GPU_ALLTOALL=78 -DHOROVOD_GPU_BROADCAST=78 -DHOROVOD_GPU_REDUCESCATTER=78 -DPYTORCH_VERSION=2000001000 -DTORCH_API_INCLUDE_EXTENSION_H=1 -Dpytorch_EXPORTS -I/tmp/pip-req-build-gb4msjz3/third_party/HTTPRequest/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/assert/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/config/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/core/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/detail/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/iterator/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/lockfree/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/mpl/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/parameter/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/predef/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/preprocessor/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/static_assert/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/type_traits/include -I/tmp/pip-req-build-gb4msjz3/third_party/boost/utility/include -I/tmp/pip-req-build-gb4msjz3/third_party/lbfgs/include -I/tmp/pip-req-build-gb4msjz3/third_party/eigen -I/tmp/pip-req-build-gb4msjz3/third_party/flatbuffers/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /usr/local/cuda/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0  -pthread -fPIC -Wall -ftree-vectorize -mf16c -mavx -mfma -O3 -g -DNDEBUG -std=c++14 -fPIC -MD -MT horovod/torch/CMakeFiles/pytorch.dir/__/common/message.cc.o -MF CMakeFiles/pytorch.dir/__/common/message.cc.o.d -o CMakeFiles/pytorch.dir/__/common/message.cc.o -c /tmp/pip-req-build-gb4msjz3/horovod/common/message.cc
#0 44.79   In file included from /tmp/pip-req-build-gb4msjz3/horovod/common/global_state.h:25,
#0 44.79                    from /tmp/pip-req-build-gb4msjz3/horovod/common/controller.h:24,
#0 44.79                    from /tmp/pip-req-build-gb4msjz3/horovod/common/controller.cc:18:
#0 44.79   /tmp/pip-req-build-gb4msjz3/horovod/common/parameter_manager.h:26:10: fatal error: Eigen/Core: No such file or directory
#0 44.79      26 | #include <Eigen/Core>
#0 44.79         |          ^~~~~~~~~~~~
#0 44.79   compilation terminated.
#0 44.79   gmake[2]: *** [horovod/torch/CMakeFiles/pytorch.dir/build.make:90: horovod/torch/CMakeFiles/pytorch.dir/__/common/controller.cc.o] Error 1
#0 44.79   gmake[2]: *** Waiting for unfinished jobs....
#0 44.81   In file included from /tmp/pip-req-build-gb4msjz3/horovod/common/message.cc:23:
#0 44.81   /tmp/pip-req-build-gb4msjz3/horovod/common/wire/message_generated.h:23:10: fatal error: flatbuffers/flatbuffers.h: No such file or directory
#0 44.81      23 | #include "flatbuffers/flatbuffers.h"
#0 44.81         |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~
#0 44.81   compilation terminated.
#0 44.81   gmake[2]: *** [horovod/torch/CMakeFiles/pytorch.dir/build.make:160: horovod/torch/CMakeFiles/pytorch.dir/__/common/message.cc.o] Error 1
#0 44.82   [ 26%] Linking CUDA static library libhorovod_cuda_kernels.a
#0 44.82   cd /tmp/pip-req-build-gb4msjz3/build/temp.linux-x86_64-cpython-310/RelWithDebInfo/horovod/common/ops/cuda && /usr/local/lib/python3.10/dist-packages/cmake/data/bin/cmake -P CMakeFiles/horovod_cuda_kernels.dir/cmake_clean_target.cmake
#0 44.82   cd /tmp/pip-req-build-gb4msjz3/build/temp.linux-x86_64-cpython-310/RelWithDebInfo/horovod/common/ops/cuda && /usr/local/lib/python3.10/dist-packages/cmake/data/bin/cmake -E cmake_link_script CMakeFiles/horovod_cuda_kernels.dir/link.txt --verbose=1
#0 44.82   /usr/bin/ar qc libhorovod_cuda_kernels.a CMakeFiles/horovod_cuda_kernels.dir/cuda_kernels.cu.o
#0 44.83   /usr/bin/ranlib libhorovod_cuda_kernels.a
#0 44.84   gmake[2]: Leaving directory '/tmp/pip-req-build-gb4msjz3/build/temp.linux-x86_64-cpython-310/RelWithDebInfo'
#0 44.84   In file included from /tmp/pip-req-build-gb4msjz3/horovod/common/half.cc:16:
#0 44.84   /tmp/pip-req-build-gb4msjz3/horovod/common/half.h: In function ‘void horovod::common::HalfBits2Float(const short unsigned int*, float*)’:
#0 44.84   /tmp/pip-req-build-gb4msjz3/horovod/common/half.h:76:11: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
#0 44.84      76 |   *res = *reinterpret_cast<float const*>(&f);
#0 44.84         |           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#0 44.85   [ 26%] Built target horovod_cuda_kernels
#0 45.27   /tmp/pip-req-build-gb4msjz3/horovod/common/half.h:76:10: warning: ‘f’ is used uninitialized [-Wuninitialized]
#0 45.27      76 |   *res = *reinterpret_cast<float const*>(&f);
#0 45.28         |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#0 45.28   /tmp/pip-req-build-gb4msjz3/horovod/common/half.h:48:12: note: ‘f’ declared here
#0 45.28      48 |   unsigned f = 0;
#0 45.28         |            ^
#0 46.49   gmake[2]: Leaving directory '/tmp/pip-req-build-gb4msjz3/build/temp.linux-x86_64-cpython-310/RelWithDebInfo'
#0 46.49   gmake[1]: *** [CMakeFiles/Makefile2:154: horovod/torch/CMakeFiles/pytorch.dir/all] Error 2
#0 46.49   gmake[1]: Leaving directory '/tmp/pip-req-build-gb4msjz3/build/temp.linux-x86_64-cpython-310/RelWithDebInfo'
#0 46.49   gmake: *** [Makefile:91: all] Error 2
#0 46.50   Traceback (most recent call last):
#0 46.50     File "<string>", line 2, in <module>
#0 46.50     File "<pip-setuptools-caller>", line 34, in <module>
#0 46.50     File "/tmp/pip-req-build-gb4msjz3/setup.py", line 213, in <module>
#0 46.50       setup(name='horovod',
#0 46.50     File "/usr/local/lib/python3.10/dist-packages/setuptools/__init__.py", line 103, in setup
#0 46.50       return distutils.core.setup(**attrs)
#0 46.50     File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/core.py", line 185, in setup
#0 46.51       return run_commands(dist)
#0 46.51     File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/core.py", line 201, in run_commands
#0 46.51       dist.run_commands()
#0 46.51     File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 969, in run_commands
#0 46.51       self.run_command(cmd)
#0 46.51     File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command
#0 46.51       super().run_command(command)
#0 46.51     File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
#0 46.51       cmd_obj.run()
#0 46.51     File "/usr/local/lib/python3.10/dist-packages/wheel/bdist_wheel.py", line 364, in run
#0 46.51       self.run_command("build")
#0 46.51     File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py", line 318, in run_command
#0 46.51       self.distribution.run_command(command)
#0 46.51     File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command
#0 46.51       super().run_command(command)
#0 46.51     File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
#0 46.51       cmd_obj.run()
#0 46.51     File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build.py", line 131, in run
#0 46.51       self.run_command(cmd_name)
#0 46.51     File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py", line 318, in run_command
#0 46.51       self.distribution.run_command(command)
#0 46.51     File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command
#0 46.52       super().run_command(command)
#0 46.52     File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
#0 46.52       cmd_obj.run()
#0 46.52     File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 88, in run
#0 46.52       _build_ext.run(self)
#0 46.52     File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
#0 46.52       self.build_extensions()
#0 46.52     File "/tmp/pip-req-build-gb4msjz3/setup.py", line 145, in build_extensions
#0 46.52       subprocess.check_call(command, cwd=cmake_build_dir)
#0 46.52     File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
#0 46.52       raise CalledProcessError(retcode, cmd)
#0 46.52   subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--config', 'RelWithDebInfo', '--', '-j8', 'VERBOSE=1']' returned non-zero exit status 2.
#0 46.54   error: subprocess-exited-with-error
#0 46.54   
#0 46.54   × python setup.py bdist_wheel did not run successfully.
#0 46.54   │ exit code: 1
#0 46.54   ╰─> See above for output.
#0 46.54   
#0 46.54   note: This error originates from a subprocess, and is likely not a problem with pip.
#0 46.54   full command: /usr/bin/python3.10 -u -c '
#0 46.54   exec(compile('"'"''"'"''"'"'
#0 46.54   # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
#0 46.54   #
#0 46.54   # - It imports setuptools before invoking setup.py, to enable projects that directly
#0 46.54   #   import from `distutils.core` to work with newer packaging standards.
#0 46.54   # - It provides a clear error message when setuptools is not installed.
#0 46.54   # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
#0 46.54   #   setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
#0 46.54   #     manifest_maker: standard file '"'"'-c'"'"' not found".
#0 46.54   # - It generates a shim setup.py, for handling setup.cfg-only projects.
#0 46.54   import os, sys, tokenize
#0 46.54   
#0 46.54   try:
#0 46.54       import setuptools
#0 46.54   except ImportError as error:
#0 46.54       print(
#0 46.54           "ERROR: Can not execute `setup.py` since setuptools is not available in "
#0 46.54           "the build environment.",
#0 46.54           file=sys.stderr,
#0 46.54       )
#0 46.54       sys.exit(1)
#0 46.54   
#0 46.54   __file__ = %r
#0 46.54   sys.argv[0] = __file__
#0 46.54   
#0 46.54   if os.path.exists(__file__):
#0 46.54       filename = __file__
#0 46.54       with tokenize.open(__file__) as f:
#0 46.54           setup_py_code = f.read()
#0 46.54   else:
#0 46.54       filename = "<auto-generated setuptools caller>"
#0 46.54       setup_py_code = "from setuptools import setup; setup()"
#0 46.54   
#0 46.54   exec(compile(setup_py_code, filename, "exec"))
#0 46.54   '"'"''"'"''"'"' % ('"'"'/tmp/pip-req-build-gb4msjz3/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' bdist_wheel -d /tmp/pip-wheel-f6c_nxp6
#0 46.54   cwd: /tmp/pip-req-build-gb4msjz3/
#0 46.54   Building wheel for horovod (setup.py): finished with status 'error'
#0 46.54   ERROR: Failed building wheel for horovod
#0 46.54   Running setup.py clean for horovod
#0 46.54   Running command python setup.py clean
#0 46.70   running clean
#0 46.70   removing 'build/temp.linux-x86_64-cpython-310' (and everything under it)
#0 46.71   removing 'build/lib.linux-x86_64-cpython-310' (and everything under it)
#0 46.72   'build/bdist.linux-x86_64' does not exist -- can't clean it
#0 46.72   'build/scripts-3.10' does not exist -- can't clean it
#0 46.72   removing 'build'
#0 46.74 Failed to build horovod
#0 46.74 ERROR: Could not build wheels for horovod, which is required to install pyproject.toml-based projects
------
horovod.Dockerfile:71
--------------------
  70 |         rm -r horovod-0.28.1/
  71 | >>> RUN python setup.py sdist && \
  72 | >>>     ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
  73 | >>>     bash -c "HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_MPI=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir -v $(ls /horovod/dist/horovod-*.tar.gz)" && \
  74 | >>>     horovodrun --check-build && \
  75 | >>>     ldconfig
  76 |     
--------------------
ERROR: failed to solve: process "/bin/bash -euo pipefail -c python setup.py sdist &&     ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs &&     bash -c \"HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_MPI=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir -v $(ls /horovod/dist/horovod-*.tar.gz)\" &&     horovodrun --check-build &&     ldconfig" did not complete successfully: exit code: 1

In the cmake blurb it's trying to find from third_party/eigen/Eigen

#0 2.088   warning: no directories found matching 'third_party/eigen/Eigen'
#0 2.089   warning: no previously-included files found matching 'third_party/eigen/Eigen/Eigen'
#0 2.089   warning: no previously-included files found matching 'third_party/eigen/Eigen/IterativeLinearSolvers'
#0 2.090   warning: no previously-included files found matching 'third_party/eigen/Eigen/MetisSupport'
#0 2.091   warning: no previously-included files found matching 'third_party/eigen/Eigen/Sparse'
#0 2.091   warning: no previously-included files found matching 'third_party/eigen/Eigen/SparseCholesky'
#0 2.092   warning: no previously-included files found matching 'third_party/eigen/Eigen/SparseLU'
#0 2.092   warning: no previously-included files found matching 'third_party/eigen/Eigen/src/IterativeSolvers/*'
#0 2.093   warning: no previously-included files found matching 'third_party/eigen/Eigen/src/OrderingMethods/Amd.h'
#0 2.093   warning: no previously-included files found matching 'third_party/eigen/Eigen/src/SparseCholesky/*'
#0 2.093   warning: no previously-included files found matching 'third_party/eigen/unsupported/test/mpreal/mpreal.h'
#0 2.094   warning: no previously-included files found matching 'third_party/eigen/unsupported/Eigen/FFT'
#0 2.094   warning: no previously-included files found matching 'third_party/eigen/unsupported/Eigen/MPRealSupport'
#0 2.094   warning: no previously-included files found matching 'third_party/eigen/doc/PreprocessorDirectives.dox'
#0 2.095   warning: no previously-included files found matching 'third_party/eigen/doc/UsingIntelMKL.dox'
#0 2.095   warning: no previously-included files found matching 'third_party/eigen/doc/SparseLinearSystems.dox'
#0 2.095   warning: no previously-included files found matching 'third_party/eigen/COPYING.GPL'
#0 2.096   warning: no previously-included files found matching 'third_party/eigen/COPYING.LGPL'
#0 2.096   warning: no previously-included files found matching 'third_party/eigen/COPYING.README'
@fferroni fferroni added the bug label Oct 6, 2023
@fferroni
Copy link
Author

fferroni commented Oct 6, 2023

Seems like this specific problem goes away only if you also compile tensorflow.

Many subsequent issues occur such as missing flatbuffers, boost, liblbfgs etc.
Now I am stuck on

#0 74.40   /tmp/pip-req-build-rffy10f0/horovod/common/optim/bayesian_optimization.cc:24:10: fatal error: LBFGS.h: No such file or directory
#0 74.40      24 | #include "LBFGS.h"

apt install liblbfgs-dev does not seem to work.

I feel like I'm doing something very wrong in the setup but it's not immediately obvious to me?

This is the current Docker image

FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

# Select pytorch toolchain version
ARG PYTORCH_VERSION=2.0.1+cu118
ARG TORCHVISION_VERSION=0.15.2+cu118
ARG PYTORCH_LIGHTNING_VERSION=2.0.6
ARG TENSORFLOW_VERSION=2.9.2
ARG MXNET_VERSION=1.9.1

ENV DEBIAN_FRONTEND=noninteractive \
    DEBCONF_NONINTERACTIVE_SEEN=true \
    PYTHONUNBUFFERED=1 \
    TZ=Europe/London
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

RUN apt update && \
    apt install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
    build-essential \
    cmake \
    git \
    curl \
    vim \
    wget \
    ca-certificates \
    libjpeg-dev \
    libpng-dev \
    librdmacm1 \
    libibverbs1 \
    ibverbs-providers \
    openjdk-8-jdk-headless \
    openssh-client \
    openssh-server && \
    gpg --list-keys && \
    gpg --no-default-keyring --keyring /usr/share/keyrings/deadsnakes.gpg --keyserver keyserver.ubuntu.com --recv-keys F23C5A6CF475977595C89F51BA6932366A755776 && \
    echo 'deb [signed-by=/usr/share/keyrings/deadsnakes.gpg] http://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy main' | tee -a /etc/apt/sources.list.d/python.list && \
    apt update && \
    apt install python3.10 python3.10-dev python3.10-distutils -y && \
    update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 && \
    curl --output get-pip.py https://bootstrap.pypa.io/get-pip.py && \
    python3.10 get-pip.py && \
    apt clean

# Set default shell to /bin/bash
SHELL ["/bin/bash", "-euo", "pipefail", "-c"]

# Install Open MPI
RUN wget --progress=dot:mega -O /tmp/openmpi-4.1.4-bin.tar.gz https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.gz && \
    cd /tmp && tar -zxf /tmp/openmpi-4.1.4-bin.tar.gz && \
    mkdir openmpi-4.1.4/build && cd openmpi-4.1.4/build && ../configure --prefix=/usr/local && \
    make -j all && make install && ldconfig && \
    mpirun --version

# Install PyTorch
RUN pip install --no-cache-dir \
    torch==${PYTORCH_VERSION} \
    torchvision==${TORCHVISION_VERSION} \
    pytorch_lightning==${PYTORCH_LIGHTNING_VERSION} \
    --extra-index-url=https://download.pytorch.org/whl/cu118

# Install Tensorflow
RUN pip install --no-cache-dir future typing packaging
RUN pip install --no-cache-dir \
    tensorflow==${TENSORFLOW_VERSION} \
    keras \
    h5py

# Install MXNet
RUN pip install --no-cache-dir mxnet-cu112==${MXNET_VERSION} "numpy<1.24.0"

# Install Horovod, temporarily using CUDA stubs
# Install with NCCL and MPI support
RUN apt install -y libflatbuffers-dev libboost-all-dev liblbfgs-dev

WORKDIR /horovod
RUN wget https://github.com/horovod/horovod/archive/refs/tags/v0.28.1.tar.gz && \
    tar -xvf v0.28.1.tar.gz && \
    cp -r horovod-0.28.1/* . && \
    rm v0.28.1.tar.gz && \
    rm -r horovod-0.28.1/
RUN python setup.py sdist && \
    ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
    bash -c "HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_MPI=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_GLOO=1 pip install --no-cache-dir -v $(ls /horovod/dist/horovod-*.tar.gz)" && \
    horovodrun --check-build && \
    ldconfig

# Check all frameworks are working correctly. Use CUDA stubs to ensure CUDA libs can be found correctly
# when running on CPU machine
WORKDIR "/horovod/examples"
RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
    python -c "import horovod.torch as hvd; hvd.init()" && \
    ldconfig

@datasith
Copy link

did you ever figure this out @fferroni?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants