Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horovod with TensorFlow crashed #4020

Open
mythZhu opened this issue Feb 5, 2024 · 0 comments
Open

Horovod with TensorFlow crashed #4020

mythZhu opened this issue Feb 5, 2024 · 0 comments
Labels

Comments

@mythZhu
Copy link

mythZhu commented Feb 5, 2024

Environment:

  1. Framework: TensorFlow
  2. Framework version: 1.15.3
  3. Horovod version: 0.28.1
  4. MPI version: 4.0.1
  5. CUDA version:
  6. NCCL version:
  7. Python version: 3.6.9
  8. Spark / PySpark version:
  9. Ray version:
  10. OS and version: Ubuntu 18.04.6 LTS
  11. GCC version: 7.5.0
  12. CMake version: 3.13.0

Checklist:

  1. Did you search issues to find if somebody asked this question before? Yes
  2. If your question is about hang, did you read this doc? Yes
  3. If your question is about docker, did you read this doc? Yes
  4. Did you check if you question is answered in the troubleshooting guide? Yes

Bug report:
I did CPU training with Horovod + TensorFlow and launched it with OpenMPI. Horovod always crashed with following errors when some workers didn't process any data and directly call hvd.join() to wait for other workers.

munmap_chunk(): invalid pointer
[node-0:410078] *** Process received signal ***
[node-0:410078] Signal: Aborted (6)
[node-0:410078] Signal code:  (-6)
[node-0:410078] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7f77f0cebf10]
[node-0:410078] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f77f0cebe87]
[node-0:410078] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f77f0ced7f1]
[node-0:410078] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x89837)[0x7f77f0d36837]
[node-0:410078] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x908ba)[0x7f77f0d3d8ba]
[node-0:410078] [ 5] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x58c)[0x7f77f0d44e9c]
[node-0:410078] [ 6] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN9__gnu_cxx13new_allocatorIN7horovod6common7RequestEE10deallocateEPS3_m+0x20)[0x7f77d197de04]
[node-0:410078] [ 7] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt16allocator_traitsISaIN7horovod6common7RequestEEE10deallocateERS3_PS2_m+0x2b)[0x7f77d197bac8]
[node-0:410078] [ 8] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt12_Vector_baseIN7horovod6common7RequestESaIS2_EE13_M_deallocateEPS2_m+0x32)[0x7f77d1978de0]
[node-0:410078] [ 9] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt12_Vector_baseIN7horovod6common7RequestESaIS2_EED2Ev+0x52)[0x7f77d1977d66]
[node-0:410078] [10] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt6vectorIN7horovod6common7RequestESaIS2_EED1Ev+0x41)[0x7f77d1974e7b]
[node-0:410078] [11] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt6vectorIN7horovod6common7RequestESaISA_EEED1Ev+0x1c)[0x7f77d197f9ca]
[node-0:410078] [12] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN9__gnu_cxx13new_allocatorISt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt6vectorIN7horovod6common7RequestESaISC_EEEE7destroyISF_EEvPT_+0x1c)[0x7f77d197f9f6]
[node-0:410078] [13] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt16allocator_traitsISaISt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt6vectorIN7horovod6common7RequestESaISB_EEEEE7destroyISE_EEvRSF_PT_+0x23)[0x7f77d197e4ee]
[node-0:410078] [14] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt8__detail16_Hashtable_allocISaINS_10_Hash_nodeISt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt6vectorIN7horovod6common7RequestESaISD_EEELb1EEEEE18_M_deallocate_nodeEPSH_+0x6c)[0x7f77d197c34c]
[node-0:410078] [15] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt10_HashtableINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_St6vectorIN7horovod6common7RequestESaISB_EEESaISE_ENSt8__detail10_Select1stESt8equal_toIS5_ESt4hashIS5_ENSG_18_Mod_range_hashingENSG_20_Default_ranged_hashENSG_20_Prime_rehash_policyENSG_17_Hashtable_traitsILb1ELb0ELb1EEEE8_M_eraseEmPNSG_15_Hash_node_baseEPNSG_10_Hash_nodeISE_Lb1EEE+0x12b)[0x7f77d197da9f]
[node-0:410078] [16] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt10_HashtableINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_St6vectorIN7horovod6common7RequestESaISB_EEESaISE_ENSt8__detail10_Select1stESt8equal_toIS5_ESt4hashIS5_ENSG_18_Mod_range_hashingENSG_20_Default_ranged_hashENSG_20_Prime_rehash_policyENSG_17_Hashtable_traitsILb1ELb0ELb1EEEE5eraseENSG_20_Node_const_iteratorISE_Lb0ELb1EEE+0x62)[0x7f77d197b67e]
[node-0:410078] [17] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt10_HashtableINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_St6vectorIN7horovod6common7RequestESaISB_EEESaISE_ENSt8__detail10_Select1stESt8equal_toIS5_ESt4hashIS5_ENSG_18_Mod_range_hashingENSG_20_Default_ranged_hashENSG_20_Prime_rehash_policyENSG_17_Hashtable_traitsILb1ELb0ELb1EEEE5eraseENSG_14_Node_iteratorISE_Lb0ELb1EEE+0x45)[0x7f77d1978609]
[node-0:410078] [18] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt13unordered_mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt6vectorIN7horovod6common7RequestESaIS9_EESt4hashIS5_ESt8equal_toIS5_ESaISt4pairIKS5_SB_EEE5eraseENSt8__detail14_Node_iteratorISI_Lb0ELb1EEE+0x23)[0x7f77d1975711]
[node-0:410078] [19] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common10Controller17ConstructResponseERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi+0x1cf0)[0x7f77d1970eb4]
[node-0:410078] [20] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common10Controller19ComputeResponseListEbRNS0_18HorovodGlobalStateERNS0_10ProcessSetE+0x1c2f)[0x7f77d196e85d]
[node-0:410078] [21] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x10365e)[0x7f77d199d65e]
[node-0:410078] [22] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x102e54)[0x7f77d199ce54]
[node-0:410078] [23] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZSt13__invoke_implIvPFvRN7horovod6common18HorovodGlobalStateEEJSt17reference_wrapperIS2_EEET_St14__invoke_otherOT0_DpOT1_+0x39)[0x7f77d19ae578]
[node-0:410078] [24] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZSt8__invokeIPFvRN7horovod6common18HorovodGlobalStateEEJSt17reference_wrapperIS2_EEENSt15__invoke_resultIT_JDpT0_EE4typeEOS9_DpOSA_+0x4e)[0x7f77d19a9d68]
[node-0:410078] [25] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt6thread8_InvokerISt5tupleIJPFvRN7horovod6common18HorovodGlobalStateEESt17reference_wrapperIS4_EEEE9_M_invokeIJLm0ELm1EEEEDTcl8__invokespcl10_S_declvalIXT_EEEEESt12_Index_tupleIJXspT_EEE+0x43)[0x7f77d19c39f9]
[node-0:410078] [26] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt6thread8_InvokerISt5tupleIJPFvRN7horovod6common18HorovodGlobalStateEESt17reference_wrapperIS4_EEEEclEv+0x2c)[0x7f77d19c399a]
[node-0:410078] [27] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvRN7horovod6common18HorovodGlobalStateEESt17reference_wrapperIS5_EEEEEE6_M_runEv+0x1c)[0x7f77d19c391e]
[node-0:410078] [28] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44c0)[0x7f73eb5a54c0]
[node-0:410078] [29] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f77f0a956db]
[node-0:410078] *** End of error message ***

OR

free(): invalid next size (normal)
[node-0:391803] *** Process received signal ***
[node-0:391803] Signal: Aborted (6)
[node-0:391803] Signal code:  (-6)
[node-0:391803] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7fd1dc5a7f10]
[node-0:391803] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fd1dc5a7e87]
[node-0:391803] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fd1dc5a97f1]
[node-0:391803] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x89837)[0x7fd1dc5f2837]
[node-0:391803] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x908ba)[0x7fd1dc5f98ba]
[node-0:391803] [ 5] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x76d)[0x7fd1dc60107d]
[node-0:391803] [ 6] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xd2550)[0x7fd1bd228550]
[node-0:391803] [ 7] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xd23ee)[0x7fd1bd2283ee]
[node-0:391803] [ 8] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xd2204)[0x7fd1bd228204]
[node-0:391803] [ 9] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xd1df7)[0x7fd1bd227df7]
[node-0:391803] [10] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xd189b)[0x7fd1bd22789b]
[node-0:391803] [11] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common10Controller17ConstructResponseERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi+0x1d20)[0x7fd1bd22cee4]
[node-0:391803] [12] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common10Controller19ComputeResponseListEbRNS0_18HorovodGlobalStateERNS0_10ProcessSetE+0x1c2f)[0x7fd1bd22a85d]
[node-0:391803] [13] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x10365e)[0x7fd1bd25965e]
[node-0:391803] [14] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x102e54)[0x7fd1bd258e54]
[node-0:391803] [15] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZSt13__invoke_implIvPFvRN7horovod6common18HorovodGlobalStateEEJSt17reference_wrapperIS2_EEET_St14__invoke_otherOT0_DpOT1_+0x39)[0x7fd1bd26a578]
[node-0:391803] [16] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZSt8__invokeIPFvRN7horovod6common18HorovodGlobalStateEEJSt17reference_wrapperIS2_EEENSt15__invoke_resultIT_JDpT0_EE4typeEOS9_DpOSA_+0x4e)[0x7fd1bd265d68]
[node-0:391803] [17] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt6thread8_InvokerISt5tupleIJPFvRN7horovod6common18HorovodGlobalStateEESt17reference_wrapperIS4_EEEE9_M_invokeIJLm0ELm1EEEEDTcl8__invokespcl10_S_declvalIXT_EEEEESt12_Index_tupleIJXspT_EEE+0x43)[0x7fd1bd27f9f9]
[node-0:391803] [18] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt6thread8_InvokerISt5tupleIJPFvRN7horovod6common18HorovodGlobalStateEESt17reference_wrapperIS4_EEEEclEv+0x2c)[0x7fd1bd27f99a]
[node-0:391803] [19] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvRN7horovod6common18HorovodGlobalStateEESt17reference_wrapperIS5_EEEEEE6_M_runEv+0x1c)[0x7fd1bd27f91e]
[node-0:391803] [20] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44c0)[0x7fcdd6e614c0]
[node-0:391803] [21] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fd1dc3516db]
[node-0:391803] [22] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fd1dc68a61f]
[node-0:391803] *** End of error message ***

What's wrong? Thanks for your help in advance!

@mythZhu mythZhu added the bug label Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

1 participant