Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault during training #642

Open
JKwon0331 opened this issue Mar 31, 2024 · 22 comments
Open

Segmentation fault during training #642

JKwon0331 opened this issue Mar 31, 2024 · 22 comments

Comments

@JKwon0331
Copy link

Hi,

I have difficulty with training the model.
I always meet the segmentation fault error.
It occurs randomly, I mean random epoch.
For example, it occurred in epoch 65 in the below picture.
Screenshot 2024-03-30 at 8 23 14 PM

Sometimes, it occurred in epoch 99 or 104, etc.
I know, it is hard to figure out the reason with this short information.
However, could you let me know what can I suspect the reason?

@awni
Copy link
Member

awni commented Mar 31, 2024

Is it one of the examples in this repo? Can you share anything else? There shouldn't be a segfault from MLX, so if you are getting one from MLX it is a bug. But without anything more information it's almost impossible for us to debug it, so anything you can share is appreciated.

@awni
Copy link
Member

awni commented Mar 31, 2024

Also useful if you can share MLX version and platform (OS, machine, etc)

@JKwon0331
Copy link
Author

Hi, I'm so sorry for the late response.
I uploaded the example codes in my github.
https://github.com/JKwon0331/mlx_test/tree/main

In mlx, you can see the example codes.
When I run train.py.
Screenshot 2024-03-31 at 1 54 01 AM

I'm trying to implement Squeezeformer and refer to the code from https://github.com/upskyy/Squeezeformer.

As I know, there is no depthwise convolution, so I tried to implement it as I could.

Also, regarding with #625,
You might be able to see when you uncomment the lines 125 and 133 and comment the lines 126 and 134 of model.py .

Finally, you can also see the Pytorch version in my github, which was a simplified version from https://github.com/upskyy/Squeezeformer.

You can also run train.py.

As you can see, the training speed in pytorch (about 8 s per epoch) is quite faster than that in mlx (about 12 s per epoch).
Screenshot 2024-03-31 at 1 58 56 AM

I tried to compare the pytorch depthwise convolution and my depthwise convolution implemented for mlx, there was no difference in training speed.

I am not sure that I implemented the code correctly.
However, I hope this could be helpful to fix the debug and improve the speed.

If you have any questions, please feel free to let me know.

@JKwon0331
Copy link
Author

Oh, I use the mlx 0.9.0 and 16-inch M2 Max MacBook pro

@awni
Copy link
Member

awni commented Mar 31, 2024

Thanks for the code that's great. I'm running it now. It's at epoch 60 so far.. no segfault yet, let's see

@awni
Copy link
Member

awni commented Mar 31, 2024

As you can see, the training speed in pytorch (about 8 s per epoch) is quite faster than that in mlx (about 12 s per epoch).

Indeed I'm not too surprised by that for this model. Training conv nets needs some optimization in MLX and also some of the features you are using will be a bit slow (like the depthwise conv, RNN, etc). We need some time to add these features and optimize more. But this is a great benchmark to have, thanks!

@awni
Copy link
Member

awni commented Apr 2, 2024

I ran it for hundreds of epochs on both my M1 Max and an M2 Ultra and was not able to get a segfault. It may have been something we fixed in the latest MLX 🤔 . If you still see the segfault after the next release, please let us know.

@JKwon0331
Copy link
Author

I ran it for hundreds of epochs on both my M1 Max and an M2 Ultra and was not able to get a segfault. It may have been something we fixed in the latest MLX 🤔 . If you still see the segfault after the next release, please let us know.

Hi, for me it still occurs.
Screenshot 2024-04-02 at 6 20 52 PM

hm... do you use the external monitor?

I have no idea...

@awni
Copy link
Member

awni commented Apr 3, 2024

@JKwon0331 we aren't able to reproduce the segfault. Could you share a bit more information:

  1. Operating system version
  2. Output of python -c "import mlx.core as mx; print(mx.__version__)"

@JKwon0331
Copy link
Author

@awni
Here they are.

  1. Mac Sonoma 14.4
  2. 0.9.0

@JKwon0331
Copy link
Author

@awni

Hi, I tried it several times.
Sometimes, it doesn't happen for hundreds of epochs.
However, sometimes, it happens in just a few epochs, as shown in the figure below.

Screenshot 2024-04-03 at 11 55 32 PM

@angeloskath
Copy link
Member

@JKwon0331 I let it run overnight for several hundreds of epochs and didn't encounter it. Would it be possible to run it with the debugger attached so we can maybe get an idea for where it segfaults?

Since we can't repro even after several hours you will have to do some of the digging unfortunately. Step 1 would be to just run it with the debugger attached as is. Step 2 which may be a bit of a pain would be to compile mlx in debug mode and run t with the debugger attached. If you can do step 2 and it segfaults then we will know exactly where it happened in the code.

Let me know if you need help with either of the above.

@angeloskath
Copy link
Member

A tutorial for lldb (LLVM debugger) can be found at https://lldb.llvm.org/use/tutorial.html .

However the simplest way to attach it would be the following steps

  • Get the PID of the training process, one way would be import os; print(os.getpid())
  • In another terminal run sudo lldb -p <PID printed in the training run>
  • c + enter in the debugger to continue the training
  • Wait until it segfaults
  • bt all in the debugger to print all backtraces

Just in case, it might be simpler to try from a new python environment first. Start a brand new environment, install the latest MLX and try again to see if it segfaults.

@JKwon0331
Copy link
Author

JKwon0331 commented Apr 8, 2024

A tutorial for lldb (LLVM debugger) can be found at https://lldb.llvm.org/use/tutorial.html .

However the simplest way to attach it would be the following steps

  • Get the PID of the training process, one way would be import os; print(os.getpid())
  • In another terminal run sudo lldb -p <PID printed in the training run>
  • c + enter in the debugger to continue the training
  • Wait until it segfaults
  • bt all in the debugger to print all backtraces

Just in case, it might be simpler to try from a new python environment first. Start a brand new environment, install the latest MLX and try again to see if it segfaults.

Hi, I'm so sorry for the late response.

According to your recommendation, I tried a new Python environment. Python 3.11.8 and mlx 0.9.1
Unfortunately, there was another issue, the bus error.

Attached figures are the screenshots following the steps you mentioned.
Please let me know if there are other things I need to do.

Screenshot 2024-04-07 at 10 41 51 PM Screenshot 2024-04-07 at 10 40 56 PM Screenshot 2024-04-07 at 10 41 09 PM

@JKwon0331
Copy link
Author

Attached is for segmentation fault.

Screenshot 2024-04-07 at 11 57 28 PM
Screenshot 2024-04-07 at 11 57 44 PM
Screenshot 2024-04-07 at 11 57 53 PM

@angeloskath
Copy link
Member

Thanks that is very helpful!

There seems to be an issue in the function collapse_contiguous_dims which is used internally to route to a better kernel if possible. It is still very weird that this only happens on your machine but we are closer to figuring it out. It is great that it happens from a reshape because it is the simplest way to call collapse_contiguous_dims so I have some chance to brute-force replicate it.

I will look into it. Not tonight but possibly tomorrow.

@angeloskath
Copy link
Member

Sorry for the delayed response. I am trying to reproduce the bug locally even though we know where it happens and no luck yet. It happens from a reshape which calls copy so I wrote a small fuzzer and run several 100s of thousands of different reshapes including transpositions, broadcasts, strided slices etc but nothing breaks unfortunately (or fortunately :-) ).

This means I will have to trouble you a bit more since it seems to be something maybe only on your machine. If you can build from source in debug mode

$ cd mlx/
$ CMAKE_ARGS='-DCMAKE_BUILD_TYPE=Debug' pip install .

After that the backtraces will include the line of code where the problem is encountered. You can also just run the fuzzer https://gist.github.com/angeloskath/a3dc38b030c080ae5e4135f0125a94b2 to see if this causes the error on your setup.

@JKwon0331
Copy link
Author

JKwon0331 commented Apr 16, 2024

Hi, I am so sorry about this but, I am not familiar with debug, could you let me know more details?

You mean,

  1. Get the PID of the training process, one way would be import os; print(os.getpid())
  2. In another terminal run sudo lldb -p
  3. c + enter in the debugger to continue the training
  4. Wait until it segfaults
  5. bt all in the debugger to print all backtraces

Before start 3, I can build from this source?
$ cd mlx/
$ CMAKE_ARGS='-DCMAKE_BUILD_TYPE=Debug' pip install

Actually, cd mlx/ does not work for me, with no such file or directory: mlx.

Thank you

@JKwon0331
Copy link
Author

When I run the fuzzer, there was a no error.

@angeloskath
Copy link
Member

Sorry, cd mlx/ was meant to be cd your/local/path/to/mlx/source.

@JKwon0331
Copy link
Author

I am sorry.. I can not find that directory. I tried to Go to the Folder in Finder, but still, I have no idea what you meant.

@jrp2014
Copy link

jrp2014 commented May 3, 2024

I think that he assumed that you had cloned the GitHub source and installed from there, with pip install .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants