Segmentation fault during training #642

JKwon0331 · 2024-03-31T01:26:17Z

Hi,

I have difficulty with training the model.
I always meet the segmentation fault error.
It occurs randomly, I mean random epoch.
For example, it occurred in epoch 65 in the below picture.

Sometimes, it occurred in epoch 99 or 104, etc.
I know, it is hard to figure out the reason with this short information.
However, could you let me know what can I suspect the reason?

awni · 2024-03-31T01:45:26Z

Is it one of the examples in this repo? Can you share anything else? There shouldn't be a segfault from MLX, so if you are getting one from MLX it is a bug. But without anything more information it's almost impossible for us to debug it, so anything you can share is appreciated.

awni · 2024-03-31T01:45:50Z

Also useful if you can share MLX version and platform (OS, machine, etc)

JKwon0331 · 2024-03-31T07:08:21Z

Hi, I'm so sorry for the late response.
I uploaded the example codes in my github.
https://github.com/JKwon0331/mlx_test/tree/main

In mlx, you can see the example codes.
When I run train.py.

I'm trying to implement Squeezeformer and refer to the code from https://github.com/upskyy/Squeezeformer.

As I know, there is no depthwise convolution, so I tried to implement it as I could.

Also, regarding with #625,
You might be able to see when you uncomment the lines 125 and 133 and comment the lines 126 and 134 of model.py .

Finally, you can also see the Pytorch version in my github, which was a simplified version from https://github.com/upskyy/Squeezeformer.

You can also run train.py.

As you can see, the training speed in pytorch (about 8 s per epoch) is quite faster than that in mlx (about 12 s per epoch).

I tried to compare the pytorch depthwise convolution and my depthwise convolution implemented for mlx, there was no difference in training speed.

I am not sure that I implemented the code correctly.
However, I hope this could be helpful to fix the debug and improve the speed.

If you have any questions, please feel free to let me know.

JKwon0331 · 2024-03-31T07:09:31Z

Oh, I use the mlx 0.9.0 and 16-inch M2 Max MacBook pro

awni · 2024-03-31T13:22:07Z

Thanks for the code that's great. I'm running it now. It's at epoch 60 so far.. no segfault yet, let's see

awni · 2024-03-31T13:26:29Z

As you can see, the training speed in pytorch (about 8 s per epoch) is quite faster than that in mlx (about 12 s per epoch).

Indeed I'm not too surprised by that for this model. Training conv nets needs some optimization in MLX and also some of the features you are using will be a bit slow (like the depthwise conv, RNN, etc). We need some time to add these features and optimize more. But this is a great benchmark to have, thanks!

awni · 2024-04-02T17:14:00Z

I ran it for hundreds of epochs on both my M1 Max and an M2 Ultra and was not able to get a segfault. It may have been something we fixed in the latest MLX 🤔 . If you still see the segfault after the next release, please let us know.

JKwon0331 · 2024-04-02T23:23:12Z

I ran it for hundreds of epochs on both my M1 Max and an M2 Ultra and was not able to get a segfault. It may have been something we fixed in the latest MLX 🤔 . If you still see the segfault after the next release, please let us know.

Hi, for me it still occurs.

hm... do you use the external monitor?

I have no idea...

awni · 2024-04-03T14:51:41Z

@JKwon0331 we aren't able to reproduce the segfault. Could you share a bit more information:

Operating system version
Output of python -c "import mlx.core as mx; print(mx.__version__)"

JKwon0331 · 2024-04-03T19:15:11Z

@awni
Here they are.

Mac Sonoma 14.4
0.9.0

JKwon0331 · 2024-04-04T04:58:22Z

@awni

Hi, I tried it several times.
Sometimes, it doesn't happen for hundreds of epochs.
However, sometimes, it happens in just a few epochs, as shown in the figure below.

angeloskath · 2024-04-04T05:53:26Z

@JKwon0331 I let it run overnight for several hundreds of epochs and didn't encounter it. Would it be possible to run it with the debugger attached so we can maybe get an idea for where it segfaults?

Since we can't repro even after several hours you will have to do some of the digging unfortunately. Step 1 would be to just run it with the debugger attached as is. Step 2 which may be a bit of a pain would be to compile mlx in debug mode and run t with the debugger attached. If you can do step 2 and it segfaults then we will know exactly where it happened in the code.

Let me know if you need help with either of the above.

angeloskath · 2024-04-04T16:32:59Z

A tutorial for lldb (LLVM debugger) can be found at https://lldb.llvm.org/use/tutorial.html .

However the simplest way to attach it would be the following steps

Get the PID of the training process, one way would be import os; print(os.getpid())
In another terminal run sudo lldb -p <PID printed in the training run>
c + enter in the debugger to continue the training
Wait until it segfaults
bt all in the debugger to print all backtraces

Just in case, it might be simpler to try from a new python environment first. Start a brand new environment, install the latest MLX and try again to see if it segfaults.

JKwon0331 · 2024-04-08T03:45:03Z

A tutorial for lldb (LLVM debugger) can be found at https://lldb.llvm.org/use/tutorial.html .

However the simplest way to attach it would be the following steps

Get the PID of the training process, one way would be import os; print(os.getpid())

In another terminal run sudo lldb -p <PID printed in the training run>

c + enter in the debugger to continue the training

Wait until it segfaults

bt all in the debugger to print all backtraces

Just in case, it might be simpler to try from a new python environment first. Start a brand new environment, install the latest MLX and try again to see if it segfaults.

Hi, I'm so sorry for the late response.

According to your recommendation, I tried a new Python environment. Python 3.11.8 and mlx 0.9.1
Unfortunately, there was another issue, the bus error.

Attached figures are the screenshots following the steps you mentioned.
Please let me know if there are other things I need to do.

JKwon0331 · 2024-04-08T04:58:39Z

Attached is for segmentation fault.

angeloskath · 2024-04-09T05:24:27Z

Thanks that is very helpful!

There seems to be an issue in the function collapse_contiguous_dims which is used internally to route to a better kernel if possible. It is still very weird that this only happens on your machine but we are closer to figuring it out. It is great that it happens from a reshape because it is the simplest way to call collapse_contiguous_dims so I have some chance to brute-force replicate it.

I will look into it. Not tonight but possibly tomorrow.

angeloskath · 2024-04-12T21:05:28Z

Sorry for the delayed response. I am trying to reproduce the bug locally even though we know where it happens and no luck yet. It happens from a reshape which calls copy so I wrote a small fuzzer and run several 100s of thousands of different reshapes including transpositions, broadcasts, strided slices etc but nothing breaks unfortunately (or fortunately :-) ).

This means I will have to trouble you a bit more since it seems to be something maybe only on your machine. If you can build from source in debug mode

$ cd mlx/
$ CMAKE_ARGS='-DCMAKE_BUILD_TYPE=Debug' pip install .

After that the backtraces will include the line of code where the problem is encountered. You can also just run the fuzzer https://gist.github.com/angeloskath/a3dc38b030c080ae5e4135f0125a94b2 to see if this causes the error on your setup.

JKwon0331 · 2024-04-16T02:58:59Z

Hi, I am so sorry about this but, I am not familiar with debug, could you let me know more details?

You mean,

Get the PID of the training process, one way would be import os; print(os.getpid())
In another terminal run sudo lldb -p
c + enter in the debugger to continue the training
Wait until it segfaults
bt all in the debugger to print all backtraces

Before start 3, I can build from this source?
$ cd mlx/
$ CMAKE_ARGS='-DCMAKE_BUILD_TYPE=Debug' pip install

Actually, cd mlx/ does not work for me, with no such file or directory: mlx.

Thank you

JKwon0331 · 2024-04-16T15:01:54Z

When I run the fuzzer, there was a no error.

angeloskath · 2024-04-16T15:32:57Z

Sorry, cd mlx/ was meant to be cd your/local/path/to/mlx/source.

JKwon0331 · 2024-04-17T04:25:50Z

I am sorry.. I can not find that directory. I tried to Go to the Folder in Finder, but still, I have no idea what you meant.

jrp2014 · 2024-05-03T22:58:53Z

I think that he assumed that you had cloned the GitHub source and installed from there, with pip install .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault during training #642

Segmentation fault during training #642

JKwon0331 commented Mar 31, 2024

awni commented Mar 31, 2024

awni commented Mar 31, 2024

JKwon0331 commented Mar 31, 2024

JKwon0331 commented Mar 31, 2024

awni commented Mar 31, 2024

awni commented Mar 31, 2024

awni commented Apr 2, 2024

JKwon0331 commented Apr 2, 2024

awni commented Apr 3, 2024

JKwon0331 commented Apr 3, 2024

JKwon0331 commented Apr 4, 2024

angeloskath commented Apr 4, 2024

angeloskath commented Apr 4, 2024

JKwon0331 commented Apr 8, 2024 •

edited

JKwon0331 commented Apr 8, 2024

angeloskath commented Apr 9, 2024

angeloskath commented Apr 12, 2024

JKwon0331 commented Apr 16, 2024 •

edited

JKwon0331 commented Apr 16, 2024

angeloskath commented Apr 16, 2024

JKwon0331 commented Apr 17, 2024

jrp2014 commented May 3, 2024

Segmentation fault during training #642

Segmentation fault during training #642

Comments

JKwon0331 commented Mar 31, 2024

awni commented Mar 31, 2024

awni commented Mar 31, 2024

JKwon0331 commented Mar 31, 2024

JKwon0331 commented Mar 31, 2024

awni commented Mar 31, 2024

awni commented Mar 31, 2024

awni commented Apr 2, 2024

JKwon0331 commented Apr 2, 2024

awni commented Apr 3, 2024

JKwon0331 commented Apr 3, 2024

JKwon0331 commented Apr 4, 2024

angeloskath commented Apr 4, 2024

angeloskath commented Apr 4, 2024

JKwon0331 commented Apr 8, 2024 • edited

JKwon0331 commented Apr 8, 2024

angeloskath commented Apr 9, 2024

angeloskath commented Apr 12, 2024

JKwon0331 commented Apr 16, 2024 • edited

JKwon0331 commented Apr 16, 2024

angeloskath commented Apr 16, 2024

JKwon0331 commented Apr 17, 2024

jrp2014 commented May 3, 2024

JKwon0331 commented Apr 8, 2024 •

edited

JKwon0331 commented Apr 16, 2024 •

edited