-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault during training #642
Comments
Is it one of the examples in this repo? Can you share anything else? There shouldn't be a segfault from MLX, so if you are getting one from MLX it is a bug. But without anything more information it's almost impossible for us to debug it, so anything you can share is appreciated. |
Also useful if you can share MLX version and platform (OS, machine, etc) |
Hi, I'm so sorry for the late response. In mlx, you can see the example codes. I'm trying to implement Squeezeformer and refer to the code from https://github.com/upskyy/Squeezeformer. As I know, there is no depthwise convolution, so I tried to implement it as I could. Also, regarding with #625, Finally, you can also see the Pytorch version in my github, which was a simplified version from https://github.com/upskyy/Squeezeformer. You can also run train.py. As you can see, the training speed in pytorch (about 8 s per epoch) is quite faster than that in mlx (about 12 s per epoch). I tried to compare the pytorch depthwise convolution and my depthwise convolution implemented for mlx, there was no difference in training speed. I am not sure that I implemented the code correctly. If you have any questions, please feel free to let me know. |
Oh, I use the mlx 0.9.0 and 16-inch M2 Max MacBook pro |
Thanks for the code that's great. I'm running it now. It's at epoch 60 so far.. no segfault yet, let's see |
Indeed I'm not too surprised by that for this model. Training conv nets needs some optimization in MLX and also some of the features you are using will be a bit slow (like the depthwise conv, RNN, etc). We need some time to add these features and optimize more. But this is a great benchmark to have, thanks! |
I ran it for hundreds of epochs on both my M1 Max and an M2 Ultra and was not able to get a segfault. It may have been something we fixed in the latest MLX 🤔 . If you still see the segfault after the next release, please let us know. |
hm... do you use the external monitor? I have no idea... |
@JKwon0331 we aren't able to reproduce the segfault. Could you share a bit more information:
|
@awni
|
Hi, I tried it several times. |
@JKwon0331 I let it run overnight for several hundreds of epochs and didn't encounter it. Would it be possible to run it with the debugger attached so we can maybe get an idea for where it segfaults? Since we can't repro even after several hours you will have to do some of the digging unfortunately. Step 1 would be to just run it with the debugger attached as is. Step 2 which may be a bit of a pain would be to compile mlx in debug mode and run t with the debugger attached. If you can do step 2 and it segfaults then we will know exactly where it happened in the code. Let me know if you need help with either of the above. |
A tutorial for lldb (LLVM debugger) can be found at https://lldb.llvm.org/use/tutorial.html . However the simplest way to attach it would be the following steps
Just in case, it might be simpler to try from a new python environment first. Start a brand new environment, install the latest MLX and try again to see if it segfaults. |
Hi, I'm so sorry for the late response. According to your recommendation, I tried a new Python environment. Python 3.11.8 and mlx 0.9.1 Attached figures are the screenshots following the steps you mentioned. |
Thanks that is very helpful! There seems to be an issue in the function I will look into it. Not tonight but possibly tomorrow. |
Sorry for the delayed response. I am trying to reproduce the bug locally even though we know where it happens and no luck yet. It happens from a reshape which calls copy so I wrote a small fuzzer and run several 100s of thousands of different reshapes including transpositions, broadcasts, strided slices etc but nothing breaks unfortunately (or fortunately :-) ). This means I will have to trouble you a bit more since it seems to be something maybe only on your machine. If you can build from source in debug mode $ cd mlx/
$ CMAKE_ARGS='-DCMAKE_BUILD_TYPE=Debug' pip install . After that the backtraces will include the line of code where the problem is encountered. You can also just run the fuzzer https://gist.github.com/angeloskath/a3dc38b030c080ae5e4135f0125a94b2 to see if this causes the error on your setup. |
Hi, I am so sorry about this but, I am not familiar with debug, could you let me know more details? You mean,
Before start 3, I can build from this source? Actually, cd mlx/ does not work for me, with no such file or directory: mlx. Thank you |
When I run the fuzzer, there was a no error. |
Sorry, |
I am sorry.. I can not find that directory. I tried to Go to the Folder in Finder, but still, I have no idea what you meant. |
I think that he assumed that you had cloned the GitHub source and installed from there, with pip install . |
Hi,
I have difficulty with training the model.
I always meet the segmentation fault error.
It occurs randomly, I mean random epoch.
For example, it occurred in epoch 65 in the below picture.
Sometimes, it occurred in epoch 99 or 104, etc.
I know, it is hard to figure out the reason with this short information.
However, could you let me know what can I suspect the reason?
The text was updated successfully, but these errors were encountered: