-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libc++abi: terminating due to uncaught exception of type std::runtime_error #664
Comments
Curious.. what machine are you on? What OS? |
The command runs fine for me with our default dataset:
|
@awni macOS: 14.4.1, tested on Python 3.11 and Python 3.12, have the same issue. |
@awni I also encountered this issue when training Qwen/Qwen1.5-0.5B. Is it related to memory footprint? |
@awni, I tested your dataset. It can run successfully, but using my dataset will encounter this issue.
|
@awni I tested on gemma-1.1-2b-it, the issue still exists. Is it a bug of MPS?
|
I'm not sure.. I never saw that error before. It seems related to too much resource use (e.g. OOM). Does it run if you use a smaller batch size? |
@awni My Mac has enough memory to fine-tune Gemma 2b. However, this issue occurs randomly with Gemma and Qwen, rarely with Mistral and Phi.
|
I'm running the command you shared. |
@danny-su what version of MLX / MLX LM are you using:
If it's not the latest, please update and try again. So far I'm not able to reproduce the issue you shared.. I will try on another machine. |
@awni I used the latest version: 0.10.0. Here is the data I used to fine-tune Qwen1.5-1.8B and Qwen1.5-0.5B. Both failed, but I can use the same size data to fine-tune Mistral-7B-Instruct-v0.2, so I think there may be some bugs in the mlx_lm implementations for Qwen and Gemma.
|
Quantized or fp16? |
@awni The original version downloaded from huggingface. |
Wow that is a really long sequence length: |
I tried this on three different machines (M2 Ultra / M2 Mini / M1 Max) and I've not been able to reproduce the same issue. I recommend trying the following:
I will do some more digging about that error message.. but without being able to reproduce it, it's difficult to help debug. |
My data does not have that long sequence length; I set it to a larger value because some outliers exceed the default value of 2048. |
Hi @danny-su sorry I cannot reproduce this issue. I tried on several machines with several models and the data sets you shared. As I understand it the error you are seeing has to do with when a command buffer (computation on the GPU) times out. I am not entirely sure why that's happening for you and since I can't reproduce the issue it's will be difficult to diagnose on my end. Let me know if you are able to try some of the above steps and they resolve the issue or not. Also if you are able to log the input data and share which data causes the program to crash that might also be helpful for us to reproduce. It seems likely (though not certain) to be data dependent. |
The same error occurs in my environment when inferring Stable Diffusion XL.
|
It seems to happen more often when the remaining memory is low, so it may be related to swap, etc. |
@kaeru-shigure that is very odd 🤔 , I assume you have a 16GB machine? |
I have a 16GB MacBook Pro with an M2 Pro chip and encountered this exact error at 60 epochs when I increased the training data size, batch size, and layers. I reran it after restarting my computer and not utilizing as much memory, and it reached 200 epochs before giving me the same error. |
@GusLovesMath what were you running? Could you share the script / command? |
I was running the command below. It worked when I reduced the GPU and ram analogous parameters below.
|
Subsequent situation: |
Training Gemma encountered the following exception, but training Mistral did not.
The text was updated successfully, but these errors were encountered: