Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libc++abi: terminating due to uncaught exception of type std::runtime_error #664

Open
danny-su opened this issue Apr 8, 2024 · 24 comments

Comments

@danny-su
Copy link

danny-su commented Apr 8, 2024

Training Gemma encountered the following exception, but training Mistral did not.

python -m mlx_lm.lora \
    --model google/gemma-2b-it \
    --train \
    --data /Users/danny/mlx_demo/data \
    --iters 600 --adapter-path /Users/danny/mlx_demo/models/gemma
Loading pretrained model
Fetching 9 files: 100%|████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 112347.43it/s]
Trainable parameters: 0.033% (0.819M/2506.172M)
Loading datasets
Training
Starting training..., iters: 600
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Internal Error (0000000e:Internal Error)
[1]    72864 abort      python -m mlx_lm.lora --model google/gemma-2b-it --train --data  --iters 600
/opt/homebrew/Cellar/python@3.12/3.12.2_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
certifi==2024.2.2
charset-normalizer==3.3.2
filelock==3.13.3
fsspec==2024.3.1
huggingface-hub==0.22.2
idna==3.6
jinja2==3.1.3
markupsafe==2.1.5
mlx==0.9.1
mlx-lm==0.7.0
mpmath==1.3.0
networkx==3.3
numpy==1.26.4
packaging==24.0
protobuf==5.26.1
pyyaml==6.0.1
regex==2023.12.25
requests==2.31.0
safetensors==0.4.2
sympy==1.12
tokenizers==0.15.2
torch==2.2.2
tqdm==4.66.2
transformers==4.39.3
typing-extensions==4.11.0
urllib3==2.2.1
@awni
Copy link
Member

awni commented Apr 8, 2024

Curious.. what machine are you on? What OS?

@awni
Copy link
Member

awni commented Apr 8, 2024

The command runs fine for me with our default dataset:

python -m mlx_lm.lora \
    --model google/gemma-2b-it \
    --train \
    --data ../lora/data \
    --iters 600 --adapter-path . 

@danny-su
Copy link
Author

danny-su commented Apr 9, 2024

Curious.. what machine are you on? What OS?

@awni macOS: 14.4.1, tested on Python 3.11 and Python 3.12, have the same issue.

image

@danny-su
Copy link
Author

danny-su commented Apr 9, 2024

@awni I also encountered this issue when training Qwen/Qwen1.5-0.5B. Is it related to memory footprint?

image

@danny-su
Copy link
Author

danny-su commented Apr 9, 2024

@awni, I tested your dataset. It can run successfully, but using my dataset will encounter this issue.
I uploaded the dataset. You can have a try.
data.zip
PS: The training process is very slow, and most of the time, the GPU usage is at its peak.

python -m mlx_lm.lora \
    --model Qwen/Qwen1.5-0.5B \
    --train \
    --data /Users/danny/mlx_demo/data \
    --iters 600 --adapter-path /Users/danny/mlx_demo/models/Qwen
Loading pretrained model
Fetching 7 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 95948.13it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Trainable parameters: 0.085% (0.524M/619.570M)
Loading datasets
Training
Starting training..., iters: 600
Iter 1: Val loss 1.306, Val took 62.408s
Iter 10: Train loss 1.171, Learning Rate 1.000e-05, It/sec 0.377, Tokens/sec 1384.743, Trained Tokens 36717, Peak mem 18.499 GB
Iter 20: Train loss 0.758, Learning Rate 1.000e-05, It/sec 0.302, Tokens/sec 1112.727, Trained Tokens 73579, Peak mem 18.499 GB
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Internal Error (0000000e:Internal Error)
[1]    24608 abort      python -m mlx_lm.lora --model Qwen/Qwen1.5-0.5B --train --data  --iters 600
/opt/homebrew/Cellar/python@3.12/3.12.2_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
python -m mlx_lm.lora \
    --model Qwen/Qwen1.5-0.5B \
    --train \
    --data /Users/danny/Downloads/mlx-examples/lora/data \
    --iters 600 --adapter-path /Users/danny/Downloads/llm/data/models/Qwen
Loading pretrained model
Fetching 7 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 166818.91it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Trainable parameters: 0.085% (0.524M/619.570M)
Loading datasets
Training
Starting training..., iters: 600
Iter 1: Val loss 2.529, Val took 2.059s
Iter 10: Train loss 2.268, Learning Rate 1.000e-05, It/sec 6.197, Tokens/sec 2288.035, Trained Tokens 3692, Peak mem 3.544 GB
Iter 20: Train loss 1.809, Learning Rate 1.000e-05, It/sec 5.394, Tokens/sec 1998.107, Trained Tokens 7396, Peak mem 3.544 GB
Iter 30: Train loss 1.472, Learning Rate 1.000e-05, It/sec 5.675, Tokens/sec 1989.779, Trained Tokens 10902, Peak mem 3.544 GB
Iter 40: Train loss 1.441, Learning Rate 1.000e-05, It/sec 5.613, Tokens/sec 1956.271, Trained Tokens 14387, Peak mem 3.544 GB
Iter 50: Train loss 1.207, Learning Rate 1.000e-05, It/sec 6.172, Tokens/sec 2090.439, Trained Tokens 17774, Peak mem 3.544 GB
Iter 60: Train loss 1.356, Learning Rate 1.000e-05, It/sec 5.851, Tokens/sec 2034.451, Trained Tokens 21251, Peak mem 3.544 GB
Iter 70: Train loss 1.270, Learning Rate 1.000e-05, It/sec 4.538, Tokens/sec 1807.524, Trained Tokens 25234, Peak mem 4.089 GB
Iter 80: Train loss 1.293, Learning Rate 1.000e-05, It/sec 5.466, Tokens/sec 2000.177, Trained Tokens 28893, Peak mem 4.089 GB
Iter 90: Train loss 1.263, Learning Rate 1.000e-05, It/sec 6.040, Tokens/sec 2073.083, Trained Tokens 32325, Peak mem 4.089 GB
Iter 100: Train loss 1.245, Learning Rate 1.000e-05, It/sec 5.272, Tokens/sec 1945.532, Trained Tokens 36015, Peak mem 4.089 GB
Iter 100: Saved adapter weights to /Users/danny/Downloads/llm/data/models/Qwen/adapters.safetensors and /Users/danny/Downloads/llm/data/models/Qwen/0000100_adapters.safetensors.
Iter 110: Train loss 1.263, Learning Rate 1.000e-05, It/sec 4.132, Tokens/sec 1538.110, Trained Tokens 39737, Peak mem 4.777 GB
Iter 120: Train loss 1.141, Learning Rate 1.000e-05, It/sec 5.657, Tokens/sec 1918.372, Trained Tokens 43128, Peak mem 4.777 GB
Iter 130: Train loss 1.189, Learning Rate 1.000e-05, It/sec 3.227, Tokens/sec 1249.486, Trained Tokens 47000, Peak mem 4.777 GB
Iter 140: Train loss 1.321, Learning Rate 1.000e-05, It/sec 3.101, Tokens/sec 1098.420, Trained Tokens 50542, Peak mem 4.777 GB
Iter 150: Train loss 1.185, Learning Rate 1.000e-05, It/sec 5.003, Tokens/sec 1690.969, Trained Tokens 53922, Peak mem 4.777 GB
Iter 160: Train loss 1.301, Learning Rate 1.000e-05, It/sec 3.979, Tokens/sec 1434.800, Trained Tokens 57528, Peak mem 4.777 GB
Iter 170: Train loss 1.281, Learning Rate 1.000e-05, It/sec 1.569, Tokens/sec 621.018, Trained Tokens 61486, Peak mem 4.777 GB
Iter 180: Train loss 1.102, Learning Rate 1.000e-05, It/sec 3.781, Tokens/sec 1222.717, Trained Tokens 64720, Peak mem 4.777 GB
Iter 190: Train loss 1.216, Learning Rate 1.000e-05, It/sec 5.807, Tokens/sec 1999.907, Trained Tokens 68164, Peak mem 4.777 GB
Iter 200: Train loss 1.060, Learning Rate 1.000e-05, It/sec 5.666, Tokens/sec 1968.231, Trained Tokens 71638, Peak mem 4.777 GB
Iter 200: Val loss 1.300, Val took 3.507s
Iter 200: Saved adapter weights to /Users/danny/Downloads/llm/data/models/Qwen/adapters.safetensors and /Users/danny/Downloads/llm/data/models/Qwen/0000200_adapters.safetensors.
Iter 210: Train loss 1.125, Learning Rate 1.000e-05, It/sec 3.213, Tokens/sec 1177.234, Trained Tokens 75302, Peak mem 4.777 GB
Iter 220: Train loss 1.224, Learning Rate 1.000e-05, It/sec 3.143, Tokens/sec 1142.332, Trained Tokens 78937, Peak mem 4.777 GB
Iter 230: Train loss 1.167, Learning Rate 1.000e-05, It/sec 3.375, Tokens/sec 1286.044, Trained Tokens 82747, Peak mem 4.777 GB
Iter 240: Train loss 1.161, Learning Rate 1.000e-05, It/sec 5.189, Tokens/sec 1778.210, Trained Tokens 86174, Peak mem 4.777 GB
Iter 250: Train loss 1.150, Learning Rate 1.000e-05, It/sec 6.528, Tokens/sec 2210.498, Trained Tokens 89560, Peak mem 4.777 GB
Iter 260: Train loss 1.206, Learning Rate 1.000e-05, It/sec 6.279, Tokens/sec 2193.992, Trained Tokens 93054, Peak mem 4.777 GB
Iter 270: Train loss 1.121, Learning Rate 1.000e-05, It/sec 6.113, Tokens/sec 2231.362, Trained Tokens 96704, Peak mem 4.777 GB
Iter 280: Train loss 1.055, Learning Rate 1.000e-05, It/sec 3.694, Tokens/sec 1397.752, Trained Tokens 100488, Peak mem 4.777 GB
Iter 290: Train loss 1.018, Learning Rate 1.000e-05, It/sec 5.504, Tokens/sec 1793.900, Trained Tokens 103747, Peak mem 4.777 GB
Iter 300: Train loss 1.125, Learning Rate 1.000e-05, It/sec 5.693, Tokens/sec 2138.719, Trained Tokens 107504, Peak mem 4.777 GB
Iter 300: Saved adapter weights to /Users/danny/Downloads/llm/data/models/Qwen/adapters.safetensors and /Users/danny/Downloads/llm/data/models/Qwen/0000300_adapters.safetensors.
Iter 310: Train loss 1.095, Learning Rate 1.000e-05, It/sec 6.214, Tokens/sec 2163.038, Trained Tokens 110985, Peak mem 4.777 GB
Iter 320: Train loss 1.082, Learning Rate 1.000e-05, It/sec 6.020, Tokens/sec 2205.861, Trained Tokens 114649, Peak mem 4.777 GB
Iter 330: Train loss 1.087, Learning Rate 1.000e-05, It/sec 6.105, Tokens/sec 2202.581, Trained Tokens 118257, Peak mem 4.777 GB
Iter 340: Train loss 1.222, Learning Rate 1.000e-05, It/sec 3.726, Tokens/sec 1358.650, Trained Tokens 121903, Peak mem 4.777 GB
Iter 350: Train loss 1.045, Learning Rate 1.000e-05, It/sec 6.062, Tokens/sec 2059.877, Trained Tokens 125301, Peak mem 4.777 GB
Iter 360: Train loss 1.187, Learning Rate 1.000e-05, It/sec 3.849, Tokens/sec 1458.039, Trained Tokens 129089, Peak mem 4.777 GB
Iter 370: Train loss 1.030, Learning Rate 1.000e-05, It/sec 6.032, Tokens/sec 2036.418, Trained Tokens 132465, Peak mem 4.777 GB
Iter 380: Train loss 1.045, Learning Rate 1.000e-05, It/sec 4.722, Tokens/sec 1687.298, Trained Tokens 136038, Peak mem 4.777 GB
Iter 390: Train loss 0.967, Learning Rate 1.000e-05, It/sec 4.942, Tokens/sec 1688.123, Trained Tokens 139454, Peak mem 4.777 GB
Iter 400: Train loss 1.029, Learning Rate 1.000e-05, It/sec 3.888, Tokens/sec 1367.146, Trained Tokens 142970, Peak mem 4.777 GB
Iter 400: Val loss 1.239, Val took 2.315s
Iter 400: Saved adapter weights to /Users/danny/Downloads/llm/data/models/Qwen/adapters.safetensors and /Users/danny/Downloads/llm/data/models/Qwen/0000400_adapters.safetensors.
Iter 410: Train loss 1.108, Learning Rate 1.000e-05, It/sec 5.109, Tokens/sec 1980.925, Trained Tokens 146847, Peak mem 4.777 GB
Iter 420: Train loss 0.981, Learning Rate 1.000e-05, It/sec 6.454, Tokens/sec 2187.782, Trained Tokens 150237, Peak mem 4.777 GB
Iter 430: Train loss 1.110, Learning Rate 1.000e-05, It/sec 5.464, Tokens/sec 1972.631, Trained Tokens 153847, Peak mem 4.777 GB
Iter 440: Train loss 1.012, Learning Rate 1.000e-05, It/sec 4.878, Tokens/sec 1682.292, Trained Tokens 157296, Peak mem 4.777 GB
Iter 450: Train loss 0.937, Learning Rate 1.000e-05, It/sec 6.140, Tokens/sec 2059.229, Trained Tokens 160650, Peak mem 4.777 GB
Iter 460: Train loss 0.970, Learning Rate 1.000e-05, It/sec 4.933, Tokens/sec 1775.521, Trained Tokens 164249, Peak mem 4.777 GB
Iter 470: Train loss 1.060, Learning Rate 1.000e-05, It/sec 6.227, Tokens/sec 2154.638, Trained Tokens 167709, Peak mem 4.777 GB
Iter 480: Train loss 1.112, Learning Rate 1.000e-05, It/sec 3.563, Tokens/sec 1397.810, Trained Tokens 171632, Peak mem 4.777 GB
Iter 490: Train loss 1.004, Learning Rate 1.000e-05, It/sec 5.196, Tokens/sec 1934.301, Trained Tokens 175355, Peak mem 4.777 GB
Iter 500: Train loss 0.972, Learning Rate 1.000e-05, It/sec 3.898, Tokens/sec 1467.423, Trained Tokens 179120, Peak mem 4.777 GB
Iter 500: Saved adapter weights to /Users/danny/Downloads/llm/data/models/Qwen/adapters.safetensors and /Users/danny/Downloads/llm/data/models/Qwen/0000500_adapters.safetensors.
Iter 510: Train loss 1.158, Learning Rate 1.000e-05, It/sec 5.176, Tokens/sec 2116.396, Trained Tokens 183209, Peak mem 4.777 GB
Iter 520: Train loss 1.022, Learning Rate 1.000e-05, It/sec 3.405, Tokens/sec 1211.972, Trained Tokens 186768, Peak mem 4.777 GB
Iter 530: Train loss 0.885, Learning Rate 1.000e-05, It/sec 6.226, Tokens/sec 2104.546, Trained Tokens 190148, Peak mem 4.777 GB
Iter 540: Train loss 0.954, Learning Rate 1.000e-05, It/sec 3.651, Tokens/sec 1330.368, Trained Tokens 193792, Peak mem 4.777 GB
Iter 550: Train loss 1.042, Learning Rate 1.000e-05, It/sec 3.709, Tokens/sec 1342.178, Trained Tokens 197411, Peak mem 4.777 GB
Iter 560: Train loss 0.974, Learning Rate 1.000e-05, It/sec 4.469, Tokens/sec 1564.471, Trained Tokens 200912, Peak mem 4.777 GB
Iter 570: Train loss 0.959, Learning Rate 1.000e-05, It/sec 5.560, Tokens/sec 1802.950, Trained Tokens 204155, Peak mem 4.777 GB
Iter 580: Train loss 1.070, Learning Rate 1.000e-05, It/sec 6.116, Tokens/sec 2150.402, Trained Tokens 207671, Peak mem 4.777 GB
Iter 590: Train loss 0.847, Learning Rate 1.000e-05, It/sec 6.688, Tokens/sec 2119.963, Trained Tokens 210841, Peak mem 4.777 GB
Iter 600: Train loss 0.957, Learning Rate 1.000e-05, It/sec 3.137, Tokens/sec 1173.311, Trained Tokens 214581, Peak mem 4.777 GB
Iter 600: Val loss 1.226, Val took 3.087s
Iter 600: Saved adapter weights to /Users/danny/Downloads/llm/data/models/Qwen/adapters.safetensors and /Users/danny/Downloads/llm/data/models/Qwen/0000600_adapters.safetensors.
Saved final adapter weights to /Users/danny/Downloads/llm/data/models/Qwen/adapters.safetensors.
image

@danny-su
Copy link
Author

@awni I tested on gemma-1.1-2b-it, the issue still exists. Is it a bug of MPS?

python -m mlx_lm.lora \
    --model google/gemma-1.1-2b-it \
    --train \
    --data  data/Gemma \
    --iters 300 --adapter-path adapters/gemma-1.1-2b-it
Loading pretrained model
Fetching 9 files: 100%|██████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 183246.29it/s]
Trainable parameters: 0.033% (0.819M/2506.172M)
Loading datasets
Training
Starting training..., iters: 300
Iter 1: Val loss 4.472, Val took 35.372s
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Internal Error (0000000e:Internal Error)
[1]    6094 abort      python -m mlx_lm.lora --model google/gemma-1.1-2b-it --train --data  --iters
/opt/homebrew/Cellar/python@3.12/3.12.2_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

@awni
Copy link
Member

awni commented Apr 11, 2024

I'm not sure.. I never saw that error before. It seems related to too much resource use (e.g. OOM). Does it run if you use a smaller batch size? --batch-size=1?

@danny-su
Copy link
Author

@awni My Mac has enough memory to fine-tune Gemma 2b. However, this issue occurs randomly with Gemma and Qwen, rarely with Mistral and Phi.

 python -m mlx_lm.lora \
    --model google/gemma-1.1-2b-it \
    --train \
    --data ../data/Gemma \
    --iters 300 --batch-size 1 --adapter-path ../adapters/gemma-1.1-2b-it
Loading pretrained model
Trainable parameters: 0.033% (0.819M/2506.172M)
Loading datasets
Training
Starting training..., iters: 300
Iter 1: Val loss 2.433, Val took 25.997s
Iter 10: Train loss 2.004, Learning Rate 1.000e-05, It/sec 0.298, Tokens/sec 266.739, Trained Tokens 8950, Peak mem 13.134 GB
Iter 20: Train loss 1.258, Learning Rate 1.000e-05, It/sec 0.288, Tokens/sec 253.914, Trained Tokens 17756, Peak mem 13.134 GB
Iter 30: Train loss 0.873, Learning Rate 1.000e-05, It/sec 0.299, Tokens/sec 263.815, Trained Tokens 26572, Peak mem 13.134 GB
Iter 40: Train loss 0.590, Learning Rate 1.000e-05, It/sec 0.372, Tokens/sec 329.684, Trained Tokens 35444, Peak mem 13.134 GB
Iter 50: Train loss 0.390, Learning Rate 1.000e-05, It/sec 0.410, Tokens/sec 360.239, Trained Tokens 44234, Peak mem 13.134 GB
Iter 60: Train loss 0.289, Learning Rate 1.000e-05, It/sec 0.367, Tokens/sec 327.103, Trained Tokens 53139, Peak mem 13.134 GB
Iter 70: Train loss 0.251, Learning Rate 1.000e-05, It/sec 0.345, Tokens/sec 307.374, Trained Tokens 62061, Peak mem 13.134 GB
Iter 80: Train loss 0.216, Learning Rate 1.000e-05, It/sec 0.460, Tokens/sec 407.837, Trained Tokens 70928, Peak mem 13.134 GB
Iter 90: Train loss 0.198, Learning Rate 1.000e-05, It/sec 0.318, Tokens/sec 279.738, Trained Tokens 79725, Peak mem 13.134 GB
Iter 100: Train loss 0.178, Learning Rate 1.000e-05, It/sec 0.313, Tokens/sec 273.145, Trained Tokens 88441, Peak mem 13.134 GB
Iter 100: Saved adapter weights to /Users/danny/Downloads/llm/data/adapters/gemma-1.1-2b-it/adapters.safetensors and /Users/danny/Downloads/llm/data/adapters/gemma-1.1-2b-it/0000100_adapters.safetensors.
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Internal Error (0000000e:Internal Error)
[1]    15804 abort      python -m mlx_lm.lora --model google/gemma-1.1-2b-it --train --data  --iters

python -m mlx_lm.lora \
    --model google/gemma-1.1-2b-it \
    --train \
    --data ../data/Gemma \
    --iters 300 --batch-size 1 --adapter-path ../adapters/gemma-1.1-2b-it
Loading pretrained model
Trainable parameters: 0.033% (0.819M/2506.172M)
Loading datasets
Training
Starting training..., iters: 300
Iter 1: Val loss 2.436, Val took 16.906s
Iter 10: Train loss 2.015, Learning Rate 1.000e-05, It/sec 0.485, Tokens/sec 433.968, Trained Tokens 8950, Peak mem 13.134 GB
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Internal Error (0000000e:Internal Error)
[1]    17830 abort      python -m mlx_lm.lora --model google/gemma-1.1-2b-it --train --data  --iters

@awni
Copy link
Member

awni commented Apr 12, 2024

I'm running the command you shared.

@awni
Copy link
Member

awni commented Apr 12, 2024

@danny-su what version of MLX / MLX LM are you using:

python -c "import mlx.core as mx; print(mx.__version__)"

If it's not the latest, please update and try again.

So far I'm not able to reproduce the issue you shared.. I will try on another machine.

@danny-su
Copy link
Author

danny-su commented Apr 12, 2024

@awni I used the latest version: 0.10.0. Here is the data I used to fine-tune Qwen1.5-1.8B and Qwen1.5-0.5B. Both failed, but I can use the same size data to fine-tune Mistral-7B-Instruct-v0.2, so I think there may be some bugs in the mlx_lm implementations for Qwen and Gemma.

qwen.zip

python3.11 -m mlx_lm.lora --max-seq-length 102400 --batch-size 1 \
    --model Qwen/Qwen1.5-0.5B \
    --train \
    --data ../data/Qwen \
    --iters 1000 --adapter-path ../adapters/qwen

python3.11 -m mlx_lm.lora --max-seq-length 102400 --batch-size 1 \
    --model Qwen/Qwen1.5-1.8B \
    --train \
    --data ../data/Qwen \
    --iters 1000 --adapter-path ../adapters/qwen

@awni
Copy link
Member

awni commented Apr 12, 2024

I can use the same size data to fine-tune Mistral-7B-Instruct-v0.2

Quantized or fp16?

@danny-su
Copy link
Author

@awni The original version downloaded from huggingface.

@awni
Copy link
Member

awni commented Apr 12, 2024

Wow that is a really long sequence length: 102400. I can't imagine you have enough memory on your machine for that long of a sequence length. Just the attentions scores for one layer alone would be 20GB.

@awni
Copy link
Member

awni commented Apr 12, 2024

I tried this on three different machines (M2 Ultra / M2 Mini / M1 Max) and I've not been able to reproduce the same issue.

I recommend trying the following:

  • Reboot
  • Maybe disconnect any external monitors / make sure no other GPU heavy processes are running
  • Try a fresh environment with a fresh install of everything

I will do some more digging about that error message.. but without being able to reproduce it, it's difficult to help debug.

@danny-su
Copy link
Author

Wow that is a really long sequence length: 102400. I can't imagine you have enough memory on your machine for that long of a sequence length. Just the attentions scores for one layer alone would be 20GB.

My data does not have that long sequence length; I set it to a larger value because some outliers exceed the default value of 2048.

@awni
Copy link
Member

awni commented Apr 13, 2024

Hi @danny-su sorry I cannot reproduce this issue. I tried on several machines with several models and the data sets you shared.

As I understand it the error you are seeing has to do with when a command buffer (computation on the GPU) times out. I am not entirely sure why that's happening for you and since I can't reproduce the issue it's will be difficult to diagnose on my end.

Let me know if you are able to try some of the above steps and they resolve the issue or not.

Also if you are able to log the input data and share which data causes the program to crash that might also be helpful for us to reproduce. It seems likely (though not certain) to be data dependent.

@kaeru-shigure
Copy link

The same error occurs in my environment when inferring Stable Diffusion XL.
However, it is not reproducible and occurs randomly with low probability...

libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Internal Error (0000000e:Internal Error)

env: M1 Pro, macOS 14.4 / M1 Pro, macOS 14.4.1

@kaeru-shigure
Copy link

It seems to happen more often when the remaining memory is low, so it may be related to swap, etc.

@awni
Copy link
Member

awni commented Apr 17, 2024

@kaeru-shigure that is very odd 🤔 , I assume you have a 16GB machine?

@GusLovesMath
Copy link

I have a 16GB MacBook Pro with an M2 Pro chip and encountered this exact error at 60 epochs when I increased the training data size, batch size, and layers. I reran it after restarting my computer and not utilizing as much memory, and it reached 200 epochs before giving me the same error.

@awni
Copy link
Member

awni commented May 6, 2024

@GusLovesMath what were you running? Could you share the script / command?

@GusLovesMath
Copy link

I was running the command below. It worked when I reduced the GPU and ram analogous parameters below.

    !python -m mlx_lm.lora \
        --model mlx-community/Meta-Llama-3-8B-Instruct-4bit \
        --train \
        --batch-size 1 \
        --lora-layers 1 \
        --iters 1000 \
        --data Data \
        --seed 0

@kaeru-shigure
Copy link

kaeru-shigure commented May 26, 2024

Subsequent situation:
First, mlx causes extreme slowdowns and reduced GPU usage when a memory swap occurs.
The most obvious indicator to determine this condition is GPU usage.
When no memory swap is occurring, GPU usage will remain at 100%.
When we tried to avoid this situation, the same error did not occur even after hundreds of hours of execution.
Note: My MacBook is a 32GB model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants