[BUG]: OOM when saving 70B model #5585

jiejie1993 · 2024-04-11T03:29:53Z

🐛 Describe the bug

I train the llama2-70b model in 4*8(80G) H100 with "gemini" plugin, Can train normally, but an “OOM” error occurs when saving the model.

here is the log
Epoch 0: 0%| | 5/1119698 [01:42<6209:10:15, 19.96s/it, Loss=13.5324] Start saving model checkpoint with running states Traceback (most recent call last): File "/mnt/disk1/gyj/ColossalAI/applications/Colossal-LLaMA-2/train.py", line 430, in <module> main() File "/mnt/disk1/gyj/ColossalAI/applications/Colossal-LLaMA-2/train.py", line 390, in main save_checkpoint( File "/mnt/disk1/gyj/ColossalAI/applications/Colossal-LLaMA-2/colossal_llama2/utils/ckpt_io.py", line 56, in save_checkpoint booster.save_optimizer(optimizer, os.path.join(save_dir, "optimizer"), shard=True) File "/mnt/disk1/gyj/ColossalAI/colossalai/booster/booster.py", line 307, in save_optimizer self.checkpoint_io.save_optimizer(optimizer, checkpoint, shard, gather_dtensor, prefix, size_per_shard) File "/mnt/disk1/gyj/ColossalAI/colossalai/checkpoint_io/checkpoint_io_base.py", line 197, in save_optimizer self.save_sharded_optimizer(optimizer, checkpoint, gather_dtensor, prefix, size_per_shard) File "/mnt/disk1/gyj/ColossalAI/colossalai/booster/plugin/gemini_plugin.py", line 191, in save_sharded_optimizer total_size = save_state_dict_shards( File "/mnt/disk1/gyj/ColossalAI/colossalai/checkpoint_io/utils.py", line 234, in save_state_dict_shards for idx, shard_pair in enumerate(sharded_state_dict): File "/mnt/disk1/gyj/ColossalAI/colossalai/zero/gemini/gemini_optimizer.py", line 799, in state_shard state = self.collect_states(param_id=param_id, only_rank_0=only_rank_0) File "/mnt/disk1/gyj/ColossalAI/colossalai/zero/gemini/gemini_optimizer.py", line 519, in collect_states dist.all_gather_object(gathered_state_shards, [compacted_states, shard_offset, shard_size], group=zero_group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1436, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2057, in all_gather_object object_list[i] = _tensor_to_object(tensor, tensor_size) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1955, in _tensor_to_object return _unpickler(io.BytesIO(buf)).load() File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 241, in _load_from_bytes return torch.load(io.BytesIO(b)) File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 815, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1043, in _legacy_load result = unpickler.load() File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 980, in persistent_load wrap_storage=restore_location(obj, location), File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 217, in default_restore_location result = fn(storage, location) File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 185, in _cuda_deserialize return torch.UntypedStorage(obj.nbytes(), device=torch.device(location)) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 79.11 GiB total capacity; 1.40 GiB already allocated; 29.69 MiB free; 1.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Environment

4 nodes with 8 H100 gpus, per gpu has 80G mem

CUDA:12.1
CUDNN:2.18.1
python:3.10
pytorch:2.0.0

The text was updated successfully, but these errors were encountered:

Issues-translate-bot · 2024-04-11T03:30:03Z

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Title: [BUG]: OOM when saving 70B model

Edenzzzz · 2024-04-28T14:05:08Z

@ver217 any insights?

jiejie1993 added the bug Something isn't working label Apr 11, 2024

Edenzzzz assigned ver217 May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: OOM when saving 70B model #5585

[BUG]: OOM when saving 70B model #5585

jiejie1993 commented Apr 11, 2024

Issues-translate-bot commented Apr 11, 2024

Edenzzzz commented Apr 28, 2024

[BUG]: OOM when saving 70B model #5585

[BUG]: OOM when saving 70B model #5585

Comments

jiejie1993 commented Apr 11, 2024

🐛 Describe the bug

Environment

Issues-translate-bot commented Apr 11, 2024

Edenzzzz commented Apr 28, 2024