You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I train the llama2-70b model in 4*8(80G) H100 with "gemini" plugin, Can train normally, but an βOOMβ error occurs when saving the model.
here is the log Epoch 0: 0%| | 5/1119698 [01:42<6209:10:15, 19.96s/it, Loss=13.5324] Start saving model checkpoint with running states Traceback (most recent call last): File "/mnt/disk1/gyj/ColossalAI/applications/Colossal-LLaMA-2/train.py", line 430, in <module> main() File "/mnt/disk1/gyj/ColossalAI/applications/Colossal-LLaMA-2/train.py", line 390, in main save_checkpoint( File "/mnt/disk1/gyj/ColossalAI/applications/Colossal-LLaMA-2/colossal_llama2/utils/ckpt_io.py", line 56, in save_checkpoint booster.save_optimizer(optimizer, os.path.join(save_dir, "optimizer"), shard=True) File "/mnt/disk1/gyj/ColossalAI/colossalai/booster/booster.py", line 307, in save_optimizer self.checkpoint_io.save_optimizer(optimizer, checkpoint, shard, gather_dtensor, prefix, size_per_shard) File "/mnt/disk1/gyj/ColossalAI/colossalai/checkpoint_io/checkpoint_io_base.py", line 197, in save_optimizer self.save_sharded_optimizer(optimizer, checkpoint, gather_dtensor, prefix, size_per_shard) File "/mnt/disk1/gyj/ColossalAI/colossalai/booster/plugin/gemini_plugin.py", line 191, in save_sharded_optimizer total_size = save_state_dict_shards( File "/mnt/disk1/gyj/ColossalAI/colossalai/checkpoint_io/utils.py", line 234, in save_state_dict_shards for idx, shard_pair in enumerate(sharded_state_dict): File "/mnt/disk1/gyj/ColossalAI/colossalai/zero/gemini/gemini_optimizer.py", line 799, in state_shard state = self.collect_states(param_id=param_id, only_rank_0=only_rank_0) File "/mnt/disk1/gyj/ColossalAI/colossalai/zero/gemini/gemini_optimizer.py", line 519, in collect_states dist.all_gather_object(gathered_state_shards, [compacted_states, shard_offset, shard_size], group=zero_group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1436, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2057, in all_gather_object object_list[i] = _tensor_to_object(tensor, tensor_size) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1955, in _tensor_to_object return _unpickler(io.BytesIO(buf)).load() File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 241, in _load_from_bytes return torch.load(io.BytesIO(b)) File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 815, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1043, in _legacy_load result = unpickler.load() File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 980, in persistent_load wrap_storage=restore_location(obj, location), File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 217, in default_restore_location result = fn(storage, location) File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 185, in _cuda_deserialize return torch.UntypedStorage(obj.nbytes(), device=torch.device(location)) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 79.11 GiB total capacity; 1.40 GiB already allocated; 29.69 MiB free; 1.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Environment
4 nodes with 8 H100 gpus, per gpu has 80G mem
CUDA:12.1
CUDNN:2.18.1
python:3.10
pytorch:2.0.0
The text was updated successfully, but these errors were encountered:
π Describe the bug
I train the llama2-70b model in 4*8(80G) H100 with "gemini" plugin, Can train normally, but an βOOMβ error occurs when saving the model.
here is the log
Epoch 0: 0%| | 5/1119698 [01:42<6209:10:15, 19.96s/it, Loss=13.5324] Start saving model checkpoint with running states Traceback (most recent call last): File "/mnt/disk1/gyj/ColossalAI/applications/Colossal-LLaMA-2/train.py", line 430, in <module> main() File "/mnt/disk1/gyj/ColossalAI/applications/Colossal-LLaMA-2/train.py", line 390, in main save_checkpoint( File "/mnt/disk1/gyj/ColossalAI/applications/Colossal-LLaMA-2/colossal_llama2/utils/ckpt_io.py", line 56, in save_checkpoint booster.save_optimizer(optimizer, os.path.join(save_dir, "optimizer"), shard=True) File "/mnt/disk1/gyj/ColossalAI/colossalai/booster/booster.py", line 307, in save_optimizer self.checkpoint_io.save_optimizer(optimizer, checkpoint, shard, gather_dtensor, prefix, size_per_shard) File "/mnt/disk1/gyj/ColossalAI/colossalai/checkpoint_io/checkpoint_io_base.py", line 197, in save_optimizer self.save_sharded_optimizer(optimizer, checkpoint, gather_dtensor, prefix, size_per_shard) File "/mnt/disk1/gyj/ColossalAI/colossalai/booster/plugin/gemini_plugin.py", line 191, in save_sharded_optimizer total_size = save_state_dict_shards( File "/mnt/disk1/gyj/ColossalAI/colossalai/checkpoint_io/utils.py", line 234, in save_state_dict_shards for idx, shard_pair in enumerate(sharded_state_dict): File "/mnt/disk1/gyj/ColossalAI/colossalai/zero/gemini/gemini_optimizer.py", line 799, in state_shard state = self.collect_states(param_id=param_id, only_rank_0=only_rank_0) File "/mnt/disk1/gyj/ColossalAI/colossalai/zero/gemini/gemini_optimizer.py", line 519, in collect_states dist.all_gather_object(gathered_state_shards, [compacted_states, shard_offset, shard_size], group=zero_group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1436, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2057, in all_gather_object object_list[i] = _tensor_to_object(tensor, tensor_size) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1955, in _tensor_to_object return _unpickler(io.BytesIO(buf)).load() File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 241, in _load_from_bytes return torch.load(io.BytesIO(b)) File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 815, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1043, in _legacy_load result = unpickler.load() File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 980, in persistent_load wrap_storage=restore_location(obj, location), File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 217, in default_restore_location result = fn(storage, location) File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 185, in _cuda_deserialize return torch.UntypedStorage(obj.nbytes(), device=torch.device(location)) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 79.11 GiB total capacity; 1.40 GiB already allocated; 29.69 MiB free; 1.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Environment
4 nodes with 8 H100 gpus, per gpu has 80G mem
CUDA:12.1
CUDNN:2.18.1
python:3.10
pytorch:2.0.0
The text was updated successfully, but these errors were encountered: