When running quickstart.ipynb, loading model in int8 and fp16 occupy significantly different amounts of GPU memory. #374

lankuohsing · 2024-02-17T09:42:17Z

I am trying finetuning llama2 7B with lora by running quickstart.ipynb(https://github.com/facebookresearch/llama-recipes/blob/main/examples/quickstart.ipynb), using an A100 40G GPU.
When I load the model in int8 and create a PeftModel in int8 (just as the original setting in quickstart.ipynb), the training occupies 14 GB GPU memory (batch_size is set to be 2).
However, when I load the model in fp16 and create a PeftModel in fp16, the training occupies 40 GB GPU memory (batch_size is set to be 1 ).

The part of the code I modified is shown below:

model =LlamaForCausalLM.from_pretrained(model_id, device_map='auto', torch_dtype=torch.float16)
...
config = {
    'lora_config': lora_config,
    'learning_rate': 1e-5,# from 1e-4 to 1e-5
    'num_train_epochs': 1,
    'gradient_accumulation_steps': 4,#from 2 to 4
    'per_device_train_batch_size': 1,#from 2 to 1
    'gradient_checkpointing': False,
}
...
# model = prepare_model_for_int8_training(model)
model = get_peft_model(model, peft_config)
...

Can someone explain why there is such a huge difference of GPU memory consumption?

The text was updated successfully, but these errors were encountered:

HamidShojanazeri · 2024-02-21T20:55:27Z

@lankuohsing reading the bit&bytes docs might be helpful, this support a bunch of functionality beside int8 matrix multiplication, it also supports int8 optimizer that can significanly reduce the memory requirements and if you look at prepare_model_for_int8_train it also enables gradient checkpointing which is big memory saver. That should sort of explain the difference.

HamidShojanazeri added the triaged label Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When running quickstart.ipynb, loading model in int8 and fp16 occupy significantly different amounts of GPU memory. #374

When running quickstart.ipynb, loading model in int8 and fp16 occupy significantly different amounts of GPU memory. #374

lankuohsing commented Feb 17, 2024 •

edited

HamidShojanazeri commented Feb 21, 2024

When running quickstart.ipynb, loading model in int8 and fp16 occupy significantly different amounts of GPU memory. #374

When running quickstart.ipynb, loading model in int8 and fp16 occupy significantly different amounts of GPU memory. #374

Comments

lankuohsing commented Feb 17, 2024 • edited

HamidShojanazeri commented Feb 21, 2024

lankuohsing commented Feb 17, 2024 •

edited