Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generation takes forever #111

Open
Kira-Pgr opened this issue Feb 6, 2024 · 3 comments
Open

Generation takes forever #111

Kira-Pgr opened this issue Feb 6, 2024 · 3 comments

Comments

@Kira-Pgr
Copy link

Kira-Pgr commented Feb 6, 2024

Env

  • Python 3.9.18
  • NVIDIA GeForce RTX 4060 Laptop GPU
  • pytorch 2.1.1
  • airllm 2.8.3
  • Build cuda_12.2.r12.2/compiler.32965470_0

Model used

https://huggingface.co/152334H/miqu-1-70b-sf

Code

from airllm import AutoModel

MAX_LENGTH = 128
model = AutoModel.from_pretrained("/mnt/d/miqu-1-70b-sf", compression='4bit')
input_text = [
    "[INST] eloquent high camp prose about a cute catgirl [/INST]",
]
model.tokenizer.pad_token = model.tokenizer.eos_token
input_tokens = model.tokenizer(input_text,
                               return_tensors="pt",
                               return_attention_mask=False,
                               truncation=True,
                               max_length=MAX_LENGTH,
                               padding=True)
generation_output = model.generate(
    input_tokens['input_ids'].cuda(),
    max_new_tokens=20,
    use_cache=False,
    return_dict_in_generate=True)

output = model.tokenizer.decode(generation_output.sequences[0])

print(output)

Problem

Keep running layers(self.running_device):

new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████| 83/83 [27:57<00:00, 20.22s/it]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████| 83/83 [30:15<00:00, 21.87s/it]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|████████████████████████████████████████████| 83/83 [1:04:38<00:00, 46.73s/it]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|████████████████████████████████████████████| 83/83 [1:13:57<00:00, 53.47s/it]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device):  23%|██████████▌                                   | 19/83 [11:01<37:06, 34.79s/it]

Loading model didn't give errors, but says this

new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
not support prefetching for compression for now. loading with no prepetching mode.

Solution from #107 didn't work

@ahmedbr
Copy link

ahmedbr commented Feb 21, 2024

any updates?

@shailin1
Copy link

getting the same error on 13700k, 4090 and 32 GB RAM. Was this resolved?

@leedahae340
Copy link

这个不是问题,和这里有关系max_new_tokens=20,如果是20,就要跑20次,如果是200,就要跑200次。。。
有点慢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants