Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encountering RuntimeError: Cannot convert token (29333) to bytes: � for some model vocabularies when using llama.cpp #820

Closed
abetlen opened this issue Apr 15, 2024 · 9 comments · Fixed by #892
Labels

Comments

@abetlen
Copy link

abetlen commented Apr 15, 2024

Describe the issue as clearly as possible:

Ran into this tokenizer issue in regex.py with some models (Qwen1.5, Phi-2) but not others (OpenHermes-2.5-Mistral-7B) when using llama.cpp.

This honestly might be my fault with something I'm doing in the llama-cpp-python but I'm not familiar enough with the outlines codebase to tell.

Steps/code to reproduce the bug:

from outlines import models, generate

model = models.llamacpp("Qwen/Qwen1.5-0.5B-Chat-GGUF", "*q8*.gguf")
generator = generate.choice(model, ["skirt", "dress", "pen", "jacket"])
answer = generator("Pick the odd word out: skirt, dress, pen, jacket")

print(answer)

Expected result:

`skirt`

Error message:

Exception: Cannot convert token `` (29333) to bytes:  �
Traceback (most recent call last):
  File "/home/andrei/Documents/llms/llama_cpp/server/errors.py", line 171, in custom_route_handler
    response = await original_route_handler(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/llama_cpp/server/app.py", line 451, in create_chat_completion
    ] = await run_in_threadpool(llama.create_chat_completion, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2144, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 851, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/llama_cpp/llama.py", line 1655, in create_chat_completion
    return handler(
           ^^^^^^^^
  File "/home/andrei/Documents/llms/llama_cpp/llama_chat_format.py", line 582, in chat_completion_handler
    json_schema_processor = JSONLogitsProcessor(schema, llama)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/outlines/integrations/llamacpp.py", line 179, in __init__
    super().__init__(regex_string=regex_string, llm=llm)
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/outlines/integrations/llamacpp.py", line 143, in __init__
    fsm = RegexGuide(regex_string, tokenizer)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/outlines/fsm/guide.py", line 146, in __init__
    ) = create_states_mapping(
        ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/outlines/caching.py", line 74, in wrapper
    result = cached_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/outlines/fsm/guide.py", line 125, in create_states_mapping
    states_to_token_maps, empty_token_ids = create_fsm_index_tokenizer(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/outlines/fsm/regex.py", line 831, in create_fsm_index_tokenizer
    vocabulary, empty_token_ids = reduced_vocabulary(tokenizer)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/outlines/fsm/regex.py", line 793, in reduced_vocabulary
    raise RuntimeError(
RuntimeError: Cannot convert token `` (29333) to bytes:  �

Outlines/Python version information:

Version information

0.0.37 Python 3.11.9+ (heads/3.11:cd12e6c779, Apr 3 2024, 02:38:14) [GCC 9.4.0] annotated-types==0.6.0 anyio==4.3.0 attrs==23.2.0 Babel==2.14.0 backports.tarfile==1.0.0 black==24.4.0 certifi==2024.2.2 cffi==1.16.0 charset-normalizer==3.3.2 click==8.1.7 cloudpickle==3.0.0 colorama==0.4.6 cryptography==42.0.5 diskcache==5.6.3 docutils==0.21.1 fastapi==0.110.1 filelock==3.13.4 fsspec==2024.3.1 ghp-import==2.1.0 griffe==0.42.1 h11==0.14.0 httpcore==1.0.5 httpx==0.27.0 huggingface-hub==0.22.2 idna==3.7 importlib_metadata==7.1.0 iniconfig==2.0.0 interegular==0.3.3 jaraco.classes==3.4.0 jaraco.context==5.3.0 jaraco.functools==4.0.0 jeepney==0.8.0 Jinja2==3.1.3 joblib==1.4.0 jsonschema==4.21.1 jsonschema-specifications==2023.12.1 keyring==25.1.0 lark==1.1.9 llvmlite==0.42.0 Markdown==3.6 markdown-it-py==3.0.0 MarkupSafe==2.1.5 mdurl==0.1.2 mergedeep==1.3.4 mkdocs==1.5.3 mkdocs-autorefs==1.0.1 mkdocs-material==9.5.17 mkdocs-material-extensions==1.3.1 mkdocstrings==0.24.3 mkdocstrings-python==1.9.2 more-itertools==10.2.0 mpmath==1.3.0 mypy-extensions==1.0.0 nest-asyncio==1.6.0 networkx==3.3 nh3==0.2.17 numba==0.59.1 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.19.3 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.1.105 outlines==0.0.37 packaging==24.0 paginate==0.5.6 pathspec==0.12.1 pkginfo==1.10.0 platformdirs==4.2.0 pluggy==1.4.0 pycparser==2.22 pydantic==2.7.0 pydantic-settings==2.2.1 pydantic_core==2.18.1 Pygments==2.17.2 pymdown-extensions==10.7.1 pytest==8.1.1 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.1 pyyaml_env_tag==0.1 readme_renderer==43.0 referencing==0.34.0 regex==2023.12.25 requests==2.31.0 requests-toolbelt==1.0.0 rfc3986==2.0.0 rich==13.7.1 rpds-py==0.18.0 safetensors==0.4.2 scipy==1.13.0 SecretStorage==3.3.3 six==1.16.0 sniffio==1.3.1 sse-starlette==2.1.0 starlette==0.37.2 starlette-context==0.3.6 sympy==1.12 tokenizers==0.15.2 torch==2.2.2 tqdm==4.66.2 transformers==4.39.3 triton==2.2.0 twine==5.0.0 typing_extensions==4.11.0 urllib3==2.2.1 uvicorn==0.29.0 watchdog==4.0.0 zipp==3.18.1

Context for the issue:

No response

@abetlen abetlen added the bug label Apr 15, 2024
@lmiller-phdata
Copy link

This is also true of gemma models (eg. gemma-2b-it)

@FreakTheMighty
Copy link

I'm seeing this with llama3 instruct as well.

@melisekm
Copy link

Temp workaround: install outlines==0.0.36

@pushad
Copy link

pushad commented Apr 21, 2024

Experiencing this as well on Llama 3 8b instruct, as well as various other models.

@rlouf
Copy link
Member

rlouf commented Apr 21, 2024

Temp workaround: install outlines==0.0.36

Can someone confirm that the problem was introduced in outlines=0.0.37?

@pushad
Copy link

pushad commented Apr 21, 2024

@rlouf can confirm.

When testing the following script:

from outlines import models, generate

model = models.llamacpp("./Meta-Llama-3-8B-Instruct.Q8_0.gguf")
generator = generate.choice(model, ["skirt", "dress", "pen", "jacket"])
answer = generator("Pick the odd word out: skirt, dress, pen, jacket")
print(answer)

Running fine under 0.0.36 and under 0.0.37 I get RuntimeError: Cannot convert token (30433) to bytes: �

@rlouf
Copy link
Member

rlouf commented Apr 21, 2024

Thank you! The bug was most likely introduced by #738. I'll try to understand why.

@cegutica
Copy link

cegutica commented May 8, 2024

Hi @rlouf, is there any update on this? I'm using Outlines version 0.0.41 with Llama-3-8B and it keeps failing. When using version 0.0.36 the inference takes too long with any Llama-3 model compared to other models such as mixtral-7x[8|22]B.

@lapp0
Copy link
Contributor

lapp0 commented May 16, 2024

This issue appears to be a result of llama.cpp's pre-tokenizer not working correctly, more details on the problem are in #892

You can resolve this issue with the following steps:

    1. Clear outlines cache import outlines.caching; outlines.caching.clear_cache()
    1. Install my fix branch pip install "git+https://github.com/lapp0/outlines@test-issue-820"
    1. Ensure you specify a LlamaHFTokenizer when creating models.llamacpp as follows:
    model = models.llamacpp(
        "Qwen/Qwen1.5-0.5B-Chat-GGUF",
        "*q8*.gguf",
        tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(
            "Qwen/Qwen1.5-0.5B-Chat"
        ),
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

8 participants