Encountering `RuntimeError: Cannot convert token` � `(29333) to bytes: �` for some model vocabularies when using llama.cpp #820

abetlen · 2024-04-15T17:15:32Z

Describe the issue as clearly as possible:

Ran into this tokenizer issue in regex.py with some models (Qwen1.5, Phi-2) but not others (OpenHermes-2.5-Mistral-7B) when using llama.cpp.

This honestly might be my fault with something I'm doing in the llama-cpp-python but I'm not familiar enough with the outlines codebase to tell.

Steps/code to reproduce the bug:

from outlines import models, generate

model = models.llamacpp("Qwen/Qwen1.5-0.5B-Chat-GGUF", "*q8*.gguf")
generator = generate.choice(model, ["skirt", "dress", "pen", "jacket"])
answer = generator("Pick the odd word out: skirt, dress, pen, jacket")

print(answer)

Expected result:

`skirt`

Error message:

Exception: Cannot convert token ` �` (29333) to bytes:  �
Traceback (most recent call last):
  File "/home/andrei/Documents/llms/llama_cpp/server/errors.py", line 171, in custom_route_handler
    response = await original_route_handler(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/llama_cpp/server/app.py", line 451, in create_chat_completion
    ] = await run_in_threadpool(llama.create_chat_completion, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2144, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 851, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/llama_cpp/llama.py", line 1655, in create_chat_completion
    return handler(
           ^^^^^^^^
  File "/home/andrei/Documents/llms/llama_cpp/llama_chat_format.py", line 582, in chat_completion_handler
    json_schema_processor = JSONLogitsProcessor(schema, llama)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/outlines/integrations/llamacpp.py", line 179, in __init__
    super().__init__(regex_string=regex_string, llm=llm)
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/outlines/integrations/llamacpp.py", line 143, in __init__
    fsm = RegexGuide(regex_string, tokenizer)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/outlines/fsm/guide.py", line 146, in __init__
    ) = create_states_mapping(
        ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/outlines/caching.py", line 74, in wrapper
    result = cached_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/outlines/fsm/guide.py", line 125, in create_states_mapping
    states_to_token_maps, empty_token_ids = create_fsm_index_tokenizer(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/outlines/fsm/regex.py", line 831, in create_fsm_index_tokenizer
    vocabulary, empty_token_ids = reduced_vocabulary(tokenizer)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andrei/Documents/llms/.direnv/python-3.11.9+/lib/python3.11/site-packages/outlines/fsm/regex.py", line 793, in reduced_vocabulary
    raise RuntimeError(
RuntimeError: Cannot convert token ` �` (29333) to bytes:  �

Outlines/Python version information:

Version information

0.0.37 Python 3.11.9+ (heads/3.11:cd12e6c779, Apr 3 2024, 02:38:14) [GCC 9.4.0] annotated-types==0.6.0 anyio==4.3.0 attrs==23.2.0 Babel==2.14.0 backports.tarfile==1.0.0 black==24.4.0 certifi==2024.2.2 cffi==1.16.0 charset-normalizer==3.3.2 click==8.1.7 cloudpickle==3.0.0 colorama==0.4.6 cryptography==42.0.5 diskcache==5.6.3 docutils==0.21.1 fastapi==0.110.1 filelock==3.13.4 fsspec==2024.3.1 ghp-import==2.1.0 griffe==0.42.1 h11==0.14.0 httpcore==1.0.5 httpx==0.27.0 huggingface-hub==0.22.2 idna==3.7 importlib_metadata==7.1.0 iniconfig==2.0.0 interegular==0.3.3 jaraco.classes==3.4.0 jaraco.context==5.3.0 jaraco.functools==4.0.0 jeepney==0.8.0 Jinja2==3.1.3 joblib==1.4.0 jsonschema==4.21.1 jsonschema-specifications==2023.12.1 keyring==25.1.0 lark==1.1.9 llvmlite==0.42.0 Markdown==3.6 markdown-it-py==3.0.0 MarkupSafe==2.1.5 mdurl==0.1.2 mergedeep==1.3.4 mkdocs==1.5.3 mkdocs-autorefs==1.0.1 mkdocs-material==9.5.17 mkdocs-material-extensions==1.3.1 mkdocstrings==0.24.3 mkdocstrings-python==1.9.2 more-itertools==10.2.0 mpmath==1.3.0 mypy-extensions==1.0.0 nest-asyncio==1.6.0 networkx==3.3 nh3==0.2.17 numba==0.59.1 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.19.3 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.1.105 outlines==0.0.37 packaging==24.0 paginate==0.5.6 pathspec==0.12.1 pkginfo==1.10.0 platformdirs==4.2.0 pluggy==1.4.0 pycparser==2.22 pydantic==2.7.0 pydantic-settings==2.2.1 pydantic_core==2.18.1 Pygments==2.17.2 pymdown-extensions==10.7.1 pytest==8.1.1 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.1 pyyaml_env_tag==0.1 readme_renderer==43.0 referencing==0.34.0 regex==2023.12.25 requests==2.31.0 requests-toolbelt==1.0.0 rfc3986==2.0.0 rich==13.7.1 rpds-py==0.18.0 safetensors==0.4.2 scipy==1.13.0 SecretStorage==3.3.3 six==1.16.0 sniffio==1.3.1 sse-starlette==2.1.0 starlette==0.37.2 starlette-context==0.3.6 sympy==1.12 tokenizers==0.15.2 torch==2.2.2 tqdm==4.66.2 transformers==4.39.3 triton==2.2.0 twine==5.0.0 typing_extensions==4.11.0 urllib3==2.2.1 uvicorn==0.29.0 watchdog==4.0.0 zipp==3.18.1

Context for the issue:

No response

The text was updated successfully, but these errors were encountered:

lmiller-phdata · 2024-04-17T16:20:53Z

This is also true of gemma models (eg. gemma-2b-it)

FreakTheMighty · 2024-04-21T00:56:37Z

I'm seeing this with llama3 instruct as well.

melisekm · 2024-04-21T08:33:54Z

Temp workaround: install outlines==0.0.36

pushad · 2024-04-21T15:43:56Z

Experiencing this as well on Llama 3 8b instruct, as well as various other models.

rlouf · 2024-04-21T16:29:37Z

Temp workaround: install outlines==0.0.36

Can someone confirm that the problem was introduced in outlines=0.0.37?

pushad · 2024-04-21T17:03:03Z

@rlouf can confirm.

When testing the following script:

from outlines import models, generate

model = models.llamacpp("./Meta-Llama-3-8B-Instruct.Q8_0.gguf")
generator = generate.choice(model, ["skirt", "dress", "pen", "jacket"])
answer = generator("Pick the odd word out: skirt, dress, pen, jacket")
print(answer)

Running fine under 0.0.36 and under 0.0.37 I get RuntimeError: Cannot convert token � (30433) to bytes: �

rlouf · 2024-04-21T17:08:25Z

Thank you! The bug was most likely introduced by #738. I'll try to understand why.

cegutica · 2024-05-08T12:09:47Z

Hi @rlouf, is there any update on this? I'm using Outlines version 0.0.41 with Llama-3-8B and it keeps failing. When using version 0.0.36 the inference takes too long with any Llama-3 model compared to other models such as mixtral-7x[8|22]B.

lapp0 · 2024-05-16T19:24:41Z

This issue appears to be a result of llama.cpp's pre-tokenizer not working correctly, more details on the problem are in #892

You can resolve this issue with the following steps:

1. Clear outlines cache import outlines.caching; outlines.caching.clear_cache()
1. Install my fix branch pip install "git+https://github.com/lapp0/outlines@test-issue-820"
1. Ensure you specify a LlamaHFTokenizer when creating models.llamacpp as follows:

    model = models.llamacpp(
        "Qwen/Qwen1.5-0.5B-Chat-GGUF",
        "*q8*.gguf",
        tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(
            "Qwen/Qwen1.5-0.5B-Chat"
        ),
    )

abetlen added the bug label Apr 15, 2024

desaxce mentioned this issue May 12, 2024

[Issue #820] Cannot convert token � (29333) to bytes: � for some model vocabularies when using llama.cpp #890

Closed

lapp0 mentioned this issue May 15, 2024

Circumvent Broken llama.cpp Pre-Tokenizer #892

Merged

rlouf closed this as completed in #892 May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encountering `RuntimeError: Cannot convert token` � `(29333) to bytes: �` for some model vocabularies when using llama.cpp #820

Encountering `RuntimeError: Cannot convert token` � `(29333) to bytes: �` for some model vocabularies when using llama.cpp #820

abetlen commented Apr 15, 2024 •

edited

lmiller-phdata commented Apr 17, 2024

FreakTheMighty commented Apr 21, 2024

melisekm commented Apr 21, 2024

pushad commented Apr 21, 2024

rlouf commented Apr 21, 2024

pushad commented Apr 21, 2024

rlouf commented Apr 21, 2024

cegutica commented May 8, 2024

lapp0 commented May 16, 2024 •

edited

Encountering RuntimeError: Cannot convert token � (29333) to bytes: � for some model vocabularies when using llama.cpp #820

Encountering RuntimeError: Cannot convert token � (29333) to bytes: � for some model vocabularies when using llama.cpp #820

Comments

abetlen commented Apr 15, 2024 • edited

Describe the issue as clearly as possible:

Steps/code to reproduce the bug:

Expected result:

Error message:

Outlines/Python version information:

Context for the issue:

lmiller-phdata commented Apr 17, 2024

FreakTheMighty commented Apr 21, 2024

melisekm commented Apr 21, 2024

pushad commented Apr 21, 2024

rlouf commented Apr 21, 2024

pushad commented Apr 21, 2024

rlouf commented Apr 21, 2024

cegutica commented May 8, 2024

lapp0 commented May 16, 2024 • edited

Encountering `RuntimeError: Cannot convert token` � `(29333) to bytes: �` for some model vocabularies when using llama.cpp #820

Encountering `RuntimeError: Cannot convert token` � `(29333) to bytes: �` for some model vocabularies when using llama.cpp #820

abetlen commented Apr 15, 2024 •

edited

lapp0 commented May 16, 2024 •

edited