LLMEval crashing due to force unwrapping of optional in token decode for Gemma #50

ViRo3 · 2024-04-09T17:43:25Z

st.txt is the crash log.

Model : Gemma 2B Quantized
Prompt : "Write code to boot a raspberry pico"

davidkoski · 2024-04-09T19:23:04Z

This looks a bit like:

Gemma tokenizer, fix Unicode split huggingface/swift-transformers#52
BPETokenizer splits grapheme clusters rather than code points -- crashes in Gemma merges huggingface/swift-transformers#51

What version (hash) of swift-transformers are you using?

ViRo3 · 2024-04-10T03:40:48Z

74b9421

davidkoski · 2024-04-10T05:58:09Z

OK, that is the current head of main and includes that fix. Oh, interesting -- that particular prompt does reproduce it for me!

Here is the token in question:

(lldb) p model.convertIdToToken(235345)
(String?) nil

thought it looks like it is defined in the tokenizer.json:

      "#": 235345,

The problem is that there are two (ish) #'s:

egrep '122661|235345' tokenizer.json
      "#": 122661,
      "#": 235345,

though one of them has some extra unicode, in particular a ZERO WIDTH NO-BREAK SPACE:

egrep '122661|235345' tokenizer.json | od -c
0000000                            " 357 273 277   #   "   :       1   2
0000020    2   6   6   1   ,  \n                           "   #   "   :
0000040        2   3   5   3   4   5   ,  \n                            
0000051

It looks like the strings map to the same value on read and the tokenizer model loses the entry for 235345.

Blaizzy · 2024-04-10T07:57:01Z

@ViRo3 could install the latest swift-transformers from GH and let us know if the issue persists.

Btw this might be a huggingface swift-transformers issue and not MLX.

ViRo3 · 2024-04-10T13:10:35Z

@ViRo3 could install the latest swift-transformers from GH and let us know if the issue persists.

Btw this might be a huggingface swift-transformers issue and not MLX.

I have checked and it persists and is a swift-transformer issue so filed an issue there too.

davidkoski · 2024-04-11T05:06:56Z

I had a thought about how this might be fixed -- we might be able to make something along these lines:

struct CodePointString : StringProtocol {
    let value: String

    static func ==(lhs: CodePointString, rhs: CodePointString) {
        // this isn't actually comparable but this is the idea
        lhs.utf16 == rhs.utf16
    }

    func hash(hasher: inout Hasher) {
        hasher.combine(value.utf16)
    }

    ...
}

Then we load the config as [CodePointString:Any]. Something like this would let us treat the strings as code points (which is what python, etc. do) rather than unicode strings. Maybe :-)

ViRo3 mentioned this issue Apr 10, 2024

Tokenizer force unwraps an optional that's nil for Gemma huggingface/swift-transformers#88

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLMEval crashing due to force unwrapping of optional in token decode for Gemma #50

LLMEval crashing due to force unwrapping of optional in token decode for Gemma #50

ViRo3 commented Apr 9, 2024

davidkoski commented Apr 9, 2024

ViRo3 commented Apr 10, 2024

davidkoski commented Apr 10, 2024

Blaizzy commented Apr 10, 2024

ViRo3 commented Apr 10, 2024

davidkoski commented Apr 11, 2024

LLMEval crashing due to force unwrapping of optional in token decode for Gemma #50

LLMEval crashing due to force unwrapping of optional in token decode for Gemma #50

Comments

ViRo3 commented Apr 9, 2024

davidkoski commented Apr 9, 2024

ViRo3 commented Apr 10, 2024

davidkoski commented Apr 10, 2024

Blaizzy commented Apr 10, 2024

ViRo3 commented Apr 10, 2024

davidkoski commented Apr 11, 2024