Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLMEval crashing due to force unwrapping of optional in token decode for Gemma #50

Open
ViRo3 opened this issue Apr 9, 2024 · 6 comments

Comments

@ViRo3
Copy link

ViRo3 commented Apr 9, 2024

st.txt is the crash log.

Model : Gemma 2B Quantized
Prompt : "Write code to boot a raspberry pico"

@davidkoski
Copy link
Collaborator

@ViRo3
Copy link
Author

ViRo3 commented Apr 10, 2024

74b9421

@davidkoski
Copy link
Collaborator

OK, that is the current head of main and includes that fix. Oh, interesting -- that particular prompt does reproduce it for me!

Here is the token in question:

(lldb) p model.convertIdToToken(235345)
(String?) nil

thought it looks like it is defined in the tokenizer.json:

      "#": 235345,

The problem is that there are two (ish) #'s:

egrep '122661|235345' tokenizer.json
      "#": 122661,
      "#": 235345,

though one of them has some extra unicode, in particular a ZERO WIDTH NO-BREAK SPACE:

egrep '122661|235345' tokenizer.json | od -c
0000000                            " 357 273 277   #   "   :       1   2
0000020    2   6   6   1   ,  \n                           "   #   "   :
0000040        2   3   5   3   4   5   ,  \n                            
0000051

It looks like the strings map to the same value on read and the tokenizer model loses the entry for 235345.

@Blaizzy
Copy link

Blaizzy commented Apr 10, 2024

@ViRo3 could install the latest swift-transformers from GH and let us know if the issue persists.

Btw this might be a huggingface swift-transformers issue and not MLX.

@ViRo3
Copy link
Author

ViRo3 commented Apr 10, 2024

@ViRo3 could install the latest swift-transformers from GH and let us know if the issue persists.

Btw this might be a huggingface swift-transformers issue and not MLX.

I have checked and it persists and is a swift-transformer issue so filed an issue there too.

@davidkoski
Copy link
Collaborator

I had a thought about how this might be fixed -- we might be able to make something along these lines:

struct CodePointString : StringProtocol {
    let value: String

    static func ==(lhs: CodePointString, rhs: CodePointString) {
        // this isn't actually comparable but this is the idea
        lhs.utf16 == rhs.utf16
    }

    func hash(hasher: inout Hasher) {
        hasher.combine(value.utf16)
    }

    ...
}

Then we load the config as [CodePointString:Any]. Something like this would let us treat the strings as code points (which is what python, etc. do) rather than unicode strings. Maybe :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants