Release llamafile v0.8.1 · Mozilla-Ocho/llamafile

Support for Phi-3 Mini 4k has been introduced
A bug causing GPU module crashes on some systems has been resolved
Support for Command-R Plus has now been vetted with proper 64-bit indexing
We now support more AMD GPU architectures thanks to better detection of offload archs (#368)
We now ship prebuilt NVIDIA and ROCm modules for both Windows and Linux users. They link tinyBLAS which is a libre math library that only depends on the graphics driver being installed. Since it's slower, llamafile will automatically build a native module for your system if the CUDA or ROCm SDKs are installed. You can control this behavior using --nocompile or --recompile. Yes, Our LLavA llamafile still manages to squeak under the Windows 4GB file size limit!
An assertion error has been fixed that happened when using llamafile-quantize to create K quants from an F32 GGUF file
A new llamafile-tokenize command line tool has been introduced. For example, if you want to count how many "tokens" are in a text file, you can say cat file.txt | llamafile-tokenize -m model.llamafile | wc -l since it prints each token on a single line.

Provide feedback