Releases: Mozilla-Ocho/llamafile
llamafile v0.8.4
This release fixes underflows and overflows.
-
A memory bug in the grammar parser has been fixed, that caused commands like
./llamafile -m foo.gguf -p bar --grammar 'root::="'
(which failed to specify a closing quote) to crash. Anyone using the server as a public facing endpoint (despite our previous recommendations) is strongly encouraged to upgrade. See 22aba95 and 3fe045f. Credit for discovering (and most importantly, reporting) this issue goes to Eclypsium Security Researcher Richard Johnson. We incorrectly reported earlier that this fix was incorporated into the v0.8.2 release. You need to use the v0.8.4 release. This bug fix was upstreamed in ggerganov/llama.cpp#7194 -
Our new vectorized expf() implementation now handles underflow by producing subnormals rather than flushing to zero. b5c6df6
See these instructions for how to put the latest llamafile software into your old weights, without having to redownload. #24 (comment)
llamafile v0.8.2
llamafile lets you distribute and run LLMs with a single file
llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of llama.cpp and cosmopolitan libc while aiming to stay ahead of the curve by including the most cutting-edge performance and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.
-
This release introduces faster AVX2 prompt processing for K-quants and IQ4_XS (#394). This was contributed to llamafile by @ikawrakow who originally invented K quants last year: ggerganov/llama.cpp@99009e7. In prior releases we recommended the legacy
Q4_0
quant since it was the simplest and most intuitive to get working with recent matmul optimizations. Thanks to Iwan Kawrakow's efforts, the best quants (e.g.Q5_K_M
) will now go the fastest (on modern x86 systems). -
Text generation (or prediction) should now go slightly faster too, thanks to development work matmul kernels, and enhancements to thread synchronization (see 89c189e) which should be noticed most on CPUs with many cores running smaller models. MacOS ARM users who are using CPU rather than Metal can expect to see the biggest boost, now that llamafile knows how to utilize all cores (see 6c45e3e).
-
Bugs in the server
/embedding
endpoint have been fixed (see 0e2845a and 7900294). You can also now passllamafile --embedding -m model -p prompt
to have embeddings printed to standard output (see 42bd9b8). -
This release synchronizes with the upstream llama.cpp project as of May 7th in 94d0940, which improves tokenization for Command-R, Refact, Olmo, and StarCoder. There's a new flash attention op that may be enabled for many models by passing the
-fa
flag. We haven't been able to include this in our prebuilt cuda/rocm binaries yet, so you may need to pass thellamafile --recompile
flag for GPU. -
This release introduces the
--precise
,--fast
, and--trap
flags, which control the execution of math. The--precise
flag can slightly enhance the thinking of LLMs at the cost of some performance (see 2af3b88 and 9540b43). The--fast
flag is included since it's unspecified which mode llamafile will use for any given situation (see bbae0f6 and b749326). The--trap
flag can help you pinpoint the exact moment any NaNs appear (on CPUs that support this, e.g. most of x86), which is useful for troubleshooting. Additionally, a new vectorizedexpf()
function has been introduced that enables llamafile to compute the exponent function faster and at full quality (see e2b3cb2). This matters because it's the function that powers SiLU and SoftMax which are used by most of todays premier public models. -
Most of the CPU code in the GGML library now has optimal performance across different hardware architectures, thanks to new build system techniques. Features or specific options or models that underperformed before, may do better now (see 0bdea60 and c9d7393).
Additional fixes:
- a2d159e Fix server multimodal statistics (#392)
- aa8c01a Revert moondream vision language model support
- eecbf89 More conservative strong/em markdown matcher (#352)
- 38311f2 CUDA: CUDART < 11.7 workaround for __hmax, __hmax2
- 58d2ca0 Use qsort and set linkage to static for internal functions used for offload-arch-fix (#375)
- 4ee1e39 The PDF documentation in llamafile-0.8.2.zip is now fixed
- 4ee1e39 Remove warnings from cuda build
Additional notes:
- We're experiencing some instability with our Windows AMD GPU support. If you encounter crashes using the
-ngl 999
flag on Windows, then try using the previous 0.8.1 release. Please also consider filing an issue, to report if it doesn't work, or better yet, please file an issue if it does work, since we otherwise have no way of knowing that (llamafile doesn't have telemetry because maximally respecting the user's privacy on their local machine is one of the stated goals of the project). You can also share details about your experience with us on the Mozilla AI Discord server.
See these instructions for how to put the latest llamafile software into your old weights, without having to redownload. #24 (comment)
llamafile v0.8.1
- Support for Phi-3 Mini 4k has been introduced
- A bug causing GPU module crashes on some systems has been resolved
- Support for Command-R Plus has now been vetted with proper 64-bit indexing
- We now support more AMD GPU architectures thanks to better detection of offload archs (#368)
- We now ship prebuilt NVIDIA and ROCm modules for both Windows and Linux users. They link tinyBLAS which is a libre math library that only depends on the graphics driver being installed. Since it's slower, llamafile will automatically build a native module for your system if the CUDA or ROCm SDKs are installed. You can control this behavior using
--nocompile
or--recompile
. Yes, Our LLavA llamafile still manages to squeak under the Windows 4GB file size limit! - An assertion error has been fixed that happened when using
llamafile-quantize
to create K quants from an F32 GGUF file - A new
llamafile-tokenize
command line tool has been introduced. For example, if you want to count how many "tokens" are in a text file, you can saycat file.txt | llamafile-tokenize -m model.llamafile | wc -l
since it prints each token on a single line.
llamafile v0.8
llamafile lets you distribute and run LLMs with a single file
llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. llamafile goes 2x faster than llama.cpp and 25x faster than ollama for some use cases like CPU prompt evaluation. It has a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.
This release further improves performance and introduces support for new models.
- Support for LLaMA3 is now available
- Support for Grok has been introduced
- Support for Mixtral 8x22b has been introduced
- Support for Command-R models has been introduced
- MoE models (e.g. Mixtral, Grok) now go 2-5x faster on CPU 4db03a1
- F16 is now 20% faster on Raspberry Pi 5 (TinyLLaMA 1.1b prompt eval improved 62 -> 75 tok/sec)
- F16 is now 30% faster on Skylake (TinyLLaMA 1.1b prompt eval improved 171 -> 219 tok/sec)
- F16 is now 60% faster on Apple M2 (Mistral 7b prompt eval improved 79 -> 128 tok/sec)
- Add ability to override chat template in web gui when creating llamafiles da5cbe4
- Improve markdown and syntax highlighting in server (#88)
- CPU feature detection has been improved
Downloads
You can download prebuilt llamafiles from:
-
https://huggingface.co/jartine
llamafiles quantized and compiled by us -
https://huggingface.co/models?library=llamafile
llamafiles built by our user community
Errata
- The new web gui chat template override feature isn't working as intended. If you want to use LLaMA3 8B then you need to manually copy and paste the chat templates from our README into the llamafile web GUI.
- The
llamafile-quantize
program may fail with an assertion error when K-quantizing weights from an F32 converted file. You can work around this by asking llama.cpp'sconvert.py
script to output an FP16 GGUF file, and then runninglllamafile-quantize
on that instead.
llamafile v0.7.4
llamafile v0.7.3
llamafile v0.7.2
llamafile v0.7.1
This release fixes bugs in the 0.7.0 release.
- Fix 2 embeddings-related issues in server.cpp (#324)
- Detect search query to start webchat (#333)
- Use LLAMAFILE_GPU_ERROR value -2 instead of -1 (#291)
- Fix --silent-prompt flag regression #328
- Clamp out of range values in K quantizer ef0307e
- Update to latest q5_k quantization code a8b0b15
- Change file format magic number for recently bf16 file format introduced in 0.7.0. This is a breaking change. It's due to a numbering conflict with the upstream project. We're still waiting on a permanent assignment for bfloat16 so this could potentially change again. Follow ggerganov/llama.cpp#6412 for updates.
Mixtral 8x22b and Grok support are not available in this release, but they are available if you build llamafile from source on the main branch at HEAD. We're currently dealing with an AMD Windows GPU support regression there. Once it's resolved, a 0.8 release will ship.
llamafile v0.7
llamafile lets you distribute and run LLMs with a single file
This release improves the performance and accuracy of both CPU and GPU computations in addition to security.
- tinyBLAS now gives outputs consistent with the cuBLAS thanks to Kahan summation on matvec ops. This is good news for Windows users, because llamafile releases bundle tinyBLAS DLLs for driver-only GPU support. That support will now be faster, and more accurate than before, thereby reducing the need to install the CUDA / ROCm SDKs yourself.
- Prompt evaluation now goes much faster on CPU. For example, f16 weights on Raspberry Pi 5 are now 8x faster. These new optimizations mostly apply to
F16
,BF16
,Q8_0
,Q4_0
,Q4_0
, andF32
weights. Depending on the hardware and weights being used, we've observed llamafile-0.7 going anywhere between 30% to 500% faster than llama.cpp upstream. - Support for the bf16 data type has been introduced for CPU only, which is the Google Brain floating point format.
- Support for AVX512 has been introduced. Owners of CPUs like Zen4 can expect to see 10x faster prompt eval times.
- If you want to run
llamafile-0.7 [...] --recompile --gpu amd
support on Windows, this release requires that you use version 5.7+ of the ROCm HIP SDK, which may be downloaded here. - This release includes a security fix for CVE-2024-23496 (see #294).
- This release is synced with llama.cpp 2024-03-22 upstream.
llamafile v0.6.2
llamafile lets you distribute and run LLMs with a single file
This release synchronizes with llama.cpp upstream and polishes GPU
auto-configuration. Support for splitting a model onto multiple NVIDIA
GPUs has been restored.
- dfd3335 Synchronize with llama.cpp 2024-01-27
- c008e43 Synchronize with llama.cpp 2024-01-26
- e34b35c Make GPU auto configuration more resilient
- 79b88f8 Sanitize -ngl flag on Apple Metal
There's a known issue with support for splitting onto multiple AMD GPUs,
which currently doesn't work. This is an upstream issue we're working to
solve. The workaround is to set export HIP_VISIBLE_DEVICES=0
in your
environment when running llamafile, so it'll only see the first GPU.
Example llamafiles
Our llamafiles on Hugging Face are updated shortly after a release goes live.
Flagship models
Supreme models (highest-end consumer hardware)
- https://hf.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile
- https://hf.co/jartine/WizardCoder-Python-34B-V1.0-llamafile
Tiny models (small enough to use on raspberry pi)
- https://hf.co/jartine/phi-2-llamafile
- https://hf.co/jartine/rocket-3B-llamafile
- https://hf.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF
Other models:
- https://hf.co/jartine/jartine/wizardcoder-13b-python
- https://hf.co/jartine/jartine/Nous-Hermes-Llama2-llamafile
- https://hf.co/jartine/jartine/dolphin-2.5-mixtral-8x7b-llamafile
If you have a slow Internet connection and want to update your llamafiles without needing to redownload, then see the instructions here: #24 (comment) You can also download llamafile-0.6.2
and simply say ./llamafile-0.6.2 -m old.llamafile
to run your old weights.