Releases: b4rtaz/distributed-llama
0.9.1
0.9.0
This version introduces breaking changes in the tokenizer format. Now the tokenizer contains the whole chat template in the huggingface format.
Breaking changes
You need to regenerate your tokenizer. The tokenizer from the 0.8.0 version won't work with the 0.9.0 version.
0.8.0
This version introduces a new tokenizer format that includes configuration for chat functionality. With this update, Distributed Llama can support various models that operate in chat mode. The new tokenizer is required for these modes:
dllama chat ...
dllama-api ...
Breaking changes
The above change requires the tokenizer file to be regenerated. For Llama 3, you need to rerun the convert-tokenizer-llama3.py
script. For other models, the process is a bit more complicated; please refer to this post for detailed instructions.
0.7.4
0.7.3
0.7.2
- fix: chunked stream, close stream without econnreset.
This version fixes the problem with the chunked stream in the dllama-api
server. Now it's possible to connect to the server by using for example the AnythingLLM client. Still the server supports only Llama 3 Instruct only.
0.7.1
This version introduces a new approach to the synchronization of the MLP layers. This gave a significant reduction of the transfer size.
Big thanks for @zhengpeirong for assitance!
Model: Llama 3 8B Q40, Q80 Buffer
Devices | Transfer size / token (0.5.0) | Transfer size / token (0.7.1) | Percentage change |
---|---|---|---|
2 devices | S 510 kB + R 442 kB = 952 kB | S 272 kB + R 272 kB = 544 kB | -42.8% |
4 devices | S 1887 kB + R 867 kB = 2754 kB | S 816 kB + R 816 kB = 1632 kB | -40.7% |
Here are results of tests comparing previous versions.
0.7.0
This version introduces a new model converter that supports the huggingface .safetensors
format: convert-hf.py
. The new converter supports three model types: llama
, mistral
, and mixtral
. Many models that use these architectures can be easily converted to the Distributed Llama format from now.
Successfully tested the new converter on:
To convert a model you need to run:
python3 convert-hf.py path/to/TinyLlama-1.1B q40 tinylama
Then you need to convert also the tokenizer:
python3 convert-tokenizer-sentencepiece.py path/to/tokenizer.model tinylama
0.6.1
0.6.0
This version changes the name of the main
application into dllama
. From now to run the root node or a worker you need to compile dllama
and run the dllama
application.
make dllama
./dllama inference --model ... --tokenizer ...
Also this version introduces an early stage HTTP api compatible with the OpenAI api (only the /v1/chat/completions
endpoint). How to run the api you can find here. A big shout out to @DifferentialityDevelopment for implementing this feature. #39