Skip to content

Releases: b4rtaz/distributed-llama

0.9.1

01 Jun 14:22
08b4bcf
Compare
Choose a tag to compare

The --weights-float-type argument is optional now for models converted by using a converter from the 0.9.1 version or above.

0.9.0

01 Jun 12:18
a5e0445
Compare
Choose a tag to compare

This version introduces breaking changes in the tokenizer format. Now the tokenizer contains the whole chat template in the huggingface format.

Breaking changes

You need to regenerate your tokenizer. The tokenizer from the 0.8.0 version won't work with the 0.9.0 version.

0.8.0

31 May 20:49
6eccd30
Compare
Choose a tag to compare

This version introduces a new tokenizer format that includes configuration for chat functionality. With this update, Distributed Llama can support various models that operate in chat mode. The new tokenizer is required for these modes:

  • dllama chat ...
  • dllama-api ...

Breaking changes

The above change requires the tokenizer file to be regenerated. For Llama 3, you need to rerun the convert-tokenizer-llama3.py script. For other models, the process is a bit more complicated; please refer to this post for detailed instructions.

0.7.4

29 May 22:15
dc997b4
Compare
Choose a tag to compare

dllama-api:

  • Resolved a problem with adding unwanted <|eot_id|> to the response.
  • Introduced the naive cache that speeds up the inference in a chat client like AnythingLLM (demo).

0.7.3

27 May 21:15
Compare
Choose a tag to compare

This version adds Windows support. 🎉🎉🎉 Thanks @DifferentialityDevelopment!

Additionaly this version introduces the limit of nodes: nSlices <= nKvHeads. More details here.

0.7.2

26 May 09:36
ef1e312
Compare
Choose a tag to compare
  • fix: chunked stream, close stream without econnreset.

This version fixes the problem with the chunked stream in the dllama-api server. Now it's possible to connect to the server by using for example the AnythingLLM client. Still the server supports only Llama 3 Instruct only.

0.7.1

25 May 09:30
b4b3842
Compare
Choose a tag to compare

This version introduces a new approach to the synchronization of the MLP layers. This gave a significant reduction of the transfer size.

Big thanks for @zhengpeirong for assitance!


Model: Llama 3 8B Q40, Q80 Buffer

Devices Transfer size / token (0.5.0) Transfer size / token (0.7.1) Percentage change
2 devices S 510 kB + R 442 kB = 952 kB S 272 kB + R 272 kB = 544 kB -42.8%
4 devices S 1887 kB + R 867 kB = 2754 kB S 816 kB + R 816 kB = 1632 kB -40.7%

Here are results of tests comparing previous versions.

0.7.0

24 May 19:45
2e523f6
Compare
Choose a tag to compare

This version introduces a new model converter that supports the huggingface .safetensors format: convert-hf.py. The new converter supports three model types: llama, mistral, and mixtral. Many models that use these architectures can be easily converted to the Distributed Llama format from now.

Successfully tested the new converter on:

To convert a model you need to run:

python3 convert-hf.py path/to/TinyLlama-1.1B q40 tinylama

Then you need to convert also the tokenizer:

python3 convert-tokenizer-sentencepiece.py path/to/tokenizer.model tinylama

0.6.1

23 May 16:43
9a1e284
Compare
Choose a tag to compare
  • fix: use non-blocking sockets.

0.6.0

19 May 19:25
Compare
Choose a tag to compare

This version changes the name of the main application into dllama. From now to run the root node or a worker you need to compile dllama and run the dllama application.

make dllama
./dllama inference --model ... --tokenizer ...

Also this version introduces an early stage HTTP api compatible with the OpenAI api (only the /v1/chat/completions endpoint). How to run the api you can find here. A big shout out to @DifferentialityDevelopment for implementing this feature. #39