Releases · b4rtaz/distributed-llama

This version introduces a new tokenizer format that includes configuration for chat functionality. With this update, Distributed Llama can support various models that operate in chat mode. The new tokenizer is required for these modes:

dllama chat ...
dllama-api ...

Breaking changes

The above change requires the tokenizer file to be regenerated. For Llama 3, you need to rerun the convert-tokenizer-llama3.py script. For other models, the process is a bit more complicated; please refer to this post for detailed instructions.

Assets 2

29 May 22:15

b4rtaz

v0.7.4

dc997b4

0.7.4

dllama-api:

Resolved a problem with adding unwanted <|eot_id|> to the response.
Introduced the naive cache that speeds up the inference in a chat client like AnythingLLM (demo).

Assets 2

27 May 21:15

b4rtaz

v0.7.3

df1d360

0.7.3

This version adds Windows support. 🎉🎉🎉 Thanks @DifferentialityDevelopment!

Additionaly this version introduces the limit of nodes: nSlices <= nKvHeads. More details here.

Contributors

DifferentialityDevelopment

Assets 2

26 May 09:36

b4rtaz

v0.7.2

ef1e312

0.7.2

fix: chunked stream, close stream without econnreset.

This version fixes the problem with the chunked stream in the dllama-api server. Now it's possible to connect to the server by using for example the AnythingLLM client. Still the server supports only Llama 3 Instruct only.

Assets 2

25 May 09:30

b4rtaz

v0.7.1

b4b3842

0.7.1

This version introduces a new approach to the synchronization of the MLP layers. This gave a significant reduction of the transfer size.

Big thanks for @zhengpeirong for assitance!

_{Model: Llama 3 8B Q40, Q80 Buffer}

Devices	Transfer size / token (0.5.0)	Transfer size / token (0.7.1)	Percentage change
2 devices	S 510 kB + R 442 kB = 952 kB	S 272 kB + R 272 kB = 544 kB	-42.8%
4 devices	S 1887 kB + R 867 kB = 2754 kB	S 816 kB + R 816 kB = 1632 kB	-40.7%

Here are results of tests comparing previous versions.

Contributors

zhengpeirong

Assets 2

24 May 19:45

b4rtaz

v0.7.0

2e523f6

0.7.0

This version introduces a new model converter that supports the huggingface .safetensors format: convert-hf.py. The new converter supports three model types: llama, mistral, and mixtral. Many models that use these architectures can be easily converted to the Distributed Llama format from now.

Successfully tested the new converter on:

To convert a model you need to run:

python3 convert-hf.py path/to/TinyLlama-1.1B q40 tinylama

Then you need to convert also the tokenizer:

python3 convert-tokenizer-sentencepiece.py path/to/tokenizer.model tinylama

Assets 2

23 May 16:43

b4rtaz

v0.6.1

9a1e284

0.6.1

fix: use non-blocking sockets.

Assets 2

19 May 19:25

b4rtaz

v0.6.0

d520994

0.6.0

This version changes the name of the main application into dllama. From now to run the root node or a worker you need to compile dllama and run the dllama application.

make dllama
./dllama inference --model ... --tokenizer ...

Also this version introduces an early stage HTTP api compatible with the OpenAI api (only the /v1/chat/completions endpoint). How to run the api you can find here. A big shout out to @DifferentialityDevelopment for implementing this feature. #39

Contributors

DifferentialityDevelopment

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributors

Contributors

Contributors

Releases: b4rtaz/distributed-llama

0.9.1

0.9.0

0.8.0

0.7.4

0.7.3

Contributors

0.7.2

0.7.1

Contributors

0.7.0

0.6.1

0.6.0

Contributors