Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hi, do you know why the synchronization time from 4pi to 8pi suddenly increases? #20

Open
yuezhan0721 opened this issue Apr 6, 2024 · 15 comments

Comments

@yuezhan0721
Copy link

No description provided.

@b4rtaz
Copy link
Owner

b4rtaz commented Apr 6, 2024

Hello, what a router are you using?

@yuezhan0721
Copy link
Author

Hello, what a router are you using?

Thank you for your reply. It's just based on the results in your first form. For example,Llama 2 7B,from 34.06ms to 289.75, it has increased to nearly 10 times. What factors do you think restrict the communication between devices?

@b4rtaz
Copy link
Owner

b4rtaz commented Apr 7, 2024

Could you put a link to these results? Normally, the synchronization time is very similar during the inference like here.

@yuezhan0721
Copy link
Author

Could you put a link to these results? Normally, the synchronization time is very similar during the inference like here.

Thanks,like thishere
QQ截图20240407175014

@b4rtaz
Copy link
Owner

b4rtaz commented Apr 7, 2024

I was wondering about it as well, I suppose in this case the problem may be a weak router/switch. I used cheap TP-Link LS1008G Switch. It may slow down under heavy load.

The other thing is that, the amout of data required to synchronize doesn't grow lineary if you look at the amout of parameters (7B, 13B, 70B). Most important are parameters of the model like the amout of blocks (7B: 32, 70B: 80) or the lenght of the "dim" vector etc.

For example Llama 2 70B on 4 devices with Q80 buffer requires 14917 kB to synchronize the state. Grok-1 314B only 21013 kB but it's 4.4x larger (!).

Also if you look at this report, where the link between nodes was highly efficient the transfer time on 2 and 4 devices is similar for 70B model. But the amout of bytes is amlost 3x for 4 devices (28.50 ms, 14917 kB) than 2 devices (25.00 ms, 5525 kB).

@yuezhan0721
Copy link
Author

I was wondering about it as well, I suppose in this case the problem may be a weak router/switch. I used cheap TP-Link LS1008G Switch. It may slow down under heavy load.

The other thing is that, the amout of data required to synchronize doesn't grow lineary if you look at the amout of parameters (7B, 13B, 70B). Most important are parameters of the model like the amout of blocks (7B: 32, 70B: 80) or the lenght of the "dim" vector etc.

For example Llama 2 70B on 4 devices with Q80 buffer requires 14917 kB to synchronize the state. Grok-1 314B only 21013 kB but it's 4.4x larger (!).

Also if you look at this report, where the link between nodes was highly efficient the transfer time on 2 and 4 devices is similar for 70B model. But the amout of bytes is amlost 3x for 4 devices (28.50 ms, 14917 kB) than 2 devices (25.00 ms, 5525 kB).

Thanks for your explanation.

@zhengpeirong
Copy link

I was wondering about it as well, I suppose in this case the problem may be a weak router/switch. I used cheap TP-Link LS1008G Switch. It may slow down under heavy load.

The other thing is that, the amout of data required to synchronize doesn't grow lineary if you look at the amout of parameters (7B, 13B, 70B). Most important are parameters of the model like the amout of blocks (7B: 32, 70B: 80) or the lenght of the "dim" vector etc.

For example Llama 2 70B on 4 devices with Q80 buffer requires 14917 kB to synchronize the state. Grok-1 314B only 21013 kB but it's 4.4x larger (!).

Also if you look at this report, where the link between nodes was highly efficient the transfer time on 2 and 4 devices is similar for 70B model. But the amout of bytes is amlost 3x for 4 devices (28.50 ms, 14917 kB) than 2 devices (25.00 ms, 5525 kB).

Thank you, have you implemented any experiments on a high-end switch, e.g. Google Cloud service? If the result supports the poor switch hypothesis, then the bottleneck of this repo is not communication overhead.

@b4rtaz
Copy link
Owner

b4rtaz commented Apr 12, 2024

@zhengpeirong please check this report (4 x c3d-highcpu-30 / Google Cloud).

For Llama 7B / Q40 Weights Q80 Buffer I got:

Metric 1 VM 2 VMs 4 VMs
Avg transfer time 0.19 ms 7.62 ms 12.81 ms

The data needed for the synchronization per 1 token (Q80 Buffer):

Model 2 devices 4 devices 8 devices
Llama 2 7B 1112 kB 2830 kB 6008 kB

So yes, if you have enough fast link between nodes the communication is not the bottleneck.

Btw: USB4 link may achieve 10Gbps. Google Cloud is much much slower than this.

@zhengpeirong
Copy link

@zhengpeirong please check this report (4 x c3d-highcpu-30 / Google Cloud).

For Llama 7B / Q40 Weights Q80 Buffer I got:

Metric 1 VM 2 VMs 4 VMs
Avg transfer time 0.19 ms 7.62 ms 12.81 ms
The data needed for the synchronization per 1 token (Q80 Buffer):

Model 2 devices 4 devices 8 devices
Llama 2 7B 1112 kB 2830 kB 6008 kB
So yes, if you have enough fast link between nodes the communication is not the bottleneck.

Btw: USB4 link may achieve 10Gbps. Google Cloud is much much slower than this.

Thank you!! I have checked the specifications of the switch:
Packet Buffer Memory | 1.5 Mb
Jumbo Frames | 16 KB
Although I don't know the process behind the switch, this "Packet Buffer Memory" may be the key challenge for communication when it's less than the size of synchronization data.

@zhengpeirong
Copy link

@b4rtaz I recommend making synchronization more often with smaller chunks instead of the whole QKV or FFN result. This solution should benefit the transfer time by reducing the communication overhead to the level of a low-end switch.

@b4rtaz
Copy link
Owner

b4rtaz commented Apr 28, 2024

Today, I tried a minor adjustment to the order of synchronization:

(current) matmul Q -> matmul K -> matmul V -> sync Q -> sync K -> sync V

VS

(new)     matmul Q -> sync Q -> matmul K -> sync K -> matmul V -> sync V

Setup: 4 x Raspberry Pi 5 8GB, Llama 3 8B Q40, Q80 Buffer, TP-Link LS1008G Switch.

Results:

Edit: I'm hidding these results because it contains an error, check the discussion below

current

Test 1:

⏩ Loaded 6323781632 bytes
🔶 G  705 ms I  583 ms T  118 ms S 2877687 kB R    714 kB The
🔶 G  352 ms I  251 ms T  101 ms S   2295 kB R    714 kB  E
🔶 G  407 ms I  278 ms T  129 ms S   2295 kB R    714 kB iff
🔶 G  398 ms I  285 ms T  113 ms S   2295 kB R    714 kB el
🔶 G  394 ms I  282 ms T  112 ms S   2295 kB R    714 kB  Tower
🔶 G  386 ms I  276 ms T  110 ms S   2295 kB R    714 kB  is
🔶 G  377 ms I  273 ms T  103 ms S   2295 kB R    714 kB  a
...
🔶 G  447 ms I  392 ms T   54 ms S   2295 kB R    714 kB  
🔶 G  428 ms I  354 ms T   72 ms S   2295 kB R    714 kB 108
🔶 G  425 ms I  356 ms T   67 ms S   2295 kB R    714 kB 0
🔶 G  384 ms I  318 ms T   64 ms S   2295 kB R    714 kB  steps
Generated tokens:    64
Avg generation time: 405.23 ms
Avg inference time:  326.27 ms
Avg transfer time:   77.47 ms

Test 2:

⏩ Loaded 6323781632 bytes
🔶 G  407 ms I  292 ms T  110 ms S 2877687 kB R    714 kB The
🔶 G  351 ms I  249 ms T  102 ms S   2295 kB R    714 kB  E
🔶 G  393 ms I  247 ms T  146 ms S   2295 kB R    714 kB iff
🔶 G  386 ms I  243 ms T  143 ms S   2295 kB R    714 kB el
🔶 G  347 ms I  246 ms T  101 ms S   2295 kB R    714 kB  Tower
🔶 G  339 ms I  246 ms T   93 ms S   2295 kB R    714 kB  is
...
🔶 G  426 ms I  353 ms T   71 ms S   2295 kB R    714 kB .
🔶 G  406 ms I  348 ms T   57 ms S   2295 kB R    714 kB  It
🔶 G  429 ms I  361 ms T   66 ms S   2295 kB R    714 kB  is
🔶 G  423 ms I  345 ms T   76 ms S   2295 kB R    714 kB  the
Generated tokens:    64
Avg generation time: 400.53 ms
Avg inference time:  322.92 ms
Avg transfer time:   76.03 ms

new

Commit: d5b8354

Test 1:

⏩ Loaded 6323781632 bytes
🔶 G  425 ms I  308 ms T  114 ms S 2877687 kB R    714 kB The
🔶 G  355 ms I  263 ms T   92 ms S   2295 kB R    714 kB  E
🔶 G  337 ms I  259 ms T   78 ms S   2295 kB R    714 kB iff
🔶 G  338 ms I  259 ms T   79 ms S   2295 kB R    714 kB el
🔶 G  362 ms I  269 ms T   93 ms S   2295 kB R    714 kB  Tower
🔶 G  377 ms I  259 ms T  118 ms S   2295 kB R    714 kB  is
🔶 G  339 ms I  260 ms T   77 ms S   2295 kB R    714 kB  a
...
🔶 G  425 ms I  369 ms T   54 ms S   2295 kB R    714 kB  and
🔶 G  401 ms I  356 ms T   44 ms S   2295 kB R    714 kB  was
🔶 G  403 ms I  346 ms T   56 ms S   2295 kB R    714 kB  completed
🔶 G  422 ms I  338 ms T   82 ms S   2295 kB R    714 kB  in
Generated tokens:    64
Avg generation time: 384.66 ms
Avg inference time:  318.31 ms
Avg transfer time:   64.97 ms

Test 2:

⏩ Loaded 6323781632 bytes
🔶 G  374 ms I  298 ms T   71 ms S 2877687 kB R    714 kB The
🔶 G  357 ms I  261 ms T   96 ms S   2295 kB R    714 kB  E
🔶 G  363 ms I  264 ms T   99 ms S   2295 kB R    714 kB iff
🔶 G  340 ms I  259 ms T   81 ms S   2295 kB R    714 kB el
🔶 G  340 ms I  256 ms T   84 ms S   2295 kB R    714 kB  Tower
🔶 G  341 ms I  257 ms T   84 ms S   2295 kB R    714 kB  is
🔶 G  340 ms I  261 ms T   78 ms S   2295 kB R    714 kB  a
...
🔶 G  430 ms I  380 ms T   48 ms S   2295 kB R    714 kB  of
🔶 G  444 ms I  386 ms T   57 ms S   2295 kB R    714 kB  the
🔶 G  461 ms I  403 ms T   56 ms S   2295 kB R    714 kB  most
🔶 G  420 ms I  354 ms T   64 ms S   2295 kB R    714 kB  recognizable
Generated tokens:    64
Avg generation time: 392.34 ms
Avg inference time:  326.58 ms
Avg transfer time:   64.03 ms

Conclusions: it seems this change reduced the synchronization time by 12ms / token, what is a very good improvement.

It looks like there is more to improve if this works.

@cewuandy
Copy link

cewuandy commented Apr 29, 2024

I found the llamaSyncAttQ, llamaSyncAttK, llamaSyncAttV task all of them set to TASK_TYPE_INFERENCE. That maybe affect the transfer time statistics.

@b4rtaz
Copy link
Owner

b4rtaz commented Apr 29, 2024

@cewuandy Nice catch! Fixed it, later I'll retest it.

@b4rtaz
Copy link
Owner

b4rtaz commented Apr 29, 2024

Setup: the same as before.

0.3.0

Commit: ad10e18

Test 1:

b4rtaz@raspberrypi3:~/distributed-llama $ ./main inference --prompt "The Eiffel Tower is" --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --model ../dllama_meta-llama-3-8b_q40.bin --tokenizer ../dllama-llama3-tokenizer.t --steps 64 --workers 10.0.0.4:9999 10.0.0.1:9999 10.0.0.2:9999
💡 arch: llama2
💡 dim: 4096
💡 hiddenDim: 14336
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 vocabSize: 128256
💡 seqLen: 2048
💡 nSlices: 4
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128001
⏩ Loaded 6323781632 bytes
...
Generated tokens:    64
Avg generation time: 343.39 ms
Avg inference time:  258.80 ms
Avg transfer time:   82.66 ms

Test 2:

Generated tokens:    64
Avg generation time: 347.48 ms
Avg inference time:  257.58 ms
Avg transfer time:   79.97 ms

Test 3:

Generated tokens:    64
Avg generation time: 339.42 ms
Avg inference time:  258.86 ms
Avg transfer time:   78.42 ms

Test 4:

Generated tokens:    64
Avg generation time: 334.41 ms
Avg inference time:  251.34 ms
Avg transfer time:   80.67 ms

0.3.1

Commit: 7f63f9e

Test 1:

Generated tokens:    64
Avg generation time: 329.61 ms
Avg inference time:  252.23 ms
Avg transfer time:   75.52 ms

Test 2:

Generated tokens:    64
Avg generation time: 333.89 ms
Avg inference time:  253.94 ms
Avg transfer time:   78.00 ms

Test 3:

Generated tokens:    64
Avg generation time: 330.98 ms
Avg inference time:  252.69 ms
Avg transfer time:   76.47 ms

Test 4:

Generated tokens:    64
Avg generation time: 327.75 ms
Avg inference time:  247.88 ms
Avg transfer time:   77.30 ms

So we have for 0.3.0 = 80.43 ms vs for 0.3.1 = 76.82 ms (n=4).

My setup looks very non-deterministic. Yesterday I observed the average inference time close to 320.00 ms, today it's close to 250ms. 🤯 I may have set the cooling fan better today. My setup is a bit improvised:

image

Yestarday I achieved a similar average for the transfer time for old version as today for new.

So I think my tests cannot prove or disprove that if this approach is better.

@b4rtaz
Copy link
Owner

b4rtaz commented Apr 29, 2024

So for now I reverted these changes to the old one. The previous implementation is easier to maintain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants