Batched inference API and support for float16 inference #279

salvaba94 · 2024-01-27T13:13:29Z

This branch adds support for half precision inference and a batched inference API (BatchedModel). Additionally, it includes a short demo showing how to use this API.

ZachOBrien · 2024-02-07T17:47:11Z

@salvaba94 I ran this batch inference demo but did not see any performance benefit to batching. Batch size 1 took 223ms on average, batch size 2 took 447ms on average, etc. It scaled linearly in the batch size.

Did you observe the same behavior? Or were you able to get better throughput via batching?

salvaba94 · 2024-02-10T09:57:17Z

Hi @ZachOBrien, I've just checked it and yes, I see marginal improvements by using batching.

Here are the results:

Batch 1 and float32: 363 ms
Batch 2 and float32: 668 ms
Batch 1 and float16: 217 ms
Batch 2 and float16: 351 ms

I guess the improvement depends on the GPU (this was tested with RTX 2060).

Batched inference API and support for float16 inference

ace383e

rentainhe requested a review from SlongLiu January 28, 2024 03:02

Merge branch 'IDEA-Research:main' into batched_float16_inference

bffa375

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched inference API and support for float16 inference #279

Batched inference API and support for float16 inference #279

salvaba94 commented Jan 27, 2024

ZachOBrien commented Feb 7, 2024

salvaba94 commented Feb 10, 2024

Batched inference API and support for float16 inference #279

Are you sure you want to change the base?

Batched inference API and support for float16 inference #279

Conversation

salvaba94 commented Jan 27, 2024

ZachOBrien commented Feb 7, 2024

salvaba94 commented Feb 10, 2024