Skip to content

Latest commit

 

History

History
110 lines (96 loc) · 8.33 KB

File metadata and controls

110 lines (96 loc) · 8.33 KB

GPU Info

Model Size (GB) CUDA Cores
GeForce RTX 3050 Mobile/Laptop 4 2048
GeForce RTX 3050 8 2304
GeForce RTX 4050 Mobile/Laptop 6 2560
GeForce RTX 3050 Ti Mobile/Laptop 4 2560
GeForce RTX 4060 8 3072
GeForce RTX 3060 12 3584
GeForce RTX 3060 Mobile/Laptop 6 3840
GeForce RTX 4060 Ti 16 4352
GeForce RTX 4070 Mobile/Laptop 8 4608
GeForce RTX 3060 Ti 8 4864
GeForce RTX 3070 Mobile/Laptop 8 5120
GeForce RTX 3070 8 5888
GeForce RTX 4070 12 5888
GeForce RTX 3070 Ti 8 6144
GeForce RTX 3070 Ti Mobile/Laptop 8-16 6144
GeForce RTX 4070 Super 12 7168
GeForce RTX 4080 Mobile/Laptop 12 7424
GeForce RTX 3080 Ti Mobile/Laptop 16 7424
GeForce RTX 4070 Ti 12 7680
GeForce RTX 4080 12 7680
GeForce RTX 3080 10 8704
GeForce RTX 4070 Ti Super 16 8448
GeForce RTX 3080 Ti 12 8960
GeForce RTX 4080 16 9728
GeForce RTX 4090 Mobile/Laptop 16 9728
GeForce RTX 4080 Super 16 10240
GeForce RTX 3090 24 10496
GeForce RTX 3090 Ti 24 10752
GeForce RTX 4090 D 24 14592
GeForce RTX 4090 24 16384

CUDA Compute Compatibility

CUDA Compute Capability Architecture GeForce
6.1 Pascal Nvidia TITAN Xp, Titan X, GeForce GTX 1080 Ti, GTX 1080, GTX 1070 Ti, GTX 1070, GTX 1060, GTX 1050 Ti, GTX 1050, GT 1030, GT 1010, MX350, MX330, MX250, MX230, MX150, MX130, MX110
7.0 Volta NVIDIA TITAN V
7.5 Turing NVIDIA TITAN RTX, GeForce RTX 2080 Ti, RTX 2080 Super, RTX 2080, RTX 2070 Super, RTX 2070, RTX 2060 Super, RTX 2060 12GB, RTX 2060, GeForce GTX 1660 Ti, GTX 1660 Super, GTX 1660, GTX 1650 Super, GTX 1650, MX550, MX450
8.6 Ampere GeForce RTX 3090 Ti, RTX 3090, RTX 3080 Ti, RTX 3080 12GB, RTX 3080, RTX 3070 Ti, RTX 3070, RTX 3060 Ti, RTX 3060, RTX 3050, RTX 3050 Ti(mobile), RTX 3050(mobile), RTX 2050(mobile), MX570
8.9 Ada Lovelace GeForce RTX 4090, RTX 4080 Super, RTX 4080, RTX 4070 Ti Super, RTX 4070 Ti, RTX 4070 Super, RTX 4070, RTX 4060 Ti, RTX 4060

Ctranslate2 Quantization Compatibility

  • NOTE: Only Ampere and later Nvidia GPUs support the new flash_attention parameter in Ctranslate2.

CPU

Architecture int8_float32 int8_float16 int8_bfloat16 int16 float16 bfloat16
x86-64 (Intel) int8_float32 int8_float32 int8_float32 int16 float32 float32
x86-64 (other) int8_float32 int8_float32 int8_float32 int8_float32 float32 float32
AArch64/ARM64 (Apple) int8_float32 int8_float32 int8_float32 int8_float32 float32 float32
AArch64/ARM64 (other) int8_float32 int8_float32 int8_float32 int8_float32 float32 float32

Nvidia GPU

Compute Capability int8_float32 int8_float16 int8_bfloat16 int16 float16 bfloat16
>= 8.0 int8_float32 int8_float16 int8_bfloat16 float16 float16 bfloat16
>= 7.0, < 8.0 int8_float32 int8_float16 int8_float32 float16 float16 float32
6.2 float32 float32 float32 float32 float32 float32
6.1 int8_float32 int8_float32 int8_float32 float32 float32 float32
<= 6.0 float32 float32 float32 float32 float32 float32

Chat Model Benchmarks

  • Tested using ctranslate2 running in int8 on an RTX 4090.
Model Tokens per Second VRAM Usage (GB)
gemma-1.1-2b-it 63.69 3.0
Phi-3-mini-4k-instruct 36.46 4.5
dolphin-llama2-7b 37.43 7.5
Orca-2-7b 30.47 7.5
Llama-2-7b-chat-hf 37.78 7.6
neural-chat-7b-v3-3 28.38 8.1
Meta-Llama-3-8B-Instruct 30.12 8.8
dolphin-2.9-llama3-8b 34.16 8.8
Mistral-7B-Instruct-v0.3 32.24 7.9
SOLAR-10.7B-Instruct-v1.0 23.32 11.7
Llama-2-13b-chat-hf 25.12 14.0
Orca-2-13b 20.01 14.1

Concurrency

Library/Tool Type Best Use Case Pros Cons
Python threading Threading I/O-bound tasks Simple API, good for I/O-bound tasks GIL limits effectiveness for CPU-bound tasks
Python multiprocessing Multiprocessing CPU-bound tasks True parallelism, bypasses GIL Higher memory overhead, complex IPC
Python subprocess Process control Running external commands Simple process control, capture I/O Limited to external process management
concurrent.futures High-level API for Threading & Multiprocessing Unified task management Simplifies task execution, combines threading and multiprocessing Limited flexibility, higher abstraction
asyncio Async/Coroutine I/O-bound, high concurrency tasks Non-blocking I/O, single-threaded concurrency Steeper learning curve due to coroutines and event loop
QThread Threading Integrating threads into the Qt event loop, signal-slot communication Seamless Qt integration, easy inter-thread communication More boilerplate, requires subclassing
QRunnable/QThreadPool Threading Managing multiple short-lived tasks within Qt applications Efficient task management, less boilerplate Requires understanding of Qt threading architecture
QtConcurrent Threading High-level parallel tasks in Qt High-level functions for parallel execution, automatic thread pooling Less control over individual threads
QProcess Process control Running external commands in Qt applications Integrates with Qt, handles process I/O Limited to process control

Summary

  • Python threading: Best for simple I/O-bound tasks.
  • Python multiprocessing: Best for CPU-bound tasks requiring true parallelism.
  • Python subprocess: Simple external process management. Use when you need straightforward process control and portability across different environments.
  • concurrent.futures: Unified API for high-level task management.
  • asyncio: Suitable for I/O-bound tasks with high concurrency, single-threaded.
  • QThread: Ideal for complex threading in Qt applications with signal-slot communication.
  • QRunnable/QThreadPool: Efficient for managing multiple short-lived tasks in Qt.
  • QtConcurrent: Simplifies parallel task execution in Qt applications.
  • QProcess: Handles running and managing external processes within Qt applications. Use when you need tight integration with the Qt event loop and signal-slot mechanism.