Model | Size (GB) | CUDA Cores |
---|---|---|
GeForce RTX 3050 Mobile/Laptop | 4 | 2048 |
GeForce RTX 3050 | 8 | 2304 |
GeForce RTX 4050 Mobile/Laptop | 6 | 2560 |
GeForce RTX 3050 Ti Mobile/Laptop | 4 | 2560 |
GeForce RTX 4060 | 8 | 3072 |
GeForce RTX 3060 | 12 | 3584 |
GeForce RTX 3060 Mobile/Laptop | 6 | 3840 |
GeForce RTX 4060 Ti | 16 | 4352 |
GeForce RTX 4070 Mobile/Laptop | 8 | 4608 |
GeForce RTX 3060 Ti | 8 | 4864 |
GeForce RTX 3070 Mobile/Laptop | 8 | 5120 |
GeForce RTX 3070 | 8 | 5888 |
GeForce RTX 4070 | 12 | 5888 |
GeForce RTX 3070 Ti | 8 | 6144 |
GeForce RTX 3070 Ti Mobile/Laptop | 8-16 | 6144 |
GeForce RTX 4070 Super | 12 | 7168 |
GeForce RTX 4080 Mobile/Laptop | 12 | 7424 |
GeForce RTX 3080 Ti Mobile/Laptop | 16 | 7424 |
GeForce RTX 4070 Ti | 12 | 7680 |
GeForce RTX 4080 | 12 | 7680 |
GeForce RTX 3080 | 10 | 8704 |
GeForce RTX 4070 Ti Super | 16 | 8448 |
GeForce RTX 3080 Ti | 12 | 8960 |
GeForce RTX 4080 | 16 | 9728 |
GeForce RTX 4090 Mobile/Laptop | 16 | 9728 |
GeForce RTX 4080 Super | 16 | 10240 |
GeForce RTX 3090 | 24 | 10496 |
GeForce RTX 3090 Ti | 24 | 10752 |
GeForce RTX 4090 D | 24 | 14592 |
GeForce RTX 4090 | 24 | 16384 |
CUDA Compute Capability | Architecture | GeForce |
---|---|---|
6.1 | Pascal | Nvidia TITAN Xp, Titan X, GeForce GTX 1080 Ti, GTX 1080, GTX 1070 Ti, GTX 1070, GTX 1060, GTX 1050 Ti, GTX 1050, GT 1030, GT 1010, MX350, MX330, MX250, MX230, MX150, MX130, MX110 |
7.0 | Volta | NVIDIA TITAN V |
7.5 | Turing | NVIDIA TITAN RTX, GeForce RTX 2080 Ti, RTX 2080 Super, RTX 2080, RTX 2070 Super, RTX 2070, RTX 2060 Super, RTX 2060 12GB, RTX 2060, GeForce GTX 1660 Ti, GTX 1660 Super, GTX 1660, GTX 1650 Super, GTX 1650, MX550, MX450 |
8.6 | Ampere | GeForce RTX 3090 Ti, RTX 3090, RTX 3080 Ti, RTX 3080 12GB, RTX 3080, RTX 3070 Ti, RTX 3070, RTX 3060 Ti, RTX 3060, RTX 3050, RTX 3050 Ti(mobile), RTX 3050(mobile), RTX 2050(mobile), MX570 |
8.9 | Ada Lovelace | GeForce RTX 4090, RTX 4080 Super, RTX 4080, RTX 4070 Ti Super, RTX 4070 Ti, RTX 4070 Super, RTX 4070, RTX 4060 Ti, RTX 4060 |
- NOTE: Only Ampere and later Nvidia GPUs support the new
flash_attention
parameter inCtranslate2
.
Architecture | int8_float32 | int8_float16 | int8_bfloat16 | int16 | float16 | bfloat16 |
---|---|---|---|---|---|---|
x86-64 (Intel) | int8_float32 | int8_float32 | int8_float32 | int16 | float32 | float32 |
x86-64 (other) | int8_float32 | int8_float32 | int8_float32 | int8_float32 | float32 | float32 |
AArch64/ARM64 (Apple) | int8_float32 | int8_float32 | int8_float32 | int8_float32 | float32 | float32 |
AArch64/ARM64 (other) | int8_float32 | int8_float32 | int8_float32 | int8_float32 | float32 | float32 |
Compute Capability | int8_float32 | int8_float16 | int8_bfloat16 | int16 | float16 | bfloat16 |
---|---|---|---|---|---|---|
>= 8.0 | int8_float32 | int8_float16 | int8_bfloat16 | float16 | float16 | bfloat16 |
>= 7.0, < 8.0 | int8_float32 | int8_float16 | int8_float32 | float16 | float16 | float32 |
6.2 | float32 | float32 | float32 | float32 | float32 | float32 |
6.1 | int8_float32 | int8_float32 | int8_float32 | float32 | float32 | float32 |
<= 6.0 | float32 | float32 | float32 | float32 | float32 | float32 |
- Tested using
ctranslate2
running inint8
on an RTX 4090.
Model | Tokens per Second | VRAM Usage (GB) |
---|---|---|
gemma-1.1-2b-it | 63.69 | 3.0 |
Phi-3-mini-4k-instruct | 36.46 | 4.5 |
dolphin-llama2-7b | 37.43 | 7.5 |
Orca-2-7b | 30.47 | 7.5 |
Llama-2-7b-chat-hf | 37.78 | 7.6 |
neural-chat-7b-v3-3 | 28.38 | 8.1 |
Meta-Llama-3-8B-Instruct | 30.12 | 8.8 |
dolphin-2.9-llama3-8b | 34.16 | 8.8 |
Mistral-7B-Instruct-v0.3 | 32.24 | 7.9 |
SOLAR-10.7B-Instruct-v1.0 | 23.32 | 11.7 |
Llama-2-13b-chat-hf | 25.12 | 14.0 |
Orca-2-13b | 20.01 | 14.1 |
Library/Tool | Type | Best Use Case | Pros | Cons |
---|---|---|---|---|
Python threading |
Threading | I/O-bound tasks | Simple API, good for I/O-bound tasks | GIL limits effectiveness for CPU-bound tasks |
Python multiprocessing |
Multiprocessing | CPU-bound tasks | True parallelism, bypasses GIL | Higher memory overhead, complex IPC |
Python subprocess |
Process control | Running external commands | Simple process control, capture I/O | Limited to external process management |
concurrent.futures | High-level API for Threading & Multiprocessing | Unified task management | Simplifies task execution, combines threading and multiprocessing | Limited flexibility, higher abstraction |
asyncio | Async/Coroutine | I/O-bound, high concurrency tasks | Non-blocking I/O, single-threaded concurrency | Steeper learning curve due to coroutines and event loop |
QThread | Threading | Integrating threads into the Qt event loop, signal-slot communication | Seamless Qt integration, easy inter-thread communication | More boilerplate, requires subclassing |
QRunnable/QThreadPool | Threading | Managing multiple short-lived tasks within Qt applications | Efficient task management, less boilerplate | Requires understanding of Qt threading architecture |
QtConcurrent | Threading | High-level parallel tasks in Qt | High-level functions for parallel execution, automatic thread pooling | Less control over individual threads |
QProcess | Process control | Running external commands in Qt applications | Integrates with Qt, handles process I/O | Limited to process control |
- Python
threading
: Best for simple I/O-bound tasks.- Python
multiprocessing
: Best for CPU-bound tasks requiring true parallelism.- Python
subprocess
: Simple external process management. Use when you need straightforward process control and portability across different environments.- concurrent.futures: Unified API for high-level task management.
- asyncio: Suitable for I/O-bound tasks with high concurrency, single-threaded.
- QThread: Ideal for complex threading in Qt applications with signal-slot communication.
- QRunnable/QThreadPool: Efficient for managing multiple short-lived tasks in Qt.
- QtConcurrent: Simplifies parallel task execution in Qt applications.
- QProcess: Handles running and managing external processes within Qt applications. Use when you need tight integration with the Qt event loop and signal-slot mechanism.