Releases: ollama/ollama
v0.1.38
New Models
- Falcon 2: A new 11B parameters causal decoder-only model built by TII and trained over 5T tokens.
- Yi 1.5: A new high-performing version of Yi, now licensed as Apache 2.0. Available in 6B, 9B and 34B sizes.
What's Changed
ollama ps
A new command is now available: ollama ps
. This command displays currently loaded models, their memory footprint, and the processors used (GPU or CPU):
% ollama ps
NAME ID SIZE PROCESSOR UNTIL
mixtral:latest 7708c059a8bb 28 GB 47%/53% CPU/GPU Forever
llama3:latest a6990ed6be41 5.5 GB 100% GPU 4 minutes from now
all-minilm:latest 1b226e2802db 585 MB 100% GPU 4 minutes from now
/clear
To clear the chat history for a session when running ollama run
, use /clear
:
>>> /clear
Cleared session context
- Fixed issue where switching loaded models on Windows would take several seconds
- Running
/save
will no longer abort the chat session if an incorrect name is provided - The
/api/tags
API endpoint will now correctly return an empty list[]
instead ofnull
if no models are provided
New Contributors
- @fangtaosong made their first contribution in #4387
- @machimachida made their first contribution in #4424
Full Changelog: v0.1.37...v0.1.38
v0.1.37
What's Changed
- Fixed issue where models with uppercase characters in the name would not show with
ollama list
- Fixed usage string for
ollama create
- Fix
finish_reason
being""
instead ofnull
in the Open-AI compatible chat API.
New Contributors
- @todashuta made their first contribution in #4362
Full Changelog: v0.1.36...v0.1.37
v0.1.36
What's Changed
- Fixed
exit status 0xc0000005
error with AMD graphics cards on Windows - Fixed rare out of memory errors when loading a model to run with CPU
Full Changelog: v0.1.35...v0.1.36
v0.1.35
New models
- Llama 3 ChatQA: A model from NVIDIA based on Llama 3 that excels at conversational question answering (QA) and retrieval-augmented generation (RAG).
What's Changed
- Quantization:
ollama create
can now quantize models when importing them using the--quantize
or-q
flag:
ollama create -f Modelfile --quantize q4_0 mymodel
Note
--quantize
works when importing float16
or float32
models:
- From a binary GGUF files (e.g.
FROM ./model.gguf
) - From a library model (e.g.
FROM llama3:8b-instruct-fp16
)
- Fixed issue where inference subprocesses wouldn't be cleaned up on shutdown.
- Fixed a series out of memory errors when loading models on multi-GPU systems
- Ctrl+J characters will now properly add newlines in
ollama run
- Fixed issues when running
ollama show
for vision models OPTIONS
requests to the Ollama API will no longer result in errors- Fixed issue where partially downloaded files wouldn't be cleaned up
- Added a new
done_reason
field in responses describing why generation stopped responding - Ollama will now more accurately estimate how much memory is available on multi-GPU systems especially when running different models one after another
New Contributors
- @fmaclen made their first contribution in #3884
- @Renset made their first contribution in #3881
- @glumia made their first contribution in #3043
- @boessu made their first contribution in #4236
- @gaardhus made their first contribution in #2307
- @svilupp made their first contribution in #2192
- @WolfTheDeveloper made their first contribution in #4300
Full Changelog: v0.1.34...v0.1.35
v0.1.34
New models
- Llava Llama 3: A new high-performing LLaVA model fine-tuned from Llama 3 Instruct.
- Llava Phi 3: A new small LLaVA model fine-tuned from Phi 3.
- StarCoder2 15B Instruct: A new instruct fine-tune of the StarCoder2 model
- CodeGemma 1.1: A new release of the CodeGemma model.
- StableLM2 12B: A new 12B version of the StableLM 2 model from Stability AI
- Moondream 2: Moondream 2's runtime parameters have been improved for better responses
What's Changed
- Fixed issues with LLaVa models where they would respond incorrectly after the first request
- Fixed out of memory errors when running large models such as Llama 3 70B
- Fixed various issues with Nvidia GPU discovery on Linux and Windows
- Fixed a series of Modelfile errors when running
ollama create
- Fixed
no slots available
error that occurred when cancelling a request and then sending follow up requests - Improved AMD GPU detection on Fedora
- Improved reliability when using the experimental
OLLAMA_NUM_PARALLEL
andOLLAMA_MAX_LOADED
flags ollama serve
will now shut down quickly, even if a model is loading
New Contributors
- @drnic made their first contribution in #4116
- @bernardo-bruning made their first contribution in #4111
- @Drlordbasil made their first contribution in #4174
- @Saif-Shines made their first contribution in #4119
- @HydenLiu made their first contribution in #4194
- @jl-codes made their first contribution in #3621
- @Nurgo made their first contribution in #3473
- @adrienbrault made their first contribution in #3129
- @Darinochka made their first contribution in #3945
Full Changelog: v0.1.33...v0.1.34
v0.1.33
New models:
- Llama 3: a new model by Meta, and the most capable openly available LLM to date
- Phi 3 Mini: a new 3.8B parameters, lightweight, state-of-the-art open model by Microsoft.
- Moondream moondream is a small vision language model designed to run efficiently on edge devices.
- Llama 3 Gradient 1048K: A Llama 3 fine-tune by Gradient to support up to a 1M token context window.
- Dolphin Llama 3: The uncensored Dolphin model, trained by Eric Hartford and based on Llama 3 with a variety of instruction, conversational, and coding skills.
- Qwen 110B: The first Qwen model over 100B parameters in size with outstanding performance in evaluations
What's Changed
- Fixed issues where the model would not terminate, causing the API to hang.
- Fixed a series of out of memory errors on Apple Silicon Macs
- Fixed out of memory errors when running Mixtral architecture models
Experimental concurrency features
New concurrency features are coming soon to Ollama. They are available
OLLAMA_NUM_PARALLEL
: Handle multiple requests simultaneously for a single modelOLLAMA_MAX_LOADED_MODELS
: Load multiple models simultaneously
To enable these features, set the environment variables for ollama serve
. For more info see this guide:
OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve
New Contributors
- @hmartinez82 made their first contribution in #3972
- @Cephra made their first contribution in #4037
- @arpitjain099 made their first contribution in #4007
- @MarkWard0110 made their first contribution in #4031
- @alwqx made their first contribution in #4073
- @sidxt made their first contribution in #3705
- @ChengenH made their first contribution in #3789
- @secondtruth made their first contribution in #3503
- @reid41 made their first contribution in #3612
- @ericcurtin made their first contribution in #3626
- @JT2M0L3Y made their first contribution in #3633
- @datvodinh made their first contribution in #3655
- @MapleEve made their first contribution in #3817
- @swuecho made their first contribution in #3810
- @brycereitano made their first contribution in #3895
- @bsdnet made their first contribution in #3889
- @fyxtro made their first contribution in #3855
- @natalyjazzviolin made their first contribution in #3962
Full Changelog: v0.1.32...v0.1.33
v0.1.32
New models
- WizardLM 2: State of the art large language model from Microsoft AI with improved performance on complex chat, multilingual, reasoning and agent use cases.
wizardlm2:8x22b
: large 8x22B model based on Mixtral 8x22Bwizardlm2:7b
: fast, high-performing model based on Mistral 7B
- Snowflake Arctic Embed: A suite of text embedding models by Snowflake, optimized for performance.
- Command R+: a powerful, scalable large language model purpose-built for RAG use cases
- DBRX: A large 132B open, general-purpose LLM created by Databricks.
- Mixtral 8x22B: the new leading Mixture of Experts (MoE) base model by Mistral AI.
What's Changed
- Ollama will now better utilize available VRAM, leading to less out-of-memory errors, as well as better GPU utilization
- When running larger models that don't fit into VRAM on macOS, Ollama will now split the model between GPU and CPU to maximize performance.
- Fixed several issues where Ollama would hang upon encountering an error
- Fix issue where using quotes in
OLLAMA_ORIGINS
would cause an error
New Contributors
- @sugarforever made their first contribution in #3400
- @yaroslavyaroslav made their first contribution in #3378
- @Nagi-ovo made their first contribution in #3423
- @ParisNeo made their first contribution in #3436
- @philippgille made their first contribution in #3437
- @cesto93 made their first contribution in #3461
- @ThomasVitale made their first contribution in #3515
- @writinwaters made their first contribution in #3539
- @alexmavr made their first contribution in #3555
Full Changelog: v0.1.31...v0.1.32
v0.1.31
Ollama supports embedding models. Bring your existing documents or other data, and combine it with text prompts to build RAG (retrieval augmented generation) apps using the Ollama REST API, Python or Javascript libraries.
New models
- Qwen 1.5 32B: A new 32B multilingual model competitive with larger models such as Mixtral
- StarlingLM Beta: A high ranking 7B model on popular benchmarks that includes a permissive Apache 2.0 license.
- DolphinCoder StarCoder 7B: A 7B uncensored variant of the Dolphin model family that excels at coding, based on StarCoder2 7B.
- StableLM 1.6 Chat: A new version of StableLM 1.6 tuned for instruction
What's new
- Fixed issue where Ollama would hang when using certain unicode characters in the prompt such as emojis
Full Changelog: v0.1.30...v0.1.31
v0.1.30
New models
- Command R: a Large Language Model optimized for conversational interaction and long context tasks.
- mxbai-embed-large: A new state-of-the-art large embedding model
What's Changed
- Fixed various issues with
ollama run
on Windows- History now will work when pressing up and down arrow keys
- Right and left arrow keys will now move the cursor appropriately
- Pasting multi-line strings will now work on Windows
- Fixed issue where mounting or sharing files between Linux and Windows (e.g. via WSL or Docker) would cause errors due to having
:
in the filename. - Improved support for AMD MI300 and MI300X Accelerators
- Improved cleanup of temporary files resulting in better space utilization
Important change
For filesystem compatibility, Ollama has changed model data filenames to use -
instead of :
. This change will be applied automatically. If downgrading to 0.1.29 or lower from 0.1.30 (on Linux or macOS only) run:
find ~/.ollama/models/blobs -type f -exec bash -c 'mv "$0" "${0//-/:}"' {} \;
New Contributors
- @alitrack made their first contribution in #3111
- @drazdra made their first contribution in #3338
- @rapidarchitect made their first contribution in #3288
- @yusufcanb made their first contribution in #3274
- @jikkuatwork made their first contribution in #3178
- @timothycarambat made their first contribution in #3145
- @fly2tomato made their first contribution in #2946
- @enoch1118 made their first contribution in #2927
- @danny-avila made their first contribution in #2918
- @mmo80 made their first contribution in #2881
- @anaisbetts made their first contribution in #2428
- @marco-souza made their first contribution in #1905
- @guchenhe made their first contribution in #1944
- @herval made their first contribution in #1873
- @Npahlfer made their first contribution in #1623
- @remy415 made their first contribution in #2279
Full Changelog: v0.1.29...v0.1.30
v0.1.29
AMD Preview
Ollama now supports AMD graphics cards in preview on Windows and Linux. All the features are now accelerated by AMD graphics cards, and support is included by default in Ollama for Linux, Windows and Docker.
Supported cards and accelerators
Family | Supported cards and accelerators |
---|---|
AMD Radeon RX | 7900 XTX 7900 XT 7900 GRE 7800 XT 7700 XT 7600 XT 7600 6950 XT 6900 XTX 6900XT 6800 XT 6800 Vega 64 Vega 56 |
AMD Radeon PRO | W7900 W7800 W7700 W7600 W7500 W6900X W6800X Duo W6800X W6800 V620 V420 V340 V320 Vega II Duo Vega II VII SSG |
AMD Instinct | MI300X MI300A MI300 MI250X MI250 MI210 MI200 MI100 MI60 MI50 |
What's Changed
ollama <command> -h
will now show documentation for supported environment variables- Fixed issue where generating embeddings with
nomic-embed-text
,all-minilm
or other embedding models would hang on Linux - Experimental support for importing Safetensors models using the
FROM <directory with safetensors model>
command in the Modelfile - Fixed issues where Ollama would hang when using JSON mode.
- Fixed issue where
ollama run
would error when piping output totee
and other tools - Fixed an issue where memory would not be released when running vision models
- Ollama will no longer show an error message when piping to stdin on Windows
New Contributors
- @tgraupmann made their first contribution in #2582
- @andersrex made their first contribution in #2909
- @leonid20000 made their first contribution in #2440
- @hishope made their first contribution in #2973
- @mrdjohnson made their first contribution in #2759
- @mofanke made their first contribution in #3077
- @racerole made their first contribution in #3073
- @Chris-AS1 made their first contribution in #3094
Full Changelog: v0.1.28...v0.1.29