Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable OLLAMA Arc GPU support with SYCL backend #3796

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

gamunu
Copy link

@gamunu gamunu commented Apr 21, 2024

This is based on the original PR created by @felipeagc:main #2458.

It seems that the work on that pull request has come to a halt. I would like to work on this project in the next few days and accelerate the progress. I have tested the build with Ubuntu LTS and GPU Arc770.

I'm happy to progress the PR with the community feedback.

time=2024-04-21T17:39:51.870+05:30 level=INFO source=images.go:817 msg="total blobs: 0"
time=2024-04-21T17:39:51.870+05:30 level=INFO source=images.go:824 msg="total unused blobs removed: 0"
time=2024-04-21T17:39:51.874+05:30 level=INFO source=routes.go:1143 msg="Listening on 127.0.0.1:11434 (version 0.1.32-17-g91f1201-dirty)"
time=2024-04-21T17:39:51.874+05:30 level=INFO source=payload.go:28 msg="extracting embedded files" dir=/tmp/ollama2497595442/runners
time=2024-04-21T17:39:55.586+05:30 level=INFO source=payload.go:41 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11 rocm_v60002 cpu]"
time=2024-04-21T17:39:55.586+05:30 level=INFO source=gpu.go:140 msg="Detecting GPU type"
time=2024-04-21T17:39:55.586+05:30 level=INFO source=gpu.go:320 msg="Searching for GPU management library libcudart.so*"
time=2024-04-21T17:39:55.588+05:30 level=INFO source=gpu.go:366 msg="Discovered GPU libraries: [/tmp/ollama2497595442/runners/cuda_v11/libcudart.so.11.0]"
time=2024-04-21T17:39:55.601+05:30 level=INFO source=gpu.go:395 msg="Unable to load cudart CUDA management library /tmp/ollama2497595442/runners/cuda_v11/libcudart.so.11.0: cudart init failure: 100"
time=2024-04-21T17:39:55.601+05:30 level=INFO source=gpu.go:320 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-04-21T17:39:55.603+05:30 level=INFO source=gpu.go:366 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.171.04]"
time=2024-04-21T17:39:55.608+05:30 level=INFO source=gpu.go:378 msg="Unable to load NVML management library /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.171.04: nvml vram init failure: 9"
time=2024-04-21T17:39:55.608+05:30 level=INFO source=gpu.go:320 msg="Searching for GPU management library libze_intel_gpu.so"
time=2024-04-21T17:39:55.610+05:30 level=INFO source=gpu.go:366 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.28202.51]"
time=2024-04-21T17:39:55.662+05:30 level=INFO source=gpu.go:166 msg="Intel GPU detected"
time=2024-04-21T17:39:55.662+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"

@gamunu gamunu changed the title Enable OLLAMA Arc GPU support with SYCL backend feat: enable OLLAMA Arc GPU support with SYCL backend Apr 21, 2024
@mlim15
Copy link

mlim15 commented Apr 22, 2024

In the current state, this code always seems to use the CPU for inference on my system. I followed the build instructions from the other pull request. It seems to detect the GPU and prints out some relevant messages, but doesn't actually use it. In addition to my dGPU, I also have a level-zero compatible iGPU, disabling it in the BIOS did not help. Compiling Llama.cpp from source with SYCL support works fine on the same system, as well as ComfyUI using IPEX for acceleration.

Arch Linux
Kernel 6.8.4
CPU Intel i5-12600k (also tried with iGPU disabled)
GPU Arc A580

intel-compute-runtime-bin 24.13.29138.7-1
intel-gpu-tools 1.27-2
intel-graphics-compiler-bin 1:1.0.16510.2-1
intel-media-driver 24.2.1-1
intel-oneapi-basekit 2024.0.0.49564-2
level-zero-loader 1.15.1-1
libva-intel-driver 2.4.1-2
libva-mesa-driver 1:24.0.5-1
linux 6.8.4.arch1-1
linux-firmware 20240409.1addd7dc-1
linux-firmware-whence 20240409.1addd7dc-1
mesa 1:24.0.5-1
ocl-icd 2.3.2-1
vulkan-icd-loader 1.3.279-1
vulkan-intel 1:24.0.5-1

sycl-ls output with iGPU disabled

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i5-12600K OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A580 Graphics OpenCL 3.0 NEO  [24.13.29138.7]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A580 Graphics 1.3 [1.3.29138]

To get this working, I needed to do two things:

  1. The SYCL build of llama.cpp is not placed in the same place as the others and doesn't seem to get included in the final binary by default, at least when following the instructions I linked above. I needed to grab the oneapi folder produced in GIT_ROOT/llm/llama.cpp/build/linux/x86_64 and copy it beside the other folders in GIT_ROOT/llm/build/linux/x86_64. This is possibly an issue with the gen_linux.sh script? I did a fresh clone and an exact follow of the instructions in the other pull request again to make sure and it also exhibited this behaviour. Compare this build dir line with the one used for ROCm above it: BUILD_DIR="../build/linux/${ARCH}/rocm${ROCM_VARIANT}". Possibly needs to be changed to match.
  2. I needed to remove this else if from the code, otherwise it would prevent usage of my GPU. Not sure if this is an issue specific to the A580, because I have a compatible iGPU in the system (despite it being disabled in the bios), or what.

Here's the full console output with debug enabled with and without the change:

Debug output with SYCL acceleration working after removing elif
dev/ollama » ZES_ENABLE_SYSMAN=1 OLLAMA_DEBUG=1 OLLAMA_HOST=0.0.0.0 ./ollama serve
time=2024-04-21T23:59:41.281-04:00 level=INFO source=images.go:817 msg="total blobs: 5"
time=2024-04-21T23:59:41.282-04:00 level=INFO source=images.go:824 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.PullModelHandler (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.GenerateHandler (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.ChatHandler (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.EmbeddingsHandler (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.CreateModelHandler (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.PushModelHandler (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.CopyModelHandler (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.DeleteModelHandler (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.ShowModelHandler (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.CreateBlobHandler (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.HeadBlobHandler (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.ChatHandler (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-04-21T23:59:41.282-04:00 level=INFO source=routes.go:1143 msg="Listening on [::]:11434 (version 0.0.0)"
time=2024-04-21T23:59:41.282-04:00 level=INFO source=payload.go:28 msg="extracting embedded files" dir=/tmp/ollama1421623603/runners
time=2024-04-21T23:59:41.282-04:00 level=DEBUG source=payload.go:160 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz
time=2024-04-21T23:59:41.282-04:00 level=DEBUG source=payload.go:160 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz
time=2024-04-21T23:59:41.282-04:00 level=DEBUG source=payload.go:160 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz
time=2024-04-21T23:59:41.282-04:00 level=DEBUG source=payload.go:160 msg=extracting variant=oneapi file=build/linux/x86_64/oneapi/bin/ollama_llama_server.gz
time=2024-04-21T23:59:41.327-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1421623603/runners/cpu
time=2024-04-21T23:59:41.327-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1421623603/runners/cpu_avx
time=2024-04-21T23:59:41.327-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1421623603/runners/cpu_avx2
time=2024-04-21T23:59:41.327-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1421623603/runners/oneapi
time=2024-04-21T23:59:41.327-04:00 level=INFO source=payload.go:41 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 oneapi]"
time=2024-04-21T23:59:41.327-04:00 level=DEBUG source=payload.go:42 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-04-21T23:59:41.327-04:00 level=INFO source=gpu.go:140 msg="Detecting GPU type"
time=2024-04-21T23:59:41.327-04:00 level=INFO source=gpu.go:317 msg="Searching for GPU management library libcudart.so*"
time=2024-04-21T23:59:41.327-04:00 level=DEBUG source=gpu.go:335 msg="gpu management search paths: [/tmp/ollama1421623603/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libcudart.so** /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libcudart.so** /opt/intel/oneapi/mpi/2021.11/lib/libcudart.so** /opt/intel/oneapi/mkl/2024.0/lib/libcudart.so** /opt/intel/oneapi/ippcp/2021.9/lib/libcudart.so** /opt/intel/oneapi/ipp/2021.10/lib/libcudart.so** /opt/intel/oneapi/dpl/2022.3/lib/libcudart.so** /opt/intel/oneapi/dnnl/2024.0/lib/libcudart.so** /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libcudart.so** /opt/intel/oneapi/dal/2024.0/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/lib/libcudart.so** /opt/intel/oneapi/ccl/2021.11/lib/libcudart.so**]"
time=2024-04-21T23:59:41.330-04:00 level=INFO source=gpu.go:363 msg="Discovered GPU libraries: []"
time=2024-04-21T23:59:41.330-04:00 level=INFO source=gpu.go:317 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-04-21T23:59:41.330-04:00 level=DEBUG source=gpu.go:335 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libnvidia-ml.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libnvidia-ml.so* /opt/intel/oneapi/mpi/2021.11/lib/libnvidia-ml.so* /opt/intel/oneapi/mkl/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/ippcp/2021.9/lib/libnvidia-ml.so* /opt/intel/oneapi/ipp/2021.10/lib/libnvidia-ml.so* /opt/intel/oneapi/dpl/2022.3/lib/libnvidia-ml.so* /opt/intel/oneapi/dnnl/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libnvidia-ml.so* /opt/intel/oneapi/dal/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/ccl/2021.11/lib/libnvidia-ml.so*]"
time=2024-04-21T23:59:41.333-04:00 level=INFO source=gpu.go:363 msg="Discovered GPU libraries: []"
time=2024-04-21T23:59:41.333-04:00 level=INFO source=gpu.go:317 msg="Searching for GPU management library libze_intel_gpu.so"
time=2024-04-21T23:59:41.333-04:00 level=DEBUG source=gpu.go:335 msg="gpu management search paths: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so* /usr/lib*/libze_intel_gpu.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libze_intel_gpu.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libze_intel_gpu.so* /opt/intel/oneapi/mpi/2021.11/lib/libze_intel_gpu.so* /opt/intel/oneapi/mkl/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/ippcp/2021.9/lib/libze_intel_gpu.so* /opt/intel/oneapi/ipp/2021.10/lib/libze_intel_gpu.so* /opt/intel/oneapi/dpl/2022.3/lib/libze_intel_gpu.so* /opt/intel/oneapi/dnnl/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libze_intel_gpu.so* /opt/intel/oneapi/dal/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/ccl/2021.11/lib/libze_intel_gpu.so*]"
time=2024-04-21T23:59:41.336-04:00 level=INFO source=gpu.go:363 msg="Discovered GPU libraries: [/usr/lib/libze_intel_gpu.so.1.3.29138.7 /usr/lib64/libze_intel_gpu.so.1.3.29138.7]"
wiring Level-Zero management library functions in /usr/lib/libze_intel_gpu.so.1.3.29138.7
dlsym: zesInit
dlsym: zesDriverGet
dlsym: zesDeviceGet
dlsym: zesDeviceGetProperties
dlsym: zesDeviceEnumMemoryModules
dlsym: zesMemoryGetProperties
dlsym: zesMemoryGetState
time=2024-04-21T23:59:41.346-04:00 level=INFO source=gpu.go:166 msg="Intel GPU detected"
time=2024-04-21T23:59:41.346-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
discovered 1 Level-Zero drivers
discovered 1 Level-Zero devices
[0] oneAPI device name: Intel(R) Arc(TM) A580 Graphics
[0] oneAPI brand: unknown
[0] oneAPI vendor: Intel(R) Corporation
[0] oneAPI S/N: unknown
[0] oneAPI board number: unknown
discovered 1 Level-Zero memory modules
time=2024-04-21T23:59:45.853-04:00 level=DEBUG source=gguf.go:57 msg="model = &llm.gguf{containerGGUF:(*llm.containerGGUF)(0xc00015fac0), kv:llm.KV{}, tensors:[]*llm.Tensor(nil), parameters:0x0}"
time=2024-04-21T23:59:46.591-04:00 level=DEBUG source=gguf.go:193 msg="general.architecture = llama"
time=2024-04-21T23:59:46.592-04:00 level=INFO source=gpu.go:140 msg="Detecting GPU type"
time=2024-04-21T23:59:46.592-04:00 level=INFO source=gpu.go:317 msg="Searching for GPU management library libcudart.so*"
time=2024-04-21T23:59:46.592-04:00 level=DEBUG source=gpu.go:335 msg="gpu management search paths: [/tmp/ollama1421623603/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libcudart.so** /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libcudart.so** /opt/intel/oneapi/mpi/2021.11/lib/libcudart.so** /opt/intel/oneapi/mkl/2024.0/lib/libcudart.so** /opt/intel/oneapi/ippcp/2021.9/lib/libcudart.so** /opt/intel/oneapi/ipp/2021.10/lib/libcudart.so** /opt/intel/oneapi/dpl/2022.3/lib/libcudart.so** /opt/intel/oneapi/dnnl/2024.0/lib/libcudart.so** /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libcudart.so** /opt/intel/oneapi/dal/2024.0/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/lib/libcudart.so** /opt/intel/oneapi/ccl/2021.11/lib/libcudart.so**]"
time=2024-04-21T23:59:46.595-04:00 level=INFO source=gpu.go:363 msg="Discovered GPU libraries: []"
time=2024-04-21T23:59:46.595-04:00 level=INFO source=gpu.go:317 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-04-21T23:59:46.595-04:00 level=DEBUG source=gpu.go:335 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libnvidia-ml.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libnvidia-ml.so* /opt/intel/oneapi/mpi/2021.11/lib/libnvidia-ml.so* /opt/intel/oneapi/mkl/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/ippcp/2021.9/lib/libnvidia-ml.so* /opt/intel/oneapi/ipp/2021.10/lib/libnvidia-ml.so* /opt/intel/oneapi/dpl/2022.3/lib/libnvidia-ml.so* /opt/intel/oneapi/dnnl/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libnvidia-ml.so* /opt/intel/oneapi/dal/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/ccl/2021.11/lib/libnvidia-ml.so*]"
time=2024-04-21T23:59:46.598-04:00 level=INFO source=gpu.go:363 msg="Discovered GPU libraries: []"
time=2024-04-21T23:59:46.598-04:00 level=INFO source=gpu.go:317 msg="Searching for GPU management library libze_intel_gpu.so"
time=2024-04-21T23:59:46.598-04:00 level=DEBUG source=gpu.go:335 msg="gpu management search paths: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so* /usr/lib*/libze_intel_gpu.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libze_intel_gpu.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libze_intel_gpu.so* /opt/intel/oneapi/mpi/2021.11/lib/libze_intel_gpu.so* /opt/intel/oneapi/mkl/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/ippcp/2021.9/lib/libze_intel_gpu.so* /opt/intel/oneapi/ipp/2021.10/lib/libze_intel_gpu.so* /opt/intel/oneapi/dpl/2022.3/lib/libze_intel_gpu.so* /opt/intel/oneapi/dnnl/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libze_intel_gpu.so* /opt/intel/oneapi/dal/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/ccl/2021.11/lib/libze_intel_gpu.so*]"
time=2024-04-21T23:59:46.600-04:00 level=INFO source=gpu.go:363 msg="Discovered GPU libraries: [/usr/lib/libze_intel_gpu.so.1.3.29138.7 /usr/lib64/libze_intel_gpu.so.1.3.29138.7]"
wiring Level-Zero management library functions in /usr/lib/libze_intel_gpu.so.1.3.29138.7
dlsym: zesInit
dlsym: zesDriverGet
dlsym: zesDeviceGet
dlsym: zesDeviceGetProperties
dlsym: zesDeviceEnumMemoryModules
dlsym: zesMemoryGetProperties
dlsym: zesMemoryGetState
time=2024-04-21T23:59:46.601-04:00 level=INFO source=gpu.go:166 msg="Intel GPU detected"
time=2024-04-21T23:59:46.601-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
discovered 1 Level-Zero drivers
discovered 1 Level-Zero devices
[0] oneAPI device name: Intel(R) Arc(TM) A580 Graphics
[0] oneAPI brand: unknown
[0] oneAPI vendor: Intel(R) Corporation
[0] oneAPI S/N: unknown
[0] oneAPI board number: unknown
discovered 1 Level-Zero memory modules
time=2024-04-21T23:59:46.601-04:00 level=INFO source=gpu.go:140 msg="Detecting GPU type"
time=2024-04-21T23:59:46.601-04:00 level=INFO source=gpu.go:317 msg="Searching for GPU management library libcudart.so*"
time=2024-04-21T23:59:46.601-04:00 level=DEBUG source=gpu.go:335 msg="gpu management search paths: [/tmp/ollama1421623603/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libcudart.so** /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libcudart.so** /opt/intel/oneapi/mpi/2021.11/lib/libcudart.so** /opt/intel/oneapi/mkl/2024.0/lib/libcudart.so** /opt/intel/oneapi/ippcp/2021.9/lib/libcudart.so** /opt/intel/oneapi/ipp/2021.10/lib/libcudart.so** /opt/intel/oneapi/dpl/2022.3/lib/libcudart.so** /opt/intel/oneapi/dnnl/2024.0/lib/libcudart.so** /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libcudart.so** /opt/intel/oneapi/dal/2024.0/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/lib/libcudart.so** /opt/intel/oneapi/ccl/2021.11/lib/libcudart.so**]"
time=2024-04-21T23:59:46.603-04:00 level=INFO source=gpu.go:363 msg="Discovered GPU libraries: []"
time=2024-04-21T23:59:46.603-04:00 level=INFO source=gpu.go:317 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-04-21T23:59:46.603-04:00 level=DEBUG source=gpu.go:335 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libnvidia-ml.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libnvidia-ml.so* /opt/intel/oneapi/mpi/2021.11/lib/libnvidia-ml.so* /opt/intel/oneapi/mkl/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/ippcp/2021.9/lib/libnvidia-ml.so* /opt/intel/oneapi/ipp/2021.10/lib/libnvidia-ml.so* /opt/intel/oneapi/dpl/2022.3/lib/libnvidia-ml.so* /opt/intel/oneapi/dnnl/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libnvidia-ml.so* /opt/intel/oneapi/dal/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/ccl/2021.11/lib/libnvidia-ml.so*]"
time=2024-04-21T23:59:46.606-04:00 level=INFO source=gpu.go:363 msg="Discovered GPU libraries: []"
time=2024-04-21T23:59:46.606-04:00 level=INFO source=gpu.go:317 msg="Searching for GPU management library libze_intel_gpu.so"
time=2024-04-21T23:59:46.606-04:00 level=DEBUG source=gpu.go:335 msg="gpu management search paths: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so* /usr/lib*/libze_intel_gpu.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libze_intel_gpu.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libze_intel_gpu.so* /opt/intel/oneapi/mpi/2021.11/lib/libze_intel_gpu.so* /opt/intel/oneapi/mkl/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/ippcp/2021.9/lib/libze_intel_gpu.so* /opt/intel/oneapi/ipp/2021.10/lib/libze_intel_gpu.so* /opt/intel/oneapi/dpl/2022.3/lib/libze_intel_gpu.so* /opt/intel/oneapi/dnnl/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libze_intel_gpu.so* /opt/intel/oneapi/dal/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/ccl/2021.11/lib/libze_intel_gpu.so*]"
time=2024-04-21T23:59:46.609-04:00 level=INFO source=gpu.go:363 msg="Discovered GPU libraries: [/usr/lib/libze_intel_gpu.so.1.3.29138.7 /usr/lib64/libze_intel_gpu.so.1.3.29138.7]"
wiring Level-Zero management library functions in /usr/lib/libze_intel_gpu.so.1.3.29138.7
dlsym: zesInit
dlsym: zesDriverGet
dlsym: zesDeviceGet
dlsym: zesDeviceGetProperties
dlsym: zesDeviceEnumMemoryModules
dlsym: zesMemoryGetProperties
dlsym: zesMemoryGetState
time=2024-04-21T23:59:46.609-04:00 level=INFO source=gpu.go:166 msg="Intel GPU detected"
time=2024-04-21T23:59:46.609-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
discovered 1 Level-Zero drivers
discovered 1 Level-Zero devices
[0] oneAPI device name: Intel(R) Arc(TM) A580 Graphics
[0] oneAPI brand: unknown
[0] oneAPI vendor: Intel(R) Corporation
[0] oneAPI S/N: unknown
[0] oneAPI board number: unknown
discovered 1 Level-Zero memory modules
time=2024-04-21T23:59:46.609-04:00 level=INFO source=server.go:135 msg="offload to gpu" layers.real=33 layers.estimate=33 memory.available="8128.0 MiB" memory.required.full="5879.9 MiB" memory.required.partial="5879.9 MiB" memory.required.kv="256.0 MiB" memory.weights.total="5459.9 MiB" memory.weights.repeating="4704.5 MiB" memory.weights.nonrepeating="755.4 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-04-21T23:59:46.609-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1421623603/runners/cpu
time=2024-04-21T23:59:46.609-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1421623603/runners/cpu_avx
time=2024-04-21T23:59:46.609-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1421623603/runners/cpu_avx2
time=2024-04-21T23:59:46.609-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1421623603/runners/oneapi
time=2024-04-21T23:59:46.609-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1421623603/runners/cpu
time=2024-04-21T23:59:46.609-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1421623603/runners/cpu_avx
time=2024-04-21T23:59:46.609-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1421623603/runners/cpu_avx2
time=2024-04-21T23:59:46.609-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama1421623603/runners/oneapi
time=2024-04-21T23:59:46.609-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-21T23:59:46.610-04:00 level=DEBUG source=server.go:296 msg="LD_LIBRARY_PATH=/opt/intel/oneapi/tbb/2021.11/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib:/opt/intel/oneapi/mpi/2021.11/lib:/opt/intel/oneapi/mkl/2024.0/lib:/opt/intel/oneapi/ippcp/2021.9/lib/:/opt/intel/oneapi/ipp/2021.10/lib:/opt/intel/oneapi/dpl/2022.3/lib:/opt/intel/oneapi/dnnl/2024.0/lib:/opt/intel/oneapi/debugger/2024.0/opt/debugger/lib:/opt/intel/oneapi/dal/2024.0/lib:/opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib:/opt/intel/oneapi/compiler/2024.0/opt/compiler/lib:/opt/intel/oneapi/compiler/2024.0/lib:/opt/intel/oneapi/ccl/2021.11/lib/:/tmp/ollama1421623603/runners/oneapi"
time=2024-04-21T23:59:46.610-04:00 level=INFO source=server.go:301 msg="starting llama server" cmd="/tmp/ollama1421623603/runners/oneapi/ollama_llama_server --model /home/morgan/.ollama/models/blobs/sha256-6d30e88b357bc17d40dd2f50f24c26eefe3df19b3ec4e1c7456f8263c089cbdb --ctx-size 2048 --batch-size 512 --embedding --log-format json --n-gpu-layers 33 --verbose --port 42181"
time=2024-04-21T23:59:46.611-04:00 level=INFO source=server.go:426 msg="waiting for llama runner to start responding"
{"function":"server_params_parse","level":"WARN","line":2494,"msg":"server.cpp is not built with verbose logging.","tid":"125196650092544","timestamp":1713758386}
{"build":2679,"commit":"7593639","function":"main","level":"INFO","line":2820,"msg":"build info","tid":"125196650092544","timestamp":1713758386}
{"function":"main","level":"INFO","line":2827,"msg":"system info","n_threads":8,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"125196650092544","timestamp":1713758386,"total_threads":16}
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/morgan/.ollama/models/blobs/sha256-6d30e88b357bc17d40dd2f50f24c26eefe3df19b3ec4e1c7456f8263c089cbdb (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 17
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
time=2024-04-21T23:59:46.861-04:00 level=DEBUG source=server.go:457 msg="server not yet available" error="server not responding"
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 5.33 GiB (5.70 BPW) 
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 4 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|               Intel(R) Arc(TM) A580 Graphics|       1.3|        384|    1024|     32|     8096681984|
| 1|    [opencl:gpu:0]|               Intel(R) Arc(TM) A580 Graphics|       3.0|        384|    1024|     32|     8096681984|
| 2|    [opencl:cpu:0]|         12th Gen Intel(R) Core(TM) i5-12600K|       3.0|         16|    8192|     64|    32892153856|
| 3|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|         16|67108864|     64|    32892153856|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:384
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  5115.49 MiB
llm_load_tensors:        CPU buffer size =   344.44 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.50 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   258.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
{"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"125196650092544","timestamp":1713758389}
{"function":"initialize","level":"INFO","line":460,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"125196650092544","timestamp":1713758389}
{"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"125196650092544","timestamp":1713758389}
{"function":"validate_model_chat_template","level":"ERR","line":437,"msg":"The chat template comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses","tid":"125196650092544","timestamp":1713758389}
{"function":"main","hostname":"127.0.0.1","level":"INFO","line":3267,"msg":"HTTP server listening","n_threads_http":"15","port":"42181","tid":"125196650092544","timestamp":1713758389}
{"function":"update_slots","level":"INFO","line":1578,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"125196650092544","timestamp":1713758389}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":0,"tid":"125196650092544","timestamp":1713758389}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":1,"tid":"125196650092544","timestamp":1713758389}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":38466,"status":200,"tid":"125193968158400","timestamp":1713758389}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":2,"tid":"125196650092544","timestamp":1713758389}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":3,"tid":"125196650092544","timestamp":1713758389}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":38478,"status":200,"tid":"125193498396352","timestamp":1713758389}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":38464,"status":200,"tid":"125193487910592","timestamp":1713758389}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":5,"tid":"125196650092544","timestamp":1713758389}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":38484,"status":200,"tid":"125193978644160","timestamp":1713758389}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":4,"tid":"125196650092544","timestamp":1713758389}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":38494,"status":200,"tid":"125193957672640","timestamp":1713758389}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":38490,"status":200,"tid":"125193947186880","timestamp":1713758389}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":6,"tid":"125196650092544","timestamp":1713758389}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":38552,"status":200,"tid":"125193720694464","timestamp":1713758389}
time=2024-04-21T23:59:49.472-04:00 level=DEBUG source=server.go:468 msg="llama runner started in 2.861372 seconds"
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":7,"tid":"125196650092544","timestamp":1713758389}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":38552,"status":200,"tid":"125193720694464","timestamp":1713758389}
{"function":"log_server_request","level":"INFO","line":2741,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":38552,"status":200,"tid":"125193720694464","timestamp":1713758389}
time=2024-04-21T23:59:49.559-04:00 level=DEBUG source=prompt.go:172 msg="prompt now fits in context window" required=68 window=2048
time=2024-04-21T23:59:49.559-04:00 level=DEBUG source=routes.go:1347 msg="chat handler" prompt="<|start_header_id|>system<|end_header_id|>\n\nThis is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTell me a random fun fact about the Roman Empire<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" images=0
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":8,"tid":"125196650092544","timestamp":1713758389}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":38552,"status":200,"tid":"125193720694464","timestamp":1713758389}
{"function":"launch_slot_with_data","level":"INFO","line":833,"msg":"slot is processing task","slot_id":0,"task_id":9,"tid":"125196650092544","timestamp":1713758389}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1816,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":66,"slot_id":0,"task_id":9,"tid":"125196650092544","timestamp":1713758389}
{"function":"update_slots","level":"INFO","line":1840,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":9,"tid":"125196650092544","timestamp":1713758389}
^Ctime=2024-04-22T00:00:18.410-04:00 level=DEBUG source=server.go:869 msg="stopping llama server"
time=2024-04-22T00:00:18.410-04:00 level=DEBUG source=assets.go:94 msg="cleaning up" dir=/tmp/ollama1421623603
Debug output before removing the elif, resulting in running on CPU (I also had it print out `memInfo` and `memInfo.count` before the other print line)
dev/ollama2 » OLLAMA_DEBUG=1 OLLAMA_HOST=0.0.0.0 ./ollama serve
time=2024-04-22T00:19:00.010-04:00 level=INFO source=images.go:817 msg="total blobs: 5"
time=2024-04-22T00:19:00.010-04:00 level=INFO source=images.go:824 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
- using env:	export GIN_MODE=release
- using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.PullModelHandler (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.GenerateHandler (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.ChatHandler (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.EmbeddingsHandler (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.CreateModelHandler (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.PushModelHandler (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.CopyModelHandler (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.DeleteModelHandler (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.ShowModelHandler (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.CreateBlobHandler (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.HeadBlobHandler (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.ChatHandler (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-04-22T00:19:00.010-04:00 level=INFO source=routes.go:1143 msg="Listening on [::]:11434 (version 0.0.0)"
time=2024-04-22T00:19:00.010-04:00 level=INFO source=payload.go:28 msg="extracting embedded files" dir=/tmp/ollama2043914106/runners
time=2024-04-22T00:19:00.010-04:00 level=DEBUG source=payload.go:160 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz
time=2024-04-22T00:19:00.010-04:00 level=DEBUG source=payload.go:160 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz
time=2024-04-22T00:19:00.010-04:00 level=DEBUG source=payload.go:160 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz
time=2024-04-22T00:19:00.010-04:00 level=DEBUG source=payload.go:160 msg=extracting variant=oneapi file=build/linux/x86_64/oneapi/bin/ollama_llama_server.gz
time=2024-04-22T00:19:00.041-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2043914106/runners/cpu
time=2024-04-22T00:19:00.041-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2043914106/runners/cpu_avx
time=2024-04-22T00:19:00.041-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2043914106/runners/cpu_avx2
time=2024-04-22T00:19:00.041-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2043914106/runners/oneapi
time=2024-04-22T00:19:00.041-04:00 level=INFO source=payload.go:41 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 oneapi]"
time=2024-04-22T00:19:00.041-04:00 level=DEBUG source=payload.go:42 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-04-22T00:19:00.041-04:00 level=INFO source=gpu.go:140 msg="Detecting GPU type"
time=2024-04-22T00:19:00.041-04:00 level=INFO source=gpu.go:322 msg="Searching for GPU management library libcudart.so*"
time=2024-04-22T00:19:00.041-04:00 level=DEBUG source=gpu.go:340 msg="gpu management search paths: [/tmp/ollama2043914106/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libcudart.so** /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libcudart.so** /opt/intel/oneapi/mpi/2021.11/lib/libcudart.so** /opt/intel/oneapi/mkl/2024.0/lib/libcudart.so** /opt/intel/oneapi/ippcp/2021.9/lib/libcudart.so** /opt/intel/oneapi/ipp/2021.10/lib/libcudart.so** /opt/intel/oneapi/dpl/2022.3/lib/libcudart.so** /opt/intel/oneapi/dnnl/2024.0/lib/libcudart.so** /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libcudart.so** /opt/intel/oneapi/dal/2024.0/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/lib/libcudart.so** /opt/intel/oneapi/ccl/2021.11/lib/libcudart.so**]"
time=2024-04-22T00:19:00.044-04:00 level=INFO source=gpu.go:368 msg="Discovered GPU libraries: []"
time=2024-04-22T00:19:00.044-04:00 level=INFO source=gpu.go:322 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-04-22T00:19:00.045-04:00 level=DEBUG source=gpu.go:340 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libnvidia-ml.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libnvidia-ml.so* /opt/intel/oneapi/mpi/2021.11/lib/libnvidia-ml.so* /opt/intel/oneapi/mkl/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/ippcp/2021.9/lib/libnvidia-ml.so* /opt/intel/oneapi/ipp/2021.10/lib/libnvidia-ml.so* /opt/intel/oneapi/dpl/2022.3/lib/libnvidia-ml.so* /opt/intel/oneapi/dnnl/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libnvidia-ml.so* /opt/intel/oneapi/dal/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/ccl/2021.11/lib/libnvidia-ml.so*]"
time=2024-04-22T00:19:00.048-04:00 level=INFO source=gpu.go:368 msg="Discovered GPU libraries: []"
time=2024-04-22T00:19:00.048-04:00 level=INFO source=gpu.go:322 msg="Searching for GPU management library libze_intel_gpu.so"
time=2024-04-22T00:19:00.048-04:00 level=DEBUG source=gpu.go:340 msg="gpu management search paths: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so* /usr/lib*/libze_intel_gpu.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libze_intel_gpu.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libze_intel_gpu.so* /opt/intel/oneapi/mpi/2021.11/lib/libze_intel_gpu.so* /opt/intel/oneapi/mkl/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/ippcp/2021.9/lib/libze_intel_gpu.so* /opt/intel/oneapi/ipp/2021.10/lib/libze_intel_gpu.so* /opt/intel/oneapi/dpl/2022.3/lib/libze_intel_gpu.so* /opt/intel/oneapi/dnnl/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libze_intel_gpu.so* /opt/intel/oneapi/dal/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/ccl/2021.11/lib/libze_intel_gpu.so*]"
time=2024-04-22T00:19:00.051-04:00 level=INFO source=gpu.go:368 msg="Discovered GPU libraries: [/usr/lib/libze_intel_gpu.so.1.3.29138.7 /usr/lib64/libze_intel_gpu.so.1.3.29138.7]"
wiring Level-Zero management library functions in /usr/lib/libze_intel_gpu.so.1.3.29138.7
dlsym: zesInit
dlsym: zesDriverGet
dlsym: zesDeviceGet
dlsym: zesDeviceGetProperties
dlsym: zesDeviceEnumMemoryModules
dlsym: zesMemoryGetProperties
dlsym: zesMemoryGetState
time=2024-04-22T00:19:00.061-04:00 level=INFO source=gpu.go:166 msg="Intel GPU detected"
time=2024-04-22T00:19:00.062-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
discovered 1 Level-Zero drivers
discovered 1 Level-Zero devices
[0] oneAPI device name: Intel(R) Arc(TM) A580 Graphics
[0] oneAPI brand: unknown
[0] oneAPI vendor: Intel(R) Corporation
[0] oneAPI S/N: unknown
[0] oneAPI board number: unknown
discovered 1 Level-Zero memory modules
time=2024-04-22T00:19:00.062-04:00 level=INFO source=gpu.go:246 msg="{8522825728 8522825728 1 0 <nil>}"
time=2024-04-22T00:19:00.062-04:00 level=INFO source=gpu.go:247 msg=1
time=2024-04-22T00:19:00.062-04:00 level=INFO source=gpu.go:248 msg="OneAPI unsupported integrated GPU detected"
time=2024-04-22T00:19:00.062-04:00 level=INFO source=routes.go:1164 msg="no GPU detected"
time=2024-04-22T00:19:22.568-04:00 level=DEBUG source=gguf.go:57 msg="model = &llm.gguf{containerGGUF:(*llm.containerGGUF)(0xc000127880), kv:llm.KV{}, tensors:[]*llm.Tensor(nil), parameters:0x0}"
time=2024-04-22T00:19:23.269-04:00 level=DEBUG source=gguf.go:193 msg="general.architecture = llama"
time=2024-04-22T00:19:23.270-04:00 level=INFO source=gpu.go:140 msg="Detecting GPU type"
time=2024-04-22T00:19:23.270-04:00 level=INFO source=gpu.go:322 msg="Searching for GPU management library libcudart.so*"
time=2024-04-22T00:19:23.270-04:00 level=DEBUG source=gpu.go:340 msg="gpu management search paths: [/tmp/ollama2043914106/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libcudart.so** /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libcudart.so** /opt/intel/oneapi/mpi/2021.11/lib/libcudart.so** /opt/intel/oneapi/mkl/2024.0/lib/libcudart.so** /opt/intel/oneapi/ippcp/2021.9/lib/libcudart.so** /opt/intel/oneapi/ipp/2021.10/lib/libcudart.so** /opt/intel/oneapi/dpl/2022.3/lib/libcudart.so** /opt/intel/oneapi/dnnl/2024.0/lib/libcudart.so** /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libcudart.so** /opt/intel/oneapi/dal/2024.0/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/lib/libcudart.so** /opt/intel/oneapi/ccl/2021.11/lib/libcudart.so**]"
time=2024-04-22T00:19:23.273-04:00 level=INFO source=gpu.go:368 msg="Discovered GPU libraries: []"
time=2024-04-22T00:19:23.273-04:00 level=INFO source=gpu.go:322 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-04-22T00:19:23.273-04:00 level=DEBUG source=gpu.go:340 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libnvidia-ml.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libnvidia-ml.so* /opt/intel/oneapi/mpi/2021.11/lib/libnvidia-ml.so* /opt/intel/oneapi/mkl/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/ippcp/2021.9/lib/libnvidia-ml.so* /opt/intel/oneapi/ipp/2021.10/lib/libnvidia-ml.so* /opt/intel/oneapi/dpl/2022.3/lib/libnvidia-ml.so* /opt/intel/oneapi/dnnl/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libnvidia-ml.so* /opt/intel/oneapi/dal/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/ccl/2021.11/lib/libnvidia-ml.so*]"
time=2024-04-22T00:19:23.276-04:00 level=INFO source=gpu.go:368 msg="Discovered GPU libraries: []"
time=2024-04-22T00:19:23.276-04:00 level=INFO source=gpu.go:322 msg="Searching for GPU management library libze_intel_gpu.so"
time=2024-04-22T00:19:23.276-04:00 level=DEBUG source=gpu.go:340 msg="gpu management search paths: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so* /usr/lib*/libze_intel_gpu.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libze_intel_gpu.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libze_intel_gpu.so* /opt/intel/oneapi/mpi/2021.11/lib/libze_intel_gpu.so* /opt/intel/oneapi/mkl/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/ippcp/2021.9/lib/libze_intel_gpu.so* /opt/intel/oneapi/ipp/2021.10/lib/libze_intel_gpu.so* /opt/intel/oneapi/dpl/2022.3/lib/libze_intel_gpu.so* /opt/intel/oneapi/dnnl/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libze_intel_gpu.so* /opt/intel/oneapi/dal/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/ccl/2021.11/lib/libze_intel_gpu.so*]"
time=2024-04-22T00:19:23.279-04:00 level=INFO source=gpu.go:368 msg="Discovered GPU libraries: [/usr/lib/libze_intel_gpu.so.1.3.29138.7 /usr/lib64/libze_intel_gpu.so.1.3.29138.7]"
wiring Level-Zero management library functions in /usr/lib/libze_intel_gpu.so.1.3.29138.7
dlsym: zesInit
dlsym: zesDriverGet
dlsym: zesDeviceGet
dlsym: zesDeviceGetProperties
dlsym: zesDeviceEnumMemoryModules
dlsym: zesMemoryGetProperties
dlsym: zesMemoryGetState
time=2024-04-22T00:19:23.279-04:00 level=INFO source=gpu.go:166 msg="Intel GPU detected"
time=2024-04-22T00:19:23.279-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
discovered 1 Level-Zero drivers
discovered 1 Level-Zero devices
[0] oneAPI device name: Intel(R) Arc(TM) A580 Graphics
[0] oneAPI brand: unknown
[0] oneAPI vendor: Intel(R) Corporation
[0] oneAPI S/N: unknown
[0] oneAPI board number: unknown
discovered 1 Level-Zero memory modules
time=2024-04-22T00:19:23.279-04:00 level=INFO source=gpu.go:246 msg="{8522825728 8522825728 1 0 <nil>}"
time=2024-04-22T00:19:23.279-04:00 level=INFO source=gpu.go:247 msg=1
time=2024-04-22T00:19:23.279-04:00 level=INFO source=gpu.go:248 msg="OneAPI unsupported integrated GPU detected"
time=2024-04-22T00:19:23.279-04:00 level=INFO source=gpu.go:140 msg="Detecting GPU type"
time=2024-04-22T00:19:23.280-04:00 level=INFO source=gpu.go:322 msg="Searching for GPU management library libcudart.so*"
time=2024-04-22T00:19:23.280-04:00 level=DEBUG source=gpu.go:340 msg="gpu management search paths: [/tmp/ollama2043914106/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libcudart.so** /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libcudart.so** /opt/intel/oneapi/mpi/2021.11/lib/libcudart.so** /opt/intel/oneapi/mkl/2024.0/lib/libcudart.so** /opt/intel/oneapi/ippcp/2021.9/lib/libcudart.so** /opt/intel/oneapi/ipp/2021.10/lib/libcudart.so** /opt/intel/oneapi/dpl/2022.3/lib/libcudart.so** /opt/intel/oneapi/dnnl/2024.0/lib/libcudart.so** /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libcudart.so** /opt/intel/oneapi/dal/2024.0/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libcudart.so** /opt/intel/oneapi/compiler/2024.0/lib/libcudart.so** /opt/intel/oneapi/ccl/2021.11/lib/libcudart.so**]"
time=2024-04-22T00:19:23.282-04:00 level=INFO source=gpu.go:368 msg="Discovered GPU libraries: []"
time=2024-04-22T00:19:23.282-04:00 level=INFO source=gpu.go:322 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-04-22T00:19:23.282-04:00 level=DEBUG source=gpu.go:340 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libnvidia-ml.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libnvidia-ml.so* /opt/intel/oneapi/mpi/2021.11/lib/libnvidia-ml.so* /opt/intel/oneapi/mkl/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/ippcp/2021.9/lib/libnvidia-ml.so* /opt/intel/oneapi/ipp/2021.10/lib/libnvidia-ml.so* /opt/intel/oneapi/dpl/2022.3/lib/libnvidia-ml.so* /opt/intel/oneapi/dnnl/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libnvidia-ml.so* /opt/intel/oneapi/dal/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libnvidia-ml.so* /opt/intel/oneapi/compiler/2024.0/lib/libnvidia-ml.so* /opt/intel/oneapi/ccl/2021.11/lib/libnvidia-ml.so*]"
time=2024-04-22T00:19:23.285-04:00 level=INFO source=gpu.go:368 msg="Discovered GPU libraries: []"
time=2024-04-22T00:19:23.285-04:00 level=INFO source=gpu.go:322 msg="Searching for GPU management library libze_intel_gpu.so"
time=2024-04-22T00:19:23.285-04:00 level=DEBUG source=gpu.go:340 msg="gpu management search paths: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so* /usr/lib*/libze_intel_gpu.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libze_intel_gpu.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libze_intel_gpu.so* /opt/intel/oneapi/mpi/2021.11/lib/libze_intel_gpu.so* /opt/intel/oneapi/mkl/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/ippcp/2021.9/lib/libze_intel_gpu.so* /opt/intel/oneapi/ipp/2021.10/lib/libze_intel_gpu.so* /opt/intel/oneapi/dpl/2022.3/lib/libze_intel_gpu.so* /opt/intel/oneapi/dnnl/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libze_intel_gpu.so* /opt/intel/oneapi/dal/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libze_intel_gpu.so* /opt/intel/oneapi/compiler/2024.0/lib/libze_intel_gpu.so* /opt/intel/oneapi/ccl/2021.11/lib/libze_intel_gpu.so*]"
time=2024-04-22T00:19:23.288-04:00 level=INFO source=gpu.go:368 msg="Discovered GPU libraries: [/usr/lib/libze_intel_gpu.so.1.3.29138.7 /usr/lib64/libze_intel_gpu.so.1.3.29138.7]"
wiring Level-Zero management library functions in /usr/lib/libze_intel_gpu.so.1.3.29138.7
dlsym: zesInit
dlsym: zesDriverGet
dlsym: zesDeviceGet
dlsym: zesDeviceGetProperties
dlsym: zesDeviceEnumMemoryModules
dlsym: zesMemoryGetProperties
dlsym: zesMemoryGetState
time=2024-04-22T00:19:23.288-04:00 level=INFO source=gpu.go:166 msg="Intel GPU detected"
time=2024-04-22T00:19:23.288-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
discovered 1 Level-Zero drivers
discovered 1 Level-Zero devices
[0] oneAPI device name: Intel(R) Arc(TM) A580 Graphics
[0] oneAPI brand: unknown
[0] oneAPI vendor: Intel(R) Corporation
[0] oneAPI S/N: unknown
[0] oneAPI board number: unknown
discovered 1 Level-Zero memory modules
time=2024-04-22T00:19:23.289-04:00 level=INFO source=gpu.go:246 msg="{8522825728 8522825728 1 0 <nil>}"
time=2024-04-22T00:19:23.289-04:00 level=INFO source=gpu.go:247 msg=1
time=2024-04-22T00:19:23.289-04:00 level=INFO source=gpu.go:248 msg="OneAPI unsupported integrated GPU detected"
time=2024-04-22T00:19:23.289-04:00 level=INFO source=server.go:135 msg="offload to gpu" layers.real=0 layers.estimate=0 memory.available="0 B" memory.required.full="5879.9 MiB" memory.required.partial="677.5 MiB" memory.required.kv="256.0 MiB" memory.weights.total="5459.9 MiB" memory.weights.repeating="4704.5 MiB" memory.weights.nonrepeating="755.4 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-04-22T00:19:23.289-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2043914106/runners/cpu
time=2024-04-22T00:19:23.289-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2043914106/runners/cpu_avx
time=2024-04-22T00:19:23.289-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2043914106/runners/cpu_avx2
time=2024-04-22T00:19:23.289-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2043914106/runners/oneapi
time=2024-04-22T00:19:23.289-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2043914106/runners/cpu
time=2024-04-22T00:19:23.289-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2043914106/runners/cpu_avx
time=2024-04-22T00:19:23.289-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2043914106/runners/cpu_avx2
time=2024-04-22T00:19:23.289-04:00 level=DEBUG source=payload.go:68 msg="availableServers : found" file=/tmp/ollama2043914106/runners/oneapi
time=2024-04-22T00:19:23.290-04:00 level=DEBUG source=server.go:296 msg="LD_LIBRARY_PATH=/opt/intel/oneapi/tbb/2021.11/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib:/opt/intel/oneapi/mpi/2021.11/lib:/opt/intel/oneapi/mkl/2024.0/lib:/opt/intel/oneapi/ippcp/2021.9/lib/:/opt/intel/oneapi/ipp/2021.10/lib:/opt/intel/oneapi/dpl/2022.3/lib:/opt/intel/oneapi/dnnl/2024.0/lib:/opt/intel/oneapi/debugger/2024.0/opt/debugger/lib:/opt/intel/oneapi/dal/2024.0/lib:/opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib:/opt/intel/oneapi/compiler/2024.0/opt/compiler/lib:/opt/intel/oneapi/compiler/2024.0/lib:/opt/intel/oneapi/ccl/2021.11/lib/:/tmp/ollama2043914106/runners/cpu_avx2"
time=2024-04-22T00:19:23.290-04:00 level=INFO source=server.go:301 msg="starting llama server" cmd="/tmp/ollama2043914106/runners/cpu_avx2/ollama_llama_server --model /home/morgan/.ollama/models/blobs/sha256-6d30e88b357bc17d40dd2f50f24c26eefe3df19b3ec4e1c7456f8263c089cbdb --ctx-size 2048 --batch-size 512 --embedding --log-format json --n-gpu-layers 0 --verbose --port 34827"
time=2024-04-22T00:19:23.290-04:00 level=INFO source=server.go:426 msg="waiting for llama runner to start responding"
{"function":"server_params_parse","level":"WARN","line":2380,"msg":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1,"tid":"140703529813888","timestamp":1713759563}
{"function":"server_params_parse","level":"WARN","line":2494,"msg":"server.cpp is not built with verbose logging.","tid":"140703529813888","timestamp":1713759563}
{"build":2679,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140703529813888","timestamp":1713759563}
{"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":8,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140703529813888","timestamp":1713759563,"total_threads":16}

So this detection of compatible integrated GPUs doesn't seem to be working as intended. In my case, it was excluding a dGPU. Let me know if there are any extra commands I can run to help debug this.

@mlim15
Copy link

mlim15 commented Apr 22, 2024

Another thing: when running Llama.cpp locally built with SYCL, it seems to use fp16 and is much faster:

~ » ZES_ENABLE_SYSMAN=1 ./dev/llama.cpp/build/bin/server -m ./llama3-q2k.gguf -ngl 33 -sm layer --host 0.0.0.0 --chat-template llama3

warning: llama.cpp was compiled without CUDA. Setting the split mode has no effect.
{"tid":"129851088406464","timestamp":1713760489,"level":"INFO","function":"main","line":2924,"msg":"build info","build":2710,"commit":"5cf5e7d4"}
{"tid":"129851088406464","timestamp":1713760489,"level":"INFO","function":"main","line":2931,"msg":"system info","n_threads":6,"n_threads_batch":-1,"total_threads":16,"system_info":"AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "}
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from ./llama3-q2k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 10
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q2_K:  129 tensors
llama_model_loader: - type q3_K:   64 tensors
llama_model_loader: - type q4_K:   32 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 2.95 GiB (3.16 BPW) 
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: yes
found 4 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|               Intel(R) Arc(TM) A580 Graphics|       1.3|        384|    1024|     32|     8096681984|
| 1|    [opencl:gpu:0]|               Intel(R) Arc(TM) A580 Graphics|       3.0|        384|    1024|     32|     8096681984|
| 2|    [opencl:cpu:0]|         12th Gen Intel(R) Core(TM) i5-12600K|       3.0|         16|    8192|     64|    32892153856|
| 3|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|         16|67108864|     64|    32892153856|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:384

Compare to example above where running Llama.cpp through ollama says ggml_init_sycl: GGML_SYCL_F16: no. This is likely due to the build flag -DLLAMA_SYCL_F16=OFF here. Is there a reason this was disabled? Some quick googling suggests that maybe at the time the original pull request was put together, the build of llama.cpp was failing with this flag, but that appears to have been fixed since then. I think that should probably be changed to on now for performance, but I'm just an observer here, so maybe check with an actual contributor.

Sorry, I'm an idiot and I was accidentally comparing a different quantization of the model. This is probably not important.

@gamunu
Copy link
Author

gamunu commented Apr 22, 2024

@mlim15 Yes, I could reproduce the problem with an Intel i5 and Arc A770 GPU. I previously tested with the AMD + Arc machine. I will work on both issues and run a benchmark.

- Update the Intel oneAPI base image version to 2024.0.1
- Copy oneAPI build artifacts to the correct path in the final image
- Update gen_linux.sh to build in ../build/linux instead of llama.cpp/build
- Remove -g flag when building with Intel oneAPI compiler
@gamunu
Copy link
Author

gamunu commented Apr 22, 2024

I managed to fix the Linux build script. The oneAPI libraries are too large to be bundled in, so the libraries need to be in the runtime environment.

required to run before running the server

source /opt/intel/oneapi/setvars.sh

I haven't tested this for Arm64 as I don't have a device to test it on.

SYCL log:

[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
time=2024-04-23T00:55:18.127+05:30 level=DEBUG source=server.go:457 msg="server not yet available" error="server not responding"
found 4 SYCL devices:

|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       1.3|        512|    1024|     32|    16225243136|
| 1|[level_zero:gpu:1]|                    Intel(R) UHD Graphics 770|       1.3|         32|     512|     32|    30878105600|
| 2|    [opencl:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       3.0|        512|    1024|     32|    16225243136|
| 3|    [opencl:gpu:1]|                    Intel(R) UHD Graphics 770|       3.0|         32|     512|     32|    30878105600|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:512
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  3847.55 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..................................................................................................


**Devices log:**

time=2024-04-23T01:13:11.938+05:30 level=INFO source=gpu.go:166 msg="Intel GPU detected"
time=2024-04-23T01:13:11.938+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
discovered 1 Level-Zero drivers
discovered 2 Level-Zero devices
[0] oneAPI device name: Intel(R) Arc(TM) A770 Graphics
[0] oneAPI brand: unknown
[0] oneAPI vendor: Intel(R) Corporation
[0] oneAPI S/N: unknown
[0] oneAPI board number: unknown
discovered 1 Level-Zero memory modules
[1] oneAPI device name: Intel(R) UHD Graphics 770
[1] oneAPI brand: unknown
[1] oneAPI vendor: Intel(R) Corporation
[1] oneAPI S/N: unknown
[1] oneAPI board number: unknown
Device 1 is an integrated device

I ran a benchmark, and these are the results I got. intel-gpu-top shows activity for the correct GPU now.

model_name =    mistral:7b
prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch.
eval rate:            17.51 tokens/s
prompt = Develop a python function that solves the following problem, sudoku game
eval rate:            16.78 tokens/s
prompt = Create a dialogue between two characters that discusses economic crisis
eval rate:            16.41 tokens/s
prompt = In a forest, there are brave lions living there. Please continue the story.
eval rate:            17.11 tokens/s
prompt = I'd like to book a flight for 4 to Seattle in U.S.
eval rate:            17.07 tokens/s
--------------------
Average of eval rate:  16.976  tokens/s


model_name =    llama2:7b
prompt = Explain Artificial Intelligence and give its applications.
eval rate:            23.27 tokens/s
prompt = How are machine learning and AI related?
eval rate:            20.92 tokens/s
prompt = What is Deep Learning based on?
eval rate:            20.17 tokens/s
prompt = What is the full form of LSTM?
eval rate:            20.81 tokens/s
prompt = What are different components of GAN?
eval rate:            20.09 tokens/s
--------------------
Average of eval rate:  21.052  tokens/s


model_name =    llama2:13b
prompt = Explain Artificial Intelligence and give its applications.
eval rate:            12.47 tokens/s
prompt = How are machine learning and AI related?
eval rate:            13.03 tokens/s
prompt = What is Deep Learning based on?
eval rate:            12.98 tokens/s
prompt = What is the full form of LSTM?
eval rate:            14.23 tokens/s
prompt = What are different components of GAN?
eval rate:            13.02 tokens/s
--------------------
Average of eval rate:  13.146  tokens/s
----------------------------------------

@Chimrod
Copy link

Chimrod commented Apr 24, 2024

Hello, I’ve tested the branch in my debian environment, and ollama fails to run a model. I’m using Intel Arc 770 16Gb

found 5 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       1.3|        512|    1024|     32|    16225243136|
| 1|    [opencl:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       3.0|        512|    1024|     32|    16225243136|
| 2|    [opencl:cpu:0]|AMD Ryzen 9 7950X 16-Core Processor            |       3.0|         32|    8192|     64|    32827715584|
| 3|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|         32|67108864|     64|    32827715584|
| 4|    [opencl:acc:1]|               Intel(R) FPGA Emulation Device|       1.2|         32|67108864|     64|    32827715584|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:512
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  4155.99 MiB
llm_load_tensors:        CPU buffer size =   281.81 MiB
......................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.50 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   258.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
Unexpected pattern!
UNREACHABLE executed at ./lib/SPIRV/SPIRVUtil.cpp:1887!
The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':
 -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/home/sebastien/Projets/ollama/llm/llama.cpp/ggml-sycl.cpp, line:14904
time=2024-04-24T22:20:28.289+02:00 level=ERROR source=routes.go:120 msg="error loading llama server" error="llama runner process no longer running: 1 "

This may be something related with the debian library, or specific for my environment, but I prefer to report it now in case this could hide a more global issue.

@gamunu
Copy link
Author

gamunu commented May 1, 2024

@Chimrod, for some reason, llama.cpp is not picking up the GPUs. I'm currently looking into it and see what I can do

@Chimrod
Copy link

Chimrod commented May 4, 2024

Hello, the latest commit gives me segfault error when I start ollama serve

segfault.log

I do not know the go language, but tell me if I need to activate any switch during the compilation or any trace in the code to grasp the cause of this.

- Fix GPU ID formatting issue in oneapi_check_vram function
- Update GPU detection logic for oneapi devices
- Update gen_linux.sh to remove LLAMA_SYSCTL_F16 default flag
@gamunu
Copy link
Author

gamunu commented May 4, 2024

@Chimrod I only used the Docker build for testing. Are you using Debian or a variant of RHEL?
Most likely the default search paths for the libs are different in your distribution. It looks for the following paths for oneAPI libs at runtime.

"/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so*",
"/usr/lib*/libze_intel_gpu.so*",

build

source /opt/intel/oneapi/setvars.sh
export BUILD_ARCH=amd64
./scripts/build_linux.sh

run

source /opt/intel/oneapi/setvars.sh
export ZES_ENABLE_SYSMAN=1 #optional
 ./dist/ollama-linux-amd64 serve

gpu/gpu_info_oneapi.c Outdated Show resolved Hide resolved
@felipeagc
Copy link

Hi, I'm the original author of this patch, thanks for continuing to work on it! I'm finally able to play with this stuff again, so let's see if we can get the kinks figured out!

Just to note, I'm trying this on a fresh Ubuntu 24.04 install, using oneAPI installed through Intel's APT repository (when I was first working on this, I was using arch linux and it seemed to work fine there, so now I want to make sure it works on Ubuntu).

I pointed out a small mistake that was causing a segfault. With that fixed, I'm now getting this error when llama.cpp tries to load the model:

Build program log for 'Intel(R) Arc(TM) A750 Graphics':
 -999 (Unknown PI error)Exception caught at file:/home/felipe/code/go/ollama/llm/llama.cpp/ggml-sycl.cpp, line:14914

Currently looking into this.

@gamunu
Copy link
Author

gamunu commented May 4, 2024

@felipeagc, That's great that you're able to work on this again. I'm on ubuntu 24.04 as well.

I took a detour on the original work you did. I'm new to oneAPI and c. if you feel that my approach might not be the right one, please feel free to incorporate any relevant code to your pr.

@rafasaurus
Copy link

rafasaurus commented May 5, 2024

"I'm currently attempting to execute the code on an Intel i7 1185g7 processor running Arch Linux. It seems that SYCL_DEVICE_ALLOWLIST isn't being detected properly, if my understanding is correct. The error message what(): SYCL_DEVICE_ALLOWLIST has incorrect format. For details, please refer to https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md -30 (PI_ERROR_INVALID_VALUE) is appearing. I'll attach the log file for reference: logy.log. Shouldn't SYCL_DEVICE_ALLOWLIST be detected automatically?"

Update.
Fixed internal configuration, now its using full gpu with SYCL on i7 1185g7.
Thank you.

@jiriks74
Copy link

jiriks74 commented May 15, 2024

I've been able to build the Docker image. The Arc A370M dGPU was detected (the iGPU of my i7-12700H was too) but I'm having hard time running any models.

Log
[GIN] 2024/05/15 - 14:04:31 | 200 |      18.829µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/05/15 - 14:04:31 | 200 |    1.054858ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/05/15 - 14:04:31 | 200 |     270.166µs |       127.0.0.1 | POST     "/api/show"
time=2024-05-15T14:04:31.471Z level=INFO source=gpu.go:116 msg="Detecting GPUs"
time=2024-05-15T14:04:31.471Z level=DEBUG source=gpu.go:263 msg="Searching for GPU library" name=libcudart.so*
time=2024-05-15T14:04:31.471Z level=DEBUG source=gpu.go:281 msg="gpu library search" globs="[/tmp/ollama1769277002/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* /opt/intel/oneapi/lib/libcudart.so**]"
time=2024-05-15T14:04:31.472Z level=DEBUG source=gpu.go:309 msg="discovered GPU libraries" paths=[/tmp/ollama1769277002/runners/cuda_v11/libcudart.so.11.0]
cudaSetDevice err: 35
time=2024-05-15T14:04:31.473Z level=DEBUG source=gpu.go:321 msg="Unable to load cudart" library=/tmp/ollama1769277002/runners/cuda_v11/libcudart.so.11.0 error="your nvidia driver is too old or missing.  If you have a CUDA GPU please upgrade to run ollama"
time=2024-05-15T14:04:31.473Z level=INFO source=gpu.go:128 msg="Detecting Intel GPUs"
time=2024-05-15T14:04:31.473Z level=DEBUG source=gpu.go:263 msg="Searching for GPU library" name=libze_intel_gpu.so
time=2024-05-15T14:04:31.473Z level=DEBUG source=gpu.go:281 msg="gpu library search" globs="[/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so* /usr/lib*/libze_intel_gpu.so* /opt/intel/oneapi/lib/libze_intel_gpu.so*]"
time=2024-05-15T14:04:31.473Z level=DEBUG source=gpu.go:309 msg="discovered GPU libraries" paths=[/usr/lib64/libze_intel_gpu.so.1.3.27191.42]
wiring Level-Zero management library functions in /usr/lib64/libze_intel_gpu.so.1.3.27191.42
dlsym: zesInit
dlsym: zesDriverGet
dlsym: zesDeviceGet
dlsym: zesDeviceGetProperties
dlsym: zesDeviceEnumMemoryModules
dlsym: zesMemoryGetProperties
dlsym: zesMemoryGetState
discovered 1 Level-Zero drivers
discovered 2 Level-Zero devices
[1] oneAPI device name: Intel(R) Iris(R) Xe Graphics
[1] oneAPI brand: unknown
[1] oneAPI vendor: Intel(R) Corporation
[1] oneAPI S/N: unknown
[1] oneAPI board number: unknown
[2] oneAPI device name: Intel(R) Arc(TM) A370M Graphics
[2] oneAPI brand: unknown
[2] oneAPI vendor: Intel(R) Corporation
[2] oneAPI S/N: unknown
[2] oneAPI board number: unknown
time=2024-05-15T14:04:31.484Z level=INFO source=gpu.go:133 msg="detected Intel GPUs" library=/usr/lib64/libze_intel_gpu.so.1.3.27191.42 count=2
time=2024-05-15T14:04:31.484Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
discovered 0 Level-Zero memory modules
discovered 1 Level-Zero memory modules
time=2024-05-15T14:04:31.484Z level=DEBUG source=amd_linux.go:297 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing oneapi Library
time=2024-05-15T14:04:31.485Z level=DEBUG source=gguf.go:57 msg="model = &llm.gguf{containerGGUF:(*llm.containerGGUF)(0xc000a116c0), kv:llm.KV{}, tensors:[]*llm.Tensor(nil), parameters:0x0}"
time=2024-05-15T14:04:32.033Z level=DEBUG source=sched.go:173 msg="loading first model" model=/root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29
time=2024-05-15T14:04:32.033Z level=DEBUG source=memory.go:64 msg=evaluating library=oneapi gpu_count=1 available="4048.0 MiB"
time=2024-05-15T14:04:32.034Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=26 memory.available="4048.0 MiB" memory.required.full="4576.0 MiB" memory.required.partial="3928.3 MiB" memory.required.kv="256.0 MiB" memory.weights.total="4156.0 MiB" memory.weights.repeating="3745.0 MiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-05-15T14:04:32.034Z level=DEBUG source=memory.go:64 msg=evaluating library=oneapi gpu_count=1 available="1 B"
time=2024-05-15T14:04:32.034Z level=DEBUG source=memory.go:103 msg="insufficient VRAM to load any model layers"
time=2024-05-15T14:04:32.034Z level=DEBUG source=memory.go:64 msg=evaluating library=oneapi gpu_count=2 available="4048.0 MiB"
time=2024-05-15T14:04:32.034Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=21 memory.available="4048.0 MiB" memory.required.full="4740.0 MiB" memory.required.partial="3980.6 MiB" memory.required.kv="256.0 MiB" memory.weights.total="4156.0 MiB" memory.weights.repeating="3745.0 MiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="328.0 MiB" memory.graph.partial="1355.0 MiB"
time=2024-05-15T14:04:32.034Z level=DEBUG source=memory.go:64 msg=evaluating library=oneapi gpu_count=2 available="4048.0 MiB"
time=2024-05-15T14:04:32.034Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=21 memory.available="4048.0 MiB" memory.required.full="4740.0 MiB" memory.required.partial="3980.6 MiB" memory.required.kv="256.0 MiB" memory.weights.total="4156.0 MiB" memory.weights.repeating="3745.0 MiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="328.0 MiB" memory.graph.partial="1355.0 MiB"
time=2024-05-15T14:04:32.034Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1769277002/runners/cpu
time=2024-05-15T14:04:32.034Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1769277002/runners/cpu_avx
time=2024-05-15T14:04:32.034Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1769277002/runners/cpu_avx2
time=2024-05-15T14:04:32.034Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1769277002/runners/cuda_v11
time=2024-05-15T14:04:32.034Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1769277002/runners/oneapi
time=2024-05-15T14:04:32.034Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1769277002/runners/rocm_v60002
time=2024-05-15T14:04:32.034Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1769277002/runners/cpu
time=2024-05-15T14:04:32.034Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1769277002/runners/cpu_avx
time=2024-05-15T14:04:32.034Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1769277002/runners/cpu_avx2
time=2024-05-15T14:04:32.034Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1769277002/runners/cuda_v11
time=2024-05-15T14:04:32.034Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1769277002/runners/oneapi
time=2024-05-15T14:04:32.034Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1769277002/runners/rocm_v60002
time=2024-05-15T14:04:32.034Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-05-15T14:04:32.037Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama1769277002/runners/oneapi/ollama_llama_server --model /root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 --ctx-size 2048 --batch-size 512 --embedding --log-format json --n-gpu-layers 21 --verbose --parallel 1 --port 39715"
time=2024-05-15T14:04:32.037Z level=DEBUG source=server.go:291 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=e535b3abb323 OLLAMA_DEBUG=1 LANG=C.UTF-8 LD_LIBRARY_PATH=/opt/intel/oneapi/lib ONEAPI_ROOT=/opt/intel/oneapi SETVARS_COMPLETED=1 OLLAMA_HOST=0.0.0.0 HOME=/root LD_LIBRARY_PATH=/opt/intel/oneapi/lib:/tmp/ollama1769277002/runners/oneapi ONEAPI_VISIBLE_DEVICES=GPU-8680a646-0c00-0000-0002-000000000000,GPU-86809356-0500-0000-0300-000000000000]"
time=2024-05-15T14:04:32.038Z level=INFO source=sched.go:351 msg="loaded runners" count=1
time=2024-05-15T14:04:32.038Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding"
{"function":"server_params_parse","level":"WARN","line":2495,"msg":"server.cpp is not built with verbose logging.","tid":"139829892099200","timestamp":1715781872}
{"build":1,"commit":"433def2","function":"main","level":"INFO","line":2821,"msg":"build info","tid":"139829892099200","timestamp":1715781872}
{"function":"main","level":"INFO","line":2828,"msg":"system info","n_threads":6,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"139829892099200","timestamp":1715781872,"total_threads":20}
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-05-15T14:04:32.288Z level=DEBUG source=server.go:466 msg="server not yet available" error="server not responding"
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0|     [opencl:cpu:0]|          12th Gen Intel Core i7-12700H|    3.0|     20|    8192|   64| 16482M|2023.16.12.0.12_195853.xmain-hotfix|
| 1|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     20|67108864|   64| 16482M|2023.16.12.0.12_195853.xmain-hotfix|
ggml_backend_sycl_set_mul_device_mode: true
llama_model_load: error loading model: DeviceList is empty. -30 (PI_ERROR_INVALID_VALUE)
llama_load_model_from_file: exception loading model
terminate called after throwing an instance of 'sycl::_V1::invalid_parameter_error'
  what():  DeviceList is empty. -30 (PI_ERROR_INVALID_VALUE)
time=2024-05-15T14:04:33.693Z level=DEBUG source=server.go:466 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:39715/health\": dial tcp 127.0.0.1:39715: connect: connection refused"

I've tried llama3 and llama2 but both just hang without the prompt appearing.

EDIT: Here's a recording of me trying to run llama3 using Ollama first on the CPU and then the Arc A370M dGPU:

asciicast

Here is a asciicast file in case the player breaks or the recording is deleted from the website
ollama-oneapi.cast.zip

EDIT: From using llama.cpp directly I see that it's an old issue I reported (and forgot about because the developer didn't respond). ggerganov/llama.cpp#6808

@tristan-k
Copy link

tristan-k commented May 16, 2024

Why is the integrated GPU skipped? I'm trying to run ollama on a i7-1360P with a Intel Iris Xe Graphics (iGPU), 6Xe/96EU/768SP, Gen 12.2" (Raptor Lake GT1) on Ubuntu 22.04.4 LTS

time=2024-05-16T12:49:02.701+02:00 level=INFO source=routes.go:999 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-05-16T12:49:02.701+02:00 level=INFO source=payload_common.go:106 msg="Extracting dynamic libraries..."
time=2024-05-16T12:49:02.774+02:00 level=INFO source=payload_common.go:145 msg="Dynamic LLM libraries [oneapi cpu cpu_avx cpu_avx2]"
time=2024-05-16T12:49:02.774+02:00 level=INFO source=gpu.go:105 msg="Detecting GPU type"
time=2024-05-16T12:49:02.774+02:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-05-16T12:49:02.775+02:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-05-16T12:49:02.775+02:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library librocm_smi64.so"
time=2024-05-16T12:49:02.776+02:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-05-16T12:49:02.776+02:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library libze_intel_gpu.so"
time=2024-05-16T12:49:02.778+02:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.28202.52]"
time=2024-05-16T12:49:02.795+02:00 level=INFO source=gpu.go:130 msg="Intel GPU detected"
time=2024-05-16T12:49:02.795+02:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-05-16T12:49:02.795+02:00 level=INFO source=gpu.go:218 msg="OneAPI unsupported integrated GPU detected"
time=2024-05-16T12:49:02.795+02:00 level=INFO source=routes.go:1022 msg="no GPU detected"

Running the llama2 example with llama.cpp on the same integrated GPU works just fine albeit slow.

./build/bin/ls-sycl-device
found 4 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                 Intel Iris Xe Graphics|    1.3|     96|     512|   32| 30711M|            1.3.28202|
| 1|     [opencl:gpu:0]|                 Intel Iris Xe Graphics|    3.0|     96|     512|   32| 30711M|       23.52.28202.52|
| 2|     [opencl:cpu:0]|           13th Gen Intel Core i7-1360P|    3.0|     16|    8192|   64| 33174M|2024.17.3.0.08_160000|
| 3|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     16|67108864|   64| 33174M|2024.17.3.0.08_160000|
$ ZES_ENABLE_SYSMAN=1 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
llama_print_timings:        load time =    9859,22 ms
llama_print_timings:      sample time =      12,13 ms /   400 runs   (    0,03 ms per token, 32970,66 tokens per second)
llama_print_timings: prompt eval time =    2720,75 ms /    14 tokens (  194,34 ms per token,     5,15 tokens per second)
llama_print_timings:        eval time =   60272,54 ms /   399 runs   (  151,06 ms per token,     6,62 tokens per second)
llama_print_timings:       total time =   63130,20 ms /   413 tokens

@gamunu
Copy link
Author

gamunu commented May 16, 2024

@jiriks74 I haven't dived into the Docker builds after the recent changes I made; it looks like library paths are not properly set. I plan to get the Docker images working next.

@gamunu
Copy link
Author

gamunu commented May 16, 2024

@tristan-k The log messages are from one of the previous iterations of this implementation. Could you check if you have the latest code from this PR?

@jiriks74
Copy link

@gamunu It looks like llama.cpp issue as when I pass the dGPU to llama.cpp the iGPU disappears from the device list and the whole thing crashes not being able to load the model. I've created an issue some time ago, and forgot about it. it's ggerganov/llama.cpp#6808.

You can take a look how it basically crashes the same way and how it works with the iGPU but once the dGPU is passed it breaks down even if the iGPU is still present.

@tristan-k
Copy link

tristan-k commented May 19, 2024

@tristan-k The log messages are from one of the previous iterations of this implementation. Could you check if you have the latest code from this PR?

Thanks. Your main branch works fine on a bare metal Ubuntu 22.04 server but for some reasons fails to pick up the integrated GPU on a Ubuntu 22.04 LXC in Proxmox 8. With the LXC it falls back to cpu_avx2.

./ollama serve
...
time=2024-05-19T22:41:41.794+02:00 level=INFO source=gpu.go:173 msg="detected Intel GPUs" library=/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.28202.52 count=8
time=2024-05-19T22:41:41.795+02:00 level=INFO source=types.go:71 msg="inference compute" id=GPU-8680a0a7-0400-0000-0002-070000000000 library=oneapi compute="" driver=0.0 name="" total="0 B" available="1 B"
time=2024-05-19T22:41:41.795+02:00 level=INFO source=types.go:71 msg="inference compute" id=GPU-8680a0a7-0400-0000-0002-060000000000 library=oneapi compute="" driver=0.0 name="" total="0 B" available="1 B"
time=2024-05-19T22:41:41.795+02:00 level=INFO source=types.go:71 msg="inference compute" id=GPU-8680a0a7-0400-0000-0002-050000000000 library=oneapi compute="" driver=0.0 name="" total="0 B" available="1 B"
time=2024-05-19T22:41:41.795+02:00 level=INFO source=types.go:71 msg="inference compute" id=GPU-8680a0a7-0400-0000-0002-040000000000 library=oneapi compute="" driver=0.0 name="" total="0 B" available="1 B"
time=2024-05-19T22:41:41.795+02:00 level=INFO source=types.go:71 msg="inference compute" id=GPU-8680a0a7-0400-0000-0002-030000000000 library=oneapi compute="" driver=0.0 name="" total="0 B" available="1 B"
time=2024-05-19T22:41:41.795+02:00 level=INFO source=types.go:71 msg="inference compute" id=GPU-8680a0a7-0400-0000-0002-020000000000 library=oneapi compute="" driver=0.0 name="" total="0 B" available="1 B"
time=2024-05-19T22:41:41.795+02:00 level=INFO source=types.go:71 msg="inference compute" id=GPU-8680a0a7-0400-0000-0002-010000000000 library=oneapi compute="" driver=0.0 name="" total="0 B" available="1 B"
time=2024-05-19T22:41:41.795+02:00 level=INFO source=types.go:71 msg="inference compute" id=GPU-8680a0a7-0400-0000-0002-000000000000 library=oneapi compute="" driver=0.0 name="" total="0 B" available="1 B"
[GIN] 2024/05/19 - 22:41:54 | 200 |    1.018918ms |       127.0.0.1 | GET      "/api/tags"
time=2024-05-19T22:41:54.427+02:00 level=INFO source=gpu.go:168 msg="Detecting Intel GPUs"
time=2024-05-19T22:41:54.445+02:00 level=INFO source=gpu.go:173 msg="detected Intel GPUs" library=/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.28202.52 count=8
time=2024-05-19T22:41:54.552+02:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=35 layers.real=33 memory.available="1 B" memory.required.full="4.5 GiB" memory.required.partial="4.5 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.0 GiB" memory.weights.repeating="3.9 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB"
time=2024-05-19T22:41:54.552+02:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=35 layers.real=33 memory.available="1 B" memory.required.full="4.5 GiB" memory.required.partial="4.5 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.0 GiB" memory.weights.repeating="3.9 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB"
time=2024-05-19T22:41:54.553+02:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=35 layers.real=33 memory.available="1 B" memory.required.full="4.5 GiB" memory.required.partial="4.5 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.0 GiB" memory.weights.repeating="3.9 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB"
time=2024-05-19T22:41:54.553+02:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=35 layers.real=33 memory.available="1 B" memory.required.full="4.5 GiB" memory.required.partial="4.5 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.0 GiB" memory.weights.repeating="3.9 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB"
time=2024-05-19T22:41:54.553+02:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=35 layers.real=33 memory.available="1 B" memory.required.full="4.5 GiB" memory.required.partial="4.5 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.0 GiB" memory.weights.repeating="3.9 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB"
time=2024-05-19T22:41:54.553+02:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=35 layers.real=33 memory.available="1 B" memory.required.full="4.5 GiB" memory.required.partial="4.5 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.0 GiB" memory.weights.repeating="3.9 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB"
time=2024-05-19T22:41:54.554+02:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=35 layers.real=33 memory.available="1 B" memory.required.full="4.5 GiB" memory.required.partial="4.5 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.0 GiB" memory.weights.repeating="3.9 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB"
time=2024-05-19T22:41:54.554+02:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=35 layers.real=33 memory.available="1 B" memory.required.full="4.5 GiB" memory.required.partial="4.5 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.0 GiB" memory.weights.repeating="3.9 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB"
time=2024-05-19T22:41:54.554+02:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=35 layers.real=33 memory.available="8 B" memory.required.full="5.7 GiB" memory.required.partial="5.7 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.0 GiB" memory.weights.repeating="3.9 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.4 GiB"
time=2024-05-19T22:41:54.554+02:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=35 layers.real=33 memory.available="8 B" memory.required.full="5.7 GiB" memory.required.partial="5.7 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.0 GiB" memory.weights.repeating="3.9 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.4 GiB"
time=2024-05-19T22:41:54.555+02:00 level=INFO source=server.go:320 msg="starting llama server" cmd="/tmp/ollama3691754935/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-5cd5336db349ecb34eb0e3df2e4ed93f40ee52379646a77e355e342a365b79e2 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 35 --parallel 1 --port 40169"
time=2024-05-19T22:41:54.555+02:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-05-19T22:41:54.555+02:00 level=INFO source=server.go:504 msg="waiting for llama runner to start responding"
time=2024-05-19T22:41:54.556+02:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="136333138933632" timestamp=1716151314
INFO [main] build info | build=2783 commit="433def2" tid="136333138933632" timestamp=1716151314
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="136333138933632" timestamp=1716151314 total_threads=12
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="40169" tid="136333138933632" timestamp=1716151314

Running the llama2 example with llama.cpp inside the LXC works fine again.

ZES_ENABLE_SYSMAN=1 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
Log start
main: build = 2939 (1ea2a003)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.1.0 (2024.1.0.20240308) for x86_64-unknown-linux-gnu
main: seed  = 1716152340
...
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 18 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                 Intel Iris Xe Graphics|    1.3|     96|     512|   32| 30712M|            1.3.28202|
| 1| [level_zero:gpu:1]|                 Intel Iris Xe Graphics|    1.3|     96|     512|   32| 30712M|            1.3.28202|
| 2| [level_zero:gpu:2]|                 Intel Iris Xe Graphics|    1.3|     96|     512|   32| 30712M|            1.3.28202|
| 3| [level_zero:gpu:3]|                 Intel Iris Xe Graphics|    1.3|     96|     512|   32| 30712M|            1.3.28202|
| 4| [level_zero:gpu:4]|                 Intel Iris Xe Graphics|    1.3|     96|     512|   32| 30712M|            1.3.28202|
| 5| [level_zero:gpu:5]|                 Intel Iris Xe Graphics|    1.3|     96|     512|   32| 30712M|            1.3.28202|
|15|     [opencl:gpu:7]|                 Intel Iris Xe Graphics|    3.0|     96|     512|   32| 30712M|       23.52.28202.52|
|16|     [opencl:cpu:0]|           13th Gen Intel Core i7-1360P|    3.0|     12|    8192|   64| 33174M|2024.17.3.0.08_160000|
|17|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     12|67108864|   64| 33174M|2024.17.3.0.08_160000|
ggml_backend_sycl_set_single_device: use single device: [0]
use 1 SYCL GPUs: [0] with Max compute units:96
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =    70.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 12 / 12 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 400, n_keep = 1
...
llama_print_timings:        load time =   16332.81 ms
llama_print_timings:      sample time =       7.42 ms /   400 runs   (    0.02 ms per token, 53915.62 tokens per second)
llama_print_timings: prompt eval time =    2610.73 ms /    14 tokens (  186.48 ms per token,     5.36 tokens per second)
llama_print_timings:        eval time =   55008.21 ms /   399 runs   (  137.87 ms per token,     7.25 tokens per second)
llama_print_timings:       total time =   57694.82 ms /   413 tokens
Log end

@gamunu gamunu closed this May 29, 2024
@gamunu gamunu reopened this May 30, 2024
@gamunu
Copy link
Author

gamunu commented May 30, 2024

I was slightly confused. I presume @felipeagc managed to get the changes to the main?

@dhiltgen We may have to do some refactoring.

  • ZE_RESULT_ERROR_UNINITIALIZED and ZE_RESULT_ERROR_OUT_OF_HOST_MEMORY; I don't think we need it;
typedef enum ze_result_t {
  ZE_RESULT_SUCCESS = 0,
  ZE_RESULT_ERROR_UNINITIALIZED = 1,
  ZE_RESULT_ERROR_OUT_OF_HOST_MEMORY = 2
  // Other values omitted for now...
} ze_result_t;
  • The device info is not properly outputting the values to logs; it should be OneAPI issue, but I created the support ticket with Intel and am waiting on the information.

@dhiltgen
Copy link
Collaborator

PR #3278 was merged recently. I'm refactoring the GPU discovery code in PR #4517 but after that gets in, there might be some improvements to carry over from this one.

@gamunu
Copy link
Author

gamunu commented May 31, 2024

@dhiltgen, it may be good the credit the original author of the changes in the release notes, that pr doesn't reflect the commit history of @felipeagc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants