WIP - Web server + conv2d fused + k-quants + dynamic gpu offloading #221

FSSRepo · 2024-04-07T04:54:04Z

This PR will probably take me a very long time since I'll have to make many modifications to the code. Due to changes in the way of building the computation graph similar to PyTorch (which I'm not entirely convinced about, but I believe it at least facilitates the implementation of new models), and the implementation of new models like SVD and PhotoMaker, the code has grown considerably. It will require a longer inspection than usual.

My goal is to add a REST API and a web client (prompt manager) as fully as possible. As a top priority, I also want to make the SDXL model work on my 4GB GPU. For that, I'll need to improve offloading to keep only the largest tensors that take a long time to move on the GPU, while the rest of the smaller matrices will reside in RAM.

I'll also implement Flash Attention for GPU backend to reduce memory usage in the UNet phase and create a kernel by merging the conv2d operation (im2col + GEMM) at the cost of a slight reduction in performance in the VAE phase (with a 60% reduction in memory usage, even lower if I implement FA in the attention phases of VAE).

I'll work on this in my spare time since I don't have much time due to working on some projects related to llama.cpp.

Build and run server:

git clone https://github.com/FSSRepo/stable-diffusion.cpp.git --recursive
cd stable-diffusion.cpp
git checkout web-server
mkdir build
cd build
# after is PR is merged this will be -DSD_CUDA=ON
cmake .. -DSD_CUBLAS=ON
cmake --build . --config Release

The images have questionable quality since I'm using VAE tiling; for that reason, I want to reduce the amount of memory consumed by the VAE phase.

2024-04-05.17-38-44.mp4

Adding the k-quants is going to take a lot of time for me since I'll have to test the model through trial and error (my computer is somewhat slow for repetitive tasks, very frustrating) to see how much quantization affects the different parts of the model. However, there's also the limitation that only the attention weights are quantized, and these usually represent only 45% of the total model weight.

In the last few months, there have been a considerable number of changes in the ggml repository, and I want to use the latest version of ggml. This is going to be a headache.

Green-Sky · 2024-04-07T09:49:43Z

very cool, looks a bit off in darkmode though.

Is there a reason you are rolling all those changes together in this pr? The server, while simple for now, feels pretty usable as is.

Green-Sky · 2024-04-07T10:15:26Z

@FSSRepo is there a reason you diverge so much from what automatic111's webui api returns?

it would return a json containing an "images" array. You seem to return some multiline string:

data: <json here>

data: <the image json here>

bssrdf · 2024-04-07T20:35:43Z

@FSSRepo , very excited by the items you are working on. In particular, I am interested in "the kernel by merging the conv2d operation (im2col + GEMM)". Can you elaborate more on this? Is im2col going to be skipped? Or done on the fly? Thanks

Green-Sky · 2024-04-08T09:06:51Z

@FSSRepo I sugest changing the default port to either 7860 (automatic1111) or something else entirely. Yesterday I ran into an issue with the llama.cpp server and it turned out that I had both sd.cpp and llama.cpp server running on the same port, and it made a random server respond. I do not know why this happens, I have never seen double port usage and was assuming it was always exclusive...

$ netstat -tlpen | grep 8080
tcp        0      0 127.0.0.1:8080          0.0.0.0:*               LISTEN      1000       448830     201652/llama-server
tcp        0      0 127.0.0.1:8080          0.0.0.0:*               LISTEN      1000       438210     201539/result/bin/s

FSSRepo · 2024-04-08T14:19:36Z

@bssrdf

Is im2col going to be skipped? Or done on the fly?

I am going to do something similar to flash attention. I am going to divide the blocks generated by im2col into smaller parts and perform matrix multiplication using tensor cores. However, upon analyzing the algorithm, it will have to iterate along the channels, which could be up to 102,400 iterations minimum sequentially. This could incur a large number of memory accesses (hence the regression in performance).

I sugest changing the default port to either 7860 (automatic1111) or something else entirely.

I will try to make the endpoints equivalent to automatic111 to continue considering standard functionalities, so to speak.

FSSRepo · 2024-04-10T19:04:15Z

@leejet I'm not sure if this could lead to memory leaks since it needs to be created for each model, and it's a lot 10MB to store just the metadata of the tensors, this can be get_params_num()*ggml_tensor_overhead().

stable-diffusion.cpp/ggml_extend.hpp

Lines 972 to 982 in afea457

    
           static char temp_buffer[1024 * 1024 * 10]; 
        
           ggml_context* get_temp_ctx() { 
        
               struct ggml_init_params params; 
        
               params.mem_size   = sizeof(temp_buffer); 
        
               params.mem_buffer = temp_buffer; 
        
               params.no_alloc   = true; 
        
               ggml_context* temp_ctx = ggml_init(params); 
        
               GGML_ASSERT(temp_ctx != NULL); 
        
               return temp_ctx; 
        
           }

There are some arbitrary memory space additions like this:

stable-diffusion.cpp/control.hpp

Lines 331 to 334 in afea457

    
           struct ggml_init_params params; 
        
           params.mem_size   = static_cast<size_t>(outs.size() * ggml_tensor_overhead()) + 1024 * 1024; 
        
           params.mem_buffer = NULL; 
        
           params.no_alloc   = true;

leejet · 2024-04-14T06:16:00Z

@leejet I'm not sure if this could lead to memory leaks since it needs to be created for each model, and it's a lot 10MB to store just the metadata of the tensors, this can be get_params_num()*ggml_tensor_overhead().

stable-diffusion.cpp/ggml_extend.hpp

Lines 972 to 982 in afea457

static char temp_buffer[1024 * 1024 * 10];

ggml_context* get_temp_ctx() {

struct ggml_init_params params;

params.mem_size = sizeof(temp_buffer);

params.mem_buffer = temp_buffer;

params.no_alloc = true;

ggml_context* temp_ctx = ggml_init(params);

GGML_ASSERT(temp_ctx != NULL);

return temp_ctx;

}

There are some arbitrary memory space additions like this:

stable-diffusion.cpp/control.hpp

Lines 331 to 334 in afea457

struct ggml_init_params params;

params.mem_size = static_cast<size_t>(outs.size() * ggml_tensor_overhead()) + 1024 * 1024;

params.mem_buffer = NULL;

params.no_alloc = true;

get_temp_ctx is not used anymore, so there is actually no memory leak. In the latest commit, I removed the useless code including get_temp_ctx.

The additional space for control_ctx is necessary because it will be used later to create some other tensors, such as guided_hint. In order to avoid recalculating the size and increasing mem_size every time control_ctx is used, I simply allocate an extra 1MB of memory. In fact, relative to the memory size of the model weights, the size of this additional memory space is negligible.

…nt (dark mode) + vae: 60% less memory usage

Green-Sky · 2024-04-20T22:37:39Z

@FSSRepo for latest commit I had to use ggerganov/llama.cpp@2f538b9 (from the flash-attn llama.cpp branch)
But even with that it wont compile for me. (cuda 11.8)

stable-diffusion.cpp> nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
stable-diffusion.cpp> /build/cj22zpf17mymwlw80zs7q5075f4rmmjs-source/ggml/src/ggml-cuda/common.cuh(310): error: more than one conversion function from "__half" to a built-in type applies:
stable-diffusion.cpp>             function "__half::operator float() const"
stable-diffusion.cpp> /nix/store/da7kq3ibhnyf2vxb1j7pl2wr8w5appih-cudatoolkit-11.8.0/include/cuda_fp16.hpp(204): here
stable-diffusion.cpp>             function "__half::operator short() const"
stable-diffusion.cpp> /nix/store/da7kq3ibhnyf2vxb1j7pl2wr8w5appih-cudatoolkit-11.8.0/include/cuda_fp16.hpp(222): here
stable-diffusion.cpp>             function "__half::operator unsigned short() const"
stable-diffusion.cpp> /nix/store/da7kq3ibhnyf2vxb1j7pl2wr8w5appih-cudatoolkit-11.8.0/include/cuda_fp16.hpp(225): here
stable-diffusion.cpp>             function "__half::operator int() const"
stable-diffusion.cpp> /nix/store/da7kq3ibhnyf2vxb1j7pl2wr8w5appih-cudatoolkit-11.8.0/include/cuda_fp16.hpp(228): here
stable-diffusion.cpp>             function "__half::operator unsigned int() const"
stable-diffusion.cpp> /nix/store/da7kq3ibhnyf2vxb1j7pl2wr8w5appih-cudatoolkit-11.8.0/include/cuda_fp16.hpp(231): here
stable-diffusion.cpp>             function "__half::operator long long() const"
stable-diffusion.cpp> /nix/store/da7kq3ibhnyf2vxb1j7pl2wr8w5appih-cudatoolkit-11.8.0/include/cuda_fp16.hpp(234): here
stable-diffusion.cpp>             function "__half::operator unsigned long long() const"
stable-diffusion.cpp> /nix/store/da7kq3ibhnyf2vxb1j7pl2wr8w5appih-cudatoolkit-11.8.0/include/cuda_fp16.hpp(237): here
stable-diffusion.cpp>             function "__half::operator __nv_bool() const"
stable-diffusion.cpp> /nix/store/da7kq3ibhnyf2vxb1j7pl2wr8w5appih-cudatoolkit-11.8.0/include/cuda_fp16.hpp(241): here

(alot of those)

FSSRepo · 2024-04-20T22:38:59Z

For now, the kernel I created to avoid the overhead of im2col results in a 50% reduction in performance, even though it's only applied to the operation that generates a tensor of up to 1.2GB for a 512x512 image. I'll try to optimize it further later on. As for the flash attention kernel, it doesn't improve the overall inference performance as expected because it's only applied when head_dim is 40, which generates a tensor [4096, 4096, 8] weighing 500 MB for a 512x512 image and 2 GB for a 1024x1024 image, though only with SD 1.5. All these changes haven't been tested on SDXL yet. I'll rent an RTX 3060 for quick tests.

FSSRepo · 2024-04-21T00:22:21Z

@Green-Sky I'll try to do tests on RTX 3060, mainly with CUDA Toolkit 11.8. The truth is that there isn't a standard API for stable diffusion. For example, the ComfyUI API isn't the same as Automatic111's, nor is the one for Fooocus. It's somewhat like in the LLMs that have the ChatML API, which emulates the OpenAI endpoints (llama.cpp, vLLM, text-generation-webui, and others).

Green-Sky · 2024-04-21T07:54:01Z

Yea, there is no "the one" web api, but automatic1111's api is (or was?) the one with the most usage. Truthfully I made a small chatbot with it, before comfyui was a thing.
BTW I adopted the funky api you implemented. I was almost no afford. 😃

Thanks again for doing this PR :)

FSSRepo · 2024-04-21T11:33:58Z

@Green-Sky If it seemed easier to me to implement it this way (stream endpoint like chat), since otherwise it would have required websockets or a loop calling an 'http:127.0.0.0:7680/progress' endpoint, which would be the equivalent in Automatic1111 if you want to know the real-time status.

FSSRepo · 2024-04-21T19:03:03Z

@Green-Sky
~~Unfortunately, I cannot run tests on CUDA Toolkit 11.8; I have no means to conduct the tests. I tried using Google Colab, but they already use the latest version of the toolkit 12.2~~

I fixed it already.

FSSRepo · 2024-04-21T19:43:51Z

@Green-Sky try now

Green-Sky · 2024-04-23T14:54:04Z

@FSSRepo it indeed works now again :)

Did you disable VAE tiling? because now it fails to allocate the buffer for decoding.

[INFO ] stable-diffusion.cpp:1852 - sampling completed, taking 2.41s
[INFO ] stable-diffusion.cpp:1860 - generating 1 latent images completed, taking 2.41s
[INFO ] stable-diffusion.cpp:1864 - decoding 1 latents
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 3744.00 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 3744.00 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 3925868544
[ERROR] ggml_extend.hpp:831  - vae: failed to allocate the compute buffer

Segmentation fault (core dumped)

i like the colors :)

edit: actually there are easy 6GiB vram available.

FSSRepo · 2024-04-23T16:36:07Z

Try enable SD_CONV2D_MEMORY_EFFICIENT this reduces the vae memory usage, or enable VAE tiling manually on the ui

into web-server

vonjackustc · 2024-05-04T14:30:28Z

Can conv2d fused support p40 bro?

add server - api rest

ecb487e

FSSRepo marked this pull request as draft April 18, 2024 16:32

fix .safetensors loader + update ggml + FlashAttn cuda + ui improveme…

f0e6ce8

…nt (dark mode) + vae: 60% less memory usage

FSSRepo mentioned this pull request Apr 21, 2024

any plan to merge upstream libggml? #230

Open

ggml: fix cuda 11.8

a73867b

FSSRepo added 2 commits May 1, 2024 13:18

fix some compile errors

94afe7e

Merge branch 'master' of https://github.com/FSSRepo/stable-diffusion.cpp

29b3396

into web-server

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP - Web server + conv2d fused + k-quants + dynamic gpu offloading #221

WIP - Web server + conv2d fused + k-quants + dynamic gpu offloading #221

FSSRepo commented Apr 7, 2024

Green-Sky commented Apr 7, 2024 •

edited

Green-Sky commented Apr 7, 2024

bssrdf commented Apr 7, 2024

Green-Sky commented Apr 8, 2024 •

edited

FSSRepo commented Apr 8, 2024

FSSRepo commented Apr 10, 2024 •

edited

leejet commented Apr 14, 2024

Green-Sky commented Apr 20, 2024

FSSRepo commented Apr 20, 2024

FSSRepo commented Apr 21, 2024 •

edited

Green-Sky commented Apr 21, 2024

FSSRepo commented Apr 21, 2024 •

edited

FSSRepo commented Apr 21, 2024 •

edited

FSSRepo commented Apr 21, 2024

Green-Sky commented Apr 23, 2024 •

edited

FSSRepo commented Apr 23, 2024

vonjackustc commented May 4, 2024

WIP - Web server + conv2d fused + k-quants + dynamic gpu offloading #221

Are you sure you want to change the base?

WIP - Web server + conv2d fused + k-quants + dynamic gpu offloading #221

Conversation

FSSRepo commented Apr 7, 2024

Green-Sky commented Apr 7, 2024 • edited

Green-Sky commented Apr 7, 2024

bssrdf commented Apr 7, 2024

Green-Sky commented Apr 8, 2024 • edited

FSSRepo commented Apr 8, 2024

FSSRepo commented Apr 10, 2024 • edited

leejet commented Apr 14, 2024

Green-Sky commented Apr 20, 2024

FSSRepo commented Apr 20, 2024

FSSRepo commented Apr 21, 2024 • edited

Green-Sky commented Apr 21, 2024

FSSRepo commented Apr 21, 2024 • edited

FSSRepo commented Apr 21, 2024 • edited

FSSRepo commented Apr 21, 2024

Green-Sky commented Apr 23, 2024 • edited

FSSRepo commented Apr 23, 2024

vonjackustc commented May 4, 2024

Green-Sky commented Apr 7, 2024 •

edited

Green-Sky commented Apr 8, 2024 •

edited

FSSRepo commented Apr 10, 2024 •

edited

FSSRepo commented Apr 21, 2024 •

edited

FSSRepo commented Apr 21, 2024 •

edited

FSSRepo commented Apr 21, 2024 •

edited

Green-Sky commented Apr 23, 2024 •

edited