Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - Web server + conv2d fused + k-quants + dynamic gpu offloading #221

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

FSSRepo
Copy link
Contributor

@FSSRepo FSSRepo commented Apr 7, 2024

This PR will probably take me a very long time since I'll have to make many modifications to the code. Due to changes in the way of building the computation graph similar to PyTorch (which I'm not entirely convinced about, but I believe it at least facilitates the implementation of new models), and the implementation of new models like SVD and PhotoMaker, the code has grown considerably. It will require a longer inspection than usual.

My goal is to add a REST API and a web client (prompt manager) as fully as possible. As a top priority, I also want to make the SDXL model work on my 4GB GPU. For that, I'll need to improve offloading to keep only the largest tensors that take a long time to move on the GPU, while the rest of the smaller matrices will reside in RAM.

I'll also implement Flash Attention for GPU backend to reduce memory usage in the UNet phase and create a kernel by merging the conv2d operation (im2col + GEMM) at the cost of a slight reduction in performance in the VAE phase (with a 60% reduction in memory usage, even lower if I implement FA in the attention phases of VAE).

I'll work on this in my spare time since I don't have much time due to working on some projects related to llama.cpp.

Build and run server:

git clone https://github.com/FSSRepo/stable-diffusion.cpp.git --recursive
cd stable-diffusion.cpp
git checkout web-server
mkdir build
cd build
# after is PR is merged this will be -DSD_CUDA=ON
cmake .. -DSD_CUBLAS=ON
cmake --build . --config Release

The images have questionable quality since I'm using VAE tiling; for that reason, I want to reduce the amount of memory consumed by the VAE phase.

2024-04-05.17-38-44.mp4

Adding the k-quants is going to take a lot of time for me since I'll have to test the model through trial and error (my computer is somewhat slow for repetitive tasks, very frustrating) to see how much quantization affects the different parts of the model. However, there's also the limitation that only the attention weights are quantized, and these usually represent only 45% of the total model weight.

In the last few months, there have been a considerable number of changes in the ggml repository, and I want to use the latest version of ggml. This is going to be a headache.

@Green-Sky
Copy link
Contributor

Green-Sky commented Apr 7, 2024

very cool, looks a bit off in darkmode though.
image

Is there a reason you are rolling all those changes together in this pr? The server, while simple for now, feels pretty usable as is.

@Green-Sky
Copy link
Contributor

@FSSRepo is there a reason you diverge so much from what automatic111's webui api returns?

it would return a json containing an "images" array. You seem to return some multiline string:

data: <json here>

data: <the image json here>

@bssrdf
Copy link
Contributor

bssrdf commented Apr 7, 2024

@FSSRepo , very excited by the items you are working on. In particular, I am interested in "the kernel by merging the conv2d operation (im2col + GEMM)". Can you elaborate more on this? Is im2col going to be skipped? Or done on the fly? Thanks

@Green-Sky
Copy link
Contributor

Green-Sky commented Apr 8, 2024

@FSSRepo I sugest changing the default port to either 7860 (automatic1111) or something else entirely. Yesterday I ran into an issue with the llama.cpp server and it turned out that I had both sd.cpp and llama.cpp server running on the same port, and it made a random server respond. I do not know why this happens, I have never seen double port usage and was assuming it was always exclusive...

$ netstat -tlpen | grep 8080
tcp        0      0 127.0.0.1:8080          0.0.0.0:*               LISTEN      1000       448830     201652/llama-server
tcp        0      0 127.0.0.1:8080          0.0.0.0:*               LISTEN      1000       438210     201539/result/bin/s

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Apr 8, 2024

@bssrdf

Is im2col going to be skipped? Or done on the fly?

I am going to do something similar to flash attention. I am going to divide the blocks generated by im2col into smaller parts and perform matrix multiplication using tensor cores. However, upon analyzing the algorithm, it will have to iterate along the channels, which could be up to 102,400 iterations minimum sequentially. This could incur a large number of memory accesses (hence the regression in performance).

I sugest changing the default port to either 7860 (automatic1111) or something else entirely.

I will try to make the endpoints equivalent to automatic111 to continue considering standard functionalities, so to speak.

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Apr 10, 2024

@leejet I'm not sure if this could lead to memory leaks since it needs to be created for each model, and it's a lot 10MB to store just the metadata of the tensors, this can be get_params_num()*ggml_tensor_overhead().

static char temp_buffer[1024 * 1024 * 10];
ggml_context* get_temp_ctx() {
struct ggml_init_params params;
params.mem_size = sizeof(temp_buffer);
params.mem_buffer = temp_buffer;
params.no_alloc = true;
ggml_context* temp_ctx = ggml_init(params);
GGML_ASSERT(temp_ctx != NULL);
return temp_ctx;
}

There are some arbitrary memory space additions like this:

struct ggml_init_params params;
params.mem_size = static_cast<size_t>(outs.size() * ggml_tensor_overhead()) + 1024 * 1024;
params.mem_buffer = NULL;
params.no_alloc = true;

@leejet
Copy link
Owner

leejet commented Apr 14, 2024

@leejet I'm not sure if this could lead to memory leaks since it needs to be created for each model, and it's a lot 10MB to store just the metadata of the tensors, this can be get_params_num()*ggml_tensor_overhead().

static char temp_buffer[1024 * 1024 * 10];
ggml_context* get_temp_ctx() {
struct ggml_init_params params;
params.mem_size = sizeof(temp_buffer);
params.mem_buffer = temp_buffer;
params.no_alloc = true;
ggml_context* temp_ctx = ggml_init(params);
GGML_ASSERT(temp_ctx != NULL);
return temp_ctx;
}

There are some arbitrary memory space additions like this:

struct ggml_init_params params;
params.mem_size = static_cast<size_t>(outs.size() * ggml_tensor_overhead()) + 1024 * 1024;
params.mem_buffer = NULL;
params.no_alloc = true;

get_temp_ctx is not used anymore, so there is actually no memory leak. In the latest commit, I removed the useless code including get_temp_ctx.

The additional space for control_ctx is necessary because it will be used later to create some other tensors, such as guided_hint. In order to avoid recalculating the size and increasing mem_size every time control_ctx is used, I simply allocate an extra 1MB of memory. In fact, relative to the memory size of the model weights, the size of this additional memory space is negligible.

@FSSRepo FSSRepo marked this pull request as draft April 18, 2024 16:32
@Green-Sky
Copy link
Contributor

@FSSRepo for latest commit I had to use ggerganov/llama.cpp@2f538b9 (from the flash-attn llama.cpp branch)
But even with that it wont compile for me. (cuda 11.8)

stable-diffusion.cpp> nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
stable-diffusion.cpp> /build/cj22zpf17mymwlw80zs7q5075f4rmmjs-source/ggml/src/ggml-cuda/common.cuh(310): error: more than one conversion function from "__half" to a built-in type applies:
stable-diffusion.cpp>             function "__half::operator float() const"
stable-diffusion.cpp> /nix/store/da7kq3ibhnyf2vxb1j7pl2wr8w5appih-cudatoolkit-11.8.0/include/cuda_fp16.hpp(204): here
stable-diffusion.cpp>             function "__half::operator short() const"
stable-diffusion.cpp> /nix/store/da7kq3ibhnyf2vxb1j7pl2wr8w5appih-cudatoolkit-11.8.0/include/cuda_fp16.hpp(222): here
stable-diffusion.cpp>             function "__half::operator unsigned short() const"
stable-diffusion.cpp> /nix/store/da7kq3ibhnyf2vxb1j7pl2wr8w5appih-cudatoolkit-11.8.0/include/cuda_fp16.hpp(225): here
stable-diffusion.cpp>             function "__half::operator int() const"
stable-diffusion.cpp> /nix/store/da7kq3ibhnyf2vxb1j7pl2wr8w5appih-cudatoolkit-11.8.0/include/cuda_fp16.hpp(228): here
stable-diffusion.cpp>             function "__half::operator unsigned int() const"
stable-diffusion.cpp> /nix/store/da7kq3ibhnyf2vxb1j7pl2wr8w5appih-cudatoolkit-11.8.0/include/cuda_fp16.hpp(231): here
stable-diffusion.cpp>             function "__half::operator long long() const"
stable-diffusion.cpp> /nix/store/da7kq3ibhnyf2vxb1j7pl2wr8w5appih-cudatoolkit-11.8.0/include/cuda_fp16.hpp(234): here
stable-diffusion.cpp>             function "__half::operator unsigned long long() const"
stable-diffusion.cpp> /nix/store/da7kq3ibhnyf2vxb1j7pl2wr8w5appih-cudatoolkit-11.8.0/include/cuda_fp16.hpp(237): here
stable-diffusion.cpp>             function "__half::operator __nv_bool() const"
stable-diffusion.cpp> /nix/store/da7kq3ibhnyf2vxb1j7pl2wr8w5appih-cudatoolkit-11.8.0/include/cuda_fp16.hpp(241): here

(alot of those)

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Apr 20, 2024

For now, the kernel I created to avoid the overhead of im2col results in a 50% reduction in performance, even though it's only applied to the operation that generates a tensor of up to 1.2GB for a 512x512 image. I'll try to optimize it further later on. As for the flash attention kernel, it doesn't improve the overall inference performance as expected because it's only applied when head_dim is 40, which generates a tensor [4096, 4096, 8] weighing 500 MB for a 512x512 image and 2 GB for a 1024x1024 image, though only with SD 1.5. All these changes haven't been tested on SDXL yet. I'll rent an RTX 3060 for quick tests.

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Apr 21, 2024

@Green-Sky I'll try to do tests on RTX 3060, mainly with CUDA Toolkit 11.8. The truth is that there isn't a standard API for stable diffusion. For example, the ComfyUI API isn't the same as Automatic111's, nor is the one for Fooocus. It's somewhat like in the LLMs that have the ChatML API, which emulates the OpenAI endpoints (llama.cpp, vLLM, text-generation-webui, and others).

@Green-Sky
Copy link
Contributor

Yea, there is no "the one" web api, but automatic1111's api is (or was?) the one with the most usage. Truthfully I made a small chatbot with it, before comfyui was a thing.
BTW I adopted the funky api you implemented. I was almost no afford. 😃

Thanks again for doing this PR :)

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Apr 21, 2024

@Green-Sky If it seemed easier to me to implement it this way (stream endpoint like chat), since otherwise it would have required websockets or a loop calling an 'http:127.0.0.0:7680/progress' endpoint, which would be the equivalent in Automatic1111 if you want to know the real-time status.

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Apr 21, 2024

@Green-Sky
Unfortunately, I cannot run tests on CUDA Toolkit 11.8; I have no means to conduct the tests. I tried using Google Colab, but they already use the latest version of the toolkit 12.2

I fixed it already.

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Apr 21, 2024

@Green-Sky try now

@Green-Sky
Copy link
Contributor

Green-Sky commented Apr 23, 2024

@FSSRepo it indeed works now again :)

Did you disable VAE tiling? because now it fails to allocate the buffer for decoding.

[INFO ] stable-diffusion.cpp:1852 - sampling completed, taking 2.41s
[INFO ] stable-diffusion.cpp:1860 - generating 1 latent images completed, taking 2.41s
[INFO ] stable-diffusion.cpp:1864 - decoding 1 latents
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 3744.00 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 3744.00 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 3925868544
[ERROR] ggml_extend.hpp:831  - vae: failed to allocate the compute buffer

Segmentation fault (core dumped)

i like the colors :)

edit: actually there are easy 6GiB vram available.

@FSSRepo
Copy link
Contributor Author

FSSRepo commented Apr 23, 2024

Try enable SD_CONV2D_MEMORY_EFFICIENT this reduces the vae memory usage, or enable VAE tiling manually on the ui

@vonjackustc
Copy link

Can conv2d fused support p40 bro?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants