Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic output of the llama.cpp server when using multiple slots #7052

Open
reuank opened this issue May 3, 2024 · 4 comments
Open

Comments

@reuank
Copy link
Contributor

reuank commented May 3, 2024

Hey there,

thank you for your great work on llama.cpp!

I am using it in my bachelors thesis to build a LLM benchmarking tool. To make use of big GPUs, I am running a llama.cpp server in the background, and generate HTTP requests with multiple threads. This way I get much faster execution times.

For my benchmarks it is important to get deterministic results when prompting the model. I therefore set the temperature to 0 and disable other samplers. When closely inspecting the returned completions and logits across multiple runs, I realised that they are, however, not deterministic.

I have created the proof of concept below, which I executed on an H100. It spawns a llama.cpp server with 8 slots, and sends the same prompt to each slot using multiple threads. I expected all completions to be identical, but they were not. When running the script multiple times, I get between 5 and 8 unique completion texts when using 8 slots. If everything were completely deterministic, there should only be a single unique completion text in this case.

When using a single slot, I always get the same answer, but still small variations in the logits. They don't seem to be big enough to cause different tokens to be selected though.

I am currently running llama.cpp version b2774. I experienced this behavior on my MacBook Pro M1, as well as on an H100 and an A100. I also got this behavior with different models and different quantizations. It seems to me that the output just gets more random the more slots I use.

Can anyone explain to me what is happening here? Is there a way to force the outputs to be deterministic? Am I missing something here?

I would really appreciate any help!

Best
Leon

import json
import threading
import time
from pathlib import Path
from queue import Queue
from typing import List

from requests import Session

import subprocess

from tqdm import tqdm


def create_completion(prompt: str, slot_id: int):
    request = {
        "prompt": prompt,
        "id_slot": slot_id,  # ensure that a thread only uses its own server slot
        "n_predict": 128,
        "n_probs": 1,
        "temperature": 0,
        "samplers": ["temperature"],
        "seed": 1234,
        "repeat_last_n": 0,
        "min_p": 0.0,
        "top_p": 1.0,
        "top_k": 100,
        "repeat_penalty": 1.0,
        "mirostat_eta": 0.0,
        "mirostat_tau": 0.0,
        "cache_prompt": False
    }

    raw_completion_response = session.post(url=completion_url, headers=headers, json=request).json()

    return raw_completion_response["content"]


def run_subset(thread_id: int, prompts: List[str], output_queue: Queue, shared_progressbar: tqdm):
    for prompt in prompts:
        response = create_completion(prompt, thread_id)
        output_queue.put(response)
        shared_progressbar.update(1)


def run_all(prompts: List[str]):
    threads = []
    output_queue = Queue()

    def distribute_chunks(data, num_threads):
        n = len(data)
        chunk_size = n // num_threads
        remainder = n % num_threads

        chunks = []
        start = 0

        for thread_id in range(num_threads):
            end = start + chunk_size + (1 if thread_id < remainder else 0)
            chunks.append(data[start:end])
            start = end

        return chunks

    chunks = distribute_chunks(data=prompts, num_threads=n_parallel)

    shared_progressbar = tqdm(total=len(prompts), desc=f"Prompting model on {n_parallel} server slots.")

    for i in range(n_parallel):
        thread = threading.Thread(target=run_subset, args=(i, chunks[i], output_queue, shared_progressbar))
        threads.append(thread)
        thread.start()

    for thread in threads:
        thread.join()

    shared_progressbar.close()

    all_results: List[str] = []
    while not output_queue.empty():
        all_results.append(output_queue.get())

    return all_results


if __name__ == '__main__':
    prompts = [
        "Once upon a time..."
    ]

    n_parallel = 8

    server_binary_path = Path("../llama.cpp/build/bin/server")
    model_path = Path("../models/llama-2-7b-chat.Q4_K_M.gguf")

    completion_url = "http://localhost:8080/completion"
    headers = {'content-type': 'application/json'}

    session: Session = Session()

    kill_all_old_servers()

    # spawn a new server
    server_process_arguments = [
        str(server_binary_path),
        "-m", str(model_path),
        "-b", "1024",
        "-c", "8192",
        "-ngl", "1000",
        "-np", str(n_parallel)
    ]

    process = subprocess.Popen(server_process_arguments, stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)
    time.sleep(2)  # wait for the server to start

    results = run_all(prompts=prompts*16)
    unique_results = len(list(set(results)))

    print(json.dumps(unique_results, indent=2))

    process.terminate()
@kaetemi
Copy link
Collaborator

kaetemi commented May 3, 2024

There is one randomizer that's shared continuously across all generations (rng in the llama_context object). EDIT: That shouldn't have any effect at temperature 0, though... This sounds like some difference in order of operations affecting precision. (Vertical vs. horizontal vectorization, maybe. I don't know the implementation details here specifically.) (You may want to trace through the batching behaviour in the server to find out what happens there.)

EDIT2: It might be continuous batching, that's enabled by default. Currently doesn't seem to have any flag to disable it, but you can change the default in the source to see what effect that has on determinism. (See #6358)

@reuank
Copy link
Contributor Author

reuank commented May 3, 2024

Hey @kaetemi,
thank you for your comment. According to the server documentation, continuous batching is disabled by default. Is this outdated information then?

@kaetemi
Copy link
Collaborator

kaetemi commented May 3, 2024

Yep, looks like the documentation is not updated yet.

printf(" -cb, --cont-batching enable continuous batching (a.k.a dynamic batching) (default: enabled)\n");

bool cont_batching = true; // insert new sequences for decoding on-the-fly

@JohannesGaessler
Copy link
Collaborator

First of all, make sure that you're using a version that has this fix #6835 because without it there is only a single RNG state across all slots.

On the latest master the results for multiple slots are still not 100% deterministic. The problem mainly has to do with the use of > 1 slots, see #6950 . Generally speaking, floating point operations only return bit-for-bit identical results if the exact same operations are executed in the exact same way. However, the whole reason why using multiple slots is faster is because this is not done.

When using a single slot, I always get the same answer, but still small variations in the logits. They don't seem to be big enough to cause different tokens to be selected though.

To my knowledge that shouldn't be happening, is this still the case on the latest master commit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants