feat: separate the sequence and conversation. #740

AsakusaRinne · 2024-05-15T18:20:47Z

This PR separates Sequence from Conversation. It introduces no change from the user's view but only change the internal implementation.

In llama.cpp APIs, the sequence is only a concept. There isn't any struct named Sequence, nor any method named sequence. The only major binding struct related to batched inference is llama_batch. Therefore I believe we should not make Sequence something that directly interop with llama.cpp native APIs. Instead, it's better to consider it as a container for the data and status during the inference. Thus it will be more extensible. For example, we could switch a sequence in the conversation.

In this PR, to avoid break change, the id of Sequence and Conversation are both that llama.cpp seq_id. In the future it will be better to separate them.

martindevans · 2024-05-15T18:34:54Z

I don't understand the purpose of this change? The idea of Conversation is to a wrapper around the concept of a sequence and all the things you ca do with it. What can be done with the new Sequence struct that can't be done with the Conversation class?

AsakusaRinne · 2024-05-15T20:10:28Z

What can be done with the new Sequence struct that can't be done with the Conversation class?

There's nothing related with this point in this PR. You could certainly do everything in Conversation, but this class has already been responsible for too many functionalities, including sequence, kv-cache management and partial execution. As a widely known concept, Sequence is more likely to be a independent part to kv-cache and execution in many libraries. For example:

llama-cpp.python: it uses typing.Sequence which is simply a data collection. It might not be a good example but is related with llama.cpp. link
vllm: similarly, it defines a class Sequence with no execution and kv-cache part. link
TensorRT-LLM: it defines a class GenerationSequence with only seq_idx and batch_idx. link
Deepspeed inference: A class with sequence definition and kv-cache management, but without execution. link

The research does not cover all the libraries with batched-inference support, but they're all very popular libraries.

I don't understand the purpose of this change?

From my point of view, I prefer to follow the design of these popular inference libraries. Don't get me wrong, I'm not saying you can't do well without referring to them. What I worry about is that it will cost us more time in the future when we want to follow some good features (mainly high-level) from other libraries because of the gap. For example, when it comes to speculative decoding and fine-tune, Sequence seems to be more easy to be re-used compared to Conversation. Besides, it might be hard for a new contributor to start contributing to the batched inference part because all the logics are mixed up in Conversation.

Anyway, I won't strongly ask it to be merged since it doesn't block my work. I'll left the decision to you.

martindevans · 2024-05-16T22:09:45Z

You could certainly do everything in Conversation

I'm not suggesting adding anything extra to Conversation - what I meant by that question is what can be built with (i.e. on top of) this new Sequence struct that can't be done with Conversation?

The Conversation is intended as the lowest level primitive representing a sequence of tokens in the KV cache, with this struct it seems like you want something lower level?

this class has already been responsible for too many functionalities, including sequence, kv-cache management and partial execution.

I disagree, I don't think anything can be removed from the Conversation, or anything else that works with sequences. You always need to know where to prompt the sequence, and then where to sample it later. That's really all that the Conversation does!

KV cache management is not really done in the Conversation - the KvAccessor is just exposing the functions that llama.cpp offers, they're not actually used within Conversation. This is deliberate, it keeps the complex logic of working with the kv cache out of there as much as possible!

AsakusaRinne · 2024-05-17T03:05:44Z

with this struct it seems like you want something lower level?

Rather than saying making Sequence a lower level struct, my purpose is more likely to make Conversation a higher level class.

As you mentioned, the Conversation is intended as the lowest level primitive representing a sequence of tokens in the KV cache. But why does such a lowest level primitive contain an executor (a mid-level class)?

Let's consider such a question: if we want to split the LLamaSharp package to a package with low-level APIs and a package with mid/high level APIs, in which package will you put the Conversation?

martindevans · 2024-05-17T14:09:00Z

Before answering your questions, here the way I'm thinking about things:

Low Level

Just the functions and data structures that llama.cpp offers. Sometimes with a small layer of safety, such as taking and returning spans, instead of raw pointer and length.

Examples of this level are everything in NativeAPI and the various safe handles.

High Level

Things like our current executors, which simply take in text/images and output text. In the long term these should automatically handling templating, self extension, context shifting, token healing etc.

This level has two problem:

It's great for people who just want to get started quickly using LLMs, but doesn't offer enough for "power users".
It's mostly built by directly calling into the low level primitives, which makes them very complex as they're trying to manage a lot of things.

Mid Level

If we're going to build new high level executors in the future, we need a set of primitives to build them with that are slightly higher level than the raw llama.cpp API. Ideally this level should offer all of the power of the low level, with no exposed unsafe behaviour. Think of this as a toolkit that can be assembled together into high level systems.

Examples:

LLamaTemplate (text -> templated text)
Conversaton (token(s) -> logits)
BatchedExecutor (a batch of conversations, all inferred together)
ISamplingPipeline (logits -> token)
StreamingTextDecoder (token -> text)

Puzzle Pieces

As you can see I'm putting the batched executor in the "mid" level. It's not the pure llama.cpp API, so it doesn't fit in the low level. It's not a friendly and easy to use text to text system so it doesn't fit in the high level. It's intended as one "part of the puzzle" for building high level systems.

All the "mid level" things I have been working on over the last few months nearly form a complete picture of the whole pipeline from text to text (see the mid level examples). Each of these bits encapsulates one part of the llama.cpp API in a totally safe way, that loses no power.

Why?

The idea with splitting it up into pieces like this is new users can start with the high level executors, built out of the mid level parts. When they want to do something a little more complicated (e.g. custom sampling) they can "plug in" a custom mid level component (e.g. implement ISamplingPipeline).

Direct Questions

I think I've probably already answered your questions, but to answer them directly:

why does such a lowest level primitive contain an executor (a mid-level class)?

As mentioned above, I consider the Conversation and the BatchedExecutor as "mid level". It's very different to the other high level text -> text executors. I guess I was misleading when I said the "lowest level" since there's always the raw llama.cpp API of course.

The Conversation and the BatchedExecutor are fundamentally linked. You cannot prompt or sample a batch (only a sequence) and you cannot run inference (llama_decode) on a sequence (only a batch).

Let's consider such a question: if we want to split the LLamaSharp package to a package with low-level APIs and a package with mid/high level APIs, in which package will you put the Conversation?

Certainly the mid/high level APIs. Low level should be (almost) a pure representation of the llama.cpp API. As a rough guide I'd say: if it's not something that gets modified in the monthly binary updates, it's not low level.

AsakusaRinne · 2024-05-17T19:33:00Z

Thank you for your explanation.

Well, I agree to regard Conversation as mid-level, so I'll close this PR (but I still strongly prefer to separate them).

I hold a different definition of the low/mid level. In my opinion, the low level package should have and only have abstractions which must depend on llama.h completely.

My definition of low-level

Obviously, all the exported structs and exported native functions are low-level with no argument.

Controversial members of low-level:

KV-Cache: It's nearly impossible for us to manage it with an external method. It must depend on llama.cpp exported APIs completely.
Sampling methods: Unless the user want to implement the sampling method from scratch in C# (like what Lyrcaxis tried in discord), sampling totally depends on llama.cpp exported functions. The ISamplingPipeline, which is used in the executors, should be a mid-level interface.
Text decoder (optional) p.s. not a streaming one
State

Why sequence is not included

Because there isn't an exported struct nor a deterministic abstraction for it. The only thing related to this concept in llama.h is the seq_id. As long as we hold an unique sequence id, we could have a Sequence class with any kind of layout. For example, prompt string + output string + tokens + seq_id.

That's why I asked you about the level of Conversation, because I don't think it should appear in the low-level package!

Why I insist to separate `Sequence` from `Conversation`

In a word, it will limit the way users can use mid-level APIs. Sequence, LLamaBatch, Executor and KV_Cache management are mixed up!

Sequence & LLamaBatch & Executor

Let's consider such a case. I'm a user who want to implement a LLM server and I put the initial respond latent first. When the max sequence count is not reached and there's a incoming request, I will stop the current batched generation and run the prefill instead to get minimum initial respond latent. To improve the performance, I'll have two threads. One is to inference the batch, and the other is to transform a new request to a batch. Thus the inference could overlap the (prefill) batch preparation.

It's inconvenient to use Conversation and BatchedExecutor to implement it due to the facts below. (Plz correct me if I'm wrong)

All the conversations forked from the original one have the same executor.
An executor has only one LLamaBatch.
The sequence id is unique in executor-scope, instead of global-scope.

So, if I'm using multiple conversations and one executor, we cannot modify the batch during the inference even though it's asynchronized (LLamaContext.DecodeAsync). If I'm using multiple conversations and two executors, the id-share will be the problem.
Even though I handle it by manually setting the ids, there's still another problem. Because of one Batched executor one LLamaBatch, I have to execute the inference in the thread of prefill-executor. However, what if there's another request coming in during the prefill inference? Now I have to wait until the previous prefill completes.

If we make the sequence and the batch independent from each other, and the batch independent from the executor, it will be much better in this case. We could create multiple batches in thread A and execute only one in thread B (no matter for prefill or generation). Then we could let executor.Infer accepts a batch as input so that we can re-use the API.

Executor & KV-Cache management

An KvAccessor has a member of type Conversation, which further has a BatchedExecutor. I don't think it's reasonable because KvAccessor is not necessary to access an executor. The dependency should be reversed. It's the executor that needs to access a kv-cache manager.

Suggestions

Actually I'm not against to have such an "one batch - one executor" design. I can feel the great effort made to improve performance. What I'm arguing is that Conversation should not be considered as a Sequence suitable for all the mid-level users. The sequence definition for sever and individual desktop app and other apps can be very different.

Rather than separating a Sequence class from Conversation, I'd like to suggest the following changes.

Change Conversation to BatchedExecutor.Conversation (public).
Change Conversation.KvAccessor to KvAccessor or BatchedExecutor.KvAccessor.
Let BatchedExecutor hold multiple conversations.
Let BatchedExecutor hold the KvAccessor and pass sequence id to it, instead of making it from a Conversation.
Add BatchedExecutor.CreateConversation(), BatchedExecutor.GetConversations() and BatchedExecutor.AddConversation to manage the conversations.
Make BatchedExecutor.Infer accept a LLamaBatch or LLamaNativeBatch as input.

As for Sequence, we could leave it for users to define it, or define it in a conservative way. For example, only seq id + tokens.

martindevans · 2024-05-17T21:20:22Z

Text decoder (optional) p.s. not a streaming one

This has come up a few times in recent discussions but I don't think it is possible to have a non-streaming decoder, there is simply no correct way to do that! Non-streaming decoding is a bug in many other LLM libraries, LLamaSharp is one of the few to get it right!

The reason is, if you have a load of tokens [1, 2, 3, 4, 5] decoding them in chunks like [1, 2, 3] & [4, 5] may produce different text to decoding [1, 2, 3, 4, 5]. So you either detokenize everything in one go, or you have a streaming system.

The ISamplingPipeline, which is used in the executors, should be a mid-level interface.

I agree.

In this case I would say

LLamaTokenDataArray is low level
ISamplingPipeline is mid level
The implementations such as DefaultSamplingPipeline could be considered high level.

Example use case...

If I'm understanding this example use-case correctly this is not possible with llama.cpp! When you run llama_decode it overwrites buffers internal to llama.cpp (e.g. the logits output buffer, which is accessed through get_logits_ith). It is not safe to run two llama_decodes in parallel due to these singletons.

This is one of the main things BatchedExecutor is designed to solve - when you prompt a conversation it keeps track of that state and you cannot sample that conversation until llama_decode has been run exactly once. If you have just one BatchedExecutor this will never be an issue.

There's also a more fundamental issue with this example - it's actually slower! Since you cannot be running 2 llama_decode at once your prefill thread will block all inference while it is running! The better way to do this would be to have multiple conversations within one batch and feed tokens to the prefill covnersation in chunks (say 128 tokens at a time) so that inference for other conversations isn't too badly delayed.

if I'm using multiple conversations and one executor, we cannot modify the batch during the inference even though it's asynchronized

Prompting conversations and adding to the batch while llama_decode is running should be safe (I haven't tested this, but if it isn't safe it's just a bug that needs fixing, not a fundamental limitation).

Global sequence IDs

Since it's not safe to be running 2 llama_decode calls at once it's also not really safe to have 2 BatchedExecutor instances at once. We could add internal global locks so when you call BatchedExecutor.DecodeAsync it blocks until other batches are done, but I don't think that's really something we should encourage).

An KvAccessor has a member of type Conversation, which further has a BatchedExecutor.

I think we're probably thinking about slightly different things here.

The KvAccessor provides safe (ish) access to manipulate the KV cache for a single sequence. That's what it's part of the Conversation class (which represents a sequence). It's quite rare you need to access to entire KV cache (e.g. clear it), almost all access is for a single sequence (shift it, compress it with self extend etc).

AsakusaRinne · 2024-05-18T10:50:08Z

This has come up a few times in recent discussions but I don't think it is possible to have a non-streaming decoder

Well, sorry for the confusion. What I was talking about is to add a collection of decoding methods. It will deal with the encoding of words internally and provide some wrapping of the native decoding APIs. It's not streaming because it's in the low-level, without any information of the executor. In this way we could add an IStreamingDecoder and make StreamingTextDecoder a default implementation of it in the mid-level.

It is not safe to run two llama_decodes in parallel due to these singletons.
Since it's not safe to be running 2 llama_decode calls at once it's also not really safe to have 2 BatchedExecutor instances at once.

You've got me wrong. :) Let me illustrate this example with a picture. In this picture, only thread2 has an executor. The thread1 is only responsible for request processing and data preparation.

There's two batches in this workflow. At the beginning, the batch2 is empty while batch1 has 8 prefilled sequences. Here's the timeline.

t1, t2: the executor is running the generation for batch1.
t3: while the generation is running in thread2, there's an incoming request. Thread1 has filled the tokens into batch2 with a new sequence created.
t4: the executor is running the prefill for batch2. Meanwhile, because the prefill stage doesn't produce any new token and there's only one sequence running the prefill, we can adding the new sequence info back to batch1 without waiting. Before t5, thread1 has already prepared the batch1.
t5, t6: the executor is running the generation for batch1 again, but with 9 sequences now.

I don't want to introduce two executors, but if I'm going to use the current design of Conversation, I have to do so. I want two batches existing at the same time, but the current design is "one batch - one executor". That's why I was arguing that the batch and executor are mixed up!

it's actually slower! Since you cannot be running 2 llama_decode at once your prefill thread will block all inference while it is running! The better way to do this would be to have multiple conversations within one batch and feed tokens to the prefill covnersation in chunks (say 128 tokens at a time) so that inference for other conversations isn't too badly delayed.

I don't think so. Let's consider the following two cases, which one do you think is faster?

Wait 10 seconds for the first token and average 0.5 second for the following tokens.
Wait 5 seconds for the first token and average 0.8 second for the following tokens.

It actually depends on the user's preference. Some services are more interested in initial response latency, while others are more interested in average service time. The example above is a strategy which minimize the initial response latency regardless of the average service time.

The example is to indicate that users may have various ideas when they're using LLamaSharp, especially low/mid levels.

The current implementation of the batched inference is certainly good, but not generalized enough. In the mid-level, we should provide toolkits for users to realize their ideas easily, without unnecessary assumption of the way they implement something.
Now back to my previous suggestion (Conversation -> BatchedExecutor.Conversation), I was saying the Conversation should be treated as a good implementation of batched inference, instead of a generalized abstraction of the sequence.

Prompting conversations and adding to the batch while llama_decode is running should be safe

I don't think so if you mean the native batch that has already been fed into llama_decode. But in any case, this is not the point of this discussion. :)
Even if it can be safely modified, you're still adding a restriction which llama.cpp does not have. llama.cpp allows multiple batches because llama_batch is just a struct with some data. As long as the data is correctly organized, and no more than 1 llama_decode run at the same time, it works.

martindevans · 2024-05-18T15:57:44Z

There's two batches in this workflow

If I'm understanding this example correctly I don't think two batches are necessary.

When DecodeAsync is called the LLamaBatch is copied into buffers for llama_decode to use internally and it is safe to modify the LLamaBatch immediately.

So in your example:

Add prefill tokens to batch, start inference (T4)
While this is running, add tokens to batch for other sequences
Wait for prefill to finish...
As soon as it finishes, immediately start inference with whatever is in the batch (T5)

Hopefully I've understood that correctly :)

Thread safety: The LLamaBatch is not currently thread safe. It wouldn't be hard to make it safe if that's something we want.

I don't think so if you mean the native batch that has already been fed into llama_decode

I've already addressed it above, but to answer it directly as well:

I meant the C# LLamaBatch, which is copied into a NativeLLamaBatch at the last moment just before llama_decode is called. It's definitely not safe to modify the NativeLLamaBatch!

it's actually slower!

Sorry, "slower" was the wrong word to use here, I should have said "lower throughput". The best overall throughput (average tokens across all sequences) you can get is to have one single batch and make it as large as possible. You're absolutely right about latency.

However you don't need a second batch for that though, instead it depends on how you feed the single batch. For example:

batch size of 512.
You want to prefill 1024 tokens into a sequence.
You also want to handle 1 single token in 10 other sequences.

There are two ways you could do this. First:

Add 512 prefill tokens.
Add 512 more prefill tokens (this can be done while previous inference is ongoing).
Finally feed 10 tokens to the 10 other sequences (this can be done while previous inference is ongoing).

This obviously has bad latency for the 10 sequences waiting for the next token to be generated.

Second:

Feed 502 prefill tokens + 10 other sequence tokens. Infer.
Those 10 sequences now have a response, and probably have 10 more tokens to feed already!
Feed 512 prefill tokens. Infer.
Feed final 10 prefill tokens. Infer.

This is the same throughput (batches of 512, 512 & 10), but obviously has much better latency for the 10 sequences and just slightly worse latency for the prefill sequence.

AsakusaRinne · 2024-05-19T09:23:10Z

For the "latency & throughput" part I think we've reached an agreement. Yes, my example has worse throughput. The point I want to make behind this example is that our mid-level users may have various ideas, which requires us to make the APIs more generalized.

Hopefully I've understood that correctly

Yes, what you said matched this example well. :)

two batches are unnecessary...
The LLamaBatch is not currently thread safe...

I've managed to understand your words. What I meant in the data preparation block is to prepare the native batch for the next inference. The implementation of LLamaBatch is very efficient so it's completely ok to me for the aspect of speed. However, what I concern about is still the generalization.

Please consider the following variants of the previous example:

The user may cancel the request when it's in the prefill stage.
We've allowed the embedding in the batch and the prefill request is an embedding (common in RAG).

Request cancellation case

We can abort the llama_decode by setting the callback via llama_set_abort_callback. This callback will be called before the actual computation is executed.
If we use the one-batch solution, we have to remove the information about the request in the LLamaBatch, which is a bit complex. On the contrary, we can just drop the new batch.

Embedding case

Let's assume there's a request with embeddings coming in when we are running the generation (token as input). We cannot add it to the original batch because if the embd is set, the tokens will be ignored. Therefore we cannot run the prefill and the generation simultaneously. If we have two batches this problem will go away.

I believe you can handle with the two examples above within one LLamaBatch after some modifications. My central point is that Conversation should not be regarded as Sequence as long as it contains an executor.

Sequence should be a pure data structure without binding to the executor (or we just don't abstract it). The executor should accept a batch as input in BatchedExecutor.Infer. In this way I believe it will be more flexible for our mid-level users. I'm not going to remove Conversation nor refactor it. What I'm arguing is that it should be treated as a good implementation for batched inference rather than a generalized abstraction of the sequence.

feat: split the sequence and conversation.

5904af7

AsakusaRinne added the enhancement New feature or request label May 15, 2024

AsakusaRinne requested a review from martindevans May 15, 2024 18:20

AsakusaRinne closed this May 22, 2024

martindevans mentioned this pull request Jun 9, 2024

Generic Prompt Formatter #787

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: separate the sequence and conversation. #740

feat: separate the sequence and conversation. #740

AsakusaRinne commented May 15, 2024

martindevans commented May 15, 2024

AsakusaRinne commented May 15, 2024

martindevans commented May 16, 2024

AsakusaRinne commented May 17, 2024

martindevans commented May 17, 2024

AsakusaRinne commented May 17, 2024

martindevans commented May 17, 2024

AsakusaRinne commented May 18, 2024 •

edited

martindevans commented May 18, 2024

AsakusaRinne commented May 19, 2024

feat: separate the sequence and conversation. #740

feat: separate the sequence and conversation. #740

Conversation

AsakusaRinne commented May 15, 2024

martindevans commented May 15, 2024

AsakusaRinne commented May 15, 2024

martindevans commented May 16, 2024

AsakusaRinne commented May 17, 2024

martindevans commented May 17, 2024

Low Level

High Level

Mid Level

Puzzle Pieces

Why?

Direct Questions

AsakusaRinne commented May 17, 2024

My definition of low-level

Controversial members of low-level:

Why sequence is not included

Why I insist to separate Sequence from Conversation

Sequence & LLamaBatch & Executor

Executor & KV-Cache management

Suggestions

martindevans commented May 17, 2024

AsakusaRinne commented May 18, 2024 • edited

martindevans commented May 18, 2024

AsakusaRinne commented May 19, 2024

Request cancellation case

Embedding case

Why I insist to separate `Sequence` from `Conversation`

AsakusaRinne commented May 18, 2024 •

edited