-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: separate the sequence and conversation. #740
Conversation
I don't understand the purpose of this change? The idea of |
There's nothing related with this point in this PR. You could certainly do everything in
The research does not cover all the libraries with batched-inference support, but they're all very popular libraries.
From my point of view, I prefer to follow the design of these popular inference libraries. Don't get me wrong, I'm not saying you can't do well without referring to them. What I worry about is that it will cost us more time in the future when we want to follow some good features (mainly high-level) from other libraries because of the gap. For example, when it comes to speculative decoding and fine-tune, Anyway, I won't strongly ask it to be merged since it doesn't block my work. I'll left the decision to you. |
I'm not suggesting adding anything extra to The
I disagree, I don't think anything can be removed from the KV cache management is not really done in the |
Rather than saying making As you mentioned, the Conversation is intended as the lowest level primitive representing a sequence of tokens in the KV cache. But why does such a lowest level primitive contain an executor (a mid-level class)? Let's consider such a question: if we want to split the LLamaSharp package to a package with low-level APIs and a package with mid/high level APIs, in which package will you put the |
Before answering your questions, here the way I'm thinking about things: Low LevelJust the functions and data structures that llama.cpp offers. Sometimes with a small layer of safety, such as taking and returning spans, instead of raw pointer and length. Examples of this level are everything in High LevelThings like our current executors, which simply take in text/images and output text. In the long term these should automatically handling templating, self extension, context shifting, token healing etc. This level has two problem:
Mid LevelIf we're going to build new high level executors in the future, we need a set of primitives to build them with that are slightly higher level than the raw llama.cpp API. Ideally this level should offer all of the power of the low level, with no exposed unsafe behaviour. Think of this as a toolkit that can be assembled together into high level systems. Examples:
Puzzle PiecesAs you can see I'm putting the batched executor in the "mid" level. It's not the pure llama.cpp API, so it doesn't fit in the low level. It's not a friendly and easy to use text to text system so it doesn't fit in the high level. It's intended as one "part of the puzzle" for building high level systems. All the "mid level" things I have been working on over the last few months nearly form a complete picture of the whole pipeline from text to text (see the mid level examples). Each of these bits encapsulates one part of the llama.cpp API in a totally safe way, that loses no power. Why?The idea with splitting it up into pieces like this is new users can start with the high level executors, built out of the mid level parts. When they want to do something a little more complicated (e.g. custom sampling) they can "plug in" a custom mid level component (e.g. implement Direct QuestionsI think I've probably already answered your questions, but to answer them directly:
As mentioned above, I consider the Conversation and the The
Certainly the mid/high level APIs. Low level should be (almost) a pure representation of the llama.cpp API. As a rough guide I'd say: if it's not something that gets modified in the monthly binary updates, it's not low level. |
Thank you for your explanation. Well, I agree to regard I hold a different definition of the low/mid level. In my opinion, the low level package should have and only have abstractions which must depend on llama.h completely. My definition of low-levelObviously, all the exported structs and exported native functions are low-level with no argument. Controversial members of low-level:
Why sequence is not includedBecause there isn't an exported struct nor a deterministic abstraction for it. The only thing related to this concept in That's why I asked you about the level of Why I insist to separate
|
This has come up a few times in recent discussions but I don't think it is possible to have a non-streaming decoder, there is simply no correct way to do that! Non-streaming decoding is a bug in many other LLM libraries, LLamaSharp is one of the few to get it right! The reason is, if you have a load of tokens
I agree. In this case I would say
If I'm understanding this example use-case correctly this is not possible with llama.cpp! When you run This is one of the main things There's also a more fundamental issue with this example - it's actually slower! Since you cannot be running 2
Prompting conversations and adding to the batch while
Since it's not safe to be running 2
I think we're probably thinking about slightly different things here. The |
Well, sorry for the confusion. What I was talking about is to add a collection of decoding methods. It will deal with the encoding of words internally and provide some wrapping of the native decoding APIs. It's not streaming because it's in the low-level, without any information of the executor. In this way we could add an
You've got me wrong. :) Let me illustrate this example with a picture. In this picture, only thread2 has an executor. The thread1 is only responsible for request processing and data preparation. There's two batches in this workflow. At the beginning, the batch2 is empty while batch1 has 8 prefilled sequences. Here's the timeline.
I don't want to introduce two executors, but if I'm going to use the current design of
I don't think so. Let's consider the following two cases, which one do you think is faster?
It actually depends on the user's preference. Some services are more interested in initial response latency, while others are more interested in average service time. The example above is a strategy which minimize the initial response latency regardless of the average service time. The example is to indicate that users may have various ideas when they're using LLamaSharp, especially low/mid levels. The current implementation of the batched inference is certainly good, but not generalized enough. In the mid-level, we should provide toolkits for users to realize their ideas easily, without unnecessary assumption of the way they implement something.
I don't think so if you mean the native batch that has already been fed into |
If I'm understanding this example correctly I don't think two batches are necessary. When So in your example:
Hopefully I've understood that correctly :) Thread safety: The
I've already addressed it above, but to answer it directly as well: I meant the C#
Sorry, "slower" was the wrong word to use here, I should have said "lower throughput". The best overall throughput (average tokens across all sequences) you can get is to have one single batch and make it as large as possible. You're absolutely right about latency. However you don't need a second batch for that though, instead it depends on how you feed the single batch. For example:
There are two ways you could do this. First:
This obviously has bad latency for the 10 sequences waiting for the next token to be generated. Second:
This is the same throughput (batches of 512, 512 & 10), but obviously has much better latency for the 10 sequences and just slightly worse latency for the prefill sequence. |
For the "latency & throughput" part I think we've reached an agreement. Yes, my example has worse throughput. The point I want to make behind this example is that our mid-level users may have various ideas, which requires us to make the APIs more generalized.
Yes, what you said matched this example well. :)
I've managed to understand your words. What I meant in the Please consider the following variants of the previous example:
Request cancellation caseWe can abort the Embedding caseLet's assume there's a request with embeddings coming in when we are running the generation (token as input). We cannot add it to the original batch because if the I believe you can handle with the two examples above within one
|
This PR separates
Sequence
fromConversation
. It introduces no change from the user's view but only change the internal implementation.In llama.cpp APIs, the sequence is only a concept. There isn't any struct named Sequence, nor any method named sequence. The only major binding struct related to batched inference is
llama_batch
. Therefore I believe we should not makeSequence
something that directly interop with llama.cpp native APIs. Instead, it's better to consider it as a container for the data and status during the inference. Thus it will be more extensible. For example, we could switch a sequence in the conversation.In this PR, to avoid break change, the id of
Sequence
andConversation
are both that llama.cppseq_id
. In the future it will be better to separate them.