Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multimodal support in prompt_template for easier prompting. #242

Open
brenkao opened this issue May 14, 2024 · 6 comments
Open

Add multimodal support in prompt_template for easier prompting. #242

brenkao opened this issue May 14, 2024 · 6 comments
Assignees
Labels
Feature Request New feature or request
Milestone

Comments

@brenkao
Copy link
Collaborator

brenkao commented May 14, 2024

Description

First thing that comes to mind is to add something like this

prompt_template = """
SYSTEM: 
...

USER:
...

IMAGE:
{image_url}
@brenkao brenkao added the Feature Request New feature or request label May 14, 2024
@willbakst
Copy link
Contributor

Since gpt-4o is multi-modal with audio too (and it looks like other players are headed in this direction), it's likely worth also thinking about how to handle both images and audio

@off6atomic
Copy link
Contributor

off6atomic commented May 15, 2024

Note that users can pass in several images and images can be a part of any user message, not just the final message.
A message content can either be a string or a list.
If user pass in only a string, assume that it's just text. If user passes in a list, assume that it's text, images, audio, etc.

I suggest something like this:

prompt_template = """
SYSTEM: 
...

USER:
<you need a way to pass a string or a list of multiple input types here>

ASSISTANT:
<AI can also output text, audio, images, etc. Just like in GPT-4o>

USER:
<you need a way to pass a string or a list of multiple input types here>
"""

History should also take this into account.

I'm not sure if this is the right abstraction. Maybe OpenAI abstraction of representing chat history as a list is already great. Because this is a chat model, not an instruct model, if you want to utilize python features, you should model the prompt template as a list, not as a string.

It feels like we are repeating the mistake of LangChain. Forcing prompt_template to be a string is introducing magic unnecessarily isn't it? (You are forced to come up with your own markup language similar to YAML or TOML)

@willbakst
Copy link
Contributor

I agree, which is why we originally opted for enabling writing the messages array directly.

We also enable the MESSAGES: keyword for injecting a list of messages for chat history. This will inject any messages as-is into the messages array and can be used wherever and however many times you want in the prompt template just like other keywords.

I do think there is a way to update the prompt template parser to provide good DX for the multi-modal case; however, as you mentioned I also don't think the solution is a self-defined language.

Just because a multi-modal user messages has a content array doesn't necessary mean that we also need to have an array in the prompt template through a custom language. In fact, I think there is potentially a rather nice way of writing multi-modal messages still as a single string. For example:

from mirascope.openai import OpenAICall, OpenAIImage

class MultiModalCall(OpenAICall):
    prompt_template = "Can you please describe this image? {image}"

    img_bytes: bytes

    @property
    def image(self) -> OpenAIImage:
        return OpenAIImage(media_type="jpeg", bytes=self.img_bytes)

To me, this feels more natural as a transcript and how I would generally interact with the chat model anyway. Then, under the hood we can parse the user message into the correct content array if images are provided.

Of course we would also want to ensure:

  1. Multiple images can be passed into a single user message just like in the content array. Something like:
    prompt_template = "Image 1: {image1}, Image 2: {image2}"
  2. Additional convenience for passing in multiple images altogether. Something like:
    prompt_template = "Images: {images}"
  3. Ability to pass in a URL and not the bytes for additional convenience around not having to load the image manually.
  4. Similar convenience for audio files now that it looks like providers are headed in that direction (re: gpt-4o and gemini).

What do you think @off6atomic @brenkao?

@brenkao
Copy link
Collaborator Author

brenkao commented May 15, 2024

If this works across all our various providers, then I'm all for it.

@off6atomic
Copy link
Contributor

off6atomic commented May 16, 2024

@willbakst I think that's a better syntax for producing a list of inputs indeed. I totally missed that.

However, I still think this is a custom markup, which means it needs to be very easy for users to understand how it's parsed and there should be a page that explicitly explains how the custom markup is parsed into OpenAI format (or internal Mirascope format).

I would suggest using this syntax in a way that tells to the user it's simply being parsed to a list (and users can specify order of the items in the list).

For example, if user wants to pass [image, text, image]
Maybe we can allow something like this: "{image1} Please describe the difference between left and right image {image2}"
If the user wants to pass 2 images without text then they just need to provide no text between those images e.g. "{image1} {image2}"

Users should also be allowed to type the inputs in multiple lines e.g.

"""
USER:
What is the following image?
{image}

How does it relate to the following audio and video?
{audio} {video}

I want you to describe the relationship in {style} tone.
"""

would be translated to [text, image, text, audio, video, text]

I think this provides a simple mental representation for users to understand the parser. It's just splitting the string by non-text inputs.

One thing we need to be clear to users is how we remove whitespaces and newlines surrounding non-text inputs.
Should we remove all the whitespaces and newlines surrounding non-text inputs?
I think if we do that then it's simpler to understand and most of the time the users are simply going to put non-text inputs at the end of the message anyway.

Here is a typical use case:

"""
USER:
Please look at the cat and dog images and tell me which one is more cute.

{cat_image} {dog_image}
"""

Note that None image should be allowed, in such case it will simply not create an item in the list that we send to OpenAI.

@willbakst willbakst self-assigned this May 16, 2024
@willbakst
Copy link
Contributor

willbakst commented May 16, 2024

@off6atomic 100%, everything you've described is pretty much exactly the behavior I would expect. The goal is for the parser to feel intuitive and behave how you would expect so it's "convenient" and not "magic" (but still feels like magic).

Of course, I totally agree that in order for this not to be "magic" we need to have extremely clear documentation. For the README examples, this will likely be simple comments + examples of what the output messages will look like so it's succinct. In the concepts/writing_prompts.md docs page we should add a more detailed writeup of exactly what is happening under the hood so it's extremely clear to users what's happening. We can also mention in the README with this update that users should read the docs for more details.

How we handle the parser will need some more thought as we work towards implementing this feature and see what makes sense both from an internal implementation perspective as well as the external DX perspective. Mostly want to make sure that any decisions we make for parsing image/audio prompts doesn't have unintended effects on other prompts.

I'm hoping to find some time soon to prioritize this feature now that we've got a good idea of the interface and DX.

@willbakst willbakst added this to the v0.16 milestone Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants