Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT] Support for prefixing and suffixing of text #585

Open
OwenPendrighElliott opened this issue Aug 24, 2023 · 11 comments
Open

[ENHANCEMENT] Support for prefixing and suffixing of text #585

OwenPendrighElliott opened this issue Aug 24, 2023 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@OwenPendrighElliott
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Many new SOTA models for retrieval are trained with prefixes on the queries and documents, thus they expect these at inference as well. Failure to provide these degrades performance. Currently there is no way in Marqo to add a prefix or suffix to every chunk in a document as text pre-processing occurs internally.

I don't know of any models with a requirement for a specific suffix so the suffixing part can be removed if we don't want to pre-empt a use case that doesn't exist yet.

Describe the solution you'd like
New optional parameters are added to search and add_documents to enable prefixes like so:

import marqo

mq = marqo.Client()

mq.index("my-index").add_documents(
    [{"_id": 1, "text": "lorum ipsum"}],
    text_chunk_prefix="passage: ",
    text_chunk_suffix="",
    tensor_fields=["text"]
)

mq.index("my-index").search(
    "lorum ipsum",
    query_prefix="query: ",
    query_suffix=""
)

mq.index("my-index").search(
    {"lorum": 1.0, "ipsum": 0.2},
    query_prefix="query: ",
    query_suffix=""
)

When the text is chunked internally, each chunk should have the prefix added to it so that it can be tokenised and passed to the model to match the prefixing applied in training.

For queries the prefix is added to the front of every query that isn't a URL. It should be added to the front of every query in a weighted query.

Describe alternatives you've considered
These could be part of the index defaults however if a model appears in the future with different prefixes for different sub tasks (or something similar) then this won't be a good approach because it locks you in. One nice thing though is that this approach means users won't forget to add them and the correct prefix can me included in the default models.

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": False, 
        "text_preprocessing": {
            "split_method": "sentence", 
            "split_length": 3, 
            "split_overlap": 1,
        }, 
        "model_properties": {
            "name": "intfloat/e5-base-v2", 
            "type": "hf", 
            "dimensions": 768,
            "query_prefix": "query: ",
            "query_suffix": "",
            "text_chunk_prefix": "passage: ",
            "text_chunk_suffix": ""
        }, 
        "model": "hf-e5-base-v2", 
        "normalize_embeddings": True, 
    },
}

Additional context

@pandu-k
Copy link
Collaborator

pandu-k commented Aug 24, 2023

@OwenPendrighElliott thanks for this!

Is it possible to add this to the model registry, so that it is hidden from the main add_docs/search workflows? Then, we can override these values by user specified values if needed?

@OwenPendrighElliott
Copy link
Contributor Author

Yep, we could combine the two solutions I proposed. Have it as part of index defaults and have an override in add_documents and search. This would mean that models from the registry could have them provided out of the box, users could set them for the index with custom models, and you could override them in the code if you wanted.

How did you envisage the override? I think adding parameters to the add_documents and search is sensible. Perhaps we start with just the prefix and drop the suffix for the moment?

@pandu-k
Copy link
Collaborator

pandu-k commented Aug 24, 2023

are there any models that expect a suffix? if not then we can leave it out of scope until a need arises.

Yep I think the overrides as add_documents or search parameters, as you suggested, is fine

@OwenPendrighElliott
Copy link
Contributor Author

Not that I am aware of, lets leave it out.

@pandu-k
Copy link
Collaborator

pandu-k commented Aug 24, 2023

Proposal

For models in the model registry

The default text_query_prefix and text_chunk_prefix are defined in the model_registry entry, and so are not immediately visible to non-power users:

'SOME/MODEL':  {
    "name": "SOME/MODEL",
    "dimensions": 512,
    "type": "clip",
    "text_query_prefix": "query:",
    "text_chunk_prefix": "passage:"
 }

Because this is added to the registry, index settings will not change::

{
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "model": "SOME/MODEL",
        "normalize_embeddings": true
        "text_preprocessing": {
        
        }
    },
    
}

By default, you do not need to specify the prefixes:

# add docs:
my_index.add_documents([
        {"title": "Cool books" }
    ],
    tensor_fields = ["title"]
)

# search:
search_res = my_index.search(
    "interesting books"
)
pprint(search_res)

# search response: 
{
  "hits": [
      {"title": "Cool book" },
      "_id": "article_591",
      "_score": 1.2387788,

      # notice the text_chunk_prefix is present in highlight:
      "_highlights": {"title": "passage: Cool book"}
    }],
   "limit": 10,
   "offset": 0,
   "processingTimeMs": 49,

   # notice the  text_query_prefix in the query:
   "query": "query: What is the best outfit to wear on the moon?"
}

But you can override the default prefixes:

# add docs:
my_index.add_documents([
        {"title": "Cool books" }
    ],
    tensor_fields = ["title"],
    text_chunk_prefix="this is a passage: ",
)

# search:
search_res = my_index.search(
    q="interesting books",
    text_query_prefix="This is a question: "
)
pprint(search_res)

# search response: 
{
  "hits": [
      {"title": "Cool book" },
      "_id": "article_591",
      "_score": 1.2387788,

      # notice the text_chunk_prefix is present in highlight:
      "_highlights": {"title": "this is a passage: Cool book"}
    }],
   "limit": 10,
   "offset": 0,
   "processingTimeMs": 49,

   # notice the  text_query_prefix in the query:
   "query": "This is a question: What is the best outfit to wear on the moon?"
}

For generic models

If you really want to change the defaults prefixes on a registry model, you should load it using the generic model method.

Index settings:

{
    "index_defaults": {
        "treat_urls_and_pointers_as_images": true,
        "model": "my-model-alias",
        "model_properties": {
            "name": "SOME/MODEL",
            "dimensions": 512,
            "type": "clip",
            "text_query_prefix": "BEGIN QUERY:",
            "text_chunk_prefix": "BEGIN PASSAGE:"
        },
        "normalize_embeddings": True,
    }
}

Other behaviour is the same as registry-defined models:

  • You can override these prefixes at add documents and search time using text_query_prefix and text_chunk_prefix
  • These prefixes are visible in highlights and the search response object

@pandu-k
Copy link
Collaborator

pandu-k commented Aug 25, 2023

Update to response: remove prefix from search responses. Instead a dedicated field is used to communicate this.

# add docs:
my_index.add_documents([
        {"title": "Cool books" }
    ],
    tensor_fields = ["title"]
)

# search:
search_res = my_index.search(
    "interesting books"
)
pprint(search_res)

# search response: 
{
  "hits": [
      {"title": "Cool book" },
      "_id": "article_591",
      "_score": 1.2387788,
      "_text_chunk_prefix": "passage: ",
      "_highlights": {"Cool book"}
    }],
   "limit": 10,
   "offset": 0,
   "processingTimeMs": 49,
   "textQueryPrefix": "query: ",
   "query": "What is the best outfit to wear on the moon?"
}

Edge cases:

  • If the model is specified with a non-empty string prefix, then the customer overrides it with an empty string, the prefix key will still end up in the results as a
  • If these prefixes are defined as empty strings, they will not be returned in the response
  • how does Marqo know what the prefix was for a document, if it is added at add_docs time? We can't really parse it out

@pandu-k
Copy link
Collaborator

pandu-k commented Aug 25, 2023

Decision:

  • No overriding of prefixes

@vicilliar
Copy link
Contributor

@pandu-k clarifying your last comment: We will no longer be allowing overriding of prefixes at index creation time? Or at add docs/search time? Or both?

@vicilliar
Copy link
Contributor

Another issue: it feels unintuitive that in the response, the field name for a doc chunk prefix is _text_chunk_prefix, but the name for search prefix is textQueryPrefix, especially since they're worded in the same way and users will always use snake case to define their prefixes.

@vicilliar
Copy link
Contributor

Under the edge case:

If these prefixes are defined as empty strings, they will not be returned in the response

This means defined in the model registry right? In this case, if the model registry explicitly defines the prefix as empty string "", I feel it should be returned in the response as "".

@vicilliar
Copy link
Contributor

Clarifications:

  • One argument for adding prefix to index settings instead of model registry properties: could a model expect 2 different prefixes? for different situations?
  • Why don’t we put prefix in the text_preprocessing portion of the index settings object?
  • To override default prefix for a registry model , the proposal mentions to load it using the generic model method (with model properties). Does this mean you would always have to provide url or localpath? Feels unnecessary for a user to have to define all other mode properties just to change the default prefix.
  • If a model in registry does NOT have prefix set, it should be fine, and ignored. Should be the same if prefix is “”.
  • Possible idea: add a hidden field in each document __text_chunk_prefix so that it can be parsed out upon search (from the highlights)
  • Multimodal: Do we add prefix to text in multimodal fields?
  • We only add search prefix in LEXICAL search, right? Do we omit prefix from the search response object if lexical also?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants