-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENHANCEMENT] Support for prefixing and suffixing of text #585
Comments
@OwenPendrighElliott thanks for this! Is it possible to add this to the model registry, so that it is hidden from the main add_docs/search workflows? Then, we can override these values by user specified values if needed? |
Yep, we could combine the two solutions I proposed. Have it as part of index defaults and have an override in How did you envisage the override? I think adding parameters to the |
are there any models that expect a suffix? if not then we can leave it out of scope until a need arises. Yep I think the overrides as add_documents or search parameters, as you suggested, is fine |
Not that I am aware of, lets leave it out. |
ProposalFor models in the model registryThe default text_query_prefix and text_chunk_prefix are defined in the model_registry entry, and so are not immediately visible to non-power users:
Because this is added to the registry, index settings will not change::
By default, you do not need to specify the prefixes:
But you can override the default prefixes:
For generic modelsIf you really want to change the defaults prefixes on a registry model, you should load it using the generic model method. Index settings:
Other behaviour is the same as registry-defined models:
|
Update to response: remove prefix from search responses. Instead a dedicated field is used to communicate this.
Edge cases:
|
Decision:
|
@pandu-k clarifying your last comment: We will no longer be allowing overriding of prefixes at index creation time? Or at add docs/search time? Or both? |
Another issue: it feels unintuitive that in the response, the field name for a doc chunk prefix is |
Under the edge case:
This means defined in the model registry right? In this case, if the model registry explicitly defines the prefix as empty string |
Clarifications:
|
Is your feature request related to a problem? Please describe.
Many new SOTA models for retrieval are trained with prefixes on the queries and documents, thus they expect these at inference as well. Failure to provide these degrades performance. Currently there is no way in Marqo to add a prefix or suffix to every chunk in a document as text pre-processing occurs internally.
I don't know of any models with a requirement for a specific suffix so the suffixing part can be removed if we don't want to pre-empt a use case that doesn't exist yet.
Describe the solution you'd like
New optional parameters are added to
search
andadd_documents
to enable prefixes like so:When the text is chunked internally, each chunk should have the prefix added to it so that it can be tokenised and passed to the model to match the prefixing applied in training.
For queries the prefix is added to the front of every query that isn't a URL. It should be added to the front of every query in a weighted query.
Describe alternatives you've considered
These could be part of the index defaults however if a model appears in the future with different prefixes for different sub tasks (or something similar) then this won't be a good approach because it locks you in. One nice thing though is that this approach means users won't forget to add them and the correct prefix can me included in the default models.
Additional context
Represent this sentence for searching relevant passages:
as a prefix https://github.com/FlagOpen/FlagEmbeddingquery:
in front of queries andpassage:
in front of passages https://huggingface.co/intfloat/e5-largequery:
in front of queries andpassage:
in front of passages https://huggingface.co/intfloat/e5-large-v2The text was updated successfully, but these errors were encountered: