Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qweight.matches(LeafReaderContext ctx, int doc) can be prohibitively slow for large TermInSet queries #13391

Open
dweiss opened this issue May 20, 2024 · 1 comment
Labels

Comments

@dweiss
Copy link
Contributor

dweiss commented May 20, 2024

Description

I stumbled across this one in a real-life application, where matches-API based highlighting of a query like this:

field:(a OR b OR c OR d OR ...)

took very long to complete, even though query execution itself is blazing fast. The reason is (I think!) in how the MultiTermQuery handles matches - the AbstractMultiTermQueryConstantScoreWrapper returns a disjunction of iterators from a terms enum:

    @Override
    public Matches matches(LeafReaderContext context, int doc) throws IOException {
      final Terms terms = context.reader().terms(q.field);
      if (terms == null) {
        return null;
      }
      return MatchesUtils.forField(
          q.field,
          () ->
              DisjunctionMatchesIterator.fromTermsEnum(
                  context, doc, q, q.field, q.getTermsEnum(terms)));
    }

but for a large set of alternatives, the loop scan inside fromTermsEnum can take a long time until it hits the right document:

  static MatchesIterator fromTermsEnum(
      LeafReaderContext context, int doc, Query query, String field, BytesRefIterator terms)
      throws IOException {
    Objects.requireNonNull(field);
    Terms t = Terms.getTerms(context.reader(), field);
    TermsEnum te = t.iterator();
    PostingsEnum reuse = null;
    for (BytesRef term = terms.next(); term != null; term = terms.next()) {
      if (te.seekExact(term)) {
        PostingsEnum pe = te.postings(reuse, PostingsEnum.OFFSETS);
        if (pe.advance(doc) == doc) {
          return new TermsEnumDisjunctionMatchesIterator(
              new TermMatchesIterator(query, pe), terms, te, doc, query);
        } else {
          reuse = pe;
        }
      }
    }
    return null;
  }

I've no idea what the fix can be here, just mentioning the problem before I forget it.

Version and environment details

No response

@dweiss
Copy link
Contributor Author

dweiss commented May 20, 2024

Perhaps this wasn't clear - the important bit here is the use of TermInSetQuery (the query parsed substitutes large boolean expressions to this type of query to prevent max-boolean-clauses-exceeded errors).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant