Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rescoring phase adds unexplained latency to queries #108568

Open
nemphys opened this issue May 13, 2024 · 16 comments
Open

Rescoring phase adds unexplained latency to queries #108568

nemphys opened this issue May 13, 2024 · 16 comments
Labels
>bug :Search/Ranking Scoring, rescoring, rank evaluation. Team:Search Meta label for search team v8.13.3

Comments

@nemphys
Copy link

nemphys commented May 13, 2024

Elasticsearch Version

8.13.3

Installed Plugins

analysis-icu

Java Version

bundled

OS Version

macOS Sonoma

Problem Description

I have created a custom rescorer, following the examples of QueryRescorer / ExampleRescorer.

It works fine, but the latency added to the responses is much bigger than it seemingly should be.

I am using a test query, which consistently takes around 40ms to execute (without rescoring).

When I add my custom rescorer, the response times shoot up to ~130ms.

I have profiled the rescore() method of my custom class and it is consistently executed within 30-35ms. I would expect the total query time to be close to the sum of the plain query and the rescoring, ie. 70-75ms.

What could possibly induce the extra overhead? I looked through the rescoring-related code, but could not find anything suspicious.

Steps to Reproduce

It is not possible to provide a reproducible example, since it depends on a custom plugin.

Logs (if relevant)

No response

@nemphys nemphys added >bug needs:triage Requires assignment of a team area label labels May 13, 2024
@Mikep86 Mikep86 added :Search/Ranking Scoring, rescoring, rank evaluation. v8.13.3 and removed needs:triage Requires assignment of a team area label labels May 13, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label May 13, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@benwtrent
Copy link
Member

@nemphys does the Elasticsearch profiling output show a similar distribution of runtime? The absolute numbers with Elasticsearch profile: true aren't very useful, but it should give us a ballbark estimate of how long each item is taking.

How are you profiling the rescore step? Do you utilize various fields stored for each doc for your rescoring?

I wonder a couple of things:

  • Could the time be spent gathering fields from the docs for your rescorer to run?
  • Is this difference due to single threaded vs. multiple threads? (queries run in multiple threads over segments in 8.13, but rescore is still single threaded).

@nemphys
Copy link
Author

nemphys commented May 14, 2024

@benwtrent I already checked the profile output, but I could not find anything useful:

"profile": {
    "shards": [
      {
        "id": "[XuLaRqMuRY-EAP8hmEhcbA][xxxxxxxx-daddu][0]",
        "node_id": "XuLaRqMuRY-EAP8hmEhcbA",
        "shard_id": 0,
        "index": "xxxxxxxx-daddu",
        "cluster": "(local)",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "..................................",
                "time_in_nanos": 49346732,
                "breakdown": {
                  "set_min_competitive_score_count": 0,
                  "match_count": 5252,
                  "shallow_advance_count": 0,
                  "set_min_competitive_score": 0,
                  "next_doc": 5677266,
                  "match": 933039,
                  "next_doc_count": 5478,
                  "score_count": 5252,
                  "compute_max_score_count": 0,
                  "compute_max_score": 0,
                  "advance": 0,
                  "advance_count": 0,
                  "count_weight_count": 0,
                  "score": 13122807,
                  "build_scorer_count": 45,
                  "create_weight": 27145916,
                  "shallow_advance": 0,
                  "count_weight": 0,
                  "create_weight_count": 1,
                  "build_scorer": 2467704
                },
                "children": [
                  {
                    "type": "BooleanQuery",
                    "description": ".........................................",
                    "time_in_nanos": 46530663,
                    "breakdown": {
                      "set_min_competitive_score_count": 0,
                      "match_count": 5252,
                      "shallow_advance_count": 0,
                      "set_min_competitive_score": 0,
                      "next_doc": 0,
                      "match": 592239,
                      "next_doc_count": 0,
                      "score_count": 5252,
                      "compute_max_score_count": 0,
                      "compute_max_score": 0,
                      "advance": 4110969,
                      "advance_count": 5478,
                      "count_weight_count": 0,
                      "score": 12736165,
                      "build_scorer_count": 67,
                      "create_weight": 27041625,
                      "shallow_advance": 0,
                      "count_weight": 0,
                      "create_weight_count": 1,
                      "build_scorer": 2049665
                    },
                    "children": [
                      {
                        "type": "DisjunctionMaxQuery",
                        "description": ".....................................",
                        "time_in_nanos": 44293861,
                        "breakdown": {
                          "set_min_competitive_score_count": 0,
                          "match_count": 0,
                          "shallow_advance_count": 0,
                          "set_min_competitive_score": 0,
                          "next_doc": 0,
                          "match": 0,
                          "next_doc_count": 0,
                          "score_count": 5252,
                          "compute_max_score_count": 0,
                          "compute_max_score": 0,
                          "advance": 3776560,
                          "advance_count": 5478,
                          "count_weight_count": 0,
                          "score": 11940428,
                          "build_scorer_count": 67,
                          "create_weight": 26996250,
                          "shallow_advance": 0,
                          "count_weight": 0,
                          "create_weight_count": 1,
                          "build_scorer": 1580623
                        },
                        "children": [
                          {
                            "type": "CombinedFieldQuery",
                            "description": "..................................",
                            "time_in_nanos": 13005870,
                            "breakdown": {
                              "set_min_competitive_score_count": 0,
                              "match_count": 0,
                              "shallow_advance_count": 22,
                              "set_min_competitive_score": 0,
                              "next_doc": 0,
                              "match": 0,
                              "next_doc_count": 0,
                              "score_count": 5238,
                              "compute_max_score_count": 85,
                              "compute_max_score": 4609,
                              "advance": 692207,
                              "advance_count": 5464,
                              "count_weight_count": 0,
                              "score": 3013678,
                              "build_scorer_count": 67,
                              "create_weight": 8897500,
                              "shallow_advance": 3251,
                              "count_weight": 0,
                              "create_weight_count": 1,
                              "build_scorer": 394625
                            }
                          },
                          {
                            "type": "BoostQuery",
                            "description": "..................................",
                            "time_in_nanos": 1701103,
                            "breakdown": {
                              "set_min_competitive_score_count": 0,
                              "match_count": 0,
                              "shallow_advance_count": 22,
                              "set_min_competitive_score": 0,
                              "next_doc": 0,
                              "match": 0,
                              "next_doc_count": 0,
                              "score_count": 5252,
                              "compute_max_score_count": 81,
                              "compute_max_score": 26294,
                              "advance": 667029,
                              "advance_count": 5478,
                              "count_weight_count": 0,
                              "score": 517366,
                              "build_scorer_count": 67,
                              "create_weight": 202083,
                              "shallow_advance": 9371,
                              "count_weight": 0,
                              "create_weight_count": 1,
                              "build_scorer": 278960
                            },
                            "children": [
                              {
                                "type": "BoostQuery",
                                "description": "...............................",
                                "time_in_nanos": 48876,
                                "breakdown": {
                                  "set_min_competitive_score_count": 0,
                                  "match_count": 0,
                                  "shallow_advance_count": 0,
                                  "set_min_competitive_score": 0,
                                  "next_doc": 0,
                                  "match": 0,
                                  "next_doc_count": 0,
                                  "score_count": 1,
                                  "compute_max_score_count": 0,
                                  "compute_max_score": 0,
                                  "advance": 251,
                                  "advance_count": 2,
                                  "count_weight_count": 0,
                                  "score": 83,
                                  "build_scorer_count": 25,
                                  "create_weight": 41166,
                                  "shallow_advance": 0,
                                  "count_weight": 0,
                                  "create_weight_count": 1,
                                  "build_scorer": 7376
                                }
                              },
                              {
                                "type": "TermQuery",
                                "description": ".............................",
                                "time_in_nanos": 709867,
                                "breakdown": {
                                  "set_min_competitive_score_count": 0,
                                  "match_count": 0,
                                  "shallow_advance_count": 17,
                                  "set_min_competitive_score": 0,
                                  "next_doc": 0,
                                  "match": 0,
                                  "next_doc_count": 0,
                                  "score_count": 5240,
                                  "compute_max_score_count": 58,
                                  "compute_max_score": 16953,
                                  "advance": 305180,
                                  "advance_count": 5466,
                                  "count_weight_count": 0,
                                  "score": 226768,
                                  "build_scorer_count": 67,
                                  "create_weight": 16042,
                                  "shallow_advance": 7126,
                                  "count_weight": 0,
                                  "create_weight_count": 1,
                                  "build_scorer": 137798
                                }
                              },
                              {
                                "type": "BoostQuery",
                                "description": "..................................",
                                "time_in_nanos": 17300,
                                "breakdown": {
                                  "set_min_competitive_score_count": 0,
                                  "match_count": 0,
                                  "shallow_advance_count": 0,
                                  "set_min_competitive_score": 0,
                                  "next_doc": 0,
                                  "match": 0,
                                  "next_doc_count": 0,
                                  "score_count": 2,
                                  "compute_max_score_count": 0,
                                  "compute_max_score": 0,
                                  "advance": 207,
                                  "advance_count": 3,
                                  "count_weight_count": 0,
                                  "score": 208,
                                  "build_scorer_count": 25,
                                  "create_weight": 13959,
                                  "shallow_advance": 0,
                                  "count_weight": 0,
                                  "create_weight_count": 1,
                                  "build_scorer": 2926
                                }
                              },
                              {
                                "type": "BoostQuery",
                                "description": ".....................................",
                                "time_in_nanos": 43721,
                                "breakdown": {
                                  "set_min_competitive_score_count": 0,
                                  "match_count": 0,
                                  "shallow_advance_count": 0,
                                  "set_min_competitive_score": 0,
                                  "next_doc": 0,
                                  "match": 0,
                                  "next_doc_count": 0,
                                  "score_count": 7,
                                  "compute_max_score_count": 0,
                                  "compute_max_score": 0,
                                  "advance": 1419,
                                  "advance_count": 11,
                                  "count_weight_count": 0,
                                  "score": 6500,
                                  "build_scorer_count": 31,
                                  "create_weight": 28167,
                                  "shallow_advance": 0,
                                  "count_weight": 0,
                                  "create_weight_count": 1,
                                  "build_scorer": 7635
                                }
                              },
                              {
                                "type": "BoostQuery",
                                "description": ".................................",
                                "time_in_nanos": 33381,
                                "breakdown": {
                                  "set_min_competitive_score_count": 0,
                                  "match_count": 0,
                                  "shallow_advance_count": 0,
                                  "set_min_competitive_score": 0,
                                  "next_doc": 0,
                                  "match": 0,
                                  "next_doc_count": 0,
                                  "score_count": 2,
                                  "compute_max_score_count": 0,
                                  "compute_max_score": 0,
                                  "advance": 1958,
                                  "advance_count": 3,
                                  "count_weight_count": 0,
                                  "score": 1125,
                                  "build_scorer_count": 25,
                                  "create_weight": 15833,
                                  "shallow_advance": 0,
                                  "count_weight": 0,
                                  "create_weight_count": 1,
                                  "build_scorer": 14465
                                }
                              }
                            ]
                          },
                          {
                            "type": "CombinedFieldQuery",
                            "description": "..................................",
                            "time_in_nanos": 3390963,
                            "breakdown": {
                              "set_min_competitive_score_count": 0,
                              "match_count": 0,
                              "shallow_advance_count": 1,
                              "set_min_competitive_score": 0,
                              "next_doc": 0,
                              "match": 0,
                              "next_doc_count": 0,
                              "score_count": 1,
                              "compute_max_score_count": 6,
                              "compute_max_score": 294,
                              "advance": 166,
                              "advance_count": 2,
                              "count_weight_count": 0,
                              "score": 6250,
                              "build_scorer_count": 25,
                              "create_weight": 3362541,
                              "shallow_advance": 42,
                              "count_weight": 0,
                              "create_weight_count": 1,
                              "build_scorer": 21670
                            }
                          },
                          {
                            "type": "CombinedFieldQuery",
                            "description": "....................................",
                            "time_in_nanos": 10411190,
                            "breakdown": {
                              "set_min_competitive_score_count": 0,
                              "match_count": 0,
                              "shallow_advance_count": 22,
                              "set_min_competitive_score": 0,
                              "next_doc": 0,
                              "match": 0,
                              "next_doc_count": 0,
                              "score_count": 5172,
                              "compute_max_score_count": 97,
                              "compute_max_score": 3659,
                              "advance": 632445,
                              "advance_count": 5397,
                              "count_weight_count": 0,
                              "score": 4721708,
                              "build_scorer_count": 67,
                              "create_weight": 4752375,
                              "shallow_advance": 1165,
                              "count_weight": 0,
                              "create_weight_count": 1,
                              "build_scorer": 299838
                            }
                          },
                          {
                            "type": "CombinedFieldQuery",
                            "description": "....................................",
                            "time_in_nanos": 13241813,
                            "breakdown": {
                              "set_min_competitive_score_count": 0,
                              "match_count": 0,
                              "shallow_advance_count": 22,
                              "set_min_competitive_score": 0,
                              "next_doc": 0,
                              "match": 0,
                              "next_doc_count": 0,
                              "score_count": 5178,
                              "compute_max_score_count": 66,
                              "compute_max_score": 2069,
                              "advance": 673530,
                              "advance_count": 5404,
                              "count_weight_count": 0,
                              "score": 2756870,
                              "build_scorer_count": 67,
                              "create_weight": 9580500,
                              "shallow_advance": 1000,
                              "count_weight": 0,
                              "create_weight_count": 1,
                              "build_scorer": 227844
                            }
                          }
                        ]
                      },
                      {
                        "type": "MultiTermQueryConstantScoreBlendedWrapper",
                        "description": ".................................",
                        "time_in_nanos": 412759,
                        "breakdown": {
                          "set_min_competitive_score_count": 0,
                          "match_count": 0,
                          "shallow_advance_count": 0,
                          "set_min_competitive_score": 0,
                          "next_doc": 0,
                          "match": 0,
                          "next_doc_count": 0,
                          "score_count": 0,
                          "compute_max_score_count": 0,
                          "compute_max_score": 0,
                          "advance": 3630,
                          "advance_count": 22,
                          "count_weight_count": 0,
                          "score": 0,
                          "build_scorer_count": 44,
                          "create_weight": 4542,
                          "shallow_advance": 0,
                          "count_weight": 0,
                          "create_weight_count": 1,
                          "build_scorer": 404587
                        }
                      }
                    ]
                  },
                  {
                    "type": "FieldExistsQuery",
                    "description": "................................",
                    "time_in_nanos": 698722,
                    "breakdown": {
                      "set_min_competitive_score_count": 0,
                      "match_count": 0,
                      "shallow_advance_count": 0,
                      "set_min_competitive_score": 0,
                      "next_doc": 373244,
                      "match": 0,
                      "next_doc_count": 5478,
                      "score_count": 0,
                      "compute_max_score_count": 0,
                      "compute_max_score": 0,
                      "advance": 202859,
                      "advance_count": 2731,
                      "count_weight_count": 0,
                      "score": 0,
                      "build_scorer_count": 66,
                      "create_weight": 1125,
                      "shallow_advance": 0,
                      "count_weight": 0,
                      "create_weight_count": 1,
                      "build_scorer": 121494
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 1702873,
            "collector": [
              {
                "name": "QueryPhaseCollector",
                "reason": "search_query_phase",
                "time_in_nanos": 16295193,
                "children": [
                  {
                    "name": "SimpleTopScoreDocCollector",
                    "reason": "search_top_hits",
                    "time_in_nanos": 15870874
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": [],
        "fetch": {
          "type": "fetch",
          "description": "",
          "time_in_nanos": 21252291,
          "breakdown": {
            "load_stored_fields": 4819205,
            "load_source": 0,
            "load_stored_fields_count": 100,
            "next_reader_count": 11,
            "load_source_count": 0,
            "next_reader": 609624
          },
          "debug": {
            "stored_fields": [
              "_id",
              "_routing"
            ]
          },
          "children": [
            {
              "type": "FetchDocValuesPhase",
              "description": "",
              "time_in_nanos": 98041,
              "breakdown": {
                "process_count": 100,
                "process": 77251,
                "next_reader": 20790,
                "next_reader_count": 11
              }
            },
            {
              "type": "MatchedQueriesPhase",
              "description": "",
              "time_in_nanos": 3401587,
              "breakdown": {
                "process_count": 100,
                "process": 2926337,
                "next_reader": 475250,
                "next_reader_count": 11
              }
            },
            {
              "type": "StoredFieldsPhase",
              "description": "",
              "time_in_nanos": 9104,
              "breakdown": {
                "process_count": 100,
                "process": 8020,
                "next_reader": 1084,
                "next_reader_count": 11
              }
            }
          ]
        }
      }
    ]
  }

For the rescorer profiling, I am using a simple System.currentTimeInMillis() diff (start/end of the method), should be the actual execution time since the method is only called once for each request.

I indeed gather doc_values from various document fields during the rescoring, but shouldn't the time spent there be included in the profiled time?

@benwtrent
Copy link
Member

@nemphys is your window_size the same as size? When rescoring is there, we do gather extra docs to account for window_size which could increase query time.

@nemphys
Copy link
Author

nemphys commented May 14, 2024

@benwtrent actually the window size is ignored (set to the maximum allowed), since I must process all query results in the rescorer (if you are wondering why I don't do it using normal scoring, eg. with a ScriptEngine, it is because I am considering implementing some kind of LTR in the rescorer that must process the top k results before all results are rescored using some additional logic).

@benwtrent
Copy link
Member

actually the window size is ignored (set to the maximum allowed)

By maximum allowed you mean 10_000?

So the number of documents in your query testing performance difference & the number of documents handled by your rescorer are exactly the same?

@nemphys
Copy link
Author

nemphys commented May 14, 2024

actually the window size is ignored (set to the maximum allowed)

By maximum allowed you mean 10_000?

So the number of documents in your query testing performance difference & the number of documents handled by your rescorer are exactly the same?

Yes, both queries are exactly the same, with the only difference being the rescorer (which processes all results).

@nemphys
Copy link
Author

nemphys commented May 14, 2024

I think I should elaborate a little more in order to clarify what I am trying to achieve here:

Until now, I have already implemented a ScriptEngine that performs custom rescoring on top of BM25. Since I would like to introduce a LTR layer to the scoring process (not using the native ES LTR functionality, nor the ES LTR plugin), I am experimenting with a custom rescorer. The problem is that the initial custom rescoring logic must be applied after LTR rescoring, and must be applied on all query results (whereas LTR could affect only the top n).

Therefore, I created the rescorer (which does exactly what the ScriptEngine has been doing until now) and I am trying to make sure that I will not get some kind of latency overhead before I start experimenting with the LTR part. Sadly, this is not the case and I am trying to figure out what is going on :-)

@nemphys
Copy link
Author

nemphys commented May 14, 2024

@benwtrent New finding: if I remove the score processing code (including doc_values loading, etc) from the custom rescorer rescore() method (effectively turning it to a no-op), I still get ~50ms added latency to search responses (!)

This definitely means that there is something else going on, outside the rescorer.

Update: just tried using a small window size (instead of the max I have been using), and the overhead is gone. Does this mean that it is impossible to process all search results inside a rescorer without the added latency, even if the rescorer itself is fast?

Update2: did some further tests regarding window size and I cannot understand why it affects latency so much; the specific query I am using has a total of 5252 results. Gradually increasing the window size from 10 to 5300 increases the latency proportionally (this should be considered normal I guess?). What is really weird, is that if I keep increasing it up to the max (500000), query response times continue to go up; shouldn't the window size have no negative effect when raised above the actual total result count of the query?

@benwtrent
Copy link
Member

. What is really weird, is that if I keep increasing it up to the max (500000), query response times continue to go up; shouldn't the window size have no negative effect when raised above the actual total result count of the query?

When rescoring, we have to actually search for, score, and gather window_size docs. So, that means more scoring is required as window size increases.

But, the fact that increasing the window_size, but knowing for sure only 5252 will actually return, and window_size increasing continues to make things slower, this is weird.

I wonder if your rescorer is ignoring the matched docs and is instead iterating past the matched docs and assuming they have a score of 0?

This warrants some investigation on our side. Something weird and I cannot put my finger on it.

@nemphys
Copy link
Author

nemphys commented May 14, 2024

. What is really weird, is that if I keep increasing it up to the max (500000), query response times continue to go up; shouldn't the window size have no negative effect when raised above the actual total result count of the query?

When rescoring, we have to actually search for, score, and gather window_size docs. So, that means more scoring is required as window size increases.

But, the fact that increasing the window_size, but knowing for sure only 5252 will actually return, and window_size increasing continues to make things slower, this is weird.

I wonder if your rescorer is ignoring the matched docs and is instead iterating past the matched docs and assuming they have a score of 0?

The rescorer is only processing the matched docs, I am 100% certain about that. This is further confirmed by the fact that response times go up in proportion to the window_size, even with a no-op rescorer (return topDocs)!

After digging deeper, I stumbled upon a comment in the Lucene TopScoreDocCollectorManager/TopFieldCollectorManager classes noting that "The instances returned by this method pre-allocate a full array of length numHits" (which I confirmed to be equal to the window_size setting, initially set by org.elasticsearch.search.query.QueryPhaseCollectorManager's createQueryPhaseCollectorManager method). Could this have something to do with the overhead (10/100 vs 500,000 entries could potentially make a difference I suppose)?

This warrants some investigation on our side. Something weird and I cannot put my finger on it.

That would be great, looking forward to hearing about any interesting findings for potential optimization!

@benwtrent
Copy link
Member

"The instances returned by this method pre-allocate a full array of length numHits"

🫤

@nemphys
Copy link
Author

nemphys commented May 15, 2024

"The instances returned by this method pre-allocate a full array of length numHits"

🫤

Does 🫤 mean "Bummer, we can do nothing about that" or "Gee, we will look into it!"? 😅

@benwtrent
Copy link
Member

It means that it is likely done this way for a reason and I am not sure we can fix it.

We don't know how many docs are going to match until we start searching.

There are optimized ways to gather the matching docs, allowing a two pass (one to count, the other to score), but this will likely make things slower. Especially for queries where an optimized count option just doesn't exist.

Maybe we can under allocate the array and grow it, but that might lead to concurrency issues.

We could switch from an array to some concurrent collection that grows dynamically, but this might slow down the common case.

@nemphys
Copy link
Author

nemphys commented May 15, 2024

So, does this mean that this is definitely the culprit here? Is there any chance that something else might be also affecting latency?

@benwtrent
Copy link
Member

It may be part of the cause, but it's likely not all of it.

We do sort the docs multiple times, etc.

It would be interesting to see allocations and CPU usage via jfr or async-profiler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Ranking Scoring, rescoring, rank evaluation. Team:Search Meta label for search team v8.13.3
Projects
None yet
Development

No branches or pull requests

4 participants