New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrieval Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric #12997
Conversation
…aluation and new separate MRR implementation
@@ -12,45 +12,51 @@ | |||
|
|||
|
|||
class HitRate(BaseRetrievalMetric): | |||
"""Hit rate metric.""" | |||
"""Hit rate metric: Compute the proportion of matches between retrieved documents and expected documents.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the metric, but I'm not sure I like it as a replacement to HitRate, since "Hit" implies a binary nature.
Perhaps we can call this MeanDocumentMatch
or something that represents an average of proportions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I am recommending leaving HitRate
as is and creating a net new BaseRetrievalMetric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s a good point. For the proposed implementation, would you not say that technically each document gets a binary representation in the score (since each doc is hit, or not hit)?
Then the summation and division allow for the accounting, and comparison, of multiple docs in the score.
Note: I realise this would alter the usage pattern, compared to past HitRate implementations, so that adds a cost to this idea!
However, in one metric, we could get the binary representation and also avoid the information lost by having only 1 or 0 for scoring options, regardless of the amount of docs used.
Let me know if you think this is worth it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, each doc definitely gets assigned a binary value to it i.e. hit or no hit. But yea the usage pattern would be broken here unfortunately if we go down to that level of granularity with HitRate
.
Perhaps your suggesting some sort of parameter to control the calculation of the HitRate
, which if that's the case I think would be fine. We could make the default calculation be the current computation, and allow the user to specify if they would like to use the more granular calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea!
I'll work on that
@@ -12,45 +12,51 @@ | |||
|
|||
|
|||
class HitRate(BaseRetrievalMetric): | |||
"""Hit rate metric.""" | |||
"""Hit rate metric: Compute the proportion of matches between retrieved documents and expected documents.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I am recommending leaving HitRate
as is and creating a net new BaseRetrievalMetric
expected_ids: Optional[List[str]] = None, | ||
retrieved_ids: Optional[List[str]] = None, | ||
expected_texts: Optional[List[str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure if these can be removed. Did you check if removing these args still satisfies the compute
method signature for BaseRetrievalMetric
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep you are correct. I did some research and it would violate the Liskov substitution principle which is not the one, particularly with an ABC and its abstract methods.
When making the changes, I will ensure the method signatures are kept the same
Thanks for highlighting this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
class MRR(BaseRetrievalMetric): | ||
"""Mean Reciprocal Rank (MRR): Sums up the reciprocal rank score for each relevant retrieved document. | ||
Then divides by the count of relevant documents. | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will break the current way we invoke MRR
to be computed. While I agree that RR and this MRR computation is more technically correct, I feel its a bit pedantic (and, I'm pretty pedantic myself!) and might not be worth the hassle of ensuring backwards compatibility and risking breaking changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense. I completely understand.
What do you think about having a net new metric called SQMRR (single query mean reciprocal rank)?
To boost the flexibility of your evaluation suite, for devs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would SQMRR
ultimately compute the MRR
tho? i.e., after we take an average of all SQMRR
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, to a certain degree. It would be more granular version (since it scores based on multiple docs each time)
I could implement it as an optional implementation, using your parameter idea. Incase the user wants multiple docs scored each time
Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let's roll with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @AgenP. I have left some comments. I agree with you on RR / MRR, but I don't think that change is worth the breaking changes. I like the new "HitRate" metric, but we should leave the original one as is and create a net new one.
Hey @nerdai, thank you very much for the feedback. I’ve sent in my responses to your comments Here is a quick summary of my proposed action steps
Let me know if these are all good to be implmented, or if there is room for improvement. |
I think it makes sense :) . I left replies to your latest comments! |
Hey @nerdai, the iterations have been made! Here is a summary of the changes I've now implemented MRR and HitRate changes
RR removed
Testing & Formatting
Lmk if there are any issues 💪 |
amazing thanks for the thorough summary @AgenP ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good! I just think that maybe instead of using kwargs on compute
to learn if we should use granular compute or not, maybe we just make it an instance attribute
) | ||
|
||
# Determining which implementation to use based on `use_granular_hit_rate` kwarg | ||
use_granular = kwargs.get("use_granular_hit_rate", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of using kwargs
, maybe should just create a class or instance attribute use_granular
?
) | ||
|
||
# Determining which implementation to use based on `use_granular_mrr` kwarg | ||
use_granular_mrr = kwargs.get("use_granular_mrr", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above, maybe we define this as a class/instance attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that way we still satisfy the superclass method signature
Thanks @nerdai! My thinking with this was that since the BaseRetrievalMetric superclass' What do you think? |
Sorry for the late reply! Yeah, I just think its a bit odd that we have to tuck these args away in class HitRate(BaseRetrievalMetric):
"""Hit rate metric."""
metric_name: str = "hit_rate"
use_granular_mrr: bool = False
...
def compute(...):
if self.use_granular_mrr:
.... How about something like this? |
No worries whatsoever That makes complete sense. Thank you for highlighting this The new iteration has been commit! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @AgenP! LGTM!
@AgenP thanks again -- merging this for us now :) |
Side notes:
Description
HitRate edit (HitRate@K)
Changed to allow for non-binary scoring --> To widen the evaluation value of this metric, from my perspective.
Example: 5/10 relevant docs retrieved scores 0.5 (instead of 1.0) and 2/10 would be 0.2 (instead of 1.0) --> Allowing for a much more detailed analysis of the amount of expected documents retrieved
RR edit
MRR edit (MRR@K)
Other changes made
Type of Change
How Has This Been Tested?
Suggested Checklist:
make format; make lint
to appease the lint gods