-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrieval Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric #12997
Changes from 2 commits
62af5fd
f302d83
8ae408a
f50c07a
a54e2dd
eff2806
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,45 +12,51 @@ | |
|
||
|
||
class HitRate(BaseRetrievalMetric): | ||
"""Hit rate metric.""" | ||
"""Hit rate metric: Compute the proportion of matches between retrieved documents and expected documents.""" | ||
|
||
metric_name: str = "hit_rate" | ||
|
||
def compute( | ||
self, | ||
query: Optional[str] = None, | ||
expected_ids: Optional[List[str]] = None, | ||
retrieved_ids: Optional[List[str]] = None, | ||
expected_texts: Optional[List[str]] = None, | ||
retrieved_texts: Optional[List[str]] = None, | ||
**kwargs: Any, | ||
) -> RetrievalMetricResult: | ||
"""Compute metric.""" | ||
if retrieved_ids is None or expected_ids is None: | ||
raise ValueError("Retrieved ids and expected ids must be provided") | ||
is_hit = any(id in expected_ids for id in retrieved_ids) | ||
return RetrievalMetricResult( | ||
score=1.0 if is_hit else 0.0, | ||
) | ||
if ( | ||
retrieved_ids is None | ||
or expected_ids is None | ||
or not retrieved_ids | ||
or not expected_ids | ||
): | ||
raise ValueError("Both retrieved ids and expected ids must be provided") | ||
|
||
expected_set = set(expected_ids) | ||
hits = sum(1 for doc_id in retrieved_ids if doc_id in expected_set) | ||
score = hits / len(expected_ids) if expected_ids else 0.0 | ||
|
||
class MRR(BaseRetrievalMetric): | ||
"""MRR metric.""" | ||
return RetrievalMetricResult(score=score) | ||
|
||
metric_name: str = "mrr" | ||
|
||
class RR(BaseRetrievalMetric): | ||
"""Reciprocal Rank (RR): Calculates the reciprocal rank of the first, and only the first, relevant retrieved document. | ||
returns 0 if no relevant retrieved docs are found. | ||
""" | ||
|
||
metric_name: str = "rr" | ||
|
||
def compute( | ||
self, | ||
query: Optional[str] = None, | ||
expected_ids: Optional[List[str]] = None, | ||
retrieved_ids: Optional[List[str]] = None, | ||
expected_texts: Optional[List[str]] = None, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not entirely sure if these can be removed. Did you check if removing these args still satisfies the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep you are correct. I did some research and it would violate the Liskov substitution principle which is not the one, particularly with an ABC and its abstract methods. When making the changes, I will ensure the method signatures are kept the same Thanks for highlighting this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! |
||
retrieved_texts: Optional[List[str]] = None, | ||
**kwargs: Any, | ||
) -> RetrievalMetricResult: | ||
"""Compute metric.""" | ||
if retrieved_ids is None or expected_ids is None: | ||
raise ValueError("Retrieved ids and expected ids must be provided") | ||
if ( | ||
retrieved_ids is None | ||
or expected_ids is None | ||
or not retrieved_ids | ||
or not expected_ids | ||
): | ||
raise ValueError("Both retrieved ids and expected ids must be provided") | ||
for i, id in enumerate(retrieved_ids): | ||
if id in expected_ids: | ||
return RetrievalMetricResult( | ||
|
@@ -61,6 +67,43 @@ def compute( | |
) | ||
|
||
|
||
class MRR(BaseRetrievalMetric): | ||
"""Mean Reciprocal Rank (MRR): Sums up the reciprocal rank score for each relevant retrieved document. | ||
Then divides by the count of relevant documents. | ||
""" | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this will break the current way we invoke There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah that makes sense. I completely understand. What do you think about having a net new metric called SQMRR (single query mean reciprocal rank)? To boost the flexibility of your evaluation suite, for devs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, to a certain degree. It would be more granular version (since it scores based on multiple docs each time) I could implement it as an optional implementation, using your parameter idea. Incase the user wants multiple docs scored each time Let me know what you think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, let's roll with that. |
||
metric_name: str = "mrr" | ||
|
||
def compute( | ||
self, | ||
expected_ids: Optional[List[str]] = None, | ||
retrieved_ids: Optional[List[str]] = None, | ||
) -> RetrievalMetricResult: | ||
"""Compute the Mean Reciprocal Rank given expected document IDs and retrieved document IDs.""" | ||
if ( | ||
retrieved_ids is None | ||
or expected_ids is None | ||
or not retrieved_ids | ||
or not expected_ids | ||
): | ||
raise ValueError("Both retrieved ids and expected ids must be provided") | ||
|
||
expected_set = set(expected_ids) | ||
reciprocal_rank_sum = 0.0 | ||
relevant_docs_count = 0 | ||
|
||
for index, doc_id in enumerate(retrieved_ids): | ||
if doc_id in expected_set: | ||
relevant_docs_count += 1 | ||
reciprocal_rank_sum += 1.0 / (index + 1) | ||
|
||
if relevant_docs_count > 0: | ||
mrr_score = reciprocal_rank_sum / relevant_docs_count | ||
return RetrievalMetricResult(score=mrr_score) | ||
else: | ||
return RetrievalMetricResult(score=0.0) | ||
|
||
|
||
class CohereRerankRelevancyMetric(BaseRetrievalMetric): | ||
"""Cohere rerank relevancy metric.""" | ||
|
||
|
@@ -129,6 +172,7 @@ def compute( | |
|
||
METRIC_REGISTRY: Dict[str, Type[BaseRetrievalMetric]] = { | ||
"hit_rate": HitRate, | ||
"rr": RR, | ||
"mrr": MRR, | ||
"cohere_rerank_relevancy": CohereRerankRelevancyMetric, | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
from llama_index.core.evaluation.retrieval.metrics import HitRate, RR, MRR | ||
import pytest | ||
|
||
|
||
# Test cases using pytest | ||
@pytest.mark.parametrize( | ||
("expected_ids", "retrieved_ids", "expected_result"), | ||
[ | ||
(["id1", "id2", "id3"], ["id3", "id1", "id2", "id4"], 1.0), | ||
(["id1", "id2", "id3", "id4"], ["id1", "id5", "id2"], 0.5), | ||
(["id1", "id2"], ["id3", "id4"], 0.0), | ||
], | ||
) | ||
def test_hit_rate(expected_ids, retrieved_ids, expected_result): | ||
hr = HitRate() | ||
result = hr.compute(expected_ids, retrieved_ids) | ||
assert result.score == expected_result | ||
|
||
|
||
@pytest.mark.parametrize( | ||
("expected_ids", "retrieved_ids", "expected_result"), | ||
[ | ||
# Test cases that reflect the correct computation of RR | ||
( | ||
["id1", "id2", "id3"], | ||
["id3", "id1", "id2", "id4"], | ||
1 / 1, | ||
), # id3 is the first match, rank 1 | ||
( | ||
["id1", "id2", "id3", "id4"], | ||
["id5", "id1"], | ||
1 / 2, | ||
), # id1 is the first match, rank 2 | ||
(["id1", "id2"], ["id3", "id4"], 0.0), # No matches found | ||
( | ||
["id1", "id2"], | ||
["id2", "id1", "id7"], | ||
1 / 1, | ||
), # id2 is the first match, rank 1 | ||
], | ||
) | ||
def test_rr(expected_ids, retrieved_ids, expected_result): | ||
rr = RR() | ||
result = rr.compute(expected_ids, retrieved_ids) | ||
assert result.score == pytest.approx(expected_result) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
("expected_ids", "retrieved_ids", "expected_result"), | ||
[ | ||
( | ||
["id1", "id2", "id3"], | ||
["id3", "id1", "id2", "id4"], | ||
(((1 / 1) + (1 / 2) + (1 / 3)) / 3), | ||
), | ||
( | ||
["id1", "id2", "id3", "id4"], | ||
["id1", "id2", "id5"], | ||
(((1 / 1) + (1 / 2)) / 2), | ||
), | ||
(["id1", "id2"], ["id3", "id4"], 0.0), | ||
(["id1", "id2"], ["id1", "id7", "id15", "id2"], (((1 / 1) + (1 / 4)) / 2)), | ||
], | ||
) | ||
def test_mrr(expected_ids, retrieved_ids, expected_result): | ||
mrr = MRR() | ||
result = mrr.compute(expected_ids, retrieved_ids) | ||
assert result.score == pytest.approx(expected_result) | ||
|
||
|
||
# NOTE: The following test cases are specifically for handling ValueErrors | ||
@pytest.mark.parametrize( | ||
("expected_ids", "retrieved_ids"), | ||
[ | ||
([], []), # Empty IDs should trigger ValueError | ||
( | ||
None, | ||
["id3", "id1", "id2", "id4"], | ||
), # None expected_ids should trigger ValueError | ||
(["id1", "id2", "id3"], None), # None retrieved_ids should trigger ValueError | ||
], | ||
) | ||
def test_hit_rate_exceptions(expected_ids, retrieved_ids): | ||
hr = HitRate() | ||
with pytest.raises(ValueError): | ||
hr.compute(expected_ids, retrieved_ids) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
("expected_ids", "retrieved_ids"), | ||
[ | ||
# Test cases for handling exceptions | ||
([], []), # Empty IDs should trigger ValueError | ||
( | ||
None, | ||
["id3", "id1", "id2", "id4"], | ||
), # None expected_ids should trigger ValueError | ||
(["id1", "id2", "id3"], None), # None retrieved_ids should trigger ValueError | ||
], | ||
) | ||
def test_rr_exceptions(expected_ids, retrieved_ids): | ||
rr = RR() | ||
with pytest.raises(ValueError): | ||
rr.compute(expected_ids, retrieved_ids) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
("expected_ids", "retrieved_ids"), | ||
[ | ||
([], []), # Empty IDs should trigger ValueError | ||
( | ||
None, | ||
["id3", "id1", "id2", "id4"], | ||
), # None expected_ids should trigger ValueError | ||
(["id1", "id2", "id3"], None), # None retrieved_ids should trigger ValueError | ||
], | ||
) | ||
def test_mrr_exceptions(expected_ids, retrieved_ids): | ||
mrr = MRR() | ||
with pytest.raises(ValueError): | ||
mrr.compute(expected_ids, retrieved_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the metric, but I'm not sure I like it as a replacement to HitRate, since "Hit" implies a binary nature.
Perhaps we can call this
MeanDocumentMatch
or something that represents an average of proportions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I am recommending leaving
HitRate
as is and creating a net newBaseRetrievalMetric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s a good point. For the proposed implementation, would you not say that technically each document gets a binary representation in the score (since each doc is hit, or not hit)?
Then the summation and division allow for the accounting, and comparison, of multiple docs in the score.
Note: I realise this would alter the usage pattern, compared to past HitRate implementations, so that adds a cost to this idea!
However, in one metric, we could get the binary representation and also avoid the information lost by having only 1 or 0 for scoring options, regardless of the amount of docs used.
Let me know if you think this is worth it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, each doc definitely gets assigned a binary value to it i.e. hit or no hit. But yea the usage pattern would be broken here unfortunately if we go down to that level of granularity with
HitRate
.Perhaps your suggesting some sort of parameter to control the calculation of the
HitRate
, which if that's the case I think would be fine. We could make the default calculation be the current computation, and allow the user to specify if they would like to use the more granular calculation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea!
I'll work on that