Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieval Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric #12997

Merged
merged 6 commits into from
May 1, 2024
84 changes: 64 additions & 20 deletions llama-index-core/llama_index/core/evaluation/retrieval/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,45 +12,51 @@


class HitRate(BaseRetrievalMetric):
"""Hit rate metric."""
"""Hit rate metric: Compute the proportion of matches between retrieved documents and expected documents."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the metric, but I'm not sure I like it as a replacement to HitRate, since "Hit" implies a binary nature.

Perhaps we can call this MeanDocumentMatch or something that represents an average of proportions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I am recommending leaving HitRate as is and creating a net new BaseRetrievalMetric

Copy link
Contributor Author

@AgenP AgenP Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s a good point. For the proposed implementation, would you not say that technically each document gets a binary representation in the score (since each doc is hit, or not hit)?

Then the summation and division allow for the accounting, and comparison, of multiple docs in the score.

Note: I realise this would alter the usage pattern, compared to past HitRate implementations, so that adds a cost to this idea!

However, in one metric, we could get the binary representation and also avoid the information lost by having only 1 or 0 for scoring options, regardless of the amount of docs used.

Let me know if you think this is worth it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, each doc definitely gets assigned a binary value to it i.e. hit or no hit. But yea the usage pattern would be broken here unfortunately if we go down to that level of granularity with HitRate.

Perhaps your suggesting some sort of parameter to control the calculation of the HitRate, which if that's the case I think would be fine. We could make the default calculation be the current computation, and allow the user to specify if they would like to use the more granular calculation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea!

I'll work on that


metric_name: str = "hit_rate"

def compute(
self,
query: Optional[str] = None,
expected_ids: Optional[List[str]] = None,
retrieved_ids: Optional[List[str]] = None,
expected_texts: Optional[List[str]] = None,
retrieved_texts: Optional[List[str]] = None,
**kwargs: Any,
) -> RetrievalMetricResult:
"""Compute metric."""
if retrieved_ids is None or expected_ids is None:
raise ValueError("Retrieved ids and expected ids must be provided")
is_hit = any(id in expected_ids for id in retrieved_ids)
return RetrievalMetricResult(
score=1.0 if is_hit else 0.0,
)
if (
retrieved_ids is None
or expected_ids is None
or not retrieved_ids
or not expected_ids
):
raise ValueError("Both retrieved ids and expected ids must be provided")

expected_set = set(expected_ids)
hits = sum(1 for doc_id in retrieved_ids if doc_id in expected_set)
score = hits / len(expected_ids) if expected_ids else 0.0

class MRR(BaseRetrievalMetric):
"""MRR metric."""
return RetrievalMetricResult(score=score)

metric_name: str = "mrr"

class RR(BaseRetrievalMetric):
"""Reciprocal Rank (RR): Calculates the reciprocal rank of the first, and only the first, relevant retrieved document.
returns 0 if no relevant retrieved docs are found.
"""

metric_name: str = "rr"

def compute(
self,
query: Optional[str] = None,
expected_ids: Optional[List[str]] = None,
retrieved_ids: Optional[List[str]] = None,
expected_texts: Optional[List[str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure if these can be removed. Did you check if removing these args still satisfies the compute method signature for BaseRetrievalMetric?

Copy link
Contributor Author

@AgenP AgenP Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep you are correct. I did some research and it would violate the Liskov substitution principle which is not the one, particularly with an ABC and its abstract methods.

When making the changes, I will ensure the method signatures are kept the same

Thanks for highlighting this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

retrieved_texts: Optional[List[str]] = None,
**kwargs: Any,
) -> RetrievalMetricResult:
"""Compute metric."""
if retrieved_ids is None or expected_ids is None:
raise ValueError("Retrieved ids and expected ids must be provided")
if (
retrieved_ids is None
or expected_ids is None
or not retrieved_ids
or not expected_ids
):
raise ValueError("Both retrieved ids and expected ids must be provided")
for i, id in enumerate(retrieved_ids):
if id in expected_ids:
return RetrievalMetricResult(
Expand All @@ -61,6 +67,43 @@ def compute(
)


class MRR(BaseRetrievalMetric):
"""Mean Reciprocal Rank (MRR): Sums up the reciprocal rank score for each relevant retrieved document.
Then divides by the count of relevant documents.
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will break the current way we invoke MRR to be computed. While I agree that RR and this MRR computation is more technically correct, I feel its a bit pedantic (and, I'm pretty pedantic myself!) and might not be worth the hassle of ensuring backwards compatibility and risking breaking changes.

Copy link
Contributor Author

@AgenP AgenP Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense. I completely understand.

What do you think about having a net new metric called SQMRR (single query mean reciprocal rank)?

To boost the flexibility of your evaluation suite, for devs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would SQMRR ultimately compute the MRR tho? i.e., after we take an average of all SQMRR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, to a certain degree. It would be more granular version (since it scores based on multiple docs each time)

I could implement it as an optional implementation, using your parameter idea. Incase the user wants multiple docs scored each time

Let me know what you think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's roll with that.

metric_name: str = "mrr"

def compute(
self,
expected_ids: Optional[List[str]] = None,
retrieved_ids: Optional[List[str]] = None,
) -> RetrievalMetricResult:
"""Compute the Mean Reciprocal Rank given expected document IDs and retrieved document IDs."""
if (
retrieved_ids is None
or expected_ids is None
or not retrieved_ids
or not expected_ids
):
raise ValueError("Both retrieved ids and expected ids must be provided")

expected_set = set(expected_ids)
reciprocal_rank_sum = 0.0
relevant_docs_count = 0

for index, doc_id in enumerate(retrieved_ids):
if doc_id in expected_set:
relevant_docs_count += 1
reciprocal_rank_sum += 1.0 / (index + 1)

if relevant_docs_count > 0:
mrr_score = reciprocal_rank_sum / relevant_docs_count
return RetrievalMetricResult(score=mrr_score)
else:
return RetrievalMetricResult(score=0.0)


class CohereRerankRelevancyMetric(BaseRetrievalMetric):
"""Cohere rerank relevancy metric."""

Expand Down Expand Up @@ -129,6 +172,7 @@ def compute(

METRIC_REGISTRY: Dict[str, Type[BaseRetrievalMetric]] = {
"hit_rate": HitRate,
"rr": RR,
"mrr": MRR,
"cohere_rerank_relevancy": CohereRerankRelevancyMetric,
}
Expand Down
121 changes: 121 additions & 0 deletions llama-index-core/tests/evaluation/test_rr_mrr_hitrate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
from llama_index.core.evaluation.retrieval.metrics import HitRate, RR, MRR
import pytest


# Test cases using pytest
@pytest.mark.parametrize(
("expected_ids", "retrieved_ids", "expected_result"),
[
(["id1", "id2", "id3"], ["id3", "id1", "id2", "id4"], 1.0),
(["id1", "id2", "id3", "id4"], ["id1", "id5", "id2"], 0.5),
(["id1", "id2"], ["id3", "id4"], 0.0),
],
)
def test_hit_rate(expected_ids, retrieved_ids, expected_result):
hr = HitRate()
result = hr.compute(expected_ids, retrieved_ids)
assert result.score == expected_result


@pytest.mark.parametrize(
("expected_ids", "retrieved_ids", "expected_result"),
[
# Test cases that reflect the correct computation of RR
(
["id1", "id2", "id3"],
["id3", "id1", "id2", "id4"],
1 / 1,
), # id3 is the first match, rank 1
(
["id1", "id2", "id3", "id4"],
["id5", "id1"],
1 / 2,
), # id1 is the first match, rank 2
(["id1", "id2"], ["id3", "id4"], 0.0), # No matches found
(
["id1", "id2"],
["id2", "id1", "id7"],
1 / 1,
), # id2 is the first match, rank 1
],
)
def test_rr(expected_ids, retrieved_ids, expected_result):
rr = RR()
result = rr.compute(expected_ids, retrieved_ids)
assert result.score == pytest.approx(expected_result)


@pytest.mark.parametrize(
("expected_ids", "retrieved_ids", "expected_result"),
[
(
["id1", "id2", "id3"],
["id3", "id1", "id2", "id4"],
(((1 / 1) + (1 / 2) + (1 / 3)) / 3),
),
(
["id1", "id2", "id3", "id4"],
["id1", "id2", "id5"],
(((1 / 1) + (1 / 2)) / 2),
),
(["id1", "id2"], ["id3", "id4"], 0.0),
(["id1", "id2"], ["id1", "id7", "id15", "id2"], (((1 / 1) + (1 / 4)) / 2)),
],
)
def test_mrr(expected_ids, retrieved_ids, expected_result):
mrr = MRR()
result = mrr.compute(expected_ids, retrieved_ids)
assert result.score == pytest.approx(expected_result)


# NOTE: The following test cases are specifically for handling ValueErrors
@pytest.mark.parametrize(
("expected_ids", "retrieved_ids"),
[
([], []), # Empty IDs should trigger ValueError
(
None,
["id3", "id1", "id2", "id4"],
), # None expected_ids should trigger ValueError
(["id1", "id2", "id3"], None), # None retrieved_ids should trigger ValueError
],
)
def test_hit_rate_exceptions(expected_ids, retrieved_ids):
hr = HitRate()
with pytest.raises(ValueError):
hr.compute(expected_ids, retrieved_ids)


@pytest.mark.parametrize(
("expected_ids", "retrieved_ids"),
[
# Test cases for handling exceptions
([], []), # Empty IDs should trigger ValueError
(
None,
["id3", "id1", "id2", "id4"],
), # None expected_ids should trigger ValueError
(["id1", "id2", "id3"], None), # None retrieved_ids should trigger ValueError
],
)
def test_rr_exceptions(expected_ids, retrieved_ids):
rr = RR()
with pytest.raises(ValueError):
rr.compute(expected_ids, retrieved_ids)


@pytest.mark.parametrize(
("expected_ids", "retrieved_ids"),
[
([], []), # Empty IDs should trigger ValueError
(
None,
["id3", "id1", "id2", "id4"],
), # None expected_ids should trigger ValueError
(["id1", "id2", "id3"], None), # None retrieved_ids should trigger ValueError
],
)
def test_mrr_exceptions(expected_ids, retrieved_ids):
mrr = MRR()
with pytest.raises(ValueError):
mrr.compute(expected_ids, retrieved_ids)