Optimize token_classification/rank.py for performance #1078

gogetron · 2024-03-31T19:49:43Z

Summary

This PR partially addresses #862

🎯 Purpose: Improve performance of issues_from_score and get_label_quality in token_classification/rank.py file.

[ ✏️ Write your summary here. ]
After profiling get_label_quality_scores it seems that the list comprehensions and the assert_valid_inputs function were taking a significant time. One test was added test_assert_valid_class_labels_fails_with_str_labels to ensure that an error is correctly raised when the labels are provided as a string. The validation is now much faster when the provided labels are numbers.

In addition, most of the work can be batched so I added a test to ensure that the results are the same regardless of the batch size and the method choosen and I refactored the function to work in batches. I included the batch_size argument as in the other functions from other modules.

I hit an edge case when I was profiling issues_from_scores. The code below raised an IndexError:

import numpy as np

from cleanlab.token_classification.rank import get_label_quality_scores, issues_from_scores

labels = [[0, 0, 1], [0, 1]]
pred_probs = [
    np.array([[0.9, 0.1], [0.7, 0.3], [0.05, 0.95]]),
    np.array([[0.8, 0.2], [0.8, 0.2]]),
]
sentence_scores, token_scores = get_label_quality_scores(labels, pred_probs)
issues_from_scores(sentence_scores, threshold=0.99)

It was just about changing the order of the comparisons in the while loop, this is a very rare case but it is fixed now. In addition, by keeping two separate lists we can reduce the memory usage when sorting and filtering the objects.

For memory I used the memory-profiler library. The code I used for benchmarking is copied below. In addition I sorted the imports in the modified files.

Code Setup

import random

import numpy as np

from cleanlab.token_classification.rank import get_label_quality_scores, issues_from_scores

np.random.seed(0)
random.seed(0)
%load_ext memory_profiler

TOTAL_EXAMPLES = 100_000
MAX_LENGTH = 100
NUM_CLASSES = 100


def create_dataset():
    labels = []
    pred_probs = []
    for _ in range(TOTAL_EXAMPLES):
        length = random.randint(2, MAX_LENGTH)
        new_labels = np.random.randint(NUM_CLASSES, size=length) 
        labels.append(new_labels) 
        probs = np.random.random((length, NUM_CLASSES))
        probs /= probs.sum(axis=1, keepdims=True)
        pred_probs.append(probs)
    return labels, pred_probs

# Create input data
labels, pred_probs = create_dataset()
# Execute once to avoid the tensorflow import time in the benchmark.
sentence_scores, token_scores = get_label_quality_scores(labels, pred_probs)

Current version

%%timeit
%memit issues_from_scores(sentence_scores, token_scores=token_scores)
# peak memory: 5676.50 MiB, increment: 814.52 MiB
# peak memory: 5677.61 MiB, increment: 350.69 MiB
# peak memory: 5676.58 MiB, increment: 348.70 MiB
# peak memory: 5676.59 MiB, increment: 347.72 MiB
# peak memory: 5676.59 MiB, increment: 347.73 MiB
# peak memory: 5676.77 MiB, increment: 347.91 MiB
# peak memory: 5676.77 MiB, increment: 347.73 MiB
# peak memory: 5676.78 MiB, increment: 347.73 MiB
# 4.82 s ± 124 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
%memit get_label_quality_scores(labels, pred_probs)
# peak memory: 9498.85 MiB, increment: 4169.62 MiB
# peak memory: 9579.55 MiB, increment: 4138.61 MiB
# peak memory: 9588.61 MiB, increment: 4147.38 MiB
# peak memory: 9617.54 MiB, increment: 4177.29 MiB
# peak memory: 9588.71 MiB, increment: 4148.34 MiB
# peak memory: 9584.19 MiB, increment: 4143.82 MiB
# peak memory: 9647.63 MiB, increment: 4252.44 MiB
# peak memory: 9533.61 MiB, increment: 4092.57 MiB
# 6.59 s ± 237 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR

%%timeit
%memit issues_from_scores(sentence_scores, token_scores=token_scores)
# peak memory: 5425.16 MiB, increment: 622.79 MiB
# peak memory: 5422.39 MiB, increment: 155.71 MiB
# peak memory: 5423.76 MiB, increment: 157.06 MiB
# peak memory: 5425.27 MiB, increment: 158.57 MiB
# peak memory: 5425.27 MiB, increment: 158.38 MiB
# peak memory: 5424.03 MiB, increment: 157.13 MiB
# peak memory: 5422.55 MiB, increment: 155.65 MiB
# peak memory: 5422.43 MiB, increment: 155.54 MiB
# 4.2 s ± 87.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
%memit get_label_quality_scores(labels, pred_probs)
# peak memory: 5658.94 MiB, increment: 840.13 MiB
# peak memory: 5549.62 MiB, increment: 739.04 MiB
# peak memory: 5658.11 MiB, increment: 867.00 MiB
# peak memory: 5682.33 MiB, increment: 882.35 MiB
# peak memory: 5526.24 MiB, increment: 714.20 MiB
# peak memory: 5470.04 MiB, increment: 654.19 MiB
# peak memory: 5572.88 MiB, increment: 757.04 MiB
# peak memory: 5610.20 MiB, increment: 809.95 MiB
# 3.09 s ± 48.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing

🔍 Testing Done: Existing test suite and two new added tests.

References

Reviewer Notes

💡 Include any specific points for the reviewer to consider during their review.

codecov · 2024-03-31T19:59:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.28%. Comparing base (abd0924) to head (a89641b).
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1078      +/-   ##
==========================================
- Coverage   96.15%   94.28%   -1.87%     
==========================================
  Files          74       74              
  Lines        5850     5865      +15     
  Branches     1044     1046       +2     
==========================================
- Hits         5625     5530      -95     
- Misses        134      254     +120     
+ Partials       91       81      -10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gogetron added 6 commits March 28, 2024 19:55

Fix: IndexError when all probabilities are below threshold

fdeaef0

Perf: issues_from_scores separated lists

8c8e345

Perf: faster array and list creation

cdc0430

Perf: Faster inputs validation

d20be0f

Add: test_assert_valid_class_labels_fails_with_str_labels

d5c6ed5

Perf: get_label_quality_scores batch_size

fef34f9

gogetron added 2 commits March 31, 2024 22:06

Fix: F401 imported but unused

c24908b

Fix: Remove deadline for batch_size test

a89641b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize token_classification/rank.py for performance #1078

Optimize token_classification/rank.py for performance #1078

gogetron commented Mar 31, 2024

codecov bot commented Mar 31, 2024 •

edited

Optimize token_classification/rank.py for performance #1078

Are you sure you want to change the base?

Optimize token_classification/rank.py for performance #1078

Conversation

gogetron commented Mar 31, 2024

Summary

Testing

References

Reviewer Notes

codecov bot commented Mar 31, 2024 • edited

Codecov Report

codecov bot commented Mar 31, 2024 •

edited