Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize find_best_temp_scaler for performance #1075

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

gogetron
Copy link
Contributor

@gogetron gogetron commented Mar 30, 2024

Summary

This PR partially addresses #862

🎯 Purpose: Improve performance of find_best_temp_scaler

[ ✏️ Write your summary here. ]
I was profiling get_label_quality_multiannotator and it seems that find_best_temp_scaler took a long time when calibrate_probs was True. Almost all of the time was spent running compute_soft_cross_entropy. The idea is to skip the value_counts function (which calls np.unique) and instead rely on less expensive numpy operations as much as possible.

For memory I used the memory-profiler library. The code I used for benchmarking is copied below. In addition I sorted the imports in the modified files.

Code Setup

import numpy as np

from cleanlab.internal.multiannotator_utils import find_best_temp_scaler

np.random.seed(0)
%load_ext memory_profiler

M = 5
N = 100_000
K = 20
labels = np.random.randint(K, size=(N, M))
pred_probs = np.random.random((N, K))
pred_probs /= pred_probs.sum(axis=1, keepdims=True)

labels_with_some_nans = labels.astype(np.float64)
labels_with_some_nans[:labels_with_some_nans.shape[0] // 2, np.random.randint(M, size=(M))] = np.nan 

Current version

%%timeit
%memit find_best_temp_scaler(labels_with_some_nans, pred_probs)
# peak memory: 707.00 MiB, increment: 77.20 MiB
# peak memory: 708.11 MiB, increment: 78.07 MiB
# peak memory: 708.11 MiB, increment: 78.07 MiB
# peak memory: 708.11 MiB, increment: 78.07 MiB
# peak memory: 708.11 MiB, increment: 78.07 MiB
# peak memory: 708.12 MiB, increment: 78.08 MiB
# peak memory: 708.12 MiB, increment: 78.07 MiB
# peak memory: 708.12 MiB, increment: 78.07 MiB
# 21.5 s ± 479 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
%memit find_best_temp_scaler(labels, pred_probs)
# peak memory: 708.32 MiB, increment: 78.27 MiB
# peak memory: 708.32 MiB, increment: 78.07 MiB
# peak memory: 708.32 MiB, increment: 78.07 MiB
# peak memory: 708.32 MiB, increment: 78.07 MiB
# peak memory: 708.32 MiB, increment: 78.07 MiB
# peak memory: 708.32 MiB, increment: 78.07 MiB
# peak memory: 708.32 MiB, increment: 78.07 MiB
# peak memory: 708.32 MiB, increment: 78.07 MiB
# 21.2 s ± 338 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR

%%timeit
%memit find_best_temp_scaler(labels_with_some_nans, pred_probs)
# peak memory: 368.92 MiB, increment: 73.48 MiB
# peak memory: 370.00 MiB, increment: 93.27 MiB
# peak memory: 370.07 MiB, increment: 93.28 MiB
# peak memory: 370.25 MiB, increment: 93.46 MiB
# peak memory: 370.26 MiB, increment: 93.27 MiB
# peak memory: 370.20 MiB, increment: 93.21 MiB
# peak memory: 370.19 MiB, increment: 93.20 MiB
# peak memory: 370.26 MiB, increment: 93.27 MiB
# 716 ms ± 7.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
%memit find_best_temp_scaler(labels, pred_probs)
# peak memory: 370.46 MiB, increment: 93.33 MiB
# peak memory: 370.47 MiB, increment: 93.27 MiB
# peak memory: 370.47 MiB, increment: 93.27 MiB
# peak memory: 370.47 MiB, increment: 93.27 MiB
# peak memory: 370.41 MiB, increment: 93.21 MiB
# peak memory: 370.47 MiB, increment: 93.27 MiB
# peak memory: 370.47 MiB, increment: 93.27 MiB
# peak memory: 370.41 MiB, increment: 93.21 MiB
# 756 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing

🔍 Testing Done: Existing tests.

References

Reviewer Notes

💡 Include any specific points for the reviewer to consider during their review.

@gogetron gogetron changed the title Perf: compute_soft_cross_entropy Optimize find_best_temp_scaler for performance Mar 30, 2024
Copy link

codecov bot commented Mar 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.25%. Comparing base (4e2cafb) to head (80a5eb8).
Report is 7 commits behind head on master.

Current head 80a5eb8 differs from pull request most recent head 44c4329

Please upload reports for the commit 44c4329 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1075      +/-   ##
==========================================
+ Coverage   96.13%   96.25%   +0.12%     
==========================================
  Files          80       74       -6     
  Lines        6110     5854     -256     
  Branches     1074     1045      -29     
==========================================
- Hits         5874     5635     -239     
+ Misses        140      130      -10     
+ Partials       96       89       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jwmueller jwmueller requested a review from huiwengoh May 21, 2024 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant