Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Introduce top-k algo to HNSW #4114

Closed
wants to merge 15 commits into from
Closed

feat: Introduce top-k algo to HNSW #4114

wants to merge 15 commits into from

Conversation

KShivendu
Copy link
Member

@KShivendu KShivendu commented Apr 25, 2024

HNSW equivalent of #4037 PR.

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

Todo:

  • Benchmark against prev implementation

@@ -11,7 +11,7 @@ use crate::types::{ScoreType, ScoredPointOffset};
#[derive(Default)]
pub struct TopK {
k: usize,
elements: Vec<Reverse<ScoredPointOffset>>,
pub elements: Vec<Reverse<ScoredPointOffset>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to expose this.
You can either return reference to a slice in a method, or deconstruct the whole thing into vector if you need ownership

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

* feat: Allow building any branch with GH dev image builder workflow

* fix: Trim feat, fix, etc

* fix: Replace all / with -
None => true,
Some(removed) => removed.idx != score_point.idx,
};
let was_added = self.nearest.push(score_point);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the deduplication based on removed.idx != score_point.idx, not required anymore?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't return index of the point if it was discarded or not, because we would need to return a whole list in this case. So the TopK return type is different and it is just bool

None => ScoreType::min_value(),
Some(worst_of_the_best) => worst_of_the_best.score,
}
self.nearest.threshold()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it Ok. to use threshold here given that it is lazy and might not represent the full information of the top values?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be not Ok, it is up to benchmarks

@generall
Copy link
Member

It doesn't look like the change gives us a benefit:
image

Even though the TopK might improve candidate selection performance, it looks like the change in search_context update logic have more negative effect on overall results

@generall generall closed this May 1, 2024
@KShivendu
Copy link
Member Author

We see small gains (<10%) in latency with lower dims datasets like glove-100 but it also costs us some precision. So we decided to not merge this.

glove-100-latency

@KShivendu KShivendu reopened this May 1, 2024
@generall
Copy link
Member

I seems this PR is not longer neeeded. Closing it

@generall generall closed this May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants