Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues API: Too many collections #4161

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

coszio
Copy link
Contributor

@coszio coszio commented May 2, 2024

Supersedes #3652, same thing but now event-based
Tracked in #3471
Needs #4139

This PR determines to have too many collections when all of the following are true:

  • It has above 30 collections
  • The density of points in the collections is low (avg_points / num_collections)

The current way of counting the amount of points in the collections is using the count api, aproximate.

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

const MANY_COLLECTIONS: usize = 30;

/// Defines how many points are considered too few as an average per collection
const FEW_POINTS: usize = 10000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to define a number in kilobytes instead. It represents the actual size in bytes, which is more interesting than just a count. We use it in many places, including for our indexing threshold:

indexing_threshold_kb: 20000

Speaking of it, we might want to increase this number to more closely match the default indexing threshold. Note that the threshold is per segment.

Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why kilobytes is a better metric here, since the problem we want to spot is that the user has a dynamic number of collections. E.g. having many named vectors and/or high-dimensional ones does not relate to the root of the issue, IMO.

Would you elaborate why you think it is a better metric in this case?

Regarding the threshold, I am happy to move it around. The choice was completely arbitrary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was this: we count collections that are below some point threshold. I thought that it would make sense to match that up with the indexing threshold, in which case kilobytes should be used. I thought about it that way because it's usually one of the problems we mention when a lot of collections are created, many of them will be too small to make use of indexing.

But, if you designed it a bit different or if we just trigger from n collections onward, the specific metric probably doesn't really matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to spot is that the user has a dynamic number of collections.

Could you elaborate on what this means, a dynamic number of collections?

Wouldn't we just trigger this issue above 30 (?) collections, no matter their contents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, a dynamic number of collections means that creation or deletion of collections come from user interaction, not from admin decisions.

Valid cases for creating new collections might be:

  • using new embedding models
  • refactoring, new features, etc.

I think it might still be possible to reach the fixed limit while having "correct" usage, that's why I included the density threshold too.

@coszio coszio force-pushed the unindexed-field-issue-event-based branch 2 times, most recently from a5638e5 to 26758c3 Compare May 6, 2024 15:56
Base automatically changed from unindexed-field-issue-event-based to dev May 9, 2024 13:39
@coszio coszio force-pushed the too-many-collections-event-based branch from 9c79a63 to a0a23cf Compare May 10, 2024 12:54
@coszio coszio marked this pull request as ready for review May 10, 2024 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants