Issues API: Too many collections #4161

coszio · 2024-05-02T21:34:10Z

Supersedes #3652, same thing but now event-based
Tracked in #3471
Needs #4139

This PR determines to have too many collections when all of the following are true:

It has above 30 collections
The density of points in the collections is low (avg_points / num_collections)

The current way of counting the amount of points in the collections is using the count api, aproximate.

All Submissions:

Contributions should target the dev branch. Did you create your branch from dev?
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass tests?
Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
Have you checked your code using cargo clippy --all --all-features command?

lib/collection/src/problems/too_many_collections.rs

timvisee · 2024-05-03T15:45:52Z

lib/collection/src/problems/too_many_collections.rs

+    const MANY_COLLECTIONS: usize = 30;
+
+    /// Defines how many points are considered too few as an average per collection
+    const FEW_POINTS: usize = 10000;


We might want to define a number in kilobytes instead. It represents the actual size in bytes, which is more interesting than just a count. We use it in many places, including for our indexing threshold:

qdrant/config/config.yaml

Line 112 in 97c107f

indexing_threshold_kb: 20000

Speaking of it, we might want to increase this number to more closely match the default indexing threshold. Note that the threshold is per segment.

Wdyt?

I don't see why kilobytes is a better metric here, since the problem we want to spot is that the user has a dynamic number of collections. E.g. having many named vectors and/or high-dimensional ones does not relate to the root of the issue, IMO.

Would you elaborate why you think it is a better metric in this case?

Regarding the threshold, I am happy to move it around. The choice was completely arbitrary.

My thought was this: we count collections that are below some point threshold. I thought that it would make sense to match that up with the indexing threshold, in which case kilobytes should be used. I thought about it that way because it's usually one of the problems we mention when a lot of collections are created, many of them will be too small to make use of indexing.

But, if you designed it a bit different or if we just trigger from n collections onward, the specific metric probably doesn't really matter.

we want to spot is that the user has a dynamic number of collections.

Could you elaborate on what this means, a dynamic number of collections?

Wouldn't we just trigger this issue above 30 (?) collections, no matter their contents?

For me, a dynamic number of collections means that creation or deletion of collections come from user interaction, not from admin decisions.

Valid cases for creating new collections might be:

using new embedding models

refactoring, new features, etc.

I think it might still be possible to reach the fixed limit while having "correct" usage, that's why I included the density threshold too.

coszio mentioned this pull request May 2, 2024

Issues API: Too many collections #3652

Closed

6 tasks

coszio force-pushed the too-many-collections-event-based branch from 2ebc202 to 9c79a63 Compare May 2, 2024 21:35

github-actions bot mentioned this pull request May 2, 2024

Flaky test segment_builder_test::test_building_cancellation #2723

Open

coszio mentioned this pull request May 3, 2024

Tracking issue: Issues API #3471

Open

6 tasks

timvisee reviewed May 3, 2024

View reviewed changes

coszio force-pushed the unindexed-field-issue-event-based branch 2 times, most recently from a5638e5 to 26758c3 Compare May 6, 2024 15:56

Base automatically changed from unindexed-field-issue-event-based to dev May 9, 2024 13:39

implement too many collections issue

a0a23cf

coszio force-pushed the too-many-collections-event-based branch from 9c79a63 to a0a23cf Compare May 10, 2024 12:54

coszio marked this pull request as ready for review May 10, 2024 12:55

timvisee approved these changes May 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues API: Too many collections #4161

Issues API: Too many collections #4161

coszio commented May 2, 2024 •

edited

timvisee May 3, 2024

coszio May 3, 2024

timvisee May 7, 2024

timvisee May 7, 2024

coszio May 10, 2024

Issues API: Too many collections #4161

Are you sure you want to change the base?

Issues API: Too many collections #4161

Conversation

coszio commented May 2, 2024 • edited

All Submissions:

New Feature Submissions:

timvisee May 3, 2024

Choose a reason for hiding this comment

coszio May 3, 2024

Choose a reason for hiding this comment

timvisee May 7, 2024

Choose a reason for hiding this comment

timvisee May 7, 2024

Choose a reason for hiding this comment

coszio May 10, 2024

Choose a reason for hiding this comment

coszio commented May 2, 2024 •

edited