Clarity and Understanding around Prefilter performance and Schema design #1286

OptimusLime · 2024-05-09T18:20:20Z

Description

Love the use of the library so far, thank you for all your work, I'm currently using LanceDB inside the Godot game engine via the Rust API and its a breeze.

Use Case:

I have multiple characters in an environment, I am storing all their individual memories in a single table.

Setup:

Each characters has their own v4 UUID string identifier
Each entry is a memory unique to one of the characterss
To create the embedding: character's memories are actions with descriptions. Those descriptions are converted to embedding vectors and added to lancedb. There is lots of action similarities across agents.

Issue:

Searching a query string for a specific agent character without a prefilter results in finding other characters similar memories and then post-filtering the ANN results yields zero results for a specific character

Clarification in Docs Needed:

What is the best practice for filtering character memories for embedding search only within a single character?
If characters share a single table and all queries are prefilter, what are the performance implications?
Is there an alternative structure to pursue to maximize performance?
Is it better to create many unique character tables or prefilter on a single large character memory table?

Example:

Table: characters
Character [Paul] plays basketball at 10:20am.
Character [Cody] plays basketball at 10:30am.

let embed = get_embedding("basketball")

let empty_cody_memories = table.query().nearest_to(embed).filter("name == 'Cody'").limit(1).execute_stream();
let cody_prefiltered_memories = table.query().nearest_to(embed).prefilter(true).filter("name == 'Cody'").limit(1).execute_stream();

empty_cody_memories has no entries because when searching for "basketball," the embedding is close to BOTH Paul and Cody memories, therefore finds Paul as the closest then applies post filter of name == Cody and now I don't see any of Paul's basketball memories.

cody_prefiltered_memories has an entry because first we remove all the non-Cody memories from the table and then we perform our nearest neighbor search, yielding only cody memories related to basketball as expected.

Link

postfilter rust API ref with a small statement on prefilter

wjones127 · 2024-05-10T19:11:07Z

What is the best practice for filtering character memories for embedding search only within a single character?

If characters share a single table and all queries are prefilter, what are the performance implications?

The determining factor is how selective is the filter.

If the filter narrows down to < 1,000 rows, then prefilter + (exact) KNN search makes the most sense. You can do this by not creating an index on the column and using prefilter.
Otherwise when there are many results, then prefilter + ANN search makes more sense. If the filter isn't very selective (matches a lot of data), this is pretty quick. But if it only matches say 10% of rows, then you will see somewhat slower queries, as it will have to throw out about 90% of ANN matches.

Is there an alternative structure to pursue to maximize performance?

We might later implement composite index where you can include certain metadata columns within the ANN index itself. So the character ID could be stored alongside the vector in the index. That would make prefilter a bit faster for that column.

Is it better to create many unique character tables or prefilter on a single large character memory table?

Creating many unique character tables could also work. You'd have to benchmark to see if that's optimal for your use case right now.

OptimusLime · 2024-05-13T18:21:14Z

The determining factor is how selective is the filter.

If the filter narrows down to < 1,000 rows, then prefilter + (exact) KNN search makes the most sense. You can do this by not creating an index on the column and using prefilter.

Otherwise when there are many results, then prefilter + ANN search makes more sense. If the filter isn't very selective (matches a lot of data), this is pretty quick. But if it only matches say 10% of rows, then you will see somewhat slower queries, as it will have to throw out about 90% of ANN matches.

As of 0.4.20, I only saw two example files in the Rust examples directory, and neither of them come close to demonstrating 1 or 2 above.

You're going to think I'm being so rude, please excuse me. The issue is a specific conceptual example matching our data, but your response, unfortunately, doesn't point towards examples or code -- and it's kind of filled with jargon I'm trying hard to parse.

Is there Nodejs or Python or Rust examples that are near to what I'm asking? For example, is there something out there for performing exact KNN versus ANN? I wish I was an expert at LanceDB, like you, however some clues as to best practices with the library would be helpful.

Is there an alternative structure to pursue to maximize performance?

We might later implement composite index where you can include certain metadata columns within the ANN index itself. So the character ID could be stored alongside the vector in the index. That would make prefilter a bit faster for that column.

Very cool!

OptimusLime added the documentation Improvements or additions to documentation label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarity and Understanding around Prefilter performance and Schema design #1286

Clarity and Understanding around Prefilter performance and Schema design #1286

OptimusLime commented May 9, 2024 •

edited

wjones127 commented May 10, 2024

OptimusLime commented May 13, 2024 •

edited

Clarity and Understanding around Prefilter performance and Schema design #1286

Clarity and Understanding around Prefilter performance and Schema design #1286

Comments

OptimusLime commented May 9, 2024 • edited

Description

Use Case:

Issue:

Clarification in Docs Needed:

Example:

Link

wjones127 commented May 10, 2024

OptimusLime commented May 13, 2024 • edited

OptimusLime commented May 9, 2024 •

edited

OptimusLime commented May 13, 2024 •

edited