Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarity and Understanding around Prefilter performance and Schema design #1286

Open
OptimusLime opened this issue May 9, 2024 · 2 comments
Labels
documentation Improvements or additions to documentation

Comments

@OptimusLime
Copy link

OptimusLime commented May 9, 2024

Description

Love the use of the library so far, thank you for all your work, I'm currently using LanceDB inside the Godot game engine via the Rust API and its a breeze.

Use Case:

I have multiple characters in an environment, I am storing all their individual memories in a single table.

Setup:

  1. Each characters has their own v4 UUID string identifier
  2. Each entry is a memory unique to one of the characterss
  3. To create the embedding: character's memories are actions with descriptions. Those descriptions are converted to embedding vectors and added to lancedb. There is lots of action similarities across agents.

Issue:

  1. Searching a query string for a specific agent character without a prefilter results in finding other characters similar memories and then post-filtering the ANN results yields zero results for a specific character

Clarification in Docs Needed:

  1. What is the best practice for filtering character memories for embedding search only within a single character?
  2. If characters share a single table and all queries are prefilter, what are the performance implications?
  3. Is there an alternative structure to pursue to maximize performance?
  4. Is it better to create many unique character tables or prefilter on a single large character memory table?

Example:

Table: characters
Character [Paul] plays basketball at 10:20am.
Character [Cody] plays basketball at 10:30am.

let embed = get_embedding("basketball")

let empty_cody_memories = table.query().nearest_to(embed).filter("name == 'Cody'").limit(1).execute_stream();
let cody_prefiltered_memories = table.query().nearest_to(embed).prefilter(true).filter("name == 'Cody'").limit(1).execute_stream();

empty_cody_memories has no entries because when searching for "basketball," the embedding is close to BOTH Paul and Cody memories, therefore finds Paul as the closest then applies post filter of name == Cody and now I don't see any of Paul's basketball memories.

cody_prefiltered_memories has an entry because first we remove all the non-Cody memories from the table and then we perform our nearest neighbor search, yielding only cody memories related to basketball as expected.

Link

postfilter rust API ref with a small statement on prefilter

@OptimusLime OptimusLime added the documentation Improvements or additions to documentation label May 9, 2024
@wjones127
Copy link
Contributor

  1. What is the best practice for filtering character memories for embedding search only within a single character?
  2. If characters share a single table and all queries are prefilter, what are the performance implications?

The determining factor is how selective is the filter.

  1. If the filter narrows down to < 1,000 rows, then prefilter + (exact) KNN search makes the most sense. You can do this by not creating an index on the column and using prefilter.
  2. Otherwise when there are many results, then prefilter + ANN search makes more sense. If the filter isn't very selective (matches a lot of data), this is pretty quick. But if it only matches say 10% of rows, then you will see somewhat slower queries, as it will have to throw out about 90% of ANN matches.
  1. Is there an alternative structure to pursue to maximize performance?

We might later implement composite index where you can include certain metadata columns within the ANN index itself. So the character ID could be stored alongside the vector in the index. That would make prefilter a bit faster for that column.

  1. Is it better to create many unique character tables or prefilter on a single large character memory table?

Creating many unique character tables could also work. You'd have to benchmark to see if that's optimal for your use case right now.

@OptimusLime
Copy link
Author

OptimusLime commented May 13, 2024

The determining factor is how selective is the filter.

  1. If the filter narrows down to < 1,000 rows, then prefilter + (exact) KNN search makes the most sense. You can do this by not creating an index on the column and using prefilter.
  2. Otherwise when there are many results, then prefilter + ANN search makes more sense. If the filter isn't very selective (matches a lot of data), this is pretty quick. But if it only matches say 10% of rows, then you will see somewhat slower queries, as it will have to throw out about 90% of ANN matches.

As of 0.4.20, I only saw two example files in the Rust examples directory, and neither of them come close to demonstrating 1 or 2 above.

You're going to think I'm being so rude, please excuse me. The issue is a specific conceptual example matching our data, but your response, unfortunately, doesn't point towards examples or code -- and it's kind of filled with jargon I'm trying hard to parse.

Is there Nodejs or Python or Rust examples that are near to what I'm asking? For example, is there something out there for performing exact KNN versus ANN? I wish I was an expert at LanceDB, like you, however some clues as to best practices with the library would be helpful.

  1. Is there an alternative structure to pursue to maximize performance?

We might later implement composite index where you can include certain metadata columns within the ANN index itself. So the character ID could be stored alongside the vector in the index. That would make prefilter a bit faster for that column.

Very cool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants