Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: count number of documents using hnswlib #1759

Open
jupyterjazz opened this issue Aug 23, 2023 · 11 comments
Open

refactor: count number of documents using hnswlib #1759

jupyterjazz opened this issue Aug 23, 2023 · 11 comments
Labels
area/document-index Concerning Document Index or a Document Index backend good-first-issue Suitable as your first contribution to DocArray!

Comments

@jupyterjazz
Copy link
Contributor

jupyterjazz commented Aug 23, 2023

Data storage in HnswDocumentIndex works in the following way:

  1. Vectors are stored on disk using hnswlib.
  2. All other types of data are saved in an SQLITE database.

One of the operations we frequently perform is determining the total number of documents (num_docs()). However, the only way to get number of documents from SQLITE is by scanning the entire table. Even though we've made efforts to reduce the number of times we use this functionality (#1729), it's still a time-consuming process.

For better performance, let's do the following: instead of scanning the SQLITE table, we can use hnswlib's get_current_count function to quickly get the number of documents in the index.

But there's a potential issue with this approach. What if documents don't have associated vectors? get_current_count would return 0.

We have two potential solutions:

  1. Notify/Warn users about this behavior and return 0.
  2. Use to the older method of counting using the SQL table if vector-less documents are detected.
@jupyterjazz jupyterjazz added good-first-issue Suitable as your first contribution to DocArray! area/document-index Concerning Document Index or a Document Index backend labels Aug 23, 2023
@shobhit9957
Copy link

/attempt #1759

I would like to solve this issue. Can I get some more help.

@JoanFM
Copy link
Member

JoanFM commented Aug 24, 2023

Hey @shobhit9957 ,

Sure, the main refactoring to be done is to change the code here:

    def num_docs(self) -> int:
        """
        Get the number of documents.
        """
        if self._num_docs == 0:
            self._num_docs = self._get_num_docs_sqlite()
        return self._num_docs

to something that does not rely on the _get_num_docs_sqlite but from a new private method _get_num_docs_hnsw

@shobhit9957
Copy link

so I should replace the _get_num_docs_sqlite to _get_num_docs_hnsw is that correct?
I'm just a beginner, to this open-source-community. Just trying my hands out there...please help me out.

@JoanFM
Copy link
Member

JoanFM commented Aug 24, 2023

Yes, this would be the right approach

@shobhit9957
Copy link

Thanks! will do the PR Soon.

@shobhit9957
Copy link

Hey Joan. Submitted the PR, please check if there are any other issues or mistakes I've done in the PR, I would be happy to solve my mistakes and submit the PR again Joan. Thanks!

@shobhit9957
Copy link

def num_docs(self) -> int:
"""
Get the number of documents.
"""
if self._num_docs == 0:
# Replace the old method with the new one
self._num_docs = self._get_num_docs_hnsw()
return self._num_docs

def _get_num_docs_hnsw(self) -> int:
"""
Get the number of documents using the HNSW method.
This method should return the count of documents using the new technique.
"""
# Implement your logic here to get the count using the HNSW method
# For example, you might access some data or perform calculations
# that help you quickly determine the number of documents.
# Then return the count.
return calculated_num_docs
is this correct?

@munish0838
Copy link

Hi, is this issue fixed or needs contribution yet.
Would like to contribute

@JoanFM
Copy link
Member

JoanFM commented Oct 22, 2023

Hey @munish0838, there is still work to be done

@munish0838
Copy link

I would like to start working on this issue

@JoanFM
Copy link
Member

JoanFM commented Oct 24, 2023

be my guest. I believe there are parts of the plan that were not applied yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/document-index Concerning Document Index or a Document Index backend good-first-issue Suitable as your first contribution to DocArray!
Projects
Status: In progress by community
Development

No branches or pull requests

4 participants