Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

文件比较多时,要运行几十个小时,一旦出错或者意外终止,数据库就损害了要重新生成。这里每处理100个重新保存一下 #4040

Open
zmwstu opened this issue May 18, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@zmwstu
Copy link

zmwstu commented May 18, 2024

文件比较多时,要运行几十个小时,一旦出错或者意外终止,数据库就损害了要重新生成。在哪里修改代码,能够每处理100个重新保存一下或者多线程添加到向量数据库。

@zmwstu zmwstu added the bug Something isn't working label May 18, 2024
Copy link

dosubot bot commented May 18, 2024

To address your concerns about saving progress every 100 files and implementing multithreading for adding entries to the vector database, here's a concise guide:

  1. Saving Progress Every 100 Files:

    • Implement a counter in your file processing loop. After processing each file, increment this counter.
    • Once the counter hits 100, trigger a save operation to persist the current state, then reset the counter.
  2. Multithreading for Vector Database Entries:

    • Use a thread-safe approach to allow multiple threads to add entries without data corruption.
    • Implement a task queue where each task represents a file or batch of files to process.
    • Create worker threads that pull tasks from the queue and process them, ensuring that any method modifying the vector database is thread-safe.

Example Implementation:

import threading
from queue import Queue

def process_files(queue, vs_name):
    while not queue.empty():
        task = queue.get()
        name, embeddings = task
        # Add your file processing logic here, including updating the vector database
        queue.task_done()

def main():
    files_to_process = [...]  # Your list of files
    queue = Queue()
    
    # Load tasks into the queue
    for file in files_to_process:
        embeddings = load_local_embeddings()  # Function to load embeddings for a file
        queue.put((file, embeddings))
    
    # Start worker threads
    num_worker_threads = 10
    for i in range(num_worker_threads):
        t = threading.Thread(target=process_files, args=(queue, 'your_vector_store_name'))
        t.start()
    
    queue.join()  # Wait for all tasks to be processed

if __name__ == "__main__":
    main()

This setup uses a Queue for managing tasks and worker threads for concurrent processing. Ensure your file processing and vector database update logic is placed within the process_files function. Adjust the number of worker threads based on your system's capabilities and the workload. This approach will help you process files more efficiently and save progress periodically to mitigate data loss risks.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@KevinFanng
Copy link

python init_database.py -help
usage: init_database.py [-h] [-r] [--create-tables] [--clear-tables] [--import-db IMPORT_DB] [-u] [-i] [--prune-db] [--prune-folder] [-n KB_NAME [KB_NAME ...]]
[-e EMBED_MODEL]

please specify only one operate method once time.

options:
-h, --help show this help message and exit
-r, --recreate-vs recreate vector store. use this option if you have copied document files to the content folder, but vector store has not been populated
or DEFAUL_VS_TYPE/EMBEDDING_MODEL changed.
--create-tables create empty tables if not existed
--clear-tables create empty tables, or drop the database tables before recreate vector stores
--import-db IMPORT_DB
import tables from specified sqlite database
-u, --update-in-db update vector store for files exist in database. use this option if you want to recreate vectors for files exist in db and skip files
exist in local folder only.
-i, --increment update vector store for files exist in local folder and not exist in database. use this option if you want to create vectors
incrementally.
--prune-db delete docs in database that not existed in local folder. it is used to delete database docs after user deleted some doc files in file
browser
--prune-folder delete doc files in local folder that not existed in database. is is used to free local disk space by delete unused doc files.
-n KB_NAME [KB_NAME ...], --kb-name KB_NAME [KB_NAME ...]
specify knowledge base names to operate on. default is all folders exist in KB_ROOT_PATH.
-e EMBED_MODEL, --embed-model EMBED_MODEL
specify embeddings model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants