Skip to content

sethuiyer/search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Educational Purpose

This project focuses on building a high-quality search engine on custom data using txtai. txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

Overview

The project includes preparing a text corpus, indexing it using txtai, and then performing advanced semantic searches. It leverages txtai's Textractor for text extraction and incorporates a custom SemanticSearch class for efficient searching.

Prerequisites

  • Python 3.6+
  • txtai library

Corpus Preparation

  1. Extract Text Data:
    • Use txtai's Textractor to extract text from various materials. Ensure sentences=True is set.
    • Store the extracted list of sentences in separate text files for different materials.
    • Merge these files into a single text file named database.txt.
    • Later, we can simply open('database.txt').readlines() to get the dataset as list of segmented sentences.

search.py

This script uses txtai to process, index, and load the raw data present in database.txt. It sets up the infrastructure for the search engine.

SemanticSearch Class Usage

Step 1: Initialization

Create an instance of the SemanticSearch class. Specify the model path for embeddings.

from src.search import SemanticSearch
semantic_search = SemanticSearch()

Step 2: Download and Load the Index

Download the index file and load it into the SemanticSearch instance.

wget https://huggingface.co/<user>/<repo>/resolve/main/index.tar.gz # or any URL where your index lives

Then you can simply

from src.search import SemanticSearch
semantic_search = SemanticSearch()
semantic_search.load_index('index.tar.gz')

or train the index on your custom data by using the create_and_save_embeddings. Pass the data as list of strings in the first argument then the index.tar.gz as second.

semantic_search.create_and_save_embeddings(dataset as list of segmented sentences, 'index.tar.gz')

Step 3: Performing a Search

Perform semantic searches using the search method.

query = "Your search query"
results = semantic_search.search(query, limit=5)

# Displaying results
for result in results:
    print(result)

Example

Let's see the performance of this library on a custom dataset

python test.py 
Embeddings loaded in 5.36 seconds ⚡️
🔍 Query: What is kshipta avashta

Search completed in 3.29 seconds ⚡️
['mind please see this carefully kshipta is the distracted or restless state when you know what you should do but well energy nahi hai somehow you want', 'a Avastha The next state is Which also you go into, it is the Kshipta Avastha is kshitva avastha where you are very restless, very agitated, thinking']

Then you can use the output from this to the language models

from txtai.pipeline import LLM

# Create and run LLM pipeline
llm = LLM('google/flan-t5-large')
llm(
  """
  SYSTEM: You are Natasha, a friendly assistant who answers user's queries.

USER: what is kshipta avastha

CONTEXT:
 ['mind please see this carefully kshipta is the distracted or restless state when you know what you should do but well energy nahi hai somehow you want', 
 'a Avastha The next state is Which also you go into, it is the Kshipta Avastha is kshitva avastha where you are very restless, very agitated, thinking']

ASSISTANT:
  """
)
Natasha: kshipta avastha is the distracted or restless state when you know what you should do but well energy nahi hai somehow you want

Pretty good response if you ask me.

Second example:

python test.py 
Embeddings loaded in 4.42 seconds ⚡️
🔍 Query: Who is Rene Descartes?

Search completed in 1.78 seconds ⚡️
['Descartes, or Cartesius (his Latinized name), is usually regarded as the founder of modern philosophy,and he was also a brilliant mathematician and a', 'According to Werner Heisenberg ( 1958 , p. 81), who struggled with the problem for many years, “This partition has penetrated deeply into the human mi']

and giving it to the LLM

llm(
  """
  SYSTEM: You are Natasha, a friendly assistant who answers user's queries from the given context.

USER: Who is Rene Descartes?

CONTEXT:
['Descartes, or Cartesius (his Latinized name), is usually regarded as the founder of modern philosophy,and he was also a brilliant mathematician and a', 'According to Werner Heisenberg ( 1958 , p. 81), who struggled with the problem for many years, “This partition has penetrated deeply into the human mi']


ASSISTANT:
  """
)
Descartes, or Cartesius (his Latinized name) is usually regarded as the founder of modern philosophy

Again, pretty good.

Extras:

llm_router.py

This script uses txtai to determine the query type and the appropriate tools required for processing.

result = classifier.classify_instructions(["Draft a poem which also proves that sqrt of 2 is irrational"])
print(result)

Blog: https://medium.com/@sethuiyer/query-aware-similarity-tailoring-semantic-search-with-zero-shot-classification-5b552c2d29c7

About

Advanced Semantic Search Engine: Leveraging txtai for Dynamic, Context-Aware Information Retrieval

Topics

Resources

License

Stars

Watchers

Forks

Languages