Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature : Autoload a PDF file By URL #1

Open
KazeroG opened this issue May 3, 2023 · 0 comments
Open

Feature : Autoload a PDF file By URL #1

KazeroG opened this issue May 3, 2023 · 0 comments

Comments

@KazeroG
Copy link

KazeroG commented May 3, 2023

Autoload the PDF in local file

With this feature, we delete the file input

Explaincation :

  • The BytesIO and PdfReader classes from the PyPDF2 library are imported to handle the PDF file.
  • The CharacterTextSplitter, OpenAIEmbeddings, FAISS, load_qa_chain, OpenAI, and get_openai_callback classes are imported from the langchain library. These classes are used to build the question-answering system.
  • The OpenAI API key is set as an environment variable using the os module.
  • The st.set_page_config and st.header functions from the streamlit library are used to set the title and header of the web app.
  • The PDF file is loaded using the open function and read in binary mode using the rb flag. The contents of the file are then stored in a BytesIO object.
  • The text content of the PDF file is extracted using the extract_text method of the PdfReader class. The text is concatenated into a single string.
  • The CharacterTextSplitter class is used to split the text into smaller chunks. These chunks are used to build a knowledge base for the question-answering system.
  • The OpenAIEmbeddings class is used to generate embeddings for the text chunks. These embeddings are used to perform similarity searches when answering questions.
  • The st.text_input function is used to prompt the user to ask a question about the PDF file.
  • If the user enters a question, the similarity search is performed using the knowledge_base.similarity_search method. The resulting documents are passed to the load_qa_chain function to create a question-answering chain.
  • The run method of the question-answering chain is called with the input documents and user question as arguments. The result is stored in the response variable.
  • The result is displayed using the st.write function.

The code : app.py

from io import BytesIO
import requests
import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback

def main():
    st.set_page_config(page_title="Ask your PDF")
    st.header("Ask your PDF 💬")
    
    # load the PDF file
    url = 'https://www.example.com/example.pdf'
    response = requests.get(url)
    pdf = BytesIO(response.content)
    
    # extract the text
    pdf_reader = PdfReader(pdf)
    text = ""
    for page in pdf_reader.pages:
        text += page.extract_text()

    # split into chunks
    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = text_splitter.split_text(text)
      
    # create embeddings
    embeddings = OpenAIEmbeddings()
    knowledge_base = FAISS.from_texts(chunks, embeddings)
      
    # show user input
    user_question = st.text_input("Ask a question about your PDF:")
    if user_question:
        docs = knowledge_base.similarity_search(user_question)
        
        llm = OpenAI()
        chain = load_qa_chain(llm, chain_type="stuff")
        with get_openai_callback() as cb:
            response = chain.run(input_documents=docs, question=user_question)
            print(cb)
           
        st.write(response)
    

if __name__ == '__main__':
    main()

Run & Test

streamlit run .\app.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant