Skip to content

FullStackWithLawrence/openai-embeddings

Repository files navigation

OpenAI Embeddings Example

🤖 Retrieval Augmented Generation and Hybrid Search 🤖

FullStackWithLawrence
OpenAI LangChain Pinecone Python
Release Notes GHA pushMain Status AGPL License hack.d Lawrence McDaniel

A Hybrid Search and Augmented Generation prompting solution using Python OpenAI API Embeddings persisted to a Pinecone vector database index and managed by LangChain. Implements the following:

  • PDF Loader. a command-line pdf loader program that extracts text, vectorizes, and loads into a Pinecone dot product vector database that is dimensioned to match OpenAI embeddings.
  • Retrieval Augmented Generation. A chatGPT prompt based on a hybrid search retriever that locates relevant documents from the vector database and includes these in OpenAI prompts.

Installation

git clone https://github.com/FullStackWithLawrence/openai-embeddings.git
cd openai-embeddings
make init

# Linux/macOS
source venv/bin/activate

# Windows Powershell (admin)
venv\Scripts\activate

You'll also need to add your api keys to the .env file in the root of the repo.

OPENAI_API_ORGANIZATION=PLEASE-ADD-ME
OPENAI_API_KEY=PLEASE-ADD-ME
PINECONE_API_KEY=PLEASE-ADD-ME

Usage

# example 1 - generic assistant
python3 -m models.examples.prompt "your are a helpful assistant" "What analytics and accounting courses does Wharton offer?"

# example 2 - assistant with improved system prompting
python3 -m models.examples.prompt "You are a student advisor at University of Pennsylvania. You provide concise answers of 100 words or less." "What analytics and accounting courses does Wharton offer?"

# example 3 - templated assistant: Online courses
python3 -m models.examples.online_courses "analytics and accounting"

# example 4 - templated assistant: Certification programs
python3 -m models.examples.certification_programs "analytics and accounting"

# example 5 - Retrieval Augmented Generation
python3 -m models.examples.load "/path/to/your/pdf/documents"
python3 -m models.examples.rag "What analytics and accounting courses does Wharton offer?"

Retrieval Augmented Generation

For the question, "What analytics and accounting courses does Wharton offer?", an embedding can potentially dramatically alter the response generated by chatGPT. To illustrate, I uploaded a batch of 21 sets of lecture notes in PDF format for an online analytics course taught by Wharton professor Brian Bushee. You can download these from https://cdn.lawrencemcdaniel.com/fswl/openai-embeddings-data.zip to test whether your results are consistent.

The control set

Example 1 above, a generic chatGPT prompt with no additional guidance provided by a system prompt nor an embedding, generates the following response:

Wharton offers a variety of analytics and accounting courses. Some of the analytics courses include:

1. Introduction to Business Analytics: This course provides an overview of the fundamentals of business analytics, including data analysis, statistical modeling, and decision-making.

2. Data Visualization and Communication: This course focuses on the effective presentation and communication of data through visualizations and storytelling techniques.

3. Predictive Analytics: This course explores the use of statistical models and machine learning algorithms to predict future outcomes and make data-driven decisions.

4. Big Data Analytics: This course covers the analysis of large and complex datasets using advanced techniques and tools, such as Hadoop and Spark.

In terms of accounting courses, Wharton offers:

1. Financial Accounting: This course provides an introduction to the principles and concepts of financial accounting, including the preparation and analysis of financial statements.

2. Managerial Accounting: This course focuses on the use of accounting information for internal decision-making and planning, including cost analysis and budgeting.

3. Advanced Financial Accounting: This course delves into more complex accounting topics, such as consolidations, partnerships, and international accounting standards.

4. Auditing and Assurance Services: This course covers the principles and practices of auditing, including risk assessment, internal controls, and audit procedures.

These are just a few examples of the analytics and accounting courses offered at Wharton. The school offers a wide range of courses to cater to different interests and skill levels in these fields.
(venv) (base) mcdaniel@MacBookAir-Lawrence openai-embeddings % python3 -m models.examples.online_courses "analytics and accounting"

Same prompt but with an embedding

After creating an embedding from the sample set of pdf documents, you can prompt models.examples.rag with the same question, and it should provide a quite different response compared to the control from example 1. It should resemble the following:

Wharton offers a variety of analytics and accounting courses. Some of the courses offered include:

1. Accounting-Based Valuation: This course, taught by Professor Brian Bushee, focuses on using accounting information to value companies and make investment decisions.

2. Review of Financial Statements: Also taught by Professor Brian Bushee, this course provides an in-depth understanding of financial statements and how to analyze them for decision-making purposes.

3. Discretionary Accruals Model: Another course taught by Professor Brian Bushee, this course explores the concept of discretionary accruals and their impact on financial statements and financial analysis.

4. Discretionary Accruals Cases: This course, also taught by Professor Brian Bushee, provides practical applications of the discretionary accruals model through case studies and real-world examples.

These are just a few examples of the analytics and accounting courses offered at Wharton. The school offers a wide range of courses in these areas to provide students with a comprehensive understanding of financial analysis and decision-making.

Requirements

  • git. pre-installed on Linux and macOS
  • make. pre-installed on Linux and macOS.
  • OpenAI platform API key. If you're new to OpenAI API then see How to Get an OpenAI API Key
  • Pinecone API key.
  • Python 3.11: for creating virtual environment used for building AWS Lambda Layer, and locally by pre-commit linters and code formatters.
  • NodeJS: used with NPM for local ReactJS developer environment, and for configuring/testing Semantic Release.

Configuration defaults

Set these as environment variables on the command line, or in a .env file that should be located in the root of the repo.

# OpenAI API
OPENAI_API_ORGANIZATION=PLEASE-ADD-ME
OPENAI_API_KEY=PLEASE-ADD-ME
OPENAI_CHAT_MAX_RETRIES=3
OPENAI_CHAT_MODEL_NAME=gpt-3.5-turbo
OPENAI_CHAT_TEMPERATURE=0.0
OPENAI_PROMPT_MODEL_NAME=gpt-3.5-turbo-instruct

# Pinecone API
PINECONE_API_KEY=PLEASE-ADD-ME
PINECONE_DIMENSIONS=1536
PINECONE_ENVIRONMENT=gcp-starter
PINECONE_INDEX_NAME=rag
PINECONE_METRIC=dotproduct
PINECONE_VECTORSTORE_TEXT_KEY=lc_id

# This package
DEBUG_MODE=False

Contributing

This project uses a mostly automated pull request and unit testing process. See the resources in .github for additional details. You additionally should ensure that pre-commit is installed and working correctly on your dev machine by running the following command from the root of the repo.

pre-commit run --all-files

Pull requests should pass these tests before being submitted:

make test

Developer setup

git clone https://github.com/lpm0073/automatic-models.git
cd automatic-models
make init
make activate

Github Actions

Actions requires the following secrets:

PAT: {{ secrets.PAT }}  # a GitHub Personal Access Token
OPENAI_API_ORGANIZATION: {{ secrets.OPENAI_API_ORGANIZATION }}
OPENAI_API_KEY: {{ secrets.OPENAI_API_KEY }}
PINECONE_API_KEY: {{ secrets.PINECONE_API_KEY }}
PINECONE_ENVIRONMENT: {{ secrets.PINECONE_ENVIRONMENT }}
PINECONE_INDEX_NAME: {{ secrets.PINECONE_INDEX_NAME }}

Additional reading