Skip to content

Text2ImageDescription retrieves relevant images from Pascal VOC 2012 dataset using OpenAI CLIP, based on text queries, and generates descriptions using quantized Mistral-7b model.

License

Notifications You must be signed in to change notification settings

mahadev0811/Text2ImageDescription

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Text2ImageDescription

The project has 2 main parts:

  1. Image Retrieval: Given a text query, retrieve images from a dataset that are relevant to the query.
  2. Image Description Generation: Given a text query, generate a description for the image that is most relevant to the query.

Image Retrieval

The image retrieval part of the project uses a pre-trained openai CLIP model(https://github.com/openai/clip) to retrieve images from a dataset that are relevant to a given text query. The dataset used for this project is the Pascal VOC 2012 dataset. The dataset contains around 3500 images (train + validation). The CLIP model is used to encode the text query and the images in the dataset. The similarity between the text query and the images is calculated using cosine similarity. The images are then ranked based on the similarity score and the top k images are returned.

Image Description Generation

The image description generation part of the project uses a pre-trained Mistral-7b (https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF) model to generate descriptions for the give input query.

Usage

To run the project, follow the steps below:

  1. Clone the repository
  2. Run the notebook code.ipynb

Performane

  • Resource: 12 GB GPU (nvidia T4)
  • Image search: ~ 50 milliseconds.
  • Description generation: Streaming starts within approximately 2.5 seconds, achieving a rate of 40 tokens per second.

Results

Check out the demo video to see Text2ImageDescription in action:

demo.mp4

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Text2ImageDescription retrieves relevant images from Pascal VOC 2012 dataset using OpenAI CLIP, based on text queries, and generates descriptions using quantized Mistral-7b model.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published