Finetuning Models to perform domain-specific Visual Question Answering

Generating fashion product descriptions by fine-tuning a Vision-Language Model (VLM) with Amazon SageMaker and Amazon Bedrock

This repository implements a machine learning training and inference regiment, using Generative AI (GenAI) to answer questions based on provided images. Pre-trained models exist to achive such tasks, however they are a) unable to adapt to domain-specific scenarios - hence why we need to fine tune and b) do not display the capability to be deployed into production environments.

To solve this problem, this post shows how to extract domain-specific product attributes from product images by fine-tuning a VLM (Vision-Language Model) on a fashion dataset using Amazon SageMaker, and then use Amazon Bedrock to generate product descriptions using the extracted attributes as input.

For a detailed walkthrough of this repository, please refer to our blogpost.

Data and Use case

The data used in this repository is taken from Kaggle Fashion Images Dataset and the usecase we try to solve is generating captions for these fashion products for an e-commerce website, a task that has historically been very time consuming. High-quality product descriptions improve searchability through Search Engine Optimization (SEO), as well as increase customer satisfaction by allowing them to make informed decisions.

Vision-Language Model

The model finetuned in this repository is the BLIP-2 Model and more specifically, a variant of it using Flan-T5-XL.

The following diagram illustrates the overview of BLIP-2:

Solution Overview

The solution can be broken down into two sections, marked green and blue in the achitecture below: a) fine-tuning in green and b) inference in blue.

Fine-Tuning

The data is downloaded in an S3 bucket
A subset of the data is used to fine-tune a model, using a Sagemaker Training Job
The fine-tuned model artefacts are then stored in an S3 bucket to be used for inference

Inference

The model artefacts on S3 are spin up via a Sagemaker Endpoint
The Endpoint is then invoked with a question and returns a json-like response containing the relevant product attributes
The response is then passed to Amazon Bedrock alongside with a pre-defined prompt, which formats the response before returning it.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
diagrams		diagrams
notebooks		notebooks
src		src
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diagrams

diagrams

notebooks

notebooks

src

src

CODE_OF_CONDUCT.md

CODE_OF_CONDUCT.md

CONTRIBUTING.md

CONTRIBUTING.md

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Finetuning Models to perform domain-specific Visual Question Answering

Generating fashion product descriptions by fine-tuning a Vision-Language Model (VLM) with Amazon SageMaker and Amazon Bedrock

Data and Use case

Vision-Language Model

Solution Overview

Fine-Tuning

Inference

Security

License

About

Releases

Packages

Contributors 2

Languages

License

aws-samples/visual-question-answering-finetuning

Folders and files

Latest commit

History

Repository files navigation

Finetuning Models to perform domain-specific Visual Question Answering

Generating fashion product descriptions by fine-tuning a Vision-Language Model (VLM) with Amazon SageMaker and Amazon Bedrock

Data and Use case

Vision-Language Model

Solution Overview

Fine-Tuning

Inference

Security

License

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages