Skip to content

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

Notifications You must be signed in to change notification settings

jingyi0000/Awesome-Visual-Instruction-Tuning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 

Repository files navigation

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

This is the repository of Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey, a systematic review of visual instruction tuning. For details, please refer to:

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
[Paper]

arXiv Maintenance PR's Welcome

Abstract

Traditional computer vision generally solves each single task independently by a dedicated model with the task instruction implicitly designed in the model architecture, arising two limitations: (1) it leads to task-specific models, which require multiple models for different tasks and restrict the potential synergies from diverse tasks; (2) it leads to a pre-defined and fixed model interface that has limited interactivity and adaptability in following user' task instructions. To address them, Visual Instruction Tuning (VIT) has been intensively studied recently, which finetunes a large vision model with language as task instructions, aiming to learn from a wide range of vision tasks described by language instructions a general-purpose multimodal model that can follow arbitrary instructions and thus solve arbitrary tasks specified by the user. This work aims to provide a systematic review of visual instruction tuning, covering (1) the background that presents computer vision task paradigms and the development of VIT; (2) the foundations of VIT that introduce commonly used network architectures, visual instruction tuning frameworks and objectives, and evaluation setups and tasks; (3) the commonly used datasets in visual instruction tuning and evaluation; (4) the review of existing VIT methods that categorizes them with a taxonomy according to both the studied vision task and the method design and highlights the major contributions, strengths, and shortcomings of them; (5) the comparison and discussion of VIT methods over various instruction-following benchmarks; (6) several challenges, open directions and possible future works in visual instruction tuning research.

Citation

If you find our work useful in your research, please consider citing:

@article{huang2023visual,
  title={Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey},
  author={Huang, Jiaxing and Zhang, Jingyi and Jiang, Kai and Qiu, Han and Lu, Shijian},
  journal={arXiv preprint arXiv:2312.16602},
  year={2023}
}

Menu

Datasets

Datasets for Visual Instruction Tuning

Datasets for Instruction-tuned Model Evaluation

Visual Instruction Tuning Methods

Instruction-based Image Learning

Instruction-based Image Learning for Discriminative Tasks

Instruction-based Image Learning for Generative Tasks

Instruction-based Image Learning for Complex Reasoning Tasks

Instruction-based Video Learning

Instruction-based 3D Vision Learning

Instruction-based Medical Vision Learning

Instruction-based Document Vision Learning

About

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published