#

vision-and-language

Here are 220 public repositories matching this topic...

SHTUPLUS / GITM-MR

The official implementation for the ICCV 2023 paper "Grounded Image Text Matching with Mismatched Relation Reasoning".

vision-and-language vision-and-language-pre-training vision-language-dataset vision-language-model vision-language-learning

Updated Dec 8, 2023
Python

ahmdtaha / distributed_sigmoid_loss

Unofficial implementation for Sigmoid Loss for Language Image Pre-Training

python3 pytorch unsupervised-learning vision-and-language multimodal-deep-learning self-supervised-learning vision-language contrastive-learning distributed-data-parallel vision-transformer vision-language-pretraining

Updated Sep 26, 2023
Python

camiloavil / AI-Vision-Language-Transformer-API

an API built on FastAPI for visual question answering. It's open source

api dockerfile docker-compose python3 vision-and-language fastapi huggingface-transformers

Updated Sep 8, 2023
Dockerfile

Huntersxsx / RIS-Learning-List

Related papers about Referring Image Segmentation (RIS)

image-segmentation referring-expressions vision-and-language referring-image-segmentation

Updated Dec 26, 2023

shufangxun / MAC

An end-to-end masked contrastive video-and-language pre-training framework

pytorch clip mae end-to-end-learning multimodal vision-and-language activitynet pretraining msrvtt contrastive-learning vision-transformer video-text-retrieval video-language didemo

Updated Dec 13, 2022

nicholasnouri / ai-resources

A comprehensive hub for updates on generative AI research, including interviews, notebooks, and additional resources.

awesome awesome-list interview-questions vision-and-language notebook-jupyter large-language-models llm llms generative-ai

Updated Apr 2, 2024

LivXue / VCNLG

Vision-Controllable Natural Language Generation

natural-language-generation vision-and-language natual-language-processing

Updated Mar 8, 2024
Python

esradonmez / VisLang-Paper-Club

Reading group for Vision and Language research

machine-learning vl multimodal-learning vision-and-language

Updated Jan 23, 2022

marialymperaiou / knowledge-enhanced-multimodal-learning

A list of research papers on knowledge-enhanced multimodal learning

knowledge-graph multi-task-learning visual-reasoning visual-dialog visual-question-answering vision-and-language multimodal-deep-learning visual-storytelling multimodal-retrieval visual-grounding visual-commonsense-reasoning vision-and-language-navigation story-visualization image-text-matching vision-language-transformer image-text-retrieval vision-and-language-pre-training conditional-image-generation knowledge-enhanced-multimodal-learning knowledge-enhanced-vision-language

Updated Dec 8, 2022

guoyang9 / ELIP

Efficient language image pre-training

efficient vision-and-language

Updated Dec 12, 2023
Python

SCZwangxiao / TSGVs-MM2023

ACM Multimedia 2023 - Temporal Sentence in Streaming Videos

streaming-video video-understanding vision-and-language temporal-action-localization video-moment-retrieval temporal-sentence-grounding

Updated Mar 17, 2024
Python

tsujuifu / pytorch_tvc

A PyTorch implementation of TVC

pytorch vision-and-language video-prediction video-completion cvpr2023

Updated Dec 18, 2023
Jupyter Notebook

JHKim-snu / PGA

Under review. [IROS 2024] PGA: Personalizing Grasping Agents with Single Human-Robot Interaction

personalization semi-supervised-learning vision-and-language robotic-manipulation visual-grounding multi-modal-learning

Updated Mar 30, 2024
Python

Heidelberg-NLP / counting-probe

Counting dataset for Vision & Language models. Introduced in the paper "Seeing Past Words: Testing the Cross-Modal Capabilities of Pretrained V&L Models". https://arxiv.org/abs/2012.12352

dataset counting multimodal vision-and-language probing-task

Updated Dec 15, 2021

vyskocj / VinVL-L

VinVL+L: Enriching Visual Representation with Location Context in Visual Question Answering (VQA)

computer-vision visual-question-answering vision-and-language location-recognition

Updated Jan 26, 2023
Python

michelecafagna26 / HL-dataset

[INLG2023] The High-Level (HL) dataset is a Vision and Language (V&L) resource aligning object-centric descriptions from COCO with high-level descriptions crowdsourced along 3 axes: scene, action, rationale.

dataset image-captioning image2text vision-and-language multimodal-data huggingface-datasets multimodal-grounding

Updated Nov 13, 2023

ellenzhuwang / implicitOOD

An end-to-end vision and language model incorporating explicit knowledge graphs and OOD-detection.

deep-learning transformer knowledge-graph multimodal-learning mscoco-dataset visual-question-answering vision-and-language ood-detection

Updated May 3, 2024
Python

JHKim-snu / GVCCI

[IROS 2023] GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation

robotic-arm lifelong-learning vision-and-language multimodal-deep-learning robot-manipulation iros2023

Updated Apr 23, 2024
Python

kyegomez / MegaVIT

The open source implementation of the model from "Scaling Vision Transformers to 22 Billion Parameters"

computer-vision artificial-intelligence multi-modal vision-and-language multi-modal-learning vision-transformer gpt4 multi-modal-fusion

Updated Mar 11, 2024
Python

CurryYuan / ZSVG3D

[CVPR 2024] Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

3d zero-shot vision-and-language visual-grounding

Updated May 14, 2024
Jupyter Notebook

Improve this page

Add a description, image, and links to the vision-and-language topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the vision-and-language topic, visit your repo's landing page and select "manage topics."