The official implementation for the ICCV 2023 paper "Grounded Image Text Matching with Mismatched Relation Reasoning".
-
Updated
Dec 8, 2023 - Python
The official implementation for the ICCV 2023 paper "Grounded Image Text Matching with Mismatched Relation Reasoning".
Unofficial implementation for Sigmoid Loss for Language Image Pre-Training
an API built on FastAPI for visual question answering. It's open source
Related papers about Referring Image Segmentation (RIS)
"Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA"
Under review. [IROS 2024] PGA: Personalizing Grasping Agents with Single Human-Robot Interaction
Counting dataset for Vision & Language models. Introduced in the paper "Seeing Past Words: Testing the Cross-Modal Capabilities of Pretrained V&L Models". https://arxiv.org/abs/2012.12352
VinVL+L: Enriching Visual Representation with Location Context in Visual Question Answering (VQA)
[INLG2023] The High-Level (HL) dataset is a Vision and Language (V&L) resource aligning object-centric descriptions from COCO with high-level descriptions crowdsourced along 3 axes: scene, action, rationale.
An end-to-end vision and language model incorporating explicit knowledge graphs and OOD-detection.
My solutions to CS231N CNN assignments
PyTorch code for Finding in NAACL 2022 paper "Probing the Role of Positional Information in Vision-Language Models".
Arabic WordNet matches for synsets in ImageNet
Source code and documentation for the LREC-COLING'24 paper "Sharing the Cost of Success: A Game for Evaluating and Learning Collaborative Multi-Agent Instruction Giving and Following Policies"
An end-to-end masked contrastive video-and-language pre-training framework
A comprehensive hub for updates on generative AI research, including interviews, notebooks, and additional resources.
Vision-Controllable Natural Language Generation
[IROS 2023] GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation
[CVPR 2024] Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
The open source implementation of the model from "Scaling Vision Transformers to 22 Billion Parameters"
Add a description, image, and links to the vision-and-language topic page so that developers can more easily learn about it.
To associate your repository with the vision-and-language topic, visit your repo's landing page and select "manage topics."