Skip to content

liveseongho/Awesome-Video-Language-Understanding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 

Repository files navigation

Awesome-Video-Language-Understanding Awesome

We introduce recent works on Awesome Video Language Understanding.

To access full version, click here.

Table of Contents

Main

Video Language Transformers

  • VIOLETv2 (EmpiricalMVM) [Paper][Code] @Microsoft
    An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling (CVPR 2023)

  • LAVENDER [Paper][Code] @Microsoft
    LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling (CVPR 2023)

  • Flamingo[Paper] @DeepMind
    Flamingo: a Visual Language Model for Few-Shot Learning (NeurIPS 2022)

  • ALPRO [Paper][Code] @Salesforce
    Align and Prompt: Video-and-Language Pre-training with Entity Prompts (CVPR 2022)

  • VL-Adapter [Paper][Code] @UNC
    VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks (CVPR 2022)

  • VIOLET [Paper][Code] @Microsoft
    VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling (arXiv 2021)

  • HERO [Paper][Code] @Microsoft
    HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training (EMNLP 2020)

  • UniVL [Paper][Code] @Microsoft
    UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation (arXiv 2020)

Video Retrieval

  • FiT [Paper][Code][Website][Demo] @Oxford
    Frozen in Time: ️A Joint Video and Image Encoder for End to End Retrieval (ICCV 2021)

Video Question Answering

  • FrozenBiLM [Paper][Code][Website][Poster][Slides] @Inria
    Zero-Shot Video Question Answering via Frozen Bidirectional Language Models (NeurIPS 2022)

  • MERLOT Reserve [Paper][Code][Website][Demo] @AI2
    MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound (CVPR 2022)

  • MERLOT [Paper][Code][Website] @AI2
    MERLOT: Multimodal Neural Script Knowledge Models (NeurIPS 2021)

  • JustAsk [Paper/Journal][Code][Website][Demo][Poster][Slides][Oral] @Inria
    Just Ask: Learning to Answer Questions from Millions of Narrated Videos (ICCV 2021)
    Learning to Answer Visual Questions from Web Videos (TPAMI 2022)

Video Captioning

  • Video ChatCaptioner [Paper][Code] @KAUST
    Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions (arXiv 2023)

  • Vid2Seq [Paper][Code][Website][Blog] @Google
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning (CVPR 2023)

  • MV-GPT [Paper] @Google
    End-to-end Generative Pretraining for Multimodal Video Captioning (CVPR 2022)

  • SwinBERT [Paper][Code] @Microsoft
    SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning (CVPR 2022)

Datasets and SOTA

Large-scale Video Language Dataset

  • WebVid-10M [Paper][Code][Website] @Oxford
    Frozen in Time: A Joint Video and Image Encoder for End to End Retrieval (ICCV 2021)

  • HowTo100M [Paper][Code][Website] @Inria
    HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips (ICCV 2019)

Downstream Tasks