[CVPR'23] Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

A PyTorch implementation of TVC

Overview

TVC is an implementation of
"Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation"
Tsu-Jui Fu, Licheng Yu, Ning Zhang, Cheng-Yang Fu, Jong-Chyi Su, William Yang Wang, and Sean Bell
in Conference on Computer Vision and Pattern Recognition (CVPR) 2023

To model the video along with language, we propose temporal-aware VQGAN to represent a frame as visual tokens, which converts it into the same discrete space as the words. We present an effective masking strategy that masks different video parts for video completion learning. Those missing fragments are replaced by the unique [SPAN] tokens, and we consider the visual guidance from diverse time points. The multimodal encoder consumes the text and the partial missing video, and the decoder learns to produce the complete video from arbitrary guided frames. By varying the masking conditions, MMVG learns to utilize the [SPAN] token and unifies all TVC tasks during the training.

Requirements

This code is implemented under Python 3.9, Torch 1.11, Torchvision 0.12, TorchMetrics 0.6, and Lightning 1.3.

Since there is no obvious performance gap, we simplify the implementation and adopt VideoGPT in our MMVG.

Usage

Dataset

Put dataset in ./_data.

show_data.ipynb

Inference

Put ckpt in ./_ckpt.

inference.ipynb

Citation

@inproceedings{fu2023tvc, 
  author = {Tsu-Jui Fu and Licheng Yu and Ning Zhang and Cheng-Yang Fu and Jong-Chyi Su and William Yang Wang and Sean Bell}, 
  title = {{Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation}}, 
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  year = {2023}
}

Acknowledgement

This code is based on Taming and TATS

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
_ckpt		_ckpt
_data		_data
_imgs		_imgs
_input		_input
_output		_output
tats		tats
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.ipynb		inference.ipynb
show_data.ipynb		show_data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_ckpt

_ckpt

_data

_data

_imgs

_imgs

_input

_input

_output

_output

tats

tats

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

inference.ipynb

inference.ipynb

show_data.ipynb

show_data.ipynb

Repository files navigation

[CVPR'23] Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Overview

Requirements

Usage

Dataset

Inference

Citation

Acknowledgement

About

Releases

Packages

Languages

License

tsujuifu/pytorch_tvc

Folders and files

Latest commit

History

Repository files navigation

[CVPR'23] Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Overview

Requirements

Usage

Dataset

Inference

Citation

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Languages