Disentangled Representation Learning for Text-Video Retrieval

This is a PyTorch implementation of the paper Disentangled Representation Learning for Text-Video Retrieval:

@Article{DRLTVR2022,
  author  = {Qiang Wang and Yanhao Zhang and Yun Zheng and Pan Pan and Xian-Sheng Hua},
  journal = {arXiv:2203.07111},
  title   = {Disentangled Representation Learning for Text-Video Retrieval},
  year    = {2022},
}

Catalog

Setup
Fine-tuning code
Visualization demo

Setup

Setup code environment

git clone https://github.com/foolwood/DRL.git
cd DRL
conda create -n drl python=3.9
conda activate drl
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/
pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html

Download CLIP Model (as pretraining)

cd tvr/models
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
# wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt
# wget https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt

Download Datasets

cd data/MSR-VTT
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip ; unzip MSRVTT.zip
mv MSRVTT/videos/all ./videos ; mv MSRVTT/annotation/MSR_VTT.json ./anns/MSRVTT_data.json

Fine-tuning code

Train on MSR-VTT 1k.

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \
main.py --do_train 1 --workers 8 --n_display 50 \
--epochs 5 --lr 1e-4 --coef_lr 1e-3 --batch_size 128 --batch_size_val 128 \
--anno_path data/MSR-VTT/anns --video_path data/MSR-VTT/videos --datatype msrvtt \
--max_words 32 --max_frames 12 --video_framerate 1 \
--base_encoder ViT-B/32 --agg_module seqTransf \
--interaction wti --wti_arch 2 --cdcr 3 --cdcr_alpha1 0.11 --cdcr_alpha2 0.0 --cdcr_lambda 0.001 \
--output_dir ckpts/ckpt_msrvtt_wti_cdcr

Reproduce the ablation experiments scripts

configs	feature	gpus	Text-Video					Video-Text					train time (h)
configs	feature	gpus	R@1	R@5	R@10	MdR	MnR	R@1	R@5	R@10	MdR	MnR	train time (h)
CLIP4Clip	ViT/B-32	4	42.8	72.1	81.4	2.0	16.3	44.1	70.5	80.5	2.0	11.8	10.5
zero-shot	ViT/B-32	4	31.1	53.7	63.4	4.0	41.6	26.5	50.1	61.7	5.0	39.9	-
Interaction
DP+None	ViT/B-32	4	42.9	70.6	81.4	2.0	15.4	43.0	71.1	81.1	2.0	11.8	2.5
DP+seqTransf	ViT/B-32	4	42.8	71.1	81.1	2.0	15.6	44.1	70.9	80.9	2.0	11.7	2.6
XTI+None	ViT/B-32	4	40.5	71.1	82.6	2.0	13.6	42.7	70.8	80.2	2.0	12.5	14.3
XTI+seqTransf	ViT/B-32	4	42.4	71.3	80.9	2.0	15.2	40.1	69.2	79.6	2.0	15.8	16.8
TI+seqTransf	ViT/B-32	4	44.8	73.0	82.2	2.0	13.4	42.6	72.7	82.8	2.0	9.1	2.6
WTI+seqTransf	ViT/B-32	4	46.6	73.4	83.5	2.0	13.0	45.4	73.4	81.9	2.0	9.2	2.6
Channel DeCorrelation Regularization
DP+seqTransf+CDCR	ViT/B-32	4	43.9	71.1	81.2	2.0	15.3	42.3	70.3	81.1	2.0	11.4	2.6
TI+seqTransf+CDCR	ViT/B-32	4	45.8	73.0	81.9	2.0	12.8	43.3	71.8	82.7	2.0	8.9	2.6
WTI+seqTransf+CDCR	ViT/B-32	4	47.6	73.4	83.3	2.0	12.8	45.1	72.9	83.5	2.0	9.2	2.6

Note: the performances are slight boosts due to new hyperparameters.

Visualization demo

Run our visualization demo using matplotlib (no GPU needed):

License

See LICENSE for details.

Acknowledgments

Our code is partly based on CLIP4Clip.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/MSR-VTT/anns

data/MSR-VTT/anns

demo

demo

scripts

scripts

tvr

tvr

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

Disentangled Representation Learning for Text-Video Retrieval

Catalog

Setup

Setup code environment

Download CLIP Model (as pretraining)

Download Datasets

Fine-tuning code

Visualization demo

License

Acknowledgments

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data/MSR-VTT/anns		data/MSR-VTT/anns
demo		demo
scripts		scripts
tvr		tvr
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

foolwood/DRL

Folders and files

Latest commit

History

Repository files navigation

Disentangled Representation Learning for Text-Video Retrieval

Catalog

Setup

Setup code environment

Download CLIP Model (as pretraining)

Download Datasets

Fine-tuning code

Visualization demo

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Languages