Skip to content

Code for "Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model", ECCV 2020

Notifications You must be signed in to change notification settings

MinglangQiao/visual_audio_saliency

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository provides the code in our ECCV paper

" Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model "

Abstract

Recently, video streams have occupied a large proportion of Internet traffic, most of which contain human faces. Hence, it is necessary to predict saliency on multiple-face videos, which can provide attention cues for many content based applications. However, most of multiple-face saliency prediction works only consider visual information and ignore audio, which is not consistent with the naturalistic scenarios. Several behavioral studies have established that sound influences human attention, especially during the speech turn-taking in multipleface videos. In this paper, we thoroughly investigate such influences by establishing a large-scale eye-tracking database of Multiple-face Video in Visual-Audio condition (MVVA). Inspired by the findings of our investigation, we propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face. The visual branch takes the RGB frames as the input and encodes them into visual feature maps. The audio and face branches encode the audio signal and multiple cropped faces, respectively. A fusion module is introduced to integrate the information from three modalities, and to generate the final saliency map. Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works. It performs closer to human multi-modal attention.

Network

Requirements

  • python 3.7
  • pytorch 1.1.0
  • opencv
  • librosa

The dependencies can be installed through requirements.txt

Inference

Download the pretrained model from here, the gmm map generated by face branch and our MVVA database from here, and run the demo inference code

python main.py

Checklist

  • Update the Visual branch, Audio branch and Fusion module
  • Update the Face branch

Citation

If you find this repository helpful, you may cite:

@article{liu2020visualaudio,
  title={Learning to Predict Salient Faces: A Novel Audio-Visual Saliency Model},
  author={Yufan Liu; Minglang Qiao; Mai Xu; Bing Li; Weiming Hu; Ali Borji},
  booktitle=={Proceedings of the european conference on computer vision (eccv)},
  year={2020}
}

Reference

About

Code for "Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model", ECCV 2020

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages