Who’s Speaking? Audio-Supervised Classification of Active Speakers in Video
Active speakers have traditionally been identified in video, by detecting their moving lips. However, this method is not always successful, when the resolution of the video is low, or the person’s lips are obscured for some reason. This work demonstrates the use of visual spatio-temporal features that aim to capture other cues – movement of the head, upper body and hands of active speakers.
The training of the visual classifier is done using sound source localization from a microphone array. Audio sound localization is fused with tracked upper bodies in video, and spatio-temporal features from speakers are used to train the video-based classifier.
This paper was presented at ICMI, Seattle, 2015.
This technique was extended to cases where only a single channel of audio is available, as in the case of thousands of Youtube videos:
Cross Modal Supervision for Learning Active Speaker Detection in Video.
We show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion – facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person specific models. Finally, we demonstrate the online adaptation of generic models learnt on one dataset, to previously unseen people in a new dataset, again using audio (VAD) for weak supervision. The use of temporal continuity overcomes the lack of clean training data.
We are the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset. This work can be seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision, by transferring knowledge from one modality to another.
This paper was presented at ECCV, Amsterdam, 2016.
Columbia dataset: As part of this work, we present an Active Speaker Detection dataset. It is an 87-minute-long video of a panel discussion at Columbia university, available from YouTube. There are 7 speakers on the panel, and the camera focusses on smaller groups of speakers at a time. We only focus on the parts of the video where there is more than one person in the frame, and ignore people on the margins of the video who are not detected by the upper body detector. This gives us sections of video for 5 speakers, with 2-3 speakers visible at any one time. We have annotated the upper body bounding boxes of each speaker with speak/non-speak labels, about 35 minutes of video in all.
The video is available from Youtube:
The annotations are available here.