Active Speaker Detection

Who’s Speaking? Audio-Supervised Classification of Active Speakers in Video


Active speakers have traditionally been identified in video, by detecting their moving lips. However, this method is not always successful, when the resolution of the video is low, or the person’s lips are obscured for some reason. This work demonstrates the use of visual spatio-temporal features that aim to capture other cues – movement of the head, upper body and hands of active speakers.

The training of the visual classifier is done using sound source localization from a microphone array. Audio sound localization is fused with tracked upper bodies in video, and spatio-temporal features from speakers are used to train the video-based classifier.

This paper was presented at the International Conference of Multimodal Interaction (ICMI), Seattle, 2015.

[Paper] [Video]

This technique was extended to cases where only a single channel of audio is available, as in the case of thousands of Youtube videos:

Cross Modal Supervision for Learning Active Speaker Detection in Video.

We show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion – facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person specific models. Finally, we demonstrate the online adaptation of generic models learnt on one dataset, to previously unseen people in a new dataset, again using audio (VAD) for weak supervision. The use of temporal continuity overcomes the lack of clean training data.


We are the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset. This work can be seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision, by transferring knowledge from one modality to another.

This paper was presented at the European Conference for Computer Vision (ECCV), Amsterdam, 2016.


[Paper] [Video]

Columbia dataset: As part of this work, we present an Active Speaker Detection dataset. It is an 87-minute-long video of a panel discussion at Columbia university, available from YouTube. There are 7 speakers on the panel, and the camera focusses on smaller groups of speakers at a time. We only focus on the parts of the video where there is more than one person in the frame, and ignore people on the margins of the video who are not detected by the upper body detector. This gives us sections of video for 5 speakers, with 2-3 speakers visible at any one time. We have annotated the upper body bounding boxes of each speaker with speak/non-speak labels, about 35 minutes of video in all. 

The video is available from Youtube:

The annotations are available here.


Active Speaker Detection with Audio-Visual Co-Training

In this work, we show how to co-train a classifier for active speaker detection using audio-visual data. First, audio Voice Activity Detection (VAD) is used to train a personalized video-based active speaker classifier in a weakly supervised fashion. The video classifier is in turn used to train a voice model for each person. The individual voice models are then used to detect active speakers. There is no manual supervision – audio weakly supervises video classification, and the co-training loop is completed by using the trained video classifier to supervise the training of a personalized audio voice classifier.

[Paper] [Video]

This paper was presented at the International Conference for Multimodal Interaction (ICMI), Tokyo, 2016.

Comments are closed.