In this work, we introduce a model of lifelong learning, based on a Network of Experts. New tasks/experts are learned and added to the model sequentially, building on what was learned before. To ensure scalability of this process, data from previous tasks cannot be stored and hence is not available when learning a new task. A critical issue in such context, not addressed in the literature so far, relates to the decision of which expert to deploy at test time. We introduce a gating autoencoder that learns a representation for the task at hand, and is used at test time to automatically forward the test sample to the relevant expert. This has the added advantage of being memory efficient as only one expert network has to be loaded into memory at any given time. Further, the autoencoders inherently capture the relatedness of one task to another, based on which the most relevant prior model to be used for training a new expert with fine-tuning or learning-without-forgetting can be selected. We evaluate our system on image classification and video prediction problems.
R. Aljundi, P. Chakravarty and T. Tuytelaars. Expert Gate – Lifelong Learning with a Network of Experts. Internation Conference on Computer Vision and Pattern Recognition (CVPR), Hawaii, 2017. Submitted.
CNN-based Single Image Obstacle Avoidance on a Quadrotor
This work demonstrates the use of a single forward facing camera for obstacle avoidance on a quadrotor. We train a CNN for estimating depth from a single image. The depth map is then fed to a behaviour arbitration based control algorithm that steers the quadrotor away from obstacles. We conduct experiments with simulated and real drones in a variety of environments.
P. Chakravarty, T. Roussel, K. Kelchtermans, S. Wellens, T. Tuytelaars and L. VanEycken. CNN-based Single Image Obstacle Avoidance on a Quadrotor. International Conference on Robotics and Automation
(ICRA ’17). Submitted.
Active Speaker Detection in Video [Project webpage]
Active speakers have traditionally been identified in video, by detecting their moving lips. However, this method is not always successful, when the resolution of the video is low, or the person’s lips are obscured for some reason. This work demonstrates the use of visual spatio-temporal features that aim to capture other cues – movement of the head, upper body and hands of active speakers. The training is done using cross-modal supervision – audio is used to supervise the training of the video based classifier. Online adaptation of the video classifier to individual speakers shows improved performance. Subsequently, these individual classifiers are used to train audio voice classifiers for each speaker, resulting in further improvements in detecting active speakers.
This work resulted in the publication of 3 papers:
P. Chakravarty, J. Zegers, T. Tuytelaars and H. VanHamme. Active Speaker Detection with Audio-Visual Co-Training. International Conference on Multimodal Interaction (ICMI), Tokyo, November 2016.
P. Chakravarty and T. Tuytelaars. Cross-modal Supervision for Learning Active Speaker Detection in Video. European Conference for Computer Vision (ECCV), Amsterdam, Oct 2016.
P. Chakravarty, S. Mirzaei, T. Tuytelaars and H. VanHamme “Who’s Speaking? Audio-Supervised Classification of Active Speakers in Video”. In 2015 International Conference of Multi-modal Interaction (ICMI ’15), Seattle, U.S.A, Nov 2015.
Video Diarization [Project webpage]
In this work, we aim at automatically labeling actors in a TV series.
Rather than relying on transcripts and subtitles, as has been demonstrated in the past, we show how to achieve this goal starting from a set of example images of each of the main actors involved, collected from the Internet Movie Database (IMDB).
The problem then becomes one of domain adaptation: actors’ IMDB photos are typically taken at awards ceremonies and are quite different from their appearances in TV series. In each series as well, there is considerable change in actor appearance due to makeup, lighting, ageing, etc.
To bridge this gap, we propose a graph-matching based self-labelling algorithm, which we coin HSL (Hungarian Self Labelling). Further, we propose a new metric to be used in this context, as well as an extension that is more robust to outliers, where prototypical faces for each of the actors are selected based on a hierarchical clustering procedure. We conduct experiments with 15 episodes from 3 different TV series and demonstrate automatic annotation with an accuracy of 90% and up.
R. Aljundi, P. Chakravarty (equal contribution) and T. Tuytelaars. Who’s that Actor? Automatic Labelling of Actors in TV series starting from IMDB Images. Asian Conference on Computer Vision (ACCV), Taiwan, November 2016.
I also supervise the following Master’s student quadrotor projects.
The task of this project is to track people in the vicinity of the quadrotor, follow them, pan around to the face, and perform face recognition on the person. This project has now been running for two years.
Last year, the student, Davy Neven, focussed on the person detection, tracking and flying the robot on a trajectory to do face recognition, which itself was a relatively simple Eigen-faces approach. This year, the student is focussing on improving face recognition using a Deep Convolutional Neural Network approach, that is more robust to lighting and pose changes.
Visual SLAM from a Drone
This project uses a flying quadrotor and its forward facing camera to solve the SLAM problem – create a map of its environment, and localize itself on that map. Last year, the student, Elias Vanderstuyft set up a visual SLAM pipeline involving feature tracking, point cloud triangulation and bundle adjustment, and did a preliminary study in simulation, on what the optimal trajectory the quadrotor can follow, given uncertainties in the point cloud. This year, students are implementing loop closure, and extending the above techniques on a real quadrotor.
Single Image Depth Map Estimation for a Drone
Traditionally, depth perception in vision is done using multiple cameras in time (stereo) or in space (Structure from Motion/visualSLAM). This project aims to estimate depth from a single image using Machine Learning on vast amounts of training depthmaps for images. Two approaches are being tried out: a regression approach on the whole image, using Deep CNNs, and depth estimates on an image segmented into superpixels, and using a pipeline of Dense SIFT -> Fisher vector pooling -> SVM classification.
Visual Route Following on a Drone
This project aims to navigate a drone using a sequence of images taken along a route. Visual place recognition algorithms, place classification and object recognition algorithms are being investigated, so that the drone can be given a set of human-like instructions (go straight down the corridor, turn right when you see the coffee machine, third door to the left after that) to follow a route.