Composition du jury :
The main topics addressed in this thesis are the use of active learning and deep learning methods in the context of retrieval of multimodal document processing. The contributions proposed in this thesis address both these topics. An active learning framework was introduced for allowing for a more efficient annotation of broadcast TV videos thanks to the propagation of labels, to the use of multimodal data and to effective selection strategies. Several scenarios and experiments were considered in the context of person identification in videos, taking into account the use of different modalities (such as faces, speech segments and overlaid text) and different selection strategies. The whole system was additionally validated in a dry run test involving real human annotators.
A second major contribution was the investigation and use of deep learning (in particular the convolutional neural network) for video information retrieval. A comprehensive study was made using different neural network architectures and different training techniques such as fine-tuning or more classical classifiers like SVM. A comparison was made between learned features (the output of neural networks) and engineered features. Despite the lower performance of the latter, a fusion of these two types of features increases overall performance.
Finally, the use of convolutional neural network for speaker identification using spectrograms is explored. The results have been compared to those obtained with other state-of-the-art speaker identification systems. Different fusion approaches were also tested. The proposed approach obtained results comparable to those of some of the other tested approaches and offered an increase in performance when fused with the output of the best system.