This paper introduces EXMOVES, learned exemplar-based features for efficient recognition of actions in videos. The entries in our descriptor are produced by evaluating a set of movement classifiers over spatial-temporal volumes of the input sequence. Each movement classifier is a simple exemplar-SVM trained on low-level features, i.e., an SVM learned using a single annotated positive space-time volume and a large number of unannotated videos. Our representation offers two main advantages. First, since our mid-level features are learned from individual video exemplars, they require minimal amount of supervision. Second, we show that simple linear classification models trained on our global video descriptor yield action recognition accuracy approaching the state-of-the-art but at orders of magnitude lower cost, since at test-time no sliding window is necessary and linear models are efficient to train and test. This enables scalable action recognition, i.e., efficient classification of a large number of different actions even in large video databases. We show the generality of our approach by building our mid-level descriptors from two different low-level feature representations. The accuracy and efficiency of the approach are demonstrated on several large-scale action recognition benchmarks.
Submitted 20 Dec 2013 to Computer Vision and Pattern Recognition
Published 23 Dec 2013
Updated 28 Mar 2014
Author comments: In Proceedings of the 2nd International Conference on Learning Representations, Banff, Canada, 2014http://arxiv.org/abs/1312.5785http://arxiv.org/pdf/1312.5785.pdf