Detecting activities in untrimmed videos is an important yet challenging task. In this paper, we tackle the difficulties of effectively locating the start and the end of a long complex action, which are often met by existing methods. Our key contribution is the structured segment network, a novel framework for temporal action detection, which models the temporal structure of each activity instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model, which comprises two classifiers, respectively for classifying activities and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end fashion. We also propose a simple yet effective temporal action proposal scheme that can generate proposals of considerably higher qualities. On two challenging benchmarks, THUMOS14 and ActivityNet, our method remarkably outperforms existing state-of-the-art methods by over $ 10\% $ absolute average mAP, demonstrating superior accuracy and strong adaptivity in handling activities with various temporal structures.
Submitted 20 Apr 2017 to Computer Vision and Pattern Recognition
Published 21 Apr 2017