Apr 26 2018 cs.CV
Movies provide us with a mass of visual content as well as attracting stories. Existing methods have illustrated that understanding movie stories through only visual content is still a hard problem. In this paper, for answering questions about movies, we put forward a Layered Memory Network (LMN) that represents frame-level and clip-level movie content by the Static Word Memory module and the Dynamic Subtitle Memory module, respectively. Particularly, we firstly extract words and sentences from the training movie subtitles. Then the hierarchically formed movie representations, which are learned from LMN, not only encode the correspondence between words and visual content inside frames, but also encode the temporal alignment between sentences and frames inside movie clips. We also extend our LMN model into three variant frameworks to illustrate the good extendable capabilities. We conduct extensive experiments on the MovieQA dataset. With only visual content as inputs, LMN with frame-level representation obtains a large performance improvement. When incorporating subtitles into LMN to form the clip-level representation, we achieve the state-of-the-art performance on the online evaluation task of 'Video+Subtitles'. The good performance successfully demonstrates that the proposed framework of LMN is effective and the hierarchically formed movie representations have good potential for the applications of movie question answering.
Apr 18 2018 cs.CV
In this paper, we study the problem of designing efficient convolutional neural network architectures with the interest in eliminating the redundancy in convolution kernels. In addition to structured sparse kernels, low-rank kernels and the product of low-rank kernels, the product of structured sparse kernels, which is a framework for interpreting the recently-developed interleaved group convolutions (IGC) and its variants (e.g., Xception), has been attracting increasing interests. Motivated by the observation that the convolutions contained in a group convolution in IGC can be further decomposed in the same manner, we present a modularized building block, IGCV$2$: interleaved structured sparse convolutions. It generalizes interleaved group convolutions, which is composed of two structured sparse kernels, to the product of more structured sparse kernels, further eliminating the redundancy. We present the complementary condition and the balance condition to guide the design of structured sparse kernels, obtaining a balance among three aspects: model size, computation complexity and classification accuracy. Experimental results demonstrate the advantage on the balance among these three aspects compared to interleaved group convolutions and Xception, and competitive performance compared to other state-of-the-art architecture design methods.
Feb 08 2018 cs.CV
Existing video hash functions are built on three isolated stages: frame pooling, relaxed learning, and binarization, which have not adequately explored the temporal order of video frames in a joint binary optimization model, resulting in severe information loss. In this paper, we propose a novel unsupervised video hashing framework dubbed Self-Supervised Video Hashing (SSVH), that is able to capture the temporal nature of videos in an end-to-end learning-to-hash fashion. We specifically address two central problems: 1) how to design an encoder-decoder architecture to generate binary codes for videos; and 2) how to equip the binary codes with the ability of accurate video retrieval. We design a hierarchical binary autoencoder to model the temporal dependencies in videos with multiple granularities, and embed the videos into binary codes with less computations than the stacked architecture. Then, we encourage the binary codes to simultaneously reconstruct the visual content and neighborhood structure of the videos. Experiments on two real-world datasets (FCVID and YFCC) show that our SSVH method can significantly outperform the state-of-the-art methods and achieve the currently best performance on the task of unsupervised video retrieval.
Apr 21 2017 cs.CV
Despite the promising progress made in recent years, person re-identification (re-ID) remains a challenging task due to the complex variations in human appearances from different camera views. For this challenging problem, a large variety of algorithms have been developed in the fully-supervised setting, requiring access to a large amount of labeled training data. However, the main bottleneck for fully-supervised re-ID is the limited availability of labeled training samples. To address this problem, in this paper, we propose a self-trained subspace learning paradigm for person re-ID which effectively utilizes both labeled and unlabeled data to learn a discriminative subspace where person images across disjoint camera views can be easily matched. The proposed approach first constructs pseudo pairwise relationships among unlabeled persons using the k-nearest neighbors algorithm. Then, with the pseudo pairwise relationships, the unlabeled samples can be easily combined with the labeled samples to learn a discriminative projection by solving an eigenvalue problem. In addition, we refine the pseudo pairwise relationships iteratively, which further improves the learning performance. A multi-kernel embedding strategy is also incorporated into the proposed approach to cope with the non-linearity in person's appearance and explore the complementation of multiple kernels. In this way, the performance of person re-ID can be greatly enhanced when training data are insufficient. Experimental results on six widely-used datasets demonstrate the effectiveness of our approach and its performance can be comparable to the reported results of most state-of-the-art fully-supervised methods while using much fewer labeled data.
In this paper, we consider the problem of maximizing the weighted sum energy efficiency (WS-EE) for multi-input single-output (MISO) interference channels (ICs) which is well acknowledged as general models of heterogeneous networks (HetNets), multicell networks, etc. To address this problem, we develop an efficient distributed beamforming algorithm based on a pricing mechanism. Specifically, we carefully introduce a price metric for distributed beamforming design which fortunately allows efficient closed-form solutions to the per-user beam-vector optimization problem. The convergence of the distributed pricing-based beamforming design is theoretically proven. Furthermore, we present an implementation strategy of the proposed distributed algorithm with limited information exchange. Numerical results show that our algorithm converges much faster than existing algorithms, while yielding comparable, sometimes even better performance in terms of the WS-EE. Finally, by taking the backhaul power consumption into account, it is interesting to show that the proposed algorithm with limited information exchange achieves better WS-EE than the full information exchange based algorithm in some special cases.
Nov 09 2015 cs.CV
Deep ConvNets have shown its good performance in image classification tasks. However it still remains as a problem in deep video representation for action recognition. The problem comes from two aspects: on one hand, current video ConvNets are relatively shallow compared with image ConvNets, which limits its capability of capturing the complex video action information; on the other hand, temporal information of videos is not properly utilized to pool and encode the video sequences. Towards these issues, in this paper, we utilize two state-of-the-art ConvNets, i.e., the very deep spatial net (VGGNet) and the temporal net from Two-Stream ConvNets, for action representation. The convolutional layers and the proposed new layer, called frame-diff layer, are extracted and pooled with two temporal pooling strategy: Trajectory pooling and line pooling. The pooled local descriptors are then encoded with VLAD to form the video representations. In order to verify the effectiveness of the proposed framework, we conduct experiments on UCF101 and HMDB51 datasets. It achieves the accuracy of 93.78\% on UCF101 which is the state-of-the-art and the accuracy of 65.62\% on HMDB51 which is comparable to the state-of-the-art.
Feb 09 2015 cs.CV
Automated scene analysis has been a topic of great interest in computer vision and cognitive science. Recently, with the growth of crowd phenomena in the real world, crowded scene analysis has attracted much attention. However, the visual occlusions and ambiguities in crowded scenes, as well as the complex behaviors and scene semantics, make the analysis a challenging task. In the past few years, an increasing number of works on crowded scene analysis have been reported, covering different aspects including crowd motion pattern learning, crowd behavior and activity analysis, and anomaly detection in crowds. This paper surveys the state-of-the-art techniques on this topic. We first provide the background knowledge and the available features related to crowded scenes. Then, existing models, popular algorithms, evaluation protocols, as well as system performance are provided corresponding to different aspects of crowded scene analysis. We also outline the available datasets for performance evaluation. Finally, some research problems and promising future directions are presented with discussions.